netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/5] Add support for per-NAPI config via netlink
@ 2024-08-29 13:11 Joe Damato
  2024-08-29 13:11 ` [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
                   ` (4 more replies)
  0 siblings, 5 replies; 48+ messages in thread
From: Joe Damato @ 2024-08-29 13:11 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn, hch,
	willy, willemdebruijn.kernel, skhawaja, kuba, Joe Damato,
	Alexander Lobakin, Breno Leitao, Daniel Jurgens, David S. Miller,
	Donald Hunter, Heiner Kallweit, Jesper Dangaard Brouer,
	Jiri Pirko, Johannes Berg, open list, Lorenzo Bianconi,
	Martin Karsten, Paolo Abeni, Sebastian Andrzej Siewior, Xuan Zhuo

Greetings:

This makes gro_flush_timeout and napi_defer_hard_irqs available per NAPI
instance, in addition to the existing sysfs parameter. The existing
sysfs parameters remain and care was taken to support them, but an
important edge case was introduced, described below.

The netdev netlink spec has been updated to export both parameters when
doing a napi-get operation and a new operation, napi-set, has been added
to set the parameters. The parameters can be set individually or
together. The idea is that user apps might want to update, for example,
gro_flush_timeout dynamically during busy poll, but maybe the app is
happy with the existing defer_hard_irqs value.

The intention is that if this is accepted, it will be expanded to
support the suspend parameter proposed in a recent series [1].

Important edge case introduced:

In order to keep the existing sysfs parameters working as intended and
also support per NAPI settings an important change was made:
  - Writing the sysfs parameters writes both to the net_device level
    field and to the per-NAPI fields for every NAPI associated with the
    net device. This was done as the intention of writing to sysfs seems
    to be that it takes effect globally, for all NAPIs.
  - Reading the sysfs parameter reads the net_device level field.
  - It is technically possible for a user to do the following:
    - Write a value to a sysfs param, which in turn sets all NAPIs to
      that value
    - Using the netlink API, write a new value to every NAPI on the
      system
    - Print the sysfs param

The printing of the param will reveal a value that is no longer in use
by any NAPI, but is used for any newly created NAPIs (e.g. if new queues
are created).

It's tempting to think that the implementation could be something as
simple as (psuedocode):

   if (!napi->gro_flush_timeout)
     return dev->gro_flush_timeout;

To avoid the complexity of writing values to every NAPI, but this
approach does not work if the user wants the gro_flush_timeout to be 0
for a specific NAPI while having it set to non-zero for the rest of the
system.

Here's a walk through of some common commands to illustrate how one
might use this:

First, output the current NAPI settings:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 914,
  'ifindex': 7,
  'irq': 529},
 {'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 913,
  'ifindex': 7,
  'irq': 528},
 [...]

Now, set the global sysfs parameters:

$ sudo bash -c 'echo 20000 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs' 

Output current NAPI settings again:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 914,
  'ifindex': 7,
  'irq': 529},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 913,
  'ifindex': 7,
  'irq': 528},
 [...]

Now set NAPI ID 913 to specific values:

$ sudo ./tools/net/ynl/cli.py \
             --spec Documentation/netlink/specs/netdev.yaml \
             --do napi-set \
             --json='{"id": 913, "defer-hard-irqs": 111,
                      "gro-flush-timeout": 11111}'
None

Now output current NAPI settings again to ensure only 913 changed:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 914,
  'ifindex': 7,
  'irq': 529},
 {'defer-hard-irqs': 111,
  'gro-flush-timeout': 11111,
  'id': 913,
  'ifindex': 7,
  'irq': 528},
[...]

Now, increase gro-flush-timeout only:

$ sudo ./tools/net/ynl/cli.py \
       --spec Documentation/netlink/specs/netdev.yaml \
       --do napi-set --json='{"id": 913, "gro-flush-timeout": 44444}'
None

Now output the current NAPI settings once more:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 914,
  'ifindex': 7,
  'irq': 529},
 {'defer-hard-irqs': 111,
  'gro-flush-timeout': 44444,
  'id': 913,
  'ifindex': 7,
  'irq': 528},
[...]

Now set NAPI ID 913 to have gro_flush_timeout of 0:

$ sudo ./tools/net/ynl/cli.py \
       --spec Documentation/netlink/specs/netdev.yaml \
       --do napi-set --json='{"id": 913, "gro-flush-timeout": 0}'
None

Check that NAPI ID 913 has a value of 0:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 914,
  'ifindex': 7,
  'irq': 529},
 {'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 913,
  'ifindex': 7,
  'irq': 528},
[...]

Last, but not least, let's try writing the sysfs parameters to ensure
all NAPIs are rewritten:

$ sudo bash -c 'echo 33333 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 222 >/sys/class/net/eth4/napi_defer_hard_irqs' 

Check that worked:

$ $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                           --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 914,
  'ifindex': 7,
  'irq': 529},
 {'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 913,
  'ifindex': 7,
  'irq': 528},
[...]

Thanks,
Joe

[1]: https://lore.kernel.org/lkml/20240823173103.94978-1-jdamato@fastly.com/

Joe Damato (5):
  net: napi: Make napi_defer_hard_irqs per-NAPI
  netdev-genl: Dump napi_defer_hard_irqs
  net: napi: Make gro_flush_timeout per-NAPI
  netdev-genl: Dump gro_flush_timeout
  netdev-genl: Support setting per-NAPI config values

 Documentation/netlink/specs/netdev.yaml | 23 ++++++++++
 include/linux/netdevice.h               | 49 ++++++++++++++++++++
 include/uapi/linux/netdev.h             |  3 ++
 net/core/dev.c                          | 61 ++++++++++++++++++++++---
 net/core/net-sysfs.c                    |  7 ++-
 net/core/netdev-genl-gen.c              | 14 ++++++
 net/core/netdev-genl-gen.h              |  1 +
 net/core/netdev-genl.c                  | 56 +++++++++++++++++++++++
 tools/include/uapi/linux/netdev.h       |  3 ++
 9 files changed, 208 insertions(+), 9 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI
  2024-08-29 13:11 [PATCH net-next 0/5] Add support for per-NAPI config via netlink Joe Damato
@ 2024-08-29 13:11 ` Joe Damato
  2024-08-29 13:46   ` Eric Dumazet
                     ` (3 more replies)
  2024-08-29 13:11 ` [PATCH net-next 2/5] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
                   ` (3 subsequent siblings)
  4 siblings, 4 replies; 48+ messages in thread
From: Joe Damato @ 2024-08-29 13:11 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn, hch,
	willy, willemdebruijn.kernel, skhawaja, kuba, Joe Damato,
	Martin Karsten, David S. Miller, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, open list

Allow per-NAPI defer_hard_irqs setting.

The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device defer_hard_irq field.
Reads from sysfs will read from the net_device field.

sysfs code was updated to guard against what appears to be a potential
overflow as the field is an int, but the value passed in is an unsigned
long.

The ability to set defer_hard_irqs on specific NAPI instances will be
added in a later commit, via netdev-genl.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
---
 include/linux/netdevice.h | 23 +++++++++++++++++++++++
 net/core/dev.c            | 29 ++++++++++++++++++++++++++---
 net/core/net-sysfs.c      |  5 ++++-
 3 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fce70990b209..7d53380da4c0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -371,6 +371,7 @@ struct napi_struct {
 	struct list_head	rx_list; /* Pending GRO_NORMAL skbs */
 	int			rx_count; /* length of rx_list */
 	unsigned int		napi_id;
+	int			defer_hard_irqs;
 	struct hrtimer		timer;
 	struct task_struct	*thread;
 	/* control-path-only fields follow */
@@ -534,6 +535,28 @@ static inline void napi_schedule_irqoff(struct napi_struct *n)
 		__napi_schedule_irqoff(n);
 }
 
+/**
+ * napi_get_defer_hard_irqs - get the NAPI's defer_hard_irqs
+ * @n: napi struct to get the defer_hard_irqs field from
+ *
+ * Returns the per-NAPI value of the defar_hard_irqs field.
+ */
+int napi_get_defer_hard_irqs(const struct napi_struct *n);
+
+/**
+ * napi_set_defer_hard_irqs - set the defer_hard_irqs for a napi
+ * @n: napi_struct to set the defer_hard_irqs field
+ * @defer: the value the field should be set to
+ */
+void napi_set_defer_hard_irqs(struct napi_struct *n, int defer);
+
+/**
+ * netdev_set_defer_hard_irqs - set defer_hard_irqs for all NAPIs of a netdev
+ * @netdev: the net_device for which all NAPIs will have their defer_hard_irqs set
+ * @defer: the defer_hard_irqs value to set
+ */
+void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer);
+
 /**
  * napi_complete_done - NAPI processing complete
  * @n: NAPI context
diff --git a/net/core/dev.c b/net/core/dev.c
index 63987b8b7c85..f7baff0da057 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6212,6 +6212,28 @@ void __napi_schedule_irqoff(struct napi_struct *n)
 }
 EXPORT_SYMBOL(__napi_schedule_irqoff);
 
+int napi_get_defer_hard_irqs(const struct napi_struct *n)
+{
+	return READ_ONCE(n->defer_hard_irqs);
+}
+EXPORT_SYMBOL_GPL(napi_get_defer_hard_irqs);
+
+void napi_set_defer_hard_irqs(struct napi_struct *n, int defer)
+{
+	WRITE_ONCE(n->defer_hard_irqs, defer);
+}
+EXPORT_SYMBOL_GPL(napi_set_defer_hard_irqs);
+
+void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer)
+{
+	struct napi_struct *napi;
+
+	WRITE_ONCE(netdev->napi_defer_hard_irqs, defer);
+	list_for_each_entry(napi, &netdev->napi_list, dev_list)
+		napi_set_defer_hard_irqs(napi, defer);
+}
+EXPORT_SYMBOL_GPL(netdev_set_defer_hard_irqs);
+
 bool napi_complete_done(struct napi_struct *n, int work_done)
 {
 	unsigned long flags, val, new, timeout = 0;
@@ -6230,7 +6252,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 	if (work_done) {
 		if (n->gro_bitmask)
 			timeout = READ_ONCE(n->dev->gro_flush_timeout);
-		n->defer_hard_irqs_count = READ_ONCE(n->dev->napi_defer_hard_irqs);
+		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
 	}
 	if (n->defer_hard_irqs_count > 0) {
 		n->defer_hard_irqs_count--;
@@ -6368,7 +6390,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
 	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
 	if (flags & NAPI_F_PREFER_BUSY_POLL) {
-		napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs);
+		napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
 		timeout = READ_ONCE(napi->dev->gro_flush_timeout);
 		if (napi->defer_hard_irqs_count && timeout) {
 			hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
@@ -6650,6 +6672,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	INIT_HLIST_NODE(&napi->napi_hash_node);
 	hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
 	napi->timer.function = napi_watchdog;
+	napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
 	init_gro_hash(napi);
 	napi->skb = NULL;
 	INIT_LIST_HEAD(&napi->rx_list);
@@ -11032,7 +11055,7 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev)
 
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
 		dev->gro_flush_timeout = 20000;
-		dev->napi_defer_hard_irqs = 1;
+		netdev_set_defer_hard_irqs(dev, 1);
 	}
 }
 EXPORT_SYMBOL_GPL(netdev_sw_irq_coalesce_default_on);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 0e2084ce7b75..8272f0144d81 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -425,7 +425,10 @@ NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
 
 static int change_napi_defer_hard_irqs(struct net_device *dev, unsigned long val)
 {
-	WRITE_ONCE(dev->napi_defer_hard_irqs, val);
+	if (val > S32_MAX)
+		return -EINVAL;
+
+	netdev_set_defer_hard_irqs(dev, (int)val);
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH net-next 2/5] netdev-genl: Dump napi_defer_hard_irqs
  2024-08-29 13:11 [PATCH net-next 0/5] Add support for per-NAPI config via netlink Joe Damato
  2024-08-29 13:11 ` [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
@ 2024-08-29 13:11 ` Joe Damato
  2024-08-29 22:08   ` Jakub Kicinski
  2024-08-29 13:11 ` [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 48+ messages in thread
From: Joe Damato @ 2024-08-29 13:11 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn, hch,
	willy, willemdebruijn.kernel, skhawaja, kuba, Joe Damato,
	Martin Karsten, Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

Support dumping defer_hard_irqs for a NAPI ID.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
---
 Documentation/netlink/specs/netdev.yaml | 6 ++++++
 include/uapi/linux/netdev.h             | 1 +
 net/core/netdev-genl.c                  | 5 +++++
 tools/include/uapi/linux/netdev.h       | 1 +
 4 files changed, 13 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 959755be4d7f..ee4f99fd4574 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -244,6 +244,11 @@ attribute-sets:
              threaded mode. If NAPI is not in threaded mode (i.e. uses normal
              softirq context), the attribute will be absent.
         type: u32
+      -
+        name: defer-hard-irqs
+        doc: The number of consecutive empty polls before IRQ deferral ends
+             and hardware IRQs are re-enabled.
+        type: s32
   -
     name: queue
     attributes:
@@ -593,6 +598,7 @@ operations:
             - ifindex
             - irq
             - pid
+            - defer-hard-irqs
       dump:
         request:
           attributes:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 43742ac5b00d..43bb1aad9611 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -121,6 +121,7 @@ enum {
 	NETDEV_A_NAPI_ID,
 	NETDEV_A_NAPI_IRQ,
 	NETDEV_A_NAPI_PID,
+	NETDEV_A_NAPI_DEFER_HARD_IRQS,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 05f9515d2c05..79b14ad6f4bb 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -160,6 +160,7 @@ static int
 netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 			const struct genl_info *info)
 {
+	int napi_defer_hard_irqs;
 	void *hdr;
 	pid_t pid;
 
@@ -188,6 +189,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 			goto nla_put_failure;
 	}
 
+	napi_defer_hard_irqs = napi_get_defer_hard_irqs(napi);
+	if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS, napi_defer_hard_irqs))
+		goto nla_put_failure;
+
 	genlmsg_end(rsp, hdr);
 
 	return 0;
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 43742ac5b00d..43bb1aad9611 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -121,6 +121,7 @@ enum {
 	NETDEV_A_NAPI_ID,
 	NETDEV_A_NAPI_IRQ,
 	NETDEV_A_NAPI_PID,
+	NETDEV_A_NAPI_DEFER_HARD_IRQS,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI
  2024-08-29 13:11 [PATCH net-next 0/5] Add support for per-NAPI config via netlink Joe Damato
  2024-08-29 13:11 ` [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
  2024-08-29 13:11 ` [PATCH net-next 2/5] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
@ 2024-08-29 13:11 ` Joe Damato
  2024-08-29 13:48   ` Eric Dumazet
                     ` (2 more replies)
  2024-08-29 13:12 ` [PATCH net-next 4/5] netdev-genl: Dump gro_flush_timeout Joe Damato
  2024-08-29 13:12 ` [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values Joe Damato
  4 siblings, 3 replies; 48+ messages in thread
From: Joe Damato @ 2024-08-29 13:11 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn, hch,
	willy, willemdebruijn.kernel, skhawaja, kuba, Joe Damato,
	Martin Karsten, David S. Miller, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, Heiner Kallweit, open list

Allow per-NAPI gro_flush_timeout setting.

The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device gro_flush_timeout
field.  Reads from sysfs will read from the net_device field.

The ability to set gro_flush_timeout on specific NAPI instances will be
added in a later commit, via netdev-genl.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
---
 include/linux/netdevice.h | 26 ++++++++++++++++++++++++++
 net/core/dev.c            | 32 ++++++++++++++++++++++++++++----
 net/core/net-sysfs.c      |  2 +-
 3 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7d53380da4c0..d00024d9f857 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -372,6 +372,7 @@ struct napi_struct {
 	int			rx_count; /* length of rx_list */
 	unsigned int		napi_id;
 	int			defer_hard_irqs;
+	unsigned long		gro_flush_timeout;
 	struct hrtimer		timer;
 	struct task_struct	*thread;
 	/* control-path-only fields follow */
@@ -557,6 +558,31 @@ void napi_set_defer_hard_irqs(struct napi_struct *n, int defer);
  */
 void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer);
 
+/**
+ * napi_get_gro_flush_timeout - get the gro_flush_timeout
+ * @n: napi struct to get the gro_flush_timeout from
+ *
+ * Returns the per-NAPI value of the gro_flush_timeout field.
+ */
+unsigned long napi_get_gro_flush_timeout(const struct napi_struct *n);
+
+/**
+ * napi_set_gro_flush_timeout - set the gro_flush_timeout for a napi
+ * @n: napi struct to set the gro_flush_timeout
+ * @timeout: timeout value to set
+ *
+ * napi_set_gro_flush_timeout sets the per-NAPI gro_flush_timeout
+ */
+void napi_set_gro_flush_timeout(struct napi_struct *n, unsigned long timeout);
+
+/**
+ * netdev_set_gro_flush_timeout - set gro_flush_timeout for all NAPIs of a netdev
+ * @netdev: the net_device for which all NAPIs will have their gro_flush_timeout set
+ * @timeout: the timeout value to set
+ */
+void netdev_set_gro_flush_timeout(struct net_device *netdev,
+				  unsigned long timeout);
+
 /**
  * napi_complete_done - NAPI processing complete
  * @n: NAPI context
diff --git a/net/core/dev.c b/net/core/dev.c
index f7baff0da057..3f7cb1085efa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6234,6 +6234,29 @@ void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer)
 }
 EXPORT_SYMBOL_GPL(netdev_set_defer_hard_irqs);
 
+unsigned long napi_get_gro_flush_timeout(const struct napi_struct *n)
+{
+	return READ_ONCE(n->gro_flush_timeout);
+}
+EXPORT_SYMBOL_GPL(napi_get_gro_flush_timeout);
+
+void napi_set_gro_flush_timeout(struct napi_struct *n, unsigned long timeout)
+{
+	WRITE_ONCE(n->gro_flush_timeout, timeout);
+}
+EXPORT_SYMBOL_GPL(napi_set_gro_flush_timeout);
+
+void netdev_set_gro_flush_timeout(struct net_device *netdev,
+				  unsigned long timeout)
+{
+	struct napi_struct *napi;
+
+	WRITE_ONCE(netdev->gro_flush_timeout, timeout);
+	list_for_each_entry(napi, &netdev->napi_list, dev_list)
+		napi_set_gro_flush_timeout(napi, timeout);
+}
+EXPORT_SYMBOL_GPL(netdev_set_gro_flush_timeout);
+
 bool napi_complete_done(struct napi_struct *n, int work_done)
 {
 	unsigned long flags, val, new, timeout = 0;
@@ -6251,12 +6274,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 
 	if (work_done) {
 		if (n->gro_bitmask)
-			timeout = READ_ONCE(n->dev->gro_flush_timeout);
+			timeout = napi_get_gro_flush_timeout(n);
 		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
 	}
 	if (n->defer_hard_irqs_count > 0) {
 		n->defer_hard_irqs_count--;
-		timeout = READ_ONCE(n->dev->gro_flush_timeout);
+		timeout = napi_get_gro_flush_timeout(n);
 		if (timeout)
 			ret = false;
 	}
@@ -6391,7 +6414,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
 
 	if (flags & NAPI_F_PREFER_BUSY_POLL) {
 		napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
-		timeout = READ_ONCE(napi->dev->gro_flush_timeout);
+		timeout = napi_get_gro_flush_timeout(napi);
 		if (napi->defer_hard_irqs_count && timeout) {
 			hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
 			skip_schedule = true;
@@ -6673,6 +6696,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
 	napi->timer.function = napi_watchdog;
 	napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
+	napi_set_gro_flush_timeout(napi, READ_ONCE(dev->gro_flush_timeout));
 	init_gro_hash(napi);
 	napi->skb = NULL;
 	INIT_LIST_HEAD(&napi->rx_list);
@@ -11054,7 +11078,7 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev)
 	WARN_ON(dev->reg_state == NETREG_REGISTERED);
 
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
-		dev->gro_flush_timeout = 20000;
+		netdev_set_gro_flush_timeout(dev, 20000);
 		netdev_set_defer_hard_irqs(dev, 1);
 	}
 }
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 8272f0144d81..ff545a422b1f 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -408,7 +408,7 @@ NETDEVICE_SHOW_RW(tx_queue_len, fmt_dec);
 
 static int change_gro_flush_timeout(struct net_device *dev, unsigned long val)
 {
-	WRITE_ONCE(dev->gro_flush_timeout, val);
+	netdev_set_gro_flush_timeout(dev, val);
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH net-next 4/5] netdev-genl: Dump gro_flush_timeout
  2024-08-29 13:11 [PATCH net-next 0/5] Add support for per-NAPI config via netlink Joe Damato
                   ` (2 preceding siblings ...)
  2024-08-29 13:11 ` [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
@ 2024-08-29 13:12 ` Joe Damato
  2024-08-29 22:09   ` Jakub Kicinski
  2024-08-29 13:12 ` [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values Joe Damato
  4 siblings, 1 reply; 48+ messages in thread
From: Joe Damato @ 2024-08-29 13:12 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn, hch,
	willy, willemdebruijn.kernel, skhawaja, kuba, Joe Damato,
	Martin Karsten, Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Larysa Zaremba, open list

Support dumping gro_flush_timeout for a NAPI ID.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
---
 Documentation/netlink/specs/netdev.yaml | 6 ++++++
 include/uapi/linux/netdev.h             | 1 +
 net/core/netdev-genl.c                  | 6 ++++++
 tools/include/uapi/linux/netdev.h       | 1 +
 4 files changed, 14 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index ee4f99fd4574..290894962ac4 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -249,6 +249,11 @@ attribute-sets:
         doc: The number of consecutive empty polls before IRQ deferral ends
              and hardware IRQs are re-enabled.
         type: s32
+      -
+        name: gro-flush-timeout
+        doc: The timeout, in nanoseconds, of when to trigger the NAPI
+             watchdog timer and schedule NAPI processing.
+        type: u64
   -
     name: queue
     attributes:
@@ -599,6 +604,7 @@ operations:
             - irq
             - pid
             - defer-hard-irqs
+            - gro-flush-timeout
       dump:
         request:
           attributes:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 43bb1aad9611..b088a34e9254 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -122,6 +122,7 @@ enum {
 	NETDEV_A_NAPI_IRQ,
 	NETDEV_A_NAPI_PID,
 	NETDEV_A_NAPI_DEFER_HARD_IRQS,
+	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 79b14ad6f4bb..2eee95d05fe0 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -161,6 +161,7 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 			const struct genl_info *info)
 {
 	int napi_defer_hard_irqs;
+	unsigned long gro_flush_timeout;
 	void *hdr;
 	pid_t pid;
 
@@ -193,6 +194,11 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 	if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS, napi_defer_hard_irqs))
 		goto nla_put_failure;
 
+	gro_flush_timeout = napi_get_gro_flush_timeout(napi);
+	if (nla_put_u64_64bit(rsp, NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
+			      gro_flush_timeout, NETDEV_A_DEV_PAD))
+		goto nla_put_failure;
+
 	genlmsg_end(rsp, hdr);
 
 	return 0;
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 43bb1aad9611..b088a34e9254 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -122,6 +122,7 @@ enum {
 	NETDEV_A_NAPI_IRQ,
 	NETDEV_A_NAPI_PID,
 	NETDEV_A_NAPI_DEFER_HARD_IRQS,
+	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-08-29 13:11 [PATCH net-next 0/5] Add support for per-NAPI config via netlink Joe Damato
                   ` (3 preceding siblings ...)
  2024-08-29 13:12 ` [PATCH net-next 4/5] netdev-genl: Dump gro_flush_timeout Joe Damato
@ 2024-08-29 13:12 ` Joe Damato
  2024-08-29 22:31   ` Jakub Kicinski
  4 siblings, 1 reply; 48+ messages in thread
From: Joe Damato @ 2024-08-29 13:12 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn, hch,
	willy, willemdebruijn.kernel, skhawaja, kuba, Joe Damato,
	Martin Karsten, Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

Add support to set per-NAPI defer_hard_irqs and gro_flush_timeout.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
---
 Documentation/netlink/specs/netdev.yaml | 11 ++++++
 include/uapi/linux/netdev.h             |  1 +
 net/core/netdev-genl-gen.c              | 14 ++++++++
 net/core/netdev-genl-gen.h              |  1 +
 net/core/netdev-genl.c                  | 45 +++++++++++++++++++++++++
 tools/include/uapi/linux/netdev.h       |  1 +
 6 files changed, 73 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 290894962ac4..cb7049a1d6d8 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -631,6 +631,17 @@ operations:
             - rx-bytes
             - tx-packets
             - tx-bytes
+    -
+      name: napi-set
+      doc: Set configurable NAPI instance settings.
+      attribute-set: napi
+      flags: [ admin-perm ]
+      do:
+        request:
+          attributes:
+            - id
+            - defer-hard-irqs
+            - gro-flush-timeout
 
 mcast-groups:
   list:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index b088a34e9254..4c5bfbc85504 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -188,6 +188,7 @@ enum {
 	NETDEV_CMD_QUEUE_GET,
 	NETDEV_CMD_NAPI_GET,
 	NETDEV_CMD_QSTATS_GET,
+	NETDEV_CMD_NAPI_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 8350a0afa9ec..5ddb5d926850 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -74,6 +74,13 @@ static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE
 	[NETDEV_A_QSTATS_SCOPE] = NLA_POLICY_MASK(NLA_UINT, 0x1),
 };
 
+/* NETDEV_CMD_NAPI_SET - set */
+static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT + 1] = {
+	[NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
+	[NETDEV_A_NAPI_DEFER_HARD_IRQS] = { .type = NLA_S32 },
+	[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_U64 },
+};
+
 /* Ops table for netdev */
 static const struct genl_split_ops netdev_nl_ops[] = {
 	{
@@ -151,6 +158,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.maxattr	= NETDEV_A_QSTATS_SCOPE,
 		.flags		= GENL_CMD_CAP_DUMP,
 	},
+	{
+		.cmd		= NETDEV_CMD_NAPI_SET,
+		.doit		= netdev_nl_napi_set_doit,
+		.policy		= netdev_napi_set_nl_policy,
+		.maxattr	= NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
 };
 
 static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index 4db40fd5b4a9..b70cb0f20acb 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -28,6 +28,7 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb,
 			       struct netlink_callback *cb);
 int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
+int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
 				struct netlink_callback *cb);
 
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 2eee95d05fe0..43dd30eadbc6 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -299,6 +299,51 @@ int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	return err;
 }
 
+static int
+netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
+{
+	u64 gro_flush_timeout = 0;
+	int defer = 0;
+
+	if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
+		defer = nla_get_s32(info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]);
+		if (defer < 0)
+			return -EINVAL;
+		napi_set_defer_hard_irqs(napi, defer);
+	}
+
+	if (info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]) {
+		gro_flush_timeout = nla_get_u64(info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]);
+		napi_set_gro_flush_timeout(napi, gro_flush_timeout);
+	}
+
+	return 0;
+}
+
+int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct napi_struct *napi;
+	u32 napi_id;
+	int err;
+
+	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_ID))
+		return -EINVAL;
+
+	napi_id = nla_get_u32(info->attrs[NETDEV_A_NAPI_ID]);
+
+	rtnl_lock();
+
+	napi = napi_by_id(napi_id);
+	if (napi)
+		err = netdev_nl_napi_set_config(napi, info);
+	else
+		err = -EINVAL;
+
+	rtnl_unlock();
+
+	return err;
+}
+
 static int
 netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 			 u32 q_idx, u32 q_type, const struct genl_info *info)
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index b088a34e9254..4c5bfbc85504 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -188,6 +188,7 @@ enum {
 	NETDEV_CMD_QUEUE_GET,
 	NETDEV_CMD_NAPI_GET,
 	NETDEV_CMD_QSTATS_GET,
+	NETDEV_CMD_NAPI_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI
  2024-08-29 13:11 ` [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
@ 2024-08-29 13:46   ` Eric Dumazet
  2024-08-29 22:05   ` Jakub Kicinski
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Eric Dumazet @ 2024-08-29 13:46 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, amritha.nambiar, sridhar.samudrala, sdf, bjorn, hch,
	willy, willemdebruijn.kernel, skhawaja, kuba, Martin Karsten,
	David S. Miller, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, open list

On Thu, Aug 29, 2024 at 3:13 PM Joe Damato <jdamato@fastly.com> wrote:
>
> Allow per-NAPI defer_hard_irqs setting.
>
> The existing sysfs parameter is respected; writes to sysfs will write to
> all NAPI structs for the device and the net_device defer_hard_irq field.
> Reads from sysfs will read from the net_device field.
>
> sysfs code was updated to guard against what appears to be a potential
> overflow as the field is an int, but the value passed in is an unsigned
> long.
>
> The ability to set defer_hard_irqs on specific NAPI instances will be
> added in a later commit, via netdev-genl.
>
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
> ---
>  include/linux/netdevice.h | 23 +++++++++++++++++++++++
>  net/core/dev.c            | 29 ++++++++++++++++++++++++++---
>  net/core/net-sysfs.c      |  5 ++++-
>  3 files changed, 53 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index fce70990b209..7d53380da4c0 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -371,6 +371,7 @@ struct napi_struct {
>         struct list_head        rx_list; /* Pending GRO_NORMAL skbs */
>         int                     rx_count; /* length of rx_list */
>         unsigned int            napi_id;
> +       int                     defer_hard_irqs;
>         struct hrtimer          timer;
>         struct task_struct      *thread;
>         /* control-path-only fields follow */
> @@ -534,6 +535,28 @@ static inline void napi_schedule_irqoff(struct napi_struct *n)
>                 __napi_schedule_irqoff(n);
>  }
>

Since dev->napi_defer_hard_irqs is no longer used in fast path,
please move it outside of the net_device_read_rx group.

Also update Documentation/networking/net_cachelines/net_device.rst

> +/**
> + * napi_get_defer_hard_irqs - get the NAPI's defer_hard_irqs
> + * @n: napi struct to get the defer_hard_irqs field from
> + *
> + * Returns the per-NAPI value of the defar_hard_irqs field.
> + */
> +int napi_get_defer_hard_irqs(const struct napi_struct *n);
> +
> +/**
> + * napi_set_defer_hard_irqs - set the defer_hard_irqs for a napi
> + * @n: napi_struct to set the defer_hard_irqs field
> + * @defer: the value the field should be set to
> + */
> +void napi_set_defer_hard_irqs(struct napi_struct *n, int defer);
> +
> +/**
> + * netdev_set_defer_hard_irqs - set defer_hard_irqs for all NAPIs of a netdev
> + * @netdev: the net_device for which all NAPIs will have their defer_hard_irqs set
> + * @defer: the defer_hard_irqs value to set
> + */
> +void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer);
> +
>  /**
>   * napi_complete_done - NAPI processing complete
>   * @n: NAPI context
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 63987b8b7c85..f7baff0da057 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6212,6 +6212,28 @@ void __napi_schedule_irqoff(struct napi_struct *n)
>  }
>  EXPORT_SYMBOL(__napi_schedule_irqoff);
>
> +int napi_get_defer_hard_irqs(const struct napi_struct *n)
> +{
> +       return READ_ONCE(n->defer_hard_irqs);
> +}
> +EXPORT_SYMBOL_GPL(napi_get_defer_hard_irqs);
> +
> +void napi_set_defer_hard_irqs(struct napi_struct *n, int defer)
> +{
> +       WRITE_ONCE(n->defer_hard_irqs, defer);
> +}
> +EXPORT_SYMBOL_GPL(napi_set_defer_hard_irqs);
> +
> +void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer)
> +{
> +       struct napi_struct *napi;
> +
> +       WRITE_ONCE(netdev->napi_defer_hard_irqs, defer);
> +       list_for_each_entry(napi, &netdev->napi_list, dev_list)
> +               napi_set_defer_hard_irqs(napi, defer);
> +}
> +EXPORT_SYMBOL_GPL(netdev_set_defer_hard_irqs);
> +
>  bool napi_complete_done(struct napi_struct *n, int work_done)
>  {
>         unsigned long flags, val, new, timeout = 0;
> @@ -6230,7 +6252,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
>         if (work_done) {
>                 if (n->gro_bitmask)
>                         timeout = READ_ONCE(n->dev->gro_flush_timeout);
> -               n->defer_hard_irqs_count = READ_ONCE(n->dev->napi_defer_hard_irqs);
> +               n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
>         }
>         if (n->defer_hard_irqs_count > 0) {
>                 n->defer_hard_irqs_count--;
> @@ -6368,7 +6390,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
>         bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
>
>         if (flags & NAPI_F_PREFER_BUSY_POLL) {
> -               napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs);
> +               napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
>                 timeout = READ_ONCE(napi->dev->gro_flush_timeout);
>                 if (napi->defer_hard_irqs_count && timeout) {
>                         hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
> @@ -6650,6 +6672,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
>         INIT_HLIST_NODE(&napi->napi_hash_node);
>         hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
>         napi->timer.function = napi_watchdog;
> +       napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
>         init_gro_hash(napi);
>         napi->skb = NULL;
>         INIT_LIST_HEAD(&napi->rx_list);
> @@ -11032,7 +11055,7 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev)
>
>         if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
>                 dev->gro_flush_timeout = 20000;
> -               dev->napi_defer_hard_irqs = 1;
> +               netdev_set_defer_hard_irqs(dev, 1);
>         }
>  }
>  EXPORT_SYMBOL_GPL(netdev_sw_irq_coalesce_default_on);
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 0e2084ce7b75..8272f0144d81 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -425,7 +425,10 @@ NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
>
>  static int change_napi_defer_hard_irqs(struct net_device *dev, unsigned long val)
>  {
> -       WRITE_ONCE(dev->napi_defer_hard_irqs, val);
> +       if (val > S32_MAX)
> +               return -EINVAL;
> +
> +       netdev_set_defer_hard_irqs(dev, (int)val);
>         return 0;
>  }
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI
  2024-08-29 13:11 ` [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
@ 2024-08-29 13:48   ` Eric Dumazet
  2024-08-29 13:57     ` Joe Damato
  2024-08-29 15:28     ` Joe Damato
  2024-08-30 16:18   ` kernel test robot
  2024-08-30 16:18   ` kernel test robot
  2 siblings, 2 replies; 48+ messages in thread
From: Eric Dumazet @ 2024-08-29 13:48 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, amritha.nambiar, sridhar.samudrala, sdf, bjorn, hch,
	willy, willemdebruijn.kernel, skhawaja, kuba, Martin Karsten,
	David S. Miller, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, Heiner Kallweit, open list

On Thu, Aug 29, 2024 at 3:13 PM Joe Damato <jdamato@fastly.com> wrote:
>
> Allow per-NAPI gro_flush_timeout setting.
>
> The existing sysfs parameter is respected; writes to sysfs will write to
> all NAPI structs for the device and the net_device gro_flush_timeout
> field.  Reads from sysfs will read from the net_device field.
>
> The ability to set gro_flush_timeout on specific NAPI instances will be
> added in a later commit, via netdev-genl.
>
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
> ---
>  include/linux/netdevice.h | 26 ++++++++++++++++++++++++++
>  net/core/dev.c            | 32 ++++++++++++++++++++++++++++----
>  net/core/net-sysfs.c      |  2 +-
>  3 files changed, 55 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 7d53380da4c0..d00024d9f857 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -372,6 +372,7 @@ struct napi_struct {
>         int                     rx_count; /* length of rx_list */
>         unsigned int            napi_id;
>         int                     defer_hard_irqs;
> +       unsigned long           gro_flush_timeout;
>         struct hrtimer          timer;
>         struct task_struct      *thread;
>         /* control-path-only fields follow */
> @@ -557,6 +558,31 @@ void napi_set_defer_hard_irqs(struct napi_struct *n, int defer);
>   */
>  void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer);
>

Same remark :  dev->gro_flush_timeout is no longer read in the fast path.

Please move gro_flush_timeout out of net_device_read_txrx and update
Documentation/networking/net_cachelines/net_device.rst

> +/**
> + * napi_get_gro_flush_timeout - get the gro_flush_timeout
> + * @n: napi struct to get the gro_flush_timeout from
> + *
> + * Returns the per-NAPI value of the gro_flush_timeout field.
> + */
> +unsigned long napi_get_gro_flush_timeout(const struct napi_struct *n);
> +
> +/**
> + * napi_set_gro_flush_timeout - set the gro_flush_timeout for a napi
> + * @n: napi struct to set the gro_flush_timeout
> + * @timeout: timeout value to set
> + *
> + * napi_set_gro_flush_timeout sets the per-NAPI gro_flush_timeout
> + */
> +void napi_set_gro_flush_timeout(struct napi_struct *n, unsigned long timeout);
> +
> +/**
> + * netdev_set_gro_flush_timeout - set gro_flush_timeout for all NAPIs of a netdev
> + * @netdev: the net_device for which all NAPIs will have their gro_flush_timeout set
> + * @timeout: the timeout value to set
> + */
> +void netdev_set_gro_flush_timeout(struct net_device *netdev,
> +                                 unsigned long timeout);
> +
>  /**
>   * napi_complete_done - NAPI processing complete
>   * @n: NAPI context
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f7baff0da057..3f7cb1085efa 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6234,6 +6234,29 @@ void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer)
>  }
>  EXPORT_SYMBOL_GPL(netdev_set_defer_hard_irqs);
>
> +unsigned long napi_get_gro_flush_timeout(const struct napi_struct *n)
> +{
> +       return READ_ONCE(n->gro_flush_timeout);
> +}
> +EXPORT_SYMBOL_GPL(napi_get_gro_flush_timeout);
> +
> +void napi_set_gro_flush_timeout(struct napi_struct *n, unsigned long timeout)
> +{
> +       WRITE_ONCE(n->gro_flush_timeout, timeout);
> +}
> +EXPORT_SYMBOL_GPL(napi_set_gro_flush_timeout);
> +
> +void netdev_set_gro_flush_timeout(struct net_device *netdev,
> +                                 unsigned long timeout)
> +{
> +       struct napi_struct *napi;
> +
> +       WRITE_ONCE(netdev->gro_flush_timeout, timeout);
> +       list_for_each_entry(napi, &netdev->napi_list, dev_list)
> +               napi_set_gro_flush_timeout(napi, timeout);
> +}
> +EXPORT_SYMBOL_GPL(netdev_set_gro_flush_timeout);
> +
>  bool napi_complete_done(struct napi_struct *n, int work_done)
>  {
>         unsigned long flags, val, new, timeout = 0;
> @@ -6251,12 +6274,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
>
>         if (work_done) {
>                 if (n->gro_bitmask)
> -                       timeout = READ_ONCE(n->dev->gro_flush_timeout);
> +                       timeout = napi_get_gro_flush_timeout(n);
>                 n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
>         }
>         if (n->defer_hard_irqs_count > 0) {
>                 n->defer_hard_irqs_count--;
> -               timeout = READ_ONCE(n->dev->gro_flush_timeout);
> +               timeout = napi_get_gro_flush_timeout(n);
>                 if (timeout)
>                         ret = false;
>         }
> @@ -6391,7 +6414,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
>
>         if (flags & NAPI_F_PREFER_BUSY_POLL) {
>                 napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
> -               timeout = READ_ONCE(napi->dev->gro_flush_timeout);
> +               timeout = napi_get_gro_flush_timeout(napi);
>                 if (napi->defer_hard_irqs_count && timeout) {
>                         hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
>                         skip_schedule = true;
> @@ -6673,6 +6696,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
>         hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
>         napi->timer.function = napi_watchdog;
>         napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
> +       napi_set_gro_flush_timeout(napi, READ_ONCE(dev->gro_flush_timeout));
>         init_gro_hash(napi);
>         napi->skb = NULL;
>         INIT_LIST_HEAD(&napi->rx_list);
> @@ -11054,7 +11078,7 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev)
>         WARN_ON(dev->reg_state == NETREG_REGISTERED);
>
>         if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
> -               dev->gro_flush_timeout = 20000;
> +               netdev_set_gro_flush_timeout(dev, 20000);
>                 netdev_set_defer_hard_irqs(dev, 1);
>         }
>  }
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 8272f0144d81..ff545a422b1f 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -408,7 +408,7 @@ NETDEVICE_SHOW_RW(tx_queue_len, fmt_dec);
>
>  static int change_gro_flush_timeout(struct net_device *dev, unsigned long val)
>  {
> -       WRITE_ONCE(dev->gro_flush_timeout, val);
> +       netdev_set_gro_flush_timeout(dev, val);
>         return 0;
>  }
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI
  2024-08-29 13:48   ` Eric Dumazet
@ 2024-08-29 13:57     ` Joe Damato
  2024-08-29 15:28     ` Joe Damato
  1 sibling, 0 replies; 48+ messages in thread
From: Joe Damato @ 2024-08-29 13:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, amritha.nambiar, sridhar.samudrala, sdf, bjorn, hch,
	willy, willemdebruijn.kernel, skhawaja, kuba, Martin Karsten,
	David S. Miller, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, Heiner Kallweit, open list

On Thu, Aug 29, 2024 at 03:48:05PM +0200, Eric Dumazet wrote:
> On Thu, Aug 29, 2024 at 3:13 PM Joe Damato <jdamato@fastly.com> wrote:
> >
> > Allow per-NAPI gro_flush_timeout setting.
> >
> > The existing sysfs parameter is respected; writes to sysfs will write to
> > all NAPI structs for the device and the net_device gro_flush_timeout
> > field.  Reads from sysfs will read from the net_device field.
> >
> > The ability to set gro_flush_timeout on specific NAPI instances will be
> > added in a later commit, via netdev-genl.
> >
> > Signed-off-by: Joe Damato <jdamato@fastly.com>
> > Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > ---
> >  include/linux/netdevice.h | 26 ++++++++++++++++++++++++++
> >  net/core/dev.c            | 32 ++++++++++++++++++++++++++++----
> >  net/core/net-sysfs.c      |  2 +-
> >  3 files changed, 55 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 7d53380da4c0..d00024d9f857 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -372,6 +372,7 @@ struct napi_struct {
> >         int                     rx_count; /* length of rx_list */
> >         unsigned int            napi_id;
> >         int                     defer_hard_irqs;
> > +       unsigned long           gro_flush_timeout;
> >         struct hrtimer          timer;
> >         struct task_struct      *thread;
> >         /* control-path-only fields follow */
> > @@ -557,6 +558,31 @@ void napi_set_defer_hard_irqs(struct napi_struct *n, int defer);
> >   */
> >  void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer);
> >
> 
> Same remark :  dev->gro_flush_timeout is no longer read in the fast path.
> 
> Please move gro_flush_timeout out of net_device_read_txrx and update
> Documentation/networking/net_cachelines/net_device.rst

Thanks, Eric. I will take care of both in the v2.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI
  2024-08-29 13:48   ` Eric Dumazet
  2024-08-29 13:57     ` Joe Damato
@ 2024-08-29 15:28     ` Joe Damato
  2024-08-29 15:31       ` Eric Dumazet
  1 sibling, 1 reply; 48+ messages in thread
From: Joe Damato @ 2024-08-29 15:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, amritha.nambiar, sridhar.samudrala, sdf, bjorn, hch,
	willy, willemdebruijn.kernel, skhawaja, kuba, Martin Karsten,
	David S. Miller, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, Heiner Kallweit, open list

On Thu, Aug 29, 2024 at 03:48:05PM +0200, Eric Dumazet wrote:
> On Thu, Aug 29, 2024 at 3:13 PM Joe Damato <jdamato@fastly.com> wrote:
> >
> > Allow per-NAPI gro_flush_timeout setting.
> >
> > The existing sysfs parameter is respected; writes to sysfs will write to
> > all NAPI structs for the device and the net_device gro_flush_timeout
> > field.  Reads from sysfs will read from the net_device field.
> >
> > The ability to set gro_flush_timeout on specific NAPI instances will be
> > added in a later commit, via netdev-genl.
> >
> > Signed-off-by: Joe Damato <jdamato@fastly.com>
> > Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > ---
> >  include/linux/netdevice.h | 26 ++++++++++++++++++++++++++
> >  net/core/dev.c            | 32 ++++++++++++++++++++++++++++----
> >  net/core/net-sysfs.c      |  2 +-
> >  3 files changed, 55 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 7d53380da4c0..d00024d9f857 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -372,6 +372,7 @@ struct napi_struct {
> >         int                     rx_count; /* length of rx_list */
> >         unsigned int            napi_id;
> >         int                     defer_hard_irqs;
> > +       unsigned long           gro_flush_timeout;
> >         struct hrtimer          timer;
> >         struct task_struct      *thread;
> >         /* control-path-only fields follow */
> > @@ -557,6 +558,31 @@ void napi_set_defer_hard_irqs(struct napi_struct *n, int defer);
> >   */
> >  void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer);
> >
> 
> Same remark :  dev->gro_flush_timeout is no longer read in the fast path.
> 
> Please move gro_flush_timeout out of net_device_read_txrx and update
> Documentation/networking/net_cachelines/net_device.rst

Is there some tooling I should use to generate this file?

I am asking because it seems like the file is missing two fields in
net_device at the end of the struct:

struct hlist_head          page_pools;
struct dim_irq_moder *     irq_moder;

Both of which seem to have been added just before and long after
(respectively) commit 14006f1d8fa2 ("Documentations: Analyze heavily
used Networking related structs").

If this is a bug, I can submit one patch (with two fixes tags) which
adds both fields to the file?

- Joe

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI
  2024-08-29 15:28     ` Joe Damato
@ 2024-08-29 15:31       ` Eric Dumazet
  2024-08-29 15:39         ` Joe Damato
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Dumazet @ 2024-08-29 15:31 UTC (permalink / raw)
  To: Joe Damato, Eric Dumazet, netdev, amritha.nambiar,
	sridhar.samudrala, sdf, bjorn, hch, willy, willemdebruijn.kernel,
	skhawaja, kuba, Martin Karsten, David S. Miller, Paolo Abeni,
	Jiri Pirko, Sebastian Andrzej Siewior, Lorenzo Bianconi,
	Breno Leitao, Johannes Berg, Alexander Lobakin, Heiner Kallweit,
	open list

On Thu, Aug 29, 2024 at 5:28 PM Joe Damato <jdamato@fastly.com> wrote:
>
> On Thu, Aug 29, 2024 at 03:48:05PM +0200, Eric Dumazet wrote:
> > On Thu, Aug 29, 2024 at 3:13 PM Joe Damato <jdamato@fastly.com> wrote:
> > >
> > > Allow per-NAPI gro_flush_timeout setting.
> > >
> > > The existing sysfs parameter is respected; writes to sysfs will write to
> > > all NAPI structs for the device and the net_device gro_flush_timeout
> > > field.  Reads from sysfs will read from the net_device field.
> > >
> > > The ability to set gro_flush_timeout on specific NAPI instances will be
> > > added in a later commit, via netdev-genl.
> > >
> > > Signed-off-by: Joe Damato <jdamato@fastly.com>
> > > Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > > Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > > ---
> > >  include/linux/netdevice.h | 26 ++++++++++++++++++++++++++
> > >  net/core/dev.c            | 32 ++++++++++++++++++++++++++++----
> > >  net/core/net-sysfs.c      |  2 +-
> > >  3 files changed, 55 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > index 7d53380da4c0..d00024d9f857 100644
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -372,6 +372,7 @@ struct napi_struct {
> > >         int                     rx_count; /* length of rx_list */
> > >         unsigned int            napi_id;
> > >         int                     defer_hard_irqs;
> > > +       unsigned long           gro_flush_timeout;
> > >         struct hrtimer          timer;
> > >         struct task_struct      *thread;
> > >         /* control-path-only fields follow */
> > > @@ -557,6 +558,31 @@ void napi_set_defer_hard_irqs(struct napi_struct *n, int defer);
> > >   */
> > >  void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer);
> > >
> >
> > Same remark :  dev->gro_flush_timeout is no longer read in the fast path.
> >
> > Please move gro_flush_timeout out of net_device_read_txrx and update
> > Documentation/networking/net_cachelines/net_device.rst
>
> Is there some tooling I should use to generate this file?
>
> I am asking because it seems like the file is missing two fields in
> net_device at the end of the struct:
>
> struct hlist_head          page_pools;
> struct dim_irq_moder *     irq_moder;
>

At first glance this is control path only, no big deal.

> Both of which seem to have been added just before and long after
> (respectively) commit 14006f1d8fa2 ("Documentations: Analyze heavily
> used Networking related structs").
>
> If this is a bug, I can submit one patch (with two fixes tags) which
> adds both fields to the file?

No need for a Fixes: tag for this, just submit to net-next.

This file is really 'needed' for current development, for people
caring about data locality.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI
  2024-08-29 15:31       ` Eric Dumazet
@ 2024-08-29 15:39         ` Joe Damato
  0 siblings, 0 replies; 48+ messages in thread
From: Joe Damato @ 2024-08-29 15:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, amritha.nambiar, sridhar.samudrala, sdf, bjorn, hch,
	willy, willemdebruijn.kernel, skhawaja, kuba, Martin Karsten,
	David S. Miller, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, Heiner Kallweit, open list

On Thu, Aug 29, 2024 at 05:31:30PM +0200, Eric Dumazet wrote:
> On Thu, Aug 29, 2024 at 5:28 PM Joe Damato <jdamato@fastly.com> wrote:
> >
> > On Thu, Aug 29, 2024 at 03:48:05PM +0200, Eric Dumazet wrote:
> > > On Thu, Aug 29, 2024 at 3:13 PM Joe Damato <jdamato@fastly.com> wrote:
> > > >
> > > > Allow per-NAPI gro_flush_timeout setting.
> > > >
> > > > The existing sysfs parameter is respected; writes to sysfs will write to
> > > > all NAPI structs for the device and the net_device gro_flush_timeout
> > > > field.  Reads from sysfs will read from the net_device field.
> > > >
> > > > The ability to set gro_flush_timeout on specific NAPI instances will be
> > > > added in a later commit, via netdev-genl.
> > > >
> > > > Signed-off-by: Joe Damato <jdamato@fastly.com>
> > > > Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > > > Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > > > ---
> > > >  include/linux/netdevice.h | 26 ++++++++++++++++++++++++++
> > > >  net/core/dev.c            | 32 ++++++++++++++++++++++++++++----
> > > >  net/core/net-sysfs.c      |  2 +-
> > > >  3 files changed, 55 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > > index 7d53380da4c0..d00024d9f857 100644
> > > > --- a/include/linux/netdevice.h
> > > > +++ b/include/linux/netdevice.h
> > > > @@ -372,6 +372,7 @@ struct napi_struct {
> > > >         int                     rx_count; /* length of rx_list */
> > > >         unsigned int            napi_id;
> > > >         int                     defer_hard_irqs;
> > > > +       unsigned long           gro_flush_timeout;
> > > >         struct hrtimer          timer;
> > > >         struct task_struct      *thread;
> > > >         /* control-path-only fields follow */
> > > > @@ -557,6 +558,31 @@ void napi_set_defer_hard_irqs(struct napi_struct *n, int defer);
> > > >   */
> > > >  void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer);
> > > >
> > >
> > > Same remark :  dev->gro_flush_timeout is no longer read in the fast path.
> > >
> > > Please move gro_flush_timeout out of net_device_read_txrx and update
> > > Documentation/networking/net_cachelines/net_device.rst
> >
> > Is there some tooling I should use to generate this file?
> >
> > I am asking because it seems like the file is missing two fields in
> > net_device at the end of the struct:
> >
> > struct hlist_head          page_pools;
> > struct dim_irq_moder *     irq_moder;
> >
> 
> At first glance this is control path only, no big deal.

My plan was to move both fields you pointed out in my series to the
end of the struct (there is a big enough hole for both) and move
them to the bottom of this doc file (with just Type and Name, of
course).

The two fields (page_pools, irq_moder) being missing made me unsure
if I was planning on doing the right thing for the v2.
 
> > Both of which seem to have been added just before and long after
> > (respectively) commit 14006f1d8fa2 ("Documentations: Analyze heavily
> > used Networking related structs").
> >
> > If this is a bug, I can submit one patch (with two fixes tags) which
> > adds both fields to the file?
> 
> No need for a Fixes: tag for this, just submit to net-next.
> 
> This file is really 'needed' for current development, for people
> caring about data locality.

Will do; thanks for the guidance.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI
  2024-08-29 13:11 ` [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
  2024-08-29 13:46   ` Eric Dumazet
@ 2024-08-29 22:05   ` Jakub Kicinski
  2024-08-30  9:14     ` Joe Damato
  2024-08-30  8:36   ` Simon Horman
  2024-08-30 16:50   ` kernel test robot
  3 siblings, 1 reply; 48+ messages in thread
From: Jakub Kicinski @ 2024-08-29 22:05 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	David S. Miller, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, open list

On Thu, 29 Aug 2024 13:11:57 +0000 Joe Damato wrote:
> +/**
> + * napi_get_defer_hard_irqs - get the NAPI's defer_hard_irqs
> + * @n: napi struct to get the defer_hard_irqs field from
> + *
> + * Returns the per-NAPI value of the defar_hard_irqs field.
> + */
> +int napi_get_defer_hard_irqs(const struct napi_struct *n);
> +
> +/**
> + * napi_set_defer_hard_irqs - set the defer_hard_irqs for a napi
> + * @n: napi_struct to set the defer_hard_irqs field
> + * @defer: the value the field should be set to
> + */
> +void napi_set_defer_hard_irqs(struct napi_struct *n, int defer);
> +
> +/**
> + * netdev_set_defer_hard_irqs - set defer_hard_irqs for all NAPIs of a netdev
> + * @netdev: the net_device for which all NAPIs will have their defer_hard_irqs set
> + * @defer: the defer_hard_irqs value to set
> + */
> +void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer);

Do you expect drivers or modules to call these?
I'm not sure we need the wrappers just to cover up the READ/WRITE_ONCE()
but if you do want to keep them they can be static inlines in
net/core/dev.h

nit: IIUC the kdoc should go on the definition, not the declaration.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 2/5] netdev-genl: Dump napi_defer_hard_irqs
  2024-08-29 13:11 ` [PATCH net-next 2/5] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
@ 2024-08-29 22:08   ` Jakub Kicinski
  2024-08-30  9:10     ` Joe Damato
  0 siblings, 1 reply; 48+ messages in thread
From: Jakub Kicinski @ 2024-08-29 22:08 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Thu, 29 Aug 2024 13:11:58 +0000 Joe Damato wrote:
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index 959755be4d7f..ee4f99fd4574 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -244,6 +244,11 @@ attribute-sets:
>               threaded mode. If NAPI is not in threaded mode (i.e. uses normal
>               softirq context), the attribute will be absent.
>          type: u32
> +      -
> +        name: defer-hard-irqs
> +        doc: The number of consecutive empty polls before IRQ deferral ends
> +             and hardware IRQs are re-enabled.
> +        type: s32

Why is this a signed value? 🤔️
You can use:

	check:
		max: s32-max

to have netlink validate the overflow if you switch to u32.

>    -
>      name: queue
>      attributes:

> @@ -188,6 +189,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
>  			goto nla_put_failure;
>  	}
>  
> +	napi_defer_hard_irqs = napi_get_defer_hard_irqs(napi);

Here, for example the READ_ONCE() wouldn't have been necessary, right?
Cause we are holding rtnl_lock(), just a random thought, not really
actionable.

> +	if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS, napi_defer_hard_irqs))
> +		goto nla_put_failure;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 4/5] netdev-genl: Dump gro_flush_timeout
  2024-08-29 13:12 ` [PATCH net-next 4/5] netdev-genl: Dump gro_flush_timeout Joe Damato
@ 2024-08-29 22:09   ` Jakub Kicinski
  2024-08-30  9:17     ` Joe Damato
  0 siblings, 1 reply; 48+ messages in thread
From: Jakub Kicinski @ 2024-08-29 22:09 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Larysa Zaremba, open list

On Thu, 29 Aug 2024 13:12:00 +0000 Joe Damato wrote:
> +        type: u64

uint

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-08-29 13:12 ` [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values Joe Damato
@ 2024-08-29 22:31   ` Jakub Kicinski
  2024-08-30 10:43     ` Joe Damato
  0 siblings, 1 reply; 48+ messages in thread
From: Jakub Kicinski @ 2024-08-29 22:31 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:
> +	napi = napi_by_id(napi_id);
> +	if (napi)
> +		err = netdev_nl_napi_set_config(napi, info);
> +	else
> +		err = -EINVAL;

if (napi) {
...
} else {
	NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID])
	err = -ENOENT;
}

> +      doc: Set configurable NAPI instance settings.

We should pause and think here how configuring NAPI params should
behave. NAPI instances are ephemeral, if you close and open the
device (or for some drivers change any BPF or ethtool setting)
the NAPIs may get wiped and recreated, discarding all configuration.

This is not how the sysfs API behaves, the sysfs settings on the device
survive close. It's (weirdly?) also not how queues behave, because we
have struct netdev{_rx,}_queue to store stuff persistently. Even tho
you'd think queues are as ephemeral as NAPIs if not more.

I guess we can either document this, and move on (which may be fine,
you have more practical experience than me). Or we can add an internal
concept of a "channel" (which perhaps maybe if you squint is what
ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
net_device and store such config there. For simplicity of matching
config to NAPIs we can assume drivers add NAPI instances in order. 
If driver wants to do something more fancy we can add a variant of
netif_napi_add() which specifies the channel/storage to use.

Thoughts? I may be overly sensitive to the ephemeral thing, maybe
I work with unfortunate drivers...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI
  2024-08-29 13:11 ` [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
  2024-08-29 13:46   ` Eric Dumazet
  2024-08-29 22:05   ` Jakub Kicinski
@ 2024-08-30  8:36   ` Simon Horman
  2024-08-30  9:11     ` Joe Damato
  2024-08-30 16:50   ` kernel test robot
  3 siblings, 1 reply; 48+ messages in thread
From: Simon Horman @ 2024-08-30  8:36 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, kuba, Martin Karsten,
	David S. Miller, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, open list

On Thu, Aug 29, 2024 at 01:11:57PM +0000, Joe Damato wrote:
> Allow per-NAPI defer_hard_irqs setting.
> 
> The existing sysfs parameter is respected; writes to sysfs will write to
> all NAPI structs for the device and the net_device defer_hard_irq field.
> Reads from sysfs will read from the net_device field.
> 
> sysfs code was updated to guard against what appears to be a potential
> overflow as the field is an int, but the value passed in is an unsigned
> long.
> 
> The ability to set defer_hard_irqs on specific NAPI instances will be
> added in a later commit, via netdev-genl.
> 
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
> ---
>  include/linux/netdevice.h | 23 +++++++++++++++++++++++
>  net/core/dev.c            | 29 ++++++++++++++++++++++++++---
>  net/core/net-sysfs.c      |  5 ++++-
>  3 files changed, 53 insertions(+), 4 deletions(-)

...

> @@ -534,6 +535,28 @@ static inline void napi_schedule_irqoff(struct napi_struct *n)
>  		__napi_schedule_irqoff(n);
>  }
>  
> +/**
> + * napi_get_defer_hard_irqs - get the NAPI's defer_hard_irqs
> + * @n: napi struct to get the defer_hard_irqs field from
> + *
> + * Returns the per-NAPI value of the defar_hard_irqs field.
> + */
> +int napi_get_defer_hard_irqs(const struct napi_struct *n);

Hi Joe,

As it looks like there will be a v2 anyway, a minor nit from my side.

Thanks for documenting the return value, but I believe that
./scripts/kernel-doc -none -Wall expects "Return: " or "Returns: "

Likewise in patch 3/5.

...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 2/5] netdev-genl: Dump napi_defer_hard_irqs
  2024-08-29 22:08   ` Jakub Kicinski
@ 2024-08-30  9:10     ` Joe Damato
  2024-08-30 20:28       ` Jakub Kicinski
  0 siblings, 1 reply; 48+ messages in thread
From: Joe Damato @ 2024-08-30  9:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Thu, Aug 29, 2024 at 03:08:28PM -0700, Jakub Kicinski wrote:
> On Thu, 29 Aug 2024 13:11:58 +0000 Joe Damato wrote:
> > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > index 959755be4d7f..ee4f99fd4574 100644
> > --- a/Documentation/netlink/specs/netdev.yaml
> > +++ b/Documentation/netlink/specs/netdev.yaml
> > @@ -244,6 +244,11 @@ attribute-sets:
> >               threaded mode. If NAPI is not in threaded mode (i.e. uses normal
> >               softirq context), the attribute will be absent.
> >          type: u32
> > +      -
> > +        name: defer-hard-irqs
> > +        doc: The number of consecutive empty polls before IRQ deferral ends
> > +             and hardware IRQs are re-enabled.
> > +        type: s32
> 
> Why is this a signed value? 🤔️

In commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature"), napi_defer_hard_irqs was added to struct net_device as an
int. I was trying to match that and thus made the field a signed int
in the napi struct, as well.

It looks like there was a possibility of overflow introduced in that
commit in change_napi_defer_hard_irqs maybe ?

If you'd prefer I could:
  - submit a Fixes to change the net_device field to a u32 and then
    change the netlink code to also be u32
  - add an overflow check (val > U32_MAX) in
    change_napi_defer_hard_irqs

Which would mean for the v2 of this series:
  - drop the overflow check I added in Patch 1
  - Change netlink to use u32 in this patch 

What do you think?

> You can use:
> 
> 	check:
> 		max: s32-max
> 
> to have netlink validate the overflow if you switch to u32.
> 
> >    -
> >      name: queue
> >      attributes:
> 
> > @@ -188,6 +189,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
> >  			goto nla_put_failure;
> >  	}
> >  
> > +	napi_defer_hard_irqs = napi_get_defer_hard_irqs(napi);
> 
> Here, for example the READ_ONCE() wouldn't have been necessary, right?
> Cause we are holding rtnl_lock(), just a random thought, not really
> actionable.

That's right, yes.

I think it depends on where we land with the wrapper functions? I'll
reply with my thoughts about that in that thread.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI
  2024-08-30  8:36   ` Simon Horman
@ 2024-08-30  9:11     ` Joe Damato
  0 siblings, 0 replies; 48+ messages in thread
From: Joe Damato @ 2024-08-30  9:11 UTC (permalink / raw)
  To: Simon Horman
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, kuba, Martin Karsten,
	David S. Miller, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, open list

On Fri, Aug 30, 2024 at 09:36:41AM +0100, Simon Horman wrote:
> On Thu, Aug 29, 2024 at 01:11:57PM +0000, Joe Damato wrote:
> > Allow per-NAPI defer_hard_irqs setting.
> > 
> > The existing sysfs parameter is respected; writes to sysfs will write to
> > all NAPI structs for the device and the net_device defer_hard_irq field.
> > Reads from sysfs will read from the net_device field.
> > 
> > sysfs code was updated to guard against what appears to be a potential
> > overflow as the field is an int, but the value passed in is an unsigned
> > long.
> > 
> > The ability to set defer_hard_irqs on specific NAPI instances will be
> > added in a later commit, via netdev-genl.
> > 
> > Signed-off-by: Joe Damato <jdamato@fastly.com>
> > Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > ---
> >  include/linux/netdevice.h | 23 +++++++++++++++++++++++
> >  net/core/dev.c            | 29 ++++++++++++++++++++++++++---
> >  net/core/net-sysfs.c      |  5 ++++-
> >  3 files changed, 53 insertions(+), 4 deletions(-)
> 
> ...
> 
> > @@ -534,6 +535,28 @@ static inline void napi_schedule_irqoff(struct napi_struct *n)
> >  		__napi_schedule_irqoff(n);
> >  }
> >  
> > +/**
> > + * napi_get_defer_hard_irqs - get the NAPI's defer_hard_irqs
> > + * @n: napi struct to get the defer_hard_irqs field from
> > + *
> > + * Returns the per-NAPI value of the defar_hard_irqs field.
> > + */
> > +int napi_get_defer_hard_irqs(const struct napi_struct *n);
> 
> Hi Joe,
> 
> As it looks like there will be a v2 anyway, a minor nit from my side.
> 
> Thanks for documenting the return value, but I believe that
> ./scripts/kernel-doc -none -Wall expects "Return: " or "Returns: "
> 
> Likewise in patch 3/5.

Thanks Simon, will make sure to take care of this in the v2.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI
  2024-08-29 22:05   ` Jakub Kicinski
@ 2024-08-30  9:14     ` Joe Damato
  2024-08-30 20:21       ` Jakub Kicinski
  0 siblings, 1 reply; 48+ messages in thread
From: Joe Damato @ 2024-08-30  9:14 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	David S. Miller, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, open list

On Thu, Aug 29, 2024 at 03:05:02PM -0700, Jakub Kicinski wrote:
> On Thu, 29 Aug 2024 13:11:57 +0000 Joe Damato wrote:
> > +/**
> > + * napi_get_defer_hard_irqs - get the NAPI's defer_hard_irqs
> > + * @n: napi struct to get the defer_hard_irqs field from
> > + *
> > + * Returns the per-NAPI value of the defar_hard_irqs field.
> > + */
> > +int napi_get_defer_hard_irqs(const struct napi_struct *n);
> > +
> > +/**
> > + * napi_set_defer_hard_irqs - set the defer_hard_irqs for a napi
> > + * @n: napi_struct to set the defer_hard_irqs field
> > + * @defer: the value the field should be set to
> > + */
> > +void napi_set_defer_hard_irqs(struct napi_struct *n, int defer);
> > +
> > +/**
> > + * netdev_set_defer_hard_irqs - set defer_hard_irqs for all NAPIs of a netdev
> > + * @netdev: the net_device for which all NAPIs will have their defer_hard_irqs set
> > + * @defer: the defer_hard_irqs value to set
> > + */
> > +void netdev_set_defer_hard_irqs(struct net_device *netdev, int defer);
> 
> Do you expect drivers or modules to call these?
> I'm not sure we need the wrappers just to cover up the READ/WRITE_ONCE()
> but if you do want to keep them they can be static inlines in
> net/core/dev.h

It looked like there were a few call sites for these in
net/core/dev.c, the sysfs code, and the netlink code.

I figured having it all wrapped up somewhere might be better than
repeating the READ/WRITE_ONCE() stuff.

I have no preference on whether there are wrappers or not, though.
If you'd like me to drop the wrappers for the v2, let me know.

Otherwise: I'll make them static inlines as you suggested.

Let me know if you have a preference here because I am neutral.

> nit: IIUC the kdoc should go on the definition, not the declaration.

My mistake; thanks. I suppose if I move them as static inlines, I'll
just move the kdoc as well and the problem solves itself :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 4/5] netdev-genl: Dump gro_flush_timeout
  2024-08-29 22:09   ` Jakub Kicinski
@ 2024-08-30  9:17     ` Joe Damato
  0 siblings, 0 replies; 48+ messages in thread
From: Joe Damato @ 2024-08-30  9:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Larysa Zaremba, open list

On Thu, Aug 29, 2024 at 03:09:35PM -0700, Jakub Kicinski wrote:
> On Thu, 29 Aug 2024 13:12:00 +0000 Joe Damato wrote:
> > +        type: u64
> 
> uint

Ah, thanks. I just looked at include/net/netlink.h and it seems like
I can use get_uint instead of what I am doing in this
patch (and put_uint instead of what I'm doing in Patch 5).

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-08-29 22:31   ` Jakub Kicinski
@ 2024-08-30 10:43     ` Joe Damato
  2024-08-30 21:22       ` Jakub Kicinski
  0 siblings, 1 reply; 48+ messages in thread
From: Joe Damato @ 2024-08-30 10:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:
> > +	napi = napi_by_id(napi_id);
> > +	if (napi)
> > +		err = netdev_nl_napi_set_config(napi, info);
> > +	else
> > +		err = -EINVAL;
> 
> if (napi) {
> ...
> } else {
> 	NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID])
> 	err = -ENOENT;
> }

Thanks, I'll make that change in the v2.

Should I send a Fixes for the same pattern in
netdev_nl_napi_get_doit ?
 
> > +      doc: Set configurable NAPI instance settings.
> 
> We should pause and think here how configuring NAPI params should
> behave. NAPI instances are ephemeral, if you close and open the
> device (or for some drivers change any BPF or ethtool setting)
> the NAPIs may get wiped and recreated, discarding all configuration.
> 
> This is not how the sysfs API behaves, the sysfs settings on the device
> survive close. It's (weirdly?) also not how queues behave, because we
> have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> you'd think queues are as ephemeral as NAPIs if not more.
> 
> I guess we can either document this, and move on (which may be fine,
> you have more practical experience than me). Or we can add an internal
> concept of a "channel" (which perhaps maybe if you squint is what
> ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> net_device and store such config there. For simplicity of matching
> config to NAPIs we can assume drivers add NAPI instances in order. 
> If driver wants to do something more fancy we can add a variant of
> netif_napi_add() which specifies the channel/storage to use.
> 
> Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> I work with unfortunate drivers...

Thanks for pointing this out. I think this is an important case to
consider. Here's how I'm thinking about it.

There are two cases:

1) sysfs setting is used by existing/legacy apps: If the NAPIs are
discarded and recreated, the code I added to netif_napi_add_weight
in patch 1 and 3 should take care of that case preserving how sysfs
works today, I believe. I think we are good on this case ?

2) apps using netlink to set various custom settings. This seems
like a case where a future extension can be made to add a notifier
for NAPI changes (like the netdevice notifier?).

If you think this is a good idea, then we'd do something like:
  1. Document that the NAPI settings are wiped when NAPIs are wiped
  2. In the future (not part of this series) a NAPI notifier is
     added
  3. User apps can then listen for NAPI create/delete events
     and update settings when a NAPI is created. It would be
     helpful, I think, for user apps to know about NAPI
     create/delete events in general because it means NAPI IDs are
     changing.

One could argue:

  When wiping/recreating a NAPI for an existing HW queue, that HW
  queue gets a new NAPI ID associated with it. User apps operating
  at this level probably care about NAPI IDs changing (as it affects
  epoll busy poll). Since the settings in this series are per-NAPI
  (and not per HW queue), the argument could be that user apps need
  to setup NAPIs when they are created and settings do not persist
  between NAPIs with different IDs even if associated with the same
  HW queue.

Admittedly, from the perspective of a user it would be nice if a new
NAPI created for an existing HW queue retained the previous
settings so that I, as the user, can do less work.

But, what happens if a HW queue is destroyed and recreated? Will any
HW settings be retained? And does that have any influence on what we
do in software? See below.

This part of your message:

> we can assume drivers add NAPI instances in order. If driver wants
> to do something more fancy we can add a variant of
> netif_napi_add() which specifies the channel/storage to use.

assuming drivers will "do a thing", so to speak, makes me uneasy.

I started to wonder: how do drivers handle per-queue HW IRQ coalesce
settings when queue counts increase? It's a different, but adjacent
problem, I think.

I tried a couple experiments on mlx5 and got very strange results
suitable for their own thread and I didn't want to get this thread
too far off track.

I think you have much more practical experience when it comes to
dealing with drivers, so I am happy to follow your lead on this one,
but assuming drivers will "do a thing" seems mildly scary to me with
limited driver experience.

My two goals with this series are:
  1. Make it possible to set these values per NAPI
  2. Unblock the IRQ suspension series by threading the suspend
     parameter through the code path carved in this series

So, I'm happy to proceed with this series as you prefer whether
that's documentation or "napi_storage"; I think you are probably the
best person to answer this question :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI
  2024-08-29 13:11 ` [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
  2024-08-29 13:48   ` Eric Dumazet
@ 2024-08-30 16:18   ` kernel test robot
  2024-08-30 16:18   ` kernel test robot
  2 siblings, 0 replies; 48+ messages in thread
From: kernel test robot @ 2024-08-30 16:18 UTC (permalink / raw)
  To: Joe Damato, netdev
  Cc: oe-kbuild-all, edumazet, amritha.nambiar, sridhar.samudrala, sdf,
	bjorn, hch, willy, willemdebruijn.kernel, skhawaja, kuba,
	Joe Damato, Martin Karsten, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, Heiner Kallweit, linux-kernel

Hi Joe,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Joe-Damato/net-napi-Make-napi_defer_hard_irqs-per-NAPI/20240829-211617
base:   net-next/main
patch link:    https://lore.kernel.org/r/20240829131214.169977-4-jdamato%40fastly.com
patch subject: [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI
config: s390-allyesconfig (https://download.01.org/0day-ci/archive/20240831/202408310026.wxZySizP-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240831/202408310026.wxZySizP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408310026.wxZySizP-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/minmax.h:5,
                    from include/linux/jiffies.h:8,
                    from include/net/pkt_sched.h:5,
                    from drivers/net/ethernet/intel/idpf/idpf.h:12,
                    from drivers/net/ethernet/intel/idpf/idpf_dev.c:4:
   include/linux/build_bug.h:78:41: error: static assertion failed: "offsetof(struct idpf_q_vector, __cacheline_group_end__read_write) - offsetofend(struct idpf_q_vector, __cacheline_group_begin__read_write) <= (424 + 2 * sizeof(struct dim))"
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                         ^~~~~~~~~~~~~~
   include/linux/build_bug.h:77:34: note: in expansion of macro '__static_assert'
      77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
         |                                  ^~~~~~~~~~~~~~~
   include/net/libeth/cache.h:24:9: note: in expansion of macro 'static_assert'
      24 |         static_assert(offsetof(type, __cacheline_group_end__##grp) -          \
         |         ^~~~~~~~~~~~~
   include/net/libeth/cache.h:62:9: note: in expansion of macro 'libeth_cacheline_group_assert'
      62 |         libeth_cacheline_group_assert(type, read_write, rw);                  \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/idpf/idpf_txrx.h:475:1: note: in expansion of macro 'libeth_cacheline_set_assert'
     475 | libeth_cacheline_set_assert(struct idpf_q_vector, 104,
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/build_bug.h:78:41: error: static assertion failed: "sizeof(struct idpf_q_vector) <= ((((((104)) + ((__typeof__((104)))((256)) - 1)) & ~((__typeof__((104)))((256)) - 1)) + ((((424 + 2 * sizeof(struct dim))) + ((__typeof__((424 + 2 * sizeof(struct dim))))((256)) - 1)) & ~((__typeof__((424 + 2 * sizeof(struct dim))))((256)) - 1)) + ((((8 + sizeof(cpumask_var_t))) + ((__typeof__((8 + sizeof(cpumask_var_t))))((256)) - 1)) & ~((__typeof__((8 + sizeof(cpumask_var_t))))((256)) - 1))))"
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                         ^~~~~~~~~~~~~~
   include/linux/build_bug.h:77:34: note: in expansion of macro '__static_assert'
      77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
         |                                  ^~~~~~~~~~~~~~~
   include/net/libeth/cache.h:28:9: note: in expansion of macro 'static_assert'
      28 |         static_assert(sizeof(type) <= (sz))
         |         ^~~~~~~~~~~~~
   include/net/libeth/cache.h:48:9: note: in expansion of macro '__libeth_cacheline_struct_assert'
      48 |         __libeth_cacheline_struct_assert(type, __libeth_cls(__VA_ARGS__));    \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/net/libeth/cache.h:64:9: note: in expansion of macro 'libeth_cacheline_struct_assert'
      64 |         libeth_cacheline_struct_assert(type, ro, rw, c)
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/idpf/idpf_txrx.h:475:1: note: in expansion of macro 'libeth_cacheline_set_assert'
     475 | libeth_cacheline_set_assert(struct idpf_q_vector, 104,
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +78 include/linux/build_bug.h

bc6245e5efd70c Ian Abbott       2017-07-10  60  
6bab69c65013be Rasmus Villemoes 2019-03-07  61  /**
6bab69c65013be Rasmus Villemoes 2019-03-07  62   * static_assert - check integer constant expression at build time
6bab69c65013be Rasmus Villemoes 2019-03-07  63   *
6bab69c65013be Rasmus Villemoes 2019-03-07  64   * static_assert() is a wrapper for the C11 _Static_assert, with a
6bab69c65013be Rasmus Villemoes 2019-03-07  65   * little macro magic to make the message optional (defaulting to the
6bab69c65013be Rasmus Villemoes 2019-03-07  66   * stringification of the tested expression).
6bab69c65013be Rasmus Villemoes 2019-03-07  67   *
6bab69c65013be Rasmus Villemoes 2019-03-07  68   * Contrary to BUILD_BUG_ON(), static_assert() can be used at global
6bab69c65013be Rasmus Villemoes 2019-03-07  69   * scope, but requires the expression to be an integer constant
6bab69c65013be Rasmus Villemoes 2019-03-07  70   * expression (i.e., it is not enough that __builtin_constant_p() is
6bab69c65013be Rasmus Villemoes 2019-03-07  71   * true for expr).
6bab69c65013be Rasmus Villemoes 2019-03-07  72   *
6bab69c65013be Rasmus Villemoes 2019-03-07  73   * Also note that BUILD_BUG_ON() fails the build if the condition is
6bab69c65013be Rasmus Villemoes 2019-03-07  74   * true, while static_assert() fails the build if the expression is
6bab69c65013be Rasmus Villemoes 2019-03-07  75   * false.
6bab69c65013be Rasmus Villemoes 2019-03-07  76   */
6bab69c65013be Rasmus Villemoes 2019-03-07  77  #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
6bab69c65013be Rasmus Villemoes 2019-03-07 @78  #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
6bab69c65013be Rasmus Villemoes 2019-03-07  79  
07a368b3f55a79 Maxim Levitsky   2022-10-25  80  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI
  2024-08-29 13:11 ` [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
  2024-08-29 13:48   ` Eric Dumazet
  2024-08-30 16:18   ` kernel test robot
@ 2024-08-30 16:18   ` kernel test robot
  2 siblings, 0 replies; 48+ messages in thread
From: kernel test robot @ 2024-08-30 16:18 UTC (permalink / raw)
  To: Joe Damato, netdev
  Cc: oe-kbuild-all, edumazet, amritha.nambiar, sridhar.samudrala, sdf,
	bjorn, hch, willy, willemdebruijn.kernel, skhawaja, kuba,
	Joe Damato, Martin Karsten, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, Heiner Kallweit, linux-kernel

Hi Joe,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Joe-Damato/net-napi-Make-napi_defer_hard_irqs-per-NAPI/20240829-211617
base:   net-next/main
patch link:    https://lore.kernel.org/r/20240829131214.169977-4-jdamato%40fastly.com
patch subject: [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI
config: loongarch-allmodconfig (https://download.01.org/0day-ci/archive/20240831/202408310043.fmwHg8BS-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240831/202408310043.fmwHg8BS-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408310043.fmwHg8BS-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/minmax.h:5,
                    from include/linux/jiffies.h:8,
                    from include/net/pkt_sched.h:5,
                    from drivers/net/ethernet/intel/idpf/idpf.h:12,
                    from drivers/net/ethernet/intel/idpf/idpf_dev.c:4:
   include/linux/build_bug.h:78:41: error: static assertion failed: "offsetof(struct idpf_q_vector, __cacheline_group_end__read_write) - offsetofend(struct idpf_q_vector, __cacheline_group_begin__read_write) == (424 + 2 * sizeof(struct dim))"
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                         ^~~~~~~~~~~~~~
   include/linux/build_bug.h:77:34: note: in expansion of macro '__static_assert'
      77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
         |                                  ^~~~~~~~~~~~~~~
   include/net/libeth/cache.h:17:9: note: in expansion of macro 'static_assert'
      17 |         static_assert(offsetof(type, __cacheline_group_end__##grp) -          \
         |         ^~~~~~~~~~~~~
   include/net/libeth/cache.h:62:9: note: in expansion of macro 'libeth_cacheline_group_assert'
      62 |         libeth_cacheline_group_assert(type, read_write, rw);                  \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/idpf/idpf_txrx.h:475:1: note: in expansion of macro 'libeth_cacheline_set_assert'
     475 | libeth_cacheline_set_assert(struct idpf_q_vector, 104,
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/build_bug.h:78:41: error: static assertion failed: "sizeof(struct idpf_q_vector) == ((((((104)) + ((__typeof__((104)))(((1 << 6))) - 1)) & ~((__typeof__((104)))(((1 << 6))) - 1)) + ((((424 + 2 * sizeof(struct dim))) + ((__typeof__((424 + 2 * sizeof(struct dim))))(((1 << 6))) - 1)) & ~((__typeof__((424 + 2 * sizeof(struct dim))))(((1 << 6))) - 1)) + ((((8 + sizeof(cpumask_var_t))) + ((__typeof__((8 + sizeof(cpumask_var_t))))(((1 << 6))) - 1)) & ~((__typeof__((8 + sizeof(cpumask_var_t))))(((1 << 6))) - 1))))"
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                         ^~~~~~~~~~~~~~
   include/linux/build_bug.h:77:34: note: in expansion of macro '__static_assert'
      77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
         |                                  ^~~~~~~~~~~~~~~
   include/net/libeth/cache.h:21:9: note: in expansion of macro 'static_assert'
      21 |         static_assert(sizeof(type) == (sz))
         |         ^~~~~~~~~~~~~
   include/net/libeth/cache.h:48:9: note: in expansion of macro '__libeth_cacheline_struct_assert'
      48 |         __libeth_cacheline_struct_assert(type, __libeth_cls(__VA_ARGS__));    \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/net/libeth/cache.h:64:9: note: in expansion of macro 'libeth_cacheline_struct_assert'
      64 |         libeth_cacheline_struct_assert(type, ro, rw, c)
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/idpf/idpf_txrx.h:475:1: note: in expansion of macro 'libeth_cacheline_set_assert'
     475 | libeth_cacheline_set_assert(struct idpf_q_vector, 104,
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +78 include/linux/build_bug.h

bc6245e5efd70c Ian Abbott       2017-07-10  60  
6bab69c65013be Rasmus Villemoes 2019-03-07  61  /**
6bab69c65013be Rasmus Villemoes 2019-03-07  62   * static_assert - check integer constant expression at build time
6bab69c65013be Rasmus Villemoes 2019-03-07  63   *
6bab69c65013be Rasmus Villemoes 2019-03-07  64   * static_assert() is a wrapper for the C11 _Static_assert, with a
6bab69c65013be Rasmus Villemoes 2019-03-07  65   * little macro magic to make the message optional (defaulting to the
6bab69c65013be Rasmus Villemoes 2019-03-07  66   * stringification of the tested expression).
6bab69c65013be Rasmus Villemoes 2019-03-07  67   *
6bab69c65013be Rasmus Villemoes 2019-03-07  68   * Contrary to BUILD_BUG_ON(), static_assert() can be used at global
6bab69c65013be Rasmus Villemoes 2019-03-07  69   * scope, but requires the expression to be an integer constant
6bab69c65013be Rasmus Villemoes 2019-03-07  70   * expression (i.e., it is not enough that __builtin_constant_p() is
6bab69c65013be Rasmus Villemoes 2019-03-07  71   * true for expr).
6bab69c65013be Rasmus Villemoes 2019-03-07  72   *
6bab69c65013be Rasmus Villemoes 2019-03-07  73   * Also note that BUILD_BUG_ON() fails the build if the condition is
6bab69c65013be Rasmus Villemoes 2019-03-07  74   * true, while static_assert() fails the build if the expression is
6bab69c65013be Rasmus Villemoes 2019-03-07  75   * false.
6bab69c65013be Rasmus Villemoes 2019-03-07  76   */
6bab69c65013be Rasmus Villemoes 2019-03-07  77  #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
6bab69c65013be Rasmus Villemoes 2019-03-07 @78  #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
6bab69c65013be Rasmus Villemoes 2019-03-07  79  
07a368b3f55a79 Maxim Levitsky   2022-10-25  80  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI
  2024-08-29 13:11 ` [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
                     ` (2 preceding siblings ...)
  2024-08-30  8:36   ` Simon Horman
@ 2024-08-30 16:50   ` kernel test robot
  3 siblings, 0 replies; 48+ messages in thread
From: kernel test robot @ 2024-08-30 16:50 UTC (permalink / raw)
  To: Joe Damato, netdev
  Cc: llvm, oe-kbuild-all, edumazet, amritha.nambiar, sridhar.samudrala,
	sdf, bjorn, hch, willy, willemdebruijn.kernel, skhawaja, kuba,
	Joe Damato, Martin Karsten, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, linux-kernel

Hi Joe,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Joe-Damato/net-napi-Make-napi_defer_hard_irqs-per-NAPI/20240829-211617
base:   net-next/main
patch link:    https://lore.kernel.org/r/20240829131214.169977-2-jdamato%40fastly.com
patch subject: [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI
config: x86_64-allyesconfig (https://download.01.org/0day-ci/archive/20240831/202408310038.VztWz8YR-lkp@intel.com/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240831/202408310038.VztWz8YR-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408310038.VztWz8YR-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/net/ethernet/intel/idpf/idpf_dev.c:4:
   In file included from drivers/net/ethernet/intel/idpf/idpf.h:22:
>> drivers/net/ethernet/intel/idpf/idpf_txrx.h:475:1: error: static assertion failed due to requirement '__builtin_offsetof(struct idpf_q_vector, __cacheline_group_end__read_write) - (__builtin_offsetof(struct idpf_q_vector, __cacheline_group_begin__read_write) + sizeof ((((struct idpf_q_vector *)0)->__cacheline_group_begin__read_write))) == (424 + 2 * sizeof(struct dim))': offsetof(struct idpf_q_vector, __cacheline_group_end__read_write) - offsetofend(struct idpf_q_vector, __cacheline_group_begin__read_write) == (424 + 2 * sizeof(struct dim))
     475 | libeth_cacheline_set_assert(struct idpf_q_vector, 104,
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     476 |                             424 + 2 * sizeof(struct dim),
         |                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     477 |                             8 + sizeof(cpumask_var_t));
         |                             ~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/net/libeth/cache.h:62:2: note: expanded from macro 'libeth_cacheline_set_assert'
      62 |         libeth_cacheline_group_assert(type, read_write, rw);                  \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/net/libeth/cache.h:17:16: note: expanded from macro 'libeth_cacheline_group_assert'
      17 |         static_assert(offsetof(type, __cacheline_group_end__##grp) -          \
         |         ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      18 |                       offsetofend(type, __cacheline_group_begin__##grp) ==    \
         |                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      19 |                       (sz))
         |                       ~~~~~
   include/linux/stddef.h:16:32: note: expanded from macro 'offsetof'
      16 | #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
         |                                 ^
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
      77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
         |                                  ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                                        ^~~~
   drivers/net/ethernet/intel/idpf/idpf_txrx.h:475:1: note: expression evaluates to '768 == 760'
     475 | libeth_cacheline_set_assert(struct idpf_q_vector, 104,
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     476 |                             424 + 2 * sizeof(struct dim),
         |                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     477 |                             8 + sizeof(cpumask_var_t));
         |                             ~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/net/libeth/cache.h:62:2: note: expanded from macro 'libeth_cacheline_set_assert'
      62 |         libeth_cacheline_group_assert(type, read_write, rw);                  \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/net/libeth/cache.h:18:59: note: expanded from macro 'libeth_cacheline_group_assert'
      17 |         static_assert(offsetof(type, __cacheline_group_end__##grp) -          \
         |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      18 |                       offsetofend(type, __cacheline_group_begin__##grp) ==    \
         |                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
      19 |                       (sz))
         |                       ~~~~~
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
      77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
         |                                  ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                                        ^~~~
   1 error generated.
--
   In file included from drivers/net/ethernet/intel/idpf/idpf_main.c:4:
   In file included from drivers/net/ethernet/intel/idpf/idpf.h:22:
>> drivers/net/ethernet/intel/idpf/idpf_txrx.h:475:1: error: static assertion failed due to requirement '__builtin_offsetof(struct idpf_q_vector, __cacheline_group_end__read_write) - (__builtin_offsetof(struct idpf_q_vector, __cacheline_group_begin__read_write) + sizeof ((((struct idpf_q_vector *)0)->__cacheline_group_begin__read_write))) == (424 + 2 * sizeof(struct dim))': offsetof(struct idpf_q_vector, __cacheline_group_end__read_write) - offsetofend(struct idpf_q_vector, __cacheline_group_begin__read_write) == (424 + 2 * sizeof(struct dim))
     475 | libeth_cacheline_set_assert(struct idpf_q_vector, 104,
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     476 |                             424 + 2 * sizeof(struct dim),
         |                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     477 |                             8 + sizeof(cpumask_var_t));
         |                             ~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/net/libeth/cache.h:62:2: note: expanded from macro 'libeth_cacheline_set_assert'
      62 |         libeth_cacheline_group_assert(type, read_write, rw);                  \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/net/libeth/cache.h:17:16: note: expanded from macro 'libeth_cacheline_group_assert'
      17 |         static_assert(offsetof(type, __cacheline_group_end__##grp) -          \
         |         ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      18 |                       offsetofend(type, __cacheline_group_begin__##grp) ==    \
         |                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      19 |                       (sz))
         |                       ~~~~~
   include/linux/stddef.h:16:32: note: expanded from macro 'offsetof'
      16 | #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
         |                                 ^
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
      77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
         |                                  ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                                        ^~~~
   drivers/net/ethernet/intel/idpf/idpf_txrx.h:475:1: note: expression evaluates to '768 == 760'
     475 | libeth_cacheline_set_assert(struct idpf_q_vector, 104,
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     476 |                             424 + 2 * sizeof(struct dim),
         |                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     477 |                             8 + sizeof(cpumask_var_t));
         |                             ~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/net/libeth/cache.h:62:2: note: expanded from macro 'libeth_cacheline_set_assert'
      62 |         libeth_cacheline_group_assert(type, read_write, rw);                  \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/net/libeth/cache.h:18:59: note: expanded from macro 'libeth_cacheline_group_assert'
      17 |         static_assert(offsetof(type, __cacheline_group_end__##grp) -          \
         |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      18 |                       offsetofend(type, __cacheline_group_begin__##grp) ==    \
         |                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
      19 |                       (sz))
         |                       ~~~~~
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
      77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
         |                                  ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                                        ^~~~
   drivers/net/ethernet/intel/idpf/idpf_main.c:167:39: warning: shift count >= width of type [-Wshift-count-overflow]
     167 |         err = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64));
         |                                              ^~~~~~~~~~~~~~~~
   include/linux/dma-mapping.h:77:54: note: expanded from macro 'DMA_BIT_MASK'
      77 | #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
         |                                                      ^ ~~~
   1 warning and 1 error generated.


vim +475 drivers/net/ethernet/intel/idpf/idpf_txrx.h

4930fbf419a72d Pavan Kumar Linga 2023-08-07  412  
4930fbf419a72d Pavan Kumar Linga 2023-08-07  413  /**
4930fbf419a72d Pavan Kumar Linga 2023-08-07  414   * struct idpf_q_vector
1c325aac10a82f Alan Brady        2023-08-07  415   * @vport: Vport back pointer
5a816aae2d463d Alexander Lobakin 2024-06-20  416   * @num_rxq: Number of RX queues
d4d5587182664b Pavan Kumar Linga 2023-08-07  417   * @num_txq: Number of TX queues
5a816aae2d463d Alexander Lobakin 2024-06-20  418   * @num_bufq: Number of buffer queues
e4891e4687c8dd Alexander Lobakin 2024-06-20  419   * @num_complq: number of completion queues
5a816aae2d463d Alexander Lobakin 2024-06-20  420   * @rx: Array of RX queues to service
1c325aac10a82f Alan Brady        2023-08-07  421   * @tx: Array of TX queues to service
5a816aae2d463d Alexander Lobakin 2024-06-20  422   * @bufq: Array of buffer queues to service
e4891e4687c8dd Alexander Lobakin 2024-06-20  423   * @complq: array of completion queues
5a816aae2d463d Alexander Lobakin 2024-06-20  424   * @intr_reg: See struct idpf_intr_reg
5a816aae2d463d Alexander Lobakin 2024-06-20  425   * @napi: napi handler
5a816aae2d463d Alexander Lobakin 2024-06-20  426   * @total_events: Number of interrupts processed
c2d548cad1508d Joshua Hay        2023-08-07  427   * @tx_dim: Data for TX net_dim algorithm
1c325aac10a82f Alan Brady        2023-08-07  428   * @tx_itr_value: TX interrupt throttling rate
1c325aac10a82f Alan Brady        2023-08-07  429   * @tx_intr_mode: Dynamic ITR or not
1c325aac10a82f Alan Brady        2023-08-07  430   * @tx_itr_idx: TX ITR index
3a8845af66edb3 Alan Brady        2023-08-07  431   * @rx_dim: Data for RX net_dim algorithm
95af467d9a4e3b Alan Brady        2023-08-07  432   * @rx_itr_value: RX interrupt throttling rate
95af467d9a4e3b Alan Brady        2023-08-07  433   * @rx_intr_mode: Dynamic ITR or not
95af467d9a4e3b Alan Brady        2023-08-07  434   * @rx_itr_idx: RX ITR index
5a816aae2d463d Alexander Lobakin 2024-06-20  435   * @v_idx: Vector index
bf9bf7042a38eb Alexander Lobakin 2024-06-20  436   * @affinity_mask: CPU affinity mask
4930fbf419a72d Pavan Kumar Linga 2023-08-07  437   */
4930fbf419a72d Pavan Kumar Linga 2023-08-07  438  struct idpf_q_vector {
5a816aae2d463d Alexander Lobakin 2024-06-20  439  	__cacheline_group_begin_aligned(read_mostly);
1c325aac10a82f Alan Brady        2023-08-07  440  	struct idpf_vport *vport;
1c325aac10a82f Alan Brady        2023-08-07  441  
5a816aae2d463d Alexander Lobakin 2024-06-20  442  	u16 num_rxq;
d4d5587182664b Pavan Kumar Linga 2023-08-07  443  	u16 num_txq;
5a816aae2d463d Alexander Lobakin 2024-06-20  444  	u16 num_bufq;
e4891e4687c8dd Alexander Lobakin 2024-06-20  445  	u16 num_complq;
5a816aae2d463d Alexander Lobakin 2024-06-20  446  	struct idpf_rx_queue **rx;
e4891e4687c8dd Alexander Lobakin 2024-06-20  447  	struct idpf_tx_queue **tx;
5a816aae2d463d Alexander Lobakin 2024-06-20  448  	struct idpf_buf_queue **bufq;
e4891e4687c8dd Alexander Lobakin 2024-06-20  449  	struct idpf_compl_queue **complq;
e4891e4687c8dd Alexander Lobakin 2024-06-20  450  
5a816aae2d463d Alexander Lobakin 2024-06-20  451  	struct idpf_intr_reg intr_reg;
5a816aae2d463d Alexander Lobakin 2024-06-20  452  	__cacheline_group_end_aligned(read_mostly);
5a816aae2d463d Alexander Lobakin 2024-06-20  453  
5a816aae2d463d Alexander Lobakin 2024-06-20  454  	__cacheline_group_begin_aligned(read_write);
5a816aae2d463d Alexander Lobakin 2024-06-20  455  	struct napi_struct napi;
5a816aae2d463d Alexander Lobakin 2024-06-20  456  	u16 total_events;
5a816aae2d463d Alexander Lobakin 2024-06-20  457  
c2d548cad1508d Joshua Hay        2023-08-07  458  	struct dim tx_dim;
1c325aac10a82f Alan Brady        2023-08-07  459  	u16 tx_itr_value;
1c325aac10a82f Alan Brady        2023-08-07  460  	bool tx_intr_mode;
1c325aac10a82f Alan Brady        2023-08-07  461  	u32 tx_itr_idx;
1c325aac10a82f Alan Brady        2023-08-07  462  
3a8845af66edb3 Alan Brady        2023-08-07  463  	struct dim rx_dim;
95af467d9a4e3b Alan Brady        2023-08-07  464  	u16 rx_itr_value;
95af467d9a4e3b Alan Brady        2023-08-07  465  	bool rx_intr_mode;
95af467d9a4e3b Alan Brady        2023-08-07  466  	u32 rx_itr_idx;
5a816aae2d463d Alexander Lobakin 2024-06-20  467  	__cacheline_group_end_aligned(read_write);
95af467d9a4e3b Alan Brady        2023-08-07  468  
5a816aae2d463d Alexander Lobakin 2024-06-20  469  	__cacheline_group_begin_aligned(cold);
5a816aae2d463d Alexander Lobakin 2024-06-20  470  	u16 v_idx;
bf9bf7042a38eb Alexander Lobakin 2024-06-20  471  
bf9bf7042a38eb Alexander Lobakin 2024-06-20  472  	cpumask_var_t affinity_mask;
5a816aae2d463d Alexander Lobakin 2024-06-20  473  	__cacheline_group_end_aligned(cold);
4930fbf419a72d Pavan Kumar Linga 2023-08-07  474  };
5a816aae2d463d Alexander Lobakin 2024-06-20 @475  libeth_cacheline_set_assert(struct idpf_q_vector, 104,
5a816aae2d463d Alexander Lobakin 2024-06-20  476  			    424 + 2 * sizeof(struct dim),
5a816aae2d463d Alexander Lobakin 2024-06-20  477  			    8 + sizeof(cpumask_var_t));
0fe45467a1041e Pavan Kumar Linga 2023-08-07  478  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI
  2024-08-30  9:14     ` Joe Damato
@ 2024-08-30 20:21       ` Jakub Kicinski
  2024-08-30 20:23         ` Joe Damato
  0 siblings, 1 reply; 48+ messages in thread
From: Jakub Kicinski @ 2024-08-30 20:21 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	David S. Miller, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, open list

On Fri, 30 Aug 2024 10:14:41 +0100 Joe Damato wrote:
> Otherwise: I'll make them static inlines as you suggested.
> 
> Let me know if you have a preference here because I am neutral.

No strong preference either, static inlines seem like a good
middle ground :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI
  2024-08-30 20:21       ` Jakub Kicinski
@ 2024-08-30 20:23         ` Joe Damato
  0 siblings, 0 replies; 48+ messages in thread
From: Joe Damato @ 2024-08-30 20:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	David S. Miller, Paolo Abeni, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Breno Leitao,
	Johannes Berg, Alexander Lobakin, open list

On Fri, Aug 30, 2024 at 01:21:12PM -0700, Jakub Kicinski wrote:
> On Fri, 30 Aug 2024 10:14:41 +0100 Joe Damato wrote:
> > Otherwise: I'll make them static inlines as you suggested.
> > 
> > Let me know if you have a preference here because I am neutral.
> 
> No strong preference either, static inlines seem like a good
> middle ground :)

Ack. Already queued up in my v2 branch.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 2/5] netdev-genl: Dump napi_defer_hard_irqs
  2024-08-30  9:10     ` Joe Damato
@ 2024-08-30 20:28       ` Jakub Kicinski
  2024-08-30 20:31         ` Joe Damato
  0 siblings, 1 reply; 48+ messages in thread
From: Jakub Kicinski @ 2024-08-30 20:28 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Fri, 30 Aug 2024 10:10:47 +0100 Joe Damato wrote:
> > > +        name: defer-hard-irqs
> > > +        doc: The number of consecutive empty polls before IRQ deferral ends
> > > +             and hardware IRQs are re-enabled.
> > > +        type: s32  
> > 
> > Why is this a signed value? 🤔️  
> 
> In commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
> feature"), napi_defer_hard_irqs was added to struct net_device as an
> int. I was trying to match that and thus made the field a signed int
> in the napi struct, as well.

It's probably because int is the default type in C.
The choice of types in netlink feels more deliberate.

> It looks like there was a possibility of overflow introduced in that
> commit in change_napi_defer_hard_irqs maybe ?
> 
> If you'd prefer I could:
>   - submit a Fixes to change the net_device field to a u32 and then
>     change the netlink code to also be u32
>   - add an overflow check (val > U32_MAX) in
>     change_napi_defer_hard_irqs
> 
> Which would mean for the v2 of this series:
>   - drop the overflow check I added in Patch 1
>   - Change netlink to use u32 in this patch 
> 
> What do you think?

Whether we want to clean things up internally is up to you, the overflow
check you're adding in sysfs seems good. We can use u32 in netlink, with
a check: max: s32-max and lift this requirement later if we ever need
the 32nd bit?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 2/5] netdev-genl: Dump napi_defer_hard_irqs
  2024-08-30 20:28       ` Jakub Kicinski
@ 2024-08-30 20:31         ` Joe Damato
  2024-08-30 21:22           ` Jakub Kicinski
  0 siblings, 1 reply; 48+ messages in thread
From: Joe Damato @ 2024-08-30 20:31 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Fri, Aug 30, 2024 at 01:28:08PM -0700, Jakub Kicinski wrote:
> On Fri, 30 Aug 2024 10:10:47 +0100 Joe Damato wrote:
> > > > +        name: defer-hard-irqs
> > > > +        doc: The number of consecutive empty polls before IRQ deferral ends
> > > > +             and hardware IRQs are re-enabled.
> > > > +        type: s32  
> > > 
> > > Why is this a signed value? 🤔️  
> > 
> > In commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
> > feature"), napi_defer_hard_irqs was added to struct net_device as an
> > int. I was trying to match that and thus made the field a signed int
> > in the napi struct, as well.
> 
> It's probably because int is the default type in C.
> The choice of types in netlink feels more deliberate.
> 
> > It looks like there was a possibility of overflow introduced in that
> > commit in change_napi_defer_hard_irqs maybe ?
> > 
> > If you'd prefer I could:
> >   - submit a Fixes to change the net_device field to a u32 and then
> >     change the netlink code to also be u32
> >   - add an overflow check (val > U32_MAX) in
> >     change_napi_defer_hard_irqs
> > 
> > Which would mean for the v2 of this series:
> >   - drop the overflow check I added in Patch 1
> >   - Change netlink to use u32 in this patch 
> > 
> > What do you think?
> 
> Whether we want to clean things up internally is up to you, the overflow
> check you're adding in sysfs seems good. We can use u32 in netlink, with
> a check: max: s32-max and lift this requirement later if we ever need
> the 32nd bit?

OK, u32 + check for max: s32-max seems good.

Is the overflow check in sysfs a fixes I send separately or can I
sneak that into this series?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-08-30 10:43     ` Joe Damato
@ 2024-08-30 21:22       ` Jakub Kicinski
  2024-08-31 17:27         ` Joe Damato
  2024-09-02 16:56         ` Joe Damato
  0 siblings, 2 replies; 48+ messages in thread
From: Jakub Kicinski @ 2024-08-30 21:22 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > +	napi = napi_by_id(napi_id);
> > > +	if (napi)
> > > +		err = netdev_nl_napi_set_config(napi, info);
> > > +	else
> > > +		err = -EINVAL;  
> > 
> > if (napi) {
> > ...
> > } else {
> > 	NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID])
> > 	err = -ENOENT;
> > }  
> 
> Thanks, I'll make that change in the v2.
> 
> Should I send a Fixes for the same pattern in
> netdev_nl_napi_get_doit ?

SG, standalone patch is good, FWIW, no need to add to the series.

> > > +      doc: Set configurable NAPI instance settings.  
> > 
> > We should pause and think here how configuring NAPI params should
> > behave. NAPI instances are ephemeral, if you close and open the
> > device (or for some drivers change any BPF or ethtool setting)
> > the NAPIs may get wiped and recreated, discarding all configuration.
> > 
> > This is not how the sysfs API behaves, the sysfs settings on the device
> > survive close. It's (weirdly?) also not how queues behave, because we
> > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > you'd think queues are as ephemeral as NAPIs if not more.
> > 
> > I guess we can either document this, and move on (which may be fine,
> > you have more practical experience than me). Or we can add an internal
> > concept of a "channel" (which perhaps maybe if you squint is what
> > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > net_device and store such config there. For simplicity of matching
> > config to NAPIs we can assume drivers add NAPI instances in order. 
> > If driver wants to do something more fancy we can add a variant of
> > netif_napi_add() which specifies the channel/storage to use.
> > 
> > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > I work with unfortunate drivers...  
> 
> Thanks for pointing this out. I think this is an important case to
> consider. Here's how I'm thinking about it.
> 
> There are two cases:
> 
> 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> discarded and recreated, the code I added to netif_napi_add_weight
> in patch 1 and 3 should take care of that case preserving how sysfs
> works today, I believe. I think we are good on this case ?

Agreed.

> 2) apps using netlink to set various custom settings. This seems
> like a case where a future extension can be made to add a notifier
> for NAPI changes (like the netdevice notifier?).

Yes, the notifier may help, but it's a bit of a stop gap / fallback.

> If you think this is a good idea, then we'd do something like:
>   1. Document that the NAPI settings are wiped when NAPIs are wiped
>   2. In the future (not part of this series) a NAPI notifier is
>      added
>   3. User apps can then listen for NAPI create/delete events
>      and update settings when a NAPI is created. It would be
>      helpful, I think, for user apps to know about NAPI
>      create/delete events in general because it means NAPI IDs are
>      changing.
> 
> One could argue:
> 
>   When wiping/recreating a NAPI for an existing HW queue, that HW
>   queue gets a new NAPI ID associated with it. User apps operating
>   at this level probably care about NAPI IDs changing (as it affects
>   epoll busy poll). Since the settings in this series are per-NAPI
>   (and not per HW queue), the argument could be that user apps need
>   to setup NAPIs when they are created and settings do not persist
>   between NAPIs with different IDs even if associated with the same
>   HW queue.

IDK if the fact that NAPI ID gets replaced was intentional in the first
place. I would venture a guess that the person who added the IDs was
working with NICs which have stable NAPI instances once the device is
opened. This is, unfortunately, not universally the case.

I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
on an open device. Closer we get to queue API the more dynamic the whole
setup will become (read: the more often reconfigurations will happen).

> Admittedly, from the perspective of a user it would be nice if a new
> NAPI created for an existing HW queue retained the previous
> settings so that I, as the user, can do less work.
> 
> But, what happens if a HW queue is destroyed and recreated? Will any
> HW settings be retained? And does that have any influence on what we
> do in software? See below.

Up to the driver, today. But settings we store in queue structs in 
the core are not wiped.

> This part of your message:
> 
> > we can assume drivers add NAPI instances in order. If driver wants
> > to do something more fancy we can add a variant of
> > netif_napi_add() which specifies the channel/storage to use.  
> 
> assuming drivers will "do a thing", so to speak, makes me uneasy.

Yeah.. :(

> I started to wonder: how do drivers handle per-queue HW IRQ coalesce
> settings when queue counts increase? It's a different, but adjacent
> problem, I think.
> 
> I tried a couple experiments on mlx5 and got very strange results
> suitable for their own thread and I didn't want to get this thread
> too far off track.

Yes, but ethtool is an old shallow API from the times when semantics
were simpler. It's precisely this mess which we try to avoid by storing
more of the config in the core, in a consistent fashion.

> I think you have much more practical experience when it comes to
> dealing with drivers, so I am happy to follow your lead on this one,
> but assuming drivers will "do a thing" seems mildly scary to me with
> limited driver experience.
> 
> My two goals with this series are:
>   1. Make it possible to set these values per NAPI
>   2. Unblock the IRQ suspension series by threading the suspend
>      parameter through the code path carved in this series
> 
> So, I'm happy to proceed with this series as you prefer whether
> that's documentation or "napi_storage"; I think you are probably the
> best person to answer this question :)

How do you feel about making this configuration opt-in / require driver
changes? What I'm thinking is that having the new "netif_napi_add()"
variant (or perhaps extending netif_napi_set_irq()) to take an extra
"index" parameter would make the whole thing much simpler.

Index would basically be an integer 0..n, where n is the number of
IRQs configured for the driver. The index of a NAPI instance would
likely match the queue ID of the queue the NAPI serves.

We can then allocate an array of "napi_configs" in net_device -
like we allocate queues, the array size would be max(num_rx_queue,
num_tx_queues). We just need to store a couple of ints so it will
be tiny compared to queue structs, anyway.

The NAPI_SET netlink op can then work based on NAPI index rather 
than the ephemeral NAPI ID. It can apply the config to all live
NAPI instances with that index (of which there really should only 
be one, unless driver is mid-reconfiguration somehow but even that
won't cause issues, we can give multiple instances the same settings)
and also store the user config in the array in net_device.

When new NAPI instance is associate with a NAPI index it should get
all the config associated with that index applied.

Thoughts? Does that makes sense, and if so do you think it's an
over-complication?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 2/5] netdev-genl: Dump napi_defer_hard_irqs
  2024-08-30 20:31         ` Joe Damato
@ 2024-08-30 21:22           ` Jakub Kicinski
  0 siblings, 0 replies; 48+ messages in thread
From: Jakub Kicinski @ 2024-08-30 21:22 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Fri, 30 Aug 2024 21:31:25 +0100 Joe Damato wrote:
> Is the overflow check in sysfs a fixes I send separately or can I
> sneak that into this series?

It does look kinda random in that patch, I'd send separately

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-08-30 21:22       ` Jakub Kicinski
@ 2024-08-31 17:27         ` Joe Damato
  2024-09-03  0:49           ` Jakub Kicinski
  2024-09-02 16:56         ` Joe Damato
  1 sibling, 1 reply; 48+ messages in thread
From: Joe Damato @ 2024-08-31 17:27 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > +	napi = napi_by_id(napi_id);
> > > > +	if (napi)
> > > > +		err = netdev_nl_napi_set_config(napi, info);
> > > > +	else
> > > > +		err = -EINVAL;  
> > > 
> > > if (napi) {
> > > ...
> > > } else {
> > > 	NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID])
> > > 	err = -ENOENT;
> > > }  
> > 
> > Thanks, I'll make that change in the v2.
> > 
> > Should I send a Fixes for the same pattern in
> > netdev_nl_napi_get_doit ?
> 
> SG, standalone patch is good, FWIW, no need to add to the series.

Done. TBH: couldn't tell if it was a fixes for net or a net-next
thing.
 
> > > > +      doc: Set configurable NAPI instance settings.  
> > > 
> > > We should pause and think here how configuring NAPI params should
> > > behave. NAPI instances are ephemeral, if you close and open the
> > > device (or for some drivers change any BPF or ethtool setting)
> > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > 
> > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > survive close. It's (weirdly?) also not how queues behave, because we
> > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > you'd think queues are as ephemeral as NAPIs if not more.
> > > 
> > > I guess we can either document this, and move on (which may be fine,
> > > you have more practical experience than me). Or we can add an internal
> > > concept of a "channel" (which perhaps maybe if you squint is what
> > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > net_device and store such config there. For simplicity of matching
> > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > If driver wants to do something more fancy we can add a variant of
> > > netif_napi_add() which specifies the channel/storage to use.
> > > 
> > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > I work with unfortunate drivers...  
> > 
> > Thanks for pointing this out. I think this is an important case to
> > consider. Here's how I'm thinking about it.
> > 
> > There are two cases:
> > 
> > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > discarded and recreated, the code I added to netif_napi_add_weight
> > in patch 1 and 3 should take care of that case preserving how sysfs
> > works today, I believe. I think we are good on this case ?
> 
> Agreed.
> 
> > 2) apps using netlink to set various custom settings. This seems
> > like a case where a future extension can be made to add a notifier
> > for NAPI changes (like the netdevice notifier?).
> 
> Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> 
> > If you think this is a good idea, then we'd do something like:
> >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> >   2. In the future (not part of this series) a NAPI notifier is
> >      added
> >   3. User apps can then listen for NAPI create/delete events
> >      and update settings when a NAPI is created. It would be
> >      helpful, I think, for user apps to know about NAPI
> >      create/delete events in general because it means NAPI IDs are
> >      changing.
> > 
> > One could argue:
> > 
> >   When wiping/recreating a NAPI for an existing HW queue, that HW
> >   queue gets a new NAPI ID associated with it. User apps operating
> >   at this level probably care about NAPI IDs changing (as it affects
> >   epoll busy poll). Since the settings in this series are per-NAPI
> >   (and not per HW queue), the argument could be that user apps need
> >   to setup NAPIs when they are created and settings do not persist
> >   between NAPIs with different IDs even if associated with the same
> >   HW queue.
> 
> IDK if the fact that NAPI ID gets replaced was intentional in the first
> place. I would venture a guess that the person who added the IDs was
> working with NICs which have stable NAPI instances once the device is
> opened. This is, unfortunately, not universally the case.
> 
> I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> on an open device. Closer we get to queue API the more dynamic the whole
> setup will become (read: the more often reconfigurations will happen).
> 
> > Admittedly, from the perspective of a user it would be nice if a new
> > NAPI created for an existing HW queue retained the previous
> > settings so that I, as the user, can do less work.
> > 
> > But, what happens if a HW queue is destroyed and recreated? Will any
> > HW settings be retained? And does that have any influence on what we
> > do in software? See below.
> 
> Up to the driver, today. But settings we store in queue structs in 
> the core are not wiped.
> 
> > This part of your message:
> > 
> > > we can assume drivers add NAPI instances in order. If driver wants
> > > to do something more fancy we can add a variant of
> > > netif_napi_add() which specifies the channel/storage to use.  
> > 
> > assuming drivers will "do a thing", so to speak, makes me uneasy.
> 
> Yeah.. :(
> 
> > I started to wonder: how do drivers handle per-queue HW IRQ coalesce
> > settings when queue counts increase? It's a different, but adjacent
> > problem, I think.
> > 
> > I tried a couple experiments on mlx5 and got very strange results
> > suitable for their own thread and I didn't want to get this thread
> > too far off track.
> 
> Yes, but ethtool is an old shallow API from the times when semantics
> were simpler. It's precisely this mess which we try to avoid by storing
> more of the config in the core, in a consistent fashion.
> 
> > I think you have much more practical experience when it comes to
> > dealing with drivers, so I am happy to follow your lead on this one,
> > but assuming drivers will "do a thing" seems mildly scary to me with
> > limited driver experience.
> > 
> > My two goals with this series are:
> >   1. Make it possible to set these values per NAPI
> >   2. Unblock the IRQ suspension series by threading the suspend
> >      parameter through the code path carved in this series
> > 
> > So, I'm happy to proceed with this series as you prefer whether
> > that's documentation or "napi_storage"; I think you are probably the
> > best person to answer this question :)
> 
> How do you feel about making this configuration opt-in / require driver
> changes? What I'm thinking is that having the new "netif_napi_add()"
> variant (or perhaps extending netif_napi_set_irq()) to take an extra
> "index" parameter would make the whole thing much simpler.

I think if we are going to go this way, then opt-in is probably the
way to go. This series would include the necessary changes for mlx5,
in that case (because that's what I have access to) so that the new
variant has a user?

> Index would basically be an integer 0..n, where n is the number of
> IRQs configured for the driver. The index of a NAPI instance would
> likely match the queue ID of the queue the NAPI serves.
> 
> We can then allocate an array of "napi_configs" in net_device -
> like we allocate queues, the array size would be max(num_rx_queue,
> num_tx_queues). We just need to store a couple of ints so it will
> be tiny compared to queue structs, anyway.

I assume napi_storage exists for both combined RX/TX NAPIs (i.e.
drivers that multiplex RX/TX on a single NAPI like mlx5) as well
as drivers which create NAPIs that are RX or TX-only, right?

If so, it seems like we'd either need to:
  - Do something more complicated when computing how much NAPI
    storage to make, or
  - Provide a different path for drivers which don't multiplex and
    create some number of (for example) TX-only NAPIs ?

I guess I'm just imagining a weird case where a driver has 8 RX
queues but 64 TX queues. max of that is 64, so we'd be missing 8
napi_storage ?

Sorry, I'm probably just missing something about the implementation
details you summarized above.

> The NAPI_SET netlink op can then work based on NAPI index rather 
> than the ephemeral NAPI ID. It can apply the config to all live
> NAPI instances with that index (of which there really should only 
> be one, unless driver is mid-reconfiguration somehow but even that
> won't cause issues, we can give multiple instances the same settings)
> and also store the user config in the array in net_device.

I understand what you are proposing. I suppose napi-get could be
extended to include the NAPI index, too?

Then users could map queues to NAPI indexes to queues (via NAPI ID)?

> When new NAPI instance is associate with a NAPI index it should get
> all the config associated with that index applied.
> 
> Thoughts? Does that makes sense, and if so do you think it's an
> over-complication?

It feels a bit tricky, to me, as it seems there are some edge cases
to be careful with (queue count change). I could probably give the
implementation a try and see where I end up.

Having these settings per-NAPI would be really useful and being able
to support IRQ suspension would be useful, too.

I think being thoughtful about how we get there is important; I'm a
little wary of getting side tracked, but I trust your judgement and
if you think this is worth exploring I'll think on it some more.

- Joe

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-08-30 21:22       ` Jakub Kicinski
  2024-08-31 17:27         ` Joe Damato
@ 2024-09-02 16:56         ` Joe Damato
  2024-09-03  1:02           ` Jakub Kicinski
  2024-09-04 23:40           ` Stanislav Fomichev
  1 sibling, 2 replies; 48+ messages in thread
From: Joe Damato @ 2024-09-02 16:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > +      doc: Set configurable NAPI instance settings.  
> > > 
> > > We should pause and think here how configuring NAPI params should
> > > behave. NAPI instances are ephemeral, if you close and open the
> > > device (or for some drivers change any BPF or ethtool setting)
> > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > 
> > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > survive close. It's (weirdly?) also not how queues behave, because we
> > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > you'd think queues are as ephemeral as NAPIs if not more.
> > > 
> > > I guess we can either document this, and move on (which may be fine,
> > > you have more practical experience than me). Or we can add an internal
> > > concept of a "channel" (which perhaps maybe if you squint is what
> > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > net_device and store such config there. For simplicity of matching
> > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > If driver wants to do something more fancy we can add a variant of
> > > netif_napi_add() which specifies the channel/storage to use.
> > > 
> > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > I work with unfortunate drivers...  
> > 
> > Thanks for pointing this out. I think this is an important case to
> > consider. Here's how I'm thinking about it.
> > 
> > There are two cases:
> > 
> > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > discarded and recreated, the code I added to netif_napi_add_weight
> > in patch 1 and 3 should take care of that case preserving how sysfs
> > works today, I believe. I think we are good on this case ?
> 
> Agreed.
> 
> > 2) apps using netlink to set various custom settings. This seems
> > like a case where a future extension can be made to add a notifier
> > for NAPI changes (like the netdevice notifier?).
> 
> Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> 
> > If you think this is a good idea, then we'd do something like:
> >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> >   2. In the future (not part of this series) a NAPI notifier is
> >      added
> >   3. User apps can then listen for NAPI create/delete events
> >      and update settings when a NAPI is created. It would be
> >      helpful, I think, for user apps to know about NAPI
> >      create/delete events in general because it means NAPI IDs are
> >      changing.
> > 
> > One could argue:
> > 
> >   When wiping/recreating a NAPI for an existing HW queue, that HW
> >   queue gets a new NAPI ID associated with it. User apps operating
> >   at this level probably care about NAPI IDs changing (as it affects
> >   epoll busy poll). Since the settings in this series are per-NAPI
> >   (and not per HW queue), the argument could be that user apps need
> >   to setup NAPIs when they are created and settings do not persist
> >   between NAPIs with different IDs even if associated with the same
> >   HW queue.
> 
> IDK if the fact that NAPI ID gets replaced was intentional in the first
> place. I would venture a guess that the person who added the IDs was
> working with NICs which have stable NAPI instances once the device is
> opened. This is, unfortunately, not universally the case.
> 
> I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> on an open device. Closer we get to queue API the more dynamic the whole
> setup will become (read: the more often reconfigurations will happen).
>

[...]

> > I think you have much more practical experience when it comes to
> > dealing with drivers, so I am happy to follow your lead on this one,
> > but assuming drivers will "do a thing" seems mildly scary to me with
> > limited driver experience.
> > 
> > My two goals with this series are:
> >   1. Make it possible to set these values per NAPI
> >   2. Unblock the IRQ suspension series by threading the suspend
> >      parameter through the code path carved in this series
> > 
> > So, I'm happy to proceed with this series as you prefer whether
> > that's documentation or "napi_storage"; I think you are probably the
> > best person to answer this question :)
> 
> How do you feel about making this configuration opt-in / require driver
> changes? What I'm thinking is that having the new "netif_napi_add()"
> variant (or perhaps extending netif_napi_set_irq()) to take an extra
> "index" parameter would make the whole thing much simpler.

What about extending netif_queue_set_napi instead? That function
takes a napi and a queue index.

Locally I kinda of hacked up something simple that:
  - Allocates napi_storage in net_device in alloc_netdev_mqs
  - Modifies netif_queue_set_napi to:
     if (napi)
       napi->storage = dev->napi_storage[queue_index];

I think I'm still missing the bit about the
max(rx_queues,tx_queues), though :(

> Index would basically be an integer 0..n, where n is the number of
> IRQs configured for the driver. The index of a NAPI instance would
> likely match the queue ID of the queue the NAPI serves.

Hmmm. I'm hesitant about the "number of IRQs" part. What if there
are NAPIs for which no IRQ is allocated ~someday~ ?

It seems like (I could totally be wrong) that netif_queue_set_napi
can be called and work and create the association even without an
IRQ allocated.

I guess the issue is mostly the queue index question above: combined
rx/tx vs drivers having different numbers of rx and tx queues.

> We can then allocate an array of "napi_configs" in net_device -
> like we allocate queues, the array size would be max(num_rx_queue,
> num_tx_queues). We just need to store a couple of ints so it will
> be tiny compared to queue structs, anyway.
> 
> The NAPI_SET netlink op can then work based on NAPI index rather 
> than the ephemeral NAPI ID. It can apply the config to all live
> NAPI instances with that index (of which there really should only 
> be one, unless driver is mid-reconfiguration somehow but even that
> won't cause issues, we can give multiple instances the same settings)
> and also store the user config in the array in net_device.
> 
> When new NAPI instance is associate with a NAPI index it should get
> all the config associated with that index applied.
> 
> Thoughts? Does that makes sense, and if so do you think it's an
> over-complication?

I think what you are proposing seems fine; I'm just working out the
implementation details and making sure I understand before sending
another revision.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-08-31 17:27         ` Joe Damato
@ 2024-09-03  0:49           ` Jakub Kicinski
  0 siblings, 0 replies; 48+ messages in thread
From: Jakub Kicinski @ 2024-09-03  0:49 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Sat, 31 Aug 2024 18:27:45 +0100 Joe Damato wrote:
> > How do you feel about making this configuration opt-in / require driver
> > changes? What I'm thinking is that having the new "netif_napi_add()"
> > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > "index" parameter would make the whole thing much simpler.  
> 
> I think if we are going to go this way, then opt-in is probably the
> way to go. This series would include the necessary changes for mlx5,
> in that case (because that's what I have access to) so that the new
> variant has a user?

SG! FWIW for bnxt the "id" is struct bnxt_napi::index (I haven't looked
at bnxt before writing the suggestion :))

> > Index would basically be an integer 0..n, where n is the number of
> > IRQs configured for the driver. The index of a NAPI instance would
> > likely match the queue ID of the queue the NAPI serves.
> > 
> > We can then allocate an array of "napi_configs" in net_device -
> > like we allocate queues, the array size would be max(num_rx_queue,
> > num_tx_queues). We just need to store a couple of ints so it will
> > be tiny compared to queue structs, anyway.  
> 
> I assume napi_storage exists for both combined RX/TX NAPIs (i.e.
> drivers that multiplex RX/TX on a single NAPI like mlx5) as well
> as drivers which create NAPIs that are RX or TX-only, right?

Hm.

> If so, it seems like we'd either need to:
>   - Do something more complicated when computing how much NAPI
>     storage to make, or
>   - Provide a different path for drivers which don't multiplex and
>     create some number of (for example) TX-only NAPIs ?
> 
> I guess I'm just imagining a weird case where a driver has 8 RX
> queues but 64 TX queues. max of that is 64, so we'd be missing 8
> napi_storage ?
> 
> Sorry, I'm probably just missing something about the implementation
> details you summarized above.

I wouldn't worry about it. We can added a variant of alloc_netdev_mqs()
later which takes the NAPI count explicitly. For now we can simply
assume max(rx, tx) is good enough, and maybe add a WARN_ON_ONCE() to 
the set function to catch drivers which need something more complicated.

Modern NICs have far more queues than IRQs (~NAPIs).

> > The NAPI_SET netlink op can then work based on NAPI index rather 
> > than the ephemeral NAPI ID. It can apply the config to all live
> > NAPI instances with that index (of which there really should only 
> > be one, unless driver is mid-reconfiguration somehow but even that
> > won't cause issues, we can give multiple instances the same settings)
> > and also store the user config in the array in net_device.  
> 
> I understand what you are proposing. I suppose napi-get could be
> extended to include the NAPI index, too?

Yup!

> Then users could map queues to NAPI indexes to queues (via NAPI ID)?

Yes.

> > When new NAPI instance is associate with a NAPI index it should get
> > all the config associated with that index applied.
> > 
> > Thoughts? Does that makes sense, and if so do you think it's an
> > over-complication?  
> 
> It feels a bit tricky, to me, as it seems there are some edge cases
> to be careful with (queue count change). I could probably give the
> implementation a try and see where I end up.
> 
> Having these settings per-NAPI would be really useful and being able
> to support IRQ suspension would be useful, too.
> 
> I think being thoughtful about how we get there is important; I'm a
> little wary of getting side tracked, but I trust your judgement and
> if you think this is worth exploring I'll think on it some more.

I understand, we can abandon it if the implementation drags out due to
various nit picks and back-and-forths. But I don't expect much of that
🤞️

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-02 16:56         ` Joe Damato
@ 2024-09-03  1:02           ` Jakub Kicinski
  2024-09-03 19:04             ` Samiullah Khawaja
  2024-09-04 23:40           ` Stanislav Fomichev
  1 sibling, 1 reply; 48+ messages in thread
From: Jakub Kicinski @ 2024-09-03  1:02 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, edumazet, amritha.nambiar, sridhar.samudrala, sdf, bjorn,
	hch, willy, willemdebruijn.kernel, skhawaja, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Mon, 2 Sep 2024 18:56:07 +0200 Joe Damato wrote:
> > How do you feel about making this configuration opt-in / require driver
> > changes? What I'm thinking is that having the new "netif_napi_add()"
> > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > "index" parameter would make the whole thing much simpler.  
> 
> What about extending netif_queue_set_napi instead? That function
> takes a napi and a queue index.

This may become problematic, since there may be more queues than NAPIs.
Either with multiple combined queues mapped to a single NAPI, or having
separate Rx/Tx NAPIs.

No strong preference on which function we modify (netif_napi_add vs the
IRQ setting helper) but I think we need to take the index as an
explicit param, rather than try to guess it based on another ID. 
The queue ID will match 95% of the time, today. But Intel was
interested in "remapping" queues. And we all think about a queue API...
So using queue ID is going to cause problems down the road.

> Locally I kinda of hacked up something simple that:
>   - Allocates napi_storage in net_device in alloc_netdev_mqs
>   - Modifies netif_queue_set_napi to:
>      if (napi)
>        napi->storage = dev->napi_storage[queue_index];
> 
> I think I'm still missing the bit about the
> max(rx_queues,tx_queues), though :(

You mean whether it's enough to do

	napi_cnt = max(txqs, rxqs)

? Or some other question?

AFAIU mlx5 can only have as many NAPIs as it has IRQs (256?).
Look at ip -d link to see how many queues it has.
We could do txqs + rxqs to be safe, if you prefer, but it will 
waste a bit of memory.

> > Index would basically be an integer 0..n, where n is the number of
> > IRQs configured for the driver. The index of a NAPI instance would
> > likely match the queue ID of the queue the NAPI serves.  
> 
> Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> are NAPIs for which no IRQ is allocated ~someday~ ?

A device NAPI without an IRQ? Whoever does something of such silliness
will have to deal with consequences. I think it's unlikely.

> It seems like (I could totally be wrong) that netif_queue_set_napi
> can be called and work and create the association even without an
> IRQ allocated.
> 
> I guess the issue is mostly the queue index question above: combined
> rx/tx vs drivers having different numbers of rx and tx queues.

Yes, and in the future the ability to allocate more queues than NAPIs.
netif_queue_set_napi() was expected to cover such cases.

> > We can then allocate an array of "napi_configs" in net_device -
> > like we allocate queues, the array size would be max(num_rx_queue,
> > num_tx_queues). We just need to store a couple of ints so it will
> > be tiny compared to queue structs, anyway.
> > 
> > The NAPI_SET netlink op can then work based on NAPI index rather 
> > than the ephemeral NAPI ID. It can apply the config to all live
> > NAPI instances with that index (of which there really should only 
> > be one, unless driver is mid-reconfiguration somehow but even that
> > won't cause issues, we can give multiple instances the same settings)
> > and also store the user config in the array in net_device.
> > 
> > When new NAPI instance is associate with a NAPI index it should get
> > all the config associated with that index applied.
> > 
> > Thoughts? Does that makes sense, and if so do you think it's an
> > over-complication?  
> 
> I think what you are proposing seems fine; I'm just working out the
> implementation details and making sure I understand before sending
> another revision.

If you get stuck feel free to send a link to a GH or post RFC.
I'm adding extra asks so whether form of review helps.. :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-03  1:02           ` Jakub Kicinski
@ 2024-09-03 19:04             ` Samiullah Khawaja
  2024-09-03 19:40               ` Jakub Kicinski
  0 siblings, 1 reply; 48+ messages in thread
From: Samiullah Khawaja @ 2024-09-03 19:04 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Joe Damato, netdev, edumazet, amritha.nambiar, sridhar.samudrala,
	sdf, bjorn, hch, willy, willemdebruijn.kernel, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

This is great Joe.

On Mon, Sep 2, 2024 at 6:02 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 2 Sep 2024 18:56:07 +0200 Joe Damato wrote:
> > > How do you feel about making this configuration opt-in / require driver
> > > changes? What I'm thinking is that having the new "netif_napi_add()"
> > > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > > "index" parameter would make the whole thing much simpler.
> >
> > What about extending netif_queue_set_napi instead? That function
> > takes a napi and a queue index.
>
> This may become problematic, since there may be more queues than NAPIs.
> Either with multiple combined queues mapped to a single NAPI, or having
> separate Rx/Tx NAPIs.
>
> No strong preference on which function we modify (netif_napi_add vs the
> IRQ setting helper) but I think we need to take the index as an
> explicit param, rather than try to guess it based on another ID.
> The queue ID will match 95% of the time, today. But Intel was
> interested in "remapping" queues. And we all think about a queue API...
> So using queue ID is going to cause problems down the road.
Do we need a queue to napi association to set/persist napi
configurations? Can a new index param be added to the netif_napi_add
and persist the configurations in napi_storage. I guess the problem
would be the size of napi_storage.

Also wondering if for some use case persistence would be problematic
when the napis are recreated, since the new napi instances might not
represent the same context? For example If I resize the dev from 16
rx/tx to 8 rx/tx queues and the napi index that was used by TX queue,
now polls RX queue.
>
> > Locally I kinda of hacked up something simple that:
> >   - Allocates napi_storage in net_device in alloc_netdev_mqs
> >   - Modifies netif_queue_set_napi to:
> >      if (napi)
> >        napi->storage = dev->napi_storage[queue_index];
> >
> > I think I'm still missing the bit about the
> > max(rx_queues,tx_queues), though :(
>
> You mean whether it's enough to do
>
>         napi_cnt = max(txqs, rxqs)
>
> ? Or some other question?
>
> AFAIU mlx5 can only have as many NAPIs as it has IRQs (256?).
> Look at ip -d link to see how many queues it has.
> We could do txqs + rxqs to be safe, if you prefer, but it will
> waste a bit of memory.
>
> > > Index would basically be an integer 0..n, where n is the number of
> > > IRQs configured for the driver. The index of a NAPI instance would
> > > likely match the queue ID of the queue the NAPI serves.
> >
> > Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> > are NAPIs for which no IRQ is allocated ~someday~ ?
>
> A device NAPI without an IRQ? Whoever does something of such silliness
> will have to deal with consequences. I think it's unlikely.
>
> > It seems like (I could totally be wrong) that netif_queue_set_napi
> > can be called and work and create the association even without an
> > IRQ allocated.
> >
> > I guess the issue is mostly the queue index question above: combined
> > rx/tx vs drivers having different numbers of rx and tx queues.
>
> Yes, and in the future the ability to allocate more queues than NAPIs.
> netif_queue_set_napi() was expected to cover such cases.
>
> > > We can then allocate an array of "napi_configs" in net_device -
> > > like we allocate queues, the array size would be max(num_rx_queue,
> > > num_tx_queues). We just need to store a couple of ints so it will
> > > be tiny compared to queue structs, anyway.
> > >
> > > The NAPI_SET netlink op can then work based on NAPI index rather
> > > than the ephemeral NAPI ID. It can apply the config to all live
> > > NAPI instances with that index (of which there really should only
> > > be one, unless driver is mid-reconfiguration somehow but even that
> > > won't cause issues, we can give multiple instances the same settings)
> > > and also store the user config in the array in net_device.
> > >
> > > When new NAPI instance is associate with a NAPI index it should get
> > > all the config associated with that index applied.
> > >
> > > Thoughts? Does that makes sense, and if so do you think it's an
> > > over-complication?
> >
> > I think what you are proposing seems fine; I'm just working out the
> > implementation details and making sure I understand before sending
> > another revision.
>
> If you get stuck feel free to send a link to a GH or post RFC.
> I'm adding extra asks so whether form of review helps.. :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-03 19:04             ` Samiullah Khawaja
@ 2024-09-03 19:40               ` Jakub Kicinski
  2024-09-03 21:58                 ` Samiullah Khawaja
  2024-09-08 15:54                 ` Joe Damato
  0 siblings, 2 replies; 48+ messages in thread
From: Jakub Kicinski @ 2024-09-03 19:40 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Joe Damato, netdev, edumazet, amritha.nambiar, sridhar.samudrala,
	sdf, bjorn, hch, willy, willemdebruijn.kernel, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Tue, 3 Sep 2024 12:04:52 -0700 Samiullah Khawaja wrote:
> Do we need a queue to napi association to set/persist napi
> configurations? 

I'm afraid zero-copy schemes will make multiple queues per NAPI more
and more common, so pretending the NAPI params (related to polling)
are pre queue will soon become highly problematic.

> Can a new index param be added to the netif_napi_add
> and persist the configurations in napi_storage.

That'd be my (weak) preference.

> I guess the problem would be the size of napi_storage.

I don't think so, we're talking about 16B per NAPI, 
struct netdev_queue is 320B, struct netdev_rx_queue is 192B. 
NAPI storage is rounding error next to those :S

> Also wondering if for some use case persistence would be problematic
> when the napis are recreated, since the new napi instances might not
> represent the same context? For example If I resize the dev from 16
> rx/tx to 8 rx/tx queues and the napi index that was used by TX queue,
> now polls RX queue.

We can clear the config when NAPI is activated (ethtool -L /
set-channels). That seems like a good idea.

The distinction between Rx and Tx NAPIs is a bit more tricky, tho.
When^w If we can dynamically create Rx queues one day, a NAPI may 
start out as a Tx NAPI and become a combined one when Rx queue is 
added to it.

Maybe it's enough to document how rings are distributed to NAPIs?

First set of NAPIs should get allocated to the combined channels,
then for remaining rx- and tx-only NAPIs they should be interleaved
starting with rx?

Example, asymmetric config: combined + some extra tx:

    combined        tx
 [0..#combined-1] [#combined..#combined+#tx-1]

Split rx / tx - interleave:

 [0 rx0] [1 tx0] [2 rx1] [3 tx1] [4 rx2] [5 tx2] ...

This would limit the churn when changing channel counts.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-03 19:40               ` Jakub Kicinski
@ 2024-09-03 21:58                 ` Samiullah Khawaja
  2024-09-05  9:20                   ` Joe Damato
  2024-09-08 15:54                 ` Joe Damato
  1 sibling, 1 reply; 48+ messages in thread
From: Samiullah Khawaja @ 2024-09-03 21:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Joe Damato, netdev, edumazet, amritha.nambiar, sridhar.samudrala,
	sdf, bjorn, hch, willy, willemdebruijn.kernel, Martin Karsten,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Tue, Sep 3, 2024 at 12:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 3 Sep 2024 12:04:52 -0700 Samiullah Khawaja wrote:
> > Do we need a queue to napi association to set/persist napi
> > configurations?
>
> I'm afraid zero-copy schemes will make multiple queues per NAPI more
> and more common, so pretending the NAPI params (related to polling)
> are pre queue will soon become highly problematic.
Agreed.
>
> > Can a new index param be added to the netif_napi_add
> > and persist the configurations in napi_storage.
>
> That'd be my (weak) preference.
>
> > I guess the problem would be the size of napi_storage.
>
> I don't think so, we're talking about 16B per NAPI,
> struct netdev_queue is 320B, struct netdev_rx_queue is 192B.
> NAPI storage is rounding error next to those :S
Oh, I am sorry I was actually referring to the problem of figuring out
the count of the napi_storage array.
>
> > Also wondering if for some use case persistence would be problematic
> > when the napis are recreated, since the new napi instances might not
> > represent the same context? For example If I resize the dev from 16
> > rx/tx to 8 rx/tx queues and the napi index that was used by TX queue,
> > now polls RX queue.
>
> We can clear the config when NAPI is activated (ethtool -L /
> set-channels). That seems like a good idea.
That sounds good.
>
> The distinction between Rx and Tx NAPIs is a bit more tricky, tho.
> When^w If we can dynamically create Rx queues one day, a NAPI may
> start out as a Tx NAPI and become a combined one when Rx queue is
> added to it.
>
> Maybe it's enough to document how rings are distributed to NAPIs?
>
> First set of NAPIs should get allocated to the combined channels,
> then for remaining rx- and tx-only NAPIs they should be interleaved
> starting with rx?
>
> Example, asymmetric config: combined + some extra tx:
>
>     combined        tx
>  [0..#combined-1] [#combined..#combined+#tx-1]
>
> Split rx / tx - interleave:
>
>  [0 rx0] [1 tx0] [2 rx1] [3 tx1] [4 rx2] [5 tx2] ...
>
> This would limit the churn when changing channel counts.
I think this is good. The queue-get dump netlink does provide details
of all the queues in a dev. It also provides a napi-id if the driver
has set it (only few drivers set this). So basically a busy poll
application would look at the queue type and apply configurations on
the relevant napi based on the documentation above (if napi-id is not
set on the queue)?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-02 16:56         ` Joe Damato
  2024-09-03  1:02           ` Jakub Kicinski
@ 2024-09-04 23:40           ` Stanislav Fomichev
  2024-09-04 23:54             ` Jakub Kicinski
  2024-09-05  9:30             ` Joe Damato
  1 sibling, 2 replies; 48+ messages in thread
From: Stanislav Fomichev @ 2024-09-04 23:40 UTC (permalink / raw)
  To: Joe Damato, Jakub Kicinski, netdev, edumazet, amritha.nambiar,
	sridhar.samudrala, bjorn, hch, willy, willemdebruijn.kernel,
	skhawaja, Martin Karsten, Donald Hunter, David S. Miller,
	Paolo Abeni, Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens,
	open list

On 09/02, Joe Damato wrote:
> On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> > On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > > +      doc: Set configurable NAPI instance settings.  
> > > > 
> > > > We should pause and think here how configuring NAPI params should
> > > > behave. NAPI instances are ephemeral, if you close and open the
> > > > device (or for some drivers change any BPF or ethtool setting)
> > > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > > 
> > > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > > survive close. It's (weirdly?) also not how queues behave, because we
> > > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > > you'd think queues are as ephemeral as NAPIs if not more.
> > > > 
> > > > I guess we can either document this, and move on (which may be fine,
> > > > you have more practical experience than me). Or we can add an internal
> > > > concept of a "channel" (which perhaps maybe if you squint is what
> > > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > > net_device and store such config there. For simplicity of matching
> > > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > > If driver wants to do something more fancy we can add a variant of
> > > > netif_napi_add() which specifies the channel/storage to use.
> > > > 
> > > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > > I work with unfortunate drivers...  
> > > 
> > > Thanks for pointing this out. I think this is an important case to
> > > consider. Here's how I'm thinking about it.
> > > 
> > > There are two cases:
> > > 
> > > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > > discarded and recreated, the code I added to netif_napi_add_weight
> > > in patch 1 and 3 should take care of that case preserving how sysfs
> > > works today, I believe. I think we are good on this case ?
> > 
> > Agreed.
> > 
> > > 2) apps using netlink to set various custom settings. This seems
> > > like a case where a future extension can be made to add a notifier
> > > for NAPI changes (like the netdevice notifier?).
> > 
> > Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> > 
> > > If you think this is a good idea, then we'd do something like:
> > >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> > >   2. In the future (not part of this series) a NAPI notifier is
> > >      added
> > >   3. User apps can then listen for NAPI create/delete events
> > >      and update settings when a NAPI is created. It would be
> > >      helpful, I think, for user apps to know about NAPI
> > >      create/delete events in general because it means NAPI IDs are
> > >      changing.
> > > 
> > > One could argue:
> > > 
> > >   When wiping/recreating a NAPI for an existing HW queue, that HW
> > >   queue gets a new NAPI ID associated with it. User apps operating
> > >   at this level probably care about NAPI IDs changing (as it affects
> > >   epoll busy poll). Since the settings in this series are per-NAPI
> > >   (and not per HW queue), the argument could be that user apps need
> > >   to setup NAPIs when they are created and settings do not persist
> > >   between NAPIs with different IDs even if associated with the same
> > >   HW queue.
> > 
> > IDK if the fact that NAPI ID gets replaced was intentional in the first
> > place. I would venture a guess that the person who added the IDs was
> > working with NICs which have stable NAPI instances once the device is
> > opened. This is, unfortunately, not universally the case.
> > 
> > I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> > on an open device. Closer we get to queue API the more dynamic the whole
> > setup will become (read: the more often reconfigurations will happen).
> >
> 
> [...]
> 
> > > I think you have much more practical experience when it comes to
> > > dealing with drivers, so I am happy to follow your lead on this one,
> > > but assuming drivers will "do a thing" seems mildly scary to me with
> > > limited driver experience.
> > > 
> > > My two goals with this series are:
> > >   1. Make it possible to set these values per NAPI
> > >   2. Unblock the IRQ suspension series by threading the suspend
> > >      parameter through the code path carved in this series
> > > 
> > > So, I'm happy to proceed with this series as you prefer whether
> > > that's documentation or "napi_storage"; I think you are probably the
> > > best person to answer this question :)
> > 
> > How do you feel about making this configuration opt-in / require driver
> > changes? What I'm thinking is that having the new "netif_napi_add()"
> > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > "index" parameter would make the whole thing much simpler.
> 
> What about extending netif_queue_set_napi instead? That function
> takes a napi and a queue index.
> 
> Locally I kinda of hacked up something simple that:
>   - Allocates napi_storage in net_device in alloc_netdev_mqs
>   - Modifies netif_queue_set_napi to:
>      if (napi)
>        napi->storage = dev->napi_storage[queue_index];
> 
> I think I'm still missing the bit about the
> max(rx_queues,tx_queues), though :(
> 
> > Index would basically be an integer 0..n, where n is the number of
> > IRQs configured for the driver. The index of a NAPI instance would
> > likely match the queue ID of the queue the NAPI serves.
> 
> Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> are NAPIs for which no IRQ is allocated ~someday~ ?
> 
> It seems like (I could totally be wrong) that netif_queue_set_napi
> can be called and work and create the association even without an
> IRQ allocated.
> 
> I guess the issue is mostly the queue index question above: combined
> rx/tx vs drivers having different numbers of rx and tx queues.
> 
> > We can then allocate an array of "napi_configs" in net_device -
> > like we allocate queues, the array size would be max(num_rx_queue,
> > num_tx_queues). We just need to store a couple of ints so it will
> > be tiny compared to queue structs, anyway.
> > 
> > The NAPI_SET netlink op can then work based on NAPI index rather 
> > than the ephemeral NAPI ID. It can apply the config to all live
> > NAPI instances with that index (of which there really should only 
> > be one, unless driver is mid-reconfiguration somehow but even that
> > won't cause issues, we can give multiple instances the same settings)
> > and also store the user config in the array in net_device.
> > 
> > When new NAPI instance is associate with a NAPI index it should get
> > all the config associated with that index applied.
> > 
> > Thoughts? Does that makes sense, and if so do you think it's an
> > over-complication?
> 
> I think what you are proposing seems fine; I'm just working out the
> implementation details and making sure I understand before sending
> another revision.

What if instead of an extra storage index in UAPI, we make napi_id persistent?
Then we can keep using napi_id as a user-facing number for the configuration.

Having a stable napi_id would also be super useful for the epoll setup so you
don't have to match old/invalid ids to the new ones on device reset.

In the code, we can keep the same idea with napi_storage in netdev and
ask drivers to provide storage id, but keep that id internal.

The only complication with that is napi_hash_add/napi_hash_del that
happen in netif_napi_add_weight. So for the devices that allocate
new napi before removing the old ones (most devices?), we'd have to add
some new netif_napi_takeover(old_napi, new_napi) to remove the
old napi_id from the hash and reuse it in the new one.

So for mlx5, the flow would look like the following:

- mlx5e_safe_switch_params
  - mlx5e_open_channels
    - netif_napi_add(new_napi)
      - adds napi with 'ephemeral' napi id
  - mlx5e_switch_priv_channels
    - mlx5e_deactivate_priv_channels
      - napi_disable(old_napi)
      - netif_napi_del(old_napi) - this frees the old napi_id
  - mlx5e_activate_priv_channels
    - mlx5e_activate_channels
      - mlx5e_activate_channel
        - netif_napi_takeover(old_napi is gone, so probably take id from napi_storage?)
	  - if napi is not hashed - safe to reuse?
	- napi_enable

This is a bit ugly because we still have random napi ids during reset, but
is not super complicated implementation-wise. We can eventually improve
the above by splitting netif_napi_add_weight into two steps: allocate and
activate (to do the napi_id allocation & hashing). Thoughts?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-04 23:40           ` Stanislav Fomichev
@ 2024-09-04 23:54             ` Jakub Kicinski
  2024-09-05  9:32               ` Joe Damato
  2024-09-08 15:57               ` Joe Damato
  2024-09-05  9:30             ` Joe Damato
  1 sibling, 2 replies; 48+ messages in thread
From: Jakub Kicinski @ 2024-09-04 23:54 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Joe Damato, netdev, edumazet, amritha.nambiar, sridhar.samudrala,
	bjorn, hch, willy, willemdebruijn.kernel, skhawaja,
	Martin Karsten, Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Wed, 4 Sep 2024 16:40:41 -0700 Stanislav Fomichev wrote:
> > I think what you are proposing seems fine; I'm just working out the
> > implementation details and making sure I understand before sending
> > another revision.  
> 
> What if instead of an extra storage index in UAPI, we make napi_id persistent?
> Then we can keep using napi_id as a user-facing number for the configuration.
> 
> Having a stable napi_id would also be super useful for the epoll setup so you
> don't have to match old/invalid ids to the new ones on device reset.

that'd be nice, initially I thought that we have some drivers that have
multiple instances of NAPI enabled for a single "index", but I don't
see such drivers now.

> In the code, we can keep the same idea with napi_storage in netdev and
> ask drivers to provide storage id, but keep that id internal.
> 
> The only complication with that is napi_hash_add/napi_hash_del that
> happen in netif_napi_add_weight. So for the devices that allocate
> new napi before removing the old ones (most devices?), we'd have to add
> some new netif_napi_takeover(old_napi, new_napi) to remove the
> old napi_id from the hash and reuse it in the new one.
> 
> So for mlx5, the flow would look like the following:
> 
> - mlx5e_safe_switch_params
>   - mlx5e_open_channels
>     - netif_napi_add(new_napi)
>       - adds napi with 'ephemeral' napi id
>   - mlx5e_switch_priv_channels
>     - mlx5e_deactivate_priv_channels
>       - napi_disable(old_napi)
>       - netif_napi_del(old_napi) - this frees the old napi_id
>   - mlx5e_activate_priv_channels
>     - mlx5e_activate_channels
>       - mlx5e_activate_channel
>         - netif_napi_takeover(old_napi is gone, so probably take id from napi_storage?)
> 	  - if napi is not hashed - safe to reuse?
> 	- napi_enable
> 
> This is a bit ugly because we still have random napi ids during reset, but
> is not super complicated implementation-wise. We can eventually improve
> the above by splitting netif_napi_add_weight into two steps: allocate and
> activate (to do the napi_id allocation & hashing). Thoughts?

The "takeover" would be problematic for drivers which free old NAPI
before allocating new one (bnxt?). But splitting the two steps sounds
pretty clean. We can add a helper to mark NAPI as "driver will
explicitly list/hash later", and have the driver call a new helper
which takes storage ID and lists the NAPI in the hash.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-03 21:58                 ` Samiullah Khawaja
@ 2024-09-05  9:20                   ` Joe Damato
  0 siblings, 0 replies; 48+ messages in thread
From: Joe Damato @ 2024-09-05  9:20 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, netdev, edumazet, amritha.nambiar,
	sridhar.samudrala, sdf, bjorn, hch, willy, willemdebruijn.kernel,
	Martin Karsten, Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Tue, Sep 03, 2024 at 02:58:14PM -0700, Samiullah Khawaja wrote:
> On Tue, Sep 3, 2024 at 12:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Tue, 3 Sep 2024 12:04:52 -0700 Samiullah Khawaja wrote:
> > > Do we need a queue to napi association to set/persist napi
> > > configurations?
> >
> > I'm afraid zero-copy schemes will make multiple queues per NAPI more
> > and more common, so pretending the NAPI params (related to polling)
> > are pre queue will soon become highly problematic.
> Agreed.
> >
> > > Can a new index param be added to the netif_napi_add
> > > and persist the configurations in napi_storage.
> >
> > That'd be my (weak) preference.
> >
> > > I guess the problem would be the size of napi_storage.
> >
> > I don't think so, we're talking about 16B per NAPI,
> > struct netdev_queue is 320B, struct netdev_rx_queue is 192B.
> > NAPI storage is rounding error next to those :S
> Oh, I am sorry I was actually referring to the problem of figuring out
> the count of the napi_storage array.
> >
> > > Also wondering if for some use case persistence would be problematic
> > > when the napis are recreated, since the new napi instances might not
> > > represent the same context? For example If I resize the dev from 16
> > > rx/tx to 8 rx/tx queues and the napi index that was used by TX queue,
> > > now polls RX queue.
> >
> > We can clear the config when NAPI is activated (ethtool -L /
> > set-channels). That seems like a good idea.
> That sounds good.
> >
> > The distinction between Rx and Tx NAPIs is a bit more tricky, tho.
> > When^w If we can dynamically create Rx queues one day, a NAPI may
> > start out as a Tx NAPI and become a combined one when Rx queue is
> > added to it.
> >
> > Maybe it's enough to document how rings are distributed to NAPIs?
> >
> > First set of NAPIs should get allocated to the combined channels,
> > then for remaining rx- and tx-only NAPIs they should be interleaved
> > starting with rx?
> >
> > Example, asymmetric config: combined + some extra tx:
> >
> >     combined        tx
> >  [0..#combined-1] [#combined..#combined+#tx-1]
> >
> > Split rx / tx - interleave:
> >
> >  [0 rx0] [1 tx0] [2 rx1] [3 tx1] [4 rx2] [5 tx2] ...
> >
> > This would limit the churn when changing channel counts.
> I think this is good. The queue-get dump netlink does provide details
> of all the queues in a dev. It also provides a napi-id if the driver
> has set it (only few drivers set this).

This is true, but there are several and IMHO extending existing
drivers to support this can be done. I have been adding "nits" to
driver reviewers for new drivers asking the author(s) to consider
adding support for the API.

Not sure which driver you are using, but I can help you add support
for the API if it is needed.

> So basically a busy poll application would look at the queue type
> and apply configurations on the relevant napi based on the
> documentation above (if napi-id is not set on the queue)?

That was my plan for my user app based on the conversation so far.
At start, the app gets some config with a list of ifindexes it will
bind to for incoming connections and then gets the NAPI IDs via
netlink and sets the per-NAPI params via netlink as well.

Haven't implemented this yet in the user app, but that's the
direction I am planning to go with this all.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-04 23:40           ` Stanislav Fomichev
  2024-09-04 23:54             ` Jakub Kicinski
@ 2024-09-05  9:30             ` Joe Damato
  2024-09-05 16:56               ` Stanislav Fomichev
  1 sibling, 1 reply; 48+ messages in thread
From: Joe Damato @ 2024-09-05  9:30 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, netdev, edumazet, amritha.nambiar,
	sridhar.samudrala, bjorn, hch, willy, willemdebruijn.kernel,
	skhawaja, Martin Karsten, Donald Hunter, David S. Miller,
	Paolo Abeni, Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens,
	open list

On Wed, Sep 04, 2024 at 04:40:41PM -0700, Stanislav Fomichev wrote:
> On 09/02, Joe Damato wrote:
> > On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> > > On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > > > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > > > +      doc: Set configurable NAPI instance settings.  
> > > > > 
> > > > > We should pause and think here how configuring NAPI params should
> > > > > behave. NAPI instances are ephemeral, if you close and open the
> > > > > device (or for some drivers change any BPF or ethtool setting)
> > > > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > > > 
> > > > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > > > survive close. It's (weirdly?) also not how queues behave, because we
> > > > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > > > you'd think queues are as ephemeral as NAPIs if not more.
> > > > > 
> > > > > I guess we can either document this, and move on (which may be fine,
> > > > > you have more practical experience than me). Or we can add an internal
> > > > > concept of a "channel" (which perhaps maybe if you squint is what
> > > > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > > > net_device and store such config there. For simplicity of matching
> > > > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > > > If driver wants to do something more fancy we can add a variant of
> > > > > netif_napi_add() which specifies the channel/storage to use.
> > > > > 
> > > > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > > > I work with unfortunate drivers...  
> > > > 
> > > > Thanks for pointing this out. I think this is an important case to
> > > > consider. Here's how I'm thinking about it.
> > > > 
> > > > There are two cases:
> > > > 
> > > > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > > > discarded and recreated, the code I added to netif_napi_add_weight
> > > > in patch 1 and 3 should take care of that case preserving how sysfs
> > > > works today, I believe. I think we are good on this case ?
> > > 
> > > Agreed.
> > > 
> > > > 2) apps using netlink to set various custom settings. This seems
> > > > like a case where a future extension can be made to add a notifier
> > > > for NAPI changes (like the netdevice notifier?).
> > > 
> > > Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> > > 
> > > > If you think this is a good idea, then we'd do something like:
> > > >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> > > >   2. In the future (not part of this series) a NAPI notifier is
> > > >      added
> > > >   3. User apps can then listen for NAPI create/delete events
> > > >      and update settings when a NAPI is created. It would be
> > > >      helpful, I think, for user apps to know about NAPI
> > > >      create/delete events in general because it means NAPI IDs are
> > > >      changing.
> > > > 
> > > > One could argue:
> > > > 
> > > >   When wiping/recreating a NAPI for an existing HW queue, that HW
> > > >   queue gets a new NAPI ID associated with it. User apps operating
> > > >   at this level probably care about NAPI IDs changing (as it affects
> > > >   epoll busy poll). Since the settings in this series are per-NAPI
> > > >   (and not per HW queue), the argument could be that user apps need
> > > >   to setup NAPIs when they are created and settings do not persist
> > > >   between NAPIs with different IDs even if associated with the same
> > > >   HW queue.
> > > 
> > > IDK if the fact that NAPI ID gets replaced was intentional in the first
> > > place. I would venture a guess that the person who added the IDs was
> > > working with NICs which have stable NAPI instances once the device is
> > > opened. This is, unfortunately, not universally the case.
> > > 
> > > I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> > > on an open device. Closer we get to queue API the more dynamic the whole
> > > setup will become (read: the more often reconfigurations will happen).
> > >
> > 
> > [...]
> > 
> > > > I think you have much more practical experience when it comes to
> > > > dealing with drivers, so I am happy to follow your lead on this one,
> > > > but assuming drivers will "do a thing" seems mildly scary to me with
> > > > limited driver experience.
> > > > 
> > > > My two goals with this series are:
> > > >   1. Make it possible to set these values per NAPI
> > > >   2. Unblock the IRQ suspension series by threading the suspend
> > > >      parameter through the code path carved in this series
> > > > 
> > > > So, I'm happy to proceed with this series as you prefer whether
> > > > that's documentation or "napi_storage"; I think you are probably the
> > > > best person to answer this question :)
> > > 
> > > How do you feel about making this configuration opt-in / require driver
> > > changes? What I'm thinking is that having the new "netif_napi_add()"
> > > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > > "index" parameter would make the whole thing much simpler.
> > 
> > What about extending netif_queue_set_napi instead? That function
> > takes a napi and a queue index.
> > 
> > Locally I kinda of hacked up something simple that:
> >   - Allocates napi_storage in net_device in alloc_netdev_mqs
> >   - Modifies netif_queue_set_napi to:
> >      if (napi)
> >        napi->storage = dev->napi_storage[queue_index];
> > 
> > I think I'm still missing the bit about the
> > max(rx_queues,tx_queues), though :(
> > 
> > > Index would basically be an integer 0..n, where n is the number of
> > > IRQs configured for the driver. The index of a NAPI instance would
> > > likely match the queue ID of the queue the NAPI serves.
> > 
> > Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> > are NAPIs for which no IRQ is allocated ~someday~ ?
> > 
> > It seems like (I could totally be wrong) that netif_queue_set_napi
> > can be called and work and create the association even without an
> > IRQ allocated.
> > 
> > I guess the issue is mostly the queue index question above: combined
> > rx/tx vs drivers having different numbers of rx and tx queues.
> > 
> > > We can then allocate an array of "napi_configs" in net_device -
> > > like we allocate queues, the array size would be max(num_rx_queue,
> > > num_tx_queues). We just need to store a couple of ints so it will
> > > be tiny compared to queue structs, anyway.
> > > 
> > > The NAPI_SET netlink op can then work based on NAPI index rather 
> > > than the ephemeral NAPI ID. It can apply the config to all live
> > > NAPI instances with that index (of which there really should only 
> > > be one, unless driver is mid-reconfiguration somehow but even that
> > > won't cause issues, we can give multiple instances the same settings)
> > > and also store the user config in the array in net_device.
> > > 
> > > When new NAPI instance is associate with a NAPI index it should get
> > > all the config associated with that index applied.
> > > 
> > > Thoughts? Does that makes sense, and if so do you think it's an
> > > over-complication?
> > 
> > I think what you are proposing seems fine; I'm just working out the
> > implementation details and making sure I understand before sending
> > another revision.
> 
> What if instead of an extra storage index in UAPI, we make napi_id persistent?
> Then we can keep using napi_id as a user-facing number for the configuration.
> 
> Having a stable napi_id would also be super useful for the epoll setup so you
> don't have to match old/invalid ids to the new ones on device reset.

Up to now for prototyping purposes: the way I've been dealing with this is
using a SO_ATTACH_REUSEPORT_CBPF program like this:

struct sock_filter code[] = {
    /* A = skb->queue_mapping */
    { BPF_LD | BPF_W | BPF_ABS, 0, 0, SKF_AD_OFF + SKF_AD_QUEUE },
    /* A = A % n */
    { BPF_ALU | BPF_MOD, 0, 0, n },
    /* return A */
    { BPF_RET | BPF_A, 0, 0, 0 },
};

with SO_BINDTODEVICE. Note that the above uses queue_mapping (not NAPI ID) so
even if the NAPI IDs change the filter still distributes connections from the
same "queue_mapping" to the same thread.

Since epoll busy poll is based on NAPI ID (and not queue_mapping), this will
probably cause some issue if the NAPI ID changes because the NAPI ID associated
with the epoll context will suddenly change meaning the "old" NAPI won't be
busy polled. This might be fine because if that happens the old NAPI is being
disabled anyway?

At any rate the user program doesn't "need" to do anything when the NAPI ID
changes... unless it has a more complicated ebpf program that relies on NAPI ID
;)

> In the code, we can keep the same idea with napi_storage in netdev and
> ask drivers to provide storage id, but keep that id internal.
> 
> The only complication with that is napi_hash_add/napi_hash_del that
> happen in netif_napi_add_weight. So for the devices that allocate
> new napi before removing the old ones (most devices?), we'd have to add
> some new netif_napi_takeover(old_napi, new_napi) to remove the
> old napi_id from the hash and reuse it in the new one.
> 
> So for mlx5, the flow would look like the following:
> 
> - mlx5e_safe_switch_params
>   - mlx5e_open_channels
>     - netif_napi_add(new_napi)
>       - adds napi with 'ephemeral' napi id
>   - mlx5e_switch_priv_channels
>     - mlx5e_deactivate_priv_channels
>       - napi_disable(old_napi)
>       - netif_napi_del(old_napi) - this frees the old napi_id
>   - mlx5e_activate_priv_channels
>     - mlx5e_activate_channels
>       - mlx5e_activate_channel
>         - netif_napi_takeover(old_napi is gone, so probably take id from napi_storage?)
> 	  - if napi is not hashed - safe to reuse?
> 	- napi_enable
> 
> This is a bit ugly because we still have random napi ids during reset, but
> is not super complicated implementation-wise. We can eventually improve
> the above by splitting netif_napi_add_weight into two steps: allocate and
> activate (to do the napi_id allocation & hashing). Thoughts?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-04 23:54             ` Jakub Kicinski
@ 2024-09-05  9:32               ` Joe Damato
  2024-09-08 15:57               ` Joe Damato
  1 sibling, 0 replies; 48+ messages in thread
From: Joe Damato @ 2024-09-05  9:32 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, netdev, edumazet, amritha.nambiar,
	sridhar.samudrala, bjorn, hch, willy, willemdebruijn.kernel,
	skhawaja, Martin Karsten, Donald Hunter, David S. Miller,
	Paolo Abeni, Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens,
	open list

On Wed, Sep 04, 2024 at 04:54:17PM -0700, Jakub Kicinski wrote:
> On Wed, 4 Sep 2024 16:40:41 -0700 Stanislav Fomichev wrote:
> > > I think what you are proposing seems fine; I'm just working out the
> > > implementation details and making sure I understand before sending
> > > another revision.  
> > 
> > What if instead of an extra storage index in UAPI, we make napi_id persistent?
> > Then we can keep using napi_id as a user-facing number for the configuration.
> > 
> > Having a stable napi_id would also be super useful for the epoll setup so you
> > don't have to match old/invalid ids to the new ones on device reset.
> 
> that'd be nice, initially I thought that we have some drivers that have
> multiple instances of NAPI enabled for a single "index", but I don't
> see such drivers now.
> 
> > In the code, we can keep the same idea with napi_storage in netdev and
> > ask drivers to provide storage id, but keep that id internal.
> > 
> > The only complication with that is napi_hash_add/napi_hash_del that
> > happen in netif_napi_add_weight. So for the devices that allocate
> > new napi before removing the old ones (most devices?), we'd have to add
> > some new netif_napi_takeover(old_napi, new_napi) to remove the
> > old napi_id from the hash and reuse it in the new one.
> > 
> > So for mlx5, the flow would look like the following:
> > 
> > - mlx5e_safe_switch_params
> >   - mlx5e_open_channels
> >     - netif_napi_add(new_napi)
> >       - adds napi with 'ephemeral' napi id
> >   - mlx5e_switch_priv_channels
> >     - mlx5e_deactivate_priv_channels
> >       - napi_disable(old_napi)
> >       - netif_napi_del(old_napi) - this frees the old napi_id
> >   - mlx5e_activate_priv_channels
> >     - mlx5e_activate_channels
> >       - mlx5e_activate_channel
> >         - netif_napi_takeover(old_napi is gone, so probably take id from napi_storage?)
> > 	  - if napi is not hashed - safe to reuse?
> > 	- napi_enable
> > 
> > This is a bit ugly because we still have random napi ids during reset, but
> > is not super complicated implementation-wise. We can eventually improve
> > the above by splitting netif_napi_add_weight into two steps: allocate and
> > activate (to do the napi_id allocation & hashing). Thoughts?
> 
> The "takeover" would be problematic for drivers which free old NAPI
> before allocating new one (bnxt?). But splitting the two steps sounds
> pretty clean. We can add a helper to mark NAPI as "driver will
> explicitly list/hash later", and have the driver call a new helper
> which takes storage ID and lists the NAPI in the hash.

Hm... I thought I had an idea of how to write this up, but I think
maybe I've been thinking about it wrong.

Whatever I land on, I'll send first as an RFC to make sure I'm
following all the feedback that has come in. I definitely want to
get this right.

Sorry for the slow responses; I am technically on PTO for a bit
before LPC :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-05  9:30             ` Joe Damato
@ 2024-09-05 16:56               ` Stanislav Fomichev
  2024-09-05 17:05                 ` Joe Damato
  0 siblings, 1 reply; 48+ messages in thread
From: Stanislav Fomichev @ 2024-09-05 16:56 UTC (permalink / raw)
  To: Joe Damato, Jakub Kicinski, netdev, edumazet, amritha.nambiar,
	sridhar.samudrala, bjorn, hch, willy, willemdebruijn.kernel,
	skhawaja, Martin Karsten, Donald Hunter, David S. Miller,
	Paolo Abeni, Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens,
	open list

On 09/05, Joe Damato wrote:
> On Wed, Sep 04, 2024 at 04:40:41PM -0700, Stanislav Fomichev wrote:
> > On 09/02, Joe Damato wrote:
> > > On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> > > > On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > > > > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > > > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > > > > +      doc: Set configurable NAPI instance settings.  
> > > > > > 
> > > > > > We should pause and think here how configuring NAPI params should
> > > > > > behave. NAPI instances are ephemeral, if you close and open the
> > > > > > device (or for some drivers change any BPF or ethtool setting)
> > > > > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > > > > 
> > > > > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > > > > survive close. It's (weirdly?) also not how queues behave, because we
> > > > > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > > > > you'd think queues are as ephemeral as NAPIs if not more.
> > > > > > 
> > > > > > I guess we can either document this, and move on (which may be fine,
> > > > > > you have more practical experience than me). Or we can add an internal
> > > > > > concept of a "channel" (which perhaps maybe if you squint is what
> > > > > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > > > > net_device and store such config there. For simplicity of matching
> > > > > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > > > > If driver wants to do something more fancy we can add a variant of
> > > > > > netif_napi_add() which specifies the channel/storage to use.
> > > > > > 
> > > > > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > > > > I work with unfortunate drivers...  
> > > > > 
> > > > > Thanks for pointing this out. I think this is an important case to
> > > > > consider. Here's how I'm thinking about it.
> > > > > 
> > > > > There are two cases:
> > > > > 
> > > > > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > > > > discarded and recreated, the code I added to netif_napi_add_weight
> > > > > in patch 1 and 3 should take care of that case preserving how sysfs
> > > > > works today, I believe. I think we are good on this case ?
> > > > 
> > > > Agreed.
> > > > 
> > > > > 2) apps using netlink to set various custom settings. This seems
> > > > > like a case where a future extension can be made to add a notifier
> > > > > for NAPI changes (like the netdevice notifier?).
> > > > 
> > > > Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> > > > 
> > > > > If you think this is a good idea, then we'd do something like:
> > > > >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> > > > >   2. In the future (not part of this series) a NAPI notifier is
> > > > >      added
> > > > >   3. User apps can then listen for NAPI create/delete events
> > > > >      and update settings when a NAPI is created. It would be
> > > > >      helpful, I think, for user apps to know about NAPI
> > > > >      create/delete events in general because it means NAPI IDs are
> > > > >      changing.
> > > > > 
> > > > > One could argue:
> > > > > 
> > > > >   When wiping/recreating a NAPI for an existing HW queue, that HW
> > > > >   queue gets a new NAPI ID associated with it. User apps operating
> > > > >   at this level probably care about NAPI IDs changing (as it affects
> > > > >   epoll busy poll). Since the settings in this series are per-NAPI
> > > > >   (and not per HW queue), the argument could be that user apps need
> > > > >   to setup NAPIs when they are created and settings do not persist
> > > > >   between NAPIs with different IDs even if associated with the same
> > > > >   HW queue.
> > > > 
> > > > IDK if the fact that NAPI ID gets replaced was intentional in the first
> > > > place. I would venture a guess that the person who added the IDs was
> > > > working with NICs which have stable NAPI instances once the device is
> > > > opened. This is, unfortunately, not universally the case.
> > > > 
> > > > I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> > > > on an open device. Closer we get to queue API the more dynamic the whole
> > > > setup will become (read: the more often reconfigurations will happen).
> > > >
> > > 
> > > [...]
> > > 
> > > > > I think you have much more practical experience when it comes to
> > > > > dealing with drivers, so I am happy to follow your lead on this one,
> > > > > but assuming drivers will "do a thing" seems mildly scary to me with
> > > > > limited driver experience.
> > > > > 
> > > > > My two goals with this series are:
> > > > >   1. Make it possible to set these values per NAPI
> > > > >   2. Unblock the IRQ suspension series by threading the suspend
> > > > >      parameter through the code path carved in this series
> > > > > 
> > > > > So, I'm happy to proceed with this series as you prefer whether
> > > > > that's documentation or "napi_storage"; I think you are probably the
> > > > > best person to answer this question :)
> > > > 
> > > > How do you feel about making this configuration opt-in / require driver
> > > > changes? What I'm thinking is that having the new "netif_napi_add()"
> > > > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > > > "index" parameter would make the whole thing much simpler.
> > > 
> > > What about extending netif_queue_set_napi instead? That function
> > > takes a napi and a queue index.
> > > 
> > > Locally I kinda of hacked up something simple that:
> > >   - Allocates napi_storage in net_device in alloc_netdev_mqs
> > >   - Modifies netif_queue_set_napi to:
> > >      if (napi)
> > >        napi->storage = dev->napi_storage[queue_index];
> > > 
> > > I think I'm still missing the bit about the
> > > max(rx_queues,tx_queues), though :(
> > > 
> > > > Index would basically be an integer 0..n, where n is the number of
> > > > IRQs configured for the driver. The index of a NAPI instance would
> > > > likely match the queue ID of the queue the NAPI serves.
> > > 
> > > Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> > > are NAPIs for which no IRQ is allocated ~someday~ ?
> > > 
> > > It seems like (I could totally be wrong) that netif_queue_set_napi
> > > can be called and work and create the association even without an
> > > IRQ allocated.
> > > 
> > > I guess the issue is mostly the queue index question above: combined
> > > rx/tx vs drivers having different numbers of rx and tx queues.
> > > 
> > > > We can then allocate an array of "napi_configs" in net_device -
> > > > like we allocate queues, the array size would be max(num_rx_queue,
> > > > num_tx_queues). We just need to store a couple of ints so it will
> > > > be tiny compared to queue structs, anyway.
> > > > 
> > > > The NAPI_SET netlink op can then work based on NAPI index rather 
> > > > than the ephemeral NAPI ID. It can apply the config to all live
> > > > NAPI instances with that index (of which there really should only 
> > > > be one, unless driver is mid-reconfiguration somehow but even that
> > > > won't cause issues, we can give multiple instances the same settings)
> > > > and also store the user config in the array in net_device.
> > > > 
> > > > When new NAPI instance is associate with a NAPI index it should get
> > > > all the config associated with that index applied.
> > > > 
> > > > Thoughts? Does that makes sense, and if so do you think it's an
> > > > over-complication?
> > > 
> > > I think what you are proposing seems fine; I'm just working out the
> > > implementation details and making sure I understand before sending
> > > another revision.
> > 
> > What if instead of an extra storage index in UAPI, we make napi_id persistent?
> > Then we can keep using napi_id as a user-facing number for the configuration.
> > 
> > Having a stable napi_id would also be super useful for the epoll setup so you
> > don't have to match old/invalid ids to the new ones on device reset.
> 
> Up to now for prototyping purposes: the way I've been dealing with this is
> using a SO_ATTACH_REUSEPORT_CBPF program like this:
> 
> struct sock_filter code[] = {
>     /* A = skb->queue_mapping */
>     { BPF_LD | BPF_W | BPF_ABS, 0, 0, SKF_AD_OFF + SKF_AD_QUEUE },
>     /* A = A % n */
>     { BPF_ALU | BPF_MOD, 0, 0, n },
>     /* return A */
>     { BPF_RET | BPF_A, 0, 0, 0 },
> };
> 
> with SO_BINDTODEVICE. Note that the above uses queue_mapping (not NAPI ID) so
> even if the NAPI IDs change the filter still distributes connections from the
> same "queue_mapping" to the same thread.
> 
> Since epoll busy poll is based on NAPI ID (and not queue_mapping), this will
> probably cause some issue if the NAPI ID changes because the NAPI ID associated
> with the epoll context will suddenly change meaning the "old" NAPI won't be
> busy polled. This might be fine because if that happens the old NAPI is being
> disabled anyway?
> 
> At any rate the user program doesn't "need" to do anything when the NAPI ID
> changes... unless it has a more complicated ebpf program that relies on NAPI ID
> ;)

Ah, you went a more creative route. We were doing SO_INCOMING_NAPI_ID on
a single listener and manually adding fds to the appropriate epoll.
But regardless of the method, having a stable napi_id is still good
to have and hopefully it's not a lot of work compared to exposing
extra napi storage id to the userspace.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-05 16:56               ` Stanislav Fomichev
@ 2024-09-05 17:05                 ` Joe Damato
  0 siblings, 0 replies; 48+ messages in thread
From: Joe Damato @ 2024-09-05 17:05 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, netdev, edumazet, amritha.nambiar,
	sridhar.samudrala, bjorn, hch, willy, willemdebruijn.kernel,
	skhawaja, Martin Karsten, Donald Hunter, David S. Miller,
	Paolo Abeni, Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens,
	open list

On Thu, Sep 05, 2024 at 09:56:43AM -0700, Stanislav Fomichev wrote:
> On 09/05, Joe Damato wrote:
> > On Wed, Sep 04, 2024 at 04:40:41PM -0700, Stanislav Fomichev wrote:
> > > On 09/02, Joe Damato wrote:
> > > > On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> > > > > On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > > > > > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > > > > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > > > > > +      doc: Set configurable NAPI instance settings.  
> > > > > > > 
> > > > > > > We should pause and think here how configuring NAPI params should
> > > > > > > behave. NAPI instances are ephemeral, if you close and open the
> > > > > > > device (or for some drivers change any BPF or ethtool setting)
> > > > > > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > > > > > 
> > > > > > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > > > > > survive close. It's (weirdly?) also not how queues behave, because we
> > > > > > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > > > > > you'd think queues are as ephemeral as NAPIs if not more.
> > > > > > > 
> > > > > > > I guess we can either document this, and move on (which may be fine,
> > > > > > > you have more practical experience than me). Or we can add an internal
> > > > > > > concept of a "channel" (which perhaps maybe if you squint is what
> > > > > > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > > > > > net_device and store such config there. For simplicity of matching
> > > > > > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > > > > > If driver wants to do something more fancy we can add a variant of
> > > > > > > netif_napi_add() which specifies the channel/storage to use.
> > > > > > > 
> > > > > > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > > > > > I work with unfortunate drivers...  
> > > > > > 
> > > > > > Thanks for pointing this out. I think this is an important case to
> > > > > > consider. Here's how I'm thinking about it.
> > > > > > 
> > > > > > There are two cases:
> > > > > > 
> > > > > > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > > > > > discarded and recreated, the code I added to netif_napi_add_weight
> > > > > > in patch 1 and 3 should take care of that case preserving how sysfs
> > > > > > works today, I believe. I think we are good on this case ?
> > > > > 
> > > > > Agreed.
> > > > > 
> > > > > > 2) apps using netlink to set various custom settings. This seems
> > > > > > like a case where a future extension can be made to add a notifier
> > > > > > for NAPI changes (like the netdevice notifier?).
> > > > > 
> > > > > Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> > > > > 
> > > > > > If you think this is a good idea, then we'd do something like:
> > > > > >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> > > > > >   2. In the future (not part of this series) a NAPI notifier is
> > > > > >      added
> > > > > >   3. User apps can then listen for NAPI create/delete events
> > > > > >      and update settings when a NAPI is created. It would be
> > > > > >      helpful, I think, for user apps to know about NAPI
> > > > > >      create/delete events in general because it means NAPI IDs are
> > > > > >      changing.
> > > > > > 
> > > > > > One could argue:
> > > > > > 
> > > > > >   When wiping/recreating a NAPI for an existing HW queue, that HW
> > > > > >   queue gets a new NAPI ID associated with it. User apps operating
> > > > > >   at this level probably care about NAPI IDs changing (as it affects
> > > > > >   epoll busy poll). Since the settings in this series are per-NAPI
> > > > > >   (and not per HW queue), the argument could be that user apps need
> > > > > >   to setup NAPIs when they are created and settings do not persist
> > > > > >   between NAPIs with different IDs even if associated with the same
> > > > > >   HW queue.
> > > > > 
> > > > > IDK if the fact that NAPI ID gets replaced was intentional in the first
> > > > > place. I would venture a guess that the person who added the IDs was
> > > > > working with NICs which have stable NAPI instances once the device is
> > > > > opened. This is, unfortunately, not universally the case.
> > > > > 
> > > > > I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> > > > > on an open device. Closer we get to queue API the more dynamic the whole
> > > > > setup will become (read: the more often reconfigurations will happen).
> > > > >
> > > > 
> > > > [...]
> > > > 
> > > > > > I think you have much more practical experience when it comes to
> > > > > > dealing with drivers, so I am happy to follow your lead on this one,
> > > > > > but assuming drivers will "do a thing" seems mildly scary to me with
> > > > > > limited driver experience.
> > > > > > 
> > > > > > My two goals with this series are:
> > > > > >   1. Make it possible to set these values per NAPI
> > > > > >   2. Unblock the IRQ suspension series by threading the suspend
> > > > > >      parameter through the code path carved in this series
> > > > > > 
> > > > > > So, I'm happy to proceed with this series as you prefer whether
> > > > > > that's documentation or "napi_storage"; I think you are probably the
> > > > > > best person to answer this question :)
> > > > > 
> > > > > How do you feel about making this configuration opt-in / require driver
> > > > > changes? What I'm thinking is that having the new "netif_napi_add()"
> > > > > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > > > > "index" parameter would make the whole thing much simpler.
> > > > 
> > > > What about extending netif_queue_set_napi instead? That function
> > > > takes a napi and a queue index.
> > > > 
> > > > Locally I kinda of hacked up something simple that:
> > > >   - Allocates napi_storage in net_device in alloc_netdev_mqs
> > > >   - Modifies netif_queue_set_napi to:
> > > >      if (napi)
> > > >        napi->storage = dev->napi_storage[queue_index];
> > > > 
> > > > I think I'm still missing the bit about the
> > > > max(rx_queues,tx_queues), though :(
> > > > 
> > > > > Index would basically be an integer 0..n, where n is the number of
> > > > > IRQs configured for the driver. The index of a NAPI instance would
> > > > > likely match the queue ID of the queue the NAPI serves.
> > > > 
> > > > Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> > > > are NAPIs for which no IRQ is allocated ~someday~ ?
> > > > 
> > > > It seems like (I could totally be wrong) that netif_queue_set_napi
> > > > can be called and work and create the association even without an
> > > > IRQ allocated.
> > > > 
> > > > I guess the issue is mostly the queue index question above: combined
> > > > rx/tx vs drivers having different numbers of rx and tx queues.
> > > > 
> > > > > We can then allocate an array of "napi_configs" in net_device -
> > > > > like we allocate queues, the array size would be max(num_rx_queue,
> > > > > num_tx_queues). We just need to store a couple of ints so it will
> > > > > be tiny compared to queue structs, anyway.
> > > > > 
> > > > > The NAPI_SET netlink op can then work based on NAPI index rather 
> > > > > than the ephemeral NAPI ID. It can apply the config to all live
> > > > > NAPI instances with that index (of which there really should only 
> > > > > be one, unless driver is mid-reconfiguration somehow but even that
> > > > > won't cause issues, we can give multiple instances the same settings)
> > > > > and also store the user config in the array in net_device.
> > > > > 
> > > > > When new NAPI instance is associate with a NAPI index it should get
> > > > > all the config associated with that index applied.
> > > > > 
> > > > > Thoughts? Does that makes sense, and if so do you think it's an
> > > > > over-complication?
> > > > 
> > > > I think what you are proposing seems fine; I'm just working out the
> > > > implementation details and making sure I understand before sending
> > > > another revision.
> > > 
> > > What if instead of an extra storage index in UAPI, we make napi_id persistent?
> > > Then we can keep using napi_id as a user-facing number for the configuration.
> > > 
> > > Having a stable napi_id would also be super useful for the epoll setup so you
> > > don't have to match old/invalid ids to the new ones on device reset.
> > 
> > Up to now for prototyping purposes: the way I've been dealing with this is
> > using a SO_ATTACH_REUSEPORT_CBPF program like this:
> > 
> > struct sock_filter code[] = {
> >     /* A = skb->queue_mapping */
> >     { BPF_LD | BPF_W | BPF_ABS, 0, 0, SKF_AD_OFF + SKF_AD_QUEUE },
> >     /* A = A % n */
> >     { BPF_ALU | BPF_MOD, 0, 0, n },
> >     /* return A */
> >     { BPF_RET | BPF_A, 0, 0, 0 },
> > };
> > 
> > with SO_BINDTODEVICE. Note that the above uses queue_mapping (not NAPI ID) so
> > even if the NAPI IDs change the filter still distributes connections from the
> > same "queue_mapping" to the same thread.
> > 
> > Since epoll busy poll is based on NAPI ID (and not queue_mapping), this will
> > probably cause some issue if the NAPI ID changes because the NAPI ID associated
> > with the epoll context will suddenly change meaning the "old" NAPI won't be
> > busy polled. This might be fine because if that happens the old NAPI is being
> > disabled anyway?
> > 
> > At any rate the user program doesn't "need" to do anything when the NAPI ID
> > changes... unless it has a more complicated ebpf program that relies on NAPI ID
> > ;)
> 
> Ah, you went a more creative route. We were doing SO_INCOMING_NAPI_ID on
> a single listener and manually adding fds to the appropriate epoll.
> But regardless of the method, having a stable napi_id is still good
> to have and hopefully it's not a lot of work compared to exposing
> extra napi storage id to the userspace.

Yes, your approach makes sense, as well. With a single acceptor
thread I'd also do what you are doing.

I was essentially modifying an existing REUSEPORT user app that one
app listens on all interfaces, so BINDTODEVICE + queue_mapping bpf
filter was the only way (that I could find) to make the NAPI ID
based polling work.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-03 19:40               ` Jakub Kicinski
  2024-09-03 21:58                 ` Samiullah Khawaja
@ 2024-09-08 15:54                 ` Joe Damato
  1 sibling, 0 replies; 48+ messages in thread
From: Joe Damato @ 2024-09-08 15:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Samiullah Khawaja, netdev, edumazet, amritha.nambiar,
	sridhar.samudrala, sdf, bjorn, hch, willy, willemdebruijn.kernel,
	Martin Karsten, Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens, open list

On Tue, Sep 03, 2024 at 12:40:08PM -0700, Jakub Kicinski wrote:
> On Tue, 3 Sep 2024 12:04:52 -0700 Samiullah Khawaja wrote:
> > Do we need a queue to napi association to set/persist napi
> > configurations? 
> 
> I'm afraid zero-copy schemes will make multiple queues per NAPI more
> and more common, so pretending the NAPI params (related to polling)
> are pre queue will soon become highly problematic.
> 
> > Can a new index param be added to the netif_napi_add
> > and persist the configurations in napi_storage.
> 
> That'd be my (weak) preference.
> 
> > I guess the problem would be the size of napi_storage.
> 
> I don't think so, we're talking about 16B per NAPI, 
> struct netdev_queue is 320B, struct netdev_rx_queue is 192B. 
> NAPI storage is rounding error next to those :S
> 
> > Also wondering if for some use case persistence would be problematic
> > when the napis are recreated, since the new napi instances might not
> > represent the same context? For example If I resize the dev from 16
> > rx/tx to 8 rx/tx queues and the napi index that was used by TX queue,
> > now polls RX queue.
> 
> We can clear the config when NAPI is activated (ethtool -L /
> set-channels). That seems like a good idea.

I'm probably misunderstanding this bit; I've implemented this by
just memsetting the storages to 0 on napi_enable which obviously
breaks the persistence and is incorrect because it doesn't respect
sysfs.

I'm going to send what I have as an RFC anyway, because it might be
easier to discuss by commenting on code that is (hopefully) moving
in the right direction?

I hope that's OK; I'll explicitly call it out in the cover letter
that I am about to send.

- Joe

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-04 23:54             ` Jakub Kicinski
  2024-09-05  9:32               ` Joe Damato
@ 2024-09-08 15:57               ` Joe Damato
  2024-09-09 23:03                 ` Jakub Kicinski
  1 sibling, 1 reply; 48+ messages in thread
From: Joe Damato @ 2024-09-08 15:57 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, netdev, edumazet, amritha.nambiar,
	sridhar.samudrala, bjorn, hch, willy, willemdebruijn.kernel,
	skhawaja, Martin Karsten, Donald Hunter, David S. Miller,
	Paolo Abeni, Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens,
	open list

On Wed, Sep 04, 2024 at 04:54:17PM -0700, Jakub Kicinski wrote:
> On Wed, 4 Sep 2024 16:40:41 -0700 Stanislav Fomichev wrote:
> > > I think what you are proposing seems fine; I'm just working out the
> > > implementation details and making sure I understand before sending
> > > another revision.  
> > 
> > What if instead of an extra storage index in UAPI, we make napi_id persistent?
> > Then we can keep using napi_id as a user-facing number for the configuration.
> > 
> > Having a stable napi_id would also be super useful for the epoll setup so you
> > don't have to match old/invalid ids to the new ones on device reset.
> 
> that'd be nice, initially I thought that we have some drivers that have
> multiple instances of NAPI enabled for a single "index", but I don't
> see such drivers now.
> 
> > In the code, we can keep the same idea with napi_storage in netdev and
> > ask drivers to provide storage id, but keep that id internal.
> > 
> > The only complication with that is napi_hash_add/napi_hash_del that
> > happen in netif_napi_add_weight. So for the devices that allocate
> > new napi before removing the old ones (most devices?), we'd have to add
> > some new netif_napi_takeover(old_napi, new_napi) to remove the
> > old napi_id from the hash and reuse it in the new one.
> > 
> > So for mlx5, the flow would look like the following:
> > 
> > - mlx5e_safe_switch_params
> >   - mlx5e_open_channels
> >     - netif_napi_add(new_napi)
> >       - adds napi with 'ephemeral' napi id
> >   - mlx5e_switch_priv_channels
> >     - mlx5e_deactivate_priv_channels
> >       - napi_disable(old_napi)
> >       - netif_napi_del(old_napi) - this frees the old napi_id
> >   - mlx5e_activate_priv_channels
> >     - mlx5e_activate_channels
> >       - mlx5e_activate_channel
> >         - netif_napi_takeover(old_napi is gone, so probably take id from napi_storage?)
> > 	  - if napi is not hashed - safe to reuse?
> > 	- napi_enable
> > 
> > This is a bit ugly because we still have random napi ids during reset, but
> > is not super complicated implementation-wise. We can eventually improve
> > the above by splitting netif_napi_add_weight into two steps: allocate and
> > activate (to do the napi_id allocation & hashing). Thoughts?
> 
> The "takeover" would be problematic for drivers which free old NAPI
> before allocating new one (bnxt?). But splitting the two steps sounds
> pretty clean. We can add a helper to mark NAPI as "driver will
> explicitly list/hash later", and have the driver call a new helper
> which takes storage ID and lists the NAPI in the hash.

It sounds like you all are suggesting that napi_id is moved into the
napi_storage array, as well? That way NAPI IDs persist even if the
NAPI structs themselves are recreated?

I think that's interesting and I'm open to supporting that. I wrote
up an RFC that moves stuff in the direction of napi_storage and
modifies 3 drivers but:
  - is incorrect because it breaks the persistence thing we are
    talking about, and
  - it doesn't do the two step take-over thing described above to
    inherit NAPI IDs (as far as I understand ?)

I'm going to send the RFC anyway because I think it'll be easier to
pick up the discussion on code that is hopefully closer to where we
want to land.

I hope that is OK.

- Joe

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
  2024-09-08 15:57               ` Joe Damato
@ 2024-09-09 23:03                 ` Jakub Kicinski
  0 siblings, 0 replies; 48+ messages in thread
From: Jakub Kicinski @ 2024-09-09 23:03 UTC (permalink / raw)
  To: Joe Damato
  Cc: Stanislav Fomichev, netdev, edumazet, amritha.nambiar,
	sridhar.samudrala, bjorn, hch, willy, willemdebruijn.kernel,
	skhawaja, Martin Karsten, Donald Hunter, David S. Miller,
	Paolo Abeni, Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens,
	open list

On Sun, 8 Sep 2024 17:57:22 +0200 Joe Damato wrote:
> I hope that is OK.

I'll comment on the RFC but FWIW, yes, sharing the code with discussion
point / TODOs listed in the cover letter is the right way of moving
things forward, thanks :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2024-09-09 23:03 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-29 13:11 [PATCH net-next 0/5] Add support for per-NAPI config via netlink Joe Damato
2024-08-29 13:11 ` [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
2024-08-29 13:46   ` Eric Dumazet
2024-08-29 22:05   ` Jakub Kicinski
2024-08-30  9:14     ` Joe Damato
2024-08-30 20:21       ` Jakub Kicinski
2024-08-30 20:23         ` Joe Damato
2024-08-30  8:36   ` Simon Horman
2024-08-30  9:11     ` Joe Damato
2024-08-30 16:50   ` kernel test robot
2024-08-29 13:11 ` [PATCH net-next 2/5] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
2024-08-29 22:08   ` Jakub Kicinski
2024-08-30  9:10     ` Joe Damato
2024-08-30 20:28       ` Jakub Kicinski
2024-08-30 20:31         ` Joe Damato
2024-08-30 21:22           ` Jakub Kicinski
2024-08-29 13:11 ` [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
2024-08-29 13:48   ` Eric Dumazet
2024-08-29 13:57     ` Joe Damato
2024-08-29 15:28     ` Joe Damato
2024-08-29 15:31       ` Eric Dumazet
2024-08-29 15:39         ` Joe Damato
2024-08-30 16:18   ` kernel test robot
2024-08-30 16:18   ` kernel test robot
2024-08-29 13:12 ` [PATCH net-next 4/5] netdev-genl: Dump gro_flush_timeout Joe Damato
2024-08-29 22:09   ` Jakub Kicinski
2024-08-30  9:17     ` Joe Damato
2024-08-29 13:12 ` [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values Joe Damato
2024-08-29 22:31   ` Jakub Kicinski
2024-08-30 10:43     ` Joe Damato
2024-08-30 21:22       ` Jakub Kicinski
2024-08-31 17:27         ` Joe Damato
2024-09-03  0:49           ` Jakub Kicinski
2024-09-02 16:56         ` Joe Damato
2024-09-03  1:02           ` Jakub Kicinski
2024-09-03 19:04             ` Samiullah Khawaja
2024-09-03 19:40               ` Jakub Kicinski
2024-09-03 21:58                 ` Samiullah Khawaja
2024-09-05  9:20                   ` Joe Damato
2024-09-08 15:54                 ` Joe Damato
2024-09-04 23:40           ` Stanislav Fomichev
2024-09-04 23:54             ` Jakub Kicinski
2024-09-05  9:32               ` Joe Damato
2024-09-08 15:57               ` Joe Damato
2024-09-09 23:03                 ` Jakub Kicinski
2024-09-05  9:30             ` Joe Damato
2024-09-05 16:56               ` Stanislav Fomichev
2024-09-05 17:05                 ` Joe Damato

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).