netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC net-next v3 0/9] Add support for per-NAPI config via netlink
@ 2024-09-12 10:07 Joe Damato
  2024-09-12 10:07 ` [RFC net-next v3 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, Joe Damato, Alexander Lobakin, Breno Leitao,
	Daniel Jurgens, David Ahern, David S. Miller, Donald Hunter,
	Eric Dumazet, Jesper Dangaard Brouer, Jiri Pirko, Johannes Berg,
	Jonathan Corbet, Kory Maincent, Leon Romanovsky,
	open list:DOCUMENTATION, open list,
	open list:MELLANOX MLX4 core VPI driver, Lorenzo Bianconi,
	Michael Chan, Paolo Abeni, Saeed Mahameed,
	Sebastian Andrzej Siewior, Tariq Toukan, Xuan Zhuo

Greetings:

Welcome to RFC v3.

This implementation allocates an array of "struct napi_config" in
net_device and each NAPI instance is assigned an index into the config
array.

Per-NAPI settings like:
  - NAPI ID
  - gro_flush_timeout
  - defer_hard_irqs

are persisted in napi_config and restored on napi_disable/napi_enable
respectively.

To help illustrate how this would end up working, I've added patches for
3 drivers, of which I have access to only 1:
  - mlx5 which is the basis of the examples below
  - mlx4 which has TX only NAPIs, just to highlight that case. I have
    only compile tested this patch; I don't have this hardware.
  - bnxt which I have only compiled tested. I don't have this
    hardware.

NOTE: I only tested this on mlx5; I have no access to the other hardware
for which I provided patches. Hopefully other folks can help test :)

This iteration seems to persist NAPI IDs and settings even when resizing
queues, see below, so I think maybe this is getting close to where we
want to land?

Here's an example of how it works on my mlx5:

# start with 2 queues

$ ethtool -l eth4 | grep Combined | tail -1
Combined:	2

First, output the current NAPI settings:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now, set the global sysfs parameters:

$ sudo bash -c 'echo 20000 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs'

Output current NAPI settings again:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now set NAPI ID 345, via its NAPI ID to specific values:

$ sudo ./tools/net/ynl/cli.py \
          --spec Documentation/netlink/specs/netdev.yaml \
          --do napi-set \
          --json='{"id": 345,
                   "defer-hard-irqs": 111,
                   "gro-flush-timeout": 11111}'
None

Now output current NAPI settings again to ensure only NAPI ID 345
changed:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 11111,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now, increase gro-flush-timeout only:

$ sudo ./tools/net/ynl/cli.py \
       --spec Documentation/netlink/specs/netdev.yaml \
       --do napi-set --json='{"id": 345,
                              "gro-flush-timeout": 44444}'
None

Now output the current NAPI settings once more:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 44444,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now set NAPI ID 345 to have gro_flush_timeout of 0:

$ sudo ./tools/net/ynl/cli.py \
       --spec Documentation/netlink/specs/netdev.yaml \
       --do napi-set --json='{"id": 345,
                              "gro-flush-timeout": 0}'
None

Check that NAPI ID 345 has a value of 0:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Change the queue count, ensuring that NAPI ID 345 retains its settings:

$ sudo ethtool -L eth4 combined 4

Check that the new queues have the system wide settings but that NAPI ID
345 remains unchanged:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 347,
  'ifindex': 7,
  'irq': 529},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 346,
  'ifindex': 7,
  'irq': 528},
 {'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now reduce the queue count below where NAPI ID 345 is indexed:

$ sudo ethtool -L eth4 combined 1

Check the output:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Re-increase the queue count to ensure NAPI ID 345 is re-assigned the same
values:

$ sudo ethtool -L eth4 combined 2

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Create new queues to ensure the sysfs globals are used for the new NAPIs
but that NAPI ID 345 is unchanged:

$ sudo ethtool -L eth4 comabined 8

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[...]
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 346,
  'ifindex': 7,
  'irq': 528},
 {'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Last, but not least, let's try writing the sysfs parameters to ensure
all NAPIs are rewritten:

$ sudo bash -c 'echo 33333 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 222 >/sys/class/net/eth4/napi_defer_hard_irqs'

Check that worked:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[...]
 {'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 346,
  'ifindex': 7,
  'irq': 528},
 {'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Thanks,
Joe

rfcv3:
  - Renamed napi_storage to napi_config
  - Reordered patches
  - Added defer_hard_irqs and gro_flush_timeout to napi_struct
  - Attempt to save and restore settings on napi_disable/napi_enable
  - Removed weight as a parameter to netif_napi_add_storage
  - Updated driver patches to no longer pass in weight

rfcv2:
  - Almost total rewrite from v1

Joe Damato (9):
  net: napi: Make napi_defer_hard_irqs per-NAPI
  netdev-genl: Dump napi_defer_hard_irqs
  net: napi: Make gro_flush_timeout per-NAPI
  netdev-genl: Dump gro_flush_timeout
  net: napi: Add napi_config
  netdev-genl: Support setting per-NAPI config values
  bnxt: Add support for napi storage
  mlx5: Add support for napi storage
  mlx4: Add support for napi storage to RX CQs

 Documentation/netlink/specs/netdev.yaml       | 25 ++++++
 .../networking/net_cachelines/net_device.rst  |  5 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  3 +-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c    |  3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 include/linux/netdevice.h                     | 40 ++++++++-
 include/uapi/linux/netdev.h                   |  3 +
 net/core/dev.c                                | 90 +++++++++++++++----
 net/core/dev.h                                | 87 ++++++++++++++++++
 net/core/net-sysfs.c                          |  4 +-
 net/core/netdev-genl-gen.c                    | 14 +++
 net/core/netdev-genl-gen.h                    |  1 +
 net/core/netdev-genl.c                        | 55 ++++++++++++
 tools/include/uapi/linux/netdev.h             |  3 +
 14 files changed, 310 insertions(+), 25 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC net-next v3 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI
  2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
  2024-09-12 10:07 ` [RFC net-next v3 2/9] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
	Paolo Abeni, Jonathan Corbet, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, David Ahern,
	Kory Maincent, Johannes Berg, Breno Leitao, Alexander Lobakin,
	open list:DOCUMENTATION, open list

Allow per-NAPI defer_hard_irqs setting.

The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device defer_hard_irq field.
Reads from sysfs will read from the net_device field.

sysfs code was updated to guard against what appears to be a potential
overflow as the field is an int, but the value passed in is an unsigned
long.

The ability to set defer_hard_irqs on specific NAPI instances will be
added in a later commit, via netdev-genl.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 .../networking/net_cachelines/net_device.rst  |  2 +-
 include/linux/netdevice.h                     |  3 +-
 net/core/dev.c                                | 10 +++---
 net/core/dev.h                                | 36 +++++++++++++++++++
 net/core/net-sysfs.c                          |  2 +-
 5 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 22b07c814f4a..eeeb7c925ec5 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -99,7 +99,6 @@ unsigned_int                        num_rx_queues
 unsigned_int                        real_num_rx_queues      -                   read_mostly         get_rps_cpu
 struct_bpf_prog*                    xdp_prog                -                   read_mostly         netif_elide_gro()
 unsigned_long                       gro_flush_timeout       -                   read_mostly         napi_complete_done
-u32                                 napi_defer_hard_irqs    -                   read_mostly         napi_complete_done
 unsigned_int                        gro_max_size            -                   read_mostly         skb_gro_receive
 unsigned_int                        gro_ipv4_max_size       -                   read_mostly         skb_gro_receive
 rx_handler_func_t*                  rx_handler              read_mostly         -                   __netif_receive_skb_core
@@ -183,3 +182,4 @@ struct_devlink_port*                devlink_port
 struct_dpll_pin*                    dpll_pin                                                        
 struct hlist_head                   page_pools
 struct dim_irq_moder*               irq_moder
+u32                                 napi_defer_hard_irqs
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2e40a137dc12..f28b96c95259 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -377,6 +377,7 @@ struct napi_struct {
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
 	int			irq;
+	u32			defer_hard_irqs;
 };
 
 enum {
@@ -2075,7 +2076,6 @@ struct net_device {
 	unsigned int		real_num_rx_queues;
 	struct netdev_rx_queue	*_rx;
 	unsigned long		gro_flush_timeout;
-	u32			napi_defer_hard_irqs;
 	unsigned int		gro_max_size;
 	unsigned int		gro_ipv4_max_size;
 	rx_handler_func_t __rcu	*rx_handler;
@@ -2398,6 +2398,7 @@ struct net_device {
 
 	/** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */
 	struct dim_irq_moder	*irq_moder;
+	u32			napi_defer_hard_irqs;
 
 	u8			priv[] ____cacheline_aligned
 				       __counted_by(priv_len);
diff --git a/net/core/dev.c b/net/core/dev.c
index 8f4dead64284..d3d0680664b3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6221,7 +6221,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 	if (work_done) {
 		if (n->gro_bitmask)
 			timeout = READ_ONCE(n->dev->gro_flush_timeout);
-		n->defer_hard_irqs_count = READ_ONCE(n->dev->napi_defer_hard_irqs);
+		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
 	}
 	if (n->defer_hard_irqs_count > 0) {
 		n->defer_hard_irqs_count--;
@@ -6359,7 +6359,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
 	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
 	if (flags & NAPI_F_PREFER_BUSY_POLL) {
-		napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs);
+		napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
 		timeout = READ_ONCE(napi->dev->gro_flush_timeout);
 		if (napi->defer_hard_irqs_count && timeout) {
 			hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
@@ -6641,6 +6641,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	INIT_HLIST_NODE(&napi->napi_hash_node);
 	hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
 	napi->timer.function = napi_watchdog;
+	napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
 	init_gro_hash(napi);
 	napi->skb = NULL;
 	INIT_LIST_HEAD(&napi->rx_list);
@@ -11023,7 +11024,7 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev)
 
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
 		dev->gro_flush_timeout = 20000;
-		dev->napi_defer_hard_irqs = 1;
+		netdev_set_defer_hard_irqs(dev, 1);
 	}
 }
 EXPORT_SYMBOL_GPL(netdev_sw_irq_coalesce_default_on);
@@ -11960,7 +11961,6 @@ static void __init net_dev_struct_check(void)
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, real_num_rx_queues);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, _rx);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_flush_timeout);
-	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, napi_defer_hard_irqs);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_max_size);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_ipv4_max_size);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, rx_handler);
@@ -11972,7 +11972,7 @@ static void __init net_dev_struct_check(void)
 #ifdef CONFIG_NET_XGRESS
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, tcx_ingress);
 #endif
-	CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 104);
+	CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 100);
 }
 
 /*
diff --git a/net/core/dev.h b/net/core/dev.h
index 5654325c5b71..f24fa38a2cac 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -138,6 +138,42 @@ static inline void netif_set_gro_ipv4_max_size(struct net_device *dev,
 	WRITE_ONCE(dev->gro_ipv4_max_size, size);
 }
 
+/**
+ * napi_get_defer_hard_irqs - get the NAPI's defer_hard_irqs
+ * @n: napi struct to get the defer_hard_irqs field from
+ *
+ * Return: the per-NAPI value of the defar_hard_irqs field.
+ */
+static inline u32 napi_get_defer_hard_irqs(const struct napi_struct *n)
+{
+	return READ_ONCE(n->defer_hard_irqs);
+}
+
+/**
+ * napi_set_defer_hard_irqs - set the defer_hard_irqs for a napi
+ * @n: napi_struct to set the defer_hard_irqs field
+ * @defer: the value the field should be set to
+ */
+static inline void napi_set_defer_hard_irqs(struct napi_struct *n, u32 defer)
+{
+	WRITE_ONCE(n->defer_hard_irqs, defer);
+}
+
+/**
+ * netdev_set_defer_hard_irqs - set defer_hard_irqs for all NAPIs of a netdev
+ * @netdev: the net_device for which all NAPIs will have their defer_hard_irqs set
+ * @defer: the defer_hard_irqs value to set
+ */
+static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
+					      u32 defer)
+{
+	struct napi_struct *napi;
+
+	WRITE_ONCE(netdev->napi_defer_hard_irqs, defer);
+	list_for_each_entry(napi, &netdev->napi_list, dev_list)
+		napi_set_defer_hard_irqs(napi, defer);
+}
+
 int rps_cpumask_housekeeping(struct cpumask *mask);
 
 #if defined(CONFIG_DEBUG_NET) && defined(CONFIG_BPF_SYSCALL)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 05cf5347f25e..25125f356a15 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -429,7 +429,7 @@ static int change_napi_defer_hard_irqs(struct net_device *dev, unsigned long val
 	if (val > S32_MAX)
 		return -ERANGE;
 
-	WRITE_ONCE(dev->napi_defer_hard_irqs, val);
+	netdev_set_defer_hard_irqs(dev, (u32)val);
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC net-next v3 2/9] netdev-genl: Dump napi_defer_hard_irqs
  2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
  2024-09-12 10:07 ` [RFC net-next v3 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
  2024-09-12 10:07 ` [RFC net-next v3 3/9] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, Joe Damato, Donald Hunter, David S. Miller,
	Eric Dumazet, Paolo Abeni, Jesper Dangaard Brouer, Xuan Zhuo,
	Daniel Jurgens, open list

Support dumping defer_hard_irqs for a NAPI ID.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 Documentation/netlink/specs/netdev.yaml | 8 ++++++++
 include/uapi/linux/netdev.h             | 1 +
 net/core/netdev-genl.c                  | 5 +++++
 tools/include/uapi/linux/netdev.h       | 1 +
 4 files changed, 15 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 959755be4d7f..351d93994a66 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -244,6 +244,13 @@ attribute-sets:
              threaded mode. If NAPI is not in threaded mode (i.e. uses normal
              softirq context), the attribute will be absent.
         type: u32
+      -
+        name: defer-hard-irqs
+        doc: The number of consecutive empty polls before IRQ deferral ends
+             and hardware IRQs are re-enabled.
+        type: u32
+        checks:
+          max: s32-max
   -
     name: queue
     attributes:
@@ -593,6 +600,7 @@ operations:
             - ifindex
             - irq
             - pid
+            - defer-hard-irqs
       dump:
         request:
           attributes:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 43742ac5b00d..43bb1aad9611 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -121,6 +121,7 @@ enum {
 	NETDEV_A_NAPI_ID,
 	NETDEV_A_NAPI_IRQ,
 	NETDEV_A_NAPI_PID,
+	NETDEV_A_NAPI_DEFER_HARD_IRQS,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index a17d7eaeb001..e67918dd97be 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -160,6 +160,7 @@ static int
 netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 			const struct genl_info *info)
 {
+	u32 napi_defer_hard_irqs;
 	void *hdr;
 	pid_t pid;
 
@@ -188,6 +189,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 			goto nla_put_failure;
 	}
 
+	napi_defer_hard_irqs = napi_get_defer_hard_irqs(napi);
+	if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS, napi_defer_hard_irqs))
+		goto nla_put_failure;
+
 	genlmsg_end(rsp, hdr);
 
 	return 0;
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 43742ac5b00d..43bb1aad9611 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -121,6 +121,7 @@ enum {
 	NETDEV_A_NAPI_ID,
 	NETDEV_A_NAPI_IRQ,
 	NETDEV_A_NAPI_PID,
+	NETDEV_A_NAPI_DEFER_HARD_IRQS,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC net-next v3 3/9] net: napi: Make gro_flush_timeout per-NAPI
  2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
  2024-09-12 10:07 ` [RFC net-next v3 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
  2024-09-12 10:07 ` [RFC net-next v3 2/9] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
  2024-09-12 10:07 ` [RFC net-next v3 4/9] netdev-genl: Dump gro_flush_timeout Joe Damato
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
	Paolo Abeni, Jonathan Corbet, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Johannes Berg,
	Breno Leitao, Alexander Lobakin, open list:DOCUMENTATION,
	open list

Allow per-NAPI gro_flush_timeout setting.

The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device gro_flush_timeout
field. Reads from sysfs will read from the net_device field.

The ability to set gro_flush_timeout on specific NAPI instances will be
added in a later commit, via netdev-genl.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 .../networking/net_cachelines/net_device.rst  |  2 +-
 include/linux/netdevice.h                     |  3 +-
 net/core/dev.c                                | 12 +++---
 net/core/dev.h                                | 39 +++++++++++++++++++
 net/core/net-sysfs.c                          |  2 +-
 5 files changed, 49 insertions(+), 9 deletions(-)

diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index eeeb7c925ec5..3d02ae79c850 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -98,7 +98,6 @@ struct_netdev_queue*                _rx                     read_mostly
 unsigned_int                        num_rx_queues                                                   
 unsigned_int                        real_num_rx_queues      -                   read_mostly         get_rps_cpu
 struct_bpf_prog*                    xdp_prog                -                   read_mostly         netif_elide_gro()
-unsigned_long                       gro_flush_timeout       -                   read_mostly         napi_complete_done
 unsigned_int                        gro_max_size            -                   read_mostly         skb_gro_receive
 unsigned_int                        gro_ipv4_max_size       -                   read_mostly         skb_gro_receive
 rx_handler_func_t*                  rx_handler              read_mostly         -                   __netif_receive_skb_core
@@ -182,4 +181,5 @@ struct_devlink_port*                devlink_port
 struct_dpll_pin*                    dpll_pin                                                        
 struct hlist_head                   page_pools
 struct dim_irq_moder*               irq_moder
+unsigned_long                       gro_flush_timeout
 u32                                 napi_defer_hard_irqs
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f28b96c95259..3e07ab8e0295 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -377,6 +377,7 @@ struct napi_struct {
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
 	int			irq;
+	unsigned long		gro_flush_timeout;
 	u32			defer_hard_irqs;
 };
 
@@ -2075,7 +2076,6 @@ struct net_device {
 	int			ifindex;
 	unsigned int		real_num_rx_queues;
 	struct netdev_rx_queue	*_rx;
-	unsigned long		gro_flush_timeout;
 	unsigned int		gro_max_size;
 	unsigned int		gro_ipv4_max_size;
 	rx_handler_func_t __rcu	*rx_handler;
@@ -2398,6 +2398,7 @@ struct net_device {
 
 	/** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */
 	struct dim_irq_moder	*irq_moder;
+	unsigned long		gro_flush_timeout;
 	u32			napi_defer_hard_irqs;
 
 	u8			priv[] ____cacheline_aligned
diff --git a/net/core/dev.c b/net/core/dev.c
index d3d0680664b3..f2fd503516de 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6220,12 +6220,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 
 	if (work_done) {
 		if (n->gro_bitmask)
-			timeout = READ_ONCE(n->dev->gro_flush_timeout);
+			timeout = napi_get_gro_flush_timeout(n);
 		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
 	}
 	if (n->defer_hard_irqs_count > 0) {
 		n->defer_hard_irqs_count--;
-		timeout = READ_ONCE(n->dev->gro_flush_timeout);
+		timeout = napi_get_gro_flush_timeout(n);
 		if (timeout)
 			ret = false;
 	}
@@ -6360,7 +6360,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
 
 	if (flags & NAPI_F_PREFER_BUSY_POLL) {
 		napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
-		timeout = READ_ONCE(napi->dev->gro_flush_timeout);
+		timeout = napi_get_gro_flush_timeout(napi);
 		if (napi->defer_hard_irqs_count && timeout) {
 			hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
 			skip_schedule = true;
@@ -6642,6 +6642,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
 	napi->timer.function = napi_watchdog;
 	napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
+	napi_set_gro_flush_timeout(napi, READ_ONCE(dev->gro_flush_timeout));
 	init_gro_hash(napi);
 	napi->skb = NULL;
 	INIT_LIST_HEAD(&napi->rx_list);
@@ -11023,7 +11024,7 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev)
 	WARN_ON(dev->reg_state == NETREG_REGISTERED);
 
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
-		dev->gro_flush_timeout = 20000;
+		netdev_set_gro_flush_timeout(dev, 20000);
 		netdev_set_defer_hard_irqs(dev, 1);
 	}
 }
@@ -11960,7 +11961,6 @@ static void __init net_dev_struct_check(void)
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, ifindex);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, real_num_rx_queues);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, _rx);
-	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_flush_timeout);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_max_size);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_ipv4_max_size);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, rx_handler);
@@ -11972,7 +11972,7 @@ static void __init net_dev_struct_check(void)
 #ifdef CONFIG_NET_XGRESS
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, tcx_ingress);
 #endif
-	CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 100);
+	CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 92);
 }
 
 /*
diff --git a/net/core/dev.h b/net/core/dev.h
index f24fa38a2cac..a9d5f678564a 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -174,6 +174,45 @@ static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
 		napi_set_defer_hard_irqs(napi, defer);
 }
 
+/**
+ * napi_get_gro_flush_timeout - get the gro_flush_timeout
+ * @n: napi struct to get the gro_flush_timeout from
+ *
+ * Return: the per-NAPI value of the gro_flush_timeout field.
+ */
+static inline unsigned long napi_get_gro_flush_timeout(const struct napi_struct *n)
+{
+	return READ_ONCE(n->gro_flush_timeout);
+}
+
+/**
+ * napi_set_gro_flush_timeout - set the gro_flush_timeout for a napi
+ * @n: napi struct to set the gro_flush_timeout
+ * @timeout: timeout value to set
+ *
+ * napi_set_gro_flush_timeout sets the per-NAPI gro_flush_timeout
+ */
+static inline void napi_set_gro_flush_timeout(struct napi_struct *n,
+					      unsigned long timeout)
+{
+	WRITE_ONCE(n->gro_flush_timeout, timeout);
+}
+
+/**
+ * netdev_set_gro_flush_timeout - set gro_flush_timeout for all NAPIs of a netdev
+ * @netdev: the net_device for which all NAPIs will have their gro_flush_timeout set
+ * @timeout: the timeout value to set
+ */
+static inline void netdev_set_gro_flush_timeout(struct net_device *netdev,
+						unsigned long timeout)
+{
+	struct napi_struct *napi;
+
+	WRITE_ONCE(netdev->gro_flush_timeout, timeout);
+	list_for_each_entry(napi, &netdev->napi_list, dev_list)
+		napi_set_gro_flush_timeout(napi, timeout);
+}
+
 int rps_cpumask_housekeeping(struct cpumask *mask);
 
 #if defined(CONFIG_DEBUG_NET) && defined(CONFIG_BPF_SYSCALL)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 25125f356a15..2d9afc6e2161 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -409,7 +409,7 @@ NETDEVICE_SHOW_RW(tx_queue_len, fmt_dec);
 
 static int change_gro_flush_timeout(struct net_device *dev, unsigned long val)
 {
-	WRITE_ONCE(dev->gro_flush_timeout, val);
+	netdev_set_gro_flush_timeout(dev, val);
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC net-next v3 4/9] netdev-genl: Dump gro_flush_timeout
  2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
                   ` (2 preceding siblings ...)
  2024-09-12 10:07 ` [RFC net-next v3 3/9] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
  2024-09-12 10:07 ` [RFC net-next v3 5/9] net: napi: Add napi_config Joe Damato
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
	Paolo Abeni, Donald Hunter, Jesper Dangaard Brouer, Xuan Zhuo,
	Daniel Jurgens, open list

Support dumping gro_flush_timeout for a NAPI ID.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 Documentation/netlink/specs/netdev.yaml | 6 ++++++
 include/uapi/linux/netdev.h             | 1 +
 net/core/netdev-genl.c                  | 5 +++++
 tools/include/uapi/linux/netdev.h       | 1 +
 4 files changed, 13 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 351d93994a66..906091c3059a 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -251,6 +251,11 @@ attribute-sets:
         type: u32
         checks:
           max: s32-max
+      -
+        name: gro-flush-timeout
+        doc: The timeout, in nanoseconds, of when to trigger the NAPI
+             watchdog timer and schedule NAPI processing.
+        type: uint
   -
     name: queue
     attributes:
@@ -601,6 +606,7 @@ operations:
             - irq
             - pid
             - defer-hard-irqs
+            - gro-flush-timeout
       dump:
         request:
           attributes:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 43bb1aad9611..b088a34e9254 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -122,6 +122,7 @@ enum {
 	NETDEV_A_NAPI_IRQ,
 	NETDEV_A_NAPI_PID,
 	NETDEV_A_NAPI_DEFER_HARD_IRQS,
+	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index e67918dd97be..4698034b5a49 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -160,6 +160,7 @@ static int
 netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 			const struct genl_info *info)
 {
+	unsigned long gro_flush_timeout;
 	u32 napi_defer_hard_irqs;
 	void *hdr;
 	pid_t pid;
@@ -193,6 +194,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 	if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS, napi_defer_hard_irqs))
 		goto nla_put_failure;
 
+	gro_flush_timeout = napi_get_gro_flush_timeout(napi);
+	if (nla_put_uint(rsp, NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT, gro_flush_timeout))
+		goto nla_put_failure;
+
 	genlmsg_end(rsp, hdr);
 
 	return 0;
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 43bb1aad9611..b088a34e9254 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -122,6 +122,7 @@ enum {
 	NETDEV_A_NAPI_IRQ,
 	NETDEV_A_NAPI_PID,
 	NETDEV_A_NAPI_DEFER_HARD_IRQS,
+	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC net-next v3 5/9] net: napi: Add napi_config
  2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
                   ` (3 preceding siblings ...)
  2024-09-12 10:07 ` [RFC net-next v3 4/9] netdev-genl: Dump gro_flush_timeout Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
  2024-09-12 15:03   ` Joe Damato
  2024-09-13 17:42   ` Stanislav Fomichev
  2024-09-12 10:07 ` [RFC net-next v3 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
	Paolo Abeni, Jonathan Corbet, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, David Ahern,
	Johannes Berg, open list:DOCUMENTATION, open list

Add a persistent NAPI config area for NAPI configuration to the core.
Drivers opt-in to setting the storage for a NAPI by passing an index
when calling netif_napi_add_storage.

napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
(after the NAPIs are deleted), and set to 0 when napi_enable is called.

Drivers which implement call netif_napi_add_storage will have persistent
NAPI IDs.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 .../networking/net_cachelines/net_device.rst  |  1 +
 include/linux/netdevice.h                     | 34 +++++++++
 net/core/dev.c                                | 74 +++++++++++++++++--
 net/core/dev.h                                | 12 +++
 4 files changed, 113 insertions(+), 8 deletions(-)

diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 3d02ae79c850..11d659051f5e 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -183,3 +183,4 @@ struct hlist_head                   page_pools
 struct dim_irq_moder*               irq_moder
 unsigned_long                       gro_flush_timeout
 u32                                 napi_defer_hard_irqs
+struct napi_config*                 napi_config
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3e07ab8e0295..08afc96179f9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -342,6 +342,15 @@ struct gro_list {
  */
 #define GRO_HASH_BUCKETS	8
 
+/*
+ * Structure for per-NAPI storage
+ */
+struct napi_config {
+	u64 gro_flush_timeout;
+	u32 defer_hard_irqs;
+	unsigned int napi_id;
+};
+
 /*
  * Structure for NAPI scheduling similar to tasklet but with weighting
  */
@@ -379,6 +388,8 @@ struct napi_struct {
 	int			irq;
 	unsigned long		gro_flush_timeout;
 	u32			defer_hard_irqs;
+	int			index;
+	struct napi_config	*config;
 };
 
 enum {
@@ -2011,6 +2022,9 @@ enum netdev_reg_state {
  *	@dpll_pin: Pointer to the SyncE source pin of a DPLL subsystem,
  *		   where the clock is recovered.
  *
+ *	@napi_config: An array of napi_config structures containing per-NAPI
+ *		      settings.
+ *
  *	FIXME: cleanup struct net_device such that network protocol info
  *	moves out.
  */
@@ -2400,6 +2414,7 @@ struct net_device {
 	struct dim_irq_moder	*irq_moder;
 	unsigned long		gro_flush_timeout;
 	u32			napi_defer_hard_irqs;
+	struct napi_config	*napi_config;
 
 	u8			priv[] ____cacheline_aligned
 				       __counted_by(priv_len);
@@ -2650,6 +2665,23 @@ netif_napi_add_tx_weight(struct net_device *dev,
 	netif_napi_add_weight(dev, napi, poll, weight);
 }
 
+/**
+ * netif_napi_add_storage - initialize a NAPI context and set storage area
+ * @dev: network device
+ * @napi: NAPI context
+ * @poll: polling function
+ * @weight: the poll weight of this NAPI
+ * @index: the NAPI index
+ */
+static inline void
+netif_napi_add_storage(struct net_device *dev, struct napi_struct *napi,
+		       int (*poll)(struct napi_struct *, int), int index)
+{
+	napi->index = index;
+	napi->config = &dev->napi_config[index];
+	netif_napi_add_weight(dev, napi, poll, NAPI_POLL_WEIGHT);
+}
+
 /**
  * netif_napi_add_tx() - initialize a NAPI context to be used for Tx only
  * @dev:  network device
@@ -2685,6 +2717,8 @@ void __netif_napi_del(struct napi_struct *napi);
  */
 static inline void netif_napi_del(struct napi_struct *napi)
 {
+	napi->config = NULL;
+	napi->index = -1;
 	__netif_napi_del(napi);
 	synchronize_net();
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index f2fd503516de..ca2227d0b8ed 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6493,6 +6493,18 @@ EXPORT_SYMBOL(napi_busy_loop);
 
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
+static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
+{
+	spin_lock(&napi_hash_lock);
+
+	napi->napi_id = napi_id;
+
+	hlist_add_head_rcu(&napi->napi_hash_node,
+			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
+
+	spin_unlock(&napi_hash_lock);
+}
+
 static void napi_hash_add(struct napi_struct *napi)
 {
 	if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state))
@@ -6505,12 +6517,13 @@ static void napi_hash_add(struct napi_struct *napi)
 		if (unlikely(++napi_gen_id < MIN_NAPI_ID))
 			napi_gen_id = MIN_NAPI_ID;
 	} while (napi_by_id(napi_gen_id));
-	napi->napi_id = napi_gen_id;
-
-	hlist_add_head_rcu(&napi->napi_hash_node,
-			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
 
 	spin_unlock(&napi_hash_lock);
+
+	napi_hash_add_with_id(napi, napi_gen_id);
+
+	if (napi->config)
+		napi->config->napi_id = napi_gen_id;
 }
 
 /* Warning : caller is responsible to make sure rcu grace period
@@ -6631,6 +6644,21 @@ void netif_queue_set_napi(struct net_device *dev, unsigned int queue_index,
 }
 EXPORT_SYMBOL(netif_queue_set_napi);
 
+static void napi_restore_config(struct napi_struct *n)
+{
+	n->defer_hard_irqs = n->config->defer_hard_irqs;
+	n->gro_flush_timeout = n->config->gro_flush_timeout;
+	napi_hash_add_with_id(n, n->config->napi_id);
+}
+
+static void napi_save_config(struct napi_struct *n)
+{
+	n->config->defer_hard_irqs = n->defer_hard_irqs;
+	n->config->gro_flush_timeout = n->gro_flush_timeout;
+	n->config->napi_id = n->napi_id;
+	napi_hash_del(n);
+}
+
 void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 			   int (*poll)(struct napi_struct *, int), int weight)
 {
@@ -6641,8 +6669,6 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	INIT_HLIST_NODE(&napi->napi_hash_node);
 	hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
 	napi->timer.function = napi_watchdog;
-	napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
-	napi_set_gro_flush_timeout(napi, READ_ONCE(dev->gro_flush_timeout));
 	init_gro_hash(napi);
 	napi->skb = NULL;
 	INIT_LIST_HEAD(&napi->rx_list);
@@ -6660,7 +6686,15 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	set_bit(NAPI_STATE_SCHED, &napi->state);
 	set_bit(NAPI_STATE_NPSVC, &napi->state);
 	list_add_rcu(&napi->dev_list, &dev->napi_list);
-	napi_hash_add(napi);
+	/* if there is no config associated with this NAPI, generate a fresh
+	 * NAPI ID and hash it. Otherwise, settings will be restored in napi_enable.
+	 */
+	if (!napi->config || (napi->config && !napi->config->napi_id)) {
+		napi_hash_add(napi);
+		napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
+		napi_set_gro_flush_timeout(napi, READ_ONCE(dev->gro_flush_timeout));
+	}
+
 	napi_get_frags_check(napi);
 	/* Create kthread for this napi if dev->threaded is set.
 	 * Clear dev->threaded if kthread creation failed so that
@@ -6692,6 +6726,9 @@ void napi_disable(struct napi_struct *n)
 
 	hrtimer_cancel(&n->timer);
 
+	if (n->config)
+		napi_save_config(n);
+
 	clear_bit(NAPI_STATE_DISABLE, &n->state);
 }
 EXPORT_SYMBOL(napi_disable);
@@ -6714,6 +6751,9 @@ void napi_enable(struct napi_struct *n)
 		if (n->dev->threaded && n->thread)
 			new |= NAPIF_STATE_THREADED;
 	} while (!try_cmpxchg(&n->state, &val, new));
+
+	if (n->config)
+		napi_restore_config(n);
 }
 EXPORT_SYMBOL(napi_enable);
 
@@ -6736,7 +6776,13 @@ void __netif_napi_del(struct napi_struct *napi)
 	if (!test_and_clear_bit(NAPI_STATE_LISTED, &napi->state))
 		return;
 
-	napi_hash_del(napi);
+	if (!napi->config) {
+		napi_hash_del(napi);
+	} else {
+		napi->index = -1;
+		napi->config = NULL;
+	}
+
 	list_del_rcu(&napi->dev_list);
 	napi_free_frags(napi);
 
@@ -11049,6 +11095,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 		unsigned int txqs, unsigned int rxqs)
 {
 	struct net_device *dev;
+	size_t napi_config_sz;
+	unsigned int maxqs;
 
 	BUG_ON(strlen(name) >= sizeof(dev->name));
 
@@ -11062,6 +11110,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 		return NULL;
 	}
 
+	WARN_ON_ONCE(txqs != rxqs);
+	maxqs = max(txqs, rxqs);
+
 	dev = kvzalloc(struct_size(dev, priv, sizeof_priv),
 		       GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!dev)
@@ -11136,6 +11187,11 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	if (!dev->ethtool)
 		goto free_all;
 
+	napi_config_sz = array_size(maxqs, sizeof(*dev->napi_config));
+	dev->napi_config = kvzalloc(napi_config_sz, GFP_KERNEL_ACCOUNT);
+	if (!dev->napi_config)
+		goto free_all;
+
 	strscpy(dev->name, name);
 	dev->name_assign_type = name_assign_type;
 	dev->group = INIT_NETDEV_GROUP;
@@ -11197,6 +11253,8 @@ void free_netdev(struct net_device *dev)
 	list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
 		netif_napi_del(p);
 
+	kvfree(dev->napi_config);
+
 	ref_tracker_dir_exit(&dev->refcnt_tracker);
 #ifdef CONFIG_PCPU_DEV_REFCNT
 	free_percpu(dev->pcpu_refcnt);
diff --git a/net/core/dev.h b/net/core/dev.h
index a9d5f678564a..9eb3f559275c 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -167,11 +167,17 @@ static inline void napi_set_defer_hard_irqs(struct napi_struct *n, u32 defer)
 static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
 					      u32 defer)
 {
+	unsigned int count = max(netdev->num_rx_queues,
+				 netdev->num_tx_queues);
 	struct napi_struct *napi;
+	int i;
 
 	WRITE_ONCE(netdev->napi_defer_hard_irqs, defer);
 	list_for_each_entry(napi, &netdev->napi_list, dev_list)
 		napi_set_defer_hard_irqs(napi, defer);
+
+	for (i = 0; i < count; i++)
+		netdev->napi_config[i].defer_hard_irqs = defer;
 }
 
 /**
@@ -206,11 +212,17 @@ static inline void napi_set_gro_flush_timeout(struct napi_struct *n,
 static inline void netdev_set_gro_flush_timeout(struct net_device *netdev,
 						unsigned long timeout)
 {
+	unsigned int count = max(netdev->num_rx_queues,
+				 netdev->num_tx_queues);
 	struct napi_struct *napi;
+	int i;
 
 	WRITE_ONCE(netdev->gro_flush_timeout, timeout);
 	list_for_each_entry(napi, &netdev->napi_list, dev_list)
 		napi_set_gro_flush_timeout(napi, timeout);
+
+	for (i = 0; i < count; i++)
+		netdev->napi_config[i].gro_flush_timeout = timeout;
 }
 
 int rps_cpumask_housekeeping(struct cpumask *mask);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC net-next v3 6/9] netdev-genl: Support setting per-NAPI config values
  2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
                   ` (4 preceding siblings ...)
  2024-09-12 10:07 ` [RFC net-next v3 5/9] net: napi: Add napi_config Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
  2024-09-12 10:38   ` Joe Damato
  2024-09-12 10:07 ` [RFC net-next v3 7/9] bnxt: Add support for napi storage Joe Damato
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
	Paolo Abeni, Donald Hunter, Jesper Dangaard Brouer, Xuan Zhuo,
	Daniel Jurgens, Larysa Zaremba, open list

Add support to set per-NAPI defer_hard_irqs and gro_flush_timeout.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 Documentation/netlink/specs/netdev.yaml | 11 ++++++
 include/uapi/linux/netdev.h             |  1 +
 net/core/netdev-genl-gen.c              | 14 ++++++++
 net/core/netdev-genl-gen.h              |  1 +
 net/core/netdev-genl.c                  | 45 +++++++++++++++++++++++++
 tools/include/uapi/linux/netdev.h       |  1 +
 6 files changed, 73 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 906091c3059a..319c1e179b08 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -633,6 +633,17 @@ operations:
             - rx-bytes
             - tx-packets
             - tx-bytes
+    -
+      name: napi-set
+      doc: Set configurable NAPI instance settings.
+      attribute-set: napi
+      flags: [ admin-perm ]
+      do:
+        request:
+          attributes:
+            - napi-id
+            - defer-hard-irqs
+            - gro-flush-timeout
 
 mcast-groups:
   list:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index b088a34e9254..4c5bfbc85504 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -188,6 +188,7 @@ enum {
 	NETDEV_CMD_QUEUE_GET,
 	NETDEV_CMD_NAPI_GET,
 	NETDEV_CMD_QSTATS_GET,
+	NETDEV_CMD_NAPI_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 8350a0afa9ec..ead570c6ff7d 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -74,6 +74,13 @@ static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE
 	[NETDEV_A_QSTATS_SCOPE] = NLA_POLICY_MASK(NLA_UINT, 0x1),
 };
 
+/* NETDEV_CMD_NAPI_SET - set */
+static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT + 1] = {
+	[NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
+	[NETDEV_A_NAPI_DEFER_HARD_IRQS] = { .type = NLA_S32 },
+	[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT },
+};
+
 /* Ops table for netdev */
 static const struct genl_split_ops netdev_nl_ops[] = {
 	{
@@ -151,6 +158,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.maxattr	= NETDEV_A_QSTATS_SCOPE,
 		.flags		= GENL_CMD_CAP_DUMP,
 	},
+	{
+		.cmd		= NETDEV_CMD_NAPI_SET,
+		.doit		= netdev_nl_napi_set_doit,
+		.policy		= netdev_napi_set_nl_policy,
+		.maxattr	= NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
 };
 
 static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index 4db40fd5b4a9..b70cb0f20acb 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -28,6 +28,7 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb,
 			       struct netlink_callback *cb);
 int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
+int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
 				struct netlink_callback *cb);
 
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 4698034b5a49..3c90a2fd005a 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -300,6 +300,51 @@ int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	return err;
 }
 
+static int
+netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
+{
+	u64 gro_flush_timeout = 0;
+	u32 defer = 0;
+
+	if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
+		defer = nla_get_u32(info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]);
+		napi_set_defer_hard_irqs(napi, defer);
+	}
+
+	if (info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]) {
+		gro_flush_timeout = nla_get_uint(info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]);
+		napi_set_gro_flush_timeout(napi, gro_flush_timeout);
+	}
+
+	return 0;
+}
+
+int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct napi_struct *napi;
+	unsigned int napi_id;
+	int err;
+
+	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_ID))
+		return -EINVAL;
+
+	napi_id = nla_get_u32(info->attrs[NETDEV_A_NAPI_ID]);
+
+	rtnl_lock();
+
+	napi = napi_by_id(napi_id);
+	if (napi) {
+		err = netdev_nl_napi_set_config(napi, info);
+	} else {
+		NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID]);
+		err = -ENOENT;
+	}
+
+	rtnl_unlock();
+
+	return err;
+}
+
 static int
 netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 			 u32 q_idx, u32 q_type, const struct genl_info *info)
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index b088a34e9254..4c5bfbc85504 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -188,6 +188,7 @@ enum {
 	NETDEV_CMD_QUEUE_GET,
 	NETDEV_CMD_NAPI_GET,
 	NETDEV_CMD_QSTATS_GET,
+	NETDEV_CMD_NAPI_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC net-next v3 7/9] bnxt: Add support for napi storage
  2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
                   ` (5 preceding siblings ...)
  2024-09-12 10:07 ` [RFC net-next v3 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
  2024-09-12 10:07 ` [RFC net-next v3 8/9] mlx5: " Joe Damato
  2024-09-12 10:07 ` [RFC net-next v3 9/9] mlx4: Add support for napi storage to RX CQs Joe Damato
  8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, Joe Damato, Michael Chan, David S. Miller,
	Eric Dumazet, Paolo Abeni, open list

Use netif_napi_add_storage to assign per-NAPI storage when initializing
NAPIs.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6e422e24750a..44fc38efff33 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -10986,7 +10986,8 @@ static void bnxt_init_napi(struct bnxt *bp)
 		cp_nr_rings--;
 	for (i = 0; i < cp_nr_rings; i++) {
 		bnapi = bp->bnapi[i];
-		netif_napi_add(bp->dev, &bnapi->napi, poll_fn);
+		netif_napi_add_storage(bp->dev, &bnapi->napi, poll_fn,
+				       bnapi->index);
 	}
 	if (BNXT_CHIP_TYPE_NITRO_A0(bp)) {
 		bnapi = bp->bnapi[cp_nr_rings];
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC net-next v3 8/9] mlx5: Add support for napi storage
  2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
                   ` (6 preceding siblings ...)
  2024-09-12 10:07 ` [RFC net-next v3 7/9] bnxt: Add support for napi storage Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
  2024-09-12 10:07 ` [RFC net-next v3 9/9] mlx4: Add support for napi storage to RX CQs Joe Damato
  8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, Joe Damato, Saeed Mahameed, Tariq Toukan,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Paolo Abeni,
	open list:MELLANOX MLX5 core VPI driver, open list

Use netif_napi_add_storage to assign per-NAPI storage when initializing
NAPIs.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 47e7a80d221b..aa4eb5677b8a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2696,7 +2696,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 	c->aff_mask = irq_get_effective_affinity_mask(irq);
 	c->lag_port = mlx5e_enumerate_lag_port(mdev, ix);
 
-	netif_napi_add(netdev, &c->napi, mlx5e_napi_poll);
+	netif_napi_add_storage(netdev, &c->napi, mlx5e_napi_poll, ix);
 	netif_napi_set_irq(&c->napi, irq);
 
 	err = mlx5e_open_queues(c, params, cparam);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC net-next v3 9/9] mlx4: Add support for napi storage to RX CQs
  2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
                   ` (7 preceding siblings ...)
  2024-09-12 10:07 ` [RFC net-next v3 8/9] mlx5: " Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
  8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, Joe Damato, Tariq Toukan, David S. Miller,
	Eric Dumazet, Paolo Abeni,
	open list:MELLANOX MLX4 core VPI driver, open list

Use netif_napi_add_storage to assign per-NAPI storage when initializing
RX CQ NAPIs.

Presently, struct napi_storage only has support for two fields used for
RX, so there is no need to support them with TX CQs, yet.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 461cc2c79c71..6943268e8256 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -156,7 +156,8 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
 		break;
 	case RX:
 		cq->mcq.comp = mlx4_en_rx_irq;
-		netif_napi_add(cq->dev, &cq->napi, mlx4_en_poll_rx_cq);
+		netif_napi_add_storage(cq->dev, &cq->napi, mlx4_en_poll_rx_cq,
+				       cq_idx);
 		netif_napi_set_irq(&cq->napi, irq);
 		napi_enable(&cq->napi);
 		netif_queue_set_napi(cq->dev, cq_idx, NETDEV_QUEUE_TYPE_RX, &cq->napi);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC net-next v3 6/9] netdev-genl: Support setting per-NAPI config values
  2024-09-12 10:07 ` [RFC net-next v3 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
@ 2024-09-12 10:38   ` Joe Damato
  0 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:38 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
	Donald Hunter, Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens,
	Larysa Zaremba, open list

On Thu, Sep 12, 2024 at 10:07:14AM +0000, Joe Damato wrote:
> Add support to set per-NAPI defer_hard_irqs and gro_flush_timeout.
> 
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> ---
>  Documentation/netlink/specs/netdev.yaml | 11 ++++++
>  include/uapi/linux/netdev.h             |  1 +
>  net/core/netdev-genl-gen.c              | 14 ++++++++
>  net/core/netdev-genl-gen.h              |  1 +
>  net/core/netdev-genl.c                  | 45 +++++++++++++++++++++++++
>  tools/include/uapi/linux/netdev.h       |  1 +
>  6 files changed, 73 insertions(+)

My apologies; there's a few merge conflicts with this patch against
the latest net-next/main. I pulled recently, but it looks like
Mina's work got merged (which is excellent news) after I pulled and
so my patch won't apply cleanly.

I can wait the 48 hours to resend or simply reply with an updated
patch 6 in the body of my message, as this is only an RFC.

Let me know what works easiest for anyone who wants to actually test
this (vs just review it).

Sorry about that; should have pulled this morning before sending :)

- Joe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
  2024-09-12 10:07 ` [RFC net-next v3 5/9] net: napi: Add napi_config Joe Damato
@ 2024-09-12 15:03   ` Joe Damato
  2024-09-13 21:44     ` Willem de Bruijn
  2024-09-13 17:42   ` Stanislav Fomichev
  1 sibling, 1 reply; 18+ messages in thread
From: Joe Damato @ 2024-09-12 15:03 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
	Lorenzo Bianconi, David Ahern, Johannes Berg,
	open list:DOCUMENTATION, open list

Several comments on different things below for this patch that I just noticed.

On Thu, Sep 12, 2024 at 10:07:13AM +0000, Joe Damato wrote:
> Add a persistent NAPI config area for NAPI configuration to the core.
> Drivers opt-in to setting the storage for a NAPI by passing an index
> when calling netif_napi_add_storage.
> 
> napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
> (after the NAPIs are deleted), and set to 0 when napi_enable is called.

Forgot to re-read all the commit messages. I will do that for rfcv4
and make sure they are all correct; this message is not correct.
 
> Drivers which implement call netif_napi_add_storage will have persistent
> NAPI IDs.
> 
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> ---
>  .../networking/net_cachelines/net_device.rst  |  1 +
>  include/linux/netdevice.h                     | 34 +++++++++
>  net/core/dev.c                                | 74 +++++++++++++++++--
>  net/core/dev.h                                | 12 +++
>  4 files changed, 113 insertions(+), 8 deletions(-)
>

[...]

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3e07ab8e0295..08afc96179f9 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h

[...]

> @@ -2685,6 +2717,8 @@ void __netif_napi_del(struct napi_struct *napi);
>   */
>  static inline void netif_napi_del(struct napi_struct *napi)
>  {
> +	napi->config = NULL;
> +	napi->index = -1;
>  	__netif_napi_del(napi);
>  	synchronize_net();
>  }

I don't quite see how, but occasionally when changing the queue
count with ethtool -L, I seem to trigger a page_pool issue.

I assumed it was related to either my changes above in netif_napi_del, or 
below in __netif_napi_del, but I'm not so sure because the issue does not
happen reliably and I can't seem to figure out how my change would cause this.

When it does happen, this is the stack trace:

  page_pool_empty_ring() page_pool refcnt -30528 violation
  ------------[ cut here ]------------
  Negative(-1) inflight packet-pages
  WARNING: CPU: 1 PID: 5117 at net/core/page_pool.c:617 page_pool_inflight+0x4c/0x90

  [...]

  CPU: 1 UID: 0 PID: 5117 Comm: ethtool Tainted: G        W          [...]

  RIP: 0010:page_pool_inflight+0x4c/0x90
  Code: e4 b8 00 00 00 00 44 0f 48 e0 44 89 e0 41 5c e9 8a c1 1b 00 66 90 45 85 e4 79 ef 44 89 e6 48 c7 c7 78 56 26 82 e8 14 63 78 ff <0f> 0b eb dc 65 8b 05 b5 af 71 7e 48 0f a3 05 21 d9 3b 01 73 d7 48
  RSP: 0018:ffffc9001d01b640 EFLAGS: 00010282
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
  RDX: 0000000000000027 RSI: 00000000ffefffff RDI: ffff88bf4f45c988
  RBP: ffff8900da55a800 R08: 0000000000000000 R09: ffffc9001d01b480
  R10: 0000000000000001 R11: 0000000000000001 R12: 00000000ffffffff
  R13: ffffffff82cd35b0 R14: ffff890062550f00 R15: ffff8881b0e85400
  FS:  00007fa9cb382740(0000) GS:ffff88bf4f440000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000558baa9d3b38 CR3: 000000011222a000 CR4: 0000000000350ef0
  Call Trace:
   <TASK>
   ? __warn+0x80/0x110
   ? page_pool_inflight+0x4c/0x90
   ? report_bug+0x19c/0x1b0
   ? handle_bug+0x3c/0x70
   ? exc_invalid_op+0x18/0x70
   ? asm_exc_invalid_op+0x1a/0x20
   ? page_pool_inflight+0x4c/0x90
   page_pool_release+0x10e/0x1d0
   page_pool_destroy+0x66/0x160
   mlx5e_free_rq+0x69/0xb0 [mlx5_core]
   mlx5e_close_queues+0x39/0x150 [mlx5_core]
   mlx5e_close_channel+0x1c/0x60 [mlx5_core]
   mlx5e_close_channels+0x49/0xa0 [mlx5_core]
   mlx5e_switch_priv_channels+0xa9/0x120 [mlx5_core]
   ? __pfx_mlx5e_num_channels_changed_ctx+0x10/0x10 [mlx5_core]
   mlx5e_safe_switch_params+0xb8/0xf0 [mlx5_core]
   mlx5e_ethtool_set_channels+0x17a/0x290 [mlx5_core]
   ethnl_set_channels+0x243/0x310
   ethnl_default_set_doit+0xc1/0x170
   genl_family_rcv_msg_doit+0xd9/0x130
   genl_rcv_msg+0x18f/0x2c0
   ? __pfx_ethnl_default_set_doit+0x10/0x10
   ? __pfx_genl_rcv_msg+0x10/0x10
   netlink_rcv_skb+0x5a/0x110
   genl_rcv+0x28/0x40
   netlink_unicast+0x1aa/0x260
   netlink_sendmsg+0x1e9/0x420
   __sys_sendto+0x1d5/0x1f0
   ? handle_mm_fault+0x1cb/0x290
   ? do_user_addr_fault+0x558/0x7c0
   __x64_sys_sendto+0x29/0x30
   do_syscall_64+0x5d/0x170
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

> diff --git a/net/core/dev.c b/net/core/dev.c
> index f2fd503516de..ca2227d0b8ed 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c

[...]

> @@ -6736,7 +6776,13 @@ void __netif_napi_del(struct napi_struct *napi)
>  	if (!test_and_clear_bit(NAPI_STATE_LISTED, &napi->state))
>  		return;
>  
> -	napi_hash_del(napi);
> +	if (!napi->config) {
> +		napi_hash_del(napi);
> +	} else {
> +		napi->index = -1;
> +		napi->config = NULL;
> +	}
> +

See above; perhaps something related to this change is triggering the page pool
warning occasionally.

>  	list_del_rcu(&napi->dev_list);
>  	napi_free_frags(napi);
>  
> @@ -11049,6 +11095,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>  		unsigned int txqs, unsigned int rxqs)
>  {
>  	struct net_device *dev;
> +	size_t napi_config_sz;
> +	unsigned int maxqs;
>  
>  	BUG_ON(strlen(name) >= sizeof(dev->name));
>  
> @@ -11062,6 +11110,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>  		return NULL;
>  	}
>  
> +	WARN_ON_ONCE(txqs != rxqs);

This warning triggers for me on boot every time with mlx5 NICs.

The code in mlx5 seems to get the rxq and txq maximums in:
  drivers/net/ethernet/mellanox/mlx5/core/en_main.c
    mlx5e_create_netdev

  which does:

    txqs = mlx5e_get_max_num_txqs(mdev, profile);
    rxqs = mlx5e_get_max_num_rxqs(mdev, profile);

    netdev = alloc_etherdev_mqs(sizeof(struct mlx5e_priv), txqs, rxqs);

In my case for my device, txqs: 760, rxqs: 63.

I would guess that this warning will trigger everytime for mlx5 NICs
and would be quite annoying.

We may just want to replace the allocation logic to allocate
txqs+rxqs, remove the WARN_ON_ONCE, and be OK with some wasted
space?

> +	maxqs = max(txqs, rxqs);
> +
>  	dev = kvzalloc(struct_size(dev, priv, sizeof_priv),
>  		       GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
>  	if (!dev)
> @@ -11136,6 +11187,11 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>  	if (!dev->ethtool)
>  		goto free_all;
>  
> +	napi_config_sz = array_size(maxqs, sizeof(*dev->napi_config));
> +	dev->napi_config = kvzalloc(napi_config_sz, GFP_KERNEL_ACCOUNT);
> +	if (!dev->napi_config)
> +		goto free_all;
> +

[...]

> diff --git a/net/core/dev.h b/net/core/dev.h
> index a9d5f678564a..9eb3f559275c 100644
> --- a/net/core/dev.h
> +++ b/net/core/dev.h
> @@ -167,11 +167,17 @@ static inline void napi_set_defer_hard_irqs(struct napi_struct *n, u32 defer)
>  static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
>  					      u32 defer)
>  {
> +	unsigned int count = max(netdev->num_rx_queues,
> +				 netdev->num_tx_queues);
>  	struct napi_struct *napi;
> +	int i;
>  
>  	WRITE_ONCE(netdev->napi_defer_hard_irqs, defer);
>  	list_for_each_entry(napi, &netdev->napi_list, dev_list)
>  		napi_set_defer_hard_irqs(napi, defer);
> +
> +	for (i = 0; i < count; i++)
> +		netdev->napi_config[i].defer_hard_irqs = defer;

The above is incorrect. Some devices may have netdev->napi_config =
NULL if they don't call the add_storage wrapper.

Unless there is major feedback/changes requested that affect this
code, in the rfcv4 branch I plan to fix this by adding a NULL check:

  if (netdev->napi_config)
    for (....)
      netdev->napi_config[i]....

>  
>  /**
> @@ -206,11 +212,17 @@ static inline void napi_set_gro_flush_timeout(struct napi_struct *n,
>  static inline void netdev_set_gro_flush_timeout(struct net_device *netdev,
>  						unsigned long timeout)
>  {
> +	unsigned int count = max(netdev->num_rx_queues,
> +				 netdev->num_tx_queues);
>  	struct napi_struct *napi;
> +	int i;
>  
>  	WRITE_ONCE(netdev->gro_flush_timeout, timeout);
>  	list_for_each_entry(napi, &netdev->napi_list, dev_list)
>  		napi_set_gro_flush_timeout(napi, timeout);
> +
> +	for (i = 0; i < count; i++)
> +		netdev->napi_config[i].gro_flush_timeout = timeout;

Same as above.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
  2024-09-12 10:07 ` [RFC net-next v3 5/9] net: napi: Add napi_config Joe Damato
  2024-09-12 15:03   ` Joe Damato
@ 2024-09-13 17:42   ` Stanislav Fomichev
  2024-09-13 18:55     ` Joe Damato
  2024-09-13 19:39     ` Joe Damato
  1 sibling, 2 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2024-09-13 17:42 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
	Lorenzo Bianconi, David Ahern, Johannes Berg,
	open list:DOCUMENTATION, open list

On 09/12, Joe Damato wrote:
> Add a persistent NAPI config area for NAPI configuration to the core.
> Drivers opt-in to setting the storage for a NAPI by passing an index
> when calling netif_napi_add_storage.
> 
> napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
> (after the NAPIs are deleted), and set to 0 when napi_enable is called.
> 
> Drivers which implement call netif_napi_add_storage will have persistent
> NAPI IDs.
> 
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> ---
>  .../networking/net_cachelines/net_device.rst  |  1 +
>  include/linux/netdevice.h                     | 34 +++++++++
>  net/core/dev.c                                | 74 +++++++++++++++++--
>  net/core/dev.h                                | 12 +++
>  4 files changed, 113 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
> index 3d02ae79c850..11d659051f5e 100644
> --- a/Documentation/networking/net_cachelines/net_device.rst
> +++ b/Documentation/networking/net_cachelines/net_device.rst
> @@ -183,3 +183,4 @@ struct hlist_head                   page_pools
>  struct dim_irq_moder*               irq_moder
>  unsigned_long                       gro_flush_timeout
>  u32                                 napi_defer_hard_irqs
> +struct napi_config*                 napi_config
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3e07ab8e0295..08afc96179f9 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -342,6 +342,15 @@ struct gro_list {
>   */
>  #define GRO_HASH_BUCKETS	8
>  
> +/*
> + * Structure for per-NAPI storage
> + */
> +struct napi_config {
> +	u64 gro_flush_timeout;
> +	u32 defer_hard_irqs;
> +	unsigned int napi_id;
> +};
> +
>  /*
>   * Structure for NAPI scheduling similar to tasklet but with weighting
>   */
> @@ -379,6 +388,8 @@ struct napi_struct {
>  	int			irq;
>  	unsigned long		gro_flush_timeout;
>  	u32			defer_hard_irqs;
> +	int			index;
> +	struct napi_config	*config;
>  };
>  
>  enum {
> @@ -2011,6 +2022,9 @@ enum netdev_reg_state {
>   *	@dpll_pin: Pointer to the SyncE source pin of a DPLL subsystem,
>   *		   where the clock is recovered.
>   *
> + *	@napi_config: An array of napi_config structures containing per-NAPI
> + *		      settings.
> + *
>   *	FIXME: cleanup struct net_device such that network protocol info
>   *	moves out.
>   */
> @@ -2400,6 +2414,7 @@ struct net_device {
>  	struct dim_irq_moder	*irq_moder;
>  	unsigned long		gro_flush_timeout;
>  	u32			napi_defer_hard_irqs;
> +	struct napi_config	*napi_config;
>  
>  	u8			priv[] ____cacheline_aligned
>  				       __counted_by(priv_len);
> @@ -2650,6 +2665,23 @@ netif_napi_add_tx_weight(struct net_device *dev,
>  	netif_napi_add_weight(dev, napi, poll, weight);
>  }
>  
> +/**
> + * netif_napi_add_storage - initialize a NAPI context and set storage area
> + * @dev: network device
> + * @napi: NAPI context
> + * @poll: polling function
> + * @weight: the poll weight of this NAPI
> + * @index: the NAPI index
> + */
> +static inline void
> +netif_napi_add_storage(struct net_device *dev, struct napi_struct *napi,
> +		       int (*poll)(struct napi_struct *, int), int index)
> +{
> +	napi->index = index;
> +	napi->config = &dev->napi_config[index];
> +	netif_napi_add_weight(dev, napi, poll, NAPI_POLL_WEIGHT);
> +}
> +
>  /**
>   * netif_napi_add_tx() - initialize a NAPI context to be used for Tx only
>   * @dev:  network device
> @@ -2685,6 +2717,8 @@ void __netif_napi_del(struct napi_struct *napi);
>   */
>  static inline void netif_napi_del(struct napi_struct *napi)
>  {
> +	napi->config = NULL;
> +	napi->index = -1;
>  	__netif_napi_del(napi);
>  	synchronize_net();
>  }
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f2fd503516de..ca2227d0b8ed 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6493,6 +6493,18 @@ EXPORT_SYMBOL(napi_busy_loop);
>  
>  #endif /* CONFIG_NET_RX_BUSY_POLL */
>  
> +static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> +{
> +	spin_lock(&napi_hash_lock);
> +
> +	napi->napi_id = napi_id;
> +
> +	hlist_add_head_rcu(&napi->napi_hash_node,
> +			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> +
> +	spin_unlock(&napi_hash_lock);
> +}
> +
>  static void napi_hash_add(struct napi_struct *napi)
>  {
>  	if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state))
> @@ -6505,12 +6517,13 @@ static void napi_hash_add(struct napi_struct *napi)
>  		if (unlikely(++napi_gen_id < MIN_NAPI_ID))
>  			napi_gen_id = MIN_NAPI_ID;
>  	} while (napi_by_id(napi_gen_id));

[..]

> -	napi->napi_id = napi_gen_id;
> -
> -	hlist_add_head_rcu(&napi->napi_hash_node,
> -			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
>  
>  	spin_unlock(&napi_hash_lock);
> +
> +	napi_hash_add_with_id(napi, napi_gen_id);

nit: it is very unlikely that napi_gen_id is gonna wrap around after the
spin_unlock above, but maybe it's safer to have the following?

static void __napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
{
	napi->napi_id = napi_id;
	hlist_add_head_rcu(&napi->napi_hash_node,
			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
}

static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
{
	spin_lock(&napi_hash_lock);
	__napi_hash_add_with_id(...);
	spin_unlock(&napi_hash_lock);
}

And use __napi_hash_add_with_id here before spin_unlock?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
  2024-09-13 17:42   ` Stanislav Fomichev
@ 2024-09-13 18:55     ` Joe Damato
  2024-09-13 19:39     ` Joe Damato
  1 sibling, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-13 18:55 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
	Lorenzo Bianconi, David Ahern, Johannes Berg,
	open list:DOCUMENTATION, open list

On Fri, Sep 13, 2024 at 10:42:37AM -0700, Stanislav Fomichev wrote:
> On 09/12, Joe Damato wrote:

[...]

> > @@ -6505,12 +6517,13 @@ static void napi_hash_add(struct napi_struct *napi)
> >  		if (unlikely(++napi_gen_id < MIN_NAPI_ID))
> >  			napi_gen_id = MIN_NAPI_ID;
> >  	} while (napi_by_id(napi_gen_id));
> 
> [..]
> 
> > -	napi->napi_id = napi_gen_id;
> > -
> > -	hlist_add_head_rcu(&napi->napi_hash_node,
> > -			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> >  
> >  	spin_unlock(&napi_hash_lock);
> > +
> > +	napi_hash_add_with_id(napi, napi_gen_id);
> 
> nit: it is very unlikely that napi_gen_id is gonna wrap around after the
> spin_unlock above, but maybe it's safer to have the following?
> 
> static void __napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> {
> 	napi->napi_id = napi_id;
> 	hlist_add_head_rcu(&napi->napi_hash_node,
> 			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> }
> 
> static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> {
> 	spin_lock(&napi_hash_lock);
> 	__napi_hash_add_with_id(...);
> 	spin_unlock(&napi_hash_lock);
> }
> 
> And use __napi_hash_add_with_id here before spin_unlock?

Thanks for taking a look.

Sure, that seems reasonable. I can add that for the rfcv4.

I'll probably hold off on posting the rfcv4 until either after LPC
and/or after I have some time to debug the mlx5/page_pool thing.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
  2024-09-13 17:42   ` Stanislav Fomichev
  2024-09-13 18:55     ` Joe Damato
@ 2024-09-13 19:39     ` Joe Damato
  2024-09-14 13:00       ` Joe Damato
  1 sibling, 1 reply; 18+ messages in thread
From: Joe Damato @ 2024-09-13 19:39 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
	Lorenzo Bianconi, David Ahern, Johannes Berg,
	open list:DOCUMENTATION, open list

On Fri, Sep 13, 2024 at 10:42:37AM -0700, Stanislav Fomichev wrote:
> On 09/12, Joe Damato wrote:

[...]

> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6493,6 +6493,18 @@ EXPORT_SYMBOL(napi_busy_loop);
> >  
> >  #endif /* CONFIG_NET_RX_BUSY_POLL */
> >  
> > +static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> > +{
> > +	spin_lock(&napi_hash_lock);
> > +
> > +	napi->napi_id = napi_id;
> > +
> > +	hlist_add_head_rcu(&napi->napi_hash_node,
> > +			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> > +
> > +	spin_unlock(&napi_hash_lock);
> > +}
> > +
> >  static void napi_hash_add(struct napi_struct *napi)
> >  {
> >  	if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state))
> > @@ -6505,12 +6517,13 @@ static void napi_hash_add(struct napi_struct *napi)
> >  		if (unlikely(++napi_gen_id < MIN_NAPI_ID))
> >  			napi_gen_id = MIN_NAPI_ID;
> >  	} while (napi_by_id(napi_gen_id));
> 
> [..]
> 
> > -	napi->napi_id = napi_gen_id;
> > -
> > -	hlist_add_head_rcu(&napi->napi_hash_node,
> > -			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> >  
> >  	spin_unlock(&napi_hash_lock);
> > +
> > +	napi_hash_add_with_id(napi, napi_gen_id);
> 
> nit: it is very unlikely that napi_gen_id is gonna wrap around after the
> spin_unlock above, but maybe it's safer to have the following?
> 
> static void __napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> {
> 	napi->napi_id = napi_id;
> 	hlist_add_head_rcu(&napi->napi_hash_node,
> 			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> }
> 
> static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> {
> 	spin_lock(&napi_hash_lock);
> 	__napi_hash_add_with_id(...);
> 	spin_unlock(&napi_hash_lock);
> }
> 
> And use __napi_hash_add_with_id here before spin_unlock?

After making this change and re-testing on a couple reboots, I haven't
been able to reproduce the page pool issue I mentioned in the other
email [1].

Not sure if I've just been... "getting lucky" or if this fixed
something that I won't fully grasp until I read the mlx5 source
again.

Will test it a few more times, though.

[1]: https://lore.kernel.org/netdev/ZuMC2fYPPtWggB2w@LQ3V64L9R2.homenet.telecomitalia.it/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
  2024-09-12 15:03   ` Joe Damato
@ 2024-09-13 21:44     ` Willem de Bruijn
  2024-09-13 22:10       ` Joe Damato
  0 siblings, 1 reply; 18+ messages in thread
From: Willem de Bruijn @ 2024-09-13 21:44 UTC (permalink / raw)
  To: Joe Damato, netdev
  Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
	Lorenzo Bianconi, David Ahern, Johannes Berg,
	open list:DOCUMENTATION, open list

Joe Damato wrote:
> Several comments on different things below for this patch that I just noticed.
> 
> On Thu, Sep 12, 2024 at 10:07:13AM +0000, Joe Damato wrote:
> > Add a persistent NAPI config area for NAPI configuration to the core.
> > Drivers opt-in to setting the storage for a NAPI by passing an index
> > when calling netif_napi_add_storage.
> > 
> > napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
> > (after the NAPIs are deleted), and set to 0 when napi_enable is called.
> 
> Forgot to re-read all the commit messages. I will do that for rfcv4
> and make sure they are all correct; this message is not correct.
>  
> > Drivers which implement call netif_napi_add_storage will have persistent
> > NAPI IDs.
> > 
> > Signed-off-by: Joe Damato <jdamato@fastly.com>

> > @@ -11062,6 +11110,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
> >  		return NULL;
> >  	}
> >  
> > +	WARN_ON_ONCE(txqs != rxqs);
> 
> This warning triggers for me on boot every time with mlx5 NICs.
> 
> The code in mlx5 seems to get the rxq and txq maximums in:
>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>     mlx5e_create_netdev
> 
>   which does:
> 
>     txqs = mlx5e_get_max_num_txqs(mdev, profile);
>     rxqs = mlx5e_get_max_num_rxqs(mdev, profile);
> 
>     netdev = alloc_etherdev_mqs(sizeof(struct mlx5e_priv), txqs, rxqs);
> 
> In my case for my device, txqs: 760, rxqs: 63.
> 
> I would guess that this warning will trigger everytime for mlx5 NICs
> and would be quite annoying.
> 
> We may just want to replace the allocation logic to allocate
> txqs+rxqs, remove the WARN_ON_ONCE, and be OK with some wasted
> space?

I was about to say that txqs == rxqs is not necessary.

The number of napi config structs you want depends on whether the
driver configures separate IRQs for Tx and Rx or not.

Allocating the max of the two is perhaps sufficient for now.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
  2024-09-13 21:44     ` Willem de Bruijn
@ 2024-09-13 22:10       ` Joe Damato
  0 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-13 22:10 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netdev, mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
	Lorenzo Bianconi, David Ahern, Johannes Berg,
	open list:DOCUMENTATION, open list

On Fri, Sep 13, 2024 at 05:44:07PM -0400, Willem de Bruijn wrote:
> Joe Damato wrote:
> > Several comments on different things below for this patch that I just noticed.
> > 
> > On Thu, Sep 12, 2024 at 10:07:13AM +0000, Joe Damato wrote:
> > > Add a persistent NAPI config area for NAPI configuration to the core.
> > > Drivers opt-in to setting the storage for a NAPI by passing an index
> > > when calling netif_napi_add_storage.
> > > 
> > > napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
> > > (after the NAPIs are deleted), and set to 0 when napi_enable is called.
> > 
> > Forgot to re-read all the commit messages. I will do that for rfcv4
> > and make sure they are all correct; this message is not correct.
> >  
> > > Drivers which implement call netif_napi_add_storage will have persistent
> > > NAPI IDs.
> > > 
> > > Signed-off-by: Joe Damato <jdamato@fastly.com>
> 
> > > @@ -11062,6 +11110,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
> > >  		return NULL;
> > >  	}
> > >  
> > > +	WARN_ON_ONCE(txqs != rxqs);
> > 
> > This warning triggers for me on boot every time with mlx5 NICs.
> > 
> > The code in mlx5 seems to get the rxq and txq maximums in:
> >   drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> >     mlx5e_create_netdev
> > 
> >   which does:
> > 
> >     txqs = mlx5e_get_max_num_txqs(mdev, profile);
> >     rxqs = mlx5e_get_max_num_rxqs(mdev, profile);
> > 
> >     netdev = alloc_etherdev_mqs(sizeof(struct mlx5e_priv), txqs, rxqs);
> > 
> > In my case for my device, txqs: 760, rxqs: 63.
> > 
> > I would guess that this warning will trigger everytime for mlx5 NICs
> > and would be quite annoying.
> > 
> > We may just want to replace the allocation logic to allocate
> > txqs+rxqs, remove the WARN_ON_ONCE, and be OK with some wasted
> > space?
> 
> I was about to say that txqs == rxqs is not necessary.

Correct.

> The number of napi config structs you want depends on whether the
> driver configures separate IRQs for Tx and Rx or not.

Correct. This is why I included the mlx4 patch.

> Allocating the max of the two is perhaps sufficient for now.

I don't think I agree. The max of the two means you'll always be
missing some config space if the maximum number of both are
allocated by the user/device.

The WARN_ON_ONCE was added as suggested from a previous conversation
[1], but due to the imbalance in mlx5 (and probably other devices)
the warning will be more of a nuisance and will likely trigger on
every boot for at least mlx5, but probably others.

Regardless of how many we decide to allocate: the point I was making
above was that the WARN_ON_ONCE should likely be removed.

[1]: https://lore.kernel.org/lkml/20240902174944.293dfe4b@kernel.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
  2024-09-13 19:39     ` Joe Damato
@ 2024-09-14 13:00       ` Joe Damato
  0 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-14 13:00 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev, mkarsten, kuba, skhawaja, sdf, bjorn,
	amritha.nambiar, sridhar.samudrala, David S. Miller, Eric Dumazet,
	Paolo Abeni, Jonathan Corbet, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, David Ahern,
	Johannes Berg, open list:DOCUMENTATION, open list

On Fri, Sep 13, 2024 at 09:39:49PM +0200, Joe Damato wrote:
> On Fri, Sep 13, 2024 at 10:42:37AM -0700, Stanislav Fomichev wrote:
> > On 09/12, Joe Damato wrote:
> 
> [...]
> 
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -6493,6 +6493,18 @@ EXPORT_SYMBOL(napi_busy_loop);
> > >  
> > >  #endif /* CONFIG_NET_RX_BUSY_POLL */
> > >  
> > > +static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> > > +{
> > > +	spin_lock(&napi_hash_lock);
> > > +
> > > +	napi->napi_id = napi_id;
> > > +
> > > +	hlist_add_head_rcu(&napi->napi_hash_node,
> > > +			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> > > +
> > > +	spin_unlock(&napi_hash_lock);
> > > +}
> > > +
> > >  static void napi_hash_add(struct napi_struct *napi)
> > >  {
> > >  	if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state))
> > > @@ -6505,12 +6517,13 @@ static void napi_hash_add(struct napi_struct *napi)
> > >  		if (unlikely(++napi_gen_id < MIN_NAPI_ID))
> > >  			napi_gen_id = MIN_NAPI_ID;
> > >  	} while (napi_by_id(napi_gen_id));
> > 
> > [..]
> > 
> > > -	napi->napi_id = napi_gen_id;
> > > -
> > > -	hlist_add_head_rcu(&napi->napi_hash_node,
> > > -			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> > >  
> > >  	spin_unlock(&napi_hash_lock);
> > > +
> > > +	napi_hash_add_with_id(napi, napi_gen_id);
> > 
> > nit: it is very unlikely that napi_gen_id is gonna wrap around after the
> > spin_unlock above, but maybe it's safer to have the following?
> > 
> > static void __napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> > {
> > 	napi->napi_id = napi_id;
> > 	hlist_add_head_rcu(&napi->napi_hash_node,
> > 			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> > }
> > 
> > static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> > {
> > 	spin_lock(&napi_hash_lock);
> > 	__napi_hash_add_with_id(...);
> > 	spin_unlock(&napi_hash_lock);
> > }
> > 
> > And use __napi_hash_add_with_id here before spin_unlock?
> 
> After making this change and re-testing on a couple reboots, I haven't
> been able to reproduce the page pool issue I mentioned in the other
> email [1].
> 
> Not sure if I've just been... "getting lucky" or if this fixed
> something that I won't fully grasp until I read the mlx5 source
> again.
> 
> Will test it a few more times, though.
> 
> [1]: https://lore.kernel.org/netdev/ZuMC2fYPPtWggB2w@LQ3V64L9R2.homenet.telecomitalia.it/

I was able to reproduce the issue by running a simple script
(suggested by Martin):

  for ((i=0;i<100;i++)); do for q in 1 2 4 8 16 32; do sudo ethtool -L eth4 combined $q ;done;done

It does not seem to reproduce on net-next/main, so it is almost certainly a bug
introduced in patch 5 and the suggested changes above didn't solve the problem.

That said, the changes I have queued for the RFCv4:
  - Updated commit messages of most patches
  - Renamed netif_napi_add_storage to netif_napi_add_config in patch 5 and
    updated the driver patches.
  - Added a NULL check in netdev_set_defer_hard_irqs and
    netdev_set_gro_flush_timeout for netdev->napi_config in patch 5
  - Add Stanislav's suggestion for more safe handling of napi_gen_id in
    patch 5
  - Fixed merge conflicts in patch 6 so it applies cleanly

The outstanding items at this point are:
  - Getting rid of the WARN_ON_ONCE (assuming we all agree on this)
  - Deciding if we are allocating max(rxqs, txqs) or something else
  - Debugging the page pool issue

Jakub suggested the first two items on the list above, so I'm hoping to hear
his thoughts on them :) I have no strong preference on those two.

Once those two are solved, I can send an RFCv4 to see where we are at and I'll
call out the outstanding page pool issue in the cover letter.

I likely won't have time to debug the page pool thing until after LPC, though
:(

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-09-14 13:00 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 2/9] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 3/9] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 4/9] netdev-genl: Dump gro_flush_timeout Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 5/9] net: napi: Add napi_config Joe Damato
2024-09-12 15:03   ` Joe Damato
2024-09-13 21:44     ` Willem de Bruijn
2024-09-13 22:10       ` Joe Damato
2024-09-13 17:42   ` Stanislav Fomichev
2024-09-13 18:55     ` Joe Damato
2024-09-13 19:39     ` Joe Damato
2024-09-14 13:00       ` Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
2024-09-12 10:38   ` Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 7/9] bnxt: Add support for napi storage Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 8/9] mlx5: " Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 9/9] mlx4: Add support for napi storage to RX CQs Joe Damato

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).