netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [net-next v6 0/9] Add support for per-NAPI config via netlink
@ 2024-10-11 18:44 Joe Damato
  2024-10-11 18:44 ` [net-next v6 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
                   ` (9 more replies)
  0 siblings, 10 replies; 25+ messages in thread
From: Joe Damato @ 2024-10-11 18:44 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet, Joe Damato,
	Alexander Lobakin, Breno Leitao, Daniel Jurgens, David Ahern,
	David S. Miller, Donald Hunter, Jakub Kicinski,
	Jesper Dangaard Brouer, Jiri Pirko, Johannes Berg,
	Jonathan Corbet, Leon Romanovsky, open list:DOCUMENTATION,
	open list, open list:MELLANOX MLX4 core VPI driver,
	Lorenzo Bianconi, Michael Chan, Mina Almasry, Paolo Abeni,
	Saeed Mahameed, Sebastian Andrzej Siewior, Tariq Toukan,
	Xuan Zhuo

Greetings:

Welcome to v6. Minor changes from v5 [1], please see changelog below.

There were no explicit comments from reviewers on the call outs in my
v5, so I'm retaining them from my previous cover letter just in case :)

A few important call outs for reviewers:

  1. This revision seems to work (see below for a full walk through). I
     think this is the behavior we talked about, but please let me know if
     a use case is missing.
  
  2. Re a previous point made by Stanislav regarding "taking over a NAPI
     ID" when the channel count changes: mlx5 seems to call napi_disable
     followed by netif_napi_del for the old queues and then calls
     napi_enable for the new ones. In this RFC, the NAPI ID generation is
     deferred to napi_enable. This means we won't end up with two of the
     same NAPI IDs added to the hash at the same time.

     Can we assume all drivers will napi_disable the old queues before
     napi_enable the new ones?
       - If yes: we might not need to worry about a NAPI ID takeover
       	 function.
       - If no: I'll need to make a change so that the NAPI ID generation
       	 is deferred only for drivers which have opted into the config
       	 space via calls to netif_napi_add_config

  3. I made the decision to remove the WARN_ON_ONCE that (I think?)
     Jakub previously suggested in alloc_netdev_mqs (WARN_ON_ONCE(txqs
     != rxqs);) because this was triggering on every kernel boot with my
     mlx5 NIC.

  4. I left the "maxqs = max(txqs, rxqs);" in alloc_netdev_mqs despite
     thinking this is a bit strange. I think it's strange that we might
     be short some number of NAPI configs, but it seems like most people
     are in favor of this approach, so I've left it. 

I'd appreciate thoughts from reviewers on the above items, if at all
possible.

Now, on to the implementation.

Firstly, this implementation moves certain settings to napi_struct so that
they are "per-NAPI", while taking care to respect existing sysfs
parameters which are interface wide and affect all NAPIs:
  - NAPI ID
  - gro_flush_timeout
  - defer_hard_irqs

Furthermore:
   - NAPI ID generation and addition to the hash is now deferred to
     napi_enable, instead of during netif_napi_add
   - NAPIs are removed from the hash during napi_disable, instead of
     netif_napi_del.
   - An array of "struct napi_config" is allocated in net_device.

IMPORTANT: The above changes affect all network drivers.

Optionally, drivers may opt-in to using their config space by calling
netif_napi_add_config instead of netif_napi_add.

If a driver does this, the NAPI being added is linked with an allocated
"struct napi_config" and the per-NAPI settings (including NAPI ID) are
persisted even as hardware queues are destroyed and recreated.

To help illustrate how this would end up working, I've added patches for
3 drivers, of which I have access to only 1:
  - mlx5 which is the basis of the examples below
  - mlx4 which has TX only NAPIs, just to highlight that case. I have
    only compile tested this patch; I don't have this hardware.
  - bnxt which I have only compiled tested. I don't have this
    hardware.

NOTE: I only tested this on mlx5; I have no access to the other hardware
for which I provided patches. Hopefully other folks can help test :)

Here's how it works when I test it on my mlx5 system:

# start with 2 queues

$ ethtool -l eth4 | grep Combined | tail -1
Combined:       2

First, output the current NAPI settings:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now, set the global sysfs parameters:

$ sudo bash -c 'echo 20000 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs'

Output current NAPI settings again:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now set NAPI ID 345, via its NAPI ID to specific values:

$ sudo ./tools/net/ynl/cli.py \
          --spec Documentation/netlink/specs/netdev.yaml \
          --do napi-set \
          --json='{"id": 345,
                   "defer-hard-irqs": 111,
                   "gro-flush-timeout": 11111}'
None

Now output current NAPI settings again to ensure only NAPI ID 345
changed:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 11111,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now, increase gro-flush-timeout only:

$ sudo ./tools/net/ynl/cli.py \
       --spec Documentation/netlink/specs/netdev.yaml \
       --do napi-set --json='{"id": 345,
                              "gro-flush-timeout": 44444}'
None

Now output the current NAPI settings once more:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 44444,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now set NAPI ID 345 to have gro_flush_timeout of 0:

$ sudo ./tools/net/ynl/cli.py \
       --spec Documentation/netlink/specs/netdev.yaml \
       --do napi-set --json='{"id": 345,
                              "gro-flush-timeout": 0}'
None

Check that NAPI ID 345 has a value of 0:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Change the queue count, ensuring that NAPI ID 345 retains its settings:

$ sudo ethtool -L eth4 combined 4

Check that the new queues have the system wide settings but that NAPI ID
345 remains unchanged:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 347,
  'ifindex': 7,
  'irq': 529},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 346,
  'ifindex': 7,
  'irq': 528},
 {'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now reduce the queue count below where NAPI ID 345 is indexed:

$ sudo ethtool -L eth4 combined 1

Check the output:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Re-increase the queue count to ensure NAPI ID 345 is re-assigned the same
values:

$ sudo ethtool -L eth4 combined 2

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Create new queues to ensure the sysfs globals are used for the new NAPIs
but that NAPI ID 345 is unchanged:

$ sudo ethtool -L eth4 combined 8

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[...]
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 346,
  'ifindex': 7,
  'irq': 528},
 {'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Last, but not least, let's try writing the sysfs parameters to ensure
all NAPIs are rewritten:

$ sudo bash -c 'echo 33333 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 222 >/sys/class/net/eth4/napi_defer_hard_irqs'

Check that worked:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[...]
 {'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 346,
  'ifindex': 7,
  'irq': 528},
 {'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Thanks,
Joe

[1]: https://lore.kernel.org/netdev/20241009005525.13651-1-jdamato@fastly.com/

v6:
  - Added Jakub's Reviewed-by to all commit messages
  - Added Eric's Reviewed-by to the commits for which he provided his
    review (i.e. all patches except 4, 5, and 6)
  - Updated the netlink doc for gro_flush_timeout in patch 4
  - Added WARN_ON_ONCE to napi_hash_add_with_id as suggested by Eric
    Dumazet to patch 5

v5: https://lore.kernel.org/netdev/20241009005525.13651-1-jdamato@fastly.com/
  - Converted from an RFC to a PATCH
  - Moved defer_hard_irqs above control-fields comment in napi_struct in
    patch 1
  - Moved gro_flush_timeout above control-fields comment in napi_struct in
    patch 3
  - Removed unnecessary idpf changes from patch 3
  - Removed unnecessary kdoc in patch 5 for a parameter removed in an
    earlier revision
  - Removed unnecessary NULL check in patch 5
  - Used tooling to autogenerate code for patch 6, which fixed the type and
    range of NETDEV_A_NAPI_DEFER_HARD_IRQ.

rfcv4: https://lore.kernel.org/lkml/20241001235302.57609-1-jdamato@fastly.com/
  - Updated commit messages of most patches
  - Renamed netif_napi_add_storage to netif_napi_add_config in patch 5
  - Added a NULL check in netdev_set_defer_hard_irqs and
    netdev_set_gro_flush_timeout for netdev->napi_config in patch 5
  - Removed the WARN_ON_ONCE suggested in an earlier revision
    in alloc_netdev_mqs from patch 5; it triggers every time on my mlx5
    machine at boot and needlessly spams the log
  - Added a locking adjustment suggested by Stanislav to patch 6 to
    protect napi_id in patch 5
  - Removed napi_hash_del from netif_napi_del in patch 5. netif_napi_del
    calls __netif_napi_del which itself calls napi_hash_del. The
    original code thus resulted in two napi_hash_del calls, which is
    incorrect.
  - Removed the napi_hash_add from netif_napi_add_weight in patch 5.
    NAPIs are added to the hash when napi_enable is called, instead.
  - Moved the napi_restore_config to the top of napi_enable in patch 5.
  - Simplified the logic in __netif_napi_del and removed napi_hash_del.
    NAPIs are removed in napi_disable.
  - Fixed merge conflicts in patch 6 so it applies cleanly

rfcv3: https://lore.kernel.org/netdev/20240912100738.16567-8-jdamato@fastly.com/T/
  - Renamed napi_storage to napi_config
  - Reordered patches
  - Added defer_hard_irqs and gro_flush_timeout to napi_struct
  - Attempt to save and restore settings on napi_disable/napi_enable
  - Removed weight as a parameter to netif_napi_add_storage
  - Updated driver patches to no longer pass in weight

rfcv2: https://lore.kernel.org/netdev/20240908160702.56618-1-jdamato@fastly.com/
  - Almost total rewrite from v1

Joe Damato (9):
  net: napi: Make napi_defer_hard_irqs per-NAPI
  netdev-genl: Dump napi_defer_hard_irqs
  net: napi: Make gro_flush_timeout per-NAPI
  netdev-genl: Dump gro_flush_timeout
  net: napi: Add napi_config
  netdev-genl: Support setting per-NAPI config values
  bnxt: Add support for persistent NAPI config
  mlx5: Add support for persistent NAPI config
  mlx4: Add support for persistent NAPI config to RX CQs

 Documentation/netlink/specs/netdev.yaml       | 28 ++++++
 .../networking/net_cachelines/net_device.rst  |  3 +
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  3 +-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c    |  3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 include/linux/netdevice.h                     | 42 +++++++-
 include/uapi/linux/netdev.h                   |  3 +
 net/core/dev.c                                | 96 ++++++++++++++++---
 net/core/dev.h                                | 88 +++++++++++++++++
 net/core/net-sysfs.c                          |  4 +-
 net/core/netdev-genl-gen.c                    | 18 ++++
 net/core/netdev-genl-gen.h                    |  1 +
 net/core/netdev-genl.c                        | 57 +++++++++++
 tools/include/uapi/linux/netdev.h             |  3 +
 14 files changed, 326 insertions(+), 25 deletions(-)


base-commit: d677aebd663ddc287f2b2bda098474694a0ca875
-- 
2.25.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [net-next v6 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI
  2024-10-11 18:44 [net-next v6 0/9] Add support for per-NAPI config via netlink Joe Damato
@ 2024-10-11 18:44 ` Joe Damato
  2024-10-11 18:44 ` [net-next v6 2/9] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Joe Damato @ 2024-10-11 18:44 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet, Joe Damato,
	Jakub Kicinski, David S. Miller, Paolo Abeni, Jonathan Corbet,
	Jiri Pirko, Sebastian Andrzej Siewior, Lorenzo Bianconi,
	Johannes Berg, Breno Leitao, Alexander Lobakin,
	open list:DOCUMENTATION, open list

Add defer_hard_irqs to napi_struct in preparation for per-NAPI
settings.

The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device defer_hard_irq field.
Reads from sysfs show the net_device field.

The ability to set defer_hard_irqs on specific NAPI instances will be
added in a later commit, via netdev-genl.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
---
 .../networking/net_cachelines/net_device.rst  |  1 +
 include/linux/netdevice.h                     |  3 +-
 net/core/dev.c                                | 10 +++---
 net/core/dev.h                                | 36 +++++++++++++++++++
 net/core/net-sysfs.c                          |  2 +-
 5 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 1b018ac35e9a..5a7388b2ab6f 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -186,4 +186,5 @@ struct dpll_pin*                    dpll_pin
 struct hlist_head                   page_pools
 struct dim_irq_moder*               irq_moder
 u64                                 max_pacing_offload_horizon
+u32                                 napi_defer_hard_irqs
 =================================== =========================== =================== =================== ===================================================================================
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e6b93d01e631..2e7bc23660ec 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -373,6 +373,7 @@ struct napi_struct {
 	unsigned int		napi_id;
 	struct hrtimer		timer;
 	struct task_struct	*thread;
+	u32			defer_hard_irqs;
 	/* control-path-only fields follow */
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
@@ -2085,7 +2086,6 @@ struct net_device {
 	unsigned int		real_num_rx_queues;
 	struct netdev_rx_queue	*_rx;
 	unsigned long		gro_flush_timeout;
-	u32			napi_defer_hard_irqs;
 	unsigned int		gro_max_size;
 	unsigned int		gro_ipv4_max_size;
 	rx_handler_func_t __rcu	*rx_handler;
@@ -2413,6 +2413,7 @@ struct net_device {
 	struct dim_irq_moder	*irq_moder;
 
 	u64			max_pacing_offload_horizon;
+	u32			napi_defer_hard_irqs;
 
 	/**
 	 * @lock: protects @net_shaper_hierarchy, feel free to use for other
diff --git a/net/core/dev.c b/net/core/dev.c
index b590eefce3b4..fbaa9eabf77f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6233,7 +6233,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 	if (work_done) {
 		if (n->gro_bitmask)
 			timeout = READ_ONCE(n->dev->gro_flush_timeout);
-		n->defer_hard_irqs_count = READ_ONCE(n->dev->napi_defer_hard_irqs);
+		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
 	}
 	if (n->defer_hard_irqs_count > 0) {
 		n->defer_hard_irqs_count--;
@@ -6371,7 +6371,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
 	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
 	if (flags & NAPI_F_PREFER_BUSY_POLL) {
-		napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs);
+		napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
 		timeout = READ_ONCE(napi->dev->gro_flush_timeout);
 		if (napi->defer_hard_irqs_count && timeout) {
 			hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
@@ -6653,6 +6653,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	INIT_HLIST_NODE(&napi->napi_hash_node);
 	hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
 	napi->timer.function = napi_watchdog;
+	napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
 	init_gro_hash(napi);
 	napi->skb = NULL;
 	INIT_LIST_HEAD(&napi->rx_list);
@@ -11059,7 +11060,7 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev)
 
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
 		dev->gro_flush_timeout = 20000;
-		dev->napi_defer_hard_irqs = 1;
+		netdev_set_defer_hard_irqs(dev, 1);
 	}
 }
 EXPORT_SYMBOL_GPL(netdev_sw_irq_coalesce_default_on);
@@ -12003,7 +12004,6 @@ static void __init net_dev_struct_check(void)
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, real_num_rx_queues);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, _rx);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_flush_timeout);
-	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, napi_defer_hard_irqs);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_max_size);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_ipv4_max_size);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, rx_handler);
@@ -12015,7 +12015,7 @@ static void __init net_dev_struct_check(void)
 #ifdef CONFIG_NET_XGRESS
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, tcx_ingress);
 #endif
-	CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 104);
+	CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 100);
 }
 
 /*
diff --git a/net/core/dev.h b/net/core/dev.h
index d3ea92949ff3..0716b1048261 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -148,6 +148,42 @@ static inline void netif_set_gro_ipv4_max_size(struct net_device *dev,
 	WRITE_ONCE(dev->gro_ipv4_max_size, size);
 }
 
+/**
+ * napi_get_defer_hard_irqs - get the NAPI's defer_hard_irqs
+ * @n: napi struct to get the defer_hard_irqs field from
+ *
+ * Return: the per-NAPI value of the defar_hard_irqs field.
+ */
+static inline u32 napi_get_defer_hard_irqs(const struct napi_struct *n)
+{
+	return READ_ONCE(n->defer_hard_irqs);
+}
+
+/**
+ * napi_set_defer_hard_irqs - set the defer_hard_irqs for a napi
+ * @n: napi_struct to set the defer_hard_irqs field
+ * @defer: the value the field should be set to
+ */
+static inline void napi_set_defer_hard_irqs(struct napi_struct *n, u32 defer)
+{
+	WRITE_ONCE(n->defer_hard_irqs, defer);
+}
+
+/**
+ * netdev_set_defer_hard_irqs - set defer_hard_irqs for all NAPIs of a netdev
+ * @netdev: the net_device for which all NAPIs will have defer_hard_irqs set
+ * @defer: the defer_hard_irqs value to set
+ */
+static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
+					      u32 defer)
+{
+	struct napi_struct *napi;
+
+	WRITE_ONCE(netdev->napi_defer_hard_irqs, defer);
+	list_for_each_entry(napi, &netdev->napi_list, dev_list)
+		napi_set_defer_hard_irqs(napi, defer);
+}
+
 int rps_cpumask_housekeeping(struct cpumask *mask);
 
 #if defined(CONFIG_DEBUG_NET) && defined(CONFIG_BPF_SYSCALL)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 05cf5347f25e..25125f356a15 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -429,7 +429,7 @@ static int change_napi_defer_hard_irqs(struct net_device *dev, unsigned long val
 	if (val > S32_MAX)
 		return -ERANGE;
 
-	WRITE_ONCE(dev->napi_defer_hard_irqs, val);
+	netdev_set_defer_hard_irqs(dev, (u32)val);
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [net-next v6 2/9] netdev-genl: Dump napi_defer_hard_irqs
  2024-10-11 18:44 [net-next v6 0/9] Add support for per-NAPI config via netlink Joe Damato
  2024-10-11 18:44 ` [net-next v6 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
@ 2024-10-11 18:44 ` Joe Damato
  2024-10-11 18:44 ` [net-next v6 3/9] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Joe Damato @ 2024-10-11 18:44 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet, Joe Damato,
	Jakub Kicinski, David S. Miller, Paolo Abeni, Donald Hunter,
	Jesper Dangaard Brouer, Mina Almasry, Xuan Zhuo, Larysa Zaremba,
	open list

Support dumping defer_hard_irqs for a NAPI ID.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/netlink/specs/netdev.yaml | 8 ++++++++
 include/uapi/linux/netdev.h             | 1 +
 net/core/netdev-genl.c                  | 6 ++++++
 tools/include/uapi/linux/netdev.h       | 1 +
 4 files changed, 16 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 08412c279297..585e87ec3c16 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -248,6 +248,13 @@ attribute-sets:
              threaded mode. If NAPI is not in threaded mode (i.e. uses normal
              softirq context), the attribute will be absent.
         type: u32
+      -
+        name: defer-hard-irqs
+        doc: The number of consecutive empty polls before IRQ deferral ends
+             and hardware IRQs are re-enabled.
+        type: u32
+        checks:
+          max: s32-max
   -
     name: queue
     attributes:
@@ -636,6 +643,7 @@ operations:
             - ifindex
             - irq
             - pid
+            - defer-hard-irqs
       dump:
         request:
           attributes:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 7c308f04e7a0..13dc0b027e86 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -122,6 +122,7 @@ enum {
 	NETDEV_A_NAPI_ID,
 	NETDEV_A_NAPI_IRQ,
 	NETDEV_A_NAPI_PID,
+	NETDEV_A_NAPI_DEFER_HARD_IRQS,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 358cba248796..f98e5d1d0d21 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -161,6 +161,7 @@ static int
 netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 			const struct genl_info *info)
 {
+	u32 napi_defer_hard_irqs;
 	void *hdr;
 	pid_t pid;
 
@@ -189,6 +190,11 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 			goto nla_put_failure;
 	}
 
+	napi_defer_hard_irqs = napi_get_defer_hard_irqs(napi);
+	if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS,
+			napi_defer_hard_irqs))
+		goto nla_put_failure;
+
 	genlmsg_end(rsp, hdr);
 
 	return 0;
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 7c308f04e7a0..13dc0b027e86 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -122,6 +122,7 @@ enum {
 	NETDEV_A_NAPI_ID,
 	NETDEV_A_NAPI_IRQ,
 	NETDEV_A_NAPI_PID,
+	NETDEV_A_NAPI_DEFER_HARD_IRQS,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [net-next v6 3/9] net: napi: Make gro_flush_timeout per-NAPI
  2024-10-11 18:44 [net-next v6 0/9] Add support for per-NAPI config via netlink Joe Damato
  2024-10-11 18:44 ` [net-next v6 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
  2024-10-11 18:44 ` [net-next v6 2/9] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
@ 2024-10-11 18:44 ` Joe Damato
  2024-10-11 18:44 ` [net-next v6 4/9] netdev-genl: Dump gro_flush_timeout Joe Damato
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Joe Damato @ 2024-10-11 18:44 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet, Joe Damato,
	Jakub Kicinski, David S. Miller, Paolo Abeni, Jonathan Corbet,
	Jiri Pirko, Sebastian Andrzej Siewior, Lorenzo Bianconi,
	Johannes Berg, Breno Leitao, Alexander Lobakin,
	open list:DOCUMENTATION, open list

Allow per-NAPI gro_flush_timeout setting.

The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device gro_flush_timeout
field. Reads from sysfs will read from the net_device field.

The ability to set gro_flush_timeout on specific NAPI instances will be
added in a later commit, via netdev-genl.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
---
 .../networking/net_cachelines/net_device.rst  |  1 +
 include/linux/netdevice.h                     |  3 +-
 net/core/dev.c                                | 12 +++---
 net/core/dev.h                                | 40 +++++++++++++++++++
 net/core/net-sysfs.c                          |  2 +-
 5 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 5a7388b2ab6f..67910ea49160 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -186,5 +186,6 @@ struct dpll_pin*                    dpll_pin
 struct hlist_head                   page_pools
 struct dim_irq_moder*               irq_moder
 u64                                 max_pacing_offload_horizon
+unsigned_long                       gro_flush_timeout
 u32                                 napi_defer_hard_irqs
 =================================== =========================== =================== =================== ===================================================================================
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2e7bc23660ec..93241d4de437 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -373,6 +373,7 @@ struct napi_struct {
 	unsigned int		napi_id;
 	struct hrtimer		timer;
 	struct task_struct	*thread;
+	unsigned long		gro_flush_timeout;
 	u32			defer_hard_irqs;
 	/* control-path-only fields follow */
 	struct list_head	dev_list;
@@ -2085,7 +2086,6 @@ struct net_device {
 	int			ifindex;
 	unsigned int		real_num_rx_queues;
 	struct netdev_rx_queue	*_rx;
-	unsigned long		gro_flush_timeout;
 	unsigned int		gro_max_size;
 	unsigned int		gro_ipv4_max_size;
 	rx_handler_func_t __rcu	*rx_handler;
@@ -2413,6 +2413,7 @@ struct net_device {
 	struct dim_irq_moder	*irq_moder;
 
 	u64			max_pacing_offload_horizon;
+	unsigned long		gro_flush_timeout;
 	u32			napi_defer_hard_irqs;
 
 	/**
diff --git a/net/core/dev.c b/net/core/dev.c
index fbaa9eabf77f..e21ace3551d5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6232,12 +6232,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 
 	if (work_done) {
 		if (n->gro_bitmask)
-			timeout = READ_ONCE(n->dev->gro_flush_timeout);
+			timeout = napi_get_gro_flush_timeout(n);
 		n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
 	}
 	if (n->defer_hard_irqs_count > 0) {
 		n->defer_hard_irqs_count--;
-		timeout = READ_ONCE(n->dev->gro_flush_timeout);
+		timeout = napi_get_gro_flush_timeout(n);
 		if (timeout)
 			ret = false;
 	}
@@ -6372,7 +6372,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
 
 	if (flags & NAPI_F_PREFER_BUSY_POLL) {
 		napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
-		timeout = READ_ONCE(napi->dev->gro_flush_timeout);
+		timeout = napi_get_gro_flush_timeout(napi);
 		if (napi->defer_hard_irqs_count && timeout) {
 			hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
 			skip_schedule = true;
@@ -6654,6 +6654,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
 	napi->timer.function = napi_watchdog;
 	napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
+	napi_set_gro_flush_timeout(napi, READ_ONCE(dev->gro_flush_timeout));
 	init_gro_hash(napi);
 	napi->skb = NULL;
 	INIT_LIST_HEAD(&napi->rx_list);
@@ -11059,7 +11060,7 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev)
 	WARN_ON(dev->reg_state == NETREG_REGISTERED);
 
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
-		dev->gro_flush_timeout = 20000;
+		netdev_set_gro_flush_timeout(dev, 20000);
 		netdev_set_defer_hard_irqs(dev, 1);
 	}
 }
@@ -12003,7 +12004,6 @@ static void __init net_dev_struct_check(void)
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, ifindex);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, real_num_rx_queues);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, _rx);
-	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_flush_timeout);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_max_size);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_ipv4_max_size);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, rx_handler);
@@ -12015,7 +12015,7 @@ static void __init net_dev_struct_check(void)
 #ifdef CONFIG_NET_XGRESS
 	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, tcx_ingress);
 #endif
-	CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 100);
+	CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 92);
 }
 
 /*
diff --git a/net/core/dev.h b/net/core/dev.h
index 0716b1048261..7d0aab7e3ef1 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -184,6 +184,46 @@ static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
 		napi_set_defer_hard_irqs(napi, defer);
 }
 
+/**
+ * napi_get_gro_flush_timeout - get the gro_flush_timeout
+ * @n: napi struct to get the gro_flush_timeout from
+ *
+ * Return: the per-NAPI value of the gro_flush_timeout field.
+ */
+static inline unsigned long
+napi_get_gro_flush_timeout(const struct napi_struct *n)
+{
+	return READ_ONCE(n->gro_flush_timeout);
+}
+
+/**
+ * napi_set_gro_flush_timeout - set the gro_flush_timeout for a napi
+ * @n: napi struct to set the gro_flush_timeout
+ * @timeout: timeout value to set
+ *
+ * napi_set_gro_flush_timeout sets the per-NAPI gro_flush_timeout
+ */
+static inline void napi_set_gro_flush_timeout(struct napi_struct *n,
+					      unsigned long timeout)
+{
+	WRITE_ONCE(n->gro_flush_timeout, timeout);
+}
+
+/**
+ * netdev_set_gro_flush_timeout - set gro_flush_timeout of a netdev's NAPIs
+ * @netdev: the net_device for which all NAPIs will have gro_flush_timeout set
+ * @timeout: the timeout value to set
+ */
+static inline void netdev_set_gro_flush_timeout(struct net_device *netdev,
+						unsigned long timeout)
+{
+	struct napi_struct *napi;
+
+	WRITE_ONCE(netdev->gro_flush_timeout, timeout);
+	list_for_each_entry(napi, &netdev->napi_list, dev_list)
+		napi_set_gro_flush_timeout(napi, timeout);
+}
+
 int rps_cpumask_housekeeping(struct cpumask *mask);
 
 #if defined(CONFIG_DEBUG_NET) && defined(CONFIG_BPF_SYSCALL)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 25125f356a15..2d9afc6e2161 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -409,7 +409,7 @@ NETDEVICE_SHOW_RW(tx_queue_len, fmt_dec);
 
 static int change_gro_flush_timeout(struct net_device *dev, unsigned long val)
 {
-	WRITE_ONCE(dev->gro_flush_timeout, val);
+	netdev_set_gro_flush_timeout(dev, val);
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [net-next v6 4/9] netdev-genl: Dump gro_flush_timeout
  2024-10-11 18:44 [net-next v6 0/9] Add support for per-NAPI config via netlink Joe Damato
                   ` (2 preceding siblings ...)
  2024-10-11 18:44 ` [net-next v6 3/9] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
@ 2024-10-11 18:44 ` Joe Damato
  2024-10-11 18:47   ` Eric Dumazet
  2024-10-11 18:45 ` [net-next v6 5/9] net: napi: Add napi_config Joe Damato
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Joe Damato @ 2024-10-11 18:44 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet, Joe Damato,
	Jakub Kicinski, David S. Miller, Paolo Abeni, Donald Hunter,
	Jesper Dangaard Brouer, Mina Almasry, Xuan Zhuo, Larysa Zaremba,
	open list

Support dumping gro_flush_timeout for a NAPI ID.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/netlink/specs/netdev.yaml | 9 +++++++++
 include/uapi/linux/netdev.h             | 1 +
 net/core/netdev-genl.c                  | 6 ++++++
 tools/include/uapi/linux/netdev.h       | 1 +
 4 files changed, 17 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 585e87ec3c16..7b47454c51dd 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -255,6 +255,14 @@ attribute-sets:
         type: u32
         checks:
           max: s32-max
+      -
+        name: gro-flush-timeout
+        doc: The timeout, in nanoseconds, of when to trigger the NAPI watchdog
+             timer which schedules NAPI processing. Additionally, a non-zero
+             value will also prevent GRO from flushing recent super-frames at
+             the end of a NAPI cycle. This may add receive latency in exchange
+             for reducing the number of frames processed by the network stack.
+        type: uint
   -
     name: queue
     attributes:
@@ -644,6 +652,7 @@ operations:
             - irq
             - pid
             - defer-hard-irqs
+            - gro-flush-timeout
       dump:
         request:
           attributes:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 13dc0b027e86..cacd33359c76 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -123,6 +123,7 @@ enum {
 	NETDEV_A_NAPI_IRQ,
 	NETDEV_A_NAPI_PID,
 	NETDEV_A_NAPI_DEFER_HARD_IRQS,
+	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index f98e5d1d0d21..ac19f2e6cfbe 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -161,6 +161,7 @@ static int
 netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 			const struct genl_info *info)
 {
+	unsigned long gro_flush_timeout;
 	u32 napi_defer_hard_irqs;
 	void *hdr;
 	pid_t pid;
@@ -195,6 +196,11 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 			napi_defer_hard_irqs))
 		goto nla_put_failure;
 
+	gro_flush_timeout = napi_get_gro_flush_timeout(napi);
+	if (nla_put_uint(rsp, NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
+			 gro_flush_timeout))
+		goto nla_put_failure;
+
 	genlmsg_end(rsp, hdr);
 
 	return 0;
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 13dc0b027e86..cacd33359c76 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -123,6 +123,7 @@ enum {
 	NETDEV_A_NAPI_IRQ,
 	NETDEV_A_NAPI_PID,
 	NETDEV_A_NAPI_DEFER_HARD_IRQS,
+	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [net-next v6 5/9] net: napi: Add napi_config
  2024-10-11 18:44 [net-next v6 0/9] Add support for per-NAPI config via netlink Joe Damato
                   ` (3 preceding siblings ...)
  2024-10-11 18:44 ` [net-next v6 4/9] netdev-genl: Dump gro_flush_timeout Joe Damato
@ 2024-10-11 18:45 ` Joe Damato
  2024-10-11 18:49   ` Eric Dumazet
  2024-11-27 17:43   ` Guenter Roeck
  2024-10-11 18:45 ` [net-next v6 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 25+ messages in thread
From: Joe Damato @ 2024-10-11 18:45 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet, Joe Damato,
	Jakub Kicinski, David S. Miller, Paolo Abeni, Jonathan Corbet,
	Jiri Pirko, Sebastian Andrzej Siewior, Lorenzo Bianconi,
	Johannes Berg, open list:DOCUMENTATION, open list

Add a persistent NAPI config area for NAPI configuration to the core.
Drivers opt-in to setting the persistent config for a NAPI by passing an
index when calling netif_napi_add_config.

napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
(after the NAPIs are deleted).

Drivers which call netif_napi_add_config will have persistent per-NAPI
settings: NAPI IDs, gro_flush_timeout, and defer_hard_irq settings.

Per-NAPI settings are saved in napi_disable and restored in napi_enable.

Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca>
Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
---
 .../networking/net_cachelines/net_device.rst  |  1 +
 include/linux/netdevice.h                     | 36 ++++++++-
 net/core/dev.c                                | 80 +++++++++++++++++--
 net/core/dev.h                                | 12 +++
 4 files changed, 119 insertions(+), 10 deletions(-)

diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 67910ea49160..db6192b2bb50 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -186,6 +186,7 @@ struct dpll_pin*                    dpll_pin
 struct hlist_head                   page_pools
 struct dim_irq_moder*               irq_moder
 u64                                 max_pacing_offload_horizon
+struct_napi_config*                 napi_config
 unsigned_long                       gro_flush_timeout
 u32                                 napi_defer_hard_irqs
 =================================== =========================== =================== =================== ===================================================================================
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 93241d4de437..8feaca12655e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -342,6 +342,15 @@ struct gro_list {
  */
 #define GRO_HASH_BUCKETS	8
 
+/*
+ * Structure for per-NAPI config
+ */
+struct napi_config {
+	u64 gro_flush_timeout;
+	u32 defer_hard_irqs;
+	unsigned int napi_id;
+};
+
 /*
  * Structure for NAPI scheduling similar to tasklet but with weighting
  */
@@ -379,6 +388,8 @@ struct napi_struct {
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
 	int			irq;
+	int			index;
+	struct napi_config	*config;
 };
 
 enum {
@@ -1868,9 +1879,6 @@ enum netdev_reg_state {
  *				allocated at register_netdev() time
  *	@real_num_rx_queues: 	Number of RX queues currently active in device
  *	@xdp_prog:		XDP sockets filter program pointer
- *	@gro_flush_timeout:	timeout for GRO layer in NAPI
- *	@napi_defer_hard_irqs:	If not zero, provides a counter that would
- *				allow to avoid NIC hard IRQ, on busy queues.
  *
  *	@rx_handler:		handler for received packets
  *	@rx_handler_data: 	XXX: need comments on this one
@@ -2020,6 +2028,11 @@ enum netdev_reg_state {
  *		   where the clock is recovered.
  *
  *	@max_pacing_offload_horizon: max EDT offload horizon in nsec.
+ *	@napi_config: An array of napi_config structures containing per-NAPI
+ *		      settings.
+ *	@gro_flush_timeout:	timeout for GRO layer in NAPI
+ *	@napi_defer_hard_irqs:	If not zero, provides a counter that would
+ *				allow to avoid NIC hard IRQ, on busy queues.
  *
  *	FIXME: cleanup struct net_device such that network protocol info
  *	moves out.
@@ -2413,6 +2426,7 @@ struct net_device {
 	struct dim_irq_moder	*irq_moder;
 
 	u64			max_pacing_offload_horizon;
+	struct napi_config	*napi_config;
 	unsigned long		gro_flush_timeout;
 	u32			napi_defer_hard_irqs;
 
@@ -2678,6 +2692,22 @@ netif_napi_add_tx_weight(struct net_device *dev,
 	netif_napi_add_weight(dev, napi, poll, weight);
 }
 
+/**
+ * netif_napi_add_config - initialize a NAPI context with persistent config
+ * @dev: network device
+ * @napi: NAPI context
+ * @poll: polling function
+ * @index: the NAPI index
+ */
+static inline void
+netif_napi_add_config(struct net_device *dev, struct napi_struct *napi,
+		      int (*poll)(struct napi_struct *, int), int index)
+{
+	napi->index = index;
+	napi->config = &dev->napi_config[index];
+	netif_napi_add_weight(dev, napi, poll, NAPI_POLL_WEIGHT);
+}
+
 /**
  * netif_napi_add_tx() - initialize a NAPI context to be used for Tx only
  * @dev:  network device
diff --git a/net/core/dev.c b/net/core/dev.c
index e21ace3551d5..c682173a7642 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6505,6 +6505,23 @@ EXPORT_SYMBOL(napi_busy_loop);
 
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
+static void __napi_hash_add_with_id(struct napi_struct *napi,
+				    unsigned int napi_id)
+{
+	napi->napi_id = napi_id;
+	hlist_add_head_rcu(&napi->napi_hash_node,
+			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
+}
+
+static void napi_hash_add_with_id(struct napi_struct *napi,
+				  unsigned int napi_id)
+{
+	spin_lock(&napi_hash_lock);
+	WARN_ON_ONCE(napi_by_id(napi_id));
+	__napi_hash_add_with_id(napi, napi_id);
+	spin_unlock(&napi_hash_lock);
+}
+
 static void napi_hash_add(struct napi_struct *napi)
 {
 	if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state))
@@ -6517,10 +6534,8 @@ static void napi_hash_add(struct napi_struct *napi)
 		if (unlikely(++napi_gen_id < MIN_NAPI_ID))
 			napi_gen_id = MIN_NAPI_ID;
 	} while (napi_by_id(napi_gen_id));
-	napi->napi_id = napi_gen_id;
 
-	hlist_add_head_rcu(&napi->napi_hash_node,
-			   &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
+	__napi_hash_add_with_id(napi, napi_gen_id);
 
 	spin_unlock(&napi_hash_lock);
 }
@@ -6643,6 +6658,28 @@ void netif_queue_set_napi(struct net_device *dev, unsigned int queue_index,
 }
 EXPORT_SYMBOL(netif_queue_set_napi);
 
+static void napi_restore_config(struct napi_struct *n)
+{
+	n->defer_hard_irqs = n->config->defer_hard_irqs;
+	n->gro_flush_timeout = n->config->gro_flush_timeout;
+	/* a NAPI ID might be stored in the config, if so use it. if not, use
+	 * napi_hash_add to generate one for us. It will be saved to the config
+	 * in napi_disable.
+	 */
+	if (n->config->napi_id)
+		napi_hash_add_with_id(n, n->config->napi_id);
+	else
+		napi_hash_add(n);
+}
+
+static void napi_save_config(struct napi_struct *n)
+{
+	n->config->defer_hard_irqs = n->defer_hard_irqs;
+	n->config->gro_flush_timeout = n->gro_flush_timeout;
+	n->config->napi_id = n->napi_id;
+	napi_hash_del(n);
+}
+
 void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 			   int (*poll)(struct napi_struct *, int), int weight)
 {
@@ -6653,8 +6690,6 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	INIT_HLIST_NODE(&napi->napi_hash_node);
 	hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
 	napi->timer.function = napi_watchdog;
-	napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
-	napi_set_gro_flush_timeout(napi, READ_ONCE(dev->gro_flush_timeout));
 	init_gro_hash(napi);
 	napi->skb = NULL;
 	INIT_LIST_HEAD(&napi->rx_list);
@@ -6672,7 +6707,13 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 	set_bit(NAPI_STATE_SCHED, &napi->state);
 	set_bit(NAPI_STATE_NPSVC, &napi->state);
 	list_add_rcu(&napi->dev_list, &dev->napi_list);
-	napi_hash_add(napi);
+
+	/* default settings from sysfs are applied to all NAPIs. any per-NAPI
+	 * configuration will be loaded in napi_enable
+	 */
+	napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
+	napi_set_gro_flush_timeout(napi, READ_ONCE(dev->gro_flush_timeout));
+
 	napi_get_frags_check(napi);
 	/* Create kthread for this napi if dev->threaded is set.
 	 * Clear dev->threaded if kthread creation failed so that
@@ -6704,6 +6745,11 @@ void napi_disable(struct napi_struct *n)
 
 	hrtimer_cancel(&n->timer);
 
+	if (n->config)
+		napi_save_config(n);
+	else
+		napi_hash_del(n);
+
 	clear_bit(NAPI_STATE_DISABLE, &n->state);
 }
 EXPORT_SYMBOL(napi_disable);
@@ -6719,6 +6765,11 @@ void napi_enable(struct napi_struct *n)
 {
 	unsigned long new, val = READ_ONCE(n->state);
 
+	if (n->config)
+		napi_restore_config(n);
+	else
+		napi_hash_add(n);
+
 	do {
 		BUG_ON(!test_bit(NAPI_STATE_SCHED, &val));
 
@@ -6748,7 +6799,11 @@ void __netif_napi_del(struct napi_struct *napi)
 	if (!test_and_clear_bit(NAPI_STATE_LISTED, &napi->state))
 		return;
 
-	napi_hash_del(napi);
+	if (napi->config) {
+		napi->index = -1;
+		napi->config = NULL;
+	}
+
 	list_del_rcu(&napi->dev_list);
 	napi_free_frags(napi);
 
@@ -11085,6 +11140,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 		unsigned int txqs, unsigned int rxqs)
 {
 	struct net_device *dev;
+	size_t napi_config_sz;
+	unsigned int maxqs;
 
 	BUG_ON(strlen(name) >= sizeof(dev->name));
 
@@ -11098,6 +11155,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 		return NULL;
 	}
 
+	maxqs = max(txqs, rxqs);
+
 	dev = kvzalloc(struct_size(dev, priv, sizeof_priv),
 		       GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!dev)
@@ -11174,6 +11233,11 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	if (!dev->ethtool)
 		goto free_all;
 
+	napi_config_sz = array_size(maxqs, sizeof(*dev->napi_config));
+	dev->napi_config = kvzalloc(napi_config_sz, GFP_KERNEL_ACCOUNT);
+	if (!dev->napi_config)
+		goto free_all;
+
 	strscpy(dev->name, name);
 	dev->name_assign_type = name_assign_type;
 	dev->group = INIT_NETDEV_GROUP;
@@ -11237,6 +11301,8 @@ void free_netdev(struct net_device *dev)
 	list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
 		netif_napi_del(p);
 
+	kvfree(dev->napi_config);
+
 	ref_tracker_dir_exit(&dev->refcnt_tracker);
 #ifdef CONFIG_PCPU_DEV_REFCNT
 	free_percpu(dev->pcpu_refcnt);
diff --git a/net/core/dev.h b/net/core/dev.h
index 7d0aab7e3ef1..7881bced70a9 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -177,11 +177,17 @@ static inline void napi_set_defer_hard_irqs(struct napi_struct *n, u32 defer)
 static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
 					      u32 defer)
 {
+	unsigned int count = max(netdev->num_rx_queues,
+				 netdev->num_tx_queues);
 	struct napi_struct *napi;
+	int i;
 
 	WRITE_ONCE(netdev->napi_defer_hard_irqs, defer);
 	list_for_each_entry(napi, &netdev->napi_list, dev_list)
 		napi_set_defer_hard_irqs(napi, defer);
+
+	for (i = 0; i < count; i++)
+		netdev->napi_config[i].defer_hard_irqs = defer;
 }
 
 /**
@@ -217,11 +223,17 @@ static inline void napi_set_gro_flush_timeout(struct napi_struct *n,
 static inline void netdev_set_gro_flush_timeout(struct net_device *netdev,
 						unsigned long timeout)
 {
+	unsigned int count = max(netdev->num_rx_queues,
+				 netdev->num_tx_queues);
 	struct napi_struct *napi;
+	int i;
 
 	WRITE_ONCE(netdev->gro_flush_timeout, timeout);
 	list_for_each_entry(napi, &netdev->napi_list, dev_list)
 		napi_set_gro_flush_timeout(napi, timeout);
+
+	for (i = 0; i < count; i++)
+		netdev->napi_config[i].gro_flush_timeout = timeout;
 }
 
 int rps_cpumask_housekeeping(struct cpumask *mask);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [net-next v6 6/9] netdev-genl: Support setting per-NAPI config values
  2024-10-11 18:44 [net-next v6 0/9] Add support for per-NAPI config via netlink Joe Damato
                   ` (4 preceding siblings ...)
  2024-10-11 18:45 ` [net-next v6 5/9] net: napi: Add napi_config Joe Damato
@ 2024-10-11 18:45 ` Joe Damato
  2024-10-11 18:49   ` Eric Dumazet
  2024-11-12  9:17   ` Paolo Abeni
  2024-10-11 18:45 ` [net-next v6 7/9] bnxt: Add support for persistent NAPI config Joe Damato
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 25+ messages in thread
From: Joe Damato @ 2024-10-11 18:45 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet, Joe Damato,
	Jakub Kicinski, Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Mina Almasry, Xuan Zhuo, open list

Add support to set per-NAPI defer_hard_irqs and gro_flush_timeout.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/netlink/specs/netdev.yaml | 11 ++++++
 include/uapi/linux/netdev.h             |  1 +
 net/core/netdev-genl-gen.c              | 18 ++++++++++
 net/core/netdev-genl-gen.h              |  1 +
 net/core/netdev-genl.c                  | 45 +++++++++++++++++++++++++
 tools/include/uapi/linux/netdev.h       |  1 +
 6 files changed, 77 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 7b47454c51dd..f9cb97d6106c 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -693,6 +693,17 @@ operations:
         reply:
           attributes:
             - id
+    -
+      name: napi-set
+      doc: Set configurable NAPI instance settings.
+      attribute-set: napi
+      flags: [ admin-perm ]
+      do:
+        request:
+          attributes:
+            - id
+            - defer-hard-irqs
+            - gro-flush-timeout
 
 kernel-family:
   headers: [ "linux/list.h"]
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index cacd33359c76..e3ebb49f60d2 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -201,6 +201,7 @@ enum {
 	NETDEV_CMD_NAPI_GET,
 	NETDEV_CMD_QSTATS_GET,
 	NETDEV_CMD_BIND_RX,
+	NETDEV_CMD_NAPI_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index b28424ae06d5..e197bd84997c 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -22,6 +22,10 @@ static const struct netlink_range_validation netdev_a_page_pool_ifindex_range =
 	.max	= 2147483647ULL,
 };
 
+static const struct netlink_range_validation netdev_a_napi_defer_hard_irqs_range = {
+	.max	= 2147483647ULL,
+};
+
 /* Common nested types */
 const struct nla_policy netdev_page_pool_info_nl_policy[NETDEV_A_PAGE_POOL_IFINDEX + 1] = {
 	[NETDEV_A_PAGE_POOL_ID] = NLA_POLICY_FULL_RANGE(NLA_UINT, &netdev_a_page_pool_id_range),
@@ -87,6 +91,13 @@ static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1]
 	[NETDEV_A_DMABUF_QUEUES] = NLA_POLICY_NESTED(netdev_queue_id_nl_policy),
 };
 
+/* NETDEV_CMD_NAPI_SET - do */
+static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT + 1] = {
+	[NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
+	[NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range),
+	[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, },
+};
+
 /* Ops table for netdev */
 static const struct genl_split_ops netdev_nl_ops[] = {
 	{
@@ -171,6 +182,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.maxattr	= NETDEV_A_DMABUF_FD,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
 	},
+	{
+		.cmd		= NETDEV_CMD_NAPI_SET,
+		.doit		= netdev_nl_napi_set_doit,
+		.policy		= netdev_napi_set_nl_policy,
+		.maxattr	= NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
 };
 
 static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index 8cda334fd042..e09dd7539ff2 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -33,6 +33,7 @@ int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
 int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
 				struct netlink_callback *cb);
 int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info);
+int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
 
 enum {
 	NETDEV_NLGRP_MGMT,
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index ac19f2e6cfbe..b49c3b4e5fbe 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -303,6 +303,51 @@ int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	return err;
 }
 
+static int
+netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
+{
+	u64 gro_flush_timeout = 0;
+	u32 defer = 0;
+
+	if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
+		defer = nla_get_u32(info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]);
+		napi_set_defer_hard_irqs(napi, defer);
+	}
+
+	if (info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]) {
+		gro_flush_timeout = nla_get_uint(info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]);
+		napi_set_gro_flush_timeout(napi, gro_flush_timeout);
+	}
+
+	return 0;
+}
+
+int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct napi_struct *napi;
+	unsigned int napi_id;
+	int err;
+
+	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_ID))
+		return -EINVAL;
+
+	napi_id = nla_get_u32(info->attrs[NETDEV_A_NAPI_ID]);
+
+	rtnl_lock();
+
+	napi = napi_by_id(napi_id);
+	if (napi) {
+		err = netdev_nl_napi_set_config(napi, info);
+	} else {
+		NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID]);
+		err = -ENOENT;
+	}
+
+	rtnl_unlock();
+
+	return err;
+}
+
 static int
 netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 			 u32 q_idx, u32 q_type, const struct genl_info *info)
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index cacd33359c76..e3ebb49f60d2 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -201,6 +201,7 @@ enum {
 	NETDEV_CMD_NAPI_GET,
 	NETDEV_CMD_QSTATS_GET,
 	NETDEV_CMD_BIND_RX,
+	NETDEV_CMD_NAPI_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [net-next v6 7/9] bnxt: Add support for persistent NAPI config
  2024-10-11 18:44 [net-next v6 0/9] Add support for per-NAPI config via netlink Joe Damato
                   ` (5 preceding siblings ...)
  2024-10-11 18:45 ` [net-next v6 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
@ 2024-10-11 18:45 ` Joe Damato
  2024-10-11 18:45 ` [net-next v6 8/9] mlx5: " Joe Damato
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Joe Damato @ 2024-10-11 18:45 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet, Joe Damato,
	Jakub Kicinski, Michael Chan, David S. Miller, Paolo Abeni,
	open list

Use netif_napi_add_config to assign persistent per-NAPI config when
initializing NAPIs.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6e422e24750a..f5da2dace982 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -10986,7 +10986,8 @@ static void bnxt_init_napi(struct bnxt *bp)
 		cp_nr_rings--;
 	for (i = 0; i < cp_nr_rings; i++) {
 		bnapi = bp->bnapi[i];
-		netif_napi_add(bp->dev, &bnapi->napi, poll_fn);
+		netif_napi_add_config(bp->dev, &bnapi->napi, poll_fn,
+				      bnapi->index);
 	}
 	if (BNXT_CHIP_TYPE_NITRO_A0(bp)) {
 		bnapi = bp->bnapi[cp_nr_rings];
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [net-next v6 8/9] mlx5: Add support for persistent NAPI config
  2024-10-11 18:44 [net-next v6 0/9] Add support for per-NAPI config via netlink Joe Damato
                   ` (6 preceding siblings ...)
  2024-10-11 18:45 ` [net-next v6 7/9] bnxt: Add support for persistent NAPI config Joe Damato
@ 2024-10-11 18:45 ` Joe Damato
  2024-10-11 18:45 ` [net-next v6 9/9] mlx4: Add support for persistent NAPI config to RX CQs Joe Damato
  2024-10-15  1:10 ` [net-next v6 0/9] Add support for per-NAPI config via netlink patchwork-bot+netdevbpf
  9 siblings, 0 replies; 25+ messages in thread
From: Joe Damato @ 2024-10-11 18:45 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet, Joe Damato,
	Jakub Kicinski, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David S. Miller, Paolo Abeni,
	open list:MELLANOX MLX5 core VPI driver, open list

Use netif_napi_add_config to assign persistent per-NAPI config when
initializing NAPIs.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index a5659c0c4236..09ab7ac07c29 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2697,7 +2697,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 	c->aff_mask = irq_get_effective_affinity_mask(irq);
 	c->lag_port = mlx5e_enumerate_lag_port(mdev, ix);
 
-	netif_napi_add(netdev, &c->napi, mlx5e_napi_poll);
+	netif_napi_add_config(netdev, &c->napi, mlx5e_napi_poll, ix);
 	netif_napi_set_irq(&c->napi, irq);
 
 	err = mlx5e_open_queues(c, params, cparam);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [net-next v6 9/9] mlx4: Add support for persistent NAPI config to RX CQs
  2024-10-11 18:44 [net-next v6 0/9] Add support for per-NAPI config via netlink Joe Damato
                   ` (7 preceding siblings ...)
  2024-10-11 18:45 ` [net-next v6 8/9] mlx5: " Joe Damato
@ 2024-10-11 18:45 ` Joe Damato
  2024-10-15  1:10 ` [net-next v6 0/9] Add support for per-NAPI config via netlink patchwork-bot+netdevbpf
  9 siblings, 0 replies; 25+ messages in thread
From: Joe Damato @ 2024-10-11 18:45 UTC (permalink / raw)
  To: netdev
  Cc: mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet, Joe Damato,
	Jakub Kicinski, Tariq Toukan, David S. Miller, Paolo Abeni,
	open list:MELLANOX MLX4 core VPI driver, open list

Use netif_napi_add_config to assign persistent per-NAPI config when
initializing RX CQ NAPIs.

Presently, struct napi_config only has support for two fields used for
RX, so there is no need to support them with TX CQs, yet.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 461cc2c79c71..0e92956e84cf 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -156,7 +156,8 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
 		break;
 	case RX:
 		cq->mcq.comp = mlx4_en_rx_irq;
-		netif_napi_add(cq->dev, &cq->napi, mlx4_en_poll_rx_cq);
+		netif_napi_add_config(cq->dev, &cq->napi, mlx4_en_poll_rx_cq,
+				      cq_idx);
 		netif_napi_set_irq(&cq->napi, irq);
 		napi_enable(&cq->napi);
 		netif_queue_set_napi(cq->dev, cq_idx, NETDEV_QUEUE_TYPE_RX, &cq->napi);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [net-next v6 4/9] netdev-genl: Dump gro_flush_timeout
  2024-10-11 18:44 ` [net-next v6 4/9] netdev-genl: Dump gro_flush_timeout Joe Damato
@ 2024-10-11 18:47   ` Eric Dumazet
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2024-10-11 18:47 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, Jakub Kicinski,
	David S. Miller, Paolo Abeni, Donald Hunter,
	Jesper Dangaard Brouer, Mina Almasry, Xuan Zhuo, Larysa Zaremba,
	open list

On Fri, Oct 11, 2024 at 8:46 PM Joe Damato <jdamato@fastly.com> wrote:
>
> Support dumping gro_flush_timeout for a NAPI ID.
>
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> Reviewed-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net-next v6 5/9] net: napi: Add napi_config
  2024-10-11 18:45 ` [net-next v6 5/9] net: napi: Add napi_config Joe Damato
@ 2024-10-11 18:49   ` Eric Dumazet
  2024-11-27 17:43   ` Guenter Roeck
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2024-10-11 18:49 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, Jakub Kicinski,
	David S. Miller, Paolo Abeni, Jonathan Corbet, Jiri Pirko,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, Johannes Berg,
	open list:DOCUMENTATION, open list

On Fri, Oct 11, 2024 at 8:46 PM Joe Damato <jdamato@fastly.com> wrote:
>
> Add a persistent NAPI config area for NAPI configuration to the core.
> Drivers opt-in to setting the persistent config for a NAPI by passing an
> index when calling netif_napi_add_config.
>
> napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
> (after the NAPIs are deleted).
>
> Drivers which call netif_napi_add_config will have persistent per-NAPI
> settings: NAPI IDs, gro_flush_timeout, and defer_hard_irq settings.
>
> Per-NAPI settings are saved in napi_disable and restored in napi_enable.
>
> Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca>
> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> Reviewed-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net-next v6 6/9] netdev-genl: Support setting per-NAPI config values
  2024-10-11 18:45 ` [net-next v6 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
@ 2024-10-11 18:49   ` Eric Dumazet
  2024-11-12  9:17   ` Paolo Abeni
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2024-10-11 18:49 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, Jakub Kicinski,
	Donald Hunter, David S. Miller, Paolo Abeni,
	Jesper Dangaard Brouer, Mina Almasry, Xuan Zhuo, open list

On Fri, Oct 11, 2024 at 8:46 PM Joe Damato <jdamato@fastly.com> wrote:
>
> Add support to set per-NAPI defer_hard_irqs and gro_flush_timeout.
>
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> Reviewed-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net-next v6 0/9] Add support for per-NAPI config via netlink
  2024-10-11 18:44 [net-next v6 0/9] Add support for per-NAPI config via netlink Joe Damato
                   ` (8 preceding siblings ...)
  2024-10-11 18:45 ` [net-next v6 9/9] mlx4: Add support for persistent NAPI config to RX CQs Joe Damato
@ 2024-10-15  1:10 ` patchwork-bot+netdevbpf
  9 siblings, 0 replies; 25+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-10-15  1:10 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet,
	aleksander.lobakin, leitao, danielj, dsahern, davem,
	donald.hunter, kuba, hawk, jiri, johannes.berg, corbet, leon,
	linux-doc, linux-kernel, linux-rdma, lorenzo, michael.chan,
	almasrymina, pabeni, saeedm, bigeasy, tariqt, xuanzhuo

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 11 Oct 2024 18:44:55 +0000 you wrote:
> Greetings:
> 
> Welcome to v6. Minor changes from v5 [1], please see changelog below.
> 
> There were no explicit comments from reviewers on the call outs in my
> v5, so I'm retaining them from my previous cover letter just in case :)
> 
> [...]

Here is the summary with links:
  - [net-next,v6,1/9] net: napi: Make napi_defer_hard_irqs per-NAPI
    https://git.kernel.org/netdev/net-next/c/f15e3b3ddb9f
  - [net-next,v6,2/9] netdev-genl: Dump napi_defer_hard_irqs
    https://git.kernel.org/netdev/net-next/c/516010460011
  - [net-next,v6,3/9] net: napi: Make gro_flush_timeout per-NAPI
    https://git.kernel.org/netdev/net-next/c/acb8d4ed5661
  - [net-next,v6,4/9] netdev-genl: Dump gro_flush_timeout
    https://git.kernel.org/netdev/net-next/c/0137891e7457
  - [net-next,v6,5/9] net: napi: Add napi_config
    https://git.kernel.org/netdev/net-next/c/86e25f40aa1e
  - [net-next,v6,6/9] netdev-genl: Support setting per-NAPI config values
    https://git.kernel.org/netdev/net-next/c/1287c1ae0fc2
  - [net-next,v6,7/9] bnxt: Add support for persistent NAPI config
    https://git.kernel.org/netdev/net-next/c/419365227496
  - [net-next,v6,8/9] mlx5: Add support for persistent NAPI config
    https://git.kernel.org/netdev/net-next/c/2a3372cafe02
  - [net-next,v6,9/9] mlx4: Add support for persistent NAPI config to RX CQs
    https://git.kernel.org/netdev/net-next/c/c9191eaa7285

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net-next v6 6/9] netdev-genl: Support setting per-NAPI config values
  2024-10-11 18:45 ` [net-next v6 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
  2024-10-11 18:49   ` Eric Dumazet
@ 2024-11-12  9:17   ` Paolo Abeni
  2024-11-12 17:12     ` Joe Damato
  1 sibling, 1 reply; 25+ messages in thread
From: Paolo Abeni @ 2024-11-12  9:17 UTC (permalink / raw)
  To: Joe Damato, netdev
  Cc: mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet,
	Jakub Kicinski, Donald Hunter, David S. Miller,
	Jesper Dangaard Brouer, Mina Almasry, Xuan Zhuo, open list

On 10/11/24 20:45, Joe Damato wrote:
> +int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info)
> +{
> +	struct napi_struct *napi;
> +	unsigned int napi_id;
> +	int err;
> +
> +	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_ID))
> +		return -EINVAL;
> +
> +	napi_id = nla_get_u32(info->attrs[NETDEV_A_NAPI_ID]);
> +
> +	rtnl_lock();
> +
> +	napi = napi_by_id(napi_id);

AFAICS the above causes a RCU splat in the selftests:

https://netdev-3.bots.linux.dev/vmksft-net-dbg/results/856342/61-busy-poll-test-sh/stderr

because napi_by_id() only checks for the RCU lock.

Could you please have a look?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net-next v6 6/9] netdev-genl: Support setting per-NAPI config values
  2024-11-12  9:17   ` Paolo Abeni
@ 2024-11-12 17:12     ` Joe Damato
  0 siblings, 0 replies; 25+ messages in thread
From: Joe Damato @ 2024-11-12 17:12 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet,
	Jakub Kicinski, Donald Hunter, David S. Miller,
	Jesper Dangaard Brouer, Mina Almasry, Xuan Zhuo, open list

On Tue, Nov 12, 2024 at 10:17:40AM +0100, Paolo Abeni wrote:
> On 10/11/24 20:45, Joe Damato wrote:
> > +int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info)
> > +{
> > +	struct napi_struct *napi;
> > +	unsigned int napi_id;
> > +	int err;
> > +
> > +	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_ID))
> > +		return -EINVAL;
> > +
> > +	napi_id = nla_get_u32(info->attrs[NETDEV_A_NAPI_ID]);
> > +
> > +	rtnl_lock();
> > +
> > +	napi = napi_by_id(napi_id);
> 
> AFAICS the above causes a RCU splat in the selftests:
> 
> https://netdev-3.bots.linux.dev/vmksft-net-dbg/results/856342/61-busy-poll-test-sh/stderr
> 
> because napi_by_id() only checks for the RCU lock.
> 
> Could you please have a look?

Thanks for letting me know.

I rebuilt my kernel with CONFIG_PROVE_RCU_LIST and a couple other
debugging options and I was able to reproduce the splat you
mentioned.

I took a look and it looks like there might be two things:
  - netdev_nl_napi_set_doit needs to call rcu_read_lock /
    rcu_read_unlock, which would be a Fixes on the commit in the
    series just merged, and
  - netdev_nl_napi_get_doit also has the same issue and should be
    fixed in a separate commit with its own fixes tag.

If that sounds right to you, I'll propose a short series of 2
patches, 1 to fix each.

Let me know if that sounds OK?

In the meantime, I'm rebuilding a kernel now to ensure my proposed
fix fixes the splat.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net-next v6 5/9] net: napi: Add napi_config
  2024-10-11 18:45 ` [net-next v6 5/9] net: napi: Add napi_config Joe Damato
  2024-10-11 18:49   ` Eric Dumazet
@ 2024-11-27 17:43   ` Guenter Roeck
  2024-11-27 18:51     ` Joe Damato
  1 sibling, 1 reply; 25+ messages in thread
From: Guenter Roeck @ 2024-11-27 17:43 UTC (permalink / raw)
  To: Joe Damato
  Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet,
	Jakub Kicinski, David S. Miller, Paolo Abeni, Jonathan Corbet,
	Jiri Pirko, Sebastian Andrzej Siewior, Lorenzo Bianconi,
	Johannes Berg, open list:DOCUMENTATION, open list

Hi,

On Fri, Oct 11, 2024 at 06:45:00PM +0000, Joe Damato wrote:
> Add a persistent NAPI config area for NAPI configuration to the core.
> Drivers opt-in to setting the persistent config for a NAPI by passing an
> index when calling netif_napi_add_config.
> 
> napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
> (after the NAPIs are deleted).
> 
> Drivers which call netif_napi_add_config will have persistent per-NAPI
> settings: NAPI IDs, gro_flush_timeout, and defer_hard_irq settings.
> 
> Per-NAPI settings are saved in napi_disable and restored in napi_enable.
> 
> Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca>
> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> Reviewed-by: Jakub Kicinski <kuba@kernel.org>

This patch triggers a lock inversion message on pcnet Ethernet adapters.

========================================================
WARNING: possible irq lock inversion dependency detected
6.12.0-08446-g228a1157fb9f #1 Tainted: G                 N

The problem seems obvious - napi_hash_lock and the local driver lock are
acquired in different order depending on the call sequence. Unfortunately
I have no idea how to fix the problem or I'd submit a patch.

Complete backtrace and bisect log attached. Bisect was run with qemu on
riscv64, but the architecture/platform does not really matter.

Please let me know if there is anything I can do to help resolve the
problem.

Thanks,
Guenter

---
[   13.251894] ========================================================
[   13.252024] WARNING: possible irq lock inversion dependency detected
[   13.252307] 6.12.0-08446-g228a1157fb9f #1 Tainted: G                 N
[   13.252472] --------------------------------------------------------
[   13.252569] ip/1816 just changed the state of lock:
[   13.252678] ffffffff81dec490 (napi_hash_lock){+...}-{3:3}, at: napi_disable+0xf8/0x10c
[   13.253497] but this lock was taken by another, HARDIRQ-safe lock in the past:
[   13.253637]  (&lp->lock){-.-.}-{3:3}
[   13.253682]
[   13.253682]
[   13.253682] and interrupts could create inverse lock ordering between them.
[   13.253682]
[   13.253923]
[   13.253923] other info that might help us debug this:
[   13.254082]  Possible interrupt unsafe locking scenario:
[   13.254082]
[   13.254186]        CPU0                    CPU1
[   13.254264]        ----                    ----
[   13.254340]   lock(napi_hash_lock);
[   13.254438]                                local_irq_disable();
[   13.254532]                                lock(&lp->lock);
[   13.254649]                                lock(napi_hash_lock);
[   13.254772]   <Interrupt>
[   13.254828]     lock(&lp->lock);
[   13.254921]
[   13.254921]  *** DEADLOCK ***
[   13.254921]
[   13.255049] 1 lock held by ip/1816:
[   13.255127]  #0: ffffffff81dece50 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x22a/0x74c
[   13.255398]
[   13.255398] the shortest dependencies between 2nd lock and 1st lock:
[   13.255593]  -> (&lp->lock){-.-.}-{3:3} ops: 75 {
[   13.255802]     IN-HARDIRQ-W at:
[   13.255910]                       __lock_acquire+0xa3e/0x2158
[   13.256055]                       lock_acquire.part.0+0xba/0x21e
[   13.256153]                       lock_acquire+0x44/0x5a
[   13.256241]                       _raw_spin_lock+0x2c/0x40
[   13.256343]                       pcnet32_interrupt+0x3c/0x200
[   13.256442]                       __handle_irq_event_percpu+0xa0/0x2e0
[   13.256547]                       handle_irq_event+0x3c/0x8a
[   13.256640]                       handle_fasteoi_irq+0x9c/0x1d2
[   13.256738]                       generic_handle_domain_irq+0x1c/0x2a
[   13.256840]                       plic_handle_irq+0x7e/0xfc
[   13.256937]                       generic_handle_domain_irq+0x1c/0x2a
[   13.257041]                       riscv_intc_irq+0x26/0x60
[   13.257133]                       handle_riscv_irq+0x4a/0x74
[   13.257228]                       call_on_irq_stack+0x32/0x40
[   13.257349]     IN-SOFTIRQ-W at:
[   13.257420]                       __lock_acquire+0x40a/0x2158
[   13.257515]                       lock_acquire.part.0+0xba/0x21e
[   13.257611]                       lock_acquire+0x44/0x5a
[   13.257699]                       _raw_spin_lock_irqsave+0x3a/0x64
[   13.257798]                       pcnet32_poll+0x2ac/0x768
[   13.257892]                       __napi_poll.constprop.0+0x26/0x128
[   13.257997]                       net_rx_action+0x186/0x30e
[   13.258090]                       handle_softirqs+0x110/0x4a2
[   13.258187]                       __irq_exit_rcu+0xe2/0x10c
[   13.258279]                       irq_exit_rcu+0xc/0x36
[   13.258368]                       handle_riscv_irq+0x64/0x74
[   13.258463]                       call_on_irq_stack+0x32/0x40
[   13.258558]     INITIAL USE at:
[   13.258629]                      __lock_acquire+0x46c/0x2158
[   13.258723]                      lock_acquire.part.0+0xba/0x21e
[   13.258820]                      lock_acquire+0x44/0x5a
[   13.258909]                      _raw_spin_lock_irqsave+0x3a/0x64
[   13.259007]                      pcnet32_get_stats+0x2a/0x62
[   13.259101]                      dev_get_stats+0xc4/0x2a6
[   13.259380]                      rtnl_fill_stats+0x32/0xee
[   13.259480]                      rtnl_fill_ifinfo.isra.0+0x648/0x141c
[   13.259589]                      rtmsg_ifinfo_build_skb+0x98/0xf0
[   13.259711]                      rtmsg_ifinfo+0x36/0x78
[   13.259799]                      register_netdevice+0x758/0x7a8
[   13.259905]                      register_netdev+0x18/0x2e
[   13.259999]                      pcnet32_probe1+0xb96/0x103e
[   13.260099]                      pcnet32_probe_pci+0xcc/0x12e
[   13.260196]                      pci_device_probe+0x82/0x100
[   13.260291]                      really_probe+0x86/0x234
[   13.260383]                      __driver_probe_device+0x5c/0xda
[   13.260482]                      driver_probe_device+0x2c/0xb2
[   13.260578]                      __driver_attach+0x70/0x120
[   13.260671]                      bus_for_each_dev+0x60/0xae
[   13.260764]                      driver_attach+0x1a/0x22
[   13.260853]                      bus_add_driver+0xce/0x1d6
[   13.260951]                      driver_register+0x3e/0xd8
[   13.261048]                      __pci_register_driver+0x5c/0x66
[   13.261149]                      pcnet32_init_module+0x58/0x14a
[   13.261255]                      do_one_initcall+0x7e/0x2b6
[   13.261352]                      kernel_init_freeable+0x2cc/0x33e
[   13.261456]                      kernel_init+0x1e/0x11a
[   13.261550]                      ret_from_fork+0xe/0x18
[   13.261653]   }
[   13.261705]   ... key      at: [<ffffffff82a9efb0>] __key.4+0x0/0x10
[   13.261839]   ... acquired at:
[   13.261907]    lock_acquire.part.0+0xba/0x21e
[   13.261987]    lock_acquire+0x44/0x5a
[   13.262056]    _raw_spin_lock+0x2c/0x40
[   13.262132]    napi_hash_add+0x26/0xb8
[   13.262207]    napi_enable+0x10e/0x124
[   13.262281]    pcnet32_open+0x3c2/0x6b8
[   13.262357]    __dev_open+0xba/0x158
[   13.262427]    __dev_change_flags+0x19a/0x214
[   13.262508]    dev_change_flags+0x1e/0x56
[   13.262583]    do_setlink.isra.0+0x20c/0xcc2
[   13.262661]    rtnl_newlink+0x592/0x74c
[   13.262731]    rtnetlink_rcv_msg+0x3a8/0x582
[   13.262807]    netlink_rcv_skb+0x44/0xf4
[   13.262882]    rtnetlink_rcv+0x14/0x1c
[   13.262956]    netlink_unicast+0x1a0/0x23e
[   13.263031]    netlink_sendmsg+0x17e/0x38e
[   13.263109]    __sock_sendmsg+0x40/0x7a
[   13.263182]    ____sys_sendmsg+0x18e/0x1de
[   13.263288]    ___sys_sendmsg+0x82/0xc6
[   13.263360]    __sys_sendmsg+0x8e/0xe0
[   13.263430]    __riscv_sys_sendmsg+0x16/0x1e
[   13.263507]    do_trap_ecall_u+0x1b6/0x1e2
[   13.263583]    _new_vmalloc_restore_context_a0+0xc2/0xce
[   13.263685]
[   13.263746] -> (napi_hash_lock){+...}-{3:3} ops: 2 {
[   13.263891]    HARDIRQ-ON-W at:
[   13.263963]                     __lock_acquire+0x688/0x2158
[   13.264057]                     lock_acquire.part.0+0xba/0x21e
[   13.264153]                     lock_acquire+0x44/0x5a
[   13.264240]                     _raw_spin_lock+0x2c/0x40
[   13.264330]                     napi_disable+0xf8/0x10c
[   13.264419]                     pcnet32_close+0x5c/0x126
[   13.264511]                     __dev_close_many+0x8a/0xf8
[   13.264609]                     __dev_change_flags+0x176/0x214
[   13.264708]                     dev_change_flags+0x1e/0x56
[   13.264800]                     do_setlink.isra.0+0x20c/0xcc2
[   13.264900]                     rtnl_newlink+0x592/0x74c
[   13.264988]                     rtnetlink_rcv_msg+0x3a8/0x582
[   13.265082]                     netlink_rcv_skb+0x44/0xf4
[   13.265174]                     rtnetlink_rcv+0x14/0x1c
[   13.265268]                     netlink_unicast+0x1a0/0x23e
[   13.265367]                     netlink_sendmsg+0x17e/0x38e
[   13.265466]                     __sock_sendmsg+0x40/0x7a
[   13.265557]                     ____sys_sendmsg+0x18e/0x1de
[   13.265649]                     ___sys_sendmsg+0x82/0xc6
[   13.265738]                     __sys_sendmsg+0x8e/0xe0
[   13.265828]                     __riscv_sys_sendmsg+0x16/0x1e
[   13.265928]                     do_trap_ecall_u+0x1b6/0x1e2
[   13.266024]                     _new_vmalloc_restore_context_a0+0xc2/0xce
[   13.266134]    INITIAL USE at:
[   13.266206]                    __lock_acquire+0x46c/0x2158
[   13.266298]                    lock_acquire.part.0+0xba/0x21e
[   13.266394]                    lock_acquire+0x44/0x5a
[   13.266480]                    _raw_spin_lock+0x2c/0x40
[   13.266572]                    napi_hash_add+0x26/0xb8
[   13.266660]                    napi_enable+0x10e/0x124
[   13.266748]                    pcnet32_open+0x3c2/0x6b8
[   13.266839]                    __dev_open+0xba/0x158
[   13.266930]                    __dev_change_flags+0x19a/0x214
[   13.267026]                    dev_change_flags+0x1e/0x56
[   13.267116]                    do_setlink.isra.0+0x20c/0xcc2
[   13.267211]                    rtnl_newlink+0x592/0x74c
[   13.267299]                    rtnetlink_rcv_msg+0x3a8/0x582
[   13.267395]                    netlink_rcv_skb+0x44/0xf4
[   13.267489]                    rtnetlink_rcv+0x14/0x1c
[   13.267578]                    netlink_unicast+0x1a0/0x23e
[   13.267670]                    netlink_sendmsg+0x17e/0x38e
[   13.267763]                    __sock_sendmsg+0x40/0x7a
[   13.267851]                    ____sys_sendmsg+0x18e/0x1de
[   13.267946]                    ___sys_sendmsg+0x82/0xc6
[   13.268034]                    __sys_sendmsg+0x8e/0xe0
[   13.268121]                    __riscv_sys_sendmsg+0x16/0x1e
[   13.268215]                    do_trap_ecall_u+0x1b6/0x1e2
[   13.268309]                    _new_vmalloc_restore_context_a0+0xc2/0xce
[   13.268419]  }
[   13.268460]  ... key      at: [<ffffffff81dec490>] napi_hash_lock+0x18/0x40
[   13.268579]  ... acquired at:
[   13.268636]    mark_lock+0x5f2/0x88a
[   13.268705]    __lock_acquire+0x688/0x2158
[   13.268779]    lock_acquire.part.0+0xba/0x21e
[   13.268858]    lock_acquire+0x44/0x5a
[   13.268930]    _raw_spin_lock+0x2c/0x40
[   13.269002]    napi_disable+0xf8/0x10c
[   13.269073]    pcnet32_close+0x5c/0x126
[   13.269145]    __dev_close_many+0x8a/0xf8
[   13.269220]    __dev_change_flags+0x176/0x214
[   13.269298]    dev_change_flags+0x1e/0x56
[   13.269371]    do_setlink.isra.0+0x20c/0xcc2
[   13.269449]    rtnl_newlink+0x592/0x74c
[   13.269520]    rtnetlink_rcv_msg+0x3a8/0x582
[   13.269596]    netlink_rcv_skb+0x44/0xf4
[   13.269671]    rtnetlink_rcv+0x14/0x1c
[   13.269742]    netlink_unicast+0x1a0/0x23e
[   13.269818]    netlink_sendmsg+0x17e/0x38e
[   13.269894]    __sock_sendmsg+0x40/0x7a
[   13.269966]    ____sys_sendmsg+0x18e/0x1de
[   13.270041]    ___sys_sendmsg+0x82/0xc6
[   13.270114]    __sys_sendmsg+0x8e/0xe0
[   13.270187]    __riscv_sys_sendmsg+0x16/0x1e
[   13.270268]    do_trap_ecall_u+0x1b6/0x1e2
[   13.270346]    _new_vmalloc_restore_context_a0+0xc2/0xce
[   13.270439]
[   13.270525]
[   13.270525] stack backtrace:
[   13.270729] CPU: 0 UID: 0 PID: 1816 Comm: ip Tainted: G                 N 6.12.0-08446-g228a1157fb9f #1
[   13.270933] Tainted: [N]=TEST
[   13.271006] Hardware name: riscv-virtio,qemu (DT)
[   13.271165] Call Trace:
[   13.271270] [<ffffffff80006d42>] dump_backtrace+0x1c/0x24
[   13.271373] [<ffffffff80dc5574>] show_stack+0x2c/0x38
[   13.271467] [<ffffffff80ddc854>] dump_stack_lvl+0x74/0xac
[   13.271565] [<ffffffff80ddc8a0>] dump_stack+0x14/0x1c
[   13.271657] [<ffffffff8008551a>] print_irq_inversion_bug.part.0+0x1aa/0x1fe
[   13.271775] [<ffffffff800867f8>] mark_lock+0x5f2/0x88a
[   13.271869] [<ffffffff80087d44>] __lock_acquire+0x688/0x2158
[   13.271973] [<ffffffff80086e10>] lock_acquire.part.0+0xba/0x21e
[   13.272078] [<ffffffff80086fb8>] lock_acquire+0x44/0x5a
[   13.272173] [<ffffffff80de8ea4>] _raw_spin_lock+0x2c/0x40
[   13.272269] [<ffffffff80b89bf0>] napi_disable+0xf8/0x10c
[   13.272360] [<ffffffff8099d022>] pcnet32_close+0x5c/0x126
[   13.272454] [<ffffffff80b8fa52>] __dev_close_many+0x8a/0xf8
[   13.272549] [<ffffffff80b9790e>] __dev_change_flags+0x176/0x214
[   13.272647] [<ffffffff80b979ca>] dev_change_flags+0x1e/0x56
[   13.272740] [<ffffffff80ba93a8>] do_setlink.isra.0+0x20c/0xcc2
[   13.272838] [<ffffffff80baa3f0>] rtnl_newlink+0x592/0x74c
[   13.272935] [<ffffffff80babf5a>] rtnetlink_rcv_msg+0x3a8/0x582
[   13.273032] [<ffffffff80c05bcc>] netlink_rcv_skb+0x44/0xf4
[   13.273127] [<ffffffff80ba64f6>] rtnetlink_rcv+0x14/0x1c
[   13.273223] [<ffffffff80c05578>] netlink_unicast+0x1a0/0x23e
[   13.273326] [<ffffffff80c05794>] netlink_sendmsg+0x17e/0x38e
[   13.273423] [<ffffffff80b633e2>] __sock_sendmsg+0x40/0x7a
[   13.273515] [<ffffffff80b63742>] ____sys_sendmsg+0x18e/0x1de
[   13.273610] [<ffffffff80b65e44>] ___sys_sendmsg+0x82/0xc6
[   13.273704] [<ffffffff80b662ac>] __sys_sendmsg+0x8e/0xe0
[   13.273795] [<ffffffff80b66314>] __riscv_sys_sendmsg+0x16/0x1e
[   13.273894] [<ffffffff80ddd3e4>] do_trap_ecall_u+0x1b6/0x1e2
[   13.273992] [<ffffffff80de9c7a>] _new_vmalloc_restore_context_a0+0xc2/0xce

---
# bad: [228a1157fb9fec47eb135b51c0202b574e079ebf] Merge tag '6.13-rc-part1-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
# good: [43fb83c17ba2d63dfb798f0be7453ed55ca3f9c2] Merge tag 'soc-arm-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect start '228a1157fb9f' '43fb83c17ba2'
# bad: [071b34dcf71523a559b6c39f5d21a268a9531b50] Merge tag 'sound-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect bad 071b34dcf71523a559b6c39f5d21a268a9531b50
# bad: [31a1f8752f7df7e3d8122054fbef02a9a8bff38f] Merge branch 'phy-mediatek-reorg'
git bisect bad 31a1f8752f7df7e3d8122054fbef02a9a8bff38f
# bad: [de51ad08b1177bbbb8b60cb7dd4c3c5dd50d262f] phonet: Pass net and ifindex to rtm_phonet_notify().
git bisect bad de51ad08b1177bbbb8b60cb7dd4c3c5dd50d262f
# good: [76d46d766a45e205e59af511efbb24abe22d0b4c] net: emaclite: Adopt clock support
git bisect good 76d46d766a45e205e59af511efbb24abe22d0b4c
# bad: [a37b0e4eca0436ebc17d512d70b1409956340688] ipv6: Use rtnl_register_many().
git bisect bad a37b0e4eca0436ebc17d512d70b1409956340688
# bad: [de306f0051ae947680a13c13a9fd9373d7460bb1] net: gianfar: Use __be64 * to store pointers to big endian values
git bisect bad de306f0051ae947680a13c13a9fd9373d7460bb1
# good: [5c16e118b796e95d6e5c80c5d8af2591262431c9] net: ethernet: ti: am65-cpsw: Use __be64 type for id_temp
git bisect good 5c16e118b796e95d6e5c80c5d8af2591262431c9
# bad: [41936522749654e64531121bbd6a95bab5d56d76] bnxt: Add support for persistent NAPI config
git bisect bad 41936522749654e64531121bbd6a95bab5d56d76
# good: [79636038d37e7bd4d078238f2a3f002cab4423bc] ipv4: tcp: give socket pointer to control skbs
git bisect good 79636038d37e7bd4d078238f2a3f002cab4423bc
# good: [516010460011ae74ac3b7383cf90ed27e2711cd6] netdev-genl: Dump napi_defer_hard_irqs
git bisect good 516010460011ae74ac3b7383cf90ed27e2711cd6
# good: [0137891e74576f77a7901718dc0ce08ca074ae74] netdev-genl: Dump gro_flush_timeout
git bisect good 0137891e74576f77a7901718dc0ce08ca074ae74
# bad: [1287c1ae0fc227e5acef11a539eb4e75646e31c7] netdev-genl: Support setting per-NAPI config values
git bisect bad 1287c1ae0fc227e5acef11a539eb4e75646e31c7
# bad: [86e25f40aa1e9e54e081e55016f65b5c92523989] net: napi: Add napi_config
git bisect bad 86e25f40aa1e9e54e081e55016f65b5c92523989
# first bad commit: [86e25f40aa1e9e54e081e55016f65b5c92523989] net: napi: Add napi_config

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net-next v6 5/9] net: napi: Add napi_config
  2024-11-27 17:43   ` Guenter Roeck
@ 2024-11-27 18:51     ` Joe Damato
  2024-11-27 20:00       ` Joe Damato
  2024-11-27 21:43       ` Guenter Roeck
  0 siblings, 2 replies; 25+ messages in thread
From: Joe Damato @ 2024-11-27 18:51 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet,
	Jakub Kicinski, David S. Miller, Paolo Abeni, Jonathan Corbet,
	Jiri Pirko, Sebastian Andrzej Siewior, Lorenzo Bianconi,
	Johannes Berg, open list:DOCUMENTATION, open list, pcnet32

On Wed, Nov 27, 2024 at 09:43:54AM -0800, Guenter Roeck wrote:
> Hi,
> 
> On Fri, Oct 11, 2024 at 06:45:00PM +0000, Joe Damato wrote:
> > Add a persistent NAPI config area for NAPI configuration to the core.
> > Drivers opt-in to setting the persistent config for a NAPI by passing an
> > index when calling netif_napi_add_config.
> > 
> > napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
> > (after the NAPIs are deleted).
> > 
> > Drivers which call netif_napi_add_config will have persistent per-NAPI
> > settings: NAPI IDs, gro_flush_timeout, and defer_hard_irq settings.
> > 
> > Per-NAPI settings are saved in napi_disable and restored in napi_enable.
> > 
> > Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > Signed-off-by: Joe Damato <jdamato@fastly.com>
> > Reviewed-by: Jakub Kicinski <kuba@kernel.org>
> 
> This patch triggers a lock inversion message on pcnet Ethernet adapters.

Thanks for the report. I am not familiar with the pcnet driver, but
took some time now to read the report below and the driver code.

I could definitely be reading the output incorrectly (if so please
let me know), but it seems like the issue can be triggered in this
case:

CPU 0:
pcnet32_open
   lock(lp->lock)
     napi_enable
       napi_hash_add
         lock(napi_hash_lock)
         unlock(napi_hash_lock)
   unlock(lp->lock)


Meanwhile on CPU 1:
  pcnet32_close
    napi_disable
      napi_hash_del
        lock(napi_hash_lock)
        unlock(napi_hash_lock)
    lock(lp->lock)
    [... other code ...]
    unlock(lp->lock)
    [... other code ...]
    lock(lp->lock)
    [... other code ...]
    unlock(lp->lock)

In other words: while the close path is holding napi_hash_lock (and
before it acquires lp->lock), the enable path takes lp->lock and
then napi_hash_lock.

It seems this was triggered because before the identified commit,
napi_enable did not call napi_hash_add (and thus did not take the
napi_hash_lock).

So, I agree there is an inversion; I can't say for sure if this
would cause a deadlock in certain situations. It seems like
napi_hash_del in the close path will return, so the inversion
doesn't seem like it'd lead to a deadlock, but I am not an expert in
this and could certainly be wrong.

I wonder if a potential fix for this would be in the driver's close
function? 

In pcnet32_open the order is:
  lock(lp->lock)
    napi_enable
    netif_start_queue
    mod_timer(watchdog)
  unlock(lp->lock)

Perhaps pcnet32_close should be the same?

I've included an example patch below for pcnet32_close and I've CC'd
the maintainer of pcnet32 that is not currently CC'd.

Guenter: Is there any change you might be able to test the proposed
patch below?

Don: Would you mind taking a look to see if this change is sensible?

Netdev maintainers: at a higher level, I'm not sure how many other
drivers might have locking patterns like this that commit
86e25f40aa1e ("net: napi: Add napi_config") will break in a similar
manner. 

Do I:
  - comb through drivers trying to identify these, and/or
  - do we find a way to implement the identified commit with the
    original lock ordering to avoid breaking any other driver?

I'd appreciate guidance/insight from the maintainers on how to best
proceed.

diff --git a/drivers/net/ethernet/amd/pcnet32.c b/drivers/net/ethernet/amd/pcnet32.c
index 72db9f9e7bee..ff56a308fec9 100644
--- a/drivers/net/ethernet/amd/pcnet32.c
+++ b/drivers/net/ethernet/amd/pcnet32.c
@@ -2623,13 +2623,13 @@ static int pcnet32_close(struct net_device *dev)
        struct pcnet32_private *lp = netdev_priv(dev);
        unsigned long flags;

+       spin_lock_irqsave(&lp->lock, flags);
+
        del_timer_sync(&lp->watchdog_timer);

        netif_stop_queue(dev);
        napi_disable(&lp->napi);

-       spin_lock_irqsave(&lp->lock, flags);
-
        dev->stats.rx_missed_errors = lp->a->read_csr(ioaddr, 112);

        netif_printk(lp, ifdown, KERN_DEBUG, dev,

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [net-next v6 5/9] net: napi: Add napi_config
  2024-11-27 18:51     ` Joe Damato
@ 2024-11-27 20:00       ` Joe Damato
  2024-11-27 22:48         ` Guenter Roeck
  2024-11-30 20:45         ` Jakub Kicinski
  2024-11-27 21:43       ` Guenter Roeck
  1 sibling, 2 replies; 25+ messages in thread
From: Joe Damato @ 2024-11-27 20:00 UTC (permalink / raw)
  To: Guenter Roeck, netdev, mkarsten, skhawaja, sdf, bjorn,
	amritha.nambiar, sridhar.samudrala, willemdebruijn.kernel,
	edumazet, Jakub Kicinski, David S. Miller, Paolo Abeni,
	Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
	Lorenzo Bianconi, Johannes Berg, open list:DOCUMENTATION,
	open list, pcnet32

On Wed, Nov 27, 2024 at 10:51:16AM -0800, Joe Damato wrote:
> On Wed, Nov 27, 2024 at 09:43:54AM -0800, Guenter Roeck wrote:
> > Hi,
> > 
> > On Fri, Oct 11, 2024 at 06:45:00PM +0000, Joe Damato wrote:
> > > Add a persistent NAPI config area for NAPI configuration to the core.
> > > Drivers opt-in to setting the persistent config for a NAPI by passing an
> > > index when calling netif_napi_add_config.
> > > 
> > > napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
> > > (after the NAPIs are deleted).
> > > 
> > > Drivers which call netif_napi_add_config will have persistent per-NAPI
> > > settings: NAPI IDs, gro_flush_timeout, and defer_hard_irq settings.
> > > 
> > > Per-NAPI settings are saved in napi_disable and restored in napi_enable.
> > > 
> > > Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > > Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
> > > Signed-off-by: Joe Damato <jdamato@fastly.com>
> > > Reviewed-by: Jakub Kicinski <kuba@kernel.org>
> > 
> > This patch triggers a lock inversion message on pcnet Ethernet adapters.
> 
> Thanks for the report. I am not familiar with the pcnet driver, but
> took some time now to read the report below and the driver code.
> 
> I could definitely be reading the output incorrectly (if so please
> let me know), but it seems like the issue can be triggered in this
> case:

Sorry, my apologies, I both:
  - read the report incorrectly, and
  - proposed a bad patch that would result in a deadlock :)

After re-reading it and running this by Martin (who is CC'd), the
inversion is actually:

CPU 0:
pcnet32_open
   lock(lp->lock)
     napi_enable
       napi_hash_add <- before this executes, CPU 1 proceeds
         lock(napi_hash_lock)
CPU 1:
  pcnet32_close
    napi_disable
      napi_hash_del
        lock(napi_hash_lock)
         < INTERRUPT >
            pcnet32_interrupt
              lock(lp->lock)

This is now an inversion because:

CPU 0: holds lp->lock and is about to take napi_hash_lock
CPU 1: holds napi_hashlock and an IRQ firing on CPU 1 tries to take
       lp->lock (which CPU 0 already holds)

Neither side can proceed:
  - CPU 0 is stuck waiting for napi_hash_lock
  - CPU 1 is stuck waiting for lp->lock

I think the below explanation is still correct as to why the
identified commit causes the issue:

> It seems this was triggered because before the identified commit,
> napi_enable did not call napi_hash_add (and thus did not take the
> napi_hash_lock).

However, the previous patch I proposed for pcnet32 would also cause
a deadlock as the watchdog timer's function also needs lp->lock.

A corrected patch for pcnet32 can be found below.

Guenter: Sorry, would you mind testing the below instead of the
previous patch?

Don: Let me know what you think about the below?

Netdev maintainers, there is an alternate locking solution I can
propose as an RFC that might avoid this class of problem if this
sort of issue is more widespread than just pcnet32:
  - add the NAPI to the hash in netif_napi_add_weight (instead of napi_enable)
  - remove the NAPI from the hash in __netif_napi_del (instead of
    napi_disable)

If changing the locking order in core is the desired route, than the
patch below should be unnecessary, but:

diff --git a/drivers/net/ethernet/amd/pcnet32.c b/drivers/net/ethernet/amd/pcnet32.c
index 72db9f9e7bee..2e0077e68883 100644
--- a/drivers/net/ethernet/amd/pcnet32.c
+++ b/drivers/net/ethernet/amd/pcnet32.c
@@ -2625,11 +2625,10 @@ static int pcnet32_close(struct net_device *dev)

        del_timer_sync(&lp->watchdog_timer);

+       spin_lock_irqsave(&lp->lock, flags);
        netif_stop_queue(dev);
        napi_disable(&lp->napi);

-       spin_lock_irqsave(&lp->lock, flags);
-
        dev->stats.rx_missed_errors = lp->a->read_csr(ioaddr, 112);

        netif_printk(lp, ifdown, KERN_DEBUG, dev,

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [net-next v6 5/9] net: napi: Add napi_config
  2024-11-27 18:51     ` Joe Damato
  2024-11-27 20:00       ` Joe Damato
@ 2024-11-27 21:43       ` Guenter Roeck
  2024-11-28  1:17         ` Joe Damato
  1 sibling, 1 reply; 25+ messages in thread
From: Guenter Roeck @ 2024-11-27 21:43 UTC (permalink / raw)
  To: Joe Damato, netdev, mkarsten, skhawaja, sdf, bjorn,
	amritha.nambiar, sridhar.samudrala, willemdebruijn.kernel,
	edumazet, Jakub Kicinski, David S. Miller, Paolo Abeni,
	Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
	Lorenzo Bianconi, Johannes Berg, open list:DOCUMENTATION,
	open list, pcnet32

On 11/27/24 10:51, Joe Damato wrote:
> On Wed, Nov 27, 2024 at 09:43:54AM -0800, Guenter Roeck wrote:
>> Hi,
>>
>> On Fri, Oct 11, 2024 at 06:45:00PM +0000, Joe Damato wrote:
>>> Add a persistent NAPI config area for NAPI configuration to the core.
>>> Drivers opt-in to setting the persistent config for a NAPI by passing an
>>> index when calling netif_napi_add_config.
>>>
>>> napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
>>> (after the NAPIs are deleted).
>>>
>>> Drivers which call netif_napi_add_config will have persistent per-NAPI
>>> settings: NAPI IDs, gro_flush_timeout, and defer_hard_irq settings.
>>>
>>> Per-NAPI settings are saved in napi_disable and restored in napi_enable.
>>>
>>> Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca>
>>> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
>>> Signed-off-by: Joe Damato <jdamato@fastly.com>
>>> Reviewed-by: Jakub Kicinski <kuba@kernel.org>
>>
>> This patch triggers a lock inversion message on pcnet Ethernet adapters.
> 
> Thanks for the report. I am not familiar with the pcnet driver, but
> took some time now to read the report below and the driver code.
> 
> I could definitely be reading the output incorrectly (if so please
> let me know), but it seems like the issue can be triggered in this
> case:
> 
> CPU 0:
> pcnet32_open
>     lock(lp->lock)
>       napi_enable
>         napi_hash_add
>           lock(napi_hash_lock)
>           unlock(napi_hash_lock)
>     unlock(lp->lock)
> 
> 
> Meanwhile on CPU 1:
>    pcnet32_close
>      napi_disable
>        napi_hash_del
>          lock(napi_hash_lock)
>          unlock(napi_hash_lock)
>      lock(lp->lock)
>      [... other code ...]
>      unlock(lp->lock)
>      [... other code ...]
>      lock(lp->lock)
>      [... other code ...]
>      unlock(lp->lock)
> 
> In other words: while the close path is holding napi_hash_lock (and
> before it acquires lp->lock), the enable path takes lp->lock and
> then napi_hash_lock.
> 
> It seems this was triggered because before the identified commit,
> napi_enable did not call napi_hash_add (and thus did not take the
> napi_hash_lock).
> 
> So, I agree there is an inversion; I can't say for sure if this
> would cause a deadlock in certain situations. It seems like
> napi_hash_del in the close path will return, so the inversion
> doesn't seem like it'd lead to a deadlock, but I am not an expert in
> this and could certainly be wrong.
> 
> I wonder if a potential fix for this would be in the driver's close
> function?
> 
> In pcnet32_open the order is:
>    lock(lp->lock)
>      napi_enable
>      netif_start_queue
>      mod_timer(watchdog)
>    unlock(lp->lock)
> 
> Perhaps pcnet32_close should be the same?
> 
> I've included an example patch below for pcnet32_close and I've CC'd
> the maintainer of pcnet32 that is not currently CC'd.
> 
> Guenter: Is there any change you might be able to test the proposed
> patch below?
> 

I moved the spinlock after del_timer_sync() because it is not a good idea
to hold a spinlock when calling that function. That results in:

[   10.646956] BUG: sleeping function called from invalid context at net/core/dev.c:6775
[   10.647142] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1817, name: ip
[   10.647237] preempt_count: 1, expected: 0
[   10.647319] 2 locks held by ip/1817:
[   10.647383]  #0: ffffffff81ded990 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x22a/0x74c
[   10.647880]  #1: ff6000000498ccb0 (&lp->lock){-.-.}-{3:3}, at: pcnet32_close+0x40/0x126
[   10.648050] irq event stamp: 3720
[   10.648102] hardirqs last  enabled at (3719): [<ffffffff80decaf4>] _raw_spin_unlock_irqrestore+0x54/0x62
[   10.648204] hardirqs last disabled at (3720): [<ffffffff80dec8a2>] _raw_spin_lock_irqsave+0x5e/0x64
[   10.648301] softirqs last  enabled at (3712): [<ffffffff8001efca>] handle_softirqs+0x3e6/0x4a2
[   10.648396] softirqs last disabled at (3631): [<ffffffff80ded6cc>] __do_softirq+0x12/0x1a
[   10.648666] CPU: 0 UID: 0 PID: 1817 Comm: ip Tainted: G                 N 6.12.0-10313-g7d4050728c83-dirty #1
[   10.648828] Tainted: [N]=TEST
[   10.648879] Hardware name: riscv-virtio,qemu (DT)
[   10.648978] Call Trace:
[   10.649048] [<ffffffff80006d42>] dump_backtrace+0x1c/0x24
[   10.649117] [<ffffffff80dc8d94>] show_stack+0x2c/0x38
[   10.649180] [<ffffffff80de00b0>] dump_stack_lvl+0x74/0xac
[   10.649246] [<ffffffff80de00fc>] dump_stack+0x14/0x1c
[   10.649308] [<ffffffff8004da18>] __might_resched+0x23e/0x248
[   10.649377] [<ffffffff8004da60>] __might_sleep+0x3e/0x62
[   10.649441] [<ffffffff80b8d370>] napi_disable+0x24/0x10c
[   10.649506] [<ffffffff809a06fe>] pcnet32_close+0x6c/0x126
...

This is due to might_sleep() at the beginning of napi_disable(). So it doesn't
work as intended, it just replaces one problem with another.

> Don: Would you mind taking a look to see if this change is sensible?
> 
> Netdev maintainers: at a higher level, I'm not sure how many other
> drivers might have locking patterns like this that commit
> 86e25f40aa1e ("net: napi: Add napi_config") will break in a similar
> manner.
> 
> Do I:
>    - comb through drivers trying to identify these, and/or

Coccinelle, checking for napi_enable calls under spinlock, points to:

napi_enable called under spin_lock_irqsave from drivers/net/ethernet/via/via-velocity.c:2325
napi_enable called under spin_lock_irqsave from drivers/net/can/grcan.c:1076
napi_enable called under spin_lock from drivers/net/ethernet/marvell/mvneta.c:4388
napi_enable called under spin_lock_irqsave from drivers/net/ethernet/amd/pcnet32.c:2104

Guenter

>    - do we find a way to implement the identified commit with the
>      original lock ordering to avoid breaking any other driver?
> 
> I'd appreciate guidance/insight from the maintainers on how to best
> proceed.
> 
> diff --git a/drivers/net/ethernet/amd/pcnet32.c b/drivers/net/ethernet/amd/pcnet32.c
> index 72db9f9e7bee..ff56a308fec9 100644
> --- a/drivers/net/ethernet/amd/pcnet32.c
> +++ b/drivers/net/ethernet/amd/pcnet32.c
> @@ -2623,13 +2623,13 @@ static int pcnet32_close(struct net_device *dev)
>          struct pcnet32_private *lp = netdev_priv(dev);
>          unsigned long flags;
> 
> +       spin_lock_irqsave(&lp->lock, flags);
> +
>          del_timer_sync(&lp->watchdog_timer);
> 
>          netif_stop_queue(dev);
>          napi_disable(&lp->napi);
> 
> -       spin_lock_irqsave(&lp->lock, flags);
> -
>          dev->stats.rx_missed_errors = lp->a->read_csr(ioaddr, 112);
> 
>          netif_printk(lp, ifdown, KERN_DEBUG, dev,


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net-next v6 5/9] net: napi: Add napi_config
  2024-11-27 20:00       ` Joe Damato
@ 2024-11-27 22:48         ` Guenter Roeck
  2024-11-30 20:45         ` Jakub Kicinski
  1 sibling, 0 replies; 25+ messages in thread
From: Guenter Roeck @ 2024-11-27 22:48 UTC (permalink / raw)
  To: Joe Damato, netdev, mkarsten, skhawaja, sdf, bjorn,
	amritha.nambiar, sridhar.samudrala, willemdebruijn.kernel,
	edumazet, Jakub Kicinski, David S. Miller, Paolo Abeni,
	Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
	Lorenzo Bianconi, Johannes Berg, open list:DOCUMENTATION,
	open list, pcnet32

On 11/27/24 12:00, Joe Damato wrote:
> On Wed, Nov 27, 2024 at 10:51:16AM -0800, Joe Damato wrote:
>> On Wed, Nov 27, 2024 at 09:43:54AM -0800, Guenter Roeck wrote:
>>> Hi,
>>>
>>> On Fri, Oct 11, 2024 at 06:45:00PM +0000, Joe Damato wrote:
>>>> Add a persistent NAPI config area for NAPI configuration to the core.
>>>> Drivers opt-in to setting the persistent config for a NAPI by passing an
>>>> index when calling netif_napi_add_config.
>>>>
>>>> napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
>>>> (after the NAPIs are deleted).
>>>>
>>>> Drivers which call netif_napi_add_config will have persistent per-NAPI
>>>> settings: NAPI IDs, gro_flush_timeout, and defer_hard_irq settings.
>>>>
>>>> Per-NAPI settings are saved in napi_disable and restored in napi_enable.
>>>>
>>>> Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca>
>>>> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
>>>> Signed-off-by: Joe Damato <jdamato@fastly.com>
>>>> Reviewed-by: Jakub Kicinski <kuba@kernel.org>
>>>
>>> This patch triggers a lock inversion message on pcnet Ethernet adapters.
>>
>> Thanks for the report. I am not familiar with the pcnet driver, but
>> took some time now to read the report below and the driver code.
>>
>> I could definitely be reading the output incorrectly (if so please
>> let me know), but it seems like the issue can be triggered in this
>> case:
> 
> Sorry, my apologies, I both:
>    - read the report incorrectly, and
>    - proposed a bad patch that would result in a deadlock :)
> 
> After re-reading it and running this by Martin (who is CC'd), the
> inversion is actually:
> 
> CPU 0:
> pcnet32_open
>     lock(lp->lock)
>       napi_enable
>         napi_hash_add <- before this executes, CPU 1 proceeds
>           lock(napi_hash_lock)
> CPU 1:
>    pcnet32_close
>      napi_disable
>        napi_hash_del
>          lock(napi_hash_lock)
>           < INTERRUPT >
>              pcnet32_interrupt
>                lock(lp->lock)
> 
> This is now an inversion because:
> 
> CPU 0: holds lp->lock and is about to take napi_hash_lock
> CPU 1: holds napi_hashlock and an IRQ firing on CPU 1 tries to take
>         lp->lock (which CPU 0 already holds)
> 
> Neither side can proceed:
>    - CPU 0 is stuck waiting for napi_hash_lock
>    - CPU 1 is stuck waiting for lp->lock
> 
> I think the below explanation is still correct as to why the
> identified commit causes the issue:
> 
>> It seems this was triggered because before the identified commit,
>> napi_enable did not call napi_hash_add (and thus did not take the
>> napi_hash_lock).
> 
> However, the previous patch I proposed for pcnet32 would also cause
> a deadlock as the watchdog timer's function also needs lp->lock.
> 
> A corrected patch for pcnet32 can be found below.
> 
> Guenter: Sorry, would you mind testing the below instead of the
> previous patch?
> 
> Don: Let me know what you think about the below?
> 
> Netdev maintainers, there is an alternate locking solution I can
> propose as an RFC that might avoid this class of problem if this
> sort of issue is more widespread than just pcnet32:
>    - add the NAPI to the hash in netif_napi_add_weight (instead of napi_enable)
>    - remove the NAPI from the hash in __netif_napi_del (instead of
>      napi_disable)
> 
> If changing the locking order in core is the desired route, than the
> patch below should be unnecessary, but:
> 
> diff --git a/drivers/net/ethernet/amd/pcnet32.c b/drivers/net/ethernet/amd/pcnet32.c
> index 72db9f9e7bee..2e0077e68883 100644
> --- a/drivers/net/ethernet/amd/pcnet32.c
> +++ b/drivers/net/ethernet/amd/pcnet32.c
> @@ -2625,11 +2625,10 @@ static int pcnet32_close(struct net_device *dev)
> 
>          del_timer_sync(&lp->watchdog_timer);
> 
> +       spin_lock_irqsave(&lp->lock, flags);
>          netif_stop_queue(dev);
>          napi_disable(&lp->napi);
> 

That is what I did, actually. Problem with that is that napi_disable()
wants to be able to sleep, thus the above triggers:

BUG: sleeping function called from invalid context at net/core/dev.c:6775
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1817, name: ip
preempt_count: 1, expected: 0
2 locks held by ip/1817:
#0: ffffffff81ded990 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x22a/0x74c
#1: ff6000000498ccb0 (&lp->lock){-.-.}-{3:3}, at: pcnet32_close+0x40/0x126
irq event stamp: 3720
hardirqs last  enabled at (3719): [<ffffffff80decaf4>] _raw_spin_unlock_irqrestore+0x54/0x62
hardirqs last disabled at (3720): [<ffffffff80dec8a2>] _raw_spin_lock_irqsave+0x5e/0x64
softirqs last  enabled at (3712): [<ffffffff8001efca>] handle_softirqs+0x3e6/0x4a2
softirqs last disabled at (3631): [<ffffffff80ded6cc>] __do_softirq+0x12/0x1a
CPU: 0 UID: 0 PID: 1817 Comm: ip Tainted: G                 N 6.12.0-10313-g7d4050728c83-dirty #1
Tainted: [N]=TEST
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff80006d42>] dump_backtrace+0x1c/0x24
[<ffffffff80dc8d94>] show_stack+0x2c/0x38
[<ffffffff80de00b0>] dump_stack_lvl+0x74/0xac
[<ffffffff80de00fc>] dump_stack+0x14/0x1c
[<ffffffff8004da18>] __might_resched+0x23e/0x248
[<ffffffff8004da60>] __might_sleep+0x3e/0x62
[<ffffffff80b8d370>] napi_disable+0x24/0x10c
[<ffffffff809a06fe>] pcnet32_close+0x6c/0x126

Guenter


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net-next v6 5/9] net: napi: Add napi_config
  2024-11-27 21:43       ` Guenter Roeck
@ 2024-11-28  1:17         ` Joe Damato
  2024-11-28  1:34           ` Guenter Roeck
  0 siblings, 1 reply; 25+ messages in thread
From: Joe Damato @ 2024-11-28  1:17 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
	sridhar.samudrala, willemdebruijn.kernel, edumazet,
	Jakub Kicinski, David S. Miller, Paolo Abeni, Jonathan Corbet,
	Jiri Pirko, Sebastian Andrzej Siewior, Lorenzo Bianconi,
	Johannes Berg, open list:DOCUMENTATION, open list, pcnet32

On Wed, Nov 27, 2024 at 01:43:14PM -0800, Guenter Roeck wrote:
> On 11/27/24 10:51, Joe Damato wrote:
> > On Wed, Nov 27, 2024 at 09:43:54AM -0800, Guenter Roeck wrote:
> > > Hi,
> > > 
> > > On Fri, Oct 11, 2024 at 06:45:00PM +0000, Joe Damato wrote:

[...]

> > It seems this was triggered because before the identified commit,
> > napi_enable did not call napi_hash_add (and thus did not take the
> > napi_hash_lock).
> > 
> > So, I agree there is an inversion; I can't say for sure if this
> > would cause a deadlock in certain situations. It seems like
> > napi_hash_del in the close path will return, so the inversion
> > doesn't seem like it'd lead to a deadlock, but I am not an expert in
> > this and could certainly be wrong.
> > 
> > I wonder if a potential fix for this would be in the driver's close
> > function?
> > 
> > In pcnet32_open the order is:
> >    lock(lp->lock)
> >      napi_enable
> >      netif_start_queue
> >      mod_timer(watchdog)
> >    unlock(lp->lock)
> > 
> > Perhaps pcnet32_close should be the same?
> > 
> > I've included an example patch below for pcnet32_close and I've CC'd
> > the maintainer of pcnet32 that is not currently CC'd.
> > 
> > Guenter: Is there any change you might be able to test the proposed
> > patch below?
> > 
> 
> I moved the spinlock after del_timer_sync() because it is not a good idea
> to hold a spinlock when calling that function. That results in:
> 
> [   10.646956] BUG: sleeping function called from invalid context at net/core/dev.c:6775
> [   10.647142] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1817, name: ip
> [   10.647237] preempt_count: 1, expected: 0
> [   10.647319] 2 locks held by ip/1817:
> [   10.647383]  #0: ffffffff81ded990 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x22a/0x74c
> [   10.647880]  #1: ff6000000498ccb0 (&lp->lock){-.-.}-{3:3}, at: pcnet32_close+0x40/0x126

[...]

> This is due to might_sleep() at the beginning of napi_disable(). So it doesn't
> work as intended, it just replaces one problem with another.

Thanks for testing that. And you are right, it is also not correct.
I will give it some thought to see if I can think of something
better.

Maybe Don will have some ideas.
 
> > Don: Would you mind taking a look to see if this change is sensible?
> > 
> > Netdev maintainers: at a higher level, I'm not sure how many other
> > drivers might have locking patterns like this that commit
> > 86e25f40aa1e ("net: napi: Add napi_config") will break in a similar
> > manner.
> > 
> > Do I:
> >    - comb through drivers trying to identify these, and/or
> 
> Coccinelle, checking for napi_enable calls under spinlock, points to:
> 
> napi_enable called under spin_lock_irqsave from drivers/net/ethernet/via/via-velocity.c:2325
> napi_enable called under spin_lock_irqsave from drivers/net/can/grcan.c:1076
> napi_enable called under spin_lock from drivers/net/ethernet/marvell/mvneta.c:4388
> napi_enable called under spin_lock_irqsave from drivers/net/ethernet/amd/pcnet32.c:2104

I checked the 3 cases above other than pcnet32 and they appear to be
false positives to me.

Guenter: would you mind sending me your cocci script? Mostly for
selfish reasons; I'd like to see how you did it so I can learn more
:) Feel free to do so off list if you prefer.

I tried to write my first coccinelle script (which you can find
below) that is probably wrong, but it attempts to detect:
  - interrupt routines that hold locks
  - in drivers that call napi_enable between a lock/unlock

I couldn't figure out how to get regexps to work in my script, so I
made a couple variants of the script for each of the spin_lock_*
variants and ran them all.

Only one offender was detected: pcnet32.

So, I guess the question to put out there to maintainers / others on
the list is:

  - There seems to be at least 1 driver affected (pcnet32). There
    might be others, but my simple (and likely incorrect) cocci
    script below couldn't detect any with this particular bug shape.
    Worth mentioning: there could be other bug shapes that trigger
    an inversion that I am currently unaware of.

  - As far as I can tell, there are three ways to proceed:
    1. Find and fix all drivers which broke (pcnet32 being the only
       known driver at this point), or
    2. Disable IRQs when taking the lock in napi_hash_del, or
    3. Move the napi hash add/remove out of napi enable/disable.

Happy to proceed however seems most reasonable to the maintainers,
please let me know.

My cocci script follows; as noted above I am too much of a noob and
couldn't figure out how to use regexps to match the different
spin_lock* variants, so I simply made multiple versions of this
script for each variant:

virtual report

@napi@
identifier func0;
position p0;
@@

func0(...)
{
	...
	spin_lock_irqsave(...)
	...
	napi_enable@p0(...)
	...
	spin_unlock_irqrestore(...)
	...
}

@u@
position p;
identifier func;
typedef irqreturn_t;
@@

irqreturn_t func (...)
{
	...
	spin_lock@p(...)
	...
}

@script:python depends on napi && u@
p << u.p;
func << u.func;
disable << napi.p0;
@@

print("* file: %s irq handler %s takes lock on line %s and calls napi_enable under lock %s" % (p[0].file,func,p[0].line,disable[0].line))

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net-next v6 5/9] net: napi: Add napi_config
  2024-11-28  1:17         ` Joe Damato
@ 2024-11-28  1:34           ` Guenter Roeck
  0 siblings, 0 replies; 25+ messages in thread
From: Guenter Roeck @ 2024-11-28  1:34 UTC (permalink / raw)
  To: Joe Damato, netdev, mkarsten, skhawaja, sdf, bjorn,
	amritha.nambiar, sridhar.samudrala, willemdebruijn.kernel,
	edumazet, Jakub Kicinski, David S. Miller, Paolo Abeni,
	Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
	Lorenzo Bianconi, Johannes Berg, open list:DOCUMENTATION,
	open list, pcnet32

On Wed, Nov 27, 2024 at 05:17:15PM -0800, Joe Damato wrote:
> 
> Guenter: would you mind sending me your cocci script? Mostly for
> selfish reasons; I'd like to see how you did it so I can learn more
> :) Feel free to do so off list if you prefer.
> 
You mean because it is so clumsy ? :-). No worries, I attached
it below. It is way too complicated, but it was good enough
for a quick hack.

> I tried to write my first coccinelle script (which you can find
> below) that is probably wrong, but it attempts to detect:
>   - interrupt routines that hold locks
>   - in drivers that call napi_enable between a lock/unlock
> 
> I couldn't figure out how to get regexps to work in my script, so I
> made a couple variants of the script for each of the spin_lock_*
> variants and ran them all.

I pretty much did the same, only in one script.

> 
> Only one offender was detected: pcnet32.
> 

Your script doesn't take into account that the spinlock may have been
released before the call to napi_enable(). Other than that it should
work just fine. I didn't try to track the interrupt handler because
tracking that through multiple functions can be difficult.

Thanks,
Guenter

---
virtual report

@f1@
identifier flags;
position p;
expression e;
@@

<...
    spin_lock_irqsave@p(e, flags);
    ... when != spin_unlock_irqrestore(e, flags);
    napi_enable(...);
    ... when any
    spin_unlock_irqrestore(e, flags);
...>

@f2@
position p;
expression e;
@@

<...
    spin_lock_irq@p(e);
    ... when != spin_unlock_irq(e);
    napi_enable(...);
    ... when any
    spin_unlock_irq(e);
...>

@f3@
position p;
expression e;
@@

<...
    spin_lock@p(e);
    ... when != spin_unlock(e);
    napi_enable(...);
    ... when any
    spin_unlock(e);
...>

@script:python@
p << f1.p;
@@

print("napi_enable called under spin_lock_irqsave from %s:%s" % (p[0].file, p[0].line))

@script:python@
p << f2.p;
@@

print("napi_enable called under spin_lock_irq from %s:%s" % (p[0].file, p[0].line))

@script:python@
p << f3.p;
@@

print("napi_enable called under spin_lock from %s:%s" % (p[0].file, p[0].line))

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net-next v6 5/9] net: napi: Add napi_config
  2024-11-27 20:00       ` Joe Damato
  2024-11-27 22:48         ` Guenter Roeck
@ 2024-11-30 20:45         ` Jakub Kicinski
  2024-12-02 17:34           ` Joe Damato
  1 sibling, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2024-11-30 20:45 UTC (permalink / raw)
  To: Joe Damato
  Cc: Guenter Roeck, netdev, mkarsten, skhawaja, sdf, bjorn,
	amritha.nambiar, sridhar.samudrala, willemdebruijn.kernel,
	edumazet, David S. Miller, Paolo Abeni, Jonathan Corbet,
	Jiri Pirko, Sebastian Andrzej Siewior, Lorenzo Bianconi,
	Johannes Berg, open list:DOCUMENTATION, open list, pcnet32

On Wed, 27 Nov 2024 12:00:02 -0800 Joe Damato wrote:
> CPU 0:
> pcnet32_open
>    lock(lp->lock)
>      napi_enable
>        napi_hash_add <- before this executes, CPU 1 proceeds
>          lock(napi_hash_lock)
> CPU 1:
>   pcnet32_close
>     napi_disable
>       napi_hash_del
>         lock(napi_hash_lock)
>          < INTERRUPT >

How about making napi_hash_lock irq-safe ?
It's a control path lock, it should be fine to disable irqs.

>             pcnet32_interrupt
>               lock(lp->lock)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [net-next v6 5/9] net: napi: Add napi_config
  2024-11-30 20:45         ` Jakub Kicinski
@ 2024-12-02 17:34           ` Joe Damato
  0 siblings, 0 replies; 25+ messages in thread
From: Joe Damato @ 2024-12-02 17:34 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Guenter Roeck, netdev, mkarsten, skhawaja, sdf, bjorn,
	amritha.nambiar, sridhar.samudrala, willemdebruijn.kernel,
	edumazet, David S. Miller, Paolo Abeni, Jonathan Corbet,
	Jiri Pirko, Sebastian Andrzej Siewior, Lorenzo Bianconi,
	Johannes Berg, open list:DOCUMENTATION, open list, pcnet32

On Sat, Nov 30, 2024 at 12:45:01PM -0800, Jakub Kicinski wrote:
> On Wed, 27 Nov 2024 12:00:02 -0800 Joe Damato wrote:
> > CPU 0:
> > pcnet32_open
> >    lock(lp->lock)
> >      napi_enable
> >        napi_hash_add <- before this executes, CPU 1 proceeds
> >          lock(napi_hash_lock)
> > CPU 1:
> >   pcnet32_close
> >     napi_disable
> >       napi_hash_del
> >         lock(napi_hash_lock)
> >          < INTERRUPT >
> 
> How about making napi_hash_lock irq-safe ?
> It's a control path lock, it should be fine to disable irqs.

Ah, right. That should fix it.

I'll write a fixes against net and change the napi_hash_lock to use
spin_lock_irqsave and spin_lock_irqrestore and send shortly.

> >             pcnet32_interrupt
> >               lock(lp->lock)
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-12-02 17:34 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-11 18:44 [net-next v6 0/9] Add support for per-NAPI config via netlink Joe Damato
2024-10-11 18:44 ` [net-next v6 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
2024-10-11 18:44 ` [net-next v6 2/9] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
2024-10-11 18:44 ` [net-next v6 3/9] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
2024-10-11 18:44 ` [net-next v6 4/9] netdev-genl: Dump gro_flush_timeout Joe Damato
2024-10-11 18:47   ` Eric Dumazet
2024-10-11 18:45 ` [net-next v6 5/9] net: napi: Add napi_config Joe Damato
2024-10-11 18:49   ` Eric Dumazet
2024-11-27 17:43   ` Guenter Roeck
2024-11-27 18:51     ` Joe Damato
2024-11-27 20:00       ` Joe Damato
2024-11-27 22:48         ` Guenter Roeck
2024-11-30 20:45         ` Jakub Kicinski
2024-12-02 17:34           ` Joe Damato
2024-11-27 21:43       ` Guenter Roeck
2024-11-28  1:17         ` Joe Damato
2024-11-28  1:34           ` Guenter Roeck
2024-10-11 18:45 ` [net-next v6 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
2024-10-11 18:49   ` Eric Dumazet
2024-11-12  9:17   ` Paolo Abeni
2024-11-12 17:12     ` Joe Damato
2024-10-11 18:45 ` [net-next v6 7/9] bnxt: Add support for persistent NAPI config Joe Damato
2024-10-11 18:45 ` [net-next v6 8/9] mlx5: " Joe Damato
2024-10-11 18:45 ` [net-next v6 9/9] mlx4: Add support for persistent NAPI config to RX CQs Joe Damato
2024-10-15  1:10 ` [net-next v6 0/9] Add support for per-NAPI config via netlink patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).