* [RFC net-next v3 0/9] Add support for per-NAPI config via netlink
@ 2024-09-12 10:07 Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
` (8 more replies)
0 siblings, 9 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, Alexander Lobakin, Breno Leitao,
Daniel Jurgens, David Ahern, David S. Miller, Donald Hunter,
Eric Dumazet, Jesper Dangaard Brouer, Jiri Pirko, Johannes Berg,
Jonathan Corbet, Kory Maincent, Leon Romanovsky,
open list:DOCUMENTATION, open list,
open list:MELLANOX MLX4 core VPI driver, Lorenzo Bianconi,
Michael Chan, Paolo Abeni, Saeed Mahameed,
Sebastian Andrzej Siewior, Tariq Toukan, Xuan Zhuo
Greetings:
Welcome to RFC v3.
This implementation allocates an array of "struct napi_config" in
net_device and each NAPI instance is assigned an index into the config
array.
Per-NAPI settings like:
- NAPI ID
- gro_flush_timeout
- defer_hard_irqs
are persisted in napi_config and restored on napi_disable/napi_enable
respectively.
To help illustrate how this would end up working, I've added patches for
3 drivers, of which I have access to only 1:
- mlx5 which is the basis of the examples below
- mlx4 which has TX only NAPIs, just to highlight that case. I have
only compile tested this patch; I don't have this hardware.
- bnxt which I have only compiled tested. I don't have this
hardware.
NOTE: I only tested this on mlx5; I have no access to the other hardware
for which I provided patches. Hopefully other folks can help test :)
This iteration seems to persist NAPI IDs and settings even when resizing
queues, see below, so I think maybe this is getting close to where we
want to land?
Here's an example of how it works on my mlx5:
# start with 2 queues
$ ethtool -l eth4 | grep Combined | tail -1
Combined: 2
First, output the current NAPI settings:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now, set the global sysfs parameters:
$ sudo bash -c 'echo 20000 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs'
Output current NAPI settings again:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now set NAPI ID 345, via its NAPI ID to specific values:
$ sudo ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--do napi-set \
--json='{"id": 345,
"defer-hard-irqs": 111,
"gro-flush-timeout": 11111}'
None
Now output current NAPI settings again to ensure only NAPI ID 345
changed:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
'gro-flush-timeout': 11111,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now, increase gro-flush-timeout only:
$ sudo ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--do napi-set --json='{"id": 345,
"gro-flush-timeout": 44444}'
None
Now output the current NAPI settings once more:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
'gro-flush-timeout': 44444,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now set NAPI ID 345 to have gro_flush_timeout of 0:
$ sudo ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--do napi-set --json='{"id": 345,
"gro-flush-timeout": 0}'
None
Check that NAPI ID 345 has a value of 0:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Change the queue count, ensuring that NAPI ID 345 retains its settings:
$ sudo ethtool -L eth4 combined 4
Check that the new queues have the system wide settings but that NAPI ID
345 remains unchanged:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 347,
'ifindex': 7,
'irq': 529},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 346,
'ifindex': 7,
'irq': 528},
{'defer-hard-irqs': 111,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Now reduce the queue count below where NAPI ID 345 is indexed:
$ sudo ethtool -L eth4 combined 1
Check the output:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Re-increase the queue count to ensure NAPI ID 345 is re-assigned the same
values:
$ sudo ethtool -L eth4 combined 2
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Create new queues to ensure the sysfs globals are used for the new NAPIs
but that NAPI ID 345 is unchanged:
$ sudo ethtool -L eth4 comabined 8
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[...]
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 346,
'ifindex': 7,
'irq': 528},
{'defer-hard-irqs': 111,
'gro-flush-timeout': 0,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 344,
'ifindex': 7,
'irq': 327}]
Last, but not least, let's try writing the sysfs parameters to ensure
all NAPIs are rewritten:
$ sudo bash -c 'echo 33333 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 222 >/sys/class/net/eth4/napi_defer_hard_irqs'
Check that worked:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[...]
{'defer-hard-irqs': 222,
'gro-flush-timeout': 33333,
'id': 346,
'ifindex': 7,
'irq': 528},
{'defer-hard-irqs': 222,
'gro-flush-timeout': 33333,
'id': 345,
'ifindex': 7,
'irq': 527},
{'defer-hard-irqs': 222,
'gro-flush-timeout': 33333,
'id': 344,
'ifindex': 7,
'irq': 327}]
Thanks,
Joe
rfcv3:
- Renamed napi_storage to napi_config
- Reordered patches
- Added defer_hard_irqs and gro_flush_timeout to napi_struct
- Attempt to save and restore settings on napi_disable/napi_enable
- Removed weight as a parameter to netif_napi_add_storage
- Updated driver patches to no longer pass in weight
rfcv2:
- Almost total rewrite from v1
Joe Damato (9):
net: napi: Make napi_defer_hard_irqs per-NAPI
netdev-genl: Dump napi_defer_hard_irqs
net: napi: Make gro_flush_timeout per-NAPI
netdev-genl: Dump gro_flush_timeout
net: napi: Add napi_config
netdev-genl: Support setting per-NAPI config values
bnxt: Add support for napi storage
mlx5: Add support for napi storage
mlx4: Add support for napi storage to RX CQs
Documentation/netlink/specs/netdev.yaml | 25 ++++++
.../networking/net_cachelines/net_device.rst | 5 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +-
drivers/net/ethernet/mellanox/mlx4/en_cq.c | 3 +-
.../net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
include/linux/netdevice.h | 40 ++++++++-
include/uapi/linux/netdev.h | 3 +
net/core/dev.c | 90 +++++++++++++++----
net/core/dev.h | 87 ++++++++++++++++++
net/core/net-sysfs.c | 4 +-
net/core/netdev-genl-gen.c | 14 +++
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 55 ++++++++++++
tools/include/uapi/linux/netdev.h | 3 +
14 files changed, 310 insertions(+), 25 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC net-next v3 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI
2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 2/9] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
` (7 subsequent siblings)
8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
Paolo Abeni, Jonathan Corbet, Jiri Pirko,
Sebastian Andrzej Siewior, Lorenzo Bianconi, David Ahern,
Kory Maincent, Johannes Berg, Breno Leitao, Alexander Lobakin,
open list:DOCUMENTATION, open list
Allow per-NAPI defer_hard_irqs setting.
The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device defer_hard_irq field.
Reads from sysfs will read from the net_device field.
sysfs code was updated to guard against what appears to be a potential
overflow as the field is an int, but the value passed in is an unsigned
long.
The ability to set defer_hard_irqs on specific NAPI instances will be
added in a later commit, via netdev-genl.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
.../networking/net_cachelines/net_device.rst | 2 +-
include/linux/netdevice.h | 3 +-
net/core/dev.c | 10 +++---
net/core/dev.h | 36 +++++++++++++++++++
net/core/net-sysfs.c | 2 +-
5 files changed, 45 insertions(+), 8 deletions(-)
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 22b07c814f4a..eeeb7c925ec5 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -99,7 +99,6 @@ unsigned_int num_rx_queues
unsigned_int real_num_rx_queues - read_mostly get_rps_cpu
struct_bpf_prog* xdp_prog - read_mostly netif_elide_gro()
unsigned_long gro_flush_timeout - read_mostly napi_complete_done
-u32 napi_defer_hard_irqs - read_mostly napi_complete_done
unsigned_int gro_max_size - read_mostly skb_gro_receive
unsigned_int gro_ipv4_max_size - read_mostly skb_gro_receive
rx_handler_func_t* rx_handler read_mostly - __netif_receive_skb_core
@@ -183,3 +182,4 @@ struct_devlink_port* devlink_port
struct_dpll_pin* dpll_pin
struct hlist_head page_pools
struct dim_irq_moder* irq_moder
+u32 napi_defer_hard_irqs
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2e40a137dc12..f28b96c95259 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -377,6 +377,7 @@ struct napi_struct {
struct list_head dev_list;
struct hlist_node napi_hash_node;
int irq;
+ u32 defer_hard_irqs;
};
enum {
@@ -2075,7 +2076,6 @@ struct net_device {
unsigned int real_num_rx_queues;
struct netdev_rx_queue *_rx;
unsigned long gro_flush_timeout;
- u32 napi_defer_hard_irqs;
unsigned int gro_max_size;
unsigned int gro_ipv4_max_size;
rx_handler_func_t __rcu *rx_handler;
@@ -2398,6 +2398,7 @@ struct net_device {
/** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */
struct dim_irq_moder *irq_moder;
+ u32 napi_defer_hard_irqs;
u8 priv[] ____cacheline_aligned
__counted_by(priv_len);
diff --git a/net/core/dev.c b/net/core/dev.c
index 8f4dead64284..d3d0680664b3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6221,7 +6221,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
if (work_done) {
if (n->gro_bitmask)
timeout = READ_ONCE(n->dev->gro_flush_timeout);
- n->defer_hard_irqs_count = READ_ONCE(n->dev->napi_defer_hard_irqs);
+ n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
}
if (n->defer_hard_irqs_count > 0) {
n->defer_hard_irqs_count--;
@@ -6359,7 +6359,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
if (flags & NAPI_F_PREFER_BUSY_POLL) {
- napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs);
+ napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
timeout = READ_ONCE(napi->dev->gro_flush_timeout);
if (napi->defer_hard_irqs_count && timeout) {
hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
@@ -6641,6 +6641,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
INIT_HLIST_NODE(&napi->napi_hash_node);
hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
napi->timer.function = napi_watchdog;
+ napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
init_gro_hash(napi);
napi->skb = NULL;
INIT_LIST_HEAD(&napi->rx_list);
@@ -11023,7 +11024,7 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev)
if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
dev->gro_flush_timeout = 20000;
- dev->napi_defer_hard_irqs = 1;
+ netdev_set_defer_hard_irqs(dev, 1);
}
}
EXPORT_SYMBOL_GPL(netdev_sw_irq_coalesce_default_on);
@@ -11960,7 +11961,6 @@ static void __init net_dev_struct_check(void)
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, real_num_rx_queues);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, _rx);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_flush_timeout);
- CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, napi_defer_hard_irqs);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_max_size);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_ipv4_max_size);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, rx_handler);
@@ -11972,7 +11972,7 @@ static void __init net_dev_struct_check(void)
#ifdef CONFIG_NET_XGRESS
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, tcx_ingress);
#endif
- CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 104);
+ CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 100);
}
/*
diff --git a/net/core/dev.h b/net/core/dev.h
index 5654325c5b71..f24fa38a2cac 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -138,6 +138,42 @@ static inline void netif_set_gro_ipv4_max_size(struct net_device *dev,
WRITE_ONCE(dev->gro_ipv4_max_size, size);
}
+/**
+ * napi_get_defer_hard_irqs - get the NAPI's defer_hard_irqs
+ * @n: napi struct to get the defer_hard_irqs field from
+ *
+ * Return: the per-NAPI value of the defar_hard_irqs field.
+ */
+static inline u32 napi_get_defer_hard_irqs(const struct napi_struct *n)
+{
+ return READ_ONCE(n->defer_hard_irqs);
+}
+
+/**
+ * napi_set_defer_hard_irqs - set the defer_hard_irqs for a napi
+ * @n: napi_struct to set the defer_hard_irqs field
+ * @defer: the value the field should be set to
+ */
+static inline void napi_set_defer_hard_irqs(struct napi_struct *n, u32 defer)
+{
+ WRITE_ONCE(n->defer_hard_irqs, defer);
+}
+
+/**
+ * netdev_set_defer_hard_irqs - set defer_hard_irqs for all NAPIs of a netdev
+ * @netdev: the net_device for which all NAPIs will have their defer_hard_irqs set
+ * @defer: the defer_hard_irqs value to set
+ */
+static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
+ u32 defer)
+{
+ struct napi_struct *napi;
+
+ WRITE_ONCE(netdev->napi_defer_hard_irqs, defer);
+ list_for_each_entry(napi, &netdev->napi_list, dev_list)
+ napi_set_defer_hard_irqs(napi, defer);
+}
+
int rps_cpumask_housekeeping(struct cpumask *mask);
#if defined(CONFIG_DEBUG_NET) && defined(CONFIG_BPF_SYSCALL)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 05cf5347f25e..25125f356a15 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -429,7 +429,7 @@ static int change_napi_defer_hard_irqs(struct net_device *dev, unsigned long val
if (val > S32_MAX)
return -ERANGE;
- WRITE_ONCE(dev->napi_defer_hard_irqs, val);
+ netdev_set_defer_hard_irqs(dev, (u32)val);
return 0;
}
--
2.25.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC net-next v3 2/9] netdev-genl: Dump napi_defer_hard_irqs
2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 3/9] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
` (6 subsequent siblings)
8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, Donald Hunter, David S. Miller,
Eric Dumazet, Paolo Abeni, Jesper Dangaard Brouer, Xuan Zhuo,
Daniel Jurgens, open list
Support dumping defer_hard_irqs for a NAPI ID.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
Documentation/netlink/specs/netdev.yaml | 8 ++++++++
include/uapi/linux/netdev.h | 1 +
net/core/netdev-genl.c | 5 +++++
tools/include/uapi/linux/netdev.h | 1 +
4 files changed, 15 insertions(+)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 959755be4d7f..351d93994a66 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -244,6 +244,13 @@ attribute-sets:
threaded mode. If NAPI is not in threaded mode (i.e. uses normal
softirq context), the attribute will be absent.
type: u32
+ -
+ name: defer-hard-irqs
+ doc: The number of consecutive empty polls before IRQ deferral ends
+ and hardware IRQs are re-enabled.
+ type: u32
+ checks:
+ max: s32-max
-
name: queue
attributes:
@@ -593,6 +600,7 @@ operations:
- ifindex
- irq
- pid
+ - defer-hard-irqs
dump:
request:
attributes:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 43742ac5b00d..43bb1aad9611 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -121,6 +121,7 @@ enum {
NETDEV_A_NAPI_ID,
NETDEV_A_NAPI_IRQ,
NETDEV_A_NAPI_PID,
+ NETDEV_A_NAPI_DEFER_HARD_IRQS,
__NETDEV_A_NAPI_MAX,
NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index a17d7eaeb001..e67918dd97be 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -160,6 +160,7 @@ static int
netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
const struct genl_info *info)
{
+ u32 napi_defer_hard_irqs;
void *hdr;
pid_t pid;
@@ -188,6 +189,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
goto nla_put_failure;
}
+ napi_defer_hard_irqs = napi_get_defer_hard_irqs(napi);
+ if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS, napi_defer_hard_irqs))
+ goto nla_put_failure;
+
genlmsg_end(rsp, hdr);
return 0;
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 43742ac5b00d..43bb1aad9611 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -121,6 +121,7 @@ enum {
NETDEV_A_NAPI_ID,
NETDEV_A_NAPI_IRQ,
NETDEV_A_NAPI_PID,
+ NETDEV_A_NAPI_DEFER_HARD_IRQS,
__NETDEV_A_NAPI_MAX,
NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
--
2.25.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC net-next v3 3/9] net: napi: Make gro_flush_timeout per-NAPI
2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 2/9] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 4/9] netdev-genl: Dump gro_flush_timeout Joe Damato
` (5 subsequent siblings)
8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
Paolo Abeni, Jonathan Corbet, Jiri Pirko,
Sebastian Andrzej Siewior, Lorenzo Bianconi, Johannes Berg,
Breno Leitao, Alexander Lobakin, open list:DOCUMENTATION,
open list
Allow per-NAPI gro_flush_timeout setting.
The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device gro_flush_timeout
field. Reads from sysfs will read from the net_device field.
The ability to set gro_flush_timeout on specific NAPI instances will be
added in a later commit, via netdev-genl.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
.../networking/net_cachelines/net_device.rst | 2 +-
include/linux/netdevice.h | 3 +-
net/core/dev.c | 12 +++---
net/core/dev.h | 39 +++++++++++++++++++
net/core/net-sysfs.c | 2 +-
5 files changed, 49 insertions(+), 9 deletions(-)
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index eeeb7c925ec5..3d02ae79c850 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -98,7 +98,6 @@ struct_netdev_queue* _rx read_mostly
unsigned_int num_rx_queues
unsigned_int real_num_rx_queues - read_mostly get_rps_cpu
struct_bpf_prog* xdp_prog - read_mostly netif_elide_gro()
-unsigned_long gro_flush_timeout - read_mostly napi_complete_done
unsigned_int gro_max_size - read_mostly skb_gro_receive
unsigned_int gro_ipv4_max_size - read_mostly skb_gro_receive
rx_handler_func_t* rx_handler read_mostly - __netif_receive_skb_core
@@ -182,4 +181,5 @@ struct_devlink_port* devlink_port
struct_dpll_pin* dpll_pin
struct hlist_head page_pools
struct dim_irq_moder* irq_moder
+unsigned_long gro_flush_timeout
u32 napi_defer_hard_irqs
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f28b96c95259..3e07ab8e0295 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -377,6 +377,7 @@ struct napi_struct {
struct list_head dev_list;
struct hlist_node napi_hash_node;
int irq;
+ unsigned long gro_flush_timeout;
u32 defer_hard_irqs;
};
@@ -2075,7 +2076,6 @@ struct net_device {
int ifindex;
unsigned int real_num_rx_queues;
struct netdev_rx_queue *_rx;
- unsigned long gro_flush_timeout;
unsigned int gro_max_size;
unsigned int gro_ipv4_max_size;
rx_handler_func_t __rcu *rx_handler;
@@ -2398,6 +2398,7 @@ struct net_device {
/** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */
struct dim_irq_moder *irq_moder;
+ unsigned long gro_flush_timeout;
u32 napi_defer_hard_irqs;
u8 priv[] ____cacheline_aligned
diff --git a/net/core/dev.c b/net/core/dev.c
index d3d0680664b3..f2fd503516de 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6220,12 +6220,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
if (work_done) {
if (n->gro_bitmask)
- timeout = READ_ONCE(n->dev->gro_flush_timeout);
+ timeout = napi_get_gro_flush_timeout(n);
n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
}
if (n->defer_hard_irqs_count > 0) {
n->defer_hard_irqs_count--;
- timeout = READ_ONCE(n->dev->gro_flush_timeout);
+ timeout = napi_get_gro_flush_timeout(n);
if (timeout)
ret = false;
}
@@ -6360,7 +6360,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
if (flags & NAPI_F_PREFER_BUSY_POLL) {
napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
- timeout = READ_ONCE(napi->dev->gro_flush_timeout);
+ timeout = napi_get_gro_flush_timeout(napi);
if (napi->defer_hard_irqs_count && timeout) {
hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
skip_schedule = true;
@@ -6642,6 +6642,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
napi->timer.function = napi_watchdog;
napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
+ napi_set_gro_flush_timeout(napi, READ_ONCE(dev->gro_flush_timeout));
init_gro_hash(napi);
napi->skb = NULL;
INIT_LIST_HEAD(&napi->rx_list);
@@ -11023,7 +11024,7 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev)
WARN_ON(dev->reg_state == NETREG_REGISTERED);
if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
- dev->gro_flush_timeout = 20000;
+ netdev_set_gro_flush_timeout(dev, 20000);
netdev_set_defer_hard_irqs(dev, 1);
}
}
@@ -11960,7 +11961,6 @@ static void __init net_dev_struct_check(void)
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, ifindex);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, real_num_rx_queues);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, _rx);
- CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_flush_timeout);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_max_size);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_ipv4_max_size);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, rx_handler);
@@ -11972,7 +11972,7 @@ static void __init net_dev_struct_check(void)
#ifdef CONFIG_NET_XGRESS
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, tcx_ingress);
#endif
- CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 100);
+ CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 92);
}
/*
diff --git a/net/core/dev.h b/net/core/dev.h
index f24fa38a2cac..a9d5f678564a 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -174,6 +174,45 @@ static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
napi_set_defer_hard_irqs(napi, defer);
}
+/**
+ * napi_get_gro_flush_timeout - get the gro_flush_timeout
+ * @n: napi struct to get the gro_flush_timeout from
+ *
+ * Return: the per-NAPI value of the gro_flush_timeout field.
+ */
+static inline unsigned long napi_get_gro_flush_timeout(const struct napi_struct *n)
+{
+ return READ_ONCE(n->gro_flush_timeout);
+}
+
+/**
+ * napi_set_gro_flush_timeout - set the gro_flush_timeout for a napi
+ * @n: napi struct to set the gro_flush_timeout
+ * @timeout: timeout value to set
+ *
+ * napi_set_gro_flush_timeout sets the per-NAPI gro_flush_timeout
+ */
+static inline void napi_set_gro_flush_timeout(struct napi_struct *n,
+ unsigned long timeout)
+{
+ WRITE_ONCE(n->gro_flush_timeout, timeout);
+}
+
+/**
+ * netdev_set_gro_flush_timeout - set gro_flush_timeout for all NAPIs of a netdev
+ * @netdev: the net_device for which all NAPIs will have their gro_flush_timeout set
+ * @timeout: the timeout value to set
+ */
+static inline void netdev_set_gro_flush_timeout(struct net_device *netdev,
+ unsigned long timeout)
+{
+ struct napi_struct *napi;
+
+ WRITE_ONCE(netdev->gro_flush_timeout, timeout);
+ list_for_each_entry(napi, &netdev->napi_list, dev_list)
+ napi_set_gro_flush_timeout(napi, timeout);
+}
+
int rps_cpumask_housekeeping(struct cpumask *mask);
#if defined(CONFIG_DEBUG_NET) && defined(CONFIG_BPF_SYSCALL)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 25125f356a15..2d9afc6e2161 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -409,7 +409,7 @@ NETDEVICE_SHOW_RW(tx_queue_len, fmt_dec);
static int change_gro_flush_timeout(struct net_device *dev, unsigned long val)
{
- WRITE_ONCE(dev->gro_flush_timeout, val);
+ netdev_set_gro_flush_timeout(dev, val);
return 0;
}
--
2.25.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC net-next v3 4/9] netdev-genl: Dump gro_flush_timeout
2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
` (2 preceding siblings ...)
2024-09-12 10:07 ` [RFC net-next v3 3/9] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 5/9] net: napi: Add napi_config Joe Damato
` (4 subsequent siblings)
8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
Paolo Abeni, Donald Hunter, Jesper Dangaard Brouer, Xuan Zhuo,
Daniel Jurgens, open list
Support dumping gro_flush_timeout for a NAPI ID.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
Documentation/netlink/specs/netdev.yaml | 6 ++++++
include/uapi/linux/netdev.h | 1 +
net/core/netdev-genl.c | 5 +++++
tools/include/uapi/linux/netdev.h | 1 +
4 files changed, 13 insertions(+)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 351d93994a66..906091c3059a 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -251,6 +251,11 @@ attribute-sets:
type: u32
checks:
max: s32-max
+ -
+ name: gro-flush-timeout
+ doc: The timeout, in nanoseconds, of when to trigger the NAPI
+ watchdog timer and schedule NAPI processing.
+ type: uint
-
name: queue
attributes:
@@ -601,6 +606,7 @@ operations:
- irq
- pid
- defer-hard-irqs
+ - gro-flush-timeout
dump:
request:
attributes:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 43bb1aad9611..b088a34e9254 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -122,6 +122,7 @@ enum {
NETDEV_A_NAPI_IRQ,
NETDEV_A_NAPI_PID,
NETDEV_A_NAPI_DEFER_HARD_IRQS,
+ NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
__NETDEV_A_NAPI_MAX,
NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index e67918dd97be..4698034b5a49 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -160,6 +160,7 @@ static int
netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
const struct genl_info *info)
{
+ unsigned long gro_flush_timeout;
u32 napi_defer_hard_irqs;
void *hdr;
pid_t pid;
@@ -193,6 +194,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS, napi_defer_hard_irqs))
goto nla_put_failure;
+ gro_flush_timeout = napi_get_gro_flush_timeout(napi);
+ if (nla_put_uint(rsp, NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT, gro_flush_timeout))
+ goto nla_put_failure;
+
genlmsg_end(rsp, hdr);
return 0;
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 43bb1aad9611..b088a34e9254 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -122,6 +122,7 @@ enum {
NETDEV_A_NAPI_IRQ,
NETDEV_A_NAPI_PID,
NETDEV_A_NAPI_DEFER_HARD_IRQS,
+ NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
__NETDEV_A_NAPI_MAX,
NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
--
2.25.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC net-next v3 5/9] net: napi: Add napi_config
2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
` (3 preceding siblings ...)
2024-09-12 10:07 ` [RFC net-next v3 4/9] netdev-genl: Dump gro_flush_timeout Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
2024-09-12 15:03 ` Joe Damato
2024-09-13 17:42 ` Stanislav Fomichev
2024-09-12 10:07 ` [RFC net-next v3 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
` (3 subsequent siblings)
8 siblings, 2 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
Paolo Abeni, Jonathan Corbet, Jiri Pirko,
Sebastian Andrzej Siewior, Lorenzo Bianconi, David Ahern,
Johannes Berg, open list:DOCUMENTATION, open list
Add a persistent NAPI config area for NAPI configuration to the core.
Drivers opt-in to setting the storage for a NAPI by passing an index
when calling netif_napi_add_storage.
napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
(after the NAPIs are deleted), and set to 0 when napi_enable is called.
Drivers which implement call netif_napi_add_storage will have persistent
NAPI IDs.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
.../networking/net_cachelines/net_device.rst | 1 +
include/linux/netdevice.h | 34 +++++++++
net/core/dev.c | 74 +++++++++++++++++--
net/core/dev.h | 12 +++
4 files changed, 113 insertions(+), 8 deletions(-)
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 3d02ae79c850..11d659051f5e 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -183,3 +183,4 @@ struct hlist_head page_pools
struct dim_irq_moder* irq_moder
unsigned_long gro_flush_timeout
u32 napi_defer_hard_irqs
+struct napi_config* napi_config
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3e07ab8e0295..08afc96179f9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -342,6 +342,15 @@ struct gro_list {
*/
#define GRO_HASH_BUCKETS 8
+/*
+ * Structure for per-NAPI storage
+ */
+struct napi_config {
+ u64 gro_flush_timeout;
+ u32 defer_hard_irqs;
+ unsigned int napi_id;
+};
+
/*
* Structure for NAPI scheduling similar to tasklet but with weighting
*/
@@ -379,6 +388,8 @@ struct napi_struct {
int irq;
unsigned long gro_flush_timeout;
u32 defer_hard_irqs;
+ int index;
+ struct napi_config *config;
};
enum {
@@ -2011,6 +2022,9 @@ enum netdev_reg_state {
* @dpll_pin: Pointer to the SyncE source pin of a DPLL subsystem,
* where the clock is recovered.
*
+ * @napi_config: An array of napi_config structures containing per-NAPI
+ * settings.
+ *
* FIXME: cleanup struct net_device such that network protocol info
* moves out.
*/
@@ -2400,6 +2414,7 @@ struct net_device {
struct dim_irq_moder *irq_moder;
unsigned long gro_flush_timeout;
u32 napi_defer_hard_irqs;
+ struct napi_config *napi_config;
u8 priv[] ____cacheline_aligned
__counted_by(priv_len);
@@ -2650,6 +2665,23 @@ netif_napi_add_tx_weight(struct net_device *dev,
netif_napi_add_weight(dev, napi, poll, weight);
}
+/**
+ * netif_napi_add_storage - initialize a NAPI context and set storage area
+ * @dev: network device
+ * @napi: NAPI context
+ * @poll: polling function
+ * @weight: the poll weight of this NAPI
+ * @index: the NAPI index
+ */
+static inline void
+netif_napi_add_storage(struct net_device *dev, struct napi_struct *napi,
+ int (*poll)(struct napi_struct *, int), int index)
+{
+ napi->index = index;
+ napi->config = &dev->napi_config[index];
+ netif_napi_add_weight(dev, napi, poll, NAPI_POLL_WEIGHT);
+}
+
/**
* netif_napi_add_tx() - initialize a NAPI context to be used for Tx only
* @dev: network device
@@ -2685,6 +2717,8 @@ void __netif_napi_del(struct napi_struct *napi);
*/
static inline void netif_napi_del(struct napi_struct *napi)
{
+ napi->config = NULL;
+ napi->index = -1;
__netif_napi_del(napi);
synchronize_net();
}
diff --git a/net/core/dev.c b/net/core/dev.c
index f2fd503516de..ca2227d0b8ed 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6493,6 +6493,18 @@ EXPORT_SYMBOL(napi_busy_loop);
#endif /* CONFIG_NET_RX_BUSY_POLL */
+static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
+{
+ spin_lock(&napi_hash_lock);
+
+ napi->napi_id = napi_id;
+
+ hlist_add_head_rcu(&napi->napi_hash_node,
+ &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
+
+ spin_unlock(&napi_hash_lock);
+}
+
static void napi_hash_add(struct napi_struct *napi)
{
if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state))
@@ -6505,12 +6517,13 @@ static void napi_hash_add(struct napi_struct *napi)
if (unlikely(++napi_gen_id < MIN_NAPI_ID))
napi_gen_id = MIN_NAPI_ID;
} while (napi_by_id(napi_gen_id));
- napi->napi_id = napi_gen_id;
-
- hlist_add_head_rcu(&napi->napi_hash_node,
- &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
spin_unlock(&napi_hash_lock);
+
+ napi_hash_add_with_id(napi, napi_gen_id);
+
+ if (napi->config)
+ napi->config->napi_id = napi_gen_id;
}
/* Warning : caller is responsible to make sure rcu grace period
@@ -6631,6 +6644,21 @@ void netif_queue_set_napi(struct net_device *dev, unsigned int queue_index,
}
EXPORT_SYMBOL(netif_queue_set_napi);
+static void napi_restore_config(struct napi_struct *n)
+{
+ n->defer_hard_irqs = n->config->defer_hard_irqs;
+ n->gro_flush_timeout = n->config->gro_flush_timeout;
+ napi_hash_add_with_id(n, n->config->napi_id);
+}
+
+static void napi_save_config(struct napi_struct *n)
+{
+ n->config->defer_hard_irqs = n->defer_hard_irqs;
+ n->config->gro_flush_timeout = n->gro_flush_timeout;
+ n->config->napi_id = n->napi_id;
+ napi_hash_del(n);
+}
+
void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight)
{
@@ -6641,8 +6669,6 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
INIT_HLIST_NODE(&napi->napi_hash_node);
hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
napi->timer.function = napi_watchdog;
- napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
- napi_set_gro_flush_timeout(napi, READ_ONCE(dev->gro_flush_timeout));
init_gro_hash(napi);
napi->skb = NULL;
INIT_LIST_HEAD(&napi->rx_list);
@@ -6660,7 +6686,15 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
set_bit(NAPI_STATE_SCHED, &napi->state);
set_bit(NAPI_STATE_NPSVC, &napi->state);
list_add_rcu(&napi->dev_list, &dev->napi_list);
- napi_hash_add(napi);
+ /* if there is no config associated with this NAPI, generate a fresh
+ * NAPI ID and hash it. Otherwise, settings will be restored in napi_enable.
+ */
+ if (!napi->config || (napi->config && !napi->config->napi_id)) {
+ napi_hash_add(napi);
+ napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
+ napi_set_gro_flush_timeout(napi, READ_ONCE(dev->gro_flush_timeout));
+ }
+
napi_get_frags_check(napi);
/* Create kthread for this napi if dev->threaded is set.
* Clear dev->threaded if kthread creation failed so that
@@ -6692,6 +6726,9 @@ void napi_disable(struct napi_struct *n)
hrtimer_cancel(&n->timer);
+ if (n->config)
+ napi_save_config(n);
+
clear_bit(NAPI_STATE_DISABLE, &n->state);
}
EXPORT_SYMBOL(napi_disable);
@@ -6714,6 +6751,9 @@ void napi_enable(struct napi_struct *n)
if (n->dev->threaded && n->thread)
new |= NAPIF_STATE_THREADED;
} while (!try_cmpxchg(&n->state, &val, new));
+
+ if (n->config)
+ napi_restore_config(n);
}
EXPORT_SYMBOL(napi_enable);
@@ -6736,7 +6776,13 @@ void __netif_napi_del(struct napi_struct *napi)
if (!test_and_clear_bit(NAPI_STATE_LISTED, &napi->state))
return;
- napi_hash_del(napi);
+ if (!napi->config) {
+ napi_hash_del(napi);
+ } else {
+ napi->index = -1;
+ napi->config = NULL;
+ }
+
list_del_rcu(&napi->dev_list);
napi_free_frags(napi);
@@ -11049,6 +11095,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
unsigned int txqs, unsigned int rxqs)
{
struct net_device *dev;
+ size_t napi_config_sz;
+ unsigned int maxqs;
BUG_ON(strlen(name) >= sizeof(dev->name));
@@ -11062,6 +11110,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
return NULL;
}
+ WARN_ON_ONCE(txqs != rxqs);
+ maxqs = max(txqs, rxqs);
+
dev = kvzalloc(struct_size(dev, priv, sizeof_priv),
GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
if (!dev)
@@ -11136,6 +11187,11 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
if (!dev->ethtool)
goto free_all;
+ napi_config_sz = array_size(maxqs, sizeof(*dev->napi_config));
+ dev->napi_config = kvzalloc(napi_config_sz, GFP_KERNEL_ACCOUNT);
+ if (!dev->napi_config)
+ goto free_all;
+
strscpy(dev->name, name);
dev->name_assign_type = name_assign_type;
dev->group = INIT_NETDEV_GROUP;
@@ -11197,6 +11253,8 @@ void free_netdev(struct net_device *dev)
list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
netif_napi_del(p);
+ kvfree(dev->napi_config);
+
ref_tracker_dir_exit(&dev->refcnt_tracker);
#ifdef CONFIG_PCPU_DEV_REFCNT
free_percpu(dev->pcpu_refcnt);
diff --git a/net/core/dev.h b/net/core/dev.h
index a9d5f678564a..9eb3f559275c 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -167,11 +167,17 @@ static inline void napi_set_defer_hard_irqs(struct napi_struct *n, u32 defer)
static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
u32 defer)
{
+ unsigned int count = max(netdev->num_rx_queues,
+ netdev->num_tx_queues);
struct napi_struct *napi;
+ int i;
WRITE_ONCE(netdev->napi_defer_hard_irqs, defer);
list_for_each_entry(napi, &netdev->napi_list, dev_list)
napi_set_defer_hard_irqs(napi, defer);
+
+ for (i = 0; i < count; i++)
+ netdev->napi_config[i].defer_hard_irqs = defer;
}
/**
@@ -206,11 +212,17 @@ static inline void napi_set_gro_flush_timeout(struct napi_struct *n,
static inline void netdev_set_gro_flush_timeout(struct net_device *netdev,
unsigned long timeout)
{
+ unsigned int count = max(netdev->num_rx_queues,
+ netdev->num_tx_queues);
struct napi_struct *napi;
+ int i;
WRITE_ONCE(netdev->gro_flush_timeout, timeout);
list_for_each_entry(napi, &netdev->napi_list, dev_list)
napi_set_gro_flush_timeout(napi, timeout);
+
+ for (i = 0; i < count; i++)
+ netdev->napi_config[i].gro_flush_timeout = timeout;
}
int rps_cpumask_housekeeping(struct cpumask *mask);
--
2.25.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC net-next v3 6/9] netdev-genl: Support setting per-NAPI config values
2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
` (4 preceding siblings ...)
2024-09-12 10:07 ` [RFC net-next v3 5/9] net: napi: Add napi_config Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
2024-09-12 10:38 ` Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 7/9] bnxt: Add support for napi storage Joe Damato
` (2 subsequent siblings)
8 siblings, 1 reply; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
Paolo Abeni, Donald Hunter, Jesper Dangaard Brouer, Xuan Zhuo,
Daniel Jurgens, Larysa Zaremba, open list
Add support to set per-NAPI defer_hard_irqs and gro_flush_timeout.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
Documentation/netlink/specs/netdev.yaml | 11 ++++++
include/uapi/linux/netdev.h | 1 +
net/core/netdev-genl-gen.c | 14 ++++++++
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 45 +++++++++++++++++++++++++
tools/include/uapi/linux/netdev.h | 1 +
6 files changed, 73 insertions(+)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 906091c3059a..319c1e179b08 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -633,6 +633,17 @@ operations:
- rx-bytes
- tx-packets
- tx-bytes
+ -
+ name: napi-set
+ doc: Set configurable NAPI instance settings.
+ attribute-set: napi
+ flags: [ admin-perm ]
+ do:
+ request:
+ attributes:
+ - napi-id
+ - defer-hard-irqs
+ - gro-flush-timeout
mcast-groups:
list:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index b088a34e9254..4c5bfbc85504 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -188,6 +188,7 @@ enum {
NETDEV_CMD_QUEUE_GET,
NETDEV_CMD_NAPI_GET,
NETDEV_CMD_QSTATS_GET,
+ NETDEV_CMD_NAPI_SET,
__NETDEV_CMD_MAX,
NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 8350a0afa9ec..ead570c6ff7d 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -74,6 +74,13 @@ static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE
[NETDEV_A_QSTATS_SCOPE] = NLA_POLICY_MASK(NLA_UINT, 0x1),
};
+/* NETDEV_CMD_NAPI_SET - set */
+static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT + 1] = {
+ [NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
+ [NETDEV_A_NAPI_DEFER_HARD_IRQS] = { .type = NLA_S32 },
+ [NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT },
+};
+
/* Ops table for netdev */
static const struct genl_split_ops netdev_nl_ops[] = {
{
@@ -151,6 +158,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
.maxattr = NETDEV_A_QSTATS_SCOPE,
.flags = GENL_CMD_CAP_DUMP,
},
+ {
+ .cmd = NETDEV_CMD_NAPI_SET,
+ .doit = netdev_nl_napi_set_doit,
+ .policy = netdev_napi_set_nl_policy,
+ .maxattr = NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
+ .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+ },
};
static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index 4db40fd5b4a9..b70cb0f20acb 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -28,6 +28,7 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb,
struct netlink_callback *cb);
int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info);
int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
+int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
struct netlink_callback *cb);
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 4698034b5a49..3c90a2fd005a 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -300,6 +300,51 @@ int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
return err;
}
+static int
+netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
+{
+ u64 gro_flush_timeout = 0;
+ u32 defer = 0;
+
+ if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
+ defer = nla_get_u32(info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]);
+ napi_set_defer_hard_irqs(napi, defer);
+ }
+
+ if (info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]) {
+ gro_flush_timeout = nla_get_uint(info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]);
+ napi_set_gro_flush_timeout(napi, gro_flush_timeout);
+ }
+
+ return 0;
+}
+
+int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info)
+{
+ struct napi_struct *napi;
+ unsigned int napi_id;
+ int err;
+
+ if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_ID))
+ return -EINVAL;
+
+ napi_id = nla_get_u32(info->attrs[NETDEV_A_NAPI_ID]);
+
+ rtnl_lock();
+
+ napi = napi_by_id(napi_id);
+ if (napi) {
+ err = netdev_nl_napi_set_config(napi, info);
+ } else {
+ NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID]);
+ err = -ENOENT;
+ }
+
+ rtnl_unlock();
+
+ return err;
+}
+
static int
netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
u32 q_idx, u32 q_type, const struct genl_info *info)
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index b088a34e9254..4c5bfbc85504 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -188,6 +188,7 @@ enum {
NETDEV_CMD_QUEUE_GET,
NETDEV_CMD_NAPI_GET,
NETDEV_CMD_QSTATS_GET,
+ NETDEV_CMD_NAPI_SET,
__NETDEV_CMD_MAX,
NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
--
2.25.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC net-next v3 7/9] bnxt: Add support for napi storage
2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
` (5 preceding siblings ...)
2024-09-12 10:07 ` [RFC net-next v3 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 8/9] mlx5: " Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 9/9] mlx4: Add support for napi storage to RX CQs Joe Damato
8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, Michael Chan, David S. Miller,
Eric Dumazet, Paolo Abeni, open list
Use netif_napi_add_storage to assign per-NAPI storage when initializing
NAPIs.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6e422e24750a..44fc38efff33 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -10986,7 +10986,8 @@ static void bnxt_init_napi(struct bnxt *bp)
cp_nr_rings--;
for (i = 0; i < cp_nr_rings; i++) {
bnapi = bp->bnapi[i];
- netif_napi_add(bp->dev, &bnapi->napi, poll_fn);
+ netif_napi_add_storage(bp->dev, &bnapi->napi, poll_fn,
+ bnapi->index);
}
if (BNXT_CHIP_TYPE_NITRO_A0(bp)) {
bnapi = bp->bnapi[cp_nr_rings];
--
2.25.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC net-next v3 8/9] mlx5: Add support for napi storage
2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
` (6 preceding siblings ...)
2024-09-12 10:07 ` [RFC net-next v3 7/9] bnxt: Add support for napi storage Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 9/9] mlx4: Add support for napi storage to RX CQs Joe Damato
8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, Saeed Mahameed, Tariq Toukan,
Leon Romanovsky, David S. Miller, Eric Dumazet, Paolo Abeni,
open list:MELLANOX MLX5 core VPI driver, open list
Use netif_napi_add_storage to assign per-NAPI storage when initializing
NAPIs.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 47e7a80d221b..aa4eb5677b8a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2696,7 +2696,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
c->aff_mask = irq_get_effective_affinity_mask(irq);
c->lag_port = mlx5e_enumerate_lag_port(mdev, ix);
- netif_napi_add(netdev, &c->napi, mlx5e_napi_poll);
+ netif_napi_add_storage(netdev, &c->napi, mlx5e_napi_poll, ix);
netif_napi_set_irq(&c->napi, irq);
err = mlx5e_open_queues(c, params, cparam);
--
2.25.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC net-next v3 9/9] mlx4: Add support for napi storage to RX CQs
2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
` (7 preceding siblings ...)
2024-09-12 10:07 ` [RFC net-next v3 8/9] mlx5: " Joe Damato
@ 2024-09-12 10:07 ` Joe Damato
8 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:07 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, Tariq Toukan, David S. Miller,
Eric Dumazet, Paolo Abeni,
open list:MELLANOX MLX4 core VPI driver, open list
Use netif_napi_add_storage to assign per-NAPI storage when initializing
RX CQ NAPIs.
Presently, struct napi_storage only has support for two fields used for
RX, so there is no need to support them with TX CQs, yet.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
drivers/net/ethernet/mellanox/mlx4/en_cq.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 461cc2c79c71..6943268e8256 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -156,7 +156,8 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
break;
case RX:
cq->mcq.comp = mlx4_en_rx_irq;
- netif_napi_add(cq->dev, &cq->napi, mlx4_en_poll_rx_cq);
+ netif_napi_add_storage(cq->dev, &cq->napi, mlx4_en_poll_rx_cq,
+ cq_idx);
netif_napi_set_irq(&cq->napi, irq);
napi_enable(&cq->napi);
netif_queue_set_napi(cq->dev, cq_idx, NETDEV_QUEUE_TYPE_RX, &cq->napi);
--
2.25.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC net-next v3 6/9] netdev-genl: Support setting per-NAPI config values
2024-09-12 10:07 ` [RFC net-next v3 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
@ 2024-09-12 10:38 ` Joe Damato
0 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-12 10:38 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Donald Hunter, Jesper Dangaard Brouer, Xuan Zhuo, Daniel Jurgens,
Larysa Zaremba, open list
On Thu, Sep 12, 2024 at 10:07:14AM +0000, Joe Damato wrote:
> Add support to set per-NAPI defer_hard_irqs and gro_flush_timeout.
>
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> ---
> Documentation/netlink/specs/netdev.yaml | 11 ++++++
> include/uapi/linux/netdev.h | 1 +
> net/core/netdev-genl-gen.c | 14 ++++++++
> net/core/netdev-genl-gen.h | 1 +
> net/core/netdev-genl.c | 45 +++++++++++++++++++++++++
> tools/include/uapi/linux/netdev.h | 1 +
> 6 files changed, 73 insertions(+)
My apologies; there's a few merge conflicts with this patch against
the latest net-next/main. I pulled recently, but it looks like
Mina's work got merged (which is excellent news) after I pulled and
so my patch won't apply cleanly.
I can wait the 48 hours to resend or simply reply with an updated
patch 6 in the body of my message, as this is only an RFC.
Let me know what works easiest for anyone who wants to actually test
this (vs just review it).
Sorry about that; should have pulled this morning before sending :)
- Joe
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
2024-09-12 10:07 ` [RFC net-next v3 5/9] net: napi: Add napi_config Joe Damato
@ 2024-09-12 15:03 ` Joe Damato
2024-09-13 21:44 ` Willem de Bruijn
2024-09-13 17:42 ` Stanislav Fomichev
1 sibling, 1 reply; 18+ messages in thread
From: Joe Damato @ 2024-09-12 15:03 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, David Ahern, Johannes Berg,
open list:DOCUMENTATION, open list
Several comments on different things below for this patch that I just noticed.
On Thu, Sep 12, 2024 at 10:07:13AM +0000, Joe Damato wrote:
> Add a persistent NAPI config area for NAPI configuration to the core.
> Drivers opt-in to setting the storage for a NAPI by passing an index
> when calling netif_napi_add_storage.
>
> napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
> (after the NAPIs are deleted), and set to 0 when napi_enable is called.
Forgot to re-read all the commit messages. I will do that for rfcv4
and make sure they are all correct; this message is not correct.
> Drivers which implement call netif_napi_add_storage will have persistent
> NAPI IDs.
>
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> ---
> .../networking/net_cachelines/net_device.rst | 1 +
> include/linux/netdevice.h | 34 +++++++++
> net/core/dev.c | 74 +++++++++++++++++--
> net/core/dev.h | 12 +++
> 4 files changed, 113 insertions(+), 8 deletions(-)
>
[...]
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3e07ab8e0295..08afc96179f9 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
[...]
> @@ -2685,6 +2717,8 @@ void __netif_napi_del(struct napi_struct *napi);
> */
> static inline void netif_napi_del(struct napi_struct *napi)
> {
> + napi->config = NULL;
> + napi->index = -1;
> __netif_napi_del(napi);
> synchronize_net();
> }
I don't quite see how, but occasionally when changing the queue
count with ethtool -L, I seem to trigger a page_pool issue.
I assumed it was related to either my changes above in netif_napi_del, or
below in __netif_napi_del, but I'm not so sure because the issue does not
happen reliably and I can't seem to figure out how my change would cause this.
When it does happen, this is the stack trace:
page_pool_empty_ring() page_pool refcnt -30528 violation
------------[ cut here ]------------
Negative(-1) inflight packet-pages
WARNING: CPU: 1 PID: 5117 at net/core/page_pool.c:617 page_pool_inflight+0x4c/0x90
[...]
CPU: 1 UID: 0 PID: 5117 Comm: ethtool Tainted: G W [...]
RIP: 0010:page_pool_inflight+0x4c/0x90
Code: e4 b8 00 00 00 00 44 0f 48 e0 44 89 e0 41 5c e9 8a c1 1b 00 66 90 45 85 e4 79 ef 44 89 e6 48 c7 c7 78 56 26 82 e8 14 63 78 ff <0f> 0b eb dc 65 8b 05 b5 af 71 7e 48 0f a3 05 21 d9 3b 01 73 d7 48
RSP: 0018:ffffc9001d01b640 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
RDX: 0000000000000027 RSI: 00000000ffefffff RDI: ffff88bf4f45c988
RBP: ffff8900da55a800 R08: 0000000000000000 R09: ffffc9001d01b480
R10: 0000000000000001 R11: 0000000000000001 R12: 00000000ffffffff
R13: ffffffff82cd35b0 R14: ffff890062550f00 R15: ffff8881b0e85400
FS: 00007fa9cb382740(0000) GS:ffff88bf4f440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000558baa9d3b38 CR3: 000000011222a000 CR4: 0000000000350ef0
Call Trace:
<TASK>
? __warn+0x80/0x110
? page_pool_inflight+0x4c/0x90
? report_bug+0x19c/0x1b0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x18/0x70
? asm_exc_invalid_op+0x1a/0x20
? page_pool_inflight+0x4c/0x90
page_pool_release+0x10e/0x1d0
page_pool_destroy+0x66/0x160
mlx5e_free_rq+0x69/0xb0 [mlx5_core]
mlx5e_close_queues+0x39/0x150 [mlx5_core]
mlx5e_close_channel+0x1c/0x60 [mlx5_core]
mlx5e_close_channels+0x49/0xa0 [mlx5_core]
mlx5e_switch_priv_channels+0xa9/0x120 [mlx5_core]
? __pfx_mlx5e_num_channels_changed_ctx+0x10/0x10 [mlx5_core]
mlx5e_safe_switch_params+0xb8/0xf0 [mlx5_core]
mlx5e_ethtool_set_channels+0x17a/0x290 [mlx5_core]
ethnl_set_channels+0x243/0x310
ethnl_default_set_doit+0xc1/0x170
genl_family_rcv_msg_doit+0xd9/0x130
genl_rcv_msg+0x18f/0x2c0
? __pfx_ethnl_default_set_doit+0x10/0x10
? __pfx_genl_rcv_msg+0x10/0x10
netlink_rcv_skb+0x5a/0x110
genl_rcv+0x28/0x40
netlink_unicast+0x1aa/0x260
netlink_sendmsg+0x1e9/0x420
__sys_sendto+0x1d5/0x1f0
? handle_mm_fault+0x1cb/0x290
? do_user_addr_fault+0x558/0x7c0
__x64_sys_sendto+0x29/0x30
do_syscall_64+0x5d/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f2fd503516de..ca2227d0b8ed 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
[...]
> @@ -6736,7 +6776,13 @@ void __netif_napi_del(struct napi_struct *napi)
> if (!test_and_clear_bit(NAPI_STATE_LISTED, &napi->state))
> return;
>
> - napi_hash_del(napi);
> + if (!napi->config) {
> + napi_hash_del(napi);
> + } else {
> + napi->index = -1;
> + napi->config = NULL;
> + }
> +
See above; perhaps something related to this change is triggering the page pool
warning occasionally.
> list_del_rcu(&napi->dev_list);
> napi_free_frags(napi);
>
> @@ -11049,6 +11095,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
> unsigned int txqs, unsigned int rxqs)
> {
> struct net_device *dev;
> + size_t napi_config_sz;
> + unsigned int maxqs;
>
> BUG_ON(strlen(name) >= sizeof(dev->name));
>
> @@ -11062,6 +11110,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
> return NULL;
> }
>
> + WARN_ON_ONCE(txqs != rxqs);
This warning triggers for me on boot every time with mlx5 NICs.
The code in mlx5 seems to get the rxq and txq maximums in:
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
mlx5e_create_netdev
which does:
txqs = mlx5e_get_max_num_txqs(mdev, profile);
rxqs = mlx5e_get_max_num_rxqs(mdev, profile);
netdev = alloc_etherdev_mqs(sizeof(struct mlx5e_priv), txqs, rxqs);
In my case for my device, txqs: 760, rxqs: 63.
I would guess that this warning will trigger everytime for mlx5 NICs
and would be quite annoying.
We may just want to replace the allocation logic to allocate
txqs+rxqs, remove the WARN_ON_ONCE, and be OK with some wasted
space?
> + maxqs = max(txqs, rxqs);
> +
> dev = kvzalloc(struct_size(dev, priv, sizeof_priv),
> GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
> if (!dev)
> @@ -11136,6 +11187,11 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
> if (!dev->ethtool)
> goto free_all;
>
> + napi_config_sz = array_size(maxqs, sizeof(*dev->napi_config));
> + dev->napi_config = kvzalloc(napi_config_sz, GFP_KERNEL_ACCOUNT);
> + if (!dev->napi_config)
> + goto free_all;
> +
[...]
> diff --git a/net/core/dev.h b/net/core/dev.h
> index a9d5f678564a..9eb3f559275c 100644
> --- a/net/core/dev.h
> +++ b/net/core/dev.h
> @@ -167,11 +167,17 @@ static inline void napi_set_defer_hard_irqs(struct napi_struct *n, u32 defer)
> static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
> u32 defer)
> {
> + unsigned int count = max(netdev->num_rx_queues,
> + netdev->num_tx_queues);
> struct napi_struct *napi;
> + int i;
>
> WRITE_ONCE(netdev->napi_defer_hard_irqs, defer);
> list_for_each_entry(napi, &netdev->napi_list, dev_list)
> napi_set_defer_hard_irqs(napi, defer);
> +
> + for (i = 0; i < count; i++)
> + netdev->napi_config[i].defer_hard_irqs = defer;
The above is incorrect. Some devices may have netdev->napi_config =
NULL if they don't call the add_storage wrapper.
Unless there is major feedback/changes requested that affect this
code, in the rfcv4 branch I plan to fix this by adding a NULL check:
if (netdev->napi_config)
for (....)
netdev->napi_config[i]....
>
> /**
> @@ -206,11 +212,17 @@ static inline void napi_set_gro_flush_timeout(struct napi_struct *n,
> static inline void netdev_set_gro_flush_timeout(struct net_device *netdev,
> unsigned long timeout)
> {
> + unsigned int count = max(netdev->num_rx_queues,
> + netdev->num_tx_queues);
> struct napi_struct *napi;
> + int i;
>
> WRITE_ONCE(netdev->gro_flush_timeout, timeout);
> list_for_each_entry(napi, &netdev->napi_list, dev_list)
> napi_set_gro_flush_timeout(napi, timeout);
> +
> + for (i = 0; i < count; i++)
> + netdev->napi_config[i].gro_flush_timeout = timeout;
Same as above.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
2024-09-12 10:07 ` [RFC net-next v3 5/9] net: napi: Add napi_config Joe Damato
2024-09-12 15:03 ` Joe Damato
@ 2024-09-13 17:42 ` Stanislav Fomichev
2024-09-13 18:55 ` Joe Damato
2024-09-13 19:39 ` Joe Damato
1 sibling, 2 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2024-09-13 17:42 UTC (permalink / raw)
To: Joe Damato
Cc: netdev, mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, David Ahern, Johannes Berg,
open list:DOCUMENTATION, open list
On 09/12, Joe Damato wrote:
> Add a persistent NAPI config area for NAPI configuration to the core.
> Drivers opt-in to setting the storage for a NAPI by passing an index
> when calling netif_napi_add_storage.
>
> napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
> (after the NAPIs are deleted), and set to 0 when napi_enable is called.
>
> Drivers which implement call netif_napi_add_storage will have persistent
> NAPI IDs.
>
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> ---
> .../networking/net_cachelines/net_device.rst | 1 +
> include/linux/netdevice.h | 34 +++++++++
> net/core/dev.c | 74 +++++++++++++++++--
> net/core/dev.h | 12 +++
> 4 files changed, 113 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
> index 3d02ae79c850..11d659051f5e 100644
> --- a/Documentation/networking/net_cachelines/net_device.rst
> +++ b/Documentation/networking/net_cachelines/net_device.rst
> @@ -183,3 +183,4 @@ struct hlist_head page_pools
> struct dim_irq_moder* irq_moder
> unsigned_long gro_flush_timeout
> u32 napi_defer_hard_irqs
> +struct napi_config* napi_config
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3e07ab8e0295..08afc96179f9 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -342,6 +342,15 @@ struct gro_list {
> */
> #define GRO_HASH_BUCKETS 8
>
> +/*
> + * Structure for per-NAPI storage
> + */
> +struct napi_config {
> + u64 gro_flush_timeout;
> + u32 defer_hard_irqs;
> + unsigned int napi_id;
> +};
> +
> /*
> * Structure for NAPI scheduling similar to tasklet but with weighting
> */
> @@ -379,6 +388,8 @@ struct napi_struct {
> int irq;
> unsigned long gro_flush_timeout;
> u32 defer_hard_irqs;
> + int index;
> + struct napi_config *config;
> };
>
> enum {
> @@ -2011,6 +2022,9 @@ enum netdev_reg_state {
> * @dpll_pin: Pointer to the SyncE source pin of a DPLL subsystem,
> * where the clock is recovered.
> *
> + * @napi_config: An array of napi_config structures containing per-NAPI
> + * settings.
> + *
> * FIXME: cleanup struct net_device such that network protocol info
> * moves out.
> */
> @@ -2400,6 +2414,7 @@ struct net_device {
> struct dim_irq_moder *irq_moder;
> unsigned long gro_flush_timeout;
> u32 napi_defer_hard_irqs;
> + struct napi_config *napi_config;
>
> u8 priv[] ____cacheline_aligned
> __counted_by(priv_len);
> @@ -2650,6 +2665,23 @@ netif_napi_add_tx_weight(struct net_device *dev,
> netif_napi_add_weight(dev, napi, poll, weight);
> }
>
> +/**
> + * netif_napi_add_storage - initialize a NAPI context and set storage area
> + * @dev: network device
> + * @napi: NAPI context
> + * @poll: polling function
> + * @weight: the poll weight of this NAPI
> + * @index: the NAPI index
> + */
> +static inline void
> +netif_napi_add_storage(struct net_device *dev, struct napi_struct *napi,
> + int (*poll)(struct napi_struct *, int), int index)
> +{
> + napi->index = index;
> + napi->config = &dev->napi_config[index];
> + netif_napi_add_weight(dev, napi, poll, NAPI_POLL_WEIGHT);
> +}
> +
> /**
> * netif_napi_add_tx() - initialize a NAPI context to be used for Tx only
> * @dev: network device
> @@ -2685,6 +2717,8 @@ void __netif_napi_del(struct napi_struct *napi);
> */
> static inline void netif_napi_del(struct napi_struct *napi)
> {
> + napi->config = NULL;
> + napi->index = -1;
> __netif_napi_del(napi);
> synchronize_net();
> }
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f2fd503516de..ca2227d0b8ed 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6493,6 +6493,18 @@ EXPORT_SYMBOL(napi_busy_loop);
>
> #endif /* CONFIG_NET_RX_BUSY_POLL */
>
> +static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> +{
> + spin_lock(&napi_hash_lock);
> +
> + napi->napi_id = napi_id;
> +
> + hlist_add_head_rcu(&napi->napi_hash_node,
> + &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> +
> + spin_unlock(&napi_hash_lock);
> +}
> +
> static void napi_hash_add(struct napi_struct *napi)
> {
> if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state))
> @@ -6505,12 +6517,13 @@ static void napi_hash_add(struct napi_struct *napi)
> if (unlikely(++napi_gen_id < MIN_NAPI_ID))
> napi_gen_id = MIN_NAPI_ID;
> } while (napi_by_id(napi_gen_id));
[..]
> - napi->napi_id = napi_gen_id;
> -
> - hlist_add_head_rcu(&napi->napi_hash_node,
> - &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
>
> spin_unlock(&napi_hash_lock);
> +
> + napi_hash_add_with_id(napi, napi_gen_id);
nit: it is very unlikely that napi_gen_id is gonna wrap around after the
spin_unlock above, but maybe it's safer to have the following?
static void __napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
{
napi->napi_id = napi_id;
hlist_add_head_rcu(&napi->napi_hash_node,
&napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
}
static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
{
spin_lock(&napi_hash_lock);
__napi_hash_add_with_id(...);
spin_unlock(&napi_hash_lock);
}
And use __napi_hash_add_with_id here before spin_unlock?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
2024-09-13 17:42 ` Stanislav Fomichev
@ 2024-09-13 18:55 ` Joe Damato
2024-09-13 19:39 ` Joe Damato
1 sibling, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-13 18:55 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: netdev, mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, David Ahern, Johannes Berg,
open list:DOCUMENTATION, open list
On Fri, Sep 13, 2024 at 10:42:37AM -0700, Stanislav Fomichev wrote:
> On 09/12, Joe Damato wrote:
[...]
> > @@ -6505,12 +6517,13 @@ static void napi_hash_add(struct napi_struct *napi)
> > if (unlikely(++napi_gen_id < MIN_NAPI_ID))
> > napi_gen_id = MIN_NAPI_ID;
> > } while (napi_by_id(napi_gen_id));
>
> [..]
>
> > - napi->napi_id = napi_gen_id;
> > -
> > - hlist_add_head_rcu(&napi->napi_hash_node,
> > - &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> >
> > spin_unlock(&napi_hash_lock);
> > +
> > + napi_hash_add_with_id(napi, napi_gen_id);
>
> nit: it is very unlikely that napi_gen_id is gonna wrap around after the
> spin_unlock above, but maybe it's safer to have the following?
>
> static void __napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> {
> napi->napi_id = napi_id;
> hlist_add_head_rcu(&napi->napi_hash_node,
> &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> }
>
> static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> {
> spin_lock(&napi_hash_lock);
> __napi_hash_add_with_id(...);
> spin_unlock(&napi_hash_lock);
> }
>
> And use __napi_hash_add_with_id here before spin_unlock?
Thanks for taking a look.
Sure, that seems reasonable. I can add that for the rfcv4.
I'll probably hold off on posting the rfcv4 until either after LPC
and/or after I have some time to debug the mlx5/page_pool thing.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
2024-09-13 17:42 ` Stanislav Fomichev
2024-09-13 18:55 ` Joe Damato
@ 2024-09-13 19:39 ` Joe Damato
2024-09-14 13:00 ` Joe Damato
1 sibling, 1 reply; 18+ messages in thread
From: Joe Damato @ 2024-09-13 19:39 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: netdev, mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, David Ahern, Johannes Berg,
open list:DOCUMENTATION, open list
On Fri, Sep 13, 2024 at 10:42:37AM -0700, Stanislav Fomichev wrote:
> On 09/12, Joe Damato wrote:
[...]
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6493,6 +6493,18 @@ EXPORT_SYMBOL(napi_busy_loop);
> >
> > #endif /* CONFIG_NET_RX_BUSY_POLL */
> >
> > +static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> > +{
> > + spin_lock(&napi_hash_lock);
> > +
> > + napi->napi_id = napi_id;
> > +
> > + hlist_add_head_rcu(&napi->napi_hash_node,
> > + &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> > +
> > + spin_unlock(&napi_hash_lock);
> > +}
> > +
> > static void napi_hash_add(struct napi_struct *napi)
> > {
> > if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state))
> > @@ -6505,12 +6517,13 @@ static void napi_hash_add(struct napi_struct *napi)
> > if (unlikely(++napi_gen_id < MIN_NAPI_ID))
> > napi_gen_id = MIN_NAPI_ID;
> > } while (napi_by_id(napi_gen_id));
>
> [..]
>
> > - napi->napi_id = napi_gen_id;
> > -
> > - hlist_add_head_rcu(&napi->napi_hash_node,
> > - &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> >
> > spin_unlock(&napi_hash_lock);
> > +
> > + napi_hash_add_with_id(napi, napi_gen_id);
>
> nit: it is very unlikely that napi_gen_id is gonna wrap around after the
> spin_unlock above, but maybe it's safer to have the following?
>
> static void __napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> {
> napi->napi_id = napi_id;
> hlist_add_head_rcu(&napi->napi_hash_node,
> &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> }
>
> static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> {
> spin_lock(&napi_hash_lock);
> __napi_hash_add_with_id(...);
> spin_unlock(&napi_hash_lock);
> }
>
> And use __napi_hash_add_with_id here before spin_unlock?
After making this change and re-testing on a couple reboots, I haven't
been able to reproduce the page pool issue I mentioned in the other
email [1].
Not sure if I've just been... "getting lucky" or if this fixed
something that I won't fully grasp until I read the mlx5 source
again.
Will test it a few more times, though.
[1]: https://lore.kernel.org/netdev/ZuMC2fYPPtWggB2w@LQ3V64L9R2.homenet.telecomitalia.it/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
2024-09-12 15:03 ` Joe Damato
@ 2024-09-13 21:44 ` Willem de Bruijn
2024-09-13 22:10 ` Joe Damato
0 siblings, 1 reply; 18+ messages in thread
From: Willem de Bruijn @ 2024-09-13 21:44 UTC (permalink / raw)
To: Joe Damato, netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, David Ahern, Johannes Berg,
open list:DOCUMENTATION, open list
Joe Damato wrote:
> Several comments on different things below for this patch that I just noticed.
>
> On Thu, Sep 12, 2024 at 10:07:13AM +0000, Joe Damato wrote:
> > Add a persistent NAPI config area for NAPI configuration to the core.
> > Drivers opt-in to setting the storage for a NAPI by passing an index
> > when calling netif_napi_add_storage.
> >
> > napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
> > (after the NAPIs are deleted), and set to 0 when napi_enable is called.
>
> Forgot to re-read all the commit messages. I will do that for rfcv4
> and make sure they are all correct; this message is not correct.
>
> > Drivers which implement call netif_napi_add_storage will have persistent
> > NAPI IDs.
> >
> > Signed-off-by: Joe Damato <jdamato@fastly.com>
> > @@ -11062,6 +11110,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
> > return NULL;
> > }
> >
> > + WARN_ON_ONCE(txqs != rxqs);
>
> This warning triggers for me on boot every time with mlx5 NICs.
>
> The code in mlx5 seems to get the rxq and txq maximums in:
> drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> mlx5e_create_netdev
>
> which does:
>
> txqs = mlx5e_get_max_num_txqs(mdev, profile);
> rxqs = mlx5e_get_max_num_rxqs(mdev, profile);
>
> netdev = alloc_etherdev_mqs(sizeof(struct mlx5e_priv), txqs, rxqs);
>
> In my case for my device, txqs: 760, rxqs: 63.
>
> I would guess that this warning will trigger everytime for mlx5 NICs
> and would be quite annoying.
>
> We may just want to replace the allocation logic to allocate
> txqs+rxqs, remove the WARN_ON_ONCE, and be OK with some wasted
> space?
I was about to say that txqs == rxqs is not necessary.
The number of napi config structs you want depends on whether the
driver configures separate IRQs for Tx and Rx or not.
Allocating the max of the two is perhaps sufficient for now.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
2024-09-13 21:44 ` Willem de Bruijn
@ 2024-09-13 22:10 ` Joe Damato
0 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-13 22:10 UTC (permalink / raw)
To: Willem de Bruijn
Cc: netdev, mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, David Ahern, Johannes Berg,
open list:DOCUMENTATION, open list
On Fri, Sep 13, 2024 at 05:44:07PM -0400, Willem de Bruijn wrote:
> Joe Damato wrote:
> > Several comments on different things below for this patch that I just noticed.
> >
> > On Thu, Sep 12, 2024 at 10:07:13AM +0000, Joe Damato wrote:
> > > Add a persistent NAPI config area for NAPI configuration to the core.
> > > Drivers opt-in to setting the storage for a NAPI by passing an index
> > > when calling netif_napi_add_storage.
> > >
> > > napi_config is allocated in alloc_netdev_mqs, freed in free_netdev
> > > (after the NAPIs are deleted), and set to 0 when napi_enable is called.
> >
> > Forgot to re-read all the commit messages. I will do that for rfcv4
> > and make sure they are all correct; this message is not correct.
> >
> > > Drivers which implement call netif_napi_add_storage will have persistent
> > > NAPI IDs.
> > >
> > > Signed-off-by: Joe Damato <jdamato@fastly.com>
>
> > > @@ -11062,6 +11110,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
> > > return NULL;
> > > }
> > >
> > > + WARN_ON_ONCE(txqs != rxqs);
> >
> > This warning triggers for me on boot every time with mlx5 NICs.
> >
> > The code in mlx5 seems to get the rxq and txq maximums in:
> > drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > mlx5e_create_netdev
> >
> > which does:
> >
> > txqs = mlx5e_get_max_num_txqs(mdev, profile);
> > rxqs = mlx5e_get_max_num_rxqs(mdev, profile);
> >
> > netdev = alloc_etherdev_mqs(sizeof(struct mlx5e_priv), txqs, rxqs);
> >
> > In my case for my device, txqs: 760, rxqs: 63.
> >
> > I would guess that this warning will trigger everytime for mlx5 NICs
> > and would be quite annoying.
> >
> > We may just want to replace the allocation logic to allocate
> > txqs+rxqs, remove the WARN_ON_ONCE, and be OK with some wasted
> > space?
>
> I was about to say that txqs == rxqs is not necessary.
Correct.
> The number of napi config structs you want depends on whether the
> driver configures separate IRQs for Tx and Rx or not.
Correct. This is why I included the mlx4 patch.
> Allocating the max of the two is perhaps sufficient for now.
I don't think I agree. The max of the two means you'll always be
missing some config space if the maximum number of both are
allocated by the user/device.
The WARN_ON_ONCE was added as suggested from a previous conversation
[1], but due to the imbalance in mlx5 (and probably other devices)
the warning will be more of a nuisance and will likely trigger on
every boot for at least mlx5, but probably others.
Regardless of how many we decide to allocate: the point I was making
above was that the WARN_ON_ONCE should likely be removed.
[1]: https://lore.kernel.org/lkml/20240902174944.293dfe4b@kernel.org/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC net-next v3 5/9] net: napi: Add napi_config
2024-09-13 19:39 ` Joe Damato
@ 2024-09-14 13:00 ` Joe Damato
0 siblings, 0 replies; 18+ messages in thread
From: Joe Damato @ 2024-09-14 13:00 UTC (permalink / raw)
To: Stanislav Fomichev, netdev, mkarsten, kuba, skhawaja, sdf, bjorn,
amritha.nambiar, sridhar.samudrala, David S. Miller, Eric Dumazet,
Paolo Abeni, Jonathan Corbet, Jiri Pirko,
Sebastian Andrzej Siewior, Lorenzo Bianconi, David Ahern,
Johannes Berg, open list:DOCUMENTATION, open list
On Fri, Sep 13, 2024 at 09:39:49PM +0200, Joe Damato wrote:
> On Fri, Sep 13, 2024 at 10:42:37AM -0700, Stanislav Fomichev wrote:
> > On 09/12, Joe Damato wrote:
>
> [...]
>
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -6493,6 +6493,18 @@ EXPORT_SYMBOL(napi_busy_loop);
> > >
> > > #endif /* CONFIG_NET_RX_BUSY_POLL */
> > >
> > > +static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> > > +{
> > > + spin_lock(&napi_hash_lock);
> > > +
> > > + napi->napi_id = napi_id;
> > > +
> > > + hlist_add_head_rcu(&napi->napi_hash_node,
> > > + &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> > > +
> > > + spin_unlock(&napi_hash_lock);
> > > +}
> > > +
> > > static void napi_hash_add(struct napi_struct *napi)
> > > {
> > > if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state))
> > > @@ -6505,12 +6517,13 @@ static void napi_hash_add(struct napi_struct *napi)
> > > if (unlikely(++napi_gen_id < MIN_NAPI_ID))
> > > napi_gen_id = MIN_NAPI_ID;
> > > } while (napi_by_id(napi_gen_id));
> >
> > [..]
> >
> > > - napi->napi_id = napi_gen_id;
> > > -
> > > - hlist_add_head_rcu(&napi->napi_hash_node,
> > > - &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> > >
> > > spin_unlock(&napi_hash_lock);
> > > +
> > > + napi_hash_add_with_id(napi, napi_gen_id);
> >
> > nit: it is very unlikely that napi_gen_id is gonna wrap around after the
> > spin_unlock above, but maybe it's safer to have the following?
> >
> > static void __napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> > {
> > napi->napi_id = napi_id;
> > hlist_add_head_rcu(&napi->napi_hash_node,
> > &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> > }
> >
> > static void napi_hash_add_with_id(struct napi_struct *napi, unsigned int napi_id)
> > {
> > spin_lock(&napi_hash_lock);
> > __napi_hash_add_with_id(...);
> > spin_unlock(&napi_hash_lock);
> > }
> >
> > And use __napi_hash_add_with_id here before spin_unlock?
>
> After making this change and re-testing on a couple reboots, I haven't
> been able to reproduce the page pool issue I mentioned in the other
> email [1].
>
> Not sure if I've just been... "getting lucky" or if this fixed
> something that I won't fully grasp until I read the mlx5 source
> again.
>
> Will test it a few more times, though.
>
> [1]: https://lore.kernel.org/netdev/ZuMC2fYPPtWggB2w@LQ3V64L9R2.homenet.telecomitalia.it/
I was able to reproduce the issue by running a simple script
(suggested by Martin):
for ((i=0;i<100;i++)); do for q in 1 2 4 8 16 32; do sudo ethtool -L eth4 combined $q ;done;done
It does not seem to reproduce on net-next/main, so it is almost certainly a bug
introduced in patch 5 and the suggested changes above didn't solve the problem.
That said, the changes I have queued for the RFCv4:
- Updated commit messages of most patches
- Renamed netif_napi_add_storage to netif_napi_add_config in patch 5 and
updated the driver patches.
- Added a NULL check in netdev_set_defer_hard_irqs and
netdev_set_gro_flush_timeout for netdev->napi_config in patch 5
- Add Stanislav's suggestion for more safe handling of napi_gen_id in
patch 5
- Fixed merge conflicts in patch 6 so it applies cleanly
The outstanding items at this point are:
- Getting rid of the WARN_ON_ONCE (assuming we all agree on this)
- Deciding if we are allocating max(rxqs, txqs) or something else
- Debugging the page pool issue
Jakub suggested the first two items on the list above, so I'm hoping to hear
his thoughts on them :) I have no strong preference on those two.
Once those two are solved, I can send an RFCv4 to see where we are at and I'll
call out the outstanding page pool issue in the cover letter.
I likely won't have time to debug the page pool thing until after LPC, though
:(
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-09-14 13:00 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-12 10:07 [RFC net-next v3 0/9] Add support for per-NAPI config via netlink Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 1/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 2/9] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 3/9] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 4/9] netdev-genl: Dump gro_flush_timeout Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 5/9] net: napi: Add napi_config Joe Damato
2024-09-12 15:03 ` Joe Damato
2024-09-13 21:44 ` Willem de Bruijn
2024-09-13 22:10 ` Joe Damato
2024-09-13 17:42 ` Stanislav Fomichev
2024-09-13 18:55 ` Joe Damato
2024-09-13 19:39 ` Joe Damato
2024-09-14 13:00 ` Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
2024-09-12 10:38 ` Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 7/9] bnxt: Add support for napi storage Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 8/9] mlx5: " Joe Damato
2024-09-12 10:07 ` [RFC net-next v3 9/9] mlx4: Add support for napi storage to RX CQs Joe Damato
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).