* [RFC net-next v2 0/9] Add support for per-NAPI config via netlink
@ 2024-09-08 16:06 Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 1/9] net: napi: Add napi_storage Joe Damato
` (8 more replies)
0 siblings, 9 replies; 22+ messages in thread
From: Joe Damato @ 2024-09-08 16:06 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, Alexander Lobakin, Breno Leitao,
Daniel Jurgens, David Ahern, David S. Miller, Donald Hunter,
Eric Dumazet, Jesper Dangaard Brouer, Jiri Pirko, Johannes Berg,
Jonathan Corbet, Kory Maincent, Larysa Zaremba, Leon Romanovsky,
open list:DOCUMENTATION, open list,
open list:MELLANOX MLX4 core VPI driver, Lorenzo Bianconi,
Michael Chan, Paolo Abeni, Saeed Mahameed,
Sebastian Andrzej Siewior, Tariq Toukan, Xuan Zhuo
Greetings:
Welcome to v2, converted to RFC.... which is definitely incorrect, but
hopefully can serve as a basis for discussion to get to something
better.
This implementation allocates "struct napi_storage" and each NAPI
instance is assigned an index into the storage array.
It seemed like this storage area should persist even if different HW
queues are created, but should be cleared when queue counts are resized
(ethtool -L).
What I did is flat out incorrect: memset the struct to 0
on napi_enable.
I am not totally clear if I understand the part of the previous
conversation about mapping rings to NAPIs and so on, but I wanted to
make sure the rest of the implementation was starting to vaguely look
like what was discussed in the previous thread.
To help illustrate how this would end up working, I've added patches for
3 drivers, of which I have access to only 1:
- mlx5 which is the basis of the examples below
- mlx4 which has TX only NAPIs, just to highlight that case. I have
only compile tested this patch; I don't have this hardware.
- bnxt which I have only compiled tested.
Zeroing on napi_enable is incorrect because, at the very least, it
breaks sysfs (the sysfs settings should be inherited). It's not clear
to me how we achieve persistence with the zero-ing unless we assume the
drivers are changed somehow? IIRC, there was a suggestion in the
previous thread to memset the napi_storage to 0 on queue resize, but
I've definitely gotten the nuance in what was desired wrong.
Anyway, sending what I have before iterating further to see if this is
even remotely the right direction before going too deep down this path.
I hope that's OK.
Here's an example of how it works on my mlx5 as is:
# start with 4 queues
$ ethtool -l eth4 | grep Combined | tail -1
Combined: 4
First, output the current NAPI settings:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 928,
'ifindex': 7,
'index': 3,
'irq': 529},
{'defer-hard-irqs': 0,
'gro-flush-timeout': 0,
'id': 927,
'ifindex': 7,
'index': 2,
'irq': 528},
[...]
Now, set the global sysfs parameters:
$ sudo bash -c 'echo 20000 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs'
Output current NAPI settings again:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 928,
'ifindex': 7,
'index': 3,
'irq': 529},
{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 927,
'ifindex': 7,
'index': 2,
'irq': 528},
[...]
Now set NAPI ID 927, via its ifindex and index to specific values:
$ sudo ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--do napi-set \
--json='{"ifindex": 7, "index": 2,
"defer-hard-irqs": 111,
"gro-flush-timeout": 11111}'
None
Now output current NAPI settings again to ensure only 927 changed:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 928,
'ifindex': 7,
'index': 3,
'irq': 529},
{'defer-hard-irqs': 111,
'gro-flush-timeout': 11111,
'id': 927,
'ifindex': 7,
'index': 2,
'irq': 528},
[...]
Now, increase gro-flush-timeout only:
$ sudo ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--do napi-set --json='{"ifindex": 7, "index": 2,
"gro-flush-timeout": 44444}
None
Now output the current NAPI settings once more:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 928,
'ifindex': 7,
'index': 3,
'irq': 529},
{'defer-hard-irqs': 111,
'gro-flush-timeout': 44444,
'id': 927,
'ifindex': 7,
'index': 2,
'irq': 528},
[...]
Now set NAPI ID 927, via its ifindex and index, to have
gro_flush_timeout of 0:
$ sudo ./tools/net/ynl/cli.py \
--spec Documentation/netlink/specs/netdev.yaml \
--do napi-set --json='{"ifindex": 7, "index": 2,
"gro-flush-timeout": 0}'
None
Check that NAPI ID 927 has a value of 0:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
'gro-flush-timeout': 20000,
'id': 928,
'ifindex': 7,
'index': 3,
'irq': 529},
{'defer-hard-irqs': 111,
'gro-flush-timeout': 0,
'id': 927,
'ifindex': 7,
'index': 2,
'irq': 528},
[...]
Last, but not least, let's try writing the sysfs parameters to ensure
all NAPIs are rewritten:
$ sudo bash -c 'echo 33333 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 222 >/sys/class/net/eth4/napi_defer_hard_irqs'
Check that worked:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 222,
'gro-flush-timeout': 33333,
'id': 928,
'ifindex': 7,
'index': 3,
'irq': 529},
{'defer-hard-irqs': 222,
'gro-flush-timeout': 33333,
'id': 927,
'ifindex': 7,
'index': 2,
'irq': 528},
[...]
Resizing the queues (ethtool -L) resets everything to 0, which is wrong
because it breaks sysfs and it breaks the persistence goal.
I hope though that his code can still be discussed to ensure that I am
moving in the correct direction toward solving these issues before going
too far down the rabbit hole :)
Thanks,
Joe
Joe Damato (9):
net: napi: Add napi_storage
netdev-genl: Export NAPI index
net: napi: Make napi_defer_hard_irqs per-NAPI
netdev-genl: Dump napi_defer_hard_irqs
net: napi: Make gro_flush_timeout per-NAPI
netdev-genl: Support setting per-NAPI config values
bnxt: Add support for napi storage
mlx5: Add support for napi storage
mlx4: Add support for napi storage to RX CQs
Documentation/netlink/specs/netdev.yaml | 35 ++++++++
.../networking/net_cachelines/net_device.rst | 3 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +-
drivers/net/ethernet/mellanox/mlx4/en_cq.c | 3 +-
.../net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
include/linux/netdevice.h | 38 ++++++++-
include/uapi/linux/netdev.h | 4 +
net/core/dev.c | 36 +++++---
net/core/dev.h | 83 +++++++++++++++++++
net/core/net-sysfs.c | 4 +-
net/core/netdev-genl-gen.c | 15 ++++
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 65 +++++++++++++++
tools/include/uapi/linux/netdev.h | 3 +
14 files changed, 276 insertions(+), 19 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 22+ messages in thread
* [RFC net-next v2 1/9] net: napi: Add napi_storage
2024-09-08 16:06 [RFC net-next v2 0/9] Add support for per-NAPI config via netlink Joe Damato
@ 2024-09-08 16:06 ` Joe Damato
2024-09-08 20:49 ` Joe Damato
2024-09-09 23:40 ` Jakub Kicinski
2024-09-08 16:06 ` [RFC net-next v2 2/9] netdev-genl: Export NAPI index Joe Damato
` (7 subsequent siblings)
8 siblings, 2 replies; 22+ messages in thread
From: Joe Damato @ 2024-09-08 16:06 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
Paolo Abeni, Jonathan Corbet, Jiri Pirko,
Sebastian Andrzej Siewior, Lorenzo Bianconi,
open list:DOCUMENTATION, open list
Add a persistent NAPI storage area for NAPI configuration to the core.
Drivers opt-in to setting the storage for a NAPI by passing an index
when calling netif_napi_add_storage.
napi_storage is allocated in alloc_netdev_mqs, freed in free_netdev
(after the NAPIs are deleted), and set to 0 when napi_enable is called.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
.../networking/net_cachelines/net_device.rst | 1 +
include/linux/netdevice.h | 34 +++++++++++++++++++
net/core/dev.c | 18 +++++++++-
3 files changed, 52 insertions(+), 1 deletion(-)
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 22b07c814f4a..a82751c88d18 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -106,6 +106,7 @@ rx_handler_func_t* rx_handler read_mostly
void* rx_handler_data read_mostly -
struct_netdev_queue* ingress_queue read_mostly -
struct_bpf_mprog_entry tcx_ingress - read_mostly sch_handle_ingress
+struct napi_storage* napi_storage - read_mostly napi_complete_done
struct_nf_hook_entries* nf_hooks_ingress
unsigned_char broadcast[32]
struct_cpu_rmap* rx_cpu_rmap
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b47c00657bd0..54da1c800e65 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -342,6 +342,14 @@ struct gro_list {
*/
#define GRO_HASH_BUCKETS 8
+/*
+ * Structure for per-NAPI storage
+ */
+struct napi_storage {
+ u64 gro_flush_timeout;
+ u32 defer_hard_irqs;
+};
+
/*
* Structure for NAPI scheduling similar to tasklet but with weighting
*/
@@ -377,6 +385,8 @@ struct napi_struct {
struct list_head dev_list;
struct hlist_node napi_hash_node;
int irq;
+ int index;
+ struct napi_storage *napi_storage;
};
enum {
@@ -2009,6 +2019,9 @@ enum netdev_reg_state {
* @dpll_pin: Pointer to the SyncE source pin of a DPLL subsystem,
* where the clock is recovered.
*
+ * @napi_storage: An array of napi_storage structures containing per-NAPI
+ * settings.
+ *
* FIXME: cleanup struct net_device such that network protocol info
* moves out.
*/
@@ -2087,6 +2100,7 @@ struct net_device {
#ifdef CONFIG_NET_XGRESS
struct bpf_mprog_entry __rcu *tcx_ingress;
#endif
+ struct napi_storage *napi_storage;
__cacheline_group_end(net_device_read_rx);
char name[IFNAMSIZ];
@@ -2648,6 +2662,24 @@ netif_napi_add_tx_weight(struct net_device *dev,
netif_napi_add_weight(dev, napi, poll, weight);
}
+/**
+ * netif_napi_add_storage - initialize a NAPI context and set storage area
+ * @dev: network device
+ * @napi: NAPI context
+ * @poll: polling function
+ * @weight: the poll weight of this NAPI
+ * @index: the NAPI index
+ */
+static inline void
+netif_napi_add_storage(struct net_device *dev, struct napi_struct *napi,
+ int (*poll)(struct napi_struct *, int), int weight,
+ int index)
+{
+ napi->index = index;
+ napi->napi_storage = &dev->napi_storage[index];
+ netif_napi_add_weight(dev, napi, poll, weight);
+}
+
/**
* netif_napi_add_tx() - initialize a NAPI context to be used for Tx only
* @dev: network device
@@ -2683,6 +2715,8 @@ void __netif_napi_del(struct napi_struct *napi);
*/
static inline void netif_napi_del(struct napi_struct *napi)
{
+ napi->napi_storage = NULL;
+ napi->index = -1;
__netif_napi_del(napi);
synchronize_net();
}
diff --git a/net/core/dev.c b/net/core/dev.c
index 22c3f14d9287..ca90e8cab121 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6719,6 +6719,9 @@ void napi_enable(struct napi_struct *n)
if (n->dev->threaded && n->thread)
new |= NAPIF_STATE_THREADED;
} while (!try_cmpxchg(&n->state, &val, new));
+
+ if (n->napi_storage)
+ memset(n->napi_storage, 0, sizeof(*n->napi_storage));
}
EXPORT_SYMBOL(napi_enable);
@@ -11054,6 +11057,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
unsigned int txqs, unsigned int rxqs)
{
struct net_device *dev;
+ size_t napi_storage_sz;
+ unsigned int maxqs;
BUG_ON(strlen(name) >= sizeof(dev->name));
@@ -11067,6 +11072,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
return NULL;
}
+ WARN_ON_ONCE(txqs != rxqs);
+ maxqs = max(txqs, rxqs);
+
dev = kvzalloc(struct_size(dev, priv, sizeof_priv),
GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
if (!dev)
@@ -11141,6 +11149,11 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
if (!dev->ethtool)
goto free_all;
+ napi_storage_sz = array_size(maxqs, sizeof(*dev->napi_storage));
+ dev->napi_storage = kvzalloc(napi_storage_sz, GFP_KERNEL_ACCOUNT);
+ if (!dev->napi_storage)
+ goto free_all;
+
strscpy(dev->name, name);
dev->name_assign_type = name_assign_type;
dev->group = INIT_NETDEV_GROUP;
@@ -11202,6 +11215,8 @@ void free_netdev(struct net_device *dev)
list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
netif_napi_del(p);
+ kvfree(dev->napi_storage);
+
ref_tracker_dir_exit(&dev->refcnt_tracker);
#ifdef CONFIG_PCPU_DEV_REFCNT
free_percpu(dev->pcpu_refcnt);
@@ -11979,7 +11994,8 @@ static void __init net_dev_struct_check(void)
#ifdef CONFIG_NET_XGRESS
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, tcx_ingress);
#endif
- CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 104);
+ CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, napi_storage);
+ CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 112);
}
/*
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC net-next v2 2/9] netdev-genl: Export NAPI index
2024-09-08 16:06 [RFC net-next v2 0/9] Add support for per-NAPI config via netlink Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 1/9] net: napi: Add napi_storage Joe Damato
@ 2024-09-08 16:06 ` Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 3/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
` (6 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Joe Damato @ 2024-09-08 16:06 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
Paolo Abeni, Donald Hunter, Jesper Dangaard Brouer, Xuan Zhuo,
Daniel Jurgens, open list
Export the NAPI index on napi-get operations. This index will be used in
future commits to set per-NAPI parameters.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
Documentation/netlink/specs/netdev.yaml | 9 +++++++++
include/uapi/linux/netdev.h | 1 +
net/core/netdev-genl.c | 3 +++
3 files changed, 13 insertions(+)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 959755be4d7f..cf3e77c6fd5e 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -244,6 +244,14 @@ attribute-sets:
threaded mode. If NAPI is not in threaded mode (i.e. uses normal
softirq context), the attribute will be absent.
type: u32
+ -
+ name: index
+ doc: The index of the NAPI instance. Refers to persistent storage for
+ any NAPI with the same index.
+ type: u32
+ checks:
+ min: 0
+ max: s32-max
-
name: queue
attributes:
@@ -593,6 +601,7 @@ operations:
- ifindex
- irq
- pid
+ - index
dump:
request:
attributes:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 43742ac5b00d..e06e33acb6fd 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -121,6 +121,7 @@ enum {
NETDEV_A_NAPI_ID,
NETDEV_A_NAPI_IRQ,
NETDEV_A_NAPI_PID,
+ NETDEV_A_NAPI_INDEX,
__NETDEV_A_NAPI_MAX,
NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index a17d7eaeb001..9561841b9d2d 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -182,6 +182,9 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
if (napi->irq >= 0 && nla_put_u32(rsp, NETDEV_A_NAPI_IRQ, napi->irq))
goto nla_put_failure;
+ if (napi->index >= 0 && nla_put_u32(rsp, NETDEV_A_NAPI_INDEX, napi->index))
+ goto nla_put_failure;
+
if (napi->thread) {
pid = task_pid_nr(napi->thread);
if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid))
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC net-next v2 3/9] net: napi: Make napi_defer_hard_irqs per-NAPI
2024-09-08 16:06 [RFC net-next v2 0/9] Add support for per-NAPI config via netlink Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 1/9] net: napi: Add napi_storage Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 2/9] netdev-genl: Export NAPI index Joe Damato
@ 2024-09-08 16:06 ` Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 4/9] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
` (5 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Joe Damato @ 2024-09-08 16:06 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
Paolo Abeni, Jonathan Corbet, Jiri Pirko,
Sebastian Andrzej Siewior, Lorenzo Bianconi, David Ahern,
Johannes Berg, Breno Leitao, Alexander Lobakin,
open list:DOCUMENTATION, open list
Allow per-NAPI defer_hard_irqs setting.
The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device defer_hard_irq field.
Reads from sysfs will read from the net_device field.
sysfs code was updated to guard against what appears to be a potential
overflow as the field is an int, but the value passed in is an unsigned
long.
The ability to set defer_hard_irqs on specific NAPI instances will be
added in a later commit, via netdev-genl.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
.../networking/net_cachelines/net_device.rst | 1 -
include/linux/netdevice.h | 2 +-
net/core/dev.c | 10 ++---
net/core/dev.h | 40 +++++++++++++++++++
net/core/net-sysfs.c | 2 +-
5 files changed, 47 insertions(+), 8 deletions(-)
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index a82751c88d18..4cd801398c4e 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -99,7 +99,6 @@ unsigned_int num_rx_queues
unsigned_int real_num_rx_queues - read_mostly get_rps_cpu
struct_bpf_prog* xdp_prog - read_mostly netif_elide_gro()
unsigned_long gro_flush_timeout - read_mostly napi_complete_done
-u32 napi_defer_hard_irqs - read_mostly napi_complete_done
unsigned_int gro_max_size - read_mostly skb_gro_receive
unsigned_int gro_ipv4_max_size - read_mostly skb_gro_receive
rx_handler_func_t* rx_handler read_mostly - __netif_receive_skb_core
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 54da1c800e65..5a58cf61539e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2088,7 +2088,6 @@ struct net_device {
unsigned int real_num_rx_queues;
struct netdev_rx_queue *_rx;
unsigned long gro_flush_timeout;
- u32 napi_defer_hard_irqs;
unsigned int gro_max_size;
unsigned int gro_ipv4_max_size;
rx_handler_func_t __rcu *rx_handler;
@@ -2412,6 +2411,7 @@ struct net_device {
/** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */
struct dim_irq_moder *irq_moder;
+ u32 napi_defer_hard_irqs;
u8 priv[] ____cacheline_aligned
__counted_by(priv_len);
diff --git a/net/core/dev.c b/net/core/dev.c
index ca90e8cab121..9495448fedaa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6228,7 +6228,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
if (work_done) {
if (n->gro_bitmask)
timeout = READ_ONCE(n->dev->gro_flush_timeout);
- n->defer_hard_irqs_count = READ_ONCE(n->dev->napi_defer_hard_irqs);
+ n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
}
if (n->defer_hard_irqs_count > 0) {
n->defer_hard_irqs_count--;
@@ -6366,7 +6366,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
if (flags & NAPI_F_PREFER_BUSY_POLL) {
- napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs);
+ napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
timeout = READ_ONCE(napi->dev->gro_flush_timeout);
if (napi->defer_hard_irqs_count && timeout) {
hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
@@ -6648,6 +6648,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
INIT_HLIST_NODE(&napi->napi_hash_node);
hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
napi->timer.function = napi_watchdog;
+ napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
init_gro_hash(napi);
napi->skb = NULL;
INIT_LIST_HEAD(&napi->rx_list);
@@ -11033,7 +11034,7 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev)
if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
dev->gro_flush_timeout = 20000;
- dev->napi_defer_hard_irqs = 1;
+ netdev_set_defer_hard_irqs(dev, 1);
}
}
EXPORT_SYMBOL_GPL(netdev_sw_irq_coalesce_default_on);
@@ -11982,7 +11983,6 @@ static void __init net_dev_struct_check(void)
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, real_num_rx_queues);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, _rx);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_flush_timeout);
- CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, napi_defer_hard_irqs);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_max_size);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_ipv4_max_size);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, rx_handler);
@@ -11995,7 +11995,7 @@ static void __init net_dev_struct_check(void)
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, tcx_ingress);
#endif
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, napi_storage);
- CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 112);
+ CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 108);
}
/*
diff --git a/net/core/dev.h b/net/core/dev.h
index 5654325c5b71..2584a7de189f 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -138,6 +138,46 @@ static inline void netif_set_gro_ipv4_max_size(struct net_device *dev,
WRITE_ONCE(dev->gro_ipv4_max_size, size);
}
+/**
+ * napi_get_defer_hard_irqs - get the NAPI's defer_hard_irqs
+ * @n: napi struct to get the defer_hard_irqs field from
+ *
+ * Return: the per-NAPI value of the defar_hard_irqs field.
+ */
+static inline u32 napi_get_defer_hard_irqs(const struct napi_struct *n)
+{
+ if (n->napi_storage)
+ return READ_ONCE(n->napi_storage->defer_hard_irqs);
+ else
+ return READ_ONCE(n->dev->napi_defer_hard_irqs);
+}
+
+/**
+ * napi_set_defer_hard_irqs - set the defer_hard_irqs for a napi
+ * @n: napi_struct to set the defer_hard_irqs field
+ * @defer: the value the field should be set to
+ */
+static inline void napi_set_defer_hard_irqs(struct napi_struct *n, u32 defer)
+{
+ if (n->napi_storage)
+ WRITE_ONCE(n->napi_storage->defer_hard_irqs, defer);
+}
+
+/**
+ * netdev_set_defer_hard_irqs - set defer_hard_irqs for all NAPIs of a netdev
+ * @netdev: the net_device for which all NAPIs will have their defer_hard_irqs set
+ * @defer: the defer_hard_irqs value to set
+ */
+static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
+ u32 defer)
+{
+ struct napi_struct *napi;
+
+ WRITE_ONCE(netdev->napi_defer_hard_irqs, defer);
+ list_for_each_entry(napi, &netdev->napi_list, dev_list)
+ napi_set_defer_hard_irqs(napi, defer);
+}
+
int rps_cpumask_housekeeping(struct cpumask *mask);
#if defined(CONFIG_DEBUG_NET) && defined(CONFIG_BPF_SYSCALL)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 0648dbf0e234..0a0bbbfb39b4 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -429,7 +429,7 @@ static int change_napi_defer_hard_irqs(struct net_device *dev, unsigned long val
if (val > S32_MAX)
return -ERANGE;
- WRITE_ONCE(dev->napi_defer_hard_irqs, val);
+ netdev_set_defer_hard_irqs(dev, (u32)val);
return 0;
}
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC net-next v2 4/9] netdev-genl: Dump napi_defer_hard_irqs
2024-09-08 16:06 [RFC net-next v2 0/9] Add support for per-NAPI config via netlink Joe Damato
` (2 preceding siblings ...)
2024-09-08 16:06 ` [RFC net-next v2 3/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
@ 2024-09-08 16:06 ` Joe Damato
2024-09-08 20:36 ` Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 5/9] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
` (4 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Joe Damato @ 2024-09-08 16:06 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, Donald Hunter, David S. Miller,
Eric Dumazet, Paolo Abeni, Jesper Dangaard Brouer, Xuan Zhuo,
open list
Support dumping defer_hard_irqs for a NAPI ID.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
Documentation/netlink/specs/netdev.yaml | 8 ++++++++
include/uapi/linux/netdev.h | 1 +
net/core/netdev-genl.c | 5 +++++
tools/include/uapi/linux/netdev.h | 1 +
4 files changed, 15 insertions(+)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index cf3e77c6fd5e..e4219bfff08d 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -252,6 +252,13 @@ attribute-sets:
checks:
min: 0
max: s32-max
+ -
+ name: defer-hard-irqs
+ doc: The number of consecutive empty polls before IRQ deferral ends
+ and hardware IRQs are re-enabled.
+ type: u32
+ checks:
+ max: s32-max
-
name: queue
attributes:
@@ -602,6 +609,7 @@ operations:
- irq
- pid
- index
+ - defer-hard-irqs
dump:
request:
attributes:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index e06e33acb6fd..bcc95b7ebd92 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -122,6 +122,7 @@ enum {
NETDEV_A_NAPI_IRQ,
NETDEV_A_NAPI_PID,
NETDEV_A_NAPI_INDEX,
+ NETDEV_A_NAPI_DEFER_HARD_IRQS,
__NETDEV_A_NAPI_MAX,
NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 9561841b9d2d..f1e505ad069f 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -160,6 +160,7 @@ static int
netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
const struct genl_info *info)
{
+ int napi_defer_hard_irqs;
void *hdr;
pid_t pid;
@@ -191,6 +192,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
goto nla_put_failure;
}
+ napi_defer_hard_irqs = napi_get_defer_hard_irqs(napi);
+ if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS, napi_defer_hard_irqs))
+ goto nla_put_failure;
+
genlmsg_end(rsp, hdr);
return 0;
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 43742ac5b00d..43bb1aad9611 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -121,6 +121,7 @@ enum {
NETDEV_A_NAPI_ID,
NETDEV_A_NAPI_IRQ,
NETDEV_A_NAPI_PID,
+ NETDEV_A_NAPI_DEFER_HARD_IRQS,
__NETDEV_A_NAPI_MAX,
NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC net-next v2 5/9] net: napi: Make gro_flush_timeout per-NAPI
2024-09-08 16:06 [RFC net-next v2 0/9] Add support for per-NAPI config via netlink Joe Damato
` (3 preceding siblings ...)
2024-09-08 16:06 ` [RFC net-next v2 4/9] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
@ 2024-09-08 16:06 ` Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
` (3 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Joe Damato @ 2024-09-08 16:06 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, David S. Miller, Eric Dumazet,
Paolo Abeni, Donald Hunter, Jonathan Corbet,
Jesper Dangaard Brouer, Xuan Zhuo, Jiri Pirko,
Sebastian Andrzej Siewior, Lorenzo Bianconi, David Ahern,
Johannes Berg, Breno Leitao, Alexander Lobakin, Daniel Jurgens,
open list, open list:DOCUMENTATION
Allow per-NAPI gro_flush_timeout setting.
The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device gro_flush_timeout
field. Reads from sysfs will read from the net_device field.
The ability to set gro_flush_timeout on specific NAPI instances will be
added in a later commit, via netdev-genl.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
Documentation/netlink/specs/netdev.yaml | 6 +++
.../networking/net_cachelines/net_device.rst | 1 -
include/linux/netdevice.h | 2 +-
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 12 +++---
net/core/dev.h | 43 +++++++++++++++++++
net/core/net-sysfs.c | 2 +-
net/core/netdev-genl.c | 5 +++
tools/include/uapi/linux/netdev.h | 1 +
9 files changed, 64 insertions(+), 9 deletions(-)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index e4219bfff08d..3034c480d0b4 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -259,6 +259,11 @@ attribute-sets:
type: u32
checks:
max: s32-max
+ -
+ name: gro-flush-timeout
+ doc: The timeout, in nanoseconds, of when to trigger the NAPI
+ watchdog timer and schedule NAPI processing.
+ type: uint
-
name: queue
attributes:
@@ -610,6 +615,7 @@ operations:
- pid
- index
- defer-hard-irqs
+ - gro-flush-timeout
dump:
request:
attributes:
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 4cd801398c4e..048cc9d1eafc 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -98,7 +98,6 @@ struct_netdev_queue* _rx read_mostly
unsigned_int num_rx_queues
unsigned_int real_num_rx_queues - read_mostly get_rps_cpu
struct_bpf_prog* xdp_prog - read_mostly netif_elide_gro()
-unsigned_long gro_flush_timeout - read_mostly napi_complete_done
unsigned_int gro_max_size - read_mostly skb_gro_receive
unsigned_int gro_ipv4_max_size - read_mostly skb_gro_receive
rx_handler_func_t* rx_handler read_mostly - __netif_receive_skb_core
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5a58cf61539e..862c835bcf09 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2087,7 +2087,6 @@ struct net_device {
int ifindex;
unsigned int real_num_rx_queues;
struct netdev_rx_queue *_rx;
- unsigned long gro_flush_timeout;
unsigned int gro_max_size;
unsigned int gro_ipv4_max_size;
rx_handler_func_t __rcu *rx_handler;
@@ -2411,6 +2410,7 @@ struct net_device {
/** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */
struct dim_irq_moder *irq_moder;
+ unsigned long gro_flush_timeout;
u32 napi_defer_hard_irqs;
u8 priv[] ____cacheline_aligned
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index bcc95b7ebd92..fd02b5b3b081 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -123,6 +123,7 @@ enum {
NETDEV_A_NAPI_PID,
NETDEV_A_NAPI_INDEX,
NETDEV_A_NAPI_DEFER_HARD_IRQS,
+ NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
__NETDEV_A_NAPI_MAX,
NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/dev.c b/net/core/dev.c
index 9495448fedaa..a45a0dbcf711 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6227,12 +6227,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
if (work_done) {
if (n->gro_bitmask)
- timeout = READ_ONCE(n->dev->gro_flush_timeout);
+ timeout = napi_get_gro_flush_timeout(n);
n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
}
if (n->defer_hard_irqs_count > 0) {
n->defer_hard_irqs_count--;
- timeout = READ_ONCE(n->dev->gro_flush_timeout);
+ timeout = napi_get_gro_flush_timeout(n);
if (timeout)
ret = false;
}
@@ -6367,7 +6367,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
if (flags & NAPI_F_PREFER_BUSY_POLL) {
napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
- timeout = READ_ONCE(napi->dev->gro_flush_timeout);
+ timeout = napi_get_gro_flush_timeout(napi);
if (napi->defer_hard_irqs_count && timeout) {
hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
skip_schedule = true;
@@ -6649,6 +6649,7 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
napi->timer.function = napi_watchdog;
napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
+ napi_set_gro_flush_timeout(napi, READ_ONCE(dev->gro_flush_timeout));
init_gro_hash(napi);
napi->skb = NULL;
INIT_LIST_HEAD(&napi->rx_list);
@@ -11033,7 +11034,7 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev)
WARN_ON(dev->reg_state == NETREG_REGISTERED);
if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
- dev->gro_flush_timeout = 20000;
+ netdev_set_gro_flush_timeout(dev, 20000);
netdev_set_defer_hard_irqs(dev, 1);
}
}
@@ -11982,7 +11983,6 @@ static void __init net_dev_struct_check(void)
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, ifindex);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, real_num_rx_queues);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, _rx);
- CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_flush_timeout);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_max_size);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_ipv4_max_size);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, rx_handler);
@@ -11995,7 +11995,7 @@ static void __init net_dev_struct_check(void)
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, tcx_ingress);
#endif
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, napi_storage);
- CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 108);
+ CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 100);
}
/*
diff --git a/net/core/dev.h b/net/core/dev.h
index 2584a7de189f..f33d7bcb923f 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -178,6 +178,49 @@ static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
napi_set_defer_hard_irqs(napi, defer);
}
+/**
+ * napi_get_gro_flush_timeout - get the gro_flush_timeout
+ * @n: napi struct to get the gro_flush_timeout from
+ *
+ * Return: the per-NAPI value of the gro_flush_timeout field.
+ */
+static inline unsigned long napi_get_gro_flush_timeout(const struct napi_struct *n)
+{
+ if (n->napi_storage)
+ return READ_ONCE(n->napi_storage->gro_flush_timeout);
+ else
+ return READ_ONCE(n->dev->napi_defer_hard_irqs);
+}
+
+/**
+ * napi_set_gro_flush_timeout - set the gro_flush_timeout for a napi
+ * @n: napi struct to set the gro_flush_timeout
+ * @timeout: timeout value to set
+ *
+ * napi_set_gro_flush_timeout sets the per-NAPI gro_flush_timeout
+ */
+static inline void napi_set_gro_flush_timeout(struct napi_struct *n,
+ unsigned long timeout)
+{
+ if (n->napi_storage)
+ WRITE_ONCE(n->napi_storage->gro_flush_timeout, timeout);
+}
+
+/**
+ * netdev_set_gro_flush_timeout - set gro_flush_timeout for all NAPIs of a netdev
+ * @netdev: the net_device for which all NAPIs will have their gro_flush_timeout set
+ * @timeout: the timeout value to set
+ */
+static inline void netdev_set_gro_flush_timeout(struct net_device *netdev,
+ unsigned long timeout)
+{
+ struct napi_struct *napi;
+
+ WRITE_ONCE(netdev->gro_flush_timeout, timeout);
+ list_for_each_entry(napi, &netdev->napi_list, dev_list)
+ napi_set_gro_flush_timeout(napi, timeout);
+}
+
int rps_cpumask_housekeeping(struct cpumask *mask);
#if defined(CONFIG_DEBUG_NET) && defined(CONFIG_BPF_SYSCALL)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 0a0bbbfb39b4..daa32b5a6623 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -409,7 +409,7 @@ NETDEVICE_SHOW_RW(tx_queue_len, fmt_dec);
static int change_gro_flush_timeout(struct net_device *dev, unsigned long val)
{
- WRITE_ONCE(dev->gro_flush_timeout, val);
+ netdev_set_gro_flush_timeout(dev, val);
return 0;
}
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index f1e505ad069f..68ec8265567d 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -161,6 +161,7 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
const struct genl_info *info)
{
int napi_defer_hard_irqs;
+ unsigned long gro_flush_timeout;
void *hdr;
pid_t pid;
@@ -196,6 +197,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS, napi_defer_hard_irqs))
goto nla_put_failure;
+ gro_flush_timeout = napi_get_gro_flush_timeout(napi);
+ if (nla_put_uint(rsp, NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT, gro_flush_timeout))
+ goto nla_put_failure;
+
genlmsg_end(rsp, hdr);
return 0;
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 43bb1aad9611..b088a34e9254 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -122,6 +122,7 @@ enum {
NETDEV_A_NAPI_IRQ,
NETDEV_A_NAPI_PID,
NETDEV_A_NAPI_DEFER_HARD_IRQS,
+ NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
__NETDEV_A_NAPI_MAX,
NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC net-next v2 6/9] netdev-genl: Support setting per-NAPI config values
2024-09-08 16:06 [RFC net-next v2 0/9] Add support for per-NAPI config via netlink Joe Damato
` (4 preceding siblings ...)
2024-09-08 16:06 ` [RFC net-next v2 5/9] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
@ 2024-09-08 16:06 ` Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 7/9] bnxt: Add support for napi storage Joe Damato
` (2 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Joe Damato @ 2024-09-08 16:06 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, Donald Hunter, David S. Miller,
Eric Dumazet, Paolo Abeni, Jesper Dangaard Brouer, Xuan Zhuo,
open list
Add support to set per-NAPI defer_hard_irqs and gro_flush_timeout.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
Documentation/netlink/specs/netdev.yaml | 12 ++++++
include/uapi/linux/netdev.h | 1 +
net/core/netdev-genl-gen.c | 15 +++++++
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 52 +++++++++++++++++++++++++
tools/include/uapi/linux/netdev.h | 1 +
6 files changed, 82 insertions(+)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 3034c480d0b4..7c0c25e5b808 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -642,6 +642,18 @@ operations:
- rx-bytes
- tx-packets
- tx-bytes
+ -
+ name: napi-set
+ doc: Set configurable NAPI instance settings.
+ attribute-set: napi
+ flags: [ admin-perm ]
+ do:
+ request:
+ attributes:
+ - ifindex
+ - index
+ - defer-hard-irqs
+ - gro-flush-timeout
mcast-groups:
list:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index fd02b5b3b081..4e6941b45f3e 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -189,6 +189,7 @@ enum {
NETDEV_CMD_QUEUE_GET,
NETDEV_CMD_NAPI_GET,
NETDEV_CMD_QSTATS_GET,
+ NETDEV_CMD_NAPI_SET,
__NETDEV_CMD_MAX,
NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 8350a0afa9ec..209c56cf08f1 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -74,6 +74,14 @@ static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE
[NETDEV_A_QSTATS_SCOPE] = NLA_POLICY_MASK(NLA_UINT, 0x1),
};
+/* NETDEV_CMD_NAPI_SET - set */
+static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT + 1] = {
+ [NETDEV_A_NAPI_IFINDEX] = { .type = NLA_U32, },
+ [NETDEV_A_NAPI_INDEX] = { .type = NLA_U32, },
+ [NETDEV_A_NAPI_DEFER_HARD_IRQS] = { .type = NLA_S32 },
+ [NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT },
+};
+
/* Ops table for netdev */
static const struct genl_split_ops netdev_nl_ops[] = {
{
@@ -151,6 +159,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
.maxattr = NETDEV_A_QSTATS_SCOPE,
.flags = GENL_CMD_CAP_DUMP,
},
+ {
+ .cmd = NETDEV_CMD_NAPI_SET,
+ .doit = netdev_nl_napi_set_doit,
+ .policy = netdev_napi_set_nl_policy,
+ .maxattr = NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
+ .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+ },
};
static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index 4db40fd5b4a9..b70cb0f20acb 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -28,6 +28,7 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb,
struct netlink_callback *cb);
int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info);
int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
+int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
struct netlink_callback *cb);
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 68ec8265567d..fca1670706cc 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -303,6 +303,58 @@ int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
return err;
}
+static int
+netdev_nl_napi_set_config(struct napi_storage *napi_storage, struct genl_info *info)
+{
+ u64 gro_flush_timeout = 0;
+ int defer = 0;
+
+ if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
+ defer = nla_get_s32(info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]);
+ WRITE_ONCE(napi_storage->defer_hard_irqs, defer);
+ }
+
+ if (info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]) {
+ gro_flush_timeout = nla_get_uint(info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]);
+ WRITE_ONCE(napi_storage->gro_flush_timeout, gro_flush_timeout);
+ }
+
+ return 0;
+}
+
+int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info)
+{
+ struct napi_storage *napi_storage;
+ struct net_device *netdev;
+ u32 ifindex;
+ u32 index;
+ int err;
+
+ if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_IFINDEX))
+ return -EINVAL;
+
+ if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_INDEX))
+ return -EINVAL;
+
+ ifindex = nla_get_u32(info->attrs[NETDEV_A_NAPI_IFINDEX]);
+ index = nla_get_u32(info->attrs[NETDEV_A_NAPI_INDEX]);
+
+ rtnl_lock();
+
+ netdev = __dev_get_by_index(genl_info_net(info), ifindex);
+ if (netdev) {
+ napi_storage = &netdev->napi_storage[index];
+ err = netdev_nl_napi_set_config(napi_storage, info);
+ } else {
+ NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_IFINDEX]);
+ err = -ENODEV;
+ }
+
+ rtnl_unlock();
+
+ return err;
+}
+
static int
netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
u32 q_idx, u32 q_type, const struct genl_info *info)
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index b088a34e9254..4c5bfbc85504 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -188,6 +188,7 @@ enum {
NETDEV_CMD_QUEUE_GET,
NETDEV_CMD_NAPI_GET,
NETDEV_CMD_QSTATS_GET,
+ NETDEV_CMD_NAPI_SET,
__NETDEV_CMD_MAX,
NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC net-next v2 7/9] bnxt: Add support for napi storage
2024-09-08 16:06 [RFC net-next v2 0/9] Add support for per-NAPI config via netlink Joe Damato
` (5 preceding siblings ...)
2024-09-08 16:06 ` [RFC net-next v2 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
@ 2024-09-08 16:06 ` Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 8/9] mlx5: " Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 9/9] mlx4: Add support for napi storage to RX CQs Joe Damato
8 siblings, 0 replies; 22+ messages in thread
From: Joe Damato @ 2024-09-08 16:06 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, Michael Chan, David S. Miller,
Eric Dumazet, Paolo Abeni, open list
Use netif_napi_add_storage to assign per-NAPI storage when initializing
NAPIs.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index c9248ed9330c..6f231a43775b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -10986,7 +10986,8 @@ static void bnxt_init_napi(struct bnxt *bp)
cp_nr_rings--;
for (i = 0; i < cp_nr_rings; i++) {
bnapi = bp->bnapi[i];
- netif_napi_add(bp->dev, &bnapi->napi, poll_fn);
+ netif_napi_add_storage(bp->dev, &bnapi->napi, poll_fn,
+ NAPI_POLL_WEIGHT, bnapi->index);
}
if (BNXT_CHIP_TYPE_NITRO_A0(bp)) {
bnapi = bp->bnapi[cp_nr_rings];
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC net-next v2 8/9] mlx5: Add support for napi storage
2024-09-08 16:06 [RFC net-next v2 0/9] Add support for per-NAPI config via netlink Joe Damato
` (6 preceding siblings ...)
2024-09-08 16:06 ` [RFC net-next v2 7/9] bnxt: Add support for napi storage Joe Damato
@ 2024-09-08 16:06 ` Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 9/9] mlx4: Add support for napi storage to RX CQs Joe Damato
8 siblings, 0 replies; 22+ messages in thread
From: Joe Damato @ 2024-09-08 16:06 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, Saeed Mahameed, Tariq Toukan,
Leon Romanovsky, David S. Miller, Eric Dumazet, Paolo Abeni,
open list:MELLANOX MLX5 core VPI driver, open list
Use netif_napi_add_storage to assign per-NAPI storage when initializing
NAPIs.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 47e7a80d221b..2b718ee61140 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2696,7 +2696,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
c->aff_mask = irq_get_effective_affinity_mask(irq);
c->lag_port = mlx5e_enumerate_lag_port(mdev, ix);
- netif_napi_add(netdev, &c->napi, mlx5e_napi_poll);
+ netif_napi_add_storage(netdev, &c->napi, mlx5e_napi_poll, NAPI_POLL_WEIGHT, ix);
netif_napi_set_irq(&c->napi, irq);
err = mlx5e_open_queues(c, params, cparam);
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC net-next v2 9/9] mlx4: Add support for napi storage to RX CQs
2024-09-08 16:06 [RFC net-next v2 0/9] Add support for per-NAPI config via netlink Joe Damato
` (7 preceding siblings ...)
2024-09-08 16:06 ` [RFC net-next v2 8/9] mlx5: " Joe Damato
@ 2024-09-08 16:06 ` Joe Damato
8 siblings, 0 replies; 22+ messages in thread
From: Joe Damato @ 2024-09-08 16:06 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Joe Damato, Tariq Toukan, David S. Miller,
Eric Dumazet, Paolo Abeni,
open list:MELLANOX MLX4 core VPI driver, open list
Use netif_napi_add_storage to assign per-NAPI storage when initializing
RX CQ NAPIs.
Presently, struct napi_storage only has support for two fields used for
RX, so there is no need to support them with TX CQs, yet.
Signed-off-by: Joe Damato <jdamato@fastly.com>
---
drivers/net/ethernet/mellanox/mlx4/en_cq.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 461cc2c79c71..74f4d8b63ffb 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -156,7 +156,8 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
break;
case RX:
cq->mcq.comp = mlx4_en_rx_irq;
- netif_napi_add(cq->dev, &cq->napi, mlx4_en_poll_rx_cq);
+ netif_napi_add_storage(cq->dev, &cq->napi, mlx4_en_poll_rx_cq,
+ NAPI_POLL_WEIGHT, cq_idx);
netif_napi_set_irq(&cq->napi, irq);
napi_enable(&cq->napi);
netif_queue_set_napi(cq->dev, cq_idx, NETDEV_QUEUE_TYPE_RX, &cq->napi);
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC net-next v2 4/9] netdev-genl: Dump napi_defer_hard_irqs
2024-09-08 16:06 ` [RFC net-next v2 4/9] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
@ 2024-09-08 20:36 ` Joe Damato
0 siblings, 0 replies; 22+ messages in thread
From: Joe Damato @ 2024-09-08 20:36 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, Donald Hunter, David S. Miller, Eric Dumazet,
Paolo Abeni, Jesper Dangaard Brouer, Xuan Zhuo, open list
On Sun, Sep 08, 2024 at 04:06:38PM +0000, Joe Damato wrote:
> Support dumping defer_hard_irqs for a NAPI ID.
>
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> ---
> Documentation/netlink/specs/netdev.yaml | 8 ++++++++
> include/uapi/linux/netdev.h | 1 +
> net/core/netdev-genl.c | 5 +++++
> tools/include/uapi/linux/netdev.h | 1 +
> 4 files changed, 15 insertions(+)
[...]
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index 9561841b9d2d..f1e505ad069f 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -160,6 +160,7 @@ static int
> netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
> const struct genl_info *info)
> {
> + int napi_defer_hard_irqs;
I've already made this a u32 in my rfcv2 branch.
> void *hdr;
> pid_t pid;
>
> @@ -191,6 +192,10 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
> goto nla_put_failure;
> }
>
> + napi_defer_hard_irqs = napi_get_defer_hard_irqs(napi);
> + if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS, napi_defer_hard_irqs))
> + goto nla_put_failure;
> +
> genlmsg_end(rsp, hdr);
>
> return 0;
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC net-next v2 1/9] net: napi: Add napi_storage
2024-09-08 16:06 ` [RFC net-next v2 1/9] net: napi: Add napi_storage Joe Damato
@ 2024-09-08 20:49 ` Joe Damato
2024-09-09 22:37 ` Stanislav Fomichev
2024-09-09 23:40 ` Jakub Kicinski
1 sibling, 1 reply; 22+ messages in thread
From: Joe Damato @ 2024-09-08 20:49 UTC (permalink / raw)
To: netdev
Cc: mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, open list:DOCUMENTATION, open list
On Sun, Sep 08, 2024 at 04:06:35PM +0000, Joe Damato wrote:
> Add a persistent NAPI storage area for NAPI configuration to the core.
> Drivers opt-in to setting the storage for a NAPI by passing an index
> when calling netif_napi_add_storage.
>
> napi_storage is allocated in alloc_netdev_mqs, freed in free_netdev
> (after the NAPIs are deleted), and set to 0 when napi_enable is called.
>
> Signed-off-by: Joe Damato <jdamato@fastly.com>
> ---
> .../networking/net_cachelines/net_device.rst | 1 +
> include/linux/netdevice.h | 34 +++++++++++++++++++
> net/core/dev.c | 18 +++++++++-
> 3 files changed, 52 insertions(+), 1 deletion(-)
>
[...]
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6719,6 +6719,9 @@ void napi_enable(struct napi_struct *n)
> if (n->dev->threaded && n->thread)
> new |= NAPIF_STATE_THREADED;
> } while (!try_cmpxchg(&n->state, &val, new));
> +
> + if (n->napi_storage)
> + memset(n->napi_storage, 0, sizeof(*n->napi_storage));
This part is very obviously wrong ;)
I think when I was reading the other thread about resetting on
channel change [1] I got a bit confused.
Maybe what was intended was on napi_enable, do nothing / remove the
above code (which I suppose would give the persistence desired?).
But modify net/ethtool/channels.c to reset NAPIs to the global
(sysfs) settings? Not sure how to balance both persistence with
queue count changes in a way that makes sense for users of the API.
And, I didn't quite follow the bits about:
1. The proposed ring to NAPI mapping
2. The two step "takeover" which seemed to imply that we might
pull napi_id into napi_storage? Or maybe I just read that part
wrong?
I'll go back re-re-(re?)-read those emails to see if I can figure
out what the desired implementation is.
I'd welcome any/all feedback/clarifications on this part in
particular.
[1]: https://lore.kernel.org/netdev/20240903124008.4793c087@kernel.org/
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC net-next v2 1/9] net: napi: Add napi_storage
2024-09-08 20:49 ` Joe Damato
@ 2024-09-09 22:37 ` Stanislav Fomichev
2024-09-10 6:16 ` Joe Damato
0 siblings, 1 reply; 22+ messages in thread
From: Stanislav Fomichev @ 2024-09-09 22:37 UTC (permalink / raw)
To: Joe Damato, netdev, mkarsten, kuba, skhawaja, sdf, bjorn,
amritha.nambiar, sridhar.samudrala, David S. Miller, Eric Dumazet,
Paolo Abeni, Jonathan Corbet, Jiri Pirko,
Sebastian Andrzej Siewior, Lorenzo Bianconi,
open list:DOCUMENTATION, open list
On 09/08, Joe Damato wrote:
> On Sun, Sep 08, 2024 at 04:06:35PM +0000, Joe Damato wrote:
> > Add a persistent NAPI storage area for NAPI configuration to the core.
> > Drivers opt-in to setting the storage for a NAPI by passing an index
> > when calling netif_napi_add_storage.
> >
> > napi_storage is allocated in alloc_netdev_mqs, freed in free_netdev
> > (after the NAPIs are deleted), and set to 0 when napi_enable is called.
> >
> > Signed-off-by: Joe Damato <jdamato@fastly.com>
> > ---
> > .../networking/net_cachelines/net_device.rst | 1 +
> > include/linux/netdevice.h | 34 +++++++++++++++++++
> > net/core/dev.c | 18 +++++++++-
> > 3 files changed, 52 insertions(+), 1 deletion(-)
> >
>
> [...]
>
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6719,6 +6719,9 @@ void napi_enable(struct napi_struct *n)
> > if (n->dev->threaded && n->thread)
> > new |= NAPIF_STATE_THREADED;
> > } while (!try_cmpxchg(&n->state, &val, new));
> > +
> > + if (n->napi_storage)
> > + memset(n->napi_storage, 0, sizeof(*n->napi_storage));
>
> This part is very obviously wrong ;)
>
> I think when I was reading the other thread about resetting on
> channel change [1] I got a bit confused.
>
> Maybe what was intended was on napi_enable, do nothing / remove the
> above code (which I suppose would give the persistence desired?).
>
> But modify net/ethtool/channels.c to reset NAPIs to the global
> (sysfs) settings? Not sure how to balance both persistence with
> queue count changes in a way that makes sense for users of the API.
>
> And, I didn't quite follow the bits about:
> 1. The proposed ring to NAPI mapping
[..]
> 2. The two step "takeover" which seemed to imply that we might
> pull napi_id into napi_storage? Or maybe I just read that part
> wrong?
Yes, the suggestion here is to drop patch #2 from your series and
keep napi_id as a user facing 'id' for the persistent storage. But,
obviously, this requires persistent napi_id(s) that survive device
resets.
The function that allocates new napi_id is napi_hash_add
from netif_napi_add_weight. So we can do either of the following:
1. Keep everything as is, but add the napi_rehash somewhere
around napi_enable to 'takeover' previously allocated napi_id.
2. (preferred) Separate napi_hash_add out of netif_napi_add_weight.
And have some new napi_hash_with_id(previous_napi_id) to expose it to the
userspace. Then convert mlx5 to this new interface.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC net-next v2 1/9] net: napi: Add napi_storage
2024-09-08 16:06 ` [RFC net-next v2 1/9] net: napi: Add napi_storage Joe Damato
2024-09-08 20:49 ` Joe Damato
@ 2024-09-09 23:40 ` Jakub Kicinski
2024-09-10 6:13 ` Joe Damato
2024-09-11 7:51 ` Joe Damato
1 sibling, 2 replies; 22+ messages in thread
From: Jakub Kicinski @ 2024-09-09 23:40 UTC (permalink / raw)
To: Joe Damato
Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, open list:DOCUMENTATION, open list
On Sun, 8 Sep 2024 16:06:35 +0000 Joe Damato wrote:
> Add a persistent NAPI storage area for NAPI configuration to the core.
> Drivers opt-in to setting the storage for a NAPI by passing an index
> when calling netif_napi_add_storage.
>
> napi_storage is allocated in alloc_netdev_mqs, freed in free_netdev
> (after the NAPIs are deleted), and set to 0 when napi_enable is called.
> enum {
> @@ -2009,6 +2019,9 @@ enum netdev_reg_state {
> * @dpll_pin: Pointer to the SyncE source pin of a DPLL subsystem,
> * where the clock is recovered.
> *
> + * @napi_storage: An array of napi_storage structures containing per-NAPI
> + * settings.
FWIW you can use inline kdoc, with the size of the struct it's easier
to find it. Also this doesn't need to be accessed from fastpath so you
can move it down.
> +/**
> + * netif_napi_add_storage - initialize a NAPI context and set storage area
> + * @dev: network device
> + * @napi: NAPI context
> + * @poll: polling function
> + * @weight: the poll weight of this NAPI
> + * @index: the NAPI index
> + */
> +static inline void
> +netif_napi_add_storage(struct net_device *dev, struct napi_struct *napi,
> + int (*poll)(struct napi_struct *, int), int weight,
> + int index)
> +{
> + napi->index = index;
> + napi->napi_storage = &dev->napi_storage[index];
> + netif_napi_add_weight(dev, napi, poll, weight);
You can drop the weight param, just pass NAPI_POLL_WEIGHT.
Then -- change netif_napi_add_weight() to prevent if from
calling napi_hash_add() if it has index >= 0
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 22c3f14d9287..ca90e8cab121 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6719,6 +6719,9 @@ void napi_enable(struct napi_struct *n)
> if (n->dev->threaded && n->thread)
> new |= NAPIF_STATE_THREADED;
> } while (!try_cmpxchg(&n->state, &val, new));
> +
> + if (n->napi_storage)
> + memset(n->napi_storage, 0, sizeof(*n->napi_storage));
And here inherit the settings and the NAPI ID from storage, then call
napi_hash_add(). napi_hash_add() will need a minor diff to use the
existing ID if already assigned.
And the inverse of that has to happen in napi_disable() (unhash, save
settings to storage), and __netif_napi_del() (don't unhash if it has
index).
I think that should work?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC net-next v2 1/9] net: napi: Add napi_storage
2024-09-09 23:40 ` Jakub Kicinski
@ 2024-09-10 6:13 ` Joe Damato
2024-09-10 14:52 ` Jakub Kicinski
2024-09-11 7:51 ` Joe Damato
1 sibling, 1 reply; 22+ messages in thread
From: Joe Damato @ 2024-09-10 6:13 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, open list:DOCUMENTATION, open list
On Mon, Sep 09, 2024 at 04:40:39PM -0700, Jakub Kicinski wrote:
> On Sun, 8 Sep 2024 16:06:35 +0000 Joe Damato wrote:
> > Add a persistent NAPI storage area for NAPI configuration to the core.
> > Drivers opt-in to setting the storage for a NAPI by passing an index
> > when calling netif_napi_add_storage.
> >
> > napi_storage is allocated in alloc_netdev_mqs, freed in free_netdev
> > (after the NAPIs are deleted), and set to 0 when napi_enable is called.
>
> > enum {
> > @@ -2009,6 +2019,9 @@ enum netdev_reg_state {
> > * @dpll_pin: Pointer to the SyncE source pin of a DPLL subsystem,
> > * where the clock is recovered.
> > *
> > + * @napi_storage: An array of napi_storage structures containing per-NAPI
> > + * settings.
>
> FWIW you can use inline kdoc, with the size of the struct it's easier
> to find it. Also this doesn't need to be accessed from fastpath so you
> can move it down.
OK. I figured since it was being deref'd in napi_complete_done
(where we previously read napi_defer_hard_irqs and
gro_flush_timeout) it needed to be in the fast path.
I'll move it down for the next RFC.
> > +/**
> > + * netif_napi_add_storage - initialize a NAPI context and set storage area
> > + * @dev: network device
> > + * @napi: NAPI context
> > + * @poll: polling function
> > + * @weight: the poll weight of this NAPI
> > + * @index: the NAPI index
> > + */
> > +static inline void
> > +netif_napi_add_storage(struct net_device *dev, struct napi_struct *napi,
> > + int (*poll)(struct napi_struct *, int), int weight,
> > + int index)
> > +{
> > + napi->index = index;
> > + napi->napi_storage = &dev->napi_storage[index];
> > + netif_napi_add_weight(dev, napi, poll, weight);
>
> You can drop the weight param, just pass NAPI_POLL_WEIGHT.
OK will do.
> Then -- change netif_napi_add_weight() to prevent if from
> calling napi_hash_add() if it has index >= 0
OK.
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 22c3f14d9287..ca90e8cab121 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6719,6 +6719,9 @@ void napi_enable(struct napi_struct *n)
> > if (n->dev->threaded && n->thread)
> > new |= NAPIF_STATE_THREADED;
> > } while (!try_cmpxchg(&n->state, &val, new));
> > +
> > + if (n->napi_storage)
> > + memset(n->napi_storage, 0, sizeof(*n->napi_storage));
OK, your comments below will probably make more sense to me after I
try implementing it, but I'll definitely have some questions.
> And here inherit the settings and the NAPI ID from storage, then call
> napi_hash_add(). napi_hash_add() will need a minor diff to use the
> existing ID if already assigned.
I don't think I realized we settled on the NAPI ID being persistent.
I'm not opposed to that, I just think I missed that part in the
previous conversation.
I'll give it a shot and see what the next RFC looks like.
> And the inverse of that has to happen in napi_disable() (unhash, save
> settings to storage), and __netif_napi_del() (don't unhash if it has
> index).
>
> I think that should work?
Only one way to find out ;)
Separately: your comment about documenting rings to NAPIs... I am
not following that bit.
Is that a thing you meant should be documented for driver writers to
follow to reduce churn ?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC net-next v2 1/9] net: napi: Add napi_storage
2024-09-09 22:37 ` Stanislav Fomichev
@ 2024-09-10 6:16 ` Joe Damato
2024-09-10 14:56 ` Jakub Kicinski
0 siblings, 1 reply; 22+ messages in thread
From: Joe Damato @ 2024-09-10 6:16 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: netdev, mkarsten, kuba, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, open list:DOCUMENTATION, open list
On Mon, Sep 09, 2024 at 03:37:41PM -0700, Stanislav Fomichev wrote:
> On 09/08, Joe Damato wrote:
> > On Sun, Sep 08, 2024 at 04:06:35PM +0000, Joe Damato wrote:
> > > Add a persistent NAPI storage area for NAPI configuration to the core.
> > > Drivers opt-in to setting the storage for a NAPI by passing an index
> > > when calling netif_napi_add_storage.
> > >
> > > napi_storage is allocated in alloc_netdev_mqs, freed in free_netdev
> > > (after the NAPIs are deleted), and set to 0 when napi_enable is called.
> > >
> > > Signed-off-by: Joe Damato <jdamato@fastly.com>
> > > ---
> > > .../networking/net_cachelines/net_device.rst | 1 +
> > > include/linux/netdevice.h | 34 +++++++++++++++++++
> > > net/core/dev.c | 18 +++++++++-
> > > 3 files changed, 52 insertions(+), 1 deletion(-)
> > >
> >
> > [...]
> >
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -6719,6 +6719,9 @@ void napi_enable(struct napi_struct *n)
> > > if (n->dev->threaded && n->thread)
> > > new |= NAPIF_STATE_THREADED;
> > > } while (!try_cmpxchg(&n->state, &val, new));
> > > +
> > > + if (n->napi_storage)
> > > + memset(n->napi_storage, 0, sizeof(*n->napi_storage));
> >
> > This part is very obviously wrong ;)
> >
> > I think when I was reading the other thread about resetting on
> > channel change [1] I got a bit confused.
> >
> > Maybe what was intended was on napi_enable, do nothing / remove the
> > above code (which I suppose would give the persistence desired?).
> >
> > But modify net/ethtool/channels.c to reset NAPIs to the global
> > (sysfs) settings? Not sure how to balance both persistence with
> > queue count changes in a way that makes sense for users of the API.
> >
> > And, I didn't quite follow the bits about:
> > 1. The proposed ring to NAPI mapping
>
> [..]
>
> > 2. The two step "takeover" which seemed to imply that we might
> > pull napi_id into napi_storage? Or maybe I just read that part
> > wrong?
>
> Yes, the suggestion here is to drop patch #2 from your series and
> keep napi_id as a user facing 'id' for the persistent storage. But,
> obviously, this requires persistent napi_id(s) that survive device
> resets.
>
> The function that allocates new napi_id is napi_hash_add
> from netif_napi_add_weight. So we can do either of the following:
> 1. Keep everything as is, but add the napi_rehash somewhere
> around napi_enable to 'takeover' previously allocated napi_id.
> 2. (preferred) Separate napi_hash_add out of netif_napi_add_weight.
> And have some new napi_hash_with_id(previous_napi_id) to expose it to the
> userspace. Then convert mlx5 to this new interface.
Jakub is this what you were thinking too?
If this is the case, then the netlink code needs to be tweaked to
operate on NAPI IDs again (since they are persistent) instead of
ifindex + napi_storage index?
LMK.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC net-next v2 1/9] net: napi: Add napi_storage
2024-09-10 6:13 ` Joe Damato
@ 2024-09-10 14:52 ` Jakub Kicinski
2024-09-10 16:10 ` Joe Damato
0 siblings, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2024-09-10 14:52 UTC (permalink / raw)
To: Joe Damato
Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, open list:DOCUMENTATION, open list
On Tue, 10 Sep 2024 08:13:51 +0200 Joe Damato wrote:
> On Mon, Sep 09, 2024 at 04:40:39PM -0700, Jakub Kicinski wrote:
> > On Sun, 8 Sep 2024 16:06:35 +0000 Joe Damato wrote:
> > > Add a persistent NAPI storage area for NAPI configuration to the core.
> > > Drivers opt-in to setting the storage for a NAPI by passing an index
> > > when calling netif_napi_add_storage.
> > >
> > > napi_storage is allocated in alloc_netdev_mqs, freed in free_netdev
> > > (after the NAPIs are deleted), and set to 0 when napi_enable is called.
> >
> > > enum {
> > > @@ -2009,6 +2019,9 @@ enum netdev_reg_state {
> > > * @dpll_pin: Pointer to the SyncE source pin of a DPLL subsystem,
> > > * where the clock is recovered.
> > > *
> > > + * @napi_storage: An array of napi_storage structures containing per-NAPI
> > > + * settings.
> >
> > FWIW you can use inline kdoc, with the size of the struct it's easier
> > to find it. Also this doesn't need to be accessed from fastpath so you
> > can move it down.
>
> OK. I figured since it was being deref'd in napi_complete_done
> (where we previously read napi_defer_hard_irqs and
> gro_flush_timeout) it needed to be in the fast path.
>
> I'll move it down for the next RFC.
Hm, fair point. In my mind I expected we still add the fast path fields
to NAPI instances. And the storage would only be there to stash that
information for the period of time when real NAPI instances are not
present (napi_disable() -> napi_enable() cycles).
But looking at napi_struct, all the cachelines seem full, anyway, so we
can as well split the info. No strong preference, feel free to keep as
is, then. But maybe rename from napi_storage to napi_config or such?
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 22c3f14d9287..ca90e8cab121 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -6719,6 +6719,9 @@ void napi_enable(struct napi_struct *n)
> > > if (n->dev->threaded && n->thread)
> > > new |= NAPIF_STATE_THREADED;
> > > } while (!try_cmpxchg(&n->state, &val, new));
> > > +
> > > + if (n->napi_storage)
> > > + memset(n->napi_storage, 0, sizeof(*n->napi_storage));
>
> OK, your comments below will probably make more sense to me after I
> try implementing it, but I'll definitely have some questions.
>
> > And here inherit the settings and the NAPI ID from storage, then call
> > napi_hash_add(). napi_hash_add() will need a minor diff to use the
> > existing ID if already assigned.
>
> I don't think I realized we settled on the NAPI ID being persistent.
> I'm not opposed to that, I just think I missed that part in the
> previous conversation.
>
> I'll give it a shot and see what the next RFC looks like.
The main reason to try to make NAPI ID persistent from the start is that
if it works we don't have to add index to the uAPI. I don't feel
strongly about it, if you or anyone else has arguments against / why
it won't work.
> > And the inverse of that has to happen in napi_disable() (unhash, save
> > settings to storage), and __netif_napi_del() (don't unhash if it has
> > index).
> >
> > I think that should work?
>
> Only one way to find out ;)
>
> Separately: your comment about documenting rings to NAPIs... I am
> not following that bit.
>
> Is that a thing you meant should be documented for driver writers to
> follow to reduce churn ?
Which comment?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC net-next v2 1/9] net: napi: Add napi_storage
2024-09-10 6:16 ` Joe Damato
@ 2024-09-10 14:56 ` Jakub Kicinski
0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2024-09-10 14:56 UTC (permalink / raw)
To: Joe Damato
Cc: Stanislav Fomichev, netdev, mkarsten, skhawaja, sdf, bjorn,
amritha.nambiar, sridhar.samudrala, David S. Miller, Eric Dumazet,
Paolo Abeni, Jonathan Corbet, Jiri Pirko,
Sebastian Andrzej Siewior, Lorenzo Bianconi,
open list:DOCUMENTATION, open list
On Tue, 10 Sep 2024 08:16:05 +0200 Joe Damato wrote:
> > > 2. The two step "takeover" which seemed to imply that we might
> > > pull napi_id into napi_storage? Or maybe I just read that part
> > > wrong?
> >
> > Yes, the suggestion here is to drop patch #2 from your series and
> > keep napi_id as a user facing 'id' for the persistent storage. But,
> > obviously, this requires persistent napi_id(s) that survive device
> > resets.
> >
> > The function that allocates new napi_id is napi_hash_add
> > from netif_napi_add_weight. So we can do either of the following:
> > 1. Keep everything as is, but add the napi_rehash somewhere
> > around napi_enable to 'takeover' previously allocated napi_id.
> > 2. (preferred) Separate napi_hash_add out of netif_napi_add_weight.
> > And have some new napi_hash_with_id(previous_napi_id) to expose it to the
> > userspace. Then convert mlx5 to this new interface.
>
> Jakub is this what you were thinking too?
>
> If this is the case, then the netlink code needs to be tweaked to
> operate on NAPI IDs again (since they are persistent) instead of
> ifindex + napi_storage index?
No strong preference on the driver facing API, TBH, your
netif_napi_add_storage() with some additional 'ifs' in the existing
functions should work.
Also no strong preference on the uAPI, avoiding adding new fields is
a little bit tempting. But if you think NAPI ID won't work or is less
clean - we can stick the index in.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC net-next v2 1/9] net: napi: Add napi_storage
2024-09-10 14:52 ` Jakub Kicinski
@ 2024-09-10 16:10 ` Joe Damato
2024-09-11 0:10 ` Jakub Kicinski
0 siblings, 1 reply; 22+ messages in thread
From: Joe Damato @ 2024-09-10 16:10 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, open list:DOCUMENTATION, open list
On Tue, Sep 10, 2024 at 07:52:17AM -0700, Jakub Kicinski wrote:
> On Tue, 10 Sep 2024 08:13:51 +0200 Joe Damato wrote:
> > On Mon, Sep 09, 2024 at 04:40:39PM -0700, Jakub Kicinski wrote:
> > > On Sun, 8 Sep 2024 16:06:35 +0000 Joe Damato wrote:
> > > > Add a persistent NAPI storage area for NAPI configuration to the core.
> > > > Drivers opt-in to setting the storage for a NAPI by passing an index
> > > > when calling netif_napi_add_storage.
> > > >
> > > > napi_storage is allocated in alloc_netdev_mqs, freed in free_netdev
> > > > (after the NAPIs are deleted), and set to 0 when napi_enable is called.
> > >
> > > > enum {
> > > > @@ -2009,6 +2019,9 @@ enum netdev_reg_state {
> > > > * @dpll_pin: Pointer to the SyncE source pin of a DPLL subsystem,
> > > > * where the clock is recovered.
> > > > *
> > > > + * @napi_storage: An array of napi_storage structures containing per-NAPI
> > > > + * settings.
> > >
> > > FWIW you can use inline kdoc, with the size of the struct it's easier
> > > to find it. Also this doesn't need to be accessed from fastpath so you
> > > can move it down.
> >
> > OK. I figured since it was being deref'd in napi_complete_done
> > (where we previously read napi_defer_hard_irqs and
> > gro_flush_timeout) it needed to be in the fast path.
> >
> > I'll move it down for the next RFC.
>
> Hm, fair point. In my mind I expected we still add the fast path fields
> to NAPI instances. And the storage would only be there to stash that
> information for the period of time when real NAPI instances are not
> present (napi_disable() -> napi_enable() cycles).
I see, I hadn't understood that part. It sounds like you were
thinking we'd stash the values in between whereas I thought we were
reading from the struct directly (hence the implementation of the
static inline wrappers).
I don't really have a preference one way or the other.
> But looking at napi_struct, all the cachelines seem full, anyway, so we
> can as well split the info. No strong preference, feel free to keep as
> is, then. But maybe rename from napi_storage to napi_config or such?
I can give that a shot for the next RFC and see how it looks. We can
always duplicate the fields in napi_struct and napi_config and copy
them over as you suggested above.
> > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > index 22c3f14d9287..ca90e8cab121 100644
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > @@ -6719,6 +6719,9 @@ void napi_enable(struct napi_struct *n)
> > > > if (n->dev->threaded && n->thread)
> > > > new |= NAPIF_STATE_THREADED;
> > > > } while (!try_cmpxchg(&n->state, &val, new));
> > > > +
> > > > + if (n->napi_storage)
> > > > + memset(n->napi_storage, 0, sizeof(*n->napi_storage));
> >
> > OK, your comments below will probably make more sense to me after I
> > try implementing it, but I'll definitely have some questions.
> >
> > > And here inherit the settings and the NAPI ID from storage, then call
> > > napi_hash_add(). napi_hash_add() will need a minor diff to use the
> > > existing ID if already assigned.
> >
> > I don't think I realized we settled on the NAPI ID being persistent.
> > I'm not opposed to that, I just think I missed that part in the
> > previous conversation.
> >
> > I'll give it a shot and see what the next RFC looks like.
>
> The main reason to try to make NAPI ID persistent from the start is that
> if it works we don't have to add index to the uAPI. I don't feel
> strongly about it, if you or anyone else has arguments against / why
> it won't work.
Yea, I think not exposing the index in the uAPI is probably a good
idea? Making the NAPI IDs persistent let's us avoid that so I can
give that a shot because it's easier from the user app perspective,
IMO.
> > > And the inverse of that has to happen in napi_disable() (unhash, save
> > > settings to storage), and __netif_napi_del() (don't unhash if it has
> > > index).
> > >
> > > I think that should work?
> >
> > Only one way to find out ;)
> >
> > Separately: your comment about documenting rings to NAPIs... I am
> > not following that bit.
> >
> > Is that a thing you meant should be documented for driver writers to
> > follow to reduce churn ?
>
> Which comment?
In this message:
https://lore.kernel.org/netdev/20240903124008.4793c087@kernel.org/
You mentioned this, which I interpreted as a thing that core needs
to solve for, but perhaps this intended as advice for drivers?
Maybe it's enough to document how rings are distributed to NAPIs?
First set of NAPIs should get allocated to the combined channels,
then for remaining rx- and tx-only NAPIs they should be interleaved
starting with rx?
Example, asymmetric config: combined + some extra tx:
combined tx
[0..#combined-1] [#combined..#combined+#tx-1]
Split rx / tx - interleave:
[0 rx0] [1 tx0] [2 rx1] [3 tx1] [4 rx2] [5 tx2] ...
This would limit the churn when changing channel counts.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC net-next v2 1/9] net: napi: Add napi_storage
2024-09-10 16:10 ` Joe Damato
@ 2024-09-11 0:10 ` Jakub Kicinski
2024-09-11 7:47 ` Joe Damato
0 siblings, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2024-09-11 0:10 UTC (permalink / raw)
To: Joe Damato
Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, open list:DOCUMENTATION, open list
On Tue, 10 Sep 2024 18:10:42 +0200 Joe Damato wrote:
> On Tue, Sep 10, 2024 at 07:52:17AM -0700, Jakub Kicinski wrote:
> > Hm, fair point. In my mind I expected we still add the fast path fields
> > to NAPI instances. And the storage would only be there to stash that
> > information for the period of time when real NAPI instances are not
> > present (napi_disable() -> napi_enable() cycles).
>
> I see, I hadn't understood that part. It sounds like you were
> thinking we'd stash the values in between whereas I thought we were
> reading from the struct directly (hence the implementation of the
> static inline wrappers).
>
> I don't really have a preference one way or the other.
Me neither. I suspect having the fields in napi_struct to be slightly
more optimal. No need for indirect accesses via napi->storage->$field,
and conditions to check if napi->storage is set. We can make sure we
populate the fields in NAPIs when they are created and when sysfs writes
happen. So slightly better fastpath locality at the expense of more
code on the control path keeping it in sync.
FWIW the more we discuss this the less I like the word "storage" :)
If you could sed -i 's/storage/config/' on the patches that'd great :)
> > > I don't think I realized we settled on the NAPI ID being persistent.
> > > I'm not opposed to that, I just think I missed that part in the
> > > previous conversation.
> > >
> > > I'll give it a shot and see what the next RFC looks like.
> >
> > The main reason to try to make NAPI ID persistent from the start is that
> > if it works we don't have to add index to the uAPI. I don't feel
> > strongly about it, if you or anyone else has arguments against / why
> > it won't work.
>
> Yea, I think not exposing the index in the uAPI is probably a good
> idea? Making the NAPI IDs persistent let's us avoid that so I can
> give that a shot because it's easier from the user app perspective,
> IMO.
Right, basically we can always add it later. Removing it later won't
be possible :S
> > > Only one way to find out ;)
> > >
> > > Separately: your comment about documenting rings to NAPIs... I am
> > > not following that bit.
> > >
> > > Is that a thing you meant should be documented for driver writers to
> > > follow to reduce churn ?
> >
> > Which comment?
>
> In this message:
>
> https://lore.kernel.org/netdev/20240903124008.4793c087@kernel.org/
>
> You mentioned this, which I interpreted as a thing that core needs
> to solve for, but perhaps this intended as advice for drivers?
>
> Maybe it's enough to document how rings are distributed to NAPIs?
>
> First set of NAPIs should get allocated to the combined channels,
> then for remaining rx- and tx-only NAPIs they should be interleaved
> starting with rx?
>
> Example, asymmetric config: combined + some extra tx:
>
> combined tx
> [0..#combined-1] [#combined..#combined+#tx-1]
>
> Split rx / tx - interleave:
>
> [0 rx0] [1 tx0] [2 rx1] [3 tx1] [4 rx2] [5 tx2] ...
>
> This would limit the churn when changing channel counts.
Oh, yes. I'm not sure _where_ to document it. But if the driver supports
asymmetric ring count or dedicated Rx and Tx NAPIs - this is the
recommended way to distributing the rings to NAPIs, to, as you say,
limit the config churn / mismatch after ring count changes.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC net-next v2 1/9] net: napi: Add napi_storage
2024-09-11 0:10 ` Jakub Kicinski
@ 2024-09-11 7:47 ` Joe Damato
0 siblings, 0 replies; 22+ messages in thread
From: Joe Damato @ 2024-09-11 7:47 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, open list:DOCUMENTATION, open list
On Tue, Sep 10, 2024 at 05:10:48PM -0700, Jakub Kicinski wrote:
> On Tue, 10 Sep 2024 18:10:42 +0200 Joe Damato wrote:
> > On Tue, Sep 10, 2024 at 07:52:17AM -0700, Jakub Kicinski wrote:
> > > Hm, fair point. In my mind I expected we still add the fast path fields
> > > to NAPI instances. And the storage would only be there to stash that
> > > information for the period of time when real NAPI instances are not
> > > present (napi_disable() -> napi_enable() cycles).
> >
> > I see, I hadn't understood that part. It sounds like you were
> > thinking we'd stash the values in between whereas I thought we were
> > reading from the struct directly (hence the implementation of the
> > static inline wrappers).
> >
> > I don't really have a preference one way or the other.
>
> Me neither. I suspect having the fields in napi_struct to be slightly
> more optimal. No need for indirect accesses via napi->storage->$field,
> and conditions to check if napi->storage is set. We can make sure we
> populate the fields in NAPIs when they are created and when sysfs writes
> happen. So slightly better fastpath locality at the expense of more
> code on the control path keeping it in sync.
>
> FWIW the more we discuss this the less I like the word "storage" :)
> If you could sed -i 's/storage/config/' on the patches that'd great :)
Yup, I actually did that yesterday. I also like config more.
Right now, I have:
- struct napi_config
- in the napi_struct its a struct napi_config *config
- in the net_device its a struct napi_config *napi_config
I figured in the second case you are already de-refing through a
"napi_struct" so writing "napi->napi_config->..." seemed a bit wordy
compared to "napi->config->...".
> > > > I don't think I realized we settled on the NAPI ID being persistent.
> > > > I'm not opposed to that, I just think I missed that part in the
> > > > previous conversation.
> > > >
> > > > I'll give it a shot and see what the next RFC looks like.
> > >
> > > The main reason to try to make NAPI ID persistent from the start is that
> > > if it works we don't have to add index to the uAPI. I don't feel
> > > strongly about it, if you or anyone else has arguments against / why
> > > it won't work.
> >
> > Yea, I think not exposing the index in the uAPI is probably a good
> > idea? Making the NAPI IDs persistent let's us avoid that so I can
> > give that a shot because it's easier from the user app perspective,
> > IMO.
>
> Right, basically we can always add it later. Removing it later won't
> be possible :S
Yup, I totally get that.
> > > > Only one way to find out ;)
> > > >
> > > > Separately: your comment about documenting rings to NAPIs... I am
> > > > not following that bit.
> > > >
> > > > Is that a thing you meant should be documented for driver writers to
> > > > follow to reduce churn ?
> > >
> > > Which comment?
> >
> > In this message:
> >
> > https://lore.kernel.org/netdev/20240903124008.4793c087@kernel.org/
> >
> > You mentioned this, which I interpreted as a thing that core needs
> > to solve for, but perhaps this intended as advice for drivers?
> >
> > Maybe it's enough to document how rings are distributed to NAPIs?
> >
> > First set of NAPIs should get allocated to the combined channels,
> > then for remaining rx- and tx-only NAPIs they should be interleaved
> > starting with rx?
> >
> > Example, asymmetric config: combined + some extra tx:
> >
> > combined tx
> > [0..#combined-1] [#combined..#combined+#tx-1]
> >
> > Split rx / tx - interleave:
> >
> > [0 rx0] [1 tx0] [2 rx1] [3 tx1] [4 rx2] [5 tx2] ...
> >
> > This would limit the churn when changing channel counts.
>
> Oh, yes. I'm not sure _where_ to document it. But if the driver supports
> asymmetric ring count or dedicated Rx and Tx NAPIs - this is the
> recommended way to distributing the rings to NAPIs, to, as you say,
> limit the config churn / mismatch after ring count changes.
OK, so that seems like it's related but separate from the changes I
am proposing via RFC, so I don't think I need to include that in my
next RFC.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC net-next v2 1/9] net: napi: Add napi_storage
2024-09-09 23:40 ` Jakub Kicinski
2024-09-10 6:13 ` Joe Damato
@ 2024-09-11 7:51 ` Joe Damato
1 sibling, 0 replies; 22+ messages in thread
From: Joe Damato @ 2024-09-11 7:51 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, mkarsten, skhawaja, sdf, bjorn, amritha.nambiar,
sridhar.samudrala, David S. Miller, Eric Dumazet, Paolo Abeni,
Jonathan Corbet, Jiri Pirko, Sebastian Andrzej Siewior,
Lorenzo Bianconi, open list:DOCUMENTATION, open list
On Mon, Sep 09, 2024 at 04:40:39PM -0700, Jakub Kicinski wrote:
> On Sun, 8 Sep 2024 16:06:35 +0000 Joe Damato wrote:
> > Add a persistent NAPI storage area for NAPI configuration to the core.
> > Drivers opt-in to setting the storage for a NAPI by passing an index
> > when calling netif_napi_add_storage.
> >
> > napi_storage is allocated in alloc_netdev_mqs, freed in free_netdev
> > (after the NAPIs are deleted), and set to 0 when napi_enable is called.
>
> > enum {
> > @@ -2009,6 +2019,9 @@ enum netdev_reg_state {
> > * @dpll_pin: Pointer to the SyncE source pin of a DPLL subsystem,
> > * where the clock is recovered.
> > *
> > + * @napi_storage: An array of napi_storage structures containing per-NAPI
> > + * settings.
>
> FWIW you can use inline kdoc, with the size of the struct it's easier
> to find it. Also this doesn't need to be accessed from fastpath so you
> can move it down.
>
> > +/**
> > + * netif_napi_add_storage - initialize a NAPI context and set storage area
> > + * @dev: network device
> > + * @napi: NAPI context
> > + * @poll: polling function
> > + * @weight: the poll weight of this NAPI
> > + * @index: the NAPI index
> > + */
> > +static inline void
> > +netif_napi_add_storage(struct net_device *dev, struct napi_struct *napi,
> > + int (*poll)(struct napi_struct *, int), int weight,
> > + int index)
> > +{
> > + napi->index = index;
> > + napi->napi_storage = &dev->napi_storage[index];
> > + netif_napi_add_weight(dev, napi, poll, weight);
>
> You can drop the weight param, just pass NAPI_POLL_WEIGHT.
>
> Then -- change netif_napi_add_weight() to prevent if from
> calling napi_hash_add() if it has index >= 0
>
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 22c3f14d9287..ca90e8cab121 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6719,6 +6719,9 @@ void napi_enable(struct napi_struct *n)
> > if (n->dev->threaded && n->thread)
> > new |= NAPIF_STATE_THREADED;
> > } while (!try_cmpxchg(&n->state, &val, new));
> > +
> > + if (n->napi_storage)
> > + memset(n->napi_storage, 0, sizeof(*n->napi_storage));
>
> And here inherit the settings and the NAPI ID from storage, then call
> napi_hash_add(). napi_hash_add() will need a minor diff to use the
> existing ID if already assigned.
>
> And the inverse of that has to happen in napi_disable() (unhash, save
> settings to storage), and __netif_napi_del() (don't unhash if it has
> index).
>
> I think that should work?
I made the changes you suggested above yesterday and I had also
renamed the struct to napi_config because I also liked that better
than storage.
I'll update the code to put the values in the napi_struct and copy
them over as you suggested in your other message.
That said: the copying thing is more of an optimization, so the
changes I have should be almost (?) working and adding that copying
of the values into the napi_struct should only be a performance
thing and not a functionality/correctness thing.
I say that because there's still a bug I'm trying to work out with
mlx5 and it's almost certainly in the codepath Stanislav pointed out
in his messages [1] [2]. Haven't had much time to debug it just yet,
but I am hoping to be able to debug it and submit another RFC before
the end of this week.
FWIW: I too have fallen down this code path in mlx5 in the past for
other reasons. It appears it is time to fall down it again.
[1]: https://lore.kernel.org/netdev/Ztjv-dgNFwFBnXwd@mini-arch/
[2]: https://lore.kernel.org/netdev/Zt94tXG_lzGLWo1w@mini-arch/
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2024-09-11 7:51 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-08 16:06 [RFC net-next v2 0/9] Add support for per-NAPI config via netlink Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 1/9] net: napi: Add napi_storage Joe Damato
2024-09-08 20:49 ` Joe Damato
2024-09-09 22:37 ` Stanislav Fomichev
2024-09-10 6:16 ` Joe Damato
2024-09-10 14:56 ` Jakub Kicinski
2024-09-09 23:40 ` Jakub Kicinski
2024-09-10 6:13 ` Joe Damato
2024-09-10 14:52 ` Jakub Kicinski
2024-09-10 16:10 ` Joe Damato
2024-09-11 0:10 ` Jakub Kicinski
2024-09-11 7:47 ` Joe Damato
2024-09-11 7:51 ` Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 2/9] netdev-genl: Export NAPI index Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 3/9] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 4/9] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
2024-09-08 20:36 ` Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 5/9] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 6/9] netdev-genl: Support setting per-NAPI config values Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 7/9] bnxt: Add support for napi storage Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 8/9] mlx5: " Joe Damato
2024-09-08 16:06 ` [RFC net-next v2 9/9] mlx4: Add support for napi storage to RX CQs Joe Damato
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).