* [PATCH net-next v5 1/4] net: Create separate gro_flush helper function
2025-04-24 20:02 [PATCH net-next v5 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
@ 2025-04-24 20:02 ` Samiullah Khawaja
2025-04-24 23:21 ` Joe Damato
2025-04-24 20:02 ` [PATCH net-next v5 2/4] net: define an enum for the napi threaded state Samiullah Khawaja
` (3 subsequent siblings)
4 siblings, 1 reply; 23+ messages in thread
From: Samiullah Khawaja @ 2025-04-24 20:02 UTC (permalink / raw)
To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, mkarsten
Cc: netdev, skhawaja
Move multiple copies of same code snippet doing `gro_flush` and
`gro_normal_list` into separate helper functions.
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
net/core/dev.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index ea3c8a30bb97..3ff275bbf6e2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6506,6 +6506,13 @@ void __napi_schedule_irqoff(struct napi_struct *n)
}
EXPORT_SYMBOL(__napi_schedule_irqoff);
+static void __napi_gro_flush_helper(struct napi_struct *napi,
+ bool flush_old)
+{
+ gro_flush(&napi->gro, flush_old);
+ gro_normal_list(&napi->gro);
+}
+
bool napi_complete_done(struct napi_struct *n, int work_done)
{
unsigned long flags, val, new, timeout = 0;
@@ -6538,8 +6545,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
* it, we need to bound somehow the time packets are kept in
* the GRO layer.
*/
- gro_flush(&n->gro, !!timeout);
- gro_normal_list(&n->gro);
+ __napi_gro_flush_helper(n, !!timeout);
if (unlikely(!list_empty(&n->poll_list))) {
/* If n->poll_list is not empty, we need to mask irqs */
@@ -6609,8 +6615,7 @@ static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)
}
/* Flush too old packets. If HZ < 1000, flush all packets */
- gro_flush(&napi->gro, HZ >= 1000);
- gro_normal_list(&napi->gro);
+ __napi_gro_flush_helper(napi, HZ >= 1000);
clear_bit(NAPI_STATE_SCHED, &napi->state);
}
@@ -7433,8 +7438,7 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
}
/* Flush too old packets. If HZ < 1000, flush all packets */
- gro_flush(&n->gro, HZ >= 1000);
- gro_normal_list(&n->gro);
+ __napi_gro_flush_helper(n, HZ >= 1000);
/* Some drivers may have called napi_schedule
* prior to exhausting their budget.
--
2.49.0.850.g28803427d3-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH net-next v5 1/4] net: Create separate gro_flush helper function
2025-04-24 20:02 ` [PATCH net-next v5 1/4] net: Create separate gro_flush helper function Samiullah Khawaja
@ 2025-04-24 23:21 ` Joe Damato
0 siblings, 0 replies; 23+ messages in thread
From: Joe Damato @ 2025-04-24 23:21 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
On Thu, Apr 24, 2025 at 08:02:19PM +0000, Samiullah Khawaja wrote:
> Move multiple copies of same code snippet doing `gro_flush` and
> `gro_normal_list` into separate helper functions.
>
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> Reviewed-by: Willem de Bruijn <willemb@google.com>
> ---
> net/core/dev.c | 16 ++++++++++------
Looks like there's an instance in kernel/bpf/cpumap.c that can be
replaced as well?
Maybe the code proposed should be moved to a helper in net/gro.h and
then it can be used in net/core and kernel/bpf ?
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH net-next v5 2/4] net: define an enum for the napi threaded state
2025-04-24 20:02 [PATCH net-next v5 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
2025-04-24 20:02 ` [PATCH net-next v5 1/4] net: Create separate gro_flush helper function Samiullah Khawaja
@ 2025-04-24 20:02 ` Samiullah Khawaja
2025-04-24 23:40 ` Joe Damato
2025-04-26 1:36 ` Jakub Kicinski
2025-04-24 20:02 ` [PATCH net-next v5 3/4] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
` (2 subsequent siblings)
4 siblings, 2 replies; 23+ messages in thread
From: Samiullah Khawaja @ 2025-04-24 20:02 UTC (permalink / raw)
To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, mkarsten
Cc: netdev, skhawaja
Instead of using '0' and '1' for napi threaded state, use an enum with
'disable' and 'enable' states, preparing for the next patch to add a new
'busy-poll-enable' state. Also update the 'threaded' field in struct
net_device to u8 instead of bool.
Tested:
./tools/testing/selftests/net/nl_netdev.py
TAP version 13
1..6
ok 1 nl_netdev.empty_check
ok 2 nl_netdev.lo_check
ok 3 nl_netdev.page_pool_check
ok 4 nl_netdev.napi_list_check
ok 5 nl_netdev.napi_set_threaded
ok 6 nl_netdev.nsim_rxq_reset_down
# Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
Documentation/netlink/specs/netdev.yaml | 11 +++++---
.../networking/net_cachelines/net_device.rst | 2 +-
.../net/ethernet/atheros/atl1c/atl1c_main.c | 2 +-
drivers/net/ethernet/mellanox/mlxsw/pci.c | 2 +-
drivers/net/ethernet/renesas/ravb_main.c | 2 +-
drivers/net/wireless/ath/ath10k/snoc.c | 2 +-
include/linux/netdevice.h | 7 +++--
include/uapi/linux/netdev.h | 5 ++++
net/core/dev.c | 25 ++++++++++++++---
net/core/dev.h | 12 +++++---
net/core/netdev-genl-gen.c | 2 +-
net/core/netdev-genl.c | 4 +--
tools/include/uapi/linux/netdev.h | 5 ++++
tools/testing/selftests/net/nl_netdev.py | 28 +++++++++----------
14 files changed, 72 insertions(+), 37 deletions(-)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index c9d190fe1f05..c8834161e8ec 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -82,6 +82,10 @@ definitions:
name: qstats-scope
type: flags
entries: [ queue ]
+ -
+ name: napi-threaded
+ type: enum
+ entries: [ disable, enable ]
attribute-sets:
-
@@ -283,11 +287,10 @@ attribute-sets:
-
name: threaded
doc: Whether the napi is configured to operate in threaded polling
- mode. If this is set to `1` then the NAPI context operates
+ mode. If this is set to `enable` then the NAPI context operates
in threaded polling mode.
- type: uint
- checks:
- max: 1
+ type: u32
+ enum: napi-threaded
-
name: xsk-info
attributes: []
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 6327e689e8a8..e9796adc2fb0 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -164,7 +164,7 @@ struct sfp_bus* sfp_bus
struct lock_class_key* qdisc_tx_busylock
bool proto_down
unsigned:1 wol_enabled
-unsigned:1 threaded napi_poll(napi_enable,dev_set_threaded)
+u8 threaded napi_poll(napi_enable,dev_set_threaded)
unsigned_long:1 see_all_hwtstamp_requests
unsigned_long:1 change_proto_down
unsigned_long:1 netns_immutable
diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
index 82137f9deae9..58254113678c 100644
--- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
+++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
@@ -2688,7 +2688,7 @@ static int atl1c_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
adapter->mii.mdio_write = atl1c_mdio_write;
adapter->mii.phy_id_mask = 0x1f;
adapter->mii.reg_num_mask = MDIO_CTRL_REG_MASK;
- dev_set_threaded(netdev, true);
+ dev_set_threaded(netdev, NETDEV_NAPI_THREADED_ENABLE);
for (i = 0; i < adapter->rx_queue_count; ++i)
netif_napi_add(netdev, &adapter->rrd_ring[i].napi,
atl1c_clean_rx);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 058dcabfaa2e..2ed3b9263be2 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -156,7 +156,7 @@ static int mlxsw_pci_napi_devs_init(struct mlxsw_pci *mlxsw_pci)
}
strscpy(mlxsw_pci->napi_dev_rx->name, "mlxsw_rx",
sizeof(mlxsw_pci->napi_dev_rx->name));
- dev_set_threaded(mlxsw_pci->napi_dev_rx, true);
+ dev_set_threaded(mlxsw_pci->napi_dev_rx, NETDEV_NAPI_THREADED_ENABLE);
return 0;
diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index c9f4976a3527..12e4f68c0c8f 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -3075,7 +3075,7 @@ static int ravb_probe(struct platform_device *pdev)
if (info->coalesce_irqs) {
netdev_sw_irq_coalesce_default_on(ndev);
if (num_present_cpus() == 1)
- dev_set_threaded(ndev, true);
+ dev_set_threaded(ndev, NETDEV_NAPI_THREADED_ENABLE);
}
/* Network device register */
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index 866bad2db334..902517bed686 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -935,7 +935,7 @@ static int ath10k_snoc_hif_start(struct ath10k *ar)
bitmap_clear(ar_snoc->pending_ce_irqs, 0, CE_COUNT_MAX);
- dev_set_threaded(ar->napi_dev, true);
+ dev_set_threaded(ar->napi_dev, NETDEV_NAPI_THREADED_ENABLE);
ath10k_core_napi_enable(ar);
ath10k_snoc_irq_enable(ar);
ath10k_snoc_rx_post(ar);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3817720e8b24..2eda563307f9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -369,7 +369,7 @@ struct napi_config {
u64 irq_suspend_timeout;
u32 defer_hard_irqs;
cpumask_t affinity_mask;
- bool threaded;
+ u8 threaded;
unsigned int napi_id;
};
@@ -589,7 +589,8 @@ static inline bool napi_complete(struct napi_struct *n)
return napi_complete_done(n, 0);
}
-int dev_set_threaded(struct net_device *dev, bool threaded);
+int dev_set_threaded(struct net_device *dev,
+ enum netdev_napi_threaded threaded);
void napi_disable(struct napi_struct *n);
void napi_disable_locked(struct napi_struct *n);
@@ -2428,7 +2429,7 @@ struct net_device {
struct sfp_bus *sfp_bus;
struct lock_class_key *qdisc_tx_busylock;
bool proto_down;
- bool threaded;
+ u8 threaded;
bool irq_affinity_auto;
bool rx_cpu_rmap_auto;
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index fac1b8ffeb55..a5737572ce92 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -77,6 +77,11 @@ enum netdev_qstats_scope {
NETDEV_QSTATS_SCOPE_QUEUE = 1,
};
+enum netdev_napi_threaded {
+ NETDEV_NAPI_THREADED_DISABLE,
+ NETDEV_NAPI_THREADED_ENABLE,
+};
+
enum {
NETDEV_A_DEV_IFINDEX = 1,
NETDEV_A_DEV_PAD,
diff --git a/net/core/dev.c b/net/core/dev.c
index 3ff275bbf6e2..41d809f2a7f7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6893,7 +6893,19 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
-int napi_set_threaded(struct napi_struct *napi, bool threaded)
+static void napi_set_threaded_state(struct napi_struct *napi,
+ enum netdev_napi_threaded threaded)
+{
+ unsigned long val;
+
+ val = 0;
+ if (threaded)
+ val |= NAPIF_STATE_THREADED;
+ set_mask_bits(&napi->state, NAPIF_STATE_THREADED, val);
+}
+
+int napi_set_threaded(struct napi_struct *napi,
+ enum netdev_napi_threaded threaded)
{
if (threaded) {
if (!napi->thread) {
@@ -6909,14 +6921,16 @@ int napi_set_threaded(struct napi_struct *napi, bool threaded)
/* Make sure kthread is created before THREADED bit is set. */
smp_mb__before_atomic();
- assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+ napi_set_threaded_state(napi, threaded);
return 0;
}
-int dev_set_threaded(struct net_device *dev, bool threaded)
+int dev_set_threaded(struct net_device *dev,
+ enum netdev_napi_threaded threaded)
{
struct napi_struct *napi;
+ unsigned long val;
int err = 0;
netdev_assert_locked_or_invisible(dev);
@@ -6924,12 +6938,15 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
if (dev->threaded == threaded)
return 0;
+ val = 0;
if (threaded) {
+ val |= NAPIF_STATE_THREADED;
+
list_for_each_entry(napi, &dev->napi_list, dev_list) {
if (!napi->thread) {
err = napi_kthread_create(napi);
if (err) {
- threaded = false;
+ threaded = NETDEV_NAPI_THREADED_DISABLE;
break;
}
}
diff --git a/net/core/dev.h b/net/core/dev.h
index b50d118ad014..3924996ae85c 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -321,19 +321,23 @@ static inline void napi_set_irq_suspend_timeout(struct napi_struct *n,
*
* Return: the per-NAPI threaded state.
*/
-static inline bool napi_get_threaded(struct napi_struct *n)
+static inline enum netdev_napi_threaded napi_get_threaded(struct napi_struct *n)
{
- return test_bit(NAPI_STATE_THREADED, &n->state);
+ if (test_bit(NAPI_STATE_THREADED, &n->state))
+ return NETDEV_NAPI_THREADED_ENABLE;
+
+ return NETDEV_NAPI_THREADED_DISABLE;
}
/**
* napi_set_threaded - set napi threaded state
* @n: napi struct to set the threaded state on
- * @threaded: whether this napi does threaded polling
+ * @threaded: napi threaded state
*
* Return 0 on success and negative errno on failure.
*/
-int napi_set_threaded(struct napi_struct *n, bool threaded);
+int napi_set_threaded(struct napi_struct *n,
+ enum netdev_napi_threaded threaded);
int rps_cpumask_housekeeping(struct cpumask *mask);
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 2791b6b232fa..c2e5cee857d2 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -97,7 +97,7 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_THREADED
[NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range),
[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, },
[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
- [NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_UINT, 1),
+ [NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
};
/* Ops table for netdev */
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index f7d000a600cf..7335c8f6ef68 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -330,8 +330,8 @@ netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
u32 defer = 0;
if (info->attrs[NETDEV_A_NAPI_THREADED]) {
- threaded = nla_get_uint(info->attrs[NETDEV_A_NAPI_THREADED]);
- napi_set_threaded(napi, !!threaded);
+ threaded = nla_get_u32(info->attrs[NETDEV_A_NAPI_THREADED]);
+ napi_set_threaded(napi, threaded);
}
if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index fac1b8ffeb55..a5737572ce92 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -77,6 +77,11 @@ enum netdev_qstats_scope {
NETDEV_QSTATS_SCOPE_QUEUE = 1,
};
+enum netdev_napi_threaded {
+ NETDEV_NAPI_THREADED_DISABLE,
+ NETDEV_NAPI_THREADED_ENABLE,
+};
+
enum {
NETDEV_A_DEV_IFINDEX = 1,
NETDEV_A_DEV_PAD,
diff --git a/tools/testing/selftests/net/nl_netdev.py b/tools/testing/selftests/net/nl_netdev.py
index 505564818fa8..ec60bff34f5e 100755
--- a/tools/testing/selftests/net/nl_netdev.py
+++ b/tools/testing/selftests/net/nl_netdev.py
@@ -52,53 +52,53 @@ def napi_set_threaded(nf) -> None:
napi1_id = napis[1]['id']
# set napi threaded and verify
- nf.napi_set({'id': napi0_id, 'threaded': 1})
+ nf.napi_set({'id': napi0_id, 'threaded': "enable"})
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 1)
+ ksft_eq(napi0['threaded'], "enable")
ip(f"link set dev {nsim.ifname} down")
ip(f"link set dev {nsim.ifname} up")
# verify if napi threaded is still set
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 1)
+ ksft_eq(napi0['threaded'], "enable")
# unset napi threaded and verify
- nf.napi_set({'id': napi0_id, 'threaded': 0})
+ nf.napi_set({'id': napi0_id, 'threaded': "disable"})
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 0)
+ ksft_eq(napi0['threaded'], "disable")
# set napi threaded for napi0
- nf.napi_set({'id': napi0_id, 'threaded': 1})
+ nf.napi_set({'id': napi0_id, 'threaded': "enable"})
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 1)
+ ksft_eq(napi0['threaded'], "enable")
# check it is not set for napi1
napi1 = nf.napi_get({'id': napi1_id})
- ksft_eq(napi1['threaded'], 0)
+ ksft_eq(napi1['threaded'], "disable")
# set threaded at device level
system(f"echo 1 > /sys/class/net/{nsim.ifname}/threaded")
# check napi threaded is set for both napis
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 1)
+ ksft_eq(napi0['threaded'], "enable")
napi1 = nf.napi_get({'id': napi1_id})
- ksft_eq(napi1['threaded'], 1)
+ ksft_eq(napi1['threaded'], "enable")
# set napi threaded for napi0
- nf.napi_set({'id': napi0_id, 'threaded': 1})
+ nf.napi_set({'id': napi0_id, 'threaded': "enable"})
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 1)
+ ksft_eq(napi0['threaded'], "enable")
# unset threaded at device level
system(f"echo 0 > /sys/class/net/{nsim.ifname}/threaded")
# check napi threaded is unset for both napis
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 0)
+ ksft_eq(napi0['threaded'], "disable")
napi1 = nf.napi_get({'id': napi1_id})
- ksft_eq(napi1['threaded'], 0)
+ ksft_eq(napi1['threaded'], "disable")
def nsim_rxq_reset_down(nf) -> None:
"""
--
2.49.0.850.g28803427d3-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH net-next v5 2/4] net: define an enum for the napi threaded state
2025-04-24 20:02 ` [PATCH net-next v5 2/4] net: define an enum for the napi threaded state Samiullah Khawaja
@ 2025-04-24 23:40 ` Joe Damato
2025-04-26 1:36 ` Jakub Kicinski
1 sibling, 0 replies; 23+ messages in thread
From: Joe Damato @ 2025-04-24 23:40 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
On Thu, Apr 24, 2025 at 08:02:20PM +0000, Samiullah Khawaja wrote:
> Instead of using '0' and '1' for napi threaded state, use an enum with
> 'disable' and 'enable' states, preparing for the next patch to add a new
> 'busy-poll-enable' state. Also update the 'threaded' field in struct
> net_device to u8 instead of bool.
[...]
>
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index c9d190fe1f05..c8834161e8ec 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
[...]
> @@ -283,11 +287,10 @@ attribute-sets:
> -
> name: threaded
> doc: Whether the napi is configured to operate in threaded polling
> - mode. If this is set to `1` then the NAPI context operates
> + mode. If this is set to `enable` then the NAPI context operates
> in threaded polling mode.
> - type: uint
> - checks:
> - max: 1
> + type: u32
> + enum: napi-threaded
I think this can still be uint even if associated with an enum? IIUC
(I could be wrong) uint is preferred.
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3817720e8b24..2eda563307f9 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2428,7 +2429,7 @@ struct net_device {
> struct sfp_bus *sfp_bus;
> struct lock_class_key *qdisc_tx_busylock;
> bool proto_down;
> - bool threaded;
> + u8 threaded;
Looks like there's a 1 byte hole in a cacheline further up after
unsigned char lower_level
Not sure if putting the u8 there and filling that cacheline makes you
feel anything in particular.
(I feel nothing.)
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 3ff275bbf6e2..41d809f2a7f7 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
[...]
> @@ -6924,12 +6938,15 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
> if (dev->threaded == threaded)
> return 0;
>
> + val = 0;
> if (threaded) {
> + val |= NAPIF_STATE_THREADED;
> +
> list_for_each_entry(napi, &dev->napi_list, dev_list) {
> if (!napi->thread) {
> err = napi_kthread_create(napi);
> if (err) {
> - threaded = false;
> + threaded = NETDEV_NAPI_THREADED_DISABLE;
> break;
> }
> }
Re the feedback on the per-NAPI threading setting patch, I think if
you used napi_set_threaded in dev_set_threaded (as mentioned in the
feedback to that patch) you might reduce the changes here.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH net-next v5 2/4] net: define an enum for the napi threaded state
2025-04-24 20:02 ` [PATCH net-next v5 2/4] net: define an enum for the napi threaded state Samiullah Khawaja
2025-04-24 23:40 ` Joe Damato
@ 2025-04-26 1:36 ` Jakub Kicinski
2025-04-26 3:54 ` Samiullah Khawaja
1 sibling, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2025-04-26 1:36 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: David S . Miller , Eric Dumazet, Paolo Abeni, almasrymina,
willemb, jdamato, mkarsten, netdev
On Thu, 24 Apr 2025 20:02:20 +0000 Samiullah Khawaja wrote:
> - dev_set_threaded(ndev, true);
> + dev_set_threaded(ndev, NETDEV_NAPI_THREADED_ENABLE);
The drivers having to specify the type of threading is too much.
The drivers are just indicating to the core that they are too
IRQ-constrained to reasonably handle load..
Please add a wrapper, something like dev_set_threaded_hint(netdev).
All the drivers pass true now, the second argument is already
pointless. For extra points you can export just the
dev_set_threaded_hint and make dev_set_threaded() be a core-only function.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next v5 2/4] net: define an enum for the napi threaded state
2025-04-26 1:36 ` Jakub Kicinski
@ 2025-04-26 3:54 ` Samiullah Khawaja
0 siblings, 0 replies; 23+ messages in thread
From: Samiullah Khawaja @ 2025-04-26 3:54 UTC (permalink / raw)
To: Jakub Kicinski
Cc: David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb,
jdamato, mkarsten, netdev
On Fri, Apr 25, 2025 at 6:36 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 24 Apr 2025 20:02:20 +0000 Samiullah Khawaja wrote:
> > - dev_set_threaded(ndev, true);
> > + dev_set_threaded(ndev, NETDEV_NAPI_THREADED_ENABLE);
>
> The drivers having to specify the type of threading is too much.
> The drivers are just indicating to the core that they are too
> IRQ-constrained to reasonably handle load..
>
> Please add a wrapper, something like dev_set_threaded_hint(netdev).
> All the drivers pass true now, the second argument is already
> pointless. For extra points you can export just the
> dev_set_threaded_hint and make dev_set_threaded() be a core-only function.
Agree. Will do this
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH net-next v5 3/4] Extend napi threaded polling to allow kthread based busy polling
2025-04-24 20:02 [PATCH net-next v5 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
2025-04-24 20:02 ` [PATCH net-next v5 1/4] net: Create separate gro_flush helper function Samiullah Khawaja
2025-04-24 20:02 ` [PATCH net-next v5 2/4] net: define an enum for the napi threaded state Samiullah Khawaja
@ 2025-04-24 20:02 ` Samiullah Khawaja
2025-04-24 23:42 ` Joe Damato
2025-04-24 20:02 ` [PATCH net-next v5 4/4] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
2025-04-28 13:50 ` [PATCH net-next v5 0/4] Add support to do threaded napi busy poll Martin Karsten
4 siblings, 1 reply; 23+ messages in thread
From: Samiullah Khawaja @ 2025-04-24 20:02 UTC (permalink / raw)
To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, mkarsten
Cc: netdev, skhawaja
Add a new state to napi state enum:
- STATE_THREADED_BUSY_POLL
Threaded busy poll is enabled/running for this napi.
Following changes are introduced in the napi scheduling and state logic:
- When threaded busy poll is enabled through sysfs it also enables
NAPI_STATE_THREADED so a kthread is created per napi. It also sets
NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that we are
supposed to busy poll for each napi.
- When napi is scheduled with STATE_SCHED_THREADED and associated
kthread is woken up, the kthread owns the context. If
NAPI_STATE_THREADED_BUSY_POLL and NAPI_SCHED_THREADED both are set
then it means that we can busy poll.
- To keep busy polling and to avoid scheduling of the interrupts, the
napi_complete_done returns false when both SCHED_THREADED and
THREADED_BUSY_POLL flags are set. Also napi_complete_done returns
early to avoid the STATE_SCHED_THREADED being unset.
- If at any point STATE_THREADED_BUSY_POLL is unset, the
napi_complete_done will run and unset the SCHED_THREADED bit also.
This will make the associated kthread go to sleep as per existing
logic.
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
Documentation/ABI/testing/sysfs-class-net | 3 +-
Documentation/netlink/specs/netdev.yaml | 5 +-
Documentation/networking/napi.rst | 67 ++++++++++++++++++++-
include/linux/netdevice.h | 8 +++
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 71 +++++++++++++++++++----
net/core/dev.h | 3 +
net/core/net-sysfs.c | 2 +-
net/core/netdev-genl-gen.c | 2 +-
tools/include/uapi/linux/netdev.h | 1 +
10 files changed, 145 insertions(+), 18 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net
index ebf21beba846..15d7d36a8294 100644
--- a/Documentation/ABI/testing/sysfs-class-net
+++ b/Documentation/ABI/testing/sysfs-class-net
@@ -343,7 +343,7 @@ Date: Jan 2021
KernelVersion: 5.12
Contact: netdev@vger.kernel.org
Description:
- Boolean value to control the threaded mode per device. User could
+ Integer value to control the threaded mode per device. User could
set this value to enable/disable threaded mode for all napi
belonging to this device, without the need to do device up/down.
@@ -351,4 +351,5 @@ Description:
== ==================================
0 threaded mode disabled for this dev
1 threaded mode enabled for this dev
+ 2 threaded mode enabled, and busy polling enabled.
== ==================================
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index c8834161e8ec..650179559558 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -85,7 +85,7 @@ definitions:
-
name: napi-threaded
type: enum
- entries: [ disable, enable ]
+ entries: [ disable, enable, busy-poll-enable ]
attribute-sets:
-
@@ -288,7 +288,8 @@ attribute-sets:
name: threaded
doc: Whether the napi is configured to operate in threaded polling
mode. If this is set to `enable` then the NAPI context operates
- in threaded polling mode.
+ in threaded polling mode. If this is set to `busy-poll-enable`
+ then the NAPI kthread also does busypolling.
type: u32
enum: napi-threaded
-
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
index 63f98c05860f..0f83142c624d 100644
--- a/Documentation/networking/napi.rst
+++ b/Documentation/networking/napi.rst
@@ -263,7 +263,9 @@ are not well known).
Busy polling is enabled by either setting ``SO_BUSY_POLL`` on
selected sockets or using the global ``net.core.busy_poll`` and
``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
-also exists.
+also exists. Threaded polling of NAPI also has a mode to busy poll for
+packets (:ref:`threaded busy polling<threaded_busy_poll>`) using the same
+thread that is used for NAPI processing.
epoll-based busy polling
------------------------
@@ -426,6 +428,69 @@ Therefore, setting ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` is
the recommended usage, because otherwise setting ``irq-suspend-timeout``
might not have any discernible effect.
+.. _threaded_busy_poll:
+
+Threaded NAPI busy polling
+--------------------------
+
+Threaded napi allows processing of packets from each NAPI in a kthread in
+kernel. Threaded napi busy polling extends this and adds support to do
+continuous busy polling of this napi. This can be used to enable busy polling
+independent of userspace application or the API (epoll, io_uring, raw sockets)
+being used in userspace to process the packets.
+
+It can be enabled for each NAPI using netlink interface or at device level using
+the threaded NAPI sysctl.
+
+For example, using following script:
+
+.. code-block:: bash
+
+ $ kernel-source/tools/net/ynl/pyynl/cli.py \
+ --spec Documentation/netlink/specs/netdev.yaml \
+ --do napi-set \
+ --json='{"id": 66,
+ "threaded": "busy-poll-enable"}'
+
+
+Enabling it for each NAPI allows finer control to enable busy pollling for
+only a set of NIC queues which will get traffic with low latency requirements.
+
+Depending on application requirement, user might want to set affinity of the
+kthread that is busy polling each NAPI. User might also want to set priority
+and the scheduler of the thread depending on the latency requirements.
+
+For a hard low-latency application, user might want to dedicate the full core
+for the NAPI polling so the NIC queue descriptors are picked up from the queue
+as soon as they appear. For more relaxed low-latency requirement, user might
+want to share the core with other threads.
+
+Once threaded busy polling is enabled for a NAPI, PID of the kthread can be
+fetched using netlink interface so the affinity, priority and scheduler
+configuration can be done.
+
+For example, following script can be used to fetch the pid:
+
+.. code-block:: bash
+
+ $ kernel-source/tools/net/ynl/pyynl/cli.py \
+ --spec Documentation/netlink/specs/netdev.yaml \
+ --do napi-get \
+ --json='{"id": 66}'
+
+This will output something like following, the pid `258` is the PID of the
+kthread that is polling this NAPI.
+
+.. code-block:: bash
+
+ $ {'defer-hard-irqs': 0,
+ 'gro-flush-timeout': 0,
+ 'id': 66,
+ 'ifindex': 2,
+ 'irq-suspend-timeout': 0,
+ 'pid': 258,
+ 'threaded': 'enable'}
+
.. _threaded:
Threaded NAPI
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2eda563307f9..c67a7424605e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -427,6 +427,8 @@ enum {
NAPI_STATE_THREADED, /* The poll is performed inside its own thread*/
NAPI_STATE_SCHED_THREADED, /* Napi is currently scheduled in threaded mode */
NAPI_STATE_HAS_NOTIFIER, /* Napi has an IRQ notifier */
+ NAPI_STATE_THREADED_BUSY_POLL, /* The threaded napi poller will busy poll */
+ NAPI_STATE_SCHED_THREADED_BUSY_POLL, /* The threaded napi poller is busy polling */
};
enum {
@@ -441,8 +443,14 @@ enum {
NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED),
NAPIF_STATE_SCHED_THREADED = BIT(NAPI_STATE_SCHED_THREADED),
NAPIF_STATE_HAS_NOTIFIER = BIT(NAPI_STATE_HAS_NOTIFIER),
+ NAPIF_STATE_THREADED_BUSY_POLL = BIT(NAPI_STATE_THREADED_BUSY_POLL),
+ NAPIF_STATE_SCHED_THREADED_BUSY_POLL =
+ BIT(NAPI_STATE_SCHED_THREADED_BUSY_POLL),
};
+#define NAPIF_STATE_THREADED_BUSY_POLL_MASK \
+ (NAPIF_STATE_THREADED | NAPIF_STATE_THREADED_BUSY_POLL)
+
enum gro_result {
GRO_MERGED,
GRO_MERGED_FREE,
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index a5737572ce92..b9b59d60957f 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -80,6 +80,7 @@ enum netdev_qstats_scope {
enum netdev_napi_threaded {
NETDEV_NAPI_THREADED_DISABLE,
NETDEV_NAPI_THREADED_ENABLE,
+ NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE,
};
enum {
diff --git a/net/core/dev.c b/net/core/dev.c
index 41d809f2a7f7..7270e0a13c9f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -78,6 +78,7 @@
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/sched/isolation.h>
+#include <linux/sched/types.h>
#include <linux/sched/mm.h>
#include <linux/smpboot.h>
#include <linux/mutex.h>
@@ -6525,7 +6526,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
* the guarantee we will be called later.
*/
if (unlikely(n->state & (NAPIF_STATE_NPSVC |
- NAPIF_STATE_IN_BUSY_POLL)))
+ NAPIF_STATE_IN_BUSY_POLL |
+ NAPIF_STATE_SCHED_THREADED_BUSY_POLL)))
return false;
if (work_done) {
@@ -6899,9 +6901,11 @@ static void napi_set_threaded_state(struct napi_struct *napi,
unsigned long val;
val = 0;
+ if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE)
+ val |= NAPIF_STATE_THREADED_BUSY_POLL;
if (threaded)
val |= NAPIF_STATE_THREADED;
- set_mask_bits(&napi->state, NAPIF_STATE_THREADED, val);
+ set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
}
int napi_set_threaded(struct napi_struct *napi,
@@ -6941,6 +6945,8 @@ int dev_set_threaded(struct net_device *dev,
val = 0;
if (threaded) {
val |= NAPIF_STATE_THREADED;
+ if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE)
+ val |= NAPIF_STATE_THREADED_BUSY_POLL;
list_for_each_entry(napi, &dev->napi_list, dev_list) {
if (!napi->thread) {
@@ -6965,9 +6971,13 @@ int dev_set_threaded(struct net_device *dev,
* polled. In this case, the switch between threaded mode and
* softirq mode will happen in the next round of napi_schedule().
* This should not cause hiccups/stalls to the live traffic.
+ *
+ * Switch to busy_poll threaded napi will occur after the threaded
+ * napi is scheduled.
*/
list_for_each_entry(napi, &dev->napi_list, dev_list)
- assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+ set_mask_bits(&napi->state,
+ NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
return err;
}
@@ -7285,8 +7295,12 @@ void netif_napi_add_weight_locked(struct net_device *dev,
* Clear dev->threaded if kthread creation failed so that
* threaded mode will not be enabled in napi_enable().
*/
- if (dev->threaded && napi_kthread_create(napi))
- dev->threaded = false;
+ if (dev->threaded) {
+ if (napi_kthread_create(napi))
+ dev->threaded = false;
+ else
+ napi_set_threaded_state(napi, dev->threaded);
+ }
netif_napi_set_irq_locked(napi, -1);
}
EXPORT_SYMBOL(netif_napi_add_weight_locked);
@@ -7308,7 +7322,9 @@ void napi_disable_locked(struct napi_struct *n)
}
new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
- new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL);
+ new &= ~(NAPIF_STATE_THREADED
+ | NAPIF_STATE_THREADED_BUSY_POLL
+ | NAPIF_STATE_PREFER_BUSY_POLL);
} while (!try_cmpxchg(&n->state, &val, new));
hrtimer_cancel(&n->timer);
@@ -7352,7 +7368,7 @@ void napi_enable_locked(struct napi_struct *n)
new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC);
if (n->dev->threaded && n->thread)
- new |= NAPIF_STATE_THREADED;
+ napi_set_threaded_state(n, n->dev->threaded);
} while (!try_cmpxchg(&n->state, &val, new));
}
EXPORT_SYMBOL(napi_enable_locked);
@@ -7515,7 +7531,7 @@ static int napi_thread_wait(struct napi_struct *napi)
return -1;
}
-static void napi_threaded_poll_loop(struct napi_struct *napi)
+static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll)
{
struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
struct softnet_data *sd;
@@ -7544,22 +7560,53 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
}
skb_defer_free_flush(sd);
bpf_net_ctx_clear(bpf_net_ctx);
+
+ /* Flush too old packets. If HZ < 1000, flush all packets */
+ if (busy_poll)
+ __napi_gro_flush_helper(napi, HZ >= 1000);
local_bh_enable();
- if (!repoll)
+ /* If busy polling then do not break here because we need to
+ * call cond_resched and rcu_softirq_qs_periodic to prevent
+ * watchdog warnings.
+ */
+ if (!repoll && !busy_poll)
break;
rcu_softirq_qs_periodic(last_qs);
cond_resched();
+
+ if (!repoll)
+ break;
}
}
static int napi_threaded_poll(void *data)
{
struct napi_struct *napi = data;
+ bool busy_poll_sched;
+ unsigned long val;
+ bool busy_poll;
+
+ while (!napi_thread_wait(napi)) {
+ /* Once woken up, this means that we are scheduled as threaded
+ * napi and this thread owns the napi context, if busy poll
+ * state is set then we busy poll this napi.
+ */
+ val = READ_ONCE(napi->state);
+ busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
+ busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL;
- while (!napi_thread_wait(napi))
- napi_threaded_poll_loop(napi);
+ /* Do not busy poll if napi is disabled. */
+ if (unlikely(val & NAPIF_STATE_DISABLE))
+ busy_poll = false;
+
+ if (busy_poll != busy_poll_sched)
+ assign_bit(NAPI_STATE_SCHED_THREADED_BUSY_POLL,
+ &napi->state, busy_poll);
+
+ napi_threaded_poll_loop(napi, busy_poll);
+ }
return 0;
}
@@ -12744,7 +12791,7 @@ static void run_backlog_napi(unsigned int cpu)
{
struct softnet_data *sd = per_cpu_ptr(&softnet_data, cpu);
- napi_threaded_poll_loop(&sd->backlog);
+ napi_threaded_poll_loop(&sd->backlog, false);
}
static void backlog_napi_setup(unsigned int cpu)
diff --git a/net/core/dev.h b/net/core/dev.h
index 3924996ae85c..bd9d26b4a6ba 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -323,6 +323,9 @@ static inline void napi_set_irq_suspend_timeout(struct napi_struct *n,
*/
static inline enum netdev_napi_threaded napi_get_threaded(struct napi_struct *n)
{
+ if (test_bit(NAPI_STATE_THREADED_BUSY_POLL, &n->state))
+ return NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE;
+
if (test_bit(NAPI_STATE_THREADED, &n->state))
return NETDEV_NAPI_THREADED_ENABLE;
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 1ace0cd01adc..0b7624236896 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -741,7 +741,7 @@ static int modify_napi_threaded(struct net_device *dev, unsigned long val)
if (list_empty(&dev->napi_list))
return -EOPNOTSUPP;
- if (val != 0 && val != 1)
+ if (val > NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE)
return -EOPNOTSUPP;
ret = dev_set_threaded(dev, val);
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index c2e5cee857d2..1dbe5f19a192 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -97,7 +97,7 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_THREADED
[NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range),
[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, },
[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
- [NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
+ [NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 2),
};
/* Ops table for netdev */
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index a5737572ce92..b9b59d60957f 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -80,6 +80,7 @@ enum netdev_qstats_scope {
enum netdev_napi_threaded {
NETDEV_NAPI_THREADED_DISABLE,
NETDEV_NAPI_THREADED_ENABLE,
+ NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE,
};
enum {
--
2.49.0.850.g28803427d3-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH net-next v5 3/4] Extend napi threaded polling to allow kthread based busy polling
2025-04-24 20:02 ` [PATCH net-next v5 3/4] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
@ 2025-04-24 23:42 ` Joe Damato
0 siblings, 0 replies; 23+ messages in thread
From: Joe Damato @ 2025-04-24 23:42 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, mkarsten, netdev
On Thu, Apr 24, 2025 at 08:02:21PM +0000, Samiullah Khawaja wrote:
> Add a new state to napi state enum:
[...]
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 2eda563307f9..c67a7424605e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -427,6 +427,8 @@ enum {
> NAPI_STATE_THREADED, /* The poll is performed inside its own thread*/
> NAPI_STATE_SCHED_THREADED, /* Napi is currently scheduled in threaded mode */
> NAPI_STATE_HAS_NOTIFIER, /* Napi has an IRQ notifier */
> + NAPI_STATE_THREADED_BUSY_POLL, /* The threaded napi poller will busy poll */
> + NAPI_STATE_SCHED_THREADED_BUSY_POLL, /* The threaded napi poller is busy polling */
> };
Now that threaded has three states, doesn't the comment for struct
net_device need to be updated?
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH net-next v5 4/4] selftests: Add napi threaded busy poll test in `busy_poller`
2025-04-24 20:02 [PATCH net-next v5 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
` (2 preceding siblings ...)
2025-04-24 20:02 ` [PATCH net-next v5 3/4] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
@ 2025-04-24 20:02 ` Samiullah Khawaja
2025-04-28 13:50 ` [PATCH net-next v5 0/4] Add support to do threaded napi busy poll Martin Karsten
4 siblings, 0 replies; 23+ messages in thread
From: Samiullah Khawaja @ 2025-04-24 20:02 UTC (permalink / raw)
To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, mkarsten
Cc: netdev, skhawaja
Add testcase to run busy poll test with threaded napi busy poll enabled.
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
tools/testing/selftests/net/busy_poll_test.sh | 25 ++++++++++++++++++-
tools/testing/selftests/net/busy_poller.c | 14 ++++++++---
2 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/net/busy_poll_test.sh b/tools/testing/selftests/net/busy_poll_test.sh
index 7db292ec4884..aeca610dc989 100755
--- a/tools/testing/selftests/net/busy_poll_test.sh
+++ b/tools/testing/selftests/net/busy_poll_test.sh
@@ -27,6 +27,9 @@ NAPI_DEFER_HARD_IRQS=100
GRO_FLUSH_TIMEOUT=50000
SUSPEND_TIMEOUT=20000000
+# NAPI threaded busy poll config
+NAPI_THREADED_POLL=2
+
setup_ns()
{
set -e
@@ -62,6 +65,9 @@ cleanup_ns()
test_busypoll()
{
suspend_value=${1:-0}
+ napi_threaded_value=${2:-0}
+ prefer_busy_poll_value=${3:-$PREFER_BUSY_POLL}
+
tmp_file=$(mktemp)
out_file=$(mktemp)
@@ -73,10 +79,11 @@ test_busypoll()
-b${SERVER_IP} \
-m${MAX_EVENTS} \
-u${BUSY_POLL_USECS} \
- -P${PREFER_BUSY_POLL} \
+ -P${prefer_busy_poll_value} \
-g${BUSY_POLL_BUDGET} \
-i${NSIM_SV_IFIDX} \
-s${suspend_value} \
+ -t${napi_threaded_value} \
-o${out_file}&
wait_local_port_listen nssv ${SERVER_PORT} tcp
@@ -109,6 +116,15 @@ test_busypoll_with_suspend()
return $?
}
+test_busypoll_with_napi_threaded()
+{
+ # Only enable napi threaded poll. Set suspend timeout and prefer busy
+ # poll to 0.
+ test_busypoll 0 ${NAPI_THREADED_POLL} 0
+
+ return $?
+}
+
###
### Code start
###
@@ -154,6 +170,13 @@ if [ $? -ne 0 ]; then
exit 1
fi
+test_busypoll_with_napi_threaded
+if [ $? -ne 0 ]; then
+ echo "test_busypoll_with_napi_threaded failed"
+ cleanup_ns
+ exit 1
+fi
+
echo "$NSIM_SV_FD:$NSIM_SV_IFIDX" > $NSIM_DEV_SYS_UNLINK
echo $NSIM_CL_ID > $NSIM_DEV_SYS_DEL
diff --git a/tools/testing/selftests/net/busy_poller.c b/tools/testing/selftests/net/busy_poller.c
index 04c7ff577bb8..f7407f09f635 100644
--- a/tools/testing/selftests/net/busy_poller.c
+++ b/tools/testing/selftests/net/busy_poller.c
@@ -65,15 +65,16 @@ static uint32_t cfg_busy_poll_usecs;
static uint16_t cfg_busy_poll_budget;
static uint8_t cfg_prefer_busy_poll;
-/* IRQ params */
+/* NAPI params */
static uint32_t cfg_defer_hard_irqs;
static uint64_t cfg_gro_flush_timeout;
static uint64_t cfg_irq_suspend_timeout;
+static enum netdev_napi_threaded cfg_napi_threaded_poll = NETDEV_NAPI_THREADED_DISABLE;
static void usage(const char *filepath)
{
error(1, 0,
- "Usage: %s -p<port> -b<addr> -m<max_events> -u<busy_poll_usecs> -P<prefer_busy_poll> -g<busy_poll_budget> -o<outfile> -d<defer_hard_irqs> -r<gro_flush_timeout> -s<irq_suspend_timeout> -i<ifindex>",
+ "Usage: %s -p<port> -b<addr> -m<max_events> -u<busy_poll_usecs> -P<prefer_busy_poll> -g<busy_poll_budget> -o<outfile> -d<defer_hard_irqs> -r<gro_flush_timeout> -s<irq_suspend_timeout> -t<napi_threaded_poll> -i<ifindex>",
filepath);
}
@@ -86,7 +87,7 @@ static void parse_opts(int argc, char **argv)
if (argc <= 1)
usage(argv[0]);
- while ((c = getopt(argc, argv, "p:m:b:u:P:g:o:d:r:s:i:")) != -1) {
+ while ((c = getopt(argc, argv, "p:m:b:u:P:g:o:d:r:s:i:t:")) != -1) {
/* most options take integer values, except o and b, so reduce
* code duplication a bit for the common case by calling
* strtoull here and leave bounds checking and casting per
@@ -168,6 +169,12 @@ static void parse_opts(int argc, char **argv)
cfg_ifindex = (int)tmp;
break;
+ case 't':
+ if (tmp == ULLONG_MAX || tmp > 2)
+ error(1, ERANGE, "napi threaded poll value must be 0-2");
+
+ cfg_napi_threaded_poll = (enum netdev_napi_threaded)tmp;
+ break;
}
}
@@ -246,6 +253,7 @@ static void setup_queue(void)
cfg_gro_flush_timeout);
netdev_napi_set_req_set_irq_suspend_timeout(set_req,
cfg_irq_suspend_timeout);
+ netdev_napi_set_req_set_threaded(set_req, cfg_napi_threaded_poll);
if (netdev_napi_set(ys, set_req))
error(1, 0, "can't set NAPI params: %s\n", yerr.msg);
--
2.49.0.850.g28803427d3-goog
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
2025-04-24 20:02 [PATCH net-next v5 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
` (3 preceding siblings ...)
2025-04-24 20:02 ` [PATCH net-next v5 4/4] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
@ 2025-04-28 13:50 ` Martin Karsten
2025-04-28 16:52 ` Willem de Bruijn
2025-04-30 15:23 ` Martin Karsten
4 siblings, 2 replies; 23+ messages in thread
From: Martin Karsten @ 2025-04-28 13:50 UTC (permalink / raw)
To: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
Paolo Abeni, almasrymina, willemb, jdamato
Cc: netdev
On 2025-04-24 16:02, Samiullah Khawaja wrote:
> Extend the already existing support of threaded napi poll to do continuous
> busy polling.
>
> This is used for doing continuous polling of napi to fetch descriptors
> from backing RX/TX queues for low latency applications. Allow enabling
> of threaded busypoll using netlink so this can be enabled on a set of
> dedicated napis for low latency applications.
>
> Once enabled user can fetch the PID of the kthread doing NAPI polling
> and set affinity, priority and scheduler for it depending on the
> low-latency requirements.
>
> Currently threaded napi is only enabled at device level using sysfs. Add
> support to enable/disable threaded mode for a napi individually. This
> can be done using the netlink interface. Extend `napi-set` op in netlink
> spec that allows setting the `threaded` attribute of a napi.
>
> Extend the threaded attribute in napi struct to add an option to enable
> continuous busy polling. Extend the netlink and sysfs interface to allow
> enabling/disabling threaded busypolling at device or individual napi
> level.
>
> We use this for our AF_XDP based hard low-latency usecase with usecs
> level latency requirement. For our usecase we want low jitter and stable
> latency at P99.
>
> Following is an analysis and comparison of available (and compatible)
> busy poll interfaces for a low latency usecase with stable P99. Please
> note that the throughput and cpu efficiency is a non-goal.
>
> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
> description of the tool and how it tries to simulate the real workload
> is following,
>
> - It sends UDP packets between 2 machines.
> - The client machine sends packets at a fixed frequency. To maintain the
> frequency of the packet being sent, we use open-loop sampling. That is
> the packets are sent in a separate thread.
> - The server replies to the packet inline by reading the pkt from the
> recv ring and replies using the tx ring.
> - To simulate the application processing time, we use a configurable
> delay in usecs on the client side after a reply is received from the
> server.
>
> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
>
> We use this tool with following napi polling configurations,
>
> - Interrupts only
> - SO_BUSYPOLL (inline in the same thread where the client receives the
> packet).
> - SO_BUSYPOLL (separate thread and separate core)
> - Threaded NAPI busypoll
>
> System is configured using following script in all 4 cases,
>
> ```
> echo 0 | sudo tee /sys/class/net/eth0/threaded
> echo 0 | sudo tee /proc/sys/kernel/timer_migration
> echo off | sudo tee /sys/devices/system/cpu/smt/control
>
> sudo ethtool -L eth0 rx 1 tx 1
> sudo ethtool -G eth0 rx 1024
>
> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
>
> # pin IRQs on CPU 2
> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> print arr[0]}' < /proc/interrupts)"
> for irq in "${IRQS}"; \
> do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
>
> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
>
> for i in /sys/devices/virtual/workqueue/*/cpumask; \
> do echo $i; echo 1,2,3,4,5,6 > $i; done
>
> if [[ -z "$1" ]]; then
> echo 400 | sudo tee /proc/sys/net/core/busy_read
> echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> fi
>
> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
>
> if [[ "$1" == "enable_threaded" ]]; then
> echo 0 | sudo tee /proc/sys/net/core/busy_poll
> echo 0 | sudo tee /proc/sys/net/core/busy_read
> echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> echo 2 | sudo tee /sys/class/net/eth0/threaded
> NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
> sudo chrt -f -p 50 $NAPI_T
>
> # pin threaded poll thread to CPU 2
> sudo taskset -pc 2 $NAPI_T
> fi
>
> if [[ "$1" == "enable_interrupt" ]]; then
> echo 0 | sudo tee /proc/sys/net/core/busy_read
> echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> fi
> ```
>
> To enable various configurations, script can be run as following,
>
> - Interrupt Only
> ```
> <script> enable_interrupt
> ```
>
> - SO_BUSYPOLL (no arguments to script)
> ```
> <script>
> ```
>
> - NAPI threaded busypoll
> ```
> <script> enable_threaded
> ```
>
> If using idpf, the script needs to be run again after launching the
> workload just to make sure that the configurations are not reverted. As
> idpf reverts some configurations on software reset when AF_XDP program
> is attached.
>
> Once configured, the workload is run with various configurations using
> following commands. Set period (1/frequency) and delay in usecs to
> produce results for packet frequency and application processing delay.
>
> ## Interrupt Only and SO_BUSY_POLL (inline)
>
> - Server
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
> ```
>
> - Client
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v
> ```
>
> ## SO_BUSY_POLL(done in separate core using recvfrom)
>
> Argument -t spawns a seprate thread and continuously calls recvfrom.
>
> - Server
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> -h -v -t
> ```
>
> - Client
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -t
> ```
>
> ## NAPI Threaded Busy Poll
>
> Argument -n skips the recvfrom call as there is no recv kick needed.
>
> - Server
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> -h -v -n
> ```
>
> - Client
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -n
> ```
>
> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
> |---|---|---|---|---|
> | 12 Kpkt/s + 0us delay | | | | |
> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> | 32 Kpkt/s + 30us delay | | | | |
> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> | 125 Kpkt/s + 6us delay | | | | |
> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> | 12 Kpkt/s + 78us delay | | | | |
> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> | 25 Kpkt/s + 38us delay | | | | |
> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>
> ## Observations
>
> - Here without application processing all the approaches give the same
> latency within 1usecs range and NAPI threaded gives minimum latency.
> - With application processing the latency increases by 3-4usecs when
> doing inline polling.
> - Using a dedicated core to drive napi polling keeps the latency same
> even with application processing. This is observed both in userspace
> and threaded napi (in kernel).
> - Using napi threaded polling in kernel gives lower latency by
> 1-1.5usecs as compared to userspace driven polling in separate core.
> - With application processing userspace will get the packet from recv
> ring and spend some time doing application processing and then do napi
> polling. While application processing is happening a dedicated core
> doing napi polling can pull the packet of the NAPI RX queue and
> populate the AF_XDP recv ring. This means that when the application
> thread is done with application processing it has new packets ready to
> recv and process in recv ring.
> - Napi threaded busy polling in the kernel with a dedicated core gives
> the consistent P5-P99 latency.
I've experimented with this some more. I can confirm latency savings of
about 1 usec arising from busy-looping a NAPI thread on a dedicated core
when compared to in-thread busy-polling. A few more comments:
1) I note that the experiment results above show that 'interrupts' is
almost as fast as 'NAPI threaded' in the base case. I cannot confirm
these results, because I currently only have (very) old hardware
available for testing. However, these results worry me in terms of
necessity of the threaded busy-polling mechanism - also see Item 4) below.
2) The experiments reported here are symmetric in that they use the same
polling variant at both the client and the server. When mixing things up
by combining different polling variants, it becomes clear that the
latency savings are split between both ends. The total savings of 1 usec
are thus a combination of 0.5 usec are either end. So the ultimate
trade-off is 0.5 usec latency gain for burning 1 core.
3) I believe the savings arise from running two tight loops (separate
NAPI and application) instead of one longer loop. The shorter loops
likely result in better cache utilization on their respective dedicated
cores (and L1 caches). However I am not sure right how to explicitly
confirm this.
4) I still believe that the additional experiments with setting both
delay and period are meaningless. They create corner cases where rate *
delay is about 1. Nobody would run a latency-critical system at 100%
load. I also note that the experiment program xsk_rr fails when trying
to increase the load beyond saturation (client fails with 'xsk_rr:
oustanding array full').
5) I worry that a mechanism like this might be misinterpreted as some
kind of magic wand for improving performance and might end up being used
in practice and cause substantial overhead without much gain. If
accepted, I would hope that this will be documented very clearly and
have appropriate warnings attached. Given that the patch cover letter is
often used as a basis for documentation, I believe this should be
spelled out in the cover letter.
With the above in mind, someone else will need to judge whether (at
most) 0.5 usec for burning a core is a worthy enough trade-off to
justify inclusion of this mechanism. Maybe someone else can take a
closer look at the 'interrupts' variant on modern hardware.
Thanks,
Martin
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
2025-04-28 13:50 ` [PATCH net-next v5 0/4] Add support to do threaded napi busy poll Martin Karsten
@ 2025-04-28 16:52 ` Willem de Bruijn
2025-04-28 17:39 ` Joe Damato
2025-04-28 18:05 ` Martin Karsten
2025-04-30 15:23 ` Martin Karsten
1 sibling, 2 replies; 23+ messages in thread
From: Willem de Bruijn @ 2025-04-28 16:52 UTC (permalink / raw)
To: Martin Karsten, Samiullah Khawaja, Jakub Kicinski,
David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb,
jdamato
Cc: netdev
Martin Karsten wrote:
> On 2025-04-24 16:02, Samiullah Khawaja wrote:
> > Extend the already existing support of threaded napi poll to do continuous
> > busy polling.
> >
> > This is used for doing continuous polling of napi to fetch descriptors
> > from backing RX/TX queues for low latency applications. Allow enabling
> > of threaded busypoll using netlink so this can be enabled on a set of
> > dedicated napis for low latency applications.
> >
> > Once enabled user can fetch the PID of the kthread doing NAPI polling
> > and set affinity, priority and scheduler for it depending on the
> > low-latency requirements.
> >
> > Currently threaded napi is only enabled at device level using sysfs. Add
> > support to enable/disable threaded mode for a napi individually. This
> > can be done using the netlink interface. Extend `napi-set` op in netlink
> > spec that allows setting the `threaded` attribute of a napi.
> >
> > Extend the threaded attribute in napi struct to add an option to enable
> > continuous busy polling. Extend the netlink and sysfs interface to allow
> > enabling/disabling threaded busypolling at device or individual napi
> > level.
> >
> > We use this for our AF_XDP based hard low-latency usecase with usecs
> > level latency requirement. For our usecase we want low jitter and stable
> > latency at P99.
> >
> > Following is an analysis and comparison of available (and compatible)
> > busy poll interfaces for a low latency usecase with stable P99. Please
> > note that the throughput and cpu efficiency is a non-goal.
> >
> > For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
> > description of the tool and how it tries to simulate the real workload
> > is following,
> >
> > - It sends UDP packets between 2 machines.
> > - The client machine sends packets at a fixed frequency. To maintain the
> > frequency of the packet being sent, we use open-loop sampling. That is
> > the packets are sent in a separate thread.
> > - The server replies to the packet inline by reading the pkt from the
> > recv ring and replies using the tx ring.
> > - To simulate the application processing time, we use a configurable
> > delay in usecs on the client side after a reply is received from the
> > server.
> >
> > The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
> >
> > We use this tool with following napi polling configurations,
> >
> > - Interrupts only
> > - SO_BUSYPOLL (inline in the same thread where the client receives the
> > packet).
> > - SO_BUSYPOLL (separate thread and separate core)
> > - Threaded NAPI busypoll
> >
> > System is configured using following script in all 4 cases,
> >
> > ```
> > echo 0 | sudo tee /sys/class/net/eth0/threaded
> > echo 0 | sudo tee /proc/sys/kernel/timer_migration
> > echo off | sudo tee /sys/devices/system/cpu/smt/control
> >
> > sudo ethtool -L eth0 rx 1 tx 1
> > sudo ethtool -G eth0 rx 1024
> >
> > echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> > echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
> >
> > # pin IRQs on CPU 2
> > IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> > print arr[0]}' < /proc/interrupts)"
> > for irq in "${IRQS}"; \
> > do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
> >
> > echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
> >
> > for i in /sys/devices/virtual/workqueue/*/cpumask; \
> > do echo $i; echo 1,2,3,4,5,6 > $i; done
> >
> > if [[ -z "$1" ]]; then
> > echo 400 | sudo tee /proc/sys/net/core/busy_read
> > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > fi
> >
> > sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
> >
> > if [[ "$1" == "enable_threaded" ]]; then
> > echo 0 | sudo tee /proc/sys/net/core/busy_poll
> > echo 0 | sudo tee /proc/sys/net/core/busy_read
> > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > echo 2 | sudo tee /sys/class/net/eth0/threaded
> > NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
> > sudo chrt -f -p 50 $NAPI_T
> >
> > # pin threaded poll thread to CPU 2
> > sudo taskset -pc 2 $NAPI_T
> > fi
> >
> > if [[ "$1" == "enable_interrupt" ]]; then
> > echo 0 | sudo tee /proc/sys/net/core/busy_read
> > echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > fi
> > ```
> >
> > To enable various configurations, script can be run as following,
> >
> > - Interrupt Only
> > ```
> > <script> enable_interrupt
> > ```
> >
> > - SO_BUSYPOLL (no arguments to script)
> > ```
> > <script>
> > ```
> >
> > - NAPI threaded busypoll
> > ```
> > <script> enable_threaded
> > ```
> >
> > If using idpf, the script needs to be run again after launching the
> > workload just to make sure that the configurations are not reverted. As
> > idpf reverts some configurations on software reset when AF_XDP program
> > is attached.
> >
> > Once configured, the workload is run with various configurations using
> > following commands. Set period (1/frequency) and delay in usecs to
> > produce results for packet frequency and application processing delay.
> >
> > ## Interrupt Only and SO_BUSY_POLL (inline)
> >
> > - Server
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
> > ```
> >
> > - Client
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v
> > ```
> >
> > ## SO_BUSY_POLL(done in separate core using recvfrom)
> >
> > Argument -t spawns a seprate thread and continuously calls recvfrom.
> >
> > - Server
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> > -h -v -t
> > ```
> >
> > - Client
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -t
> > ```
> >
> > ## NAPI Threaded Busy Poll
> >
> > Argument -n skips the recvfrom call as there is no recv kick needed.
> >
> > - Server
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> > -h -v -n
> > ```
> >
> > - Client
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -n
> > ```
> >
> > | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
> > |---|---|---|---|---|
> > | 12 Kpkt/s + 0us delay | | | | |
> > | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> > | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> > | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> > | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> > | 32 Kpkt/s + 30us delay | | | | |
> > | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> > | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> > | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> > | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> > | 125 Kpkt/s + 6us delay | | | | |
> > | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> > | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> > | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> > | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> > | 12 Kpkt/s + 78us delay | | | | |
> > | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> > | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> > | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> > | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> > | 25 Kpkt/s + 38us delay | | | | |
> > | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> > | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> > | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> > | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >
> > ## Observations
> >
> > - Here without application processing all the approaches give the same
> > latency within 1usecs range and NAPI threaded gives minimum latency.
> > - With application processing the latency increases by 3-4usecs when
> > doing inline polling.
> > - Using a dedicated core to drive napi polling keeps the latency same
> > even with application processing. This is observed both in userspace
> > and threaded napi (in kernel).
> > - Using napi threaded polling in kernel gives lower latency by
> > 1-1.5usecs as compared to userspace driven polling in separate core.
> > - With application processing userspace will get the packet from recv
> > ring and spend some time doing application processing and then do napi
> > polling. While application processing is happening a dedicated core
> > doing napi polling can pull the packet of the NAPI RX queue and
> > populate the AF_XDP recv ring. This means that when the application
> > thread is done with application processing it has new packets ready to
> > recv and process in recv ring.
> > - Napi threaded busy polling in the kernel with a dedicated core gives
> > the consistent P5-P99 latency.
> I've experimented with this some more. I can confirm latency savings of
> about 1 usec arising from busy-looping a NAPI thread on a dedicated core
> when compared to in-thread busy-polling. A few more comments:
>
> 1) I note that the experiment results above show that 'interrupts' is
> almost as fast as 'NAPI threaded' in the base case. I cannot confirm
> these results, because I currently only have (very) old hardware
> available for testing. However, these results worry me in terms of
> necessity of the threaded busy-polling mechanism - also see Item 4) below.
>
> 2) The experiments reported here are symmetric in that they use the same
> polling variant at both the client and the server. When mixing things up
> by combining different polling variants, it becomes clear that the
> latency savings are split between both ends. The total savings of 1 usec
> are thus a combination of 0.5 usec are either end. So the ultimate
> trade-off is 0.5 usec latency gain for burning 1 core.
>
> 3) I believe the savings arise from running two tight loops (separate
> NAPI and application) instead of one longer loop. The shorter loops
> likely result in better cache utilization on their respective dedicated
> cores (and L1 caches). However I am not sure right how to explicitly
> confirm this.
>
> 4) I still believe that the additional experiments with setting both
> delay and period are meaningless. They create corner cases where rate *
> delay is about 1. Nobody would run a latency-critical system at 100%
> load. I also note that the experiment program xsk_rr fails when trying
> to increase the load beyond saturation (client fails with 'xsk_rr:
> oustanding array full').
>
> 5) I worry that a mechanism like this might be misinterpreted as some
> kind of magic wand for improving performance and might end up being used
> in practice and cause substantial overhead without much gain. If
> accepted, I would hope that this will be documented very clearly and
> have appropriate warnings attached. Given that the patch cover letter is
> often used as a basis for documentation, I believe this should be
> spelled out in the cover letter.
>
> With the above in mind, someone else will need to judge whether (at
> most) 0.5 usec for burning a core is a worthy enough trade-off to
> justify inclusion of this mechanism. Maybe someone else can take a
> closer look at the 'interrupts' variant on modern hardware.
Ack on documentation of the pros/cons.
There is also a functional argument for this feature. It brings
parity with userspace network stacks like DPDK and Google's SNAP [1].
These also run packet (and L4+) network processing on dedicated cores,
and by default do so in polling mode. An XDP plane currently lacks
this well understood configuration. This brings it closer to parity.
Users of such advanced environments can be expected to be well
familiar with the cost of polling. The cost/benefit can be debated
and benchmarked for individual applications. But there clearly are
active uses for polling, so I think it should be an operating system
facility.
[1] https://research.google/pubs/snap-a-microkernel-approach-to-host-networking/
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
2025-04-28 16:52 ` Willem de Bruijn
@ 2025-04-28 17:39 ` Joe Damato
2025-04-28 17:59 ` Willem de Bruijn
2025-04-28 18:05 ` Martin Karsten
1 sibling, 1 reply; 23+ messages in thread
From: Joe Damato @ 2025-04-28 17:39 UTC (permalink / raw)
To: Willem de Bruijn
Cc: Martin Karsten, Samiullah Khawaja, Jakub Kicinski,
David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb,
netdev
On Mon, Apr 28, 2025 at 12:52:11PM -0400, Willem de Bruijn wrote:
> Martin Karsten wrote:
> > On 2025-04-24 16:02, Samiullah Khawaja wrote:
> Ack on documentation of the pros/cons.
In my mind this includes documenting CPU usage which I know is
considered as "non-goal" of this series. It can be a "non-goal" but
it is still very relevant to the conversation and documentation.
> There is also a functional argument for this feature. It brings
> parity with userspace network stacks like DPDK and Google's SNAP [1].
> These also run packet (and L4+) network processing on dedicated cores,
> and by default do so in polling mode. An XDP plane currently lacks
> this well understood configuration. This brings it closer to parity.
It would be good if this could also be included in the cover letter,
I think, possibly with example use cases.
> Users of such advanced environments can be expected to be well
> familiar with the cost of polling. The cost/benefit can be debated
> and benchmarked for individual applications. But there clearly are
> active uses for polling, so I think it should be an operating system
> facility.
You mention users of advanced environments, but I think it's
important to consider the average user who is not necessarily a
kernel programmer.
Will that user understand that not all apps support this? Or will
they think that they can simply run a few YNL commands burning CPUs at
100% for apps that don't even support this thinking they are making
their networking faster?
I think providing a mechanism to burn CPU cores at 100% CPU by
enabling threaded busy poll has serious consequences on power
consumption, cooling requirements, and, errr, earth. I don't think
it's a decision to be taken lightly.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
2025-04-28 17:39 ` Joe Damato
@ 2025-04-28 17:59 ` Willem de Bruijn
2025-04-28 18:05 ` Martin Karsten
0 siblings, 1 reply; 23+ messages in thread
From: Willem de Bruijn @ 2025-04-28 17:59 UTC (permalink / raw)
To: Joe Damato, Willem de Bruijn
Cc: Martin Karsten, Samiullah Khawaja, Jakub Kicinski,
David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb,
netdev
Joe Damato wrote:
> On Mon, Apr 28, 2025 at 12:52:11PM -0400, Willem de Bruijn wrote:
> > Martin Karsten wrote:
> > > On 2025-04-24 16:02, Samiullah Khawaja wrote:
>
> > Ack on documentation of the pros/cons.
>
> In my mind this includes documenting CPU usage which I know is
> considered as "non-goal" of this series. It can be a "non-goal" but
> it is still very relevant to the conversation and documentation.
>
> > There is also a functional argument for this feature. It brings
> > parity with userspace network stacks like DPDK and Google's SNAP [1].
> > These also run packet (and L4+) network processing on dedicated cores,
> > and by default do so in polling mode. An XDP plane currently lacks
> > this well understood configuration. This brings it closer to parity.
>
> It would be good if this could also be included in the cover letter,
> I think, possibly with example use cases.
>
> > Users of such advanced environments can be expected to be well
> > familiar with the cost of polling. The cost/benefit can be debated
> > and benchmarked for individual applications. But there clearly are
> > active uses for polling, so I think it should be an operating system
> > facility.
>
> You mention users of advanced environments, but I think it's
> important to consider the average user who is not necessarily a
> kernel programmer.
>
> Will that user understand that not all apps support this? Or will
> they think that they can simply run a few YNL commands burning CPUs at
> 100% for apps that don't even support this thinking they are making
> their networking faster?
Busy polling can already be configured through sysfs.
I have not seen any conversations where this is suggested to non
expert users. I don't think this will be different.
But we can and should definitely increase the confidence by making
sure that any documentation of the feature contains a clear warning to
the impact and that this is for expert users only.
> I think providing a mechanism to burn CPU cores at 100% CPU by
> enabling threaded busy poll has serious consequences on power
> consumption, cooling requirements, and, errr, earth. I don't think
> it's a decision to be taken lightly.
Good point.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
2025-04-28 17:59 ` Willem de Bruijn
@ 2025-04-28 18:05 ` Martin Karsten
2025-04-30 12:37 ` David Laight
0 siblings, 1 reply; 23+ messages in thread
From: Martin Karsten @ 2025-04-28 18:05 UTC (permalink / raw)
To: Willem de Bruijn, Joe Damato
Cc: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
Paolo Abeni, almasrymina, willemb, netdev
On 2025-04-28 13:59, Willem de Bruijn wrote:
> Joe Damato wrote:
>> On Mon, Apr 28, 2025 at 12:52:11PM -0400, Willem de Bruijn wrote:
>>> Martin Karsten wrote:
>>>> On 2025-04-24 16:02, Samiullah Khawaja wrote:
[snip]
>>> Users of such advanced environments can be expected to be well
>>> familiar with the cost of polling. The cost/benefit can be debated
>>> and benchmarked for individual applications. But there clearly are
>>> active uses for polling, so I think it should be an operating system
>>> facility.
>>
>> You mention users of advanced environments, but I think it's
>> important to consider the average user who is not necessarily a
>> kernel programmer.
>>
>> Will that user understand that not all apps support this? Or will
>> they think that they can simply run a few YNL commands burning CPUs at
>> 100% for apps that don't even support this thinking they are making
>> their networking faster?
>
> Busy polling can already be configured through sysfs.
That's not the same. The existing busy-poll mechanism still needs an
application thread. This thing will just go off and spin, whether
there's an application or not.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
2025-04-28 18:05 ` Martin Karsten
@ 2025-04-30 12:37 ` David Laight
2025-04-30 16:47 ` Samiullah Khawaja
0 siblings, 1 reply; 23+ messages in thread
From: David Laight @ 2025-04-30 12:37 UTC (permalink / raw)
To: Martin Karsten
Cc: Willem de Bruijn, Joe Damato, Samiullah Khawaja, Jakub Kicinski,
David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb,
netdev
On Mon, 28 Apr 2025 14:05:17 -0400
Martin Karsten <mkarsten@uwaterloo.ca> wrote:
...
> That's not the same. The existing busy-poll mechanism still needs an
> application thread. This thing will just go off and spin, whether
> there's an application or not.
I think this (and elsewhere) should be 'busy spin' not 'busy poll'.
That would make it much more obvious that it really is a cpu intensive
spin loop.
Note that on some cpu all the 'cores' run at the same speed.
So that putting one into a 'spin' will cause the frequency of all of
them to increase - thus speeding up a benchmark.
Rather the opposite of tests where a cpu busy thread (doing work)
gets bounced around physical cpu - so keeps being run on ones
running a low clock speed.
David
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
2025-04-30 12:37 ` David Laight
@ 2025-04-30 16:47 ` Samiullah Khawaja
0 siblings, 0 replies; 23+ messages in thread
From: Samiullah Khawaja @ 2025-04-30 16:47 UTC (permalink / raw)
To: David Laight
Cc: Martin Karsten, Willem de Bruijn, Joe Damato, Jakub Kicinski,
David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb,
netdev
On Wed, Apr 30, 2025 at 5:37 AM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Mon, 28 Apr 2025 14:05:17 -0400
> Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> ...
> > That's not the same. The existing busy-poll mechanism still needs an
> > application thread. This thing will just go off and spin, whether
> > there's an application or not.
>
> I think this (and elsewhere) should be 'busy spin' not 'busy poll'.
> That would make it much more obvious that it really is a cpu intensive
> spin loop.
"Busy Poll" is synonymous to polling a socket/napi/queue while
spinning. For example it is used for socket busy poll and epoll busy
poll. Calling it "busy spin" would be confusing I think. I will add a
note in documentation, as suggested, to explain that it will spin a
core and depending on requirements one might want to share the core by
using affinity and priority configurations.
>
> Note that on some cpu all the 'cores' run at the same speed.
> So that putting one into a 'spin' will cause the frequency of all of
> them to increase - thus speeding up a benchmark.
>
> Rather the opposite of tests where a cpu busy thread (doing work)
> gets bounced around physical cpu - so keeps being run on ones
> running a low clock speed.
>
> David
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
2025-04-28 16:52 ` Willem de Bruijn
2025-04-28 17:39 ` Joe Damato
@ 2025-04-28 18:05 ` Martin Karsten
2025-04-28 18:20 ` Willem de Bruijn
1 sibling, 1 reply; 23+ messages in thread
From: Martin Karsten @ 2025-04-28 18:05 UTC (permalink / raw)
To: Willem de Bruijn, Samiullah Khawaja, Jakub Kicinski,
David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb,
jdamato
Cc: netdev
On 2025-04-28 12:52, Willem de Bruijn wrote:
> Martin Karsten wrote:
>> On 2025-04-24 16:02, Samiullah Khawaja wrote:
[snip]
> There is also a functional argument for this feature. It brings
> parity with userspace network stacks like DPDK and Google's SNAP [1].
> These also run packet (and L4+) network processing on dedicated cores,
> and by default do so in polling mode. An XDP plane currently lacks
> this well understood configuration. This brings it closer to parity.
I believe these existing user-level stacks burn one core per queue. This
scheme burns two.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
2025-04-28 18:05 ` Martin Karsten
@ 2025-04-28 18:20 ` Willem de Bruijn
0 siblings, 0 replies; 23+ messages in thread
From: Willem de Bruijn @ 2025-04-28 18:20 UTC (permalink / raw)
To: Martin Karsten, Willem de Bruijn, Samiullah Khawaja,
Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato
Cc: netdev
Martin Karsten wrote:
> On 2025-04-28 12:52, Willem de Bruijn wrote:
> > Martin Karsten wrote:
> >> On 2025-04-24 16:02, Samiullah Khawaja wrote:
>
> [snip]
>
> > There is also a functional argument for this feature. It brings
> > parity with userspace network stacks like DPDK and Google's SNAP [1].
> > These also run packet (and L4+) network processing on dedicated cores,
> > and by default do so in polling mode. An XDP plane currently lacks
> > this well understood configuration. This brings it closer to parity.
> I believe these existing user-level stacks burn one core per queue. This
> scheme burns two.
That's specific to the AF_XDP use-case where AF_XDP also runs in
polling mode.
There are other use cases that more closely match DPDK, such as an
XDP dataplane. Conversely, DPDK with reinject into the host kernel
will also have to handle the signal or polling of the second stage.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
2025-04-28 13:50 ` [PATCH net-next v5 0/4] Add support to do threaded napi busy poll Martin Karsten
2025-04-28 16:52 ` Willem de Bruijn
@ 2025-04-30 15:23 ` Martin Karsten
2025-04-30 16:58 ` Samiullah Khawaja
1 sibling, 1 reply; 23+ messages in thread
From: Martin Karsten @ 2025-04-30 15:23 UTC (permalink / raw)
To: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
Paolo Abeni, almasrymina, willemb, jdamato
Cc: netdev
On 2025-04-28 09:50, Martin Karsten wrote:
> On 2025-04-24 16:02, Samiullah Khawaja wrote:
[snip]
>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI
>> threaded |
>> |---|---|---|---|---|
>> | 12 Kpkt/s + 0us delay | | | | |
>> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>> | 32 Kpkt/s + 30us delay | | | | |
>> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>> | 125 Kpkt/s + 6us delay | | | | |
>> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>> | 12 Kpkt/s + 78us delay | | | | |
>> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>> | 25 Kpkt/s + 38us delay | | | | |
>> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>
>> ## Observations
>>
>> - Here without application processing all the approaches give the same
>> latency within 1usecs range and NAPI threaded gives minimum latency.
>> - With application processing the latency increases by 3-4usecs when
>> doing inline polling.
>> - Using a dedicated core to drive napi polling keeps the latency same
>> even with application processing. This is observed both in userspace
>> and threaded napi (in kernel).
>> - Using napi threaded polling in kernel gives lower latency by
>> 1-1.5usecs as compared to userspace driven polling in separate core.
>> - With application processing userspace will get the packet from recv
>> ring and spend some time doing application processing and then do napi
>> polling. While application processing is happening a dedicated core
>> doing napi polling can pull the packet of the NAPI RX queue and
>> populate the AF_XDP recv ring. This means that when the application
>> thread is done with application processing it has new packets ready to
>> recv and process in recv ring.
>> - Napi threaded busy polling in the kernel with a dedicated core gives
>> the consistent P5-P99 latency.
> I've experimented with this some more. I can confirm latency savings of
> about 1 usec arising from busy-looping a NAPI thread on a dedicated core
> when compared to in-thread busy-polling. A few more comments:
>
> 1) I note that the experiment results above show that 'interrupts' is
> almost as fast as 'NAPI threaded' in the base case. I cannot confirm
> these results, because I currently only have (very) old hardware
> available for testing. However, these results worry me in terms of
> necessity of the threaded busy-polling mechanism - also see Item 4) below.
I want to add one more thought, just to spell this out explicitly:
Assuming the latency benefits result from better cache utilization of
two shorter processing loops (NAPI and application) using a dedicated
core each, it would make sense to see softirq processing on the NAPI
core being almost as fast. While there might be small penalty for the
initial hardware interrupt, the following softirq processing does not
differ much from what a NAPI spin-loop does? The experiments seem to
corroborate this, because latency results for 'interrupts' and 'NAPI
threaded' are extremely close.
In this case, it would be essential that interrupt handling happens on a
dedicated empty core, so it can react to hardware interrupts right away
and its local cache isn't dirtied by other code than softirq processing.
While this also means dedicating a entire core to NAPI processing, at
least the core wouldn't have to spin all the time, hopefully reducing
power consumption and heat generation.
Thanks,
Martin
> 2) The experiments reported here are symmetric in that they use the same
> polling variant at both the client and the server. When mixing things up
> by combining different polling variants, it becomes clear that the
> latency savings are split between both ends. The total savings of 1 usec
> are thus a combination of 0.5 usec are either end. So the ultimate
> trade-off is 0.5 usec latency gain for burning 1 core.
>
> 3) I believe the savings arise from running two tight loops (separate
> NAPI and application) instead of one longer loop. The shorter loops
> likely result in better cache utilization on their respective dedicated
> cores (and L1 caches). However I am not sure right how to explicitly
> confirm this.
>
> 4) I still believe that the additional experiments with setting both
> delay and period are meaningless. They create corner cases where rate *
> delay is about 1. Nobody would run a latency-critical system at 100%
> load. I also note that the experiment program xsk_rr fails when trying
> to increase the load beyond saturation (client fails with 'xsk_rr:
> oustanding array full').
>
> 5) I worry that a mechanism like this might be misinterpreted as some
> kind of magic wand for improving performance and might end up being used
> in practice and cause substantial overhead without much gain. If
> accepted, I would hope that this will be documented very clearly and
> have appropriate warnings attached. Given that the patch cover letter is
> often used as a basis for documentation, I believe this should be
> spelled out in the cover letter.
>
> With the above in mind, someone else will need to judge whether (at
> most) 0.5 usec for burning a core is a worthy enough trade-off to
> justify inclusion of this mechanism. Maybe someone else can take a
> closer look at the 'interrupts' variant on modern hardware.
>
> Thanks,
> Martin
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
2025-04-30 15:23 ` Martin Karsten
@ 2025-04-30 16:58 ` Samiullah Khawaja
2025-04-30 19:57 ` Martin Karsten
0 siblings, 1 reply; 23+ messages in thread
From: Samiullah Khawaja @ 2025-04-30 16:58 UTC (permalink / raw)
To: Martin Karsten
Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, netdev
On Wed, Apr 30, 2025 at 8:23 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-04-28 09:50, Martin Karsten wrote:
> > On 2025-04-24 16:02, Samiullah Khawaja wrote:
>
> [snip]
>
> >> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI
> >> threaded |
> >> |---|---|---|---|---|
> >> | 12 Kpkt/s + 0us delay | | | | |
> >> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> >> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> >> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> >> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> >> | 32 Kpkt/s + 30us delay | | | | |
> >> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> >> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> >> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> >> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> >> | 125 Kpkt/s + 6us delay | | | | |
> >> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> >> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> >> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> >> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> >> | 12 Kpkt/s + 78us delay | | | | |
> >> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> >> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> >> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> >> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> >> | 25 Kpkt/s + 38us delay | | | | |
> >> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> >> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> >> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> >> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >>
> >> ## Observations
> >>
> >> - Here without application processing all the approaches give the same
> >> latency within 1usecs range and NAPI threaded gives minimum latency.
> >> - With application processing the latency increases by 3-4usecs when
> >> doing inline polling.
> >> - Using a dedicated core to drive napi polling keeps the latency same
> >> even with application processing. This is observed both in userspace
> >> and threaded napi (in kernel).
> >> - Using napi threaded polling in kernel gives lower latency by
> >> 1-1.5usecs as compared to userspace driven polling in separate core.
> >> - With application processing userspace will get the packet from recv
> >> ring and spend some time doing application processing and then do napi
> >> polling. While application processing is happening a dedicated core
> >> doing napi polling can pull the packet of the NAPI RX queue and
> >> populate the AF_XDP recv ring. This means that when the application
> >> thread is done with application processing it has new packets ready to
> >> recv and process in recv ring.
> >> - Napi threaded busy polling in the kernel with a dedicated core gives
> >> the consistent P5-P99 latency.
> > I've experimented with this some more. I can confirm latency savings of
> > about 1 usec arising from busy-looping a NAPI thread on a dedicated core
> > when compared to in-thread busy-polling. A few more comments:
Thanks for the experiments and reproducing this. I really appreciate it.
> >
> > 1) I note that the experiment results above show that 'interrupts' is
> > almost as fast as 'NAPI threaded' in the base case. I cannot confirm
> > these results, because I currently only have (very) old hardware
> > available for testing. However, these results worry me in terms of
> > necessity of the threaded busy-polling mechanism - also see Item 4) below.
>
> I want to add one more thought, just to spell this out explicitly:
> Assuming the latency benefits result from better cache utilization of
> two shorter processing loops (NAPI and application) using a dedicated
> core each, it would make sense to see softirq processing on the NAPI
> core being almost as fast. While there might be small penalty for the
> initial hardware interrupt, the following softirq processing does not
The interrupt experiment in the last row demonstrates the penalty you
mentioned. While this effect might be acceptable for some use cases,
it could be problematic in scenarios sensitive to jitter (P99
latency).
> differ much from what a NAPI spin-loop does? The experiments seem to
> corroborate this, because latency results for 'interrupts' and 'NAPI
> threaded' are extremely close.
>
> In this case, it would be essential that interrupt handling happens on a
> dedicated empty core, so it can react to hardware interrupts right away
> and its local cache isn't dirtied by other code than softirq processing.
> While this also means dedicating a entire core to NAPI processing, at
> least the core wouldn't have to spin all the time, hopefully reducing
> power consumption and heat generation.
>
> Thanks,
> Martin
> > 2) The experiments reported here are symmetric in that they use the same
> > polling variant at both the client and the server. When mixing things up
> > by combining different polling variants, it becomes clear that the
> > latency savings are split between both ends. The total savings of 1 usec
> > are thus a combination of 0.5 usec are either end. So the ultimate
> > trade-off is 0.5 usec latency gain for burning 1 core.
> >
> > 3) I believe the savings arise from running two tight loops (separate
> > NAPI and application) instead of one longer loop. The shorter loops
> > likely result in better cache utilization on their respective dedicated
> > cores (and L1 caches). However I am not sure right how to explicitly
> > confirm this.
> >
> > 4) I still believe that the additional experiments with setting both
> > delay and period are meaningless. They create corner cases where rate *
> > delay is about 1. Nobody would run a latency-critical system at 100%
> > load. I also note that the experiment program xsk_rr fails when trying
> > to increase the load beyond saturation (client fails with 'xsk_rr:
> > oustanding array full').
> >
> > 5) I worry that a mechanism like this might be misinterpreted as some
> > kind of magic wand for improving performance and might end up being used
> > in practice and cause substantial overhead without much gain. If
> > accepted, I would hope that this will be documented very clearly and
> > have appropriate warnings attached. Given that the patch cover letter is
> > often used as a basis for documentation, I believe this should be
> > spelled out in the cover letter.
> >
> > With the above in mind, someone else will need to judge whether (at
> > most) 0.5 usec for burning a core is a worthy enough trade-off to
> > justify inclusion of this mechanism. Maybe someone else can take a
> > closer look at the 'interrupts' variant on modern hardware.
> >
> > Thanks,
> > Martin
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
2025-04-30 16:58 ` Samiullah Khawaja
@ 2025-04-30 19:57 ` Martin Karsten
2025-04-30 20:33 ` Samiullah Khawaja
0 siblings, 1 reply; 23+ messages in thread
From: Martin Karsten @ 2025-04-30 19:57 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, netdev
On 2025-04-30 12:58, Samiullah Khawaja wrote:
> On Wed, Apr 30, 2025 at 8:23 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>
>> On 2025-04-28 09:50, Martin Karsten wrote:
>>> On 2025-04-24 16:02, Samiullah Khawaja wrote:
>>
>> [snip]
>>
>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI
>>>> threaded |
>>>> |---|---|---|---|---|
>>>> | 12 Kpkt/s + 0us delay | | | | |
>>>> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>>>> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>>>> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>>>> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>>>> | 32 Kpkt/s + 30us delay | | | | |
>>>> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>>>> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>>>> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>>>> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>>>> | 125 Kpkt/s + 6us delay | | | | |
>>>> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>>>> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>>>> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>>>> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>>>> | 12 Kpkt/s + 78us delay | | | | |
>>>> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>>>> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>>>> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>>>> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>>>> | 25 Kpkt/s + 38us delay | | | | |
>>>> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>>>> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>>>> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>>>> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>>>
>>>> ## Observations
>>>>
>>>> - Here without application processing all the approaches give the same
>>>> latency within 1usecs range and NAPI threaded gives minimum latency.
>>>> - With application processing the latency increases by 3-4usecs when
>>>> doing inline polling.
>>>> - Using a dedicated core to drive napi polling keeps the latency same
>>>> even with application processing. This is observed both in userspace
>>>> and threaded napi (in kernel).
>>>> - Using napi threaded polling in kernel gives lower latency by
>>>> 1-1.5usecs as compared to userspace driven polling in separate core.
>>>> - With application processing userspace will get the packet from recv
>>>> ring and spend some time doing application processing and then do napi
>>>> polling. While application processing is happening a dedicated core
>>>> doing napi polling can pull the packet of the NAPI RX queue and
>>>> populate the AF_XDP recv ring. This means that when the application
>>>> thread is done with application processing it has new packets ready to
>>>> recv and process in recv ring.
>>>> - Napi threaded busy polling in the kernel with a dedicated core gives
>>>> the consistent P5-P99 latency.
>>> I've experimented with this some more. I can confirm latency savings of
>>> about 1 usec arising from busy-looping a NAPI thread on a dedicated core
>>> when compared to in-thread busy-polling. A few more comments:
> Thanks for the experiments and reproducing this. I really appreciate it.
>>>
>>> 1) I note that the experiment results above show that 'interrupts' is
>>> almost as fast as 'NAPI threaded' in the base case. I cannot confirm
>>> these results, because I currently only have (very) old hardware
>>> available for testing. However, these results worry me in terms of
>>> necessity of the threaded busy-polling mechanism - also see Item 4) below.
>>
>> I want to add one more thought, just to spell this out explicitly:
>> Assuming the latency benefits result from better cache utilization of
>> two shorter processing loops (NAPI and application) using a dedicated
>> core each, it would make sense to see softirq processing on the NAPI
>> core being almost as fast. While there might be small penalty for the
>> initial hardware interrupt, the following softirq processing does not
> The interrupt experiment in the last row demonstrates the penalty you
> mentioned. While this effect might be acceptable for some use cases,
> it could be problematic in scenarios sensitive to jitter (P99
> latency).
Just to be clear andexplicit: The difference is 200 nsecs for P99 (13200
vs 13000), i.e, 100 nsecs per core burned on either side. As I mentioned
before, I don't think the 100%-load experiments (those with nonzero
delay setting) are representative of any real-world scenario.
Thanks,
Martin
>> differ much from what a NAPI spin-loop does? The experiments seem to
>> corroborate this, because latency results for 'interrupts' and 'NAPI
>> threaded' are extremely close.
>>
>> In this case, it would be essential that interrupt handling happens on a
>> dedicated empty core, so it can react to hardware interrupts right away
>> and its local cache isn't dirtied by other code than softirq processing.
>> While this also means dedicating a entire core to NAPI processing, at
>> least the core wouldn't have to spin all the time, hopefully reducing
>> power consumption and heat generation.
>>
>> Thanks,
>> Martin
>>> 2) The experiments reported here are symmetric in that they use the same
>>> polling variant at both the client and the server. When mixing things up
>>> by combining different polling variants, it becomes clear that the
>>> latency savings are split between both ends. The total savings of 1 usec
>>> are thus a combination of 0.5 usec are either end. So the ultimate
>>> trade-off is 0.5 usec latency gain for burning 1 core.
>>>
>>> 3) I believe the savings arise from running two tight loops (separate
>>> NAPI and application) instead of one longer loop. The shorter loops
>>> likely result in better cache utilization on their respective dedicated
>>> cores (and L1 caches). However I am not sure right how to explicitly
>>> confirm this.
>>>
>>> 4) I still believe that the additional experiments with setting both
>>> delay and period are meaningless. They create corner cases where rate *
>>> delay is about 1. Nobody would run a latency-critical system at 100%
>>> load. I also note that the experiment program xsk_rr fails when trying
>>> to increase the load beyond saturation (client fails with 'xsk_rr:
>>> oustanding array full').
>>>
>>> 5) I worry that a mechanism like this might be misinterpreted as some
>>> kind of magic wand for improving performance and might end up being used
>>> in practice and cause substantial overhead without much gain. If
>>> accepted, I would hope that this will be documented very clearly and
>>> have appropriate warnings attached. Given that the patch cover letter is
>>> often used as a basis for documentation, I believe this should be
>>> spelled out in the cover letter.
>>>
>>> With the above in mind, someone else will need to judge whether (at
>>> most) 0.5 usec for burning a core is a worthy enough trade-off to
>>> justify inclusion of this mechanism. Maybe someone else can take a
>>> closer look at the 'interrupts' variant on modern hardware.
>>>
>>> Thanks,
>>> Martin
>>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
2025-04-30 19:57 ` Martin Karsten
@ 2025-04-30 20:33 ` Samiullah Khawaja
0 siblings, 0 replies; 23+ messages in thread
From: Samiullah Khawaja @ 2025-04-30 20:33 UTC (permalink / raw)
To: Martin Karsten
Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, netdev
On Wed, Apr 30, 2025 at 12:57 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-04-30 12:58, Samiullah Khawaja wrote:
> > On Wed, Apr 30, 2025 at 8:23 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>
> >> On 2025-04-28 09:50, Martin Karsten wrote:
> >>> On 2025-04-24 16:02, Samiullah Khawaja wrote:
> >>
> >> [snip]
> >>
> >>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI
> >>>> threaded |
> >>>> |---|---|---|---|---|
> >>>> | 12 Kpkt/s + 0us delay | | | | |
> >>>> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> >>>> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> >>>> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> >>>> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> >>>> | 32 Kpkt/s + 30us delay | | | | |
> >>>> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> >>>> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> >>>> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> >>>> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> >>>> | 125 Kpkt/s + 6us delay | | | | |
> >>>> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> >>>> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> >>>> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> >>>> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> >>>> | 12 Kpkt/s + 78us delay | | | | |
> >>>> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> >>>> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> >>>> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> >>>> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> >>>> | 25 Kpkt/s + 38us delay | | | | |
> >>>> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> >>>> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> >>>> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> >>>> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >>>>
> >>>> ## Observations
> >>>>
> >>>> - Here without application processing all the approaches give the same
> >>>> latency within 1usecs range and NAPI threaded gives minimum latency.
> >>>> - With application processing the latency increases by 3-4usecs when
> >>>> doing inline polling.
> >>>> - Using a dedicated core to drive napi polling keeps the latency same
> >>>> even with application processing. This is observed both in userspace
> >>>> and threaded napi (in kernel).
> >>>> - Using napi threaded polling in kernel gives lower latency by
> >>>> 1-1.5usecs as compared to userspace driven polling in separate core.
> >>>> - With application processing userspace will get the packet from recv
> >>>> ring and spend some time doing application processing and then do napi
> >>>> polling. While application processing is happening a dedicated core
> >>>> doing napi polling can pull the packet of the NAPI RX queue and
> >>>> populate the AF_XDP recv ring. This means that when the application
> >>>> thread is done with application processing it has new packets ready to
> >>>> recv and process in recv ring.
> >>>> - Napi threaded busy polling in the kernel with a dedicated core gives
> >>>> the consistent P5-P99 latency.
> >>> I've experimented with this some more. I can confirm latency savings of
> >>> about 1 usec arising from busy-looping a NAPI thread on a dedicated core
> >>> when compared to in-thread busy-polling. A few more comments:
> > Thanks for the experiments and reproducing this. I really appreciate it.
> >>>
> >>> 1) I note that the experiment results above show that 'interrupts' is
> >>> almost as fast as 'NAPI threaded' in the base case. I cannot confirm
> >>> these results, because I currently only have (very) old hardware
> >>> available for testing. However, these results worry me in terms of
> >>> necessity of the threaded busy-polling mechanism - also see Item 4) below.
> >>
> >> I want to add one more thought, just to spell this out explicitly:
> >> Assuming the latency benefits result from better cache utilization of
> >> two shorter processing loops (NAPI and application) using a dedicated
> >> core each, it would make sense to see softirq processing on the NAPI
> >> core being almost as fast. While there might be small penalty for the
> >> initial hardware interrupt, the following softirq processing does not
> > The interrupt experiment in the last row demonstrates the penalty you
> > mentioned. While this effect might be acceptable for some use cases,
> > it could be problematic in scenarios sensitive to jitter (P99
> > latency).
>
> Just to be clear andexplicit: The difference is 200 nsecs for P99 (13200
> vs 13000), i.e, 100 nsecs per core burned on either side. As I mentioned
> before, I don't think the 100%-load experiments (those with nonzero
> delay setting) are representative of any real-world scenario.
oh.. you are only considering the first row. Yes, with zero delay it
would (mostly) be equal. I agree with you that there is very little
difference in that particular scenario.
>
> Thanks,
> Martin
>
> >> differ much from what a NAPI spin-loop does? The experiments seem to
> >> corroborate this, because latency results for 'interrupts' and 'NAPI
> >> threaded' are extremely close.
> >>
> >> In this case, it would be essential that interrupt handling happens on a
> >> dedicated empty core, so it can react to hardware interrupts right away
> >> and its local cache isn't dirtied by other code than softirq processing.
> >> While this also means dedicating a entire core to NAPI processing, at
> >> least the core wouldn't have to spin all the time, hopefully reducing
> >> power consumption and heat generation.
> >>
> >> Thanks,
> >> Martin
> >>> 2) The experiments reported here are symmetric in that they use the same
> >>> polling variant at both the client and the server. When mixing things up
> >>> by combining different polling variants, it becomes clear that the
> >>> latency savings are split between both ends. The total savings of 1 usec
> >>> are thus a combination of 0.5 usec are either end. So the ultimate
> >>> trade-off is 0.5 usec latency gain for burning 1 core.
> >>>
> >>> 3) I believe the savings arise from running two tight loops (separate
> >>> NAPI and application) instead of one longer loop. The shorter loops
> >>> likely result in better cache utilization on their respective dedicated
> >>> cores (and L1 caches). However I am not sure right how to explicitly
> >>> confirm this.
> >>>
> >>> 4) I still believe that the additional experiments with setting both
> >>> delay and period are meaningless. They create corner cases where rate *
> >>> delay is about 1. Nobody would run a latency-critical system at 100%
> >>> load. I also note that the experiment program xsk_rr fails when trying
> >>> to increase the load beyond saturation (client fails with 'xsk_rr:
> >>> oustanding array full').
> >>>
> >>> 5) I worry that a mechanism like this might be misinterpreted as some
> >>> kind of magic wand for improving performance and might end up being used
> >>> in practice and cause substantial overhead without much gain. If
> >>> accepted, I would hope that this will be documented very clearly and
> >>> have appropriate warnings attached. Given that the patch cover letter is
> >>> often used as a basis for documentation, I believe this should be
> >>> spelled out in the cover letter.
> >>>
> >>> With the above in mind, someone else will need to judge whether (at
> >>> most) 0.5 usec for burning a core is a worthy enough trade-off to
> >>> justify inclusion of this mechanism. Maybe someone else can take a
> >>> closer look at the 'interrupts' variant on modern hardware.
> >>>
> >>> Thanks,
> >>> Martin
> >>
>
^ permalink raw reply [flat|nested] 23+ messages in thread