[PATCH net-next v3 0/4] Add support to do threaded napi busy poll

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
@ 2025-02-05  0:10 Samiullah Khawaja
  2025-02-05  0:10 ` [PATCH net-next v3 1/4] Add support to set napi threaded for individual napi Samiullah Khawaja
                   ` (7 more replies)
  0 siblings, 8 replies; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-05  0:10 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina
  Cc: netdev, skhawaja

Extend the already existing support of threaded napi poll to do continuous
busy polling.

This is used for doing continuous polling of napi to fetch descriptors
from backing RX/TX queues for low latency applications. Allow enabling
of threaded busypoll using netlink so this can be enabled on a set of
dedicated napis for low latency applications.

It allows enabling NAPI busy poll for any userspace application
indepdendent of userspace API being used for packet and event processing
(epoll, io_uring, raw socket APIs). Once enabled user can fetch the PID
of the kthread doing NAPI polling and set affinity, priority and
scheduler for it depending on the low-latency requirements.

Currently threaded napi is only enabled at device level using sysfs. Add
support to enable/disable threaded mode for a napi individually. This
can be done using the netlink interface. Extend `napi-set` op in netlink
spec that allows setting the `threaded` attribute of a napi.

Extend the threaded attribute in napi struct to add an option to enable
continuous busy polling. Extend the netlink and sysfs interface to allow
enabled/disabling threaded busypolling at device or individual napi
level.

We use this for our AF_XDP based hard low-latency usecase using onload
stack (https://github.com/Xilinx-CNS/onload) that runs in userspace. Our
usecase is a fixed frequency RPC style traffic with fixed
request/response size. We simulated this using neper by only starting
next transaction when last one has completed. The experiment results are
listed below,

Setup:

- Running on Google C3 VMs with idpf driver with following configurations.
- IRQ affinity and coalascing is common for both experiments.
- There is only 1 RX/TX queue configured.
- First experiment enables busy poll using sysctl for both epoll and
  socket APIs.
- Second experiment enables NAPI threaded busy poll for the full device
  using sysctl.

Non threaded NAPI busy poll enabled using sysctl.
```
echo 400 | sudo tee /proc/sys/net/core/busy_poll
echo 400 | sudo tee /proc/sys/net/core/busy_read
echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
```

Results using following command,
```
sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
		--profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
		-p 50,90,99,999 -H <IP> -l 10

...
...

num_transactions=2835
latency_min=0.000018976
latency_max=0.049642100
latency_mean=0.003243618
latency_stddev=0.010636847
latency_p50=0.000025270
latency_p90=0.005406710
latency_p99=0.049807350
latency_p99.9=0.049807350
```

Results with napi threaded busy poll using following command,
```
sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
                --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
                -p 50,90,99,999 -H <IP> -l 10

...
...

num_transactions=460163
latency_min=0.000015707
latency_max=0.200182942
latency_mean=0.000019453
latency_stddev=0.000720727
latency_p50=0.000016950
latency_p90=0.000017270
latency_p99=0.000018710
latency_p99.9=0.000020150
```

Here with NAPI threaded busy poll in a separate core, we are able to
consistently poll the NAPI to keep latency to absolute minimum. And also
we are able to do this without any major changes to the onload stack and
threading model.

v3:
 - Fixed calls to dev_set_threaded in drivers

v2:
 - Add documentation in napi.rst.
 - Provide experiment data and usecase details.
 - Update busy_poller selftest to include napi threaded poll testcase.
 - Define threaded mode enum in netlink interface.
 - Included NAPI threaded state in napi config to save/restore.

Samiullah Khawaja (4):
  Add support to set napi threaded for individual napi
  net: Create separate gro_flush helper function
  Extend napi threaded polling to allow kthread based busy polling
  selftests: Add napi threaded busy poll test in `busy_poller`

 Documentation/ABI/testing/sysfs-class-net     |   3 +-
 Documentation/netlink/specs/netdev.yaml       |  14 ++
 Documentation/networking/napi.rst             |  80 ++++++++++-
 .../net/ethernet/atheros/atl1c/atl1c_main.c   |   2 +-
 drivers/net/ethernet/mellanox/mlxsw/pci.c     |   2 +-
 drivers/net/ethernet/renesas/ravb_main.c      |   2 +-
 drivers/net/wireless/ath/ath10k/snoc.c        |   2 +-
 include/linux/netdevice.h                     |  24 +++-
 include/uapi/linux/netdev.h                   |   7 +
 net/core/dev.c                                | 127 ++++++++++++++----
 net/core/net-sysfs.c                          |   2 +-
 net/core/netdev-genl-gen.c                    |   5 +-
 net/core/netdev-genl.c                        |   9 ++
 tools/include/uapi/linux/netdev.h             |   7 +
 tools/testing/selftests/net/busy_poll_test.sh |  25 +++-
 tools/testing/selftests/net/busy_poller.c     |  14 +-
 16 files changed, 285 insertions(+), 40 deletions(-)

-- 
2.48.1.362.g079036d154-goog

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH net-next v3 1/4] Add support to set napi threaded for individual napi
  2025-02-05  0:10 [PATCH net-next v3 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
@ 2025-02-05  0:10 ` Samiullah Khawaja
  2025-02-05  2:50   ` Joe Damato
  2025-02-05 23:10   ` David Laight
  2025-02-05  0:10 ` [PATCH net-next v3 2/4] net: Create separate gro_flush helper function Samiullah Khawaja
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-05  0:10 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina
  Cc: netdev, skhawaja

A net device has a threaded sysctl that can be used to enable threaded
napi polling on all of the NAPI contexts under that device. Allow
enabling threaded napi polling at individual napi level using netlink.

Extend the netlink operation `napi-set` and allow setting the threaded
attribute of a NAPI. This will enable the threaded polling on a napi
context.

Tested using following command in qemu/virtio-net:
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
  --do napi-set       --json '{"id": 66, "threaded": 1}'

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 Documentation/netlink/specs/netdev.yaml | 10 ++++++++
 Documentation/networking/napi.rst       | 13 ++++++++++-
 include/linux/netdevice.h               | 10 ++++++++
 include/uapi/linux/netdev.h             |  1 +
 net/core/dev.c                          | 31 +++++++++++++++++++++++++
 net/core/netdev-genl-gen.c              |  5 ++--
 net/core/netdev-genl.c                  |  9 +++++++
 tools/include/uapi/linux/netdev.h       |  1 +
 8 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index cbb544bd6c84..785240d60df6 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -268,6 +268,14 @@ attribute-sets:
         doc: The timeout, in nanoseconds, of how long to suspend irq
              processing, if event polling finds events
         type: uint
+      -
+        name: threaded
+        doc: Whether the napi is configured to operate in threaded polling
+             mode. If this is set to `1` then the NAPI context operates
+             in threaded polling mode.
+        type: u32
+        checks:
+          max: 1
   -
     name: queue
     attributes:
@@ -659,6 +667,7 @@ operations:
             - defer-hard-irqs
             - gro-flush-timeout
             - irq-suspend-timeout
+            - threaded
       dump:
         request:
           attributes:
@@ -711,6 +720,7 @@ operations:
             - defer-hard-irqs
             - gro-flush-timeout
             - irq-suspend-timeout
+            - threaded
 
 kernel-family:
   headers: [ "linux/list.h"]
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
index f970a2be271a..73c83b4533dc 100644
--- a/Documentation/networking/napi.rst
+++ b/Documentation/networking/napi.rst
@@ -413,7 +413,18 @@ dependent). The NAPI instance IDs will be assigned in the opposite
 order than the process IDs of the kernel threads.
 
 Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in
-netdev's sysfs directory.
+netdev's sysfs directory. It can also be enabled for a specific napi using
+netlink interface.
+
+For example, using the script:
+
+.. code-block:: bash
+
+  $ kernel-source/tools/net/ynl/pyynl/cli.py \
+            --spec Documentation/netlink/specs/netdev.yaml \
+            --do napi-set \
+            --json='{"id": 66,
+                     "threaded": 1}'
 
 .. rubric:: Footnotes
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2a59034a5fa2..a0e485722ed9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -352,6 +352,7 @@ struct napi_config {
 	u64 gro_flush_timeout;
 	u64 irq_suspend_timeout;
 	u32 defer_hard_irqs;
+	bool threaded;
 	unsigned int napi_id;
 };
 
@@ -572,6 +573,15 @@ static inline bool napi_complete(struct napi_struct *n)
 
 int dev_set_threaded(struct net_device *dev, bool threaded);
 
+/*
+ * napi_set_threaded - set napi threaded state
+ * @napi: NAPI context
+ * @threaded: whether this napi does threaded polling
+ *
+ * Return 0 on success and negative errno on failure.
+ */
+int napi_set_threaded(struct napi_struct *napi, bool threaded);
+
 void napi_disable(struct napi_struct *n);
 void napi_disable_locked(struct napi_struct *n);
 
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index e4be227d3ad6..829648b2ef65 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -125,6 +125,7 @@ enum {
 	NETDEV_A_NAPI_DEFER_HARD_IRQS,
 	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
 	NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
+	NETDEV_A_NAPI_THREADED,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/dev.c b/net/core/dev.c
index c0021cbd28fc..50fb234dd7a0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6787,6 +6787,30 @@ static void init_gro_hash(struct napi_struct *napi)
 	napi->gro_bitmask = 0;
 }
 
+int napi_set_threaded(struct napi_struct *napi, bool threaded)
+{
+	if (napi->dev->threaded)
+		return -EINVAL;
+
+	if (threaded) {
+		if (!napi->thread) {
+			int err = napi_kthread_create(napi);
+
+			if (err)
+				return err;
+		}
+	}
+
+	if (napi->config)
+		napi->config->threaded = threaded;
+
+	/* Make sure kthread is created before THREADED bit is set. */
+	smp_mb__before_atomic();
+	assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+
+	return 0;
+}
+
 int dev_set_threaded(struct net_device *dev, bool threaded)
 {
 	struct napi_struct *napi;
@@ -6798,6 +6822,11 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
 		return 0;
 
 	if (threaded) {
+		/* Check if threaded is set at napi level already */
+		list_for_each_entry(napi, &dev->napi_list, dev_list)
+			if (test_bit(NAPI_STATE_THREADED, &napi->state))
+				return -EINVAL;
+
 		list_for_each_entry(napi, &dev->napi_list, dev_list) {
 			if (!napi->thread) {
 				err = napi_kthread_create(napi);
@@ -6880,6 +6909,8 @@ static void napi_restore_config(struct napi_struct *n)
 		napi_hash_add(n);
 		n->config->napi_id = n->napi_id;
 	}
+
+	napi_set_threaded(n, n->config->threaded);
 }
 
 static void napi_save_config(struct napi_struct *n)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 996ac6a449eb..a1f80e687f53 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -92,11 +92,12 @@ static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1]
 };
 
 /* NETDEV_CMD_NAPI_SET - do */
-static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT + 1] = {
+static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_THREADED + 1] = {
 	[NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
 	[NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range),
 	[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, },
 	[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
+	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
 };
 
 /* Ops table for netdev */
@@ -187,7 +188,7 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.cmd		= NETDEV_CMD_NAPI_SET,
 		.doit		= netdev_nl_napi_set_doit,
 		.policy		= netdev_napi_set_nl_policy,
-		.maxattr	= NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
+		.maxattr	= NETDEV_A_NAPI_THREADED,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
 	},
 };
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 715f85c6b62e..208c3dd768ec 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -183,6 +183,9 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 	if (napi->irq >= 0 && nla_put_u32(rsp, NETDEV_A_NAPI_IRQ, napi->irq))
 		goto nla_put_failure;
 
+	if (nla_put_u32(rsp, NETDEV_A_NAPI_THREADED, !!napi->thread))
+		goto nla_put_failure;
+
 	if (napi->thread) {
 		pid = task_pid_nr(napi->thread);
 		if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid))
@@ -321,8 +324,14 @@ netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
 {
 	u64 irq_suspend_timeout = 0;
 	u64 gro_flush_timeout = 0;
+	u32 threaded = 0;
 	u32 defer = 0;
 
+	if (info->attrs[NETDEV_A_NAPI_THREADED]) {
+		threaded = nla_get_u32(info->attrs[NETDEV_A_NAPI_THREADED]);
+		napi_set_threaded(napi, !!threaded);
+	}
+
 	if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
 		defer = nla_get_u32(info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]);
 		napi_set_defer_hard_irqs(napi, defer);
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index e4be227d3ad6..829648b2ef65 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -125,6 +125,7 @@ enum {
 	NETDEV_A_NAPI_DEFER_HARD_IRQS,
 	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
 	NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
+	NETDEV_A_NAPI_THREADED,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH net-next v3 2/4] net: Create separate gro_flush helper function
  2025-02-05  0:10 [PATCH net-next v3 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
  2025-02-05  0:10 ` [PATCH net-next v3 1/4] Add support to set napi threaded for individual napi Samiullah Khawaja
@ 2025-02-05  0:10 ` Samiullah Khawaja
  2025-02-05  2:55   ` Joe Damato
  2025-02-05  0:10 ` [PATCH net-next v3 3/4] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-05  0:10 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina
  Cc: netdev, skhawaja

Move multiple copies of same code snippet doing `gro_flush` and
`gro_normal_list` into a separate helper function.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 net/core/dev.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 50fb234dd7a0..d5dcf9dd6225 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6484,6 +6484,17 @@ static void skb_defer_free_flush(struct softnet_data *sd)
 	}
 }
 
+static void __napi_gro_flush_helper(struct napi_struct *napi)
+{
+	if (napi->gro_bitmask) {
+		/* flush too old packets
+		 * If HZ < 1000, flush all packets.
+		 */
+		napi_gro_flush(napi, HZ >= 1000);
+	}
+	gro_normal_list(napi);
+}
+
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 
 static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)
@@ -6494,14 +6505,8 @@ static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)
 		return;
 	}
 
-	if (napi->gro_bitmask) {
-		/* flush too old packets
-		 * If HZ < 1000, flush all packets.
-		 */
-		napi_gro_flush(napi, HZ >= 1000);
-	}
+	__napi_gro_flush_helper(napi);
 
-	gro_normal_list(napi);
 	clear_bit(NAPI_STATE_SCHED, &napi->state);
 }
 
@@ -7170,14 +7175,7 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
 		return work;
 	}
 
-	if (n->gro_bitmask) {
-		/* flush too old packets
-		 * If HZ < 1000, flush all packets.
-		 */
-		napi_gro_flush(n, HZ >= 1000);
-	}
-
-	gro_normal_list(n);
+	__napi_gro_flush_helper(n);
 
 	/* Some drivers may have called napi_schedule
 	 * prior to exhausting their budget.
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH net-next v3 3/4] Extend napi threaded polling to allow kthread based busy polling
  2025-02-05  0:10 [PATCH net-next v3 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
  2025-02-05  0:10 ` [PATCH net-next v3 1/4] Add support to set napi threaded for individual napi Samiullah Khawaja
  2025-02-05  0:10 ` [PATCH net-next v3 2/4] net: Create separate gro_flush helper function Samiullah Khawaja
@ 2025-02-05  0:10 ` Samiullah Khawaja
  2025-02-05  9:08   ` Paul Barker
  2025-02-06  3:40   ` Samudrala, Sridhar
  2025-02-05  0:10 ` [PATCH net-next v3 4/4] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-05  0:10 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina
  Cc: netdev, skhawaja

Add a new state to napi state enum:

- STATE_THREADED_BUSY_POLL
  Threaded busy poll is enabled/running for this napi.

Following changes are introduced in the napi scheduling and state logic:

- When threaded busy poll is enabled through sysfs it also enables
  NAPI_STATE_THREADED so a kthread is created per napi. It also sets
  NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that we are
  supposed to busy poll for each napi.

- When napi is scheduled with STATE_SCHED_THREADED and associated
  kthread is woken up, the kthread owns the context. If
  NAPI_STATE_THREADED_BUSY_POLL and NAPI_SCHED_THREADED both are set
  then it means that we can busy poll.

- To keep busy polling and to avoid scheduling of the interrupts, the
  napi_complete_done returns false when both SCHED_THREADED and
  THREADED_BUSY_POLL flags are set. Also napi_complete_done returns
  early to avoid the STATE_SCHED_THREADED being unset.

- If at any point STATE_THREADED_BUSY_POLL is unset, the
  napi_complete_done will run and unset the SCHED_THREADED bit also.
  This will make the associated kthread go to sleep as per existing
  logic.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
 Documentation/ABI/testing/sysfs-class-net     |  3 +-
 Documentation/netlink/specs/netdev.yaml       | 12 ++--
 Documentation/networking/napi.rst             | 67 ++++++++++++++++-
 .../net/ethernet/atheros/atl1c/atl1c_main.c   |  2 +-
 drivers/net/ethernet/mellanox/mlxsw/pci.c     |  2 +-
 drivers/net/ethernet/renesas/ravb_main.c      |  2 +-
 drivers/net/wireless/ath/ath10k/snoc.c        |  2 +-
 include/linux/netdevice.h                     | 20 ++++--
 include/uapi/linux/netdev.h                   |  6 ++
 net/core/dev.c                                | 72 ++++++++++++++++---
 net/core/net-sysfs.c                          |  2 +-
 net/core/netdev-genl-gen.c                    |  2 +-
 net/core/netdev-genl.c                        |  2 +-
 tools/include/uapi/linux/netdev.h             |  6 ++
 14 files changed, 171 insertions(+), 29 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net
index ebf21beba846..15d7d36a8294 100644
--- a/Documentation/ABI/testing/sysfs-class-net
+++ b/Documentation/ABI/testing/sysfs-class-net
@@ -343,7 +343,7 @@ Date:		Jan 2021
 KernelVersion:	5.12
 Contact:	netdev@vger.kernel.org
 Description:
-		Boolean value to control the threaded mode per device. User could
+		Integer value to control the threaded mode per device. User could
 		set this value to enable/disable threaded mode for all napi
 		belonging to this device, without the need to do device up/down.
 
@@ -351,4 +351,5 @@ Description:
 		== ==================================
 		0  threaded mode disabled for this dev
 		1  threaded mode enabled for this dev
+		2  threaded mode enabled, and busy polling enabled.
 		== ==================================
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 785240d60df6..db3bf1eb9a63 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -78,6 +78,10 @@ definitions:
     name: qstats-scope
     type: flags
     entries: [ queue ]
+  -
+    name: napi-threaded
+    type: enum
+    entries: [ disable, enable, busy-poll-enable ]
 
 attribute-sets:
   -
@@ -271,11 +275,11 @@ attribute-sets:
       -
         name: threaded
         doc: Whether the napi is configured to operate in threaded polling
-             mode. If this is set to `1` then the NAPI context operates
-             in threaded polling mode.
+             mode. If this is set to `enable` then the NAPI context operates
+             in threaded polling mode. If this is set to `busy-poll-enable`
+             then the NAPI kthread also does busypolling.
         type: u32
-        checks:
-          max: 1
+        enum: napi-threaded
   -
     name: queue
     attributes:
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
index 73c83b4533dc..f6596573b777 100644
--- a/Documentation/networking/napi.rst
+++ b/Documentation/networking/napi.rst
@@ -232,7 +232,9 @@ are not well known).
 Busy polling is enabled by either setting ``SO_BUSY_POLL`` on
 selected sockets or using the global ``net.core.busy_poll`` and
 ``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
-also exists.
+also exists. Threaded polling of NAPI also has a mode to busy poll for
+packets (:ref:`threaded busy polling<threaded_busy_poll>`) using the same
+thread that is used for NAPI processing.
 
 epoll-based busy polling
 ------------------------
@@ -395,6 +397,69 @@ Therefore, setting ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` is
 the recommended usage, because otherwise setting ``irq-suspend-timeout``
 might not have any discernible effect.
 
+.. _threaded_busy_poll:
+
+Threaded NAPI busy polling
+--------------------------
+
+Threaded napi allows processing of packets from each NAPI in a kthread in
+kernel. Threaded napi busy polling extends this and adds support to do
+continuous busy polling of this napi. This can be used to enable busy polling
+independent of userspace application or the API (epoll, io_uring, raw sockets)
+being used in userspace to process the packets.
+
+It can be enabled for each NAPI using netlink interface or at device level using
+the threaded NAPI sysctl.
+
+For example, using following script:
+
+.. code-block:: bash
+
+  $ kernel-source/tools/net/ynl/pyynl/cli.py \
+            --spec Documentation/netlink/specs/netdev.yaml \
+            --do napi-set \
+            --json='{"id": 66,
+                     "threaded": "busy-poll-enable"}'
+
+
+Enabling it for each NAPI allows finer control to enable busy pollling for
+only a set of NIC queues which will get traffic with low latency requirements.
+
+Depending on application requirement, user might want to set affinity of the
+kthread that is busy polling each NAPI. User might also want to set priority
+and the scheduler of the thread depending on the latency requirements.
+
+For a hard low-latency application, user might want to dedicate the full core
+for the NAPI polling so the NIC queue descriptors are picked up from the queue
+as soon as they appear. For more relaxed low-latency requirement, user might
+want to share the core with other threads.
+
+Once threaded busy polling is enabled for a NAPI, PID of the kthread can be
+fetched using netlink interface so the affinity, priority and scheduler
+configuration can be done.
+
+For example, following script can be used to fetch the pid:
+
+.. code-block:: bash
+
+  $ kernel-source/tools/net/ynl/pyynl/cli.py \
+            --spec Documentation/netlink/specs/netdev.yaml \
+            --do napi-get \
+            --json='{"id": 66}'
+
+This will output something like following, the pid `258` is the PID of the
+kthread that is polling this NAPI.
+
+.. code-block:: bash
+
+  $ {'defer-hard-irqs': 0,
+     'gro-flush-timeout': 0,
+     'id': 66,
+     'ifindex': 2,
+     'irq-suspend-timeout': 0,
+     'pid': 258,
+     'threaded': 'enable'}
+
 .. _threaded:
 
 Threaded NAPI
diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
index c571614b1d50..513328476770 100644
--- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
+++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
@@ -2688,7 +2688,7 @@ static int atl1c_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	adapter->mii.mdio_write = atl1c_mdio_write;
 	adapter->mii.phy_id_mask = 0x1f;
 	adapter->mii.reg_num_mask = MDIO_CTRL_REG_MASK;
-	dev_set_threaded(netdev, true);
+	dev_set_threaded(netdev, NETDEV_NAPI_THREADED_ENABLE);
 	for (i = 0; i < adapter->rx_queue_count; ++i)
 		netif_napi_add(netdev, &adapter->rrd_ring[i].napi,
 			       atl1c_clean_rx);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 5b44c931b660..52f1ffb77a3c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -156,7 +156,7 @@ static int mlxsw_pci_napi_devs_init(struct mlxsw_pci *mlxsw_pci)
 	}
 	strscpy(mlxsw_pci->napi_dev_rx->name, "mlxsw_rx",
 		sizeof(mlxsw_pci->napi_dev_rx->name));
-	dev_set_threaded(mlxsw_pci->napi_dev_rx, true);
+	dev_set_threaded(mlxsw_pci->napi_dev_rx, NETDEV_NAPI_THREADED_ENABLE);
 
 	return 0;
 
diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index c9f4976a3527..12e4f68c0c8f 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -3075,7 +3075,7 @@ static int ravb_probe(struct platform_device *pdev)
 	if (info->coalesce_irqs) {
 		netdev_sw_irq_coalesce_default_on(ndev);
 		if (num_present_cpus() == 1)
-			dev_set_threaded(ndev, true);
+			dev_set_threaded(ndev, NETDEV_NAPI_THREADED_ENABLE);
 	}
 
 	/* Network device register */
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index d436a874cd5a..52e0de8c3069 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -935,7 +935,7 @@ static int ath10k_snoc_hif_start(struct ath10k *ar)
 
 	bitmap_clear(ar_snoc->pending_ce_irqs, 0, CE_COUNT_MAX);
 
-	dev_set_threaded(ar->napi_dev, true);
+	dev_set_threaded(ar->napi_dev, NETDEV_NAPI_THREADED_ENABLE);
 	ath10k_core_napi_enable(ar);
 	ath10k_snoc_irq_enable(ar);
 	ath10k_snoc_rx_post(ar);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a0e485722ed9..c3069a17fa7e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -352,7 +352,7 @@ struct napi_config {
 	u64 gro_flush_timeout;
 	u64 irq_suspend_timeout;
 	u32 defer_hard_irqs;
-	bool threaded;
+	u8 threaded;
 	unsigned int napi_id;
 };
 
@@ -410,6 +410,8 @@ enum {
 	NAPI_STATE_PREFER_BUSY_POLL,	/* prefer busy-polling over softirq processing*/
 	NAPI_STATE_THREADED,		/* The poll is performed inside its own thread*/
 	NAPI_STATE_SCHED_THREADED,	/* Napi is currently scheduled in threaded mode */
+	NAPI_STATE_THREADED_BUSY_POLL,	/* The threaded napi poller will busy poll */
+	NAPI_STATE_SCHED_THREADED_BUSY_POLL,  /* The threaded napi poller is busy polling */
 };
 
 enum {
@@ -423,8 +425,14 @@ enum {
 	NAPIF_STATE_PREFER_BUSY_POLL	= BIT(NAPI_STATE_PREFER_BUSY_POLL),
 	NAPIF_STATE_THREADED		= BIT(NAPI_STATE_THREADED),
 	NAPIF_STATE_SCHED_THREADED	= BIT(NAPI_STATE_SCHED_THREADED),
+	NAPIF_STATE_THREADED_BUSY_POLL	= BIT(NAPI_STATE_THREADED_BUSY_POLL),
+	NAPIF_STATE_SCHED_THREADED_BUSY_POLL
+				= BIT(NAPI_STATE_SCHED_THREADED_BUSY_POLL),
 };
 
+#define NAPIF_STATE_THREADED_BUSY_POLL_MASK \
+	(NAPIF_STATE_THREADED | NAPIF_STATE_THREADED_BUSY_POLL)
+
 enum gro_result {
 	GRO_MERGED,
 	GRO_MERGED_FREE,
@@ -571,16 +579,18 @@ static inline bool napi_complete(struct napi_struct *n)
 	return napi_complete_done(n, 0);
 }
 
-int dev_set_threaded(struct net_device *dev, bool threaded);
+int dev_set_threaded(struct net_device *dev,
+		     enum netdev_napi_threaded threaded);
 
 /*
  * napi_set_threaded - set napi threaded state
  * @napi: NAPI context
- * @threaded: whether this napi does threaded polling
+ * @threaded: threading mode
  *
  * Return 0 on success and negative errno on failure.
  */
-int napi_set_threaded(struct napi_struct *napi, bool threaded);
+int napi_set_threaded(struct napi_struct *napi,
+		      enum netdev_napi_threaded threaded);
 
 void napi_disable(struct napi_struct *n);
 void napi_disable_locked(struct napi_struct *n);
@@ -2404,7 +2414,7 @@ struct net_device {
 	struct sfp_bus		*sfp_bus;
 	struct lock_class_key	*qdisc_tx_busylock;
 	bool			proto_down;
-	bool			threaded;
+	u8			threaded;
 
 	/* priv_flags_slow, ungrouped to save space */
 	unsigned long		see_all_hwtstamp_requests:1;
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 829648b2ef65..c2a9dbb361f6 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -74,6 +74,12 @@ enum netdev_qstats_scope {
 	NETDEV_QSTATS_SCOPE_QUEUE = 1,
 };
 
+enum netdev_napi_threaded {
+	NETDEV_NAPI_THREADED_DISABLE,
+	NETDEV_NAPI_THREADED_ENABLE,
+	NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE,
+};
+
 enum {
 	NETDEV_A_DEV_IFINDEX = 1,
 	NETDEV_A_DEV_PAD,
diff --git a/net/core/dev.c b/net/core/dev.c
index d5dcf9dd6225..1964c184ce8a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -78,6 +78,7 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/sched/isolation.h>
+#include <linux/sched/types.h>
 #include <linux/sched/mm.h>
 #include <linux/smpboot.h>
 #include <linux/mutex.h>
@@ -6403,7 +6404,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 	 *    the guarantee we will be called later.
 	 */
 	if (unlikely(n->state & (NAPIF_STATE_NPSVC |
-				 NAPIF_STATE_IN_BUSY_POLL)))
+				 NAPIF_STATE_IN_BUSY_POLL |
+				 NAPIF_STATE_SCHED_THREADED_BUSY_POLL)))
 		return false;
 
 	if (work_done) {
@@ -6792,8 +6794,10 @@ static void init_gro_hash(struct napi_struct *napi)
 	napi->gro_bitmask = 0;
 }
 
-int napi_set_threaded(struct napi_struct *napi, bool threaded)
+int napi_set_threaded(struct napi_struct *napi,
+		      enum netdev_napi_threaded threaded)
 {
+	unsigned long val;
 	if (napi->dev->threaded)
 		return -EINVAL;
 
@@ -6811,14 +6815,20 @@ int napi_set_threaded(struct napi_struct *napi, bool threaded)
 
 	/* Make sure kthread is created before THREADED bit is set. */
 	smp_mb__before_atomic();
-	assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+	val = 0;
+	if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE)
+		val |= NAPIF_STATE_THREADED_BUSY_POLL;
+	if (threaded)
+		val |= NAPIF_STATE_THREADED;
+	set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
 
 	return 0;
 }
 
-int dev_set_threaded(struct net_device *dev, bool threaded)
+int dev_set_threaded(struct net_device *dev, enum netdev_napi_threaded threaded)
 {
 	struct napi_struct *napi;
+	unsigned long val;
 	int err = 0;
 
 	netdev_assert_locked_or_invisible(dev);
@@ -6826,17 +6836,22 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
 	if (dev->threaded == threaded)
 		return 0;
 
+	val = 0;
 	if (threaded) {
 		/* Check if threaded is set at napi level already */
 		list_for_each_entry(napi, &dev->napi_list, dev_list)
 			if (test_bit(NAPI_STATE_THREADED, &napi->state))
 				return -EINVAL;
 
+		val |= NAPIF_STATE_THREADED;
+		if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE)
+			val |= NAPIF_STATE_THREADED_BUSY_POLL;
+
 		list_for_each_entry(napi, &dev->napi_list, dev_list) {
 			if (!napi->thread) {
 				err = napi_kthread_create(napi);
 				if (err) {
-					threaded = false;
+					threaded = NETDEV_NAPI_THREADED_DISABLE;
 					break;
 				}
 			}
@@ -6855,9 +6870,13 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
 	 * polled. In this case, the switch between threaded mode and
 	 * softirq mode will happen in the next round of napi_schedule().
 	 * This should not cause hiccups/stalls to the live traffic.
+	 *
+	 * Switch to busy_poll threaded napi will occur after the threaded
+	 * napi is scheduled.
 	 */
 	list_for_each_entry(napi, &dev->napi_list, dev_list)
-		assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+		set_mask_bits(&napi->state,
+			      NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
 
 	return err;
 }
@@ -7235,7 +7254,7 @@ static int napi_thread_wait(struct napi_struct *napi)
 	return -1;
 }
 
-static void napi_threaded_poll_loop(struct napi_struct *napi)
+static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll)
 {
 	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct softnet_data *sd;
@@ -7264,22 +7283,53 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
 		}
 		skb_defer_free_flush(sd);
 		bpf_net_ctx_clear(bpf_net_ctx);
+
+		/* Push the skbs up the stack if busy polling. */
+		if (busy_poll)
+			__napi_gro_flush_helper(napi);
 		local_bh_enable();
 
-		if (!repoll)
+		/* If busy polling then do not break here because we need to
+		 * call cond_resched and rcu_softirq_qs_periodic to prevent
+		 * watchdog warnings.
+		 */
+		if (!repoll && !busy_poll)
 			break;
 
 		rcu_softirq_qs_periodic(last_qs);
 		cond_resched();
+
+		if (!repoll)
+			break;
 	}
 }
 
 static int napi_threaded_poll(void *data)
 {
 	struct napi_struct *napi = data;
+	bool busy_poll_sched;
+	unsigned long val;
+	bool busy_poll;
+
+	while (!napi_thread_wait(napi)) {
+		/* Once woken up, this means that we are scheduled as threaded
+		 * napi and this thread owns the napi context, if busy poll
+		 * state is set then we busy poll this napi.
+		 */
+		val = READ_ONCE(napi->state);
+		busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
+		busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL;
+
+		/* Do not busy poll if napi is disabled. */
+		if (unlikely(val & NAPIF_STATE_DISABLE))
+			busy_poll = false;
+
+		if (busy_poll != busy_poll_sched)
+			assign_bit(NAPI_STATE_SCHED_THREADED_BUSY_POLL,
+				   &napi->state, busy_poll);
 
-	while (!napi_thread_wait(napi))
-		napi_threaded_poll_loop(napi);
+		napi_threaded_poll_loop(napi, busy_poll);
+	}
 
 	return 0;
 }
@@ -12474,7 +12524,7 @@ static void run_backlog_napi(unsigned int cpu)
 {
 	struct softnet_data *sd = per_cpu_ptr(&softnet_data, cpu);
 
-	napi_threaded_poll_loop(&sd->backlog);
+	napi_threaded_poll_loop(&sd->backlog, false);
 }
 
 static void backlog_napi_setup(unsigned int cpu)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 07cb99b114bd..beb496bcb633 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -657,7 +657,7 @@ static int modify_napi_threaded(struct net_device *dev, unsigned long val)
 	if (list_empty(&dev->napi_list))
 		return -EOPNOTSUPP;
 
-	if (val != 0 && val != 1)
+	if (val > NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE)
 		return -EOPNOTSUPP;
 
 	ret = dev_set_threaded(dev, val);
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index a1f80e687f53..b572beba42e7 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -97,7 +97,7 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_THREADED
 	[NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range),
 	[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, },
 	[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
-	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
+	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 2),
 };
 
 /* Ops table for netdev */
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 208c3dd768ec..7ae5f3ed0961 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -329,7 +329,7 @@ netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
 
 	if (info->attrs[NETDEV_A_NAPI_THREADED]) {
 		threaded = nla_get_u32(info->attrs[NETDEV_A_NAPI_THREADED]);
-		napi_set_threaded(napi, !!threaded);
+		napi_set_threaded(napi, threaded);
 	}
 
 	if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 829648b2ef65..c2a9dbb361f6 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -74,6 +74,12 @@ enum netdev_qstats_scope {
 	NETDEV_QSTATS_SCOPE_QUEUE = 1,
 };
 
+enum netdev_napi_threaded {
+	NETDEV_NAPI_THREADED_DISABLE,
+	NETDEV_NAPI_THREADED_ENABLE,
+	NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE,
+};
+
 enum {
 	NETDEV_A_DEV_IFINDEX = 1,
 	NETDEV_A_DEV_PAD,
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH net-next v3 4/4] selftests: Add napi threaded busy poll test in `busy_poller`
  2025-02-05  0:10 [PATCH net-next v3 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
                   ` (2 preceding siblings ...)
  2025-02-05  0:10 ` [PATCH net-next v3 3/4] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
@ 2025-02-05  0:10 ` Samiullah Khawaja
  2025-02-05  0:14 ` [PATCH net-next v3 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-05  0:10 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina
  Cc: netdev, skhawaja

Add testcase to run busy poll test with threaded napi busy poll enabled.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 tools/testing/selftests/net/busy_poll_test.sh | 25 ++++++++++++++++++-
 tools/testing/selftests/net/busy_poller.c     | 14 ++++++++---
 2 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/net/busy_poll_test.sh b/tools/testing/selftests/net/busy_poll_test.sh
index 7db292ec4884..aeca610dc989 100755
--- a/tools/testing/selftests/net/busy_poll_test.sh
+++ b/tools/testing/selftests/net/busy_poll_test.sh
@@ -27,6 +27,9 @@ NAPI_DEFER_HARD_IRQS=100
 GRO_FLUSH_TIMEOUT=50000
 SUSPEND_TIMEOUT=20000000
 
+# NAPI threaded busy poll config
+NAPI_THREADED_POLL=2
+
 setup_ns()
 {
 	set -e
@@ -62,6 +65,9 @@ cleanup_ns()
 test_busypoll()
 {
 	suspend_value=${1:-0}
+	napi_threaded_value=${2:-0}
+	prefer_busy_poll_value=${3:-$PREFER_BUSY_POLL}
+
 	tmp_file=$(mktemp)
 	out_file=$(mktemp)
 
@@ -73,10 +79,11 @@ test_busypoll()
 					     -b${SERVER_IP}        \
 					     -m${MAX_EVENTS}       \
 					     -u${BUSY_POLL_USECS}  \
-					     -P${PREFER_BUSY_POLL} \
+					     -P${prefer_busy_poll_value} \
 					     -g${BUSY_POLL_BUDGET} \
 					     -i${NSIM_SV_IFIDX}    \
 					     -s${suspend_value}    \
+					     -t${napi_threaded_value} \
 					     -o${out_file}&
 
 	wait_local_port_listen nssv ${SERVER_PORT} tcp
@@ -109,6 +116,15 @@ test_busypoll_with_suspend()
 	return $?
 }
 
+test_busypoll_with_napi_threaded()
+{
+	# Only enable napi threaded poll. Set suspend timeout and prefer busy
+	# poll to 0.
+	test_busypoll 0 ${NAPI_THREADED_POLL} 0
+
+	return $?
+}
+
 ###
 ### Code start
 ###
@@ -154,6 +170,13 @@ if [ $? -ne 0 ]; then
 	exit 1
 fi
 
+test_busypoll_with_napi_threaded
+if [ $? -ne 0 ]; then
+	echo "test_busypoll_with_napi_threaded failed"
+	cleanup_ns
+	exit 1
+fi
+
 echo "$NSIM_SV_FD:$NSIM_SV_IFIDX" > $NSIM_DEV_SYS_UNLINK
 
 echo $NSIM_CL_ID > $NSIM_DEV_SYS_DEL
diff --git a/tools/testing/selftests/net/busy_poller.c b/tools/testing/selftests/net/busy_poller.c
index 04c7ff577bb8..f7407f09f635 100644
--- a/tools/testing/selftests/net/busy_poller.c
+++ b/tools/testing/selftests/net/busy_poller.c
@@ -65,15 +65,16 @@ static uint32_t cfg_busy_poll_usecs;
 static uint16_t cfg_busy_poll_budget;
 static uint8_t cfg_prefer_busy_poll;
 
-/* IRQ params */
+/* NAPI params */
 static uint32_t cfg_defer_hard_irqs;
 static uint64_t cfg_gro_flush_timeout;
 static uint64_t cfg_irq_suspend_timeout;
+static enum netdev_napi_threaded cfg_napi_threaded_poll = NETDEV_NAPI_THREADED_DISABLE;
 
 static void usage(const char *filepath)
 {
 	error(1, 0,
-	      "Usage: %s -p<port> -b<addr> -m<max_events> -u<busy_poll_usecs> -P<prefer_busy_poll> -g<busy_poll_budget> -o<outfile> -d<defer_hard_irqs> -r<gro_flush_timeout> -s<irq_suspend_timeout> -i<ifindex>",
+	      "Usage: %s -p<port> -b<addr> -m<max_events> -u<busy_poll_usecs> -P<prefer_busy_poll> -g<busy_poll_budget> -o<outfile> -d<defer_hard_irqs> -r<gro_flush_timeout> -s<irq_suspend_timeout> -t<napi_threaded_poll> -i<ifindex>",
 	      filepath);
 }
 
@@ -86,7 +87,7 @@ static void parse_opts(int argc, char **argv)
 	if (argc <= 1)
 		usage(argv[0]);
 
-	while ((c = getopt(argc, argv, "p:m:b:u:P:g:o:d:r:s:i:")) != -1) {
+	while ((c = getopt(argc, argv, "p:m:b:u:P:g:o:d:r:s:i:t:")) != -1) {
 		/* most options take integer values, except o and b, so reduce
 		 * code duplication a bit for the common case by calling
 		 * strtoull here and leave bounds checking and casting per
@@ -168,6 +169,12 @@ static void parse_opts(int argc, char **argv)
 
 			cfg_ifindex = (int)tmp;
 			break;
+		case 't':
+			if (tmp == ULLONG_MAX || tmp > 2)
+				error(1, ERANGE, "napi threaded poll value must be 0-2");
+
+			cfg_napi_threaded_poll = (enum netdev_napi_threaded)tmp;
+			break;
 		}
 	}
 
@@ -246,6 +253,7 @@ static void setup_queue(void)
 						  cfg_gro_flush_timeout);
 	netdev_napi_set_req_set_irq_suspend_timeout(set_req,
 						    cfg_irq_suspend_timeout);
+	netdev_napi_set_req_set_threaded(set_req, cfg_napi_threaded_poll);
 
 	if (netdev_napi_set(ys, set_req))
 		error(1, 0, "can't set NAPI params: %s\n", yerr.msg);
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-05  0:10 [PATCH net-next v3 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
                   ` (3 preceding siblings ...)
  2025-02-05  0:10 ` [PATCH net-next v3 4/4] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
@ 2025-02-05  0:14 ` Samiullah Khawaja
  2025-02-05  1:32 ` Martin Karsten
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-05  0:14 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina
  Cc: netdev, Joe Damato, Martin Karsten

On Tue, Feb 4, 2025 at 4:10 PM Samiullah Khawaja <skhawaja@google.com> wrote:
>
> Extend the already existing support of threaded napi poll to do continuous
> busy polling.
>
> This is used for doing continuous polling of napi to fetch descriptors
> from backing RX/TX queues for low latency applications. Allow enabling
> of threaded busypoll using netlink so this can be enabled on a set of
> dedicated napis for low latency applications.
>
> It allows enabling NAPI busy poll for any userspace application
> indepdendent of userspace API being used for packet and event processing
> (epoll, io_uring, raw socket APIs). Once enabled user can fetch the PID
> of the kthread doing NAPI polling and set affinity, priority and
> scheduler for it depending on the low-latency requirements.
>
> Currently threaded napi is only enabled at device level using sysfs. Add
> support to enable/disable threaded mode for a napi individually. This
> can be done using the netlink interface. Extend `napi-set` op in netlink
> spec that allows setting the `threaded` attribute of a napi.
>
> Extend the threaded attribute in napi struct to add an option to enable
> continuous busy polling. Extend the netlink and sysfs interface to allow
> enabled/disabling threaded busypolling at device or individual napi
> level.
>
> We use this for our AF_XDP based hard low-latency usecase using onload
> stack (https://github.com/Xilinx-CNS/onload) that runs in userspace. Our
> usecase is a fixed frequency RPC style traffic with fixed
> request/response size. We simulated this using neper by only starting
> next transaction when last one has completed. The experiment results are
> listed below,
>
> Setup:
>
> - Running on Google C3 VMs with idpf driver with following configurations.
> - IRQ affinity and coalascing is common for both experiments.
> - There is only 1 RX/TX queue configured.
> - First experiment enables busy poll using sysctl for both epoll and
>   socket APIs.
> - Second experiment enables NAPI threaded busy poll for the full device
>   using sysctl.
>
> Non threaded NAPI busy poll enabled using sysctl.
> ```
> echo 400 | sudo tee /proc/sys/net/core/busy_poll
> echo 400 | sudo tee /proc/sys/net/core/busy_read
> echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
> ```
>
> Results using following command,
> ```
> sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
>                 --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
>                 -p 50,90,99,999 -H <IP> -l 10
>
> ...
> ...
>
> num_transactions=2835
> latency_min=0.000018976
> latency_max=0.049642100
> latency_mean=0.003243618
> latency_stddev=0.010636847
> latency_p50=0.000025270
> latency_p90=0.005406710
> latency_p99=0.049807350
> latency_p99.9=0.049807350
> ```
>
> Results with napi threaded busy poll using following command,
> ```
> sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
>                 --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
>                 -p 50,90,99,999 -H <IP> -l 10
>
> ...
> ...
>
> num_transactions=460163
> latency_min=0.000015707
> latency_max=0.200182942
> latency_mean=0.000019453
> latency_stddev=0.000720727
> latency_p50=0.000016950
> latency_p90=0.000017270
> latency_p99=0.000018710
> latency_p99.9=0.000020150
> ```
>
> Here with NAPI threaded busy poll in a separate core, we are able to
> consistently poll the NAPI to keep latency to absolute minimum. And also
> we are able to do this without any major changes to the onload stack and
> threading model.
>
> v3:
>  - Fixed calls to dev_set_threaded in drivers
>
> v2:
>  - Add documentation in napi.rst.
>  - Provide experiment data and usecase details.
>  - Update busy_poller selftest to include napi threaded poll testcase.
>  - Define threaded mode enum in netlink interface.
>  - Included NAPI threaded state in napi config to save/restore.
>
> Samiullah Khawaja (4):
>   Add support to set napi threaded for individual napi
>   net: Create separate gro_flush helper function
>   Extend napi threaded polling to allow kthread based busy polling
>   selftests: Add napi threaded busy poll test in `busy_poller`
>
>  Documentation/ABI/testing/sysfs-class-net     |   3 +-
>  Documentation/netlink/specs/netdev.yaml       |  14 ++
>  Documentation/networking/napi.rst             |  80 ++++++++++-
>  .../net/ethernet/atheros/atl1c/atl1c_main.c   |   2 +-
>  drivers/net/ethernet/mellanox/mlxsw/pci.c     |   2 +-
>  drivers/net/ethernet/renesas/ravb_main.c      |   2 +-
>  drivers/net/wireless/ath/ath10k/snoc.c        |   2 +-
>  include/linux/netdevice.h                     |  24 +++-
>  include/uapi/linux/netdev.h                   |   7 +
>  net/core/dev.c                                | 127 ++++++++++++++----
>  net/core/net-sysfs.c                          |   2 +-
>  net/core/netdev-genl-gen.c                    |   5 +-
>  net/core/netdev-genl.c                        |   9 ++
>  tools/include/uapi/linux/netdev.h             |   7 +
>  tools/testing/selftests/net/busy_poll_test.sh |  25 +++-
>  tools/testing/selftests/net/busy_poller.c     |  14 +-
>  16 files changed, 285 insertions(+), 40 deletions(-)
>
> --
> 2.48.1.362.g079036d154-goog
>
Adding Joe and Martin as they requested to be CC'd in the next
revision. It seems I missed them when sending this out :(.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-05  0:10 [PATCH net-next v3 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
                   ` (4 preceding siblings ...)
  2025-02-05  0:14 ` [PATCH net-next v3 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
@ 2025-02-05  1:32 ` Martin Karsten
  2025-02-05 20:35   ` Samiullah Khawaja
  2025-02-05  3:18 ` Joe Damato
  2025-02-06  5:36 ` Dave Taht
  7 siblings, 1 reply; 37+ messages in thread
From: Martin Karsten @ 2025-02-05  1:32 UTC (permalink / raw)
  To: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina
  Cc: netdev, Joe Damato

On 2025-02-04 19:10, Samiullah Khawaja wrote:
> Extend the already existing support of threaded napi poll to do continuous
> busy polling.

[snip]

> Setup:
> 
> - Running on Google C3 VMs with idpf driver with following configurations.
> - IRQ affinity and coalascing is common for both experiments.
> - There is only 1 RX/TX queue configured.
> - First experiment enables busy poll using sysctl for both epoll and
>    socket APIs.
> - Second experiment enables NAPI threaded busy poll for the full device
>    using sysctl.
> 
> Non threaded NAPI busy poll enabled using sysctl.
> ```
> echo 400 | sudo tee /proc/sys/net/core/busy_poll
> echo 400 | sudo tee /proc/sys/net/core/busy_read
> echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
> ```
> 
> Results using following command,
> ```
> sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> 		--profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> 		-p 50,90,99,999 -H <IP> -l 10
> 
> ...
> ...
> 
> num_transactions=2835
> latency_min=0.000018976
> latency_max=0.049642100
> latency_mean=0.003243618
> latency_stddev=0.010636847
> latency_p50=0.000025270
> latency_p90=0.005406710
> latency_p99=0.049807350
> latency_p99.9=0.049807350
> ```
> 
> Results with napi threaded busy poll using following command,
> ```
> sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
>                  --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
>                  -p 50,90,99,999 -H <IP> -l 10
> 
> ...
> ...
> 
> num_transactions=460163
> latency_min=0.000015707
> latency_max=0.200182942
> latency_mean=0.000019453
> latency_stddev=0.000720727
> latency_p50=0.000016950
> latency_p90=0.000017270
> latency_p99=0.000018710
> latency_p99.9=0.000020150
> ```
> 
> Here with NAPI threaded busy poll in a separate core, we are able to
> consistently poll the NAPI to keep latency to absolute minimum. And also
> we are able to do this without any major changes to the onload stack and
> threading model.

As far as I'm concerned, this is still not sufficient information to 
fully assess the experiment. The experiment shows an 162-fold decrease 
in latency and a corresponding increase in throughput for this 
closed-loop workload (which, btw, is different from your open-loop fixed 
rate use case). This would be an extraordinary improvement and that 
alone warrants some scrutiny. 162X means either the base case has a lot 
of idle time or wastes an enormous amount of cpu cycles. How can that be 
explained? It would be good to get some instruction/cycle counts to 
drill down further.

The server process invocation and the actual irq routing is not 
provided. Just stating its common for both experiments is not 
sufficient. Without further information, I still cannot rule out that:

- In the base case, application and napi processing execute on the same 
core and trample on each other. I don't know how onload implements 
epoll_wait, but I worry that it cannot align application processing 
(busy-looping?) and napi processing (also busy-looping?).

- In the threaded busy-loop case, napi processing ends up on one core, 
while the application executes on another one. This uses two cores 
instead of one.

Based on the above, I think at least the following additional scenarios 
need to be investigated:

a) Run the server application in proper fullbusy mode, i.e., cleanly 
alternating between application processing and napi processing. As a 
second step, spread the incoming traffic across two cores to compare 
apples to apples.

b) Run application and napi processing on separate cores, but simply by 
way of thread pinning and interrupt routing. How close does that get to 
the current results? Then selectively add threaded napi and then busy 
polling.

c) Run the whole thing without onload for comparison. The benefits 
should show without onload as well and it's easier to reproduce. Also, I 
suspect onload hurts in the base case and that explains the atrociously 
high latency and low throughput of it.

Or, even better, simply provide a complete specification / script for 
the experiment that makes it easy to reproduce.

Note that I don't dismiss the approach out of hand. I just think it's 
important to properly understand the purported performance improvements. 
At the same time, I don't think it's good for the planet to burn cores 
with busy-looping without good reason.

Thanks,
Martin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 1/4] Add support to set napi threaded for individual napi
  2025-02-05  0:10 ` [PATCH net-next v3 1/4] Add support to set napi threaded for individual napi Samiullah Khawaja
@ 2025-02-05  2:50   ` Joe Damato
  2025-02-05 23:10   ` David Laight
  1 sibling, 0 replies; 37+ messages in thread
From: Joe Damato @ 2025-02-05  2:50 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, netdev

On Wed, Feb 05, 2025 at 12:10:49AM +0000, Samiullah Khawaja wrote:
> A net device has a threaded sysctl that can be used to enable threaded
> napi polling on all of the NAPI contexts under that device. Allow
> enabling threaded napi polling at individual napi level using netlink.
> 
> Extend the netlink operation `napi-set` and allow setting the threaded
> attribute of a NAPI. This will enable the threaded polling on a napi
> context.
> 
> Tested using following command in qemu/virtio-net:
> ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
>   --do napi-set       --json '{"id": 66, "threaded": 1}'


At a high level, I think this patch could probably be sent on its
own; IMHO it makes sense to make threaded NAPI per-NAPI via
netdev-genl.

I think its fine if you want to include it in your overall series,
but I think a change like this could probably go in on its own.

[...]

> diff --git a/net/core/dev.c b/net/core/dev.c
> index c0021cbd28fc..50fb234dd7a0 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6787,6 +6787,30 @@ static void init_gro_hash(struct napi_struct *napi)
>  	napi->gro_bitmask = 0;
>  }
>  
> +int napi_set_threaded(struct napi_struct *napi, bool threaded)
> +{
> +	if (napi->dev->threaded)
> +		return -EINVAL;

I feel this is probably a WARN_ON_ONCE situation? Not sure, but in
any case, see below.

> +
> +	if (threaded) {
> +		if (!napi->thread) {
> +			int err = napi_kthread_create(napi);
> +
> +			if (err)
> +				return err;
> +		}
> +	}
> +
> +	if (napi->config)
> +		napi->config->threaded = threaded;
> +
> +	/* Make sure kthread is created before THREADED bit is set. */
> +	smp_mb__before_atomic();
> +	assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> +
> +	return 0;
> +}
> +
>  int dev_set_threaded(struct net_device *dev, bool threaded)
>  {
>  	struct napi_struct *napi;
> @@ -6798,6 +6822,11 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
>  		return 0;
>  
>  	if (threaded) {
> +		/* Check if threaded is set at napi level already */
> +		list_for_each_entry(napi, &dev->napi_list, dev_list)
> +			if (test_bit(NAPI_STATE_THREADED, &napi->state))
> +				return -EINVAL;
> +
>  		list_for_each_entry(napi, &dev->napi_list, dev_list) {
>  			if (!napi->thread) {
>  				err = napi_kthread_create(napi);
> @@ -6880,6 +6909,8 @@ static void napi_restore_config(struct napi_struct *n)
>  		napi_hash_add(n);
>  		n->config->napi_id = n->napi_id;
>  	}
> +
> +	napi_set_threaded(n, n->config->threaded);

The return value isn't checked so even if napi_set_threaded returned
EINVAL it seems like there's no way for that error to "bubble" back
up?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 2/4] net: Create separate gro_flush helper function
  2025-02-05  0:10 ` [PATCH net-next v3 2/4] net: Create separate gro_flush helper function Samiullah Khawaja
@ 2025-02-05  2:55   ` Joe Damato
  0 siblings, 0 replies; 37+ messages in thread
From: Joe Damato @ 2025-02-05  2:55 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, netdev

On Wed, Feb 05, 2025 at 12:10:50AM +0000, Samiullah Khawaja wrote:
> Move multiple copies of same code snippet doing `gro_flush` and
> `gro_normal_list` into a separate helper function.

Like the other patch, this one could probably be sent on its own. I
think if you want to include it in your series, that's fine, but I
think factoring this out into helpers is good clean up.

> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  net/core/dev.c | 28 +++++++++++++---------------
>  1 file changed, 13 insertions(+), 15 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 50fb234dd7a0..d5dcf9dd6225 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6484,6 +6484,17 @@ static void skb_defer_free_flush(struct softnet_data *sd)
>  	}
>  }
>  
> +static void __napi_gro_flush_helper(struct napi_struct *napi)

I wonder if this could be changed to take a boolean as a second arg
then you could also use this helper in napi_complete_done in
addition to the locations you've already cleaned up?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-05  0:10 [PATCH net-next v3 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
                   ` (5 preceding siblings ...)
  2025-02-05  1:32 ` Martin Karsten
@ 2025-02-05  3:18 ` Joe Damato
  2025-02-06 21:19   ` Joe Damato
  2025-02-06  5:36 ` Dave Taht
  7 siblings, 1 reply; 37+ messages in thread
From: Joe Damato @ 2025-02-05  3:18 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, netdev

On Wed, Feb 05, 2025 at 12:10:48AM +0000, Samiullah Khawaja wrote:
> Extend the already existing support of threaded napi poll to do continuous
> busy polling.

[...]

Overall, +1 to everything Martin said in his response. I think I'd
like to try to reproduce this myself to better understand the stated
numbers below.

IMHO: the cover letter needs more details.

> 
> Setup:
> 
> - Running on Google C3 VMs with idpf driver with following configurations.
> - IRQ affinity and coalascing is common for both experiments.

As Martin suggested, a lot more detail here would be helpful.

> - There is only 1 RX/TX queue configured.
> - First experiment enables busy poll using sysctl for both epoll and
>   socket APIs.
> - Second experiment enables NAPI threaded busy poll for the full device
>   using sysctl.
> 
> Non threaded NAPI busy poll enabled using sysctl.
> ```
> echo 400 | sudo tee /proc/sys/net/core/busy_poll
> echo 400 | sudo tee /proc/sys/net/core/busy_read

I'm not sure why busy_read is enabled here?

Maybe more details on how exactly the internals of onload+neper work
would explain it, but I presume it's an epoll_wait loop with
non-blocking reads so busy_read wouldn't do anything?

> echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
> ```

The deferral amounts above are relatively small, which makes me
wonder if you are seeing IRQ and softIRQ interference in the base
case?

I ask because it seems like in the test case (if I read the patch
correctly) the processing of packets happens when BH is disabled.

Did I get that right?

If so, then:
  - In the base case, IRQs can be generated and softirq can interfere
    with packet processing.

  - In the test case, packet processing happens but BH is disabled,
    reducing interference.

If I got that right, it sounds like IRQ suspension would show good
results in this case, too, and it's probably worth comparing IRQ
suspension in the onload+neper setup.

It seems like it shouldn't be too difficult to get onload+neper
using it and the data would be very enlightening.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 3/4] Extend napi threaded polling to allow kthread based busy polling
  2025-02-05  0:10 ` [PATCH net-next v3 3/4] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
@ 2025-02-05  9:08   ` Paul Barker
  2025-02-06  3:40   ` Samudrala, Sridhar
  1 sibling, 0 replies; 37+ messages in thread
From: Paul Barker @ 2025-02-05  9:08 UTC (permalink / raw)
  To: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina
  Cc: netdev


[-- Attachment #1.1.1: Type: text/plain, Size: 2458 bytes --]

On 05/02/2025 00:10, Samiullah Khawaja wrote:
> Add a new state to napi state enum:
> 
> - STATE_THREADED_BUSY_POLL
>   Threaded busy poll is enabled/running for this napi.
> 
> Following changes are introduced in the napi scheduling and state logic:
> 
> - When threaded busy poll is enabled through sysfs it also enables
>   NAPI_STATE_THREADED so a kthread is created per napi. It also sets
>   NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that we are
>   supposed to busy poll for each napi.
> 
> - When napi is scheduled with STATE_SCHED_THREADED and associated
>   kthread is woken up, the kthread owns the context. If
>   NAPI_STATE_THREADED_BUSY_POLL and NAPI_SCHED_THREADED both are set
>   then it means that we can busy poll.
> 
> - To keep busy polling and to avoid scheduling of the interrupts, the
>   napi_complete_done returns false when both SCHED_THREADED and
>   THREADED_BUSY_POLL flags are set. Also napi_complete_done returns
>   early to avoid the STATE_SCHED_THREADED being unset.
> 
> - If at any point STATE_THREADED_BUSY_POLL is unset, the
>   napi_complete_done will run and unset the SCHED_THREADED bit also.
>   This will make the associated kthread go to sleep as per existing
>   logic.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> Reviewed-by: Willem de Bruijn <willemb@google.com>
> ---
>  Documentation/ABI/testing/sysfs-class-net     |  3 +-
>  Documentation/netlink/specs/netdev.yaml       | 12 ++--
>  Documentation/networking/napi.rst             | 67 ++++++++++++++++-
>  .../net/ethernet/atheros/atl1c/atl1c_main.c   |  2 +-
>  drivers/net/ethernet/mellanox/mlxsw/pci.c     |  2 +-
>  drivers/net/ethernet/renesas/ravb_main.c      |  2 +-
>  drivers/net/wireless/ath/ath10k/snoc.c        |  2 +-
>  include/linux/netdevice.h                     | 20 ++++--
>  include/uapi/linux/netdev.h                   |  6 ++
>  net/core/dev.c                                | 72 ++++++++++++++++---
>  net/core/net-sysfs.c                          |  2 +-
>  net/core/netdev-genl-gen.c                    |  2 +-
>  net/core/netdev-genl.c                        |  2 +-
>  tools/include/uapi/linux/netdev.h             |  6 ++
>  14 files changed, 171 insertions(+), 29 deletions(-)

Please copy in the maintainers of the network drivers which have been changed.

For ravb,
Acked-by: Paul Barker <paul.barker.ct@bp.renesas.com>

-- 
Paul Barker

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3577 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-05  1:32 ` Martin Karsten
@ 2025-02-05 20:35   ` Samiullah Khawaja
  2025-02-05 22:06     ` Joe Damato
  0 siblings, 1 reply; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-05 20:35 UTC (permalink / raw)
  To: Martin Karsten
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, netdev, Joe Damato

On Tue, Feb 4, 2025 at 5:32 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-02-04 19:10, Samiullah Khawaja wrote:
> > Extend the already existing support of threaded napi poll to do continuous
> > busy polling.
>
> [snip]
>
> > Setup:
> >
> > - Running on Google C3 VMs with idpf driver with following configurations.
> > - IRQ affinity and coalascing is common for both experiments.
> > - There is only 1 RX/TX queue configured.
> > - First experiment enables busy poll using sysctl for both epoll and
> >    socket APIs.
> > - Second experiment enables NAPI threaded busy poll for the full device
> >    using sysctl.
> >
> > Non threaded NAPI busy poll enabled using sysctl.
> > ```
> > echo 400 | sudo tee /proc/sys/net/core/busy_poll
> > echo 400 | sudo tee /proc/sys/net/core/busy_read
> > echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > ```
> >
> > Results using following command,
> > ```
> > sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> >               --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> >               -p 50,90,99,999 -H <IP> -l 10
> >
> > ...
> > ...
> >
> > num_transactions=2835
> > latency_min=0.000018976
> > latency_max=0.049642100
> > latency_mean=0.003243618
> > latency_stddev=0.010636847
> > latency_p50=0.000025270
> > latency_p90=0.005406710
> > latency_p99=0.049807350
> > latency_p99.9=0.049807350
> > ```
> >
> > Results with napi threaded busy poll using following command,
> > ```
> > sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> >                  --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> >                  -p 50,90,99,999 -H <IP> -l 10
> >
> > ...
> > ...
> >
> > num_transactions=460163
> > latency_min=0.000015707
> > latency_max=0.200182942
> > latency_mean=0.000019453
> > latency_stddev=0.000720727
> > latency_p50=0.000016950
> > latency_p90=0.000017270
> > latency_p99=0.000018710
> > latency_p99.9=0.000020150
> > ```
> >
> > Here with NAPI threaded busy poll in a separate core, we are able to
> > consistently poll the NAPI to keep latency to absolute minimum. And also
> > we are able to do this without any major changes to the onload stack and
> > threading model.
>
> As far as I'm concerned, this is still not sufficient information to
> fully assess the experiment. The experiment shows an 162-fold decrease
> in latency and a corresponding increase in throughput for this
> closed-loop workload (which, btw, is different from your open-loop fixed
> rate use case). This would be an extraordinary improvement and that
> alone warrants some scrutiny. 162X means either the base case has a lot
> of idle time or wastes an enormous amount of cpu cycles. How can that be
> explained? It would be good to get some instruction/cycle counts to
> drill down further.
The difference is much more apparent (and larger) when I am using more
sockets (50) in this case. I have noticed that the situation gets
worse if I add much more sockets in the mix, but I think this here is
enough to show the effect. The processing of packets on a core and
then going back to userspace to do application work (or protocol
processing in case of onload) is not ok for this use case. If you look
at P50, most of the time there is not much difference, but the tail
latencies add up in the P90 case. I want the descriptors to be pulled
from the NIC queues and handed over right away for processing to a
separate core.
>
> The server process invocation and the actual irq routing is not
> provided. Just stating its common for both experiments is not
> sufficient. Without further information, I still cannot rule out that:
>
> - In the base case, application and napi processing execute on the same
> core and trample on each other. I don't know how onload implements
> epoll_wait, but I worry that it cannot align application processing
> (busy-looping?) and napi processing (also busy-looping?).
>
> - In the threaded busy-loop case, napi processing ends up on one core,
> while the application executes on another one. This uses two cores
> instead of one.
>
> Based on the above, I think at least the following additional scenarios
> need to be investigated:
>
> a) Run the server application in proper fullbusy mode, i.e., cleanly
> alternating between application processing and napi processing. As a
> second step, spread the incoming traffic across two cores to compare
> apples to apples.
This is exactly what is being done in the experiment I posted and it
shows massive degradation of latency when the core is shared between
application processing and napi processing. The busy_read setup above
that I mentioned, makes onload do napi processing when xsk_recvmsg is
called. Also onload spins in the userspace to handle the AF_XDP
queues/rings in memory.
>
> b) Run application and napi processing on separate cores, but simply by
> way of thread pinning and interrupt routing. How close does that get to
> the current results? Then selectively add threaded napi and then busy
> polling.
This was the base case with which we started looking into this work.
And this gives much worse latency because the packets are only picked
from the RX queue on interrupt wakeups (and BH). In fact moving them
to separate cores in this case makes the core getting interrupts be
idle and go to sleep if the frequency of packets is low.
>
> c) Run the whole thing without onload for comparison. The benefits
> should show without onload as well and it's easier to reproduce. Also, I
> suspect onload hurts in the base case and that explains the atrociously
> high latency and low throughput of it.
>
> Or, even better, simply provide a complete specification / script for
> the experiment that makes it easy to reproduce.
That would require setting up onload on the platform you use, provided
it has all the AF_XDP things needed to bring it up. I think I have
provided everything that you would need to set this up on your
platform. I have provided the onload repo, it is open source and it
has README with steps to set it up. I have provided the sysctls
configuration I am using. I have also provided the exact command with
all the arguments I am using to run onload with neper (configuration
and environment including cpu affinity setup).
>
> Note that I don't dismiss the approach out of hand. I just think it's
> important to properly understand the purported performance improvements.
I think the performance improvements are apparent with the data I
provided, I purposefully used more sockets to show the real
differences in tail latency with this revision.

Also one thing that you are probably missing here is that the change
here also has an API aspect, that is it allows a user to drive napi
independent of the user API or protocol being used. I mean I can
certainly drive the napi using recvmsg, but the napi will only be
driven if there is no data in the recv queue. The recvmsg will check
the recv_queue and if it is not empty it will return. This forces the
application to drain the socket and then do napi processing, basically
introducing the same effect of alternating between napi processing and
application processing. The use case to drive the napi in a separate
core (or a couple of threads sharing a single core) is handled cleanly
with this change by enabling it through netlink.
> At the same time, I don't think it's good for the planet to burn cores
> with busy-looping without good reason.
>
> Thanks,
> Martin
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-05 20:35   ` Samiullah Khawaja
@ 2025-02-05 22:06     ` Joe Damato
  2025-02-06  0:45       ` Samiullah Khawaja
  2025-02-06  1:15       ` Martin Karsten
  0 siblings, 2 replies; 37+ messages in thread
From: Joe Damato @ 2025-02-05 22:06 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Martin Karsten, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina, netdev

On Wed, Feb 05, 2025 at 12:35:00PM -0800, Samiullah Khawaja wrote:
> On Tue, Feb 4, 2025 at 5:32 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >
> > On 2025-02-04 19:10, Samiullah Khawaja wrote:
> > > Extend the already existing support of threaded napi poll to do continuous
> > > busy polling.
> >
> > [snip]
> >
> > > Setup:
> > >
> > > - Running on Google C3 VMs with idpf driver with following configurations.
> > > - IRQ affinity and coalascing is common for both experiments.
> > > - There is only 1 RX/TX queue configured.
> > > - First experiment enables busy poll using sysctl for both epoll and
> > >    socket APIs.
> > > - Second experiment enables NAPI threaded busy poll for the full device
> > >    using sysctl.
> > >
> > > Non threaded NAPI busy poll enabled using sysctl.
> > > ```
> > > echo 400 | sudo tee /proc/sys/net/core/busy_poll
> > > echo 400 | sudo tee /proc/sys/net/core/busy_read
> > > echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > > echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > > ```
> > >
> > > Results using following command,
> > > ```
> > > sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> > >               --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> > >               -p 50,90,99,999 -H <IP> -l 10
> > >
> > > ...
> > > ...
> > >
> > > num_transactions=2835
> > > latency_min=0.000018976
> > > latency_max=0.049642100
> > > latency_mean=0.003243618
> > > latency_stddev=0.010636847
> > > latency_p50=0.000025270
> > > latency_p90=0.005406710
> > > latency_p99=0.049807350
> > > latency_p99.9=0.049807350
> > > ```
> > >
> > > Results with napi threaded busy poll using following command,
> > > ```
> > > sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> > >                  --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> > >                  -p 50,90,99,999 -H <IP> -l 10
> > >
> > > ...
> > > ...
> > >
> > > num_transactions=460163
> > > latency_min=0.000015707
> > > latency_max=0.200182942
> > > latency_mean=0.000019453
> > > latency_stddev=0.000720727
> > > latency_p50=0.000016950
> > > latency_p90=0.000017270
> > > latency_p99=0.000018710
> > > latency_p99.9=0.000020150
> > > ```
> > >
> > > Here with NAPI threaded busy poll in a separate core, we are able to
> > > consistently poll the NAPI to keep latency to absolute minimum. And also
> > > we are able to do this without any major changes to the onload stack and
> > > threading model.
> >
> > As far as I'm concerned, this is still not sufficient information to
> > fully assess the experiment. The experiment shows an 162-fold decrease
> > in latency and a corresponding increase in throughput for this
> > closed-loop workload (which, btw, is different from your open-loop fixed
> > rate use case). This would be an extraordinary improvement and that
> > alone warrants some scrutiny. 162X means either the base case has a lot
> > of idle time or wastes an enormous amount of cpu cycles. How can that be
> > explained? It would be good to get some instruction/cycle counts to
> > drill down further.
> The difference is much more apparent (and larger) when I am using more
> sockets (50) in this case. I have noticed that the situation gets
> worse if I add much more sockets in the mix, but I think this here is
> enough to show the effect.

Can you be more precise? What does "the situation" refer to? Are you
saying that the latency decrease is larger than 162x as the number
of sockets increases? 

It would be helpful to include that data in the cover letter showing
the latency differences amongst various socket counts.

> The processing of packets on a core and
> then going back to userspace to do application work (or protocol
> processing in case of onload) is not ok for this use case.

Why is it not OK? I assume because there is too much latency? If
so... the data for that configuration should be provided so it can
be examined and compared.

> If you look at P50, most of the time there is not much difference,
> but the tail latencies add up in the P90 case. I want the
> descriptors to be pulled from the NIC queues and handed over right
> away for processing to a separate core.

Sure; it's a trade off between CPU cycles and latency, right? I
think that data should be provided so that it is more clear what the
actual trade-off is; how many cycles am I trading for the latency
reduction at p90+ ?

> >
> > The server process invocation and the actual irq routing is not
> > provided. Just stating its common for both experiments is not
> > sufficient. Without further information, I still cannot rule out that:
> >
> > - In the base case, application and napi processing execute on the same
> > core and trample on each other. I don't know how onload implements
> > epoll_wait, but I worry that it cannot align application processing
> > (busy-looping?) and napi processing (also busy-looping?).
> >
> > - In the threaded busy-loop case, napi processing ends up on one core,
> > while the application executes on another one. This uses two cores
> > instead of one.
> >
> > Based on the above, I think at least the following additional scenarios
> > need to be investigated:
> >
> > a) Run the server application in proper fullbusy mode, i.e., cleanly
> > alternating between application processing and napi processing. As a
> > second step, spread the incoming traffic across two cores to compare
> > apples to apples.
> This is exactly what is being done in the experiment I posted and it
> shows massive degradation of latency when the core is shared between
> application processing and napi processing.

It's not clear from the cover letter that this is what the base case
is testing. I am not doubting that you are testing this; I'm
commenting that the cover letter does not clearly explain this
because I also didn't follow that.

> The busy_read setup above that I mentioned, makes onload do napi
> processing when xsk_recvmsg is called. Also onload spins in the
> userspace to handle the AF_XDP queues/rings in memory.

This detail would be useful to explain in the cover letter; it was
not clear to me why busy_read would be used, as you'll see in my
response.

If onload spins in userspace, wouldn't a test case of running
userspace on one core so it can spin and doing NAPI processing on
another core be a good test case to include in the data?

> >
> > b) Run application and napi processing on separate cores, but simply by
> > way of thread pinning and interrupt routing. How close does that get to
> > the current results? Then selectively add threaded napi and then busy
> > polling.
> This was the base case with which we started looking into this work.
> And this gives much worse latency because the packets are only picked
> from the RX queue on interrupt wakeups (and BH). In fact moving them
> to separate cores in this case makes the core getting interrupts be
> idle and go to sleep if the frequency of packets is low.

Two comments here:

1. If it was the base case which motivated the work, should an
explanation of this and data about it be provided so it can be
examined?

2. If PPS is low, wouldn't you _want_ the core to go idle?

> >
> > c) Run the whole thing without onload for comparison. The benefits
> > should show without onload as well and it's easier to reproduce. Also, I
> > suspect onload hurts in the base case and that explains the atrociously
> > high latency and low throughput of it.
> >
> > Or, even better, simply provide a complete specification / script for
> > the experiment that makes it easy to reproduce.
> That would require setting up onload on the platform you use, provided
> it has all the AF_XDP things needed to bring it up. I think I have
> provided everything that you would need to set this up on your
> platform. I have provided the onload repo, it is open source and it
> has README with steps to set it up. I have provided the sysctls
> configuration I am using. I have also provided the exact command with
> all the arguments I am using to run onload with neper (configuration
> and environment including cpu affinity setup).

Respectfully, I disagree.

I think the cover letter lacks a significant amount of detail, test
data, and test setup information which makes reproducing the results
a lot more work for a reviewer.

> >
> > Note that I don't dismiss the approach out of hand. I just think it's
> > important to properly understand the purported performance improvements.
> I think the performance improvements are apparent with the data I
> provided, I purposefully used more sockets to show the real
> differences in tail latency with this revision.

Respectfully, I don't agree that the improvements are "apparent." I
think my comments and Martin's comments both suggest that the cover
letter does not make the improvements apparent.

> Also one thing that you are probably missing here is that the change
> here also has an API aspect, that is it allows a user to drive napi
> independent of the user API or protocol being used.

I'm not missing that part; I'll let Martin speak for himself but I
suspect he also follows that part.

If some one were missing that part, though, it'd probably suggest a
more thorough approach to documenting the change in the cover
letter, right?

> I mean I can certainly drive the napi using recvmsg, but the napi
> will only be driven if there is no data in the recv queue. The
> recvmsg will check the recv_queue and if it is not empty it will
> return. This forces the application to drain the socket and then
> do napi processing, basically introducing the same effect of
> alternating between napi processing and application processing.
> The use case to drive the napi in a separate core (or a couple of
> threads sharing a single core) is handled cleanly with this change
> by enabling it through netlink.

Why not provide the numbers for this and many other test cases (as
mentioned above) so that this can be evaluated more clearly?

> > At the same time, I don't think it's good for the planet to burn cores
> > with busy-looping without good reason.

I think the point Martin makes here is subtle, but extremely
important.

Introducing a general purpose mechanism to burn cores at 100% CPU,
potentially doing nothing if network traffic is very low, should not
be taken lightly and, IMHO, very rigorous data should be provided so
that anyone who decides to turn this on knows what trade-off is.

As it stands presently, it's not clear to me how many CPU cycles I
am trading to get the claim of a 162X latency improvement at the
tail (which, like Martin, is a claim I find quite odd and worth
investigating in and of itself as to _why_ it is so large).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 1/4] Add support to set napi threaded for individual napi
  2025-02-05  0:10 ` [PATCH net-next v3 1/4] Add support to set napi threaded for individual napi Samiullah Khawaja
  2025-02-05  2:50   ` Joe Damato
@ 2025-02-05 23:10   ` David Laight
  2025-02-06 18:40     ` Joe Damato
  1 sibling, 1 reply; 37+ messages in thread
From: David Laight @ 2025-02-05 23:10 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, netdev

On Wed,  5 Feb 2025 00:10:49 +0000
Samiullah Khawaja <skhawaja@google.com> wrote:

> A net device has a threaded sysctl that can be used to enable threaded
> napi polling on all of the NAPI contexts under that device. Allow
> enabling threaded napi polling at individual napi level using netlink.
> 
> Extend the netlink operation `napi-set` and allow setting the threaded
> attribute of a NAPI. This will enable the threaded polling on a napi
> context.
> 
> Tested using following command in qemu/virtio-net:
> ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
>   --do napi-set       --json '{"id": 66, "threaded": 1}'

Is there a sane way for a 'real person' to set these from a normal
startup/network configuration script?

The netlink API is hardly user-friendly.

	David

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-05 22:06     ` Joe Damato
@ 2025-02-06  0:45       ` Samiullah Khawaja
  2025-02-06 13:42         ` Joe Damato
  2025-02-06  1:15       ` Martin Karsten
  1 sibling, 1 reply; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-06  0:45 UTC (permalink / raw)
  To: Joe Damato, Samiullah Khawaja, Martin Karsten, Jakub Kicinski,
	David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, netdev

On Wed, Feb 5, 2025 at 2:06 PM Joe Damato <jdamato@fastly.com> wrote:
>
> On Wed, Feb 05, 2025 at 12:35:00PM -0800, Samiullah Khawaja wrote:
> > On Tue, Feb 4, 2025 at 5:32 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > >
> > > On 2025-02-04 19:10, Samiullah Khawaja wrote:
> > > > Extend the already existing support of threaded napi poll to do continuous
> > > > busy polling.
> > >
> > > [snip]
> > >
> > > > Setup:
> > > >
> > > > - Running on Google C3 VMs with idpf driver with following configurations.
> > > > - IRQ affinity and coalascing is common for both experiments.
> > > > - There is only 1 RX/TX queue configured.
> > > > - First experiment enables busy poll using sysctl for both epoll and
> > > >    socket APIs.
> > > > - Second experiment enables NAPI threaded busy poll for the full device
> > > >    using sysctl.
> > > >
> > > > Non threaded NAPI busy poll enabled using sysctl.
> > > > ```
> > > > echo 400 | sudo tee /proc/sys/net/core/busy_poll
> > > > echo 400 | sudo tee /proc/sys/net/core/busy_read
> > > > echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > > > echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > > > ```
> > > >
> > > > Results using following command,
> > > > ```
> > > > sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> > > >               --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> > > >               -p 50,90,99,999 -H <IP> -l 10
> > > >
> > > > ...
> > > > ...
> > > >
> > > > num_transactions=2835
> > > > latency_min=0.000018976
> > > > latency_max=0.049642100
> > > > latency_mean=0.003243618
> > > > latency_stddev=0.010636847
> > > > latency_p50=0.000025270
> > > > latency_p90=0.005406710
> > > > latency_p99=0.049807350
> > > > latency_p99.9=0.049807350
> > > > ```
> > > >
> > > > Results with napi threaded busy poll using following command,
> > > > ```
> > > > sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> > > >                  --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> > > >                  -p 50,90,99,999 -H <IP> -l 10
> > > >
> > > > ...
> > > > ...
> > > >
> > > > num_transactions=460163
> > > > latency_min=0.000015707
> > > > latency_max=0.200182942
> > > > latency_mean=0.000019453
> > > > latency_stddev=0.000720727
> > > > latency_p50=0.000016950
> > > > latency_p90=0.000017270
> > > > latency_p99=0.000018710
> > > > latency_p99.9=0.000020150
> > > > ```
> > > >
> > > > Here with NAPI threaded busy poll in a separate core, we are able to
> > > > consistently poll the NAPI to keep latency to absolute minimum. And also
> > > > we are able to do this without any major changes to the onload stack and
> > > > threading model.
> > >
> > > As far as I'm concerned, this is still not sufficient information to
> > > fully assess the experiment. The experiment shows an 162-fold decrease
> > > in latency and a corresponding increase in throughput for this
> > > closed-loop workload (which, btw, is different from your open-loop fixed
> > > rate use case). This would be an extraordinary improvement and that
> > > alone warrants some scrutiny. 162X means either the base case has a lot
> > > of idle time or wastes an enormous amount of cpu cycles. How can that be
> > > explained? It would be good to get some instruction/cycle counts to
> > > drill down further.
> > The difference is much more apparent (and larger) when I am using more
> > sockets (50) in this case. I have noticed that the situation gets
> > worse if I add much more sockets in the mix, but I think this here is
> > enough to show the effect.
>
> Can you be more precise? What does "the situation" refer to? Are you
> saying that the latency decrease is larger than 162x as the number
> of sockets increases?
Yes the latency with higher percentiles would degrade since with more
sockets there will be longer queues with more work and more time spent
in doing application processing.
>
> It would be helpful to include that data in the cover letter showing
> the latency differences amongst various socket counts.
I just fetched some new data without the napi threaded busy poll and
pasted below, notice the change in P90. But I think the result is
pretty intuitive (wrt to extra queueing with more traffic and more
things to do in application processing) and also apparent with 1
socket data I posted earlier.

20 Socket data:
num_transactions=2345
latency_min=0.000018449
latency_max=0.049824327
latency_mean=0.003838511
latency_stddev=0.012978531
latency_p50=0.000024950
latency_p90=0.000122870
latency_p99=0.049807350
latency_p99.9=0.049807350

30 Socket data:
num_transactions=2098
latency_min=0.000018519
latency_max=0.049794902
latency_mean=0.004295050
latency_stddev=0.013671535
latency_p50=0.000025270
latency_p90=0.000174070
latency_p99=0.049807350
latency_p99.9=0.049807350
>
> > The processing of packets on a core and
> > then going back to userspace to do application work (or protocol
> > processing in case of onload) is not ok for this use case.
>
> Why is it not OK? I assume because there is too much latency? If
> so... the data for that configuration should be provided so it can
> be examined and compared.
The time taken to do the application processing of the packets on the
same core would take time away from the napi processing, introducing
latency difference at tail with packets getting queued. Now for some
use cases this would be acceptable, they can certainly set affinity of
this napi thread equal to the userspace thread or maybe use
epoll/recvmsg to drive it. For my use case, I want it to have a solid
P90+ in sub 16us. A couple of microseconds spent doing application
processing pushes it to 17-18us and that is unacceptable for my use
case.
>
> > If you look at P50, most of the time there is not much difference,
> > but the tail latencies add up in the P90 case. I want the
> > descriptors to be pulled from the NIC queues and handed over right
> > away for processing to a separate core.
>
> Sure; it's a trade off between CPU cycles and latency, right? I
> think that data should be provided so that it is more clear what the
> actual trade-off is; how many cycles am I trading for the latency
> reduction at p90+ ?
I think the tradeoff depends on the use case requirements and the
amount of work done in application processing, please see my reply
above.
>
> > >
> > > The server process invocation and the actual irq routing is not
> > > provided. Just stating its common for both experiments is not
> > > sufficient. Without further information, I still cannot rule out that:
> > >
> > > - In the base case, application and napi processing execute on the same
> > > core and trample on each other. I don't know how onload implements
> > > epoll_wait, but I worry that it cannot align application processing
> > > (busy-looping?) and napi processing (also busy-looping?).
> > >
> > > - In the threaded busy-loop case, napi processing ends up on one core,
> > > while the application executes on another one. This uses two cores
> > > instead of one.
> > >
> > > Based on the above, I think at least the following additional scenarios
> > > need to be investigated:
> > >
> > > a) Run the server application in proper fullbusy mode, i.e., cleanly
> > > alternating between application processing and napi processing. As a
> > > second step, spread the incoming traffic across two cores to compare
> > > apples to apples.
> > This is exactly what is being done in the experiment I posted and it
> > shows massive degradation of latency when the core is shared between
> > application processing and napi processing.
>
> It's not clear from the cover letter that this is what the base case
> is testing. I am not doubting that you are testing this; I'm
> commenting that the cover letter does not clearly explain this
> because I also didn't follow that.
>
> > The busy_read setup above that I mentioned, makes onload do napi
> > processing when xsk_recvmsg is called. Also onload spins in the
> > userspace to handle the AF_XDP queues/rings in memory.
>
> This detail would be useful to explain in the cover letter; it was
> not clear to me why busy_read would be used, as you'll see in my
> response.
>
> If onload spins in userspace, wouldn't a test case of running
> userspace on one core so it can spin and doing NAPI processing on
> another core be a good test case to include in the data?
That is what kthread based napi busy polling does. And this gives me
the best numbers. I run the napi processing in a separate core and
onload spinning in userspace on a separate thread. Now this ties into
my comment about the API aspect of this change. Doing napi processing
on a separate core with this change is super flexible regardless of
the UAPI you use.
>
> > >
> > > b) Run application and napi processing on separate cores, but simply by
> > > way of thread pinning and interrupt routing. How close does that get to
> > > the current results? Then selectively add threaded napi and then busy
> > > polling.
> > This was the base case with which we started looking into this work.
> > And this gives much worse latency because the packets are only picked
> > from the RX queue on interrupt wakeups (and BH). In fact moving them
> > to separate cores in this case makes the core getting interrupts be
> > idle and go to sleep if the frequency of packets is low.
>
> Two comments here:
>
> 1. If it was the base case which motivated the work, should an
> explanation of this and data about it be provided so it can be
> examined?
>
> 2. If PPS is low, wouldn't you _want_ the core to go idle?
Not really, if the core goes idle then the latency of a packet that
arrives after core goes idle will suffer due to wakeups. This is what
happens with interrupts also, but it could be acceptable for a
different use case. I am guessing/maybe with large message sizes, one
would not care since after the wakeup for first descriptor remaining
packets can be processed in a batch.
>
> > >
> > > c) Run the whole thing without onload for comparison. The benefits
> > > should show without onload as well and it's easier to reproduce. Also, I
> > > suspect onload hurts in the base case and that explains the atrociously
> > > high latency and low throughput of it.
> > >
> > > Or, even better, simply provide a complete specification / script for
> > > the experiment that makes it easy to reproduce.
> > That would require setting up onload on the platform you use, provided
> > it has all the AF_XDP things needed to bring it up. I think I have
> > provided everything that you would need to set this up on your
> > platform. I have provided the onload repo, it is open source and it
> > has README with steps to set it up. I have provided the sysctls
> > configuration I am using. I have also provided the exact command with
> > all the arguments I am using to run onload with neper (configuration
> > and environment including cpu affinity setup).
>
> Respectfully, I disagree.
>
> I think the cover letter lacks a significant amount of detail, test
> data, and test setup information which makes reproducing the results
> a lot more work for a reviewer.
>
> > >
> > > Note that I don't dismiss the approach out of hand. I just think it's
> > > important to properly understand the purported performance improvements.
> > I think the performance improvements are apparent with the data I
> > provided, I purposefully used more sockets to show the real
> > differences in tail latency with this revision.
>
> Respectfully, I don't agree that the improvements are "apparent." I
> think my comments and Martin's comments both suggest that the cover
> letter does not make the improvements apparent.
>
> > Also one thing that you are probably missing here is that the change
> > here also has an API aspect, that is it allows a user to drive napi
> > independent of the user API or protocol being used.
>
> I'm not missing that part; I'll let Martin speak for himself but I
> suspect he also follows that part.
>
> If some one were missing that part, though, it'd probably suggest a
> more thorough approach to documenting the change in the cover
> letter, right?
>
> > I mean I can certainly drive the napi using recvmsg, but the napi
> > will only be driven if there is no data in the recv queue. The
> > recvmsg will check the recv_queue and if it is not empty it will
> > return. This forces the application to drain the socket and then
> > do napi processing, basically introducing the same effect of
> > alternating between napi processing and application processing.
> > The use case to drive the napi in a separate core (or a couple of
> > threads sharing a single core) is handled cleanly with this change
> > by enabling it through netlink.
>
> Why not provide the numbers for this and many other test cases (as
> mentioned above) so that this can be evaluated more clearly?
>
> > > At the same time, I don't think it's good for the planet to burn cores
> > > with busy-looping without good reason.
>
> I think the point Martin makes here is subtle, but extremely
> important.
>
> Introducing a general purpose mechanism to burn cores at 100% CPU,
> potentially doing nothing if network traffic is very low, should not
> be taken lightly and, IMHO, very rigorous data should be provided so
> that anyone who decides to turn this on knows what trade-off is.
>
> As it stands presently, it's not clear to me how many CPU cycles I
> am trading to get the claim of a 162X latency improvement at the
> tail (which, like Martin, is a claim I find quite odd and worth
> investigating in and of itself as to _why_ it is so large).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-05 22:06     ` Joe Damato
  2025-02-06  0:45       ` Samiullah Khawaja
@ 2025-02-06  1:15       ` Martin Karsten
  2025-02-06  4:43         ` Samiullah Khawaja
  1 sibling, 1 reply; 37+ messages in thread
From: Martin Karsten @ 2025-02-06  1:15 UTC (permalink / raw)
  To: Joe Damato, Samiullah Khawaja, Jakub Kicinski, David S . Miller,
	Eric Dumazet, Paolo Abeni, almasrymina, netdev

On 2025-02-05 17:06, Joe Damato wrote:
> On Wed, Feb 05, 2025 at 12:35:00PM -0800, Samiullah Khawaja wrote:
>> On Tue, Feb 4, 2025 at 5:32 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>>
>>> On 2025-02-04 19:10, Samiullah Khawaja wrote:

[snip]

>>> Note that I don't dismiss the approach out of hand. I just think it's
>>> important to properly understand the purported performance improvements.
>> I think the performance improvements are apparent with the data I
>> provided, I purposefully used more sockets to show the real
>> differences in tail latency with this revision.
> 
> Respectfully, I don't agree that the improvements are "apparent." I
> think my comments and Martin's comments both suggest that the cover
> letter does not make the improvements apparent.
> 
>> Also one thing that you are probably missing here is that the change
>> here also has an API aspect, that is it allows a user to drive napi
>> independent of the user API or protocol being used.
> 
> I'm not missing that part; I'll let Martin speak for himself but I
> suspect he also follows that part.

Yes, the API aspect is quite interesting. In fact, Joe has given you 
pointers how to split this patch into multiple incremental steps, the 
first of which should be uncontroversial.

I also just read your subsequent response to Joe. He has captured the 
relevant concerns very well and I don't understand why you refuse to 
document your complete experiment setup for transparency and 
reproducibility. This shouldn't be hard.

To be clear: I would like to reproduce the experiments and then engage 
in a meaningful discussion about the pros and cons of this mechanism, 
but right now I need to speculate and there's no point in arguing 
speculation vs. assertions.

Thanks,
Martin


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 3/4] Extend napi threaded polling to allow kthread based busy polling
  2025-02-05  0:10 ` [PATCH net-next v3 3/4] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
  2025-02-05  9:08   ` Paul Barker
@ 2025-02-06  3:40   ` Samudrala, Sridhar
  1 sibling, 0 replies; 37+ messages in thread
From: Samudrala, Sridhar @ 2025-02-06  3:40 UTC (permalink / raw)
  To: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina
  Cc: netdev



On 2/4/2025 4:10 PM, Samiullah Khawaja wrote:
> Add a new state to napi state enum:
> 
> - STATE_THREADED_BUSY_POLL
>    Threaded busy poll is enabled/running for this napi.
> 
> Following changes are introduced in the napi scheduling and state logic:
> 
> - When threaded busy poll is enabled through sysfs it also enables
>    NAPI_STATE_THREADED so a kthread is created per napi. It also sets
>    NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that we are
>    supposed to busy poll for each napi.

Looks like this patch is changing the sysfs 'threaded' field from
boolean to an integer and value 2 is used to indicate threaded mode with
busypoll.
So I think the above comment should reflect that instead of just saying
enabled for both 'threaded' and 'threaded with busypoll'.

> 
> - When napi is scheduled with STATE_SCHED_THREADED and associated
>    kthread is woken up, the kthread owns the context. If
>    NAPI_STATE_THREADED_BUSY_POLL and NAPI_SCHED_THREADED both are set
>    then it means that we can busy poll.
> 
> - To keep busy polling and to avoid scheduling of the interrupts, the
>    napi_complete_done returns false when both SCHED_THREADED and
>    THREADED_BUSY_POLL flags are set. Also napi_complete_done returns
>    early to avoid the STATE_SCHED_THREADED being unset.
> 
> - If at any point STATE_THREADED_BUSY_POLL is unset, the
>    napi_complete_done will run and unset the SCHED_THREADED bit also.
>    This will make the associated kthread go to sleep as per existing
>    logic.

When does STATE_THREADED_BUSY_POLL get unset? Can this only be unset 
from the userspace? Don't we need a timeout value to come out of 
busypoll mode if there is no traffic?



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06  1:15       ` Martin Karsten
@ 2025-02-06  4:43         ` Samiullah Khawaja
  2025-02-06  4:50           ` Martin Karsten
  2025-02-06 13:54           ` Joe Damato
  0 siblings, 2 replies; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-06  4:43 UTC (permalink / raw)
  To: Martin Karsten
  Cc: Joe Damato, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina, netdev

On Wed, Feb 5, 2025 at 5:15 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-02-05 17:06, Joe Damato wrote:
> > On Wed, Feb 05, 2025 at 12:35:00PM -0800, Samiullah Khawaja wrote:
> >> On Tue, Feb 4, 2025 at 5:32 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>>
> >>> On 2025-02-04 19:10, Samiullah Khawaja wrote:
>
> [snip]
>
> >>> Note that I don't dismiss the approach out of hand. I just think it's
> >>> important to properly understand the purported performance improvements.
> >> I think the performance improvements are apparent with the data I
> >> provided, I purposefully used more sockets to show the real
> >> differences in tail latency with this revision.
> >
> > Respectfully, I don't agree that the improvements are "apparent." I
> > think my comments and Martin's comments both suggest that the cover
> > letter does not make the improvements apparent.
> >
> >> Also one thing that you are probably missing here is that the change
> >> here also has an API aspect, that is it allows a user to drive napi
> >> independent of the user API or protocol being used.
> >
> > I'm not missing that part; I'll let Martin speak for himself but I
> > suspect he also follows that part.
>
> Yes, the API aspect is quite interesting. In fact, Joe has given you
> pointers how to split this patch into multiple incremental steps, the
> first of which should be uncontroversial.
>
> I also just read your subsequent response to Joe. He has captured the
> relevant concerns very well and I don't understand why you refuse to
> document your complete experiment setup for transparency and
> reproducibility. This shouldn't be hard.
I think I have provided all the setup details and pointers to
components. I appreciate that you want to reproduce the results and If
you really really want to set it up then start by setting up onload on
your platform. I cannot provide a generic installer script for onload
that _claims_ to set it up on an arbitrary platform (with arbitrary
NIC and environment). If it works on your platform (on top of AF_XDP)
then from that point you can certainly build neper and run it using
the command I shared.
>
> To be clear: I would like to reproduce the experiments and then engage
> in a meaningful discussion about the pros and cons of this mechanism,
> but right now I need to speculate and there's no point in arguing
> speculation vs. assertions.
>
> Thanks,
> Martin
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06  4:43         ` Samiullah Khawaja
@ 2025-02-06  4:50           ` Martin Karsten
  2025-02-06  6:43             ` Samiullah Khawaja
  2025-02-06 13:54           ` Joe Damato
  1 sibling, 1 reply; 37+ messages in thread
From: Martin Karsten @ 2025-02-06  4:50 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Joe Damato, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina, netdev

On 2025-02-05 23:43, Samiullah Khawaja wrote:
> On Wed, Feb 5, 2025 at 5:15 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>
>> On 2025-02-05 17:06, Joe Damato wrote:
>>> On Wed, Feb 05, 2025 at 12:35:00PM -0800, Samiullah Khawaja wrote:
>>>> On Tue, Feb 4, 2025 at 5:32 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>>>>
>>>>> On 2025-02-04 19:10, Samiullah Khawaja wrote:
>>
>> [snip]
>>
>>>>> Note that I don't dismiss the approach out of hand. I just think it's
>>>>> important to properly understand the purported performance improvements.
>>>> I think the performance improvements are apparent with the data I
>>>> provided, I purposefully used more sockets to show the real
>>>> differences in tail latency with this revision.
>>>
>>> Respectfully, I don't agree that the improvements are "apparent." I
>>> think my comments and Martin's comments both suggest that the cover
>>> letter does not make the improvements apparent.
>>>
>>>> Also one thing that you are probably missing here is that the change
>>>> here also has an API aspect, that is it allows a user to drive napi
>>>> independent of the user API or protocol being used.
>>>
>>> I'm not missing that part; I'll let Martin speak for himself but I
>>> suspect he also follows that part.
>>
>> Yes, the API aspect is quite interesting. In fact, Joe has given you
>> pointers how to split this patch into multiple incremental steps, the
>> first of which should be uncontroversial.
>>
>> I also just read your subsequent response to Joe. He has captured the
>> relevant concerns very well and I don't understand why you refuse to
>> document your complete experiment setup for transparency and
>> reproducibility. This shouldn't be hard.
> I think I have provided all the setup details and pointers to
> components. I appreciate that you want to reproduce the results and If
> you really really want to set it up then start by setting up onload on
> your platform. I cannot provide a generic installer script for onload
> that _claims_ to set it up on an arbitrary platform (with arbitrary
> NIC and environment). If it works on your platform (on top of AF_XDP)
> then from that point you can certainly build neper and run it using
> the command I shared.

This is not what I have asked. Installing onload and neper is a given. 
At least, I need the irq routing and potential thread affinity settings 
at the server. Providing a full experiment script would be appreciated, 
but at least the server configuration needs to be specified.

Thanks,
Martin


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-05  0:10 [PATCH net-next v3 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
                   ` (6 preceding siblings ...)
  2025-02-05  3:18 ` Joe Damato
@ 2025-02-06  5:36 ` Dave Taht
  2025-02-06  5:49   ` Samiullah Khawaja
  2025-02-06 19:50   ` David Laight
  7 siblings, 2 replies; 37+ messages in thread
From: Dave Taht @ 2025-02-06  5:36 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, netdev

I have often wondered the effects of reducing napi poll weight from 64
to 16 or less.

Also your test shows an increase in max latency...

latency_max=0.200182942

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06  5:36 ` Dave Taht
@ 2025-02-06  5:49   ` Samiullah Khawaja
  2025-02-06  5:57     ` Dave Taht
  2025-02-06 14:01     ` Joe Damato
  2025-02-06 19:50   ` David Laight
  1 sibling, 2 replies; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-06  5:49 UTC (permalink / raw)
  To: Dave Taht
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, netdev

On Wed, Feb 5, 2025 at 9:36 PM Dave Taht <dave.taht@gmail.com> wrote:
>
> I have often wondered the effects of reducing napi poll weight from 64
> to 16 or less.
Yes, that is Interesting. I think higher weight would allow it to
fetch more descriptors doing more batching but then packets are pushed
up the stack late. A lower value would push packet up the stack
quicker, but then if the core is being shared with the application
processing thread then the descriptors will spend more time in the NIC
queue.
>
> Also your test shows an increase in max latency...
>
> latency_max=0.200182942
I noticed this anomaly and my guess is that it is a packet drop and
this is basically a retransmit timeout. Going through tcpdumps to
confirm.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06  5:49   ` Samiullah Khawaja
@ 2025-02-06  5:57     ` Dave Taht
  2025-02-06 14:01     ` Joe Damato
  1 sibling, 0 replies; 37+ messages in thread
From: Dave Taht @ 2025-02-06  5:57 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, netdev

On Wed, Feb 5, 2025 at 9:49 PM Samiullah Khawaja <skhawaja@google.com> wrote:
>
> On Wed, Feb 5, 2025 at 9:36 PM Dave Taht <dave.taht@gmail.com> wrote:
> >
> > I have often wondered the effects of reducing napi poll weight from 64
> > to 16 or less.
> Yes, that is Interesting. I think higher weight would allow it to
> fetch more descriptors doing more batching but then packets are pushed
> up the stack late. A lower value would push packet up the stack
> quicker, but then if the core is being shared with the application
> processing thread then the descriptors will spend more time in the NIC
> queue.

My take has been that a very low weight would keep far more data in L1
before being processed elsewhere. Modern interrupt response times on
arm gear seem a bit lower than x86_64 but still pretty horrible.

It´s really not related to your patch but I would rather love to see
cache hits/misses vs a vs this benchmark (with/without a lower weight)

> >
> > Also your test shows an increase in max latency...
> >
> > latency_max=0.200182942
> I noticed this anomaly and my guess is that it is a packet drop and
> this is basically a retransmit timeout. Going through tcpdumps to
> confirm.

sometimes enabling ecn helps as a debugging tool.




-- 
Dave Täht CSO, LibreQos

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06  4:50           ` Martin Karsten
@ 2025-02-06  6:43             ` Samiullah Khawaja
  2025-02-06 14:00               ` Joe Damato
  0 siblings, 1 reply; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-06  6:43 UTC (permalink / raw)
  To: Martin Karsten
  Cc: Joe Damato, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina, netdev

On Wed, Feb 5, 2025 at 8:50 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-02-05 23:43, Samiullah Khawaja wrote:
> > On Wed, Feb 5, 2025 at 5:15 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>
> >> On 2025-02-05 17:06, Joe Damato wrote:
> >>> On Wed, Feb 05, 2025 at 12:35:00PM -0800, Samiullah Khawaja wrote:
> >>>> On Tue, Feb 4, 2025 at 5:32 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>>>>
> >>>>> On 2025-02-04 19:10, Samiullah Khawaja wrote:
> >>
> >> [snip]
> >>
> >>>>> Note that I don't dismiss the approach out of hand. I just think it's
> >>>>> important to properly understand the purported performance improvements.
> >>>> I think the performance improvements are apparent with the data I
> >>>> provided, I purposefully used more sockets to show the real
> >>>> differences in tail latency with this revision.
> >>>
> >>> Respectfully, I don't agree that the improvements are "apparent." I
> >>> think my comments and Martin's comments both suggest that the cover
> >>> letter does not make the improvements apparent.
> >>>
> >>>> Also one thing that you are probably missing here is that the change
> >>>> here also has an API aspect, that is it allows a user to drive napi
> >>>> independent of the user API or protocol being used.
> >>>
> >>> I'm not missing that part; I'll let Martin speak for himself but I
> >>> suspect he also follows that part.
> >>
> >> Yes, the API aspect is quite interesting. In fact, Joe has given you
> >> pointers how to split this patch into multiple incremental steps, the
> >> first of which should be uncontroversial.
> >>
> >> I also just read your subsequent response to Joe. He has captured the
> >> relevant concerns very well and I don't understand why you refuse to
> >> document your complete experiment setup for transparency and
> >> reproducibility. This shouldn't be hard.
> > I think I have provided all the setup details and pointers to
> > components. I appreciate that you want to reproduce the results and If
> > you really really want to set it up then start by setting up onload on
> > your platform. I cannot provide a generic installer script for onload
> > that _claims_ to set it up on an arbitrary platform (with arbitrary
> > NIC and environment). If it works on your platform (on top of AF_XDP)
> > then from that point you can certainly build neper and run it using
> > the command I shared.
>
> This is not what I have asked. Installing onload and neper is a given.
> At least, I need the irq routing and potential thread affinity settings
> at the server. Providing a full experiment script would be appreciated,
> but at least the server configuration needs to be specified.
- There is only 1 rx/tx queue and IRQs are deferred as mentioned in
the cover letter. I have listed the following command for you to
configure your netdev with 1 queue pair.
```
sudo ethtool -L eth0 rx 1 tx 1
```
- There is no special interrupt routing, there is only 1 queue pair
and IDPF shares the same IRQ for both queues. The results remain the
same whatever core I pin this IRQ to, I tested with core 2, 3 and 23
(random) on my machine. This is mostly irrelevant since interrupts are
deferred in both tests. I don't know how your NIC uses IRQs and
whether the tx and rx are combined, so you might have to figure that
part out. Sorry about that.
- I moved my napi polling thread to core 2 and as you can see in the
command I shared I run neper on core 3-10. I enable threaded napi at
device level for ease of use as I have only 1 queue pair and they both
share a single NAPI on idpf. Probably different for your platform. I
use following command,
```
  echo 2 | sudo tee /sys/class/net/eth0/threaded
  NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
  sudo chrt -o  -p 0 $NAPI_T
  sudo taskset -pc 2 $NAPI_T
```
>
> Thanks,
> Martin
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06  0:45       ` Samiullah Khawaja
@ 2025-02-06 13:42         ` Joe Damato
  2025-02-06 22:49           ` Samiullah Khawaja
  0 siblings, 1 reply; 37+ messages in thread
From: Joe Damato @ 2025-02-06 13:42 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Martin Karsten, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina, netdev

On Wed, Feb 05, 2025 at 04:45:59PM -0800, Samiullah Khawaja wrote:
> On Wed, Feb 5, 2025 at 2:06 PM Joe Damato <jdamato@fastly.com> wrote:
> >
> > On Wed, Feb 05, 2025 at 12:35:00PM -0800, Samiullah Khawaja wrote:
> > > On Tue, Feb 4, 2025 at 5:32 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > > >
> > > > On 2025-02-04 19:10, Samiullah Khawaja wrote:
> > > > > Extend the already existing support of threaded napi poll to do continuous
> > > > > busy polling.
> > > >
> > > > [snip]
> > > >
> > > > > Setup:
> > > > >
> > > > > - Running on Google C3 VMs with idpf driver with following configurations.
> > > > > - IRQ affinity and coalascing is common for both experiments.
> > > > > - There is only 1 RX/TX queue configured.
> > > > > - First experiment enables busy poll using sysctl for both epoll and
> > > > >    socket APIs.
> > > > > - Second experiment enables NAPI threaded busy poll for the full device
> > > > >    using sysctl.
> > > > >
> > > > > Non threaded NAPI busy poll enabled using sysctl.
> > > > > ```
> > > > > echo 400 | sudo tee /proc/sys/net/core/busy_poll
> > > > > echo 400 | sudo tee /proc/sys/net/core/busy_read
> > > > > echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > > > > echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > > > > ```
> > > > >
> > > > > Results using following command,
> > > > > ```
> > > > > sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> > > > >               --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> > > > >               -p 50,90,99,999 -H <IP> -l 10
> > > > >
> > > > > ...
> > > > > ...
> > > > >
> > > > > num_transactions=2835
> > > > > latency_min=0.000018976
> > > > > latency_max=0.049642100
> > > > > latency_mean=0.003243618
> > > > > latency_stddev=0.010636847
> > > > > latency_p50=0.000025270
> > > > > latency_p90=0.005406710
> > > > > latency_p99=0.049807350
> > > > > latency_p99.9=0.049807350
> > > > > ```
> > > > >
> > > > > Results with napi threaded busy poll using following command,
> > > > > ```
> > > > > sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> > > > >                  --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> > > > >                  -p 50,90,99,999 -H <IP> -l 10
> > > > >
> > > > > ...
> > > > > ...
> > > > >
> > > > > num_transactions=460163
> > > > > latency_min=0.000015707
> > > > > latency_max=0.200182942
> > > > > latency_mean=0.000019453
> > > > > latency_stddev=0.000720727
> > > > > latency_p50=0.000016950
> > > > > latency_p90=0.000017270
> > > > > latency_p99=0.000018710
> > > > > latency_p99.9=0.000020150
> > > > > ```
> > > > >
> > > > > Here with NAPI threaded busy poll in a separate core, we are able to
> > > > > consistently poll the NAPI to keep latency to absolute minimum. And also
> > > > > we are able to do this without any major changes to the onload stack and
> > > > > threading model.
> > > >
> > > > As far as I'm concerned, this is still not sufficient information to
> > > > fully assess the experiment. The experiment shows an 162-fold decrease
> > > > in latency and a corresponding increase in throughput for this
> > > > closed-loop workload (which, btw, is different from your open-loop fixed
> > > > rate use case). This would be an extraordinary improvement and that
> > > > alone warrants some scrutiny. 162X means either the base case has a lot
> > > > of idle time or wastes an enormous amount of cpu cycles. How can that be
> > > > explained? It would be good to get some instruction/cycle counts to
> > > > drill down further.
> > > The difference is much more apparent (and larger) when I am using more
> > > sockets (50) in this case. I have noticed that the situation gets
> > > worse if I add much more sockets in the mix, but I think this here is
> > > enough to show the effect.
> >
> > Can you be more precise? What does "the situation" refer to? Are you
> > saying that the latency decrease is larger than 162x as the number
> > of sockets increases?
> Yes the latency with higher percentiles would degrade since with more
> sockets there will be longer queues with more work and more time spent
> in doing application processing.

Why not capture that in the cover letter and show that, for all test
cases? It seems like it would be pretty useful to see and
understand.

> >
> > It would be helpful to include that data in the cover letter showing
> > the latency differences amongst various socket counts.
> I just fetched some new data without the napi threaded busy poll and
> pasted below, notice the change in P90. But I think the result is
> pretty intuitive (wrt to extra queueing with more traffic and more
> things to do in application processing) and also apparent with 1
> socket data I posted earlier.
> 
> 20 Socket data:
> num_transactions=2345
> latency_min=0.000018449
> latency_max=0.049824327
> latency_mean=0.003838511
> latency_stddev=0.012978531
> latency_p50=0.000024950
> latency_p90=0.000122870
> latency_p99=0.049807350
> latency_p99.9=0.049807350
> 
> 30 Socket data:
> num_transactions=2098
> latency_min=0.000018519
> latency_max=0.049794902
> latency_mean=0.004295050
> latency_stddev=0.013671535
> latency_p50=0.000025270
> latency_p90=0.000174070
> latency_p99=0.049807350
> latency_p99.9=0.049807350

Why not generate this data for the non-busy poll case and provide it
as well so that it cane be compared side-by-side with the busy poll
case?

> > > The processing of packets on a core and
> > > then going back to userspace to do application work (or protocol
> > > processing in case of onload) is not ok for this use case.
> >
> > Why is it not OK? I assume because there is too much latency? If
> > so... the data for that configuration should be provided so it can
> > be examined and compared.
> The time taken to do the application processing of the packets on the
> same core would take time away from the napi processing, introducing
> latency difference at tail with packets getting queued. Now for some
> use cases this would be acceptable, they can certainly set affinity of
> this napi thread equal to the userspace thread or maybe use
> epoll/recvmsg to drive it. For my use case, I want it to have a solid
> P90+ in sub 16us. A couple of microseconds spent doing application
> processing pushes it to 17-18us and that is unacceptable for my use
> case.

Right, so the issue is that sharing a core induces latency which you
want to avoid.

It seems like this data should be provided to highlight the concern?

> >
> > > If you look at P50, most of the time there is not much difference,
> > > but the tail latencies add up in the P90 case. I want the
> > > descriptors to be pulled from the NIC queues and handed over right
> > > away for processing to a separate core.
> >
> > Sure; it's a trade off between CPU cycles and latency, right? I
> > think that data should be provided so that it is more clear what the
> > actual trade-off is; how many cycles am I trading for the latency
> > reduction at p90+ ?
> I think the tradeoff depends on the use case requirements and the
> amount of work done in application processing, please see my reply
> above.

I respectfully disagree; you can show the numbers you are getting
for CPU consumption in exchange for the latency numbers you are
seeing for each test case with varying numbers of sockets.

This is what Martin and I did in our series and it makes the
comparison for reviewers much easier, IMHO.


> >
> > > >
> > > > The server process invocation and the actual irq routing is not
> > > > provided. Just stating its common for both experiments is not
> > > > sufficient. Without further information, I still cannot rule out that:
> > > >
> > > > - In the base case, application and napi processing execute on the same
> > > > core and trample on each other. I don't know how onload implements
> > > > epoll_wait, but I worry that it cannot align application processing
> > > > (busy-looping?) and napi processing (also busy-looping?).
> > > >
> > > > - In the threaded busy-loop case, napi processing ends up on one core,
> > > > while the application executes on another one. This uses two cores
> > > > instead of one.
> > > >
> > > > Based on the above, I think at least the following additional scenarios
> > > > need to be investigated:
> > > >
> > > > a) Run the server application in proper fullbusy mode, i.e., cleanly
> > > > alternating between application processing and napi processing. As a
> > > > second step, spread the incoming traffic across two cores to compare
> > > > apples to apples.
> > > This is exactly what is being done in the experiment I posted and it
> > > shows massive degradation of latency when the core is shared between
> > > application processing and napi processing.
> >
> > It's not clear from the cover letter that this is what the base case
> > is testing. I am not doubting that you are testing this; I'm
> > commenting that the cover letter does not clearly explain this
> > because I also didn't follow that.
> >
> > > The busy_read setup above that I mentioned, makes onload do napi
> > > processing when xsk_recvmsg is called. Also onload spins in the
> > > userspace to handle the AF_XDP queues/rings in memory.
> >
> > This detail would be useful to explain in the cover letter; it was
> > not clear to me why busy_read would be used, as you'll see in my
> > response.
> >
> > If onload spins in userspace, wouldn't a test case of running
> > userspace on one core so it can spin and doing NAPI processing on
> > another core be a good test case to include in the data?
> That is what kthread based napi busy polling does. And this gives me
> the best numbers. I run the napi processing in a separate core and
> onload spinning in userspace on a separate thread. Now this ties into
> my comment about the API aspect of this change. Doing napi processing
> on a separate core with this change is super flexible regardless of
> the UAPI you use.

None of this was clear to me in the cover letter.

> > > >
> > > > b) Run application and napi processing on separate cores, but simply by
> > > > way of thread pinning and interrupt routing. How close does that get to
> > > > the current results? Then selectively add threaded napi and then busy
> > > > polling.
> > > This was the base case with which we started looking into this work.
> > > And this gives much worse latency because the packets are only picked
> > > from the RX queue on interrupt wakeups (and BH). In fact moving them
> > > to separate cores in this case makes the core getting interrupts be
> > > idle and go to sleep if the frequency of packets is low.
> >
> > Two comments here:
> >
> > 1. If it was the base case which motivated the work, should an
> > explanation of this and data about it be provided so it can be
> > examined?
> >
> > 2. If PPS is low, wouldn't you _want_ the core to go idle?
> Not really, if the core goes idle then the latency of a packet that
> arrives after core goes idle will suffer due to wakeups.

Again: can you quantify what that "suffering" is numerically?

> This is what happens with interrupts also, but it could be
> acceptable for a different use case. I am guessing/maybe with
> large message sizes, one would not care since after the wakeup for
> first descriptor remaining packets can be processed in a batch.

To me that suggests the test cases should be expanded to include:
  - More thorough and clear descriptions of each experimental case
    being tested
  - Different message sizes
  - Different socket counts
  - CPU / cycle consumption information

Then it would be much easier for a reviewer to determine the impact
of this change.

I think what Martin and I are calling for is a more rigorous
approach to gathering, comparing, and analyzing the data.

See the below responses from the my previous email.

> >
> > > >
> > > > c) Run the whole thing without onload for comparison. The benefits
> > > > should show without onload as well and it's easier to reproduce. Also, I
> > > > suspect onload hurts in the base case and that explains the atrociously
> > > > high latency and low throughput of it.
> > > >
> > > > Or, even better, simply provide a complete specification / script for
> > > > the experiment that makes it easy to reproduce.
> > > That would require setting up onload on the platform you use, provided
> > > it has all the AF_XDP things needed to bring it up. I think I have
> > > provided everything that you would need to set this up on your
> > > platform. I have provided the onload repo, it is open source and it
> > > has README with steps to set it up. I have provided the sysctls
> > > configuration I am using. I have also provided the exact command with
> > > all the arguments I am using to run onload with neper (configuration
> > > and environment including cpu affinity setup).
> >
> > Respectfully, I disagree.
> >
> > I think the cover letter lacks a significant amount of detail, test
> > data, and test setup information which makes reproducing the results
> > a lot more work for a reviewer.
> >
> > > >
> > > > Note that I don't dismiss the approach out of hand. I just think it's
> > > > important to properly understand the purported performance improvements.
> > > I think the performance improvements are apparent with the data I
> > > provided, I purposefully used more sockets to show the real
> > > differences in tail latency with this revision.
> >
> > Respectfully, I don't agree that the improvements are "apparent." I
> > think my comments and Martin's comments both suggest that the cover
> > letter does not make the improvements apparent.
> >
> > > Also one thing that you are probably missing here is that the change
> > > here also has an API aspect, that is it allows a user to drive napi
> > > independent of the user API or protocol being used.
> >
> > I'm not missing that part; I'll let Martin speak for himself but I
> > suspect he also follows that part.
> >
> > If some one were missing that part, though, it'd probably suggest a
> > more thorough approach to documenting the change in the cover
> > letter, right?
> >
> > > I mean I can certainly drive the napi using recvmsg, but the napi
> > > will only be driven if there is no data in the recv queue. The
> > > recvmsg will check the recv_queue and if it is not empty it will
> > > return. This forces the application to drain the socket and then
> > > do napi processing, basically introducing the same effect of
> > > alternating between napi processing and application processing.
> > > The use case to drive the napi in a separate core (or a couple of
> > > threads sharing a single core) is handled cleanly with this change
> > > by enabling it through netlink.
> >
> > Why not provide the numbers for this and many other test cases (as
> > mentioned above) so that this can be evaluated more clearly?
> >
> > > > At the same time, I don't think it's good for the planet to burn cores
> > > > with busy-looping without good reason.
> >
> > I think the point Martin makes here is subtle, but extremely
> > important.
> >
> > Introducing a general purpose mechanism to burn cores at 100% CPU,
> > potentially doing nothing if network traffic is very low, should not
> > be taken lightly and, IMHO, very rigorous data should be provided so
> > that anyone who decides to turn this on knows what trade-off is.
> >
> > As it stands presently, it's not clear to me how many CPU cycles I
> > am trading to get the claim of a 162X latency improvement at the
> > tail (which, like Martin, is a claim I find quite odd and worth
> > investigating in and of itself as to _why_ it is so large).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06  4:43         ` Samiullah Khawaja
  2025-02-06  4:50           ` Martin Karsten
@ 2025-02-06 13:54           ` Joe Damato
  1 sibling, 0 replies; 37+ messages in thread
From: Joe Damato @ 2025-02-06 13:54 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Martin Karsten, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina, netdev

On Wed, Feb 05, 2025 at 08:43:48PM -0800, Samiullah Khawaja wrote:
> On Wed, Feb 5, 2025 at 5:15 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >
> > On 2025-02-05 17:06, Joe Damato wrote:
> > > On Wed, Feb 05, 2025 at 12:35:00PM -0800, Samiullah Khawaja wrote:
> > >> On Tue, Feb 4, 2025 at 5:32 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > >>>
> > >>> On 2025-02-04 19:10, Samiullah Khawaja wrote:
> >
> > [snip]
> >
> > >>> Note that I don't dismiss the approach out of hand. I just think it's
> > >>> important to properly understand the purported performance improvements.
> > >> I think the performance improvements are apparent with the data I
> > >> provided, I purposefully used more sockets to show the real
> > >> differences in tail latency with this revision.
> > >
> > > Respectfully, I don't agree that the improvements are "apparent." I
> > > think my comments and Martin's comments both suggest that the cover
> > > letter does not make the improvements apparent.
> > >
> > >> Also one thing that you are probably missing here is that the change
> > >> here also has an API aspect, that is it allows a user to drive napi
> > >> independent of the user API or protocol being used.
> > >
> > > I'm not missing that part; I'll let Martin speak for himself but I
> > > suspect he also follows that part.
> >
> > Yes, the API aspect is quite interesting. In fact, Joe has given you
> > pointers how to split this patch into multiple incremental steps, the
> > first of which should be uncontroversial.
> >
> > I also just read your subsequent response to Joe. He has captured the
> > relevant concerns very well and I don't understand why you refuse to
> > document your complete experiment setup for transparency and
> > reproducibility. This shouldn't be hard.
> I think I have provided all the setup details and pointers to
> components. I appreciate that you want to reproduce the results and If
> you really really want to set it up then start by setting up onload on
> your platform. I cannot provide a generic installer script for onload
> that _claims_ to set it up on an arbitrary platform (with arbitrary
> NIC and environment). If it works on your platform (on top of AF_XDP)
> then from that point you can certainly build neper and run it using
> the command I shared.

I'm going to reproduce this on a GCP instance like you illustrated
in your cover letter, but there is a tremendous lack of detail to
get the system into a starting state.

As Martin explains in his follow: I am not asking for a generic
installer script for onload. I can definitely compile that program
on a GCP instance myself.

What we are both asking for is more information on how the system is
setup or, as Martin has repeatedly asked for, a script that gets the
system into its initial state for each experiment to streamline
reproduction of results.

I'm confused why your responses seem so strongly opposed to what we
are asking for?

Neither I nor Martin are against a new mechanism for busy polling
being introduced; we are both saying that a general purpose
mechanism that burns 100% CPU even when the network is idle should
be carefully considered. Wouldn't you agree?

The best way to carefully consider it would be to include more data
and more measurements of important system metrics, like CPU cycle
counts, CPU consumption, etc.

I don't intend to speak for Martin, but I'm asking for (and I think
he'd agree):
  - More rigorous analysis, with more test cases and data (like CPU
    consumption, socket counts, message sizes, etc)
  - Better documentation of how one configures the system, possibly
    using a script. It's OK if the script only works on the exact
    GCP instance you used; I'd like to create that GCP instance and
    know 100% that I have the same starting state as you when I run
    the test.
  - If more test cases are going to be tested and the data extracted
    for a cover letter, a script that generates this in a standard
    format with all of the metrics would be very helpful to ensure
    we are all measuring this the same way, using the same command
    line tools, with the same arguments. In other words, if you
    decide to run "perf" to count cycles or cache misses or whatever
    it is, capturing that in a script ensures we are all measuring
    this the same way.

If you think this is unreasonable, please let me know; your
responses suggest you are opposed to more detailed analysis and I am
trying to understand why that might be.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06  6:43             ` Samiullah Khawaja
@ 2025-02-06 14:00               ` Joe Damato
  0 siblings, 0 replies; 37+ messages in thread
From: Joe Damato @ 2025-02-06 14:00 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Martin Karsten, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina, netdev

On Wed, Feb 05, 2025 at 10:43:43PM -0800, Samiullah Khawaja wrote:
> On Wed, Feb 5, 2025 at 8:50 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >
> > On 2025-02-05 23:43, Samiullah Khawaja wrote:
> > > On Wed, Feb 5, 2025 at 5:15 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > >>
> > >> On 2025-02-05 17:06, Joe Damato wrote:
> > >>> On Wed, Feb 05, 2025 at 12:35:00PM -0800, Samiullah Khawaja wrote:
> > >>>> On Tue, Feb 4, 2025 at 5:32 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > >>>>>
> > >>>>> On 2025-02-04 19:10, Samiullah Khawaja wrote:
> > >>
> > >> [snip]
> > >>
> > >>>>> Note that I don't dismiss the approach out of hand. I just think it's
> > >>>>> important to properly understand the purported performance improvements.
> > >>>> I think the performance improvements are apparent with the data I
> > >>>> provided, I purposefully used more sockets to show the real
> > >>>> differences in tail latency with this revision.
> > >>>
> > >>> Respectfully, I don't agree that the improvements are "apparent." I
> > >>> think my comments and Martin's comments both suggest that the cover
> > >>> letter does not make the improvements apparent.
> > >>>
> > >>>> Also one thing that you are probably missing here is that the change
> > >>>> here also has an API aspect, that is it allows a user to drive napi
> > >>>> independent of the user API or protocol being used.
> > >>>
> > >>> I'm not missing that part; I'll let Martin speak for himself but I
> > >>> suspect he also follows that part.
> > >>
> > >> Yes, the API aspect is quite interesting. In fact, Joe has given you
> > >> pointers how to split this patch into multiple incremental steps, the
> > >> first of which should be uncontroversial.
> > >>
> > >> I also just read your subsequent response to Joe. He has captured the
> > >> relevant concerns very well and I don't understand why you refuse to
> > >> document your complete experiment setup for transparency and
> > >> reproducibility. This shouldn't be hard.
> > > I think I have provided all the setup details and pointers to
> > > components. I appreciate that you want to reproduce the results and If
> > > you really really want to set it up then start by setting up onload on
> > > your platform. I cannot provide a generic installer script for onload
> > > that _claims_ to set it up on an arbitrary platform (with arbitrary
> > > NIC and environment). If it works on your platform (on top of AF_XDP)
> > > then from that point you can certainly build neper and run it using
> > > the command I shared.
> >
> > This is not what I have asked. Installing onload and neper is a given.
> > At least, I need the irq routing and potential thread affinity settings
> > at the server. Providing a full experiment script would be appreciated,
> > but at least the server configuration needs to be specified.
> - There is only 1 rx/tx queue and IRQs are deferred as mentioned in
> the cover letter. I have listed the following command for you to
> configure your netdev with 1 queue pair.
> ```
> sudo ethtool -L eth0 rx 1 tx 1
> ```
> - There is no special interrupt routing, there is only 1 queue pair
> and IDPF shares the same IRQ for both queues. The results remain the
> same whatever core I pin this IRQ to, I tested with core 2, 3 and 23
> (random) on my machine. This is mostly irrelevant since interrupts are
> deferred in both tests. I don't know how your NIC uses IRQs and
> whether the tx and rx are combined, so you might have to figure that
> part out. Sorry about that.

Again, you can assume we are using the same GCP instance you used.

If pinning the IRQ to other cores has no impact on the results, why
not include data that shows that in your cover letter?

> - I moved my napi polling thread to core 2 and as you can see in the
> command I shared I run neper on core 3-10. I enable threaded napi at
> device level for ease of use as I have only 1 queue pair and they both
> share a single NAPI on idpf. Probably different for your platform. I
> use following command,
> ```
>   echo 2 | sudo tee /sys/class/net/eth0/threaded
>   NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
>   sudo chrt -o  -p 0 $NAPI_T
>   sudo taskset -pc 2 $NAPI_T
> ```

I must have missed it in the original cover letter because I didn't
know that you had moved NAPI polling to core 2.

So: 
  - your base case is testing share one CPU core, where IRQ/softirq
    interference is likely to occur when network load is high

  - your test case is testing a 2 CPU setup, where one core does
    NAPI processing with BH disabled so that IRQ/sofitrq cannot
    interfere

Is that accurate? It may not be, and if not please let me know.

I never heard back on my response to the cover letter so I have no
idea if my understanding of how this works, what is being tested,
and the machine's configuration is accurate.

I think this points to more detail required in the cover letter
given how none of this seems particularly obvious to either Martin
or I.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06  5:49   ` Samiullah Khawaja
  2025-02-06  5:57     ` Dave Taht
@ 2025-02-06 14:01     ` Joe Damato
  1 sibling, 0 replies; 37+ messages in thread
From: Joe Damato @ 2025-02-06 14:01 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Dave Taht, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina, netdev

On Wed, Feb 05, 2025 at 09:49:04PM -0800, Samiullah Khawaja wrote:
> On Wed, Feb 5, 2025 at 9:36 PM Dave Taht <dave.taht@gmail.com> wrote:
> >
> > I have often wondered the effects of reducing napi poll weight from 64
> > to 16 or less.
> Yes, that is Interesting. I think higher weight would allow it to
> fetch more descriptors doing more batching but then packets are pushed
> up the stack late. A lower value would push packet up the stack
> quicker, but then if the core is being shared with the application
> processing thread then the descriptors will spend more time in the NIC
> queue.

Seems testable?

> >
> > Also your test shows an increase in max latency...
> >
> > latency_max=0.200182942
> I noticed this anomaly and my guess is that it is a packet drop and
> this is basically a retransmit timeout. Going through tcpdumps to
> confirm.

Can you call out anomalies like this more explicitly and explain why
they occur?

If it weren't for Dave's response, I would have missed this.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 1/4] Add support to set napi threaded for individual napi
  2025-02-05 23:10   ` David Laight
@ 2025-02-06 18:40     ` Joe Damato
  0 siblings, 0 replies; 37+ messages in thread
From: Joe Damato @ 2025-02-06 18:40 UTC (permalink / raw)
  To: David Laight
  Cc: Samiullah Khawaja, Jakub Kicinski, David S . Miller ,
	Eric Dumazet, Paolo Abeni, almasrymina, netdev

On Wed, Feb 05, 2025 at 11:10:03PM +0000, David Laight wrote:
> On Wed,  5 Feb 2025 00:10:49 +0000
> Samiullah Khawaja <skhawaja@google.com> wrote:
> 
> > A net device has a threaded sysctl that can be used to enable threaded
> > napi polling on all of the NAPI contexts under that device. Allow
> > enabling threaded napi polling at individual napi level using netlink.
> > 
> > Extend the netlink operation `napi-set` and allow setting the threaded
> > attribute of a NAPI. This will enable the threaded polling on a napi
> > context.
> > 
> > Tested using following command in qemu/virtio-net:
> > ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
> >   --do napi-set       --json '{"id": 66, "threaded": 1}'
> 
> Is there a sane way for a 'real person' to set these from a normal
> startup/network configuration script?
> 
> The netlink API is hardly user-friendly.

There is a C library, libynl that abstracts a lot of the netlink
stuff away and is quite nice. That said, if you wanted to use it
from a script, you'd probably need bindings for it in the language
of your choice.

If you meant more a shell script type setup, then yea.... the python
CLI is (AFAIK) the only option.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06  5:36 ` Dave Taht
  2025-02-06  5:49   ` Samiullah Khawaja
@ 2025-02-06 19:50   ` David Laight
  1 sibling, 0 replies; 37+ messages in thread
From: David Laight @ 2025-02-06 19:50 UTC (permalink / raw)
  To: Dave Taht
  Cc: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina, netdev

On Wed, 5 Feb 2025 21:36:31 -0800
Dave Taht <dave.taht@gmail.com> wrote:

> I have often wondered the effects of reducing napi poll weight from 64
> to 16 or less.

Doesn't that just move the loop from inside 'NAPI' to the softint scheduler.
Which could easily slow things down because of L1 I-cache pressure.

IIRC what happens next the the softint scheduler decides it has run too
many functions on the current process stack and decides to defer further
calls to a thread context.
Since the thread runs at a normal user priority hardware receive rings
then overflow (discarding packets) because the receive code has suddenly
gone from being very high priority (higher than any RT process) to really
quite low.

On a system doing real work all the ethernet receive code needs to run
at a reasonably high priority (eg a low/bad FIFO one) in order to avoid
packet loss for anything needing high rx packet rates.

	David

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-05  3:18 ` Joe Damato
@ 2025-02-06 21:19   ` Joe Damato
  2025-02-06 22:06     ` Samiullah Khawaja
  0 siblings, 1 reply; 37+ messages in thread
From: Joe Damato @ 2025-02-06 21:19 UTC (permalink / raw)
  To: Samiullah Khawaja, Jakub Kicinski, David S . Miller ,
	Eric Dumazet, Paolo Abeni, almasrymina, netdev

On Tue, Feb 04, 2025 at 07:18:36PM -0800, Joe Damato wrote:
> On Wed, Feb 05, 2025 at 12:10:48AM +0000, Samiullah Khawaja wrote:
> > Extend the already existing support of threaded napi poll to do continuous
> > busy polling.
> 
> [...]
> 
> Overall, +1 to everything Martin said in his response. I think I'd
> like to try to reproduce this myself to better understand the stated
> numbers below.
> 
> IMHO: the cover letter needs more details.
> 
> > 
> > Setup:
> > 
> > - Running on Google C3 VMs with idpf driver with following configurations.
> > - IRQ affinity and coalascing is common for both experiments.
> 
> As Martin suggested, a lot more detail here would be helpful.

Just to give you a sense of the questions I ran into while trying to
reproduce this just now:

- What is the base SHA? You should use --base when using git
  format-patch. I assumed the latest net-next SHA and applied the
  patches to that.

- Which C3 instance type? I chose c3-highcpu-192-metal, but I could
  have chosen c3-standard-192-metal, apparently. No idea what
  difference this makes on the results, if any.

- Was "tier 1 networking" enabled? I enabled it. No idea if it
  matters or not. I assume not, since it would be internal
  networking within the GCP VPC of my instances and not real egress?

- What version of onload was used? Which SHA or release tag?

- I have no idea where to put CPU affinity for the 1 TX/RX queue, I
  assume CPU 2 based on your other message.

- The neper commands provided seem to be the server side since there
  is no -c mentioned. What is the neper client side command?

- What do the environment variables set for onload+neper mean?

...

Do you follow what I'm getting at here? The cover lacks a tremendous
amount of detail that makes reproducing the setup you are using
unnecessarily difficult.

Do you agree that I should be able to read the cover letter and, if
so desired, go off and reproduce the setup and get similar results?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06 21:19   ` Joe Damato
@ 2025-02-06 22:06     ` Samiullah Khawaja
  2025-02-06 22:48       ` Joe Damato
  2025-02-11  2:52       ` Martin Karsten
  0 siblings, 2 replies; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-06 22:06 UTC (permalink / raw)
  To: Joe Damato, Samiullah Khawaja, Jakub Kicinski, David S . Miller,
	Eric Dumazet, Paolo Abeni, almasrymina, netdev

On Thu, Feb 6, 2025 at 1:19 PM Joe Damato <jdamato@fastly.com> wrote:
>
> On Tue, Feb 04, 2025 at 07:18:36PM -0800, Joe Damato wrote:
> > On Wed, Feb 05, 2025 at 12:10:48AM +0000, Samiullah Khawaja wrote:
> > > Extend the already existing support of threaded napi poll to do continuous
> > > busy polling.
> >
> > [...]
> >
> > Overall, +1 to everything Martin said in his response. I think I'd
> > like to try to reproduce this myself to better understand the stated
> > numbers below.
> >
> > IMHO: the cover letter needs more details.
> >
> > >
> > > Setup:
> > >
> > > - Running on Google C3 VMs with idpf driver with following configurations.
> > > - IRQ affinity and coalascing is common for both experiments.
> >
> > As Martin suggested, a lot more detail here would be helpful.
>
> Just to give you a sense of the questions I ran into while trying to
> reproduce this just now:
>
> - What is the base SHA? You should use --base when using git
>   format-patch. I assumed the latest net-next SHA and applied the
>   patches to that.
Yes that is true. I will use --base when I do it next. Thanks for the
suggestion.
>
> - Which C3 instance type? I chose c3-highcpu-192-metal, but I could
>   have chosen c3-standard-192-metal, apparently. No idea what
>   difference this makes on the results, if any.
The tricky part is that the c3 instance variant that I am using for
dev is probably not available publicly. It is a variant of
c3-standard-192-metal but we had to enable AF_XDP on it to make it
stable to be able to run onload. You will have to reproduce this on a
platform available to you with AF_XDP as suggested in the onload
github I shared. This is the problem with providing an
installer/runner script and also system configuration. My
configuration would not be best for your platform, but you should
certainly be able to reproduce the relative improvements in latency
using the different busypolling schemes (busy_read/busy_poll vs
threaded napi busy poll) I mentioned in the cover letter.
>
> - Was "tier 1 networking" enabled? I enabled it. No idea if it
>   matters or not. I assume not, since it would be internal
>   networking within the GCP VPC of my instances and not real egress?
>
> - What version of onload was used? Which SHA or release tag?
v9.0, I agree this should be added to the cover letter.
>
> - I have no idea where to put CPU affinity for the 1 TX/RX queue, I
>   assume CPU 2 based on your other message.
Yes I replied to Martin with this information, but like I said it
certainly depends on your platform and hence didn't include it in the
cover letter. Since I don't know what/where core 2 looks like on your
platform.
>
> - The neper commands provided seem to be the server side since there
>   is no -c mentioned. What is the neper client side command?
Same command with -c
>
> - What do the environment variables set for onload+neper mean?
>
> ...
>
> Do you follow what I'm getting at here? The cover lacks a tremendous
> amount of detail that makes reproducing the setup you are using
> unnecessarily difficult.
>
> Do you agree that I should be able to read the cover letter and, if
> so desired, go off and reproduce the setup and get similar results?
Yes you should be able to that, but there are micro details of your
platform and configuration that I have no way of knowing and suggest
configurations. I have certainly pointed out the relevant environment
and special configurations (netdev queues sizes, sysctls, irq defer,
neper command and onload environment variables) that I did for each
test case in my experiment. Beyond that I have no way of providing you
an internal C3 platform or providing system configuration for your
platform.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06 22:06     ` Samiullah Khawaja
@ 2025-02-06 22:48       ` Joe Damato
  2025-02-07  3:13         ` Samiullah Khawaja
  2025-02-11  2:52       ` Martin Karsten
  1 sibling, 1 reply; 37+ messages in thread
From: Joe Damato @ 2025-02-06 22:48 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, netdev

On Thu, Feb 06, 2025 at 02:06:14PM -0800, Samiullah Khawaja wrote:
> On Thu, Feb 6, 2025 at 1:19 PM Joe Damato <jdamato@fastly.com> wrote:
> >
> > On Tue, Feb 04, 2025 at 07:18:36PM -0800, Joe Damato wrote:
> > > On Wed, Feb 05, 2025 at 12:10:48AM +0000, Samiullah Khawaja wrote:
> > > > Extend the already existing support of threaded napi poll to do continuous
> > > > busy polling.
> > >
> > > [...]
> > >
> > > Overall, +1 to everything Martin said in his response. I think I'd
> > > like to try to reproduce this myself to better understand the stated
> > > numbers below.
> > >
> > > IMHO: the cover letter needs more details.
> > >
> > > >
> > > > Setup:
> > > >
> > > > - Running on Google C3 VMs with idpf driver with following configurations.
> > > > - IRQ affinity and coalascing is common for both experiments.
> > >
> > > As Martin suggested, a lot more detail here would be helpful.
> >
> > Just to give you a sense of the questions I ran into while trying to
> > reproduce this just now:
> >
> > - What is the base SHA? You should use --base when using git
> >   format-patch. I assumed the latest net-next SHA and applied the
> >   patches to that.
> Yes that is true. I will use --base when I do it next. Thanks for the
> suggestion.
> >
> > - Which C3 instance type? I chose c3-highcpu-192-metal, but I could
> >   have chosen c3-standard-192-metal, apparently. No idea what
> >   difference this makes on the results, if any.
> The tricky part is that the c3 instance variant that I am using for
> dev is probably not available publicly.

Can you use a publicly available c3 instance type instead? Maybe you
can't, and if so you should probably mention that it's an internal
c3 image and can't be reproduced on the public c3's because of XYZ
in the cover letter.

> It is a variant of c3-standard-192-metal but we had to enable
> AF_XDP on it to make it stable to be able to run onload. You will
> have to reproduce this on a platform available to you with AF_XDP
> as suggested in the onload github I shared. This is the problem
> with providing an installer/runner script and also system
> configuration. My configuration would not be best for your
> platform, but you should certainly be able to reproduce the
> relative improvements in latency using the different busypolling
> schemes (busy_read/busy_poll vs threaded napi busy poll) I
> mentioned in the cover letter.

Sorry, I still don't agree. Just because your configuration won't
work for me out of the box is, IMHO, totally unrelated to what
Martin and I are asking for.

I won't speak for Martin, but I am saying that the configuration you
are using should be thoroughly documented so that I can at least
understand it and how I might reproduce it.

> >
> > - Was "tier 1 networking" enabled? I enabled it. No idea if it
> >   matters or not. I assume not, since it would be internal
> >   networking within the GCP VPC of my instances and not real egress?
> >
> > - What version of onload was used? Which SHA or release tag?
> v9.0, I agree this should be added to the cover letter.

To the list of things to add to the cover letter:
  - What OS and version are you using?
  - How many runs of neper? It seems like it was just 1 run. Is that
    sufficient? I'd argue you need to re-run the experiment many
    times, with different message sizes, queue counts, etc and
    compute some statistical analysis of the results.

> >
> > - I have no idea where to put CPU affinity for the 1 TX/RX queue, I
> >   assume CPU 2 based on your other message.
> Yes I replied to Martin with this information, but like I said it
> certainly depends on your platform and hence didn't include it in the
> cover letter. Since I don't know what/where core 2 looks like on your
> platform.

You keep referring to "your platform" and I'm still confused.

Don't you think it's important to properly document _your setup_,
including mentioning that core 2 is used for the IRQ and the
onload+neper runs on other cores? Maybe I missed it in the cover
letter, but that details seems pretty important for analysis,
wouldn't you agree?

Even if my computer is different, there should be enough detail for
me to form a mental model of what you are doing so that I can think
through it, understand the data, and, if I want to, try to reproduce
it.

> >
> > - The neper commands provided seem to be the server side since there
> >   is no -c mentioned. What is the neper client side command?
> Same command with -c
> >
> > - What do the environment variables set for onload+neper mean?
> >
> > ...
> >
> > Do you follow what I'm getting at here? The cover lacks a tremendous
> > amount of detail that makes reproducing the setup you are using
> > unnecessarily difficult.
> >
> > Do you agree that I should be able to read the cover letter and, if
> > so desired, go off and reproduce the setup and get similar results?
> Yes you should be able to that, but there are micro details of your
> platform and configuration that I have no way of knowing and suggest
> configurations. I have certainly pointed out the relevant environment
> and special configurations (netdev queues sizes, sysctls, irq defer,
> neper command and onload environment variables) that I did for each
> test case in my experiment. Beyond that I have no way of providing you
> an internal C3 platform or providing system configuration for your
> platform.

I'm going to let the thread rest at this point; I just think we are
talking past each other here and it just doesn't feel productive.

You are saying that your platform and configuration are not publicly
available, there are too many "micro details", and that you can't
suggest a configuration for my computer, which is sure to be wildly
different.

I keep repeating this, but I'll repeat it more explicitly: I'm not
asking you to suggest a configuration to me for my computer.

I'm asking you to thoroughly, rigorously, and verbosely document
what _exactly_ *your setup* is, precisely how it is configured, and
all the versions and SHAs of everything so that if I want to try to
reproduce it I have enough information in order to do so.

With your cover letter as it stands presently: this is not possible.

Surely, if I can run neper and get a latency measurement, I can run
a script that you wrote which measures CPU cycles using a publicly
available tool, right? Then at least we are both measuring CPU
consumption the same way and can compare latency vs CPU tradeoff
results based on the same measurement.

By providing better documentation, you make it more likely that
other people will try to reproduce your results. The more people who
reproduce your results, the stronger the argument to get this merged
becomes.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06 13:42         ` Joe Damato
@ 2025-02-06 22:49           ` Samiullah Khawaja
  2025-02-06 22:58             ` Joe Damato
  0 siblings, 1 reply; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-06 22:49 UTC (permalink / raw)
  To: Joe Damato, Samiullah Khawaja, Martin Karsten, Jakub Kicinski,
	David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, netdev

On Thu, Feb 6, 2025 at 5:42 AM Joe Damato <jdamato@fastly.com> wrote:
>
> On Wed, Feb 05, 2025 at 04:45:59PM -0800, Samiullah Khawaja wrote:
> > On Wed, Feb 5, 2025 at 2:06 PM Joe Damato <jdamato@fastly.com> wrote:
> > >
> > > On Wed, Feb 05, 2025 at 12:35:00PM -0800, Samiullah Khawaja wrote:
> > > > On Tue, Feb 4, 2025 at 5:32 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > > > >
> > > > > On 2025-02-04 19:10, Samiullah Khawaja wrote:
> > > > > > Extend the already existing support of threaded napi poll to do continuous
> > > > > > busy polling.
> > > > >
> > > > > [snip]
> > > > >
> > > > > > Setup:
> > > > > >
> > > > > > - Running on Google C3 VMs with idpf driver with following configurations.
> > > > > > - IRQ affinity and coalascing is common for both experiments.
> > > > > > - There is only 1 RX/TX queue configured.
> > > > > > - First experiment enables busy poll using sysctl for both epoll and
> > > > > >    socket APIs.
> > > > > > - Second experiment enables NAPI threaded busy poll for the full device
> > > > > >    using sysctl.
> > > > > >
> > > > > > Non threaded NAPI busy poll enabled using sysctl.
> > > > > > ```
> > > > > > echo 400 | sudo tee /proc/sys/net/core/busy_poll
> > > > > > echo 400 | sudo tee /proc/sys/net/core/busy_read
> > > > > > echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > > > > > echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > > > > > ```
> > > > > >
> > > > > > Results using following command,
> > > > > > ```
> > > > > > sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> > > > > >               --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> > > > > >               -p 50,90,99,999 -H <IP> -l 10
> > > > > >
> > > > > > ...
> > > > > > ...
> > > > > >
> > > > > > num_transactions=2835
> > > > > > latency_min=0.000018976
> > > > > > latency_max=0.049642100
> > > > > > latency_mean=0.003243618
> > > > > > latency_stddev=0.010636847
> > > > > > latency_p50=0.000025270
> > > > > > latency_p90=0.005406710
> > > > > > latency_p99=0.049807350
> > > > > > latency_p99.9=0.049807350
> > > > > > ```
> > > > > >
> > > > > > Results with napi threaded busy poll using following command,
> > > > > > ```
> > > > > > sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> > > > > >                  --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> > > > > >                  -p 50,90,99,999 -H <IP> -l 10
> > > > > >
> > > > > > ...
> > > > > > ...
> > > > > >
> > > > > > num_transactions=460163
> > > > > > latency_min=0.000015707
> > > > > > latency_max=0.200182942
> > > > > > latency_mean=0.000019453
> > > > > > latency_stddev=0.000720727
> > > > > > latency_p50=0.000016950
> > > > > > latency_p90=0.000017270
> > > > > > latency_p99=0.000018710
> > > > > > latency_p99.9=0.000020150
> > > > > > ```
> > > > > >
> > > > > > Here with NAPI threaded busy poll in a separate core, we are able to
> > > > > > consistently poll the NAPI to keep latency to absolute minimum. And also
> > > > > > we are able to do this without any major changes to the onload stack and
> > > > > > threading model.
> > > > >
> > > > > As far as I'm concerned, this is still not sufficient information to
> > > > > fully assess the experiment. The experiment shows an 162-fold decrease
> > > > > in latency and a corresponding increase in throughput for this
> > > > > closed-loop workload (which, btw, is different from your open-loop fixed
> > > > > rate use case). This would be an extraordinary improvement and that
> > > > > alone warrants some scrutiny. 162X means either the base case has a lot
> > > > > of idle time or wastes an enormous amount of cpu cycles. How can that be
> > > > > explained? It would be good to get some instruction/cycle counts to
> > > > > drill down further.
> > > > The difference is much more apparent (and larger) when I am using more
> > > > sockets (50) in this case. I have noticed that the situation gets
> > > > worse if I add much more sockets in the mix, but I think this here is
> > > > enough to show the effect.
> > >
> > > Can you be more precise? What does "the situation" refer to? Are you
> > > saying that the latency decrease is larger than 162x as the number
> > > of sockets increases?
> > Yes the latency with higher percentiles would degrade since with more
> > sockets there will be longer queues with more work and more time spent
> > in doing application processing.
>
> Why not capture that in the cover letter and show that, for all test
> cases? It seems like it would be pretty useful to see and
> understand.
>
> > >
> > > It would be helpful to include that data in the cover letter showing
> > > the latency differences amongst various socket counts.
> > I just fetched some new data without the napi threaded busy poll and
> > pasted below, notice the change in P90. But I think the result is
> > pretty intuitive (wrt to extra queueing with more traffic and more
> > things to do in application processing) and also apparent with 1
> > socket data I posted earlier.
> >
> > 20 Socket data:
> > num_transactions=2345
> > latency_min=0.000018449
> > latency_max=0.049824327
> > latency_mean=0.003838511
> > latency_stddev=0.012978531
> > latency_p50=0.000024950
> > latency_p90=0.000122870
> > latency_p99=0.049807350
> > latency_p99.9=0.049807350
> >
> > 30 Socket data:
> > num_transactions=2098
> > latency_min=0.000018519
> > latency_max=0.049794902
> > latency_mean=0.004295050
> > latency_stddev=0.013671535
> > latency_p50=0.000025270
> > latency_p90=0.000174070
> > latency_p99=0.049807350
> > latency_p99.9=0.049807350
>
> Why not generate this data for the non-busy poll case and provide it
> as well so that it cane be compared side-by-side with the busy poll
> case?
>
> > > > The processing of packets on a core and
> > > > then going back to userspace to do application work (or protocol
> > > > processing in case of onload) is not ok for this use case.
> > >
> > > Why is it not OK? I assume because there is too much latency? If
> > > so... the data for that configuration should be provided so it can
> > > be examined and compared.
> > The time taken to do the application processing of the packets on the
> > same core would take time away from the napi processing, introducing
> > latency difference at tail with packets getting queued. Now for some
> > use cases this would be acceptable, they can certainly set affinity of
> > this napi thread equal to the userspace thread or maybe use
> > epoll/recvmsg to drive it. For my use case, I want it to have a solid
> > P90+ in sub 16us. A couple of microseconds spent doing application
> > processing pushes it to 17-18us and that is unacceptable for my use
> > case.
>
> Right, so the issue is that sharing a core induces latency which you
> want to avoid.
>
> It seems like this data should be provided to highlight the concern?
The 2 data points I provided are exactly that, Basically I am
comparing 2 mechanisms of enabling busy polling with one (socket/epoll
based) sharing a core (or doing work in sequence because of API
design) and the other that drives napi in a separate thread (in my
case also a separate core) independent of application. Different
message sizes, number of sockets, hops between clients/server etc that
would magnify the problem are all orthogonal issues that are
irrelevant to this comparison I am trying to do here. Some of the
points that you raised are certainly important, like the small value
of interrupts being deferred and that maybe causes some interference
with the socket/epoll based busypolling approach. But beyond that, I
think the variety of experiments and results you are asking for might
be interesting but are irrelevant to the scope of what I am proposing
here,
- The first experiment runs busypolling using busy_read, that is by
design a mechanism that shares cpu core (or time as the userspace
would be waiting) between application processing and napi processing.
- The second experiment enables threaded napi busypoll and allows
doing it on a separate core (on a separate core in my case).
>
> > >
> > > > If you look at P50, most of the time there is not much difference,
> > > > but the tail latencies add up in the P90 case. I want the
> > > > descriptors to be pulled from the NIC queues and handed over right
> > > > away for processing to a separate core.
> > >
> > > Sure; it's a trade off between CPU cycles and latency, right? I
> > > think that data should be provided so that it is more clear what the
> > > actual trade-off is; how many cycles am I trading for the latency
> > > reduction at p90+ ?
> > I think the tradeoff depends on the use case requirements and the
> > amount of work done in application processing, please see my reply
> > above.
>
> I respectfully disagree; you can show the numbers you are getting
> for CPU consumption in exchange for the latency numbers you are
> seeing for each test case with varying numbers of sockets.
>
> This is what Martin and I did in our series and it makes the
> comparison for reviewers much easier, IMHO.
>
>
> > >
> > > > >
> > > > > The server process invocation and the actual irq routing is not
> > > > > provided. Just stating its common for both experiments is not
> > > > > sufficient. Without further information, I still cannot rule out that:
> > > > >
> > > > > - In the base case, application and napi processing execute on the same
> > > > > core and trample on each other. I don't know how onload implements
> > > > > epoll_wait, but I worry that it cannot align application processing
> > > > > (busy-looping?) and napi processing (also busy-looping?).
> > > > >
> > > > > - In the threaded busy-loop case, napi processing ends up on one core,
> > > > > while the application executes on another one. This uses two cores
> > > > > instead of one.
> > > > >
> > > > > Based on the above, I think at least the following additional scenarios
> > > > > need to be investigated:
> > > > >
> > > > > a) Run the server application in proper fullbusy mode, i.e., cleanly
> > > > > alternating between application processing and napi processing. As a
> > > > > second step, spread the incoming traffic across two cores to compare
> > > > > apples to apples.
> > > > This is exactly what is being done in the experiment I posted and it
> > > > shows massive degradation of latency when the core is shared between
> > > > application processing and napi processing.
> > >
> > > It's not clear from the cover letter that this is what the base case
> > > is testing. I am not doubting that you are testing this; I'm
> > > commenting that the cover letter does not clearly explain this
> > > because I also didn't follow that.
> > >
> > > > The busy_read setup above that I mentioned, makes onload do napi
> > > > processing when xsk_recvmsg is called. Also onload spins in the
> > > > userspace to handle the AF_XDP queues/rings in memory.
> > >
> > > This detail would be useful to explain in the cover letter; it was
> > > not clear to me why busy_read would be used, as you'll see in my
> > > response.
> > >
> > > If onload spins in userspace, wouldn't a test case of running
> > > userspace on one core so it can spin and doing NAPI processing on
> > > another core be a good test case to include in the data?
> > That is what kthread based napi busy polling does. And this gives me
> > the best numbers. I run the napi processing in a separate core and
> > onload spinning in userspace on a separate thread. Now this ties into
> > my comment about the API aspect of this change. Doing napi processing
> > on a separate core with this change is super flexible regardless of
> > the UAPI you use.
>
> None of this was clear to me in the cover letter.
>
> > > > >
> > > > > b) Run application and napi processing on separate cores, but simply by
> > > > > way of thread pinning and interrupt routing. How close does that get to
> > > > > the current results? Then selectively add threaded napi and then busy
> > > > > polling.
> > > > This was the base case with which we started looking into this work.
> > > > And this gives much worse latency because the packets are only picked
> > > > from the RX queue on interrupt wakeups (and BH). In fact moving them
> > > > to separate cores in this case makes the core getting interrupts be
> > > > idle and go to sleep if the frequency of packets is low.
> > >
> > > Two comments here:
> > >
> > > 1. If it was the base case which motivated the work, should an
> > > explanation of this and data about it be provided so it can be
> > > examined?
> > >
> > > 2. If PPS is low, wouldn't you _want_ the core to go idle?
> > Not really, if the core goes idle then the latency of a packet that
> > arrives after core goes idle will suffer due to wakeups.
>
> Again: can you quantify what that "suffering" is numerically?
>
> > This is what happens with interrupts also, but it could be
> > acceptable for a different use case. I am guessing/maybe with
> > large message sizes, one would not care since after the wakeup for
> > first descriptor remaining packets can be processed in a batch.
>
> To me that suggests the test cases should be expanded to include:
>   - More thorough and clear descriptions of each experimental case
>     being tested
>   - Different message sizes
>   - Different socket counts
>   - CPU / cycle consumption information
>
> Then it would be much easier for a reviewer to determine the impact
> of this change.
>
> I think what Martin and I are calling for is a more rigorous
> approach to gathering, comparing, and analyzing the data.
>
> See the below responses from the my previous email.
>
> > >
> > > > >
> > > > > c) Run the whole thing without onload for comparison. The benefits
> > > > > should show without onload as well and it's easier to reproduce. Also, I
> > > > > suspect onload hurts in the base case and that explains the atrociously
> > > > > high latency and low throughput of it.
> > > > >
> > > > > Or, even better, simply provide a complete specification / script for
> > > > > the experiment that makes it easy to reproduce.
> > > > That would require setting up onload on the platform you use, provided
> > > > it has all the AF_XDP things needed to bring it up. I think I have
> > > > provided everything that you would need to set this up on your
> > > > platform. I have provided the onload repo, it is open source and it
> > > > has README with steps to set it up. I have provided the sysctls
> > > > configuration I am using. I have also provided the exact command with
> > > > all the arguments I am using to run onload with neper (configuration
> > > > and environment including cpu affinity setup).
> > >
> > > Respectfully, I disagree.
> > >
> > > I think the cover letter lacks a significant amount of detail, test
> > > data, and test setup information which makes reproducing the results
> > > a lot more work for a reviewer.
> > >
> > > > >
> > > > > Note that I don't dismiss the approach out of hand. I just think it's
> > > > > important to properly understand the purported performance improvements.
> > > > I think the performance improvements are apparent with the data I
> > > > provided, I purposefully used more sockets to show the real
> > > > differences in tail latency with this revision.
> > >
> > > Respectfully, I don't agree that the improvements are "apparent." I
> > > think my comments and Martin's comments both suggest that the cover
> > > letter does not make the improvements apparent.
> > >
> > > > Also one thing that you are probably missing here is that the change
> > > > here also has an API aspect, that is it allows a user to drive napi
> > > > independent of the user API or protocol being used.
> > >
> > > I'm not missing that part; I'll let Martin speak for himself but I
> > > suspect he also follows that part.
> > >
> > > If some one were missing that part, though, it'd probably suggest a
> > > more thorough approach to documenting the change in the cover
> > > letter, right?
> > >
> > > > I mean I can certainly drive the napi using recvmsg, but the napi
> > > > will only be driven if there is no data in the recv queue. The
> > > > recvmsg will check the recv_queue and if it is not empty it will
> > > > return. This forces the application to drain the socket and then
> > > > do napi processing, basically introducing the same effect of
> > > > alternating between napi processing and application processing.
> > > > The use case to drive the napi in a separate core (or a couple of
> > > > threads sharing a single core) is handled cleanly with this change
> > > > by enabling it through netlink.
> > >
> > > Why not provide the numbers for this and many other test cases (as
> > > mentioned above) so that this can be evaluated more clearly?
> > >
> > > > > At the same time, I don't think it's good for the planet to burn cores
> > > > > with busy-looping without good reason.
> > >
> > > I think the point Martin makes here is subtle, but extremely
> > > important.
> > >
> > > Introducing a general purpose mechanism to burn cores at 100% CPU,
> > > potentially doing nothing if network traffic is very low, should not
> > > be taken lightly and, IMHO, very rigorous data should be provided so
> > > that anyone who decides to turn this on knows what trade-off is.
> > >
> > > As it stands presently, it's not clear to me how many CPU cycles I
> > > am trading to get the claim of a 162X latency improvement at the
> > > tail (which, like Martin, is a claim I find quite odd and worth
> > > investigating in and of itself as to _why_ it is so large).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06 22:49           ` Samiullah Khawaja
@ 2025-02-06 22:58             ` Joe Damato
  0 siblings, 0 replies; 37+ messages in thread
From: Joe Damato @ 2025-02-06 22:58 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Martin Karsten, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina, netdev

On Thu, Feb 06, 2025 at 02:49:08PM -0800, Samiullah Khawaja wrote:
> On Thu, Feb 6, 2025 at 5:42 AM Joe Damato <jdamato@fastly.com> wrote:
> >
> > On Wed, Feb 05, 2025 at 04:45:59PM -0800, Samiullah Khawaja wrote:
> > > On Wed, Feb 5, 2025 at 2:06 PM Joe Damato <jdamato@fastly.com> wrote:
> > > >
> > > > On Wed, Feb 05, 2025 at 12:35:00PM -0800, Samiullah Khawaja wrote:
> > > > > On Tue, Feb 4, 2025 at 5:32 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > > > > >
> > > > > > On 2025-02-04 19:10, Samiullah Khawaja wrote:

[...]

> > > > > The processing of packets on a core and
> > > > > then going back to userspace to do application work (or protocol
> > > > > processing in case of onload) is not ok for this use case.
> > > >
> > > > Why is it not OK? I assume because there is too much latency? If
> > > > so... the data for that configuration should be provided so it can
> > > > be examined and compared.
> > > The time taken to do the application processing of the packets on the
> > > same core would take time away from the napi processing, introducing
> > > latency difference at tail with packets getting queued. Now for some
> > > use cases this would be acceptable, they can certainly set affinity of
> > > this napi thread equal to the userspace thread or maybe use
> > > epoll/recvmsg to drive it. For my use case, I want it to have a solid
> > > P90+ in sub 16us. A couple of microseconds spent doing application
> > > processing pushes it to 17-18us and that is unacceptable for my use
> > > case.
> >
> > Right, so the issue is that sharing a core induces latency which you
> > want to avoid.
> >
> > It seems like this data should be provided to highlight the concern?
> The 2 data points I provided are exactly that, Basically I am
> comparing 2 mechanisms of enabling busy polling with one (socket/epoll
> based) sharing a core (or doing work in sequence because of API
> design) and the other that drives napi in a separate thread (in my
> case also a separate core) independent of application. Different
> message sizes, number of sockets, hops between clients/server etc that
> would magnify the problem are all orthogonal issues that are
> irrelevant to this comparison I am trying to do here. Some of the
> points that you raised are certainly important, like the small value
> of interrupts being deferred and that maybe causes some interference
> with the socket/epoll based busypolling approach. But beyond that, I
> think the variety of experiments and results you are asking for might
> be interesting but are irrelevant to the scope of what I am proposing
> here,

With the utmost respect for the work, effort, and time you've put
into this in mind: I respectfully disagree in the strongest possible
terms.

Two data points (which lack significant documentation in the current
iteration of the cover letter) are not sufficient evidence of the
claim, especially when the claim is a >100x improvement.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06 22:48       ` Joe Damato
@ 2025-02-07  3:13         ` Samiullah Khawaja
  2025-02-07  3:50           ` Joe Damato
  0 siblings, 1 reply; 37+ messages in thread
From: Samiullah Khawaja @ 2025-02-07  3:13 UTC (permalink / raw)
  To: Joe Damato, Samiullah Khawaja, Jakub Kicinski, David S . Miller,
	Eric Dumazet, Paolo Abeni, almasrymina, netdev

On Thu, Feb 6, 2025 at 2:48 PM Joe Damato <jdamato@fastly.com> wrote:
>
> On Thu, Feb 06, 2025 at 02:06:14PM -0800, Samiullah Khawaja wrote:
> > On Thu, Feb 6, 2025 at 1:19 PM Joe Damato <jdamato@fastly.com> wrote:
> > >
> > > On Tue, Feb 04, 2025 at 07:18:36PM -0800, Joe Damato wrote:
> > > > On Wed, Feb 05, 2025 at 12:10:48AM +0000, Samiullah Khawaja wrote:
> > > > > Extend the already existing support of threaded napi poll to do continuous
> > > > > busy polling.
> > > >
> > > > [...]
> > > >
> > > > Overall, +1 to everything Martin said in his response. I think I'd
> > > > like to try to reproduce this myself to better understand the stated
> > > > numbers below.
> > > >
> > > > IMHO: the cover letter needs more details.
> > > >
> > > > >
> > > > > Setup:
> > > > >
> > > > > - Running on Google C3 VMs with idpf driver with following configurations.
> > > > > - IRQ affinity and coalascing is common for both experiments.
> > > >
> > > > As Martin suggested, a lot more detail here would be helpful.
> > >
> > > Just to give you a sense of the questions I ran into while trying to
> > > reproduce this just now:
> > >
> > > - What is the base SHA? You should use --base when using git
> > >   format-patch. I assumed the latest net-next SHA and applied the
> > >   patches to that.
> > Yes that is true. I will use --base when I do it next. Thanks for the
> > suggestion.
> > >
> > > - Which C3 instance type? I chose c3-highcpu-192-metal, but I could
> > >   have chosen c3-standard-192-metal, apparently. No idea what
> > >   difference this makes on the results, if any.
> > The tricky part is that the c3 instance variant that I am using for
> > dev is probably not available publicly.
>
> Can you use a publicly available c3 instance type instead? Maybe you
> can't, and if so you should probably mention that it's an internal
> c3 image and can't be reproduced on the public c3's because of XYZ
> in the cover letter.
>
> > It is a variant of c3-standard-192-metal but we had to enable
> > AF_XDP on it to make it stable to be able to run onload. You will
> > have to reproduce this on a platform available to you with AF_XDP
> > as suggested in the onload github I shared. This is the problem
> > with providing an installer/runner script and also system
> > configuration. My configuration would not be best for your
> > platform, but you should certainly be able to reproduce the
> > relative improvements in latency using the different busypolling
> > schemes (busy_read/busy_poll vs threaded napi busy poll) I
> > mentioned in the cover letter.
>
> Sorry, I still don't agree. Just because your configuration won't
> work for me out of the box is, IMHO, totally unrelated to what
> Martin and I are asking for.
>
> I won't speak for Martin, but I am saying that the configuration you
> are using should be thoroughly documented so that I can at least
> understand it and how I might reproduce it.
I provided all the relevant configuration I used that you can apply on
your platform. Later also provided the IRQ routing and thread affinity
as Martin asked, but as you can see it is pretty opaque and irrelevant
to the experiment I am doing and it also depends on the platform you
use. But I can add those to the cover letter next time. Specifically I
will add the following plus the things that I already provided,
- interrupt routing
- any thread affinity
- number of queues (already added)
- threaded napi enablement (with affinity)
>
> > >
> > > - Was "tier 1 networking" enabled? I enabled it. No idea if it
> > >   matters or not. I assume not, since it would be internal
> > >   networking within the GCP VPC of my instances and not real egress?
> > >
> > > - What version of onload was used? Which SHA or release tag?
> > v9.0, I agree this should be added to the cover letter.
>
> To the list of things to add to the cover letter:
>   - What OS and version are you using?
>   - How many runs of neper? It seems like it was just 1 run. Is that
>     sufficient? I'd argue you need to re-run the experiment many
>     times, with different message sizes, queue counts, etc and
>     compute some statistical analysis of the results.
>
> > >
> > > - I have no idea where to put CPU affinity for the 1 TX/RX queue, I
> > >   assume CPU 2 based on your other message.
> > Yes I replied to Martin with this information, but like I said it
> > certainly depends on your platform and hence didn't include it in the
> > cover letter. Since I don't know what/where core 2 looks like on your
> > platform.
>
> You keep referring to "your platform" and I'm still confused.
>
> Don't you think it's important to properly document _your setup_,
> including mentioning that core 2 is used for the IRQ and the
> onload+neper runs on other cores? Maybe I missed it in the cover
> letter, but that details seems pretty important for analysis,
> wouldn't you agree?
Respectfully I think here you are again confusing things, napi
threaded polling is running in a separate core (2). And the cover
letter clearly states the following about the experiment.
```
Here with NAPI threaded busy poll in a separate core, we are able to
consistently poll the NAPI to keep latency to absolute minimum.
```
>
> Even if my computer is different, there should be enough detail for
> me to form a mental model of what you are doing so that I can think
> through it, understand the data, and, if I want to, try to reproduce
> it.
I agree to this 100% and I will fill in the interrupt routing and
other affinity info so it gives you a mental model, that is I am doing
a comparison between sharing a core between application processing and
napi processing vs doing napi processing in dedicated cores. I want to
focus on the premise of the problem/use case I am trying to solve. I
mentioned this in the cover letter, but it seems you are looking for
specifics however irrelevant they might be to your platform. I will
put those in the next iteration.
>
> > >
> > > - The neper commands provided seem to be the server side since there
> > >   is no -c mentioned. What is the neper client side command?
> > Same command with -c
> > >
> > > - What do the environment variables set for onload+neper mean?
> > >
> > > ...
> > >
> > > Do you follow what I'm getting at here? The cover lacks a tremendous
> > > amount of detail that makes reproducing the setup you are using
> > > unnecessarily difficult.
> > >
> > > Do you agree that I should be able to read the cover letter and, if
> > > so desired, go off and reproduce the setup and get similar results?
> > Yes you should be able to that, but there are micro details of your
> > platform and configuration that I have no way of knowing and suggest
> > configurations. I have certainly pointed out the relevant environment
> > and special configurations (netdev queues sizes, sysctls, irq defer,
> > neper command and onload environment variables) that I did for each
> > test case in my experiment. Beyond that I have no way of providing you
> > an internal C3 platform or providing system configuration for your
> > platform.
>
> I'm going to let the thread rest at this point; I just think we are
> talking past each other here and it just doesn't feel productive.
>
> You are saying that your platform and configuration are not publicly
> available, there are too many "micro details", and that you can't
> suggest a configuration for my computer, which is sure to be wildly
> different.
>
> I keep repeating this, but I'll repeat it more explicitly: I'm not
> asking you to suggest a configuration to me for my computer.
>
> I'm asking you to thoroughly, rigorously, and verbosely document
> what _exactly_ *your setup* is, precisely how it is configured, and
> all the versions and SHAs of everything so that if I want to try to
> reproduce it I have enough information in order to do so.
>
> With your cover letter as it stands presently: this is not possible.
>
> Surely, if I can run neper and get a latency measurement, I can run
> a script that you wrote which measures CPU cycles using a publicly
> available tool, right? Then at least we are both measuring CPU
> consumption the same way and can compare latency vs CPU tradeoff
> results based on the same measurement.
I am not considering the CPU/Latency tradeoff since my use case
requires consistent latency. This is very apparent when the core is
shared between application processing and napi processing and it is
pretty intuitive. I think affinity info and the mental model you are
looking for would probably make this more apparent.
>
> By providing better documentation, you make it more likely that
> other people will try to reproduce your results. The more people who
> reproduce your results, the stronger the argument to get this merged
> becomes.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-07  3:13         ` Samiullah Khawaja
@ 2025-02-07  3:50           ` Joe Damato
  0 siblings, 0 replies; 37+ messages in thread
From: Joe Damato @ 2025-02-07  3:50 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, netdev

On Thu, Feb 06, 2025 at 07:13:00PM -0800, Samiullah Khawaja wrote:
> On Thu, Feb 6, 2025 at 2:48 PM Joe Damato <jdamato@fastly.com> wrote:
> >
> > On Thu, Feb 06, 2025 at 02:06:14PM -0800, Samiullah Khawaja wrote:
> > > On Thu, Feb 6, 2025 at 1:19 PM Joe Damato <jdamato@fastly.com> wrote:
> > > >
> > > > On Tue, Feb 04, 2025 at 07:18:36PM -0800, Joe Damato wrote:
> > > > > On Wed, Feb 05, 2025 at 12:10:48AM +0000, Samiullah Khawaja wrote:
> > > > > > Extend the already existing support of threaded napi poll to do continuous
> > > > > > busy polling.
> > > > >
> > > > > [...]
> > > > >
> > > > > Overall, +1 to everything Martin said in his response. I think I'd
> > > > > like to try to reproduce this myself to better understand the stated
> > > > > numbers below.
> > > > >
> > > > > IMHO: the cover letter needs more details.
> > > > >
> > > > > >
> > > > > > Setup:
> > > > > >
> > > > > > - Running on Google C3 VMs with idpf driver with following configurations.
> > > > > > - IRQ affinity and coalascing is common for both experiments.
> > > > >
> > > > > As Martin suggested, a lot more detail here would be helpful.
> > > >
> > > > Just to give you a sense of the questions I ran into while trying to
> > > > reproduce this just now:
> > > >
> > > > - What is the base SHA? You should use --base when using git
> > > >   format-patch. I assumed the latest net-next SHA and applied the
> > > >   patches to that.
> > > Yes that is true. I will use --base when I do it next. Thanks for the
> > > suggestion.
> > > >
> > > > - Which C3 instance type? I chose c3-highcpu-192-metal, but I could
> > > >   have chosen c3-standard-192-metal, apparently. No idea what
> > > >   difference this makes on the results, if any.
> > > The tricky part is that the c3 instance variant that I am using for
> > > dev is probably not available publicly.
> >
> > Can you use a publicly available c3 instance type instead? Maybe you
> > can't, and if so you should probably mention that it's an internal
> > c3 image and can't be reproduced on the public c3's because of XYZ
> > in the cover letter.
> >
> > > It is a variant of c3-standard-192-metal but we had to enable
> > > AF_XDP on it to make it stable to be able to run onload. You will
> > > have to reproduce this on a platform available to you with AF_XDP
> > > as suggested in the onload github I shared. This is the problem
> > > with providing an installer/runner script and also system
> > > configuration. My configuration would not be best for your
> > > platform, but you should certainly be able to reproduce the
> > > relative improvements in latency using the different busypolling
> > > schemes (busy_read/busy_poll vs threaded napi busy poll) I
> > > mentioned in the cover letter.
> >
> > Sorry, I still don't agree. Just because your configuration won't
> > work for me out of the box is, IMHO, totally unrelated to what
> > Martin and I are asking for.
> >
> > I won't speak for Martin, but I am saying that the configuration you
> > are using should be thoroughly documented so that I can at least
> > understand it and how I might reproduce it.
> I provided all the relevant configuration I used that you can apply on
> your platform. 

Sorry, but that is not true -- both due to your below statement on
IRQ routing and thread affinity being provided later and your
agreement that you didn't include version numbers or git SHAs below.

> Later also provided the IRQ routing and thread affinity
> as Martin asked, but as you can see it is pretty opaque and irrelevant
> to the experiment I am doing and it also depends on the platform you
> use.

[...]

> > > >
> > > > - I have no idea where to put CPU affinity for the 1 TX/RX queue, I
> > > >   assume CPU 2 based on your other message.
> > > Yes I replied to Martin with this information, but like I said it
> > > certainly depends on your platform and hence didn't include it in the
> > > cover letter. Since I don't know what/where core 2 looks like on your
> > > platform.
> >
> > You keep referring to "your platform" and I'm still confused.
> >
> > Don't you think it's important to properly document _your setup_,
> > including mentioning that core 2 is used for the IRQ and the
> > onload+neper runs on other cores? Maybe I missed it in the cover
> > letter, but that details seems pretty important for analysis,
> > wouldn't you agree?
> Respectfully I think here you are again confusing things, napi
> threaded polling is running in a separate core (2). And the cover
> letter clearly states the following about the experiment.
> ```
> Here with NAPI threaded busy poll in a separate core, we are able to
> consistently poll the NAPI to keep latency to absolute minimum.
> ```

Is core 2 mentioned anywhere in the cover letter? Is there a
description of the core layout? Is core 2 NUMA local to the NIC? Are
the cores where neper+onload run NUMA local?

I'm probably "again confusing things" and this is all clearly
explained in the cover letter, I bet.

> >
> > Even if my computer is different, there should be enough detail for
> > me to form a mental model of what you are doing so that I can think
> > through it, understand the data, and, if I want to, try to reproduce
> > it.
> I agree to this 100% and I will fill in the interrupt routing and
> other affinity info so it gives you a mental model, that is I am doing
> a comparison between sharing a core between application processing and
> napi processing vs doing napi processing in dedicated cores. I want to
> focus on the premise of the problem/use case I am trying to solve. I
> mentioned this in the cover letter, but it seems you are looking for
> specifics however irrelevant they might be to your platform. I will
> put those in the next iteration.

The specifics are necessary for two reasons:
  1. So I can understand what you claim to be measuring
  2. So I can attempt to reproduce it

How odd to call this "irrelevant"; it's probably one of the most
relevant things required to understand and analyze the impact of the
proposed change.

> >
> > > >
> > > > - The neper commands provided seem to be the server side since there
> > > >   is no -c mentioned. What is the neper client side command?
> > > Same command with -c
> > > >
> > > > - What do the environment variables set for onload+neper mean?
> > > >
> > > > ...
> > > >
> > > > Do you follow what I'm getting at here? The cover lacks a tremendous
> > > > amount of detail that makes reproducing the setup you are using
> > > > unnecessarily difficult.
> > > >
> > > > Do you agree that I should be able to read the cover letter and, if
> > > > so desired, go off and reproduce the setup and get similar results?
> > > Yes you should be able to that, but there are micro details of your
> > > platform and configuration that I have no way of knowing and suggest
> > > configurations. I have certainly pointed out the relevant environment
> > > and special configurations (netdev queues sizes, sysctls, irq defer,
> > > neper command and onload environment variables) that I did for each
> > > test case in my experiment. Beyond that I have no way of providing you
> > > an internal C3 platform or providing system configuration for your
> > > platform.
> >
> > I'm going to let the thread rest at this point; I just think we are
> > talking past each other here and it just doesn't feel productive.
> >
> > You are saying that your platform and configuration are not publicly
> > available, there are too many "micro details", and that you can't
> > suggest a configuration for my computer, which is sure to be wildly
> > different.
> >
> > I keep repeating this, but I'll repeat it more explicitly: I'm not
> > asking you to suggest a configuration to me for my computer.
> >
> > I'm asking you to thoroughly, rigorously, and verbosely document
> > what _exactly_ *your setup* is, precisely how it is configured, and
> > all the versions and SHAs of everything so that if I want to try to
> > reproduce it I have enough information in order to do so.
> >
> > With your cover letter as it stands presently: this is not possible.
> >
> > Surely, if I can run neper and get a latency measurement, I can run
> > a script that you wrote which measures CPU cycles using a publicly
> > available tool, right? Then at least we are both measuring CPU
> > consumption the same way and can compare latency vs CPU tradeoff
> > results based on the same measurement.
> I am not considering the CPU/Latency tradeoff since my use case
> requires consistent latency. This is very apparent when the core is
> shared between application processing and napi processing and it is
> pretty intuitive. I think affinity info and the mental model you are
> looking for would probably make this more apparent.

FYI: I will strongly object to any future submission of this work
that do not include a rigorous, thorough, verbose, reproducible, and
clear analysis of the test system setup, test cases, and trade-offs
introduced by this change.

I'm not a maintainer though, so maybe my strong objections won't
impact this being merged.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
  2025-02-06 22:06     ` Samiullah Khawaja
  2025-02-06 22:48       ` Joe Damato
@ 2025-02-11  2:52       ` Martin Karsten
  1 sibling, 0 replies; 37+ messages in thread
From: Martin Karsten @ 2025-02-11  2:52 UTC (permalink / raw)
  To: Samiullah Khawaja, Joe Damato, Jakub Kicinski, David S . Miller,
	Eric Dumazet, Paolo Abeni, almasrymina, netdev

On 2025-02-06 17:06, Samiullah Khawaja wrote:
> On Thu, Feb 6, 2025 at 1:19 PM Joe Damato <jdamato@fastly.com> wrote:
>>
>> On Tue, Feb 04, 2025 at 07:18:36PM -0800, Joe Damato wrote:
>>> On Wed, Feb 05, 2025 at 12:10:48AM +0000, Samiullah Khawaja wrote:
>>>> Extend the already existing support of threaded napi poll to do continuous
>>>> busy polling.
>>>
>>> [...]
>>>
>>> Overall, +1 to everything Martin said in his response. I think I'd
>>> like to try to reproduce this myself to better understand the stated
>>> numbers below.
>>>
>>> IMHO: the cover letter needs more details.
>>>
>>>>
>>>> Setup:
>>>>
>>>> - Running on Google C3 VMs with idpf driver with following configurations.
>>>> - IRQ affinity and coalascing is common for both experiments.
>>>
>>> As Martin suggested, a lot more detail here would be helpful.
>>
>> Just to give you a sense of the questions I ran into while trying to
>> reproduce this just now:
>>
>> - What is the base SHA? You should use --base when using git
>>    format-patch. I assumed the latest net-next SHA and applied the
>>    patches to that.
> Yes that is true. I will use --base when I do it next. Thanks for the
> suggestion.

Sorry for replying to an older message. In trying to reproduce your 
setup it seems that onload from the github repo does not compile since 
the update to the kernel build system in kernel commit 13b2548. I am not 
that familiar with the details of either build system, so have filed an 
issue with onload. I was just wondering whether you had a workaround and 
could share it please?

Thanks,
Martin


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2025-02-11  2:52 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-05  0:10 [PATCH net-next v3 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
2025-02-05  0:10 ` [PATCH net-next v3 1/4] Add support to set napi threaded for individual napi Samiullah Khawaja
2025-02-05  2:50   ` Joe Damato
2025-02-05 23:10   ` David Laight
2025-02-06 18:40     ` Joe Damato
2025-02-05  0:10 ` [PATCH net-next v3 2/4] net: Create separate gro_flush helper function Samiullah Khawaja
2025-02-05  2:55   ` Joe Damato
2025-02-05  0:10 ` [PATCH net-next v3 3/4] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
2025-02-05  9:08   ` Paul Barker
2025-02-06  3:40   ` Samudrala, Sridhar
2025-02-05  0:10 ` [PATCH net-next v3 4/4] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
2025-02-05  0:14 ` [PATCH net-next v3 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
2025-02-05  1:32 ` Martin Karsten
2025-02-05 20:35   ` Samiullah Khawaja
2025-02-05 22:06     ` Joe Damato
2025-02-06  0:45       ` Samiullah Khawaja
2025-02-06 13:42         ` Joe Damato
2025-02-06 22:49           ` Samiullah Khawaja
2025-02-06 22:58             ` Joe Damato
2025-02-06  1:15       ` Martin Karsten
2025-02-06  4:43         ` Samiullah Khawaja
2025-02-06  4:50           ` Martin Karsten
2025-02-06  6:43             ` Samiullah Khawaja
2025-02-06 14:00               ` Joe Damato
2025-02-06 13:54           ` Joe Damato
2025-02-05  3:18 ` Joe Damato
2025-02-06 21:19   ` Joe Damato
2025-02-06 22:06     ` Samiullah Khawaja
2025-02-06 22:48       ` Joe Damato
2025-02-07  3:13         ` Samiullah Khawaja
2025-02-07  3:50           ` Joe Damato
2025-02-11  2:52       ` Martin Karsten
2025-02-06  5:36 ` Dave Taht
2025-02-06  5:49   ` Samiullah Khawaja
2025-02-06  5:57     ` Dave Taht
2025-02-06 14:01     ` Joe Damato
2025-02-06 19:50   ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).