netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] Add support to do threaded napi busy poll
@ 2025-01-02 19:12 Samiullah Khawaja
  2025-01-02 19:12 ` [PATCH net-next 1/3] Add support to set napi threaded for individual napi Samiullah Khawaja
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Samiullah Khawaja @ 2025-01-02 19:12 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni
  Cc: netdev, skhawaja

Extend the already existing support of threaded napi poll to do continuous
busypolling.

This is used for doing continuous polling of napi to fetch descriptors from
backing RX/TX queues for low latency applications. Allow enabling of threaded
busypoll using netlink so this can be enabled on a set of dedicated napis for
low latency applications.

Currently threaded napi is only enabled at device level using sysfs. Add
support to enable/disable threaded mode for a napi individually. This can be
done using the netlink interface. Add `set_threaded` op in netlink spec that
allows setting the `threaded` attribute of a napi.

Extend the threaded attribute in napi struct to add an option to enable
continuous busy polling. Extend the netlink and sysfs interface to allow
enabled/disabling threaded busypolling at device or individual napi level.

Once threaded busypoll on a napi is enabled, depending on the application
requirements the polling thread can be moved to dedicated cores. We used this
for AF_XDP usecases to fetch packets from RX queues to reduce latency.

Samiullah Khawaja (3):
  Add support to set napi threaded for individual napi
  net: Create separate gro_flush helper function
  Extend napi threaded polling to allow kthread based busy polling

 Documentation/ABI/testing/sysfs-class-net     |   3 +-
 Documentation/netlink/specs/netdev.yaml       |  20 +++
 .../net/ethernet/atheros/atl1c/atl1c_main.c   |   2 +-
 include/linux/netdevice.h                     |  29 ++++-
 include/uapi/linux/netdev.h                   |   2 +
 net/core/dev.c                                | 122 ++++++++++++++----
 net/core/net-sysfs.c                          |   2 +-
 net/core/netdev-genl-gen.c                    |  13 ++
 net/core/netdev-genl-gen.h                    |   2 +
 net/core/netdev-genl.c                        |  37 ++++++
 tools/include/uapi/linux/netdev.h             |   2 +
 11 files changed, 205 insertions(+), 29 deletions(-)

-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next 1/3] Add support to set napi threaded for individual napi
  2025-01-02 19:12 [PATCH net-next 0/3] Add support to do threaded napi busy poll Samiullah Khawaja
@ 2025-01-02 19:12 ` Samiullah Khawaja
  2025-01-02 21:15   ` Stanislav Fomichev
  2025-01-02 19:12 ` [PATCH net-next 2/3] net: Create separate gro_flush helper function Samiullah Khawaja
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Samiullah Khawaja @ 2025-01-02 19:12 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni
  Cc: netdev, skhawaja

A net device has a threaded sysctl that can be used to enable threaded
napi polling on all of the NAPI contexts under that device. Allow
enabling threaded napi polling at individual napi level using netlink.

Add a new netlink operation `napi-set-threaded` that takes napi `id` and
`threaded` attributes. This will enable the threaded polling on napi
context.

Tested using following command in qemu/virtio-net:
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
  --do napi-set-threaded       --json '{"id": 513, "threaded": 1}'

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
 Documentation/netlink/specs/netdev.yaml | 19 +++++++++++++
 include/linux/netdevice.h               |  9 ++++++
 include/uapi/linux/netdev.h             |  2 ++
 net/core/dev.c                          | 26 +++++++++++++++++
 net/core/netdev-genl-gen.c              | 13 +++++++++
 net/core/netdev-genl-gen.h              |  2 ++
 net/core/netdev-genl.c                  | 37 +++++++++++++++++++++++++
 tools/include/uapi/linux/netdev.h       |  2 ++
 8 files changed, 110 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index cbb544bd6c84..aac343af7246 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -268,6 +268,14 @@ attribute-sets:
         doc: The timeout, in nanoseconds, of how long to suspend irq
              processing, if event polling finds events
         type: uint
+      -
+        name: threaded
+        doc: Whether the napi is configured to operate in threaded polling
+             mode. If this is set to `1` then the NAPI context operates
+             in threaded polling mode.
+        type: u32
+        checks:
+          max: 1
   -
     name: queue
     attributes:
@@ -659,6 +667,7 @@ operations:
             - defer-hard-irqs
             - gro-flush-timeout
             - irq-suspend-timeout
+            - threaded
       dump:
         request:
           attributes:
@@ -711,6 +720,16 @@ operations:
             - defer-hard-irqs
             - gro-flush-timeout
             - irq-suspend-timeout
+    -
+      name: napi-set-threaded
+      doc: Set threaded napi mode on this napi.
+      attribute-set: napi
+      flags: [ admin-perm ]
+      do:
+        request:
+          attributes:
+            - id
+            - threaded
 
 kernel-family:
   headers: [ "linux/list.h"]
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2593019ad5b1..8f531d528869 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -570,6 +570,15 @@ static inline bool napi_complete(struct napi_struct *n)
 
 int dev_set_threaded(struct net_device *dev, bool threaded);
 
+/*
+ * napi_set_threaded - set napi threaded state
+ * @napi: NAPI context
+ * @threaded: whether this napi does threaded polling
+ *
+ * Return 0 on success and negative errno on failure.
+ */
+int napi_set_threaded(struct napi_struct *napi, bool threaded);
+
 /**
  *	napi_disable - prevent NAPI from scheduling
  *	@n: NAPI context
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index e4be227d3ad6..cefbb8f39ae7 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -125,6 +125,7 @@ enum {
 	NETDEV_A_NAPI_DEFER_HARD_IRQS,
 	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
 	NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
+	NETDEV_A_NAPI_THREADED,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
@@ -203,6 +204,7 @@ enum {
 	NETDEV_CMD_QSTATS_GET,
 	NETDEV_CMD_BIND_RX,
 	NETDEV_CMD_NAPI_SET,
+	NETDEV_CMD_NAPI_SET_THREADED,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/dev.c b/net/core/dev.c
index c7f3dea3e0eb..3c95994323ea 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6628,6 +6628,27 @@ static void init_gro_hash(struct napi_struct *napi)
 	napi->gro_bitmask = 0;
 }
 
+int napi_set_threaded(struct napi_struct *napi, bool threaded)
+{
+	if (napi->dev->threaded)
+		return -EINVAL;
+
+	if (threaded) {
+		if (!napi->thread) {
+			int err = napi_kthread_create(napi);
+
+			if (err)
+				return err;
+		}
+	}
+
+	/* Make sure kthread is created before THREADED bit is set. */
+	smp_mb__before_atomic();
+	assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+
+	return 0;
+}
+
 int dev_set_threaded(struct net_device *dev, bool threaded)
 {
 	struct napi_struct *napi;
@@ -6637,6 +6658,11 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
 		return 0;
 
 	if (threaded) {
+		/* Check if threaded is set at napi level already */
+		list_for_each_entry(napi, &dev->napi_list, dev_list)
+			if (test_bit(NAPI_STATE_THREADED, &napi->state))
+				return -EINVAL;
+
 		list_for_each_entry(napi, &dev->napi_list, dev_list) {
 			if (!napi->thread) {
 				err = napi_kthread_create(napi);
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index a89cbd8d87c3..93dc74dad6de 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -99,6 +99,12 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_IRQ_SUSPE
 	[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
 };
 
+/* NETDEV_CMD_NAPI_SET_THREADED - do */
+static const struct nla_policy netdev_napi_set_threaded_nl_policy[NETDEV_A_NAPI_THREADED + 1] = {
+	[NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
+	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
+};
+
 /* Ops table for netdev */
 static const struct genl_split_ops netdev_nl_ops[] = {
 	{
@@ -190,6 +196,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.maxattr	= NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
 	},
+	{
+		.cmd		= NETDEV_CMD_NAPI_SET_THREADED,
+		.doit		= netdev_nl_napi_set_threaded_doit,
+		.policy		= netdev_napi_set_threaded_nl_policy,
+		.maxattr	= NETDEV_A_NAPI_THREADED,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
 };
 
 static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index e09dd7539ff2..00c229569b7a 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -34,6 +34,8 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
 				struct netlink_callback *cb);
 int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
+int netdev_nl_napi_set_threaded_doit(struct sk_buff *skb,
+				     struct genl_info *info);
 
 enum {
 	NETDEV_NLGRP_MGMT,
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 2d3ae0cd3ad2..ace22b24be7e 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -186,6 +186,9 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 	if (napi->irq >= 0 && nla_put_u32(rsp, NETDEV_A_NAPI_IRQ, napi->irq))
 		goto nla_put_failure;
 
+	if (nla_put_u32(rsp, NETDEV_A_NAPI_THREADED, !!napi->thread))
+		goto nla_put_failure;
+
 	if (napi->thread) {
 		pid = task_pid_nr(napi->thread);
 		if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid))
@@ -311,6 +314,40 @@ int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	return err;
 }
 
+int netdev_nl_napi_set_threaded_doit(struct sk_buff *skb,
+				     struct genl_info *info)
+{
+	struct napi_struct *napi;
+	u32 napi_threaded;
+	u32 napi_id;
+	int err = 0;
+
+	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_ID) ||
+	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_THREADED))
+		return -EINVAL;
+
+	napi_id = nla_get_u32(info->attrs[NETDEV_A_NAPI_ID]);
+	napi_threaded = nla_get_u32(info->attrs[NETDEV_A_NAPI_THREADED]);
+
+	rtnl_lock();
+
+	napi = napi_by_id(napi_id);
+	if (!napi) {
+		NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID]);
+		err = -ENOENT;
+		goto napi_set_threaded_failure;
+	}
+
+	err = napi_set_threaded(napi, napi_threaded);
+	if (err)
+		NL_SET_ERR_MSG(info->extack,
+			       "unable to set threaded state of napi");
+
+napi_set_threaded_failure:
+	rtnl_unlock();
+	return err;
+}
+
 static int
 netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
 {
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index e4be227d3ad6..cefbb8f39ae7 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -125,6 +125,7 @@ enum {
 	NETDEV_A_NAPI_DEFER_HARD_IRQS,
 	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
 	NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
+	NETDEV_A_NAPI_THREADED,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
@@ -203,6 +204,7 @@ enum {
 	NETDEV_CMD_QSTATS_GET,
 	NETDEV_CMD_BIND_RX,
 	NETDEV_CMD_NAPI_SET,
+	NETDEV_CMD_NAPI_SET_THREADED,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next 2/3] net: Create separate gro_flush helper function
  2025-01-02 19:12 [PATCH net-next 0/3] Add support to do threaded napi busy poll Samiullah Khawaja
  2025-01-02 19:12 ` [PATCH net-next 1/3] Add support to set napi threaded for individual napi Samiullah Khawaja
@ 2025-01-02 19:12 ` Samiullah Khawaja
  2025-01-02 19:12 ` [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Samiullah Khawaja @ 2025-01-02 19:12 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni
  Cc: netdev, skhawaja

Move multiple copies of same code snippet doing `gro_flush` and
`gro_normal_list` into a separate helper function.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
 net/core/dev.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 3c95994323ea..762977a62da2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6325,6 +6325,17 @@ static void skb_defer_free_flush(struct softnet_data *sd)
 	}
 }
 
+static void __napi_gro_flush_helper(struct napi_struct *napi)
+{
+	if (napi->gro_bitmask) {
+		/* flush too old packets
+		 * If HZ < 1000, flush all packets.
+		 */
+		napi_gro_flush(napi, HZ >= 1000);
+	}
+	gro_normal_list(napi);
+}
+
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 
 static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)
@@ -6335,14 +6346,8 @@ static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)
 		return;
 	}
 
-	if (napi->gro_bitmask) {
-		/* flush too old packets
-		 * If HZ < 1000, flush all packets.
-		 */
-		napi_gro_flush(napi, HZ >= 1000);
-	}
+	__napi_gro_flush_helper(napi);
 
-	gro_normal_list(napi);
 	clear_bit(NAPI_STATE_SCHED, &napi->state);
 }
 
@@ -6942,14 +6947,7 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
 		return work;
 	}
 
-	if (n->gro_bitmask) {
-		/* flush too old packets
-		 * If HZ < 1000, flush all packets.
-		 */
-		napi_gro_flush(n, HZ >= 1000);
-	}
-
-	gro_normal_list(n);
+	__napi_gro_flush_helper(n);
 
 	/* Some drivers may have called napi_schedule
 	 * prior to exhausting their budget.
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling
  2025-01-02 19:12 [PATCH net-next 0/3] Add support to do threaded napi busy poll Samiullah Khawaja
  2025-01-02 19:12 ` [PATCH net-next 1/3] Add support to set napi threaded for individual napi Samiullah Khawaja
  2025-01-02 19:12 ` [PATCH net-next 2/3] net: Create separate gro_flush helper function Samiullah Khawaja
@ 2025-01-02 19:12 ` Samiullah Khawaja
  2025-01-02 21:16   ` Samudrala, Sridhar
                     ` (4 more replies)
  2025-01-02 21:30 ` [PATCH net-next 0/3] Add support to do threaded napi busy poll Stanislav Fomichev
                   ` (2 subsequent siblings)
  5 siblings, 5 replies; 18+ messages in thread
From: Samiullah Khawaja @ 2025-01-02 19:12 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni
  Cc: netdev, skhawaja

Add a new state to napi state enum:

- STATE_THREADED_BUSY_POLL
  Threaded busy poll is enabled/running for this napi.

Following changes are introduced in the napi scheduling and state logic:

- When threaded busy poll is enabled through sysfs it also enables
  NAPI_STATE_THREADED so a kthread is created per napi. It also sets
  NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that we are
  supposed to busy poll for each napi.

- When napi is scheduled with STATE_SCHED_THREADED and associated
  kthread is woken up, the kthread owns the context. If
  NAPI_STATE_THREADED_BUSY_POLL and NAPI_SCHED_THREADED both are set
  then it means that we can busy poll.

- To keep busy polling and to avoid scheduling of the interrupts, the
  napi_complete_done returns false when both SCHED_THREADED and
  THREADED_BUSY_POLL flags are set. Also napi_complete_done returns
  early to avoid the STATE_SCHED_THREADED being unset.

- If at any point STATE_THREADED_BUSY_POLL is unset, the
  napi_complete_done will run and unset the SCHED_THREADED bit also.
  This will make the associated kthread go to sleep as per existing
  logic.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
 Documentation/ABI/testing/sysfs-class-net     |  3 +-
 Documentation/netlink/specs/netdev.yaml       |  5 +-
 .../net/ethernet/atheros/atl1c/atl1c_main.c   |  2 +-
 include/linux/netdevice.h                     | 24 +++++--
 net/core/dev.c                                | 72 ++++++++++++++++---
 net/core/net-sysfs.c                          |  2 +-
 net/core/netdev-genl-gen.c                    |  2 +-
 7 files changed, 89 insertions(+), 21 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net
index ebf21beba846..15d7d36a8294 100644
--- a/Documentation/ABI/testing/sysfs-class-net
+++ b/Documentation/ABI/testing/sysfs-class-net
@@ -343,7 +343,7 @@ Date:		Jan 2021
 KernelVersion:	5.12
 Contact:	netdev@vger.kernel.org
 Description:
-		Boolean value to control the threaded mode per device. User could
+		Integer value to control the threaded mode per device. User could
 		set this value to enable/disable threaded mode for all napi
 		belonging to this device, without the need to do device up/down.
 
@@ -351,4 +351,5 @@ Description:
 		== ==================================
 		0  threaded mode disabled for this dev
 		1  threaded mode enabled for this dev
+		2  threaded mode enabled, and busy polling enabled.
 		== ==================================
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index aac343af7246..9c905243a1cc 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -272,10 +272,11 @@ attribute-sets:
         name: threaded
         doc: Whether the napi is configured to operate in threaded polling
              mode. If this is set to `1` then the NAPI context operates
-             in threaded polling mode.
+             in threaded polling mode. If this is set to `2` then the NAPI
+             kthread also does busypolling.
         type: u32
         checks:
-          max: 1
+          max: 2
   -
     name: queue
     attributes:
diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
index c571614b1d50..a709cddcd292 100644
--- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
+++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
@@ -2688,7 +2688,7 @@ static int atl1c_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	adapter->mii.mdio_write = atl1c_mdio_write;
 	adapter->mii.phy_id_mask = 0x1f;
 	adapter->mii.reg_num_mask = MDIO_CTRL_REG_MASK;
-	dev_set_threaded(netdev, true);
+	dev_set_threaded(netdev, DEV_NAPI_THREADED);
 	for (i = 0; i < adapter->rx_queue_count; ++i)
 		netif_napi_add(netdev, &adapter->rrd_ring[i].napi,
 			       atl1c_clean_rx);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8f531d528869..c384ffe0976e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -407,6 +407,8 @@ enum {
 	NAPI_STATE_PREFER_BUSY_POLL,	/* prefer busy-polling over softirq processing*/
 	NAPI_STATE_THREADED,		/* The poll is performed inside its own thread*/
 	NAPI_STATE_SCHED_THREADED,	/* Napi is currently scheduled in threaded mode */
+	NAPI_STATE_THREADED_BUSY_POLL,	/* The threaded napi poller will busy poll */
+	NAPI_STATE_SCHED_THREADED_BUSY_POLL,  /* The threaded napi poller is busy polling */
 };
 
 enum {
@@ -420,8 +422,14 @@ enum {
 	NAPIF_STATE_PREFER_BUSY_POLL	= BIT(NAPI_STATE_PREFER_BUSY_POLL),
 	NAPIF_STATE_THREADED		= BIT(NAPI_STATE_THREADED),
 	NAPIF_STATE_SCHED_THREADED	= BIT(NAPI_STATE_SCHED_THREADED),
+	NAPIF_STATE_THREADED_BUSY_POLL	= BIT(NAPI_STATE_THREADED_BUSY_POLL),
+	NAPIF_STATE_SCHED_THREADED_BUSY_POLL
+				= BIT(NAPI_STATE_SCHED_THREADED_BUSY_POLL),
 };
 
+#define NAPIF_STATE_THREADED_BUSY_POLL_MASK \
+	(NAPIF_STATE_THREADED | NAPIF_STATE_THREADED_BUSY_POLL)
+
 enum gro_result {
 	GRO_MERGED,
 	GRO_MERGED_FREE,
@@ -568,16 +576,24 @@ static inline bool napi_complete(struct napi_struct *n)
 	return napi_complete_done(n, 0);
 }
 
-int dev_set_threaded(struct net_device *dev, bool threaded);
+enum napi_threaded_state {
+	NAPI_THREADED_OFF = 0,
+	NAPI_THREADED = 1,
+	NAPI_THREADED_BUSY_POLL = 2,
+	NAPI_THREADED_MAX = NAPI_THREADED_BUSY_POLL,
+};
+
+int dev_set_threaded(struct net_device *dev, enum napi_threaded_state threaded);
 
 /*
  * napi_set_threaded - set napi threaded state
  * @napi: NAPI context
- * @threaded: whether this napi does threaded polling
+ * @threaded: threading mode
  *
  * Return 0 on success and negative errno on failure.
  */
-int napi_set_threaded(struct napi_struct *napi, bool threaded);
+int napi_set_threaded(struct napi_struct *napi,
+		      enum napi_threaded_state threaded);
 
 /**
  *	napi_disable - prevent NAPI from scheduling
@@ -2406,7 +2422,7 @@ struct net_device {
 	struct sfp_bus		*sfp_bus;
 	struct lock_class_key	*qdisc_tx_busylock;
 	bool			proto_down;
-	bool			threaded;
+	u8			threaded;
 
 	/* priv_flags_slow, ungrouped to save space */
 	unsigned long		see_all_hwtstamp_requests:1;
diff --git a/net/core/dev.c b/net/core/dev.c
index 762977a62da2..b6cd9474bdd3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -78,6 +78,7 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/sched/isolation.h>
+#include <linux/sched/types.h>
 #include <linux/sched/mm.h>
 #include <linux/smpboot.h>
 #include <linux/mutex.h>
@@ -6231,7 +6232,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 	 *    the guarantee we will be called later.
 	 */
 	if (unlikely(n->state & (NAPIF_STATE_NPSVC |
-				 NAPIF_STATE_IN_BUSY_POLL)))
+				 NAPIF_STATE_IN_BUSY_POLL |
+				 NAPIF_STATE_SCHED_THREADED_BUSY_POLL)))
 		return false;
 
 	if (work_done) {
@@ -6633,8 +6635,10 @@ static void init_gro_hash(struct napi_struct *napi)
 	napi->gro_bitmask = 0;
 }
 
-int napi_set_threaded(struct napi_struct *napi, bool threaded)
+int napi_set_threaded(struct napi_struct *napi,
+		      enum napi_threaded_state threaded)
 {
+	unsigned long val;
 	if (napi->dev->threaded)
 		return -EINVAL;
 
@@ -6649,30 +6653,41 @@ int napi_set_threaded(struct napi_struct *napi, bool threaded)
 
 	/* Make sure kthread is created before THREADED bit is set. */
 	smp_mb__before_atomic();
-	assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+	val = 0;
+	if (threaded == NAPI_THREADED_BUSY_POLL)
+		val |= NAPIF_STATE_THREADED_BUSY_POLL;
+	if (threaded)
+		val |= NAPIF_STATE_THREADED;
+	set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
 
 	return 0;
 }
 
-int dev_set_threaded(struct net_device *dev, bool threaded)
+int dev_set_threaded(struct net_device *dev, enum napi_threaded_state threaded)
 {
 	struct napi_struct *napi;
+	unsigned long val;
 	int err = 0;
 
 	if (dev->threaded == threaded)
 		return 0;
 
+	val = 0;
 	if (threaded) {
 		/* Check if threaded is set at napi level already */
 		list_for_each_entry(napi, &dev->napi_list, dev_list)
 			if (test_bit(NAPI_STATE_THREADED, &napi->state))
 				return -EINVAL;
 
+		val |= NAPIF_STATE_THREADED;
+		if (threaded == NAPI_THREADED_BUSY_POLL)
+			val |= NAPIF_STATE_THREADED_BUSY_POLL;
+
 		list_for_each_entry(napi, &dev->napi_list, dev_list) {
 			if (!napi->thread) {
 				err = napi_kthread_create(napi);
 				if (err) {
-					threaded = false;
+					threaded = NAPI_THREADED_OFF;
 					break;
 				}
 			}
@@ -6691,9 +6706,13 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
 	 * polled. In this case, the switch between threaded mode and
 	 * softirq mode will happen in the next round of napi_schedule().
 	 * This should not cause hiccups/stalls to the live traffic.
+	 *
+	 * Switch to busy_poll threaded napi will occur after the threaded
+	 * napi is scheduled.
 	 */
 	list_for_each_entry(napi, &dev->napi_list, dev_list)
-		assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+		set_mask_bits(&napi->state,
+			      NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
 
 	return err;
 }
@@ -7007,7 +7026,7 @@ static int napi_thread_wait(struct napi_struct *napi)
 	return -1;
 }
 
-static void napi_threaded_poll_loop(struct napi_struct *napi)
+static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll)
 {
 	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct softnet_data *sd;
@@ -7036,22 +7055,53 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
 		}
 		skb_defer_free_flush(sd);
 		bpf_net_ctx_clear(bpf_net_ctx);
+
+		/* Push the skbs up the stack if busy polling. */
+		if (busy_poll)
+			__napi_gro_flush_helper(napi);
 		local_bh_enable();
 
-		if (!repoll)
+		/* If busy polling then do not break here because we need to
+		 * call cond_resched and rcu_softirq_qs_periodic to prevent
+		 * watchdog warnings.
+		 */
+		if (!repoll && !busy_poll)
 			break;
 
 		rcu_softirq_qs_periodic(last_qs);
 		cond_resched();
+
+		if (!repoll)
+			break;
 	}
 }
 
 static int napi_threaded_poll(void *data)
 {
 	struct napi_struct *napi = data;
+	bool busy_poll_sched;
+	unsigned long val;
+	bool busy_poll;
+
+	while (!napi_thread_wait(napi)) {
+		/* Once woken up, this means that we are scheduled as threaded
+		 * napi and this thread owns the napi context, if busy poll
+		 * state is set then we busy poll this napi.
+		 */
+		val = READ_ONCE(napi->state);
+		busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
+		busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL;
+
+		/* Do not busy poll if napi is disabled. */
+		if (unlikely(val & NAPIF_STATE_DISABLE))
+			busy_poll = false;
+
+		if (busy_poll != busy_poll_sched)
+			assign_bit(NAPI_STATE_SCHED_THREADED_BUSY_POLL,
+				   &napi->state, busy_poll);
 
-	while (!napi_thread_wait(napi))
-		napi_threaded_poll_loop(napi);
+		napi_threaded_poll_loop(napi, busy_poll);
+	}
 
 	return 0;
 }
@@ -12205,7 +12255,7 @@ static void run_backlog_napi(unsigned int cpu)
 {
 	struct softnet_data *sd = per_cpu_ptr(&softnet_data, cpu);
 
-	napi_threaded_poll_loop(&sd->backlog);
+	napi_threaded_poll_loop(&sd->backlog, false);
 }
 
 static void backlog_napi_setup(unsigned int cpu)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 2d9afc6e2161..36d0a22e341c 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -626,7 +626,7 @@ static int modify_napi_threaded(struct net_device *dev, unsigned long val)
 	if (list_empty(&dev->napi_list))
 		return -EOPNOTSUPP;
 
-	if (val != 0 && val != 1)
+	if (val > NAPI_THREADED_MAX)
 		return -EOPNOTSUPP;
 
 	ret = dev_set_threaded(dev, val);
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 93dc74dad6de..4086d2577dcc 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -102,7 +102,7 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_IRQ_SUSPE
 /* NETDEV_CMD_NAPI_SET_THREADED - do */
 static const struct nla_policy netdev_napi_set_threaded_nl_policy[NETDEV_A_NAPI_THREADED + 1] = {
 	[NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
-	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
+	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 2),
 };
 
 /* Ops table for netdev */
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/3] Add support to set napi threaded for individual napi
  2025-01-02 19:12 ` [PATCH net-next 1/3] Add support to set napi threaded for individual napi Samiullah Khawaja
@ 2025-01-02 21:15   ` Stanislav Fomichev
  0 siblings, 0 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2025-01-02 21:15 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	netdev

On 01/02, Samiullah Khawaja wrote:
> A net device has a threaded sysctl that can be used to enable threaded
> napi polling on all of the NAPI contexts under that device. Allow
> enabling threaded napi polling at individual napi level using netlink.
> 
> Add a new netlink operation `napi-set-threaded` that takes napi `id` and
> `threaded` attributes. This will enable the threaded polling on napi
> context.
> 
> Tested using following command in qemu/virtio-net:
> ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
>   --do napi-set-threaded       --json '{"id": 513, "threaded": 1}'
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> Reviewed-by: Willem de Bruijn <willemb@google.com>
> ---
>  Documentation/netlink/specs/netdev.yaml | 19 +++++++++++++
>  include/linux/netdevice.h               |  9 ++++++
>  include/uapi/linux/netdev.h             |  2 ++
>  net/core/dev.c                          | 26 +++++++++++++++++
>  net/core/netdev-genl-gen.c              | 13 +++++++++
>  net/core/netdev-genl-gen.h              |  2 ++
>  net/core/netdev-genl.c                  | 37 +++++++++++++++++++++++++
>  tools/include/uapi/linux/netdev.h       |  2 ++
>  8 files changed, 110 insertions(+)
> 
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index cbb544bd6c84..aac343af7246 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -268,6 +268,14 @@ attribute-sets:
>          doc: The timeout, in nanoseconds, of how long to suspend irq
>               processing, if event polling finds events
>          type: uint
> +      -
> +        name: threaded
> +        doc: Whether the napi is configured to operate in threaded polling
> +             mode. If this is set to `1` then the NAPI context operates
> +             in threaded polling mode.
> +        type: u32
> +        checks:
> +          max: 1
>    -
>      name: queue
>      attributes:
> @@ -659,6 +667,7 @@ operations:
>              - defer-hard-irqs
>              - gro-flush-timeout
>              - irq-suspend-timeout
> +            - threaded
>        dump:
>          request:
>            attributes:
> @@ -711,6 +720,16 @@ operations:
>              - defer-hard-irqs
>              - gro-flush-timeout
>              - irq-suspend-timeout
> +    -
> +      name: napi-set-threaded
> +      doc: Set threaded napi mode on this napi.
> +      attribute-set: napi
> +      flags: [ admin-perm ]
> +      do:
> +        request:
> +          attributes:
> +            - id
> +            - threaded

Any reason we need a separate op to enable/disable? Why not piggyback
on napi_set?

Maybe also move 'threaded' flag to persistent napi config?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling
  2025-01-02 19:12 ` [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
@ 2025-01-02 21:16   ` Samudrala, Sridhar
  2025-01-02 21:28   ` Stanislav Fomichev
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Samudrala, Sridhar @ 2025-01-02 21:16 UTC (permalink / raw)
  To: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni
  Cc: netdev



On 1/2/2025 11:12 AM, Samiullah Khawaja wrote:
> Add a new state to napi state enum:
> 
> - STATE_THREADED_BUSY_POLL
>    Threaded busy poll is enabled/running for this napi.
> 
> Following changes are introduced in the napi scheduling and state logic:
> 
> - When threaded busy poll is enabled through sysfs it also enables
>    NAPI_STATE_THREADED so a kthread is created per napi. It also sets
>    NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that we are
>    supposed to busy poll for each napi.

Looks like this patch is changing the sysfs 'threaded' field from 
boolean to an integer and value 2 is used to indicate threaded mode with 
busypoll.
So I think the above comment should reflect that instead of just saying 
enabled for both threaded and busypoll.

> 
> - When napi is scheduled with STATE_SCHED_THREADED and associated
>    kthread is woken up, the kthread owns the context. If
>    NAPI_STATE_THREADED_BUSY_POLL and NAPI_SCHED_THREADED both are set
>    then it means that we can busy poll.
> 
> - To keep busy polling and to avoid scheduling of the interrupts, the
>    napi_complete_done returns false when both SCHED_THREADED and
>    THREADED_BUSY_POLL flags are set. Also napi_complete_done returns
>    early to avoid the STATE_SCHED_THREADED being unset.
> 
> - If at any point STATE_THREADED_BUSY_POLL is unset, the
>    napi_complete_done will run and unset the SCHED_THREADED bit also.
>    This will make the associated kthread go to sleep as per existing
>    logic.

When does STATE_THREADED_BUSY_POLL get unset? Don't we need a timeout 
value to come out of busypoll mode if there is no traffic?

> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> Reviewed-by: Willem de Bruijn <willemb@google.com>
> ---
>   Documentation/ABI/testing/sysfs-class-net     |  3 +-
>   Documentation/netlink/specs/netdev.yaml       |  5 +-
>   .../net/ethernet/atheros/atl1c/atl1c_main.c   |  2 +-
>   include/linux/netdevice.h                     | 24 +++++--
>   net/core/dev.c                                | 72 ++++++++++++++++---
>   net/core/net-sysfs.c                          |  2 +-
>   net/core/netdev-genl-gen.c                    |  2 +-
>   7 files changed, 89 insertions(+), 21 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net
> index ebf21beba846..15d7d36a8294 100644
> --- a/Documentation/ABI/testing/sysfs-class-net
> +++ b/Documentation/ABI/testing/sysfs-class-net
> @@ -343,7 +343,7 @@ Date:		Jan 2021
>   KernelVersion:	5.12
>   Contact:	netdev@vger.kernel.org
>   Description:
> -		Boolean value to control the threaded mode per device. User could
> +		Integer value to control the threaded mode per device. User could
>   		set this value to enable/disable threaded mode for all napi
>   		belonging to this device, without the need to do device up/down.
>   
> @@ -351,4 +351,5 @@ Description:
>   		== ==================================
>   		0  threaded mode disabled for this dev
>   		1  threaded mode enabled for this dev
> +		2  threaded mode enabled, and busy polling enabled.
>   		== ==================================
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index aac343af7246..9c905243a1cc 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -272,10 +272,11 @@ attribute-sets:
>           name: threaded
>           doc: Whether the napi is configured to operate in threaded polling
>                mode. If this is set to `1` then the NAPI context operates
> -             in threaded polling mode.
> +             in threaded polling mode. If this is set to `2` then the NAPI
> +             kthread also does busypolling.
>           type: u32
>           checks:
> -          max: 1
> +          max: 2
>     -
>       name: queue
>       attributes:
> diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> index c571614b1d50..a709cddcd292 100644
> --- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> +++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> @@ -2688,7 +2688,7 @@ static int atl1c_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>   	adapter->mii.mdio_write = atl1c_mdio_write;
>   	adapter->mii.phy_id_mask = 0x1f;
>   	adapter->mii.reg_num_mask = MDIO_CTRL_REG_MASK;
> -	dev_set_threaded(netdev, true);
> +	dev_set_threaded(netdev, DEV_NAPI_THREADED);
>   	for (i = 0; i < adapter->rx_queue_count; ++i)
>   		netif_napi_add(netdev, &adapter->rrd_ring[i].napi,
>   			       atl1c_clean_rx);
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 8f531d528869..c384ffe0976e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -407,6 +407,8 @@ enum {
>   	NAPI_STATE_PREFER_BUSY_POLL,	/* prefer busy-polling over softirq processing*/
>   	NAPI_STATE_THREADED,		/* The poll is performed inside its own thread*/
>   	NAPI_STATE_SCHED_THREADED,	/* Napi is currently scheduled in threaded mode */
> +	NAPI_STATE_THREADED_BUSY_POLL,	/* The threaded napi poller will busy poll */
> +	NAPI_STATE_SCHED_THREADED_BUSY_POLL,  /* The threaded napi poller is busy polling */
>   };
>   
>   enum {
> @@ -420,8 +422,14 @@ enum {
>   	NAPIF_STATE_PREFER_BUSY_POLL	= BIT(NAPI_STATE_PREFER_BUSY_POLL),
>   	NAPIF_STATE_THREADED		= BIT(NAPI_STATE_THREADED),
>   	NAPIF_STATE_SCHED_THREADED	= BIT(NAPI_STATE_SCHED_THREADED),
> +	NAPIF_STATE_THREADED_BUSY_POLL	= BIT(NAPI_STATE_THREADED_BUSY_POLL),
> +	NAPIF_STATE_SCHED_THREADED_BUSY_POLL
> +				= BIT(NAPI_STATE_SCHED_THREADED_BUSY_POLL),
>   };
>   
> +#define NAPIF_STATE_THREADED_BUSY_POLL_MASK \
> +	(NAPIF_STATE_THREADED | NAPIF_STATE_THREADED_BUSY_POLL)
> +
>   enum gro_result {
>   	GRO_MERGED,
>   	GRO_MERGED_FREE,
> @@ -568,16 +576,24 @@ static inline bool napi_complete(struct napi_struct *n)
>   	return napi_complete_done(n, 0);
>   }
>   
> -int dev_set_threaded(struct net_device *dev, bool threaded);
> +enum napi_threaded_state {
> +	NAPI_THREADED_OFF = 0,
> +	NAPI_THREADED = 1,
> +	NAPI_THREADED_BUSY_POLL = 2,
> +	NAPI_THREADED_MAX = NAPI_THREADED_BUSY_POLL,
> +};
> +
> +int dev_set_threaded(struct net_device *dev, enum napi_threaded_state threaded);
>   
>   /*
>    * napi_set_threaded - set napi threaded state
>    * @napi: NAPI context
> - * @threaded: whether this napi does threaded polling
> + * @threaded: threading mode
>    *
>    * Return 0 on success and negative errno on failure.
>    */
> -int napi_set_threaded(struct napi_struct *napi, bool threaded);
> +int napi_set_threaded(struct napi_struct *napi,
> +		      enum napi_threaded_state threaded);
>   
>   /**
>    *	napi_disable - prevent NAPI from scheduling
> @@ -2406,7 +2422,7 @@ struct net_device {
>   	struct sfp_bus		*sfp_bus;
>   	struct lock_class_key	*qdisc_tx_busylock;
>   	bool			proto_down;
> -	bool			threaded;
> +	u8			threaded;
>   
>   	/* priv_flags_slow, ungrouped to save space */
>   	unsigned long		see_all_hwtstamp_requests:1;
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 762977a62da2..b6cd9474bdd3 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -78,6 +78,7 @@
>   #include <linux/slab.h>
>   #include <linux/sched.h>
>   #include <linux/sched/isolation.h>
> +#include <linux/sched/types.h>
>   #include <linux/sched/mm.h>
>   #include <linux/smpboot.h>
>   #include <linux/mutex.h>
> @@ -6231,7 +6232,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
>   	 *    the guarantee we will be called later.
>   	 */
>   	if (unlikely(n->state & (NAPIF_STATE_NPSVC |
> -				 NAPIF_STATE_IN_BUSY_POLL)))
> +				 NAPIF_STATE_IN_BUSY_POLL |
> +				 NAPIF_STATE_SCHED_THREADED_BUSY_POLL)))
>   		return false;
>   
>   	if (work_done) {
> @@ -6633,8 +6635,10 @@ static void init_gro_hash(struct napi_struct *napi)
>   	napi->gro_bitmask = 0;
>   }
>   
> -int napi_set_threaded(struct napi_struct *napi, bool threaded)
> +int napi_set_threaded(struct napi_struct *napi,
> +		      enum napi_threaded_state threaded)
>   {
> +	unsigned long val;
>   	if (napi->dev->threaded)
>   		return -EINVAL;
>   
> @@ -6649,30 +6653,41 @@ int napi_set_threaded(struct napi_struct *napi, bool threaded)
>   
>   	/* Make sure kthread is created before THREADED bit is set. */
>   	smp_mb__before_atomic();
> -	assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> +	val = 0;
> +	if (threaded == NAPI_THREADED_BUSY_POLL)
> +		val |= NAPIF_STATE_THREADED_BUSY_POLL;
> +	if (threaded)
> +		val |= NAPIF_STATE_THREADED;
> +	set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
>   
>   	return 0;
>   }
>   
> -int dev_set_threaded(struct net_device *dev, bool threaded)
> +int dev_set_threaded(struct net_device *dev, enum napi_threaded_state threaded)
>   {
>   	struct napi_struct *napi;
> +	unsigned long val;
>   	int err = 0;
>   
>   	if (dev->threaded == threaded)
>   		return 0;
>   
> +	val = 0;
>   	if (threaded) {
>   		/* Check if threaded is set at napi level already */
>   		list_for_each_entry(napi, &dev->napi_list, dev_list)
>   			if (test_bit(NAPI_STATE_THREADED, &napi->state))
>   				return -EINVAL;
>   
> +		val |= NAPIF_STATE_THREADED;
> +		if (threaded == NAPI_THREADED_BUSY_POLL)
> +			val |= NAPIF_STATE_THREADED_BUSY_POLL;
> +
>   		list_for_each_entry(napi, &dev->napi_list, dev_list) {
>   			if (!napi->thread) {
>   				err = napi_kthread_create(napi);
>   				if (err) {
> -					threaded = false;
> +					threaded = NAPI_THREADED_OFF;
>   					break;
>   				}
>   			}
> @@ -6691,9 +6706,13 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
>   	 * polled. In this case, the switch between threaded mode and
>   	 * softirq mode will happen in the next round of napi_schedule().
>   	 * This should not cause hiccups/stalls to the live traffic.
> +	 *
> +	 * Switch to busy_poll threaded napi will occur after the threaded
> +	 * napi is scheduled.
>   	 */
>   	list_for_each_entry(napi, &dev->napi_list, dev_list)
> -		assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> +		set_mask_bits(&napi->state,
> +			      NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
>   
>   	return err;
>   }
> @@ -7007,7 +7026,7 @@ static int napi_thread_wait(struct napi_struct *napi)
>   	return -1;
>   }
>   
> -static void napi_threaded_poll_loop(struct napi_struct *napi)
> +static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll)
>   {
>   	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
>   	struct softnet_data *sd;
> @@ -7036,22 +7055,53 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
>   		}
>   		skb_defer_free_flush(sd);
>   		bpf_net_ctx_clear(bpf_net_ctx);
> +
> +		/* Push the skbs up the stack if busy polling. */
> +		if (busy_poll)
> +			__napi_gro_flush_helper(napi);
>   		local_bh_enable();
>   
> -		if (!repoll)
> +		/* If busy polling then do not break here because we need to
> +		 * call cond_resched and rcu_softirq_qs_periodic to prevent
> +		 * watchdog warnings.
> +		 */
> +		if (!repoll && !busy_poll)
>   			break;
>   
>   		rcu_softirq_qs_periodic(last_qs);
>   		cond_resched();
> +
> +		if (!repoll)
> +			break;
>   	}
>   }
>   
>   static int napi_threaded_poll(void *data)
>   {
>   	struct napi_struct *napi = data;
> +	bool busy_poll_sched;
> +	unsigned long val;
> +	bool busy_poll;
> +
> +	while (!napi_thread_wait(napi)) {
> +		/* Once woken up, this means that we are scheduled as threaded
> +		 * napi and this thread owns the napi context, if busy poll
> +		 * state is set then we busy poll this napi.
> +		 */
> +		val = READ_ONCE(napi->state);
> +		busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
> +		busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL;
> +
> +		/* Do not busy poll if napi is disabled. */
> +		if (unlikely(val & NAPIF_STATE_DISABLE))
> +			busy_poll = false;
> +
> +		if (busy_poll != busy_poll_sched)
> +			assign_bit(NAPI_STATE_SCHED_THREADED_BUSY_POLL,
> +				   &napi->state, busy_poll);
>   
> -	while (!napi_thread_wait(napi))
> -		napi_threaded_poll_loop(napi);
> +		napi_threaded_poll_loop(napi, busy_poll);
> +	}
>   
>   	return 0;
>   }
> @@ -12205,7 +12255,7 @@ static void run_backlog_napi(unsigned int cpu)
>   {
>   	struct softnet_data *sd = per_cpu_ptr(&softnet_data, cpu);
>   
> -	napi_threaded_poll_loop(&sd->backlog);
> +	napi_threaded_poll_loop(&sd->backlog, false);
>   }
>   
>   static void backlog_napi_setup(unsigned int cpu)
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 2d9afc6e2161..36d0a22e341c 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -626,7 +626,7 @@ static int modify_napi_threaded(struct net_device *dev, unsigned long val)
>   	if (list_empty(&dev->napi_list))
>   		return -EOPNOTSUPP;
>   
> -	if (val != 0 && val != 1)
> +	if (val > NAPI_THREADED_MAX)
>   		return -EOPNOTSUPP;
>   
>   	ret = dev_set_threaded(dev, val);
> diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> index 93dc74dad6de..4086d2577dcc 100644
> --- a/net/core/netdev-genl-gen.c
> +++ b/net/core/netdev-genl-gen.c
> @@ -102,7 +102,7 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_IRQ_SUSPE
>   /* NETDEV_CMD_NAPI_SET_THREADED - do */
>   static const struct nla_policy netdev_napi_set_threaded_nl_policy[NETDEV_A_NAPI_THREADED + 1] = {
>   	[NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
> -	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
> +	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 2),
>   };
>   
>   /* Ops table for netdev */


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling
  2025-01-02 19:12 ` [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
  2025-01-02 21:16   ` Samudrala, Sridhar
@ 2025-01-02 21:28   ` Stanislav Fomichev
  2025-01-03  1:05   ` kernel test robot
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2025-01-02 21:28 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	netdev

On 01/02, Samiullah Khawaja wrote:
> Add a new state to napi state enum:
> 
> - STATE_THREADED_BUSY_POLL
>   Threaded busy poll is enabled/running for this napi.
> 
> Following changes are introduced in the napi scheduling and state logic:
> 
> - When threaded busy poll is enabled through sysfs it also enables
>   NAPI_STATE_THREADED so a kthread is created per napi. It also sets
>   NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that we are
>   supposed to busy poll for each napi.
> 
> - When napi is scheduled with STATE_SCHED_THREADED and associated
>   kthread is woken up, the kthread owns the context. If
>   NAPI_STATE_THREADED_BUSY_POLL and NAPI_SCHED_THREADED both are set
>   then it means that we can busy poll.
> 
> - To keep busy polling and to avoid scheduling of the interrupts, the
>   napi_complete_done returns false when both SCHED_THREADED and
>   THREADED_BUSY_POLL flags are set. Also napi_complete_done returns
>   early to avoid the STATE_SCHED_THREADED being unset.
> 
> - If at any point STATE_THREADED_BUSY_POLL is unset, the
>   napi_complete_done will run and unset the SCHED_THREADED bit also.
>   This will make the associated kthread go to sleep as per existing
>   logic.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> Reviewed-by: Willem de Bruijn <willemb@google.com>
> ---
>  Documentation/ABI/testing/sysfs-class-net     |  3 +-
>  Documentation/netlink/specs/netdev.yaml       |  5 +-
>  .../net/ethernet/atheros/atl1c/atl1c_main.c   |  2 +-
>  include/linux/netdevice.h                     | 24 +++++--
>  net/core/dev.c                                | 72 ++++++++++++++++---
>  net/core/net-sysfs.c                          |  2 +-
>  net/core/netdev-genl-gen.c                    |  2 +-
>  7 files changed, 89 insertions(+), 21 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net
> index ebf21beba846..15d7d36a8294 100644
> --- a/Documentation/ABI/testing/sysfs-class-net
> +++ b/Documentation/ABI/testing/sysfs-class-net
> @@ -343,7 +343,7 @@ Date:		Jan 2021
>  KernelVersion:	5.12
>  Contact:	netdev@vger.kernel.org
>  Description:
> -		Boolean value to control the threaded mode per device. User could
> +		Integer value to control the threaded mode per device. User could
>  		set this value to enable/disable threaded mode for all napi
>  		belonging to this device, without the need to do device up/down.
>  
> @@ -351,4 +351,5 @@ Description:
>  		== ==================================
>  		0  threaded mode disabled for this dev
>  		1  threaded mode enabled for this dev
> +		2  threaded mode enabled, and busy polling enabled.
>  		== ==================================
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index aac343af7246..9c905243a1cc 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -272,10 +272,11 @@ attribute-sets:
>          name: threaded
>          doc: Whether the napi is configured to operate in threaded polling
>               mode. If this is set to `1` then the NAPI context operates
> -             in threaded polling mode.
> +             in threaded polling mode. If this is set to `2` then the NAPI
> +             kthread also does busypolling.
>          type: u32
>          checks:
> -          max: 1
> +          max: 2
>    -

I'd vote for a separate threaded-busy-poll parameter (and separate doc)
instead of overloading 'threaded' bool. But if you prefer to
have a single argument, let's at least change it to enum with proper
values for busy and non-busy modes instead of magic numbers?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] Add support to do threaded napi busy poll
  2025-01-02 19:12 [PATCH net-next 0/3] Add support to do threaded napi busy poll Samiullah Khawaja
                   ` (2 preceding siblings ...)
  2025-01-02 19:12 ` [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
@ 2025-01-02 21:30 ` Stanislav Fomichev
  2025-01-02 21:56 ` David Laight
  2025-01-03  0:47 ` Jakub Kicinski
  5 siblings, 0 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2025-01-02 21:30 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	netdev, jdamato

On 01/02, Samiullah Khawaja wrote:
> Extend the already existing support of threaded napi poll to do continuous
> busypolling.
> 
> This is used for doing continuous polling of napi to fetch descriptors from
> backing RX/TX queues for low latency applications. Allow enabling of threaded
> busypoll using netlink so this can be enabled on a set of dedicated napis for
> low latency applications.
> 
> Currently threaded napi is only enabled at device level using sysfs. Add
> support to enable/disable threaded mode for a napi individually. This can be
> done using the netlink interface. Add `set_threaded` op in netlink spec that
> allows setting the `threaded` attribute of a napi.
> 
> Extend the threaded attribute in napi struct to add an option to enable
> continuous busy polling. Extend the netlink and sysfs interface to allow
> enabled/disabling threaded busypolling at device or individual napi level.
> 
> Once threaded busypoll on a napi is enabled, depending on the application
> requirements the polling thread can be moved to dedicated cores. We used this
> for AF_XDP usecases to fetch packets from RX queues to reduce latency.

Joe recently added tools/testing/selftests/net/busy_poller.c, should we
extend it to cover your new threaded/non-thread busy/non-busy napi modes?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] Add support to do threaded napi busy poll
  2025-01-02 19:12 [PATCH net-next 0/3] Add support to do threaded napi busy poll Samiullah Khawaja
                   ` (3 preceding siblings ...)
  2025-01-02 21:30 ` [PATCH net-next 0/3] Add support to do threaded napi busy poll Stanislav Fomichev
@ 2025-01-02 21:56 ` David Laight
  2025-01-03  0:47 ` Jakub Kicinski
  5 siblings, 0 replies; 18+ messages in thread
From: David Laight @ 2025-01-02 21:56 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	netdev

On Thu,  2 Jan 2025 19:12:24 +0000
Samiullah Khawaja <skhawaja@google.com> wrote:

> Extend the already existing support of threaded napi poll to do continuous
> busypolling.
> 
> This is used for doing continuous polling of napi to fetch descriptors from
> backing RX/TX queues for low latency applications. Allow enabling of threaded
> busypoll using netlink so this can be enabled on a set of dedicated napis for
> low latency applications.
> 
> Currently threaded napi is only enabled at device level using sysfs. Add
> support to enable/disable threaded mode for a napi individually. This can be
> done using the netlink interface. Add `set_threaded` op in netlink spec that
> allows setting the `threaded` attribute of a napi.
> 
> Extend the threaded attribute in napi struct to add an option to enable
> continuous busy polling. Extend the netlink and sysfs interface to allow
> enabled/disabling threaded busypolling at device or individual napi level.
> 
> Once threaded busypoll on a napi is enabled, depending on the application
> requirements the polling thread can be moved to dedicated cores. We used this
> for AF_XDP usecases to fetch packets from RX queues to reduce latency.

I think it would be more generally useful to be able to set the priority
of the NAPI thread.
'busypolling' is just an extreme version of a high priority thread.

We've had to use threaded NAPI and a low SCHED_FIFO priority in order to
clear the RX queues without the softint code locking out a other SCHED_FIFO
user threads for 20ms+.
That is currently rather a pain to setup.

	David



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] Add support to do threaded napi busy poll
  2025-01-02 19:12 [PATCH net-next 0/3] Add support to do threaded napi busy poll Samiullah Khawaja
                   ` (4 preceding siblings ...)
  2025-01-02 21:56 ` David Laight
@ 2025-01-03  0:47 ` Jakub Kicinski
  2025-01-08 19:25   ` Joe Damato
  5 siblings, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2025-01-03  0:47 UTC (permalink / raw)
  To: Samiullah Khawaja; +Cc: David S . Miller , Eric Dumazet, Paolo Abeni, netdev

On Thu,  2 Jan 2025 19:12:24 +0000 Samiullah Khawaja wrote:
> Extend the already existing support of threaded napi poll to do continuous
> busypolling.
> 
> This is used for doing continuous polling of napi to fetch descriptors from
> backing RX/TX queues for low latency applications. Allow enabling of threaded
> busypoll using netlink so this can be enabled on a set of dedicated napis for
> low latency applications.

This is lacking clear justification and experimental results
vs doing the same thing from user space.

I'd also appreciate if Google could share the experience and results
of using basic threaded NAPI _in production_.
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling
  2025-01-02 19:12 ` [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
  2025-01-02 21:16   ` Samudrala, Sridhar
  2025-01-02 21:28   ` Stanislav Fomichev
@ 2025-01-03  1:05   ` kernel test robot
  2025-01-03  8:11   ` kernel test robot
  2025-01-03  8:12   ` kernel test robot
  4 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2025-01-03  1:05 UTC (permalink / raw)
  To: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni
  Cc: oe-kbuild-all, netdev, skhawaja

Hi Samiullah,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Samiullah-Khawaja/Add-support-to-set-napi-threaded-for-individual-napi/20250103-031428
base:   net-next/main
patch link:    https://lore.kernel.org/r/20250102191227.2084046-4-skhawaja%40google.com
patch subject: [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling
config: i386-buildonly-randconfig-006-20250103 (https://download.01.org/0day-ci/archive/20250103/202501030842.OdBE8ADq-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250103/202501030842.OdBE8ADq-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501030842.OdBE8ADq-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/net/ethernet/atheros/atl1c/atl1c_main.c: In function 'atl1c_probe':
>> drivers/net/ethernet/atheros/atl1c/atl1c_main.c:2691:34: error: 'DEV_NAPI_THREADED' undeclared (first use in this function); did you mean 'NAPI_THREADED'?
    2691 |         dev_set_threaded(netdev, DEV_NAPI_THREADED);
         |                                  ^~~~~~~~~~~~~~~~~
         |                                  NAPI_THREADED
   drivers/net/ethernet/atheros/atl1c/atl1c_main.c:2691:34: note: each undeclared identifier is reported only once for each function it appears in


vim +2691 drivers/net/ethernet/atheros/atl1c/atl1c_main.c

  2600	
  2601	/**
  2602	 * atl1c_probe - Device Initialization Routine
  2603	 * @pdev: PCI device information struct
  2604	 * @ent: entry in atl1c_pci_tbl
  2605	 *
  2606	 * Returns 0 on success, negative on failure
  2607	 *
  2608	 * atl1c_probe initializes an adapter identified by a pci_dev structure.
  2609	 * The OS initialization, configuring of the adapter private structure,
  2610	 * and a hardware reset occur.
  2611	 */
  2612	static int atl1c_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
  2613	{
  2614		struct net_device *netdev;
  2615		struct atl1c_adapter *adapter;
  2616		static int cards_found;
  2617		u8 __iomem *hw_addr;
  2618		enum atl1c_nic_type nic_type;
  2619		u32 queue_count = 1;
  2620		int err = 0;
  2621		int i;
  2622	
  2623		/* enable device (incl. PCI PM wakeup and hotplug setup) */
  2624		err = pci_enable_device_mem(pdev);
  2625		if (err)
  2626			return dev_err_probe(&pdev->dev, err, "cannot enable PCI device\n");
  2627	
  2628		/*
  2629		 * The atl1c chip can DMA to 64-bit addresses, but it uses a single
  2630		 * shared register for the high 32 bits, so only a single, aligned,
  2631		 * 4 GB physical address range can be used at a time.
  2632		 *
  2633		 * Supporting 64-bit DMA on this hardware is more trouble than it's
  2634		 * worth.  It is far easier to limit to 32-bit DMA than update
  2635		 * various kernel subsystems to support the mechanics required by a
  2636		 * fixed-high-32-bit system.
  2637		 */
  2638		err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
  2639		if (err) {
  2640			dev_err(&pdev->dev, "No usable DMA configuration,aborting\n");
  2641			goto err_dma;
  2642		}
  2643	
  2644		err = pci_request_regions(pdev, atl1c_driver_name);
  2645		if (err) {
  2646			dev_err(&pdev->dev, "cannot obtain PCI resources\n");
  2647			goto err_pci_reg;
  2648		}
  2649	
  2650		pci_set_master(pdev);
  2651	
  2652		hw_addr = pci_ioremap_bar(pdev, 0);
  2653		if (!hw_addr) {
  2654			err = -EIO;
  2655			dev_err(&pdev->dev, "cannot map device registers\n");
  2656			goto err_ioremap;
  2657		}
  2658	
  2659		nic_type = atl1c_get_mac_type(pdev, hw_addr);
  2660		if (nic_type == athr_mt)
  2661			queue_count = 4;
  2662	
  2663		netdev = alloc_etherdev_mq(sizeof(struct atl1c_adapter), queue_count);
  2664		if (netdev == NULL) {
  2665			err = -ENOMEM;
  2666			goto err_alloc_etherdev;
  2667		}
  2668	
  2669		err = atl1c_init_netdev(netdev, pdev);
  2670		if (err) {
  2671			dev_err(&pdev->dev, "init netdevice failed\n");
  2672			goto err_init_netdev;
  2673		}
  2674		adapter = netdev_priv(netdev);
  2675		adapter->bd_number = cards_found;
  2676		adapter->netdev = netdev;
  2677		adapter->pdev = pdev;
  2678		adapter->hw.adapter = adapter;
  2679		adapter->hw.nic_type = nic_type;
  2680		adapter->msg_enable = netif_msg_init(-1, atl1c_default_msg);
  2681		adapter->hw.hw_addr = hw_addr;
  2682		adapter->tx_queue_count = queue_count;
  2683		adapter->rx_queue_count = queue_count;
  2684	
  2685		/* init mii data */
  2686		adapter->mii.dev = netdev;
  2687		adapter->mii.mdio_read  = atl1c_mdio_read;
  2688		adapter->mii.mdio_write = atl1c_mdio_write;
  2689		adapter->mii.phy_id_mask = 0x1f;
  2690		adapter->mii.reg_num_mask = MDIO_CTRL_REG_MASK;
> 2691		dev_set_threaded(netdev, DEV_NAPI_THREADED);
  2692		for (i = 0; i < adapter->rx_queue_count; ++i)
  2693			netif_napi_add(netdev, &adapter->rrd_ring[i].napi,
  2694				       atl1c_clean_rx);
  2695		for (i = 0; i < adapter->tx_queue_count; ++i)
  2696			netif_napi_add_tx(netdev, &adapter->tpd_ring[i].napi,
  2697					  atl1c_clean_tx);
  2698		timer_setup(&adapter->phy_config_timer, atl1c_phy_config, 0);
  2699		/* setup the private structure */
  2700		err = atl1c_sw_init(adapter);
  2701		if (err) {
  2702			dev_err(&pdev->dev, "net device private data init failed\n");
  2703			goto err_sw_init;
  2704		}
  2705		/* set max MTU */
  2706		atl1c_set_max_mtu(netdev);
  2707	
  2708		atl1c_reset_pcie(&adapter->hw, ATL1C_PCIE_L0S_L1_DISABLE);
  2709	
  2710		/* Init GPHY as early as possible due to power saving issue  */
  2711		atl1c_phy_reset(&adapter->hw);
  2712	
  2713		err = atl1c_reset_mac(&adapter->hw);
  2714		if (err) {
  2715			err = -EIO;
  2716			goto err_reset;
  2717		}
  2718	
  2719		/* reset the controller to
  2720		 * put the device in a known good starting state */
  2721		err = atl1c_phy_init(&adapter->hw);
  2722		if (err) {
  2723			err = -EIO;
  2724			goto err_reset;
  2725		}
  2726		if (atl1c_read_mac_addr(&adapter->hw)) {
  2727			/* got a random MAC address, set NET_ADDR_RANDOM to netdev */
  2728			netdev->addr_assign_type = NET_ADDR_RANDOM;
  2729		}
  2730		eth_hw_addr_set(netdev, adapter->hw.mac_addr);
  2731		if (netif_msg_probe(adapter))
  2732			dev_dbg(&pdev->dev, "mac address : %pM\n",
  2733				adapter->hw.mac_addr);
  2734	
  2735		atl1c_hw_set_mac_addr(&adapter->hw, adapter->hw.mac_addr);
  2736		INIT_WORK(&adapter->common_task, atl1c_common_task);
  2737		adapter->work_event = 0;
  2738		err = register_netdev(netdev);
  2739		if (err) {
  2740			dev_err(&pdev->dev, "register netdevice failed\n");
  2741			goto err_register;
  2742		}
  2743	
  2744		cards_found++;
  2745		return 0;
  2746	
  2747	err_reset:
  2748	err_register:
  2749	err_sw_init:
  2750	err_init_netdev:
  2751		free_netdev(netdev);
  2752	err_alloc_etherdev:
  2753		iounmap(hw_addr);
  2754	err_ioremap:
  2755		pci_release_regions(pdev);
  2756	err_pci_reg:
  2757	err_dma:
  2758		pci_disable_device(pdev);
  2759		return err;
  2760	}
  2761	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling
  2025-01-02 19:12 ` [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
                     ` (2 preceding siblings ...)
  2025-01-03  1:05   ` kernel test robot
@ 2025-01-03  8:11   ` kernel test robot
  2025-01-03  8:12   ` kernel test robot
  4 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2025-01-03  8:11 UTC (permalink / raw)
  To: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni
  Cc: llvm, oe-kbuild-all, netdev, skhawaja

Hi Samiullah,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Samiullah-Khawaja/Add-support-to-set-napi-threaded-for-individual-napi/20250103-031428
base:   net-next/main
patch link:    https://lore.kernel.org/r/20250102191227.2084046-4-skhawaja%40google.com
patch subject: [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20250103/202501031537.QXSNLahs-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250103/202501031537.QXSNLahs-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501031537.QXSNLahs-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/net/ethernet/atheros/atl1c/atl1c_main.c:9:
   In file included from drivers/net/ethernet/atheros/atl1c/atl1c.h:16:
   In file included from include/linux/pci.h:1658:
   In file included from include/linux/dmapool.h:14:
   In file included from include/linux/scatterlist.h:8:
   In file included from include/linux/mm.h:2223:
   include/linux/vmstat.h:504:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     504 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     505 |                            item];
         |                            ~~~~
   include/linux/vmstat.h:511:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     511 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     512 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   include/linux/vmstat.h:524:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     524 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     525 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
>> drivers/net/ethernet/atheros/atl1c/atl1c_main.c:2691:27: error: use of undeclared identifier 'DEV_NAPI_THREADED'; did you mean 'NAPI_THREADED'?
    2691 |         dev_set_threaded(netdev, DEV_NAPI_THREADED);
         |                                  ^~~~~~~~~~~~~~~~~
         |                                  NAPI_THREADED
   include/linux/netdevice.h:581:2: note: 'NAPI_THREADED' declared here
     581 |         NAPI_THREADED = 1,
         |         ^
   4 warnings and 1 error generated.


vim +2691 drivers/net/ethernet/atheros/atl1c/atl1c_main.c

  2600	
  2601	/**
  2602	 * atl1c_probe - Device Initialization Routine
  2603	 * @pdev: PCI device information struct
  2604	 * @ent: entry in atl1c_pci_tbl
  2605	 *
  2606	 * Returns 0 on success, negative on failure
  2607	 *
  2608	 * atl1c_probe initializes an adapter identified by a pci_dev structure.
  2609	 * The OS initialization, configuring of the adapter private structure,
  2610	 * and a hardware reset occur.
  2611	 */
  2612	static int atl1c_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
  2613	{
  2614		struct net_device *netdev;
  2615		struct atl1c_adapter *adapter;
  2616		static int cards_found;
  2617		u8 __iomem *hw_addr;
  2618		enum atl1c_nic_type nic_type;
  2619		u32 queue_count = 1;
  2620		int err = 0;
  2621		int i;
  2622	
  2623		/* enable device (incl. PCI PM wakeup and hotplug setup) */
  2624		err = pci_enable_device_mem(pdev);
  2625		if (err)
  2626			return dev_err_probe(&pdev->dev, err, "cannot enable PCI device\n");
  2627	
  2628		/*
  2629		 * The atl1c chip can DMA to 64-bit addresses, but it uses a single
  2630		 * shared register for the high 32 bits, so only a single, aligned,
  2631		 * 4 GB physical address range can be used at a time.
  2632		 *
  2633		 * Supporting 64-bit DMA on this hardware is more trouble than it's
  2634		 * worth.  It is far easier to limit to 32-bit DMA than update
  2635		 * various kernel subsystems to support the mechanics required by a
  2636		 * fixed-high-32-bit system.
  2637		 */
  2638		err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
  2639		if (err) {
  2640			dev_err(&pdev->dev, "No usable DMA configuration,aborting\n");
  2641			goto err_dma;
  2642		}
  2643	
  2644		err = pci_request_regions(pdev, atl1c_driver_name);
  2645		if (err) {
  2646			dev_err(&pdev->dev, "cannot obtain PCI resources\n");
  2647			goto err_pci_reg;
  2648		}
  2649	
  2650		pci_set_master(pdev);
  2651	
  2652		hw_addr = pci_ioremap_bar(pdev, 0);
  2653		if (!hw_addr) {
  2654			err = -EIO;
  2655			dev_err(&pdev->dev, "cannot map device registers\n");
  2656			goto err_ioremap;
  2657		}
  2658	
  2659		nic_type = atl1c_get_mac_type(pdev, hw_addr);
  2660		if (nic_type == athr_mt)
  2661			queue_count = 4;
  2662	
  2663		netdev = alloc_etherdev_mq(sizeof(struct atl1c_adapter), queue_count);
  2664		if (netdev == NULL) {
  2665			err = -ENOMEM;
  2666			goto err_alloc_etherdev;
  2667		}
  2668	
  2669		err = atl1c_init_netdev(netdev, pdev);
  2670		if (err) {
  2671			dev_err(&pdev->dev, "init netdevice failed\n");
  2672			goto err_init_netdev;
  2673		}
  2674		adapter = netdev_priv(netdev);
  2675		adapter->bd_number = cards_found;
  2676		adapter->netdev = netdev;
  2677		adapter->pdev = pdev;
  2678		adapter->hw.adapter = adapter;
  2679		adapter->hw.nic_type = nic_type;
  2680		adapter->msg_enable = netif_msg_init(-1, atl1c_default_msg);
  2681		adapter->hw.hw_addr = hw_addr;
  2682		adapter->tx_queue_count = queue_count;
  2683		adapter->rx_queue_count = queue_count;
  2684	
  2685		/* init mii data */
  2686		adapter->mii.dev = netdev;
  2687		adapter->mii.mdio_read  = atl1c_mdio_read;
  2688		adapter->mii.mdio_write = atl1c_mdio_write;
  2689		adapter->mii.phy_id_mask = 0x1f;
  2690		adapter->mii.reg_num_mask = MDIO_CTRL_REG_MASK;
> 2691		dev_set_threaded(netdev, DEV_NAPI_THREADED);
  2692		for (i = 0; i < adapter->rx_queue_count; ++i)
  2693			netif_napi_add(netdev, &adapter->rrd_ring[i].napi,
  2694				       atl1c_clean_rx);
  2695		for (i = 0; i < adapter->tx_queue_count; ++i)
  2696			netif_napi_add_tx(netdev, &adapter->tpd_ring[i].napi,
  2697					  atl1c_clean_tx);
  2698		timer_setup(&adapter->phy_config_timer, atl1c_phy_config, 0);
  2699		/* setup the private structure */
  2700		err = atl1c_sw_init(adapter);
  2701		if (err) {
  2702			dev_err(&pdev->dev, "net device private data init failed\n");
  2703			goto err_sw_init;
  2704		}
  2705		/* set max MTU */
  2706		atl1c_set_max_mtu(netdev);
  2707	
  2708		atl1c_reset_pcie(&adapter->hw, ATL1C_PCIE_L0S_L1_DISABLE);
  2709	
  2710		/* Init GPHY as early as possible due to power saving issue  */
  2711		atl1c_phy_reset(&adapter->hw);
  2712	
  2713		err = atl1c_reset_mac(&adapter->hw);
  2714		if (err) {
  2715			err = -EIO;
  2716			goto err_reset;
  2717		}
  2718	
  2719		/* reset the controller to
  2720		 * put the device in a known good starting state */
  2721		err = atl1c_phy_init(&adapter->hw);
  2722		if (err) {
  2723			err = -EIO;
  2724			goto err_reset;
  2725		}
  2726		if (atl1c_read_mac_addr(&adapter->hw)) {
  2727			/* got a random MAC address, set NET_ADDR_RANDOM to netdev */
  2728			netdev->addr_assign_type = NET_ADDR_RANDOM;
  2729		}
  2730		eth_hw_addr_set(netdev, adapter->hw.mac_addr);
  2731		if (netif_msg_probe(adapter))
  2732			dev_dbg(&pdev->dev, "mac address : %pM\n",
  2733				adapter->hw.mac_addr);
  2734	
  2735		atl1c_hw_set_mac_addr(&adapter->hw, adapter->hw.mac_addr);
  2736		INIT_WORK(&adapter->common_task, atl1c_common_task);
  2737		adapter->work_event = 0;
  2738		err = register_netdev(netdev);
  2739		if (err) {
  2740			dev_err(&pdev->dev, "register netdevice failed\n");
  2741			goto err_register;
  2742		}
  2743	
  2744		cards_found++;
  2745		return 0;
  2746	
  2747	err_reset:
  2748	err_register:
  2749	err_sw_init:
  2750	err_init_netdev:
  2751		free_netdev(netdev);
  2752	err_alloc_etherdev:
  2753		iounmap(hw_addr);
  2754	err_ioremap:
  2755		pci_release_regions(pdev);
  2756	err_pci_reg:
  2757	err_dma:
  2758		pci_disable_device(pdev);
  2759		return err;
  2760	}
  2761	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling
  2025-01-02 19:12 ` [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
                     ` (3 preceding siblings ...)
  2025-01-03  8:11   ` kernel test robot
@ 2025-01-03  8:12   ` kernel test robot
  4 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2025-01-03  8:12 UTC (permalink / raw)
  To: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni
  Cc: oe-kbuild-all, netdev, skhawaja

Hi Samiullah,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Samiullah-Khawaja/Add-support-to-set-napi-threaded-for-individual-napi/20250103-031428
base:   net-next/main
patch link:    https://lore.kernel.org/r/20250102191227.2084046-4-skhawaja%40google.com
patch subject: [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling
config: x86_64-randconfig-073-20250103 (https://download.01.org/0day-ci/archive/20250103/202501031530.ss0kvHke-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250103/202501031530.ss0kvHke-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501031530.ss0kvHke-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/net/ethernet/mellanox/mlxsw/pci.c: In function 'mlxsw_pci_napi_devs_init':
>> drivers/net/ethernet/mellanox/mlxsw/pci.c:158:50: warning: implicit conversion from 'enum <anonymous>' to 'enum napi_threaded_state' [-Wenum-conversion]
     158 |         dev_set_threaded(mlxsw_pci->napi_dev_rx, true);
         |                                                  ^~~~


vim +158 drivers/net/ethernet/mellanox/mlxsw/pci.c

eda6500a987a02 Jiri Pirko 2015-07-29  140  
5d01ed2e970812 Amit Cohen 2024-04-26  141  static int mlxsw_pci_napi_devs_init(struct mlxsw_pci *mlxsw_pci)
5d01ed2e970812 Amit Cohen 2024-04-26  142  {
5d01ed2e970812 Amit Cohen 2024-04-26  143  	int err;
5d01ed2e970812 Amit Cohen 2024-04-26  144  
5d01ed2e970812 Amit Cohen 2024-04-26  145  	mlxsw_pci->napi_dev_tx = alloc_netdev_dummy(0);
5d01ed2e970812 Amit Cohen 2024-04-26  146  	if (!mlxsw_pci->napi_dev_tx)
5d01ed2e970812 Amit Cohen 2024-04-26  147  		return -ENOMEM;
5d01ed2e970812 Amit Cohen 2024-04-26  148  	strscpy(mlxsw_pci->napi_dev_tx->name, "mlxsw_tx",
5d01ed2e970812 Amit Cohen 2024-04-26  149  		sizeof(mlxsw_pci->napi_dev_tx->name));
5d01ed2e970812 Amit Cohen 2024-04-26  150  
5d01ed2e970812 Amit Cohen 2024-04-26  151  	mlxsw_pci->napi_dev_rx = alloc_netdev_dummy(0);
5d01ed2e970812 Amit Cohen 2024-04-26  152  	if (!mlxsw_pci->napi_dev_rx) {
5d01ed2e970812 Amit Cohen 2024-04-26  153  		err = -ENOMEM;
5d01ed2e970812 Amit Cohen 2024-04-26  154  		goto err_alloc_rx;
5d01ed2e970812 Amit Cohen 2024-04-26  155  	}
5d01ed2e970812 Amit Cohen 2024-04-26  156  	strscpy(mlxsw_pci->napi_dev_rx->name, "mlxsw_rx",
5d01ed2e970812 Amit Cohen 2024-04-26  157  		sizeof(mlxsw_pci->napi_dev_rx->name));
5d01ed2e970812 Amit Cohen 2024-04-26 @158  	dev_set_threaded(mlxsw_pci->napi_dev_rx, true);
5d01ed2e970812 Amit Cohen 2024-04-26  159  
5d01ed2e970812 Amit Cohen 2024-04-26  160  	return 0;
5d01ed2e970812 Amit Cohen 2024-04-26  161  
5d01ed2e970812 Amit Cohen 2024-04-26  162  err_alloc_rx:
5d01ed2e970812 Amit Cohen 2024-04-26  163  	free_netdev(mlxsw_pci->napi_dev_tx);
5d01ed2e970812 Amit Cohen 2024-04-26  164  	return err;
5d01ed2e970812 Amit Cohen 2024-04-26  165  }
5d01ed2e970812 Amit Cohen 2024-04-26  166  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] Add support to do threaded napi busy poll
  2025-01-03  0:47 ` Jakub Kicinski
@ 2025-01-08 19:25   ` Joe Damato
  2025-01-08 21:18     ` Samiullah Khawaja
  0 siblings, 1 reply; 18+ messages in thread
From: Joe Damato @ 2025-01-08 19:25 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Samiullah Khawaja, David S . Miller , Eric Dumazet, Paolo Abeni,
	netdev, mkarsten

On Thu, Jan 02, 2025 at 04:47:14PM -0800, Jakub Kicinski wrote:
> On Thu,  2 Jan 2025 19:12:24 +0000 Samiullah Khawaja wrote:
> > Extend the already existing support of threaded napi poll to do continuous
> > busypolling.
> > 
> > This is used for doing continuous polling of napi to fetch descriptors from
> > backing RX/TX queues for low latency applications. Allow enabling of threaded
> > busypoll using netlink so this can be enabled on a set of dedicated napis for
> > low latency applications.
> 
> This is lacking clear justification and experimental results
> vs doing the same thing from user space.

Apologies for chiming in late here as I was out of the office, but I
agree with Jakub and Stanislav:

- This lacks clear justification and data to compare packet delivery
  mechanisms. IMHO, at a minimum a real world application should be
  benchmarked and various packet delivery mechanisms (including this
  one) should be compared side-by-side. You don't need to do exactly
  what Martin and I did [1], but I'd offer that as a possible
  suggestion for how you might proceed if you need some suggestions.

- This should include a test of some sort; perhaps expanding the test
  I added (as Stanislav suggested) would be a good start?

- IMHO, this change should also include updated kernel documentation
  to clearly explain how, when, and why a user might use this and
  what tradeoffs a user can expect. The commit message is, IMHO, far
  too vague.

  Including example code snippets or ynl invocations etc in the
  kernel documentation would be very helpful.

> I'd also appreciate if Google could share the experience and results
> of using basic threaded NAPI _in production_.

+1; this data would be very insightful.

[1]: https://lore.kernel.org/netdev/20241109050245.191288-1-jdamato@fastly.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] Add support to do threaded napi busy poll
  2025-01-08 19:25   ` Joe Damato
@ 2025-01-08 21:18     ` Samiullah Khawaja
  2025-01-08 21:53       ` Martin Karsten
  0 siblings, 1 reply; 18+ messages in thread
From: Samiullah Khawaja @ 2025-01-08 21:18 UTC (permalink / raw)
  To: Joe Damato, Jakub Kicinski, Samiullah Khawaja, David S . Miller,
	Eric Dumazet, Paolo Abeni, netdev, mkarsten

On Wed, Jan 8, 2025 at 11:25 AM Joe Damato <jdamato@fastly.com> wrote:
>
> On Thu, Jan 02, 2025 at 04:47:14PM -0800, Jakub Kicinski wrote:
> > On Thu,  2 Jan 2025 19:12:24 +0000 Samiullah Khawaja wrote:
> > > Extend the already existing support of threaded napi poll to do continuous
> > > busypolling.
> > >
> > > This is used for doing continuous polling of napi to fetch descriptors from
> > > backing RX/TX queues for low latency applications. Allow enabling of threaded
> > > busypoll using netlink so this can be enabled on a set of dedicated napis for
> > > low latency applications.
> >
> > This is lacking clear justification and experimental results
> > vs doing the same thing from user space.
Thanks for the response.

The major benefit is that this is a one common way to enable busy
polling of descriptors on a particular napi. It is basically
independent of the userspace API and allows for enabling busy polling
on a subset, out of the complete list, of napi instances in a device
that can be shared among multiple processes/applications that have low
latency requirements. This allows for a dedicated subset of napi
instances that are configured for busy polling on a machine and
workload/jobs can target these napi instances.

Once enabled, the relevant kthread can be queried using netlink
`get-napi` op. The thread priority, scheduler and any thread core
affinity can also be set. Any userspace application using a variety of
interfaces (AF_XDP, io_uring, xsk, epoll etc) can run on top of it
without any further complexity. For userspace driven napi busy
polling, one has to either use sysctls to setup busypolling that are
done at device level or use different interfaces depending on the use
cases,
- epoll params (or a sysctl that is system wide) for epoll based interface
- socket option (or sysctl that is system wide) for sk_recvmsg
- io_uring (I believe SQPOLL can be configured with it)

Our application for this feature uses a userspace implementation of
TCP (https://github.com/Xilinx-CNS/onload) that interfaces with AF_XDP
to send/receive packets. We use neper (running with AF_XDP + userspace
TCP library) to measure latency improvements with and without napi
threaded busy poll. Our target application sends packets with a well
defined frequency with a couple of 100 bytes of RPC style
request/response.

Test Environment:
Google C3 VMs running netdev-net/main kernel with idpf driver

Without napi threaded busy poll (p50 at around 44us)
num_transactions=47918
latency_min=0.000018838
latency_max=0.333912365
latency_mean=0.000189570
latency_stddev=0.005859874
latency_p50=0.000043510
latency_p90=0.000053750
latency_p99=0.000058230
latency_p99.9=0.000184310

With napi threaded busy poll (p50 around 14us)
latency_min=0.000012271
latency_max=0.209365389
latency_mean=0.000021611
latency_stddev=0.001166541
latency_p50=0.000013590
latency_p90=0.000019990
latency_p99=0.000023670
latency_p99.9=0.000027830

> Apologies for chiming in late here as I was out of the office, but I
> agree with Jakub and Stanislav:
Thanks for chiming in.
>
> - This lacks clear justification and data to compare packet delivery
>   mechanisms. IMHO, at a minimum a real world application should be
>   benchmarked and various packet delivery mechanisms (including this
>   one) should be compared side-by-side. You don't need to do exactly
>   what Martin and I did [1], but I'd offer that as a possible
>   suggestion for how you might proceed if you need some suggestions.
Some of the packet delivery mechanisms like epoll can only be compared
with this using the application that uses it. This napi threaded
approach provides support for enabling busy polling of a napi
regardless of the userspace API the application uses. For example our
target application uses AF_XDP directly and interfaces with rings
directly.
>
> - This should include a test of some sort; perhaps expanding the test
>   I added (as Stanislav suggested) would be a good start?
Thanks for the suggestion. I am currently expanding the test you added
with this (as Stanislav suggested). I will be sending an update with
it.
>
> - IMHO, this change should also include updated kernel documentation
>   to clearly explain how, when, and why a user might use this and
>   what tradeoffs a user can expect. The commit message is, IMHO, far
>   too vague.
>
>   Including example code snippets or ynl invocations etc in the
>   kernel documentation would be very helpful.
Thanks for the suggestion. I will add those in the next update.
>
> > I'd also appreciate if Google could share the experience and results
> > of using basic threaded NAPI _in production_.
We are not using basic threaded NAPI _in production_, but we are going
to use this threaded napi busy poll in production for one of the use
cases. We are currently improving neper to accurately do the network
traffic simulation and handle the frequency of request/response
properly.
>
> +1; this data would be very insightful.
>
> [1]: https://lore.kernel.org/netdev/20241109050245.191288-1-jdamato@fastly.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] Add support to do threaded napi busy poll
  2025-01-08 21:18     ` Samiullah Khawaja
@ 2025-01-08 21:53       ` Martin Karsten
  2025-01-15 22:35         ` Samiullah Khawaja
  0 siblings, 1 reply; 18+ messages in thread
From: Martin Karsten @ 2025-01-08 21:53 UTC (permalink / raw)
  To: Samiullah Khawaja, Joe Damato, Jakub Kicinski, David S . Miller,
	Eric Dumazet, Paolo Abeni, netdev

On 2025-01-08 16:18, Samiullah Khawaja wrote:
> On Wed, Jan 8, 2025 at 11:25 AM Joe Damato <jdamato@fastly.com> wrote:
>>
>> On Thu, Jan 02, 2025 at 04:47:14PM -0800, Jakub Kicinski wrote:
>>> On Thu,  2 Jan 2025 19:12:24 +0000 Samiullah Khawaja wrote:
>>>> Extend the already existing support of threaded napi poll to do continuous
>>>> busypolling.
>>>>
>>>> This is used for doing continuous polling of napi to fetch descriptors from
>>>> backing RX/TX queues for low latency applications. Allow enabling of threaded
>>>> busypoll using netlink so this can be enabled on a set of dedicated napis for
>>>> low latency applications.
>>>
>>> This is lacking clear justification and experimental results
>>> vs doing the same thing from user space.
> Thanks for the response.
> 
> The major benefit is that this is a one common way to enable busy
> polling of descriptors on a particular napi. It is basically
> independent of the userspace API and allows for enabling busy polling
> on a subset, out of the complete list, of napi instances in a device
> that can be shared among multiple processes/applications that have low
> latency requirements. This allows for a dedicated subset of napi
> instances that are configured for busy polling on a machine and
> workload/jobs can target these napi instances.
> 
> Once enabled, the relevant kthread can be queried using netlink
> `get-napi` op. The thread priority, scheduler and any thread core
> affinity can also be set. Any userspace application using a variety of
> interfaces (AF_XDP, io_uring, xsk, epoll etc) can run on top of it
> without any further complexity. For userspace driven napi busy
> polling, one has to either use sysctls to setup busypolling that are
> done at device level or use different interfaces depending on the use
> cases,
> - epoll params (or a sysctl that is system wide) for epoll based interface
> - socket option (or sysctl that is system wide) for sk_recvmsg
> - io_uring (I believe SQPOLL can be configured with it)
> 
> Our application for this feature uses a userspace implementation of
> TCP (https://github.com/Xilinx-CNS/onload) that interfaces with AF_XDP
> to send/receive packets. We use neper (running with AF_XDP + userspace
> TCP library) to measure latency improvements with and without napi
> threaded busy poll. Our target application sends packets with a well
> defined frequency with a couple of 100 bytes of RPC style
> request/response.

Let me also apologize for being late to the party. I am not always 
paying close attention to the list and did not see this until Joe 
flagged it for me.

I have a couple of questions about your experiments. In general, I find 
this level of experiment description not sufficient for reproducibility. 
Ideally you point to complete scripts.

A canonical problem with using network benchmarks like neper to assess 
network stack processing is that it typically inflates the relative 
importance of network stack processing, because there is not application 
processing involved.

Were the experiments below run single-threaded?

> Test Environment:
> Google C3 VMs running netdev-net/main kernel with idpf driver
> 
> Without napi threaded busy poll (p50 at around 44us)
> num_transactions=47918
> latency_min=0.000018838
> latency_max=0.333912365
> latency_mean=0.000189570
> latency_stddev=0.005859874
> latency_p50=0.000043510
> latency_p90=0.000053750
> latency_p99=0.000058230
> latency_p99.9=0.000184310

What was the interrupt routing in the above base case?

> With napi threaded busy poll (p50 around 14us)
> latency_min=0.000012271
> latency_max=0.209365389
> latency_mean=0.000021611
> latency_stddev=0.001166541
> latency_p50=0.000013590
> latency_p90=0.000019990
> latency_p99=0.000023670
> latency_p99.9=0.000027830

How many cores are in play in this case?

I am wondering whether your baseline effectively uses only one core, but 
your "threaded busy poll" case uses two? Then I am wondering whether a 
similar effect could be achieved by suitable interrupt and thread affinity?

Thanks,
Martin

[snip]


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] Add support to do threaded napi busy poll
  2025-01-08 21:53       ` Martin Karsten
@ 2025-01-15 22:35         ` Samiullah Khawaja
  2025-01-16  0:28           ` Martin Karsten
  0 siblings, 1 reply; 18+ messages in thread
From: Samiullah Khawaja @ 2025-01-15 22:35 UTC (permalink / raw)
  To: Martin Karsten
  Cc: Joe Damato, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, netdev

On Wed, Jan 8, 2025 at 1:54 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-01-08 16:18, Samiullah Khawaja wrote:
> > On Wed, Jan 8, 2025 at 11:25 AM Joe Damato <jdamato@fastly.com> wrote:
> >>
> >> On Thu, Jan 02, 2025 at 04:47:14PM -0800, Jakub Kicinski wrote:
> >>> On Thu,  2 Jan 2025 19:12:24 +0000 Samiullah Khawaja wrote:
> >>>> Extend the already existing support of threaded napi poll to do continuous
> >>>> busypolling.
> >>>>
> >>>> This is used for doing continuous polling of napi to fetch descriptors from
> >>>> backing RX/TX queues for low latency applications. Allow enabling of threaded
> >>>> busypoll using netlink so this can be enabled on a set of dedicated napis for
> >>>> low latency applications.
> >>>
> >>> This is lacking clear justification and experimental results
> >>> vs doing the same thing from user space.
> > Thanks for the response.
> >
> > The major benefit is that this is a one common way to enable busy
> > polling of descriptors on a particular napi. It is basically
> > independent of the userspace API and allows for enabling busy polling
> > on a subset, out of the complete list, of napi instances in a device
> > that can be shared among multiple processes/applications that have low
> > latency requirements. This allows for a dedicated subset of napi
> > instances that are configured for busy polling on a machine and
> > workload/jobs can target these napi instances.
> >
> > Once enabled, the relevant kthread can be queried using netlink
> > `get-napi` op. The thread priority, scheduler and any thread core
> > affinity can also be set. Any userspace application using a variety of
> > interfaces (AF_XDP, io_uring, xsk, epoll etc) can run on top of it
> > without any further complexity. For userspace driven napi busy
> > polling, one has to either use sysctls to setup busypolling that are
> > done at device level or use different interfaces depending on the use
> > cases,
> > - epoll params (or a sysctl that is system wide) for epoll based interface
> > - socket option (or sysctl that is system wide) for sk_recvmsg
> > - io_uring (I believe SQPOLL can be configured with it)
> >
> > Our application for this feature uses a userspace implementation of
> > TCP (https://github.com/Xilinx-CNS/onload) that interfaces with AF_XDP
> > to send/receive packets. We use neper (running with AF_XDP + userspace
> > TCP library) to measure latency improvements with and without napi
> > threaded busy poll. Our target application sends packets with a well
> > defined frequency with a couple of 100 bytes of RPC style
> > request/response.
>
> Let me also apologize for being late to the party. I am not always
> paying close attention to the list and did not see this until Joe
> flagged it for me.
Thanks for the reply.
>
> I have a couple of questions about your experiments. In general, I find
> this level of experiment description not sufficient for reproducibility.
> Ideally you point to complete scripts.
>
> A canonical problem with using network benchmarks like neper to assess
> network stack processing is that it typically inflates the relative
> importance of network stack processing, because there is not application
> processing involved
Agreed on your assessment and I went back to get some more info before
I could reply to this. Basically our use case is a very low latency, a
solid 14us RPCs with very small messages around 200 bytes with minimum
application processing. We have well defined traffic patterns for this
use case with a defined maximum number of packets per second to make
sure the latency is guaranteed. So to measure the performance of such
a use case, we basically picked up neper and used it to generate our
traffic pattern. While we are using neper, this does represent our
real world use case. Also In my experimentation, I am using neper with
the onload library that I mentioned earlier to interface with the NIC
using AF_XDP. In short we do want to get the maximum network stack
optimization where the packets are pulled off the descriptor queue
quickly..
>
> Were the experiments below run single-threaded?
Since we are waiting on some of the other features in our environment,
we are working with only 1 RX queue that has multiple flows running.
Both experiments have the same interrupt configuration, Also the
userspace process has affinity set to be closer to the core getting
the interrupts.
>
> > Test Environment:
> > Google C3 VMs running netdev-net/main kernel with idpf driver
> >
> > Without napi threaded busy poll (p50 at around 44us)
> > num_transactions=47918
> > latency_min=0.000018838
> > latency_max=0.333912365
> > latency_mean=0.000189570
> > latency_stddev=0.005859874
> > latency_p50=0.000043510
> > latency_p90=0.000053750
> > latency_p99=0.000058230
> > latency_p99.9=0.000184310
>
> What was the interrupt routing in the above base case?
>
> > With napi threaded busy poll (p50 around 14us)
> > latency_min=0.000012271
> > latency_max=0.209365389
> > latency_mean=0.000021611
> > latency_stddev=0.001166541
> > latency_p50=0.000013590
> > latency_p90=0.000019990
> > latency_p99=0.000023670
> > latency_p99.9=0.000027830
>
> How many cores are in play in this case?
Same in userspace. But napi has its own dedicated core polling on it
inside the kernel. Since napi is polled continuously, we don't enable
interrupts for this case as implemented in the patch. This is one of
the major reasons we cannot drive this from userspace and want napi
driven in a separate core independent of the application processing
logic. We don't want the latency drop while the thread that is driving
the napi goes back to userspace and handles some application logic or
packet processing that might be happening in onload.
>
> I am wondering whether your baseline effectively uses only one core, but
> your "threaded busy poll" case uses two? Then I am wondering whether a
> similar effect could be achieved by suitable interrupt and thread affinity?
We tried doing this in earlier experiments by setting up proper
interrupt and thread affinity to make them closer and the 44us latency
is achieved using that. With non AF_XDP tests by enabling busypolling
at socket level using socketopt, we are only able to achieve around
20us, but P99 still suffers. This is mostly because the thread that is
driving the napi goes back to userspace to do application work. This
netlink based mechanism basically solves that and provides a UAPI
independent mechanism to enable busypolling for a napi. One can choose
to configure the napi thread core affinity and priority to share cores
with userspace processes if desired.
>
> Thanks,
> Martin
>
> [snip]
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] Add support to do threaded napi busy poll
  2025-01-15 22:35         ` Samiullah Khawaja
@ 2025-01-16  0:28           ` Martin Karsten
  0 siblings, 0 replies; 18+ messages in thread
From: Martin Karsten @ 2025-01-16  0:28 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Joe Damato, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, netdev

On 2025-01-15 17:35, Samiullah Khawaja wrote:
> On Wed, Jan 8, 2025 at 1:54 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>
>> On 2025-01-08 16:18, Samiullah Khawaja wrote:
>>> On Wed, Jan 8, 2025 at 11:25 AM Joe Damato <jdamato@fastly.com> wrote:
>>>>
>>>> On Thu, Jan 02, 2025 at 04:47:14PM -0800, Jakub Kicinski wrote:
>>>>> On Thu,  2 Jan 2025 19:12:24 +0000 Samiullah Khawaja wrote:
>>>>>> Extend the already existing support of threaded napi poll to do continuous
>>>>>> busypolling.
>>>>>>
>>>>>> This is used for doing continuous polling of napi to fetch descriptors from
>>>>>> backing RX/TX queues for low latency applications. Allow enabling of threaded
>>>>>> busypoll using netlink so this can be enabled on a set of dedicated napis for
>>>>>> low latency applications.
>>>>>
>>>>> This is lacking clear justification and experimental results
>>>>> vs doing the same thing from user space.
>>> Thanks for the response.
>>>
>>> The major benefit is that this is a one common way to enable busy
>>> polling of descriptors on a particular napi. It is basically
>>> independent of the userspace API and allows for enabling busy polling
>>> on a subset, out of the complete list, of napi instances in a device
>>> that can be shared among multiple processes/applications that have low
>>> latency requirements. This allows for a dedicated subset of napi
>>> instances that are configured for busy polling on a machine and
>>> workload/jobs can target these napi instances.
>>>
>>> Once enabled, the relevant kthread can be queried using netlink
>>> `get-napi` op. The thread priority, scheduler and any thread core
>>> affinity can also be set. Any userspace application using a variety of
>>> interfaces (AF_XDP, io_uring, xsk, epoll etc) can run on top of it
>>> without any further complexity. For userspace driven napi busy
>>> polling, one has to either use sysctls to setup busypolling that are
>>> done at device level or use different interfaces depending on the use
>>> cases,
>>> - epoll params (or a sysctl that is system wide) for epoll based interface
>>> - socket option (or sysctl that is system wide) for sk_recvmsg
>>> - io_uring (I believe SQPOLL can be configured with it)
>>>
>>> Our application for this feature uses a userspace implementation of
>>> TCP (https://github.com/Xilinx-CNS/onload) that interfaces with AF_XDP
>>> to send/receive packets. We use neper (running with AF_XDP + userspace
>>> TCP library) to measure latency improvements with and without napi
>>> threaded busy poll. Our target application sends packets with a well
>>> defined frequency with a couple of 100 bytes of RPC style
>>> request/response.
>>
>> Let me also apologize for being late to the party. I am not always
>> paying close attention to the list and did not see this until Joe
>> flagged it for me.
> Thanks for the reply.
>>
>> I have a couple of questions about your experiments. In general, I find
>> this level of experiment description not sufficient for reproducibility.
>> Ideally you point to complete scripts.
>>
>> A canonical problem with using network benchmarks like neper to assess
>> network stack processing is that it typically inflates the relative
>> importance of network stack processing, because there is not application
>> processing involved
> Agreed on your assessment and I went back to get some more info before
> I could reply to this. Basically our use case is a very low latency, a
> solid 14us RPCs with very small messages around 200 bytes with minimum
> application processing. We have well defined traffic patterns for this
> use case with a defined maximum number of packets per second to make
> sure the latency is guaranteed. So to measure the performance of such
> a use case, we basically picked up neper and used it to generate our
> traffic pattern. While we are using neper, this does represent our
> real world use case. Also In my experimentation, I am using neper with
> the onload library that I mentioned earlier to interface with the NIC
> using AF_XDP. In short we do want to get the maximum network stack
> optimization where the packets are pulled off the descriptor queue
> quickly..
>>
>> Were the experiments below run single-threaded?
> Since we are waiting on some of the other features in our environment,
> we are working with only 1 RX queue that has multiple flows running.
> Both experiments have the same interrupt configuration, Also the
> userspace process has affinity set to be closer to the core getting
> the interrupts.
>>
>>> Test Environment:
>>> Google C3 VMs running netdev-net/main kernel with idpf driver
>>>
>>> Without napi threaded busy poll (p50 at around 44us)
>>> num_transactions=47918
>>> latency_min=0.000018838
>>> latency_max=0.333912365
>>> latency_mean=0.000189570
>>> latency_stddev=0.005859874
>>> latency_p50=0.000043510
>>> latency_p90=0.000053750
>>> latency_p99=0.000058230
>>> latency_p99.9=0.000184310
>>
>> What was the interrupt routing in the above base case?
>>
>>> With napi threaded busy poll (p50 around 14us)
>>> latency_min=0.000012271
>>> latency_max=0.209365389
>>> latency_mean=0.000021611
>>> latency_stddev=0.001166541
>>> latency_p50=0.000013590
>>> latency_p90=0.000019990
>>> latency_p99=0.000023670
>>> latency_p99.9=0.000027830
>>
>> How many cores are in play in this case?
> Same in userspace. But napi has its own dedicated core polling on it
> inside the kernel. Since napi is polled continuously, we don't enable
> interrupts for this case as implemented in the patch. This is one of
> the major reasons we cannot drive this from userspace and want napi
> driven in a separate core independent of the application processing
> logic. We don't want the latency drop while the thread that is driving
> the napi goes back to userspace and handles some application logic or
> packet processing that might be happening in onload.
>>
>> I am wondering whether your baseline effectively uses only one core, but
>> your "threaded busy poll" case uses two? Then I am wondering whether a
>> similar effect could be achieved by suitable interrupt and thread affinity?
> We tried doing this in earlier experiments by setting up proper
> interrupt and thread affinity to make them closer and the 44us latency
> is achieved using that. With non AF_XDP tests by enabling busypolling
> at socket level using socketopt, we are only able to achieve around
> 20us, but P99 still suffers. This is mostly because the thread that is
> driving the napi goes back to userspace to do application work. This
> netlink based mechanism basically solves that and provides a UAPI
> independent mechanism to enable busypolling for a napi. One can choose
> to configure the napi thread core affinity and priority to share cores
> with userspace processes if desired.

Thanks for your explanations. I have a better sense now for your 
motivation. However, I can think of several follow-up questions and 
what-ifs about your experiments, but rather than going back and forth on 
the list, I would find it extremely helpful to see your actual and 
complete experiment setups, ideally in script(s), so that one can 
reproduce your observations and tinker with variations and what-ifs.

But I'm just a bit of a tourist here, so you might get away with 
ignoring me. :-)

Best,
Martin


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-01-16  0:30 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-02 19:12 [PATCH net-next 0/3] Add support to do threaded napi busy poll Samiullah Khawaja
2025-01-02 19:12 ` [PATCH net-next 1/3] Add support to set napi threaded for individual napi Samiullah Khawaja
2025-01-02 21:15   ` Stanislav Fomichev
2025-01-02 19:12 ` [PATCH net-next 2/3] net: Create separate gro_flush helper function Samiullah Khawaja
2025-01-02 19:12 ` [PATCH net-next 3/3] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
2025-01-02 21:16   ` Samudrala, Sridhar
2025-01-02 21:28   ` Stanislav Fomichev
2025-01-03  1:05   ` kernel test robot
2025-01-03  8:11   ` kernel test robot
2025-01-03  8:12   ` kernel test robot
2025-01-02 21:30 ` [PATCH net-next 0/3] Add support to do threaded napi busy poll Stanislav Fomichev
2025-01-02 21:56 ` David Laight
2025-01-03  0:47 ` Jakub Kicinski
2025-01-08 19:25   ` Joe Damato
2025-01-08 21:18     ` Samiullah Khawaja
2025-01-08 21:53       ` Martin Karsten
2025-01-15 22:35         ` Samiullah Khawaja
2025-01-16  0:28           ` Martin Karsten

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).