From: Jakub Kicinski <kuba@kernel.org>
To: Samiullah Khawaja <skhawaja@google.com>
Cc: "David S . Miller " <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Paolo Abeni <pabeni@redhat.com>,
almasrymina@google.com, willemb@google.com,
Joe Damato <joe@dama.to>,
mkarsten@uwaterloo.ca, netdev@vger.kernel.org
Subject: Re: [PATCH net-next v9 1/2] Extend napi threaded polling to allow kthread based busy polling
Date: Fri, 12 Sep 2025 18:46:08 -0700 [thread overview]
Message-ID: <20250912184608.6c6a7c51@kernel.org> (raw)
In-Reply-To: <20250911212901.1718508-2-skhawaja@google.com>
On Thu, 11 Sep 2025 21:29:00 +0000 Samiullah Khawaja wrote:
> Add a new state to napi state enum:
>
> - NAPI_STATE_THREADED_BUSY_POLL
> Threaded busy poll is enabled/running for this napi.
>
> Following changes are introduced in the napi scheduling and state logic:
>
> - When threaded busy poll is enabled through netlink it also enables
> NAPI_STATE_THREADED so a kthread is created per napi. It also sets
> NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that it is
> going to busy poll the napi.
>
> - When napi is scheduled with NAPI_STATE_SCHED_THREADED and associated
> kthread is woken up, the kthread owns the context. If
> NAPI_STATE_THREADED_BUSY_POLL and NAPI_STATE_SCHED_THREADED both are
> set then it means that kthread can busy poll.
>
> - To keep busy polling and to avoid scheduling of the interrupts, the
> napi_complete_done returns false when both NAPI_STATE_SCHED_THREADED
> and NAPI_STATE_THREADED_BUSY_POLL flags are set. Also
> napi_complete_done returns early to avoid the
> NAPI_STATE_SCHED_THREADED being unset.
>
> - If at any point NAPI_STATE_THREADED_BUSY_POLL is unset, the
> napi_complete_done will run and unset the NAPI_STATE_SCHED_THREADED
> bit also. This will make the associated kthread go to sleep as per
> existing logic.
>
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
I think you need to spend some time trying to make this code more..
elegant.
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index c035dc0f64fd..ce28e8708a87 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -88,7 +88,7 @@ definitions:
> -
> name: napi-threaded
> type: enum
> - entries: [disabled, enabled]
> + entries: [disabled, enabled, busy-poll-enabled]
drop the -enabled
> attribute-sets:
> -
> @@ -291,7 +291,8 @@ attribute-sets:
> name: threaded
> doc: Whether the NAPI is configured to operate in threaded polling
> mode. If this is set to enabled then the NAPI context operates
> - in threaded polling mode.
> + in threaded polling mode. If this is set to busy-poll-enabled
> + then the NAPI kthread also does busypolling.
I don't think busypolling is a word? I mean, I don't think English
combines words like this.
> +Threaded NAPI busy polling
> +--------------------------
Please feed the documentation into a grammar checker. A bunch of
articles seems to be missing.
> +Threaded NAPI allows processing of packets from each NAPI in a kthread in
> +kernel. Threaded NAPI busy polling extends this and adds support to do
> +continuous busy polling of this NAPI. This can be used to enable busy polling
> +independent of userspace application or the API (epoll, io_uring, raw sockets)
> +being used in userspace to process the packets.
> +
> +It can be enabled for each NAPI using netlink interface.
Netlink, capital letter
> +For example, using following script:
> +
> +.. code-block:: bash
> +
> + $ ynl --family netdev --do napi-set \
> + --json='{"id": 66, "threaded": "busy-poll-enabled"}'
> +
> +
> +Enabling it for each NAPI allows finer control to enable busy pollling for
pollling -> polling
> +only a set of NIC queues which will get traffic with low latency requirements.
A bit of a non-sequitur. Sounds like you just cut off the device-wide
config here.
> +Depending on application requirement, user might want to set affinity of the
> +kthread that is busy polling each NAPI. User might also want to set priority
> +and the scheduler of the thread depending on the latency requirements.
> +
> +For a hard low-latency application, user might want to dedicate the full core
> +for the NAPI polling so the NIC queue descriptors are picked up from the queue
> +as soon as they appear. Once enabled, the NAPI thread will poll the NIC queues
> +continuously without sleeping. This will keep the CPU core busy with 100%
> +usage. For more relaxed low-latency requirement, user might want to share the
> +core with other threads by setting thread affinity and priority.
Is there such a thing a priority in the Linux scheduler? Being more
specific would be useful. I think this code is useful for forwarding
or AF_XDP but normal socket applications would struggle to use this
mode.
> Threaded NAPI
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index f3a3b761abfb..a88f6596aef7 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -427,6 +427,8 @@ enum {
> NAPI_STATE_THREADED, /* The poll is performed inside its own thread*/
> NAPI_STATE_SCHED_THREADED, /* Napi is currently scheduled in threaded mode */
> NAPI_STATE_HAS_NOTIFIER, /* Napi has an IRQ notifier */
> + NAPI_STATE_THREADED_BUSY_POLL, /* The threaded napi poller will busy poll */
> + NAPI_STATE_SCHED_THREADED_BUSY_POLL, /* The threaded napi poller is busy polling */
I don't get why you need 2 bits to implement this feature.
> @@ -1873,7 +1881,8 @@ enum netdev_reg_state {
> * @addr_len: Hardware address length
> * @upper_level: Maximum depth level of upper devices.
> * @lower_level: Maximum depth level of lower devices.
> - * @threaded: napi threaded state.
> + * @threaded: napi threaded mode is disabled, enabled or
> + * enabled with busy polling.
And you are still updating the device level kdoc.
> * @neigh_priv_len: Used in neigh_alloc()
> * @dev_id: Used to differentiate devices that share
> * the same link layer address
> @@ -78,6 +78,7 @@
> #include <linux/slab.h>
> #include <linux/sched.h>
> #include <linux/sched/isolation.h>
> +#include <linux/sched/types.h>
Leftover from experiments with setting scheduler params in the core?
Or dare I say Google prod kernel?
> #include <linux/sched/mm.h>
> #include <linux/smpboot.h>
> #include <linux/mutex.h>
> @@ -6619,7 +6620,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> * the guarantee we will be called later.
> */
> if (unlikely(n->state & (NAPIF_STATE_NPSVC |
> - NAPIF_STATE_IN_BUSY_POLL)))
> + NAPIF_STATE_IN_BUSY_POLL |
> + NAPIF_STATE_SCHED_THREADED_BUSY_POLL)))
Why not just set the IN_BUSY_POLL when the thread starts polling?
What's the significance of the distinction?
> return false;
>
> if (work_done) {
> +static void napi_set_threaded_state(struct napi_struct *napi,
> + enum netdev_napi_threaded threaded)
> +{
> + unsigned long val;
> +
> + val = 0;
> + if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED)
> + val |= NAPIF_STATE_THREADED_BUSY_POLL;
> + if (threaded)
> + val |= NAPIF_STATE_THREADED;
this reads odd, set threaded first then the sub-option
> + set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
Does this actually have to be atomic? I don't think so.
> +}
> +
> int napi_set_threaded(struct napi_struct *napi,
> enum netdev_napi_threaded threaded)
> {
> @@ -7050,7 +7066,7 @@ int napi_set_threaded(struct napi_struct *napi,
> } else {
> /* Make sure kthread is created before THREADED bit is set. */
> smp_mb__before_atomic();
> - assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> + napi_set_threaded_state(napi, threaded);
> }
>
> return 0;
> @@ -7442,7 +7458,9 @@ void napi_disable_locked(struct napi_struct *n)
> }
>
> new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
> - new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL);
> + new &= ~(NAPIF_STATE_THREADED
> + | NAPIF_STATE_THREADED_BUSY_POLL
> + | NAPIF_STATE_PREFER_BUSY_POLL);
kernel coding style has | at the end of the line.
> } while (!try_cmpxchg(&n->state, &val, new));
>
> hrtimer_cancel(&n->timer);
> @@ -7486,7 +7504,7 @@ void napi_enable_locked(struct napi_struct *n)
>
> new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC);
> if (n->dev->threaded && n->thread)
> - new |= NAPIF_STATE_THREADED;
> + napi_set_threaded_state(n, n->dev->threaded);
> } while (!try_cmpxchg(&n->state, &val, new));
> }
> EXPORT_SYMBOL(napi_enable_locked);
> @@ -7654,7 +7672,7 @@ static int napi_thread_wait(struct napi_struct *napi)
> return -1;
> }
>
> -static void napi_threaded_poll_loop(struct napi_struct *napi)
> +static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll)
> {
> struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
> struct softnet_data *sd;
> @@ -7683,22 +7701,58 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
> }
> skb_defer_free_flush(sd);
> bpf_net_ctx_clear(bpf_net_ctx);
> +
> + /* Flush too old packets. If HZ < 1000, flush all packets */
Probably better to say something about the condition than just copy
the comment third time.
> + if (busy_poll)
> + gro_flush_normal(&napi->gro, HZ >= 1000);
> local_bh_enable();
>
> - if (!repoll)
> + /* If busy polling then do not break here because we need to
> + * call cond_resched and rcu_softirq_qs_periodic to prevent
> + * watchdog warnings.
> + */
> + if (!repoll && !busy_poll)
> break;
>
> rcu_softirq_qs_periodic(last_qs);
> cond_resched();
> +
> + if (!repoll)
> + break;
> }
> }
>
> static int napi_threaded_poll(void *data)
> {
> struct napi_struct *napi = data;
> + bool busy_poll_sched;
> + unsigned long val;
> + bool busy_poll;
> +
> + while (!napi_thread_wait(napi)) {
> + /* Once woken up, this means that we are scheduled as threaded
> + * napi and this thread owns the napi context, if busy poll
please capitalize NAPI in all comments and docs
> + * state is set then busy poll this napi.
> + */
> + val = READ_ONCE(napi->state);
> + busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
> + busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL;
> +
> + /* Do not busy poll if napi is disabled. */
It's not disabled, disable is pending
> + if (unlikely(val & NAPIF_STATE_DISABLE))
> + busy_poll = false;
> +
> + if (busy_poll != busy_poll_sched) {
> + val = busy_poll ?
> + NAPIF_STATE_SCHED_THREADED_BUSY_POLL :
> + 0;
> + set_mask_bits(&napi->state,
> + NAPIF_STATE_SCHED_THREADED_BUSY_POLL,
> + val);
why set single bit with set_mask_bits() ?
Please improve, IIRC it took quite a lot of reviewer effort to get
the thread stopping into shape, similar level of effort will not be
afforded again.
next prev parent reply other threads:[~2025-09-13 1:46 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-11 21:28 [PATCH net-next v9 0/2] Add support to do threaded napi busy poll Samiullah Khawaja
2025-09-11 21:29 ` [PATCH net-next v9 1/2] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
2025-09-13 1:46 ` Jakub Kicinski [this message]
2025-09-15 16:24 ` Samiullah Khawaja
2025-09-11 21:29 ` [PATCH net-next v9 2/2] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
2025-09-12 8:08 ` [PATCH net-next v9 0/2] Add support to do threaded napi busy poll Martin Karsten
2025-09-12 16:22 ` Willem de Bruijn
2025-09-13 2:07 ` Jakub Kicinski
2025-09-13 16:03 ` Martin Karsten
2025-09-15 16:26 ` Samiullah Khawaja
2025-09-12 20:22 ` Stanislav Fomichev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250912184608.6c6a7c51@kernel.org \
--to=kuba@kernel.org \
--cc=almasrymina@google.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=joe@dama.to \
--cc=mkarsten@uwaterloo.ca \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=skhawaja@google.com \
--cc=willemb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.