From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 73F032DC790 for ; Sat, 13 Sep 2025 01:46:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757727970; cv=none; b=CK0YD2M4uqipr23vh9O+i5fwapjopb7sIyLQcRt2Ym7+E8pUXdri7mezZcTlwt97rWpLdnnE98AaCXM1TeabT0741DI2E35jShKMIsD7ptZRitN//4gJ3W0hrAnQnbsxpki0GaF9wUmJGsENq+aOATa+ZIohcMSdPI8nxrvNtho= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757727970; c=relaxed/simple; bh=Et/0a+orQrj/cOhMK6z+BbrmLerxnHj4PXbnyenAyy0=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=VwaKkBKC6nfEMNAxCOcoF57HFlFcBTu+DQn7CdJUAfl3Soct5cp4Db75DwE15ldwNIOEOvibxCxTFmomGThgY3No4tMCVHRoRSCXAZ5b2rduLj4bu+M+uBAzmNE7lNxLHpL5UzRN2NwDpfa0FBPZMoVr4R81wHi8/EgCNOsLh1c= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=tDmzyO1Z; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="tDmzyO1Z" Received: by smtp.kernel.org (Postfix) with ESMTPSA id ABD23C4CEF1; Sat, 13 Sep 2025 01:46:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1757727970; bh=Et/0a+orQrj/cOhMK6z+BbrmLerxnHj4PXbnyenAyy0=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=tDmzyO1Znbq85UHvcQ+clA6Caxdqb4ReHYRUfV+s+6y5Aslk4fP+Cw7rE1ya1I76W rF4O3qlzdU/QNHVNvl+VVegJWAkrrGk16o1y7iVcU5IgU8H+wM7v4worRr84p0pOSn oAWPTlEDIapPPTNZBDgb1C3Z1+ZdTYTRe7XLtCRQMjIeQMSHDSMGYbQu51jq1vPyON hrkEu7OCDxVwtoUq5Irm773tPQJJj6qyHsRjvADVJmNAbrb+fJ4qlM45YIvmsw/FlN GUfOPshXAVQ+SU49Nrk9fmX7vj+Lh9++/g5hUJJk0LFV5XDylLGVv/wKCRE9+5rQ/9 RmjRo74DaSW2Q== Date: Fri, 12 Sep 2025 18:46:08 -0700 From: Jakub Kicinski To: Samiullah Khawaja Cc: "David S . Miller " , Eric Dumazet , Paolo Abeni , almasrymina@google.com, willemb@google.com, Joe Damato , mkarsten@uwaterloo.ca, netdev@vger.kernel.org Subject: Re: [PATCH net-next v9 1/2] Extend napi threaded polling to allow kthread based busy polling Message-ID: <20250912184608.6c6a7c51@kernel.org> In-Reply-To: <20250911212901.1718508-2-skhawaja@google.com> References: <20250911212901.1718508-1-skhawaja@google.com> <20250911212901.1718508-2-skhawaja@google.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Thu, 11 Sep 2025 21:29:00 +0000 Samiullah Khawaja wrote: > Add a new state to napi state enum: > > - NAPI_STATE_THREADED_BUSY_POLL > Threaded busy poll is enabled/running for this napi. > > Following changes are introduced in the napi scheduling and state logic: > > - When threaded busy poll is enabled through netlink it also enables > NAPI_STATE_THREADED so a kthread is created per napi. It also sets > NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that it is > going to busy poll the napi. > > - When napi is scheduled with NAPI_STATE_SCHED_THREADED and associated > kthread is woken up, the kthread owns the context. If > NAPI_STATE_THREADED_BUSY_POLL and NAPI_STATE_SCHED_THREADED both are > set then it means that kthread can busy poll. > > - To keep busy polling and to avoid scheduling of the interrupts, the > napi_complete_done returns false when both NAPI_STATE_SCHED_THREADED > and NAPI_STATE_THREADED_BUSY_POLL flags are set. Also > napi_complete_done returns early to avoid the > NAPI_STATE_SCHED_THREADED being unset. > > - If at any point NAPI_STATE_THREADED_BUSY_POLL is unset, the > napi_complete_done will run and unset the NAPI_STATE_SCHED_THREADED > bit also. This will make the associated kthread go to sleep as per > existing logic. > > Signed-off-by: Samiullah Khawaja I think you need to spend some time trying to make this code more.. elegant. > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml > index c035dc0f64fd..ce28e8708a87 100644 > --- a/Documentation/netlink/specs/netdev.yaml > +++ b/Documentation/netlink/specs/netdev.yaml > @@ -88,7 +88,7 @@ definitions: > - > name: napi-threaded > type: enum > - entries: [disabled, enabled] > + entries: [disabled, enabled, busy-poll-enabled] drop the -enabled > attribute-sets: > - > @@ -291,7 +291,8 @@ attribute-sets: > name: threaded > doc: Whether the NAPI is configured to operate in threaded polling > mode. If this is set to enabled then the NAPI context operates > - in threaded polling mode. > + in threaded polling mode. If this is set to busy-poll-enabled > + then the NAPI kthread also does busypolling. I don't think busypolling is a word? I mean, I don't think English combines words like this. > +Threaded NAPI busy polling > +-------------------------- Please feed the documentation into a grammar checker. A bunch of articles seems to be missing. > +Threaded NAPI allows processing of packets from each NAPI in a kthread in > +kernel. Threaded NAPI busy polling extends this and adds support to do > +continuous busy polling of this NAPI. This can be used to enable busy polling > +independent of userspace application or the API (epoll, io_uring, raw sockets) > +being used in userspace to process the packets. > + > +It can be enabled for each NAPI using netlink interface. Netlink, capital letter > +For example, using following script: > + > +.. code-block:: bash > + > + $ ynl --family netdev --do napi-set \ > + --json='{"id": 66, "threaded": "busy-poll-enabled"}' > + > + > +Enabling it for each NAPI allows finer control to enable busy pollling for pollling -> polling > +only a set of NIC queues which will get traffic with low latency requirements. A bit of a non-sequitur. Sounds like you just cut off the device-wide config here. > +Depending on application requirement, user might want to set affinity of the > +kthread that is busy polling each NAPI. User might also want to set priority > +and the scheduler of the thread depending on the latency requirements. > + > +For a hard low-latency application, user might want to dedicate the full core > +for the NAPI polling so the NIC queue descriptors are picked up from the queue > +as soon as they appear. Once enabled, the NAPI thread will poll the NIC queues > +continuously without sleeping. This will keep the CPU core busy with 100% > +usage. For more relaxed low-latency requirement, user might want to share the > +core with other threads by setting thread affinity and priority. Is there such a thing a priority in the Linux scheduler? Being more specific would be useful. I think this code is useful for forwarding or AF_XDP but normal socket applications would struggle to use this mode. > Threaded NAPI > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index f3a3b761abfb..a88f6596aef7 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -427,6 +427,8 @@ enum { > NAPI_STATE_THREADED, /* The poll is performed inside its own thread*/ > NAPI_STATE_SCHED_THREADED, /* Napi is currently scheduled in threaded mode */ > NAPI_STATE_HAS_NOTIFIER, /* Napi has an IRQ notifier */ > + NAPI_STATE_THREADED_BUSY_POLL, /* The threaded napi poller will busy poll */ > + NAPI_STATE_SCHED_THREADED_BUSY_POLL, /* The threaded napi poller is busy polling */ I don't get why you need 2 bits to implement this feature. > @@ -1873,7 +1881,8 @@ enum netdev_reg_state { > * @addr_len: Hardware address length > * @upper_level: Maximum depth level of upper devices. > * @lower_level: Maximum depth level of lower devices. > - * @threaded: napi threaded state. > + * @threaded: napi threaded mode is disabled, enabled or > + * enabled with busy polling. And you are still updating the device level kdoc. > * @neigh_priv_len: Used in neigh_alloc() > * @dev_id: Used to differentiate devices that share > * the same link layer address > @@ -78,6 +78,7 @@ > #include > #include > #include > +#include Leftover from experiments with setting scheduler params in the core? Or dare I say Google prod kernel? > #include > #include > #include > @@ -6619,7 +6620,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done) > * the guarantee we will be called later. > */ > if (unlikely(n->state & (NAPIF_STATE_NPSVC | > - NAPIF_STATE_IN_BUSY_POLL))) > + NAPIF_STATE_IN_BUSY_POLL | > + NAPIF_STATE_SCHED_THREADED_BUSY_POLL))) Why not just set the IN_BUSY_POLL when the thread starts polling? What's the significance of the distinction? > return false; > > if (work_done) { > +static void napi_set_threaded_state(struct napi_struct *napi, > + enum netdev_napi_threaded threaded) > +{ > + unsigned long val; > + > + val = 0; > + if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED) > + val |= NAPIF_STATE_THREADED_BUSY_POLL; > + if (threaded) > + val |= NAPIF_STATE_THREADED; this reads odd, set threaded first then the sub-option > + set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val); Does this actually have to be atomic? I don't think so. > +} > + > int napi_set_threaded(struct napi_struct *napi, > enum netdev_napi_threaded threaded) > { > @@ -7050,7 +7066,7 @@ int napi_set_threaded(struct napi_struct *napi, > } else { > /* Make sure kthread is created before THREADED bit is set. */ > smp_mb__before_atomic(); > - assign_bit(NAPI_STATE_THREADED, &napi->state, threaded); > + napi_set_threaded_state(napi, threaded); > } > > return 0; > @@ -7442,7 +7458,9 @@ void napi_disable_locked(struct napi_struct *n) > } > > new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC; > - new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL); > + new &= ~(NAPIF_STATE_THREADED > + | NAPIF_STATE_THREADED_BUSY_POLL > + | NAPIF_STATE_PREFER_BUSY_POLL); kernel coding style has | at the end of the line. > } while (!try_cmpxchg(&n->state, &val, new)); > > hrtimer_cancel(&n->timer); > @@ -7486,7 +7504,7 @@ void napi_enable_locked(struct napi_struct *n) > > new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC); > if (n->dev->threaded && n->thread) > - new |= NAPIF_STATE_THREADED; > + napi_set_threaded_state(n, n->dev->threaded); > } while (!try_cmpxchg(&n->state, &val, new)); > } > EXPORT_SYMBOL(napi_enable_locked); > @@ -7654,7 +7672,7 @@ static int napi_thread_wait(struct napi_struct *napi) > return -1; > } > > -static void napi_threaded_poll_loop(struct napi_struct *napi) > +static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll) > { > struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx; > struct softnet_data *sd; > @@ -7683,22 +7701,58 @@ static void napi_threaded_poll_loop(struct napi_struct *napi) > } > skb_defer_free_flush(sd); > bpf_net_ctx_clear(bpf_net_ctx); > + > + /* Flush too old packets. If HZ < 1000, flush all packets */ Probably better to say something about the condition than just copy the comment third time. > + if (busy_poll) > + gro_flush_normal(&napi->gro, HZ >= 1000); > local_bh_enable(); > > - if (!repoll) > + /* If busy polling then do not break here because we need to > + * call cond_resched and rcu_softirq_qs_periodic to prevent > + * watchdog warnings. > + */ > + if (!repoll && !busy_poll) > break; > > rcu_softirq_qs_periodic(last_qs); > cond_resched(); > + > + if (!repoll) > + break; > } > } > > static int napi_threaded_poll(void *data) > { > struct napi_struct *napi = data; > + bool busy_poll_sched; > + unsigned long val; > + bool busy_poll; > + > + while (!napi_thread_wait(napi)) { > + /* Once woken up, this means that we are scheduled as threaded > + * napi and this thread owns the napi context, if busy poll please capitalize NAPI in all comments and docs > + * state is set then busy poll this napi. > + */ > + val = READ_ONCE(napi->state); > + busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL; > + busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL; > + > + /* Do not busy poll if napi is disabled. */ It's not disabled, disable is pending > + if (unlikely(val & NAPIF_STATE_DISABLE)) > + busy_poll = false; > + > + if (busy_poll != busy_poll_sched) { > + val = busy_poll ? > + NAPIF_STATE_SCHED_THREADED_BUSY_POLL : > + 0; > + set_mask_bits(&napi->state, > + NAPIF_STATE_SCHED_THREADED_BUSY_POLL, > + val); why set single bit with set_mask_bits() ? Please improve, IIRC it took quite a lot of reviewer effort to get the thread stopping into shape, similar level of effort will not be afforded again.