Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Martin Karsten <mkarsten@uwaterloo.ca>
To: Samiullah Khawaja <skhawaja@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	"David S . Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Paolo Abeni <pabeni@redhat.com>,
	almasrymina@google.com, willemb@google.com, jdamato@fastly.com
Cc: netdev@vger.kernel.org
Subject: Re: [PATCH net-next v5 0/4] Add support to do threaded napi busy poll
Date: Wed, 30 Apr 2025 11:23:20 -0400	[thread overview]
Message-ID: <db35fe8a-05c3-4227-9b2b-eeca8b7cb75a@uwaterloo.ca> (raw)
In-Reply-To: <52e7cf72-6655-49ed-984c-44bd1ecb0d95@uwaterloo.ca>

On 2025-04-28 09:50, Martin Karsten wrote:
> On 2025-04-24 16:02, Samiullah Khawaja wrote:

[snip]

>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI 
>> threaded |
>> |---|---|---|---|---|
>> | 12 Kpkt/s + 0us delay | | | | |
>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>> | 32 Kpkt/s + 30us delay | | | | |
>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>> | 125 Kpkt/s + 6us delay | | | | |
>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>> | 12 Kpkt/s + 78us delay | | | | |
>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>> | 25 Kpkt/s + 38us delay | | | | |
>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>
>>   ## Observations
>>
>> - Here without application processing all the approaches give the same
>>    latency within 1usecs range and NAPI threaded gives minimum latency.
>> - With application processing the latency increases by 3-4usecs when
>>    doing inline polling.
>> - Using a dedicated core to drive napi polling keeps the latency same
>>    even with application processing. This is observed both in userspace
>>    and threaded napi (in kernel).
>> - Using napi threaded polling in kernel gives lower latency by
>>    1-1.5usecs as compared to userspace driven polling in separate core.
>> - With application processing userspace will get the packet from recv
>>    ring and spend some time doing application processing and then do napi
>>    polling. While application processing is happening a dedicated core
>>    doing napi polling can pull the packet of the NAPI RX queue and
>>    populate the AF_XDP recv ring. This means that when the application
>>    thread is done with application processing it has new packets ready to
>>    recv and process in recv ring.
>> - Napi threaded busy polling in the kernel with a dedicated core gives
>>    the consistent P5-P99 latency.
> I've experimented with this some more. I can confirm latency savings of 
> about 1 usec arising from busy-looping a NAPI thread on a dedicated core 
> when compared to in-thread busy-polling. A few more comments:
> 
> 1) I note that the experiment results above show that 'interrupts' is 
> almost as fast as 'NAPI threaded' in the base case. I cannot confirm 
> these results, because I currently only have (very) old hardware 
> available for testing. However, these results worry me in terms of 
> necessity of the threaded busy-polling mechanism - also see Item 4) below.

I want to add one more thought, just to spell this out explicitly: 
Assuming the latency benefits result from better cache utilization of 
two shorter processing loops (NAPI and application) using a dedicated 
core each, it would make sense to see softirq processing on the NAPI 
core being almost as fast. While there might be small penalty for the 
initial hardware interrupt, the following softirq processing does not 
differ much from what a NAPI spin-loop does? The experiments seem to 
corroborate this, because latency results for 'interrupts' and 'NAPI 
threaded' are extremely close.

In this case, it would be essential that interrupt handling happens on a 
dedicated empty core, so it can react to hardware interrupts right away 
and its local cache isn't dirtied by other code than softirq processing. 
While this also means dedicating a entire core to NAPI processing, at 
least the core wouldn't have to spin all the time, hopefully reducing 
power consumption and heat generation.

Thanks,
Martin
> 2) The experiments reported here are symmetric in that they use the same 
> polling variant at both the client and the server. When mixing things up 
> by combining different polling variants, it becomes clear that the 
> latency savings are split between both ends. The total savings of 1 usec 
> are thus a combination of 0.5 usec are either end. So the ultimate 
> trade-off is 0.5 usec latency gain for burning 1 core.
> 
> 3) I believe the savings arise from running two tight loops (separate 
> NAPI and application) instead of one longer loop. The shorter loops 
> likely result in better cache utilization on their respective dedicated 
> cores (and L1 caches). However I am not sure right how to explicitly 
> confirm this.
> 
> 4) I still believe that the additional experiments with setting both 
> delay and period are meaningless. They create corner cases where rate * 
> delay is about 1. Nobody would run a latency-critical system at 100% 
> load. I also note that the experiment program xsk_rr fails when trying 
> to increase the load beyond saturation (client fails with 'xsk_rr: 
> oustanding array full').
> 
> 5) I worry that a mechanism like this might be misinterpreted as some 
> kind of magic wand for improving performance and might end up being used 
> in practice and cause substantial overhead without much gain. If 
> accepted, I would hope that this will be documented very clearly and 
> have appropriate warnings attached. Given that the patch cover letter is 
> often used as a basis for documentation, I believe this should be 
> spelled out in the cover letter.
> 
> With the above in mind, someone else will need to judge whether (at 
> most) 0.5 usec for burning a core is a worthy enough trade-off to 
> justify inclusion of this mechanism. Maybe someone else can take a 
> closer look at the 'interrupts' variant on modern hardware.
> 
> Thanks,
> Martin

next prev parent reply	other threads:[~2025-04-30 15:23 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-24 20:02 [PATCH net-next v5 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
2025-04-24 20:02 ` [PATCH net-next v5 1/4] net: Create separate gro_flush helper function Samiullah Khawaja
2025-04-24 23:21   ` Joe Damato
2025-04-24 20:02 ` [PATCH net-next v5 2/4] net: define an enum for the napi threaded state Samiullah Khawaja
2025-04-24 23:40   ` Joe Damato
2025-04-26  1:36   ` Jakub Kicinski
2025-04-26  3:54     ` Samiullah Khawaja
2025-04-24 20:02 ` [PATCH net-next v5 3/4] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
2025-04-24 23:42   ` Joe Damato
2025-04-24 20:02 ` [PATCH net-next v5 4/4] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
2025-04-28 13:50 ` [PATCH net-next v5 0/4] Add support to do threaded napi busy poll Martin Karsten
2025-04-28 16:52   ` Willem de Bruijn
2025-04-28 17:39     ` Joe Damato
2025-04-28 17:59       ` Willem de Bruijn
2025-04-28 18:05         ` Martin Karsten
2025-04-30 12:37           ` David Laight
2025-04-30 16:47             ` Samiullah Khawaja
2025-04-28 18:05     ` Martin Karsten
2025-04-28 18:20       ` Willem de Bruijn
2025-04-30 15:23   ` Martin Karsten [this message]
2025-04-30 16:58     ` Samiullah Khawaja
2025-04-30 19:57       ` Martin Karsten
2025-04-30 20:33         ` Samiullah Khawaja

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=db35fe8a-05c3-4227-9b2b-eeca8b7cb75a@uwaterloo.ca \
    --to=mkarsten@uwaterloo.ca \
    --cc=almasrymina@google.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=jdamato@fastly.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=skhawaja@google.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).