Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <hawk@kernel.org>
To: Alexander Lobakin <aleksander.lobakin@intel.com>,
	Daniel Xu <dxu@dxuuu.xyz>
Cc: Lorenzo Bianconi <lorenzo@kernel.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	Jakub Kicinski <kuba@kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	David Miller <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Paolo Abeni <pabeni@redhat.com>,
	netdev@vger.kernel.org,
	Lorenzo Bianconi <lorenzo.bianconi@redhat.com>,
	kernel-team <kernel-team@cloudflare.com>,
	mfleming@cloudflare.com
Subject: Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase
Date: Mon, 25 Nov 2024 19:50:41 +0100	[thread overview]
Message-ID: <fcaae4c8-4083-4eef-8cfe-3d1f7e340079@kernel.org> (raw)
In-Reply-To: <05991551-415c-49d0-8f14-f99cb84fc5cb@intel.com>

On 25/11/2024 16.12, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Fri, 22 Nov 2024 17:10:06 -0700
> 
>> Hi Olek,
>>
>> Here are the results.
>>
>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>
>>>
>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> 
> [...]
> 
>> Baseline (again)
>>
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
>> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
>> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
>> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
>> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
>> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
>>

We need to talk about what we are measuring, and how to control the
experiment setup to get reproducible results.
Especially controlling on what CPU cores our code paths are executing.

In above "baseline" case, we have two processes/tasks executing:
  (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket)
  (2) Userspace netserver process TCP receiving data from socket.

My experience is that you will see two noticeable different
throughput performance results depending on whether (1) and (2) is
executing on the *same* CPU (multi-tasking context-switching),
or executing in parallel (e.g. pinned) on two different CPU cores.

The netperf command have an option

  -T lcpu,remcpu
       Request that netperf be bound to local CPU lcpu and/or netserver 
be bound to remote CPU rcpu.

Verify setting by listing pinning like this:
   for PID in $(pidof netserver); do taskset -pc $PID ; done

You can also set pinning runtime like this:
  export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU 
$PID; done

For troubleshooting, I like to use the periodic 1 sec (netperf -D1)
output and adjust pinning runtime to observe the effect quickly.

My experience is unfortunately that TCP results have a lot of variation
(thanks for incliding 5 runs in your benchmarks), as it depends on tasks
timing, that can get affected by CPU sleep states. The systems CPU
latency setting can be seen in /dev/cpu_dma_latency, which can be read
like this:

  sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency

For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm
as it requires holding the file open. E.g I play with these profiles:

  sudo tuned-adm profile throughput-performance
  sudo tuned-adm profile latency-performance
  sudo tuned-adm profile network-latency

>> cpumap v2 Olek
>>
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
>> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
>> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
>> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
>> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
>> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
>> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
>>
>>

We now three processes/tasks executing:
  (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap)
  (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket)
  (3) Userspace netserver process TCP receiving data from socket.

Again, now the performance is going to depend on depending on which CPU
cores the processes/tasks are running and whether some are sharing the
same CPU. (There are both wakeup timing and cache-line effects).

There are now more combinations to test...

CPUmap is a CPU scaling facility, and you will likely also see different
CPU utilization on the difference cores one you start to pin these to
control the scenarios.

>> It's very interesting that we see -40% tput w/ the patches. I went back
> 

Sad that we see -40% throughput...  but do we know what CPU cores the
now three different tasks/processes run on(?)

> Oh no, I messed up something =\
>  > Could you please also test not the whole series, but patches 1-3 (up to
> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> array...")? Would be great to see whether this implementation works
> worse right from the start or I just broke something later on.
> 
>> and double checked and it seems the numbers are right. Here's the
>> some output from some profiles I took with:
>>
>>      perf record -e cycles:k -a -- sleep 10
>>      perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
>>
>>      # Event 'cycles:k'
>>      # Baseline  Delta Abs  Shared Object                                                    Symbol
>>           6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
>

I really appreciate that you provide perf data and perf diff, but as
described above, we need data and information on what CPU cores are
running which workload.

Fortunately perf diff (and perf report) support doing like this:
  perf diff --sort=cpu,symbol

But then you also need to control the CPUs used in experiment for the
diff to work.

I hope I made sense as these kind of CPU scaling benchmarks are tricky,
--Jesper

next prev parent reply	other threads:[~2024-11-25 18:50 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-16 10:13 [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase Lorenzo Bianconi
2024-09-16 10:13 ` [RFC/RFT v2 1/3] net: Add napi_init_for_gro routine Lorenzo Bianconi
2024-09-16 10:13 ` [RFC/RFT v2 2/3] net: add napi_threaded_poll to netdevice.h Lorenzo Bianconi
2024-09-16 10:13 ` [RFC/RFT v2 3/3] bpf: cpumap: Add gro support Lorenzo Bianconi
2024-09-16 15:10 ` [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase Alexander Lobakin
2024-10-08 22:39 ` Daniel Xu
2024-10-09 10:46   ` Lorenzo Bianconi
2024-10-09 12:27     ` Alexander Lobakin
2024-10-09 12:47       ` Lorenzo Bianconi
2024-10-09 12:50         ` Alexander Lobakin
2024-10-22 15:51           ` Alexander Lobakin
2024-11-12 17:43             ` Alexander Lobakin
2024-11-13 23:39               ` Daniel Xu
2024-11-23  0:10                 ` Daniel Xu
2024-11-25 15:12                   ` Alexander Lobakin
2024-11-25 17:03                     ` Daniel Xu
2024-11-25 18:50                     ` Jesper Dangaard Brouer [this message]
2024-11-25 21:53                       ` Daniel Xu
2024-11-25 22:19                         ` Lorenzo Bianconi
2024-11-25 22:56                     ` Daniel Xu
2024-11-26 10:36                       ` Alexander Lobakin
2024-11-26 17:02                         ` Lorenzo Bianconi
2024-11-26 17:12                           ` Jesper Dangaard Brouer
2024-11-28 10:41                             ` Alexander Lobakin
2024-11-28 10:56                               ` Lorenzo Bianconi
2024-11-28 10:57                                 ` Alexander Lobakin
2024-12-02 22:47                         ` Jakub Kicinski
2024-12-03 11:01                           ` Alexander Lobakin
2024-12-04  0:51                             ` Jakub Kicinski
2024-12-04 16:42                               ` Alexander Lobakin
2024-12-04 21:51                                 ` Daniel Xu
2024-12-05 10:38                                   ` Alexander Lobakin
2024-12-05 11:06                                     ` Alexander Lobakin
2024-12-06  0:41                                       ` Daniel Xu
2024-12-06 15:06                                         ` Alexander Lobakin
2024-12-06 23:36                                           ` Daniel Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fcaae4c8-4083-4eef-8cfe-3d1f7e340079@kernel.org \
    --to=hawk@kernel.org \
    --cc=aleksander.lobakin@intel.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dxu@dxuuu.xyz \
    --cc=edumazet@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=lorenzo.bianconi@redhat.com \
    --cc=lorenzo@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=mfleming@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).