From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 976D32500CC for ; Mon, 25 Nov 2024 22:19:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732573191; cv=none; b=k/BMolX7B74uD4scWgUzxQyZ2yeT/l7mzo+mQ9f/uSOgUWTesxwNUdP+KRuuzwJHei7NRnD93NcTOGhc4UuVGouAP4cLX0VQ9T0j2o6h9Ra2VXmkFK5FQa6FqhLmmCyJat+jO+VSBQci6CQU1I/f/hJqvLg+ZmntknKpFxlvRw0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732573191; c=relaxed/simple; bh=FbgVT8CtZaJUYW4PUlH8WrEiVPj7jSW5V+nyidW4wyY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=FJ0KBPQgnGli8p0js6PLQB9Dj+bwqOn27D5cR8FSHwEyuWNiLIr/4xrbCMitNlnVCasQ/paFxOIfL2GnBlhi4+UfHrNGZUfsGeGfPdDMsZQkXrx+af6R7pFK4GeVeu5nmybBMs8/CyOBaZISsTHh1kAHGNCPyVtdcl6NLKd6j14= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=dJ8WYtuZ; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="dJ8WYtuZ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1732573188; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=aQUXO4Z20IXKMD6bN/TfftHEgC4oaBaDs3134FSiqJE=; b=dJ8WYtuZXHbC8FS8cHlKgDac0u+70ZHbVUXpMjpb5cwUpCqYCv1JgEZOuZPjLdyOrTpi4j LX/8bFyQgAsqiJcQDAqRoJrDtqnzrZjEW3YSG0DQMlXixj3oKTBAHMlfm9AxEisQc6ZS2A yj0mcGVYK2jO+aP+gX4UqgGALeTI57k= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-653-um8vvkIlNnuZfs7ltZCM9A-1; Mon, 25 Nov 2024 17:19:46 -0500 X-MC-Unique: um8vvkIlNnuZfs7ltZCM9A-1 X-Mimecast-MFC-AGG-ID: um8vvkIlNnuZfs7ltZCM9A Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-38240d9ed31so2881554f8f.3 for ; Mon, 25 Nov 2024 14:19:46 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732573185; x=1733177985; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=aQUXO4Z20IXKMD6bN/TfftHEgC4oaBaDs3134FSiqJE=; b=SFC7nhR9N61b4AubkF+tWF8m++OnRSupU1tpjqlNNeW44Hm1djAtfa7+C2vv/Yxe3r qz/J/Mbjw2tIWXhEMo0mrX3v4ZR15+euWjYoEBjJiTeJZhftrNRpB8cwl+seXOvXGsTD U0SRiLDv8Zdx0cV9NNXt4GB5himviqoZM92KVgqo9/RyJS65uANaKwYvYTM7b3/Rnxoc pD+BNBUdNn3SZwBupLk8v+OLeuJnlCO81uMOjOM+uMgbtsOVtXsW0MfpK4sOpQJgt8ZE 0Ki7jsg2V1jrRgsm4IlRm5176spogmMfYdANrb5GS6nktZhDRSNCrQCNjVYEh2Qyt3Qb 3U6g== X-Forwarded-Encrypted: i=1; AJvYcCVpkHtT+NYUtgV3NC2SFJL/gn+/Y6DwtmMF9Lk6Zq0BaP5JUXSU7fPxvuWTT4zV9Hb+JhE=@vger.kernel.org X-Gm-Message-State: AOJu0YxNlPnuwxHog9fbZ/GQq3suA+M/Me37knQMP1QFUQbMQO3IMf9f kwo8Tqx3WAtjUnayhUB0O51j68w4Me7Qh9W92YfXULon87h1MUKuFob47Xr2ytFipd22U8vCCpa Q5dUZCCBE8Hw4vbkZBNnG2fn4FaDBkwhcwTdWsgGIQQx1fc75+Q== X-Gm-Gg: ASbGnctnEMsf23bi50m3M2e9Jcq8KMloInBuy/Q/nGqraoPMUiZT8EVxTQS7/zraPGT 2EHocGgoGroOCys/sGil8oiB09zCEV+U2Msoh9SljRyb7/w25wFqfzHRRgrkuLR3axZBpPD8GPf BXtUPNegzpy5/w0rM/xRKjKupLJi8rUTnAXqFg4A2eue8/QJ4WM4B7CmzDZ+JtJYFSg97RgO0+1 daqUNsJsgEfGxqWKKQBox59jp3Pd35MF8L4YC+DIH1moG78Vkt7VsYnC9KNnXplWMZB1TDOzf+i ju2AQJ0KtSNjJDzfTmsQgA== X-Received: by 2002:a05:6000:78c:b0:385:bee7:5c63 with SMTP id ffacd0b85a97d-385bee75c97mr2671440f8f.14.1732573184907; Mon, 25 Nov 2024 14:19:44 -0800 (PST) X-Google-Smtp-Source: AGHT+IE7/Qyz9bgV82qGNWSH5Tl3jHd0ty6lmhQ4ezBZR0e4vaXQ3DF97mcrZGxNdqjdY6lGH9mkww== X-Received: by 2002:a05:6000:78c:b0:385:bee7:5c63 with SMTP id ffacd0b85a97d-385bee75c97mr2671412f8f.14.1732573184445; Mon, 25 Nov 2024 14:19:44 -0800 (PST) Received: from localhost (net-93-146-37-148.cust.vodafonedsl.it. [93.146.37.148]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3825fb30bfdsm11625478f8f.56.2024.11.25.14.19.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 25 Nov 2024 14:19:43 -0800 (PST) Date: Mon, 25 Nov 2024 23:19:43 +0100 From: Lorenzo Bianconi To: Daniel Xu Cc: Jesper Dangaard Brouer , Alexander Lobakin , Lorenzo Bianconi , "bpf@vger.kernel.org" , Jakub Kicinski , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , John Fastabend , Martin KaFai Lau , David Miller , Eric Dumazet , Paolo Abeni , netdev@vger.kernel.org, kernel-team , mfleming@cloudflare.com Subject: Re: [RFC/RFT v2 0/3] Introduce GRO support to cpumap codebase Message-ID: References: <55d2ac1c-0619-4b24-b8ab-6eb5f553c1dd@intel.com> <01dcfecc-ab8e-43b8-b20c-96cc476a826d@intel.com> <05991551-415c-49d0-8f14-f99cb84fc5cb@intel.com> <25ujrqfgfkyek2mxh2c2kuuvyt5dyx2e6uysujgv3q43ezab4s@aedwgrlhnvft> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="gE0rb2NCZ79+/A22" Content-Disposition: inline In-Reply-To: <25ujrqfgfkyek2mxh2c2kuuvyt5dyx2e6uysujgv3q43ezab4s@aedwgrlhnvft> --gE0rb2NCZ79+/A22 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable > Hi Jesper, >=20 > On Mon, Nov 25, 2024 at 07:50:41PM GMT, Jesper Dangaard Brouer wrote: > >=20 > >=20 > > On 25/11/2024 16.12, Alexander Lobakin wrote: > > > From: Daniel Xu > > > Date: Fri, 22 Nov 2024 17:10:06 -0700 > > >=20 > > > > Hi Olek, > > > >=20 > > > > Here are the results. > > > >=20 > > > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: > > > > >=20 > > > > >=20 > > > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > >=20 > > > [...] > > >=20 > > > > Baseline (again) > > > >=20 > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Thr= oughput (Mbit/s) > > > > Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749= =2E43 > > > > Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897= =2E17 > > > > Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906= =2E82 > > > > Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155= =2E15 > > > > Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397= =2E06 > > > > Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621= =2E126 > > > >=20 > >=20 > > We need to talk about what we are measuring, and how to control the > > experiment setup to get reproducible results. > > Especially controlling on what CPU cores our code paths are executing. > >=20 > > In above "baseline" case, we have two processes/tasks executing: > > (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket) > > (2) Userspace netserver process TCP receiving data from socket. >=20 > "baseline" in this case is still cpumap, just without these GRO patches. >=20 > >=20 > > My experience is that you will see two noticeable different > > throughput performance results depending on whether (1) and (2) is > > executing on the *same* CPU (multi-tasking context-switching), > > or executing in parallel (e.g. pinned) on two different CPU cores. > >=20 > > The netperf command have an option > >=20 > > -T lcpu,remcpu > > Request that netperf be bound to local CPU lcpu and/or netserver = be > > bound to remote CPU rcpu. > >=20 > > Verify setting by listing pinning like this: > > for PID in $(pidof netserver); do taskset -pc $PID ; done > >=20 > > You can also set pinning runtime like this: > > export CPU=3D2; for PID in $(pidof netserver); do sudo taskset -pc $CP= U $PID; > > done > >=20 > > For troubleshooting, I like to use the periodic 1 sec (netperf -D1) > > output and adjust pinning runtime to observe the effect quickly. > >=20 > > My experience is unfortunately that TCP results have a lot of variation > > (thanks for incliding 5 runs in your benchmarks), as it depends on tasks > > timing, that can get affected by CPU sleep states. The systems CPU > > latency setting can be seen in /dev/cpu_dma_latency, which can be read > > like this: > >=20 > > sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency > >=20 > > For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm > > as it requires holding the file open. E.g I play with these profiles: > >=20 > > sudo tuned-adm profile throughput-performance > > sudo tuned-adm profile latency-performance > > sudo tuned-adm profile network-latency >=20 > Appreciate the tips - I should keep this saved somewhere. >=20 > >=20 > >=20 > > > > cpumap v2 Olek > > > >=20 > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Thr= oughput (Mbit/s) > > > > Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497= =2E57 > > > > Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115= =2E53 > > > > Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323= =2E38 > > > > Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901= =2E88 > > > > Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593= =2E22 > > > > Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686= =2E316 > > > > Delta 0.92% -0.53% 0.33% 0.85% -4= 1.32% > > > >=20 > > > >=20 > >=20 > >=20 > > We now three processes/tasks executing: > > (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap) > > (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket) > > (3) Userspace netserver process TCP receiving data from socket. > >=20 > > Again, now the performance is going to depend on depending on which CPU > > cores the processes/tasks are running and whether some are sharing the > > same CPU. (There are both wakeup timing and cache-line effects). > >=20 > > There are now more combinations to test... > >=20 > > CPUmap is a CPU scaling facility, and you will likely also see different > > CPU utilization on the difference cores one you start to pin these to > > control the scenarios. > >=20 > > > > It's very interesting that we see -40% tput w/ the patches. I went = back > > >=20 > >=20 > > Sad that we see -40% throughput... but do we know what CPU cores the > > now three different tasks/processes run on(?) > >=20 >=20 > Roughly, yes. For context, my primary use case for cpumap is to provide > some degree of isolation between colocated containers on a single host. > In particular, colocation occurs on AMD Bergamo. And containers are > CPU pinned to their own CCX (roughly). My RX steering program ensures > RX packets destined to a specific container are cpumap redirected to any > of the container's pinned CPUs. It not only provides a good measure of > isolation but ensures resources are properly accounted. >=20 > So to answer your question of which CPUs the 3 things run on: cpumap > kthread and application run on the same set of cores. More than that, > they share the same L3 cache by design. irq/softirq is effectively > random given default RSS config and IRQ affinities. >=20 >=20 > >=20 > > > Oh no, I messed up something =3D\ > > > > Could you please also test not the whole series, but patches 1-3 (= up to > > > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > > > array...")? Would be great to see whether this implementation works > > > worse right from the start or I just broke something later on. > > >=20 > > > > and double checked and it seems the numbers are right. Here's the > > > > some output from some profiles I took with: > > > >=20 > > > > perf record -e cycles:k -a -- sleep 10 > > > > perf --no-pager diff perf.data.baseline perf.data.withpatches = > ... > > > >=20 > > > > # Event 'cycles:k' > > > > # Baseline Delta Abs Shared Object = Symbol > > > > 6.13% -3.60% [kernel.kallsyms] = [k] _copy_to_iter > > >=20 > >=20 > > I really appreciate that you provide perf data and perf diff, but as > > described above, we need data and information on what CPU cores are > > running which workload. > >=20 > > Fortunately perf diff (and perf report) support doing like this: > > perf diff --sort=3Dcpu,symbol > >=20 > > But then you also need to control the CPUs used in experiment for the > > diff to work. > >=20 > > I hope I made sense as these kind of CPU scaling benchmarks are tricky, >=20 > Indeed, sounds quite tricky. >=20 > My understanding with GRO is that it's a powerful general purpose > optimization. Enough that it should rise above the usual noise on a > reasonably configured system (which mine is). >=20 > Maybe we can consider decoupling the cpumap GRO enablement with the > later optimizations? I agree. First, we need to identify the best approach to enable GRO on cpum= ap (between Olek's approach and what I have suggested) and then we can evaluate subsequent optimizations. @Olek: do you agree? Regards, Lorenzo >=20 > So in Olek's above series, patches 1-3 seem like they would still > benefit from an simpler testbed. But the more targetted optimizations in > patch 4+ would probably justify a de-noised setup. Possibly single host > with xdp-trafficgen or something. >=20 > Procedurally speaking, maybe it would save some wasted effort if > everyone agreed on the general approach before investing more time into > finer optimizations built on top of the basic GRO support? >=20 > Thanks, > Daniel >=20 --gE0rb2NCZ79+/A22 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTquNwa3Txd3rGGn7Y6cBh0uS2trAUCZ0T3/wAKCRA6cBh0uS2t rIwVAP4o+gnNTbb/ewx0Xp01ji0XIGOVlAQduKUyi85Y5/Vr5gEA7ORstLAqRQVv PLN/WFIuCxPy9Wv+dMNacu/DiJltIAI= =SCGB -----END PGP SIGNATURE----- --gE0rb2NCZ79+/A22--