From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 696AB1DF258; Fri, 13 Jun 2025 10:59:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749812380; cv=none; b=mqrOdujGg96eq449zG1sc4EwxxQWN1Ewnxu2wJNBQ/N4SBp5RH+Ue4yBsQnuTOzZyWqrLUZfqbi61Iq3lvXPuYrNc60ZaxqC6pet5Q5iyeEsRJ6D+h8ynx2ZInil3tmQvCJWXQNEFj3KRflPpjrmavDoEz7ddATlprIDEaXzsWQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749812380; c=relaxed/simple; bh=8BPhB2JZF3kqe7+MUXT80kngr0DXMAQUCUO0KkK66Fs=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=g79MNG+6VxD6lgZijYD7AY3TLT0cc0EmcRIGF4VFSjGpe7C4GyTLXSYk4jifxNEScIypMbV+XeYH8vEQtMuKdQyAtAAx6HrohAV3csXcfqXsTZyA4TCFt5LSImxOwdnnA7ZAdN8EJvDeTxI3+Oxf8hW7Dg5WK//bi7dbFSVAk8w= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=E8pGoORo; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="E8pGoORo" Received: by smtp.kernel.org (Postfix) with ESMTPSA id ADF46C4CEE3; Fri, 13 Jun 2025 10:59:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1749812379; bh=8BPhB2JZF3kqe7+MUXT80kngr0DXMAQUCUO0KkK66Fs=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=E8pGoORo7JjAI5Gf8pcjM9GcdLfbtOp7aw9l0A49lEOrkGRdYINKd0IK/fxk5UdSX xkShKhDLO86C1A6feqTOBegRKg6JaBu0ev4hOpLJUXwTLzmx5HRqqSW7wdCVOfOClj q87wVJbOSZiCyg/aSndJuppyPYVK8/1XihHWGQ0ZuBsykJi4CUOB+5kEhDYtdSb0yM j/EkWtzw8JMqNYHdEBO1TtN56t/1nkrzogStq3aAO91cbSommV14kUlG0q2xFHfEKW SCVnDxOD8J1FyeHrj5uU3MrFSkRc1eOMIOzEcW2O7bFCnzrqQVBhlOHxkfzDn2sR1V KKwgCeHEPf5Cg== Message-ID: Date: Fri, 13 Jun 2025 12:59:32 +0200 Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst To: Daniel Borkmann , Stanislav Fomichev , Lorenzo Bianconi Cc: =?UTF-8?Q?Toke_H=C3=B8iland-J=C3=B8rgensen?= , Daniel Borkmann , bpf@vger.kernel.org, netdev@vger.kernel.org, Jakub Kicinski , Alexei Starovoitov , Eric Dumazet , "David S. Miller" , Paolo Abeni , sdf@fomichev.me, kernel-team@cloudflare.com, arthur@arthurfabre.com, jakub@cloudflare.com, Magnus Karlsson , Maciej Fijalkowski , arzeznik@cloudflare.com, Yan Zhai References: <174897271826.1677018.9096866882347745168.stgit@firesoul> <174897279518.1677018.5982630277641723936.stgit@firesoul> <87plfbcq4m.fsf@toke.dk> Content-Language: en-US From: Jesper Dangaard Brouer In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 11/06/2025 05.40, Stanislav Fomichev wrote: > On 06/11, Lorenzo Bianconi wrote: >>> Daniel Borkmann writes: >>> >> [...] >>>>> >>>>> Why not have a new flag for bpf_redirect that transparently stores all >>>>> available metadata? If you care only about the redirect -> skb case. >>>>> Might give us more wiggle room in the future to make it work with >>>>> traits. >>>> >>>> Also q from my side: If I understand the proposal correctly, in order to fully >>>> populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs >>>> to collect the data from the driver descriptors (indirect call), and then yet >>>> again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct >>>> xdp_rx_meta again. This seems rather costly and once you add more kfuncs with >>>> meta data aren't you better off switching to tc(x) directly so the driver can >>>> do all this natively? :/ >>> >>> I agree that the "one kfunc per metadata item" scales poorly. IIRC, the >>> hope was (back when we added the initial HW metadata support) that we >>> would be able to inline them to avoid the function call overhead. >>> >>> That being said, even with half a dozen function calls, that's still a >>> lot less overhead from going all the way to TC(x). The goal of the use >>> case here is to do as little work as possible on the CPU that initially >>> receives the packet, instead moving the network stack processing (and >>> skb allocation) to a different CPU with cpumap. >>> >>> So even if the *total* amount of work being done is a bit higher because >>> of the kfunc overhead, that can still be beneficial because it's split >>> between two (or more) CPUs. >>> >>> I'm sure Jesper has some concrete benchmarks for this lying around >>> somewhere, hopefully he can share those :) >> >> Another possible approach would be to have some utility functions (not kfuncs) >> used to 'store' the hw metadata in the xdp_frame that are executed in each >> driver codebase before performing XDP_REDIRECT. The downside of this approach >> is we need to parse the hw metadata twice if the eBPF program that is bounded >> to the NIC is consuming these info. What do you think? > > That's the option I was asking about. I'm assuming we should be able > to reuse existing xmo metadata callbacks for this. We should be able > to hide it from the drivers also hopefully. I'm not against this idea of transparently stores all available metadata into the xdp_frame (via some flag/config), but it does not fit our production use-case. I also think that this can be added later. We need the ability to overwrite the RX-hash value, before redirecting packet to CPUMAP (remember as cover-letter describe RX-hash needed *before* the GRO engine processes the packet in CPUMAP. This is before TC/BPF). Our use-case for overwriting the RX-hash value is load-balancing IPSEC encapsulated tunnel traffic at XDP stage via CPUMAP redirects. This is generally applicable to tunneling in that we want the store the RX-hash of the tunnels inner-headers. Our IPSEC use-case have a variation that we only decrypt[1] the first 32 bytes to calc a LB hash over inner-headers, and then redirect the original packet to CPUMAP. The IPSEC packets travel into a veth device, which we discovered will send everything on a single RX-queue... because RX-hash (calc by netstack) will obviously use the outer-headers, meaning this LB doesn't scale. I hope this makes it clear, why we need BPF-prog ability to explicitly "store" the RX-hash in the xdp-frame. --Jesper [1] https://docs.ebpf.io/linux/kfuncs/bpf_crypto_decrypt/