Re: [RFC net-next 1/9] skb: introduce gro_disabled bit

BPF List
 help / color / mirror / Atom feed

From: Daniel Borkmann <daniel@iogearbox.net>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
	Yan Zhai <yan@cloudflare.com>
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	bpf@vger.kernel.org, kernel-team <kernel-team@cloudflare.com>
Subject: Re: [RFC net-next 1/9] skb: introduce gro_disabled bit
Date: Mon, 24 Jun 2024 15:30:12 +0200	[thread overview]
Message-ID: <caecbff8-ffc4-976b-4516-dba41848ef30@iogearbox.net> (raw)
In-Reply-To: <6677db8b2ef78_33522729492@willemb.c.googlers.com.notmuch>

On 6/23/24 10:23 AM, Willem de Bruijn wrote:
> Yan Zhai wrote:
>> On Fri, Jun 21, 2024 at 11:41 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>> On 6/21/24 6:00 PM, Yan Zhai wrote:
>>>> On Fri, Jun 21, 2024 at 8:13 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>> On 6/21/24 2:15 PM, Willem de Bruijn wrote:
>>>>>> Yan Zhai wrote:
>>>>>>> Software GRO is currently controlled by a single switch, i.e.
>>>>>>>
>>>>>>>      ethtool -K dev gro on|off
>>>>>>>
>>>>>>> However, this is not always desired. When GRO is enabled, even if the
>>>>>>> kernel cannot GRO certain traffic, it has to run through the GRO receive
>>>>>>> handlers with no benefit.
>>>>>>>
>>>>>>> There are also scenarios that turning off GRO is a requirement. For
>>>>>>> example, our production environment has a scenario that a TC egress hook
>>>>>>> may add multiple encapsulation headers to forwarded skbs for load
>>>>>>> balancing and isolation purpose. The encapsulation is implemented via
>>>>>>> BPF. But the problem arises then: there is no way to properly offload a
>>>>>>> double-encapsulated packet, since skb only has network_header and
>>>>>>> inner_network_header to track one layer of encapsulation, but not two.
>>>>>>> On the other hand, not all the traffic through this device needs double
>>>>>>> encapsulation. But we have to turn off GRO completely for any ingress
>>>>>>> device as a result.
>>>>>>>
>>>>>>> Introduce a bit on skb so that GRO engine can be notified to skip GRO on
>>>>>>> this skb, rather than having to be 0-or-1 for all traffic.
>>>>>>>
>>>>>>> Signed-off-by: Yan Zhai <yan@cloudflare.com>
>>>>>>> ---
>>>>>>>     include/linux/netdevice.h |  9 +++++++--
>>>>>>>     include/linux/skbuff.h    | 10 ++++++++++
>>>>>>>     net/Kconfig               | 10 ++++++++++
>>>>>>>     net/core/gro.c            |  2 +-
>>>>>>>     net/core/gro_cells.c      |  2 +-
>>>>>>>     net/core/skbuff.c         |  4 ++++
>>>>>>>     6 files changed, 33 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>>>>> index c83b390191d4..2ca0870b1221 100644
>>>>>>> --- a/include/linux/netdevice.h
>>>>>>> +++ b/include/linux/netdevice.h
>>>>>>> @@ -2415,11 +2415,16 @@ struct net_device {
>>>>>>>        ((dev)->devlink_port = (port));                         \
>>>>>>>     })
>>>>>>>
>>>>>>> -static inline bool netif_elide_gro(const struct net_device *dev)
>>>>>>> +static inline bool netif_elide_gro(const struct sk_buff *skb)
>>>>>>>     {
>>>>>>> -    if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog)
>>>>>>> +    if (!(skb->dev->features & NETIF_F_GRO) || skb->dev->xdp_prog)
>>>>>>>                return true;
>>>>>>> +
>>>>>>> +#ifdef CONFIG_SKB_GRO_CONTROL
>>>>>>> +    return skb->gro_disabled;
>>>>>>> +#else
>>>>>>>        return false;
>>>>>>> +#endif
>>>>>>
>>>>>> Yet more branches in the hot path.
>>>>>>
>>>>>> Compile time configurability does not help, as that will be
>>>>>> enabled by distros.
>>>>>>
>>>>>> For a fairly niche use case. Where functionality of GRO already
>>>>>> works. So just a performance for a very rare case at the cost of a
>>>>>> regression in the common case. A small regression perhaps, but death
>>>>>> by a thousand cuts.
>>>>>
>>>>> Mentioning it here b/c it perhaps fits in this context, longer time ago
>>>>> there was the idea mentioned to have BPF operating as GRO engine which
>>>>> might also help to reduce attack surface by only having to handle packets
>>>>> of interest for the concrete production use case. Perhaps here meta data
>>>>> buffer could be used to pass a notification from XDP to exit early w/o
>>>>> aggregation.
>>>>
>>>> Metadata is in fact one of our interests as well. We discussed using
>>>> metadata instead of a skb bit to carry this information internally.
>>>> Since metadata is opaque atm so it seems the only option is to have a
>>>> GRO control hook before napi_gro_receive, and let BPF decide
>>>> netif_receive_skb or napi_gro_receive (echo what Paolo said). With BPF
>>>> it could indeed be more flexible, but the cons is that it could be
>>>> even more slower than taking a bit on skb. I am actually open to
>>>> either approach, as long as it gives us more control on when to enable
>>>> GRO :)
>>>
>>> Oh wait, one thing that just came to mind.. have you tried u64 per-CPU
>>> counter map in XDP? For packets which should not be GRO-aggregated you
>>> add count++ into the meta data area, and this forces GRO to not aggregate
>>> since meta data that needs to be transported to tc BPF layer mismatches
>>> (and therefore the contract/intent is that tc BPF needs to see the different
>>> meta data passed to it).
>>
>> We did this before accidentally (we put a timestamp for debugging
>> purposes in metadata) and this actually caused about 20% of OoO for
>> TCP in production: all PSH packets are reordered. GRO does not fire
>> the packet to the upper layer when a diff in metadata is found for a
>> non-PSH packet, instead it is queued as a “new flow” on the GRO list
>> and waits for flushing. When a PSH packet arrives, its semantic is to
>> flush this packet immediately and thus precedes earlier packets of the
>> same flow.
> 
> Is that a bug in XDP metadata handling for GRO?
> 
> Mismatching metadata should not be taken as separate flows, but as a
> flush condition.

Definitely a bug as it should flush. If noone is faster I can add it to my
backlog todo to fix it, but might probably take a week before I get to it.

Thanks,
Daniel

next prev parent reply	other threads:[~2024-06-24 13:30 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <cover.1718919473.git.yan@cloudflare.com>
2024-06-20 22:19 ` [RFC net-next 1/9] skb: introduce gro_disabled bit Yan Zhai
2024-06-21  9:11   ` Alexander Lobakin
2024-06-21 15:40     ` Yan Zhai
2024-06-21  9:49   ` Paolo Abeni
2024-06-21 14:29     ` Yan Zhai
2024-06-21  9:57   ` Paolo Abeni
2024-06-21 15:17     ` Yan Zhai
2024-06-21 12:15   ` Willem de Bruijn
2024-06-21 12:47     ` Daniel Borkmann
2024-06-21 16:00       ` Yan Zhai
2024-06-21 16:15         ` Daniel Borkmann
2024-06-21 17:20           ` Yan Zhai
2024-06-23  8:23             ` Willem de Bruijn
2024-06-24 13:30               ` Daniel Borkmann [this message]
2024-06-24 17:49                 ` Yan Zhai
2024-06-21 15:34     ` Yan Zhai
2024-06-23  8:27       ` Willem de Bruijn
2024-06-24 18:17         ` Yan Zhai
2024-06-30 13:40           ` Willem de Bruijn
2024-07-03 18:46             ` Yan Zhai
2024-06-20 22:19 ` [RFC net-next 2/9] xdp: add XDP_FLAGS_GRO_DISABLED flag Yan Zhai
2024-06-21  9:15   ` Alexander Lobakin
2024-06-21 16:12     ` Yan Zhai
2024-06-20 22:19 ` [RFC net-next 3/9] xdp: implement bpf_xdp_disable_gro kfunc Yan Zhai
2024-06-20 22:19 ` [RFC net-next 4/9] bnxt: apply XDP offloading fixup when building skb Yan Zhai
2024-06-20 22:19 ` [RFC net-next 5/9] ice: " Yan Zhai
2024-06-21  9:20   ` Alexander Lobakin
2024-06-21 16:05     ` Yan Zhai
2024-06-20 22:19 ` [RFC net-next 6/9] veth: " Yan Zhai
2024-06-20 22:19 ` [RFC net-next 7/9] mlx5: move xdp_buff scope one level up Jesper Dangaard Brouer
2024-06-20 22:19 ` [RFC net-next 8/9] mlx5: apply XDP offloading fixup when building skb Yan Zhai
2024-06-20 22:19 ` [RFC net-next 9/9] bpf: selftests: test disabling GRO by XDP Yan Zhai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=caecbff8-ffc4-976b-4516-dba41848ef30@iogearbox.net \
    --to=daniel@iogearbox.net \
    --cc=bpf@vger.kernel.org \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=yan@cloudflare.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox