BPF List
 help / color / mirror / Atom feed
From: "Ilpo Järvinen" <ij@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>
Cc: chia-yu.chang@nokia-bell-labs.com, netdev@vger.kernel.org,
	 davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	 dsahern@kernel.org, netfilter-devel@vger.kernel.org,
	kadlec@netfilter.org,  coreteam@netfilter.org,
	pablo@netfilter.org, bpf@vger.kernel.org,
	 joel.granados@kernel.org, linux-fsdevel@vger.kernel.org,
	kees@kernel.org,  mcgrof@kernel.org, ncardwell@google.com,
	 koen.de_schepper@nokia-bell-labs.com, g.white@CableLabs.com,
	 ingemar.s.johansson@ericsson.com, mirja.kuehlewind@ericsson.com,
	 cheshire@apple.com, rs.ietf@gmx.at, Jason_Livingood@comcast.com,
	 vidhi_goel@apple.com
Subject: Re: [PATCH v4 net-next 09/14] gro: prevent ACE field corruption & better AccECN handling
Date: Tue, 29 Oct 2024 23:17:40 +0200 (EET)	[thread overview]
Message-ID: <4b3ede21-27cf-bc17-be71-4c388e670f2c@kernel.org> (raw)
In-Reply-To: <eb04ddfd-6e17-464b-a629-09aed99e2e95@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 5007 bytes --]

On Tue, 29 Oct 2024, Paolo Abeni wrote:
> On 10/21/24 23:59, chia-yu.chang@nokia-bell-labs.com wrote:
> > From: Ilpo Järvinen <ij@kernel.org>
> > 
> > There are important differences in how the CWR field behaves
> > in RFC3168 and AccECN. With AccECN, CWR flag is part of the
> > ACE counter and its changes are important so adjust the flags
> > changed mask accordingly.
> > 
> > Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> > corrupting CWR flag somewhere.
> > 
> > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > ---
> >  net/ipv4/tcp_offload.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> > index 0b05f30e9e5f..f59762d88c38 100644
> > --- a/net/ipv4/tcp_offload.c
> > +++ b/net/ipv4/tcp_offload.c
> > @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
> >  	th2 = tcp_hdr(p);
> >  	flush = (__force int)(flags & TCP_FLAG_CWR);
> >  	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> > -		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> > +		  ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
> 
> If I read correctly, if the peer is using RFC3168 and TSO_ECN, GRO will
> now pump into the stack twice the number of packets it was doing prior
> to this patch, am I correct?
> 
> That is likely causing measurable performance regressions.

Hi Paolo,

Thanks for taking a look!

While it's true on surface that this might cause some more packets with 
RFC3168 (by design, as network cannot know if the sender is using RFC3168 
or not), the important question is the scale how many of extra packets 
will occur in practice.

First of all, RFC3168 requires CWR flag to be sent no more frequently 
than once per window of data, or in other words, once per RTT. And that
means just one packet, not e.g. all packets of a super-skb (the RFC3168
signalling will lose its integrity if this is violated by the sender).

Secondly, the TCP sender uses CWR flag to indicate it just halved its 
congestion window which mean it is sending half the amount of packets in 
this window than in the previous window (analoguous to halving sending 
rate). 2 RTTs with CWR each means two window reductions (this behavior 
is spec'ed in RFC3168).

So lets say the sender was using 100 packets congestion window, this 
change will add one packet to 50 packets on this next RTT. Note those are
raw numbers of packets on wire and do not tell how many packets GRO 
combined into each super-skb which will wary case-by-case basis. 
Regardless, I suspect the extra packet added to the half of the packets 
will be hard/impossible to measure to cause a performance regression.

This change would double the number of packets only if the congestion 
window is 1 or 2 packets and in that case TSO/GSO/GRO benefits will be 
pretty small to begin with (or even counterproductive). Also, the 
traditional TCP congestion control (RFC3168 included) has many issues 
anyway with that small windows because it doesn't deal with fractional 
congestion windows well.

> >  	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> >  	for (i = sizeof(*th); i < thlen; i += 4)
> >  		flush |= *(u32 *)((u8 *)th + i) ^
> > @@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
> >  	shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
> >  
> >  	if (th->cwr)
> > -		shinfo->gso_type |= SKB_GSO_TCP_ECN;
> > +		shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
> 
> If this packet is forwarded, it will not leverage TSO anymore - with
> current H/W.
> 
> I think we need a way to enable this feature conditionally, but I fear
> another sysctl will be ugly and the additional conditionals will not be
> good for GRO.
>
> Smarter suggestions welcome ;)

Well, it is already very selectively _conditional_, SKB_GSO_TCP_ACCECN is 
only set for the skb when CWR is set. That is, once per RTT (data window) 
when it comes to RFC3168. 

I don't have any source for this (other than reading many many tcpdumps 
in the past) but I believe the percentage of packets with CWR set (due to 
RFC3168 signalling) is going to be very small overall.

Do you think that is not good enough?

To answer more generally to your suggestion on making it conditional based 
on some other logic, it would mean you accept network middleboxes are 
allowed to corrupt AccECN ACE field when forwarding. If RFC3168 TSO/GSO 
trickery remains in use (without a middlebox explicitly tracking the 
connection had negotiated RFC3168), a forwarder won't be able to reproduce 
the exactly same stream of TCP packets headers thus corrupting non-RFC3168 
use of CWR flag. It's not something any middlebox should be doing (I hope 
we agree on this as a general principle)!

-- 
 i.
 
> Cheers,
> 
> Paolo
> 
> >  }
> >  EXPORT_SYMBOL(tcp_gro_complete);
> >  
> 
> 

  reply	other threads:[~2024-10-29 21:17 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
2024-10-21 21:58 ` [PATCH v4 net-next 01/14] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
2024-10-21 21:58 ` [PATCH v4 net-next 02/14] tcp: create FLAG_TS_PROGRESS chia-yu.chang
2024-10-21 21:58 ` [PATCH v4 net-next 03/14] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 04/14] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
2024-10-29 11:43   ` Paolo Abeni
2024-10-29 11:45     ` Paolo Abeni
2024-10-21 21:59 ` [PATCH v4 net-next 05/14] tcp: reorganize SYN ECN code chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 06/14] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 07/14] tcp: helpers for ECN mode handling chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 08/14] gso: AccECN support chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 09/14] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
2024-10-29 12:03   ` Paolo Abeni
2024-10-29 21:17     ` Ilpo Järvinen [this message]
2024-10-21 21:59 ` [PATCH v4 net-next 10/14] tcp: AccECN support to tcp_add_backlog chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 11/14] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
2024-10-29 12:18   ` Paolo Abeni
2024-10-21 21:59 ` [PATCH v4 net-next 12/14] tcp: Pass flags to __tcp_send_ack chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 13/14] tcp: fast path functions later chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 14/14] net: sysctl: introduce sysctl SYSCTL_FIVE chia-yu.chang
2024-10-29 12:26   ` Paolo Abeni
2024-10-29 21:29   ` Ilpo Järvinen
2024-10-31 14:08   ` Joel Granados
2024-10-31 15:44     ` Chia-Yu Chang (Nokia)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4b3ede21-27cf-bc17-be71-4c388e670f2c@kernel.org \
    --to=ij@kernel.org \
    --cc=Jason_Livingood@comcast.com \
    --cc=bpf@vger.kernel.org \
    --cc=cheshire@apple.com \
    --cc=chia-yu.chang@nokia-bell-labs.com \
    --cc=coreteam@netfilter.org \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=g.white@CableLabs.com \
    --cc=ingemar.s.johansson@ericsson.com \
    --cc=joel.granados@kernel.org \
    --cc=kadlec@netfilter.org \
    --cc=kees@kernel.org \
    --cc=koen.de_schepper@nokia-bell-labs.com \
    --cc=kuba@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=mirja.kuehlewind@ericsson.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pablo@netfilter.org \
    --cc=rs.ietf@gmx.at \
    --cc=vidhi_goel@apple.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox