netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 net-next 00/13] AccECN protocol preparation patch series
@ 2024-11-05 10:06 chia-yu.chang
  2024-11-05 10:06 ` [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
                   ` (12 more replies)
  0 siblings, 13 replies; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Hello,

Specific changes in v5 (5-Nov-2024)
- Add helper function "tcp_flags_ntohs" to preserve last 2 bytes of TCP flags of patch #4 (Paolo Abeni <pabeni@redhat.com>)
- Fix reverse X-max tree order of patches #4, #11 (Paolo Abeni <pabeni@redhat.com>)
- Rename variable "delta" as "timestamp_delta" of patch #2 fo clariety
- Remove patch #14 in this series (Paolo Abeni <pabeni@redhat.com>, Joel Granados <joel.granados@kernel.org>)

Specific changes in v4 (21-Oct-2024)
- Fix line length warning of patches #2, #4, #8, #10, #11, #14
- Fix spaces preferred around '|' (ctx:VxV) warning of patch #7
- Add missing CC'ed of patches #4, #12, #14

Specific changes in v3 (19-Oct-2024)
- Fix build error in v2

Specific changes in v2 (18-Oct-2024)
- Fix warning caused by NETIF_F_GSO_ACCECN_BIT in patch #9 (Jakub Kicinski <kuba@kernel.org>)

The full patch series can be found in
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/

The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28

--
Chia-Yu

Chia-Yu Chang (1):
  tcp: use BIT() macro in include/net/tcp.h

Ilpo Järvinen (12):
  tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
  tcp: create FLAG_TS_PROGRESS
  tcp: extend TCP flags to allow AE bit/ACE field
  tcp: reorganize SYN ECN code
  tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()
  tcp: helpers for ECN mode handling
  gso: AccECN support
  gro: prevent ACE field corruption & better AccECN handling
  tcp: AccECN support to tcp_add_backlog
  tcp: allow ECN bits in TOS/traffic class
  tcp: Pass flags to __tcp_send_ack
  tcp: fast path functions later

 include/linux/netdev_features.h |   8 +-
 include/linux/netdevice.h       |   2 +
 include/linux/skbuff.h          |   2 +
 include/net/tcp.h               | 135 +++++++++++++++++++++-----------
 include/uapi/linux/tcp.h        |   9 ++-
 net/ethtool/common.c            |   1 +
 net/ipv4/bpf_tcp_ca.c           |   2 +-
 net/ipv4/ip_output.c            |   3 +-
 net/ipv4/tcp.c                  |   2 +-
 net/ipv4/tcp_dctcp.c            |   2 +-
 net/ipv4/tcp_dctcp.h            |   2 +-
 net/ipv4/tcp_input.c            | 120 ++++++++++++++++------------
 net/ipv4/tcp_ipv4.c             |  28 +++++--
 net/ipv4/tcp_minisocks.c        |   6 +-
 net/ipv4/tcp_offload.c          |  10 ++-
 net/ipv4/tcp_output.c           |  23 +++---
 net/ipv6/tcp_ipv6.c             |  26 ++++--
 net/netfilter/nf_log_syslog.c   |   8 +-
 18 files changed, 249 insertions(+), 140 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
  2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
  2024-11-07  9:31   ` Eric Dumazet
  2024-11-05 10:06 ` [PATCH v5 net-next 02/13] tcp: create FLAG_TS_PROGRESS chia-yu.chang
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

- Move tcp_count_delivered() earlier and split tcp_count_delivered_ce()
  out of it
- Move tcp_in_ack_event() later
- While at it, remove the inline from tcp_in_ack_event() and let
  the compiler to decide

Accurate ECN's heuristics does not know if there is going
to be ACE field based CE counter increase or not until after
rtx queue has been processed. Only then the number of ACKed
bytes/pkts is available. As CE or not affects presence of
FLAG_ECE, that information for tcp_in_ack_event is not yet
available in the old location of the call to tcp_in_ack_event().

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_input.c | 56 +++++++++++++++++++++++++-------------------
 1 file changed, 32 insertions(+), 24 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5bdf13ac26ef..fc52eab4fcc9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -413,6 +413,20 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
 	return false;
 }
 
+static void tcp_count_delivered_ce(struct tcp_sock *tp, u32 ecn_count)
+{
+	tp->delivered_ce += ecn_count;
+}
+
+/* Updates the delivered and delivered_ce counts */
+static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
+				bool ece_ack)
+{
+	tp->delivered += delivered;
+	if (ece_ack)
+		tcp_count_delivered_ce(tp, delivered);
+}
+
 /* Buffer size and advertised window tuning.
  *
  * 1. Tuning sk->sk_sndbuf, when connection enters established state.
@@ -1148,15 +1162,6 @@ void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb)
 	}
 }
 
-/* Updates the delivered and delivered_ce counts */
-static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
-				bool ece_ack)
-{
-	tp->delivered += delivered;
-	if (ece_ack)
-		tp->delivered_ce += delivered;
-}
-
 /* This procedure tags the retransmission queue when SACKs arrive.
  *
  * We have three tag bits: SACKED(S), RETRANS(R) and LOST(L).
@@ -3856,12 +3861,23 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
 	}
 }
 
-static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
+static void tcp_in_ack_event(struct sock *sk, int flag)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 
-	if (icsk->icsk_ca_ops->in_ack_event)
-		icsk->icsk_ca_ops->in_ack_event(sk, flags);
+	if (icsk->icsk_ca_ops->in_ack_event) {
+		u32 ack_ev_flags = 0;
+
+		if (flag & FLAG_WIN_UPDATE)
+			ack_ev_flags |= CA_ACK_WIN_UPDATE;
+		if (flag & FLAG_SLOWPATH) {
+			ack_ev_flags = CA_ACK_SLOWPATH;
+			if (flag & FLAG_ECE)
+				ack_ev_flags |= CA_ACK_ECE;
+		}
+
+		icsk->icsk_ca_ops->in_ack_event(sk, ack_ev_flags);
+	}
 }
 
 /* Congestion control has updated the cwnd already. So if we're in
@@ -3978,12 +3994,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		tcp_snd_una_update(tp, ack);
 		flag |= FLAG_WIN_UPDATE;
 
-		tcp_in_ack_event(sk, CA_ACK_WIN_UPDATE);
-
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPACKS);
 	} else {
-		u32 ack_ev_flags = CA_ACK_SLOWPATH;
-
 		if (ack_seq != TCP_SKB_CB(skb)->end_seq)
 			flag |= FLAG_DATA;
 		else
@@ -3995,19 +4007,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 			flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
 							&sack_state);
 
-		if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) {
+		if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb)))
 			flag |= FLAG_ECE;
-			ack_ev_flags |= CA_ACK_ECE;
-		}
 
 		if (sack_state.sack_delivered)
 			tcp_count_delivered(tp, sack_state.sack_delivered,
 					    flag & FLAG_ECE);
-
-		if (flag & FLAG_WIN_UPDATE)
-			ack_ev_flags |= CA_ACK_WIN_UPDATE;
-
-		tcp_in_ack_event(sk, ack_ev_flags);
 	}
 
 	/* This is a deviation from RFC3168 since it states that:
@@ -4034,6 +4039,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	tcp_rack_update_reo_wnd(sk, &rs);
 
+	tcp_in_ack_event(sk, flag);
+
 	if (tp->tlp_high_seq)
 		tcp_process_tlp_ack(sk, ack, flag);
 
@@ -4065,6 +4072,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	return 1;
 
 no_queue:
+	tcp_in_ack_event(sk, flag);
 	/* If data was DSACKed, see if we can undo a cwnd reduction. */
 	if (flag & FLAG_DSACKING_ACK) {
 		tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 net-next 02/13] tcp: create FLAG_TS_PROGRESS
  2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
  2024-11-05 10:06 ` [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
  2024-11-07  9:34   ` Eric Dumazet
  2024-11-05 10:06 ` [PATCH v5 net-next 03/13] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Whenever timestamp advances, it declares progress which
can be used by the other parts of the stack to decide that
the ACK is the most recent one seen so far.

AccECN will use this flag when deciding whether to use the
ACK to update AccECN state or not.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_input.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fc52eab4fcc9..ecb3de69c6de 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -102,6 +102,7 @@ int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 #define FLAG_NO_CHALLENGE_ACK	0x8000 /* do not call tcp_send_challenge_ack()	*/
 #define FLAG_ACK_MAYBE_DELAYED	0x10000 /* Likely a delayed ACK */
 #define FLAG_DSACK_TLP		0x20000 /* DSACK for tail loss probe */
+#define FLAG_TS_PROGRESS	0x40000 /* Positive timestamp delta */
 
 #define FLAG_ACKED		(FLAG_DATA_ACKED|FLAG_SYN_ACKED)
 #define FLAG_NOT_DUP		(FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
@@ -3813,8 +3814,16 @@ static void tcp_store_ts_recent(struct tcp_sock *tp)
 	tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
 }
 
-static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
+static int __tcp_replace_ts_recent(struct tcp_sock *tp, s32 tstamp_delta)
 {
+	tcp_store_ts_recent(tp);
+	return tstamp_delta > 0 ? FLAG_TS_PROGRESS : 0;
+}
+
+static int tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
+{
+	s32 delta;
+
 	if (tp->rx_opt.saw_tstamp && !after(seq, tp->rcv_wup)) {
 		/* PAWS bug workaround wrt. ACK frames, the PAWS discard
 		 * extra check below makes sure this can only happen
@@ -3823,9 +3832,13 @@ static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
 		 * Not only, also it occurs for expired timestamps.
 		 */
 
-		if (tcp_paws_check(&tp->rx_opt, 0))
-			tcp_store_ts_recent(tp);
+		if (tcp_paws_check(&tp->rx_opt, 0)) {
+			delta = tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent;
+			return __tcp_replace_ts_recent(tp, delta);
+		}
 	}
+
+	return 0;
 }
 
 /* This routine deals with acks during a TLP episode and ends an episode by
@@ -3982,7 +3995,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	 * is in window.
 	 */
 	if (flag & FLAG_UPDATE_TS_RECENT)
-		tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
+		flag |= tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
 
 	if ((flag & (FLAG_SLOWPATH | FLAG_SND_UNA_ADVANCED)) ==
 	    FLAG_SND_UNA_ADVANCED) {
@@ -6140,6 +6153,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 	    TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
 	    !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
 		int tcp_header_len = tp->tcp_header_len;
+		s32 delta = 0;
+		int flag = 0;
 
 		/* Timestamp header prediction: tcp_header_len
 		 * is automatically equal to th->doff*4 due to pred_flags
@@ -6152,8 +6167,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			if (!tcp_parse_aligned_timestamp(tp, th))
 				goto slow_path;
 
+			delta = tp->rx_opt.rcv_tsval -
+				tp->rx_opt.ts_recent;
 			/* If PAWS failed, check it more carefully in slow path */
-			if ((s32)(tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent) < 0)
+			if (delta < 0)
 				goto slow_path;
 
 			/* DO NOT update ts_recent here, if checksum fails
@@ -6173,12 +6190,13 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 				if (tcp_header_len ==
 				    (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
 				    tp->rcv_nxt == tp->rcv_wup)
-					tcp_store_ts_recent(tp);
+					flag |= __tcp_replace_ts_recent(tp,
+									delta);
 
 				/* We know that such packets are checksummed
 				 * on entry.
 				 */
-				tcp_ack(sk, skb, 0);
+				tcp_ack(sk, skb, flag);
 				__kfree_skb(skb);
 				tcp_data_snd_check(sk);
 				/* When receiving pure ack in fast path, update
@@ -6209,7 +6227,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			if (tcp_header_len ==
 			    (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
 			    tp->rcv_nxt == tp->rcv_wup)
-				tcp_store_ts_recent(tp);
+				flag |= __tcp_replace_ts_recent(tp,
+								delta);
 
 			tcp_rcv_rtt_measure_ts(sk, skb);
 
@@ -6224,7 +6243,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 
 			if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) {
 				/* Well, only one small jumplet in fast path... */
-				tcp_ack(sk, skb, FLAG_DATA);
+				tcp_ack(sk, skb, flag | FLAG_DATA);
 				tcp_data_snd_check(sk);
 				if (!inet_csk_ack_scheduled(sk))
 					goto no_ack;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 net-next 03/13] tcp: use BIT() macro in include/net/tcp.h
  2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
  2024-11-05 10:06 ` [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
  2024-11-05 10:06 ` [PATCH v5 net-next 02/13] tcp: create FLAG_TS_PROGRESS chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
  2024-11-07 10:20   ` Eric Dumazet
  2024-11-05 10:06 ` [PATCH v5 net-next 04/13] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Use BIT() macro for TCP flags field and TCP congestion control
flags that will be used by the congestion control algorithm.

No functional changes.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Ilpo Järvinen <ij@kernel.org>
---
 include/net/tcp.h | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e9b37b76e894..cccc6b739532 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -26,6 +26,7 @@
 #include <linux/kref.h>
 #include <linux/ktime.h>
 #include <linux/indirect_call_wrapper.h>
+#include <linux/bits.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/inet_timewait_sock.h>
@@ -911,14 +912,14 @@ static inline u32 tcp_rsk_tsval(const struct tcp_request_sock *treq)
 
 #define tcp_flag_byte(th) (((u_int8_t *)th)[13])
 
-#define TCPHDR_FIN 0x01
-#define TCPHDR_SYN 0x02
-#define TCPHDR_RST 0x04
-#define TCPHDR_PSH 0x08
-#define TCPHDR_ACK 0x10
-#define TCPHDR_URG 0x20
-#define TCPHDR_ECE 0x40
-#define TCPHDR_CWR 0x80
+#define TCPHDR_FIN	BIT(0)
+#define TCPHDR_SYN	BIT(1)
+#define TCPHDR_RST	BIT(2)
+#define TCPHDR_PSH	BIT(3)
+#define TCPHDR_ACK	BIT(4)
+#define TCPHDR_URG	BIT(5)
+#define TCPHDR_ECE	BIT(6)
+#define TCPHDR_CWR	BIT(7)
 
 #define TCPHDR_SYN_ECN	(TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
 
@@ -1107,9 +1108,9 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CA_UNSPEC	0
 
 /* Algorithm can be set on socket without CAP_NET_ADMIN privileges */
-#define TCP_CONG_NON_RESTRICTED 0x1
+#define TCP_CONG_NON_RESTRICTED		BIT(0)
 /* Requires ECN/ECT set on all packets */
-#define TCP_CONG_NEEDS_ECN	0x2
+#define TCP_CONG_NEEDS_ECN		BIT(1)
 #define TCP_CONG_MASK	(TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN)
 
 union tcp_cc_info;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 net-next 04/13] tcp: extend TCP flags to allow AE bit/ACE field
  2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
                   ` (2 preceding siblings ...)
  2024-11-05 10:06 ` [PATCH v5 net-next 03/13] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
  2024-11-07 10:23   ` Eric Dumazet
  2024-11-05 10:06 ` [PATCH v5 net-next 05/13] tcp: reorganize SYN ECN code chia-yu.chang
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

With AccECN, there's one additional TCP flag to be used (AE)
and ACE field that overloads the definition of AE, CWR, and
ECE flags. As tcp_flags was previously only 1 byte, the
byte-order stuff needs to be added to it's handling.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h             | 11 +++++++++--
 include/uapi/linux/tcp.h      |  9 ++++++---
 net/ipv4/tcp_ipv4.c           |  2 +-
 net/ipv4/tcp_output.c         |  8 ++++----
 net/ipv6/tcp_ipv6.c           |  2 +-
 net/netfilter/nf_log_syslog.c |  8 +++++---
 6 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index cccc6b739532..a9948fe3537a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -920,7 +920,14 @@ static inline u32 tcp_rsk_tsval(const struct tcp_request_sock *treq)
 #define TCPHDR_URG	BIT(5)
 #define TCPHDR_ECE	BIT(6)
 #define TCPHDR_CWR	BIT(7)
-
+#define TCPHDR_AE	BIT(8)
+#define TCPHDR_FLAGS_MASK (TCPHDR_FIN | TCPHDR_SYN | TCPHDR_RST | \
+			   TCPHDR_PSH | TCPHDR_ACK | TCPHDR_URG | \
+			   TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)
+#define tcp_flags_ntohs(th) (ntohs(*(__be16 *)&tcp_flag_word(th)) & \
+			    TCPHDR_FLAGS_MASK)
+
+#define TCPHDR_ACE (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)
 #define TCPHDR_SYN_ECN	(TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
 
 /* State flags for sacked in struct tcp_skb_cb */
@@ -955,7 +962,7 @@ struct tcp_skb_cb {
 			u16	tcp_gso_size;
 		};
 	};
-	__u8		tcp_flags;	/* TCP header flags. (tcp[13])	*/
+	__u16		tcp_flags;	/* TCP header flags (tcp[12-13])*/
 
 	__u8		sacked;		/* State flags for SACK.	*/
 	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index dbf896f3146c..3fe08d7dddaf 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -28,7 +28,8 @@ struct tcphdr {
 	__be32	seq;
 	__be32	ack_seq;
 #if defined(__LITTLE_ENDIAN_BITFIELD)
-	__u16	res1:4,
+	__u16	ae:1,
+		res1:3,
 		doff:4,
 		fin:1,
 		syn:1,
@@ -40,7 +41,8 @@ struct tcphdr {
 		cwr:1;
 #elif defined(__BIG_ENDIAN_BITFIELD)
 	__u16	doff:4,
-		res1:4,
+		res1:3,
+		ae:1,
 		cwr:1,
 		ece:1,
 		urg:1,
@@ -70,6 +72,7 @@ union tcp_word_hdr {
 #define tcp_flag_word(tp) (((union tcp_word_hdr *)(tp))->words[3])
 
 enum {
+	TCP_FLAG_AE  = __constant_cpu_to_be32(0x01000000),
 	TCP_FLAG_CWR = __constant_cpu_to_be32(0x00800000),
 	TCP_FLAG_ECE = __constant_cpu_to_be32(0x00400000),
 	TCP_FLAG_URG = __constant_cpu_to_be32(0x00200000),
@@ -78,7 +81,7 @@ enum {
 	TCP_FLAG_RST = __constant_cpu_to_be32(0x00040000),
 	TCP_FLAG_SYN = __constant_cpu_to_be32(0x00020000),
 	TCP_FLAG_FIN = __constant_cpu_to_be32(0x00010000),
-	TCP_RESERVED_BITS = __constant_cpu_to_be32(0x0F000000),
+	TCP_RESERVED_BITS = __constant_cpu_to_be32(0x0E000000),
 	TCP_DATA_OFFSET = __constant_cpu_to_be32(0xF0000000)
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a38c8b1f44db..536767723584 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2163,7 +2163,7 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
 				    skb->len - th->doff * 4);
 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-	TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
+	TCP_SKB_CB(skb)->tcp_flags = tcp_flags_ntohs(th);
 	TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
 	TCP_SKB_CB(skb)->sacked	 = 0;
 	TCP_SKB_CB(skb)->has_rxtstamp =
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5485a70b5fe5..1b2f4a2e7332 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -400,7 +400,7 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
 /* Constructs common control bits of non-data skb. If SYN/FIN is present,
  * auto increment end seqno.
  */
-static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u16 flags)
 {
 	skb->ip_summed = CHECKSUM_PARTIAL;
 
@@ -1382,7 +1382,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	th->seq			= htonl(tcb->seq);
 	th->ack_seq		= htonl(rcv_nxt);
 	*(((__be16 *)th) + 6)	= htons(((tcp_header_size >> 2) << 12) |
-					tcb->tcp_flags);
+					(tcb->tcp_flags & TCPHDR_FLAGS_MASK));
 
 	th->check		= 0;
 	th->urg_ptr		= 0;
@@ -1603,8 +1603,8 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 	struct sk_buff *buff;
 	int old_factor;
 	long limit;
+	u16 flags;
 	int nlen;
-	u8 flags;
 
 	if (WARN_ON(len > skb->len))
 		return -EINVAL;
@@ -2159,7 +2159,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
 {
 	int nlen = skb->len - len;
 	struct sk_buff *buff;
-	u8 flags;
+	u16 flags;
 
 	/* All of a TSO frame must be composed of paged data.  */
 	DEBUG_NET_WARN_ON_ONCE(skb->len != skb->data_len);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index c748eeae1453..fec9acffb167 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1737,7 +1737,7 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
 				    skb->len - th->doff*4);
 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-	TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
+	TCP_SKB_CB(skb)->tcp_flags = tcp_flags_ntohs(th);
 	TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
 	TCP_SKB_CB(skb)->sacked = 0;
 	TCP_SKB_CB(skb)->has_rxtstamp =
diff --git a/net/netfilter/nf_log_syslog.c b/net/netfilter/nf_log_syslog.c
index 58402226045e..86d5fc5d28e3 100644
--- a/net/netfilter/nf_log_syslog.c
+++ b/net/netfilter/nf_log_syslog.c
@@ -216,7 +216,9 @@ nf_log_dump_tcp_header(struct nf_log_buf *m,
 	/* Max length: 9 "RES=0x3C " */
 	nf_log_buf_add(m, "RES=0x%02x ", (u_int8_t)(ntohl(tcp_flag_word(th) &
 					    TCP_RESERVED_BITS) >> 22));
-	/* Max length: 32 "CWR ECE URG ACK PSH RST SYN FIN " */
+	/* Max length: 35 "AE CWR ECE URG ACK PSH RST SYN FIN " */
+	if (th->ae)
+		nf_log_buf_add(m, "AE ");
 	if (th->cwr)
 		nf_log_buf_add(m, "CWR ");
 	if (th->ece)
@@ -516,7 +518,7 @@ dump_ipv4_packet(struct net *net, struct nf_log_buf *m,
 
 	/* Proto    Max log string length */
 	/* IP:	    40+46+6+11+127 = 230 */
-	/* TCP:     10+max(25,20+30+13+9+32+11+127) = 252 */
+	/* TCP:     10+max(25,20+30+13+9+35+11+127) = 255 */
 	/* UDP:     10+max(25,20) = 35 */
 	/* UDPLITE: 14+max(25,20) = 39 */
 	/* ICMP:    11+max(25, 18+25+max(19,14,24+3+n+10,3+n+10)) = 91+n */
@@ -526,7 +528,7 @@ dump_ipv4_packet(struct net *net, struct nf_log_buf *m,
 
 	/* (ICMP allows recursion one level deep) */
 	/* maxlen =  IP + ICMP +  IP + max(TCP,UDP,ICMP,unknown) */
-	/* maxlen = 230+   91  + 230 + 252 = 803 */
+	/* maxlen = 230+   91  + 230 + 255 = 806 */
 }
 
 static noinline_for_stack void
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 net-next 05/13] tcp: reorganize SYN ECN code
  2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
                   ` (3 preceding siblings ...)
  2024-11-05 10:06 ` [PATCH v5 net-next 04/13] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
  2024-11-07 10:25   ` Eric Dumazet
  2024-11-05 10:06 ` [PATCH v5 net-next 06/13] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Prepare for AccECN that needs to have access here on IP ECN
field value which is only available after INET_ECN_xmit().

No functional changes.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_output.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1b2f4a2e7332..9c47b46aa14d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -347,10 +347,11 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 	tp->ecn_flags = 0;
 
 	if (use_ecn) {
-		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
-		tp->ecn_flags = TCP_ECN_OK;
 		if (tcp_ca_needs_ecn(sk) || bpf_needs_ecn)
 			INET_ECN_xmit(sk);
+
+		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
+		tp->ecn_flags = TCP_ECN_OK;
 	}
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 net-next 06/13] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()
  2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
                   ` (4 preceding siblings ...)
  2024-11-05 10:06 ` [PATCH v5 net-next 05/13] tcp: reorganize SYN ECN code chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
  2024-11-07 10:30   ` Eric Dumazet
  2024-11-05 10:06 ` [PATCH v5 net-next 07/13] tcp: helpers for ECN mode handling chia-yu.chang
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Rename tcp_ecn_check_ce to tcp_data_ecn_check as it is
called only for data segments, not for ACKs (with AccECN,
also ACKs may get ECN bits).

The extra "layer" in tcp_ecn_check_ce() function just
checks for ECN being enabled, that can be moved into
tcp_ecn_field_check rather than having the __ variant.

No functional changes.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_input.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ecb3de69c6de..b5654f94453e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -357,10 +357,13 @@ static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
 	tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR;
 }
 
-static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
+static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	if (!(tcp_sk(sk)->ecn_flags & TCP_ECN_OK))
+		return;
+
 	switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
 	case INET_ECN_NOT_ECT:
 		/* Funny extension: if ECT is not set on a segment,
@@ -389,12 +392,6 @@ static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
 	}
 }
 
-static void tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
-{
-	if (tcp_sk(sk)->ecn_flags & TCP_ECN_OK)
-		__tcp_ecn_check_ce(sk, skb);
-}
-
 static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
 {
 	if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || th->cwr))
@@ -866,7 +863,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
 	icsk->icsk_ack.lrcvtime = now;
 	tcp_save_lrcv_flowlabel(sk, skb);
 
-	tcp_ecn_check_ce(sk, skb);
+	tcp_data_ecn_check(sk, skb);
 
 	if (skb->len >= 128)
 		tcp_grow_window(sk, skb, true);
@@ -5028,7 +5025,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	bool fragstolen;
 
 	tcp_save_lrcv_flowlabel(sk, skb);
-	tcp_ecn_check_ce(sk, skb);
+	tcp_data_ecn_check(sk, skb);
 
 	if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFODROP);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 net-next 07/13] tcp: helpers for ECN mode handling
  2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
                   ` (5 preceding siblings ...)
  2024-11-05 10:06 ` [PATCH v5 net-next 06/13] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
  2024-11-07 12:26   ` Eric Dumazet
  2024-11-05 10:06 ` [PATCH v5 net-next 08/13] gso: AccECN support chia-yu.chang
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Create helpers for TCP ECN modes. No functional changes.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h        | 44 ++++++++++++++++++++++++++++++++++++----
 net/ipv4/tcp.c           |  2 +-
 net/ipv4/tcp_dctcp.c     |  2 +-
 net/ipv4/tcp_input.c     | 14 ++++++-------
 net/ipv4/tcp_minisocks.c |  4 +++-
 net/ipv4/tcp_output.c    |  6 +++---
 6 files changed, 55 insertions(+), 17 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index a9948fe3537a..215b7ba105be 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -372,10 +372,46 @@ static inline void tcp_dec_quickack_mode(struct sock *sk)
 	}
 }
 
-#define	TCP_ECN_OK		1
-#define	TCP_ECN_QUEUE_CWR	2
-#define	TCP_ECN_DEMAND_CWR	4
-#define	TCP_ECN_SEEN		8
+#define	TCP_ECN_MODE_RFC3168	BIT(0)
+#define	TCP_ECN_QUEUE_CWR	BIT(1)
+#define	TCP_ECN_DEMAND_CWR	BIT(2)
+#define	TCP_ECN_SEEN		BIT(3)
+#define	TCP_ECN_MODE_ACCECN	BIT(4)
+
+#define	TCP_ECN_DISABLED	0
+#define	TCP_ECN_MODE_PENDING	(TCP_ECN_MODE_RFC3168 | TCP_ECN_MODE_ACCECN)
+#define	TCP_ECN_MODE_ANY	(TCP_ECN_MODE_RFC3168 | TCP_ECN_MODE_ACCECN)
+
+static inline bool tcp_ecn_mode_any(const struct tcp_sock *tp)
+{
+	return tp->ecn_flags & TCP_ECN_MODE_ANY;
+}
+
+static inline bool tcp_ecn_mode_rfc3168(const struct tcp_sock *tp)
+{
+	return (tp->ecn_flags & TCP_ECN_MODE_ANY) == TCP_ECN_MODE_RFC3168;
+}
+
+static inline bool tcp_ecn_mode_accecn(const struct tcp_sock *tp)
+{
+	return (tp->ecn_flags & TCP_ECN_MODE_ANY) == TCP_ECN_MODE_ACCECN;
+}
+
+static inline bool tcp_ecn_disabled(const struct tcp_sock *tp)
+{
+	return !tcp_ecn_mode_any(tp);
+}
+
+static inline bool tcp_ecn_mode_pending(const struct tcp_sock *tp)
+{
+	return (tp->ecn_flags & TCP_ECN_MODE_PENDING) == TCP_ECN_MODE_PENDING;
+}
+
+static inline void tcp_ecn_mode_set(struct tcp_sock *tp, u8 mode)
+{
+	tp->ecn_flags &= ~TCP_ECN_MODE_ANY;
+	tp->ecn_flags |= mode;
+}
 
 enum tcp_tw_status {
 	TCP_TW_SUCCESS = 0,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0d704bda6c41..e30204394175 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4107,7 +4107,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 		info->tcpi_rcv_wscale = tp->rx_opt.rcv_wscale;
 	}
 
-	if (tp->ecn_flags & TCP_ECN_OK)
+	if (tcp_ecn_mode_any(tp))
 		info->tcpi_options |= TCPI_OPT_ECN;
 	if (tp->ecn_flags & TCP_ECN_SEEN)
 		info->tcpi_options |= TCPI_OPT_ECN_SEEN;
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 8a45a4aea933..03abe0848420 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -90,7 +90,7 @@ __bpf_kfunc static void dctcp_init(struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 
-	if ((tp->ecn_flags & TCP_ECN_OK) ||
+	if (tcp_ecn_mode_any(tp) ||
 	    (sk->sk_state == TCP_LISTEN ||
 	     sk->sk_state == TCP_CLOSE)) {
 		struct dctcp *ca = inet_csk_ca(sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b5654f94453e..bafd09ff9a70 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -334,7 +334,7 @@ static bool tcp_in_quickack_mode(struct sock *sk)
 
 static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
 {
-	if (tp->ecn_flags & TCP_ECN_OK)
+	if (tcp_ecn_mode_rfc3168(tp))
 		tp->ecn_flags |= TCP_ECN_QUEUE_CWR;
 }
 
@@ -361,7 +361,7 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (!(tcp_sk(sk)->ecn_flags & TCP_ECN_OK))
+	if (tcp_ecn_disabled(tp))
 		return;
 
 	switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
@@ -394,19 +394,19 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
 
 static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
 {
-	if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || th->cwr))
-		tp->ecn_flags &= ~TCP_ECN_OK;
+	if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || th->cwr))
+		tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
 }
 
 static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th)
 {
-	if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || !th->cwr))
-		tp->ecn_flags &= ~TCP_ECN_OK;
+	if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || !th->cwr))
+		tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
 }
 
 static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr *th)
 {
-	if (th->ece && !th->syn && (tp->ecn_flags & TCP_ECN_OK))
+	if (th->ece && !th->syn && tcp_ecn_mode_rfc3168(tp))
 		return true;
 	return false;
 }
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bb1fe1ba867a..bd6515ab660f 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -453,7 +453,9 @@ EXPORT_SYMBOL(tcp_openreq_init_rwin);
 static void tcp_ecn_openreq_child(struct tcp_sock *tp,
 				  const struct request_sock *req)
 {
-	tp->ecn_flags = inet_rsk(req)->ecn_ok ? TCP_ECN_OK : 0;
+	tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
+			     TCP_ECN_MODE_RFC3168 :
+			     TCP_ECN_DISABLED);
 }
 
 void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9c47b46aa14d..f4ee86e2b1b5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -322,7 +322,7 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 	const struct tcp_sock *tp = tcp_sk(sk);
 
 	TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR;
-	if (!(tp->ecn_flags & TCP_ECN_OK))
+	if (tcp_ecn_disabled(tp))
 		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ECE;
 	else if (tcp_ca_needs_ecn(sk) ||
 		 tcp_bpf_ca_needs_ecn(sk))
@@ -351,7 +351,7 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 			INET_ECN_xmit(sk);
 
 		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
-		tp->ecn_flags = TCP_ECN_OK;
+		tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
 	}
 }
 
@@ -379,7 +379,7 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tp->ecn_flags & TCP_ECN_OK) {
+	if (tcp_ecn_mode_rfc3168(tp)) {
 		/* Not-retransmitted data segment: set ECT and inject CWR. */
 		if (skb->len != tcp_header_len &&
 		    !before(TCP_SKB_CB(skb)->seq, tp->snd_nxt)) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 net-next 08/13] gso: AccECN support
  2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
                   ` (6 preceding siblings ...)
  2024-11-05 10:06 ` [PATCH v5 net-next 07/13] tcp: helpers for ECN mode handling chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
  2024-11-07 12:41   ` Eric Dumazet
  2024-11-09 10:09   ` Ilpo Järvinen
  2024-11-05 10:06 ` [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Handling the CWR flag differs between RFC 3168 ECN and AccECN.
With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared
starting from 2nd segment which is incompatible how AccECN handles
the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN.
With AccECN, CWR flag (or more accurately, the ACE field that also
includes ECE & AE flags) changes only when new packet(s) with CE
mark arrives so the flag should not be changed within a super-skb.
The new skb/feature flags are necessary to prevent such TSO engines
corrupting AccECN ACE counters by clearing the CWR flag (if the
CWR handling feature cannot be turned off).

If NIC is completely unaware of RFC3168 ECN (doesn't support
NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
despite supporting also NETIF_F_TSO_ECN, TSO could be safely used
with AccECN on such NIC. This should be evaluated per NIC basis
(not done in this patch series for any NICs).

For the cases, where TSO cannot keep its hands off the CWR flag,
a GSO fallback is provided by this patch.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/netdev_features.h | 8 +++++---
 include/linux/netdevice.h       | 2 ++
 include/linux/skbuff.h          | 2 ++
 net/ethtool/common.c            | 1 +
 net/ipv4/tcp_offload.c          | 6 +++++-
 5 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 66e7d26b70a4..c59db449bcf0 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -53,12 +53,12 @@ enum {
 	NETIF_F_GSO_UDP_BIT,		/* ... UFO, deprecated except tuntap */
 	NETIF_F_GSO_UDP_L4_BIT,		/* ... UDP payload GSO (not UFO) */
 	NETIF_F_GSO_FRAGLIST_BIT,		/* ... Fraglist GSO */
+	NETIF_F_GSO_ACCECN_BIT,         /* TCP AccECN w/ TSO (no clear CWR) */
 	/**/NETIF_F_GSO_LAST =		/* last bit, see GSO_MASK */
-		NETIF_F_GSO_FRAGLIST_BIT,
+		NETIF_F_GSO_ACCECN_BIT,
 
 	NETIF_F_FCOE_CRC_BIT,		/* FCoE CRC32 */
 	NETIF_F_SCTP_CRC_BIT,		/* SCTP checksum offload */
-	__UNUSED_NETIF_F_37,
 	NETIF_F_NTUPLE_BIT,		/* N-tuple filters supported */
 	NETIF_F_RXHASH_BIT,		/* Receive hashing offload */
 	NETIF_F_RXCSUM_BIT,		/* Receive checksumming offload */
@@ -128,6 +128,7 @@ enum {
 #define NETIF_F_SG		__NETIF_F(SG)
 #define NETIF_F_TSO6		__NETIF_F(TSO6)
 #define NETIF_F_TSO_ECN		__NETIF_F(TSO_ECN)
+#define NETIF_F_GSO_ACCECN	__NETIF_F(GSO_ACCECN)
 #define NETIF_F_TSO		__NETIF_F(TSO)
 #define NETIF_F_VLAN_CHALLENGED	__NETIF_F(VLAN_CHALLENGED)
 #define NETIF_F_RXFCS		__NETIF_F(RXFCS)
@@ -210,7 +211,8 @@ static inline int find_next_netdev_feature(u64 feature, unsigned long start)
 				 NETIF_F_TSO_ECN | NETIF_F_TSO_MANGLEID)
 
 /* List of features with software fallbacks. */
-#define NETIF_F_GSO_SOFTWARE	(NETIF_F_ALL_TSO | NETIF_F_GSO_SCTP |	     \
+#define NETIF_F_GSO_SOFTWARE	(NETIF_F_ALL_TSO | \
+				 NETIF_F_GSO_ACCECN | NETIF_F_GSO_SCTP | \
 				 NETIF_F_GSO_UDP_L4 | NETIF_F_GSO_FRAGLIST)
 
 /*
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6e0f8e4aeb14..480d915b3bdb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -5076,6 +5076,8 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
 	BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_GSO_UDP >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_UDP_L4 != (NETIF_F_GSO_UDP_L4 >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_FRAGLIST != (NETIF_F_GSO_FRAGLIST >> NETIF_F_GSO_SHIFT));
+	BUILD_BUG_ON(SKB_GSO_TCP_ACCECN !=
+		     (NETIF_F_GSO_ACCECN >> NETIF_F_GSO_SHIFT));
 
 	return (features & feature) == feature;
 }
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 48f1e0fa2a13..530cb325fb86 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -694,6 +694,8 @@ enum {
 	SKB_GSO_UDP_L4 = 1 << 17,
 
 	SKB_GSO_FRAGLIST = 1 << 18,
+
+	SKB_GSO_TCP_ACCECN = 1 << 19,
 };
 
 #if BITS_PER_LONG > 32
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 0d62363dbd9d..5c3ba2dfaa74 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -32,6 +32,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
 	[NETIF_F_TSO_BIT] =              "tx-tcp-segmentation",
 	[NETIF_F_GSO_ROBUST_BIT] =       "tx-gso-robust",
 	[NETIF_F_TSO_ECN_BIT] =          "tx-tcp-ecn-segmentation",
+	[NETIF_F_GSO_ACCECN_BIT] =	 "tx-tcp-accecn-segmentation",
 	[NETIF_F_TSO_MANGLEID_BIT] =	 "tx-tcp-mangleid-segmentation",
 	[NETIF_F_TSO6_BIT] =             "tx-tcp6-segmentation",
 	[NETIF_F_FSO_BIT] =              "tx-fcoe-segmentation",
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 2308665b51c5..0b05f30e9e5f 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -139,6 +139,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 	struct sk_buff *gso_skb = skb;
 	__sum16 newcheck;
 	bool ooo_okay, copy_destructor;
+	bool ecn_cwr_mask;
 	__wsum delta;
 
 	th = tcp_hdr(skb);
@@ -198,6 +199,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 
 	newcheck = ~csum_fold(csum_add(csum_unfold(th->check), delta));
 
+	ecn_cwr_mask = !!(skb_shinfo(gso_skb)->gso_type & SKB_GSO_TCP_ACCECN);
+
 	while (skb->next) {
 		th->fin = th->psh = 0;
 		th->check = newcheck;
@@ -217,7 +220,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 		th = tcp_hdr(skb);
 
 		th->seq = htonl(seq);
-		th->cwr = 0;
+
+		th->cwr &= ecn_cwr_mask;
 	}
 
 	/* Following permits TCP Small Queues to work well with GSO :
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
  2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
                   ` (7 preceding siblings ...)
  2024-11-05 10:06 ` [PATCH v5 net-next 08/13] gso: AccECN support chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
  2024-11-07 12:40   ` Eric Dumazet
  2024-11-05 10:06 ` [PATCH v5 net-next 10/13] tcp: AccECN support to tcp_add_backlog chia-yu.chang
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

There are important differences in how the CWR field behaves
in RFC3168 and AccECN. With AccECN, CWR flag is part of the
ACE counter and its changes are important so adjust the flags
changed mask accordingly.

Also, if CWR is there, set the Accurate ECN GSO flag to avoid
corrupting CWR flag somewhere.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_offload.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 0b05f30e9e5f..f59762d88c38 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
 	th2 = tcp_hdr(p);
 	flush = (__force int)(flags & TCP_FLAG_CWR);
 	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
-		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
+		  ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
 	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
 	for (i = sizeof(*th); i < thlen; i += 4)
 		flush |= *(u32 *)((u8 *)th + i) ^
@@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
 	shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
 
 	if (th->cwr)
-		shinfo->gso_type |= SKB_GSO_TCP_ECN;
+		shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
 }
 EXPORT_SYMBOL(tcp_gro_complete);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 net-next 10/13] tcp: AccECN support to tcp_add_backlog
  2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
                   ` (8 preceding siblings ...)
  2024-11-05 10:06 ` [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
  2024-11-07 12:42   ` Eric Dumazet
  2024-11-05 10:06 ` [PATCH v5 net-next 11/13] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

AE flag needs to be preserved for AccECN.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_ipv4.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 536767723584..a13d6745d92b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2055,7 +2055,8 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
 	    !((TCP_SKB_CB(tail)->tcp_flags &
 	      TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_ACK) ||
 	    ((TCP_SKB_CB(tail)->tcp_flags ^
-	      TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
+	      TCP_SKB_CB(skb)->tcp_flags) &
+	     (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)) ||
 	    !tcp_skb_can_collapse_rx(tail, skb) ||
 	    thtail->doff != th->doff ||
 	    memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th)))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 net-next 11/13] tcp: allow ECN bits in TOS/traffic class
  2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
                   ` (9 preceding siblings ...)
  2024-11-05 10:06 ` [PATCH v5 net-next 10/13] tcp: AccECN support to tcp_add_backlog chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
  2024-11-07 12:55   ` Eric Dumazet
  2024-11-05 10:06 ` [PATCH v5 net-next 12/13] tcp: Pass flags to __tcp_send_ack chia-yu.chang
  2024-11-05 10:06 ` [PATCH v5 net-next 13/13] tcp: fast path functions later chia-yu.chang
  12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

AccECN connection's last ACK cannot retain ECT(1) as the bits
are always cleared causing the packet to switch into another
service queue.

This effectively adds a finer-grained filtering for ECN bits
so that acceptable TW ACKs can retain the bits.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h        |  3 ++-
 net/ipv4/ip_output.c     |  3 +--
 net/ipv4/tcp_ipv4.c      | 23 +++++++++++++++++------
 net/ipv4/tcp_minisocks.c |  2 +-
 net/ipv6/tcp_ipv6.c      | 24 +++++++++++++++++-------
 5 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 215b7ba105be..3a8782874333 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -417,7 +417,8 @@ enum tcp_tw_status {
 	TCP_TW_SUCCESS = 0,
 	TCP_TW_RST = 1,
 	TCP_TW_ACK = 2,
-	TCP_TW_SYN = 3
+	TCP_TW_SYN = 3,
+	TCP_TW_ACK_OOW = 4
 };
 
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 0065b1996c94..2fe7b1df3b90 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -75,7 +75,6 @@
 #include <net/checksum.h>
 #include <net/gso.h>
 #include <net/inetpeer.h>
-#include <net/inet_ecn.h>
 #include <net/lwtunnel.h>
 #include <net/inet_dscp.h>
 #include <linux/bpf-cgroup.h>
@@ -1643,7 +1642,7 @@ void ip_send_unicast_reply(struct sock *sk, const struct sock *orig_sk,
 	if (IS_ERR(rt))
 		return;
 
-	inet_sk(sk)->tos = arg->tos & ~INET_ECN_MASK;
+	inet_sk(sk)->tos = arg->tos;
 
 	sk->sk_protocol = ip_hdr(skb)->protocol;
 	sk->sk_bound_dev_if = arg->bound_dev_if;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a13d6745d92b..1950d4cd5da8 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -66,6 +66,7 @@
 #include <net/transp_v6.h>
 #include <net/ipv6.h>
 #include <net/inet_common.h>
+#include <net/inet_ecn.h>
 #include <net/timewait_sock.h>
 #include <net/xfrm.h>
 #include <net/secure_seq.h>
@@ -887,7 +888,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb,
 	BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
 		     offsetof(struct inet_timewait_sock, tw_bound_dev_if));
 
-	arg.tos = ip_hdr(skb)->tos;
+	arg.tos = ip_hdr(skb)->tos & ~INET_ECN_MASK;
 	arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
 	local_bh_disable();
 	local_lock_nested_bh(&ipv4_tcp_sk.bh_lock);
@@ -1033,11 +1034,17 @@ static void tcp_v4_send_ack(const struct sock *sk,
 	local_bh_enable();
 }
 
-static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
+static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb,
+				enum tcp_tw_status tw_status)
 {
 	struct inet_timewait_sock *tw = inet_twsk(sk);
 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
 	struct tcp_key key = {};
+	u8 tos = tw->tw_tos;
+
+	if (tw_status == TCP_TW_ACK_OOW)
+		tos &= ~INET_ECN_MASK;
+
 #ifdef CONFIG_TCP_AO
 	struct tcp_ao_info *ao_info;
 
@@ -1081,7 +1088,7 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
 			READ_ONCE(tcptw->tw_ts_recent),
 			tw->tw_bound_dev_if, &key,
 			tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
-			tw->tw_tos,
+			tos,
 			tw->tw_txhash);
 
 	inet_twsk_put(tw);
@@ -1158,7 +1165,7 @@ static void tcp_v4_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
 			READ_ONCE(req->ts_recent),
 			0, &key,
 			inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
-			ip_hdr(skb)->tos,
+			ip_hdr(skb)->tos & ~INET_ECN_MASK,
 			READ_ONCE(tcp_rsk(req)->txhash));
 	if (tcp_key_is_ao(&key))
 		kfree(key.traffic_key);
@@ -2179,6 +2186,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
 	enum skb_drop_reason drop_reason;
+	enum tcp_tw_status tw_status;
 	int sdif = inet_sdif(skb);
 	int dif = inet_iif(skb);
 	const struct iphdr *iph;
@@ -2405,7 +2413,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
 		inet_twsk_put(inet_twsk(sk));
 		goto csum_error;
 	}
-	switch (tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn)) {
+
+	tw_status = tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn);
+	switch (tw_status) {
 	case TCP_TW_SYN: {
 		struct sock *sk2 = inet_lookup_listener(net,
 							net->ipv4.tcp_death_row.hashinfo,
@@ -2426,7 +2436,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
 		/* to ACK */
 		fallthrough;
 	case TCP_TW_ACK:
-		tcp_v4_timewait_ack(sk, skb);
+	case TCP_TW_ACK_OOW:
+		tcp_v4_timewait_ack(sk, skb, tw_status);
 		break;
 	case TCP_TW_RST:
 		tcp_v4_send_reset(sk, skb, SK_RST_REASON_TCP_TIMEWAIT_SOCKET);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bd6515ab660f..8fb9f550fdeb 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -44,7 +44,7 @@ tcp_timewait_check_oow_rate_limit(struct inet_timewait_sock *tw,
 		/* Send ACK. Note, we do not put the bucket,
 		 * it will be released by caller.
 		 */
-		return TCP_TW_ACK;
+		return TCP_TW_ACK_OOW;
 	}
 
 	/* We are rate-limiting, so just release the tw sock and drop skb. */
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index fec9acffb167..ea85c117bf96 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -997,7 +997,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
 	if (!IS_ERR(dst)) {
 		skb_dst_set(buff, dst);
 		ip6_xmit(ctl_sk, buff, &fl6, fl6.flowi6_mark, NULL,
-			 tclass & ~INET_ECN_MASK, priority);
+			 tclass, priority);
 		TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
 		if (rst)
 			TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
@@ -1133,7 +1133,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb,
 	trace_tcp_send_reset(sk, skb, reason);
 
 	tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
-			     ipv6_get_dsfield(ipv6h), label, priority, txhash,
+			     ipv6_get_dsfield(ipv6h) & ~INET_ECN_MASK,
+			     label, priority, txhash,
 			     &key);
 
 #if defined(CONFIG_TCP_MD5SIG) || defined(CONFIG_TCP_AO)
@@ -1153,11 +1154,16 @@ static void tcp_v6_send_ack(const struct sock *sk, struct sk_buff *skb, u32 seq,
 			     tclass, label, priority, txhash, key);
 }
 
-static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
+static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb,
+				enum tcp_tw_status tw_status)
 {
 	struct inet_timewait_sock *tw = inet_twsk(sk);
 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
+	u8 tclass = tw->tw_tclass;
 	struct tcp_key key = {};
+
+	if (tw_status == TCP_TW_ACK_OOW)
+		tclass &= ~INET_ECN_MASK;
 #ifdef CONFIG_TCP_AO
 	struct tcp_ao_info *ao_info;
 
@@ -1201,7 +1207,7 @@ static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
 			tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
 			tcp_tw_tsval(tcptw),
 			READ_ONCE(tcptw->tw_ts_recent), tw->tw_bound_dev_if,
-			&key, tw->tw_tclass, cpu_to_be32(tw->tw_flowlabel),
+			&key, tclass, cpu_to_be32(tw->tw_flowlabel),
 			tw->tw_priority, tw->tw_txhash);
 
 #ifdef CONFIG_TCP_AO
@@ -1278,7 +1284,8 @@ static void tcp_v6_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
 			tcp_synack_window(req) >> inet_rsk(req)->rcv_wscale,
 			tcp_rsk_tsval(tcp_rsk(req)),
 			READ_ONCE(req->ts_recent), sk->sk_bound_dev_if,
-			&key, ipv6_get_dsfield(ipv6_hdr(skb)), 0,
+			&key, ipv6_get_dsfield(ipv6_hdr(skb)) & ~INET_ECN_MASK,
+			0,
 			READ_ONCE(sk->sk_priority),
 			READ_ONCE(tcp_rsk(req)->txhash));
 	if (tcp_key_is_ao(&key))
@@ -1747,6 +1754,7 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
 INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 {
 	enum skb_drop_reason drop_reason;
+	enum tcp_tw_status tw_status;
 	int sdif = inet6_sdif(skb);
 	int dif = inet6_iif(skb);
 	const struct tcphdr *th;
@@ -1967,7 +1975,8 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 		goto csum_error;
 	}
 
-	switch (tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn)) {
+	tw_status = tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn);
+	switch (tw_status) {
 	case TCP_TW_SYN:
 	{
 		struct sock *sk2;
@@ -1992,7 +2001,8 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 		/* to ACK */
 		fallthrough;
 	case TCP_TW_ACK:
-		tcp_v6_timewait_ack(sk, skb);
+	case TCP_TW_ACK_OOW:
+		tcp_v6_timewait_ack(sk, skb, tw_status);
 		break;
 	case TCP_TW_RST:
 		tcp_v6_send_reset(sk, skb, SK_RST_REASON_TCP_TIMEWAIT_SOCKET);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 net-next 12/13] tcp: Pass flags to __tcp_send_ack
  2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
                   ` (10 preceding siblings ...)
  2024-11-05 10:06 ` [PATCH v5 net-next 11/13] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
  2024-11-07 12:58   ` Eric Dumazet
  2024-11-05 10:06 ` [PATCH v5 net-next 13/13] tcp: fast path functions later chia-yu.chang
  12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Accurate ECN needs to send custom flags to handle IP-ECN
field reflection during handshake.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h     | 2 +-
 net/ipv4/bpf_tcp_ca.c | 2 +-
 net/ipv4/tcp_dctcp.h  | 2 +-
 net/ipv4/tcp_output.c | 6 +++---
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3a8782874333..fc9d181e9362 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -704,7 +704,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority,
 			   enum sk_rst_reason reason);
 int tcp_send_synack(struct sock *);
 void tcp_push_one(struct sock *, unsigned int mss_now);
-void __tcp_send_ack(struct sock *sk, u32 rcv_nxt);
+void __tcp_send_ack(struct sock *sk, u32 rcv_nxt, u16 flags);
 void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 554804774628..e01492234b0b 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -121,7 +121,7 @@ static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log,
 BPF_CALL_2(bpf_tcp_send_ack, struct tcp_sock *, tp, u32, rcv_nxt)
 {
 	/* bpf_tcp_ca prog cannot have NULL tp */
-	__tcp_send_ack((struct sock *)tp, rcv_nxt);
+	__tcp_send_ack((struct sock *)tp, rcv_nxt, 0);
 	return 0;
 }
 
diff --git a/net/ipv4/tcp_dctcp.h b/net/ipv4/tcp_dctcp.h
index d69a77cbd0c7..4b0259111d81 100644
--- a/net/ipv4/tcp_dctcp.h
+++ b/net/ipv4/tcp_dctcp.h
@@ -28,7 +28,7 @@ static inline void dctcp_ece_ack_update(struct sock *sk, enum tcp_ca_event evt,
 		 */
 		if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
 			dctcp_ece_ack_cwr(sk, *ce_state);
-			__tcp_send_ack(sk, *prior_rcv_nxt);
+			__tcp_send_ack(sk, *prior_rcv_nxt, 0);
 		}
 		inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f4ee86e2b1b5..fdb4af44c47e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4230,7 +4230,7 @@ void tcp_send_delayed_ack(struct sock *sk)
 }
 
 /* This routine sends an ack and also updates the window. */
-void __tcp_send_ack(struct sock *sk, u32 rcv_nxt)
+void __tcp_send_ack(struct sock *sk, u32 rcv_nxt, u16 flags)
 {
 	struct sk_buff *buff;
 
@@ -4259,7 +4259,7 @@ void __tcp_send_ack(struct sock *sk, u32 rcv_nxt)
 
 	/* Reserve space for headers and prepare control bits. */
 	skb_reserve(buff, MAX_TCP_HEADER);
-	tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK);
+	tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK | flags);
 
 	/* We do not want pure acks influencing TCP Small Queues or fq/pacing
 	 * too much.
@@ -4274,7 +4274,7 @@ EXPORT_SYMBOL_GPL(__tcp_send_ack);
 
 void tcp_send_ack(struct sock *sk)
 {
-	__tcp_send_ack(sk, tcp_sk(sk)->rcv_nxt);
+	__tcp_send_ack(sk, tcp_sk(sk)->rcv_nxt, 0);
 }
 
 /* This routine sends a packet with an out of date sequence
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v5 net-next 13/13] tcp: fast path functions later
  2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
                   ` (11 preceding siblings ...)
  2024-11-05 10:06 ` [PATCH v5 net-next 12/13] tcp: Pass flags to __tcp_send_ack chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
  2024-11-07 13:00   ` Eric Dumazet
  12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
  To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chai-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

The following patch will use tcp_ecn_mode_accecn(),
TCP_ACCECN_CEP_INIT_OFFSET, TCP_ACCECN_CEP_ACE_MASK in
__tcp_fast_path_on() to make new flag for AccECN.

No functional changes.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chai-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h | 54 +++++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fc9d181e9362..6255977bd7f9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -788,33 +788,6 @@ static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
 	return usecs_to_jiffies((tp->srtt_us >> 3) + tp->rttvar_us);
 }
 
-static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
-{
-	/* mptcp hooks are only on the slow path */
-	if (sk_is_mptcp((struct sock *)tp))
-		return;
-
-	tp->pred_flags = htonl((tp->tcp_header_len << 26) |
-			       ntohl(TCP_FLAG_ACK) |
-			       snd_wnd);
-}
-
-static inline void tcp_fast_path_on(struct tcp_sock *tp)
-{
-	__tcp_fast_path_on(tp, tp->snd_wnd >> tp->rx_opt.snd_wscale);
-}
-
-static inline void tcp_fast_path_check(struct sock *sk)
-{
-	struct tcp_sock *tp = tcp_sk(sk);
-
-	if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
-	    tp->rcv_wnd &&
-	    atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
-	    !tp->urg_data)
-		tcp_fast_path_on(tp);
-}
-
 u32 tcp_delack_max(const struct sock *sk);
 
 /* Compute the actual rto_min value */
@@ -1770,6 +1743,33 @@ static inline bool tcp_paws_reject(const struct tcp_options_received *rx_opt,
 	return true;
 }
 
+static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
+{
+	/* mptcp hooks are only on the slow path */
+	if (sk_is_mptcp((struct sock *)tp))
+		return;
+
+	tp->pred_flags = htonl((tp->tcp_header_len << 26) |
+			       ntohl(TCP_FLAG_ACK) |
+			       snd_wnd);
+}
+
+static inline void tcp_fast_path_on(struct tcp_sock *tp)
+{
+	__tcp_fast_path_on(tp, tp->snd_wnd >> tp->rx_opt.snd_wscale);
+}
+
+static inline void tcp_fast_path_check(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
+	    tp->rcv_wnd &&
+	    atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
+	    !tp->urg_data)
+		tcp_fast_path_on(tp);
+}
+
 bool tcp_oow_rate_limited(struct net *net, const struct sk_buff *skb,
 			  int mib_idx, u32 *last_oow_ack_time);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
  2024-11-05 10:06 ` [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
@ 2024-11-07  9:31   ` Eric Dumazet
  2024-11-07 19:32     ` Ilpo Järvinen
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07  9:31 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
	andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
	ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel

On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> - Move tcp_count_delivered() earlier and split tcp_count_delivered_ce()
>   out of it
> - Move tcp_in_ack_event() later
> - While at it, remove the inline from tcp_in_ack_event() and let
>   the compiler to decide
>
> Accurate ECN's heuristics does not know if there is going
> to be ACE field based CE counter increase or not until after
> rtx queue has been processed. Only then the number of ACKed
> bytes/pkts is available. As CE or not affects presence of
> FLAG_ECE, that information for tcp_in_ack_event is not yet
> available in the old location of the call to tcp_in_ack_event().
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
>  net/ipv4/tcp_input.c | 56 +++++++++++++++++++++++++-------------------
>  1 file changed, 32 insertions(+), 24 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 5bdf13ac26ef..fc52eab4fcc9 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -413,6 +413,20 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
>         return false;
>  }
>
> +static void tcp_count_delivered_ce(struct tcp_sock *tp, u32 ecn_count)
> +{
> +       tp->delivered_ce += ecn_count;
> +}
> +
> +/* Updates the delivered and delivered_ce counts */
> +static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
> +                               bool ece_ack)
> +{
> +       tp->delivered += delivered;
> +       if (ece_ack)
> +               tcp_count_delivered_ce(tp, delivered);
> +}
> +
>  /* Buffer size and advertised window tuning.
>   *
>   * 1. Tuning sk->sk_sndbuf, when connection enters established state.
> @@ -1148,15 +1162,6 @@ void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb)
>         }
>  }
>
> -/* Updates the delivered and delivered_ce counts */
> -static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
> -                               bool ece_ack)
> -{
> -       tp->delivered += delivered;
> -       if (ece_ack)
> -               tp->delivered_ce += delivered;
> -}
> -
>  /* This procedure tags the retransmission queue when SACKs arrive.
>   *
>   * We have three tag bits: SACKED(S), RETRANS(R) and LOST(L).
> @@ -3856,12 +3861,23 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
>         }
>  }
>
> -static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
> +static void tcp_in_ack_event(struct sock *sk, int flag)
>  {
>         const struct inet_connection_sock *icsk = inet_csk(sk);
>
> -       if (icsk->icsk_ca_ops->in_ack_event)
> -               icsk->icsk_ca_ops->in_ack_event(sk, flags);
> +       if (icsk->icsk_ca_ops->in_ack_event) {
> +               u32 ack_ev_flags = 0;
> +
> +               if (flag & FLAG_WIN_UPDATE)
> +                       ack_ev_flags |= CA_ACK_WIN_UPDATE;
> +               if (flag & FLAG_SLOWPATH) {
> +                       ack_ev_flags = CA_ACK_SLOWPATH;

This is removing the potential CA_ACK_WIN_UPDATE, I would suggest :

ack_ev_flags |= CA_ACK_SLOWPATH;


> +                       if (flag & FLAG_ECE)
> +                               ack_ev_flags |= CA_ACK_ECE;
> +               }
> +
> +               icsk->icsk_ca_ops->in_ack_event(sk, ack_ev_flags);
> +       }
>  }
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 02/13] tcp: create FLAG_TS_PROGRESS
  2024-11-05 10:06 ` [PATCH v5 net-next 02/13] tcp: create FLAG_TS_PROGRESS chia-yu.chang
@ 2024-11-07  9:34   ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07  9:34 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
	andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
	ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel

On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> Whenever timestamp advances, it declares progress which
> can be used by the other parts of the stack to decide that
> the ACK is the most recent one seen so far.
>
> AccECN will use this flag when deciding whether to use the
> ACK to update AccECN state or not.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 03/13] tcp: use BIT() macro in include/net/tcp.h
  2024-11-05 10:06 ` [PATCH v5 net-next 03/13] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
@ 2024-11-07 10:20   ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 10:20 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
	andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
	ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel

On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>
> Use BIT() macro for TCP flags field and TCP congestion control
> flags that will be used by the congestion control algorithm.
>
> No functional changes.
>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> Reviewed-by: Ilpo Järvinen <ij@kernel.org>
> ---
>  include/net/tcp.h | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)


Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 04/13] tcp: extend TCP flags to allow AE bit/ACE field
  2024-11-05 10:06 ` [PATCH v5 net-next 04/13] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
@ 2024-11-07 10:23   ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 10:23 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
	andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
	ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel

On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> With AccECN, there's one additional TCP flag to be used (AE)
> and ACE field that overloads the definition of AE, CWR, and
> ECE flags. As tcp_flags was previously only 1 byte, the
> byte-order stuff needs to be added to it's handling.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 05/13] tcp: reorganize SYN ECN code
  2024-11-05 10:06 ` [PATCH v5 net-next 05/13] tcp: reorganize SYN ECN code chia-yu.chang
@ 2024-11-07 10:25   ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 10:25 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
	andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
	ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel

On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> Prepare for AccECN that needs to have access here on IP ECN
> field value which is only available after INET_ECN_xmit().
>
> No functional changes.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 06/13] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()
  2024-11-05 10:06 ` [PATCH v5 net-next 06/13] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
@ 2024-11-07 10:30   ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 10:30 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
	andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
	ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel

On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> Rename tcp_ecn_check_ce to tcp_data_ecn_check as it is
> called only for data segments, not for ACKs (with AccECN,
> also ACKs may get ECN bits).
>
> The extra "layer" in tcp_ecn_check_ce() function just
> checks for ECN being enabled, that can be moved into
> tcp_ecn_field_check rather than having the __ variant.
>
> No functional changes.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 07/13] tcp: helpers for ECN mode handling
  2024-11-05 10:06 ` [PATCH v5 net-next 07/13] tcp: helpers for ECN mode handling chia-yu.chang
@ 2024-11-07 12:26   ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 12:26 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
	andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
	ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel

On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> Create helpers for TCP ECN modes. No functional changes.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
  2024-11-05 10:06 ` [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
@ 2024-11-07 12:40   ` Eric Dumazet
  2024-11-07 19:28     ` Ilpo Järvinen
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 12:40 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
	andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
	ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel

On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> There are important differences in how the CWR field behaves
> in RFC3168 and AccECN. With AccECN, CWR flag is part of the
> ACE counter and its changes are important so adjust the flags
> changed mask accordingly.
>
> Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> corrupting CWR flag somewhere.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
>  net/ipv4/tcp_offload.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> index 0b05f30e9e5f..f59762d88c38 100644
> --- a/net/ipv4/tcp_offload.c
> +++ b/net/ipv4/tcp_offload.c
> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
>         th2 = tcp_hdr(p);
>         flush = (__force int)(flags & TCP_FLAG_CWR);
>         flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> -                 ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> +                 ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
>         flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
>         for (i = sizeof(*th); i < thlen; i += 4)
>                 flush |= *(u32 *)((u8 *)th + i) ^
> @@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
>         shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
>
>         if (th->cwr)
> -               shinfo->gso_type |= SKB_GSO_TCP_ECN;
> +               shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
>  }
>  EXPORT_SYMBOL(tcp_gro_complete);
>

I do not really understand this patch. How a GRO engine can know which
ECN variant the peers are using ?

SKB_GSO_TCP_ECN is also used from other points, what is your plan ?

git grep -n SKB_GSO_TCP_ECN
drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888:
skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1291:
skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1312:
skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
include/linux/netdevice.h:5061: BUILD_BUG_ON(SKB_GSO_TCP_ECN !=
(NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
include/linux/skbuff.h:664:     SKB_GSO_TCP_ECN = 1 << 2,
include/linux/virtio_net.h:88:                  gso_type |= SKB_GSO_TCP_ECN;
include/linux/virtio_net.h:161:         switch (gso_type & ~SKB_GSO_TCP_ECN) {
include/linux/virtio_net.h:226:         if (sinfo->gso_type & SKB_GSO_TCP_ECN)
net/ipv4/tcp_offload.c:404:             shinfo->gso_type |= SKB_GSO_TCP_ECN;
net/ipv4/tcp_output.c:389:
skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 08/13] gso: AccECN support
  2024-11-05 10:06 ` [PATCH v5 net-next 08/13] gso: AccECN support chia-yu.chang
@ 2024-11-07 12:41   ` Eric Dumazet
  2024-11-09 10:09   ` Ilpo Järvinen
  1 sibling, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 12:41 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
	andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
	ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel

On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> Handling the CWR flag differs between RFC 3168 ECN and AccECN.
> With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared
> starting from 2nd segment which is incompatible how AccECN handles
> the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN.
> With AccECN, CWR flag (or more accurately, the ACE field that also
> includes ECE & AE flags) changes only when new packet(s) with CE
> mark arrives so the flag should not be changed within a super-skb.
> The new skb/feature flags are necessary to prevent such TSO engines
> corrupting AccECN ACE counters by clearing the CWR flag (if the
> CWR handling feature cannot be turned off).
>
> If NIC is completely unaware of RFC3168 ECN (doesn't support
> NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
> despite supporting also NETIF_F_TSO_ECN, TSO could be safely used
> with AccECN on such NIC. This should be evaluated per NIC basis
> (not done in this patch series for any NICs).
>
> For the cases, where TSO cannot keep its hands off the CWR flag,
> a GSO fallback is provided by this patch.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 10/13] tcp: AccECN support to tcp_add_backlog
  2024-11-05 10:06 ` [PATCH v5 net-next 10/13] tcp: AccECN support to tcp_add_backlog chia-yu.chang
@ 2024-11-07 12:42   ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 12:42 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
	andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
	ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel

On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> AE flag needs to be preserved for AccECN.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 11/13] tcp: allow ECN bits in TOS/traffic class
  2024-11-05 10:06 ` [PATCH v5 net-next 11/13] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
@ 2024-11-07 12:55   ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 12:55 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
	andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
	ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel

On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> AccECN connection's last ACK cannot retain ECT(1) as the bits
> are always cleared causing the packet to switch into another
> service queue.
>
> This effectively adds a finer-grained filtering for ECN bits
> so that acceptable TW ACKs can retain the bits.

Too cryptic changelog.

Please add more explanations, because I could not really understand
the intent of this patch.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 12/13] tcp: Pass flags to __tcp_send_ack
  2024-11-05 10:06 ` [PATCH v5 net-next 12/13] tcp: Pass flags to __tcp_send_ack chia-yu.chang
@ 2024-11-07 12:58   ` Eric Dumazet
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 12:58 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
	andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
	ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel

On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> Accurate ECN needs to send custom flags to handle IP-ECN
> field reflection during handshake.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 13/13] tcp: fast path functions later
  2024-11-05 10:06 ` [PATCH v5 net-next 13/13] tcp: fast path functions later chia-yu.chang
@ 2024-11-07 13:00   ` Eric Dumazet
  2024-11-07 20:38     ` Chia-Yu Chang (Nokia)
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 13:00 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
	andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
	ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel

On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> The following patch will use tcp_ecn_mode_accecn(),
> TCP_ACCECN_CEP_INIT_OFFSET, TCP_ACCECN_CEP_ACE_MASK in
> __tcp_fast_path_on() to make new flag for AccECN.
>
> No functional changes.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chai-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

I guess this patch should not land in this series, but in the following one.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
  2024-11-07 12:40   ` Eric Dumazet
@ 2024-11-07 19:28     ` Ilpo Järvinen
  2024-11-12  2:09       ` Chia-Yu Chang (Nokia)
  0 siblings, 1 reply; 37+ messages in thread
From: Ilpo Järvinen @ 2024-11-07 19:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: chia-yu.chang, netdev, dsahern, davem, dsahern, pabeni,
	joel.granados, kuba, andrew+netdev, horms, pablo, kadlec,
	netfilter-devel, coreteam, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel

[-- Attachment #1: Type: text/plain, Size: 4454 bytes --]

On Thu, 7 Nov 2024, Eric Dumazet wrote:

> On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
> >
> > From: Ilpo Järvinen <ij@kernel.org>
> >
> > There are important differences in how the CWR field behaves
> > in RFC3168 and AccECN. With AccECN, CWR flag is part of the
> > ACE counter and its changes are important so adjust the flags
> > changed mask accordingly.
> >
> > Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> > corrupting CWR flag somewhere.
> >
> > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > ---
> >  net/ipv4/tcp_offload.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> > index 0b05f30e9e5f..f59762d88c38 100644
> > --- a/net/ipv4/tcp_offload.c
> > +++ b/net/ipv4/tcp_offload.c
> > @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
> >         th2 = tcp_hdr(p);
> >         flush = (__force int)(flags & TCP_FLAG_CWR);
> >         flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> > -                 ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> > +                 ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
> >         flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> >         for (i = sizeof(*th); i < thlen; i += 4)
> >                 flush |= *(u32 *)((u8 *)th + i) ^
> > @@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
> >         shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
> >
> >         if (th->cwr)
> > -               shinfo->gso_type |= SKB_GSO_TCP_ECN;
> > +               shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
> >  }
> >  EXPORT_SYMBOL(tcp_gro_complete);
> >
> 
> I do not really understand this patch. How a GRO engine can know which
> ECN variant the peers are using ?

Hi Eric,

Thanks for taking a look.

GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can 
result in header change that corrupts ACE field. Thus, GRO has to assume 
the RFC3168 CWR offloading trick cannot be used anymore (unless it tracks 
the connection and knows the bits are used for RFC3168 which is something 
nobody is going to do).

The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or 
NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE 
field value.

SKB_GSO_TCP_ACCECN doesn't allow CWR bits change within a super-skb but 
the same CWR flag should be repeated for all segments. In a sense, it's 
simpler than RFC3168 offloading.

> SKB_GSO_TCP_ECN is also used from other points, what is your plan ?
> 
> git grep -n SKB_GSO_TCP_ECN
> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888:
> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1291:
> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1312:
> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;

These looked like they should be just changed to set SKB_GSO_TCP_ACCECN 
instead.

I don't anymore recall why I didn't change those when I made this patch 
long time ago, perhaps it was just an oversight or things have changed 
somehow since then.

> include/linux/netdevice.h:5061: BUILD_BUG_ON(SKB_GSO_TCP_ECN !=
> (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
> include/linux/skbuff.h:664:     SKB_GSO_TCP_ECN = 1 << 2,

Not relevant.

> include/linux/virtio_net.h:88:                  gso_type |= SKB_GSO_TCP_ECN;
> include/linux/virtio_net.h:161:         switch (gso_type & ~SKB_GSO_TCP_ECN) {
> include/linux/virtio_net.h:226:         if (sinfo->gso_type & SKB_GSO_TCP_ECN)

These need to be looked further what's going on as UAPI is also involved 
here.

> net/ipv4/tcp_offload.c:404:             shinfo->gso_type |= SKB_GSO_TCP_ECN;

This was changed above. :-)

> net/ipv4/tcp_output.c:389: skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;

I'm pretty sure this relates to sending side which will be taken care by 
a later patch in the full series (not among these preparatory patches).


FYI, these TSO/GSO/GRO changes are what I was most unsure myself if I 
got everything right when I was working with this series a few years back 
so please keep that in mind while reviewing so my lack of knowledge 
doesn't end up breaking something. :-)

-- 
 i.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
  2024-11-07  9:31   ` Eric Dumazet
@ 2024-11-07 19:32     ` Ilpo Järvinen
  0 siblings, 0 replies; 37+ messages in thread
From: Ilpo Järvinen @ 2024-11-07 19:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: chia-yu.chang, netdev, dsahern, davem, dsahern, pabeni,
	joel.granados, kuba, andrew+netdev, horms, pablo, kadlec,
	netfilter-devel, coreteam, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel

[-- Attachment #1: Type: text/plain, Size: 2432 bytes --]

On Thu, 7 Nov 2024, Eric Dumazet wrote:

> On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
> >
> > From: Ilpo Järvinen <ij@kernel.org>
> >
> > - Move tcp_count_delivered() earlier and split tcp_count_delivered_ce()
> >   out of it
> > - Move tcp_in_ack_event() later
> > - While at it, remove the inline from tcp_in_ack_event() and let
> >   the compiler to decide
> >
> > Accurate ECN's heuristics does not know if there is going
> > to be ACE field based CE counter increase or not until after
> > rtx queue has been processed. Only then the number of ACKed
> > bytes/pkts is available. As CE or not affects presence of
> > FLAG_ECE, that information for tcp_in_ack_event is not yet
> > available in the old location of the call to tcp_in_ack_event().
> >
> > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > ---
> >  net/ipv4/tcp_input.c | 56 +++++++++++++++++++++++++-------------------
> >  1 file changed, 32 insertions(+), 24 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 5bdf13ac26ef..fc52eab4fcc9 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c

> > @@ -3856,12 +3861,23 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
> >         }
> >  }
> >
> > -static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
> > +static void tcp_in_ack_event(struct sock *sk, int flag)
> >  {
> >         const struct inet_connection_sock *icsk = inet_csk(sk);
> >
> > -       if (icsk->icsk_ca_ops->in_ack_event)
> > -               icsk->icsk_ca_ops->in_ack_event(sk, flags);
> > +       if (icsk->icsk_ca_ops->in_ack_event) {
> > +               u32 ack_ev_flags = 0;
> > +
> > +               if (flag & FLAG_WIN_UPDATE)
> > +                       ack_ev_flags |= CA_ACK_WIN_UPDATE;
> > +               if (flag & FLAG_SLOWPATH) {
> > +                       ack_ev_flags = CA_ACK_SLOWPATH;
> 
> This is removing the potential CA_ACK_WIN_UPDATE, I would suggest :
> 
> ack_ev_flags |= CA_ACK_SLOWPATH;

Yes, a good catch.

-- 
 i.


> > +                       if (flag & FLAG_ECE)
> > +                               ack_ev_flags |= CA_ACK_ECE;
> > +               }
> > +
> > +               icsk->icsk_ca_ops->in_ack_event(sk, ack_ev_flags);
> > +       }
> >  }
> >
> >
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH v5 net-next 13/13] tcp: fast path functions later
  2024-11-07 13:00   ` Eric Dumazet
@ 2024-11-07 20:38     ` Chia-Yu Chang (Nokia)
  0 siblings, 0 replies; 37+ messages in thread
From: Chia-Yu Chang (Nokia) @ 2024-11-07 20:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev@vger.kernel.org, dsahern@gmail.com, davem@davemloft.net,
	dsahern@kernel.org, pabeni@redhat.com, joel.granados@kernel.org,
	kuba@kernel.org, andrew+netdev@lunn.ch, horms@kernel.org,
	pablo@netfilter.org, kadlec@netfilter.org,
	netfilter-devel@vger.kernel.org, coreteam@netfilter.org,
	ij@kernel.org, ncardwell@google.com, Koen De Schepper (Nokia),
	g.white@cablelabs.com, ingemar.s.johansson@ericsson.com,
	mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
	Jason_Livingood@comcast.com, vidhi_goel@apple.com

>From: Eric Dumazet <edumazet@google.com> 
>Sent: Thursday, November 7, 2024 2:01 PM
>To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>
>Cc: netdev@vger.kernel.org; dsahern@gmail.com; davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com; joel.granados@kernel.org; kuba@kernel.org; andrew+netdev@lunn.ch; horms@kernel.org; pablo@netfilter.org; kadlec@netfilter.org; netfilter-devel@vger.kernel.org; coreteam@netfilter.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
>Subject: Re: [PATCH v5 net-next 13/13] tcp: fast path functions later
>
>
>CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
>
>
>
>On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>>
>> From: Ilpo Järvinen <ij@kernel.org>
>>
>> The following patch will use tcp_ecn_mode_accecn(), 
>> TCP_ACCECN_CEP_INIT_OFFSET, TCP_ACCECN_CEP_ACE_MASK in
>> __tcp_fast_path_on() to make new flag for AccECN.
>>
>> No functional changes.
>>
>> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
>> Signed-off-by: Chai-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>
>I guess this patch should not land in this series, but in the following one.

Hi Eric,

	Indeed, I will move this patch to the following AccECN series.

Best regards,
Chia-Yu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 08/13] gso: AccECN support
  2024-11-05 10:06 ` [PATCH v5 net-next 08/13] gso: AccECN support chia-yu.chang
  2024-11-07 12:41   ` Eric Dumazet
@ 2024-11-09 10:09   ` Ilpo Järvinen
  2024-11-09 19:42     ` Neal Cardwell
  1 sibling, 1 reply; 37+ messages in thread
From: Ilpo Järvinen @ 2024-11-09 10:09 UTC (permalink / raw)
  To: Chia-Yu Chang
  Cc: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
	kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
	coreteam, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel

[-- Attachment #1: Type: text/plain, Size: 6919 bytes --]

On Tue, 5 Nov 2024, chia-yu.chang@nokia-bell-labs.com wrote:

> From: Ilpo Järvinen <ij@kernel.org>
> 
> Handling the CWR flag differs between RFC 3168 ECN and AccECN.
> With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared
> starting from 2nd segment which is incompatible how AccECN handles
> the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN.
> With AccECN, CWR flag (or more accurately, the ACE field that also
> includes ECE & AE flags) changes only when new packet(s) with CE
> mark arrives so the flag should not be changed within a super-skb.
> The new skb/feature flags are necessary to prevent such TSO engines
> corrupting AccECN ACE counters by clearing the CWR flag (if the
> CWR handling feature cannot be turned off).
> 
> If NIC is completely unaware of RFC3168 ECN (doesn't support
> NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
> despite supporting also NETIF_F_TSO_ECN, TSO could be safely used
> with AccECN on such NIC. This should be evaluated per NIC basis
> (not done in this patch series for any NICs).
> 
> For the cases, where TSO cannot keep its hands off the CWR flag,
> a GSO fallback is provided by this patch.
> 
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
>  include/linux/netdev_features.h | 8 +++++---
>  include/linux/netdevice.h       | 2 ++
>  include/linux/skbuff.h          | 2 ++
>  net/ethtool/common.c            | 1 +
>  net/ipv4/tcp_offload.c          | 6 +++++-
>  5 files changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> index 66e7d26b70a4..c59db449bcf0 100644
> --- a/include/linux/netdev_features.h
> +++ b/include/linux/netdev_features.h
> @@ -53,12 +53,12 @@ enum {
>  	NETIF_F_GSO_UDP_BIT,		/* ... UFO, deprecated except tuntap */
>  	NETIF_F_GSO_UDP_L4_BIT,		/* ... UDP payload GSO (not UFO) */
>  	NETIF_F_GSO_FRAGLIST_BIT,		/* ... Fraglist GSO */
> +	NETIF_F_GSO_ACCECN_BIT,         /* TCP AccECN w/ TSO (no clear CWR) */
>  	/**/NETIF_F_GSO_LAST =		/* last bit, see GSO_MASK */
> -		NETIF_F_GSO_FRAGLIST_BIT,
> +		NETIF_F_GSO_ACCECN_BIT,
>  
>  	NETIF_F_FCOE_CRC_BIT,		/* FCoE CRC32 */
>  	NETIF_F_SCTP_CRC_BIT,		/* SCTP checksum offload */
> -	__UNUSED_NETIF_F_37,
>  	NETIF_F_NTUPLE_BIT,		/* N-tuple filters supported */
>  	NETIF_F_RXHASH_BIT,		/* Receive hashing offload */
>  	NETIF_F_RXCSUM_BIT,		/* Receive checksumming offload */
> @@ -128,6 +128,7 @@ enum {
>  #define NETIF_F_SG		__NETIF_F(SG)
>  #define NETIF_F_TSO6		__NETIF_F(TSO6)
>  #define NETIF_F_TSO_ECN		__NETIF_F(TSO_ECN)
> +#define NETIF_F_GSO_ACCECN	__NETIF_F(GSO_ACCECN)
>  #define NETIF_F_TSO		__NETIF_F(TSO)
>  #define NETIF_F_VLAN_CHALLENGED	__NETIF_F(VLAN_CHALLENGED)
>  #define NETIF_F_RXFCS		__NETIF_F(RXFCS)
> @@ -210,7 +211,8 @@ static inline int find_next_netdev_feature(u64 feature, unsigned long start)
>  				 NETIF_F_TSO_ECN | NETIF_F_TSO_MANGLEID)
>  
>  /* List of features with software fallbacks. */
> -#define NETIF_F_GSO_SOFTWARE	(NETIF_F_ALL_TSO | NETIF_F_GSO_SCTP |	     \
> +#define NETIF_F_GSO_SOFTWARE	(NETIF_F_ALL_TSO | \
> +				 NETIF_F_GSO_ACCECN | NETIF_F_GSO_SCTP | \
>  				 NETIF_F_GSO_UDP_L4 | NETIF_F_GSO_FRAGLIST)
>  
>  /*
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 6e0f8e4aeb14..480d915b3bdb 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -5076,6 +5076,8 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
>  	BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_GSO_UDP >> NETIF_F_GSO_SHIFT));
>  	BUILD_BUG_ON(SKB_GSO_UDP_L4 != (NETIF_F_GSO_UDP_L4 >> NETIF_F_GSO_SHIFT));
>  	BUILD_BUG_ON(SKB_GSO_FRAGLIST != (NETIF_F_GSO_FRAGLIST >> NETIF_F_GSO_SHIFT));
> +	BUILD_BUG_ON(SKB_GSO_TCP_ACCECN !=
> +		     (NETIF_F_GSO_ACCECN >> NETIF_F_GSO_SHIFT));
>  
>  	return (features & feature) == feature;
>  }
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 48f1e0fa2a13..530cb325fb86 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -694,6 +694,8 @@ enum {
>  	SKB_GSO_UDP_L4 = 1 << 17,
>  
>  	SKB_GSO_FRAGLIST = 1 << 18,
> +
> +	SKB_GSO_TCP_ACCECN = 1 << 19,
>  };
>  
>  #if BITS_PER_LONG > 32
> diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> index 0d62363dbd9d..5c3ba2dfaa74 100644
> --- a/net/ethtool/common.c
> +++ b/net/ethtool/common.c
> @@ -32,6 +32,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
>  	[NETIF_F_TSO_BIT] =              "tx-tcp-segmentation",
>  	[NETIF_F_GSO_ROBUST_BIT] =       "tx-gso-robust",
>  	[NETIF_F_TSO_ECN_BIT] =          "tx-tcp-ecn-segmentation",
> +	[NETIF_F_GSO_ACCECN_BIT] =	 "tx-tcp-accecn-segmentation",
>  	[NETIF_F_TSO_MANGLEID_BIT] =	 "tx-tcp-mangleid-segmentation",
>  	[NETIF_F_TSO6_BIT] =             "tx-tcp6-segmentation",
>  	[NETIF_F_FSO_BIT] =              "tx-fcoe-segmentation",
> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> index 2308665b51c5..0b05f30e9e5f 100644
> --- a/net/ipv4/tcp_offload.c
> +++ b/net/ipv4/tcp_offload.c
> @@ -139,6 +139,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
>  	struct sk_buff *gso_skb = skb;
>  	__sum16 newcheck;
>  	bool ooo_okay, copy_destructor;
> +	bool ecn_cwr_mask;
>  	__wsum delta;
>  
>  	th = tcp_hdr(skb);
> @@ -198,6 +199,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
>  
>  	newcheck = ~csum_fold(csum_add(csum_unfold(th->check), delta));
>  
> +	ecn_cwr_mask = !!(skb_shinfo(gso_skb)->gso_type & SKB_GSO_TCP_ACCECN);
> +
>  	while (skb->next) {
>  		th->fin = th->psh = 0;
>  		th->check = newcheck;
> @@ -217,7 +220,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
>  		th = tcp_hdr(skb);
>  
>  		th->seq = htonl(seq);
> -		th->cwr = 0;
> +
> +		th->cwr &= ecn_cwr_mask;

Hi all,

I started to wonder if this approach would avoid CWR corruption without
need to introduce the SKB_GSO_TCP_ACCECN flag at all:

- Never set SKB_GSO_TCP_ECN for any skb
- Split CWR segment on TCP level before sending. While this causes need to 
split the super skb, it's once per RTT thing so does not sound very 
significant (compare e.g. with SACK that cause split on every ACK)
- Remove th->cwr cleaning from GSO (the above line)
- Likely needed: don't negotiate HOST/GUEST_ECN in virtio

I think that would prevent offloading treating CWR flag as RFC3168 and CWR
would be just copied as is like a normal flag which is required to not 
corrupt it. In practice, RFC3168 style traffic would just not combine 
anything with the CWR'ed packet as CWRs are singletons with RFC3168.

Would that work? It sure sounds too simple to work but I cannot 
immediately see why it would not.

-- 
 i.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 08/13] gso: AccECN support
  2024-11-09 10:09   ` Ilpo Järvinen
@ 2024-11-09 19:42     ` Neal Cardwell
  2024-11-09 21:38       ` Ilpo Järvinen
  0 siblings, 1 reply; 37+ messages in thread
From: Neal Cardwell @ 2024-11-09 19:42 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Chia-Yu Chang, netdev, dsahern, davem, edumazet, dsahern, pabeni,
	joel.granados, kuba, andrew+netdev, horms, pablo, kadlec,
	netfilter-devel, coreteam, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel, Bob Briscoe

On Sat, Nov 9, 2024 at 5:09 AM Ilpo Järvinen <ij@kernel.org> wrote:
>
> On Tue, 5 Nov 2024, chia-yu.chang@nokia-bell-labs.com wrote:
>
> > From: Ilpo Järvinen <ij@kernel.org>
> >
> > Handling the CWR flag differs between RFC 3168 ECN and AccECN.
> > With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared
> > starting from 2nd segment which is incompatible how AccECN handles
> > the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN.
> > With AccECN, CWR flag (or more accurately, the ACE field that also
> > includes ECE & AE flags) changes only when new packet(s) with CE
> > mark arrives so the flag should not be changed within a super-skb.
> > The new skb/feature flags are necessary to prevent such TSO engines
> > corrupting AccECN ACE counters by clearing the CWR flag (if the
> > CWR handling feature cannot be turned off).
> >
> > If NIC is completely unaware of RFC3168 ECN (doesn't support
> > NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
> > despite supporting also NETIF_F_TSO_ECN, TSO could be safely used
> > with AccECN on such NIC. This should be evaluated per NIC basis
> > (not done in this patch series for any NICs).
> >
> > For the cases, where TSO cannot keep its hands off the CWR flag,
> > a GSO fallback is provided by this patch.
> >
> > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > ---
> >  include/linux/netdev_features.h | 8 +++++---
> >  include/linux/netdevice.h       | 2 ++
> >  include/linux/skbuff.h          | 2 ++
> >  net/ethtool/common.c            | 1 +
> >  net/ipv4/tcp_offload.c          | 6 +++++-
> >  5 files changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> > index 66e7d26b70a4..c59db449bcf0 100644
> > --- a/include/linux/netdev_features.h
> > +++ b/include/linux/netdev_features.h
> > @@ -53,12 +53,12 @@ enum {
> >       NETIF_F_GSO_UDP_BIT,            /* ... UFO, deprecated except tuntap */
> >       NETIF_F_GSO_UDP_L4_BIT,         /* ... UDP payload GSO (not UFO) */
> >       NETIF_F_GSO_FRAGLIST_BIT,               /* ... Fraglist GSO */
> > +     NETIF_F_GSO_ACCECN_BIT,         /* TCP AccECN w/ TSO (no clear CWR) */
> >       /**/NETIF_F_GSO_LAST =          /* last bit, see GSO_MASK */
> > -             NETIF_F_GSO_FRAGLIST_BIT,
> > +             NETIF_F_GSO_ACCECN_BIT,
> >
> >       NETIF_F_FCOE_CRC_BIT,           /* FCoE CRC32 */
> >       NETIF_F_SCTP_CRC_BIT,           /* SCTP checksum offload */
> > -     __UNUSED_NETIF_F_37,
> >       NETIF_F_NTUPLE_BIT,             /* N-tuple filters supported */
> >       NETIF_F_RXHASH_BIT,             /* Receive hashing offload */
> >       NETIF_F_RXCSUM_BIT,             /* Receive checksumming offload */
> > @@ -128,6 +128,7 @@ enum {
> >  #define NETIF_F_SG           __NETIF_F(SG)
> >  #define NETIF_F_TSO6         __NETIF_F(TSO6)
> >  #define NETIF_F_TSO_ECN              __NETIF_F(TSO_ECN)
> > +#define NETIF_F_GSO_ACCECN   __NETIF_F(GSO_ACCECN)
> >  #define NETIF_F_TSO          __NETIF_F(TSO)
> >  #define NETIF_F_VLAN_CHALLENGED      __NETIF_F(VLAN_CHALLENGED)
> >  #define NETIF_F_RXFCS                __NETIF_F(RXFCS)
> > @@ -210,7 +211,8 @@ static inline int find_next_netdev_feature(u64 feature, unsigned long start)
> >                                NETIF_F_TSO_ECN | NETIF_F_TSO_MANGLEID)
> >
> >  /* List of features with software fallbacks. */
> > -#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | NETIF_F_GSO_SCTP |        \
> > +#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | \
> > +                              NETIF_F_GSO_ACCECN | NETIF_F_GSO_SCTP | \
> >                                NETIF_F_GSO_UDP_L4 | NETIF_F_GSO_FRAGLIST)
> >
> >  /*
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 6e0f8e4aeb14..480d915b3bdb 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -5076,6 +5076,8 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
> >       BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_GSO_UDP >> NETIF_F_GSO_SHIFT));
> >       BUILD_BUG_ON(SKB_GSO_UDP_L4 != (NETIF_F_GSO_UDP_L4 >> NETIF_F_GSO_SHIFT));
> >       BUILD_BUG_ON(SKB_GSO_FRAGLIST != (NETIF_F_GSO_FRAGLIST >> NETIF_F_GSO_SHIFT));
> > +     BUILD_BUG_ON(SKB_GSO_TCP_ACCECN !=
> > +                  (NETIF_F_GSO_ACCECN >> NETIF_F_GSO_SHIFT));
> >
> >       return (features & feature) == feature;
> >  }
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 48f1e0fa2a13..530cb325fb86 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -694,6 +694,8 @@ enum {
> >       SKB_GSO_UDP_L4 = 1 << 17,
> >
> >       SKB_GSO_FRAGLIST = 1 << 18,
> > +
> > +     SKB_GSO_TCP_ACCECN = 1 << 19,
> >  };
> >
> >  #if BITS_PER_LONG > 32
> > diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> > index 0d62363dbd9d..5c3ba2dfaa74 100644
> > --- a/net/ethtool/common.c
> > +++ b/net/ethtool/common.c
> > @@ -32,6 +32,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
> >       [NETIF_F_TSO_BIT] =              "tx-tcp-segmentation",
> >       [NETIF_F_GSO_ROBUST_BIT] =       "tx-gso-robust",
> >       [NETIF_F_TSO_ECN_BIT] =          "tx-tcp-ecn-segmentation",
> > +     [NETIF_F_GSO_ACCECN_BIT] =       "tx-tcp-accecn-segmentation",
> >       [NETIF_F_TSO_MANGLEID_BIT] =     "tx-tcp-mangleid-segmentation",
> >       [NETIF_F_TSO6_BIT] =             "tx-tcp6-segmentation",
> >       [NETIF_F_FSO_BIT] =              "tx-fcoe-segmentation",
> > diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> > index 2308665b51c5..0b05f30e9e5f 100644
> > --- a/net/ipv4/tcp_offload.c
> > +++ b/net/ipv4/tcp_offload.c
> > @@ -139,6 +139,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> >       struct sk_buff *gso_skb = skb;
> >       __sum16 newcheck;
> >       bool ooo_okay, copy_destructor;
> > +     bool ecn_cwr_mask;
> >       __wsum delta;
> >
> >       th = tcp_hdr(skb);
> > @@ -198,6 +199,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> >
> >       newcheck = ~csum_fold(csum_add(csum_unfold(th->check), delta));
> >
> > +     ecn_cwr_mask = !!(skb_shinfo(gso_skb)->gso_type & SKB_GSO_TCP_ACCECN);
> > +
> >       while (skb->next) {
> >               th->fin = th->psh = 0;
> >               th->check = newcheck;
> > @@ -217,7 +220,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> >               th = tcp_hdr(skb);
> >
> >               th->seq = htonl(seq);
> > -             th->cwr = 0;
> > +
> > +             th->cwr &= ecn_cwr_mask;
>
> Hi all,
>
> I started to wonder if this approach would avoid CWR corruption without
> need to introduce the SKB_GSO_TCP_ACCECN flag at all:
>
> - Never set SKB_GSO_TCP_ECN for any skb
> - Split CWR segment on TCP level before sending. While this causes need to
> split the super skb, it's once per RTT thing so does not sound very
> significant (compare e.g. with SACK that cause split on every ACK)
> - Remove th->cwr cleaning from GSO (the above line)
> - Likely needed: don't negotiate HOST/GUEST_ECN in virtio
>
> I think that would prevent offloading treating CWR flag as RFC3168 and CWR
> would be just copied as is like a normal flag which is required to not
> corrupt it. In practice, RFC3168 style traffic would just not combine
> anything with the CWR'ed packet as CWRs are singletons with RFC3168.
>
> Would that work? It sure sounds too simple to work but I cannot
> immediately see why it would not.

The above description just seems to describe the dynamics for the
RFC3168 case (since it mentions the CWR bit being set as just a "once
per RTT thing").

Ilpo, can you please describe how AccECN traffic would be handled in
your design proposal? (Since with established AccECN connections the
CWR bit is part of the ACE counter and may be set on roughly half of
outgoing data skbs...)

Sorry if I'm misunderstanding something...

Thanks,
neal

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 08/13] gso: AccECN support
  2024-11-09 19:42     ` Neal Cardwell
@ 2024-11-09 21:38       ` Ilpo Järvinen
  0 siblings, 0 replies; 37+ messages in thread
From: Ilpo Järvinen @ 2024-11-09 21:38 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Chia-Yu Chang, netdev, dsahern, davem, edumazet, dsahern, pabeni,
	joel.granados, kuba, andrew+netdev, horms, pablo, kadlec,
	netfilter-devel, coreteam, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel, Bob Briscoe

[-- Attachment #1: Type: text/plain, Size: 10194 bytes --]

On Sat, 9 Nov 2024, Neal Cardwell wrote:

> On Sat, Nov 9, 2024 at 5:09 AM Ilpo Järvinen <ij@kernel.org> wrote:
> >
> > On Tue, 5 Nov 2024, chia-yu.chang@nokia-bell-labs.com wrote:
> >
> > > From: Ilpo Järvinen <ij@kernel.org>
> > >
> > > Handling the CWR flag differs between RFC 3168 ECN and AccECN.
> > > With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared
> > > starting from 2nd segment which is incompatible how AccECN handles
> > > the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN.
> > > With AccECN, CWR flag (or more accurately, the ACE field that also
> > > includes ECE & AE flags) changes only when new packet(s) with CE
> > > mark arrives so the flag should not be changed within a super-skb.
> > > The new skb/feature flags are necessary to prevent such TSO engines
> > > corrupting AccECN ACE counters by clearing the CWR flag (if the
> > > CWR handling feature cannot be turned off).
> > >
> > > If NIC is completely unaware of RFC3168 ECN (doesn't support
> > > NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
> > > despite supporting also NETIF_F_TSO_ECN, TSO could be safely used
> > > with AccECN on such NIC. This should be evaluated per NIC basis
> > > (not done in this patch series for any NICs).
> > >
> > > For the cases, where TSO cannot keep its hands off the CWR flag,
> > > a GSO fallback is provided by this patch.
> > >
> > > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > > ---
> > >  include/linux/netdev_features.h | 8 +++++---
> > >  include/linux/netdevice.h       | 2 ++
> > >  include/linux/skbuff.h          | 2 ++
> > >  net/ethtool/common.c            | 1 +
> > >  net/ipv4/tcp_offload.c          | 6 +++++-
> > >  5 files changed, 15 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> > > index 66e7d26b70a4..c59db449bcf0 100644
> > > --- a/include/linux/netdev_features.h
> > > +++ b/include/linux/netdev_features.h
> > > @@ -53,12 +53,12 @@ enum {
> > >       NETIF_F_GSO_UDP_BIT,            /* ... UFO, deprecated except tuntap */
> > >       NETIF_F_GSO_UDP_L4_BIT,         /* ... UDP payload GSO (not UFO) */
> > >       NETIF_F_GSO_FRAGLIST_BIT,               /* ... Fraglist GSO */
> > > +     NETIF_F_GSO_ACCECN_BIT,         /* TCP AccECN w/ TSO (no clear CWR) */
> > >       /**/NETIF_F_GSO_LAST =          /* last bit, see GSO_MASK */
> > > -             NETIF_F_GSO_FRAGLIST_BIT,
> > > +             NETIF_F_GSO_ACCECN_BIT,
> > >
> > >       NETIF_F_FCOE_CRC_BIT,           /* FCoE CRC32 */
> > >       NETIF_F_SCTP_CRC_BIT,           /* SCTP checksum offload */
> > > -     __UNUSED_NETIF_F_37,
> > >       NETIF_F_NTUPLE_BIT,             /* N-tuple filters supported */
> > >       NETIF_F_RXHASH_BIT,             /* Receive hashing offload */
> > >       NETIF_F_RXCSUM_BIT,             /* Receive checksumming offload */
> > > @@ -128,6 +128,7 @@ enum {
> > >  #define NETIF_F_SG           __NETIF_F(SG)
> > >  #define NETIF_F_TSO6         __NETIF_F(TSO6)
> > >  #define NETIF_F_TSO_ECN              __NETIF_F(TSO_ECN)
> > > +#define NETIF_F_GSO_ACCECN   __NETIF_F(GSO_ACCECN)
> > >  #define NETIF_F_TSO          __NETIF_F(TSO)
> > >  #define NETIF_F_VLAN_CHALLENGED      __NETIF_F(VLAN_CHALLENGED)
> > >  #define NETIF_F_RXFCS                __NETIF_F(RXFCS)
> > > @@ -210,7 +211,8 @@ static inline int find_next_netdev_feature(u64 feature, unsigned long start)
> > >                                NETIF_F_TSO_ECN | NETIF_F_TSO_MANGLEID)
> > >
> > >  /* List of features with software fallbacks. */
> > > -#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | NETIF_F_GSO_SCTP |        \
> > > +#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | \
> > > +                              NETIF_F_GSO_ACCECN | NETIF_F_GSO_SCTP | \
> > >                                NETIF_F_GSO_UDP_L4 | NETIF_F_GSO_FRAGLIST)
> > >
> > >  /*
> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > index 6e0f8e4aeb14..480d915b3bdb 100644
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -5076,6 +5076,8 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
> > >       BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_GSO_UDP >> NETIF_F_GSO_SHIFT));
> > >       BUILD_BUG_ON(SKB_GSO_UDP_L4 != (NETIF_F_GSO_UDP_L4 >> NETIF_F_GSO_SHIFT));
> > >       BUILD_BUG_ON(SKB_GSO_FRAGLIST != (NETIF_F_GSO_FRAGLIST >> NETIF_F_GSO_SHIFT));
> > > +     BUILD_BUG_ON(SKB_GSO_TCP_ACCECN !=
> > > +                  (NETIF_F_GSO_ACCECN >> NETIF_F_GSO_SHIFT));
> > >
> > >       return (features & feature) == feature;
> > >  }
> > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > index 48f1e0fa2a13..530cb325fb86 100644
> > > --- a/include/linux/skbuff.h
> > > +++ b/include/linux/skbuff.h
> > > @@ -694,6 +694,8 @@ enum {
> > >       SKB_GSO_UDP_L4 = 1 << 17,
> > >
> > >       SKB_GSO_FRAGLIST = 1 << 18,
> > > +
> > > +     SKB_GSO_TCP_ACCECN = 1 << 19,
> > >  };
> > >
> > >  #if BITS_PER_LONG > 32
> > > diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> > > index 0d62363dbd9d..5c3ba2dfaa74 100644
> > > --- a/net/ethtool/common.c
> > > +++ b/net/ethtool/common.c
> > > @@ -32,6 +32,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
> > >       [NETIF_F_TSO_BIT] =              "tx-tcp-segmentation",
> > >       [NETIF_F_GSO_ROBUST_BIT] =       "tx-gso-robust",
> > >       [NETIF_F_TSO_ECN_BIT] =          "tx-tcp-ecn-segmentation",
> > > +     [NETIF_F_GSO_ACCECN_BIT] =       "tx-tcp-accecn-segmentation",
> > >       [NETIF_F_TSO_MANGLEID_BIT] =     "tx-tcp-mangleid-segmentation",
> > >       [NETIF_F_TSO6_BIT] =             "tx-tcp6-segmentation",
> > >       [NETIF_F_FSO_BIT] =              "tx-fcoe-segmentation",
> > > diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> > > index 2308665b51c5..0b05f30e9e5f 100644
> > > --- a/net/ipv4/tcp_offload.c
> > > +++ b/net/ipv4/tcp_offload.c
> > > @@ -139,6 +139,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> > >       struct sk_buff *gso_skb = skb;
> > >       __sum16 newcheck;
> > >       bool ooo_okay, copy_destructor;
> > > +     bool ecn_cwr_mask;
> > >       __wsum delta;
> > >
> > >       th = tcp_hdr(skb);
> > > @@ -198,6 +199,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> > >
> > >       newcheck = ~csum_fold(csum_add(csum_unfold(th->check), delta));
> > >
> > > +     ecn_cwr_mask = !!(skb_shinfo(gso_skb)->gso_type & SKB_GSO_TCP_ACCECN);
> > > +
> > >       while (skb->next) {
> > >               th->fin = th->psh = 0;
> > >               th->check = newcheck;
> > > @@ -217,7 +220,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> > >               th = tcp_hdr(skb);
> > >
> > >               th->seq = htonl(seq);
> > > -             th->cwr = 0;
> > > +
> > > +             th->cwr &= ecn_cwr_mask;
> >
> > Hi all,
> >
> > I started to wonder if this approach would avoid CWR corruption without
> > need to introduce the SKB_GSO_TCP_ACCECN flag at all:
> >
> > - Never set SKB_GSO_TCP_ECN for any skb
> > - Split CWR segment on TCP level before sending. While this causes need to
> > split the super skb, it's once per RTT thing so does not sound very
> > significant (compare e.g. with SACK that cause split on every ACK)
> > - Remove th->cwr cleaning from GSO (the above line)
> > - Likely needed: don't negotiate HOST/GUEST_ECN in virtio
> >
> > I think that would prevent offloading treating CWR flag as RFC3168 and CWR
> > would be just copied as is like a normal flag which is required to not
> > corrupt it. In practice, RFC3168 style traffic would just not combine
> > anything with the CWR'ed packet as CWRs are singletons with RFC3168.
> >
> > Would that work? It sure sounds too simple to work but I cannot
> > immediately see why it would not.
> 
> The above description just seems to describe the dynamics for the
> RFC3168 case (since it mentions the CWR bit being set as just a "once
> per RTT thing").

For RFC3168 ECN case, that will still hold.

I think the misunderstanding was from this, I meant to only briefly say
how this would impact TCP using RFC3168 ECN. It needs to split CWR segment 
to own skb to not share skb with non-CWR segments and that even is 
infrequent enough (one per RTT) that it doesn't sound a bad compromise.

> Ilpo, can you please describe how AccECN traffic would be handled in
> your design proposal? (Since with established AccECN connections the
> CWR bit is part of the ACE counter and may be set on roughly half of
> outgoing data skbs...)

TSO (and GSO) RFC3168 ECN offloading takes an skb, puts CWR only into 
the first segment and the rest of the skb are sent without it.

What we want for AccECN is to never combine CWR and non-CWR skbs. The 
question is just how to achieve this.

My idea is to instead of this SKB_GSO_TCP_ACCECN flag split those skbs 
always apart on TCP level (RFC3168 case) and to adapt GRO to not set 
SKB_GSO_TCP_ECN either (the latter should already be done by this series). 

When SKB_GSO_TCP_ECN is not set by TCP nor GRO anymore, is that enough to 
prevent HW offloading CWR in the RFC3168 way? That is, will the HW offload 
still mess with CWR flag or does HW offloading honor the lack of 
SKB_GSO_TCP_ECN (I'm not sure about the answer as I've not dealt with NICs 
on that level)? If the HW offloading keeps its hands off from CWR, then 
there would be no need for this new flag.

GSO case is handled by simply removing CWR mangling from the code.


Well, I now went to read some random NIC drivers and I suspect the answer 
is that HW offloading will still mess with CWR, given the lack of code 
that would be depending on that gso_type flag... So I guess my idea is a 
dead-end and SKB_GSO_TCP_ACCECN flag is needed to prevent HW offloads from 
seeing the CWR'ed skb.

> Sorry if I'm misunderstanding something...

-- 
 i.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
  2024-11-07 19:28     ` Ilpo Järvinen
@ 2024-11-12  2:09       ` Chia-Yu Chang (Nokia)
  2024-11-12 21:19         ` Ilpo Järvinen
  0 siblings, 1 reply; 37+ messages in thread
From: Chia-Yu Chang (Nokia) @ 2024-11-12  2:09 UTC (permalink / raw)
  To: Ilpo Järvinen, Eric Dumazet
  Cc: netdev@vger.kernel.org, dsahern@gmail.com, davem@davemloft.net,
	dsahern@kernel.org, pabeni@redhat.com, joel.granados@kernel.org,
	kuba@kernel.org, andrew+netdev@lunn.ch, horms@kernel.org,
	pablo@netfilter.org, kadlec@netfilter.org,
	netfilter-devel@vger.kernel.org, coreteam@netfilter.org,
	ncardwell@google.com, Koen De Schepper (Nokia),
	g.white@cablelabs.com, ingemar.s.johansson@ericsson.com,
	mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
	Jason_Livingood@comcast.com, vidhi_goel@apple.com

>From: Ilpo Järvinen <ij@kernel.org> 
>Sent: Thursday, November 7, 2024 8:28 PM
>To: Eric Dumazet <edumazet@google.com>
>Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; netdev@vger.kernel.org; dsahern@gmail.com; davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com; joel.granados@kernel.org; kuba@kernel.org; andrew+netdev@lunn.ch; horms@kernel.org; pablo@netfilter.org; kadlec@netfilter.org; netfilter-devel@vger.kernel.org; coreteam@netfilter.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
>Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
>
>
>CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
>
>
>
>On Thu, 7 Nov 2024, Eric Dumazet wrote:
>
>>On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>>>
>>> From: Ilpo Järvinen <ij@kernel.org>
>>>
>>> There are important differences in how the CWR field behaves in 
>>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE counter 
>>> and its changes are important so adjust the flags changed mask 
>>> accordingly.
>>>
>>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid 
>>> corrupting CWR flag somewhere.
>>>
>>> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
>>> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>>> ---
>>>  net/ipv4/tcp_offload.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index 
>>> 0b05f30e9e5f..f59762d88c38 100644
>>> --- a/net/ipv4/tcp_offload.c
>>> +++ b/net/ipv4/tcp_offload.c
>>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
>>>         th2 = tcp_hdr(p);
>>>         flush = (__force int)(flags & TCP_FLAG_CWR);
>>>         flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
>>> -                 ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
>>> +                 ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
>>>         flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
>>>         for (i = sizeof(*th); i < thlen; i += 4)
>>>                 flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 +405,7 
>>> @@ void tcp_gro_complete(struct sk_buff *skb)
>>>         shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
>>>
>>>         if (th->cwr)
>>> -               shinfo->gso_type |= SKB_GSO_TCP_ECN;
>>> +               shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
>>>  }
>>>  EXPORT_SYMBOL(tcp_gro_complete);
>>>
>>
>> I do not really understand this patch. How a GRO engine can know which 
>> ECN variant the peers are using ?
>
>Hi Eric,
>
>Thanks for taking a look.
>
>GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can result in header change that corrupts ACE field. Thus, GRO has to assume the RFC3168 CWR offloading trick cannot be used anymore (unless it tracks the connection and knows the bits are used for RFC3168 which is something nobody is going to do).
>
>The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE field value.

Hi Eric and Ilpo,

From my understanding of another email thread (patch 08/13), it seems the conclusions that SKB_GSO_TCP_ACCECN will still be needed.

>
>SKB_GSO_TCP_ACCECN doesn't allow CWR bits change within a super-skb but the same CWR flag should be repeated for all segments. In a sense, it's simpler than RFC3168 offloading.
>
>> SKB_GSO_TCP_ECN is also used from other points, what is your plan ?
>>
>> git grep -n SKB_GSO_TCP_ECN
>> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888:
>> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
>> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1291:
>> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
>> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1312:
>> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
>
>These looked like they should be just changed to set SKB_GSO_TCP_ACCECN instead.

I agree with these changes and will apply them in the next version.

>
>I don't anymore recall why I didn't change those when I made this patch long time ago, perhaps it was just an oversight or things have changed somehow since then.
>
>> include/linux/netdevice.h:5061: BUILD_BUG_ON(SKB_GSO_TCP_ECN != 
>> (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
>> include/linux/skbuff.h:664:     SKB_GSO_TCP_ECN = 1 << 2,
>
>Not relevant.
>
>> include/linux/virtio_net.h:88:                  gso_type |= SKB_GSO_TCP_ECN;
>> include/linux/virtio_net.h:161:         switch (gso_type & ~SKB_GSO_TCP_ECN) {
>> include/linux/virtio_net.h:226:         if (sinfo->gso_type & SKB_GSO_TCP_ECN)
>
>These need to be looked further what's going on as UAPI is also involved here.

I had a look at these parts, and only the 1st and 3rd ones are relevant.
Related to the 1st one, I propose to change
from

                if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
                        gso_type |= SKB_GSO_TCP_ECN;

to

                if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
                        gso_type |= SKB_GSO_TCP_ACCECN;

The reason behind this proposed change is similar as the above changes in en_rx.c.

For the 3rd one, I suggest to change from

                if (sinfo->gso_type & SKB_GSO_TCP_ECN)
                        hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;

to

                if (sinfo->gso_type & (SKB_GSO_TCP_ECN | SKB_GSO_TCP_ACCECN))
                        hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;

This proposed change is because VIRTIO_NET_HDR_GSO_ECN must be set to allow TCP packets requiring segmentation offload which have ECN bit set.
So, no matter whether skb gso_type have GSO_TCP_ECN flag or GSO_TCP_ACCECN flag, the corresponding hdr gso_type shall be set.

But, I wonder what would the driver do when with VIRTIO_NET_HDR_GSO_ECN flag. Will it corrupts CWR or not?

--
Chia-Yu

>
>> net/ipv4/tcp_offload.c:404:             shinfo->gso_type |= SKB_GSO_TCP_ECN;
>
>This was changed above. :-)
>
>> net/ipv4/tcp_output.c:389: skb_shinfo(skb)->gso_type |= 
>> SKB_GSO_TCP_ECN;
>
>I'm pretty sure this relates to sending side which will be taken care by a later patch in the full series (not among these preparatory patches).
>
>
>FYI, these TSO/GSO/GRO changes are what I was most unsure myself if I got everything right when I was working with this series a few years back so please keep that in mind while reviewing so my lack of knowledge doesn't end up breaking something. :-)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
  2024-11-12  2:09       ` Chia-Yu Chang (Nokia)
@ 2024-11-12 21:19         ` Ilpo Järvinen
  2024-11-13  1:42           ` Jason Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Ilpo Järvinen @ 2024-11-12 21:19 UTC (permalink / raw)
  To: Chia-Yu Chang (Nokia), Michael S. Tsirkin, Jason Wang,
	virtualization
  Cc: Eric Dumazet, netdev@vger.kernel.org, dsahern@gmail.com,
	davem@davemloft.net, dsahern@kernel.org, pabeni@redhat.com,
	joel.granados@kernel.org, kuba@kernel.org, andrew+netdev@lunn.ch,
	horms@kernel.org, pablo@netfilter.org, kadlec@netfilter.org,
	netfilter-devel@vger.kernel.org, coreteam@netfilter.org,
	ncardwell@google.com, Koen De Schepper (Nokia),
	g.white@cablelabs.com, ingemar.s.johansson@ericsson.com,
	mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
	Jason_Livingood@comcast.com, vidhi_goel@apple.com

[-- Attachment #1: Type: text/plain, Size: 7464 bytes --]

Adding a few virtio people. Please see the virtio spec/flag question 
below.

On Tue, 12 Nov 2024, Chia-Yu Chang (Nokia) wrote:

> >From: Ilpo Järvinen <ij@kernel.org> 
> >Sent: Thursday, November 7, 2024 8:28 PM
> >To: Eric Dumazet <edumazet@google.com>
> >Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; netdev@vger.kernel.org; dsahern@gmail.com; davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com; joel.granados@kernel.org; kuba@kernel.org; andrew+netdev@lunn.ch; horms@kernel.org; pablo@netfilter.org; kadlec@netfilter.org; netfilter-devel@vger.kernel.org; coreteam@netfilter.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
> >Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
> >
> >
> >CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> >
> >
> >
> >On Thu, 7 Nov 2024, Eric Dumazet wrote:
> >
> >>On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
> >>>
> >>> From: Ilpo Järvinen <ij@kernel.org>
> >>>
> >>> There are important differences in how the CWR field behaves in 
> >>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE counter 
> >>> and its changes are important so adjust the flags changed mask 
> >>> accordingly.
> >>>
> >>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid 
> >>> corrupting CWR flag somewhere.
> >>>
> >>> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> >>> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> >>> ---
> >>>  net/ipv4/tcp_offload.c | 4 ++--
> >>>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index 
> >>> 0b05f30e9e5f..f59762d88c38 100644
> >>> --- a/net/ipv4/tcp_offload.c
> >>> +++ b/net/ipv4/tcp_offload.c
> >>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
> >>>         th2 = tcp_hdr(p);
> >>>         flush = (__force int)(flags & TCP_FLAG_CWR);
> >>>         flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> >>> -                 ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> >>> +                 ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
> >>>         flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> >>>         for (i = sizeof(*th); i < thlen; i += 4)
> >>>                 flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 +405,7 
> >>> @@ void tcp_gro_complete(struct sk_buff *skb)
> >>>         shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
> >>>
> >>>         if (th->cwr)
> >>> -               shinfo->gso_type |= SKB_GSO_TCP_ECN;
> >>> +               shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
> >>>  }
> >>>  EXPORT_SYMBOL(tcp_gro_complete);
> >>>
> >>
> >> I do not really understand this patch. How a GRO engine can know which 
> >> ECN variant the peers are using ?
> >
> >Hi Eric,
> >
> >Thanks for taking a look.
> >
> >GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can result in header change that corrupts ACE field. Thus, GRO has to assume the RFC3168 CWR offloading trick cannot be used anymore (unless it tracks the connection and knows the bits are used for RFC3168 which is something nobody is going to do).
> >
> >The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE field value.
> 
> Hi Eric and Ilpo,
> 
> From my understanding of another email thread (patch 08/13), it seems the conclusions that SKB_GSO_TCP_ACCECN will still be needed.
> 
> >
> >SKB_GSO_TCP_ACCECN doesn't allow CWR bits change within a super-skb but the same CWR flag should be repeated for all segments. In a sense, it's simpler than RFC3168 offloading.
> >
> >> SKB_GSO_TCP_ECN is also used from other points, what is your plan ?
> >>
> >> git grep -n SKB_GSO_TCP_ECN
> >> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888:
> >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1291:
> >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1312:
> >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> >
> >These looked like they should be just changed to set SKB_GSO_TCP_ACCECN instead.
> 
> I agree with these changes and will apply them in the next version.
> 
> >
> >I don't anymore recall why I didn't change those when I made this patch long time ago, perhaps it was just an oversight or things have changed somehow since then.
> >
> >> include/linux/netdevice.h:5061: BUILD_BUG_ON(SKB_GSO_TCP_ECN != 
> >> (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
> >> include/linux/skbuff.h:664:     SKB_GSO_TCP_ECN = 1 << 2,
> >
> >Not relevant.
> >
> >> include/linux/virtio_net.h:88:                  gso_type |= SKB_GSO_TCP_ECN;
> >> include/linux/virtio_net.h:161:         switch (gso_type & ~SKB_GSO_TCP_ECN) {
> >> include/linux/virtio_net.h:226:         if (sinfo->gso_type & SKB_GSO_TCP_ECN)
> >
> >These need to be looked further what's going on as UAPI is also 
> > involved here. 
> 
> I had a look at these parts, and only the 1st and 3rd ones are relevant.
> Related to the 1st one, I propose to change
> from
> 
>                 if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
>                         gso_type |= SKB_GSO_TCP_ECN;
> 
> to
> 
>                 if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
>                         gso_type |= SKB_GSO_TCP_ACCECN;
> 
> The reason behind this proposed change is similar as the above changes 
> in en_rx.c.

But en_rx.c is based one CWR flag, there's no similar check here.

> For the 3rd one, I suggest to change from
> 
>                 if (sinfo->gso_type & SKB_GSO_TCP_ECN)
>                         hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
>
> to
> 
>                 if (sinfo->gso_type & (SKB_GSO_TCP_ECN | SKB_GSO_TCP_ACCECN))
>                         hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
> 
> This proposed change is because VIRTIO_NET_HDR_GSO_ECN must be set to 
> allow TCP packets requiring segmentation offload which have ECN bit set.
> So, no matter whether skb gso_type have GSO_TCP_ECN flag or 
> GSO_TCP_ACCECN flag, the corresponding hdr gso_type shall be set.
> 
> But, I wonder what would the driver do when with VIRTIO_NET_HDR_GSO_ECN 
> flag. Will it corrupts CWR or not?

I'm starting to heavily question what is the meaning of that 
VIRTIO_NET_HDR_GSO_ECN flag and if it's even consistent...

https://github.com/qemu/qemu/blob/master/net/eth.c

That sets VIRTIO_NET_HDR_GSO_ECN based on CE?? (Whereas kernel associates 
the related SKB_GSO_TCP_ECN to CWR offloading.)

The virtio spec is way too vague to be useful so it would not be 
surprising if there are diverging interpretations from implementers:

"If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the 
VIRTIO_NET_HDR_GSO_ECN bit in gso_type indicates that the TCP packet has 
the ECN bit set."

What is "the ECN bit" in terms used by TCP (or IP)? Could some virtio 
folks please explain?


-- 
 i.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
  2024-11-12 21:19         ` Ilpo Järvinen
@ 2024-11-13  1:42           ` Jason Wang
  2024-11-14  1:04             ` Chia-Yu Chang (Nokia)
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2024-11-13  1:42 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Chia-Yu Chang (Nokia), Michael S. Tsirkin, virtualization,
	Eric Dumazet, netdev@vger.kernel.org, dsahern@gmail.com,
	davem@davemloft.net, dsahern@kernel.org, pabeni@redhat.com,
	joel.granados@kernel.org, kuba@kernel.org, andrew+netdev@lunn.ch,
	horms@kernel.org, pablo@netfilter.org, kadlec@netfilter.org,
	netfilter-devel@vger.kernel.org, coreteam@netfilter.org,
	ncardwell@google.com, Koen De Schepper (Nokia),
	g.white@cablelabs.com, ingemar.s.johansson@ericsson.com,
	mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
	Jason_Livingood@comcast.com, vidhi_goel@apple.com

On Wed, Nov 13, 2024 at 5:19 AM Ilpo Järvinen <ij@kernel.org> wrote:
>
> Adding a few virtio people. Please see the virtio spec/flag question
> below.
>
> On Tue, 12 Nov 2024, Chia-Yu Chang (Nokia) wrote:
>
> > >From: Ilpo Järvinen <ij@kernel.org>
> > >Sent: Thursday, November 7, 2024 8:28 PM
> > >To: Eric Dumazet <edumazet@google.com>
> > >Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; netdev@vger.kernel.org; dsahern@gmail.com; davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com; joel.granados@kernel.org; kuba@kernel.org; andrew+netdev@lunn.ch; horms@kernel.org; pablo@netfilter.org; kadlec@netfilter.org; netfilter-devel@vger.kernel.org; coreteam@netfilter.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
> > >Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
> > >
> > >
> > >CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> > >
> > >
> > >
> > >On Thu, 7 Nov 2024, Eric Dumazet wrote:
> > >
> > >>On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
> > >>>
> > >>> From: Ilpo Järvinen <ij@kernel.org>
> > >>>
> > >>> There are important differences in how the CWR field behaves in
> > >>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE counter
> > >>> and its changes are important so adjust the flags changed mask
> > >>> accordingly.
> > >>>
> > >>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> > >>> corrupting CWR flag somewhere.
> > >>>
> > >>> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > >>> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > >>> ---
> > >>>  net/ipv4/tcp_offload.c | 4 ++--
> > >>>  1 file changed, 2 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index
> > >>> 0b05f30e9e5f..f59762d88c38 100644
> > >>> --- a/net/ipv4/tcp_offload.c
> > >>> +++ b/net/ipv4/tcp_offload.c
> > >>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
> > >>>         th2 = tcp_hdr(p);
> > >>>         flush = (__force int)(flags & TCP_FLAG_CWR);
> > >>>         flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> > >>> -                 ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> > >>> +                 ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
> > >>>         flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> > >>>         for (i = sizeof(*th); i < thlen; i += 4)
> > >>>                 flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 +405,7
> > >>> @@ void tcp_gro_complete(struct sk_buff *skb)
> > >>>         shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
> > >>>
> > >>>         if (th->cwr)
> > >>> -               shinfo->gso_type |= SKB_GSO_TCP_ECN;
> > >>> +               shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
> > >>>  }
> > >>>  EXPORT_SYMBOL(tcp_gro_complete);
> > >>>
> > >>
> > >> I do not really understand this patch. How a GRO engine can know which
> > >> ECN variant the peers are using ?
> > >
> > >Hi Eric,
> > >
> > >Thanks for taking a look.
> > >
> > >GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can result in header change that corrupts ACE field. Thus, GRO has to assume the RFC3168 CWR offloading trick cannot be used anymore (unless it tracks the connection and knows the bits are used for RFC3168 which is something nobody is going to do).
> > >
> > >The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE field value.
> >
> > Hi Eric and Ilpo,
> >
> > From my understanding of another email thread (patch 08/13), it seems the conclusions that SKB_GSO_TCP_ACCECN will still be needed.
> >
> > >
> > >SKB_GSO_TCP_ACCECN doesn't allow CWR bits change within a super-skb but the same CWR flag should be repeated for all segments. In a sense, it's simpler than RFC3168 offloading.
> > >
> > >> SKB_GSO_TCP_ECN is also used from other points, what is your plan ?
> > >>
> > >> git grep -n SKB_GSO_TCP_ECN
> > >> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888:
> > >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> > >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1291:
> > >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> > >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1312:
> > >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> > >
> > >These looked like they should be just changed to set SKB_GSO_TCP_ACCECN instead.
> >
> > I agree with these changes and will apply them in the next version.
> >
> > >
> > >I don't anymore recall why I didn't change those when I made this patch long time ago, perhaps it was just an oversight or things have changed somehow since then.
> > >
> > >> include/linux/netdevice.h:5061: BUILD_BUG_ON(SKB_GSO_TCP_ECN !=
> > >> (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
> > >> include/linux/skbuff.h:664:     SKB_GSO_TCP_ECN = 1 << 2,
> > >
> > >Not relevant.
> > >
> > >> include/linux/virtio_net.h:88:                  gso_type |= SKB_GSO_TCP_ECN;
> > >> include/linux/virtio_net.h:161:         switch (gso_type & ~SKB_GSO_TCP_ECN) {
> > >> include/linux/virtio_net.h:226:         if (sinfo->gso_type & SKB_GSO_TCP_ECN)
> > >
> > >These need to be looked further what's going on as UAPI is also
> > > involved here.
> >
> > I had a look at these parts, and only the 1st and 3rd ones are relevant.
> > Related to the 1st one, I propose to change
> > from
> >
> >                 if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
> >                         gso_type |= SKB_GSO_TCP_ECN;
> >
> > to
> >
> >                 if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
> >                         gso_type |= SKB_GSO_TCP_ACCECN;
> >
> > The reason behind this proposed change is similar as the above changes
> > in en_rx.c.
>
> But en_rx.c is based one CWR flag, there's no similar check here.
>
> > For the 3rd one, I suggest to change from
> >
> >                 if (sinfo->gso_type & SKB_GSO_TCP_ECN)
> >                         hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
> >
> > to
> >
> >                 if (sinfo->gso_type & (SKB_GSO_TCP_ECN | SKB_GSO_TCP_ACCECN))
> >                         hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
> >
> > This proposed change is because VIRTIO_NET_HDR_GSO_ECN must be set to
> > allow TCP packets requiring segmentation offload which have ECN bit set.
> > So, no matter whether skb gso_type have GSO_TCP_ECN flag or
> > GSO_TCP_ACCECN flag, the corresponding hdr gso_type shall be set.
> >
> > But, I wonder what would the driver do when with VIRTIO_NET_HDR_GSO_ECN
> > flag. Will it corrupts CWR or not?
>
> I'm starting to heavily question what is the meaning of that
> VIRTIO_NET_HDR_GSO_ECN flag and if it's even consistent...
>
> https://github.com/qemu/qemu/blob/master/net/eth.c
>
> That sets VIRTIO_NET_HDR_GSO_ECN based on CE?? (Whereas kernel associates
> the related SKB_GSO_TCP_ECN to CWR offloading.)
>
> The virtio spec is way too vague to be useful so it would not be
> surprising if there are diverging interpretations from implementers:
>
> "If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the
> VIRTIO_NET_HDR_GSO_ECN bit in gso_type indicates that the TCP packet has
> the ECN bit set."
>
> What is "the ECN bit" in terms used by TCP (or IP)? Could some virtio
> folks please explain?

According to the current Linux implementation in virtio_net_hdr_to_skb():

                if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
                        gso_type |= SKB_GSO_TCP_ECN;

It mapps to SKB_GSO_TCP_ECN which is:

        /* This indicates the tcp segment has CWR set. */
        SKB_GSO_TCP_ECN = 1 << 2,

Thanks

>
>
> --
>  i.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
  2024-11-13  1:42           ` Jason Wang
@ 2024-11-14  1:04             ` Chia-Yu Chang (Nokia)
  0 siblings, 0 replies; 37+ messages in thread
From: Chia-Yu Chang (Nokia) @ 2024-11-14  1:04 UTC (permalink / raw)
  To: Jason Wang, Ilpo Järvinen
  Cc: Michael S. Tsirkin, virtualization@lists.linux.dev, Eric Dumazet,
	netdev@vger.kernel.org, dsahern@gmail.com, davem@davemloft.net,
	dsahern@kernel.org, pabeni@redhat.com, joel.granados@kernel.org,
	kuba@kernel.org, andrew+netdev@lunn.ch, horms@kernel.org,
	pablo@netfilter.org, kadlec@netfilter.org,
	netfilter-devel@vger.kernel.org, coreteam@netfilter.org,
	ncardwell@google.com, Koen De Schepper (Nokia),
	g.white@cablelabs.com, ingemar.s.johansson@ericsson.com,
	mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
	Jason_Livingood@comcast.com, vidhi_goel@apple.com

>From: Jason Wang <jasowang@redhat.com> 
>Sent: Wednesday, November 13, 2024 2:43 AM
>To: Ilpo Järvinen <ij@kernel.org>
>Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; Michael S. Tsirkin <mst@redhat.com>; virtualization@lists.linux.dev; Eric Dumazet <edumazet@google.com>; netdev@vger.kernel.org; dsahern@gmail.com; davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com; joel.granados@kernel.org; kuba@kernel.org; andrew+netdev@lunn.ch; horms@kernel.org; pablo@netfilter.org; kadlec@netfilter.org; netfilter-devel@vger.kernel.org; coreteam@netfilter.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
>Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
>
>[You don't often get email from jasowang@redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
>CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
>
>
>
>On Wed, Nov 13, 2024 at 5:19 AM Ilpo Järvinen <ij@kernel.org> wrote:
>>
>> Adding a few virtio people. Please see the virtio spec/flag question 
>> below.
>>
>> On Tue, 12 Nov 2024, Chia-Yu Chang (Nokia) wrote:
>>
>> > >From: Ilpo Järvinen <ij@kernel.org>
>> > >Sent: Thursday, November 7, 2024 8:28 PM
>> > >To: Eric Dumazet <edumazet@google.com>
>> > >Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; 
>> > >netdev@vger.kernel.org; dsahern@gmail.com; davem@davemloft.net; 
>> > >dsahern@kernel.org; pabeni@redhat.com; joel.granados@kernel.org; 
>> > >kuba@kernel.org; andrew+netdev@lunn.ch; horms@kernel.org; 
>> > >pablo@netfilter.org; kadlec@netfilter.org; 
>> > >netfilter-devel@vger.kernel.org; coreteam@netfilter.org; 
>> > >ncardwell@google.com; Koen De Schepper (Nokia) 
>> > ><koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; 
>> > >ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; 
>> > >cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; 
>> > >vidhi_goel@apple.com
>> > >Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field 
>> > >corruption & better AccECN handling
>> > >
>> > >
>> > >CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
>> > >
>> > >
>> > >
>> > >On Thu, 7 Nov 2024, Eric Dumazet wrote:
>> > >
>> > >>On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>> > >>>
>> > >>> From: Ilpo Järvinen <ij@kernel.org>
>> > >>>
>> > >>> There are important differences in how the CWR field behaves in
>> > >>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE 
>> > >>> counter and its changes are important so adjust the flags 
>> > >>> changed mask accordingly.
>> > >>>
>> > >>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid 
>> > >>> corrupting CWR flag somewhere.
>> > >>>
>> > >>> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
>> > >>> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>> > >>> ---
>> > >>>  net/ipv4/tcp_offload.c | 4 ++--
>> > >>>  1 file changed, 2 insertions(+), 2 deletions(-)
>> > >>>
>> > >>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c 
>> > >>> index
>> > >>> 0b05f30e9e5f..f59762d88c38 100644
>> > >>> --- a/net/ipv4/tcp_offload.c
>> > >>> +++ b/net/ipv4/tcp_offload.c
>> > >>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
>> > >>>         th2 = tcp_hdr(p);
>> > >>>         flush = (__force int)(flags & TCP_FLAG_CWR);
>> > >>>         flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
>> > >>> -                 ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
>> > >>> +                 ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
>> > >>>         flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
>> > >>>         for (i = sizeof(*th); i < thlen; i += 4)
>> > >>>                 flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 
>> > >>> +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
>> > >>>         shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
>> > >>>
>> > >>>         if (th->cwr)
>> > >>> -               shinfo->gso_type |= SKB_GSO_TCP_ECN;
>> > >>> +               shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
>> > >>>  }
>> > >>>  EXPORT_SYMBOL(tcp_gro_complete);
>> > >>>
>> > >>
>> > >> I do not really understand this patch. How a GRO engine can know 
>> > >> which ECN variant the peers are using ?
>> > >
>> > >Hi Eric,
>> > >
>> > >Thanks for taking a look.
>> > >
>> > >GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can result in header change that corrupts ACE field. Thus, GRO has to assume the RFC3168 CWR offloading trick cannot be used anymore (unless it tracks the connection and knows the bits are used for RFC3168 which is something nobody is going to do).
>> > >
>> > >The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE field value.
>> >
>> > Hi Eric and Ilpo,
>> >
>> > From my understanding of another email thread (patch 08/13), it seems the conclusions that SKB_GSO_TCP_ACCECN will still be needed.
>> >
>> > >
>> > >SKB_GSO_TCP_ACCECN doesn't allow CWR bits change within a super-skb but the same CWR flag should be repeated for all segments. In a sense, it's simpler than RFC3168 offloading.
>> > >
>> > >> SKB_GSO_TCP_ECN is also used from other points, what is your plan ?
>> > >>
>> > >> git grep -n SKB_GSO_TCP_ECN
>> > >> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888:
>> > >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
>> > >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1291:
>> > >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
>> > >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1312:
>> > >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
>> > >
>> > >These looked like they should be just changed to set SKB_GSO_TCP_ACCECN instead.
>> >
>> > I agree with these changes and will apply them in the next version.
>> >
>> > >
>> > >I don't anymore recall why I didn't change those when I made this patch long time ago, perhaps it was just an oversight or things have changed somehow since then.
>> > >
>> > >> include/linux/netdevice.h:5061: BUILD_BUG_ON(SKB_GSO_TCP_ECN != 
>> > >> (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
>> > >> include/linux/skbuff.h:664:     SKB_GSO_TCP_ECN = 1 << 2,
>> > >
>> > >Not relevant.
>> > >
>> > >> include/linux/virtio_net.h:88:                  gso_type |= SKB_GSO_TCP_ECN;
>> > >> include/linux/virtio_net.h:161:         switch (gso_type & ~SKB_GSO_TCP_ECN) {
>> > >> include/linux/virtio_net.h:226:         if (sinfo->gso_type & SKB_GSO_TCP_ECN)
>> > >
>> > >These need to be looked further what's going on as UAPI is also  
>> > >involved here.
>> >
>> > I had a look at these parts, and only the 1st and 3rd ones are relevant.
>> > Related to the 1st one, I propose to change from
>> >
>> >                 if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
>> >                         gso_type |= SKB_GSO_TCP_ECN;
>> >
>> > to
>> >
>> >                 if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
>> >                         gso_type |= SKB_GSO_TCP_ACCECN;
>> >
>> > The reason behind this proposed change is similar as the above 
>> > changes in en_rx.c.
>>
>> But en_rx.c is based one CWR flag, there's no similar check here.
>>
>> > For the 3rd one, I suggest to change from
>> >
>> >                 if (sinfo->gso_type & SKB_GSO_TCP_ECN)
>> >                         hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
>> >
>> > to
>> >
>> >                 if (sinfo->gso_type & (SKB_GSO_TCP_ECN | SKB_GSO_TCP_ACCECN))
>> >                         hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
>> >
>> > This proposed change is because VIRTIO_NET_HDR_GSO_ECN must be set 
>> > to allow TCP packets requiring segmentation offload which have ECN bit set.
>> > So, no matter whether skb gso_type have GSO_TCP_ECN flag or 
>> > GSO_TCP_ACCECN flag, the corresponding hdr gso_type shall be set.
>> >
>> > But, I wonder what would the driver do when with 
>> > VIRTIO_NET_HDR_GSO_ECN flag. Will it corrupts CWR or not?
>>
>> I'm starting to heavily question what is the meaning of that 
>> VIRTIO_NET_HDR_GSO_ECN flag and if it's even consistent...
>>
>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
>> ub.com%2Fqemu%2Fqemu%2Fblob%2Fmaster%2Fnet%2Feth.c&data=05%7C02%7Cchia
>> -yu.chang%40nokia-bell-labs.com%7C64e73bacd5eb4647905208dd03848651%7C5
>> d4717519675428d917b70f44f9630b0%7C0%7C0%7C638670589949336838%7CUnknown
>> %7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4
>> zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=iEeGdA4mcq1WC
>> unKWvxNZ6yt213xKWKOpjyYdBzLzTk%3D&reserved=0
>>
>> That sets VIRTIO_NET_HDR_GSO_ECN based on CE?? (Whereas kernel 
>> associates the related SKB_GSO_TCP_ECN to CWR offloading.)
>>
>> The virtio spec is way too vague to be useful so it would not be 
>> surprising if there are diverging interpretations from implementers:
>>
>> "If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the 
>> VIRTIO_NET_HDR_GSO_ECN bit in gso_type indicates that the TCP packet 
>> has the ECN bit set."
>>
>> What is "the ECN bit" in terms used by TCP (or IP)? Could some virtio 
>> folks please explain?
>
>According to the current Linux implementation in virtio_net_hdr_to_skb():
>
>                if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
>                        gso_type |= SKB_GSO_TCP_ECN;
>
>It mapps to SKB_GSO_TCP_ECN which is:
>
>        /* This indicates the tcp segment has CWR set. */
>        SKB_GSO_TCP_ECN = 1 << 2,
>
>Thanks

This patch series is related to adding new SKB_GSO_TCP_ACCECN.
The reason behind is because how the CWR field handling is different between RFC3168 and AccECN.
You can to find the original patches in below:
https://lore.kernel.org/netdev/20241105100647.117346-9-chia-yu.chang@nokia-bell-labs.com/
https://lore.kernel.org/netdev/20241105100647.117346-10-chia-yu.chang@nokia-bell-labs.com/

When reviewing all occurrence of SKB_GSO_TCP_ECN, we see it's related to VIRTIO_NET_HDR_GSO_ECN.
But after checking the virtio spec, it's quite unclear about VIRTIO_NET_HDR_GSO_ECN:
"If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the VIRTIO_NET_HDR_GSO_ECN bit in gso_type indicates that the TCP packet has the ECN bit set."

In virtnet_probe(), it looks like VIRTIO_NET_F_HOST_ECN is used to enable NETIF_F_TSO_ECN support:
                if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_ECN))
                        dev->hw_features |= NETIF_F_TSO_ECN;

So the question is that is its TSO engine can be set to not touch CWR flag even with NETIF_F_TSO_ECN?

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-11-14  1:04 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
2024-11-05 10:06 ` [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
2024-11-07  9:31   ` Eric Dumazet
2024-11-07 19:32     ` Ilpo Järvinen
2024-11-05 10:06 ` [PATCH v5 net-next 02/13] tcp: create FLAG_TS_PROGRESS chia-yu.chang
2024-11-07  9:34   ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 03/13] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
2024-11-07 10:20   ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 04/13] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
2024-11-07 10:23   ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 05/13] tcp: reorganize SYN ECN code chia-yu.chang
2024-11-07 10:25   ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 06/13] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
2024-11-07 10:30   ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 07/13] tcp: helpers for ECN mode handling chia-yu.chang
2024-11-07 12:26   ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 08/13] gso: AccECN support chia-yu.chang
2024-11-07 12:41   ` Eric Dumazet
2024-11-09 10:09   ` Ilpo Järvinen
2024-11-09 19:42     ` Neal Cardwell
2024-11-09 21:38       ` Ilpo Järvinen
2024-11-05 10:06 ` [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
2024-11-07 12:40   ` Eric Dumazet
2024-11-07 19:28     ` Ilpo Järvinen
2024-11-12  2:09       ` Chia-Yu Chang (Nokia)
2024-11-12 21:19         ` Ilpo Järvinen
2024-11-13  1:42           ` Jason Wang
2024-11-14  1:04             ` Chia-Yu Chang (Nokia)
2024-11-05 10:06 ` [PATCH v5 net-next 10/13] tcp: AccECN support to tcp_add_backlog chia-yu.chang
2024-11-07 12:42   ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 11/13] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
2024-11-07 12:55   ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 12/13] tcp: Pass flags to __tcp_send_ack chia-yu.chang
2024-11-07 12:58   ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 13/13] tcp: fast path functions later chia-yu.chang
2024-11-07 13:00   ` Eric Dumazet
2024-11-07 20:38     ` Chia-Yu Chang (Nokia)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).