* [PATCH v5 net-next 00/13] AccECN protocol preparation patch series
@ 2024-11-05 10:06 chia-yu.chang
2024-11-05 10:06 ` [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
` (12 more replies)
0 siblings, 13 replies; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Hello,
Specific changes in v5 (5-Nov-2024)
- Add helper function "tcp_flags_ntohs" to preserve last 2 bytes of TCP flags of patch #4 (Paolo Abeni <pabeni@redhat.com>)
- Fix reverse X-max tree order of patches #4, #11 (Paolo Abeni <pabeni@redhat.com>)
- Rename variable "delta" as "timestamp_delta" of patch #2 fo clariety
- Remove patch #14 in this series (Paolo Abeni <pabeni@redhat.com>, Joel Granados <joel.granados@kernel.org>)
Specific changes in v4 (21-Oct-2024)
- Fix line length warning of patches #2, #4, #8, #10, #11, #14
- Fix spaces preferred around '|' (ctx:VxV) warning of patch #7
- Add missing CC'ed of patches #4, #12, #14
Specific changes in v3 (19-Oct-2024)
- Fix build error in v2
Specific changes in v2 (18-Oct-2024)
- Fix warning caused by NETIF_F_GSO_ACCECN_BIT in patch #9 (Jakub Kicinski <kuba@kernel.org>)
The full patch series can be found in
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28
--
Chia-Yu
Chia-Yu Chang (1):
tcp: use BIT() macro in include/net/tcp.h
Ilpo Järvinen (12):
tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
tcp: create FLAG_TS_PROGRESS
tcp: extend TCP flags to allow AE bit/ACE field
tcp: reorganize SYN ECN code
tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()
tcp: helpers for ECN mode handling
gso: AccECN support
gro: prevent ACE field corruption & better AccECN handling
tcp: AccECN support to tcp_add_backlog
tcp: allow ECN bits in TOS/traffic class
tcp: Pass flags to __tcp_send_ack
tcp: fast path functions later
include/linux/netdev_features.h | 8 +-
include/linux/netdevice.h | 2 +
include/linux/skbuff.h | 2 +
include/net/tcp.h | 135 +++++++++++++++++++++-----------
include/uapi/linux/tcp.h | 9 ++-
net/ethtool/common.c | 1 +
net/ipv4/bpf_tcp_ca.c | 2 +-
net/ipv4/ip_output.c | 3 +-
net/ipv4/tcp.c | 2 +-
net/ipv4/tcp_dctcp.c | 2 +-
net/ipv4/tcp_dctcp.h | 2 +-
net/ipv4/tcp_input.c | 120 ++++++++++++++++------------
net/ipv4/tcp_ipv4.c | 28 +++++--
net/ipv4/tcp_minisocks.c | 6 +-
net/ipv4/tcp_offload.c | 10 ++-
net/ipv4/tcp_output.c | 23 +++---
net/ipv6/tcp_ipv6.c | 26 ++++--
net/netfilter/nf_log_syslog.c | 8 +-
18 files changed, 249 insertions(+), 140 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
2024-11-07 9:31 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 02/13] tcp: create FLAG_TS_PROGRESS chia-yu.chang
` (11 subsequent siblings)
12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
- Move tcp_count_delivered() earlier and split tcp_count_delivered_ce()
out of it
- Move tcp_in_ack_event() later
- While at it, remove the inline from tcp_in_ack_event() and let
the compiler to decide
Accurate ECN's heuristics does not know if there is going
to be ACE field based CE counter increase or not until after
rtx queue has been processed. Only then the number of ACKed
bytes/pkts is available. As CE or not affects presence of
FLAG_ECE, that information for tcp_in_ack_event is not yet
available in the old location of the call to tcp_in_ack_event().
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
net/ipv4/tcp_input.c | 56 +++++++++++++++++++++++++-------------------
1 file changed, 32 insertions(+), 24 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5bdf13ac26ef..fc52eab4fcc9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -413,6 +413,20 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
return false;
}
+static void tcp_count_delivered_ce(struct tcp_sock *tp, u32 ecn_count)
+{
+ tp->delivered_ce += ecn_count;
+}
+
+/* Updates the delivered and delivered_ce counts */
+static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
+ bool ece_ack)
+{
+ tp->delivered += delivered;
+ if (ece_ack)
+ tcp_count_delivered_ce(tp, delivered);
+}
+
/* Buffer size and advertised window tuning.
*
* 1. Tuning sk->sk_sndbuf, when connection enters established state.
@@ -1148,15 +1162,6 @@ void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb)
}
}
-/* Updates the delivered and delivered_ce counts */
-static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
- bool ece_ack)
-{
- tp->delivered += delivered;
- if (ece_ack)
- tp->delivered_ce += delivered;
-}
-
/* This procedure tags the retransmission queue when SACKs arrive.
*
* We have three tag bits: SACKED(S), RETRANS(R) and LOST(L).
@@ -3856,12 +3861,23 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
}
}
-static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
+static void tcp_in_ack_event(struct sock *sk, int flag)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
- if (icsk->icsk_ca_ops->in_ack_event)
- icsk->icsk_ca_ops->in_ack_event(sk, flags);
+ if (icsk->icsk_ca_ops->in_ack_event) {
+ u32 ack_ev_flags = 0;
+
+ if (flag & FLAG_WIN_UPDATE)
+ ack_ev_flags |= CA_ACK_WIN_UPDATE;
+ if (flag & FLAG_SLOWPATH) {
+ ack_ev_flags = CA_ACK_SLOWPATH;
+ if (flag & FLAG_ECE)
+ ack_ev_flags |= CA_ACK_ECE;
+ }
+
+ icsk->icsk_ca_ops->in_ack_event(sk, ack_ev_flags);
+ }
}
/* Congestion control has updated the cwnd already. So if we're in
@@ -3978,12 +3994,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
tcp_snd_una_update(tp, ack);
flag |= FLAG_WIN_UPDATE;
- tcp_in_ack_event(sk, CA_ACK_WIN_UPDATE);
-
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPACKS);
} else {
- u32 ack_ev_flags = CA_ACK_SLOWPATH;
-
if (ack_seq != TCP_SKB_CB(skb)->end_seq)
flag |= FLAG_DATA;
else
@@ -3995,19 +4007,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
&sack_state);
- if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) {
+ if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb)))
flag |= FLAG_ECE;
- ack_ev_flags |= CA_ACK_ECE;
- }
if (sack_state.sack_delivered)
tcp_count_delivered(tp, sack_state.sack_delivered,
flag & FLAG_ECE);
-
- if (flag & FLAG_WIN_UPDATE)
- ack_ev_flags |= CA_ACK_WIN_UPDATE;
-
- tcp_in_ack_event(sk, ack_ev_flags);
}
/* This is a deviation from RFC3168 since it states that:
@@ -4034,6 +4039,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
tcp_rack_update_reo_wnd(sk, &rs);
+ tcp_in_ack_event(sk, flag);
+
if (tp->tlp_high_seq)
tcp_process_tlp_ack(sk, ack, flag);
@@ -4065,6 +4072,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
return 1;
no_queue:
+ tcp_in_ack_event(sk, flag);
/* If data was DSACKed, see if we can undo a cwnd reduction. */
if (flag & FLAG_DSACKING_ACK) {
tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v5 net-next 02/13] tcp: create FLAG_TS_PROGRESS
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
2024-11-05 10:06 ` [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
2024-11-07 9:34 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 03/13] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
` (10 subsequent siblings)
12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
Whenever timestamp advances, it declares progress which
can be used by the other parts of the stack to decide that
the ACK is the most recent one seen so far.
AccECN will use this flag when deciding whether to use the
ACK to update AccECN state or not.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
net/ipv4/tcp_input.c | 37 ++++++++++++++++++++++++++++---------
1 file changed, 28 insertions(+), 9 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fc52eab4fcc9..ecb3de69c6de 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -102,6 +102,7 @@ int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
#define FLAG_NO_CHALLENGE_ACK 0x8000 /* do not call tcp_send_challenge_ack() */
#define FLAG_ACK_MAYBE_DELAYED 0x10000 /* Likely a delayed ACK */
#define FLAG_DSACK_TLP 0x20000 /* DSACK for tail loss probe */
+#define FLAG_TS_PROGRESS 0x40000 /* Positive timestamp delta */
#define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
#define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
@@ -3813,8 +3814,16 @@ static void tcp_store_ts_recent(struct tcp_sock *tp)
tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
}
-static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
+static int __tcp_replace_ts_recent(struct tcp_sock *tp, s32 tstamp_delta)
{
+ tcp_store_ts_recent(tp);
+ return tstamp_delta > 0 ? FLAG_TS_PROGRESS : 0;
+}
+
+static int tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
+{
+ s32 delta;
+
if (tp->rx_opt.saw_tstamp && !after(seq, tp->rcv_wup)) {
/* PAWS bug workaround wrt. ACK frames, the PAWS discard
* extra check below makes sure this can only happen
@@ -3823,9 +3832,13 @@ static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
* Not only, also it occurs for expired timestamps.
*/
- if (tcp_paws_check(&tp->rx_opt, 0))
- tcp_store_ts_recent(tp);
+ if (tcp_paws_check(&tp->rx_opt, 0)) {
+ delta = tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent;
+ return __tcp_replace_ts_recent(tp, delta);
+ }
}
+
+ return 0;
}
/* This routine deals with acks during a TLP episode and ends an episode by
@@ -3982,7 +3995,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
* is in window.
*/
if (flag & FLAG_UPDATE_TS_RECENT)
- tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
+ flag |= tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
if ((flag & (FLAG_SLOWPATH | FLAG_SND_UNA_ADVANCED)) ==
FLAG_SND_UNA_ADVANCED) {
@@ -6140,6 +6153,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
!after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
int tcp_header_len = tp->tcp_header_len;
+ s32 delta = 0;
+ int flag = 0;
/* Timestamp header prediction: tcp_header_len
* is automatically equal to th->doff*4 due to pred_flags
@@ -6152,8 +6167,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
if (!tcp_parse_aligned_timestamp(tp, th))
goto slow_path;
+ delta = tp->rx_opt.rcv_tsval -
+ tp->rx_opt.ts_recent;
/* If PAWS failed, check it more carefully in slow path */
- if ((s32)(tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent) < 0)
+ if (delta < 0)
goto slow_path;
/* DO NOT update ts_recent here, if checksum fails
@@ -6173,12 +6190,13 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
if (tcp_header_len ==
(sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
tp->rcv_nxt == tp->rcv_wup)
- tcp_store_ts_recent(tp);
+ flag |= __tcp_replace_ts_recent(tp,
+ delta);
/* We know that such packets are checksummed
* on entry.
*/
- tcp_ack(sk, skb, 0);
+ tcp_ack(sk, skb, flag);
__kfree_skb(skb);
tcp_data_snd_check(sk);
/* When receiving pure ack in fast path, update
@@ -6209,7 +6227,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
if (tcp_header_len ==
(sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
tp->rcv_nxt == tp->rcv_wup)
- tcp_store_ts_recent(tp);
+ flag |= __tcp_replace_ts_recent(tp,
+ delta);
tcp_rcv_rtt_measure_ts(sk, skb);
@@ -6224,7 +6243,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) {
/* Well, only one small jumplet in fast path... */
- tcp_ack(sk, skb, FLAG_DATA);
+ tcp_ack(sk, skb, flag | FLAG_DATA);
tcp_data_snd_check(sk);
if (!inet_csk_ack_scheduled(sk))
goto no_ack;
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v5 net-next 03/13] tcp: use BIT() macro in include/net/tcp.h
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
2024-11-05 10:06 ` [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
2024-11-05 10:06 ` [PATCH v5 net-next 02/13] tcp: create FLAG_TS_PROGRESS chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
2024-11-07 10:20 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 04/13] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
` (9 subsequent siblings)
12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Use BIT() macro for TCP flags field and TCP congestion control
flags that will be used by the congestion control algorithm.
No functional changes.
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Ilpo Järvinen <ij@kernel.org>
---
include/net/tcp.h | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e9b37b76e894..cccc6b739532 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -26,6 +26,7 @@
#include <linux/kref.h>
#include <linux/ktime.h>
#include <linux/indirect_call_wrapper.h>
+#include <linux/bits.h>
#include <net/inet_connection_sock.h>
#include <net/inet_timewait_sock.h>
@@ -911,14 +912,14 @@ static inline u32 tcp_rsk_tsval(const struct tcp_request_sock *treq)
#define tcp_flag_byte(th) (((u_int8_t *)th)[13])
-#define TCPHDR_FIN 0x01
-#define TCPHDR_SYN 0x02
-#define TCPHDR_RST 0x04
-#define TCPHDR_PSH 0x08
-#define TCPHDR_ACK 0x10
-#define TCPHDR_URG 0x20
-#define TCPHDR_ECE 0x40
-#define TCPHDR_CWR 0x80
+#define TCPHDR_FIN BIT(0)
+#define TCPHDR_SYN BIT(1)
+#define TCPHDR_RST BIT(2)
+#define TCPHDR_PSH BIT(3)
+#define TCPHDR_ACK BIT(4)
+#define TCPHDR_URG BIT(5)
+#define TCPHDR_ECE BIT(6)
+#define TCPHDR_CWR BIT(7)
#define TCPHDR_SYN_ECN (TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
@@ -1107,9 +1108,9 @@ enum tcp_ca_ack_event_flags {
#define TCP_CA_UNSPEC 0
/* Algorithm can be set on socket without CAP_NET_ADMIN privileges */
-#define TCP_CONG_NON_RESTRICTED 0x1
+#define TCP_CONG_NON_RESTRICTED BIT(0)
/* Requires ECN/ECT set on all packets */
-#define TCP_CONG_NEEDS_ECN 0x2
+#define TCP_CONG_NEEDS_ECN BIT(1)
#define TCP_CONG_MASK (TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN)
union tcp_cc_info;
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v5 net-next 04/13] tcp: extend TCP flags to allow AE bit/ACE field
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
` (2 preceding siblings ...)
2024-11-05 10:06 ` [PATCH v5 net-next 03/13] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
2024-11-07 10:23 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 05/13] tcp: reorganize SYN ECN code chia-yu.chang
` (8 subsequent siblings)
12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
With AccECN, there's one additional TCP flag to be used (AE)
and ACE field that overloads the definition of AE, CWR, and
ECE flags. As tcp_flags was previously only 1 byte, the
byte-order stuff needs to be added to it's handling.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
include/net/tcp.h | 11 +++++++++--
include/uapi/linux/tcp.h | 9 ++++++---
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/tcp_output.c | 8 ++++----
net/ipv6/tcp_ipv6.c | 2 +-
net/netfilter/nf_log_syslog.c | 8 +++++---
6 files changed, 26 insertions(+), 14 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cccc6b739532..a9948fe3537a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -920,7 +920,14 @@ static inline u32 tcp_rsk_tsval(const struct tcp_request_sock *treq)
#define TCPHDR_URG BIT(5)
#define TCPHDR_ECE BIT(6)
#define TCPHDR_CWR BIT(7)
-
+#define TCPHDR_AE BIT(8)
+#define TCPHDR_FLAGS_MASK (TCPHDR_FIN | TCPHDR_SYN | TCPHDR_RST | \
+ TCPHDR_PSH | TCPHDR_ACK | TCPHDR_URG | \
+ TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)
+#define tcp_flags_ntohs(th) (ntohs(*(__be16 *)&tcp_flag_word(th)) & \
+ TCPHDR_FLAGS_MASK)
+
+#define TCPHDR_ACE (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)
#define TCPHDR_SYN_ECN (TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
/* State flags for sacked in struct tcp_skb_cb */
@@ -955,7 +962,7 @@ struct tcp_skb_cb {
u16 tcp_gso_size;
};
};
- __u8 tcp_flags; /* TCP header flags. (tcp[13]) */
+ __u16 tcp_flags; /* TCP header flags (tcp[12-13])*/
__u8 sacked; /* State flags for SACK. */
__u8 ip_dsfield; /* IPv4 tos or IPv6 dsfield */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index dbf896f3146c..3fe08d7dddaf 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -28,7 +28,8 @@ struct tcphdr {
__be32 seq;
__be32 ack_seq;
#if defined(__LITTLE_ENDIAN_BITFIELD)
- __u16 res1:4,
+ __u16 ae:1,
+ res1:3,
doff:4,
fin:1,
syn:1,
@@ -40,7 +41,8 @@ struct tcphdr {
cwr:1;
#elif defined(__BIG_ENDIAN_BITFIELD)
__u16 doff:4,
- res1:4,
+ res1:3,
+ ae:1,
cwr:1,
ece:1,
urg:1,
@@ -70,6 +72,7 @@ union tcp_word_hdr {
#define tcp_flag_word(tp) (((union tcp_word_hdr *)(tp))->words[3])
enum {
+ TCP_FLAG_AE = __constant_cpu_to_be32(0x01000000),
TCP_FLAG_CWR = __constant_cpu_to_be32(0x00800000),
TCP_FLAG_ECE = __constant_cpu_to_be32(0x00400000),
TCP_FLAG_URG = __constant_cpu_to_be32(0x00200000),
@@ -78,7 +81,7 @@ enum {
TCP_FLAG_RST = __constant_cpu_to_be32(0x00040000),
TCP_FLAG_SYN = __constant_cpu_to_be32(0x00020000),
TCP_FLAG_FIN = __constant_cpu_to_be32(0x00010000),
- TCP_RESERVED_BITS = __constant_cpu_to_be32(0x0F000000),
+ TCP_RESERVED_BITS = __constant_cpu_to_be32(0x0E000000),
TCP_DATA_OFFSET = __constant_cpu_to_be32(0xF0000000)
};
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a38c8b1f44db..536767723584 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2163,7 +2163,7 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
skb->len - th->doff * 4);
TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
- TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
+ TCP_SKB_CB(skb)->tcp_flags = tcp_flags_ntohs(th);
TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
TCP_SKB_CB(skb)->sacked = 0;
TCP_SKB_CB(skb)->has_rxtstamp =
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5485a70b5fe5..1b2f4a2e7332 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -400,7 +400,7 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
/* Constructs common control bits of non-data skb. If SYN/FIN is present,
* auto increment end seqno.
*/
-static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u16 flags)
{
skb->ip_summed = CHECKSUM_PARTIAL;
@@ -1382,7 +1382,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
th->seq = htonl(tcb->seq);
th->ack_seq = htonl(rcv_nxt);
*(((__be16 *)th) + 6) = htons(((tcp_header_size >> 2) << 12) |
- tcb->tcp_flags);
+ (tcb->tcp_flags & TCPHDR_FLAGS_MASK));
th->check = 0;
th->urg_ptr = 0;
@@ -1603,8 +1603,8 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
struct sk_buff *buff;
int old_factor;
long limit;
+ u16 flags;
int nlen;
- u8 flags;
if (WARN_ON(len > skb->len))
return -EINVAL;
@@ -2159,7 +2159,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
{
int nlen = skb->len - len;
struct sk_buff *buff;
- u8 flags;
+ u16 flags;
/* All of a TSO frame must be composed of paged data. */
DEBUG_NET_WARN_ON_ONCE(skb->len != skb->data_len);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index c748eeae1453..fec9acffb167 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1737,7 +1737,7 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
skb->len - th->doff*4);
TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
- TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
+ TCP_SKB_CB(skb)->tcp_flags = tcp_flags_ntohs(th);
TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
TCP_SKB_CB(skb)->sacked = 0;
TCP_SKB_CB(skb)->has_rxtstamp =
diff --git a/net/netfilter/nf_log_syslog.c b/net/netfilter/nf_log_syslog.c
index 58402226045e..86d5fc5d28e3 100644
--- a/net/netfilter/nf_log_syslog.c
+++ b/net/netfilter/nf_log_syslog.c
@@ -216,7 +216,9 @@ nf_log_dump_tcp_header(struct nf_log_buf *m,
/* Max length: 9 "RES=0x3C " */
nf_log_buf_add(m, "RES=0x%02x ", (u_int8_t)(ntohl(tcp_flag_word(th) &
TCP_RESERVED_BITS) >> 22));
- /* Max length: 32 "CWR ECE URG ACK PSH RST SYN FIN " */
+ /* Max length: 35 "AE CWR ECE URG ACK PSH RST SYN FIN " */
+ if (th->ae)
+ nf_log_buf_add(m, "AE ");
if (th->cwr)
nf_log_buf_add(m, "CWR ");
if (th->ece)
@@ -516,7 +518,7 @@ dump_ipv4_packet(struct net *net, struct nf_log_buf *m,
/* Proto Max log string length */
/* IP: 40+46+6+11+127 = 230 */
- /* TCP: 10+max(25,20+30+13+9+32+11+127) = 252 */
+ /* TCP: 10+max(25,20+30+13+9+35+11+127) = 255 */
/* UDP: 10+max(25,20) = 35 */
/* UDPLITE: 14+max(25,20) = 39 */
/* ICMP: 11+max(25, 18+25+max(19,14,24+3+n+10,3+n+10)) = 91+n */
@@ -526,7 +528,7 @@ dump_ipv4_packet(struct net *net, struct nf_log_buf *m,
/* (ICMP allows recursion one level deep) */
/* maxlen = IP + ICMP + IP + max(TCP,UDP,ICMP,unknown) */
- /* maxlen = 230+ 91 + 230 + 252 = 803 */
+ /* maxlen = 230+ 91 + 230 + 255 = 806 */
}
static noinline_for_stack void
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v5 net-next 05/13] tcp: reorganize SYN ECN code
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
` (3 preceding siblings ...)
2024-11-05 10:06 ` [PATCH v5 net-next 04/13] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
2024-11-07 10:25 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 06/13] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
` (7 subsequent siblings)
12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
Prepare for AccECN that needs to have access here on IP ECN
field value which is only available after INET_ECN_xmit().
No functional changes.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
net/ipv4/tcp_output.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1b2f4a2e7332..9c47b46aa14d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -347,10 +347,11 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
tp->ecn_flags = 0;
if (use_ecn) {
- TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
- tp->ecn_flags = TCP_ECN_OK;
if (tcp_ca_needs_ecn(sk) || bpf_needs_ecn)
INET_ECN_xmit(sk);
+
+ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
+ tp->ecn_flags = TCP_ECN_OK;
}
}
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v5 net-next 06/13] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
` (4 preceding siblings ...)
2024-11-05 10:06 ` [PATCH v5 net-next 05/13] tcp: reorganize SYN ECN code chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
2024-11-07 10:30 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 07/13] tcp: helpers for ECN mode handling chia-yu.chang
` (6 subsequent siblings)
12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
Rename tcp_ecn_check_ce to tcp_data_ecn_check as it is
called only for data segments, not for ACKs (with AccECN,
also ACKs may get ECN bits).
The extra "layer" in tcp_ecn_check_ce() function just
checks for ECN being enabled, that can be moved into
tcp_ecn_field_check rather than having the __ variant.
No functional changes.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
net/ipv4/tcp_input.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ecb3de69c6de..b5654f94453e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -357,10 +357,13 @@ static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR;
}
-static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
+static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
+ if (!(tcp_sk(sk)->ecn_flags & TCP_ECN_OK))
+ return;
+
switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
case INET_ECN_NOT_ECT:
/* Funny extension: if ECT is not set on a segment,
@@ -389,12 +392,6 @@ static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
}
}
-static void tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
-{
- if (tcp_sk(sk)->ecn_flags & TCP_ECN_OK)
- __tcp_ecn_check_ce(sk, skb);
-}
-
static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
{
if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || th->cwr))
@@ -866,7 +863,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
icsk->icsk_ack.lrcvtime = now;
tcp_save_lrcv_flowlabel(sk, skb);
- tcp_ecn_check_ce(sk, skb);
+ tcp_data_ecn_check(sk, skb);
if (skb->len >= 128)
tcp_grow_window(sk, skb, true);
@@ -5028,7 +5025,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
bool fragstolen;
tcp_save_lrcv_flowlabel(sk, skb);
- tcp_ecn_check_ce(sk, skb);
+ tcp_data_ecn_check(sk, skb);
if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFODROP);
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v5 net-next 07/13] tcp: helpers for ECN mode handling
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
` (5 preceding siblings ...)
2024-11-05 10:06 ` [PATCH v5 net-next 06/13] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
2024-11-07 12:26 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 08/13] gso: AccECN support chia-yu.chang
` (5 subsequent siblings)
12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
Create helpers for TCP ECN modes. No functional changes.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
include/net/tcp.h | 44 ++++++++++++++++++++++++++++++++++++----
net/ipv4/tcp.c | 2 +-
net/ipv4/tcp_dctcp.c | 2 +-
net/ipv4/tcp_input.c | 14 ++++++-------
net/ipv4/tcp_minisocks.c | 4 +++-
net/ipv4/tcp_output.c | 6 +++---
6 files changed, 55 insertions(+), 17 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a9948fe3537a..215b7ba105be 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -372,10 +372,46 @@ static inline void tcp_dec_quickack_mode(struct sock *sk)
}
}
-#define TCP_ECN_OK 1
-#define TCP_ECN_QUEUE_CWR 2
-#define TCP_ECN_DEMAND_CWR 4
-#define TCP_ECN_SEEN 8
+#define TCP_ECN_MODE_RFC3168 BIT(0)
+#define TCP_ECN_QUEUE_CWR BIT(1)
+#define TCP_ECN_DEMAND_CWR BIT(2)
+#define TCP_ECN_SEEN BIT(3)
+#define TCP_ECN_MODE_ACCECN BIT(4)
+
+#define TCP_ECN_DISABLED 0
+#define TCP_ECN_MODE_PENDING (TCP_ECN_MODE_RFC3168 | TCP_ECN_MODE_ACCECN)
+#define TCP_ECN_MODE_ANY (TCP_ECN_MODE_RFC3168 | TCP_ECN_MODE_ACCECN)
+
+static inline bool tcp_ecn_mode_any(const struct tcp_sock *tp)
+{
+ return tp->ecn_flags & TCP_ECN_MODE_ANY;
+}
+
+static inline bool tcp_ecn_mode_rfc3168(const struct tcp_sock *tp)
+{
+ return (tp->ecn_flags & TCP_ECN_MODE_ANY) == TCP_ECN_MODE_RFC3168;
+}
+
+static inline bool tcp_ecn_mode_accecn(const struct tcp_sock *tp)
+{
+ return (tp->ecn_flags & TCP_ECN_MODE_ANY) == TCP_ECN_MODE_ACCECN;
+}
+
+static inline bool tcp_ecn_disabled(const struct tcp_sock *tp)
+{
+ return !tcp_ecn_mode_any(tp);
+}
+
+static inline bool tcp_ecn_mode_pending(const struct tcp_sock *tp)
+{
+ return (tp->ecn_flags & TCP_ECN_MODE_PENDING) == TCP_ECN_MODE_PENDING;
+}
+
+static inline void tcp_ecn_mode_set(struct tcp_sock *tp, u8 mode)
+{
+ tp->ecn_flags &= ~TCP_ECN_MODE_ANY;
+ tp->ecn_flags |= mode;
+}
enum tcp_tw_status {
TCP_TW_SUCCESS = 0,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0d704bda6c41..e30204394175 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4107,7 +4107,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_rcv_wscale = tp->rx_opt.rcv_wscale;
}
- if (tp->ecn_flags & TCP_ECN_OK)
+ if (tcp_ecn_mode_any(tp))
info->tcpi_options |= TCPI_OPT_ECN;
if (tp->ecn_flags & TCP_ECN_SEEN)
info->tcpi_options |= TCPI_OPT_ECN_SEEN;
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 8a45a4aea933..03abe0848420 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -90,7 +90,7 @@ __bpf_kfunc static void dctcp_init(struct sock *sk)
{
const struct tcp_sock *tp = tcp_sk(sk);
- if ((tp->ecn_flags & TCP_ECN_OK) ||
+ if (tcp_ecn_mode_any(tp) ||
(sk->sk_state == TCP_LISTEN ||
sk->sk_state == TCP_CLOSE)) {
struct dctcp *ca = inet_csk_ca(sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b5654f94453e..bafd09ff9a70 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -334,7 +334,7 @@ static bool tcp_in_quickack_mode(struct sock *sk)
static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
{
- if (tp->ecn_flags & TCP_ECN_OK)
+ if (tcp_ecn_mode_rfc3168(tp))
tp->ecn_flags |= TCP_ECN_QUEUE_CWR;
}
@@ -361,7 +361,7 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
- if (!(tcp_sk(sk)->ecn_flags & TCP_ECN_OK))
+ if (tcp_ecn_disabled(tp))
return;
switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
@@ -394,19 +394,19 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
{
- if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || th->cwr))
- tp->ecn_flags &= ~TCP_ECN_OK;
+ if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || th->cwr))
+ tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
}
static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th)
{
- if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || !th->cwr))
- tp->ecn_flags &= ~TCP_ECN_OK;
+ if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || !th->cwr))
+ tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
}
static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr *th)
{
- if (th->ece && !th->syn && (tp->ecn_flags & TCP_ECN_OK))
+ if (th->ece && !th->syn && tcp_ecn_mode_rfc3168(tp))
return true;
return false;
}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bb1fe1ba867a..bd6515ab660f 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -453,7 +453,9 @@ EXPORT_SYMBOL(tcp_openreq_init_rwin);
static void tcp_ecn_openreq_child(struct tcp_sock *tp,
const struct request_sock *req)
{
- tp->ecn_flags = inet_rsk(req)->ecn_ok ? TCP_ECN_OK : 0;
+ tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
+ TCP_ECN_MODE_RFC3168 :
+ TCP_ECN_DISABLED);
}
void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9c47b46aa14d..f4ee86e2b1b5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -322,7 +322,7 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
const struct tcp_sock *tp = tcp_sk(sk);
TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR;
- if (!(tp->ecn_flags & TCP_ECN_OK))
+ if (tcp_ecn_disabled(tp))
TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ECE;
else if (tcp_ca_needs_ecn(sk) ||
tcp_bpf_ca_needs_ecn(sk))
@@ -351,7 +351,7 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
INET_ECN_xmit(sk);
TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
- tp->ecn_flags = TCP_ECN_OK;
+ tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
}
}
@@ -379,7 +379,7 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
{
struct tcp_sock *tp = tcp_sk(sk);
- if (tp->ecn_flags & TCP_ECN_OK) {
+ if (tcp_ecn_mode_rfc3168(tp)) {
/* Not-retransmitted data segment: set ECT and inject CWR. */
if (skb->len != tcp_header_len &&
!before(TCP_SKB_CB(skb)->seq, tp->snd_nxt)) {
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v5 net-next 08/13] gso: AccECN support
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
` (6 preceding siblings ...)
2024-11-05 10:06 ` [PATCH v5 net-next 07/13] tcp: helpers for ECN mode handling chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
2024-11-07 12:41 ` Eric Dumazet
2024-11-09 10:09 ` Ilpo Järvinen
2024-11-05 10:06 ` [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
` (4 subsequent siblings)
12 siblings, 2 replies; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
Handling the CWR flag differs between RFC 3168 ECN and AccECN.
With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared
starting from 2nd segment which is incompatible how AccECN handles
the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN.
With AccECN, CWR flag (or more accurately, the ACE field that also
includes ECE & AE flags) changes only when new packet(s) with CE
mark arrives so the flag should not be changed within a super-skb.
The new skb/feature flags are necessary to prevent such TSO engines
corrupting AccECN ACE counters by clearing the CWR flag (if the
CWR handling feature cannot be turned off).
If NIC is completely unaware of RFC3168 ECN (doesn't support
NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
despite supporting also NETIF_F_TSO_ECN, TSO could be safely used
with AccECN on such NIC. This should be evaluated per NIC basis
(not done in this patch series for any NICs).
For the cases, where TSO cannot keep its hands off the CWR flag,
a GSO fallback is provided by this patch.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
include/linux/netdev_features.h | 8 +++++---
include/linux/netdevice.h | 2 ++
include/linux/skbuff.h | 2 ++
net/ethtool/common.c | 1 +
net/ipv4/tcp_offload.c | 6 +++++-
5 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 66e7d26b70a4..c59db449bcf0 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -53,12 +53,12 @@ enum {
NETIF_F_GSO_UDP_BIT, /* ... UFO, deprecated except tuntap */
NETIF_F_GSO_UDP_L4_BIT, /* ... UDP payload GSO (not UFO) */
NETIF_F_GSO_FRAGLIST_BIT, /* ... Fraglist GSO */
+ NETIF_F_GSO_ACCECN_BIT, /* TCP AccECN w/ TSO (no clear CWR) */
/**/NETIF_F_GSO_LAST = /* last bit, see GSO_MASK */
- NETIF_F_GSO_FRAGLIST_BIT,
+ NETIF_F_GSO_ACCECN_BIT,
NETIF_F_FCOE_CRC_BIT, /* FCoE CRC32 */
NETIF_F_SCTP_CRC_BIT, /* SCTP checksum offload */
- __UNUSED_NETIF_F_37,
NETIF_F_NTUPLE_BIT, /* N-tuple filters supported */
NETIF_F_RXHASH_BIT, /* Receive hashing offload */
NETIF_F_RXCSUM_BIT, /* Receive checksumming offload */
@@ -128,6 +128,7 @@ enum {
#define NETIF_F_SG __NETIF_F(SG)
#define NETIF_F_TSO6 __NETIF_F(TSO6)
#define NETIF_F_TSO_ECN __NETIF_F(TSO_ECN)
+#define NETIF_F_GSO_ACCECN __NETIF_F(GSO_ACCECN)
#define NETIF_F_TSO __NETIF_F(TSO)
#define NETIF_F_VLAN_CHALLENGED __NETIF_F(VLAN_CHALLENGED)
#define NETIF_F_RXFCS __NETIF_F(RXFCS)
@@ -210,7 +211,8 @@ static inline int find_next_netdev_feature(u64 feature, unsigned long start)
NETIF_F_TSO_ECN | NETIF_F_TSO_MANGLEID)
/* List of features with software fallbacks. */
-#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | NETIF_F_GSO_SCTP | \
+#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | \
+ NETIF_F_GSO_ACCECN | NETIF_F_GSO_SCTP | \
NETIF_F_GSO_UDP_L4 | NETIF_F_GSO_FRAGLIST)
/*
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6e0f8e4aeb14..480d915b3bdb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -5076,6 +5076,8 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_GSO_UDP >> NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_UDP_L4 != (NETIF_F_GSO_UDP_L4 >> NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_FRAGLIST != (NETIF_F_GSO_FRAGLIST >> NETIF_F_GSO_SHIFT));
+ BUILD_BUG_ON(SKB_GSO_TCP_ACCECN !=
+ (NETIF_F_GSO_ACCECN >> NETIF_F_GSO_SHIFT));
return (features & feature) == feature;
}
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 48f1e0fa2a13..530cb325fb86 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -694,6 +694,8 @@ enum {
SKB_GSO_UDP_L4 = 1 << 17,
SKB_GSO_FRAGLIST = 1 << 18,
+
+ SKB_GSO_TCP_ACCECN = 1 << 19,
};
#if BITS_PER_LONG > 32
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 0d62363dbd9d..5c3ba2dfaa74 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -32,6 +32,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
[NETIF_F_TSO_BIT] = "tx-tcp-segmentation",
[NETIF_F_GSO_ROBUST_BIT] = "tx-gso-robust",
[NETIF_F_TSO_ECN_BIT] = "tx-tcp-ecn-segmentation",
+ [NETIF_F_GSO_ACCECN_BIT] = "tx-tcp-accecn-segmentation",
[NETIF_F_TSO_MANGLEID_BIT] = "tx-tcp-mangleid-segmentation",
[NETIF_F_TSO6_BIT] = "tx-tcp6-segmentation",
[NETIF_F_FSO_BIT] = "tx-fcoe-segmentation",
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 2308665b51c5..0b05f30e9e5f 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -139,6 +139,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
struct sk_buff *gso_skb = skb;
__sum16 newcheck;
bool ooo_okay, copy_destructor;
+ bool ecn_cwr_mask;
__wsum delta;
th = tcp_hdr(skb);
@@ -198,6 +199,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
newcheck = ~csum_fold(csum_add(csum_unfold(th->check), delta));
+ ecn_cwr_mask = !!(skb_shinfo(gso_skb)->gso_type & SKB_GSO_TCP_ACCECN);
+
while (skb->next) {
th->fin = th->psh = 0;
th->check = newcheck;
@@ -217,7 +220,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
th = tcp_hdr(skb);
th->seq = htonl(seq);
- th->cwr = 0;
+
+ th->cwr &= ecn_cwr_mask;
}
/* Following permits TCP Small Queues to work well with GSO :
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
` (7 preceding siblings ...)
2024-11-05 10:06 ` [PATCH v5 net-next 08/13] gso: AccECN support chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
2024-11-07 12:40 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 10/13] tcp: AccECN support to tcp_add_backlog chia-yu.chang
` (3 subsequent siblings)
12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
There are important differences in how the CWR field behaves
in RFC3168 and AccECN. With AccECN, CWR flag is part of the
ACE counter and its changes are important so adjust the flags
changed mask accordingly.
Also, if CWR is there, set the Accurate ECN GSO flag to avoid
corrupting CWR flag somewhere.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
net/ipv4/tcp_offload.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 0b05f30e9e5f..f59762d88c38 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
th2 = tcp_hdr(p);
flush = (__force int)(flags & TCP_FLAG_CWR);
flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
- ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
+ ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
for (i = sizeof(*th); i < thlen; i += 4)
flush |= *(u32 *)((u8 *)th + i) ^
@@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
if (th->cwr)
- shinfo->gso_type |= SKB_GSO_TCP_ECN;
+ shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
}
EXPORT_SYMBOL(tcp_gro_complete);
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v5 net-next 10/13] tcp: AccECN support to tcp_add_backlog
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
` (8 preceding siblings ...)
2024-11-05 10:06 ` [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
2024-11-07 12:42 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 11/13] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
` (2 subsequent siblings)
12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
AE flag needs to be preserved for AccECN.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
net/ipv4/tcp_ipv4.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 536767723584..a13d6745d92b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2055,7 +2055,8 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
!((TCP_SKB_CB(tail)->tcp_flags &
TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_ACK) ||
((TCP_SKB_CB(tail)->tcp_flags ^
- TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
+ TCP_SKB_CB(skb)->tcp_flags) &
+ (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)) ||
!tcp_skb_can_collapse_rx(tail, skb) ||
thtail->doff != th->doff ||
memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th)))
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v5 net-next 11/13] tcp: allow ECN bits in TOS/traffic class
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
` (9 preceding siblings ...)
2024-11-05 10:06 ` [PATCH v5 net-next 10/13] tcp: AccECN support to tcp_add_backlog chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
2024-11-07 12:55 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 12/13] tcp: Pass flags to __tcp_send_ack chia-yu.chang
2024-11-05 10:06 ` [PATCH v5 net-next 13/13] tcp: fast path functions later chia-yu.chang
12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
AccECN connection's last ACK cannot retain ECT(1) as the bits
are always cleared causing the packet to switch into another
service queue.
This effectively adds a finer-grained filtering for ECN bits
so that acceptable TW ACKs can retain the bits.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
include/net/tcp.h | 3 ++-
net/ipv4/ip_output.c | 3 +--
net/ipv4/tcp_ipv4.c | 23 +++++++++++++++++------
net/ipv4/tcp_minisocks.c | 2 +-
net/ipv6/tcp_ipv6.c | 24 +++++++++++++++++-------
5 files changed, 38 insertions(+), 17 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 215b7ba105be..3a8782874333 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -417,7 +417,8 @@ enum tcp_tw_status {
TCP_TW_SUCCESS = 0,
TCP_TW_RST = 1,
TCP_TW_ACK = 2,
- TCP_TW_SYN = 3
+ TCP_TW_SYN = 3,
+ TCP_TW_ACK_OOW = 4
};
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 0065b1996c94..2fe7b1df3b90 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -75,7 +75,6 @@
#include <net/checksum.h>
#include <net/gso.h>
#include <net/inetpeer.h>
-#include <net/inet_ecn.h>
#include <net/lwtunnel.h>
#include <net/inet_dscp.h>
#include <linux/bpf-cgroup.h>
@@ -1643,7 +1642,7 @@ void ip_send_unicast_reply(struct sock *sk, const struct sock *orig_sk,
if (IS_ERR(rt))
return;
- inet_sk(sk)->tos = arg->tos & ~INET_ECN_MASK;
+ inet_sk(sk)->tos = arg->tos;
sk->sk_protocol = ip_hdr(skb)->protocol;
sk->sk_bound_dev_if = arg->bound_dev_if;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a13d6745d92b..1950d4cd5da8 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -66,6 +66,7 @@
#include <net/transp_v6.h>
#include <net/ipv6.h>
#include <net/inet_common.h>
+#include <net/inet_ecn.h>
#include <net/timewait_sock.h>
#include <net/xfrm.h>
#include <net/secure_seq.h>
@@ -887,7 +888,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb,
BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
offsetof(struct inet_timewait_sock, tw_bound_dev_if));
- arg.tos = ip_hdr(skb)->tos;
+ arg.tos = ip_hdr(skb)->tos & ~INET_ECN_MASK;
arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
local_lock_nested_bh(&ipv4_tcp_sk.bh_lock);
@@ -1033,11 +1034,17 @@ static void tcp_v4_send_ack(const struct sock *sk,
local_bh_enable();
}
-static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
+static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb,
+ enum tcp_tw_status tw_status)
{
struct inet_timewait_sock *tw = inet_twsk(sk);
struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
struct tcp_key key = {};
+ u8 tos = tw->tw_tos;
+
+ if (tw_status == TCP_TW_ACK_OOW)
+ tos &= ~INET_ECN_MASK;
+
#ifdef CONFIG_TCP_AO
struct tcp_ao_info *ao_info;
@@ -1081,7 +1088,7 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
READ_ONCE(tcptw->tw_ts_recent),
tw->tw_bound_dev_if, &key,
tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
- tw->tw_tos,
+ tos,
tw->tw_txhash);
inet_twsk_put(tw);
@@ -1158,7 +1165,7 @@ static void tcp_v4_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
READ_ONCE(req->ts_recent),
0, &key,
inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
- ip_hdr(skb)->tos,
+ ip_hdr(skb)->tos & ~INET_ECN_MASK,
READ_ONCE(tcp_rsk(req)->txhash));
if (tcp_key_is_ao(&key))
kfree(key.traffic_key);
@@ -2179,6 +2186,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
{
struct net *net = dev_net(skb->dev);
enum skb_drop_reason drop_reason;
+ enum tcp_tw_status tw_status;
int sdif = inet_sdif(skb);
int dif = inet_iif(skb);
const struct iphdr *iph;
@@ -2405,7 +2413,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
inet_twsk_put(inet_twsk(sk));
goto csum_error;
}
- switch (tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn)) {
+
+ tw_status = tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn);
+ switch (tw_status) {
case TCP_TW_SYN: {
struct sock *sk2 = inet_lookup_listener(net,
net->ipv4.tcp_death_row.hashinfo,
@@ -2426,7 +2436,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
/* to ACK */
fallthrough;
case TCP_TW_ACK:
- tcp_v4_timewait_ack(sk, skb);
+ case TCP_TW_ACK_OOW:
+ tcp_v4_timewait_ack(sk, skb, tw_status);
break;
case TCP_TW_RST:
tcp_v4_send_reset(sk, skb, SK_RST_REASON_TCP_TIMEWAIT_SOCKET);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bd6515ab660f..8fb9f550fdeb 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -44,7 +44,7 @@ tcp_timewait_check_oow_rate_limit(struct inet_timewait_sock *tw,
/* Send ACK. Note, we do not put the bucket,
* it will be released by caller.
*/
- return TCP_TW_ACK;
+ return TCP_TW_ACK_OOW;
}
/* We are rate-limiting, so just release the tw sock and drop skb. */
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index fec9acffb167..ea85c117bf96 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -997,7 +997,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
if (!IS_ERR(dst)) {
skb_dst_set(buff, dst);
ip6_xmit(ctl_sk, buff, &fl6, fl6.flowi6_mark, NULL,
- tclass & ~INET_ECN_MASK, priority);
+ tclass, priority);
TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
if (rst)
TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
@@ -1133,7 +1133,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb,
trace_tcp_send_reset(sk, skb, reason);
tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
- ipv6_get_dsfield(ipv6h), label, priority, txhash,
+ ipv6_get_dsfield(ipv6h) & ~INET_ECN_MASK,
+ label, priority, txhash,
&key);
#if defined(CONFIG_TCP_MD5SIG) || defined(CONFIG_TCP_AO)
@@ -1153,11 +1154,16 @@ static void tcp_v6_send_ack(const struct sock *sk, struct sk_buff *skb, u32 seq,
tclass, label, priority, txhash, key);
}
-static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
+static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb,
+ enum tcp_tw_status tw_status)
{
struct inet_timewait_sock *tw = inet_twsk(sk);
struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
+ u8 tclass = tw->tw_tclass;
struct tcp_key key = {};
+
+ if (tw_status == TCP_TW_ACK_OOW)
+ tclass &= ~INET_ECN_MASK;
#ifdef CONFIG_TCP_AO
struct tcp_ao_info *ao_info;
@@ -1201,7 +1207,7 @@ static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
tcp_tw_tsval(tcptw),
READ_ONCE(tcptw->tw_ts_recent), tw->tw_bound_dev_if,
- &key, tw->tw_tclass, cpu_to_be32(tw->tw_flowlabel),
+ &key, tclass, cpu_to_be32(tw->tw_flowlabel),
tw->tw_priority, tw->tw_txhash);
#ifdef CONFIG_TCP_AO
@@ -1278,7 +1284,8 @@ static void tcp_v6_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
tcp_synack_window(req) >> inet_rsk(req)->rcv_wscale,
tcp_rsk_tsval(tcp_rsk(req)),
READ_ONCE(req->ts_recent), sk->sk_bound_dev_if,
- &key, ipv6_get_dsfield(ipv6_hdr(skb)), 0,
+ &key, ipv6_get_dsfield(ipv6_hdr(skb)) & ~INET_ECN_MASK,
+ 0,
READ_ONCE(sk->sk_priority),
READ_ONCE(tcp_rsk(req)->txhash));
if (tcp_key_is_ao(&key))
@@ -1747,6 +1754,7 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
{
enum skb_drop_reason drop_reason;
+ enum tcp_tw_status tw_status;
int sdif = inet6_sdif(skb);
int dif = inet6_iif(skb);
const struct tcphdr *th;
@@ -1967,7 +1975,8 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
goto csum_error;
}
- switch (tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn)) {
+ tw_status = tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn);
+ switch (tw_status) {
case TCP_TW_SYN:
{
struct sock *sk2;
@@ -1992,7 +2001,8 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
/* to ACK */
fallthrough;
case TCP_TW_ACK:
- tcp_v6_timewait_ack(sk, skb);
+ case TCP_TW_ACK_OOW:
+ tcp_v6_timewait_ack(sk, skb, tw_status);
break;
case TCP_TW_RST:
tcp_v6_send_reset(sk, skb, SK_RST_REASON_TCP_TIMEWAIT_SOCKET);
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v5 net-next 12/13] tcp: Pass flags to __tcp_send_ack
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
` (10 preceding siblings ...)
2024-11-05 10:06 ` [PATCH v5 net-next 11/13] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
2024-11-07 12:58 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 13/13] tcp: fast path functions later chia-yu.chang
12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
Accurate ECN needs to send custom flags to handle IP-ECN
field reflection during handshake.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
include/net/tcp.h | 2 +-
net/ipv4/bpf_tcp_ca.c | 2 +-
net/ipv4/tcp_dctcp.h | 2 +-
net/ipv4/tcp_output.c | 6 +++---
4 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3a8782874333..fc9d181e9362 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -704,7 +704,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority,
enum sk_rst_reason reason);
int tcp_send_synack(struct sock *);
void tcp_push_one(struct sock *, unsigned int mss_now);
-void __tcp_send_ack(struct sock *sk, u32 rcv_nxt);
+void __tcp_send_ack(struct sock *sk, u32 rcv_nxt, u16 flags);
void tcp_send_ack(struct sock *sk);
void tcp_send_delayed_ack(struct sock *sk);
void tcp_send_loss_probe(struct sock *sk);
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 554804774628..e01492234b0b 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -121,7 +121,7 @@ static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log,
BPF_CALL_2(bpf_tcp_send_ack, struct tcp_sock *, tp, u32, rcv_nxt)
{
/* bpf_tcp_ca prog cannot have NULL tp */
- __tcp_send_ack((struct sock *)tp, rcv_nxt);
+ __tcp_send_ack((struct sock *)tp, rcv_nxt, 0);
return 0;
}
diff --git a/net/ipv4/tcp_dctcp.h b/net/ipv4/tcp_dctcp.h
index d69a77cbd0c7..4b0259111d81 100644
--- a/net/ipv4/tcp_dctcp.h
+++ b/net/ipv4/tcp_dctcp.h
@@ -28,7 +28,7 @@ static inline void dctcp_ece_ack_update(struct sock *sk, enum tcp_ca_event evt,
*/
if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
dctcp_ece_ack_cwr(sk, *ce_state);
- __tcp_send_ack(sk, *prior_rcv_nxt);
+ __tcp_send_ack(sk, *prior_rcv_nxt, 0);
}
inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f4ee86e2b1b5..fdb4af44c47e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4230,7 +4230,7 @@ void tcp_send_delayed_ack(struct sock *sk)
}
/* This routine sends an ack and also updates the window. */
-void __tcp_send_ack(struct sock *sk, u32 rcv_nxt)
+void __tcp_send_ack(struct sock *sk, u32 rcv_nxt, u16 flags)
{
struct sk_buff *buff;
@@ -4259,7 +4259,7 @@ void __tcp_send_ack(struct sock *sk, u32 rcv_nxt)
/* Reserve space for headers and prepare control bits. */
skb_reserve(buff, MAX_TCP_HEADER);
- tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK);
+ tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK | flags);
/* We do not want pure acks influencing TCP Small Queues or fq/pacing
* too much.
@@ -4274,7 +4274,7 @@ EXPORT_SYMBOL_GPL(__tcp_send_ack);
void tcp_send_ack(struct sock *sk)
{
- __tcp_send_ack(sk, tcp_sk(sk)->rcv_nxt);
+ __tcp_send_ack(sk, tcp_sk(sk)->rcv_nxt, 0);
}
/* This routine sends a packet with an out of date sequence
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v5 net-next 13/13] tcp: fast path functions later
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
` (11 preceding siblings ...)
2024-11-05 10:06 ` [PATCH v5 net-next 12/13] tcp: Pass flags to __tcp_send_ack chia-yu.chang
@ 2024-11-05 10:06 ` chia-yu.chang
2024-11-07 13:00 ` Eric Dumazet
12 siblings, 1 reply; 37+ messages in thread
From: chia-yu.chang @ 2024-11-05 10:06 UTC (permalink / raw)
To: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chai-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
The following patch will use tcp_ecn_mode_accecn(),
TCP_ACCECN_CEP_INIT_OFFSET, TCP_ACCECN_CEP_ACE_MASK in
__tcp_fast_path_on() to make new flag for AccECN.
No functional changes.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chai-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
include/net/tcp.h | 54 +++++++++++++++++++++++------------------------
1 file changed, 27 insertions(+), 27 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index fc9d181e9362..6255977bd7f9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -788,33 +788,6 @@ static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
return usecs_to_jiffies((tp->srtt_us >> 3) + tp->rttvar_us);
}
-static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
-{
- /* mptcp hooks are only on the slow path */
- if (sk_is_mptcp((struct sock *)tp))
- return;
-
- tp->pred_flags = htonl((tp->tcp_header_len << 26) |
- ntohl(TCP_FLAG_ACK) |
- snd_wnd);
-}
-
-static inline void tcp_fast_path_on(struct tcp_sock *tp)
-{
- __tcp_fast_path_on(tp, tp->snd_wnd >> tp->rx_opt.snd_wscale);
-}
-
-static inline void tcp_fast_path_check(struct sock *sk)
-{
- struct tcp_sock *tp = tcp_sk(sk);
-
- if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
- tp->rcv_wnd &&
- atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
- !tp->urg_data)
- tcp_fast_path_on(tp);
-}
-
u32 tcp_delack_max(const struct sock *sk);
/* Compute the actual rto_min value */
@@ -1770,6 +1743,33 @@ static inline bool tcp_paws_reject(const struct tcp_options_received *rx_opt,
return true;
}
+static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
+{
+ /* mptcp hooks are only on the slow path */
+ if (sk_is_mptcp((struct sock *)tp))
+ return;
+
+ tp->pred_flags = htonl((tp->tcp_header_len << 26) |
+ ntohl(TCP_FLAG_ACK) |
+ snd_wnd);
+}
+
+static inline void tcp_fast_path_on(struct tcp_sock *tp)
+{
+ __tcp_fast_path_on(tp, tp->snd_wnd >> tp->rx_opt.snd_wscale);
+}
+
+static inline void tcp_fast_path_check(struct sock *sk)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+
+ if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
+ tp->rcv_wnd &&
+ atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
+ !tp->urg_data)
+ tcp_fast_path_on(tp);
+}
+
bool tcp_oow_rate_limited(struct net *net, const struct sk_buff *skb,
int mib_idx, u32 *last_oow_ack_time);
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
2024-11-05 10:06 ` [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
@ 2024-11-07 9:31 ` Eric Dumazet
2024-11-07 19:32 ` Ilpo Järvinen
0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 9:31 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> - Move tcp_count_delivered() earlier and split tcp_count_delivered_ce()
> out of it
> - Move tcp_in_ack_event() later
> - While at it, remove the inline from tcp_in_ack_event() and let
> the compiler to decide
>
> Accurate ECN's heuristics does not know if there is going
> to be ACE field based CE counter increase or not until after
> rtx queue has been processed. Only then the number of ACKed
> bytes/pkts is available. As CE or not affects presence of
> FLAG_ECE, that information for tcp_in_ack_event is not yet
> available in the old location of the call to tcp_in_ack_event().
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
> net/ipv4/tcp_input.c | 56 +++++++++++++++++++++++++-------------------
> 1 file changed, 32 insertions(+), 24 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 5bdf13ac26ef..fc52eab4fcc9 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -413,6 +413,20 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
> return false;
> }
>
> +static void tcp_count_delivered_ce(struct tcp_sock *tp, u32 ecn_count)
> +{
> + tp->delivered_ce += ecn_count;
> +}
> +
> +/* Updates the delivered and delivered_ce counts */
> +static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
> + bool ece_ack)
> +{
> + tp->delivered += delivered;
> + if (ece_ack)
> + tcp_count_delivered_ce(tp, delivered);
> +}
> +
> /* Buffer size and advertised window tuning.
> *
> * 1. Tuning sk->sk_sndbuf, when connection enters established state.
> @@ -1148,15 +1162,6 @@ void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb)
> }
> }
>
> -/* Updates the delivered and delivered_ce counts */
> -static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
> - bool ece_ack)
> -{
> - tp->delivered += delivered;
> - if (ece_ack)
> - tp->delivered_ce += delivered;
> -}
> -
> /* This procedure tags the retransmission queue when SACKs arrive.
> *
> * We have three tag bits: SACKED(S), RETRANS(R) and LOST(L).
> @@ -3856,12 +3861,23 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
> }
> }
>
> -static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
> +static void tcp_in_ack_event(struct sock *sk, int flag)
> {
> const struct inet_connection_sock *icsk = inet_csk(sk);
>
> - if (icsk->icsk_ca_ops->in_ack_event)
> - icsk->icsk_ca_ops->in_ack_event(sk, flags);
> + if (icsk->icsk_ca_ops->in_ack_event) {
> + u32 ack_ev_flags = 0;
> +
> + if (flag & FLAG_WIN_UPDATE)
> + ack_ev_flags |= CA_ACK_WIN_UPDATE;
> + if (flag & FLAG_SLOWPATH) {
> + ack_ev_flags = CA_ACK_SLOWPATH;
This is removing the potential CA_ACK_WIN_UPDATE, I would suggest :
ack_ev_flags |= CA_ACK_SLOWPATH;
> + if (flag & FLAG_ECE)
> + ack_ev_flags |= CA_ACK_ECE;
> + }
> +
> + icsk->icsk_ca_ops->in_ack_event(sk, ack_ev_flags);
> + }
> }
>
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 02/13] tcp: create FLAG_TS_PROGRESS
2024-11-05 10:06 ` [PATCH v5 net-next 02/13] tcp: create FLAG_TS_PROGRESS chia-yu.chang
@ 2024-11-07 9:34 ` Eric Dumazet
0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 9:34 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> Whenever timestamp advances, it declares progress which
> can be used by the other parts of the stack to decide that
> the ACK is the most recent one seen so far.
>
> AccECN will use this flag when deciding whether to use the
> ACK to update AccECN state or not.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 03/13] tcp: use BIT() macro in include/net/tcp.h
2024-11-05 10:06 ` [PATCH v5 net-next 03/13] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
@ 2024-11-07 10:20 ` Eric Dumazet
0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 10:20 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>
> Use BIT() macro for TCP flags field and TCP congestion control
> flags that will be used by the congestion control algorithm.
>
> No functional changes.
>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> Reviewed-by: Ilpo Järvinen <ij@kernel.org>
> ---
> include/net/tcp.h | 21 +++++++++++----------
> 1 file changed, 11 insertions(+), 10 deletions(-)
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 04/13] tcp: extend TCP flags to allow AE bit/ACE field
2024-11-05 10:06 ` [PATCH v5 net-next 04/13] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
@ 2024-11-07 10:23 ` Eric Dumazet
0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 10:23 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> With AccECN, there's one additional TCP flag to be used (AE)
> and ACE field that overloads the definition of AE, CWR, and
> ECE flags. As tcp_flags was previously only 1 byte, the
> byte-order stuff needs to be added to it's handling.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 05/13] tcp: reorganize SYN ECN code
2024-11-05 10:06 ` [PATCH v5 net-next 05/13] tcp: reorganize SYN ECN code chia-yu.chang
@ 2024-11-07 10:25 ` Eric Dumazet
0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 10:25 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> Prepare for AccECN that needs to have access here on IP ECN
> field value which is only available after INET_ECN_xmit().
>
> No functional changes.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 06/13] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()
2024-11-05 10:06 ` [PATCH v5 net-next 06/13] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
@ 2024-11-07 10:30 ` Eric Dumazet
0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 10:30 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> Rename tcp_ecn_check_ce to tcp_data_ecn_check as it is
> called only for data segments, not for ACKs (with AccECN,
> also ACKs may get ECN bits).
>
> The extra "layer" in tcp_ecn_check_ce() function just
> checks for ECN being enabled, that can be moved into
> tcp_ecn_field_check rather than having the __ variant.
>
> No functional changes.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 07/13] tcp: helpers for ECN mode handling
2024-11-05 10:06 ` [PATCH v5 net-next 07/13] tcp: helpers for ECN mode handling chia-yu.chang
@ 2024-11-07 12:26 ` Eric Dumazet
0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 12:26 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> Create helpers for TCP ECN modes. No functional changes.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
2024-11-05 10:06 ` [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
@ 2024-11-07 12:40 ` Eric Dumazet
2024-11-07 19:28 ` Ilpo Järvinen
0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 12:40 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> There are important differences in how the CWR field behaves
> in RFC3168 and AccECN. With AccECN, CWR flag is part of the
> ACE counter and its changes are important so adjust the flags
> changed mask accordingly.
>
> Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> corrupting CWR flag somewhere.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
> net/ipv4/tcp_offload.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> index 0b05f30e9e5f..f59762d88c38 100644
> --- a/net/ipv4/tcp_offload.c
> +++ b/net/ipv4/tcp_offload.c
> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
> th2 = tcp_hdr(p);
> flush = (__force int)(flags & TCP_FLAG_CWR);
> flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> + ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
> flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> for (i = sizeof(*th); i < thlen; i += 4)
> flush |= *(u32 *)((u8 *)th + i) ^
> @@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
> shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
>
> if (th->cwr)
> - shinfo->gso_type |= SKB_GSO_TCP_ECN;
> + shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
> }
> EXPORT_SYMBOL(tcp_gro_complete);
>
I do not really understand this patch. How a GRO engine can know which
ECN variant the peers are using ?
SKB_GSO_TCP_ECN is also used from other points, what is your plan ?
git grep -n SKB_GSO_TCP_ECN
drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888:
skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1291:
skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1312:
skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
include/linux/netdevice.h:5061: BUILD_BUG_ON(SKB_GSO_TCP_ECN !=
(NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
include/linux/skbuff.h:664: SKB_GSO_TCP_ECN = 1 << 2,
include/linux/virtio_net.h:88: gso_type |= SKB_GSO_TCP_ECN;
include/linux/virtio_net.h:161: switch (gso_type & ~SKB_GSO_TCP_ECN) {
include/linux/virtio_net.h:226: if (sinfo->gso_type & SKB_GSO_TCP_ECN)
net/ipv4/tcp_offload.c:404: shinfo->gso_type |= SKB_GSO_TCP_ECN;
net/ipv4/tcp_output.c:389:
skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 08/13] gso: AccECN support
2024-11-05 10:06 ` [PATCH v5 net-next 08/13] gso: AccECN support chia-yu.chang
@ 2024-11-07 12:41 ` Eric Dumazet
2024-11-09 10:09 ` Ilpo Järvinen
1 sibling, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 12:41 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> Handling the CWR flag differs between RFC 3168 ECN and AccECN.
> With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared
> starting from 2nd segment which is incompatible how AccECN handles
> the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN.
> With AccECN, CWR flag (or more accurately, the ACE field that also
> includes ECE & AE flags) changes only when new packet(s) with CE
> mark arrives so the flag should not be changed within a super-skb.
> The new skb/feature flags are necessary to prevent such TSO engines
> corrupting AccECN ACE counters by clearing the CWR flag (if the
> CWR handling feature cannot be turned off).
>
> If NIC is completely unaware of RFC3168 ECN (doesn't support
> NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
> despite supporting also NETIF_F_TSO_ECN, TSO could be safely used
> with AccECN on such NIC. This should be evaluated per NIC basis
> (not done in this patch series for any NICs).
>
> For the cases, where TSO cannot keep its hands off the CWR flag,
> a GSO fallback is provided by this patch.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 10/13] tcp: AccECN support to tcp_add_backlog
2024-11-05 10:06 ` [PATCH v5 net-next 10/13] tcp: AccECN support to tcp_add_backlog chia-yu.chang
@ 2024-11-07 12:42 ` Eric Dumazet
0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 12:42 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> AE flag needs to be preserved for AccECN.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 11/13] tcp: allow ECN bits in TOS/traffic class
2024-11-05 10:06 ` [PATCH v5 net-next 11/13] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
@ 2024-11-07 12:55 ` Eric Dumazet
0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 12:55 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> AccECN connection's last ACK cannot retain ECT(1) as the bits
> are always cleared causing the packet to switch into another
> service queue.
>
> This effectively adds a finer-grained filtering for ECN bits
> so that acceptable TW ACKs can retain the bits.
Too cryptic changelog.
Please add more explanations, because I could not really understand
the intent of this patch.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 12/13] tcp: Pass flags to __tcp_send_ack
2024-11-05 10:06 ` [PATCH v5 net-next 12/13] tcp: Pass flags to __tcp_send_ack chia-yu.chang
@ 2024-11-07 12:58 ` Eric Dumazet
0 siblings, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 12:58 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> Accurate ECN needs to send custom flags to handle IP-ECN
> field reflection during handshake.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 13/13] tcp: fast path functions later
2024-11-05 10:06 ` [PATCH v5 net-next 13/13] tcp: fast path functions later chia-yu.chang
@ 2024-11-07 13:00 ` Eric Dumazet
2024-11-07 20:38 ` Chia-Yu Chang (Nokia)
0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2024-11-07 13:00 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, dsahern, davem, dsahern, pabeni, joel.granados, kuba,
andrew+netdev, horms, pablo, kadlec, netfilter-devel, coreteam,
ij, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@kernel.org>
>
> The following patch will use tcp_ecn_mode_accecn(),
> TCP_ACCECN_CEP_INIT_OFFSET, TCP_ACCECN_CEP_ACE_MASK in
> __tcp_fast_path_on() to make new flag for AccECN.
>
> No functional changes.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chai-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
I guess this patch should not land in this series, but in the following one.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
2024-11-07 12:40 ` Eric Dumazet
@ 2024-11-07 19:28 ` Ilpo Järvinen
2024-11-12 2:09 ` Chia-Yu Chang (Nokia)
0 siblings, 1 reply; 37+ messages in thread
From: Ilpo Järvinen @ 2024-11-07 19:28 UTC (permalink / raw)
To: Eric Dumazet
Cc: chia-yu.chang, netdev, dsahern, davem, dsahern, pabeni,
joel.granados, kuba, andrew+netdev, horms, pablo, kadlec,
netfilter-devel, coreteam, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
[-- Attachment #1: Type: text/plain, Size: 4454 bytes --]
On Thu, 7 Nov 2024, Eric Dumazet wrote:
> On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
> >
> > From: Ilpo Järvinen <ij@kernel.org>
> >
> > There are important differences in how the CWR field behaves
> > in RFC3168 and AccECN. With AccECN, CWR flag is part of the
> > ACE counter and its changes are important so adjust the flags
> > changed mask accordingly.
> >
> > Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> > corrupting CWR flag somewhere.
> >
> > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > ---
> > net/ipv4/tcp_offload.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> > index 0b05f30e9e5f..f59762d88c38 100644
> > --- a/net/ipv4/tcp_offload.c
> > +++ b/net/ipv4/tcp_offload.c
> > @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
> > th2 = tcp_hdr(p);
> > flush = (__force int)(flags & TCP_FLAG_CWR);
> > flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> > - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> > + ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
> > flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> > for (i = sizeof(*th); i < thlen; i += 4)
> > flush |= *(u32 *)((u8 *)th + i) ^
> > @@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
> > shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
> >
> > if (th->cwr)
> > - shinfo->gso_type |= SKB_GSO_TCP_ECN;
> > + shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
> > }
> > EXPORT_SYMBOL(tcp_gro_complete);
> >
>
> I do not really understand this patch. How a GRO engine can know which
> ECN variant the peers are using ?
Hi Eric,
Thanks for taking a look.
GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can
result in header change that corrupts ACE field. Thus, GRO has to assume
the RFC3168 CWR offloading trick cannot be used anymore (unless it tracks
the connection and knows the bits are used for RFC3168 which is something
nobody is going to do).
The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or
NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE
field value.
SKB_GSO_TCP_ACCECN doesn't allow CWR bits change within a super-skb but
the same CWR flag should be repeated for all segments. In a sense, it's
simpler than RFC3168 offloading.
> SKB_GSO_TCP_ECN is also used from other points, what is your plan ?
>
> git grep -n SKB_GSO_TCP_ECN
> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888:
> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1291:
> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1312:
> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
These looked like they should be just changed to set SKB_GSO_TCP_ACCECN
instead.
I don't anymore recall why I didn't change those when I made this patch
long time ago, perhaps it was just an oversight or things have changed
somehow since then.
> include/linux/netdevice.h:5061: BUILD_BUG_ON(SKB_GSO_TCP_ECN !=
> (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
> include/linux/skbuff.h:664: SKB_GSO_TCP_ECN = 1 << 2,
Not relevant.
> include/linux/virtio_net.h:88: gso_type |= SKB_GSO_TCP_ECN;
> include/linux/virtio_net.h:161: switch (gso_type & ~SKB_GSO_TCP_ECN) {
> include/linux/virtio_net.h:226: if (sinfo->gso_type & SKB_GSO_TCP_ECN)
These need to be looked further what's going on as UAPI is also involved
here.
> net/ipv4/tcp_offload.c:404: shinfo->gso_type |= SKB_GSO_TCP_ECN;
This was changed above. :-)
> net/ipv4/tcp_output.c:389: skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
I'm pretty sure this relates to sending side which will be taken care by
a later patch in the full series (not among these preparatory patches).
FYI, these TSO/GSO/GRO changes are what I was most unsure myself if I
got everything right when I was working with this series a few years back
so please keep that in mind while reviewing so my lack of knowledge
doesn't end up breaking something. :-)
--
i.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
2024-11-07 9:31 ` Eric Dumazet
@ 2024-11-07 19:32 ` Ilpo Järvinen
0 siblings, 0 replies; 37+ messages in thread
From: Ilpo Järvinen @ 2024-11-07 19:32 UTC (permalink / raw)
To: Eric Dumazet
Cc: chia-yu.chang, netdev, dsahern, davem, dsahern, pabeni,
joel.granados, kuba, andrew+netdev, horms, pablo, kadlec,
netfilter-devel, coreteam, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
[-- Attachment #1: Type: text/plain, Size: 2432 bytes --]
On Thu, 7 Nov 2024, Eric Dumazet wrote:
> On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
> >
> > From: Ilpo Järvinen <ij@kernel.org>
> >
> > - Move tcp_count_delivered() earlier and split tcp_count_delivered_ce()
> > out of it
> > - Move tcp_in_ack_event() later
> > - While at it, remove the inline from tcp_in_ack_event() and let
> > the compiler to decide
> >
> > Accurate ECN's heuristics does not know if there is going
> > to be ACE field based CE counter increase or not until after
> > rtx queue has been processed. Only then the number of ACKed
> > bytes/pkts is available. As CE or not affects presence of
> > FLAG_ECE, that information for tcp_in_ack_event is not yet
> > available in the old location of the call to tcp_in_ack_event().
> >
> > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > ---
> > net/ipv4/tcp_input.c | 56 +++++++++++++++++++++++++-------------------
> > 1 file changed, 32 insertions(+), 24 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 5bdf13ac26ef..fc52eab4fcc9 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3856,12 +3861,23 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
> > }
> > }
> >
> > -static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
> > +static void tcp_in_ack_event(struct sock *sk, int flag)
> > {
> > const struct inet_connection_sock *icsk = inet_csk(sk);
> >
> > - if (icsk->icsk_ca_ops->in_ack_event)
> > - icsk->icsk_ca_ops->in_ack_event(sk, flags);
> > + if (icsk->icsk_ca_ops->in_ack_event) {
> > + u32 ack_ev_flags = 0;
> > +
> > + if (flag & FLAG_WIN_UPDATE)
> > + ack_ev_flags |= CA_ACK_WIN_UPDATE;
> > + if (flag & FLAG_SLOWPATH) {
> > + ack_ev_flags = CA_ACK_SLOWPATH;
>
> This is removing the potential CA_ACK_WIN_UPDATE, I would suggest :
>
> ack_ev_flags |= CA_ACK_SLOWPATH;
Yes, a good catch.
--
i.
> > + if (flag & FLAG_ECE)
> > + ack_ev_flags |= CA_ACK_ECE;
> > + }
> > +
> > + icsk->icsk_ca_ops->in_ack_event(sk, ack_ev_flags);
> > + }
> > }
> >
> >
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [PATCH v5 net-next 13/13] tcp: fast path functions later
2024-11-07 13:00 ` Eric Dumazet
@ 2024-11-07 20:38 ` Chia-Yu Chang (Nokia)
0 siblings, 0 replies; 37+ messages in thread
From: Chia-Yu Chang (Nokia) @ 2024-11-07 20:38 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev@vger.kernel.org, dsahern@gmail.com, davem@davemloft.net,
dsahern@kernel.org, pabeni@redhat.com, joel.granados@kernel.org,
kuba@kernel.org, andrew+netdev@lunn.ch, horms@kernel.org,
pablo@netfilter.org, kadlec@netfilter.org,
netfilter-devel@vger.kernel.org, coreteam@netfilter.org,
ij@kernel.org, ncardwell@google.com, Koen De Schepper (Nokia),
g.white@cablelabs.com, ingemar.s.johansson@ericsson.com,
mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
Jason_Livingood@comcast.com, vidhi_goel@apple.com
>From: Eric Dumazet <edumazet@google.com>
>Sent: Thursday, November 7, 2024 2:01 PM
>To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>
>Cc: netdev@vger.kernel.org; dsahern@gmail.com; davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com; joel.granados@kernel.org; kuba@kernel.org; andrew+netdev@lunn.ch; horms@kernel.org; pablo@netfilter.org; kadlec@netfilter.org; netfilter-devel@vger.kernel.org; coreteam@netfilter.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
>Subject: Re: [PATCH v5 net-next 13/13] tcp: fast path functions later
>
>
>CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
>
>
>
>On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>>
>> From: Ilpo Järvinen <ij@kernel.org>
>>
>> The following patch will use tcp_ecn_mode_accecn(),
>> TCP_ACCECN_CEP_INIT_OFFSET, TCP_ACCECN_CEP_ACE_MASK in
>> __tcp_fast_path_on() to make new flag for AccECN.
>>
>> No functional changes.
>>
>> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
>> Signed-off-by: Chai-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>
>I guess this patch should not land in this series, but in the following one.
Hi Eric,
Indeed, I will move this patch to the following AccECN series.
Best regards,
Chia-Yu
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 08/13] gso: AccECN support
2024-11-05 10:06 ` [PATCH v5 net-next 08/13] gso: AccECN support chia-yu.chang
2024-11-07 12:41 ` Eric Dumazet
@ 2024-11-09 10:09 ` Ilpo Järvinen
2024-11-09 19:42 ` Neal Cardwell
1 sibling, 1 reply; 37+ messages in thread
From: Ilpo Järvinen @ 2024-11-09 10:09 UTC (permalink / raw)
To: Chia-Yu Chang
Cc: netdev, dsahern, davem, edumazet, dsahern, pabeni, joel.granados,
kuba, andrew+netdev, horms, pablo, kadlec, netfilter-devel,
coreteam, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
[-- Attachment #1: Type: text/plain, Size: 6919 bytes --]
On Tue, 5 Nov 2024, chia-yu.chang@nokia-bell-labs.com wrote:
> From: Ilpo Järvinen <ij@kernel.org>
>
> Handling the CWR flag differs between RFC 3168 ECN and AccECN.
> With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared
> starting from 2nd segment which is incompatible how AccECN handles
> the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN.
> With AccECN, CWR flag (or more accurately, the ACE field that also
> includes ECE & AE flags) changes only when new packet(s) with CE
> mark arrives so the flag should not be changed within a super-skb.
> The new skb/feature flags are necessary to prevent such TSO engines
> corrupting AccECN ACE counters by clearing the CWR flag (if the
> CWR handling feature cannot be turned off).
>
> If NIC is completely unaware of RFC3168 ECN (doesn't support
> NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
> despite supporting also NETIF_F_TSO_ECN, TSO could be safely used
> with AccECN on such NIC. This should be evaluated per NIC basis
> (not done in this patch series for any NICs).
>
> For the cases, where TSO cannot keep its hands off the CWR flag,
> a GSO fallback is provided by this patch.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
> include/linux/netdev_features.h | 8 +++++---
> include/linux/netdevice.h | 2 ++
> include/linux/skbuff.h | 2 ++
> net/ethtool/common.c | 1 +
> net/ipv4/tcp_offload.c | 6 +++++-
> 5 files changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> index 66e7d26b70a4..c59db449bcf0 100644
> --- a/include/linux/netdev_features.h
> +++ b/include/linux/netdev_features.h
> @@ -53,12 +53,12 @@ enum {
> NETIF_F_GSO_UDP_BIT, /* ... UFO, deprecated except tuntap */
> NETIF_F_GSO_UDP_L4_BIT, /* ... UDP payload GSO (not UFO) */
> NETIF_F_GSO_FRAGLIST_BIT, /* ... Fraglist GSO */
> + NETIF_F_GSO_ACCECN_BIT, /* TCP AccECN w/ TSO (no clear CWR) */
> /**/NETIF_F_GSO_LAST = /* last bit, see GSO_MASK */
> - NETIF_F_GSO_FRAGLIST_BIT,
> + NETIF_F_GSO_ACCECN_BIT,
>
> NETIF_F_FCOE_CRC_BIT, /* FCoE CRC32 */
> NETIF_F_SCTP_CRC_BIT, /* SCTP checksum offload */
> - __UNUSED_NETIF_F_37,
> NETIF_F_NTUPLE_BIT, /* N-tuple filters supported */
> NETIF_F_RXHASH_BIT, /* Receive hashing offload */
> NETIF_F_RXCSUM_BIT, /* Receive checksumming offload */
> @@ -128,6 +128,7 @@ enum {
> #define NETIF_F_SG __NETIF_F(SG)
> #define NETIF_F_TSO6 __NETIF_F(TSO6)
> #define NETIF_F_TSO_ECN __NETIF_F(TSO_ECN)
> +#define NETIF_F_GSO_ACCECN __NETIF_F(GSO_ACCECN)
> #define NETIF_F_TSO __NETIF_F(TSO)
> #define NETIF_F_VLAN_CHALLENGED __NETIF_F(VLAN_CHALLENGED)
> #define NETIF_F_RXFCS __NETIF_F(RXFCS)
> @@ -210,7 +211,8 @@ static inline int find_next_netdev_feature(u64 feature, unsigned long start)
> NETIF_F_TSO_ECN | NETIF_F_TSO_MANGLEID)
>
> /* List of features with software fallbacks. */
> -#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | NETIF_F_GSO_SCTP | \
> +#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | \
> + NETIF_F_GSO_ACCECN | NETIF_F_GSO_SCTP | \
> NETIF_F_GSO_UDP_L4 | NETIF_F_GSO_FRAGLIST)
>
> /*
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 6e0f8e4aeb14..480d915b3bdb 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -5076,6 +5076,8 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
> BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_GSO_UDP >> NETIF_F_GSO_SHIFT));
> BUILD_BUG_ON(SKB_GSO_UDP_L4 != (NETIF_F_GSO_UDP_L4 >> NETIF_F_GSO_SHIFT));
> BUILD_BUG_ON(SKB_GSO_FRAGLIST != (NETIF_F_GSO_FRAGLIST >> NETIF_F_GSO_SHIFT));
> + BUILD_BUG_ON(SKB_GSO_TCP_ACCECN !=
> + (NETIF_F_GSO_ACCECN >> NETIF_F_GSO_SHIFT));
>
> return (features & feature) == feature;
> }
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 48f1e0fa2a13..530cb325fb86 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -694,6 +694,8 @@ enum {
> SKB_GSO_UDP_L4 = 1 << 17,
>
> SKB_GSO_FRAGLIST = 1 << 18,
> +
> + SKB_GSO_TCP_ACCECN = 1 << 19,
> };
>
> #if BITS_PER_LONG > 32
> diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> index 0d62363dbd9d..5c3ba2dfaa74 100644
> --- a/net/ethtool/common.c
> +++ b/net/ethtool/common.c
> @@ -32,6 +32,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
> [NETIF_F_TSO_BIT] = "tx-tcp-segmentation",
> [NETIF_F_GSO_ROBUST_BIT] = "tx-gso-robust",
> [NETIF_F_TSO_ECN_BIT] = "tx-tcp-ecn-segmentation",
> + [NETIF_F_GSO_ACCECN_BIT] = "tx-tcp-accecn-segmentation",
> [NETIF_F_TSO_MANGLEID_BIT] = "tx-tcp-mangleid-segmentation",
> [NETIF_F_TSO6_BIT] = "tx-tcp6-segmentation",
> [NETIF_F_FSO_BIT] = "tx-fcoe-segmentation",
> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> index 2308665b51c5..0b05f30e9e5f 100644
> --- a/net/ipv4/tcp_offload.c
> +++ b/net/ipv4/tcp_offload.c
> @@ -139,6 +139,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> struct sk_buff *gso_skb = skb;
> __sum16 newcheck;
> bool ooo_okay, copy_destructor;
> + bool ecn_cwr_mask;
> __wsum delta;
>
> th = tcp_hdr(skb);
> @@ -198,6 +199,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
>
> newcheck = ~csum_fold(csum_add(csum_unfold(th->check), delta));
>
> + ecn_cwr_mask = !!(skb_shinfo(gso_skb)->gso_type & SKB_GSO_TCP_ACCECN);
> +
> while (skb->next) {
> th->fin = th->psh = 0;
> th->check = newcheck;
> @@ -217,7 +220,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> th = tcp_hdr(skb);
>
> th->seq = htonl(seq);
> - th->cwr = 0;
> +
> + th->cwr &= ecn_cwr_mask;
Hi all,
I started to wonder if this approach would avoid CWR corruption without
need to introduce the SKB_GSO_TCP_ACCECN flag at all:
- Never set SKB_GSO_TCP_ECN for any skb
- Split CWR segment on TCP level before sending. While this causes need to
split the super skb, it's once per RTT thing so does not sound very
significant (compare e.g. with SACK that cause split on every ACK)
- Remove th->cwr cleaning from GSO (the above line)
- Likely needed: don't negotiate HOST/GUEST_ECN in virtio
I think that would prevent offloading treating CWR flag as RFC3168 and CWR
would be just copied as is like a normal flag which is required to not
corrupt it. In practice, RFC3168 style traffic would just not combine
anything with the CWR'ed packet as CWRs are singletons with RFC3168.
Would that work? It sure sounds too simple to work but I cannot
immediately see why it would not.
--
i.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 08/13] gso: AccECN support
2024-11-09 10:09 ` Ilpo Järvinen
@ 2024-11-09 19:42 ` Neal Cardwell
2024-11-09 21:38 ` Ilpo Järvinen
0 siblings, 1 reply; 37+ messages in thread
From: Neal Cardwell @ 2024-11-09 19:42 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: Chia-Yu Chang, netdev, dsahern, davem, edumazet, dsahern, pabeni,
joel.granados, kuba, andrew+netdev, horms, pablo, kadlec,
netfilter-devel, coreteam, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel, Bob Briscoe
On Sat, Nov 9, 2024 at 5:09 AM Ilpo Järvinen <ij@kernel.org> wrote:
>
> On Tue, 5 Nov 2024, chia-yu.chang@nokia-bell-labs.com wrote:
>
> > From: Ilpo Järvinen <ij@kernel.org>
> >
> > Handling the CWR flag differs between RFC 3168 ECN and AccECN.
> > With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared
> > starting from 2nd segment which is incompatible how AccECN handles
> > the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN.
> > With AccECN, CWR flag (or more accurately, the ACE field that also
> > includes ECE & AE flags) changes only when new packet(s) with CE
> > mark arrives so the flag should not be changed within a super-skb.
> > The new skb/feature flags are necessary to prevent such TSO engines
> > corrupting AccECN ACE counters by clearing the CWR flag (if the
> > CWR handling feature cannot be turned off).
> >
> > If NIC is completely unaware of RFC3168 ECN (doesn't support
> > NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
> > despite supporting also NETIF_F_TSO_ECN, TSO could be safely used
> > with AccECN on such NIC. This should be evaluated per NIC basis
> > (not done in this patch series for any NICs).
> >
> > For the cases, where TSO cannot keep its hands off the CWR flag,
> > a GSO fallback is provided by this patch.
> >
> > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > ---
> > include/linux/netdev_features.h | 8 +++++---
> > include/linux/netdevice.h | 2 ++
> > include/linux/skbuff.h | 2 ++
> > net/ethtool/common.c | 1 +
> > net/ipv4/tcp_offload.c | 6 +++++-
> > 5 files changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> > index 66e7d26b70a4..c59db449bcf0 100644
> > --- a/include/linux/netdev_features.h
> > +++ b/include/linux/netdev_features.h
> > @@ -53,12 +53,12 @@ enum {
> > NETIF_F_GSO_UDP_BIT, /* ... UFO, deprecated except tuntap */
> > NETIF_F_GSO_UDP_L4_BIT, /* ... UDP payload GSO (not UFO) */
> > NETIF_F_GSO_FRAGLIST_BIT, /* ... Fraglist GSO */
> > + NETIF_F_GSO_ACCECN_BIT, /* TCP AccECN w/ TSO (no clear CWR) */
> > /**/NETIF_F_GSO_LAST = /* last bit, see GSO_MASK */
> > - NETIF_F_GSO_FRAGLIST_BIT,
> > + NETIF_F_GSO_ACCECN_BIT,
> >
> > NETIF_F_FCOE_CRC_BIT, /* FCoE CRC32 */
> > NETIF_F_SCTP_CRC_BIT, /* SCTP checksum offload */
> > - __UNUSED_NETIF_F_37,
> > NETIF_F_NTUPLE_BIT, /* N-tuple filters supported */
> > NETIF_F_RXHASH_BIT, /* Receive hashing offload */
> > NETIF_F_RXCSUM_BIT, /* Receive checksumming offload */
> > @@ -128,6 +128,7 @@ enum {
> > #define NETIF_F_SG __NETIF_F(SG)
> > #define NETIF_F_TSO6 __NETIF_F(TSO6)
> > #define NETIF_F_TSO_ECN __NETIF_F(TSO_ECN)
> > +#define NETIF_F_GSO_ACCECN __NETIF_F(GSO_ACCECN)
> > #define NETIF_F_TSO __NETIF_F(TSO)
> > #define NETIF_F_VLAN_CHALLENGED __NETIF_F(VLAN_CHALLENGED)
> > #define NETIF_F_RXFCS __NETIF_F(RXFCS)
> > @@ -210,7 +211,8 @@ static inline int find_next_netdev_feature(u64 feature, unsigned long start)
> > NETIF_F_TSO_ECN | NETIF_F_TSO_MANGLEID)
> >
> > /* List of features with software fallbacks. */
> > -#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | NETIF_F_GSO_SCTP | \
> > +#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | \
> > + NETIF_F_GSO_ACCECN | NETIF_F_GSO_SCTP | \
> > NETIF_F_GSO_UDP_L4 | NETIF_F_GSO_FRAGLIST)
> >
> > /*
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 6e0f8e4aeb14..480d915b3bdb 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -5076,6 +5076,8 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
> > BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_GSO_UDP >> NETIF_F_GSO_SHIFT));
> > BUILD_BUG_ON(SKB_GSO_UDP_L4 != (NETIF_F_GSO_UDP_L4 >> NETIF_F_GSO_SHIFT));
> > BUILD_BUG_ON(SKB_GSO_FRAGLIST != (NETIF_F_GSO_FRAGLIST >> NETIF_F_GSO_SHIFT));
> > + BUILD_BUG_ON(SKB_GSO_TCP_ACCECN !=
> > + (NETIF_F_GSO_ACCECN >> NETIF_F_GSO_SHIFT));
> >
> > return (features & feature) == feature;
> > }
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 48f1e0fa2a13..530cb325fb86 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -694,6 +694,8 @@ enum {
> > SKB_GSO_UDP_L4 = 1 << 17,
> >
> > SKB_GSO_FRAGLIST = 1 << 18,
> > +
> > + SKB_GSO_TCP_ACCECN = 1 << 19,
> > };
> >
> > #if BITS_PER_LONG > 32
> > diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> > index 0d62363dbd9d..5c3ba2dfaa74 100644
> > --- a/net/ethtool/common.c
> > +++ b/net/ethtool/common.c
> > @@ -32,6 +32,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
> > [NETIF_F_TSO_BIT] = "tx-tcp-segmentation",
> > [NETIF_F_GSO_ROBUST_BIT] = "tx-gso-robust",
> > [NETIF_F_TSO_ECN_BIT] = "tx-tcp-ecn-segmentation",
> > + [NETIF_F_GSO_ACCECN_BIT] = "tx-tcp-accecn-segmentation",
> > [NETIF_F_TSO_MANGLEID_BIT] = "tx-tcp-mangleid-segmentation",
> > [NETIF_F_TSO6_BIT] = "tx-tcp6-segmentation",
> > [NETIF_F_FSO_BIT] = "tx-fcoe-segmentation",
> > diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> > index 2308665b51c5..0b05f30e9e5f 100644
> > --- a/net/ipv4/tcp_offload.c
> > +++ b/net/ipv4/tcp_offload.c
> > @@ -139,6 +139,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> > struct sk_buff *gso_skb = skb;
> > __sum16 newcheck;
> > bool ooo_okay, copy_destructor;
> > + bool ecn_cwr_mask;
> > __wsum delta;
> >
> > th = tcp_hdr(skb);
> > @@ -198,6 +199,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> >
> > newcheck = ~csum_fold(csum_add(csum_unfold(th->check), delta));
> >
> > + ecn_cwr_mask = !!(skb_shinfo(gso_skb)->gso_type & SKB_GSO_TCP_ACCECN);
> > +
> > while (skb->next) {
> > th->fin = th->psh = 0;
> > th->check = newcheck;
> > @@ -217,7 +220,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> > th = tcp_hdr(skb);
> >
> > th->seq = htonl(seq);
> > - th->cwr = 0;
> > +
> > + th->cwr &= ecn_cwr_mask;
>
> Hi all,
>
> I started to wonder if this approach would avoid CWR corruption without
> need to introduce the SKB_GSO_TCP_ACCECN flag at all:
>
> - Never set SKB_GSO_TCP_ECN for any skb
> - Split CWR segment on TCP level before sending. While this causes need to
> split the super skb, it's once per RTT thing so does not sound very
> significant (compare e.g. with SACK that cause split on every ACK)
> - Remove th->cwr cleaning from GSO (the above line)
> - Likely needed: don't negotiate HOST/GUEST_ECN in virtio
>
> I think that would prevent offloading treating CWR flag as RFC3168 and CWR
> would be just copied as is like a normal flag which is required to not
> corrupt it. In practice, RFC3168 style traffic would just not combine
> anything with the CWR'ed packet as CWRs are singletons with RFC3168.
>
> Would that work? It sure sounds too simple to work but I cannot
> immediately see why it would not.
The above description just seems to describe the dynamics for the
RFC3168 case (since it mentions the CWR bit being set as just a "once
per RTT thing").
Ilpo, can you please describe how AccECN traffic would be handled in
your design proposal? (Since with established AccECN connections the
CWR bit is part of the ACE counter and may be set on roughly half of
outgoing data skbs...)
Sorry if I'm misunderstanding something...
Thanks,
neal
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 08/13] gso: AccECN support
2024-11-09 19:42 ` Neal Cardwell
@ 2024-11-09 21:38 ` Ilpo Järvinen
0 siblings, 0 replies; 37+ messages in thread
From: Ilpo Järvinen @ 2024-11-09 21:38 UTC (permalink / raw)
To: Neal Cardwell
Cc: Chia-Yu Chang, netdev, dsahern, davem, edumazet, dsahern, pabeni,
joel.granados, kuba, andrew+netdev, horms, pablo, kadlec,
netfilter-devel, coreteam, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel, Bob Briscoe
[-- Attachment #1: Type: text/plain, Size: 10194 bytes --]
On Sat, 9 Nov 2024, Neal Cardwell wrote:
> On Sat, Nov 9, 2024 at 5:09 AM Ilpo Järvinen <ij@kernel.org> wrote:
> >
> > On Tue, 5 Nov 2024, chia-yu.chang@nokia-bell-labs.com wrote:
> >
> > > From: Ilpo Järvinen <ij@kernel.org>
> > >
> > > Handling the CWR flag differs between RFC 3168 ECN and AccECN.
> > > With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared
> > > starting from 2nd segment which is incompatible how AccECN handles
> > > the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN.
> > > With AccECN, CWR flag (or more accurately, the ACE field that also
> > > includes ECE & AE flags) changes only when new packet(s) with CE
> > > mark arrives so the flag should not be changed within a super-skb.
> > > The new skb/feature flags are necessary to prevent such TSO engines
> > > corrupting AccECN ACE counters by clearing the CWR flag (if the
> > > CWR handling feature cannot be turned off).
> > >
> > > If NIC is completely unaware of RFC3168 ECN (doesn't support
> > > NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
> > > despite supporting also NETIF_F_TSO_ECN, TSO could be safely used
> > > with AccECN on such NIC. This should be evaluated per NIC basis
> > > (not done in this patch series for any NICs).
> > >
> > > For the cases, where TSO cannot keep its hands off the CWR flag,
> > > a GSO fallback is provided by this patch.
> > >
> > > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > > ---
> > > include/linux/netdev_features.h | 8 +++++---
> > > include/linux/netdevice.h | 2 ++
> > > include/linux/skbuff.h | 2 ++
> > > net/ethtool/common.c | 1 +
> > > net/ipv4/tcp_offload.c | 6 +++++-
> > > 5 files changed, 15 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> > > index 66e7d26b70a4..c59db449bcf0 100644
> > > --- a/include/linux/netdev_features.h
> > > +++ b/include/linux/netdev_features.h
> > > @@ -53,12 +53,12 @@ enum {
> > > NETIF_F_GSO_UDP_BIT, /* ... UFO, deprecated except tuntap */
> > > NETIF_F_GSO_UDP_L4_BIT, /* ... UDP payload GSO (not UFO) */
> > > NETIF_F_GSO_FRAGLIST_BIT, /* ... Fraglist GSO */
> > > + NETIF_F_GSO_ACCECN_BIT, /* TCP AccECN w/ TSO (no clear CWR) */
> > > /**/NETIF_F_GSO_LAST = /* last bit, see GSO_MASK */
> > > - NETIF_F_GSO_FRAGLIST_BIT,
> > > + NETIF_F_GSO_ACCECN_BIT,
> > >
> > > NETIF_F_FCOE_CRC_BIT, /* FCoE CRC32 */
> > > NETIF_F_SCTP_CRC_BIT, /* SCTP checksum offload */
> > > - __UNUSED_NETIF_F_37,
> > > NETIF_F_NTUPLE_BIT, /* N-tuple filters supported */
> > > NETIF_F_RXHASH_BIT, /* Receive hashing offload */
> > > NETIF_F_RXCSUM_BIT, /* Receive checksumming offload */
> > > @@ -128,6 +128,7 @@ enum {
> > > #define NETIF_F_SG __NETIF_F(SG)
> > > #define NETIF_F_TSO6 __NETIF_F(TSO6)
> > > #define NETIF_F_TSO_ECN __NETIF_F(TSO_ECN)
> > > +#define NETIF_F_GSO_ACCECN __NETIF_F(GSO_ACCECN)
> > > #define NETIF_F_TSO __NETIF_F(TSO)
> > > #define NETIF_F_VLAN_CHALLENGED __NETIF_F(VLAN_CHALLENGED)
> > > #define NETIF_F_RXFCS __NETIF_F(RXFCS)
> > > @@ -210,7 +211,8 @@ static inline int find_next_netdev_feature(u64 feature, unsigned long start)
> > > NETIF_F_TSO_ECN | NETIF_F_TSO_MANGLEID)
> > >
> > > /* List of features with software fallbacks. */
> > > -#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | NETIF_F_GSO_SCTP | \
> > > +#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | \
> > > + NETIF_F_GSO_ACCECN | NETIF_F_GSO_SCTP | \
> > > NETIF_F_GSO_UDP_L4 | NETIF_F_GSO_FRAGLIST)
> > >
> > > /*
> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > index 6e0f8e4aeb14..480d915b3bdb 100644
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -5076,6 +5076,8 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
> > > BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_GSO_UDP >> NETIF_F_GSO_SHIFT));
> > > BUILD_BUG_ON(SKB_GSO_UDP_L4 != (NETIF_F_GSO_UDP_L4 >> NETIF_F_GSO_SHIFT));
> > > BUILD_BUG_ON(SKB_GSO_FRAGLIST != (NETIF_F_GSO_FRAGLIST >> NETIF_F_GSO_SHIFT));
> > > + BUILD_BUG_ON(SKB_GSO_TCP_ACCECN !=
> > > + (NETIF_F_GSO_ACCECN >> NETIF_F_GSO_SHIFT));
> > >
> > > return (features & feature) == feature;
> > > }
> > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > index 48f1e0fa2a13..530cb325fb86 100644
> > > --- a/include/linux/skbuff.h
> > > +++ b/include/linux/skbuff.h
> > > @@ -694,6 +694,8 @@ enum {
> > > SKB_GSO_UDP_L4 = 1 << 17,
> > >
> > > SKB_GSO_FRAGLIST = 1 << 18,
> > > +
> > > + SKB_GSO_TCP_ACCECN = 1 << 19,
> > > };
> > >
> > > #if BITS_PER_LONG > 32
> > > diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> > > index 0d62363dbd9d..5c3ba2dfaa74 100644
> > > --- a/net/ethtool/common.c
> > > +++ b/net/ethtool/common.c
> > > @@ -32,6 +32,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
> > > [NETIF_F_TSO_BIT] = "tx-tcp-segmentation",
> > > [NETIF_F_GSO_ROBUST_BIT] = "tx-gso-robust",
> > > [NETIF_F_TSO_ECN_BIT] = "tx-tcp-ecn-segmentation",
> > > + [NETIF_F_GSO_ACCECN_BIT] = "tx-tcp-accecn-segmentation",
> > > [NETIF_F_TSO_MANGLEID_BIT] = "tx-tcp-mangleid-segmentation",
> > > [NETIF_F_TSO6_BIT] = "tx-tcp6-segmentation",
> > > [NETIF_F_FSO_BIT] = "tx-fcoe-segmentation",
> > > diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> > > index 2308665b51c5..0b05f30e9e5f 100644
> > > --- a/net/ipv4/tcp_offload.c
> > > +++ b/net/ipv4/tcp_offload.c
> > > @@ -139,6 +139,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> > > struct sk_buff *gso_skb = skb;
> > > __sum16 newcheck;
> > > bool ooo_okay, copy_destructor;
> > > + bool ecn_cwr_mask;
> > > __wsum delta;
> > >
> > > th = tcp_hdr(skb);
> > > @@ -198,6 +199,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> > >
> > > newcheck = ~csum_fold(csum_add(csum_unfold(th->check), delta));
> > >
> > > + ecn_cwr_mask = !!(skb_shinfo(gso_skb)->gso_type & SKB_GSO_TCP_ACCECN);
> > > +
> > > while (skb->next) {
> > > th->fin = th->psh = 0;
> > > th->check = newcheck;
> > > @@ -217,7 +220,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
> > > th = tcp_hdr(skb);
> > >
> > > th->seq = htonl(seq);
> > > - th->cwr = 0;
> > > +
> > > + th->cwr &= ecn_cwr_mask;
> >
> > Hi all,
> >
> > I started to wonder if this approach would avoid CWR corruption without
> > need to introduce the SKB_GSO_TCP_ACCECN flag at all:
> >
> > - Never set SKB_GSO_TCP_ECN for any skb
> > - Split CWR segment on TCP level before sending. While this causes need to
> > split the super skb, it's once per RTT thing so does not sound very
> > significant (compare e.g. with SACK that cause split on every ACK)
> > - Remove th->cwr cleaning from GSO (the above line)
> > - Likely needed: don't negotiate HOST/GUEST_ECN in virtio
> >
> > I think that would prevent offloading treating CWR flag as RFC3168 and CWR
> > would be just copied as is like a normal flag which is required to not
> > corrupt it. In practice, RFC3168 style traffic would just not combine
> > anything with the CWR'ed packet as CWRs are singletons with RFC3168.
> >
> > Would that work? It sure sounds too simple to work but I cannot
> > immediately see why it would not.
>
> The above description just seems to describe the dynamics for the
> RFC3168 case (since it mentions the CWR bit being set as just a "once
> per RTT thing").
For RFC3168 ECN case, that will still hold.
I think the misunderstanding was from this, I meant to only briefly say
how this would impact TCP using RFC3168 ECN. It needs to split CWR segment
to own skb to not share skb with non-CWR segments and that even is
infrequent enough (one per RTT) that it doesn't sound a bad compromise.
> Ilpo, can you please describe how AccECN traffic would be handled in
> your design proposal? (Since with established AccECN connections the
> CWR bit is part of the ACE counter and may be set on roughly half of
> outgoing data skbs...)
TSO (and GSO) RFC3168 ECN offloading takes an skb, puts CWR only into
the first segment and the rest of the skb are sent without it.
What we want for AccECN is to never combine CWR and non-CWR skbs. The
question is just how to achieve this.
My idea is to instead of this SKB_GSO_TCP_ACCECN flag split those skbs
always apart on TCP level (RFC3168 case) and to adapt GRO to not set
SKB_GSO_TCP_ECN either (the latter should already be done by this series).
When SKB_GSO_TCP_ECN is not set by TCP nor GRO anymore, is that enough to
prevent HW offloading CWR in the RFC3168 way? That is, will the HW offload
still mess with CWR flag or does HW offloading honor the lack of
SKB_GSO_TCP_ECN (I'm not sure about the answer as I've not dealt with NICs
on that level)? If the HW offloading keeps its hands off from CWR, then
there would be no need for this new flag.
GSO case is handled by simply removing CWR mangling from the code.
Well, I now went to read some random NIC drivers and I suspect the answer
is that HW offloading will still mess with CWR, given the lack of code
that would be depending on that gso_type flag... So I guess my idea is a
dead-end and SKB_GSO_TCP_ACCECN flag is needed to prevent HW offloads from
seeing the CWR'ed skb.
> Sorry if I'm misunderstanding something...
--
i.
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
2024-11-07 19:28 ` Ilpo Järvinen
@ 2024-11-12 2:09 ` Chia-Yu Chang (Nokia)
2024-11-12 21:19 ` Ilpo Järvinen
0 siblings, 1 reply; 37+ messages in thread
From: Chia-Yu Chang (Nokia) @ 2024-11-12 2:09 UTC (permalink / raw)
To: Ilpo Järvinen, Eric Dumazet
Cc: netdev@vger.kernel.org, dsahern@gmail.com, davem@davemloft.net,
dsahern@kernel.org, pabeni@redhat.com, joel.granados@kernel.org,
kuba@kernel.org, andrew+netdev@lunn.ch, horms@kernel.org,
pablo@netfilter.org, kadlec@netfilter.org,
netfilter-devel@vger.kernel.org, coreteam@netfilter.org,
ncardwell@google.com, Koen De Schepper (Nokia),
g.white@cablelabs.com, ingemar.s.johansson@ericsson.com,
mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
Jason_Livingood@comcast.com, vidhi_goel@apple.com
>From: Ilpo Järvinen <ij@kernel.org>
>Sent: Thursday, November 7, 2024 8:28 PM
>To: Eric Dumazet <edumazet@google.com>
>Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; netdev@vger.kernel.org; dsahern@gmail.com; davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com; joel.granados@kernel.org; kuba@kernel.org; andrew+netdev@lunn.ch; horms@kernel.org; pablo@netfilter.org; kadlec@netfilter.org; netfilter-devel@vger.kernel.org; coreteam@netfilter.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
>Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
>
>
>CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
>
>
>
>On Thu, 7 Nov 2024, Eric Dumazet wrote:
>
>>On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>>>
>>> From: Ilpo Järvinen <ij@kernel.org>
>>>
>>> There are important differences in how the CWR field behaves in
>>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE counter
>>> and its changes are important so adjust the flags changed mask
>>> accordingly.
>>>
>>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid
>>> corrupting CWR flag somewhere.
>>>
>>> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
>>> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>>> ---
>>> net/ipv4/tcp_offload.c | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index
>>> 0b05f30e9e5f..f59762d88c38 100644
>>> --- a/net/ipv4/tcp_offload.c
>>> +++ b/net/ipv4/tcp_offload.c
>>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
>>> th2 = tcp_hdr(p);
>>> flush = (__force int)(flags & TCP_FLAG_CWR);
>>> flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
>>> - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
>>> + ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
>>> flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
>>> for (i = sizeof(*th); i < thlen; i += 4)
>>> flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 +405,7
>>> @@ void tcp_gro_complete(struct sk_buff *skb)
>>> shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
>>>
>>> if (th->cwr)
>>> - shinfo->gso_type |= SKB_GSO_TCP_ECN;
>>> + shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
>>> }
>>> EXPORT_SYMBOL(tcp_gro_complete);
>>>
>>
>> I do not really understand this patch. How a GRO engine can know which
>> ECN variant the peers are using ?
>
>Hi Eric,
>
>Thanks for taking a look.
>
>GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can result in header change that corrupts ACE field. Thus, GRO has to assume the RFC3168 CWR offloading trick cannot be used anymore (unless it tracks the connection and knows the bits are used for RFC3168 which is something nobody is going to do).
>
>The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE field value.
Hi Eric and Ilpo,
From my understanding of another email thread (patch 08/13), it seems the conclusions that SKB_GSO_TCP_ACCECN will still be needed.
>
>SKB_GSO_TCP_ACCECN doesn't allow CWR bits change within a super-skb but the same CWR flag should be repeated for all segments. In a sense, it's simpler than RFC3168 offloading.
>
>> SKB_GSO_TCP_ECN is also used from other points, what is your plan ?
>>
>> git grep -n SKB_GSO_TCP_ECN
>> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888:
>> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
>> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1291:
>> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
>> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1312:
>> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
>
>These looked like they should be just changed to set SKB_GSO_TCP_ACCECN instead.
I agree with these changes and will apply them in the next version.
>
>I don't anymore recall why I didn't change those when I made this patch long time ago, perhaps it was just an oversight or things have changed somehow since then.
>
>> include/linux/netdevice.h:5061: BUILD_BUG_ON(SKB_GSO_TCP_ECN !=
>> (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
>> include/linux/skbuff.h:664: SKB_GSO_TCP_ECN = 1 << 2,
>
>Not relevant.
>
>> include/linux/virtio_net.h:88: gso_type |= SKB_GSO_TCP_ECN;
>> include/linux/virtio_net.h:161: switch (gso_type & ~SKB_GSO_TCP_ECN) {
>> include/linux/virtio_net.h:226: if (sinfo->gso_type & SKB_GSO_TCP_ECN)
>
>These need to be looked further what's going on as UAPI is also involved here.
I had a look at these parts, and only the 1st and 3rd ones are relevant.
Related to the 1st one, I propose to change
from
if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
gso_type |= SKB_GSO_TCP_ECN;
to
if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
gso_type |= SKB_GSO_TCP_ACCECN;
The reason behind this proposed change is similar as the above changes in en_rx.c.
For the 3rd one, I suggest to change from
if (sinfo->gso_type & SKB_GSO_TCP_ECN)
hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
to
if (sinfo->gso_type & (SKB_GSO_TCP_ECN | SKB_GSO_TCP_ACCECN))
hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
This proposed change is because VIRTIO_NET_HDR_GSO_ECN must be set to allow TCP packets requiring segmentation offload which have ECN bit set.
So, no matter whether skb gso_type have GSO_TCP_ECN flag or GSO_TCP_ACCECN flag, the corresponding hdr gso_type shall be set.
But, I wonder what would the driver do when with VIRTIO_NET_HDR_GSO_ECN flag. Will it corrupts CWR or not?
--
Chia-Yu
>
>> net/ipv4/tcp_offload.c:404: shinfo->gso_type |= SKB_GSO_TCP_ECN;
>
>This was changed above. :-)
>
>> net/ipv4/tcp_output.c:389: skb_shinfo(skb)->gso_type |=
>> SKB_GSO_TCP_ECN;
>
>I'm pretty sure this relates to sending side which will be taken care by a later patch in the full series (not among these preparatory patches).
>
>
>FYI, these TSO/GSO/GRO changes are what I was most unsure myself if I got everything right when I was working with this series a few years back so please keep that in mind while reviewing so my lack of knowledge doesn't end up breaking something. :-)
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
2024-11-12 2:09 ` Chia-Yu Chang (Nokia)
@ 2024-11-12 21:19 ` Ilpo Järvinen
2024-11-13 1:42 ` Jason Wang
0 siblings, 1 reply; 37+ messages in thread
From: Ilpo Järvinen @ 2024-11-12 21:19 UTC (permalink / raw)
To: Chia-Yu Chang (Nokia), Michael S. Tsirkin, Jason Wang,
virtualization
Cc: Eric Dumazet, netdev@vger.kernel.org, dsahern@gmail.com,
davem@davemloft.net, dsahern@kernel.org, pabeni@redhat.com,
joel.granados@kernel.org, kuba@kernel.org, andrew+netdev@lunn.ch,
horms@kernel.org, pablo@netfilter.org, kadlec@netfilter.org,
netfilter-devel@vger.kernel.org, coreteam@netfilter.org,
ncardwell@google.com, Koen De Schepper (Nokia),
g.white@cablelabs.com, ingemar.s.johansson@ericsson.com,
mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
Jason_Livingood@comcast.com, vidhi_goel@apple.com
[-- Attachment #1: Type: text/plain, Size: 7464 bytes --]
Adding a few virtio people. Please see the virtio spec/flag question
below.
On Tue, 12 Nov 2024, Chia-Yu Chang (Nokia) wrote:
> >From: Ilpo Järvinen <ij@kernel.org>
> >Sent: Thursday, November 7, 2024 8:28 PM
> >To: Eric Dumazet <edumazet@google.com>
> >Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; netdev@vger.kernel.org; dsahern@gmail.com; davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com; joel.granados@kernel.org; kuba@kernel.org; andrew+netdev@lunn.ch; horms@kernel.org; pablo@netfilter.org; kadlec@netfilter.org; netfilter-devel@vger.kernel.org; coreteam@netfilter.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
> >Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
> >
> >
> >CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> >
> >
> >
> >On Thu, 7 Nov 2024, Eric Dumazet wrote:
> >
> >>On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
> >>>
> >>> From: Ilpo Järvinen <ij@kernel.org>
> >>>
> >>> There are important differences in how the CWR field behaves in
> >>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE counter
> >>> and its changes are important so adjust the flags changed mask
> >>> accordingly.
> >>>
> >>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> >>> corrupting CWR flag somewhere.
> >>>
> >>> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> >>> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> >>> ---
> >>> net/ipv4/tcp_offload.c | 4 ++--
> >>> 1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index
> >>> 0b05f30e9e5f..f59762d88c38 100644
> >>> --- a/net/ipv4/tcp_offload.c
> >>> +++ b/net/ipv4/tcp_offload.c
> >>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
> >>> th2 = tcp_hdr(p);
> >>> flush = (__force int)(flags & TCP_FLAG_CWR);
> >>> flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> >>> - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> >>> + ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
> >>> flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> >>> for (i = sizeof(*th); i < thlen; i += 4)
> >>> flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 +405,7
> >>> @@ void tcp_gro_complete(struct sk_buff *skb)
> >>> shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
> >>>
> >>> if (th->cwr)
> >>> - shinfo->gso_type |= SKB_GSO_TCP_ECN;
> >>> + shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
> >>> }
> >>> EXPORT_SYMBOL(tcp_gro_complete);
> >>>
> >>
> >> I do not really understand this patch. How a GRO engine can know which
> >> ECN variant the peers are using ?
> >
> >Hi Eric,
> >
> >Thanks for taking a look.
> >
> >GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can result in header change that corrupts ACE field. Thus, GRO has to assume the RFC3168 CWR offloading trick cannot be used anymore (unless it tracks the connection and knows the bits are used for RFC3168 which is something nobody is going to do).
> >
> >The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE field value.
>
> Hi Eric and Ilpo,
>
> From my understanding of another email thread (patch 08/13), it seems the conclusions that SKB_GSO_TCP_ACCECN will still be needed.
>
> >
> >SKB_GSO_TCP_ACCECN doesn't allow CWR bits change within a super-skb but the same CWR flag should be repeated for all segments. In a sense, it's simpler than RFC3168 offloading.
> >
> >> SKB_GSO_TCP_ECN is also used from other points, what is your plan ?
> >>
> >> git grep -n SKB_GSO_TCP_ECN
> >> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888:
> >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1291:
> >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1312:
> >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> >
> >These looked like they should be just changed to set SKB_GSO_TCP_ACCECN instead.
>
> I agree with these changes and will apply them in the next version.
>
> >
> >I don't anymore recall why I didn't change those when I made this patch long time ago, perhaps it was just an oversight or things have changed somehow since then.
> >
> >> include/linux/netdevice.h:5061: BUILD_BUG_ON(SKB_GSO_TCP_ECN !=
> >> (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
> >> include/linux/skbuff.h:664: SKB_GSO_TCP_ECN = 1 << 2,
> >
> >Not relevant.
> >
> >> include/linux/virtio_net.h:88: gso_type |= SKB_GSO_TCP_ECN;
> >> include/linux/virtio_net.h:161: switch (gso_type & ~SKB_GSO_TCP_ECN) {
> >> include/linux/virtio_net.h:226: if (sinfo->gso_type & SKB_GSO_TCP_ECN)
> >
> >These need to be looked further what's going on as UAPI is also
> > involved here.
>
> I had a look at these parts, and only the 1st and 3rd ones are relevant.
> Related to the 1st one, I propose to change
> from
>
> if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
> gso_type |= SKB_GSO_TCP_ECN;
>
> to
>
> if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
> gso_type |= SKB_GSO_TCP_ACCECN;
>
> The reason behind this proposed change is similar as the above changes
> in en_rx.c.
But en_rx.c is based one CWR flag, there's no similar check here.
> For the 3rd one, I suggest to change from
>
> if (sinfo->gso_type & SKB_GSO_TCP_ECN)
> hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
>
> to
>
> if (sinfo->gso_type & (SKB_GSO_TCP_ECN | SKB_GSO_TCP_ACCECN))
> hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
>
> This proposed change is because VIRTIO_NET_HDR_GSO_ECN must be set to
> allow TCP packets requiring segmentation offload which have ECN bit set.
> So, no matter whether skb gso_type have GSO_TCP_ECN flag or
> GSO_TCP_ACCECN flag, the corresponding hdr gso_type shall be set.
>
> But, I wonder what would the driver do when with VIRTIO_NET_HDR_GSO_ECN
> flag. Will it corrupts CWR or not?
I'm starting to heavily question what is the meaning of that
VIRTIO_NET_HDR_GSO_ECN flag and if it's even consistent...
https://github.com/qemu/qemu/blob/master/net/eth.c
That sets VIRTIO_NET_HDR_GSO_ECN based on CE?? (Whereas kernel associates
the related SKB_GSO_TCP_ECN to CWR offloading.)
The virtio spec is way too vague to be useful so it would not be
surprising if there are diverging interpretations from implementers:
"If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the
VIRTIO_NET_HDR_GSO_ECN bit in gso_type indicates that the TCP packet has
the ECN bit set."
What is "the ECN bit" in terms used by TCP (or IP)? Could some virtio
folks please explain?
--
i.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
2024-11-12 21:19 ` Ilpo Järvinen
@ 2024-11-13 1:42 ` Jason Wang
2024-11-14 1:04 ` Chia-Yu Chang (Nokia)
0 siblings, 1 reply; 37+ messages in thread
From: Jason Wang @ 2024-11-13 1:42 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: Chia-Yu Chang (Nokia), Michael S. Tsirkin, virtualization,
Eric Dumazet, netdev@vger.kernel.org, dsahern@gmail.com,
davem@davemloft.net, dsahern@kernel.org, pabeni@redhat.com,
joel.granados@kernel.org, kuba@kernel.org, andrew+netdev@lunn.ch,
horms@kernel.org, pablo@netfilter.org, kadlec@netfilter.org,
netfilter-devel@vger.kernel.org, coreteam@netfilter.org,
ncardwell@google.com, Koen De Schepper (Nokia),
g.white@cablelabs.com, ingemar.s.johansson@ericsson.com,
mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
Jason_Livingood@comcast.com, vidhi_goel@apple.com
On Wed, Nov 13, 2024 at 5:19 AM Ilpo Järvinen <ij@kernel.org> wrote:
>
> Adding a few virtio people. Please see the virtio spec/flag question
> below.
>
> On Tue, 12 Nov 2024, Chia-Yu Chang (Nokia) wrote:
>
> > >From: Ilpo Järvinen <ij@kernel.org>
> > >Sent: Thursday, November 7, 2024 8:28 PM
> > >To: Eric Dumazet <edumazet@google.com>
> > >Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; netdev@vger.kernel.org; dsahern@gmail.com; davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com; joel.granados@kernel.org; kuba@kernel.org; andrew+netdev@lunn.ch; horms@kernel.org; pablo@netfilter.org; kadlec@netfilter.org; netfilter-devel@vger.kernel.org; coreteam@netfilter.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
> > >Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
> > >
> > >
> > >CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> > >
> > >
> > >
> > >On Thu, 7 Nov 2024, Eric Dumazet wrote:
> > >
> > >>On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
> > >>>
> > >>> From: Ilpo Järvinen <ij@kernel.org>
> > >>>
> > >>> There are important differences in how the CWR field behaves in
> > >>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE counter
> > >>> and its changes are important so adjust the flags changed mask
> > >>> accordingly.
> > >>>
> > >>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> > >>> corrupting CWR flag somewhere.
> > >>>
> > >>> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > >>> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > >>> ---
> > >>> net/ipv4/tcp_offload.c | 4 ++--
> > >>> 1 file changed, 2 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index
> > >>> 0b05f30e9e5f..f59762d88c38 100644
> > >>> --- a/net/ipv4/tcp_offload.c
> > >>> +++ b/net/ipv4/tcp_offload.c
> > >>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
> > >>> th2 = tcp_hdr(p);
> > >>> flush = (__force int)(flags & TCP_FLAG_CWR);
> > >>> flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> > >>> - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> > >>> + ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
> > >>> flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> > >>> for (i = sizeof(*th); i < thlen; i += 4)
> > >>> flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 +405,7
> > >>> @@ void tcp_gro_complete(struct sk_buff *skb)
> > >>> shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
> > >>>
> > >>> if (th->cwr)
> > >>> - shinfo->gso_type |= SKB_GSO_TCP_ECN;
> > >>> + shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
> > >>> }
> > >>> EXPORT_SYMBOL(tcp_gro_complete);
> > >>>
> > >>
> > >> I do not really understand this patch. How a GRO engine can know which
> > >> ECN variant the peers are using ?
> > >
> > >Hi Eric,
> > >
> > >Thanks for taking a look.
> > >
> > >GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can result in header change that corrupts ACE field. Thus, GRO has to assume the RFC3168 CWR offloading trick cannot be used anymore (unless it tracks the connection and knows the bits are used for RFC3168 which is something nobody is going to do).
> > >
> > >The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE field value.
> >
> > Hi Eric and Ilpo,
> >
> > From my understanding of another email thread (patch 08/13), it seems the conclusions that SKB_GSO_TCP_ACCECN will still be needed.
> >
> > >
> > >SKB_GSO_TCP_ACCECN doesn't allow CWR bits change within a super-skb but the same CWR flag should be repeated for all segments. In a sense, it's simpler than RFC3168 offloading.
> > >
> > >> SKB_GSO_TCP_ECN is also used from other points, what is your plan ?
> > >>
> > >> git grep -n SKB_GSO_TCP_ECN
> > >> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888:
> > >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> > >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1291:
> > >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> > >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1312:
> > >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
> > >
> > >These looked like they should be just changed to set SKB_GSO_TCP_ACCECN instead.
> >
> > I agree with these changes and will apply them in the next version.
> >
> > >
> > >I don't anymore recall why I didn't change those when I made this patch long time ago, perhaps it was just an oversight or things have changed somehow since then.
> > >
> > >> include/linux/netdevice.h:5061: BUILD_BUG_ON(SKB_GSO_TCP_ECN !=
> > >> (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
> > >> include/linux/skbuff.h:664: SKB_GSO_TCP_ECN = 1 << 2,
> > >
> > >Not relevant.
> > >
> > >> include/linux/virtio_net.h:88: gso_type |= SKB_GSO_TCP_ECN;
> > >> include/linux/virtio_net.h:161: switch (gso_type & ~SKB_GSO_TCP_ECN) {
> > >> include/linux/virtio_net.h:226: if (sinfo->gso_type & SKB_GSO_TCP_ECN)
> > >
> > >These need to be looked further what's going on as UAPI is also
> > > involved here.
> >
> > I had a look at these parts, and only the 1st and 3rd ones are relevant.
> > Related to the 1st one, I propose to change
> > from
> >
> > if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
> > gso_type |= SKB_GSO_TCP_ECN;
> >
> > to
> >
> > if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
> > gso_type |= SKB_GSO_TCP_ACCECN;
> >
> > The reason behind this proposed change is similar as the above changes
> > in en_rx.c.
>
> But en_rx.c is based one CWR flag, there's no similar check here.
>
> > For the 3rd one, I suggest to change from
> >
> > if (sinfo->gso_type & SKB_GSO_TCP_ECN)
> > hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
> >
> > to
> >
> > if (sinfo->gso_type & (SKB_GSO_TCP_ECN | SKB_GSO_TCP_ACCECN))
> > hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
> >
> > This proposed change is because VIRTIO_NET_HDR_GSO_ECN must be set to
> > allow TCP packets requiring segmentation offload which have ECN bit set.
> > So, no matter whether skb gso_type have GSO_TCP_ECN flag or
> > GSO_TCP_ACCECN flag, the corresponding hdr gso_type shall be set.
> >
> > But, I wonder what would the driver do when with VIRTIO_NET_HDR_GSO_ECN
> > flag. Will it corrupts CWR or not?
>
> I'm starting to heavily question what is the meaning of that
> VIRTIO_NET_HDR_GSO_ECN flag and if it's even consistent...
>
> https://github.com/qemu/qemu/blob/master/net/eth.c
>
> That sets VIRTIO_NET_HDR_GSO_ECN based on CE?? (Whereas kernel associates
> the related SKB_GSO_TCP_ECN to CWR offloading.)
>
> The virtio spec is way too vague to be useful so it would not be
> surprising if there are diverging interpretations from implementers:
>
> "If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the
> VIRTIO_NET_HDR_GSO_ECN bit in gso_type indicates that the TCP packet has
> the ECN bit set."
>
> What is "the ECN bit" in terms used by TCP (or IP)? Could some virtio
> folks please explain?
According to the current Linux implementation in virtio_net_hdr_to_skb():
if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
gso_type |= SKB_GSO_TCP_ECN;
It mapps to SKB_GSO_TCP_ECN which is:
/* This indicates the tcp segment has CWR set. */
SKB_GSO_TCP_ECN = 1 << 2,
Thanks
>
>
> --
> i.
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
2024-11-13 1:42 ` Jason Wang
@ 2024-11-14 1:04 ` Chia-Yu Chang (Nokia)
0 siblings, 0 replies; 37+ messages in thread
From: Chia-Yu Chang (Nokia) @ 2024-11-14 1:04 UTC (permalink / raw)
To: Jason Wang, Ilpo Järvinen
Cc: Michael S. Tsirkin, virtualization@lists.linux.dev, Eric Dumazet,
netdev@vger.kernel.org, dsahern@gmail.com, davem@davemloft.net,
dsahern@kernel.org, pabeni@redhat.com, joel.granados@kernel.org,
kuba@kernel.org, andrew+netdev@lunn.ch, horms@kernel.org,
pablo@netfilter.org, kadlec@netfilter.org,
netfilter-devel@vger.kernel.org, coreteam@netfilter.org,
ncardwell@google.com, Koen De Schepper (Nokia),
g.white@cablelabs.com, ingemar.s.johansson@ericsson.com,
mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
Jason_Livingood@comcast.com, vidhi_goel@apple.com
>From: Jason Wang <jasowang@redhat.com>
>Sent: Wednesday, November 13, 2024 2:43 AM
>To: Ilpo Järvinen <ij@kernel.org>
>Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; Michael S. Tsirkin <mst@redhat.com>; virtualization@lists.linux.dev; Eric Dumazet <edumazet@google.com>; netdev@vger.kernel.org; dsahern@gmail.com; davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com; joel.granados@kernel.org; kuba@kernel.org; andrew+netdev@lunn.ch; horms@kernel.org; pablo@netfilter.org; kadlec@netfilter.org; netfilter-devel@vger.kernel.org; coreteam@netfilter.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
>Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
>
>[You don't often get email from jasowang@redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
>CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
>
>
>
>On Wed, Nov 13, 2024 at 5:19 AM Ilpo Järvinen <ij@kernel.org> wrote:
>>
>> Adding a few virtio people. Please see the virtio spec/flag question
>> below.
>>
>> On Tue, 12 Nov 2024, Chia-Yu Chang (Nokia) wrote:
>>
>> > >From: Ilpo Järvinen <ij@kernel.org>
>> > >Sent: Thursday, November 7, 2024 8:28 PM
>> > >To: Eric Dumazet <edumazet@google.com>
>> > >Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>;
>> > >netdev@vger.kernel.org; dsahern@gmail.com; davem@davemloft.net;
>> > >dsahern@kernel.org; pabeni@redhat.com; joel.granados@kernel.org;
>> > >kuba@kernel.org; andrew+netdev@lunn.ch; horms@kernel.org;
>> > >pablo@netfilter.org; kadlec@netfilter.org;
>> > >netfilter-devel@vger.kernel.org; coreteam@netfilter.org;
>> > >ncardwell@google.com; Koen De Schepper (Nokia)
>> > ><koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com;
>> > >ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com;
>> > >cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com;
>> > >vidhi_goel@apple.com
>> > >Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field
>> > >corruption & better AccECN handling
>> > >
>> > >
>> > >CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
>> > >
>> > >
>> > >
>> > >On Thu, 7 Nov 2024, Eric Dumazet wrote:
>> > >
>> > >>On Tue, Nov 5, 2024 at 11:07 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>> > >>>
>> > >>> From: Ilpo Järvinen <ij@kernel.org>
>> > >>>
>> > >>> There are important differences in how the CWR field behaves in
>> > >>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE
>> > >>> counter and its changes are important so adjust the flags
>> > >>> changed mask accordingly.
>> > >>>
>> > >>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid
>> > >>> corrupting CWR flag somewhere.
>> > >>>
>> > >>> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
>> > >>> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>> > >>> ---
>> > >>> net/ipv4/tcp_offload.c | 4 ++--
>> > >>> 1 file changed, 2 insertions(+), 2 deletions(-)
>> > >>>
>> > >>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
>> > >>> index
>> > >>> 0b05f30e9e5f..f59762d88c38 100644
>> > >>> --- a/net/ipv4/tcp_offload.c
>> > >>> +++ b/net/ipv4/tcp_offload.c
>> > >>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
>> > >>> th2 = tcp_hdr(p);
>> > >>> flush = (__force int)(flags & TCP_FLAG_CWR);
>> > >>> flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
>> > >>> - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
>> > >>> + ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
>> > >>> flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
>> > >>> for (i = sizeof(*th); i < thlen; i += 4)
>> > >>> flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7
>> > >>> +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
>> > >>> shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
>> > >>>
>> > >>> if (th->cwr)
>> > >>> - shinfo->gso_type |= SKB_GSO_TCP_ECN;
>> > >>> + shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
>> > >>> }
>> > >>> EXPORT_SYMBOL(tcp_gro_complete);
>> > >>>
>> > >>
>> > >> I do not really understand this patch. How a GRO engine can know
>> > >> which ECN variant the peers are using ?
>> > >
>> > >Hi Eric,
>> > >
>> > >Thanks for taking a look.
>> > >
>> > >GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can result in header change that corrupts ACE field. Thus, GRO has to assume the RFC3168 CWR offloading trick cannot be used anymore (unless it tracks the connection and knows the bits are used for RFC3168 which is something nobody is going to do).
>> > >
>> > >The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE field value.
>> >
>> > Hi Eric and Ilpo,
>> >
>> > From my understanding of another email thread (patch 08/13), it seems the conclusions that SKB_GSO_TCP_ACCECN will still be needed.
>> >
>> > >
>> > >SKB_GSO_TCP_ACCECN doesn't allow CWR bits change within a super-skb but the same CWR flag should be repeated for all segments. In a sense, it's simpler than RFC3168 offloading.
>> > >
>> > >> SKB_GSO_TCP_ECN is also used from other points, what is your plan ?
>> > >>
>> > >> git grep -n SKB_GSO_TCP_ECN
>> > >> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888:
>> > >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
>> > >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1291:
>> > >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
>> > >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:1312:
>> > >> skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
>> > >
>> > >These looked like they should be just changed to set SKB_GSO_TCP_ACCECN instead.
>> >
>> > I agree with these changes and will apply them in the next version.
>> >
>> > >
>> > >I don't anymore recall why I didn't change those when I made this patch long time ago, perhaps it was just an oversight or things have changed somehow since then.
>> > >
>> > >> include/linux/netdevice.h:5061: BUILD_BUG_ON(SKB_GSO_TCP_ECN !=
>> > >> (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
>> > >> include/linux/skbuff.h:664: SKB_GSO_TCP_ECN = 1 << 2,
>> > >
>> > >Not relevant.
>> > >
>> > >> include/linux/virtio_net.h:88: gso_type |= SKB_GSO_TCP_ECN;
>> > >> include/linux/virtio_net.h:161: switch (gso_type & ~SKB_GSO_TCP_ECN) {
>> > >> include/linux/virtio_net.h:226: if (sinfo->gso_type & SKB_GSO_TCP_ECN)
>> > >
>> > >These need to be looked further what's going on as UAPI is also
>> > >involved here.
>> >
>> > I had a look at these parts, and only the 1st and 3rd ones are relevant.
>> > Related to the 1st one, I propose to change from
>> >
>> > if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
>> > gso_type |= SKB_GSO_TCP_ECN;
>> >
>> > to
>> >
>> > if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
>> > gso_type |= SKB_GSO_TCP_ACCECN;
>> >
>> > The reason behind this proposed change is similar as the above
>> > changes in en_rx.c.
>>
>> But en_rx.c is based one CWR flag, there's no similar check here.
>>
>> > For the 3rd one, I suggest to change from
>> >
>> > if (sinfo->gso_type & SKB_GSO_TCP_ECN)
>> > hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
>> >
>> > to
>> >
>> > if (sinfo->gso_type & (SKB_GSO_TCP_ECN | SKB_GSO_TCP_ACCECN))
>> > hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
>> >
>> > This proposed change is because VIRTIO_NET_HDR_GSO_ECN must be set
>> > to allow TCP packets requiring segmentation offload which have ECN bit set.
>> > So, no matter whether skb gso_type have GSO_TCP_ECN flag or
>> > GSO_TCP_ACCECN flag, the corresponding hdr gso_type shall be set.
>> >
>> > But, I wonder what would the driver do when with
>> > VIRTIO_NET_HDR_GSO_ECN flag. Will it corrupts CWR or not?
>>
>> I'm starting to heavily question what is the meaning of that
>> VIRTIO_NET_HDR_GSO_ECN flag and if it's even consistent...
>>
>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
>> ub.com%2Fqemu%2Fqemu%2Fblob%2Fmaster%2Fnet%2Feth.c&data=05%7C02%7Cchia
>> -yu.chang%40nokia-bell-labs.com%7C64e73bacd5eb4647905208dd03848651%7C5
>> d4717519675428d917b70f44f9630b0%7C0%7C0%7C638670589949336838%7CUnknown
>> %7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4
>> zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=iEeGdA4mcq1WC
>> unKWvxNZ6yt213xKWKOpjyYdBzLzTk%3D&reserved=0
>>
>> That sets VIRTIO_NET_HDR_GSO_ECN based on CE?? (Whereas kernel
>> associates the related SKB_GSO_TCP_ECN to CWR offloading.)
>>
>> The virtio spec is way too vague to be useful so it would not be
>> surprising if there are diverging interpretations from implementers:
>>
>> "If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the
>> VIRTIO_NET_HDR_GSO_ECN bit in gso_type indicates that the TCP packet
>> has the ECN bit set."
>>
>> What is "the ECN bit" in terms used by TCP (or IP)? Could some virtio
>> folks please explain?
>
>According to the current Linux implementation in virtio_net_hdr_to_skb():
>
> if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
> gso_type |= SKB_GSO_TCP_ECN;
>
>It mapps to SKB_GSO_TCP_ECN which is:
>
> /* This indicates the tcp segment has CWR set. */
> SKB_GSO_TCP_ECN = 1 << 2,
>
>Thanks
This patch series is related to adding new SKB_GSO_TCP_ACCECN.
The reason behind is because how the CWR field handling is different between RFC3168 and AccECN.
You can to find the original patches in below:
https://lore.kernel.org/netdev/20241105100647.117346-9-chia-yu.chang@nokia-bell-labs.com/
https://lore.kernel.org/netdev/20241105100647.117346-10-chia-yu.chang@nokia-bell-labs.com/
When reviewing all occurrence of SKB_GSO_TCP_ECN, we see it's related to VIRTIO_NET_HDR_GSO_ECN.
But after checking the virtio spec, it's quite unclear about VIRTIO_NET_HDR_GSO_ECN:
"If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the VIRTIO_NET_HDR_GSO_ECN bit in gso_type indicates that the TCP packet has the ECN bit set."
In virtnet_probe(), it looks like VIRTIO_NET_F_HOST_ECN is used to enable NETIF_F_TSO_ECN support:
if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_ECN))
dev->hw_features |= NETIF_F_TSO_ECN;
So the question is that is its TSO engine can be set to not touch CWR flag even with NETIF_F_TSO_ECN?
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2024-11-14 1:04 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-05 10:06 [PATCH v5 net-next 00/13] AccECN protocol preparation patch series chia-yu.chang
2024-11-05 10:06 ` [PATCH v5 net-next 01/13] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
2024-11-07 9:31 ` Eric Dumazet
2024-11-07 19:32 ` Ilpo Järvinen
2024-11-05 10:06 ` [PATCH v5 net-next 02/13] tcp: create FLAG_TS_PROGRESS chia-yu.chang
2024-11-07 9:34 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 03/13] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
2024-11-07 10:20 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 04/13] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
2024-11-07 10:23 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 05/13] tcp: reorganize SYN ECN code chia-yu.chang
2024-11-07 10:25 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 06/13] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
2024-11-07 10:30 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 07/13] tcp: helpers for ECN mode handling chia-yu.chang
2024-11-07 12:26 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 08/13] gso: AccECN support chia-yu.chang
2024-11-07 12:41 ` Eric Dumazet
2024-11-09 10:09 ` Ilpo Järvinen
2024-11-09 19:42 ` Neal Cardwell
2024-11-09 21:38 ` Ilpo Järvinen
2024-11-05 10:06 ` [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
2024-11-07 12:40 ` Eric Dumazet
2024-11-07 19:28 ` Ilpo Järvinen
2024-11-12 2:09 ` Chia-Yu Chang (Nokia)
2024-11-12 21:19 ` Ilpo Järvinen
2024-11-13 1:42 ` Jason Wang
2024-11-14 1:04 ` Chia-Yu Chang (Nokia)
2024-11-05 10:06 ` [PATCH v5 net-next 10/13] tcp: AccECN support to tcp_add_backlog chia-yu.chang
2024-11-07 12:42 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 11/13] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
2024-11-07 12:55 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 12/13] tcp: Pass flags to __tcp_send_ack chia-yu.chang
2024-11-07 12:58 ` Eric Dumazet
2024-11-05 10:06 ` [PATCH v5 net-next 13/13] tcp: fast path functions later chia-yu.chang
2024-11-07 13:00 ` Eric Dumazet
2024-11-07 20:38 ` Chia-Yu Chang (Nokia)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).