* [PATCH v4 net-next 00/14] AccECN protocol preparation patch series
@ 2024-10-21 21:58 chia-yu.chang
2024-10-21 21:58 ` [PATCH v4 net-next 01/14] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
` (13 more replies)
0 siblings, 14 replies; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:58 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Hello,
Specific changes in this version
- Fix line length warning of patches 02, 04, 08, 10, 11, 14
- Fix spaces preferred around that '|' (ctx:VxV) of patch 07
- Add missing CC'ed of patches 04, 12, 14
This updated patch series is grouped in preparation for the AccECN protocol,
and is part of the full AccECN patch series.
The full patch series can be found in
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28
--
Chia-Yu
Chia-Yu Chang (2):
tcp: use BIT() macro in include/net/tcp.h
net: sysctl: introduce sysctl SYSCTL_FIVE
Ilpo Järvinen (12):
tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
tcp: create FLAG_TS_PROGRESS
tcp: extend TCP flags to allow AE bit/ACE field
tcp: reorganize SYN ECN code
tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()
tcp: helpers for ECN mode handling
gso: AccECN support
gro: prevent ACE field corruption & better AccECN handling
tcp: AccECN support to tcp_add_backlog
tcp: allow ECN bits in TOS/traffic class
tcp: Pass flags to __tcp_send_ack
tcp: fast path functions later
include/linux/netdev_features.h | 8 +-
include/linux/netdevice.h | 2 +
include/linux/skbuff.h | 2 +
include/linux/sysctl.h | 17 ++--
include/net/tcp.h | 133 +++++++++++++++++++++-----------
include/uapi/linux/tcp.h | 9 ++-
kernel/sysctl.c | 3 +-
net/ethtool/common.c | 1 +
net/ipv4/bpf_tcp_ca.c | 2 +-
net/ipv4/ip_output.c | 3 +-
net/ipv4/tcp.c | 2 +-
net/ipv4/tcp_dctcp.c | 2 +-
net/ipv4/tcp_dctcp.h | 2 +-
net/ipv4/tcp_input.c | 120 ++++++++++++++++------------
net/ipv4/tcp_ipv4.c | 29 +++++--
net/ipv4/tcp_minisocks.c | 6 +-
net/ipv4/tcp_offload.c | 10 ++-
net/ipv4/tcp_output.c | 23 +++---
net/ipv6/tcp_ipv6.c | 27 +++++--
net/netfilter/nf_log_syslog.c | 8 +-
20 files changed, 260 insertions(+), 149 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 01/14] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
@ 2024-10-21 21:58 ` chia-yu.chang
2024-10-21 21:58 ` [PATCH v4 net-next 02/14] tcp: create FLAG_TS_PROGRESS chia-yu.chang
` (12 subsequent siblings)
13 siblings, 0 replies; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:58 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
- Move tcp_count_delivered() earlier and split tcp_count_delivered_ce()
out of it
- Move tcp_in_ack_event() later
- While at it, remove the inline from tcp_in_ack_event() and let
the compiler to decide
Accurate ECN's heuristics does not know if there is going
to be ACE field based CE counter increase or not until after
rtx queue has been processed. Only then the number of ACKed
bytes/pkts is available. As CE or not affects presence of
FLAG_ECE, that information for tcp_in_ack_event is not yet
available in the old location of the call to tcp_in_ack_event().
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
net/ipv4/tcp_input.c | 56 +++++++++++++++++++++++++-------------------
1 file changed, 32 insertions(+), 24 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2d844e1f867f..5a6f93148814 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -413,6 +413,20 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
return false;
}
+static void tcp_count_delivered_ce(struct tcp_sock *tp, u32 ecn_count)
+{
+ tp->delivered_ce += ecn_count;
+}
+
+/* Updates the delivered and delivered_ce counts */
+static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
+ bool ece_ack)
+{
+ tp->delivered += delivered;
+ if (ece_ack)
+ tcp_count_delivered_ce(tp, delivered);
+}
+
/* Buffer size and advertised window tuning.
*
* 1. Tuning sk->sk_sndbuf, when connection enters established state.
@@ -1148,15 +1162,6 @@ void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb)
}
}
-/* Updates the delivered and delivered_ce counts */
-static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
- bool ece_ack)
-{
- tp->delivered += delivered;
- if (ece_ack)
- tp->delivered_ce += delivered;
-}
-
/* This procedure tags the retransmission queue when SACKs arrive.
*
* We have three tag bits: SACKED(S), RETRANS(R) and LOST(L).
@@ -3856,12 +3861,23 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
}
}
-static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
+static void tcp_in_ack_event(struct sock *sk, int flag)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
- if (icsk->icsk_ca_ops->in_ack_event)
- icsk->icsk_ca_ops->in_ack_event(sk, flags);
+ if (icsk->icsk_ca_ops->in_ack_event) {
+ u32 ack_ev_flags = 0;
+
+ if (flag & FLAG_WIN_UPDATE)
+ ack_ev_flags |= CA_ACK_WIN_UPDATE;
+ if (flag & FLAG_SLOWPATH) {
+ ack_ev_flags = CA_ACK_SLOWPATH;
+ if (flag & FLAG_ECE)
+ ack_ev_flags |= CA_ACK_ECE;
+ }
+
+ icsk->icsk_ca_ops->in_ack_event(sk, ack_ev_flags);
+ }
}
/* Congestion control has updated the cwnd already. So if we're in
@@ -3978,12 +3994,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
tcp_snd_una_update(tp, ack);
flag |= FLAG_WIN_UPDATE;
- tcp_in_ack_event(sk, CA_ACK_WIN_UPDATE);
-
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPACKS);
} else {
- u32 ack_ev_flags = CA_ACK_SLOWPATH;
-
if (ack_seq != TCP_SKB_CB(skb)->end_seq)
flag |= FLAG_DATA;
else
@@ -3995,19 +4007,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
&sack_state);
- if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) {
+ if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb)))
flag |= FLAG_ECE;
- ack_ev_flags |= CA_ACK_ECE;
- }
if (sack_state.sack_delivered)
tcp_count_delivered(tp, sack_state.sack_delivered,
flag & FLAG_ECE);
-
- if (flag & FLAG_WIN_UPDATE)
- ack_ev_flags |= CA_ACK_WIN_UPDATE;
-
- tcp_in_ack_event(sk, ack_ev_flags);
}
/* This is a deviation from RFC3168 since it states that:
@@ -4034,6 +4039,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
tcp_rack_update_reo_wnd(sk, &rs);
+ tcp_in_ack_event(sk, flag);
+
if (tp->tlp_high_seq)
tcp_process_tlp_ack(sk, ack, flag);
@@ -4065,6 +4072,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
return 1;
no_queue:
+ tcp_in_ack_event(sk, flag);
/* If data was DSACKed, see if we can undo a cwnd reduction. */
if (flag & FLAG_DSACKING_ACK) {
tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 02/14] tcp: create FLAG_TS_PROGRESS
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
2024-10-21 21:58 ` [PATCH v4 net-next 01/14] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
@ 2024-10-21 21:58 ` chia-yu.chang
2024-10-21 21:58 ` [PATCH v4 net-next 03/14] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
` (11 subsequent siblings)
13 siblings, 0 replies; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:58 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
Whenever timestamp advances, it declares progress which
can be used by the other parts of the stack to decide that
the ACK is the most recent one seen so far.
AccECN will use this flag when deciding whether to use the
ACK to update AccECN state or not.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
net/ipv4/tcp_input.c | 37 ++++++++++++++++++++++++++++---------
1 file changed, 28 insertions(+), 9 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5a6f93148814..3295ad329aef 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -102,6 +102,7 @@ int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
#define FLAG_NO_CHALLENGE_ACK 0x8000 /* do not call tcp_send_challenge_ack() */
#define FLAG_ACK_MAYBE_DELAYED 0x10000 /* Likely a delayed ACK */
#define FLAG_DSACK_TLP 0x20000 /* DSACK for tail loss probe */
+#define FLAG_TS_PROGRESS 0x40000 /* Positive timestamp delta */
#define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
#define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
@@ -3813,8 +3814,16 @@ static void tcp_store_ts_recent(struct tcp_sock *tp)
tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
}
-static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
+static int __tcp_replace_ts_recent(struct tcp_sock *tp, s32 tstamp_delta)
{
+ tcp_store_ts_recent(tp);
+ return tstamp_delta > 0 ? FLAG_TS_PROGRESS : 0;
+}
+
+static int tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
+{
+ s32 delta;
+
if (tp->rx_opt.saw_tstamp && !after(seq, tp->rcv_wup)) {
/* PAWS bug workaround wrt. ACK frames, the PAWS discard
* extra check below makes sure this can only happen
@@ -3823,9 +3832,13 @@ static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
* Not only, also it occurs for expired timestamps.
*/
- if (tcp_paws_check(&tp->rx_opt, 0))
- tcp_store_ts_recent(tp);
+ if (tcp_paws_check(&tp->rx_opt, 0)) {
+ delta = tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent;
+ return __tcp_replace_ts_recent(tp, delta);
+ }
}
+
+ return 0;
}
/* This routine deals with acks during a TLP episode and ends an episode by
@@ -3982,7 +3995,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
* is in window.
*/
if (flag & FLAG_UPDATE_TS_RECENT)
- tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
+ flag |= tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
if ((flag & (FLAG_SLOWPATH | FLAG_SND_UNA_ADVANCED)) ==
FLAG_SND_UNA_ADVANCED) {
@@ -6140,6 +6153,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
!after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
int tcp_header_len = tp->tcp_header_len;
+ s32 tstamp_delta = 0;
+ int flag = 0;
/* Timestamp header prediction: tcp_header_len
* is automatically equal to th->doff*4 due to pred_flags
@@ -6152,8 +6167,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
if (!tcp_parse_aligned_timestamp(tp, th))
goto slow_path;
+ tstamp_delta = tp->rx_opt.rcv_tsval -
+ tp->rx_opt.ts_recent;
/* If PAWS failed, check it more carefully in slow path */
- if ((s32)(tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent) < 0)
+ if (tstamp_delta < 0)
goto slow_path;
/* DO NOT update ts_recent here, if checksum fails
@@ -6173,12 +6190,13 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
if (tcp_header_len ==
(sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
tp->rcv_nxt == tp->rcv_wup)
- tcp_store_ts_recent(tp);
+ flag |= __tcp_replace_ts_recent(tp,
+ tstamp_delta);
/* We know that such packets are checksummed
* on entry.
*/
- tcp_ack(sk, skb, 0);
+ tcp_ack(sk, skb, flag);
__kfree_skb(skb);
tcp_data_snd_check(sk);
/* When receiving pure ack in fast path, update
@@ -6209,7 +6227,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
if (tcp_header_len ==
(sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
tp->rcv_nxt == tp->rcv_wup)
- tcp_store_ts_recent(tp);
+ flag |= __tcp_replace_ts_recent(tp,
+ tstamp_delta);
tcp_rcv_rtt_measure_ts(sk, skb);
@@ -6224,7 +6243,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) {
/* Well, only one small jumplet in fast path... */
- tcp_ack(sk, skb, FLAG_DATA);
+ tcp_ack(sk, skb, flag | FLAG_DATA);
tcp_data_snd_check(sk);
if (!inet_csk_ack_scheduled(sk))
goto no_ack;
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 03/14] tcp: use BIT() macro in include/net/tcp.h
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
2024-10-21 21:58 ` [PATCH v4 net-next 01/14] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
2024-10-21 21:58 ` [PATCH v4 net-next 02/14] tcp: create FLAG_TS_PROGRESS chia-yu.chang
@ 2024-10-21 21:58 ` chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 04/14] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
` (10 subsequent siblings)
13 siblings, 0 replies; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:58 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Use BIT() macro for TCP flags field and TCP congestion control
flags that will be used by the congestion control algorithm.
No functional changes.
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Ilpo Järvinen <ij@kernel.org>
---
include/net/tcp.h | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 739a9fb83d0c..bc34b450929c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -26,6 +26,7 @@
#include <linux/kref.h>
#include <linux/ktime.h>
#include <linux/indirect_call_wrapper.h>
+#include <linux/bits.h>
#include <net/inet_connection_sock.h>
#include <net/inet_timewait_sock.h>
@@ -911,14 +912,14 @@ static inline u32 tcp_rsk_tsval(const struct tcp_request_sock *treq)
#define tcp_flag_byte(th) (((u_int8_t *)th)[13])
-#define TCPHDR_FIN 0x01
-#define TCPHDR_SYN 0x02
-#define TCPHDR_RST 0x04
-#define TCPHDR_PSH 0x08
-#define TCPHDR_ACK 0x10
-#define TCPHDR_URG 0x20
-#define TCPHDR_ECE 0x40
-#define TCPHDR_CWR 0x80
+#define TCPHDR_FIN BIT(0)
+#define TCPHDR_SYN BIT(1)
+#define TCPHDR_RST BIT(2)
+#define TCPHDR_PSH BIT(3)
+#define TCPHDR_ACK BIT(4)
+#define TCPHDR_URG BIT(5)
+#define TCPHDR_ECE BIT(6)
+#define TCPHDR_CWR BIT(7)
#define TCPHDR_SYN_ECN (TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
@@ -1107,9 +1108,9 @@ enum tcp_ca_ack_event_flags {
#define TCP_CA_UNSPEC 0
/* Algorithm can be set on socket without CAP_NET_ADMIN privileges */
-#define TCP_CONG_NON_RESTRICTED 0x1
+#define TCP_CONG_NON_RESTRICTED BIT(0)
/* Requires ECN/ECT set on all packets */
-#define TCP_CONG_NEEDS_ECN 0x2
+#define TCP_CONG_NEEDS_ECN BIT(1)
#define TCP_CONG_MASK (TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN)
union tcp_cc_info;
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 04/14] tcp: extend TCP flags to allow AE bit/ACE field
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
` (2 preceding siblings ...)
2024-10-21 21:58 ` [PATCH v4 net-next 03/14] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
@ 2024-10-21 21:59 ` chia-yu.chang
2024-10-29 11:43 ` Paolo Abeni
2024-10-21 21:59 ` [PATCH v4 net-next 05/14] tcp: reorganize SYN ECN code chia-yu.chang
` (9 subsequent siblings)
13 siblings, 1 reply; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:59 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
With AccECN, there's one additional TCP flag to be used (AE)
and ACE field that overloads the definition of AE, CWR, and
ECE flags. As tcp_flags was previously only 1 byte, the
byte-order stuff needs to be added to it's handling.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
include/net/tcp.h | 7 ++++++-
include/uapi/linux/tcp.h | 9 ++++++---
net/ipv4/tcp_ipv4.c | 3 ++-
net/ipv4/tcp_output.c | 8 ++++----
net/ipv6/tcp_ipv6.c | 3 ++-
net/netfilter/nf_log_syslog.c | 8 +++++---
6 files changed, 25 insertions(+), 13 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index bc34b450929c..55a7f0a7ee59 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -920,7 +920,12 @@ static inline u32 tcp_rsk_tsval(const struct tcp_request_sock *treq)
#define TCPHDR_URG BIT(5)
#define TCPHDR_ECE BIT(6)
#define TCPHDR_CWR BIT(7)
+#define TCPHDR_AE BIT(8)
+#define TCPHDR_FLAGS_MASK (TCPHDR_FIN | TCPHDR_SYN | TCPHDR_RST | \
+ TCPHDR_PSH | TCPHDR_ACK | TCPHDR_URG | \
+ TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)
+#define TCPHDR_ACE (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)
#define TCPHDR_SYN_ECN (TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
/* State flags for sacked in struct tcp_skb_cb */
@@ -955,7 +960,7 @@ struct tcp_skb_cb {
u16 tcp_gso_size;
};
};
- __u8 tcp_flags; /* TCP header flags. (tcp[13]) */
+ __u16 tcp_flags; /* TCP header flags (tcp[12-13])*/
__u8 sacked; /* State flags for SACK. */
__u8 ip_dsfield; /* IPv4 tos or IPv6 dsfield */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index dbf896f3146c..3fe08d7dddaf 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -28,7 +28,8 @@ struct tcphdr {
__be32 seq;
__be32 ack_seq;
#if defined(__LITTLE_ENDIAN_BITFIELD)
- __u16 res1:4,
+ __u16 ae:1,
+ res1:3,
doff:4,
fin:1,
syn:1,
@@ -40,7 +41,8 @@ struct tcphdr {
cwr:1;
#elif defined(__BIG_ENDIAN_BITFIELD)
__u16 doff:4,
- res1:4,
+ res1:3,
+ ae:1,
cwr:1,
ece:1,
urg:1,
@@ -70,6 +72,7 @@ union tcp_word_hdr {
#define tcp_flag_word(tp) (((union tcp_word_hdr *)(tp))->words[3])
enum {
+ TCP_FLAG_AE = __constant_cpu_to_be32(0x01000000),
TCP_FLAG_CWR = __constant_cpu_to_be32(0x00800000),
TCP_FLAG_ECE = __constant_cpu_to_be32(0x00400000),
TCP_FLAG_URG = __constant_cpu_to_be32(0x00200000),
@@ -78,7 +81,7 @@ enum {
TCP_FLAG_RST = __constant_cpu_to_be32(0x00040000),
TCP_FLAG_SYN = __constant_cpu_to_be32(0x00020000),
TCP_FLAG_FIN = __constant_cpu_to_be32(0x00010000),
- TCP_RESERVED_BITS = __constant_cpu_to_be32(0x0F000000),
+ TCP_RESERVED_BITS = __constant_cpu_to_be32(0x0E000000),
TCP_DATA_OFFSET = __constant_cpu_to_be32(0xF0000000)
};
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9d3dd101ea71..9fe314a59240 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2162,7 +2162,8 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
skb->len - th->doff * 4);
TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
- TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
+ TCP_SKB_CB(skb)->tcp_flags = ntohs(*(__be16 *)&tcp_flag_word(th)) &
+ TCPHDR_FLAGS_MASK;
TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
TCP_SKB_CB(skb)->sacked = 0;
TCP_SKB_CB(skb)->has_rxtstamp =
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 054244ce5117..45cb67c635be 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -400,7 +400,7 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
/* Constructs common control bits of non-data skb. If SYN/FIN is present,
* auto increment end seqno.
*/
-static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u16 flags)
{
skb->ip_summed = CHECKSUM_PARTIAL;
@@ -1382,7 +1382,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
th->seq = htonl(tcb->seq);
th->ack_seq = htonl(rcv_nxt);
*(((__be16 *)th) + 6) = htons(((tcp_header_size >> 2) << 12) |
- tcb->tcp_flags);
+ (tcb->tcp_flags & TCPHDR_FLAGS_MASK));
th->check = 0;
th->urg_ptr = 0;
@@ -1604,7 +1604,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
int old_factor;
long limit;
int nlen;
- u8 flags;
+ u16 flags;
if (WARN_ON(len > skb->len))
return -EINVAL;
@@ -2159,7 +2159,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
{
int nlen = skb->len - len;
struct sk_buff *buff;
- u8 flags;
+ u16 flags;
/* All of a TSO frame must be composed of paged data. */
DEBUG_NET_WARN_ON_ONCE(skb->len != skb->data_len);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 597920061a3a..252d3dac3a09 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1737,7 +1737,8 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
skb->len - th->doff*4);
TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
- TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
+ TCP_SKB_CB(skb)->tcp_flags = ntohs(*(__be16 *)&tcp_flag_word(th)) &
+ TCPHDR_FLAGS_MASK;
TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
TCP_SKB_CB(skb)->sacked = 0;
TCP_SKB_CB(skb)->has_rxtstamp =
diff --git a/net/netfilter/nf_log_syslog.c b/net/netfilter/nf_log_syslog.c
index 58402226045e..86d5fc5d28e3 100644
--- a/net/netfilter/nf_log_syslog.c
+++ b/net/netfilter/nf_log_syslog.c
@@ -216,7 +216,9 @@ nf_log_dump_tcp_header(struct nf_log_buf *m,
/* Max length: 9 "RES=0x3C " */
nf_log_buf_add(m, "RES=0x%02x ", (u_int8_t)(ntohl(tcp_flag_word(th) &
TCP_RESERVED_BITS) >> 22));
- /* Max length: 32 "CWR ECE URG ACK PSH RST SYN FIN " */
+ /* Max length: 35 "AE CWR ECE URG ACK PSH RST SYN FIN " */
+ if (th->ae)
+ nf_log_buf_add(m, "AE ");
if (th->cwr)
nf_log_buf_add(m, "CWR ");
if (th->ece)
@@ -516,7 +518,7 @@ dump_ipv4_packet(struct net *net, struct nf_log_buf *m,
/* Proto Max log string length */
/* IP: 40+46+6+11+127 = 230 */
- /* TCP: 10+max(25,20+30+13+9+32+11+127) = 252 */
+ /* TCP: 10+max(25,20+30+13+9+35+11+127) = 255 */
/* UDP: 10+max(25,20) = 35 */
/* UDPLITE: 14+max(25,20) = 39 */
/* ICMP: 11+max(25, 18+25+max(19,14,24+3+n+10,3+n+10)) = 91+n */
@@ -526,7 +528,7 @@ dump_ipv4_packet(struct net *net, struct nf_log_buf *m,
/* (ICMP allows recursion one level deep) */
/* maxlen = IP + ICMP + IP + max(TCP,UDP,ICMP,unknown) */
- /* maxlen = 230+ 91 + 230 + 252 = 803 */
+ /* maxlen = 230+ 91 + 230 + 255 = 806 */
}
static noinline_for_stack void
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 05/14] tcp: reorganize SYN ECN code
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
` (3 preceding siblings ...)
2024-10-21 21:59 ` [PATCH v4 net-next 04/14] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
@ 2024-10-21 21:59 ` chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 06/14] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
` (8 subsequent siblings)
13 siblings, 0 replies; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:59 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
Prepare for AccECN that needs to have access here on IP ECN
field value which is only available after INET_ECN_xmit().
No functional changes.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
net/ipv4/tcp_output.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 45cb67c635be..64d47c18255f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -347,10 +347,11 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
tp->ecn_flags = 0;
if (use_ecn) {
- TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
- tp->ecn_flags = TCP_ECN_OK;
if (tcp_ca_needs_ecn(sk) || bpf_needs_ecn)
INET_ECN_xmit(sk);
+
+ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
+ tp->ecn_flags = TCP_ECN_OK;
}
}
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 06/14] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
` (4 preceding siblings ...)
2024-10-21 21:59 ` [PATCH v4 net-next 05/14] tcp: reorganize SYN ECN code chia-yu.chang
@ 2024-10-21 21:59 ` chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 07/14] tcp: helpers for ECN mode handling chia-yu.chang
` (7 subsequent siblings)
13 siblings, 0 replies; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:59 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
Rename tcp_ecn_check_ce to tcp_data_ecn_check as it is
called only for data segments, not for ACKs (with AccECN,
also ACKs may get ECN bits).
The extra "layer" in tcp_ecn_check_ce() function just
checks for ECN being enabled, that can be moved into
tcp_ecn_field_check rather than having the __ variant.
No functional changes.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
net/ipv4/tcp_input.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3295ad329aef..6d4abd452a36 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -357,10 +357,13 @@ static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR;
}
-static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
+static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
+ if (!(tcp_sk(sk)->ecn_flags & TCP_ECN_OK))
+ return;
+
switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
case INET_ECN_NOT_ECT:
/* Funny extension: if ECT is not set on a segment,
@@ -389,12 +392,6 @@ static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
}
}
-static void tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
-{
- if (tcp_sk(sk)->ecn_flags & TCP_ECN_OK)
- __tcp_ecn_check_ce(sk, skb);
-}
-
static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
{
if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || th->cwr))
@@ -866,7 +863,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
icsk->icsk_ack.lrcvtime = now;
tcp_save_lrcv_flowlabel(sk, skb);
- tcp_ecn_check_ce(sk, skb);
+ tcp_data_ecn_check(sk, skb);
if (skb->len >= 128)
tcp_grow_window(sk, skb, true);
@@ -5028,7 +5025,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
bool fragstolen;
tcp_save_lrcv_flowlabel(sk, skb);
- tcp_ecn_check_ce(sk, skb);
+ tcp_data_ecn_check(sk, skb);
if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFODROP);
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 07/14] tcp: helpers for ECN mode handling
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
` (5 preceding siblings ...)
2024-10-21 21:59 ` [PATCH v4 net-next 06/14] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
@ 2024-10-21 21:59 ` chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 08/14] gso: AccECN support chia-yu.chang
` (6 subsequent siblings)
13 siblings, 0 replies; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:59 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
Create helpers for TCP ECN modes. No functional changes.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
include/net/tcp.h | 44 ++++++++++++++++++++++++++++++++++++----
net/ipv4/tcp.c | 2 +-
net/ipv4/tcp_dctcp.c | 2 +-
net/ipv4/tcp_input.c | 14 ++++++-------
net/ipv4/tcp_minisocks.c | 4 +++-
net/ipv4/tcp_output.c | 6 +++---
6 files changed, 55 insertions(+), 17 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 55a7f0a7ee59..b6a4e0124280 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -372,10 +372,46 @@ static inline void tcp_dec_quickack_mode(struct sock *sk)
}
}
-#define TCP_ECN_OK 1
-#define TCP_ECN_QUEUE_CWR 2
-#define TCP_ECN_DEMAND_CWR 4
-#define TCP_ECN_SEEN 8
+#define TCP_ECN_MODE_RFC3168 BIT(0)
+#define TCP_ECN_QUEUE_CWR BIT(1)
+#define TCP_ECN_DEMAND_CWR BIT(2)
+#define TCP_ECN_SEEN BIT(3)
+#define TCP_ECN_MODE_ACCECN BIT(4)
+
+#define TCP_ECN_DISABLED 0
+#define TCP_ECN_MODE_PENDING (TCP_ECN_MODE_RFC3168 | TCP_ECN_MODE_ACCECN)
+#define TCP_ECN_MODE_ANY (TCP_ECN_MODE_RFC3168 | TCP_ECN_MODE_ACCECN)
+
+static inline bool tcp_ecn_mode_any(const struct tcp_sock *tp)
+{
+ return tp->ecn_flags & TCP_ECN_MODE_ANY;
+}
+
+static inline bool tcp_ecn_mode_rfc3168(const struct tcp_sock *tp)
+{
+ return (tp->ecn_flags & TCP_ECN_MODE_ANY) == TCP_ECN_MODE_RFC3168;
+}
+
+static inline bool tcp_ecn_mode_accecn(const struct tcp_sock *tp)
+{
+ return (tp->ecn_flags & TCP_ECN_MODE_ANY) == TCP_ECN_MODE_ACCECN;
+}
+
+static inline bool tcp_ecn_disabled(const struct tcp_sock *tp)
+{
+ return !tcp_ecn_mode_any(tp);
+}
+
+static inline bool tcp_ecn_mode_pending(const struct tcp_sock *tp)
+{
+ return (tp->ecn_flags & TCP_ECN_MODE_PENDING) == TCP_ECN_MODE_PENDING;
+}
+
+static inline void tcp_ecn_mode_set(struct tcp_sock *tp, u8 mode)
+{
+ tp->ecn_flags &= ~TCP_ECN_MODE_ANY;
+ tp->ecn_flags |= mode;
+}
enum tcp_tw_status {
TCP_TW_SUCCESS = 0,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 82cc4a5633ce..94546f55385a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4107,7 +4107,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_rcv_wscale = tp->rx_opt.rcv_wscale;
}
- if (tp->ecn_flags & TCP_ECN_OK)
+ if (tcp_ecn_mode_any(tp))
info->tcpi_options |= TCPI_OPT_ECN;
if (tp->ecn_flags & TCP_ECN_SEEN)
info->tcpi_options |= TCPI_OPT_ECN_SEEN;
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 8a45a4aea933..03abe0848420 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -90,7 +90,7 @@ __bpf_kfunc static void dctcp_init(struct sock *sk)
{
const struct tcp_sock *tp = tcp_sk(sk);
- if ((tp->ecn_flags & TCP_ECN_OK) ||
+ if (tcp_ecn_mode_any(tp) ||
(sk->sk_state == TCP_LISTEN ||
sk->sk_state == TCP_CLOSE)) {
struct dctcp *ca = inet_csk_ca(sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 6d4abd452a36..0161660938d3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -334,7 +334,7 @@ static bool tcp_in_quickack_mode(struct sock *sk)
static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
{
- if (tp->ecn_flags & TCP_ECN_OK)
+ if (tcp_ecn_mode_rfc3168(tp))
tp->ecn_flags |= TCP_ECN_QUEUE_CWR;
}
@@ -361,7 +361,7 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
- if (!(tcp_sk(sk)->ecn_flags & TCP_ECN_OK))
+ if (tcp_ecn_disabled(tp))
return;
switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
@@ -394,19 +394,19 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
{
- if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || th->cwr))
- tp->ecn_flags &= ~TCP_ECN_OK;
+ if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || th->cwr))
+ tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
}
static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th)
{
- if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || !th->cwr))
- tp->ecn_flags &= ~TCP_ECN_OK;
+ if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || !th->cwr))
+ tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
}
static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr *th)
{
- if (th->ece && !th->syn && (tp->ecn_flags & TCP_ECN_OK))
+ if (th->ece && !th->syn && tcp_ecn_mode_rfc3168(tp))
return true;
return false;
}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bb1fe1ba867a..bd6515ab660f 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -453,7 +453,9 @@ EXPORT_SYMBOL(tcp_openreq_init_rwin);
static void tcp_ecn_openreq_child(struct tcp_sock *tp,
const struct request_sock *req)
{
- tp->ecn_flags = inet_rsk(req)->ecn_ok ? TCP_ECN_OK : 0;
+ tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
+ TCP_ECN_MODE_RFC3168 :
+ TCP_ECN_DISABLED);
}
void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 64d47c18255f..bb83ad43a4e2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -322,7 +322,7 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
const struct tcp_sock *tp = tcp_sk(sk);
TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR;
- if (!(tp->ecn_flags & TCP_ECN_OK))
+ if (tcp_ecn_disabled(tp))
TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ECE;
else if (tcp_ca_needs_ecn(sk) ||
tcp_bpf_ca_needs_ecn(sk))
@@ -351,7 +351,7 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
INET_ECN_xmit(sk);
TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
- tp->ecn_flags = TCP_ECN_OK;
+ tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
}
}
@@ -379,7 +379,7 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
{
struct tcp_sock *tp = tcp_sk(sk);
- if (tp->ecn_flags & TCP_ECN_OK) {
+ if (tcp_ecn_mode_rfc3168(tp)) {
/* Not-retransmitted data segment: set ECT and inject CWR. */
if (skb->len != tcp_header_len &&
!before(TCP_SKB_CB(skb)->seq, tp->snd_nxt)) {
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 08/14] gso: AccECN support
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
` (6 preceding siblings ...)
2024-10-21 21:59 ` [PATCH v4 net-next 07/14] tcp: helpers for ECN mode handling chia-yu.chang
@ 2024-10-21 21:59 ` chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 09/14] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
` (5 subsequent siblings)
13 siblings, 0 replies; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:59 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
Handling the CWR flag differs between RFC 3168 ECN and AccECN.
With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared
starting from 2nd segment which is incompatible how AccECN handles
the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN.
With AccECN, CWR flag (or more accurately, the ACE field that also
includes ECE & AE flags) changes only when new packet(s) with CE
mark arrives so the flag should not be changed within a super-skb.
The new skb/feature flags are necessary to prevent such TSO engines
corrupting AccECN ACE counters by clearing the CWR flag (if the
CWR handling feature cannot be turned off).
If NIC is completely unaware of RFC3168 ECN (doesn't support
NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
despite supporting also NETIF_F_TSO_ECN, TSO could be safely used
with AccECN on such NIC. This should be evaluated per NIC basis
(not done in this patch series for any NICs).
For the cases, where TSO cannot keep its hands off the CWR flag,
a GSO fallback is provided by this patch.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
include/linux/netdev_features.h | 8 +++++---
include/linux/netdevice.h | 2 ++
include/linux/skbuff.h | 2 ++
net/ethtool/common.c | 1 +
net/ipv4/tcp_offload.c | 6 +++++-
5 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 66e7d26b70a4..c59db449bcf0 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -53,12 +53,12 @@ enum {
NETIF_F_GSO_UDP_BIT, /* ... UFO, deprecated except tuntap */
NETIF_F_GSO_UDP_L4_BIT, /* ... UDP payload GSO (not UFO) */
NETIF_F_GSO_FRAGLIST_BIT, /* ... Fraglist GSO */
+ NETIF_F_GSO_ACCECN_BIT, /* TCP AccECN w/ TSO (no clear CWR) */
/**/NETIF_F_GSO_LAST = /* last bit, see GSO_MASK */
- NETIF_F_GSO_FRAGLIST_BIT,
+ NETIF_F_GSO_ACCECN_BIT,
NETIF_F_FCOE_CRC_BIT, /* FCoE CRC32 */
NETIF_F_SCTP_CRC_BIT, /* SCTP checksum offload */
- __UNUSED_NETIF_F_37,
NETIF_F_NTUPLE_BIT, /* N-tuple filters supported */
NETIF_F_RXHASH_BIT, /* Receive hashing offload */
NETIF_F_RXCSUM_BIT, /* Receive checksumming offload */
@@ -128,6 +128,7 @@ enum {
#define NETIF_F_SG __NETIF_F(SG)
#define NETIF_F_TSO6 __NETIF_F(TSO6)
#define NETIF_F_TSO_ECN __NETIF_F(TSO_ECN)
+#define NETIF_F_GSO_ACCECN __NETIF_F(GSO_ACCECN)
#define NETIF_F_TSO __NETIF_F(TSO)
#define NETIF_F_VLAN_CHALLENGED __NETIF_F(VLAN_CHALLENGED)
#define NETIF_F_RXFCS __NETIF_F(RXFCS)
@@ -210,7 +211,8 @@ static inline int find_next_netdev_feature(u64 feature, unsigned long start)
NETIF_F_TSO_ECN | NETIF_F_TSO_MANGLEID)
/* List of features with software fallbacks. */
-#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | NETIF_F_GSO_SCTP | \
+#define NETIF_F_GSO_SOFTWARE (NETIF_F_ALL_TSO | \
+ NETIF_F_GSO_ACCECN | NETIF_F_GSO_SCTP | \
NETIF_F_GSO_UDP_L4 | NETIF_F_GSO_FRAGLIST)
/*
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8feaca12655e..4f0747e2325e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -5066,6 +5066,8 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_GSO_UDP >> NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_UDP_L4 != (NETIF_F_GSO_UDP_L4 >> NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_FRAGLIST != (NETIF_F_GSO_FRAGLIST >> NETIF_F_GSO_SHIFT));
+ BUILD_BUG_ON(SKB_GSO_TCP_ACCECN !=
+ (NETIF_F_GSO_ACCECN >> NETIF_F_GSO_SHIFT));
return (features & feature) == feature;
}
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 48f1e0fa2a13..530cb325fb86 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -694,6 +694,8 @@ enum {
SKB_GSO_UDP_L4 = 1 << 17,
SKB_GSO_FRAGLIST = 1 << 18,
+
+ SKB_GSO_TCP_ACCECN = 1 << 19,
};
#if BITS_PER_LONG > 32
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 0d62363dbd9d..5c3ba2dfaa74 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -32,6 +32,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
[NETIF_F_TSO_BIT] = "tx-tcp-segmentation",
[NETIF_F_GSO_ROBUST_BIT] = "tx-gso-robust",
[NETIF_F_TSO_ECN_BIT] = "tx-tcp-ecn-segmentation",
+ [NETIF_F_GSO_ACCECN_BIT] = "tx-tcp-accecn-segmentation",
[NETIF_F_TSO_MANGLEID_BIT] = "tx-tcp-mangleid-segmentation",
[NETIF_F_TSO6_BIT] = "tx-tcp6-segmentation",
[NETIF_F_FSO_BIT] = "tx-fcoe-segmentation",
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 2308665b51c5..0b05f30e9e5f 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -139,6 +139,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
struct sk_buff *gso_skb = skb;
__sum16 newcheck;
bool ooo_okay, copy_destructor;
+ bool ecn_cwr_mask;
__wsum delta;
th = tcp_hdr(skb);
@@ -198,6 +199,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
newcheck = ~csum_fold(csum_add(csum_unfold(th->check), delta));
+ ecn_cwr_mask = !!(skb_shinfo(gso_skb)->gso_type & SKB_GSO_TCP_ACCECN);
+
while (skb->next) {
th->fin = th->psh = 0;
th->check = newcheck;
@@ -217,7 +220,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
th = tcp_hdr(skb);
th->seq = htonl(seq);
- th->cwr = 0;
+
+ th->cwr &= ecn_cwr_mask;
}
/* Following permits TCP Small Queues to work well with GSO :
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 09/14] gro: prevent ACE field corruption & better AccECN handling
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
` (7 preceding siblings ...)
2024-10-21 21:59 ` [PATCH v4 net-next 08/14] gso: AccECN support chia-yu.chang
@ 2024-10-21 21:59 ` chia-yu.chang
2024-10-29 12:03 ` Paolo Abeni
2024-10-21 21:59 ` [PATCH v4 net-next 10/14] tcp: AccECN support to tcp_add_backlog chia-yu.chang
` (4 subsequent siblings)
13 siblings, 1 reply; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:59 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
There are important differences in how the CWR field behaves
in RFC3168 and AccECN. With AccECN, CWR flag is part of the
ACE counter and its changes are important so adjust the flags
changed mask accordingly.
Also, if CWR is there, set the Accurate ECN GSO flag to avoid
corrupting CWR flag somewhere.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
net/ipv4/tcp_offload.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 0b05f30e9e5f..f59762d88c38 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
th2 = tcp_hdr(p);
flush = (__force int)(flags & TCP_FLAG_CWR);
flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
- ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
+ ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
for (i = sizeof(*th); i < thlen; i += 4)
flush |= *(u32 *)((u8 *)th + i) ^
@@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
if (th->cwr)
- shinfo->gso_type |= SKB_GSO_TCP_ECN;
+ shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
}
EXPORT_SYMBOL(tcp_gro_complete);
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 10/14] tcp: AccECN support to tcp_add_backlog
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
` (8 preceding siblings ...)
2024-10-21 21:59 ` [PATCH v4 net-next 09/14] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
@ 2024-10-21 21:59 ` chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 11/14] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
` (3 subsequent siblings)
13 siblings, 0 replies; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:59 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
AE flag needs to be preserved for AccECN.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
net/ipv4/tcp_ipv4.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9fe314a59240..540fe14bdc32 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2054,7 +2054,8 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
!((TCP_SKB_CB(tail)->tcp_flags &
TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_ACK) ||
((TCP_SKB_CB(tail)->tcp_flags ^
- TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
+ TCP_SKB_CB(skb)->tcp_flags) &
+ (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)) ||
!tcp_skb_can_collapse_rx(tail, skb) ||
thtail->doff != th->doff ||
memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th)))
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 11/14] tcp: allow ECN bits in TOS/traffic class
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
` (9 preceding siblings ...)
2024-10-21 21:59 ` [PATCH v4 net-next 10/14] tcp: AccECN support to tcp_add_backlog chia-yu.chang
@ 2024-10-21 21:59 ` chia-yu.chang
2024-10-29 12:18 ` Paolo Abeni
2024-10-21 21:59 ` [PATCH v4 net-next 12/14] tcp: Pass flags to __tcp_send_ack chia-yu.chang
` (2 subsequent siblings)
13 siblings, 1 reply; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:59 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
AccECN connection's last ACK cannot retain ECT(1) as the bits
are always cleared causing the packet to switch into another
service queue.
This effectively adds a finer-grained filtering for ECN bits
so that acceptable TW ACKs can retain the bits.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
include/net/tcp.h | 3 ++-
net/ipv4/ip_output.c | 3 +--
net/ipv4/tcp_ipv4.c | 23 +++++++++++++++++------
net/ipv4/tcp_minisocks.c | 2 +-
net/ipv6/tcp_ipv6.c | 24 +++++++++++++++++-------
5 files changed, 38 insertions(+), 17 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b6a4e0124280..d348ea9be172 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -417,7 +417,8 @@ enum tcp_tw_status {
TCP_TW_SUCCESS = 0,
TCP_TW_RST = 1,
TCP_TW_ACK = 2,
- TCP_TW_SYN = 3
+ TCP_TW_SYN = 3,
+ TCP_TW_ACK_OOW = 4
};
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 0065b1996c94..2fe7b1df3b90 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -75,7 +75,6 @@
#include <net/checksum.h>
#include <net/gso.h>
#include <net/inetpeer.h>
-#include <net/inet_ecn.h>
#include <net/lwtunnel.h>
#include <net/inet_dscp.h>
#include <linux/bpf-cgroup.h>
@@ -1643,7 +1642,7 @@ void ip_send_unicast_reply(struct sock *sk, const struct sock *orig_sk,
if (IS_ERR(rt))
return;
- inet_sk(sk)->tos = arg->tos & ~INET_ECN_MASK;
+ inet_sk(sk)->tos = arg->tos;
sk->sk_protocol = ip_hdr(skb)->protocol;
sk->sk_bound_dev_if = arg->bound_dev_if;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 540fe14bdc32..3d836e0f099a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -66,6 +66,7 @@
#include <net/transp_v6.h>
#include <net/ipv6.h>
#include <net/inet_common.h>
+#include <net/inet_ecn.h>
#include <net/timewait_sock.h>
#include <net/xfrm.h>
#include <net/secure_seq.h>
@@ -887,7 +888,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb,
BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
offsetof(struct inet_timewait_sock, tw_bound_dev_if));
- arg.tos = ip_hdr(skb)->tos;
+ arg.tos = ip_hdr(skb)->tos & ~INET_ECN_MASK;
arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
local_lock_nested_bh(&ipv4_tcp_sk.bh_lock);
@@ -1033,11 +1034,17 @@ static void tcp_v4_send_ack(const struct sock *sk,
local_bh_enable();
}
-static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
+static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb,
+ enum tcp_tw_status tw_status)
{
struct inet_timewait_sock *tw = inet_twsk(sk);
struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
struct tcp_key key = {};
+ u8 tos = tw->tw_tos;
+
+ if (tw_status == TCP_TW_ACK_OOW)
+ tos &= ~INET_ECN_MASK;
+
#ifdef CONFIG_TCP_AO
struct tcp_ao_info *ao_info;
@@ -1080,7 +1087,7 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
READ_ONCE(tcptw->tw_ts_recent),
tw->tw_bound_dev_if, &key,
tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
- tw->tw_tos,
+ tos,
tw->tw_txhash);
inet_twsk_put(tw);
@@ -1157,7 +1164,7 @@ static void tcp_v4_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
READ_ONCE(req->ts_recent),
0, &key,
inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
- ip_hdr(skb)->tos,
+ ip_hdr(skb)->tos & ~INET_ECN_MASK,
READ_ONCE(tcp_rsk(req)->txhash));
if (tcp_key_is_ao(&key))
kfree(key.traffic_key);
@@ -2178,6 +2185,7 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
int tcp_v4_rcv(struct sk_buff *skb)
{
struct net *net = dev_net(skb->dev);
+ enum tcp_tw_status tw_status;
enum skb_drop_reason drop_reason;
int sdif = inet_sdif(skb);
int dif = inet_iif(skb);
@@ -2405,7 +2413,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
inet_twsk_put(inet_twsk(sk));
goto csum_error;
}
- switch (tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn)) {
+
+ tw_status = tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn);
+ switch (tw_status) {
case TCP_TW_SYN: {
struct sock *sk2 = inet_lookup_listener(net,
net->ipv4.tcp_death_row.hashinfo,
@@ -2426,7 +2436,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
/* to ACK */
fallthrough;
case TCP_TW_ACK:
- tcp_v4_timewait_ack(sk, skb);
+ case TCP_TW_ACK_OOW:
+ tcp_v4_timewait_ack(sk, skb, tw_status);
break;
case TCP_TW_RST:
tcp_v4_send_reset(sk, skb, SK_RST_REASON_TCP_TIMEWAIT_SOCKET);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bd6515ab660f..8fb9f550fdeb 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -44,7 +44,7 @@ tcp_timewait_check_oow_rate_limit(struct inet_timewait_sock *tw,
/* Send ACK. Note, we do not put the bucket,
* it will be released by caller.
*/
- return TCP_TW_ACK;
+ return TCP_TW_ACK_OOW;
}
/* We are rate-limiting, so just release the tw sock and drop skb. */
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 252d3dac3a09..9beba4dc2f42 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -997,7 +997,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
if (!IS_ERR(dst)) {
skb_dst_set(buff, dst);
ip6_xmit(ctl_sk, buff, &fl6, fl6.flowi6_mark, NULL,
- tclass & ~INET_ECN_MASK, priority);
+ tclass, priority);
TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
if (rst)
TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
@@ -1133,7 +1133,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb,
trace_tcp_send_reset(sk, skb, reason);
tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
- ipv6_get_dsfield(ipv6h), label, priority, txhash,
+ ipv6_get_dsfield(ipv6h) & ~INET_ECN_MASK,
+ label, priority, txhash,
&key);
#if defined(CONFIG_TCP_MD5SIG) || defined(CONFIG_TCP_AO)
@@ -1153,11 +1154,16 @@ static void tcp_v6_send_ack(const struct sock *sk, struct sk_buff *skb, u32 seq,
tclass, label, priority, txhash, key);
}
-static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
+static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb,
+ enum tcp_tw_status tw_status)
{
struct inet_timewait_sock *tw = inet_twsk(sk);
struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
struct tcp_key key = {};
+ u8 tclass = tw->tw_tclass;
+
+ if (tw_status == TCP_TW_ACK_OOW)
+ tclass &= ~INET_ECN_MASK;
#ifdef CONFIG_TCP_AO
struct tcp_ao_info *ao_info;
@@ -1201,7 +1207,7 @@ static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
tcp_tw_tsval(tcptw),
READ_ONCE(tcptw->tw_ts_recent), tw->tw_bound_dev_if,
- &key, tw->tw_tclass, cpu_to_be32(tw->tw_flowlabel),
+ &key, tclass, cpu_to_be32(tw->tw_flowlabel),
tw->tw_priority, tw->tw_txhash);
#ifdef CONFIG_TCP_AO
@@ -1278,7 +1284,8 @@ static void tcp_v6_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
tcp_synack_window(req) >> inet_rsk(req)->rcv_wscale,
tcp_rsk_tsval(tcp_rsk(req)),
READ_ONCE(req->ts_recent), sk->sk_bound_dev_if,
- &key, ipv6_get_dsfield(ipv6_hdr(skb)), 0,
+ &key, ipv6_get_dsfield(ipv6_hdr(skb)) & ~INET_ECN_MASK,
+ 0,
READ_ONCE(sk->sk_priority),
READ_ONCE(tcp_rsk(req)->txhash));
if (tcp_key_is_ao(&key))
@@ -1747,6 +1754,7 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
{
+ enum tcp_tw_status tw_status;
enum skb_drop_reason drop_reason;
int sdif = inet6_sdif(skb);
int dif = inet6_iif(skb);
@@ -1968,7 +1976,8 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
goto csum_error;
}
- switch (tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn)) {
+ tw_status = tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn);
+ switch (tw_status) {
case TCP_TW_SYN:
{
struct sock *sk2;
@@ -1993,7 +2002,8 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
/* to ACK */
fallthrough;
case TCP_TW_ACK:
- tcp_v6_timewait_ack(sk, skb);
+ case TCP_TW_ACK_OOW:
+ tcp_v6_timewait_ack(sk, skb, tw_status);
break;
case TCP_TW_RST:
tcp_v6_send_reset(sk, skb, SK_RST_REASON_TCP_TIMEWAIT_SOCKET);
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 12/14] tcp: Pass flags to __tcp_send_ack
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
` (10 preceding siblings ...)
2024-10-21 21:59 ` [PATCH v4 net-next 11/14] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
@ 2024-10-21 21:59 ` chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 13/14] tcp: fast path functions later chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 14/14] net: sysctl: introduce sysctl SYSCTL_FIVE chia-yu.chang
13 siblings, 0 replies; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:59 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
Accurate ECN needs to send custom flags to handle IP-ECN
field reflection during handshake.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
include/net/tcp.h | 2 +-
net/ipv4/bpf_tcp_ca.c | 2 +-
net/ipv4/tcp_dctcp.h | 2 +-
net/ipv4/tcp_output.c | 6 +++---
4 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d348ea9be172..81efbe1195fc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -704,7 +704,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority,
enum sk_rst_reason reason);
int tcp_send_synack(struct sock *);
void tcp_push_one(struct sock *, unsigned int mss_now);
-void __tcp_send_ack(struct sock *sk, u32 rcv_nxt);
+void __tcp_send_ack(struct sock *sk, u32 rcv_nxt, u16 flags);
void tcp_send_ack(struct sock *sk);
void tcp_send_delayed_ack(struct sock *sk);
void tcp_send_loss_probe(struct sock *sk);
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 554804774628..e01492234b0b 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -121,7 +121,7 @@ static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log,
BPF_CALL_2(bpf_tcp_send_ack, struct tcp_sock *, tp, u32, rcv_nxt)
{
/* bpf_tcp_ca prog cannot have NULL tp */
- __tcp_send_ack((struct sock *)tp, rcv_nxt);
+ __tcp_send_ack((struct sock *)tp, rcv_nxt, 0);
return 0;
}
diff --git a/net/ipv4/tcp_dctcp.h b/net/ipv4/tcp_dctcp.h
index d69a77cbd0c7..4b0259111d81 100644
--- a/net/ipv4/tcp_dctcp.h
+++ b/net/ipv4/tcp_dctcp.h
@@ -28,7 +28,7 @@ static inline void dctcp_ece_ack_update(struct sock *sk, enum tcp_ca_event evt,
*/
if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
dctcp_ece_ack_cwr(sk, *ce_state);
- __tcp_send_ack(sk, *prior_rcv_nxt);
+ __tcp_send_ack(sk, *prior_rcv_nxt, 0);
}
inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index bb83ad43a4e2..556c2da2bc77 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4232,7 +4232,7 @@ void tcp_send_delayed_ack(struct sock *sk)
}
/* This routine sends an ack and also updates the window. */
-void __tcp_send_ack(struct sock *sk, u32 rcv_nxt)
+void __tcp_send_ack(struct sock *sk, u32 rcv_nxt, u16 flags)
{
struct sk_buff *buff;
@@ -4261,7 +4261,7 @@ void __tcp_send_ack(struct sock *sk, u32 rcv_nxt)
/* Reserve space for headers and prepare control bits. */
skb_reserve(buff, MAX_TCP_HEADER);
- tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK);
+ tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK | flags);
/* We do not want pure acks influencing TCP Small Queues or fq/pacing
* too much.
@@ -4276,7 +4276,7 @@ EXPORT_SYMBOL_GPL(__tcp_send_ack);
void tcp_send_ack(struct sock *sk)
{
- __tcp_send_ack(sk, tcp_sk(sk)->rcv_nxt);
+ __tcp_send_ack(sk, tcp_sk(sk)->rcv_nxt, 0);
}
/* This routine sends a packet with an out of date sequence
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 13/14] tcp: fast path functions later
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
` (11 preceding siblings ...)
2024-10-21 21:59 ` [PATCH v4 net-next 12/14] tcp: Pass flags to __tcp_send_ack chia-yu.chang
@ 2024-10-21 21:59 ` chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 14/14] net: sysctl: introduce sysctl SYSCTL_FIVE chia-yu.chang
13 siblings, 0 replies; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:59 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chai-Yu Chang
From: Ilpo Järvinen <ij@kernel.org>
The following patch will use tcp_ecn_mode_accecn(),
TCP_ACCECN_CEP_INIT_OFFSET, TCP_ACCECN_CEP_ACE_MASK in
__tcp_fast_path_on() to make new flag for AccECN.
No functional changes.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chai-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
include/net/tcp.h | 54 +++++++++++++++++++++++------------------------
1 file changed, 27 insertions(+), 27 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 81efbe1195fc..6945541b5874 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -788,33 +788,6 @@ static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
return usecs_to_jiffies((tp->srtt_us >> 3) + tp->rttvar_us);
}
-static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
-{
- /* mptcp hooks are only on the slow path */
- if (sk_is_mptcp((struct sock *)tp))
- return;
-
- tp->pred_flags = htonl((tp->tcp_header_len << 26) |
- ntohl(TCP_FLAG_ACK) |
- snd_wnd);
-}
-
-static inline void tcp_fast_path_on(struct tcp_sock *tp)
-{
- __tcp_fast_path_on(tp, tp->snd_wnd >> tp->rx_opt.snd_wscale);
-}
-
-static inline void tcp_fast_path_check(struct sock *sk)
-{
- struct tcp_sock *tp = tcp_sk(sk);
-
- if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
- tp->rcv_wnd &&
- atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
- !tp->urg_data)
- tcp_fast_path_on(tp);
-}
-
u32 tcp_delack_max(const struct sock *sk);
/* Compute the actual rto_min value */
@@ -1768,6 +1741,33 @@ static inline bool tcp_paws_reject(const struct tcp_options_received *rx_opt,
return true;
}
+static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
+{
+ /* mptcp hooks are only on the slow path */
+ if (sk_is_mptcp((struct sock *)tp))
+ return;
+
+ tp->pred_flags = htonl((tp->tcp_header_len << 26) |
+ ntohl(TCP_FLAG_ACK) |
+ snd_wnd);
+}
+
+static inline void tcp_fast_path_on(struct tcp_sock *tp)
+{
+ __tcp_fast_path_on(tp, tp->snd_wnd >> tp->rx_opt.snd_wscale);
+}
+
+static inline void tcp_fast_path_check(struct sock *sk)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+
+ if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
+ tp->rcv_wnd &&
+ atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
+ !tp->urg_data)
+ tcp_fast_path_on(tp);
+}
+
bool tcp_oow_rate_limited(struct net *net, const struct sk_buff *skb,
int mib_idx, u32 *last_oow_ack_time);
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v4 net-next 14/14] net: sysctl: introduce sysctl SYSCTL_FIVE
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
` (12 preceding siblings ...)
2024-10-21 21:59 ` [PATCH v4 net-next 13/14] tcp: fast path functions later chia-yu.chang
@ 2024-10-21 21:59 ` chia-yu.chang
2024-10-29 12:26 ` Paolo Abeni
` (2 more replies)
13 siblings, 3 replies; 24+ messages in thread
From: chia-yu.chang @ 2024-10-21 21:59 UTC (permalink / raw)
To: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ij, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
Cc: Chia-Yu Chang
From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Add SYSCTL_FIVE for new AccECN feedback modes of net.ipv4.tcp_ecn.
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
include/linux/sysctl.h | 17 +++++++++--------
kernel/sysctl.c | 3 ++-
2 files changed, 11 insertions(+), 9 deletions(-)
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index aa4c6d44aaa0..37c95a70c10e 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -37,21 +37,22 @@ struct ctl_table_root;
struct ctl_table_header;
struct ctl_dir;
-/* Keep the same order as in fs/proc/proc_sysctl.c */
+/* Keep the same order as in kernel/sysctl.c */
#define SYSCTL_ZERO ((void *)&sysctl_vals[0])
#define SYSCTL_ONE ((void *)&sysctl_vals[1])
#define SYSCTL_TWO ((void *)&sysctl_vals[2])
#define SYSCTL_THREE ((void *)&sysctl_vals[3])
#define SYSCTL_FOUR ((void *)&sysctl_vals[4])
-#define SYSCTL_ONE_HUNDRED ((void *)&sysctl_vals[5])
-#define SYSCTL_TWO_HUNDRED ((void *)&sysctl_vals[6])
-#define SYSCTL_ONE_THOUSAND ((void *)&sysctl_vals[7])
-#define SYSCTL_THREE_THOUSAND ((void *)&sysctl_vals[8])
-#define SYSCTL_INT_MAX ((void *)&sysctl_vals[9])
+#define SYSCTL_FIVE ((void *)&sysctl_vals[5])
+#define SYSCTL_ONE_HUNDRED ((void *)&sysctl_vals[6])
+#define SYSCTL_TWO_HUNDRED ((void *)&sysctl_vals[7])
+#define SYSCTL_ONE_THOUSAND ((void *)&sysctl_vals[8])
+#define SYSCTL_THREE_THOUSAND ((void *)&sysctl_vals[9])
+#define SYSCTL_INT_MAX ((void *)&sysctl_vals[10])
/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
-#define SYSCTL_MAXOLDUID ((void *)&sysctl_vals[10])
-#define SYSCTL_NEG_ONE ((void *)&sysctl_vals[11])
+#define SYSCTL_MAXOLDUID ((void *)&sysctl_vals[11])
+#define SYSCTL_NEG_ONE ((void *)&sysctl_vals[12])
extern const int sysctl_vals[];
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 79e6cb1d5c48..68b6ca67a0c6 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -82,7 +82,8 @@
#endif
/* shared constants to be used in various sysctls */
-const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1 };
+const int sysctl_vals[] = { 0, 1, 2, 3, 4, 5, 100, 200, 1000, 3000, INT_MAX,
+ 65535, -1 };
EXPORT_SYMBOL(sysctl_vals);
const unsigned long sysctl_long_vals[] = { 0, 1, LONG_MAX };
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v4 net-next 04/14] tcp: extend TCP flags to allow AE bit/ACE field
2024-10-21 21:59 ` [PATCH v4 net-next 04/14] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
@ 2024-10-29 11:43 ` Paolo Abeni
2024-10-29 11:45 ` Paolo Abeni
0 siblings, 1 reply; 24+ messages in thread
From: Paolo Abeni @ 2024-10-29 11:43 UTC (permalink / raw)
To: chia-yu.chang, netdev, davem, edumazet, kuba, dsahern,
netfilter-devel, kadlec, coreteam, pablo, bpf, joel.granados,
linux-fsdevel, kees, mcgrof, ij, ncardwell, koen.de_schepper,
g.white, ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
On 10/21/24 23:59, chia-yu.chang@nokia-bell-labs.com wrote:
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 9d3dd101ea71..9fe314a59240 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2162,7 +2162,8 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
> TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
> skb->len - th->doff * 4);
> TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
> - TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
> + TCP_SKB_CB(skb)->tcp_flags = ntohs(*(__be16 *)&tcp_flag_word(th)) &
> + TCPHDR_FLAGS_MASK;
As you access the same 2 bytes even later.
> TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
> TCP_SKB_CB(skb)->sacked = 0;
> TCP_SKB_CB(skb)->has_rxtstamp =
[...]
> @@ -1604,7 +1604,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
> int old_factor;
> long limit;
> int nlen;
> - u8 flags;
> + u16 flags;
Minor nit: please respect the reverse x-mas tree order
Cheers,
Paolo
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 net-next 04/14] tcp: extend TCP flags to allow AE bit/ACE field
2024-10-29 11:43 ` Paolo Abeni
@ 2024-10-29 11:45 ` Paolo Abeni
0 siblings, 0 replies; 24+ messages in thread
From: Paolo Abeni @ 2024-10-29 11:45 UTC (permalink / raw)
To: chia-yu.chang, netdev, davem, edumazet, kuba, dsahern,
netfilter-devel, kadlec, coreteam, pablo, bpf, joel.granados,
linux-fsdevel, kees, mcgrof, ij, ncardwell, koen.de_schepper,
g.white, ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
On 10/29/24 12:43, Paolo Abeni wrote:
> On 10/21/24 23:59, chia-yu.chang@nokia-bell-labs.com wrote:
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index 9d3dd101ea71..9fe314a59240 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -2162,7 +2162,8 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
>> TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
>> skb->len - th->doff * 4);
>> TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
>> - TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
>> + TCP_SKB_CB(skb)->tcp_flags = ntohs(*(__be16 *)&tcp_flag_word(th)) &
>> + TCPHDR_FLAGS_MASK;
>
> As you access the same 2 bytes even later.
[Whoops, sorry part of the reply was unintentionally stripped.]
I suggest creating a specific helper to fetch them.
/P
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 net-next 09/14] gro: prevent ACE field corruption & better AccECN handling
2024-10-21 21:59 ` [PATCH v4 net-next 09/14] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
@ 2024-10-29 12:03 ` Paolo Abeni
2024-10-29 21:17 ` Ilpo Järvinen
0 siblings, 1 reply; 24+ messages in thread
From: Paolo Abeni @ 2024-10-29 12:03 UTC (permalink / raw)
To: chia-yu.chang, netdev, davem, edumazet, kuba, dsahern,
netfilter-devel, kadlec, coreteam, pablo, bpf, joel.granados,
linux-fsdevel, kees, mcgrof, ij, ncardwell, koen.de_schepper,
g.white, ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
On 10/21/24 23:59, chia-yu.chang@nokia-bell-labs.com wrote:
> From: Ilpo Järvinen <ij@kernel.org>
>
> There are important differences in how the CWR field behaves
> in RFC3168 and AccECN. With AccECN, CWR flag is part of the
> ACE counter and its changes are important so adjust the flags
> changed mask accordingly.
>
> Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> corrupting CWR flag somewhere.
>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
> net/ipv4/tcp_offload.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> index 0b05f30e9e5f..f59762d88c38 100644
> --- a/net/ipv4/tcp_offload.c
> +++ b/net/ipv4/tcp_offload.c
> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
> th2 = tcp_hdr(p);
> flush = (__force int)(flags & TCP_FLAG_CWR);
> flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> + ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
If I read correctly, if the peer is using RFC3168 and TSO_ECN, GRO will
now pump into the stack twice the number of packets it was doing prior
to this patch, am I correct?
That is likely causing measurable performance regressions.
> flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> for (i = sizeof(*th); i < thlen; i += 4)
> flush |= *(u32 *)((u8 *)th + i) ^
> @@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
> shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
>
> if (th->cwr)
> - shinfo->gso_type |= SKB_GSO_TCP_ECN;
> + shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
If this packet is forwarded, it will not leverage TSO anymore - with
current H/W.
I think we need a way to enable this feature conditionally, but I fear
another sysctl will be ugly and the additional conditionals will not be
good for GRO.
Smarter suggestions welcome ;)
Cheers,
Paolo
> }
> EXPORT_SYMBOL(tcp_gro_complete);
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 net-next 11/14] tcp: allow ECN bits in TOS/traffic class
2024-10-21 21:59 ` [PATCH v4 net-next 11/14] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
@ 2024-10-29 12:18 ` Paolo Abeni
0 siblings, 0 replies; 24+ messages in thread
From: Paolo Abeni @ 2024-10-29 12:18 UTC (permalink / raw)
To: chia-yu.chang, netdev, davem, edumazet, kuba, dsahern,
netfilter-devel, kadlec, coreteam, pablo, bpf, joel.granados,
linux-fsdevel, kees, mcgrof, ij, ncardwell, koen.de_schepper,
g.white, ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
On 10/21/24 23:59, chia-yu.chang@nokia-bell-labs.com wrote:
> @@ -2178,6 +2185,7 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
> int tcp_v4_rcv(struct sk_buff *skb)
> {
> struct net *net = dev_net(skb->dev);
> + enum tcp_tw_status tw_status;
> enum skb_drop_reason drop_reason;
> int sdif = inet_sdif(skb);
> int dif = inet_iif(skb);
Minor nit: please respect the reverse x-mas tree order.
More instances below.
Cheers,
Paolo
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 net-next 14/14] net: sysctl: introduce sysctl SYSCTL_FIVE
2024-10-21 21:59 ` [PATCH v4 net-next 14/14] net: sysctl: introduce sysctl SYSCTL_FIVE chia-yu.chang
@ 2024-10-29 12:26 ` Paolo Abeni
2024-10-29 21:29 ` Ilpo Järvinen
2024-10-31 14:08 ` Joel Granados
2 siblings, 0 replies; 24+ messages in thread
From: Paolo Abeni @ 2024-10-29 12:26 UTC (permalink / raw)
To: chia-yu.chang, netdev, davem, edumazet, kuba, dsahern,
netfilter-devel, kadlec, coreteam, pablo, bpf, joel.granados,
linux-fsdevel, kees, mcgrof, ij, ncardwell, koen.de_schepper,
g.white, ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
On 10/21/24 23:59, chia-yu.chang@nokia-bell-labs.com wrote:
> From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>
> Add SYSCTL_FIVE for new AccECN feedback modes of net.ipv4.tcp_ecn.
How many sysctl entries will use such value? If just one you are better
off not introducing the new sysctl value and instead using a static
constant in the tcp code.
Also this patch makes the commit message in the previous one incorrect.
Please adjust that.
Side note: on new version, you should include the changelog in the
affected patches, after a '---' separator, to help the reviewers.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 net-next 09/14] gro: prevent ACE field corruption & better AccECN handling
2024-10-29 12:03 ` Paolo Abeni
@ 2024-10-29 21:17 ` Ilpo Järvinen
0 siblings, 0 replies; 24+ messages in thread
From: Ilpo Järvinen @ 2024-10-29 21:17 UTC (permalink / raw)
To: Paolo Abeni
Cc: chia-yu.chang, netdev, davem, edumazet, kuba, dsahern,
netfilter-devel, kadlec, coreteam, pablo, bpf, joel.granados,
linux-fsdevel, kees, mcgrof, ncardwell, koen.de_schepper, g.white,
ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
Jason_Livingood, vidhi_goel
[-- Attachment #1: Type: text/plain, Size: 5007 bytes --]
On Tue, 29 Oct 2024, Paolo Abeni wrote:
> On 10/21/24 23:59, chia-yu.chang@nokia-bell-labs.com wrote:
> > From: Ilpo Järvinen <ij@kernel.org>
> >
> > There are important differences in how the CWR field behaves
> > in RFC3168 and AccECN. With AccECN, CWR flag is part of the
> > ACE counter and its changes are important so adjust the flags
> > changed mask accordingly.
> >
> > Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> > corrupting CWR flag somewhere.
> >
> > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > ---
> > net/ipv4/tcp_offload.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
> > index 0b05f30e9e5f..f59762d88c38 100644
> > --- a/net/ipv4/tcp_offload.c
> > +++ b/net/ipv4/tcp_offload.c
> > @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
> > th2 = tcp_hdr(p);
> > flush = (__force int)(flags & TCP_FLAG_CWR);
> > flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> > - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> > + ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
>
> If I read correctly, if the peer is using RFC3168 and TSO_ECN, GRO will
> now pump into the stack twice the number of packets it was doing prior
> to this patch, am I correct?
>
> That is likely causing measurable performance regressions.
Hi Paolo,
Thanks for taking a look!
While it's true on surface that this might cause some more packets with
RFC3168 (by design, as network cannot know if the sender is using RFC3168
or not), the important question is the scale how many of extra packets
will occur in practice.
First of all, RFC3168 requires CWR flag to be sent no more frequently
than once per window of data, or in other words, once per RTT. And that
means just one packet, not e.g. all packets of a super-skb (the RFC3168
signalling will lose its integrity if this is violated by the sender).
Secondly, the TCP sender uses CWR flag to indicate it just halved its
congestion window which mean it is sending half the amount of packets in
this window than in the previous window (analoguous to halving sending
rate). 2 RTTs with CWR each means two window reductions (this behavior
is spec'ed in RFC3168).
So lets say the sender was using 100 packets congestion window, this
change will add one packet to 50 packets on this next RTT. Note those are
raw numbers of packets on wire and do not tell how many packets GRO
combined into each super-skb which will wary case-by-case basis.
Regardless, I suspect the extra packet added to the half of the packets
will be hard/impossible to measure to cause a performance regression.
This change would double the number of packets only if the congestion
window is 1 or 2 packets and in that case TSO/GSO/GRO benefits will be
pretty small to begin with (or even counterproductive). Also, the
traditional TCP congestion control (RFC3168 included) has many issues
anyway with that small windows because it doesn't deal with fractional
congestion windows well.
> > flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> > for (i = sizeof(*th); i < thlen; i += 4)
> > flush |= *(u32 *)((u8 *)th + i) ^
> > @@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
> > shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
> >
> > if (th->cwr)
> > - shinfo->gso_type |= SKB_GSO_TCP_ECN;
> > + shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
>
> If this packet is forwarded, it will not leverage TSO anymore - with
> current H/W.
>
> I think we need a way to enable this feature conditionally, but I fear
> another sysctl will be ugly and the additional conditionals will not be
> good for GRO.
>
> Smarter suggestions welcome ;)
Well, it is already very selectively _conditional_, SKB_GSO_TCP_ACCECN is
only set for the skb when CWR is set. That is, once per RTT (data window)
when it comes to RFC3168.
I don't have any source for this (other than reading many many tcpdumps
in the past) but I believe the percentage of packets with CWR set (due to
RFC3168 signalling) is going to be very small overall.
Do you think that is not good enough?
To answer more generally to your suggestion on making it conditional based
on some other logic, it would mean you accept network middleboxes are
allowed to corrupt AccECN ACE field when forwarding. If RFC3168 TSO/GSO
trickery remains in use (without a middlebox explicitly tracking the
connection had negotiated RFC3168), a forwarder won't be able to reproduce
the exactly same stream of TCP packets headers thus corrupting non-RFC3168
use of CWR flag. It's not something any middlebox should be doing (I hope
we agree on this as a general principle)!
--
i.
> Cheers,
>
> Paolo
>
> > }
> > EXPORT_SYMBOL(tcp_gro_complete);
> >
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 net-next 14/14] net: sysctl: introduce sysctl SYSCTL_FIVE
2024-10-21 21:59 ` [PATCH v4 net-next 14/14] net: sysctl: introduce sysctl SYSCTL_FIVE chia-yu.chang
2024-10-29 12:26 ` Paolo Abeni
@ 2024-10-29 21:29 ` Ilpo Järvinen
2024-10-31 14:08 ` Joel Granados
2 siblings, 0 replies; 24+ messages in thread
From: Ilpo Järvinen @ 2024-10-29 21:29 UTC (permalink / raw)
To: Chia-Yu Chang
Cc: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, joel.granados, linux-fsdevel, kees,
mcgrof, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Mon, 21 Oct 2024, chia-yu.chang@nokia-bell-labs.com wrote:
> From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>
> Add SYSCTL_FIVE for new AccECN feedback modes of net.ipv4.tcp_ecn.
>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
> include/linux/sysctl.h | 17 +++++++++--------
> kernel/sysctl.c | 3 ++-
> 2 files changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index aa4c6d44aaa0..37c95a70c10e 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -37,21 +37,22 @@ struct ctl_table_root;
> struct ctl_table_header;
> struct ctl_dir;
>
> -/* Keep the same order as in fs/proc/proc_sysctl.c */
> +/* Keep the same order as in kernel/sysctl.c */
> #define SYSCTL_ZERO ((void *)&sysctl_vals[0])
> #define SYSCTL_ONE ((void *)&sysctl_vals[1])
> #define SYSCTL_TWO ((void *)&sysctl_vals[2])
> #define SYSCTL_THREE ((void *)&sysctl_vals[3])
> #define SYSCTL_FOUR ((void *)&sysctl_vals[4])
> -#define SYSCTL_ONE_HUNDRED ((void *)&sysctl_vals[5])
> -#define SYSCTL_TWO_HUNDRED ((void *)&sysctl_vals[6])
> -#define SYSCTL_ONE_THOUSAND ((void *)&sysctl_vals[7])
> -#define SYSCTL_THREE_THOUSAND ((void *)&sysctl_vals[8])
> -#define SYSCTL_INT_MAX ((void *)&sysctl_vals[9])
> +#define SYSCTL_FIVE ((void *)&sysctl_vals[5])
> +#define SYSCTL_ONE_HUNDRED ((void *)&sysctl_vals[6])
> +#define SYSCTL_TWO_HUNDRED ((void *)&sysctl_vals[7])
> +#define SYSCTL_ONE_THOUSAND ((void *)&sysctl_vals[8])
> +#define SYSCTL_THREE_THOUSAND ((void *)&sysctl_vals[9])
> +#define SYSCTL_INT_MAX ((void *)&sysctl_vals[10])
>
> /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
> -#define SYSCTL_MAXOLDUID ((void *)&sysctl_vals[10])
> -#define SYSCTL_NEG_ONE ((void *)&sysctl_vals[11])
> +#define SYSCTL_MAXOLDUID ((void *)&sysctl_vals[11])
> +#define SYSCTL_NEG_ONE ((void *)&sysctl_vals[12])
>
> extern const int sysctl_vals[];
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 79e6cb1d5c48..68b6ca67a0c6 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -82,7 +82,8 @@
> #endif
>
> /* shared constants to be used in various sysctls */
> -const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1 };
> +const int sysctl_vals[] = { 0, 1, 2, 3, 4, 5, 100, 200, 1000, 3000, INT_MAX,
> + 65535, -1 };
> EXPORT_SYMBOL(sysctl_vals);
>
> const unsigned long sysctl_long_vals[] = { 0, 1, LONG_MAX };
Hi,
I know I suggested you to put this change into this first batch of
AccECN patches but I've since come to other thoughts.
I think this should be moved to very tail of AccECN changes in the series
and joined together with the part of change which allows setting
net.ipv4.tcp_ecn to those higher values. Currently the latter is done in
the AccECN negotion patch (IIRC) but that part should be moved into a
separate patch with this change only after all AccECN patches have been
included to prevent enabling AccECN in incomplete form.
(This comment is orthogonal to Paolo's suggestion to use static constant.
So whichever form is chosen, it should be with the net.ipv4.tcp_ecn
change at the end of AccECN changes.)
--
i.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 net-next 14/14] net: sysctl: introduce sysctl SYSCTL_FIVE
2024-10-21 21:59 ` [PATCH v4 net-next 14/14] net: sysctl: introduce sysctl SYSCTL_FIVE chia-yu.chang
2024-10-29 12:26 ` Paolo Abeni
2024-10-29 21:29 ` Ilpo Järvinen
@ 2024-10-31 14:08 ` Joel Granados
2024-10-31 15:44 ` Chia-Yu Chang (Nokia)
2 siblings, 1 reply; 24+ messages in thread
From: Joel Granados @ 2024-10-31 14:08 UTC (permalink / raw)
To: chia-yu.chang
Cc: netdev, davem, edumazet, kuba, pabeni, dsahern, netfilter-devel,
kadlec, coreteam, pablo, bpf, linux-fsdevel, kees, mcgrof, ij,
ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel
On Mon, Oct 21, 2024 at 11:59:10PM +0200, chia-yu.chang@nokia-bell-labs.com wrote:
> From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>
> Add SYSCTL_FIVE for new AccECN feedback modes of net.ipv4.tcp_ecn.
>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
> include/linux/sysctl.h | 17 +++++++++--------
> kernel/sysctl.c | 3 ++-
> 2 files changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index aa4c6d44aaa0..37c95a70c10e 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -37,21 +37,22 @@ struct ctl_table_root;
> struct ctl_table_header;
> struct ctl_dir;
>
> -/* Keep the same order as in fs/proc/proc_sysctl.c */
> +/* Keep the same order as in kernel/sysctl.c */
> #define SYSCTL_ZERO ((void *)&sysctl_vals[0])
> #define SYSCTL_ONE ((void *)&sysctl_vals[1])
> #define SYSCTL_TWO ((void *)&sysctl_vals[2])
> #define SYSCTL_THREE ((void *)&sysctl_vals[3])
> #define SYSCTL_FOUR ((void *)&sysctl_vals[4])
> -#define SYSCTL_ONE_HUNDRED ((void *)&sysctl_vals[5])
> -#define SYSCTL_TWO_HUNDRED ((void *)&sysctl_vals[6])
> -#define SYSCTL_ONE_THOUSAND ((void *)&sysctl_vals[7])
> -#define SYSCTL_THREE_THOUSAND ((void *)&sysctl_vals[8])
> -#define SYSCTL_INT_MAX ((void *)&sysctl_vals[9])
> +#define SYSCTL_FIVE ((void *)&sysctl_vals[5])
Is it necessary to insert the value instead of appending it to the end
of sysctl_vals? I would actually consider Paolo Abeni's suggestion to
just use a constant if you are using it only in one place.
> +#define SYSCTL_ONE_HUNDRED ((void *)&sysctl_vals[6])
> +#define SYSCTL_TWO_HUNDRED ((void *)&sysctl_vals[7])
> +#define SYSCTL_ONE_THOUSAND ((void *)&sysctl_vals[8])
> +#define SYSCTL_THREE_THOUSAND ((void *)&sysctl_vals[9])
> +#define SYSCTL_INT_MAX ((void *)&sysctl_vals[10])
>
> /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
> -#define SYSCTL_MAXOLDUID ((void *)&sysctl_vals[10])
> -#define SYSCTL_NEG_ONE ((void *)&sysctl_vals[11])
> +#define SYSCTL_MAXOLDUID ((void *)&sysctl_vals[11])
> +#define SYSCTL_NEG_ONE ((void *)&sysctl_vals[12])
>
> extern const int sysctl_vals[];
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 79e6cb1d5c48..68b6ca67a0c6 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -82,7 +82,8 @@
> #endif
>
> /* shared constants to be used in various sysctls */
> -const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1 };
> +const int sysctl_vals[] = { 0, 1, 2, 3, 4, 5, 100, 200, 1000, 3000, INT_MAX,
> + 65535, -1 };
> EXPORT_SYMBOL(sysctl_vals);
>
> const unsigned long sysctl_long_vals[] = { 0, 1, LONG_MAX };
> --
> 2.34.1
>
--
Joel Granados
^ permalink raw reply [flat|nested] 24+ messages in thread
* RE: [PATCH v4 net-next 14/14] net: sysctl: introduce sysctl SYSCTL_FIVE
2024-10-31 14:08 ` Joel Granados
@ 2024-10-31 15:44 ` Chia-Yu Chang (Nokia)
0 siblings, 0 replies; 24+ messages in thread
From: Chia-Yu Chang (Nokia) @ 2024-10-31 15:44 UTC (permalink / raw)
To: Joel Granados
Cc: netdev@vger.kernel.org, davem@davemloft.net, edumazet@google.com,
kuba@kernel.org, pabeni@redhat.com, dsahern@kernel.org,
netfilter-devel@vger.kernel.org, kadlec@netfilter.org,
coreteam@netfilter.org, pablo@netfilter.org, bpf@vger.kernel.org,
linux-fsdevel@vger.kernel.org, kees@kernel.org, mcgrof@kernel.org,
ij@kernel.org, ncardwell@google.com, Koen De Schepper (Nokia),
g.white@cablelabs.com, ingemar.s.johansson@ericsson.com,
mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
Jason_Livingood@comcast.com, vidhi_goel@apple.com
Hi Paolo and Joel,
We will remove this patch as we check this will be only used by tcp_ecn in the upcoming patch.
Brs,
Chia-Yu
-----Original Message-----
From: Joel Granados <joel.granados@kernel.org>
Sent: Thursday, October 31, 2024 3:09 PM
To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>
Cc: netdev@vger.kernel.org; davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; dsahern@kernel.org; netfilter-devel@vger.kernel.org; kadlec@netfilter.org; coreteam@netfilter.org; pablo@netfilter.org; bpf@vger.kernel.org; linux-fsdevel@vger.kernel.org; kees@kernel.org; mcgrof@kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
Subject: Re: [PATCH v4 net-next 14/14] net: sysctl: introduce sysctl SYSCTL_FIVE
[Some people who received this message don't often get email from joel.granados@kernel.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
On Mon, Oct 21, 2024 at 11:59:10PM +0200, chia-yu.chang@nokia-bell-labs.com wrote:
> From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>
> Add SYSCTL_FIVE for new AccECN feedback modes of net.ipv4.tcp_ecn.
>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
> include/linux/sysctl.h | 17 +++++++++--------
> kernel/sysctl.c | 3 ++-
> 2 files changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index
> aa4c6d44aaa0..37c95a70c10e 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -37,21 +37,22 @@ struct ctl_table_root; struct ctl_table_header;
> struct ctl_dir;
>
> -/* Keep the same order as in fs/proc/proc_sysctl.c */
> +/* Keep the same order as in kernel/sysctl.c */
> #define SYSCTL_ZERO ((void *)&sysctl_vals[0])
> #define SYSCTL_ONE ((void *)&sysctl_vals[1])
> #define SYSCTL_TWO ((void *)&sysctl_vals[2])
> #define SYSCTL_THREE ((void *)&sysctl_vals[3])
> #define SYSCTL_FOUR ((void *)&sysctl_vals[4])
> -#define SYSCTL_ONE_HUNDRED ((void *)&sysctl_vals[5])
> -#define SYSCTL_TWO_HUNDRED ((void *)&sysctl_vals[6])
> -#define SYSCTL_ONE_THOUSAND ((void *)&sysctl_vals[7])
> -#define SYSCTL_THREE_THOUSAND ((void *)&sysctl_vals[8])
> -#define SYSCTL_INT_MAX ((void *)&sysctl_vals[9])
> +#define SYSCTL_FIVE ((void *)&sysctl_vals[5])
Is it necessary to insert the value instead of appending it to the end of sysctl_vals? I would actually consider Paolo Abeni's suggestion to just use a constant if you are using it only in one place.
> +#define SYSCTL_ONE_HUNDRED ((void *)&sysctl_vals[6])
> +#define SYSCTL_TWO_HUNDRED ((void *)&sysctl_vals[7])
> +#define SYSCTL_ONE_THOUSAND ((void *)&sysctl_vals[8])
> +#define SYSCTL_THREE_THOUSAND ((void *)&sysctl_vals[9])
> +#define SYSCTL_INT_MAX ((void *)&sysctl_vals[10])
>
> /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
> -#define SYSCTL_MAXOLDUID ((void *)&sysctl_vals[10])
> -#define SYSCTL_NEG_ONE ((void *)&sysctl_vals[11])
> +#define SYSCTL_MAXOLDUID ((void *)&sysctl_vals[11])
> +#define SYSCTL_NEG_ONE ((void *)&sysctl_vals[12])
>
> extern const int sysctl_vals[];
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c index
> 79e6cb1d5c48..68b6ca67a0c6 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -82,7 +82,8 @@
> #endif
>
> /* shared constants to be used in various sysctls */ -const int
> sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535,
> -1 };
> +const int sysctl_vals[] = { 0, 1, 2, 3, 4, 5, 100, 200, 1000, 3000, INT_MAX,
> + 65535, -1 };
> EXPORT_SYMBOL(sysctl_vals);
>
> const unsigned long sysctl_long_vals[] = { 0, 1, LONG_MAX };
> --
> 2.34.1
>
--
Joel Granados
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2024-10-31 15:44 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-21 21:58 [PATCH v4 net-next 00/14] AccECN protocol preparation patch series chia-yu.chang
2024-10-21 21:58 ` [PATCH v4 net-next 01/14] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
2024-10-21 21:58 ` [PATCH v4 net-next 02/14] tcp: create FLAG_TS_PROGRESS chia-yu.chang
2024-10-21 21:58 ` [PATCH v4 net-next 03/14] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 04/14] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
2024-10-29 11:43 ` Paolo Abeni
2024-10-29 11:45 ` Paolo Abeni
2024-10-21 21:59 ` [PATCH v4 net-next 05/14] tcp: reorganize SYN ECN code chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 06/14] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 07/14] tcp: helpers for ECN mode handling chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 08/14] gso: AccECN support chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 09/14] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
2024-10-29 12:03 ` Paolo Abeni
2024-10-29 21:17 ` Ilpo Järvinen
2024-10-21 21:59 ` [PATCH v4 net-next 10/14] tcp: AccECN support to tcp_add_backlog chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 11/14] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
2024-10-29 12:18 ` Paolo Abeni
2024-10-21 21:59 ` [PATCH v4 net-next 12/14] tcp: Pass flags to __tcp_send_ack chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 13/14] tcp: fast path functions later chia-yu.chang
2024-10-21 21:59 ` [PATCH v4 net-next 14/14] net: sysctl: introduce sysctl SYSCTL_FIVE chia-yu.chang
2024-10-29 12:26 ` Paolo Abeni
2024-10-29 21:29 ` Ilpo Järvinen
2024-10-31 14:08 ` Joel Granados
2024-10-31 15:44 ` Chia-Yu Chang (Nokia)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox