[PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series
@ 2024-10-15 10:28 chia-yu.chang
  2024-10-15 10:28 ` [PATCH net-next 01/44] sched: Add dualpi2 qdisc chia-yu.chang
                   ` (44 more replies)
  0 siblings, 45 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:28 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Hello,

Please find the enclosed patch series covering the L4S (Low Latency,
Low Loss, and Scalable Throughput) as outlined in IETF RFC9330:
https://datatracker.ietf.org/doc/html/rfc9330

* 1 patch for DualPI2 (cf. IETF RFC9332
  https://datatracker.ietf.org/doc/html/rfc9332)
* 40 pataches for Accurate ECN (It implements the AccECN protocol
  in terms of negotiation, feedback, and compliance requirements:
  https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28)
* 3 patches for TCP Prague (It implements the performance and safety
  requirements listed in Appendix A of IETF RFC9331:
  https://datatracker.ietf.org/doc/html/rfc9331)

Best regagrds,
Chia-Yu

Chia-Yu Chang (17):
  tcp: use BIT() macro in include/net/tcp.h
  net: sysctl: introduce sysctl SYSCTL_FIVE
  tcp: accecn: AccECN option failure handling
  tcp: L4S ECT(1) identifier for CC modules
  tcp: disable RFC3168 fallback identifier for CC modules
  tcp: accecn: handle unexpected AccECN negotiation feedback
  tcp: accecn: retransmit downgraded SYN in AccECN negotiation
  tcp: move increment of num_retrans
  tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN
    SYN/ACK
  tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion
  tcp: accecn: fallback outgoing half link to non-AccECN
  tcp: accecn: verify ACE counter in 1st ACK after AccECN negotiation
  tcp: accecn: stop sending AccECN option when loss ACK with AccECN
    option
  Documentation: networking: Update ECN related sysctls
  tcp: Add tso_segs() CC callback for TCP Prague
  tcp: Add mss_cache_set_by_ca for CC algorithm to set MSS
  tcp: Add the TCP Prague congestion control module

Ilpo Järvinen (26):
  tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
  tcp: create FLAG_TS_PROGRESS
  tcp: extend TCP flags to allow AE bit/ACE field
  tcp: reorganize SYN ECN code
  tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()
  tcp: helpers for ECN mode handling
  gso: AccECN support
  gro: prevent ACE field corruption & better AccECN handling
  tcp: AccECN support to tcp_add_backlog
  tcp: allow ECN bits in TOS/traffic class
  tcp: Pass flags to __tcp_send_ack
  tcp: fast path functions later
  tcp: AccECN core
  tcp: accecn: AccECN negotiation
  tcp: accecn: add AccECN rx byte counters
  tcp: allow embedding leftover into option padding
  tcp: accecn: AccECN needs to know delivered bytes
  tcp: sack option handling improvements
  tcp: accecn: AccECN option
  tcp: accecn: AccECN option send control
  tcp: accecn: AccECN option ceb/cep heuristic
  tcp: accecn: AccECN ACE field multi-wrap heuristic
  tcp: accecn: try to fit AccECN option with SACK
  tcp: try to avoid safer when ACKs are thinned
  gro: flushing when CWR is set negatively affects AccECN
  tcp: accecn: Add ece_delta to rate_sample

Koen De Schepper (1):
  sched: Add dualpi2 qdisc

 Documentation/networking/ip-sysctl.rst |   55 +-
 include/linux/netdev_features.h        |    5 +-
 include/linux/netdevice.h              |    2 +
 include/linux/skbuff.h                 |    2 +
 include/linux/sysctl.h                 |   17 +-
 include/linux/tcp.h                    |   31 +-
 include/net/inet_ecn.h                 |   20 +-
 include/net/netns/ipv4.h               |    2 +
 include/net/tcp.h                      |  299 +++++--
 include/uapi/linux/inet_diag.h         |   13 +
 include/uapi/linux/pkt_sched.h         |   34 +
 include/uapi/linux/tcp.h               |   16 +-
 kernel/sysctl.c                        |    2 +-
 net/ethtool/common.c                   |    1 +
 net/ipv4/Kconfig                       |   37 +
 net/ipv4/Makefile                      |    1 +
 net/ipv4/bpf_tcp_ca.c                  |    2 +-
 net/ipv4/inet_connection_sock.c        |    8 +-
 net/ipv4/ip_output.c                   |    3 +-
 net/ipv4/syncookies.c                  |    3 +
 net/ipv4/sysctl_net_ipv4.c             |   18 +
 net/ipv4/tcp.c                         |   26 +-
 net/ipv4/tcp_cong.c                    |    9 +-
 net/ipv4/tcp_dctcp.c                   |    2 +-
 net/ipv4/tcp_dctcp.h                   |    2 +-
 net/ipv4/tcp_input.c                   |  689 ++++++++++++++--
 net/ipv4/tcp_ipv4.c                    |   33 +-
 net/ipv4/tcp_minisocks.c               |  117 ++-
 net/ipv4/tcp_offload.c                 |   13 +-
 net/ipv4/tcp_output.c                  |  336 +++++++-
 net/ipv4/tcp_prague.c                  |  866 ++++++++++++++++++++
 net/ipv6/syncookies.c                  |    1 +
 net/ipv6/tcp_ipv6.c                    |   27 +-
 net/netfilter/nf_log_syslog.c          |    8 +-
 net/sched/Kconfig                      |   20 +
 net/sched/Makefile                     |    1 +
 net/sched/sch_dualpi2.c                | 1046 ++++++++++++++++++++++++
 37 files changed, 3519 insertions(+), 248 deletions(-)
 create mode 100644 net/ipv4/tcp_prague.c
 create mode 100644 net/sched/sch_dualpi2.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH net-next 01/44] sched: Add dualpi2 qdisc
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
@ 2024-10-15 10:28 ` chia-yu.chang
  2024-10-15 15:30   ` Jamal Hadi Salim
  2024-10-15 10:28 ` [PATCH net-next 02/44] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
                   ` (43 subsequent siblings)
  44 siblings, 1 reply; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:28 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Olga Albisser, Olivier Tilmans, Henrik Steen, Bob Briscoe,
	Chia-Yu Chang

From: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>

DualPI2 provides L4S-type low latency & loss to traffic that uses a
scalable congestion controller (e.g. TCP-Prague, DCTCP) without
degrading the performance of 'classic' traffic (e.g. Reno,
Cubic etc.). It is intended to be the reference implementation of the
IETF's DualQ Coupled AQM.

The qdisc provides two queues called low latency and classic. It
classifies packets based on the ECN field in the IP headers. By
default it directs non-ECN and ECT(0) into the classic queue and
ECT(1) and CE into the low latency queue, as per the IETF spec.

Each queue runs its own AQM:
* The classic AQM is called PI2, which is similar to the PIE AQM but
  more responsive and simpler. Classic traffic requires a decent
  target queue (default 15ms for Internet deployment) to fully
  utilize the link and to avoid high drop rates.
* The low latency AQM is, by default, a very shallow ECN marking
  threshold (1ms) similar to that used for DCTCP.

The DualQ isolates the low queuing delay of the Low Latency queue
from the larger delay of the 'Classic' queue. However, from a
bandwidth perspective, flows in either queue will share out the link
capacity as if there was just a single queue. This bandwidth pooling
effect is achieved by coupling together the drop and ECN-marking
probabilities of the two AQMs.

The PI2 AQM has two main parameters in addition to its target delay.
All the defaults are suitable for any Internet setting, but it can
be reconfigured for a Data Centre setting. The integral gain factor
alpha is used to slowly correct any persistent standing queue error
from the target delay, while the proportional gain factor beta is
used to quickly compensate for queue changes (growth or shrinkage).
Either alpha and beta are given as a parameter, or they can be
calculated by tc from alternative typical and maximum RTT parameters.

Internally, the output of a linear Proportional Integral (PI)
controller is used for both queues. This output is squared to
calculate the drop or ECN-marking probability of the classic queue.
This counterbalances the square-root rate equation of Reno/Cubic,
which is the trick that balances flow rates across the queues. For
the ECN-marking probability of the low latency queue, the output of
the base AQM is multiplied by a coupling factor. This determines the
balance between the flow rates in each queue. The default setting
makes the flow rates roughly equal, which should be generally
applicable.

If DUALPI2 AQM has detected overload (due to excessive non-responsive
traffic in either queue), it will switch to signaling congestion
solely using drop, irrespective of the ECN field. Alternatively, it
can be configured to limit the drop probability and let the queue
grow and eventually overflow (like tail-drop).

Additional details can be found in the draft:
  https://datatracker.ietf.org/doc/html/rfc9332

Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Co-developed-by: Olga Albisser <olga@albisser.org>
Signed-off-by: Olga Albisser <olga@albisser.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Henrik Steen <henrist@henrist.net>
Signed-off-by: Henrik Steen <henrist@henrist.net>
Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/netdevice.h      |    1 +
 include/uapi/linux/pkt_sched.h |   34 ++
 net/sched/Kconfig              |   20 +
 net/sched/Makefile             |    1 +
 net/sched/sch_dualpi2.c        | 1046 ++++++++++++++++++++++++++++++++
 5 files changed, 1102 insertions(+)
 create mode 100644 net/sched/sch_dualpi2.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8feaca12655e..bdd7d6262112 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -30,6 +30,7 @@
 #include <asm/byteorder.h>
 #include <asm/local.h>
 
+#include <linux/netdev_features.h>
 #include <linux/percpu.h>
 #include <linux/rculist.h>
 #include <linux/workqueue.h>
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 25a9a47001cd..f2418eabdcb1 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1210,4 +1210,38 @@ enum {
 
 #define TCA_ETS_MAX (__TCA_ETS_MAX - 1)
 
+/* DUALPI2 */
+enum {
+	TCA_DUALPI2_UNSPEC,
+	TCA_DUALPI2_LIMIT,		/* Packets */
+	TCA_DUALPI2_TARGET,		/* us */
+	TCA_DUALPI2_TUPDATE,		/* us */
+	TCA_DUALPI2_ALPHA,		/* Hz scaled up by 256 */
+	TCA_DUALPI2_BETA,		/* HZ scaled up by 256 */
+	TCA_DUALPI2_STEP_THRESH,	/* Packets or us */
+	TCA_DUALPI2_STEP_PACKETS,	/* Whether STEP_THRESH is in packets */
+	TCA_DUALPI2_COUPLING,		/* Coupling factor between queues */
+	TCA_DUALPI2_DROP_OVERLOAD,	/* Whether to drop on overload */
+	TCA_DUALPI2_DROP_EARLY,		/* Whether to drop on enqueue */
+	TCA_DUALPI2_C_PROTECTION,	/* Percentage */
+	TCA_DUALPI2_ECN_MASK,		/* L4S queue classification mask */
+	TCA_DUALPI2_SPLIT_GSO,		/* Split GSO packets at enqueue */
+	TCA_DUALPI2_PAD,
+	__TCA_DUALPI2_MAX
+};
+
+#define TCA_DUALPI2_MAX   (__TCA_DUALPI2_MAX - 1)
+
+struct tc_dualpi2_xstats {
+	__u32 prob;		/* current probability */
+	__u32 delay_c;		/* current delay in C queue */
+	__u32 delay_l;		/* current delay in L queue */
+	__s32 credit;		/* current c_protection credit */
+	__u32 packets_in_c;	/* number of packets enqueued in C queue */
+	__u32 packets_in_l;	/* number of packets enqueued in L queue */
+	__u32 maxq;		/* maximum queue size */
+	__u32 ecn_mark;		/* packets marked with ecn*/
+	__u32 step_marks;	/* ECN marks due to the step AQM */
+};
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 8180d0c12fce..c1421e219040 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -403,6 +403,26 @@ config NET_SCH_ETS
 
 	  If unsure, say N.
 
+config NET_SCH_DUALPI2
+	tristate "Dual Queue Proportional Integral Controller Improved with a Square (DUALPI2) scheduler"
+	help
+	  Say Y here if you want to use the DualPI2 AQM.
+	  This is a combination of the DUALQ Coupled-AQM with a PI2 base-AQM.
+	  The PI2 AQM is in turn both an extension and a simplification of the
+	  PIE AQM. PI2 makes quite some PIE heuristics unnecessary, while being
+	  able to control scalable congestion controls like DCTCP and
+	  TCP-Prague. With PI2, both Reno/Cubic can be used in parallel with
+	  DCTCP, maintaining window fairness. DUALQ provides latency separation
+	  between low latency DCTCP flows and Reno/Cubic flows that need a
+	  bigger queue.
+	  For more information, please see
+	  https://datatracker.ietf.org/doc/html/rfc9332
+
+	  To compile this code as a module, choose M here: the module
+	  will be called sch_dualpi2.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 82c3f78ca486..1abb06554057 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -62,6 +62,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE)	+= sch_fq_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_DUALPI2)	+= sch_dualpi2.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
new file mode 100644
index 000000000000..18e8934faa4e
--- /dev/null
+++ b/net/sched/sch_dualpi2.c
@@ -0,0 +1,1046 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2024 Nokia
+ *
+ * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
+ * Author: Olga Albisser <olga@albisser.org>
+ * Author: Henrik Steen <henrist@henrist.net>
+ * Author: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
+ * Author: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
+ *
+ * DualPI Improved with a Square (dualpi2):
+ * - Supports congestion controls that comply with the Prague requirements
+ *   in RFC9331 (e.g. TCP-Prague)
+ * - Supports coupled dual-queue with PI2 as defined in RFC9332
+ * - Supports ECN L4S-identifier (IP.ECN==0b*1)
+ *
+ * note: DCTCP is not Prague compliant, so DCTCP & DualPI2 can only be
+ *   used in DC context; BBRv3 (overwrites bbr) stopped Prague support,
+ *   you should use TCP-Prague instead for low latency apps
+ *
+ * References:
+ * - RFC9332: https://datatracker.ietf.org/doc/html/rfc9332
+ * - De Schepper, Koen, et al. "PI 2: A linearized AQM for both classic and
+ *   scalable TCP."  in proc. ACM CoNEXT'16, 2016.
+ */
+
+#include <linux/errno.h>
+#include <linux/hrtimer.h>
+#include <linux/if_vlan.h>
+#include <linux/kernel.h>
+#include <linux/limits.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/types.h>
+
+#include <net/gso.h>
+#include <net/inet_ecn.h>
+#include <net/pkt_cls.h>
+#include <net/pkt_sched.h>
+
+/* 32b enable to support flows with windows up to ~8.6 * 1e9 packets
+ * i.e., twice the maximal snd_cwnd.
+ * MAX_PROB must be consistent with the RNG in dualpi2_roll().
+ */
+#define MAX_PROB U32_MAX
+/* alpha/beta values exchanged over netlink are in units of 256ns */
+#define ALPHA_BETA_SHIFT 8
+/* Scaled values of alpha/beta must fit in 32b to avoid overflow in later
+ * computations. Consequently (see and dualpi2_scale_alpha_beta()), their
+ * netlink-provided values can use at most 31b, i.e. be at most (2^23)-1
+ * (~4MHz) as those are given in 1/256th. This enable to tune alpha/beta to
+ * control flows whose maximal RTTs can be in usec up to few secs.
+ */
+#define ALPHA_BETA_MAX ((1U << 31) - 1)
+/* Internal alpha/beta are in units of 64ns.
+ * This enables to use all alpha/beta values in the allowed range without loss
+ * of precision due to rounding when scaling them internally, e.g.,
+ * scale_alpha_beta(1) will not round down to 0.
+ */
+#define ALPHA_BETA_GRANULARITY 6
+#define ALPHA_BETA_SCALING (ALPHA_BETA_SHIFT - ALPHA_BETA_GRANULARITY)
+/* We express the weights (wc, wl) in %, i.e., wc + wl = 100 */
+#define MAX_WC 100
+
+struct dualpi2_sched_data {
+	struct Qdisc *l_queue;	/* The L4S LL queue */
+	struct Qdisc *sch;	/* The classic queue (owner of this struct) */
+
+	/* Registered tc filters */
+	struct {
+		struct tcf_proto __rcu *filters;
+		struct tcf_block *block;
+	} tcf;
+
+	struct { /* PI2 parameters */
+		u64	target;	/* Target delay in nanoseconds */
+		u32	tupdate;/* Timer frequency in nanoseconds */
+		u32	prob;	/* Base PI probability */
+		u32	alpha;	/* Gain factor for the integral rate response */
+		u32	beta;	/* Gain factor for the proportional response */
+		struct hrtimer timer; /* prob update timer */
+	} pi2;
+
+	struct { /* Step AQM (L4S queue only) parameters */
+		u32 thresh;	/* Step threshold */
+		bool in_packets;/* Whether the step is in packets or time */
+	} step;
+
+	struct { /* Classic queue starvation protection */
+		s32	credit; /* Credit (sign indicates which queue) */
+		s32	init;	/* Reset value of the credit */
+		u8	wc;	/* C queue weight (between 0 and MAX_WC) */
+		u8	wl;	/* L queue weight (MAX_WC - wc) */
+	} c_protection;
+
+	/* General dualQ parameters */
+	u8	coupling_factor;/* Coupling factor (k) between both queues */
+	u8	ecn_mask;	/* Mask to match L4S packets */
+	bool	drop_early;	/* Drop at enqueue instead of dequeue if true */
+	bool	drop_overload;	/* Drop (1) on overload, or overflow (0) */
+	bool	split_gso;	/* Split aggregated skb (1) or leave as is */
+
+	/* Statistics */
+	u64	c_head_ts;	/* Enqueue timestamp of the classic Q's head */
+	u64	l_head_ts;	/* Enqueue timestamp of the L Q's head */
+	u64	last_qdelay;	/* Q delay val at the last probability update */
+	u32	packets_in_c;	/* Number of packets enqueued in C queue */
+	u32	packets_in_l;	/* Number of packets enqueued in L queue */
+	u32	maxq;		/* maximum queue size */
+	u32	ecn_mark;	/* packets marked with ECN */
+	u32	step_marks;	/* ECN marks due to the step AQM */
+
+	struct { /* Deferred drop statistics */
+		u32 cnt;	/* Packets dropped */
+		u32 len;	/* Bytes dropped */
+	} deferred_drops;
+};
+
+struct dualpi2_skb_cb {
+	u64 ts;			/* Timestamp at enqueue */
+	u8 apply_step:1,	/* Can we apply the step threshold */
+	   classified:2,	/* Packet classification results */
+	   ect:2;		/* Packet ECT codepoint */
+};
+
+enum dualpi2_classification_results {
+	DUALPI2_C_CLASSIC	= 0,	/* C queue */
+	DUALPI2_C_L4S		= 1,	/* L queue (scalable marking/classic drops) */
+	DUALPI2_C_LLLL		= 2,	/* L queue (no drops/marks) */
+	__DUALPI2_C_MAX			/* Keep last*/
+};
+
+static struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb)
+{
+	qdisc_cb_private_validate(skb, sizeof(struct dualpi2_skb_cb));
+	return (struct dualpi2_skb_cb *)qdisc_skb_cb(skb)->data;
+}
+
+static u64 skb_sojourn_time(struct sk_buff *skb, u64 reference)
+{
+	return reference - dualpi2_skb_cb(skb)->ts;
+}
+
+static u64 head_enqueue_time(struct Qdisc *q)
+{
+	struct sk_buff *skb = qdisc_peek_head(q);
+
+	return skb ? dualpi2_skb_cb(skb)->ts : 0;
+}
+
+static u32 dualpi2_scale_alpha_beta(u32 param)
+{
+	u64 tmp = ((u64)param * MAX_PROB >> ALPHA_BETA_SCALING);
+
+	do_div(tmp, NSEC_PER_SEC);
+	return tmp;
+}
+
+static u32 dualpi2_unscale_alpha_beta(u32 param)
+{
+	u64 tmp = ((u64)param * NSEC_PER_SEC << ALPHA_BETA_SCALING);
+
+	do_div(tmp, MAX_PROB);
+	return tmp;
+}
+
+static ktime_t next_pi2_timeout(struct dualpi2_sched_data *q)
+{
+	return ktime_add_ns(ktime_get_ns(), q->pi2.tupdate);
+}
+
+static bool skb_is_l4s(struct sk_buff *skb)
+{
+	return dualpi2_skb_cb(skb)->classified == DUALPI2_C_L4S;
+}
+
+static bool skb_in_l_queue(struct sk_buff *skb)
+{
+	return dualpi2_skb_cb(skb)->classified != DUALPI2_C_CLASSIC;
+}
+
+static bool dualpi2_mark(struct dualpi2_sched_data *q, struct sk_buff *skb)
+{
+	if (INET_ECN_set_ce(skb)) {
+		q->ecn_mark++;
+		return true;
+	}
+	return false;
+}
+
+static void dualpi2_reset_c_protection(struct dualpi2_sched_data *q)
+{
+	q->c_protection.credit = q->c_protection.init;
+}
+
+/* This computes the initial credit value and WRR weight for the L queue (wl)
+ * from the weight of the C queue (wc).
+ * If wl > wc, the scheduler will start with the L queue when reset.
+ */
+static void dualpi2_calculate_c_protection(struct Qdisc *sch,
+					   struct dualpi2_sched_data *q, u32 wc)
+{
+	q->c_protection.wc = wc;
+	q->c_protection.wl = MAX_WC - wc;
+	q->c_protection.init = (s32)psched_mtu(qdisc_dev(sch)) *
+		((int)q->c_protection.wc - (int)q->c_protection.wl);
+	dualpi2_reset_c_protection(q);
+}
+
+static bool dualpi2_roll(u32 prob)
+{
+	return get_random_u32() <= prob;
+}
+
+/* Packets in the C queue are subject to a marking probability pC, which is the
+ * square of the internal PI2 probability (i.e., have an overall lower mark/drop
+ * probability). If the qdisc is overloaded, ignore ECT values and only drop.
+ *
+ * Note that this marking scheme is also applied to L4S packets during overload.
+ * Return true if packet dropping is required in C queue
+ */
+static bool dualpi2_classic_marking(struct dualpi2_sched_data *q,
+				    struct sk_buff *skb, u32 prob,
+				    bool overload)
+{
+	if (dualpi2_roll(prob) && dualpi2_roll(prob)) {
+		if (overload || dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT)
+			return true;
+		dualpi2_mark(q, skb);
+	}
+	return false;
+}
+
+/* Packets in the L queue are subject to a marking probability pL given by the
+ * internal PI2 probability scaled by the coupling factor.
+ *
+ * On overload (i.e., @local_l_prob is >= 100%):
+ * - if the qdisc is configured to trade losses to preserve latency (i.e.,
+ *   @q->drop_overload), apply classic drops first before marking.
+ * - otherwise, preserve the "no loss" property of ECN at the cost of queueing
+ *   delay, eventually resulting in taildrop behavior once sch->limit is
+ *   reached.
+ * Return true if packet dropping is required in L queue
+ */
+static bool dualpi2_scalable_marking(struct dualpi2_sched_data *q,
+				     struct sk_buff *skb,
+				     u64 local_l_prob, u32 prob,
+				     bool overload)
+{
+	if (overload) {
+		/* Apply classic drop */
+		if (!q->drop_overload ||
+		    !(dualpi2_roll(prob) && dualpi2_roll(prob)))
+			goto mark;
+		return true;
+	}
+
+	/* We can safely cut the upper 32b as overload==false */
+	if (dualpi2_roll(local_l_prob)) {
+		/* Non-ECT packets could have classified as L4S by filters. */
+		if (dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT)
+			return true;
+mark:
+		dualpi2_mark(q, skb);
+	}
+	return false;
+}
+
+/* Decide whether a given packet must be dropped (or marked if ECT), according
+ * to the PI2 probability.
+ *
+ * Never mark/drop if we have a standing queue of less than 2 MTUs.
+ */
+static bool must_drop(struct Qdisc *sch, struct dualpi2_sched_data *q,
+		      struct sk_buff *skb)
+{
+	u64 local_l_prob;
+	u32 prob;
+	bool overload;
+
+	if (sch->qstats.backlog < 2 * psched_mtu(qdisc_dev(sch)))
+		return false;
+
+	prob = READ_ONCE(q->pi2.prob);
+	local_l_prob = (u64)prob * q->coupling_factor;
+	overload = local_l_prob > MAX_PROB;
+
+	switch (dualpi2_skb_cb(skb)->classified) {
+	case DUALPI2_C_CLASSIC:
+		return dualpi2_classic_marking(q, skb, prob, overload);
+	case DUALPI2_C_L4S:
+		return dualpi2_scalable_marking(q, skb, local_l_prob, prob,
+						overload);
+	default: /* DUALPI2_C_LLLL */
+		return false;
+	}
+}
+
+static void dualpi2_read_ect(struct sk_buff *skb)
+{
+	struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
+	int wlen = skb_network_offset(skb);
+
+	switch (skb_protocol(skb, true)) {
+	case htons(ETH_P_IP):
+		wlen += sizeof(struct iphdr);
+		if (!pskb_may_pull(skb, wlen) ||
+		    skb_try_make_writable(skb, wlen))
+			goto not_ecn;
+
+		cb->ect = ipv4_get_dsfield(ip_hdr(skb)) & INET_ECN_MASK;
+		break;
+	case htons(ETH_P_IPV6):
+		wlen += sizeof(struct ipv6hdr);
+		if (!pskb_may_pull(skb, wlen) ||
+		    skb_try_make_writable(skb, wlen))
+			goto not_ecn;
+
+		cb->ect = ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK;
+		break;
+	default:
+		goto not_ecn;
+	}
+	return;
+
+not_ecn:
+	/* Non pullable/writable packets can only be dropped hence are
+	 * classified as not ECT.
+	 */
+	cb->ect = INET_ECN_NOT_ECT;
+}
+
+static int dualpi2_skb_classify(struct dualpi2_sched_data *q,
+				struct sk_buff *skb)
+{
+	struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
+	struct tcf_result res;
+	struct tcf_proto *fl;
+	int result;
+
+	dualpi2_read_ect(skb);
+	if (cb->ect & q->ecn_mask) {
+		cb->classified = DUALPI2_C_L4S;
+		return NET_XMIT_SUCCESS;
+	}
+
+	if (TC_H_MAJ(skb->priority) == q->sch->handle &&
+	    TC_H_MIN(skb->priority) < __DUALPI2_C_MAX) {
+		cb->classified = TC_H_MIN(skb->priority);
+		return NET_XMIT_SUCCESS;
+	}
+
+	fl = rcu_dereference_bh(q->tcf.filters);
+	if (!fl) {
+		cb->classified = DUALPI2_C_CLASSIC;
+		return NET_XMIT_SUCCESS;
+	}
+
+	result = tcf_classify(skb, NULL, fl, &res, false);
+	if (result >= 0) {
+#ifdef CONFIG_NET_CLS_ACT
+		switch (result) {
+		case TC_ACT_STOLEN:
+		case TC_ACT_QUEUED:
+		case TC_ACT_TRAP:
+			return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
+		case TC_ACT_SHOT:
+			return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+		}
+#endif
+		cb->classified = TC_H_MIN(res.classid) < __DUALPI2_C_MAX ?
+			TC_H_MIN(res.classid) : DUALPI2_C_CLASSIC;
+	}
+	return NET_XMIT_SUCCESS;
+}
+
+static int dualpi2_enqueue_skb(struct sk_buff *skb, struct Qdisc *sch,
+			       struct sk_buff **to_free)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct dualpi2_skb_cb *cb;
+
+	if (unlikely(qdisc_qlen(sch) >= sch->limit)) {
+		qdisc_qstats_overlimit(sch);
+		if (skb_in_l_queue(skb))
+			qdisc_qstats_overlimit(q->l_queue);
+		return qdisc_drop(skb, sch, to_free);
+	}
+
+	if (q->drop_early && must_drop(sch, q, skb)) {
+		qdisc_drop(skb, sch, to_free);
+		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+	}
+
+	cb = dualpi2_skb_cb(skb);
+	cb->ts = ktime_get_ns();
+
+	if (qdisc_qlen(sch) > q->maxq)
+		q->maxq = qdisc_qlen(sch);
+
+	if (skb_in_l_queue(skb)) {
+		/* Only apply the step if a queue is building up */
+		dualpi2_skb_cb(skb)->apply_step =
+			skb_is_l4s(skb) && qdisc_qlen(q->l_queue) > 1;
+		/* Keep the overall qdisc stats consistent */
+		++sch->q.qlen;
+		qdisc_qstats_backlog_inc(sch, skb);
+		++q->packets_in_l;
+		if (!q->l_head_ts)
+			q->l_head_ts = cb->ts;
+		return qdisc_enqueue_tail(skb, q->l_queue);
+	}
+	++q->packets_in_c;
+	if (!q->c_head_ts)
+		q->c_head_ts = cb->ts;
+	return qdisc_enqueue_tail(skb, sch);
+}
+
+/* Optionally, dualpi2 will split GSO skbs into independent skbs and enqueue
+ * each of those individually. This yields the following benefits, at the
+ * expense of CPU usage:
+ * - Finer-grained AQM actions as the sub-packets of a burst no longer share the
+ *   same fate (e.g., the random mark/drop probability is applied individually)
+ * - Improved precision of the starvation protection/WRR scheduler at dequeue,
+ *   as the size of the dequeued packets will be smaller.
+ */
+static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+				 struct sk_buff **to_free)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	int err;
+
+	err = dualpi2_skb_classify(q, skb);
+	if (err != NET_XMIT_SUCCESS) {
+		if (err & __NET_XMIT_BYPASS)
+			qdisc_qstats_drop(sch);
+		__qdisc_drop(skb, to_free);
+		return err;
+	}
+
+	if (q->split_gso && skb_is_gso(skb)) {
+		netdev_features_t features;
+		struct sk_buff *nskb, *next;
+		int cnt, byte_len, orig_len;
+		int err;
+
+		features = netif_skb_features(skb);
+		nskb = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
+		if (IS_ERR_OR_NULL(nskb))
+			return qdisc_drop(skb, sch, to_free);
+
+		cnt = 1;
+		byte_len = 0;
+		orig_len = qdisc_pkt_len(skb);
+		while (nskb) {
+			next = nskb->next;
+			skb_mark_not_on_list(nskb);
+			qdisc_skb_cb(nskb)->pkt_len = nskb->len;
+			dualpi2_skb_cb(nskb)->classified =
+				dualpi2_skb_cb(skb)->classified;
+			dualpi2_skb_cb(nskb)->ect = dualpi2_skb_cb(skb)->ect;
+			err = dualpi2_enqueue_skb(nskb, sch, to_free);
+			if (err == NET_XMIT_SUCCESS) {
+				/* Compute the backlog adjustement that needs
+				 * to be propagated in the qdisc tree to reflect
+				 * all new skbs successfully enqueued.
+				 */
+				++cnt;
+				byte_len += nskb->len;
+			}
+			nskb = next;
+		}
+		if (err == NET_XMIT_SUCCESS) {
+			/* The caller will add the original skb stats to its
+			 * backlog, compensate this.
+			 */
+			--cnt;
+			byte_len -= orig_len;
+		}
+		qdisc_tree_reduce_backlog(sch, -cnt, -byte_len);
+		consume_skb(skb);
+		return err;
+	}
+	return dualpi2_enqueue_skb(skb, sch, to_free);
+}
+
+/* Select the queue from which the next packet can be dequeued, ensuring that
+ * neither queue can starve the other with a WRR scheduler.
+ *
+ * The sign of the WRR credit determines the next queue, while the size of
+ * the dequeued packet determines the magnitude of the WRR credit change. If
+ * either queue is empty, the WRR credit is kept unchanged.
+ *
+ * As the dequeued packet can be dropped later, the caller has to perform the
+ * qdisc_bstats_update() calls.
+ */
+static struct sk_buff *dequeue_packet(struct Qdisc *sch,
+				      struct dualpi2_sched_data *q,
+				      int *credit_change,
+				      u64 now)
+{
+	struct sk_buff *skb = NULL;
+	int c_len;
+
+	*credit_change = 0;
+	c_len = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
+	if (qdisc_qlen(q->l_queue) && (!c_len || q->c_protection.credit <= 0)) {
+		skb = __qdisc_dequeue_head(&q->l_queue->q);
+		WRITE_ONCE(q->l_head_ts, head_enqueue_time(q->l_queue));
+		if (c_len)
+			*credit_change = q->c_protection.wc;
+		qdisc_qstats_backlog_dec(q->l_queue, skb);
+		/* Keep the global queue size consistent */
+		--sch->q.qlen;
+	} else if (c_len) {
+		skb = __qdisc_dequeue_head(&sch->q);
+		WRITE_ONCE(q->c_head_ts, head_enqueue_time(sch));
+		if (qdisc_qlen(q->l_queue))
+			*credit_change = ~((s32)q->c_protection.wl) + 1;
+	} else {
+		dualpi2_reset_c_protection(q);
+		return NULL;
+	}
+	*credit_change *= qdisc_pkt_len(skb);
+	qdisc_qstats_backlog_dec(sch, skb);
+	return skb;
+}
+
+static int do_step_aqm(struct dualpi2_sched_data *q, struct sk_buff *skb,
+		       u64 now)
+{
+	u64 qdelay = 0;
+
+	if (q->step.in_packets)
+		qdelay = qdisc_qlen(q->l_queue);
+	else
+		qdelay = skb_sojourn_time(skb, now);
+
+	if (dualpi2_skb_cb(skb)->apply_step && qdelay > q->step.thresh) {
+		if (!dualpi2_skb_cb(skb)->ect)
+			/* Drop this non-ECT packet */
+			return 1;
+		if (dualpi2_mark(q, skb))
+			++q->step_marks;
+	}
+	qdisc_bstats_update(q->l_queue, skb);
+	return 0;
+}
+
+static void drop_and_retry(struct dualpi2_sched_data *q, struct sk_buff *skb, struct Qdisc *sch)
+{
+	++q->deferred_drops.cnt;
+	q->deferred_drops.len += qdisc_pkt_len(skb);
+	consume_skb(skb);
+	qdisc_qstats_drop(sch);
+}
+
+static struct sk_buff *dualpi2_qdisc_dequeue(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	int credit_change;
+	u64 now;
+
+	now = ktime_get_ns();
+
+	while ((skb = dequeue_packet(sch, q, &credit_change, now))) {
+		if (!q->drop_early && must_drop(sch, q, skb)) {
+			drop_and_retry(q, skb, sch);
+			continue;
+		}
+
+		if (skb_in_l_queue(skb) && do_step_aqm(q, skb, now)) {
+			qdisc_qstats_drop(q->l_queue);
+			drop_and_retry(q, skb, sch);
+			continue;
+		}
+
+		q->c_protection.credit += credit_change;
+		qdisc_bstats_update(sch, skb);
+		break;
+	}
+
+	/* We cannot call qdisc_tree_reduce_backlog() if our qlen is 0,
+	 * or HTB crashes.
+	 */
+	if (q->deferred_drops.cnt && qdisc_qlen(sch)) {
+		qdisc_tree_reduce_backlog(sch, q->deferred_drops.cnt,
+					  q->deferred_drops.len);
+		q->deferred_drops.cnt = 0;
+		q->deferred_drops.len = 0;
+	}
+	return skb;
+}
+
+static s64 __scale_delta(u64 diff)
+{
+	do_div(diff, 1 << ALPHA_BETA_GRANULARITY);
+	return diff;
+}
+
+static void get_queue_delays(struct dualpi2_sched_data *q, u64 *qdelay_c,
+			     u64 *qdelay_l)
+{
+	u64 now, qc, ql;
+
+	now = ktime_get_ns();
+	qc = READ_ONCE(q->c_head_ts);
+	ql = READ_ONCE(q->l_head_ts);
+
+	*qdelay_c = qc ? now - qc : 0;
+	*qdelay_l = ql ? now - ql : 0;
+}
+
+static u32 calculate_probability(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u32 new_prob;
+	u64 qdelay_c;
+	u64 qdelay_l;
+	u64 qdelay;
+	s64 delta;
+
+	get_queue_delays(q, &qdelay_c, &qdelay_l);
+	qdelay = max(qdelay_l, qdelay_c);
+	/* Alpha and beta take at most 32b, i.e, the delay difference would
+	 * overflow for queuing delay differences > ~4.2sec.
+	 */
+	delta = ((s64)qdelay - q->pi2.target) * q->pi2.alpha;
+	delta += ((s64)qdelay - q->last_qdelay) * q->pi2.beta;
+	if (delta > 0) {
+		new_prob = __scale_delta(delta) + q->pi2.prob;
+		if (new_prob < q->pi2.prob)
+			new_prob = MAX_PROB;
+	} else {
+		new_prob = q->pi2.prob - __scale_delta(~delta + 1);
+		if (new_prob > q->pi2.prob)
+			new_prob = 0;
+	}
+	q->last_qdelay = qdelay;
+	/* If we do not drop on overload, ensure we cap the L4S probability to
+	 * 100% to keep window fairness when overflowing.
+	 */
+	if (!q->drop_overload)
+		return min_t(u32, new_prob, MAX_PROB / q->coupling_factor);
+	return new_prob;
+}
+
+static enum hrtimer_restart dualpi2_timer(struct hrtimer *timer)
+{
+	struct dualpi2_sched_data *q = from_timer(q, timer, pi2.timer);
+
+	WRITE_ONCE(q->pi2.prob, calculate_probability(q->sch));
+
+	hrtimer_set_expires(&q->pi2.timer, next_pi2_timeout(q));
+	return HRTIMER_RESTART;
+}
+
+static const struct nla_policy dualpi2_policy[TCA_DUALPI2_MAX + 1] = {
+	[TCA_DUALPI2_LIMIT] = {.type = NLA_U32},
+	[TCA_DUALPI2_TARGET] = {.type = NLA_U32},
+	[TCA_DUALPI2_TUPDATE] = {.type = NLA_U32},
+	[TCA_DUALPI2_ALPHA] = {.type = NLA_U32},
+	[TCA_DUALPI2_BETA] = {.type = NLA_U32},
+	[TCA_DUALPI2_STEP_THRESH] = {.type = NLA_U32},
+	[TCA_DUALPI2_STEP_PACKETS] = {.type = NLA_U8},
+	[TCA_DUALPI2_COUPLING] = {.type = NLA_U8},
+	[TCA_DUALPI2_DROP_OVERLOAD] = {.type = NLA_U8},
+	[TCA_DUALPI2_DROP_EARLY] = {.type = NLA_U8},
+	[TCA_DUALPI2_C_PROTECTION] = {.type = NLA_U8},
+	[TCA_DUALPI2_ECN_MASK] = {.type = NLA_U8},
+	[TCA_DUALPI2_SPLIT_GSO] = {.type = NLA_U8},
+};
+
+static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[TCA_DUALPI2_MAX + 1];
+	struct dualpi2_sched_data *q;
+	int old_backlog;
+	int old_qlen;
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+	err = nla_parse_nested_deprecated(tb, TCA_DUALPI2_MAX, opt,
+					  dualpi2_policy, extack);
+	if (err < 0)
+		return err;
+
+	q = qdisc_priv(sch);
+	sch_tree_lock(sch);
+
+	if (tb[TCA_DUALPI2_LIMIT]) {
+		u32 limit = nla_get_u32(tb[TCA_DUALPI2_LIMIT]);
+
+		if (!limit) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_LIMIT],
+					    "limit must be greater than 0.");
+			sch_tree_unlock(sch);
+			return -EINVAL;
+		}
+		sch->limit = limit;
+	}
+
+	if (tb[TCA_DUALPI2_TARGET])
+		q->pi2.target = (u64)nla_get_u32(tb[TCA_DUALPI2_TARGET]) *
+			NSEC_PER_USEC;
+
+	if (tb[TCA_DUALPI2_TUPDATE]) {
+		u64 tupdate = nla_get_u32(tb[TCA_DUALPI2_TUPDATE]);
+
+		if (!tupdate) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_TUPDATE],
+					    "tupdate cannot be 0us.");
+			sch_tree_unlock(sch);
+			return -EINVAL;
+		}
+		q->pi2.tupdate = tupdate * NSEC_PER_USEC;
+	}
+
+	if (tb[TCA_DUALPI2_ALPHA]) {
+		u32 alpha = nla_get_u32(tb[TCA_DUALPI2_ALPHA]);
+
+		if (alpha > ALPHA_BETA_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_ALPHA],
+					    "alpha is too large.");
+			sch_tree_unlock(sch);
+			return -EINVAL;
+		}
+		q->pi2.alpha = dualpi2_scale_alpha_beta(alpha);
+	}
+
+	if (tb[TCA_DUALPI2_BETA]) {
+		u32 beta = nla_get_u32(tb[TCA_DUALPI2_BETA]);
+
+		if (beta > ALPHA_BETA_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_BETA],
+					    "beta is too large.");
+			sch_tree_unlock(sch);
+			return -EINVAL;
+		}
+		q->pi2.beta = dualpi2_scale_alpha_beta(beta);
+	}
+
+	if (tb[TCA_DUALPI2_STEP_THRESH])
+		q->step.thresh = nla_get_u32(tb[TCA_DUALPI2_STEP_THRESH]) *
+			NSEC_PER_USEC;
+
+	if (tb[TCA_DUALPI2_COUPLING]) {
+		u8 coupling = nla_get_u8(tb[TCA_DUALPI2_COUPLING]);
+
+		if (!coupling) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_COUPLING],
+					    "Must use a non-zero coupling.");
+			sch_tree_unlock(sch);
+			return -EINVAL;
+		}
+		q->coupling_factor = coupling;
+	}
+
+	if (tb[TCA_DUALPI2_STEP_PACKETS])
+		q->step.in_packets = !!nla_get_u8(tb[TCA_DUALPI2_STEP_PACKETS]);
+
+	if (tb[TCA_DUALPI2_DROP_OVERLOAD])
+		q->drop_overload = !!nla_get_u8(tb[TCA_DUALPI2_DROP_OVERLOAD]);
+
+	if (tb[TCA_DUALPI2_DROP_EARLY])
+		q->drop_early = !!nla_get_u8(tb[TCA_DUALPI2_DROP_EARLY]);
+
+	if (tb[TCA_DUALPI2_C_PROTECTION]) {
+		u8 wc = nla_get_u8(tb[TCA_DUALPI2_C_PROTECTION]);
+
+		if (wc > MAX_WC) {
+			NL_SET_ERR_MSG_ATTR(extack,
+					    tb[TCA_DUALPI2_C_PROTECTION],
+					    "c_protection must be <= 100.");
+			sch_tree_unlock(sch);
+			return -EINVAL;
+		}
+		dualpi2_calculate_c_protection(sch, q, wc);
+	}
+
+	if (tb[TCA_DUALPI2_ECN_MASK])
+		q->ecn_mask = nla_get_u8(tb[TCA_DUALPI2_ECN_MASK]);
+
+	if (tb[TCA_DUALPI2_SPLIT_GSO])
+		q->split_gso = !!nla_get_u8(tb[TCA_DUALPI2_SPLIT_GSO]);
+
+	old_qlen = qdisc_qlen(sch);
+	old_backlog = sch->qstats.backlog;
+	while (qdisc_qlen(sch) > sch->limit) {
+		struct sk_buff *skb = __qdisc_dequeue_head(&sch->q);
+
+		qdisc_qstats_backlog_dec(sch, skb);
+		rtnl_qdisc_drop(skb, sch);
+	}
+	qdisc_tree_reduce_backlog(sch, old_qlen - qdisc_qlen(sch),
+				  old_backlog - sch->qstats.backlog);
+
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+/* Default alpha/beta values give a 10dB stability margin with max_rtt=100ms. */
+static void dualpi2_reset_default(struct dualpi2_sched_data *q)
+{
+	q->sch->limit = 10000;				/* Max 125ms at 1Gbps */
+
+	q->pi2.target = 15 * NSEC_PER_MSEC;
+	q->pi2.tupdate = 16 * NSEC_PER_MSEC;
+	q->pi2.alpha = dualpi2_scale_alpha_beta(41);	/* ~0.16 Hz * 256 */
+	q->pi2.beta = dualpi2_scale_alpha_beta(819);	/* ~3.20 Hz * 256 */
+
+	q->step.thresh = 1 * NSEC_PER_MSEC;
+	q->step.in_packets = false;
+
+	dualpi2_calculate_c_protection(q->sch, q, 10);	/* wc=10%, wl=90% */
+
+	q->ecn_mask = INET_ECN_ECT_1;
+	q->coupling_factor = 2;		/* window fairness for equal RTTs */
+	q->drop_overload = true;	/* Preserve latency by dropping */
+	q->drop_early = false;		/* PI2 drops on dequeue */
+	q->split_gso = true;
+}
+
+static int dualpi2_init(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	int err;
+
+	q->l_queue = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+				       TC_H_MAKE(sch->handle, 1), extack);
+	if (!q->l_queue)
+		return -ENOMEM;
+
+	err = tcf_block_get(&q->tcf.block, &q->tcf.filters, sch, extack);
+	if (err)
+		return err;
+
+	q->sch = sch;
+	dualpi2_reset_default(q);
+	hrtimer_init(&q->pi2.timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	q->pi2.timer.function = dualpi2_timer;
+
+	if (opt) {
+		err = dualpi2_change(sch, opt, extack);
+
+		if (err)
+			return err;
+	}
+
+	hrtimer_start(&q->pi2.timer, next_pi2_timeout(q),
+		      HRTIMER_MODE_ABS_PINNED);
+	return 0;
+}
+
+static u32 convert_ns_to_usec(u64 ns)
+{
+	do_div(ns, NSEC_PER_USEC);
+	return ns;
+}
+
+static int dualpi2_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct nlattr *opts;
+
+	opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	if (!opts)
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_DUALPI2_LIMIT, sch->limit) ||
+	    nla_put_u32(skb, TCA_DUALPI2_TARGET,
+			convert_ns_to_usec(q->pi2.target)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_TUPDATE,
+			convert_ns_to_usec(q->pi2.tupdate)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_ALPHA,
+			dualpi2_unscale_alpha_beta(q->pi2.alpha)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_BETA,
+			dualpi2_unscale_alpha_beta(q->pi2.beta)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_STEP_THRESH, q->step.in_packets ?
+			q->step.thresh : convert_ns_to_usec(q->step.thresh)) ||
+	    nla_put_u8(skb, TCA_DUALPI2_COUPLING, q->coupling_factor) ||
+	    nla_put_u8(skb, TCA_DUALPI2_DROP_OVERLOAD, q->drop_overload) ||
+	    nla_put_u8(skb, TCA_DUALPI2_STEP_PACKETS, q->step.in_packets) ||
+	    nla_put_u8(skb, TCA_DUALPI2_DROP_EARLY, q->drop_early) ||
+	    nla_put_u8(skb, TCA_DUALPI2_C_PROTECTION, q->c_protection.wc) ||
+	    nla_put_u8(skb, TCA_DUALPI2_ECN_MASK, q->ecn_mask) ||
+	    nla_put_u8(skb, TCA_DUALPI2_SPLIT_GSO, q->split_gso))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	nla_nest_cancel(skb, opts);
+	return -1;
+}
+
+static int dualpi2_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct tc_dualpi2_xstats st = {
+		.prob		= READ_ONCE(q->pi2.prob),
+		.packets_in_c	= q->packets_in_c,
+		.packets_in_l	= q->packets_in_l,
+		.maxq		= q->maxq,
+		.ecn_mark	= q->ecn_mark,
+		.credit		= q->c_protection.credit,
+		.step_marks	= q->step_marks,
+	};
+	u64 qc, ql;
+
+	get_queue_delays(q, &qc, &ql);
+	st.delay_l = convert_ns_to_usec(ql);
+	st.delay_c = convert_ns_to_usec(qc);
+	return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static void dualpi2_reset(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	qdisc_reset_queue(sch);
+	qdisc_reset_queue(q->l_queue);
+	q->c_head_ts = 0;
+	q->l_head_ts = 0;
+	q->pi2.prob = 0;
+	q->packets_in_c = 0;
+	q->packets_in_l = 0;
+	q->maxq = 0;
+	q->ecn_mark = 0;
+	q->step_marks = 0;
+	dualpi2_reset_c_protection(q);
+}
+
+static void dualpi2_destroy(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	q->pi2.tupdate = 0;
+	hrtimer_cancel(&q->pi2.timer);
+	if (q->l_queue)
+		qdisc_put(q->l_queue);
+	tcf_block_put(q->tcf.block);
+}
+
+static struct Qdisc *dualpi2_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	return NULL;
+}
+
+static unsigned long dualpi2_find(struct Qdisc *sch, u32 classid)
+{
+	return 0;
+}
+
+static unsigned long dualpi2_bind(struct Qdisc *sch, unsigned long parent,
+				  u32 classid)
+{
+	return 0;
+}
+
+static void dualpi2_unbind(struct Qdisc *q, unsigned long cl)
+{
+}
+
+static struct tcf_block *dualpi2_tcf_block(struct Qdisc *sch, unsigned long cl,
+					   struct netlink_ext_ack *extack)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	if (cl)
+		return NULL;
+	return q->tcf.block;
+}
+
+static void dualpi2_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	unsigned int i;
+
+	if (arg->stop)
+		return;
+
+	/* We statically define only 2 queues */
+	for (i = 0; i < 2; i++) {
+		if (arg->count < arg->skip) {
+			arg->count++;
+			continue;
+		}
+		if (arg->fn(sch, i + 1, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+		arg->count++;
+	}
+}
+
+/* Minimal class support to handler tc filters */
+static const struct Qdisc_class_ops dualpi2_class_ops = {
+	.leaf		= dualpi2_leaf,
+	.find		= dualpi2_find,
+	.tcf_block	= dualpi2_tcf_block,
+	.bind_tcf	= dualpi2_bind,
+	.unbind_tcf	= dualpi2_unbind,
+	.walk		= dualpi2_walk,
+};
+
+static struct Qdisc_ops dualpi2_qdisc_ops __read_mostly = {
+	.id		= "dualpi2",
+	.cl_ops		= &dualpi2_class_ops,
+	.priv_size	= sizeof(struct dualpi2_sched_data),
+	.enqueue	= dualpi2_qdisc_enqueue,
+	.dequeue	= dualpi2_qdisc_dequeue,
+	.peek		= qdisc_peek_dequeued,
+	.init		= dualpi2_init,
+	.destroy	= dualpi2_destroy,
+	.reset		= dualpi2_reset,
+	.change		= dualpi2_change,
+	.dump		= dualpi2_dump,
+	.dump_stats	= dualpi2_dump_stats,
+	.owner		= THIS_MODULE,
+};
+
+static int __init dualpi2_module_init(void)
+{
+	return register_qdisc(&dualpi2_qdisc_ops);
+}
+
+static void __exit dualpi2_module_exit(void)
+{
+	unregister_qdisc(&dualpi2_qdisc_ops);
+}
+
+module_init(dualpi2_module_init);
+module_exit(dualpi2_module_exit);
+
+MODULE_DESCRIPTION("Dual Queue with Proportional Integral controller Improved with a Square (dualpi2) scheduler");
+MODULE_AUTHOR("Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>");
+MODULE_AUTHOR("Olga Albisser <olga@albisser.org>");
+MODULE_AUTHOR("Henrik Steen <henrist@henrist.net>");
+MODULE_AUTHOR("Olivier Tilmans <olivier.tilmans@nokia.com>");
+MODULE_AUTHOR("Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>");
+
+MODULE_LICENSE("GPL");
+MODULE_VERSION("1.0");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 02/44] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered()
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
  2024-10-15 10:28 ` [PATCH net-next 01/44] sched: Add dualpi2 qdisc chia-yu.chang
@ 2024-10-15 10:28 ` chia-yu.chang
  2024-10-15 10:28 ` [PATCH net-next 03/44] tcp: create FLAG_TS_PROGRESS chia-yu.chang
                   ` (42 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:28 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

- Move tcp_count_delivered() earlier and split tcp_count_delivered_ce()
  out of it
- Move tcp_in_ack_event() later
- While at it, remove the inline from tcp_in_ack_event() and let
  the compiler to decide

Accurate ECN's heuristics does not know if there is going
to be ACE field based CE counter increase or not until after
rtx queue has been processed. Only then the number of ACKed
bytes/pkts is available. As CE or not affects presence of
FLAG_ECE, that information for tcp_in_ack_event is not yet
available in the old location of the call to tcp_in_ack_event().

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_input.c | 56 +++++++++++++++++++++++++-------------------
 1 file changed, 32 insertions(+), 24 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2d844e1f867f..5a6f93148814 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -413,6 +413,20 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
 	return false;
 }
 
+static void tcp_count_delivered_ce(struct tcp_sock *tp, u32 ecn_count)
+{
+	tp->delivered_ce += ecn_count;
+}
+
+/* Updates the delivered and delivered_ce counts */
+static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
+				bool ece_ack)
+{
+	tp->delivered += delivered;
+	if (ece_ack)
+		tcp_count_delivered_ce(tp, delivered);
+}
+
 /* Buffer size and advertised window tuning.
  *
  * 1. Tuning sk->sk_sndbuf, when connection enters established state.
@@ -1148,15 +1162,6 @@ void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb)
 	}
 }
 
-/* Updates the delivered and delivered_ce counts */
-static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
-				bool ece_ack)
-{
-	tp->delivered += delivered;
-	if (ece_ack)
-		tp->delivered_ce += delivered;
-}
-
 /* This procedure tags the retransmission queue when SACKs arrive.
  *
  * We have three tag bits: SACKED(S), RETRANS(R) and LOST(L).
@@ -3856,12 +3861,23 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
 	}
 }
 
-static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
+static void tcp_in_ack_event(struct sock *sk, int flag)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 
-	if (icsk->icsk_ca_ops->in_ack_event)
-		icsk->icsk_ca_ops->in_ack_event(sk, flags);
+	if (icsk->icsk_ca_ops->in_ack_event) {
+		u32 ack_ev_flags = 0;
+
+		if (flag & FLAG_WIN_UPDATE)
+			ack_ev_flags |= CA_ACK_WIN_UPDATE;
+		if (flag & FLAG_SLOWPATH) {
+			ack_ev_flags = CA_ACK_SLOWPATH;
+			if (flag & FLAG_ECE)
+				ack_ev_flags |= CA_ACK_ECE;
+		}
+
+		icsk->icsk_ca_ops->in_ack_event(sk, ack_ev_flags);
+	}
 }
 
 /* Congestion control has updated the cwnd already. So if we're in
@@ -3978,12 +3994,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		tcp_snd_una_update(tp, ack);
 		flag |= FLAG_WIN_UPDATE;
 
-		tcp_in_ack_event(sk, CA_ACK_WIN_UPDATE);
-
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPACKS);
 	} else {
-		u32 ack_ev_flags = CA_ACK_SLOWPATH;
-
 		if (ack_seq != TCP_SKB_CB(skb)->end_seq)
 			flag |= FLAG_DATA;
 		else
@@ -3995,19 +4007,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 			flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
 							&sack_state);
 
-		if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) {
+		if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb)))
 			flag |= FLAG_ECE;
-			ack_ev_flags |= CA_ACK_ECE;
-		}
 
 		if (sack_state.sack_delivered)
 			tcp_count_delivered(tp, sack_state.sack_delivered,
 					    flag & FLAG_ECE);
-
-		if (flag & FLAG_WIN_UPDATE)
-			ack_ev_flags |= CA_ACK_WIN_UPDATE;
-
-		tcp_in_ack_event(sk, ack_ev_flags);
 	}
 
 	/* This is a deviation from RFC3168 since it states that:
@@ -4034,6 +4039,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	tcp_rack_update_reo_wnd(sk, &rs);
 
+	tcp_in_ack_event(sk, flag);
+
 	if (tp->tlp_high_seq)
 		tcp_process_tlp_ack(sk, ack, flag);
 
@@ -4065,6 +4072,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	return 1;
 
 no_queue:
+	tcp_in_ack_event(sk, flag);
 	/* If data was DSACKed, see if we can undo a cwnd reduction. */
 	if (flag & FLAG_DSACKING_ACK) {
 		tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 03/44] tcp: create FLAG_TS_PROGRESS
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
  2024-10-15 10:28 ` [PATCH net-next 01/44] sched: Add dualpi2 qdisc chia-yu.chang
  2024-10-15 10:28 ` [PATCH net-next 02/44] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
@ 2024-10-15 10:28 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 04/44] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
                   ` (41 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:28 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Whenever timestamp advances, it declares progress which
can be used by the other parts of the stack to decide that
the ACK is the most recent one seen so far.

AccECN will use this flag when deciding whether to use the
ACK to update AccECN state or not.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_input.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5a6f93148814..7b8e69ccbbb0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -102,6 +102,7 @@ int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 #define FLAG_NO_CHALLENGE_ACK	0x8000 /* do not call tcp_send_challenge_ack()	*/
 #define FLAG_ACK_MAYBE_DELAYED	0x10000 /* Likely a delayed ACK */
 #define FLAG_DSACK_TLP		0x20000 /* DSACK for tail loss probe */
+#define FLAG_TS_PROGRESS	0x40000 /* Positive timestamp delta */
 
 #define FLAG_ACKED		(FLAG_DATA_ACKED|FLAG_SYN_ACKED)
 #define FLAG_NOT_DUP		(FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
@@ -3813,8 +3814,16 @@ static void tcp_store_ts_recent(struct tcp_sock *tp)
 	tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
 }
 
-static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
+static int __tcp_replace_ts_recent(struct tcp_sock *tp, s32 tstamp_delta)
 {
+	tcp_store_ts_recent(tp);
+	return tstamp_delta > 0 ? FLAG_TS_PROGRESS : 0;
+}
+
+static int tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
+{
+	s32 delta;
+
 	if (tp->rx_opt.saw_tstamp && !after(seq, tp->rcv_wup)) {
 		/* PAWS bug workaround wrt. ACK frames, the PAWS discard
 		 * extra check below makes sure this can only happen
@@ -3823,9 +3832,13 @@ static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
 		 * Not only, also it occurs for expired timestamps.
 		 */
 
-		if (tcp_paws_check(&tp->rx_opt, 0))
-			tcp_store_ts_recent(tp);
+		if (tcp_paws_check(&tp->rx_opt, 0)) {
+			delta = tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent;
+			return __tcp_replace_ts_recent(tp, delta);
+		}
 	}
+
+	return 0;
 }
 
 /* This routine deals with acks during a TLP episode and ends an episode by
@@ -3982,7 +3995,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	 * is in window.
 	 */
 	if (flag & FLAG_UPDATE_TS_RECENT)
-		tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
+		flag |= tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
 
 	if ((flag & (FLAG_SLOWPATH | FLAG_SND_UNA_ADVANCED)) ==
 	    FLAG_SND_UNA_ADVANCED) {
@@ -6140,6 +6153,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 	    TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
 	    !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
 		int tcp_header_len = tp->tcp_header_len;
+		s32 tstamp_delta = 0;
+		int flag = 0;
 
 		/* Timestamp header prediction: tcp_header_len
 		 * is automatically equal to th->doff*4 due to pred_flags
@@ -6152,8 +6167,9 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			if (!tcp_parse_aligned_timestamp(tp, th))
 				goto slow_path;
 
+			tstamp_delta = tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent;
 			/* If PAWS failed, check it more carefully in slow path */
-			if ((s32)(tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent) < 0)
+			if (tstamp_delta < 0)
 				goto slow_path;
 
 			/* DO NOT update ts_recent here, if checksum fails
@@ -6173,12 +6189,12 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 				if (tcp_header_len ==
 				    (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
 				    tp->rcv_nxt == tp->rcv_wup)
-					tcp_store_ts_recent(tp);
+					flag |= __tcp_replace_ts_recent(tp, tstamp_delta);
 
 				/* We know that such packets are checksummed
 				 * on entry.
 				 */
-				tcp_ack(sk, skb, 0);
+				tcp_ack(sk, skb, flag);
 				__kfree_skb(skb);
 				tcp_data_snd_check(sk);
 				/* When receiving pure ack in fast path, update
@@ -6209,7 +6225,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			if (tcp_header_len ==
 			    (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
 			    tp->rcv_nxt == tp->rcv_wup)
-				tcp_store_ts_recent(tp);
+				flag |= __tcp_replace_ts_recent(tp, tstamp_delta);
 
 			tcp_rcv_rtt_measure_ts(sk, skb);
 
@@ -6224,7 +6240,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 
 			if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) {
 				/* Well, only one small jumplet in fast path... */
-				tcp_ack(sk, skb, FLAG_DATA);
+				tcp_ack(sk, skb, flag | FLAG_DATA);
 				tcp_data_snd_check(sk);
 				if (!inet_csk_ack_scheduled(sk))
 					goto no_ack;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 04/44] tcp: use BIT() macro in include/net/tcp.h
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (2 preceding siblings ...)
  2024-10-15 10:28 ` [PATCH net-next 03/44] tcp: create FLAG_TS_PROGRESS chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 05/44] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
                   ` (40 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Use BIT() macro for TCP flags field and TCP congestion control
flags that will be used by the congestion control algorithm.

No functional changes.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Ilpo Järvinen <ij@kernel.org>
---
 include/net/tcp.h | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 739a9fb83d0c..bc34b450929c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -26,6 +26,7 @@
 #include <linux/kref.h>
 #include <linux/ktime.h>
 #include <linux/indirect_call_wrapper.h>
+#include <linux/bits.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/inet_timewait_sock.h>
@@ -911,14 +912,14 @@ static inline u32 tcp_rsk_tsval(const struct tcp_request_sock *treq)
 
 #define tcp_flag_byte(th) (((u_int8_t *)th)[13])
 
-#define TCPHDR_FIN 0x01
-#define TCPHDR_SYN 0x02
-#define TCPHDR_RST 0x04
-#define TCPHDR_PSH 0x08
-#define TCPHDR_ACK 0x10
-#define TCPHDR_URG 0x20
-#define TCPHDR_ECE 0x40
-#define TCPHDR_CWR 0x80
+#define TCPHDR_FIN	BIT(0)
+#define TCPHDR_SYN	BIT(1)
+#define TCPHDR_RST	BIT(2)
+#define TCPHDR_PSH	BIT(3)
+#define TCPHDR_ACK	BIT(4)
+#define TCPHDR_URG	BIT(5)
+#define TCPHDR_ECE	BIT(6)
+#define TCPHDR_CWR	BIT(7)
 
 #define TCPHDR_SYN_ECN	(TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
 
@@ -1107,9 +1108,9 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CA_UNSPEC	0
 
 /* Algorithm can be set on socket without CAP_NET_ADMIN privileges */
-#define TCP_CONG_NON_RESTRICTED 0x1
+#define TCP_CONG_NON_RESTRICTED		BIT(0)
 /* Requires ECN/ECT set on all packets */
-#define TCP_CONG_NEEDS_ECN	0x2
+#define TCP_CONG_NEEDS_ECN		BIT(1)
 #define TCP_CONG_MASK	(TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN)
 
 union tcp_cc_info;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 05/44] tcp: extend TCP flags to allow AE bit/ACE field
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (3 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 04/44] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 06/44] tcp: reorganize SYN ECN code chia-yu.chang
                   ` (39 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

With AccECN, there's one additional TCP flag to be used (AE)
and ACE field that overloads the definition of AE, CWR, and
ECE flags. As tcp_flags was previously only 1 byte, the
byte-order stuff needs to be added to it's handling.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h             | 7 ++++++-
 include/uapi/linux/tcp.h      | 9 ++++++---
 net/ipv4/tcp_ipv4.c           | 3 ++-
 net/ipv4/tcp_output.c         | 8 ++++----
 net/ipv6/tcp_ipv6.c           | 3 ++-
 net/netfilter/nf_log_syslog.c | 8 +++++---
 6 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index bc34b450929c..549fec6681d0 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -920,7 +920,12 @@ static inline u32 tcp_rsk_tsval(const struct tcp_request_sock *treq)
 #define TCPHDR_URG	BIT(5)
 #define TCPHDR_ECE	BIT(6)
 #define TCPHDR_CWR	BIT(7)
+#define TCPHDR_AE	BIT(8)
+#define TCPHDR_FLAGS_MASK (TCPHDR_FIN | TCPHDR_SYN | TCPHDR_RST | \
+			   TCPHDR_PSH | TCPHDR_ACK | TCPHDR_URG | \
+			   TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)
 
+#define TCPHDR_ACE (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)
 #define TCPHDR_SYN_ECN	(TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
 
 /* State flags for sacked in struct tcp_skb_cb */
@@ -955,7 +960,7 @@ struct tcp_skb_cb {
 			u16	tcp_gso_size;
 		};
 	};
-	__u8		tcp_flags;	/* TCP header flags. (tcp[13])	*/
+	__u16		tcp_flags;	/* TCP header flags. (tcp[12-13])	*/
 
 	__u8		sacked;		/* State flags for SACK.	*/
 	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index dbf896f3146c..3fe08d7dddaf 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -28,7 +28,8 @@ struct tcphdr {
 	__be32	seq;
 	__be32	ack_seq;
 #if defined(__LITTLE_ENDIAN_BITFIELD)
-	__u16	res1:4,
+	__u16	ae:1,
+		res1:3,
 		doff:4,
 		fin:1,
 		syn:1,
@@ -40,7 +41,8 @@ struct tcphdr {
 		cwr:1;
 #elif defined(__BIG_ENDIAN_BITFIELD)
 	__u16	doff:4,
-		res1:4,
+		res1:3,
+		ae:1,
 		cwr:1,
 		ece:1,
 		urg:1,
@@ -70,6 +72,7 @@ union tcp_word_hdr {
 #define tcp_flag_word(tp) (((union tcp_word_hdr *)(tp))->words[3])
 
 enum {
+	TCP_FLAG_AE  = __constant_cpu_to_be32(0x01000000),
 	TCP_FLAG_CWR = __constant_cpu_to_be32(0x00800000),
 	TCP_FLAG_ECE = __constant_cpu_to_be32(0x00400000),
 	TCP_FLAG_URG = __constant_cpu_to_be32(0x00200000),
@@ -78,7 +81,7 @@ enum {
 	TCP_FLAG_RST = __constant_cpu_to_be32(0x00040000),
 	TCP_FLAG_SYN = __constant_cpu_to_be32(0x00020000),
 	TCP_FLAG_FIN = __constant_cpu_to_be32(0x00010000),
-	TCP_RESERVED_BITS = __constant_cpu_to_be32(0x0F000000),
+	TCP_RESERVED_BITS = __constant_cpu_to_be32(0x0E000000),
 	TCP_DATA_OFFSET = __constant_cpu_to_be32(0xF0000000)
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9d3dd101ea71..9fe314a59240 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2162,7 +2162,8 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
 				    skb->len - th->doff * 4);
 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-	TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
+	TCP_SKB_CB(skb)->tcp_flags = ntohs(*(__be16 *)&tcp_flag_word(th)) &
+				     TCPHDR_FLAGS_MASK;
 	TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
 	TCP_SKB_CB(skb)->sacked	 = 0;
 	TCP_SKB_CB(skb)->has_rxtstamp =
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 054244ce5117..45cb67c635be 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -400,7 +400,7 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
 /* Constructs common control bits of non-data skb. If SYN/FIN is present,
  * auto increment end seqno.
  */
-static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u16 flags)
 {
 	skb->ip_summed = CHECKSUM_PARTIAL;
 
@@ -1382,7 +1382,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	th->seq			= htonl(tcb->seq);
 	th->ack_seq		= htonl(rcv_nxt);
 	*(((__be16 *)th) + 6)	= htons(((tcp_header_size >> 2) << 12) |
-					tcb->tcp_flags);
+					(tcb->tcp_flags & TCPHDR_FLAGS_MASK));
 
 	th->check		= 0;
 	th->urg_ptr		= 0;
@@ -1604,7 +1604,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 	int old_factor;
 	long limit;
 	int nlen;
-	u8 flags;
+	u16 flags;
 
 	if (WARN_ON(len > skb->len))
 		return -EINVAL;
@@ -2159,7 +2159,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
 {
 	int nlen = skb->len - len;
 	struct sk_buff *buff;
-	u8 flags;
+	u16 flags;
 
 	/* All of a TSO frame must be composed of paged data.  */
 	DEBUG_NET_WARN_ON_ONCE(skb->len != skb->data_len);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 597920061a3a..252d3dac3a09 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1737,7 +1737,8 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
 				    skb->len - th->doff*4);
 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-	TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
+	TCP_SKB_CB(skb)->tcp_flags = ntohs(*(__be16 *)&tcp_flag_word(th)) &
+				     TCPHDR_FLAGS_MASK;
 	TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
 	TCP_SKB_CB(skb)->sacked = 0;
 	TCP_SKB_CB(skb)->has_rxtstamp =
diff --git a/net/netfilter/nf_log_syslog.c b/net/netfilter/nf_log_syslog.c
index 58402226045e..86d5fc5d28e3 100644
--- a/net/netfilter/nf_log_syslog.c
+++ b/net/netfilter/nf_log_syslog.c
@@ -216,7 +216,9 @@ nf_log_dump_tcp_header(struct nf_log_buf *m,
 	/* Max length: 9 "RES=0x3C " */
 	nf_log_buf_add(m, "RES=0x%02x ", (u_int8_t)(ntohl(tcp_flag_word(th) &
 					    TCP_RESERVED_BITS) >> 22));
-	/* Max length: 32 "CWR ECE URG ACK PSH RST SYN FIN " */
+	/* Max length: 35 "AE CWR ECE URG ACK PSH RST SYN FIN " */
+	if (th->ae)
+		nf_log_buf_add(m, "AE ");
 	if (th->cwr)
 		nf_log_buf_add(m, "CWR ");
 	if (th->ece)
@@ -516,7 +518,7 @@ dump_ipv4_packet(struct net *net, struct nf_log_buf *m,
 
 	/* Proto    Max log string length */
 	/* IP:	    40+46+6+11+127 = 230 */
-	/* TCP:     10+max(25,20+30+13+9+32+11+127) = 252 */
+	/* TCP:     10+max(25,20+30+13+9+35+11+127) = 255 */
 	/* UDP:     10+max(25,20) = 35 */
 	/* UDPLITE: 14+max(25,20) = 39 */
 	/* ICMP:    11+max(25, 18+25+max(19,14,24+3+n+10,3+n+10)) = 91+n */
@@ -526,7 +528,7 @@ dump_ipv4_packet(struct net *net, struct nf_log_buf *m,
 
 	/* (ICMP allows recursion one level deep) */
 	/* maxlen =  IP + ICMP +  IP + max(TCP,UDP,ICMP,unknown) */
-	/* maxlen = 230+   91  + 230 + 252 = 803 */
+	/* maxlen = 230+   91  + 230 + 255 = 806 */
 }
 
 static noinline_for_stack void
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 06/44] tcp: reorganize SYN ECN code
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (4 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 05/44] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 07/44] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
                   ` (38 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Prepare for AccECN that needs to have access here on IP ECN
field value which is only available after INET_ECN_xmit().

No functional changes.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_output.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 45cb67c635be..64d47c18255f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -347,10 +347,11 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 	tp->ecn_flags = 0;
 
 	if (use_ecn) {
-		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
-		tp->ecn_flags = TCP_ECN_OK;
 		if (tcp_ca_needs_ecn(sk) || bpf_needs_ecn)
 			INET_ECN_xmit(sk);
+
+		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
+		tp->ecn_flags = TCP_ECN_OK;
 	}
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 07/44] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check()
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (5 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 06/44] tcp: reorganize SYN ECN code chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 08/44] tcp: helpers for ECN mode handling chia-yu.chang
                   ` (37 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Rename tcp_ecn_check_ce to tcp_data_ecn_check as it is
called only for data segments, not for ACKs (with AccECN,
also ACKs may get ECN bits).

The extra "layer" in tcp_ecn_check_ce() function just
checks for ECN being enabled, that can be moved into
tcp_ecn_field_check rather than having the __ variant.

No functional changes.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_input.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7b8e69ccbbb0..7b4e7ed8cc52 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -357,10 +357,13 @@ static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
 	tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR;
 }
 
-static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
+static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	if (!(tcp_sk(sk)->ecn_flags & TCP_ECN_OK))
+		return;
+
 	switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
 	case INET_ECN_NOT_ECT:
 		/* Funny extension: if ECT is not set on a segment,
@@ -389,12 +392,6 @@ static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
 	}
 }
 
-static void tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
-{
-	if (tcp_sk(sk)->ecn_flags & TCP_ECN_OK)
-		__tcp_ecn_check_ce(sk, skb);
-}
-
 static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
 {
 	if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || th->cwr))
@@ -866,7 +863,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
 	icsk->icsk_ack.lrcvtime = now;
 	tcp_save_lrcv_flowlabel(sk, skb);
 
-	tcp_ecn_check_ce(sk, skb);
+	tcp_data_ecn_check(sk, skb);
 
 	if (skb->len >= 128)
 		tcp_grow_window(sk, skb, true);
@@ -5028,7 +5025,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	bool fragstolen;
 
 	tcp_save_lrcv_flowlabel(sk, skb);
-	tcp_ecn_check_ce(sk, skb);
+	tcp_data_ecn_check(sk, skb);
 
 	if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFODROP);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 08/44] tcp: helpers for ECN mode handling
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (6 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 07/44] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 09/44] gso: AccECN support chia-yu.chang
                   ` (36 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Create helpers for TCP ECN modes. No functional changes.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h        | 44 ++++++++++++++++++++++++++++++++++++----
 net/ipv4/tcp.c           |  2 +-
 net/ipv4/tcp_dctcp.c     |  2 +-
 net/ipv4/tcp_input.c     | 14 ++++++-------
 net/ipv4/tcp_minisocks.c |  4 +++-
 net/ipv4/tcp_output.c    |  6 +++---
 6 files changed, 55 insertions(+), 17 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 549fec6681d0..ae3f900f17c1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -372,10 +372,46 @@ static inline void tcp_dec_quickack_mode(struct sock *sk)
 	}
 }
 
-#define	TCP_ECN_OK		1
-#define	TCP_ECN_QUEUE_CWR	2
-#define	TCP_ECN_DEMAND_CWR	4
-#define	TCP_ECN_SEEN		8
+#define	TCP_ECN_MODE_RFC3168	BIT(0)
+#define	TCP_ECN_QUEUE_CWR	BIT(1)
+#define	TCP_ECN_DEMAND_CWR	BIT(2)
+#define	TCP_ECN_SEEN		BIT(3)
+#define	TCP_ECN_MODE_ACCECN	BIT(4)
+
+#define	TCP_ECN_DISABLED	0
+#define	TCP_ECN_MODE_PENDING	(TCP_ECN_MODE_RFC3168|TCP_ECN_MODE_ACCECN)
+#define	TCP_ECN_MODE_ANY	(TCP_ECN_MODE_RFC3168|TCP_ECN_MODE_ACCECN)
+
+static inline bool tcp_ecn_mode_any(const struct tcp_sock *tp)
+{
+	return tp->ecn_flags & TCP_ECN_MODE_ANY;
+}
+
+static inline bool tcp_ecn_mode_rfc3168(const struct tcp_sock *tp)
+{
+	return (tp->ecn_flags & TCP_ECN_MODE_ANY) == TCP_ECN_MODE_RFC3168;
+}
+
+static inline bool tcp_ecn_mode_accecn(const struct tcp_sock *tp)
+{
+	return (tp->ecn_flags & TCP_ECN_MODE_ANY) == TCP_ECN_MODE_ACCECN;
+}
+
+static inline bool tcp_ecn_disabled(const struct tcp_sock *tp)
+{
+	return !tcp_ecn_mode_any(tp);
+}
+
+static inline bool tcp_ecn_mode_pending(const struct tcp_sock *tp)
+{
+	return (tp->ecn_flags & TCP_ECN_MODE_PENDING) == TCP_ECN_MODE_PENDING;
+}
+
+static inline void tcp_ecn_mode_set(struct tcp_sock *tp, u8 mode)
+{
+	tp->ecn_flags &= ~TCP_ECN_MODE_ANY;
+	tp->ecn_flags |= mode;
+}
 
 enum tcp_tw_status {
 	TCP_TW_SUCCESS = 0,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 82cc4a5633ce..94546f55385a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4107,7 +4107,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 		info->tcpi_rcv_wscale = tp->rx_opt.rcv_wscale;
 	}
 
-	if (tp->ecn_flags & TCP_ECN_OK)
+	if (tcp_ecn_mode_any(tp))
 		info->tcpi_options |= TCPI_OPT_ECN;
 	if (tp->ecn_flags & TCP_ECN_SEEN)
 		info->tcpi_options |= TCPI_OPT_ECN_SEEN;
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 8a45a4aea933..03abe0848420 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -90,7 +90,7 @@ __bpf_kfunc static void dctcp_init(struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 
-	if ((tp->ecn_flags & TCP_ECN_OK) ||
+	if (tcp_ecn_mode_any(tp) ||
 	    (sk->sk_state == TCP_LISTEN ||
 	     sk->sk_state == TCP_CLOSE)) {
 		struct dctcp *ca = inet_csk_ca(sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7b4e7ed8cc52..e8d32a231a9e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -334,7 +334,7 @@ static bool tcp_in_quickack_mode(struct sock *sk)
 
 static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
 {
-	if (tp->ecn_flags & TCP_ECN_OK)
+	if (tcp_ecn_mode_rfc3168(tp))
 		tp->ecn_flags |= TCP_ECN_QUEUE_CWR;
 }
 
@@ -361,7 +361,7 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (!(tcp_sk(sk)->ecn_flags & TCP_ECN_OK))
+	if (tcp_ecn_disabled(tp))
 		return;
 
 	switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
@@ -394,19 +394,19 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
 
 static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
 {
-	if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || th->cwr))
-		tp->ecn_flags &= ~TCP_ECN_OK;
+	if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || th->cwr))
+		tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
 }
 
 static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th)
 {
-	if ((tp->ecn_flags & TCP_ECN_OK) && (!th->ece || !th->cwr))
-		tp->ecn_flags &= ~TCP_ECN_OK;
+	if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || !th->cwr))
+		tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
 }
 
 static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr *th)
 {
-	if (th->ece && !th->syn && (tp->ecn_flags & TCP_ECN_OK))
+	if (th->ece && !th->syn && tcp_ecn_mode_rfc3168(tp))
 		return true;
 	return false;
 }
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bb1fe1ba867a..bd6515ab660f 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -453,7 +453,9 @@ EXPORT_SYMBOL(tcp_openreq_init_rwin);
 static void tcp_ecn_openreq_child(struct tcp_sock *tp,
 				  const struct request_sock *req)
 {
-	tp->ecn_flags = inet_rsk(req)->ecn_ok ? TCP_ECN_OK : 0;
+	tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
+			     TCP_ECN_MODE_RFC3168 :
+			     TCP_ECN_DISABLED);
 }
 
 void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 64d47c18255f..bb83ad43a4e2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -322,7 +322,7 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 	const struct tcp_sock *tp = tcp_sk(sk);
 
 	TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR;
-	if (!(tp->ecn_flags & TCP_ECN_OK))
+	if (tcp_ecn_disabled(tp))
 		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ECE;
 	else if (tcp_ca_needs_ecn(sk) ||
 		 tcp_bpf_ca_needs_ecn(sk))
@@ -351,7 +351,7 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 			INET_ECN_xmit(sk);
 
 		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
-		tp->ecn_flags = TCP_ECN_OK;
+		tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
 	}
 }
 
@@ -379,7 +379,7 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tp->ecn_flags & TCP_ECN_OK) {
+	if (tcp_ecn_mode_rfc3168(tp)) {
 		/* Not-retransmitted data segment: set ECT and inject CWR. */
 		if (skb->len != tcp_header_len &&
 		    !before(TCP_SKB_CB(skb)->seq, tp->snd_nxt)) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 09/44] gso: AccECN support
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (7 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 08/44] tcp: helpers for ECN mode handling chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-16  1:31   ` Jakub Kicinski
  2024-10-15 10:29 ` [PATCH net-next 10/44] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
                   ` (35 subsequent siblings)
  44 siblings, 1 reply; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Handling the CWR flag differs between RFC 3168 ECN and AccECN.
With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared
starting from 2nd segment which is incompatible how AccECN handles
the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN.
With AccECN, CWR flag (or more accurately, the ACE field that also
includes ECE & AE flags) changes only when new packet(s) with CE
mark arrives so the flag should not be changed within a super-skb.
The new skb/feature flags are necessary to prevent such TSO engines
corrupting AccECN ACE counters by clearing the CWR flag (if the
CWR handling feature cannot be turned off).

If NIC is completely unaware of RFC3168 ECN (doesn't support
NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
despite supporting also NETIF_F_TSO_ECN, TSO could be safely used
with AccECN on such NIC. This should be evaluated per NIC basis
(not done in this patch series for any NICs).

For the cases, where TSO cannot keep its hands off the CWR flag,
a GSO fallback is provided by this patch.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/netdev_features.h | 5 ++++-
 include/linux/netdevice.h       | 1 +
 include/linux/skbuff.h          | 2 ++
 net/ethtool/common.c            | 1 +
 net/ipv4/tcp_offload.c          | 6 +++++-
 5 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 66e7d26b70a4..2419045e0ffd 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -53,6 +53,7 @@ enum {
 	NETIF_F_GSO_UDP_BIT,		/* ... UFO, deprecated except tuntap */
 	NETIF_F_GSO_UDP_L4_BIT,		/* ... UDP payload GSO (not UFO) */
 	NETIF_F_GSO_FRAGLIST_BIT,		/* ... Fraglist GSO */
+	NETIF_F_GSO_ACCECN_BIT,		/* TCP AccECN with TSO (no CWR clearing) */
 	/**/NETIF_F_GSO_LAST =		/* last bit, see GSO_MASK */
 		NETIF_F_GSO_FRAGLIST_BIT,
 
@@ -128,6 +129,7 @@ enum {
 #define NETIF_F_SG		__NETIF_F(SG)
 #define NETIF_F_TSO6		__NETIF_F(TSO6)
 #define NETIF_F_TSO_ECN		__NETIF_F(TSO_ECN)
+#define NETIF_F_GSO_ACCECN	__NETIF_F(GSO_ACCECN)
 #define NETIF_F_TSO		__NETIF_F(TSO)
 #define NETIF_F_VLAN_CHALLENGED	__NETIF_F(VLAN_CHALLENGED)
 #define NETIF_F_RXFCS		__NETIF_F(RXFCS)
@@ -210,7 +212,8 @@ static inline int find_next_netdev_feature(u64 feature, unsigned long start)
 				 NETIF_F_TSO_ECN | NETIF_F_TSO_MANGLEID)
 
 /* List of features with software fallbacks. */
-#define NETIF_F_GSO_SOFTWARE	(NETIF_F_ALL_TSO | NETIF_F_GSO_SCTP |	     \
+#define NETIF_F_GSO_SOFTWARE	(NETIF_F_ALL_TSO | \
+				 NETIF_F_GSO_ACCECN | NETIF_F_GSO_SCTP | \
 				 NETIF_F_GSO_UDP_L4 | NETIF_F_GSO_FRAGLIST)
 
 /*
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bdd7d6262112..92fb65090ee7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -5067,6 +5067,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
 	BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_GSO_UDP >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_UDP_L4 != (NETIF_F_GSO_UDP_L4 >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_FRAGLIST != (NETIF_F_GSO_FRAGLIST >> NETIF_F_GSO_SHIFT));
+	BUILD_BUG_ON(SKB_GSO_TCP_ACCECN != (NETIF_F_GSO_ACCECN >> NETIF_F_GSO_SHIFT));
 
 	return (features & feature) == feature;
 }
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 48f1e0fa2a13..530cb325fb86 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -694,6 +694,8 @@ enum {
 	SKB_GSO_UDP_L4 = 1 << 17,
 
 	SKB_GSO_FRAGLIST = 1 << 18,
+
+	SKB_GSO_TCP_ACCECN = 1 << 19,
 };
 
 #if BITS_PER_LONG > 32
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index dd345efa114b..75625098df07 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -32,6 +32,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
 	[NETIF_F_TSO_BIT] =              "tx-tcp-segmentation",
 	[NETIF_F_GSO_ROBUST_BIT] =       "tx-gso-robust",
 	[NETIF_F_TSO_ECN_BIT] =          "tx-tcp-ecn-segmentation",
+	[NETIF_F_GSO_ACCECN_BIT] =	 "tx-tcp-accecn-segmentation",
 	[NETIF_F_TSO_MANGLEID_BIT] =	 "tx-tcp-mangleid-segmentation",
 	[NETIF_F_TSO6_BIT] =             "tx-tcp6-segmentation",
 	[NETIF_F_FSO_BIT] =              "tx-fcoe-segmentation",
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 2308665b51c5..0b05f30e9e5f 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -139,6 +139,7 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 	struct sk_buff *gso_skb = skb;
 	__sum16 newcheck;
 	bool ooo_okay, copy_destructor;
+	bool ecn_cwr_mask;
 	__wsum delta;
 
 	th = tcp_hdr(skb);
@@ -198,6 +199,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 
 	newcheck = ~csum_fold(csum_add(csum_unfold(th->check), delta));
 
+	ecn_cwr_mask = !!(skb_shinfo(gso_skb)->gso_type & SKB_GSO_TCP_ACCECN);
+
 	while (skb->next) {
 		th->fin = th->psh = 0;
 		th->check = newcheck;
@@ -217,7 +220,8 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 		th = tcp_hdr(skb);
 
 		th->seq = htonl(seq);
-		th->cwr = 0;
+
+		th->cwr &= ecn_cwr_mask;
 	}
 
 	/* Following permits TCP Small Queues to work well with GSO :
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 10/44] gro: prevent ACE field corruption & better AccECN handling
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (8 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 09/44] gso: AccECN support chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 11/44] tcp: AccECN support to tcp_add_backlog chia-yu.chang
                   ` (34 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

There are important differences in how the CWR field behaves
in RFC3168 and AccECN. With AccECN, CWR flag is part of the
ACE counter and its changes are important so adjust the flags
changed mask accordingly.

Also, if CWR is there, set the Accurate ECN GSO flag to avoid
corrupting CWR flag somewhere.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_offload.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 0b05f30e9e5f..f59762d88c38 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
 	th2 = tcp_hdr(p);
 	flush = (__force int)(flags & TCP_FLAG_CWR);
 	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
-		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
+		  ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
 	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
 	for (i = sizeof(*th); i < thlen; i += 4)
 		flush |= *(u32 *)((u8 *)th + i) ^
@@ -405,7 +405,7 @@ void tcp_gro_complete(struct sk_buff *skb)
 	shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
 
 	if (th->cwr)
-		shinfo->gso_type |= SKB_GSO_TCP_ECN;
+		shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
 }
 EXPORT_SYMBOL(tcp_gro_complete);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 11/44] tcp: AccECN support to tcp_add_backlog
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (9 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 10/44] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 12/44] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
                   ` (33 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

AE flag needs to be preserved for AccECN.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_ipv4.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9fe314a59240..d5aa248125f5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2054,7 +2054,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
 	    !((TCP_SKB_CB(tail)->tcp_flags &
 	      TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_ACK) ||
 	    ((TCP_SKB_CB(tail)->tcp_flags ^
-	      TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
+	      TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)) ||
 	    !tcp_skb_can_collapse_rx(tail, skb) ||
 	    thtail->doff != th->doff ||
 	    memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th)))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 12/44] tcp: allow ECN bits in TOS/traffic class
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (10 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 11/44] tcp: AccECN support to tcp_add_backlog chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 13/44] tcp: Pass flags to __tcp_send_ack chia-yu.chang
                   ` (32 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

AccECN connection's last ACK cannot retain ECT(1) as the bits
are always cleared causing the packet to switch into another
service queue.

This effectively adds a finer-grained filtering for ECN bits
so that acceptable TW ACKs can retain the bits.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h        |  3 ++-
 net/ipv4/ip_output.c     |  3 +--
 net/ipv4/tcp_ipv4.c      | 23 +++++++++++++++++------
 net/ipv4/tcp_minisocks.c |  2 +-
 net/ipv6/tcp_ipv6.c      | 23 ++++++++++++++++-------
 5 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index ae3f900f17c1..fe8ecaa4f71c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -417,7 +417,8 @@ enum tcp_tw_status {
 	TCP_TW_SUCCESS = 0,
 	TCP_TW_RST = 1,
 	TCP_TW_ACK = 2,
-	TCP_TW_SYN = 3
+	TCP_TW_SYN = 3,
+	TCP_TW_ACK_OOW = 4
 };
 
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 0065b1996c94..2fe7b1df3b90 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -75,7 +75,6 @@
 #include <net/checksum.h>
 #include <net/gso.h>
 #include <net/inetpeer.h>
-#include <net/inet_ecn.h>
 #include <net/lwtunnel.h>
 #include <net/inet_dscp.h>
 #include <linux/bpf-cgroup.h>
@@ -1643,7 +1642,7 @@ void ip_send_unicast_reply(struct sock *sk, const struct sock *orig_sk,
 	if (IS_ERR(rt))
 		return;
 
-	inet_sk(sk)->tos = arg->tos & ~INET_ECN_MASK;
+	inet_sk(sk)->tos = arg->tos;
 
 	sk->sk_protocol = ip_hdr(skb)->protocol;
 	sk->sk_bound_dev_if = arg->bound_dev_if;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d5aa248125f5..9419e7b492fc 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -66,6 +66,7 @@
 #include <net/transp_v6.h>
 #include <net/ipv6.h>
 #include <net/inet_common.h>
+#include <net/inet_ecn.h>
 #include <net/timewait_sock.h>
 #include <net/xfrm.h>
 #include <net/secure_seq.h>
@@ -887,7 +888,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb,
 	BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
 		     offsetof(struct inet_timewait_sock, tw_bound_dev_if));
 
-	arg.tos = ip_hdr(skb)->tos;
+	arg.tos = ip_hdr(skb)->tos & ~INET_ECN_MASK;
 	arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
 	local_bh_disable();
 	local_lock_nested_bh(&ipv4_tcp_sk.bh_lock);
@@ -1033,11 +1034,17 @@ static void tcp_v4_send_ack(const struct sock *sk,
 	local_bh_enable();
 }
 
-static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
+static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb,
+				enum tcp_tw_status tw_status)
 {
 	struct inet_timewait_sock *tw = inet_twsk(sk);
 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
 	struct tcp_key key = {};
+	u8 tos = tw->tw_tos;
+
+	if (tw_status == TCP_TW_ACK_OOW)
+		tos &= ~INET_ECN_MASK;
+
 #ifdef CONFIG_TCP_AO
 	struct tcp_ao_info *ao_info;
 
@@ -1080,7 +1087,7 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
 			READ_ONCE(tcptw->tw_ts_recent),
 			tw->tw_bound_dev_if, &key,
 			tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
-			tw->tw_tos,
+			tos,
 			tw->tw_txhash);
 
 	inet_twsk_put(tw);
@@ -1157,7 +1164,7 @@ static void tcp_v4_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
 			READ_ONCE(req->ts_recent),
 			0, &key,
 			inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
-			ip_hdr(skb)->tos,
+			ip_hdr(skb)->tos & ~INET_ECN_MASK,
 			READ_ONCE(tcp_rsk(req)->txhash));
 	if (tcp_key_is_ao(&key))
 		kfree(key.traffic_key);
@@ -2177,6 +2184,7 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
 int tcp_v4_rcv(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
+	enum tcp_tw_status tw_status;
 	enum skb_drop_reason drop_reason;
 	int sdif = inet_sdif(skb);
 	int dif = inet_iif(skb);
@@ -2404,7 +2412,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
 		inet_twsk_put(inet_twsk(sk));
 		goto csum_error;
 	}
-	switch (tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn)) {
+
+	tw_status = tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn);
+	switch (tw_status) {
 	case TCP_TW_SYN: {
 		struct sock *sk2 = inet_lookup_listener(net,
 							net->ipv4.tcp_death_row.hashinfo,
@@ -2425,7 +2435,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
 		/* to ACK */
 		fallthrough;
 	case TCP_TW_ACK:
-		tcp_v4_timewait_ack(sk, skb);
+	case TCP_TW_ACK_OOW:
+		tcp_v4_timewait_ack(sk, skb, tw_status);
 		break;
 	case TCP_TW_RST:
 		tcp_v4_send_reset(sk, skb, SK_RST_REASON_TCP_TIMEWAIT_SOCKET);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bd6515ab660f..8fb9f550fdeb 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -44,7 +44,7 @@ tcp_timewait_check_oow_rate_limit(struct inet_timewait_sock *tw,
 		/* Send ACK. Note, we do not put the bucket,
 		 * it will be released by caller.
 		 */
-		return TCP_TW_ACK;
+		return TCP_TW_ACK_OOW;
 	}
 
 	/* We are rate-limiting, so just release the tw sock and drop skb. */
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 252d3dac3a09..d9551c9cd562 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -997,7 +997,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
 	if (!IS_ERR(dst)) {
 		skb_dst_set(buff, dst);
 		ip6_xmit(ctl_sk, buff, &fl6, fl6.flowi6_mark, NULL,
-			 tclass & ~INET_ECN_MASK, priority);
+			 tclass, priority);
 		TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
 		if (rst)
 			TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
@@ -1133,7 +1133,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb,
 	trace_tcp_send_reset(sk, skb, reason);
 
 	tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
-			     ipv6_get_dsfield(ipv6h), label, priority, txhash,
+			     ipv6_get_dsfield(ipv6h) & ~INET_ECN_MASK,
+			     label, priority, txhash,
 			     &key);
 
 #if defined(CONFIG_TCP_MD5SIG) || defined(CONFIG_TCP_AO)
@@ -1153,11 +1154,16 @@ static void tcp_v6_send_ack(const struct sock *sk, struct sk_buff *skb, u32 seq,
 			     tclass, label, priority, txhash, key);
 }
 
-static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
+static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb,
+				enum tcp_tw_status tw_status)
 {
 	struct inet_timewait_sock *tw = inet_twsk(sk);
 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
 	struct tcp_key key = {};
+	u8 tclass = tw->tw_tclass;
+
+	if (tw_status == TCP_TW_ACK_OOW)
+		tclass &= ~INET_ECN_MASK;
 #ifdef CONFIG_TCP_AO
 	struct tcp_ao_info *ao_info;
 
@@ -1201,7 +1207,7 @@ static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
 			tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
 			tcp_tw_tsval(tcptw),
 			READ_ONCE(tcptw->tw_ts_recent), tw->tw_bound_dev_if,
-			&key, tw->tw_tclass, cpu_to_be32(tw->tw_flowlabel),
+			&key, tclass, cpu_to_be32(tw->tw_flowlabel),
 			tw->tw_priority, tw->tw_txhash);
 
 #ifdef CONFIG_TCP_AO
@@ -1278,7 +1284,7 @@ static void tcp_v6_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
 			tcp_synack_window(req) >> inet_rsk(req)->rcv_wscale,
 			tcp_rsk_tsval(tcp_rsk(req)),
 			READ_ONCE(req->ts_recent), sk->sk_bound_dev_if,
-			&key, ipv6_get_dsfield(ipv6_hdr(skb)), 0,
+			&key, ipv6_get_dsfield(ipv6_hdr(skb)) & ~INET_ECN_MASK, 0,
 			READ_ONCE(sk->sk_priority),
 			READ_ONCE(tcp_rsk(req)->txhash));
 	if (tcp_key_is_ao(&key))
@@ -1747,6 +1753,7 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
 
 INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 {
+	enum tcp_tw_status tw_status;
 	enum skb_drop_reason drop_reason;
 	int sdif = inet6_sdif(skb);
 	int dif = inet6_iif(skb);
@@ -1968,7 +1975,8 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 		goto csum_error;
 	}
 
-	switch (tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn)) {
+	tw_status = tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn);
+	switch (tw_status) {
 	case TCP_TW_SYN:
 	{
 		struct sock *sk2;
@@ -1993,7 +2001,8 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 		/* to ACK */
 		fallthrough;
 	case TCP_TW_ACK:
-		tcp_v6_timewait_ack(sk, skb);
+	case TCP_TW_ACK_OOW:
+		tcp_v6_timewait_ack(sk, skb, tw_status);
 		break;
 	case TCP_TW_RST:
 		tcp_v6_send_reset(sk, skb, SK_RST_REASON_TCP_TIMEWAIT_SOCKET);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 13/44] tcp: Pass flags to __tcp_send_ack
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (11 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 12/44] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 14/44] tcp: fast path functions later chia-yu.chang
                   ` (31 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Accurate ECN needs to send custom flags to handle IP-ECN
field reflection during handshake.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h     | 2 +-
 net/ipv4/bpf_tcp_ca.c | 2 +-
 net/ipv4/tcp_dctcp.h  | 2 +-
 net/ipv4/tcp_output.c | 6 +++---
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fe8ecaa4f71c..4d4fce389b20 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -704,7 +704,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority,
 			   enum sk_rst_reason reason);
 int tcp_send_synack(struct sock *);
 void tcp_push_one(struct sock *, unsigned int mss_now);
-void __tcp_send_ack(struct sock *sk, u32 rcv_nxt);
+void __tcp_send_ack(struct sock *sk, u32 rcv_nxt, u16 flags);
 void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 554804774628..e01492234b0b 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -121,7 +121,7 @@ static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log,
 BPF_CALL_2(bpf_tcp_send_ack, struct tcp_sock *, tp, u32, rcv_nxt)
 {
 	/* bpf_tcp_ca prog cannot have NULL tp */
-	__tcp_send_ack((struct sock *)tp, rcv_nxt);
+	__tcp_send_ack((struct sock *)tp, rcv_nxt, 0);
 	return 0;
 }
 
diff --git a/net/ipv4/tcp_dctcp.h b/net/ipv4/tcp_dctcp.h
index d69a77cbd0c7..4b0259111d81 100644
--- a/net/ipv4/tcp_dctcp.h
+++ b/net/ipv4/tcp_dctcp.h
@@ -28,7 +28,7 @@ static inline void dctcp_ece_ack_update(struct sock *sk, enum tcp_ca_event evt,
 		 */
 		if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
 			dctcp_ece_ack_cwr(sk, *ce_state);
-			__tcp_send_ack(sk, *prior_rcv_nxt);
+			__tcp_send_ack(sk, *prior_rcv_nxt, 0);
 		}
 		inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index bb83ad43a4e2..556c2da2bc77 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4232,7 +4232,7 @@ void tcp_send_delayed_ack(struct sock *sk)
 }
 
 /* This routine sends an ack and also updates the window. */
-void __tcp_send_ack(struct sock *sk, u32 rcv_nxt)
+void __tcp_send_ack(struct sock *sk, u32 rcv_nxt, u16 flags)
 {
 	struct sk_buff *buff;
 
@@ -4261,7 +4261,7 @@ void __tcp_send_ack(struct sock *sk, u32 rcv_nxt)
 
 	/* Reserve space for headers and prepare control bits. */
 	skb_reserve(buff, MAX_TCP_HEADER);
-	tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK);
+	tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK | flags);
 
 	/* We do not want pure acks influencing TCP Small Queues or fq/pacing
 	 * too much.
@@ -4276,7 +4276,7 @@ EXPORT_SYMBOL_GPL(__tcp_send_ack);
 
 void tcp_send_ack(struct sock *sk)
 {
-	__tcp_send_ack(sk, tcp_sk(sk)->rcv_nxt);
+	__tcp_send_ack(sk, tcp_sk(sk)->rcv_nxt, 0);
 }
 
 /* This routine sends a packet with an out of date sequence
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 14/44] tcp: fast path functions later
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (12 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 13/44] tcp: Pass flags to __tcp_send_ack chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 15/44] tcp: AccECN core chia-yu.chang
                   ` (30 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chai-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

The following patch will use tcp_ecn_mode_accecn(),
TCP_ACCECN_CEP_INIT_OFFSET, TCP_ACCECN_CEP_ACE_MASK in
__tcp_fast_path_on() to make new flag for AccECN.

No functional changes.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chai-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h | 54 +++++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4d4fce389b20..7ceff62969e0 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -788,33 +788,6 @@ static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
 	return usecs_to_jiffies((tp->srtt_us >> 3) + tp->rttvar_us);
 }
 
-static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
-{
-	/* mptcp hooks are only on the slow path */
-	if (sk_is_mptcp((struct sock *)tp))
-		return;
-
-	tp->pred_flags = htonl((tp->tcp_header_len << 26) |
-			       ntohl(TCP_FLAG_ACK) |
-			       snd_wnd);
-}
-
-static inline void tcp_fast_path_on(struct tcp_sock *tp)
-{
-	__tcp_fast_path_on(tp, tp->snd_wnd >> tp->rx_opt.snd_wscale);
-}
-
-static inline void tcp_fast_path_check(struct sock *sk)
-{
-	struct tcp_sock *tp = tcp_sk(sk);
-
-	if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
-	    tp->rcv_wnd &&
-	    atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
-	    !tp->urg_data)
-		tcp_fast_path_on(tp);
-}
-
 u32 tcp_delack_max(const struct sock *sk);
 
 /* Compute the actual rto_min value */
@@ -1768,6 +1741,33 @@ static inline bool tcp_paws_reject(const struct tcp_options_received *rx_opt,
 	return true;
 }
 
+static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
+{
+	/* mptcp hooks are only on the slow path */
+	if (sk_is_mptcp((struct sock *)tp))
+		return;
+
+	tp->pred_flags = htonl((tp->tcp_header_len << 26) |
+			       ntohl(TCP_FLAG_ACK) |
+			       snd_wnd);
+}
+
+static inline void tcp_fast_path_on(struct tcp_sock *tp)
+{
+	__tcp_fast_path_on(tp, tp->snd_wnd >> tp->rx_opt.snd_wscale);
+}
+
+static inline void tcp_fast_path_check(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
+	    tp->rcv_wnd &&
+	    atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
+	    !tp->urg_data)
+		tcp_fast_path_on(tp);
+}
+
 bool tcp_oow_rate_limited(struct net *net, const struct sk_buff *skb,
 			  int mib_idx, u32 *last_oow_ack_time);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 15/44] tcp: AccECN core
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (13 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 14/44] tcp: fast path functions later chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 16/44] net: sysctl: introduce sysctl SYSCTL_FIVE chia-yu.chang
                   ` (29 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Olivier Tilmans, Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

This change implements Accurate ECN without negotiation and
AccECN Option (that will be added by later changes). Based on
AccECN specifications:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt

Accurate ECN allows feeding back the number of CE (congestion
experienced) marks accurately to the sender in contrast to
RFC3168 ECN that can only signal one marks-seen-yes/no per RTT.
Congestion control algorithms can take advantage of the accurate
ECN information to fine-tune their congestion response to avoid
drastic rate reduction when only mild congestion is encountered.

With Accurate ECN, tp->received_ce (r.cep in AccECN spec) keeps
track of how many segments have arrived with a CE mark. Accurate
ECN uses ACE field (ECE, CWR, AE) to communicate the value back
to the sender which updates tp->delivered_ce (s.cep) based on the
feedback. This signalling channel is lossy when ACE field overflow
occurs.

Conservative strategy is selected here to deal with the ACE
overflow, however, some strategies using the AccECN option later
in the overall patchset mitigate against false overflows detected.

The ACE field values on the wire are offset by
TCP_ACCECN_CEP_INIT_OFFSET. Delivered_ce/received_ce count the
real CE marks rather than forcing all downstream users to adapt
to the wire offset.

Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/tcp.h   |   3 ++
 include/net/tcp.h     |  26 ++++++++++
 net/ipv4/tcp.c        |   4 +-
 net/ipv4/tcp_input.c  | 113 +++++++++++++++++++++++++++++++++++++-----
 net/ipv4/tcp_output.c |  21 +++++++-
 5 files changed, 152 insertions(+), 15 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 6a5e08b937b3..c36e519f3985 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -294,6 +294,9 @@ struct tcp_sock {
 	u32	snd_up;		/* Urgent pointer		*/
 	u32	delivered;	/* Total data packets delivered incl. rexmits */
 	u32	delivered_ce;	/* Like the above but only ECE marked packets */
+	u32	received_ce;	/* Like the above but for received CE marked packets */
+	u8	received_ce_pending:4, /* Not yet transmitted cnt of received_ce */
+		unused2:4;
 	u32	app_limited;	/* limited until "delivered" reaches this val */
 	u32	rcv_wnd;	/* Current receiver window		*/
 /*
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7ceff62969e0..5ae0d1f9b083 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -413,6 +413,11 @@ static inline void tcp_ecn_mode_set(struct tcp_sock *tp, u8 mode)
 	tp->ecn_flags |= mode;
 }
 
+static inline u8 tcp_accecn_ace(const struct tcphdr *th)
+{
+	return (th->ae << 2) | (th->cwr << 1) | th->ece;
+}
+
 enum tcp_tw_status {
 	TCP_TW_SUCCESS = 0,
 	TCP_TW_RST = 1,
@@ -938,6 +943,20 @@ static inline u32 tcp_rsk_tsval(const struct tcp_request_sock *treq)
 #define TCPHDR_ACE (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)
 #define TCPHDR_SYN_ECN	(TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
 
+#define TCP_ACCECN_CEP_ACE_MASK 0x7
+#define TCP_ACCECN_ACE_MAX_DELTA 6
+
+/* To avoid/detect middlebox interference, not all counters start at 0.
+ * See draft-ietf-tcpm-accurate-ecn for the latest values.
+ */
+#define TCP_ACCECN_CEP_INIT_OFFSET 5
+
+static inline void tcp_accecn_init_counters(struct tcp_sock *tp)
+{
+	tp->received_ce = 0;
+	tp->received_ce_pending = 0;
+}
+
 /* State flags for sacked in struct tcp_skb_cb */
 enum tcp_skb_cb_sacked_flags {
 	TCPCB_SACKED_ACKED	= (1 << 0),	/* SKB ACK'd by a SACK block	*/
@@ -1743,11 +1762,18 @@ static inline bool tcp_paws_reject(const struct tcp_options_received *rx_opt,
 
 static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
 {
+	u32 ace;
+
 	/* mptcp hooks are only on the slow path */
 	if (sk_is_mptcp((struct sock *)tp))
 		return;
 
+	ace = tcp_ecn_mode_accecn(tp) ?
+	      ((tp->delivered_ce + TCP_ACCECN_CEP_INIT_OFFSET) &
+	       TCP_ACCECN_CEP_ACE_MASK) : 0;
+
 	tp->pred_flags = htonl((tp->tcp_header_len << 26) |
+			       (ace << 22) |
 			       ntohl(TCP_FLAG_ACK) |
 			       snd_wnd);
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 94546f55385a..499f2a0be036 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3336,6 +3336,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tp->window_clamp = 0;
 	tp->delivered = 0;
 	tp->delivered_ce = 0;
+	tcp_accecn_init_counters(tp);
 	if (icsk->icsk_ca_ops->release)
 		icsk->icsk_ca_ops->release(sk);
 	memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv));
@@ -5025,6 +5026,7 @@ static void __init tcp_struct_check(void)
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, snd_up);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, delivered);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, delivered_ce);
+	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ce);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, app_limited);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_wnd);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rx_opt);
@@ -5032,7 +5034,7 @@ static void __init tcp_struct_check(void)
 	/* 32bit arches with 8byte alignment on u64 fields might need padding
 	 * before tcp_clock_cache.
 	 */
-	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 92 + 4);
+	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 97 + 7);
 
 	/* RX read-write hotpath cache lines */
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, bytes_received);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e8d32a231a9e..fcc6b7a75db8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -334,14 +334,17 @@ static bool tcp_in_quickack_mode(struct sock *sk)
 
 static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
 {
+	/* Do not set CWR if in AccECN mode! */
 	if (tcp_ecn_mode_rfc3168(tp))
 		tp->ecn_flags |= TCP_ECN_QUEUE_CWR;
 }
 
 static void tcp_ecn_accept_cwr(struct sock *sk, const struct sk_buff *skb)
 {
-	if (tcp_hdr(skb)->cwr) {
-		tcp_sk(sk)->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (tcp_ecn_mode_rfc3168(tp) && tcp_hdr(skb)->cwr) {
+		tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
 
 		/* If the sender is telling us it has entered CWR, then its
 		 * cwnd may be very low (even just 1 packet), so we should ACK
@@ -377,17 +380,16 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
 		if (tcp_ca_needs_ecn(sk))
 			tcp_ca_event(sk, CA_EVENT_ECN_IS_CE);
 
-		if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
+		if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR) &&
+		    tcp_ecn_mode_rfc3168(tp)) {
 			/* Better not delay acks, sender can have a very low cwnd */
 			tcp_enter_quickack_mode(sk, 2);
 			tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
 		}
-		tp->ecn_flags |= TCP_ECN_SEEN;
 		break;
 	default:
 		if (tcp_ca_needs_ecn(sk))
 			tcp_ca_event(sk, CA_EVENT_ECN_NO_CE);
-		tp->ecn_flags |= TCP_ECN_SEEN;
 		break;
 	}
 }
@@ -421,10 +423,62 @@ static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
 				bool ece_ack)
 {
 	tp->delivered += delivered;
-	if (ece_ack)
+	if (tcp_ecn_mode_rfc3168(tp) && ece_ack)
 		tcp_count_delivered_ce(tp, delivered);
 }
 
+/* Returns the ECN CE delta */
+static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
+				u32 delivered_pkts, int flag)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	u32 delta, safe_delta;
+	u32 corrected_ace;
+
+	/* Reordered ACK? (...or uncertain due to lack of data to send and ts) */
+	if (!(flag & (FLAG_FORWARD_PROGRESS | FLAG_TS_PROGRESS)))
+		return 0;
+
+	if (!(flag & FLAG_SLOWPATH)) {
+		/* AccECN counter might overflow on large ACKs */
+		if (delivered_pkts <= TCP_ACCECN_CEP_ACE_MASK)
+			return 0;
+	}
+
+	/* ACE field is not available during handshake */
+	if (flag & FLAG_SYN_ACKED)
+		return 0;
+
+	if (tp->received_ce_pending >= TCP_ACCECN_ACE_MAX_DELTA)
+		inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
+
+	corrected_ace = tcp_accecn_ace(tcp_hdr(skb)) - TCP_ACCECN_CEP_INIT_OFFSET;
+	delta = (corrected_ace - tp->delivered_ce) & TCP_ACCECN_CEP_ACE_MASK;
+	if (delivered_pkts <= TCP_ACCECN_CEP_ACE_MASK)
+		return delta;
+
+	safe_delta = delivered_pkts - ((delivered_pkts - delta) & TCP_ACCECN_CEP_ACE_MASK);
+
+	return safe_delta;
+}
+
+static u32 tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
+			      u32 delivered_pkts, int *flag)
+{
+	u32 delta;
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	delta = __tcp_accecn_process(sk, skb, delivered_pkts, *flag);
+	if (delta > 0) {
+		tcp_count_delivered_ce(tp, delta);
+		*flag |= FLAG_ECE;
+		/* Recalculate header predictor */
+		if (tp->pred_flags)
+			tcp_fast_path_on(tp);
+	}
+	return delta;
+}
+
 /* Buffer size and advertised window tuning.
  *
  * 1. Tuning sk->sk_sndbuf, when connection enters established state.
@@ -3912,7 +3966,8 @@ static void tcp_xmit_recovery(struct sock *sk, int rexmit)
 }
 
 /* Returns the number of packets newly acked or sacked by the current ACK */
-static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
+static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered,
+			       u32 ecn_count, int flag)
 {
 	const struct net *net = sock_net(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -3920,8 +3975,12 @@ static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
 
 	delivered = tp->delivered - prior_delivered;
 	NET_ADD_STATS(net, LINUX_MIB_TCPDELIVERED, delivered);
-	if (flag & FLAG_ECE)
-		NET_ADD_STATS(net, LINUX_MIB_TCPDELIVEREDCE, delivered);
+
+	if (flag & FLAG_ECE) {
+		if (tcp_ecn_mode_rfc3168(tp))
+			ecn_count = delivered;
+		NET_ADD_STATS(net, LINUX_MIB_TCPDELIVEREDCE, ecn_count);
+	}
 
 	return delivered;
 }
@@ -3942,6 +4001,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	u32 delivered = tp->delivered;
 	u32 lost = tp->lost;
 	int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover losses */
+	u32 ecn_count = 0;	  /* Did we receive ECE/an AccECN ACE update? */
 	u32 prior_fack;
 
 	sack_state.first_sackt = 0;
@@ -4049,6 +4109,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	tcp_rack_update_reo_wnd(sk, &rs);
 
+	if (tcp_ecn_mode_accecn(tp))
+		ecn_count = tcp_accecn_process(sk, skb, tp->delivered - delivered, &flag);
+
 	tcp_in_ack_event(sk, flag);
 
 	if (tp->tlp_high_seq)
@@ -4073,7 +4136,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
 		sk_dst_confirm(sk);
 
-	delivered = tcp_newly_delivered(sk, delivered, flag);
+	delivered = tcp_newly_delivered(sk, delivered, ecn_count, flag);
+
 	lost = tp->lost - lost;			/* freshly marked lost */
 	rs.is_ack_delayed = !!(flag & FLAG_ACK_MAYBE_DELAYED);
 	tcp_rate_gen(sk, delivered, lost, is_sack_reneg, sack_state.rate);
@@ -4082,12 +4146,14 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	return 1;
 
 no_queue:
+	if (tcp_ecn_mode_accecn(tp))
+		ecn_count = tcp_accecn_process(sk, skb, tp->delivered - delivered, &flag);
 	tcp_in_ack_event(sk, flag);
 	/* If data was DSACKed, see if we can undo a cwnd reduction. */
 	if (flag & FLAG_DSACKING_ACK) {
 		tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
 				      &rexmit);
-		tcp_newly_delivered(sk, delivered, flag);
+		tcp_newly_delivered(sk, delivered, ecn_count, flag);
 	}
 	/* If this ack opens up a zero window, clear backoff.  It was
 	 * being used to time the probes, and is probably far higher than
@@ -4108,7 +4174,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 						&sack_state);
 		tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
 				      &rexmit);
-		tcp_newly_delivered(sk, delivered, flag);
+		tcp_newly_delivered(sk, delivered, ecn_count, flag);
 		tcp_xmit_recovery(sk, rexmit);
 	}
 
@@ -5940,6 +6006,24 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
 	}
 }
 
+/* Updates Accurate ECN received counters from the received IP ECN field */
+static void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb)
+{
+	u8 ecnfield = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
+	u8 is_ce = INET_ECN_is_ce(ecnfield);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (!INET_ECN_is_not_ect(ecnfield)) {
+		u32 pcount = is_ce * max_t(u16, 1, skb_shinfo(skb)->gso_segs);
+
+		tp->ecn_flags |= TCP_ECN_SEEN;
+
+		/* ACE counter tracks *all* segments including pure ACKs */
+		tp->received_ce += pcount;
+		tp->received_ce_pending = min(tp->received_ce_pending + pcount, 0xfU);
+	}
+}
+
 /* Accept RST for rcv_nxt - 1 after a FIN.
  * When tcp connections are abruptly terminated from Mac OSX (via ^C), a
  * FIN is sent followed by a RST packet. The RST is sent with the same
@@ -6188,6 +6272,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 				    tp->rcv_nxt == tp->rcv_wup)
 					flag |= __tcp_replace_ts_recent(tp, tstamp_delta);
 
+				tcp_ecn_received_counters(sk, skb);
+
 				/* We know that such packets are checksummed
 				 * on entry.
 				 */
@@ -6231,6 +6317,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			/* Bulk data transfer: receiver */
 			skb_dst_drop(skb);
 			__skb_pull(skb, tcp_header_len);
+			tcp_ecn_received_counters(sk, skb);
 			eaten = tcp_queue_rcv(sk, skb, &fragstolen);
 
 			tcp_event_data_recv(sk, skb);
@@ -6271,6 +6358,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 		return;
 
 step5:
+	tcp_ecn_received_counters(sk, skb);
+
 	reason = tcp_ack(sk, skb, FLAG_SLOWPATH | FLAG_UPDATE_TS_RECENT);
 	if ((int)reason < 0) {
 		reason = -reason;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 556c2da2bc77..42177f464d0c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -371,6 +371,17 @@ tcp_ecn_make_synack(const struct request_sock *req, struct tcphdr *th)
 		th->ece = 1;
 }
 
+static void tcp_accecn_set_ace(struct tcphdr *th, struct tcp_sock *tp)
+{
+	u32 wire_ace;
+
+	wire_ace = tp->received_ce + TCP_ACCECN_CEP_INIT_OFFSET;
+	th->ece = !!(wire_ace & 0x1);
+	th->cwr = !!(wire_ace & 0x2);
+	th->ae = !!(wire_ace & 0x4);
+	tp->received_ce_pending = 0;
+}
+
 /* Set up ECN state for a packet on a ESTABLISHED socket that is about to
  * be sent.
  */
@@ -379,11 +390,17 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_ecn_mode_rfc3168(tp)) {
+	if (!tcp_ecn_mode_any(tp))
+		return;
+
+	INET_ECN_xmit(sk);
+	if (tcp_ecn_mode_accecn(tp)) {
+		tcp_accecn_set_ace(th, tp);
+		skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ACCECN;
+	} else {
 		/* Not-retransmitted data segment: set ECT and inject CWR. */
 		if (skb->len != tcp_header_len &&
 		    !before(TCP_SKB_CB(skb)->seq, tp->snd_nxt)) {
-			INET_ECN_xmit(sk);
 			if (tp->ecn_flags & TCP_ECN_QUEUE_CWR) {
 				tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR;
 				th->cwr = 1;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 16/44] net: sysctl: introduce sysctl SYSCTL_FIVE
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (14 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 15/44] tcp: AccECN core chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 17/44] tcp: accecn: AccECN negotiation chia-yu.chang
                   ` (28 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Add SYSCTL_FIVE for new AccECN feedback modes of net.ipv4.tcp_ecn.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/sysctl.h | 17 +++++++++--------
 kernel/sysctl.c        |  2 +-
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index aa4c6d44aaa0..37c95a70c10e 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -37,21 +37,22 @@ struct ctl_table_root;
 struct ctl_table_header;
 struct ctl_dir;
 
-/* Keep the same order as in fs/proc/proc_sysctl.c */
+/* Keep the same order as in kernel/sysctl.c */
 #define SYSCTL_ZERO			((void *)&sysctl_vals[0])
 #define SYSCTL_ONE			((void *)&sysctl_vals[1])
 #define SYSCTL_TWO			((void *)&sysctl_vals[2])
 #define SYSCTL_THREE			((void *)&sysctl_vals[3])
 #define SYSCTL_FOUR			((void *)&sysctl_vals[4])
-#define SYSCTL_ONE_HUNDRED		((void *)&sysctl_vals[5])
-#define SYSCTL_TWO_HUNDRED		((void *)&sysctl_vals[6])
-#define SYSCTL_ONE_THOUSAND		((void *)&sysctl_vals[7])
-#define SYSCTL_THREE_THOUSAND		((void *)&sysctl_vals[8])
-#define SYSCTL_INT_MAX			((void *)&sysctl_vals[9])
+#define SYSCTL_FIVE			((void *)&sysctl_vals[5])
+#define SYSCTL_ONE_HUNDRED		((void *)&sysctl_vals[6])
+#define SYSCTL_TWO_HUNDRED		((void *)&sysctl_vals[7])
+#define SYSCTL_ONE_THOUSAND		((void *)&sysctl_vals[8])
+#define SYSCTL_THREE_THOUSAND		((void *)&sysctl_vals[9])
+#define SYSCTL_INT_MAX			((void *)&sysctl_vals[10])
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
-#define SYSCTL_MAXOLDUID		((void *)&sysctl_vals[10])
-#define SYSCTL_NEG_ONE			((void *)&sysctl_vals[11])
+#define SYSCTL_MAXOLDUID		((void *)&sysctl_vals[11])
+#define SYSCTL_NEG_ONE			((void *)&sysctl_vals[12])
 
 extern const int sysctl_vals[];
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 79e6cb1d5c48..a922b44eaddd 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -82,7 +82,7 @@
 #endif
 
 /* shared constants to be used in various sysctls */
-const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1 };
+const int sysctl_vals[] = { 0, 1, 2, 3, 4, 5, 100, 200, 1000, 3000, INT_MAX, 65535, -1 };
 EXPORT_SYMBOL(sysctl_vals);
 
 const unsigned long sysctl_long_vals[] = { 0, 1, LONG_MAX };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 17/44] tcp: accecn: AccECN negotiation
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (15 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 16/44] net: sysctl: introduce sysctl SYSCTL_FIVE chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 19:49   ` Ilpo Järvinen
  2024-10-15 10:29 ` [PATCH net-next 18/44] tcp: accecn: add AccECN rx byte counters chia-yu.chang
                   ` (27 subsequent siblings)
  44 siblings, 1 reply; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Olivier Tilmans, Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Accurate ECN negotiation parts based on the specification:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt

Accurate ECN is negotiated using ECE, CWR and AE flags in the
TCP header. TCP falls back into using RFC3168 ECN if one of the
ends supports only RFC3168-style ECN.

The AccECN negotiation includes reflecting IP ECN field value
seen in SYN and SYNACK back using the same bits as negotiation
to allow responding to SYN CE marks and to detect ECN field
mangling. CE marks should not occur currently because SYN=1
segments are sent with Non-ECT in IP ECN field (but proposal
exists to remove this restriction).

Reflecting SYN IP ECN field in SYNACK is relatively simple.
Reflecting SYNACK IP ECN field in the final/third ACK of
the handshake is more challenging. Linux TCP code is not well
prepared for using the final/third ACK a signalling channel
which makes things somewhat complicated here.

Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/tcp.h        |   9 ++-
 include/net/tcp.h          |  80 +++++++++++++++++++-
 net/ipv4/syncookies.c      |   3 +
 net/ipv4/sysctl_net_ipv4.c |   2 +-
 net/ipv4/tcp.c             |   2 +
 net/ipv4/tcp_input.c       | 149 +++++++++++++++++++++++++++++++++----
 net/ipv4/tcp_ipv4.c        |   3 +-
 net/ipv4/tcp_minisocks.c   |  51 +++++++++++--
 net/ipv4/tcp_output.c      |  77 +++++++++++++++----
 net/ipv6/syncookies.c      |   1 +
 net/ipv6/tcp_ipv6.c        |   1 +
 11 files changed, 336 insertions(+), 42 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c36e519f3985..4970ce3ee864 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -156,6 +156,10 @@ struct tcp_request_sock {
 #if IS_ENABLED(CONFIG_MPTCP)
 	bool				drop_req;
 #endif
+	u8				accecn_ok  : 1,
+					syn_ect_snt: 2,
+					syn_ect_rcv: 2;
+	u8				accecn_fail_mode:4;
 	u32				txhash;
 	u32				rcv_isn;
 	u32				snt_isn;
@@ -372,7 +376,10 @@ struct tcp_sock {
 	u8	compressed_ack;
 	u8	dup_ack_counter:2,
 		tlp_retrans:1,	/* TLP is a retransmission */
-		unused:5;
+		syn_ect_snt:2,	/* AccECN ECT memory, only */
+		syn_ect_rcv:2,	/* ... needed durign 3WHS + first seqno */
+		wait_third_ack:1; /* Need 3rd ACK in simultaneous open for AccECN */
+	u8	accecn_fail_mode:4;     /* AccECN failure handling */
 	u8	thin_lto    : 1,/* Use linear timeouts for thin streams */
 		fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
 		fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5ae0d1f9b083..6a387d4b2fa1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -27,6 +27,7 @@
 #include <linux/ktime.h>
 #include <linux/indirect_call_wrapper.h>
 #include <linux/bits.h>
+#include <linux/bitfield.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/inet_timewait_sock.h>
@@ -232,6 +233,37 @@ static_assert((1 << ATO_BITS) > TCP_DELACK_MAX);
 #define TCPOLEN_MSS_ALIGNED		4
 #define TCPOLEN_EXP_SMC_BASE_ALIGNED	8
 
+/* tp->accecn_fail_mode */
+#define TCP_ACCECN_ACE_FAIL_SEND       BIT(0)
+#define TCP_ACCECN_ACE_FAIL_RECV       BIT(1)
+#define TCP_ACCECN_OPT_FAIL_SEND       BIT(2)
+#define TCP_ACCECN_OPT_FAIL_RECV       BIT(3)
+
+static inline bool tcp_accecn_ace_fail_send(const struct tcp_sock *tp)
+{
+	return tp->accecn_fail_mode & TCP_ACCECN_ACE_FAIL_SEND;
+}
+
+static inline bool tcp_accecn_ace_fail_recv(const struct tcp_sock *tp)
+{
+	return tp->accecn_fail_mode & TCP_ACCECN_ACE_FAIL_RECV;
+}
+
+static inline bool tcp_accecn_opt_fail_send(const struct tcp_sock *tp)
+{
+	return tp->accecn_fail_mode & TCP_ACCECN_OPT_FAIL_SEND;
+}
+
+static inline bool tcp_accecn_opt_fail_recv(const struct tcp_sock *tp)
+{
+	return tp->accecn_fail_mode & TCP_ACCECN_OPT_FAIL_RECV;
+}
+
+static inline void tcp_accecn_fail_mode_set(struct tcp_sock *tp, u8 mode)
+{
+	tp->accecn_fail_mode |= mode;
+}
+
 /* Flags in tp->nonagle */
 #define TCP_NAGLE_OFF		1	/* Nagle's algo is disabled */
 #define TCP_NAGLE_CORK		2	/* Socket is corked	    */
@@ -418,6 +450,23 @@ static inline u8 tcp_accecn_ace(const struct tcphdr *th)
 	return (th->ae << 2) | (th->cwr << 1) | th->ece;
 }
 
+/* Infer the ECT value our SYN arrived with from the echoed ACE field */
+static inline int tcp_accecn_extract_syn_ect(u8 ace)
+{
+	if (ace & 0x1)
+		return INET_ECN_ECT_1;
+	if (!(ace & 0x2))
+		return INET_ECN_ECT_0;
+	if (ace & 0x4)
+		return INET_ECN_CE;
+	return INET_ECN_NOT_ECT;
+}
+
+bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect);
+void tcp_accecn_third_ack(struct sock *sk, const struct sk_buff *skb,
+			  u8 syn_ect_snt);
+void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb);
+
 enum tcp_tw_status {
 	TCP_TW_SUCCESS = 0,
 	TCP_TW_RST = 1,
@@ -653,6 +702,15 @@ static inline bool cookie_ecn_ok(const struct net *net, const struct dst_entry *
 		dst_feature(dst, RTAX_FEATURE_ECN);
 }
 
+/* AccECN specification, 5.1: [...] a server can determine that it
+ * negotiated AccECN as [...] if the ACK contains an ACE field with
+ * the value 0b010 to 0b111 (decimal 2 to 7).
+ */
+static inline bool cookie_accecn_ok(const struct tcphdr *th)
+{
+	return tcp_accecn_ace(th) > 0x1;
+}
+
 #if IS_ENABLED(CONFIG_BPF)
 static inline bool cookie_bpf_ok(struct sk_buff *skb)
 {
@@ -942,6 +1000,7 @@ static inline u32 tcp_rsk_tsval(const struct tcp_request_sock *treq)
 
 #define TCPHDR_ACE (TCPHDR_ECE | TCPHDR_CWR | TCPHDR_AE)
 #define TCPHDR_SYN_ECN	(TCPHDR_SYN | TCPHDR_ECE | TCPHDR_CWR)
+#define TCPHDR_SYNACK_ACCECN (TCPHDR_SYN | TCPHDR_ACK | TCPHDR_CWR)
 
 #define TCP_ACCECN_CEP_ACE_MASK 0x7
 #define TCP_ACCECN_ACE_MAX_DELTA 6
@@ -1023,6 +1082,15 @@ struct tcp_skb_cb {
 
 #define TCP_SKB_CB(__skb)	((struct tcp_skb_cb *)&((__skb)->cb[0]))
 
+static inline u16 tcp_accecn_reflector_flags(u8 ect)
+{
+	u32 flags = ect + 2;
+
+	if (ect == 3)
+		flags++;
+	return FIELD_PREP(TCPHDR_ACE, flags);
+}
+
 extern const struct inet_connection_sock_af_ops ipv4_specific;
 
 #if IS_ENABLED(CONFIG_IPV6)
@@ -1145,7 +1213,10 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CONG_NON_RESTRICTED		BIT(0)
 /* Requires ECN/ECT set on all packets */
 #define TCP_CONG_NEEDS_ECN		BIT(1)
-#define TCP_CONG_MASK	(TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN)
+/* Require successfully negotiated AccECN capability */
+#define TCP_CONG_NEEDS_ACCECN		BIT(2)
+#define TCP_CONG_MASK  (TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN | \
+			TCP_CONG_NEEDS_ACCECN)
 
 union tcp_cc_info;
 
@@ -1277,6 +1348,13 @@ static inline bool tcp_ca_needs_ecn(const struct sock *sk)
 	return icsk->icsk_ca_ops->flags & TCP_CONG_NEEDS_ECN;
 }
 
+static inline bool tcp_ca_needs_accecn(const struct sock *sk)
+{
+	const struct inet_connection_sock *icsk = inet_csk(sk);
+
+	return icsk->icsk_ca_ops->flags & TCP_CONG_NEEDS_ACCECN;
+}
+
 static inline void tcp_ca_event(struct sock *sk, const enum tcp_ca_event event)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 1948d15f1f28..3bd6274c8bcb 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -401,6 +401,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	const struct tcphdr *th = tcp_hdr(skb);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_request_sock *ireq;
+	struct tcp_request_sock *treq;
 	struct net *net = sock_net(sk);
 	struct request_sock *req;
 	struct sock *ret = sk;
@@ -427,6 +428,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	}
 
 	ireq = inet_rsk(req);
+	treq = tcp_rsk(req);
 
 	sk_rcv_saddr_set(req_to_sk(req), ip_hdr(skb)->daddr);
 	sk_daddr_set(req_to_sk(req), ip_hdr(skb)->saddr);
@@ -481,6 +483,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	if (!req->syncookie)
 		ireq->rcv_wscale = rcv_wscale;
 	ireq->ecn_ok &= cookie_ecn_ok(net, &rt->dst);
+	treq->accecn_ok = ireq->ecn_ok && cookie_accecn_ok(th);
 
 	ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst);
 	/* ip_queue_xmit() depends on our flow being setup
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a79b2a52ce01..01fcc6b2045b 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -726,7 +726,7 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dou8vec_minmax,
 		.extra1		= SYSCTL_ZERO,
-		.extra2		= SYSCTL_TWO,
+		.extra2		= SYSCTL_FIVE,
 	},
 	{
 		.procname	= "tcp_ecn_fallback",
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 499f2a0be036..f5ceadb43efb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3336,6 +3336,8 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tp->window_clamp = 0;
 	tp->delivered = 0;
 	tp->delivered_ce = 0;
+	tp->wait_third_ack = 0;
+	tp->accecn_fail_mode = 0;
 	tcp_accecn_init_counters(tp);
 	if (icsk->icsk_ca_ops->release)
 		icsk->icsk_ca_ops->release(sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fcc6b7a75db8..0591c605b57a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -394,14 +394,91 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
 	}
 }
 
-static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
+/* AccECN specificaiton, 3.1.2: If a TCP server that implements AccECN
+ * receives a SYN with the three TCP header flags (AE, CWR and ECE) set
+ * to any combination other than 000, 011 or 111, it MUST negotiate the
+ * use of AccECN as if they had been set to 111.
+ */
+static bool tcp_accecn_syn_requested(const struct tcphdr *th)
+{
+	u8 ace = tcp_accecn_ace(th);
+
+	return ace && ace != 0x3;
+}
+
+/* Check ECN field transition to detect invalid transitions */
+static bool tcp_ect_transition_valid(u8 snt, u8 rcv)
+{
+	if (rcv == snt)
+		return true;
+
+	/* Non-ECT altered to something or something became non-ECT */
+	if (snt == INET_ECN_NOT_ECT || rcv == INET_ECN_NOT_ECT)
+		return false;
+	/* CE -> ECT(0/1)? */
+	if (snt == INET_ECN_CE)
+		return false;
+	return true;
+}
+
+bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect)
 {
-	if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || th->cwr))
+	u8 ect = tcp_accecn_extract_syn_ect(ace);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (!sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback)
+		return true;
+
+	if (!tcp_ect_transition_valid(sent_ect, ect)) {
+		tcp_accecn_fail_mode_set(tp, TCP_ACCECN_ACE_FAIL_RECV);
+		return false;
+	}
+
+	return true;
+}
+
+/* See Table 2 of the AccECN draft */
+static void tcp_ecn_rcv_synack(struct sock *sk, const struct tcphdr *th,
+			       u8 ip_dsfield)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	u8 ace = tcp_accecn_ace(th);
+
+	switch (ace) {
+	case 0x0:
+	case 0x7:
 		tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
+		break;
+	case 0x1:
+	case 0x5:
+		if (tcp_ecn_mode_pending(tp))
+			/* Downgrade from AccECN, or requested initially */
+			tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
+		break;
+	default:
+		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
+		tp->syn_ect_rcv = ip_dsfield & INET_ECN_MASK;
+		if (tcp_accecn_validate_syn_feedback(sk, ace, tp->syn_ect_snt) &&
+		    INET_ECN_is_ce(ip_dsfield)) {
+			tp->received_ce++;
+			tp->received_ce_pending++;
+		}
+		break;
+	}
 }
 
-static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th)
+static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th,
+			    const struct sk_buff *skb)
 {
+	if (tcp_ecn_mode_pending(tp)) {
+		if (!tcp_accecn_syn_requested(th)) {
+			/* Downgrade to classic ECN feedback */
+			tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
+		} else {
+			tp->syn_ect_rcv = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
+			tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
+		}
+	}
 	if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || !th->cwr))
 		tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
 }
@@ -3825,7 +3902,7 @@ bool tcp_oow_rate_limited(struct net *net, const struct sk_buff *skb,
 }
 
 /* RFC 5961 7 [ACK Throttling] */
-static void tcp_send_challenge_ack(struct sock *sk)
+static void tcp_send_challenge_ack(struct sock *sk, bool accecn_reflector)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct net *net = sock_net(sk);
@@ -3855,7 +3932,8 @@ static void tcp_send_challenge_ack(struct sock *sk)
 		WRITE_ONCE(net->ipv4.tcp_challenge_count, count - 1);
 send_ack:
 		NET_INC_STATS(net, LINUX_MIB_TCPCHALLENGEACK);
-		tcp_send_ack(sk);
+		__tcp_send_ack(sk, tp->rcv_nxt,
+			       !accecn_reflector ? 0 : tcp_accecn_reflector_flags(tp->syn_ect_rcv));
 	}
 }
 
@@ -4022,7 +4100,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		/* RFC 5961 5.2 [Blind Data Injection Attack].[Mitigation] */
 		if (before(ack, prior_snd_una - max_window)) {
 			if (!(flag & FLAG_NO_CHALLENGE_ACK))
-				tcp_send_challenge_ack(sk);
+				tcp_send_challenge_ack(sk, false);
 			return -SKB_DROP_REASON_TCP_TOO_OLD_ACK;
 		}
 		goto old_ack;
@@ -6007,7 +6085,7 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
 }
 
 /* Updates Accurate ECN received counters from the received IP ECN field */
-static void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb)
+void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb)
 {
 	u8 ecnfield = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
 	u8 is_ce = INET_ECN_is_ce(ecnfield);
@@ -6047,6 +6125,7 @@ static bool tcp_reset_check(const struct sock *sk, const struct sk_buff *skb)
 static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 				  const struct tcphdr *th, int syn_inerr)
 {
+	bool send_accecn_reflector = false;
 	struct tcp_sock *tp = tcp_sk(sk);
 	SKB_DR(reason);
 
@@ -6128,7 +6207,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 		if (tp->syn_fastopen && !tp->data_segs_in &&
 		    sk->sk_state == TCP_ESTABLISHED)
 			tcp_fastopen_active_disable(sk);
-		tcp_send_challenge_ack(sk);
+		tcp_send_challenge_ack(sk, false);
 		SKB_DR_SET(reason, TCP_RESET);
 		goto discard;
 	}
@@ -6139,16 +6218,25 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 	 * RFC 5961 4.2 : Send a challenge ack
 	 */
 	if (th->syn) {
+		if (tcp_ecn_mode_accecn(tp))
+			send_accecn_reflector = true;
 		if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
 		    TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
 		    TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
-		    TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
+		    TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt) {
+			if (!tcp_ecn_disabled(tp)) {
+				tp->wait_third_ack = true;
+				__tcp_send_ack(sk, tp->rcv_nxt,
+					       !send_accecn_reflector ? 0 :
+					       tcp_accecn_reflector_flags(tp->syn_ect_rcv));
+			}
 			goto pass;
+		}
 syn_challenge:
 		if (syn_inerr)
 			TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE);
-		tcp_send_challenge_ack(sk);
+		tcp_send_challenge_ack(sk, send_accecn_reflector);
 		SKB_DR_SET(reason, TCP_INVALID_SYN);
 		goto discard;
 	}
@@ -6358,6 +6446,13 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 		return;
 
 step5:
+	if (unlikely(tp->wait_third_ack)) {
+		if (!tcp_ecn_disabled(tp))
+			tp->wait_third_ack = 0;
+		if (tcp_ecn_mode_accecn(tp))
+			tcp_accecn_third_ack(sk, skb, tp->syn_ect_snt);
+		tcp_fast_path_on(tp);
+	}
 	tcp_ecn_received_counters(sk, skb);
 
 	reason = tcp_ack(sk, skb, FLAG_SLOWPATH | FLAG_UPDATE_TS_RECENT);
@@ -6611,7 +6706,8 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		 *    state to ESTABLISHED..."
 		 */
 
-		tcp_ecn_rcv_synack(tp, th);
+		if (tcp_ecn_mode_any(tp))
+			tcp_ecn_rcv_synack(sk, th, TCP_SKB_CB(skb)->ip_dsfield);
 
 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
 		tcp_try_undo_spurious_syn(sk);
@@ -6683,7 +6779,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 						  TCP_DELACK_MAX, TCP_RTO_MAX);
 			goto consume;
 		}
-		tcp_send_ack(sk);
+		__tcp_send_ack(sk, tp->rcv_nxt,
+			       !tcp_ecn_mode_accecn(tp) ? 0 :
+			       tcp_accecn_reflector_flags(tp->syn_ect_rcv));
 		return -1;
 	}
 
@@ -6742,7 +6840,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		tp->snd_wl1    = TCP_SKB_CB(skb)->seq;
 		tp->max_window = tp->snd_wnd;
 
-		tcp_ecn_rcv_syn(tp, th);
+		tcp_ecn_rcv_syn(tp, th, skb);
 
 		tcp_mtup_init(sk);
 		tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
@@ -6925,7 +7023,7 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		}
 		/* accept old ack during closing */
 		if ((int)reason < 0) {
-			tcp_send_challenge_ack(sk);
+			tcp_send_challenge_ack(sk, false);
 			reason = -reason;
 			goto discard;
 		}
@@ -6972,9 +7070,16 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		tp->lsndtime = tcp_jiffies32;
 
 		tcp_initialize_rcv_mss(sk);
-		tcp_fast_path_on(tp);
+		if (likely(!tp->wait_third_ack)) {
+			if (tcp_ecn_mode_accecn(tp))
+				tcp_accecn_third_ack(sk, skb, tp->syn_ect_snt);
+			tcp_fast_path_on(tp);
+		}
 		if (sk->sk_shutdown & SEND_SHUTDOWN)
 			tcp_shutdown(sk, SEND_SHUTDOWN);
+
+		if (sk->sk_socket && tp->wait_third_ack)
+			goto consume;
 		break;
 
 	case TCP_FIN_WAIT1: {
@@ -7144,6 +7249,14 @@ static void tcp_ecn_create_request(struct request_sock *req,
 	bool ect, ecn_ok;
 	u32 ecn_ok_dst;
 
+	if (tcp_accecn_syn_requested(th) &&
+	    (net->ipv4.sysctl_tcp_ecn >= 3 || tcp_ca_needs_accecn(listen_sk))) {
+		inet_rsk(req)->ecn_ok = 1;
+		tcp_rsk(req)->accecn_ok = 1;
+		tcp_rsk(req)->syn_ect_rcv = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
+		return;
+	}
+
 	if (!th_ecn)
 		return;
 
@@ -7151,7 +7264,8 @@ static void tcp_ecn_create_request(struct request_sock *req,
 	ecn_ok_dst = dst_feature(dst, DST_FEATURE_ECN_MASK);
 	ecn_ok = READ_ONCE(net->ipv4.sysctl_tcp_ecn) || ecn_ok_dst;
 
-	if (((!ect || th->res1) && ecn_ok) || tcp_ca_needs_ecn(listen_sk) ||
+	if (((!ect || th->res1 || th->ae) && ecn_ok) ||
+	    tcp_ca_needs_ecn(listen_sk) ||
 	    (ecn_ok_dst & DST_FEATURE_ECN_CA) ||
 	    tcp_bpf_ca_needs_ecn((struct sock *)req))
 		inet_rsk(req)->ecn_ok = 1;
@@ -7168,6 +7282,9 @@ static void tcp_openreq_init(struct request_sock *req,
 	tcp_rsk(req)->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
 	tcp_rsk(req)->snt_synack = 0;
 	tcp_rsk(req)->last_oow_ack_time = 0;
+	tcp_rsk(req)->accecn_ok = 0;
+	tcp_rsk(req)->syn_ect_rcv = 0;
+	tcp_rsk(req)->syn_ect_snt = 0;
 	req->mss = rx_opt->mss_clamp;
 	req->ts_recent = rx_opt->saw_tstamp ? rx_opt->rcv_tsval : 0;
 	ireq->tstamp_ok = rx_opt->tstamp_ok;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9419e7b492fc..97df9f36714c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1182,7 +1182,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
 			      enum tcp_synack_type synack_type,
 			      struct sk_buff *syn_skb)
 {
-	const struct inet_request_sock *ireq = inet_rsk(req);
+	struct inet_request_sock *ireq = inet_rsk(req);
 	struct flowi4 fl4;
 	int err = -1;
 	struct sk_buff *skb;
@@ -1195,6 +1195,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
 	skb = tcp_make_synack(sk, dst, req, foc, synack_type, syn_skb);
 
 	if (skb) {
+		tcp_rsk(req)->syn_ect_snt = inet_sk(sk)->tos & INET_ECN_MASK;
 		__tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr);
 
 		tos = READ_ONCE(inet_sk(sk)->tos);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 8fb9f550fdeb..81d42942c335 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -450,12 +450,51 @@ void tcp_openreq_init_rwin(struct request_sock *req,
 }
 EXPORT_SYMBOL(tcp_openreq_init_rwin);
 
-static void tcp_ecn_openreq_child(struct tcp_sock *tp,
-				  const struct request_sock *req)
+void tcp_accecn_third_ack(struct sock *sk, const struct sk_buff *skb,
+			  u8 syn_ect_snt)
 {
-	tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
-			     TCP_ECN_MODE_RFC3168 :
-			     TCP_ECN_DISABLED);
+	u8 ace = tcp_accecn_ace(tcp_hdr(skb));
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	switch (ace) {
+	case 0x0:
+		tcp_accecn_fail_mode_set(tp, TCP_ACCECN_ACE_FAIL_RECV);
+		break;
+	case 0x7:
+	case 0x5:
+	case 0x1:
+		/* Unused but legal values */
+		break;
+	default:
+		/* Validation only applies to first non-data packet */
+		if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq &&
+		    !TCP_SKB_CB(skb)->sacked &&
+		    tcp_accecn_validate_syn_feedback(sk, ace, syn_ect_snt)) {
+			if ((tcp_accecn_extract_syn_ect(ace) == INET_ECN_CE) &&
+			    !tp->delivered_ce)
+				tp->delivered_ce++;
+		}
+		break;
+	}
+}
+
+static void tcp_ecn_openreq_child(struct sock *sk,
+				  const struct request_sock *req,
+				  const struct sk_buff *skb)
+{
+	const struct tcp_request_sock *treq = tcp_rsk(req);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (treq->accecn_ok) {
+		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
+		tp->syn_ect_snt = treq->syn_ect_snt;
+		tcp_accecn_third_ack(sk, skb, treq->syn_ect_snt);
+		tcp_ecn_received_counters(sk, skb);
+	} else {
+		tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
+				     TCP_ECN_MODE_RFC3168 :
+				     TCP_ECN_DISABLED);
+	}
 }
 
 void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
@@ -621,7 +660,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	if (skb->len >= TCP_MSS_DEFAULT + newtp->tcp_header_len)
 		newicsk->icsk_ack.last_seg_size = skb->len - newtp->tcp_header_len;
 	newtp->rx_opt.mss_clamp = req->mss;
-	tcp_ecn_openreq_child(newtp, req);
+	tcp_ecn_openreq_child(newsk, req, skb);
 	newtp->fastopen_req = NULL;
 	RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 42177f464d0c..ebda1b71d489 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -319,7 +319,7 @@ static u16 tcp_select_window(struct sock *sk)
 /* Packet ECN state for a SYN-ACK */
 static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 {
-	const struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
 
 	TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR;
 	if (tcp_ecn_disabled(tp))
@@ -327,6 +327,12 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 	else if (tcp_ca_needs_ecn(sk) ||
 		 tcp_bpf_ca_needs_ecn(sk))
 		INET_ECN_xmit(sk);
+
+	if (tp->ecn_flags & TCP_ECN_MODE_ACCECN) {
+		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ACE;
+		TCP_SKB_CB(skb)->tcp_flags |= tcp_accecn_reflector_flags(tp->syn_ect_rcv);
+		tp->syn_ect_snt = inet_sk(sk)->tos & INET_ECN_MASK;
+	}
 }
 
 /* Packet ECN state for a SYN.  */
@@ -334,8 +340,20 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool bpf_needs_ecn = tcp_bpf_ca_needs_ecn(sk);
-	bool use_ecn = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_ecn) == 1 ||
-		tcp_ca_needs_ecn(sk) || bpf_needs_ecn;
+	bool use_ecn, use_accecn;
+	u8 tcp_ecn = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_ecn);
+
+	/* ==============  ==========================
+	 * tcp_ecn values  Outgoing connections
+	 * ==============  ==========================
+	 *      0,2,5       Do not request ECN
+	 *       1,4        Request ECN connection
+	 *        3         Request AccECN connection
+	 * ==============  ==========================
+	 */
+	use_accecn = tcp_ecn == 3 || tcp_ca_needs_accecn(sk);
+	use_ecn = tcp_ecn == 1 || tcp_ecn == 4 ||
+		  tcp_ca_needs_ecn(sk) || bpf_needs_ecn || use_accecn;
 
 	if (!use_ecn) {
 		const struct dst_entry *dst = __sk_dst_get(sk);
@@ -351,35 +369,58 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 			INET_ECN_xmit(sk);
 
 		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
-		tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
+		if (use_accecn) {
+			TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_AE;
+			tcp_ecn_mode_set(tp, TCP_ECN_MODE_PENDING);
+			tp->syn_ect_snt = inet_sk(sk)->tos & INET_ECN_MASK;
+		} else {
+			tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
+		}
 	}
 }
 
 static void tcp_ecn_clear_syn(struct sock *sk, struct sk_buff *skb)
 {
-	if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback))
+	if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_ecn_fallback)) {
 		/* tp->ecn_flags are cleared at a later point in time when
 		 * SYN ACK is ultimatively being received.
 		 */
-		TCP_SKB_CB(skb)->tcp_flags &= ~(TCPHDR_ECE | TCPHDR_CWR);
+		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ACE;
+	}
+}
+
+static void tcp_accecn_echo_syn_ect(struct tcphdr *th, u8 ect)
+{
+	th->ae = !!(ect & INET_ECN_ECT_0);
+	th->cwr = ect != INET_ECN_ECT_0;
+	th->ece = ect == INET_ECN_ECT_1;
 }
 
 static void
 tcp_ecn_make_synack(const struct request_sock *req, struct tcphdr *th)
 {
-	if (inet_rsk(req)->ecn_ok)
+	if (tcp_rsk(req)->accecn_ok)
+		tcp_accecn_echo_syn_ect(th, tcp_rsk(req)->syn_ect_rcv);
+	else if (inet_rsk(req)->ecn_ok)
 		th->ece = 1;
 }
 
-static void tcp_accecn_set_ace(struct tcphdr *th, struct tcp_sock *tp)
+static void tcp_accecn_set_ace(struct tcp_sock *tp, struct sk_buff *skb,
+			       struct tcphdr *th)
 {
 	u32 wire_ace;
 
-	wire_ace = tp->received_ce + TCP_ACCECN_CEP_INIT_OFFSET;
-	th->ece = !!(wire_ace & 0x1);
-	th->cwr = !!(wire_ace & 0x2);
-	th->ae = !!(wire_ace & 0x4);
-	tp->received_ce_pending = 0;
+	/* The final packet of the 3WHS or anything like it must reflect
+	 * the SYN/ACK ECT instead of putting CEP into ACE field, such
+	 * case show up in tcp_flags.
+	 */
+	if (likely(!(TCP_SKB_CB(skb)->tcp_flags & TCPHDR_ACE))) {
+		wire_ace = tp->received_ce + TCP_ACCECN_CEP_INIT_OFFSET;
+		th->ece = !!(wire_ace & 0x1);
+		th->cwr = !!(wire_ace & 0x2);
+		th->ae = !!(wire_ace & 0x4);
+		tp->received_ce_pending = 0;
+	}
 }
 
 /* Set up ECN state for a packet on a ESTABLISHED socket that is about to
@@ -393,9 +434,10 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
 	if (!tcp_ecn_mode_any(tp))
 		return;
 
-	INET_ECN_xmit(sk);
+	if (!tcp_accecn_ace_fail_recv(tp))
+		INET_ECN_xmit(sk);
 	if (tcp_ecn_mode_accecn(tp)) {
-		tcp_accecn_set_ace(th, tp);
+		tcp_accecn_set_ace(tp, skb, th);
 		skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ACCECN;
 	} else {
 		/* Not-retransmitted data segment: set ECT and inject CWR. */
@@ -3404,7 +3446,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 			tcp_retrans_try_collapse(sk, skb, avail_wnd);
 	}
 
-	/* RFC3168, section 6.1.1.1. ECN fallback */
+	/* RFC3168, section 6.1.1.1. ECN fallback
+	 * As AccECN uses the same SYN flags (+ AE), this check covers both
+	 * cases.
+	 */
 	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN_ECN) == TCPHDR_SYN_ECN)
 		tcp_ecn_clear_syn(sk, skb);
 
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 9d83eadd308b..50046460ee0b 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -264,6 +264,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	if (!req->syncookie)
 		ireq->rcv_wscale = rcv_wscale;
 	ireq->ecn_ok &= cookie_ecn_ok(net, dst);
+	tcp_rsk(req)->accecn_ok = ireq->ecn_ok && cookie_accecn_ok(th);
 
 	ret = tcp_get_cookie_sock(sk, skb, req, dst);
 	if (!ret) {
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d9551c9cd562..6e49f22ce379 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -542,6 +542,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
 	skb = tcp_make_synack(sk, dst, req, foc, synack_type, syn_skb);
 
 	if (skb) {
+		tcp_rsk(req)->syn_ect_snt = np->tclass & INET_ECN_MASK;
 		__tcp_v6_send_check(skb, &ireq->ir_v6_loc_addr,
 				    &ireq->ir_v6_rmt_addr);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 18/44] tcp: accecn: add AccECN rx byte counters
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (16 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 17/44] tcp: accecn: AccECN negotiation chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 19/44] tcp: allow embedding leftover into option padding chia-yu.chang
                   ` (26 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

These counters track IP ECN field payload byte sums for all
arriving (acceptable) packets. The AccECN option (added by
a later patch in the series) echoes these counters back to
sender side.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/tcp.h      |  1 +
 include/net/tcp.h        | 18 +++++++++++++++++-
 net/ipv4/tcp.c           |  3 ++-
 net/ipv4/tcp_input.c     | 12 ++++++++----
 net/ipv4/tcp_minisocks.c |  3 ++-
 5 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 4970ce3ee864..aaf84044e127 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -299,6 +299,7 @@ struct tcp_sock {
 	u32	delivered;	/* Total data packets delivered incl. rexmits */
 	u32	delivered_ce;	/* Like the above but only ECE marked packets */
 	u32	received_ce;	/* Like the above but for received CE marked packets */
+	u32	received_ecn_bytes[3];
 	u8	received_ce_pending:4, /* Not yet transmitted cnt of received_ce */
 		unused2:4;
 	u32	app_limited;	/* limited until "delivered" reaches this val */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6a387d4b2fa1..56d009723c91 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -465,7 +465,8 @@ static inline int tcp_accecn_extract_syn_ect(u8 ace)
 bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect);
 void tcp_accecn_third_ack(struct sock *sk, const struct sk_buff *skb,
 			  u8 syn_ect_snt);
-void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb);
+void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb,
+			       u32 payload_len);
 
 enum tcp_tw_status {
 	TCP_TW_SUCCESS = 0,
@@ -1009,11 +1010,26 @@ static inline u32 tcp_rsk_tsval(const struct tcp_request_sock *treq)
  * See draft-ietf-tcpm-accurate-ecn for the latest values.
  */
 #define TCP_ACCECN_CEP_INIT_OFFSET 5
+#define TCP_ACCECN_E1B_INIT_OFFSET 1
+#define TCP_ACCECN_E0B_INIT_OFFSET 1
+#define TCP_ACCECN_CEB_INIT_OFFSET 0
+
+static inline void __tcp_accecn_init_bytes_counters(int *counter_array)
+{
+	BUILD_BUG_ON(INET_ECN_ECT_1 != 0x1);
+	BUILD_BUG_ON(INET_ECN_ECT_0 != 0x2);
+	BUILD_BUG_ON(INET_ECN_CE != 0x3);
+
+	counter_array[INET_ECN_ECT_1 - 1] = 0;
+	counter_array[INET_ECN_ECT_0 - 1] = 0;
+	counter_array[INET_ECN_CE - 1] = 0;
+}
 
 static inline void tcp_accecn_init_counters(struct tcp_sock *tp)
 {
 	tp->received_ce = 0;
 	tp->received_ce_pending = 0;
+	__tcp_accecn_init_bytes_counters(tp->received_ecn_bytes);
 }
 
 /* State flags for sacked in struct tcp_skb_cb */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f5ceadb43efb..39b20901ac6f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -5029,6 +5029,7 @@ static void __init tcp_struct_check(void)
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, delivered);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, delivered_ce);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ce);
+	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ecn_bytes);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, app_limited);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_wnd);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rx_opt);
@@ -5036,7 +5037,7 @@ static void __init tcp_struct_check(void)
 	/* 32bit arches with 8byte alignment on u64 fields might need padding
 	 * before tcp_clock_cache.
 	 */
-	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 97 + 7);
+	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 109 + 3);
 
 	/* RX read-write hotpath cache lines */
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, bytes_received);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0591c605b57a..c6b1324caab4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6085,7 +6085,8 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
 }
 
 /* Updates Accurate ECN received counters from the received IP ECN field */
-void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb)
+void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb,
+			       u32 payload_len)
 {
 	u8 ecnfield = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
 	u8 is_ce = INET_ECN_is_ce(ecnfield);
@@ -6099,6 +6100,9 @@ void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb)
 		/* ACE counter tracks *all* segments including pure ACKs */
 		tp->received_ce += pcount;
 		tp->received_ce_pending = min(tp->received_ce_pending + pcount, 0xfU);
+
+		if (payload_len > 0)
+			tp->received_ecn_bytes[ecnfield - 1] += payload_len;
 	}
 }
 
@@ -6360,7 +6364,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 				    tp->rcv_nxt == tp->rcv_wup)
 					flag |= __tcp_replace_ts_recent(tp, tstamp_delta);
 
-				tcp_ecn_received_counters(sk, skb);
+				tcp_ecn_received_counters(sk, skb, 0);
 
 				/* We know that such packets are checksummed
 				 * on entry.
@@ -6405,7 +6409,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			/* Bulk data transfer: receiver */
 			skb_dst_drop(skb);
 			__skb_pull(skb, tcp_header_len);
-			tcp_ecn_received_counters(sk, skb);
+			tcp_ecn_received_counters(sk, skb, len - tcp_header_len);
 			eaten = tcp_queue_rcv(sk, skb, &fragstolen);
 
 			tcp_event_data_recv(sk, skb);
@@ -6453,7 +6457,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			tcp_accecn_third_ack(sk, skb, tp->syn_ect_snt);
 		tcp_fast_path_on(tp);
 	}
-	tcp_ecn_received_counters(sk, skb);
+	tcp_ecn_received_counters(sk, skb, len - th->doff * 4);
 
 	reason = tcp_ack(sk, skb, FLAG_SLOWPATH | FLAG_UPDATE_TS_RECENT);
 	if ((int)reason < 0) {
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 81d42942c335..ad9ac8e2bfd4 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -486,10 +486,11 @@ static void tcp_ecn_openreq_child(struct sock *sk,
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	if (treq->accecn_ok) {
+		const struct tcphdr *th = (const struct tcphdr *)skb->data;
 		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
 		tp->syn_ect_snt = treq->syn_ect_snt;
 		tcp_accecn_third_ack(sk, skb, treq->syn_ect_snt);
-		tcp_ecn_received_counters(sk, skb);
+		tcp_ecn_received_counters(sk, skb, skb->len - th->doff * 4);
 	} else {
 		tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
 				     TCP_ECN_MODE_RFC3168 :
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 19/44] tcp: allow embedding leftover into option padding
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (17 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 18/44] tcp: accecn: add AccECN rx byte counters chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 20/44] tcp: accecn: AccECN needs to know delivered bytes chia-yu.chang
                   ` (25 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

There is some waste space in the option usage due to padding
of 32-bit fields. AccECN option can take advantage of those
few bytes as its tail is often consuming just a few odd bytes.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_output.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ebda1b71d489..becaf0e2ffce 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -703,6 +703,8 @@ static __be32 *process_tcp_ao_options(struct tcp_sock *tp,
 	return ptr;
 }
 
+#define NOP_LEFTOVER   ((TCPOPT_NOP << 8) | TCPOPT_NOP)
+
 /* Write previously computed TCP options to the packet.
  *
  * Beware: Something in the Internet is very sensitive to the ordering of
@@ -722,7 +724,9 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
 			      struct tcp_key *key)
 {
 	__be32 *ptr = (__be32 *)(th + 1);
+	u16 leftover_bytes = NOP_LEFTOVER;	/* replace next NOPs if avail */
 	u16 options = opts->options;	/* mungable copy */
+	int leftover_size = 2;
 
 	if (tcp_key_is_md5(key)) {
 		*ptr++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) |
@@ -757,17 +761,22 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
 	}
 
 	if (unlikely(OPTION_SACK_ADVERTISE & options)) {
-		*ptr++ = htonl((TCPOPT_NOP << 24) |
-			       (TCPOPT_NOP << 16) |
+		*ptr++ = htonl((leftover_bytes << 16) |
 			       (TCPOPT_SACK_PERM << 8) |
 			       TCPOLEN_SACK_PERM);
+		leftover_bytes = NOP_LEFTOVER;
 	}
 
 	if (unlikely(OPTION_WSCALE & options)) {
-		*ptr++ = htonl((TCPOPT_NOP << 24) |
+		u8 highbyte = TCPOPT_NOP;
+
+		if (unlikely(leftover_size == 1))
+			highbyte = leftover_bytes >> 8;
+		*ptr++ = htonl((highbyte << 24) |
 			       (TCPOPT_WINDOW << 16) |
 			       (TCPOLEN_WINDOW << 8) |
 			       opts->ws);
+		leftover_bytes = NOP_LEFTOVER;
 	}
 
 	if (unlikely(opts->num_sack_blocks)) {
@@ -775,8 +784,7 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
 			tp->duplicate_sack : tp->selective_acks;
 		int this_sack;
 
-		*ptr++ = htonl((TCPOPT_NOP  << 24) |
-			       (TCPOPT_NOP  << 16) |
+		*ptr++ = htonl((leftover_bytes << 16) |
 			       (TCPOPT_SACK <<  8) |
 			       (TCPOLEN_SACK_BASE + (opts->num_sack_blocks *
 						     TCPOLEN_SACK_PERBLOCK)));
@@ -788,6 +796,10 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
 		}
 
 		tp->rx_opt.dsack = 0;
+	} else if (unlikely(leftover_bytes != NOP_LEFTOVER)) {
+		*ptr++ = htonl((leftover_bytes << 16) |
+			       (TCPOPT_NOP << 8) |
+			       TCPOPT_NOP);
 	}
 
 	if (unlikely(OPTION_FAST_OPEN_COOKIE & options)) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 20/44] tcp: accecn: AccECN needs to know delivered bytes
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (18 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 19/44] tcp: allow embedding leftover into option padding chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 21/44] tcp: sack option handling improvements chia-yu.chang
                   ` (24 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

AccECN byte counter estimation requires delivered bytes
which can be calculated while processing SACK blocks and
cumulative ACK. The delivered bytes will be used to estimate
the byte counters between AccECN option (on ACKs w/o the
option).

Non-SACK calculation is quite annoying, inaccurate, and
likely bogus.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_input.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c6b1324caab4..f70b65034e45 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1159,6 +1159,7 @@ struct tcp_sacktag_state {
 	u64	last_sackt;
 	u32	reord;
 	u32	sack_delivered;
+	u32     delivered_bytes;
 	int	flag;
 	unsigned int mss_now;
 	struct rate_sample *rate;
@@ -1520,7 +1521,7 @@ static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
 static u8 tcp_sacktag_one(struct sock *sk,
 			  struct tcp_sacktag_state *state, u8 sacked,
 			  u32 start_seq, u32 end_seq,
-			  int dup_sack, int pcount,
+			  int dup_sack, int pcount, u32 plen,
 			  u64 xmit_time)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -1580,6 +1581,7 @@ static u8 tcp_sacktag_one(struct sock *sk,
 		tp->sacked_out += pcount;
 		/* Out-of-order packets delivered */
 		state->sack_delivered += pcount;
+		state->delivered_bytes += plen;
 
 		/* Lost marker hint past SACKed? Tweak RFC3517 cnt */
 		if (tp->lost_skb_hint &&
@@ -1621,7 +1623,7 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *prev,
 	 * tcp_highest_sack_seq() when skb is highest_sack.
 	 */
 	tcp_sacktag_one(sk, state, TCP_SKB_CB(skb)->sacked,
-			start_seq, end_seq, dup_sack, pcount,
+			start_seq, end_seq, dup_sack, pcount, skb->len,
 			tcp_skb_timestamp_us(skb));
 	tcp_rate_skb_delivered(sk, skb, state->rate);
 
@@ -1913,6 +1915,7 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 						TCP_SKB_CB(skb)->end_seq,
 						dup_sack,
 						tcp_skb_pcount(skb),
+						skb->len,
 						tcp_skb_timestamp_us(skb));
 			tcp_rate_skb_delivered(sk, skb, state->rate);
 			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
@@ -3529,6 +3532,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, const struct sk_buff *ack_skb,
 
 		if (sacked & TCPCB_SACKED_ACKED) {
 			tp->sacked_out -= acked_pcount;
+			/* snd_una delta covers these skbs */
+			sack->delivered_bytes -= skb->len;
 		} else if (tcp_is_sack(tp)) {
 			tcp_count_delivered(tp, acked_pcount, ece_ack);
 			if (!tcp_skb_spurious_retrans(tp, skb))
@@ -3632,6 +3637,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, const struct sk_buff *ack_skb,
 			delta = prior_sacked - tp->sacked_out;
 			tp->lost_cnt_hint -= min(tp->lost_cnt_hint, delta);
 		}
+
+		sack->delivered_bytes = (skb ?
+					 TCP_SKB_CB(skb)->seq : tp->snd_una) -
+					 prior_snd_una;
 	} else if (skb && rtt_update && sack_rtt_us >= 0 &&
 		   sack_rtt_us > tcp_stamp_us_delta(tp->tcp_mstamp,
 						    tcp_skb_timestamp_us(skb))) {
@@ -4085,6 +4094,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	sack_state.first_sackt = 0;
 	sack_state.rate = &rs;
 	sack_state.sack_delivered = 0;
+	sack_state.delivered_bytes = 0;
 
 	/* We very likely will need to access rtx queue. */
 	prefetch(sk->tcp_rtx_queue.rb_node);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 21/44] tcp: sack option handling improvements
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (19 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 20/44] tcp: accecn: AccECN needs to know delivered bytes chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 22/44] tcp: accecn: AccECN option chia-yu.chang
                   ` (23 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

1) Don't early return when sack doesn't fit. AccECN code will be
   placed after this fragment so no early returns please.

2) Make sure opts->num_sack_blocks is not left undefined. E.g.,
   tcp_current_mss() does not memset its opts struct to zero.
   AccECN code checks if SACK option is present and may even
   alter it to make room for AccECN option when many SACK blocks
   are present. Thus, num_sack_blocks needs to be always valid.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_output.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index becaf0e2ffce..d6f16c82eb1b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1089,17 +1089,18 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 	eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
 	if (unlikely(eff_sacks)) {
 		const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
-		if (unlikely(remaining < TCPOLEN_SACK_BASE_ALIGNED +
-					 TCPOLEN_SACK_PERBLOCK))
-			return size;
-
-		opts->num_sack_blocks =
-			min_t(unsigned int, eff_sacks,
-			      (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
-			      TCPOLEN_SACK_PERBLOCK);
-
-		size += TCPOLEN_SACK_BASE_ALIGNED +
-			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
+		if (likely(remaining >= TCPOLEN_SACK_BASE_ALIGNED +
+					TCPOLEN_SACK_PERBLOCK)) {
+			opts->num_sack_blocks =
+				min_t(unsigned int, eff_sacks,
+				      (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
+				      TCPOLEN_SACK_PERBLOCK);
+
+			size += TCPOLEN_SACK_BASE_ALIGNED +
+				opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
+		}
+	} else {
+		opts->num_sack_blocks = 0;
 	}
 
 	if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 22/44] tcp: accecn: AccECN option
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (20 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 21/44] tcp: sack option handling improvements chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-16  1:32   ` Jakub Kicinski
  2024-10-15 10:29 ` [PATCH net-next 23/44] tcp: accecn: AccECN option send control chia-yu.chang
                   ` (22 subsequent siblings)
  44 siblings, 1 reply; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

The Accurate ECN allows echoing back the sum of bytes for
each IP ECN field value in the received packets using
AccECN option. This change implements AccECN option tx & rx
side processing without option send control related features
that are added by a later change.

Based on specification:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
(Some features of the spec will be added in the later changes
rather than in this one).

A full-length AccECN option is always attempted but if it does
not fit, the minimum length is selected based on the counters
that have changed since the last update. The AccECN option
(with 24-bit fields) often ends in odd sizes so the option
write code tries to take advantage of some nop used to pad
the other TCP options.

The delivered_ecn_bytes pairs with received_ecn_bytes similar
to how delivered_ce pairs with received_ce. In contrast to
ACE field, however, the option is not always available to update
delivered_ecn_bytes. For ACK w/o AccECN option, the delivered
bytes calculated based on the cumulative ACK+SACK information
are assigned to one of the counters using an estimation
heuristic to select the most likely ECN byte counter. Any
estimation error is corrected when the next AccECN option
arrives. It may occur that the heuristic gets too confused
when there are enough different byte counter deltas between
ACKs with the AccECN option in which case the heuristic just
gives up on updating the counters for a while.

tcp_ecn_option sysctl can be used to select option sending
mode for AccECN.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/tcp.h        |   8 +-
 include/net/netns/ipv4.h   |   1 +
 include/net/tcp.h          |  13 +++
 include/uapi/linux/tcp.h   |   7 ++
 net/ipv4/sysctl_net_ipv4.c |   9 ++
 net/ipv4/tcp.c             |  12 ++-
 net/ipv4/tcp_input.c       | 164 +++++++++++++++++++++++++++++++++++--
 net/ipv4/tcp_ipv4.c        |   1 +
 net/ipv4/tcp_output.c      | 116 ++++++++++++++++++++++++++
 9 files changed, 321 insertions(+), 10 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index aaf84044e127..1d53b184e05e 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -122,8 +122,9 @@ struct tcp_options_received {
 		smc_ok : 1,	/* SMC seen on SYN packet		*/
 		snd_wscale : 4,	/* Window scaling received from sender	*/
 		rcv_wscale : 4;	/* Window scaling to send to receiver	*/
-	u8	saw_unknown:1,	/* Received unknown option		*/
-		unused:7;
+	u8	accecn:6,       /* AccECN index in header, 0=no options */
+		saw_unknown:1,  /* Received unknown option              */
+		unused:1;
 	u8	num_sacks;	/* Number of SACK blocks		*/
 	u16	user_mss;	/* mss requested by user in ioctl	*/
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
@@ -298,10 +299,13 @@ struct tcp_sock {
 	u32	snd_up;		/* Urgent pointer		*/
 	u32	delivered;	/* Total data packets delivered incl. rexmits */
 	u32	delivered_ce;	/* Like the above but only ECE marked packets */
+	u32	delivered_ecn_bytes[3];
 	u32	received_ce;	/* Like the above but for received CE marked packets */
 	u32	received_ecn_bytes[3];
 	u8	received_ce_pending:4, /* Not yet transmitted cnt of received_ce */
 		unused2:4;
+	u8	accecn_minlen:2,/* Minimum length of AccECN option sent */
+		estimate_ecnfield:2;/* ECN field for AccECN delivered estimates */
 	u32	app_limited;	/* limited until "delivered" reaches this val */
 	u32	rcv_wnd;	/* Current receiver window		*/
 /*
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 3c014170e001..8a186e99917b 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -135,6 +135,7 @@ struct netns_ipv4 {
 	struct local_ports ip_local_ports;
 
 	u8 sysctl_tcp_ecn;
+	u8 sysctl_tcp_ecn_option;
 	u8 sysctl_tcp_ecn_fallback;
 
 	u8 sysctl_ip_default_ttl;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 56d009723c91..adc520b6eeca 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -202,6 +202,8 @@ static_assert((1 << ATO_BITS) > TCP_DELACK_MAX);
 #define TCPOPT_AO		29	/* Authentication Option (RFC5925) */
 #define TCPOPT_MPTCP		30	/* Multipath TCP (RFC6824) */
 #define TCPOPT_FASTOPEN		34	/* Fast open (RFC7413) */
+#define TCPOPT_ACCECN0		172	/* 0xAC: Accurate ECN Order 0 */
+#define TCPOPT_ACCECN1		174	/* 0xAE: Accurate ECN Order 1 */
 #define TCPOPT_EXP		254	/* Experimental */
 /* Magic number to be after the option value for sharing TCP
  * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
@@ -219,6 +221,7 @@ static_assert((1 << ATO_BITS) > TCP_DELACK_MAX);
 #define TCPOLEN_TIMESTAMP      10
 #define TCPOLEN_MD5SIG         18
 #define TCPOLEN_FASTOPEN_BASE  2
+#define TCPOLEN_ACCECN_BASE    2
 #define TCPOLEN_EXP_FASTOPEN_BASE  4
 #define TCPOLEN_EXP_SMC_BASE   6
 
@@ -232,6 +235,13 @@ static_assert((1 << ATO_BITS) > TCP_DELACK_MAX);
 #define TCPOLEN_MD5SIG_ALIGNED		20
 #define TCPOLEN_MSS_ALIGNED		4
 #define TCPOLEN_EXP_SMC_BASE_ALIGNED	8
+#define TCPOLEN_ACCECN_PERFIELD		3
+
+/* Maximum number of byte counters in AccECN option + size */
+#define TCP_ACCECN_NUMFIELDS		3
+#define TCP_ACCECN_MAXSIZE		(TCPOLEN_ACCECN_BASE + \
+					 TCPOLEN_ACCECN_PERFIELD * \
+					 TCP_ACCECN_NUMFIELDS)
 
 /* tp->accecn_fail_mode */
 #define TCP_ACCECN_ACE_FAIL_SEND       BIT(0)
@@ -1030,6 +1040,9 @@ static inline void tcp_accecn_init_counters(struct tcp_sock *tp)
 	tp->received_ce = 0;
 	tp->received_ce_pending = 0;
 	__tcp_accecn_init_bytes_counters(tp->received_ecn_bytes);
+	__tcp_accecn_init_bytes_counters(tp->delivered_ecn_bytes);
+	tp->accecn_minlen = 0;
+	tp->estimate_ecnfield = 0;
 }
 
 /* State flags for sacked in struct tcp_skb_cb */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 3fe08d7dddaf..8c21fa0463e9 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -295,6 +295,13 @@ struct tcp_info {
 	__u32	tcpi_snd_wnd;	     /* peer's advertised receive window after
 				      * scaling (bytes)
 				      */
+	__u32	tcpi_received_ce;    /* # of CE marks received */
+	__u32	tcpi_delivered_e1_bytes;  /* Accurate ECN byte counters */
+	__u32	tcpi_delivered_e0_bytes;
+	__u32	tcpi_delivered_ce_bytes;
+	__u32	tcpi_received_e1_bytes;
+	__u32	tcpi_received_e0_bytes;
+	__u32	tcpi_received_ce_bytes;
 	__u32	tcpi_rcv_wnd;	     /* local advertised receive window after
 				      * scaling (bytes)
 				      */
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 01fcc6b2045b..0d7c0fea150b 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -728,6 +728,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_FIVE,
 	},
+	{
+		.procname	= "tcp_ecn_option",
+		.data		= &init_net.ipv4.sysctl_tcp_ecn_option,
+		.maxlen		= sizeof(u8),
+		.mode		= 0644,
+		.proc_handler	= proc_dou8vec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_TWO,
+	},
 	{
 		.procname	= "tcp_ecn_fallback",
 		.data		= &init_net.ipv4.sysctl_tcp_ecn_fallback,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39b20901ac6f..ea1fbafd4fd9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -270,6 +270,7 @@
 
 #include <net/icmp.h>
 #include <net/inet_common.h>
+#include <net/inet_ecn.h>
 #include <net/tcp.h>
 #include <net/mptcp.h>
 #include <net/proto_memory.h>
@@ -4178,6 +4179,14 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 	info->tcpi_rehash = tp->plb_rehash + tp->timeout_rehash;
 	info->tcpi_fastopen_client_fail = tp->fastopen_client_fail;
 
+	info->tcpi_received_ce = tp->received_ce;
+	info->tcpi_delivered_e1_bytes = tp->delivered_ecn_bytes[INET_ECN_ECT_1 - 1];
+	info->tcpi_delivered_e0_bytes = tp->delivered_ecn_bytes[INET_ECN_ECT_0 - 1];
+	info->tcpi_delivered_ce_bytes = tp->delivered_ecn_bytes[INET_ECN_CE - 1];
+	info->tcpi_received_e1_bytes = tp->received_ecn_bytes[INET_ECN_ECT_1 - 1];
+	info->tcpi_received_e0_bytes = tp->received_ecn_bytes[INET_ECN_ECT_0 - 1];
+	info->tcpi_received_ce_bytes = tp->received_ecn_bytes[INET_ECN_CE - 1];
+
 	info->tcpi_total_rto = tp->total_rto;
 	info->tcpi_total_rto_recoveries = tp->total_rto_recoveries;
 	info->tcpi_total_rto_time = tp->total_rto_time;
@@ -5028,6 +5037,7 @@ static void __init tcp_struct_check(void)
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, snd_up);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, delivered);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, delivered_ce);
+	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, delivered_ecn_bytes);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ce);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ecn_bytes);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, app_limited);
@@ -5037,7 +5047,7 @@ static void __init tcp_struct_check(void)
 	/* 32bit arches with 8byte alignment on u64 fields might need padding
 	 * before tcp_clock_cache.
 	 */
-	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 109 + 3);
+	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 122 + 6);
 
 	/* RX read-write hotpath cache lines */
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, bytes_received);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f70b65034e45..6daeced890f7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -70,6 +70,7 @@
 #include <linux/sysctl.h>
 #include <linux/kernel.h>
 #include <linux/prefetch.h>
+#include <linux/bitops.h>
 #include <net/dst.h>
 #include <net/tcp.h>
 #include <net/proto_memory.h>
@@ -490,6 +491,136 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
 	return false;
 }
 
+/* Maps IP ECN field ECT/CE code point to AccECN option field number, given
+ * we are sending fields with Accurate ECN Order 1: ECT(1), CE, ECT(0).
+ */
+static u8 tcp_ecnfield_to_accecn_optfield(u8 ecnfield)
+{
+	switch (ecnfield) {
+	case INET_ECN_NOT_ECT:
+		return 0;	/* AccECN does not send counts of NOT_ECT */
+	case INET_ECN_ECT_1:
+		return 1;
+	case INET_ECN_CE:
+		return 2;
+	case INET_ECN_ECT_0:
+		return 3;
+	default:
+		WARN_ONCE(1, "bad ECN code point: %d\n", ecnfield);
+	}
+	return 0;
+}
+
+/* Maps IP ECN field ECT/CE code point to AccECN option field value offset.
+ * Some fields do not start from zero, to detect zeroing by middleboxes.
+ */
+static u32 tcp_accecn_field_init_offset(u8 ecnfield)
+{
+	switch (ecnfield) {
+	case INET_ECN_NOT_ECT:
+		return 0;	/* AccECN does not send counts of NOT_ECT */
+	case INET_ECN_ECT_1:
+		return TCP_ACCECN_E1B_INIT_OFFSET;
+	case INET_ECN_CE:
+		return TCP_ACCECN_CEB_INIT_OFFSET;
+	case INET_ECN_ECT_0:
+		return TCP_ACCECN_E0B_INIT_OFFSET;
+	default:
+		WARN_ONCE(1, "bad ECN code point: %d\n", ecnfield);
+	}
+	return 0;
+}
+
+/* Maps AccECN option field #nr to IP ECN field ECT/CE bits */
+static unsigned int tcp_accecn_optfield_to_ecnfield(unsigned int optfield, bool order)
+{
+	u8 tmp;
+
+	optfield = order ? 2 - optfield : optfield;
+	tmp = optfield + 2;
+
+	return (tmp + (tmp >> 2)) & INET_ECN_MASK;
+}
+
+/* Handles AccECN option ECT and CE 24-bit byte counters update into
+ * the u32 value in tcp_sock. As we're processing TCP options, it is
+ * safe to access from - 1.
+ */
+static s32 tcp_update_ecn_bytes(u32 *cnt, const char *from, u32 init_offset)
+{
+	u32 truncated = (get_unaligned_be32(from - 1) - init_offset) & 0xFFFFFFU;
+	u32 delta = (truncated - *cnt) & 0xFFFFFFU;
+
+	/* If delta has the highest bit set (24th bit) indicating negative,
+	 * sign extend to correct an estimation using sign_extend32(delta, 24 - 1)
+	 */
+	delta = sign_extend32(delta, 23);
+	*cnt += delta;
+	return (s32)delta;
+}
+
+/* Returns true if the byte counters can be used */
+static bool tcp_accecn_process_option(struct tcp_sock *tp,
+				      const struct sk_buff *skb,
+				      u32 delivered_bytes, int flag)
+{
+	u8 estimate_ecnfield = tp->estimate_ecnfield;
+	bool ambiguous_ecn_bytes_incr = false;
+	bool first_changed = false;
+	unsigned int optlen;
+	unsigned char *ptr;
+	bool order1, res;
+	unsigned int i;
+
+	if (!(flag & FLAG_SLOWPATH) || !tp->rx_opt.accecn) {
+		if (estimate_ecnfield) {
+			tp->delivered_ecn_bytes[estimate_ecnfield - 1] += delivered_bytes;
+			return true;
+		}
+		return false;
+	}
+
+	ptr = skb_transport_header(skb) + tp->rx_opt.accecn;
+	optlen = ptr[1] - 2;
+	WARN_ON_ONCE(ptr[0] != TCPOPT_ACCECN0 && ptr[0] != TCPOPT_ACCECN1);
+	order1 = (ptr[0] == TCPOPT_ACCECN1);
+	ptr += 2;
+
+	res = !!estimate_ecnfield;
+	for (i = 0; i < 3; i++) {
+		if (optlen >= TCPOLEN_ACCECN_PERFIELD) {
+			u8 ecnfield = tcp_accecn_optfield_to_ecnfield(i, order1);
+			u32 init_offset = tcp_accecn_field_init_offset(ecnfield);
+			s32 delta;
+
+			delta = tcp_update_ecn_bytes(&tp->delivered_ecn_bytes[ecnfield - 1],
+						     ptr, init_offset);
+			if (delta) {
+				if (delta < 0) {
+					res = false;
+					ambiguous_ecn_bytes_incr = true;
+				}
+				if (ecnfield != estimate_ecnfield) {
+					if (!first_changed) {
+						tp->estimate_ecnfield = ecnfield;
+						first_changed = true;
+					} else {
+						res = false;
+						ambiguous_ecn_bytes_incr = true;
+					}
+				}
+			}
+
+			optlen -= TCPOLEN_ACCECN_PERFIELD;
+			ptr += TCPOLEN_ACCECN_PERFIELD;
+		}
+	}
+	if (ambiguous_ecn_bytes_incr)
+		tp->estimate_ecnfield = 0;
+
+	return res;
+}
+
 static void tcp_count_delivered_ce(struct tcp_sock *tp, u32 ecn_count)
 {
 	tp->delivered_ce += ecn_count;
@@ -506,7 +637,7 @@ static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
 
 /* Returns the ECN CE delta */
 static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
-				u32 delivered_pkts, int flag)
+				u32 delivered_pkts, u32 delivered_bytes, int flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 delta, safe_delta;
@@ -516,6 +647,8 @@ static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
 	if (!(flag & (FLAG_FORWARD_PROGRESS | FLAG_TS_PROGRESS)))
 		return 0;
 
+	tcp_accecn_process_option(tp, skb, delivered_bytes, flag);
+
 	if (!(flag & FLAG_SLOWPATH)) {
 		/* AccECN counter might overflow on large ACKs */
 		if (delivered_pkts <= TCP_ACCECN_CEP_ACE_MASK)
@@ -540,12 +673,13 @@ static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
 }
 
 static u32 tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
-			      u32 delivered_pkts, int *flag)
+			      u32 delivered_pkts, u32 delivered_bytes, int *flag)
 {
 	u32 delta;
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	delta = __tcp_accecn_process(sk, skb, delivered_pkts, *flag);
+	delta = __tcp_accecn_process(sk, skb, delivered_pkts,
+				     delivered_bytes, *flag);
 	if (delta > 0) {
 		tcp_count_delivered_ce(tp, delta);
 		*flag |= FLAG_ECE;
@@ -4198,7 +4332,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	tcp_rack_update_reo_wnd(sk, &rs);
 
 	if (tcp_ecn_mode_accecn(tp))
-		ecn_count = tcp_accecn_process(sk, skb, tp->delivered - delivered, &flag);
+		ecn_count = tcp_accecn_process(sk, skb, tp->delivered - delivered,
+					       sack_state.delivered_bytes, &flag);
 
 	tcp_in_ack_event(sk, flag);
 
@@ -4235,7 +4370,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 no_queue:
 	if (tcp_ecn_mode_accecn(tp))
-		ecn_count = tcp_accecn_process(sk, skb, tp->delivered - delivered, &flag);
+		ecn_count = tcp_accecn_process(sk, skb, tp->delivered - delivered,
+					       sack_state.delivered_bytes, &flag);
 	tcp_in_ack_event(sk, flag);
 	/* If data was DSACKed, see if we can undo a cwnd reduction. */
 	if (flag & FLAG_DSACKING_ACK) {
@@ -4363,6 +4499,7 @@ void tcp_parse_options(const struct net *net,
 
 	ptr = (const unsigned char *)(th + 1);
 	opt_rx->saw_tstamp = 0;
+	opt_rx->accecn = 0;
 	opt_rx->saw_unknown = 0;
 
 	while (length > 0) {
@@ -4454,6 +4591,12 @@ void tcp_parse_options(const struct net *net,
 					ptr, th->syn, foc, false);
 				break;
 
+			case TCPOPT_ACCECN0:
+			case TCPOPT_ACCECN1:
+				/* Save offset of AccECN option in TCP header */
+				opt_rx->accecn = (ptr - 2) - (__u8 *)th;
+				break;
+
 			case TCPOPT_EXP:
 				/* Fast Open option shares code 254 using a
 				 * 16 bits magic number.
@@ -4514,11 +4657,14 @@ static bool tcp_fast_parse_options(const struct net *net,
 	 */
 	if (th->doff == (sizeof(*th) / 4)) {
 		tp->rx_opt.saw_tstamp = 0;
+		tp->rx_opt.accecn = 0;
 		return false;
 	} else if (tp->rx_opt.tstamp_ok &&
 		   th->doff == ((sizeof(*th) + TCPOLEN_TSTAMP_ALIGNED) / 4)) {
-		if (tcp_parse_aligned_timestamp(tp, th))
+		if (tcp_parse_aligned_timestamp(tp, th)) {
+			tp->rx_opt.accecn = 0;
 			return true;
+		}
 	}
 
 	tcp_parse_options(net, skb, &tp->rx_opt, 1, NULL);
@@ -6111,8 +6257,11 @@ void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb,
 		tp->received_ce += pcount;
 		tp->received_ce_pending = min(tp->received_ce_pending + pcount, 0xfU);
 
-		if (payload_len > 0)
+		if (payload_len > 0) {
+			u8 minlen = tcp_ecnfield_to_accecn_optfield(ecnfield);
 			tp->received_ecn_bytes[ecnfield - 1] += payload_len;
+			tp->accecn_minlen = max_t(u8, tp->accecn_minlen, minlen);
+		}
 	}
 }
 
@@ -6322,6 +6471,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 	 */
 
 	tp->rx_opt.saw_tstamp = 0;
+	tp->rx_opt.accecn = 0;
 
 	/*	pred_flags is 0xS?10 << 16 + snd_wnd
 	 *	if header_prediction is to be made
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 97df9f36714c..e632327f19f8 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3447,6 +3447,7 @@ static void __net_init tcp_set_hashinfo(struct net *net)
 static int __net_init tcp_sk_init(struct net *net)
 {
 	net->ipv4.sysctl_tcp_ecn = 2;
+	net->ipv4.sysctl_tcp_ecn_option = 2;
 	net->ipv4.sysctl_tcp_ecn_fallback = 1;
 
 	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d6f16c82eb1b..bddd0b309443 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -487,6 +487,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
 #define OPTION_SMC		BIT(9)
 #define OPTION_MPTCP		BIT(10)
 #define OPTION_AO		BIT(11)
+#define OPTION_ACCECN		BIT(12)
 
 static void smc_options_write(__be32 *ptr, u16 *options)
 {
@@ -508,12 +509,14 @@ struct tcp_out_options {
 	u16 mss;		/* 0 to disable */
 	u8 ws;			/* window scale, 0 to disable */
 	u8 num_sack_blocks;	/* number of SACK blocks to include */
+	u8 num_accecn_fields;	/* number of AccECN fields needed */
 	u8 hash_size;		/* bytes in hash_location */
 	u8 bpf_opt_len;		/* length of BPF hdr option */
 	__u8 *hash_location;	/* temporary pointer, overloaded */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
 	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
 	struct mptcp_out_options mptcp;
+	u32 *ecn_bytes;		/* AccECN ECT/CE byte counters */
 };
 
 static void mptcp_options_write(struct tcphdr *th, __be32 *ptr,
@@ -760,6 +763,39 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
 		*ptr++ = htonl(opts->tsecr);
 	}
 
+	if (OPTION_ACCECN & options) {
+		u32 e0b = opts->ecn_bytes[INET_ECN_ECT_0 - 1] + TCP_ACCECN_E0B_INIT_OFFSET;
+		u32 e1b = opts->ecn_bytes[INET_ECN_ECT_1 - 1] + TCP_ACCECN_E1B_INIT_OFFSET;
+		u32 ceb = opts->ecn_bytes[INET_ECN_CE - 1] + TCP_ACCECN_CEB_INIT_OFFSET;
+		u8 len = TCPOLEN_ACCECN_BASE +
+			 opts->num_accecn_fields * TCPOLEN_ACCECN_PERFIELD;
+
+		if (opts->num_accecn_fields == 2) {
+			*ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
+				       ((e1b >> 8) & 0xffff));
+			*ptr++ = htonl(((e1b & 0xff) << 24) |
+				       (ceb & 0xffffff));
+		} else if (opts->num_accecn_fields == 1) {
+			*ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
+				       ((e1b >> 8) & 0xffff));
+			leftover_bytes = ((e1b & 0xff) << 8) |
+					 TCPOPT_NOP;
+			leftover_size = 1;
+		} else if (opts->num_accecn_fields == 0) {
+			leftover_bytes = (TCPOPT_ACCECN1 << 8) | len;
+			leftover_size = 2;
+		} else if (opts->num_accecn_fields == 3) {
+			*ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
+				       ((e1b >> 8) & 0xffff));
+			*ptr++ = htonl(((e1b & 0xff) << 24) |
+				       (ceb & 0xffffff));
+			*ptr++ = htonl(((e0b & 0xffffff) << 8) |
+				       TCPOPT_NOP);
+		}
+		if (tp)
+			tp->accecn_minlen = 0;
+	}
+
 	if (unlikely(OPTION_SACK_ADVERTISE & options)) {
 		*ptr++ = htonl((leftover_bytes << 16) |
 			       (TCPOPT_SACK_PERM << 8) |
@@ -880,6 +916,60 @@ static void mptcp_set_option_cond(const struct request_sock *req,
 	}
 }
 
+/* Initial values for AccECN option, ordered is based on ECN field bits
+ * similar to received_ecn_bytes. Used for SYN/ACK AccECN option.
+ */
+u32 synack_ecn_bytes[3] = { 0, 0, 0 };
+
+static u32 tcp_synack_options_combine_saving(struct tcp_out_options *opts)
+{
+	/* How much there's room for combining with the alignment padding? */
+	if ((opts->options & (OPTION_SACK_ADVERTISE | OPTION_TS)) ==
+	    OPTION_SACK_ADVERTISE)
+		return 2;
+	else if (opts->options & OPTION_WSCALE)
+		return 1;
+	return 0;
+}
+
+/* Calculates how long AccECN option will fit to @remaining option space.
+ *
+ * AccECN option can sometimes replace NOPs used for alignment of other
+ * TCP options (up to @max_combine_saving available).
+ *
+ * Only solutions with at least @required AccECN fields are accepted.
+ *
+ * Returns: The size of the AccECN option excluding space repurposed from
+ * the alignment of the other options.
+ */
+static int tcp_options_fit_accecn(struct tcp_out_options *opts, int required,
+				  int remaining, int max_combine_saving)
+{
+	int size = TCP_ACCECN_MAXSIZE;
+
+	opts->num_accecn_fields = TCP_ACCECN_NUMFIELDS;
+
+	while (opts->num_accecn_fields >= required) {
+		int leftover_size = size & 0x3;
+		/* Pad to dword if cannot combine */
+		if (leftover_size > max_combine_saving)
+			leftover_size = -((4 - leftover_size) & 0x3);
+
+		if (remaining >= size - leftover_size) {
+			size -= leftover_size;
+			break;
+		}
+
+		opts->num_accecn_fields--;
+		size -= TCPOLEN_ACCECN_PERFIELD;
+	}
+	if (opts->num_accecn_fields < required)
+		return 0;
+
+	opts->options |= OPTION_ACCECN;
+	return size;
+}
+
 /* Compute TCP options for SYN packets. This is not the final
  * network wire format yet.
  */
@@ -960,6 +1050,16 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+	/* Simultaneous open SYN/ACK needs AccECN option but not SYN */
+	if (unlikely((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_ACK) &&
+		     tcp_ecn_mode_accecn(tp) &&
+		     sock_net(sk)->ipv4.sysctl_tcp_ecn_option &&
+		     remaining >= TCPOLEN_ACCECN_BASE)) {
+		opts->ecn_bytes = synack_ecn_bytes;
+		remaining -= tcp_options_fit_accecn(opts, 0, remaining,
+						    tcp_synack_options_combine_saving(opts));
+	}
+
 	bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining);
 
 	return MAX_TCP_OPTION_SPACE - remaining;
@@ -977,6 +1077,7 @@ static unsigned int tcp_synack_options(const struct sock *sk,
 {
 	struct inet_request_sock *ireq = inet_rsk(req);
 	unsigned int remaining = MAX_TCP_OPTION_SPACE;
+	struct tcp_request_sock *treq = tcp_rsk(req);
 
 	if (tcp_key_is_md5(key)) {
 		opts->options |= OPTION_MD5;
@@ -1033,6 +1134,13 @@ static unsigned int tcp_synack_options(const struct sock *sk,
 
 	smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining);
 
+	if (treq->accecn_ok && sock_net(sk)->ipv4.sysctl_tcp_ecn_option &&
+	    remaining >= TCPOLEN_ACCECN_BASE) {
+		opts->ecn_bytes = synack_ecn_bytes;
+		remaining -= tcp_options_fit_accecn(opts, 0, remaining,
+						    tcp_synack_options_combine_saving(opts));
+	}
+
 	bpf_skops_hdr_opt_len((struct sock *)sk, skb, req, syn_skb,
 			      synack_type, opts, &remaining);
 
@@ -1103,6 +1211,14 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 		opts->num_sack_blocks = 0;
 	}
 
+	if (tcp_ecn_mode_accecn(tp) &&
+	    sock_net(sk)->ipv4.sysctl_tcp_ecn_option) {
+		opts->ecn_bytes = tp->received_ecn_bytes;
+		size += tcp_options_fit_accecn(opts, tp->accecn_minlen,
+					       MAX_TCP_OPTION_SPACE - size,
+					       opts->num_sack_blocks > 0 ? 2 : 0);
+	}
+
 	if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp,
 					    BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))) {
 		unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 23/44] tcp: accecn: AccECN option send control
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (21 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 22/44] tcp: accecn: AccECN option chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 24/44] tcp: accecn: AccECN option failure handling chia-yu.chang
                   ` (21 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Instead of sending the option in every ACK, limit sending to
those ACKs where the option is necessary:
- Handshake
- "Change-triggered ACK" + the ACK following it. The
  2nd ACK is necessary to unambiguously indicate which
  of the ECN byte counters in increasing. The first
  ACK has two counters increasing due to the ecnfield
  edge.
- ACKs with CE to allow CEP delta validations to take
  advantage of the option.
- Force option to be sent every at least once per 2^22
  bytes. The check is done using the bit edges of the
  byte counters (avoids need for extra variables).
- AccECN option beacon to send a few times per RTT even if
  nothing in the ECN state requires that. The default is 3
  times per RTT, and its period can be set via
  sysctl_tcp_ecn_option_beacon.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/tcp.h        |  3 +++
 include/net/netns/ipv4.h   |  1 +
 include/net/tcp.h          |  1 +
 net/ipv4/sysctl_net_ipv4.c |  9 +++++++++
 net/ipv4/tcp.c             |  5 ++++-
 net/ipv4/tcp_input.c       | 31 ++++++++++++++++++++++++++++++-
 net/ipv4/tcp_ipv4.c        |  1 +
 net/ipv4/tcp_minisocks.c   |  2 ++
 net/ipv4/tcp_output.c      | 36 +++++++++++++++++++++++++++++++-----
 9 files changed, 82 insertions(+), 7 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 1d53b184e05e..e4aa10fdc032 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -305,7 +305,10 @@ struct tcp_sock {
 	u8	received_ce_pending:4, /* Not yet transmitted cnt of received_ce */
 		unused2:4;
 	u8	accecn_minlen:2,/* Minimum length of AccECN option sent */
+		prev_ecnfield:2,/* ECN bits from the previous segment */
+		accecn_opt_demand:2,/* Demand AccECN option for n next ACKs */
 		estimate_ecnfield:2;/* ECN field for AccECN delivered estimates */
+	u64	accecn_opt_tstamp;	/* Last AccECN option sent timestamp */
 	u32	app_limited;	/* limited until "delivered" reaches this val */
 	u32	rcv_wnd;	/* Current receiver window		*/
 /*
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8a186e99917b..87880307b68c 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -136,6 +136,7 @@ struct netns_ipv4 {
 
 	u8 sysctl_tcp_ecn;
 	u8 sysctl_tcp_ecn_option;
+	u8 sysctl_tcp_ecn_option_beacon;
 	u8 sysctl_tcp_ecn_fallback;
 
 	u8 sysctl_ip_default_ttl;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index adc520b6eeca..b3cbf9a11dbc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1042,6 +1042,7 @@ static inline void tcp_accecn_init_counters(struct tcp_sock *tp)
 	__tcp_accecn_init_bytes_counters(tp->received_ecn_bytes);
 	__tcp_accecn_init_bytes_counters(tp->delivered_ecn_bytes);
 	tp->accecn_minlen = 0;
+	tp->accecn_opt_demand = 0;
 	tp->estimate_ecnfield = 0;
 }
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0d7c0fea150b..987e74a41b09 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -737,6 +737,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_TWO,
 	},
+	{
+		.procname	= "tcp_ecn_option_beacon",
+		.data		= &init_net.ipv4.sysctl_tcp_ecn_option_beacon,
+		.maxlen		= sizeof(u8),
+		.mode		= 0644,
+		.proc_handler	= proc_dou8vec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_FOUR,
+	},
 	{
 		.procname	= "tcp_ecn_fallback",
 		.data		= &init_net.ipv4.sysctl_tcp_ecn_fallback,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ea1fbafd4fd9..e59fd2cabe03 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3340,6 +3340,8 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tp->wait_third_ack = 0;
 	tp->accecn_fail_mode = 0;
 	tcp_accecn_init_counters(tp);
+	tp->prev_ecnfield = 0;
+	tp->accecn_opt_tstamp = 0;
 	if (icsk->icsk_ca_ops->release)
 		icsk->icsk_ca_ops->release(sk);
 	memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv));
@@ -5040,6 +5042,7 @@ static void __init tcp_struct_check(void)
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, delivered_ecn_bytes);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ce);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ecn_bytes);
+	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, accecn_opt_tstamp);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, app_limited);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_wnd);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rx_opt);
@@ -5047,7 +5050,7 @@ static void __init tcp_struct_check(void)
 	/* 32bit arches with 8byte alignment on u64 fields might need padding
 	 * before tcp_clock_cache.
 	 */
-	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 122 + 6);
+	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 130 + 6);
 
 	/* RX read-write hotpath cache lines */
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, bytes_received);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 6daeced890f7..14b9a5e63687 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -459,6 +459,7 @@ static void tcp_ecn_rcv_synack(struct sock *sk, const struct tcphdr *th,
 	default:
 		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
 		tp->syn_ect_rcv = ip_dsfield & INET_ECN_MASK;
+		tp->accecn_opt_demand = 2;
 		if (tcp_accecn_validate_syn_feedback(sk, ace, tp->syn_ect_snt) &&
 		    INET_ECN_is_ce(ip_dsfield)) {
 			tp->received_ce++;
@@ -477,6 +478,7 @@ static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th,
 			tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
 		} else {
 			tp->syn_ect_rcv = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
+			tp->prev_ecnfield = tp->syn_ect_rcv;
 			tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
 		}
 	}
@@ -6247,6 +6249,7 @@ void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb,
 	u8 ecnfield = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
 	u8 is_ce = INET_ECN_is_ce(ecnfield);
 	struct tcp_sock *tp = tcp_sk(sk);
+	bool ecn_edge;
 
 	if (!INET_ECN_is_not_ect(ecnfield)) {
 		u32 pcount = is_ce * max_t(u16, 1, skb_shinfo(skb)->gso_segs);
@@ -6259,8 +6262,32 @@ void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb,
 
 		if (payload_len > 0) {
 			u8 minlen = tcp_ecnfield_to_accecn_optfield(ecnfield);
+			u32 oldbytes = tp->received_ecn_bytes[ecnfield - 1];
+
 			tp->received_ecn_bytes[ecnfield - 1] += payload_len;
 			tp->accecn_minlen = max_t(u8, tp->accecn_minlen, minlen);
+
+			/* Demand AccECN option at least every 2^22 bytes to
+			 * avoid overflowing the ECN byte counters.
+			 */
+			if ((tp->received_ecn_bytes[ecnfield - 1] ^ oldbytes) &
+			    ~((1 << 22) - 1))
+				tp->accecn_opt_demand = max_t(u8, 1,
+							      tp->accecn_opt_demand);
+		}
+	}
+
+	ecn_edge = tp->prev_ecnfield != ecnfield;
+	if (ecn_edge || is_ce) {
+		tp->prev_ecnfield = ecnfield;
+		/* Demand Accurate ECN change-triggered ACKs. Two ACK are
+		 * demanded to indicate unambiguously the ecnfield value
+		 * in the latter ACK.
+		 */
+		if (tcp_ecn_mode_accecn(tp)) {
+			if (ecn_edge)
+				inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
+			tp->accecn_opt_demand = 2;
 		}
 	}
 }
@@ -6381,8 +6408,10 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 	 * RFC 5961 4.2 : Send a challenge ack
 	 */
 	if (th->syn) {
-		if (tcp_ecn_mode_accecn(tp))
+		if (tcp_ecn_mode_accecn(tp)) {
 			send_accecn_reflector = true;
+			tp->accecn_opt_demand = max_t(u8, 1, tp->accecn_opt_demand);
+		}
 		if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
 		    TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
 		    TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index e632327f19f8..21946ac00282 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3448,6 +3448,7 @@ static int __net_init tcp_sk_init(struct net *net)
 {
 	net->ipv4.sysctl_tcp_ecn = 2;
 	net->ipv4.sysctl_tcp_ecn_option = 2;
+	net->ipv4.sysctl_tcp_ecn_option_beacon = 3;
 	net->ipv4.sysctl_tcp_ecn_fallback = 1;
 
 	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index ad9ac8e2bfd4..75baa72849fe 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -490,6 +490,8 @@ static void tcp_ecn_openreq_child(struct sock *sk,
 		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
 		tp->syn_ect_snt = treq->syn_ect_snt;
 		tcp_accecn_third_ack(sk, skb, treq->syn_ect_snt);
+		tp->prev_ecnfield = treq->syn_ect_rcv;
+		tp->accecn_opt_demand = 1;
 		tcp_ecn_received_counters(sk, skb, skb->len - th->doff * 4);
 	} else {
 		tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index bddd0b309443..22f6cfba5b27 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -792,8 +792,13 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
 			*ptr++ = htonl(((e0b & 0xffffff) << 8) |
 				       TCPOPT_NOP);
 		}
-		if (tp)
+
+		if (tp) {
 			tp->accecn_minlen = 0;
+			tp->accecn_opt_tstamp = tp->tcp_mstamp;
+			if (tp->accecn_opt_demand)
+				tp->accecn_opt_demand--;
+		}
 	}
 
 	if (unlikely(OPTION_SACK_ADVERTISE & options)) {
@@ -970,6 +975,17 @@ static int tcp_options_fit_accecn(struct tcp_out_options *opts, int required,
 	return size;
 }
 
+static bool tcp_accecn_option_beacon_check(const struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+
+	if (!sock_net(sk)->ipv4.sysctl_tcp_ecn_option_beacon)
+		return false;
+
+	return tcp_stamp_us_delta(tp->tcp_mstamp, tp->accecn_opt_tstamp) *
+	       sock_net(sk)->ipv4.sysctl_tcp_ecn_option_beacon >= (tp->srtt_us >> 3);
+}
+
 /* Compute TCP options for SYN packets. This is not the final
  * network wire format yet.
  */
@@ -1213,10 +1229,15 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 
 	if (tcp_ecn_mode_accecn(tp) &&
 	    sock_net(sk)->ipv4.sysctl_tcp_ecn_option) {
-		opts->ecn_bytes = tp->received_ecn_bytes;
-		size += tcp_options_fit_accecn(opts, tp->accecn_minlen,
-					       MAX_TCP_OPTION_SPACE - size,
-					       opts->num_sack_blocks > 0 ? 2 : 0);
+		if (sock_net(sk)->ipv4.sysctl_tcp_ecn_option >= 2 ||
+		    tp->accecn_opt_demand ||
+		    tcp_accecn_option_beacon_check(sk)) {
+			opts->ecn_bytes = tp->received_ecn_bytes;
+			size += tcp_options_fit_accecn(opts, tp->accecn_minlen,
+						       MAX_TCP_OPTION_SPACE - size,
+						       opts->num_sack_blocks > 0 ?
+						       2 : 0);
+		}
 	}
 
 	if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp,
@@ -2933,6 +2954,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	sent_pkts = 0;
 
 	tcp_mstamp_refresh(tp);
+
+	/* AccECN option beacon depends on mstamp, it may change mss */
+	if (tcp_ecn_mode_accecn(tp) && tcp_accecn_option_beacon_check(sk))
+		mss_now = tcp_current_mss(sk);
+
 	if (!push_one) {
 		/* Do MTU probing. */
 		result = tcp_mtu_probe(sk);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 24/44] tcp: accecn: AccECN option failure handling
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (22 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 23/44] tcp: accecn: AccECN option send control chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 25/44] tcp: accecn: AccECN option ceb/cep heuristic chia-yu.chang
                   ` (20 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

AccECN option may fail in various way, handle these:
- Remove option from SYN/ACK rexmits to handle blackholes
- If no option arrives in SYN/ACK, assume Option is not usable
        - If an option arrives later, re-enabled
- If option is zeroed, disable AccECN option processing

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/tcp.h      |  6 +++--
 include/net/tcp.h        |  7 ++++++
 net/ipv4/tcp.c           |  1 +
 net/ipv4/tcp_input.c     | 47 +++++++++++++++++++++++++++++++++++-----
 net/ipv4/tcp_minisocks.c | 33 ++++++++++++++++++++++++++++
 net/ipv4/tcp_output.c    |  7 ++++--
 6 files changed, 92 insertions(+), 9 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index e4aa10fdc032..d817a4d1e17c 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -160,7 +160,8 @@ struct tcp_request_sock {
 	u8				accecn_ok  : 1,
 					syn_ect_snt: 2,
 					syn_ect_rcv: 2;
-	u8				accecn_fail_mode:4;
+	u8				accecn_fail_mode:4,
+					saw_accecn_opt  :2;
 	u32				txhash;
 	u32				rcv_isn;
 	u32				snt_isn;
@@ -387,7 +388,8 @@ struct tcp_sock {
 		syn_ect_snt:2,	/* AccECN ECT memory, only */
 		syn_ect_rcv:2,	/* ... needed durign 3WHS + first seqno */
 		wait_third_ack:1; /* Need 3rd ACK in simultaneous open for AccECN */
-	u8	accecn_fail_mode:4;     /* AccECN failure handling */
+	u8	accecn_fail_mode:4,	/* AccECN failure handling */
+		saw_accecn_opt:2;	/* An AccECN option was seen */
 	u8	thin_lto    : 1,/* Use linear timeouts for thin streams */
 		fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
 		fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b3cbf9a11dbc..18c6f0ada141 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -274,6 +274,12 @@ static inline void tcp_accecn_fail_mode_set(struct tcp_sock *tp, u8 mode)
 	tp->accecn_fail_mode |= mode;
 }
 
+/* tp->saw_accecn_opt states */
+#define TCP_ACCECN_OPT_NOT_SEEN		0x0
+#define TCP_ACCECN_OPT_EMPTY_SEEN	0x1
+#define TCP_ACCECN_OPT_COUNTER_SEEN	0x2
+#define TCP_ACCECN_OPT_FAIL_SEEN	0x3
+
 /* Flags in tp->nonagle */
 #define TCP_NAGLE_OFF		1	/* Nagle's algo is disabled */
 #define TCP_NAGLE_CORK		2	/* Socket is corked	    */
@@ -475,6 +481,7 @@ static inline int tcp_accecn_extract_syn_ect(u8 ace)
 bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect);
 void tcp_accecn_third_ack(struct sock *sk, const struct sk_buff *skb,
 			  u8 syn_ect_snt);
+u8 tcp_accecn_option_init(const struct sk_buff *skb, u8 opt_offset);
 void tcp_ecn_received_counters(struct sock *sk, const struct sk_buff *skb,
 			       u32 payload_len);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e59fd2cabe03..7ef69b7265eb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3339,6 +3339,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tp->delivered_ce = 0;
 	tp->wait_third_ack = 0;
 	tp->accecn_fail_mode = 0;
+	tp->saw_accecn_opt = TCP_ACCECN_OPT_NOT_SEEN;
 	tcp_accecn_init_counters(tp);
 	tp->prev_ecnfield = 0;
 	tp->accecn_opt_tstamp = 0;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 14b9a5e63687..a8669c407978 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -439,8 +439,8 @@ bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect)
 }
 
 /* See Table 2 of the AccECN draft */
-static void tcp_ecn_rcv_synack(struct sock *sk, const struct tcphdr *th,
-			       u8 ip_dsfield)
+static void tcp_ecn_rcv_synack(struct sock *sk, const struct sk_buff *skb,
+			       const struct tcphdr *th, u8 ip_dsfield)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	u8 ace = tcp_accecn_ace(th);
@@ -459,7 +459,14 @@ static void tcp_ecn_rcv_synack(struct sock *sk, const struct tcphdr *th,
 	default:
 		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
 		tp->syn_ect_rcv = ip_dsfield & INET_ECN_MASK;
-		tp->accecn_opt_demand = 2;
+		if (tp->rx_opt.accecn &&
+		    tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) {
+			tp->saw_accecn_opt = tcp_accecn_option_init(skb,
+								    tp->rx_opt.accecn);
+			if (tp->saw_accecn_opt == TCP_ACCECN_OPT_FAIL_SEEN)
+				tcp_accecn_fail_mode_set(tp, TCP_ACCECN_OPT_FAIL_RECV);
+			tp->accecn_opt_demand = 2;
+		}
 		if (tcp_accecn_validate_syn_feedback(sk, ace, tp->syn_ect_snt) &&
 		    INET_ECN_is_ce(ip_dsfield)) {
 			tp->received_ce++;
@@ -574,7 +581,21 @@ static bool tcp_accecn_process_option(struct tcp_sock *tp,
 	bool order1, res;
 	unsigned int i;
 
+	if (tcp_accecn_opt_fail_recv(tp))
+		return false;
+
 	if (!(flag & FLAG_SLOWPATH) || !tp->rx_opt.accecn) {
+		if (!tp->saw_accecn_opt) {
+			/* Too late to enable after this point due to
+			 * potential counter wraps
+			 */
+			if (tp->bytes_sent >= (1 << 23) - 1) {
+				tp->saw_accecn_opt = TCP_ACCECN_OPT_FAIL_SEEN;
+				tcp_accecn_fail_mode_set(tp, TCP_ACCECN_OPT_FAIL_RECV);
+			}
+			return false;
+		}
+
 		if (estimate_ecnfield) {
 			tp->delivered_ecn_bytes[estimate_ecnfield - 1] += delivered_bytes;
 			return true;
@@ -588,6 +609,13 @@ static bool tcp_accecn_process_option(struct tcp_sock *tp,
 	order1 = (ptr[0] == TCPOPT_ACCECN1);
 	ptr += 2;
 
+	if (tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) {
+		tp->saw_accecn_opt = tcp_accecn_option_init(skb,
+							    tp->rx_opt.accecn);
+		if (tp->saw_accecn_opt == TCP_ACCECN_OPT_FAIL_SEEN)
+			tcp_accecn_fail_mode_set(tp, TCP_ACCECN_OPT_FAIL_RECV);
+	}
+
 	res = !!estimate_ecnfield;
 	for (i = 0; i < 3; i++) {
 		if (optlen >= TCPOLEN_ACCECN_PERFIELD) {
@@ -6410,7 +6438,14 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 	if (th->syn) {
 		if (tcp_ecn_mode_accecn(tp)) {
 			send_accecn_reflector = true;
-			tp->accecn_opt_demand = max_t(u8, 1, tp->accecn_opt_demand);
+			if (tp->rx_opt.accecn &&
+			    tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) {
+				tp->saw_accecn_opt = tcp_accecn_option_init(skb,
+									    tp->rx_opt.accecn);
+				if (tp->saw_accecn_opt == TCP_ACCECN_OPT_FAIL_SEEN)
+					tcp_accecn_fail_mode_set(tp, TCP_ACCECN_OPT_FAIL_RECV);
+				tp->accecn_opt_demand = max_t(u8, 1, tp->accecn_opt_demand);
+			}
 		}
 		if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
 		    TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
@@ -6900,7 +6935,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		 */
 
 		if (tcp_ecn_mode_any(tp))
-			tcp_ecn_rcv_synack(sk, th, TCP_SKB_CB(skb)->ip_dsfield);
+			tcp_ecn_rcv_synack(sk, skb, th, TCP_SKB_CB(skb)->ip_dsfield);
 
 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
 		tcp_try_undo_spurious_syn(sk);
@@ -7476,6 +7511,8 @@ static void tcp_openreq_init(struct request_sock *req,
 	tcp_rsk(req)->snt_synack = 0;
 	tcp_rsk(req)->last_oow_ack_time = 0;
 	tcp_rsk(req)->accecn_ok = 0;
+	tcp_rsk(req)->saw_accecn_opt = TCP_ACCECN_OPT_NOT_SEEN;
+	tcp_rsk(req)->accecn_fail_mode = 0;
 	tcp_rsk(req)->syn_ect_rcv = 0;
 	tcp_rsk(req)->syn_ect_snt = 0;
 	req->mss = rx_opt->mss_clamp;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 75baa72849fe..cce1816e4244 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -490,6 +490,7 @@ static void tcp_ecn_openreq_child(struct sock *sk,
 		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
 		tp->syn_ect_snt = treq->syn_ect_snt;
 		tcp_accecn_third_ack(sk, skb, treq->syn_ect_snt);
+		tp->saw_accecn_opt = treq->saw_accecn_opt;
 		tp->prev_ecnfield = treq->syn_ect_rcv;
 		tp->accecn_opt_demand = 1;
 		tcp_ecn_received_counters(sk, skb, skb->len - th->doff * 4);
@@ -544,6 +545,30 @@ static void smc_check_reset_syn_req(const struct tcp_sock *oldtp,
 #endif
 }
 
+u8 tcp_accecn_option_init(const struct sk_buff *skb, u8 opt_offset)
+{
+	unsigned char *ptr = skb_transport_header(skb) + opt_offset;
+	unsigned int optlen = ptr[1] - 2;
+
+	WARN_ON_ONCE(ptr[0] != TCPOPT_ACCECN0 && ptr[0] != TCPOPT_ACCECN1);
+	ptr += 2;
+
+	/* Detect option zeroing: an AccECN connection "MAY check that the
+	 * initial value of the EE0B field or the EE1B field is non-zero"
+	 */
+	if (optlen < TCPOLEN_ACCECN_PERFIELD)
+		return TCP_ACCECN_OPT_EMPTY_SEEN;
+	if (get_unaligned_be24(ptr) == 0)
+		return TCP_ACCECN_OPT_FAIL_SEEN;
+	if (optlen < TCPOLEN_ACCECN_PERFIELD * 3)
+		return TCP_ACCECN_OPT_COUNTER_SEEN;
+	ptr += TCPOLEN_ACCECN_PERFIELD * 2;
+	if (get_unaligned_be24(ptr) == 0)
+		return TCP_ACCECN_OPT_FAIL_SEEN;
+
+	return TCP_ACCECN_OPT_COUNTER_SEEN;
+}
+
 /* This is not only more efficient than what we used to do, it eliminates
  * a lot of code duplication between IPv4/IPv6 SYN recv processing. -DaveM
  *
@@ -704,6 +729,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	bool own_req;
 
 	tmp_opt.saw_tstamp = 0;
+	tmp_opt.accecn = 0;
 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
 		tcp_parse_options(sock_net(sk), skb, &tmp_opt, 0, NULL);
 
@@ -879,6 +905,13 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	if (!(flg & TCP_FLAG_ACK))
 		return NULL;
 
+	if (tcp_rsk(req)->accecn_ok && tmp_opt.accecn &&
+	    tcp_rsk(req)->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) {
+		tcp_rsk(req)->saw_accecn_opt = tcp_accecn_option_init(skb, tmp_opt.accecn);
+		if (tcp_rsk(req)->saw_accecn_opt == TCP_ACCECN_OPT_FAIL_SEEN)
+			tcp_rsk(req)->accecn_fail_mode |= TCP_ACCECN_OPT_FAIL_RECV;
+	}
+
 	/* For Fast Open no more processing is needed (sk is the
 	 * child socket).
 	 */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 22f6cfba5b27..ee23b08bd750 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1069,6 +1069,7 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 	/* Simultaneous open SYN/ACK needs AccECN option but not SYN */
 	if (unlikely((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_ACK) &&
 		     tcp_ecn_mode_accecn(tp) &&
+		     inet_csk(sk)->icsk_retransmits < 2 &&
 		     sock_net(sk)->ipv4.sysctl_tcp_ecn_option &&
 		     remaining >= TCPOLEN_ACCECN_BASE)) {
 		opts->ecn_bytes = synack_ecn_bytes;
@@ -1151,7 +1152,7 @@ static unsigned int tcp_synack_options(const struct sock *sk,
 	smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining);
 
 	if (treq->accecn_ok && sock_net(sk)->ipv4.sysctl_tcp_ecn_option &&
-	    remaining >= TCPOLEN_ACCECN_BASE) {
+	    req->num_timeout < 1 && remaining >= TCPOLEN_ACCECN_BASE) {
 		opts->ecn_bytes = synack_ecn_bytes;
 		remaining -= tcp_options_fit_accecn(opts, 0, remaining,
 						    tcp_synack_options_combine_saving(opts));
@@ -1228,7 +1229,9 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 	}
 
 	if (tcp_ecn_mode_accecn(tp) &&
-	    sock_net(sk)->ipv4.sysctl_tcp_ecn_option) {
+	    sock_net(sk)->ipv4.sysctl_tcp_ecn_option &&
+	    tp->saw_accecn_opt &&
+	    !tcp_accecn_opt_fail_send(tp)) {
 		if (sock_net(sk)->ipv4.sysctl_tcp_ecn_option >= 2 ||
 		    tp->accecn_opt_demand ||
 		    tcp_accecn_option_beacon_check(sk)) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 25/44] tcp: accecn: AccECN option ceb/cep heuristic
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (23 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 24/44] tcp: accecn: AccECN option failure handling chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 26/44] tcp: accecn: AccECN ACE field multi-wrap heuristic chia-yu.chang
                   ` (19 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

The heuristic algorithm from draft-11 Appendix A.2.2 to
mitigate against false ACE field overflows.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h    |  1 +
 net/ipv4/tcp_input.c | 16 ++++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 18c6f0ada141..a2f6b8781f11 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -242,6 +242,7 @@ static_assert((1 << ATO_BITS) > TCP_DELACK_MAX);
 #define TCP_ACCECN_MAXSIZE		(TCPOLEN_ACCECN_BASE + \
 					 TCPOLEN_ACCECN_PERFIELD * \
 					 TCP_ACCECN_NUMFIELDS)
+#define TCP_ACCECN_SAFETY_SHIFT		1 /* SAFETY_FACTOR in accecn draft */
 
 /* tp->accecn_fail_mode */
 #define TCP_ACCECN_ACE_FAIL_SEND       BIT(0)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a8669c407978..79e901eb5fcf 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -669,15 +669,17 @@ static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
 static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
 				u32 delivered_pkts, u32 delivered_bytes, int flag)
 {
+	u32 old_ceb = tcp_sk(sk)->delivered_ecn_bytes[INET_ECN_CE - 1];
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 delta, safe_delta;
+	u32 delta, safe_delta, d_ceb;
+	bool opt_deltas_valid;
 	u32 corrected_ace;
 
 	/* Reordered ACK? (...or uncertain due to lack of data to send and ts) */
 	if (!(flag & (FLAG_FORWARD_PROGRESS | FLAG_TS_PROGRESS)))
 		return 0;
 
-	tcp_accecn_process_option(tp, skb, delivered_bytes, flag);
+	opt_deltas_valid = tcp_accecn_process_option(tp, skb, delivered_bytes, flag);
 
 	if (!(flag & FLAG_SLOWPATH)) {
 		/* AccECN counter might overflow on large ACKs */
@@ -699,6 +701,16 @@ static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
 
 	safe_delta = delivered_pkts - ((delivered_pkts - delta) & TCP_ACCECN_CEP_ACE_MASK);
 
+	if (opt_deltas_valid) {
+		d_ceb = tp->delivered_ecn_bytes[INET_ECN_CE - 1] - old_ceb;
+		if (!d_ceb)
+			return delta;
+		if (d_ceb > delta * tp->mss_cache)
+			return safe_delta;
+		if (d_ceb < safe_delta * tp->mss_cache >> TCP_ACCECN_SAFETY_SHIFT)
+			return delta;
+	}
+
 	return safe_delta;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 26/44] tcp: accecn: AccECN ACE field multi-wrap heuristic
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (24 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 25/44] tcp: accecn: AccECN option ceb/cep heuristic chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 27/44] tcp: accecn: try to fit AccECN option with SACK chia-yu.chang
                   ` (18 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Armed with ceb delta from option, delivered bytes, and
delivered packets it is possible to estimate how many times
ACE field wrapped.

This calculation is necessary only if more than one wrap
is possible. Without SACK, delivered bytes and packets are
not always trustworthy in which case TCP falls back to the
simpler no-or-all wraps ceb algorithm.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_input.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 79e901eb5fcf..ac928359a443 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -705,6 +705,19 @@ static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
 		d_ceb = tp->delivered_ecn_bytes[INET_ECN_CE - 1] - old_ceb;
 		if (!d_ceb)
 			return delta;
+
+		if ((delivered_pkts >= (TCP_ACCECN_CEP_ACE_MASK + 1) * 2) &&
+		    (tcp_is_sack(tp) ||
+		     ((1 << inet_csk(sk)->icsk_ca_state) & (TCPF_CA_Open | TCPF_CA_CWR)))) {
+			u32 est_d_cep;
+
+			if (delivered_bytes <= d_ceb)
+				return safe_delta;
+
+			est_d_cep = DIV_ROUND_UP_ULL((u64)d_ceb * delivered_pkts, delivered_bytes);
+			return min(safe_delta, delta + (est_d_cep & ~TCP_ACCECN_CEP_ACE_MASK));
+		}
+
 		if (d_ceb > delta * tp->mss_cache)
 			return safe_delta;
 		if (d_ceb < safe_delta * tp->mss_cache >> TCP_ACCECN_SAFETY_SHIFT)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 27/44] tcp: accecn: try to fit AccECN option with SACK
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (25 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 26/44] tcp: accecn: AccECN ACE field multi-wrap heuristic chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 28/44] tcp: try to avoid safer when ACKs are thinned chia-yu.chang
                   ` (17 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

As SACK blocks tend to eat all option space when there are
many holes, it is useful to compromise on sending many SACK
blocks in every ACK and try to fit AccECN option there
by reduction the number of SACK blocks. But never go below
two SACK blocks because of AccECN option.

As AccECN option is often not put to every ACK, the space
hijack is usually only temporary.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_output.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ee23b08bd750..663cdea1b87b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -968,8 +968,20 @@ static int tcp_options_fit_accecn(struct tcp_out_options *opts, int required,
 		opts->num_accecn_fields--;
 		size -= TCPOLEN_ACCECN_PERFIELD;
 	}
-	if (opts->num_accecn_fields < required)
+	if (opts->num_accecn_fields < required) {
+		if (opts->num_sack_blocks > 2) {
+			/* Try to fit the option by removing one SACK block */
+			opts->num_sack_blocks--;
+			size = tcp_options_fit_accecn(opts, required,
+						      remaining + TCPOLEN_SACK_PERBLOCK,
+						      max_combine_saving);
+			if (opts->options & OPTION_ACCECN)
+				return size - TCPOLEN_SACK_PERBLOCK;
+
+			opts->num_sack_blocks++;
+		}
 		return 0;
+	}
 
 	opts->options |= OPTION_ACCECN;
 	return size;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 28/44] tcp: try to avoid safer when ACKs are thinned
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (26 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 27/44] tcp: accecn: try to fit AccECN option with SACK chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 29/44] gro: flushing when CWR is set negatively affects AccECN chia-yu.chang
                   ` (16 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Add newly acked pkts EWMA. When ACK thinning occurs, select
between safer and unsafe cep delta in AccECN processing based
on it. If the packets ACKed per ACK tends to be large, don't
conservatively assume ACE field overflow.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/tcp.h  |  1 +
 net/ipv4/tcp.c       |  4 +++-
 net/ipv4/tcp_input.c | 20 +++++++++++++++++++-
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d817a4d1e17c..9dbfaa76d721 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -309,6 +309,7 @@ struct tcp_sock {
 		prev_ecnfield:2,/* ECN bits from the previous segment */
 		accecn_opt_demand:2,/* Demand AccECN option for n next ACKs */
 		estimate_ecnfield:2;/* ECN field for AccECN delivered estimates */
+	u16	pkts_acked_ewma;/* EWMA of packets acked for AccECN cep heuristic */
 	u64	accecn_opt_tstamp;	/* Last AccECN option sent timestamp */
 	u32	app_limited;	/* limited until "delivered" reaches this val */
 	u32	rcv_wnd;	/* Current receiver window		*/
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7ef69b7265eb..16bf550a619b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3343,6 +3343,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tcp_accecn_init_counters(tp);
 	tp->prev_ecnfield = 0;
 	tp->accecn_opt_tstamp = 0;
+	tp->pkts_acked_ewma = 0;
 	if (icsk->icsk_ca_ops->release)
 		icsk->icsk_ca_ops->release(sk);
 	memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv));
@@ -5043,6 +5044,7 @@ static void __init tcp_struct_check(void)
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, delivered_ecn_bytes);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ce);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ecn_bytes);
+	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, pkts_acked_ewma);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, accecn_opt_tstamp);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, app_limited);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_wnd);
@@ -5051,7 +5053,7 @@ static void __init tcp_struct_check(void)
 	/* 32bit arches with 8byte alignment on u64 fields might need padding
 	 * before tcp_clock_cache.
 	 */
-	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 130 + 6);
+	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 132 + 4);
 
 	/* RX read-write hotpath cache lines */
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_rx, bytes_received);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ac928359a443..b1b6c55ff6e2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -665,6 +665,10 @@ static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
 		tcp_count_delivered_ce(tp, delivered);
 }
 
+#define PKTS_ACKED_WEIGHT	6
+#define PKTS_ACKED_PREC		6
+#define ACK_COMP_THRESH		4
+
 /* Returns the ECN CE delta */
 static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
 				u32 delivered_pkts, u32 delivered_bytes, int flag)
@@ -681,6 +685,19 @@ static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
 
 	opt_deltas_valid = tcp_accecn_process_option(tp, skb, delivered_bytes, flag);
 
+	if (delivered_pkts) {
+		if (!tp->pkts_acked_ewma) {
+			tp->pkts_acked_ewma = delivered_pkts << PKTS_ACKED_PREC;
+		} else {
+			u32 ewma = tp->pkts_acked_ewma;
+
+			ewma = (((ewma << PKTS_ACKED_WEIGHT) - ewma) +
+				(delivered_pkts << PKTS_ACKED_PREC)) >>
+				PKTS_ACKED_WEIGHT;
+			tp->pkts_acked_ewma = min_t(u32, ewma, 0xFFFFU);
+		}
+	}
+
 	if (!(flag & FLAG_SLOWPATH)) {
 		/* AccECN counter might overflow on large ACKs */
 		if (delivered_pkts <= TCP_ACCECN_CEP_ACE_MASK)
@@ -722,7 +739,8 @@ static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
 			return safe_delta;
 		if (d_ceb < safe_delta * tp->mss_cache >> TCP_ACCECN_SAFETY_SHIFT)
 			return delta;
-	}
+	} else if (tp->pkts_acked_ewma > (ACK_COMP_THRESH << PKTS_ACKED_PREC))
+		return delta;
 
 	return safe_delta;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 29/44] gro: flushing when CWR is set negatively affects AccECN
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (27 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 28/44] tcp: try to avoid safer when ACKs are thinned chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 30/44] tcp: accecn: Add ece_delta to rate_sample chia-yu.chang
                   ` (15 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

As AccECN may keep CWR bit asserted due to different
interpretation of the bit, flushing with GRO because of
CWR may effectively disable GRO until AccECN counter
field changes such that CWR-bit becomes 0.

There is no harm done from not immediately forwarding the
CWR'ed segment with RFC3168 ECN.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_offload.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index f59762d88c38..6286488abeca 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -327,8 +327,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb,
 		goto out_check_final;
 
 	th2 = tcp_hdr(p);
-	flush = (__force int)(flags & TCP_FLAG_CWR);
-	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
+	flush = (__force int)((flags ^ tcp_flag_word(th2)) &
 		  ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
 	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
 	for (i = sizeof(*th); i < thlen; i += 4)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 30/44] tcp: accecn: Add ece_delta to rate_sample
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (28 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 29/44] gro: flushing when CWR is set negatively affects AccECN chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 31/44] tcp: L4S ECT(1) identifier for CC modules chia-yu.chang
                   ` (14 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Olivier Tilmans, Chia-Yu Chang

From: Ilpo Järvinen <ij@kernel.org>

Include echoed CE count into rate_sample. Replace local ecn_count
variable with it.

Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h    |  1 +
 net/ipv4/tcp_input.c | 32 ++++++++++++++++----------------
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index a2f6b8781f11..822ae5ceb235 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1285,6 +1285,7 @@ struct rate_sample {
 	int  losses;		/* number of packets marked lost upon ACK */
 	u32  acked_sacked;	/* number of packets newly (S)ACKed upon ACK */
 	u32  prior_in_flight;	/* in flight before this ACK */
+	u32  ece_delta;		/* is this ACK echoing some received CE? */
 	u32  last_end_seq;	/* end_seq of most recently ACKed packet */
 	bool is_app_limited;	/* is sample from packet with bubble in pipe? */
 	bool is_retrans;	/* is sample from retransmission? */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b1b6c55ff6e2..bd7430a1e595 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -745,8 +745,9 @@ static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
 	return safe_delta;
 }
 
-static u32 tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
-			      u32 delivered_pkts, u32 delivered_bytes, int *flag)
+static void tcp_accecn_process(struct sock *sk, struct rate_sample *rs,
+			       const struct sk_buff *skb,
+			       u32 delivered_pkts, u32 delivered_bytes, int *flag)
 {
 	u32 delta;
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -756,11 +757,11 @@ static u32 tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
 	if (delta > 0) {
 		tcp_count_delivered_ce(tp, delta);
 		*flag |= FLAG_ECE;
+		rs->ece_delta = delta;
 		/* Recalculate header predictor */
 		if (tp->pred_flags)
 			tcp_fast_path_on(tp);
 	}
-	return delta;
 }
 
 /* Buffer size and advertised window tuning.
@@ -4260,8 +4261,8 @@ static void tcp_xmit_recovery(struct sock *sk, int rexmit)
 }
 
 /* Returns the number of packets newly acked or sacked by the current ACK */
-static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered,
-			       u32 ecn_count, int flag)
+static u32 tcp_newly_delivered(struct sock *sk, struct rate_sample *rs,
+			       u32 prior_delivered, int flag)
 {
 	const struct net *net = sock_net(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -4272,8 +4273,8 @@ static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered,
 
 	if (flag & FLAG_ECE) {
 		if (tcp_ecn_mode_rfc3168(tp))
-			ecn_count = delivered;
-		NET_ADD_STATS(net, LINUX_MIB_TCPDELIVEREDCE, ecn_count);
+			rs->ece_delta = delivered;
+		NET_ADD_STATS(net, LINUX_MIB_TCPDELIVEREDCE, rs->ece_delta);
 	}
 
 	return delivered;
@@ -4285,7 +4286,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_sacktag_state sack_state;
-	struct rate_sample rs = { .prior_delivered = 0 };
+	struct rate_sample rs = { .prior_delivered = 0, .ece_delta = 0 };
 	u32 prior_snd_una = tp->snd_una;
 	bool is_sack_reneg = tp->is_sack_reneg;
 	u32 ack_seq = TCP_SKB_CB(skb)->seq;
@@ -4295,7 +4296,6 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	u32 delivered = tp->delivered;
 	u32 lost = tp->lost;
 	int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover losses */
-	u32 ecn_count = 0;	  /* Did we receive ECE/an AccECN ACE update? */
 	u32 prior_fack;
 
 	sack_state.first_sackt = 0;
@@ -4405,8 +4405,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	tcp_rack_update_reo_wnd(sk, &rs);
 
 	if (tcp_ecn_mode_accecn(tp))
-		ecn_count = tcp_accecn_process(sk, skb, tp->delivered - delivered,
-					       sack_state.delivered_bytes, &flag);
+		tcp_accecn_process(sk, &rs, skb, tp->delivered - delivered,
+				   sack_state.delivered_bytes, &flag);
 
 	tcp_in_ack_event(sk, flag);
 
@@ -4432,7 +4432,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
 		sk_dst_confirm(sk);
 
-	delivered = tcp_newly_delivered(sk, delivered, ecn_count, flag);
+	delivered = tcp_newly_delivered(sk, &rs, delivered, flag);
 
 	lost = tp->lost - lost;			/* freshly marked lost */
 	rs.is_ack_delayed = !!(flag & FLAG_ACK_MAYBE_DELAYED);
@@ -4443,14 +4443,14 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 no_queue:
 	if (tcp_ecn_mode_accecn(tp))
-		ecn_count = tcp_accecn_process(sk, skb, tp->delivered - delivered,
-					       sack_state.delivered_bytes, &flag);
+		tcp_accecn_process(sk, &rs, skb, tp->delivered - delivered,
+				   sack_state.delivered_bytes, &flag);
 	tcp_in_ack_event(sk, flag);
 	/* If data was DSACKed, see if we can undo a cwnd reduction. */
 	if (flag & FLAG_DSACKING_ACK) {
 		tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
 				      &rexmit);
-		tcp_newly_delivered(sk, delivered, ecn_count, flag);
+		tcp_newly_delivered(sk, &rs, delivered, flag);
 	}
 	/* If this ack opens up a zero window, clear backoff.  It was
 	 * being used to time the probes, and is probably far higher than
@@ -4471,7 +4471,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 						&sack_state);
 		tcp_fastretrans_alert(sk, prior_snd_una, num_dupack, &flag,
 				      &rexmit);
-		tcp_newly_delivered(sk, delivered, ecn_count, flag);
+		tcp_newly_delivered(sk, &rs, delivered, flag);
 		tcp_xmit_recovery(sk, rexmit);
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 31/44] tcp: L4S ECT(1) identifier for CC modules
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (29 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 30/44] tcp: accecn: Add ece_delta to rate_sample chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 32/44] tcp: disable RFC3168 fallback " chia-yu.chang
                   ` (13 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang, Olivier Tilmans

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

When ECN is successfully negociated for a TCP flow, it defaults to
always use ECT(0) in the IP header. L4S service, however, needs to
use ECT(1).

This patch enables congestion control algorithms to control whether
ECT(0) or ECT(1) should be used on a per-segment basis. A new
CA module flag (TCP_CONG_WANTS_ECT_1) defines the behavior
expected by the CA when not-yet initialized for the connection.
As such, it implicitely assumes that the CA also has the
TCP_CONG_NEEDS_ECN set.

Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/inet_ecn.h | 20 +++++++++++++++++---
 include/net/tcp.h      |  8 ++++++++
 net/ipv4/tcp_cong.c    |  9 ++++++---
 net/ipv4/tcp_output.c  |  7 ++++---
 4 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/net/inet_ecn.h b/include/net/inet_ecn.h
index ea32393464a2..3c64d32a32b0 100644
--- a/include/net/inet_ecn.h
+++ b/include/net/inet_ecn.h
@@ -51,11 +51,25 @@ static inline __u8 INET_ECN_encapsulate(__u8 outer, __u8 inner)
 	return outer;
 }
 
+/* Apply either ECT(0) or ECT(1) */
+static inline void __INET_ECN_xmit(struct sock *sk, bool use_ect_1)
+{
+	__u8 ect = use_ect_1 ? INET_ECN_ECT_1 : INET_ECN_ECT_0;
+
+	/* Mask the complete byte in case the connection alternates between
+	 * ECT(0) and ECT(1).
+	 */
+	inet_sk(sk)->tos &= ~INET_ECN_MASK;
+	inet_sk(sk)->tos |= ect;
+	if (inet6_sk(sk) != NULL) {
+		inet6_sk(sk)->tclass &= ~INET_ECN_MASK;
+		inet6_sk(sk)->tclass |= ect;
+	}
+}
+
 static inline void INET_ECN_xmit(struct sock *sk)
 {
-	inet_sk(sk)->tos |= INET_ECN_ECT_0;
-	if (inet6_sk(sk) != NULL)
-		inet6_sk(sk)->tclass |= INET_ECN_ECT_0;
+	__INET_ECN_xmit(sk, false);
 }
 
 static inline void INET_ECN_dontxmit(struct sock *sk)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 822ae5ceb235..cecbec887508 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -426,6 +426,7 @@ static inline void tcp_dec_quickack_mode(struct sock *sk)
 #define	TCP_ECN_DEMAND_CWR	BIT(2)
 #define	TCP_ECN_SEEN		BIT(3)
 #define	TCP_ECN_MODE_ACCECN	BIT(4)
+#define	TCP_ECN_ECT_1		BIT(5)
 
 #define	TCP_ECN_DISABLED	0
 #define	TCP_ECN_MODE_PENDING	(TCP_ECN_MODE_RFC3168|TCP_ECN_MODE_ACCECN)
@@ -1253,6 +1254,8 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CONG_NEEDS_ECN		BIT(1)
 /* Require successfully negotiated AccECN capability */
 #define TCP_CONG_NEEDS_ACCECN		BIT(2)
+/* Use ECT(1) instead of ECT(0) while the CA is uninitialized */
+#define TCP_CONG_WANTS_ECT_1 (TCP_CONG_NEEDS_ECN | TCP_CONG_NEEDS_ACCECN)
 #define TCP_CONG_MASK  (TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN | \
 			TCP_CONG_NEEDS_ACCECN)
 
@@ -1394,6 +1397,11 @@ static inline bool tcp_ca_needs_accecn(const struct sock *sk)
 	return icsk->icsk_ca_ops->flags & TCP_CONG_NEEDS_ACCECN;
 }
 
+static inline bool tcp_ca_wants_ect_1(const struct sock *sk)
+{
+	return inet_csk(sk)->icsk_ca_ops->flags & TCP_CONG_WANTS_ECT_1;
+}
+
 static inline void tcp_ca_event(struct sock *sk, const enum tcp_ca_event event)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 0306d257fa64..7be5fb14428b 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -227,7 +227,7 @@ void tcp_assign_congestion_control(struct sock *sk)
 
 	memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv));
 	if (ca->flags & TCP_CONG_NEEDS_ECN)
-		INET_ECN_xmit(sk);
+		__INET_ECN_xmit(sk, tcp_ca_wants_ect_1(sk));
 	else
 		INET_ECN_dontxmit(sk);
 }
@@ -240,7 +240,10 @@ void tcp_init_congestion_control(struct sock *sk)
 	if (icsk->icsk_ca_ops->init)
 		icsk->icsk_ca_ops->init(sk);
 	if (tcp_ca_needs_ecn(sk))
-		INET_ECN_xmit(sk);
+		/* The CA is already initialized, expect it to set the
+		 * appropriate flag to select ECT(1).
+		 */
+		__INET_ECN_xmit(sk, tcp_sk(sk)->ecn_flags & TCP_ECN_ECT_1);
 	else
 		INET_ECN_dontxmit(sk);
 	icsk->icsk_ca_initialized = 1;
@@ -257,7 +260,7 @@ static void tcp_reinit_congestion_control(struct sock *sk,
 	memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv));
 
 	if (ca->flags & TCP_CONG_NEEDS_ECN)
-		INET_ECN_xmit(sk);
+		__INET_ECN_xmit(sk, tcp_ca_wants_ect_1(sk));
 	else
 		INET_ECN_dontxmit(sk);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 663cdea1b87b..ec10785f6d00 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -326,7 +326,7 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ECE;
 	else if (tcp_ca_needs_ecn(sk) ||
 		 tcp_bpf_ca_needs_ecn(sk))
-		INET_ECN_xmit(sk);
+		__INET_ECN_xmit(sk, tcp_ca_wants_ect_1(sk));
 
 	if (tp->ecn_flags & TCP_ECN_MODE_ACCECN) {
 		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ACE;
@@ -366,7 +366,7 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 
 	if (use_ecn) {
 		if (tcp_ca_needs_ecn(sk) || bpf_needs_ecn)
-			INET_ECN_xmit(sk);
+			__INET_ECN_xmit(sk, tcp_ca_wants_ect_1(sk));
 
 		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
 		if (use_accecn) {
@@ -435,7 +435,8 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
 		return;
 
 	if (!tcp_accecn_ace_fail_recv(tp))
-		INET_ECN_xmit(sk);
+		/* The CCA could change the ECT codepoint on the fly, reset it*/
+		__INET_ECN_xmit(sk, tp->ecn_flags & TCP_ECN_ECT_1);
 	if (tcp_ecn_mode_accecn(tp)) {
 		tcp_accecn_set_ace(tp, skb, th);
 		skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ACCECN;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 32/44] tcp: disable RFC3168 fallback identifier for CC modules
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (30 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 31/44] tcp: L4S ECT(1) identifier for CC modules chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 33/44] tcp: accecn: handle unexpected AccECN negotiation feedback chia-yu.chang
                   ` (12 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

When AccECN is not successfully negociated for a TCP flow, it defaults
fallback to classic ECN (RFC3168). However, L4S service will fallback
to non-ECN.

This patch enables congestion control module to control whether it
should not fallback to classic ECN after unsuccessful AccECN negotiation.
A new CA module flag (TCP_CONG_NO_FALLBACK_RFC3168) identifies this
behavior expected by the CA.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h        | 11 ++++++++++-
 net/ipv4/tcp_input.c     | 11 +++++++----
 net/ipv4/tcp_minisocks.c |  2 +-
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index cecbec887508..4d055a54c645 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1254,10 +1254,12 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CONG_NEEDS_ECN		BIT(1)
 /* Require successfully negotiated AccECN capability */
 #define TCP_CONG_NEEDS_ACCECN		BIT(2)
+/* Cannot fallback to RFC3168 during AccECN negotiation */
+#define TCP_CONG_NO_FALLBACK_RFC3168	BIT(3)
 /* Use ECT(1) instead of ECT(0) while the CA is uninitialized */
 #define TCP_CONG_WANTS_ECT_1 (TCP_CONG_NEEDS_ECN | TCP_CONG_NEEDS_ACCECN)
 #define TCP_CONG_MASK  (TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN | \
-			TCP_CONG_NEEDS_ACCECN)
+			TCP_CONG_NEEDS_ACCECN | TCP_CONG_NO_FALLBACK_RFC3168)
 
 union tcp_cc_info;
 
@@ -1397,6 +1399,13 @@ static inline bool tcp_ca_needs_accecn(const struct sock *sk)
 	return icsk->icsk_ca_ops->flags & TCP_CONG_NEEDS_ACCECN;
 }
 
+static inline bool tcp_ca_no_fallback_rfc3168(const struct sock *sk)
+{
+	const struct inet_connection_sock *icsk = inet_csk(sk);
+
+	return icsk->icsk_ca_ops->flags & TCP_CONG_NO_FALLBACK_RFC3168;
+}
+
 static inline bool tcp_ca_wants_ect_1(const struct sock *sk)
 {
 	return inet_csk(sk)->icsk_ca_ops->flags & TCP_CONG_WANTS_ECT_1;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bd7430a1e595..fb3c3a3e7c56 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -452,7 +452,9 @@ static void tcp_ecn_rcv_synack(struct sock *sk, const struct sk_buff *skb,
 		break;
 	case 0x1:
 	case 0x5:
-		if (tcp_ecn_mode_pending(tp))
+		if (tcp_ca_no_fallback_rfc3168(sk))
+			tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
+		else if (tcp_ecn_mode_pending(tp))
 			/* Downgrade from AccECN, or requested initially */
 			tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
 		break;
@@ -476,9 +478,10 @@ static void tcp_ecn_rcv_synack(struct sock *sk, const struct sk_buff *skb,
 	}
 }
 
-static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th,
+static void tcp_ecn_rcv_syn(struct sock *sk, const struct tcphdr *th,
 			    const struct sk_buff *skb)
 {
+	struct tcp_sock *tp = tcp_sk(sk);
 	if (tcp_ecn_mode_pending(tp)) {
 		if (!tcp_accecn_syn_requested(th)) {
 			/* Downgrade to classic ECN feedback */
@@ -489,7 +492,7 @@ static void tcp_ecn_rcv_syn(struct tcp_sock *tp, const struct tcphdr *th,
 			tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
 		}
 	}
-	if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || !th->cwr))
+	if (tcp_ecn_mode_rfc3168(tp) && (!th->ece || !th->cwr || tcp_ca_no_fallback_rfc3168(sk)))
 		tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
 }
 
@@ -7111,7 +7114,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		tp->snd_wl1    = TCP_SKB_CB(skb)->seq;
 		tp->max_window = tp->snd_wnd;
 
-		tcp_ecn_rcv_syn(tp, th, skb);
+		tcp_ecn_rcv_syn(sk, th, skb);
 
 		tcp_mtup_init(sk);
 		tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index cce1816e4244..4037a94fbe59 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -495,7 +495,7 @@ static void tcp_ecn_openreq_child(struct sock *sk,
 		tp->accecn_opt_demand = 1;
 		tcp_ecn_received_counters(sk, skb, skb->len - th->doff * 4);
 	} else {
-		tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok ?
+		tcp_ecn_mode_set(tp, inet_rsk(req)->ecn_ok && !tcp_ca_no_fallback_rfc3168(sk) ?
 				     TCP_ECN_MODE_RFC3168 :
 				     TCP_ECN_DISABLED);
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 33/44] tcp: accecn: handle unexpected AccECN negotiation feedback
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (31 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 32/44] tcp: disable RFC3168 fallback " chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 34/44] tcp: accecn: retransmit downgraded SYN in AccECN negotiation chia-yu.chang
                   ` (11 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Based on specification:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt

3.1.2. Backward Compatibility - If a TCP Client has sent a SYN
requesting AccECN feedback with (AE,CWR,ECE) = (1,1,1) then receives
a SYN/ACK with the currently reserved combination (AE,CWR,ECE) =
(1,0,1) but it does not have logic specific to such a combination,
the Client MUST enable AccECN mode as if the SYN/ACK confirmed that
the Server supported AccECN and as if it fed back that the IP-ECN
field on the SYN had arrived unchanged.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_input.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fb3c3a3e7c56..062bb77d886f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -438,6 +438,21 @@ bool tcp_accecn_validate_syn_feedback(struct sock *sk, u8 ace, u8 sent_ect)
 	return true;
 }
 
+static void tcp_ecn_rcv_synack_accecn(struct tcp_sock *tp, const struct sk_buff *skb,
+				      u8 ip_dsfield)
+{
+	tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
+	tp->syn_ect_rcv = ip_dsfield & INET_ECN_MASK;
+	if (tp->rx_opt.accecn &&
+	    tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) {
+		tp->saw_accecn_opt = tcp_accecn_option_init(skb,
+							    tp->rx_opt.accecn);
+		if (tp->saw_accecn_opt == TCP_ACCECN_OPT_FAIL_SEEN)
+			tcp_accecn_fail_mode_set(tp, TCP_ACCECN_OPT_FAIL_RECV);
+		tp->accecn_opt_demand = 2;
+	}
+}
+
 /* See Table 2 of the AccECN draft */
 static void tcp_ecn_rcv_synack(struct sock *sk, const struct sk_buff *skb,
 			       const struct tcphdr *th, u8 ip_dsfield)
@@ -451,24 +466,22 @@ static void tcp_ecn_rcv_synack(struct sock *sk, const struct sk_buff *skb,
 		tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
 		break;
 	case 0x1:
-	case 0x5:
 		if (tcp_ca_no_fallback_rfc3168(sk))
 			tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
-		else if (tcp_ecn_mode_pending(tp))
-			/* Downgrade from AccECN, or requested initially */
+		else
 			tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
 		break;
-	default:
-		tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
-		tp->syn_ect_rcv = ip_dsfield & INET_ECN_MASK;
-		if (tp->rx_opt.accecn &&
-		    tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) {
-			tp->saw_accecn_opt = tcp_accecn_option_init(skb,
-								    tp->rx_opt.accecn);
-			if (tp->saw_accecn_opt == TCP_ACCECN_OPT_FAIL_SEEN)
-				tcp_accecn_fail_mode_set(tp, TCP_ACCECN_OPT_FAIL_RECV);
-			tp->accecn_opt_demand = 2;
+	case 0x5:
+		if (tcp_ecn_mode_pending(tp)) {
+			tcp_ecn_rcv_synack_accecn(tp, skb, ip_dsfield);
+			if (INET_ECN_is_ce(ip_dsfield)) {
+				tp->received_ce++;
+				tp->received_ce_pending++;
+			}
 		}
+		break;
+	default:
+		tcp_ecn_rcv_synack_accecn(tp, skb, ip_dsfield);
 		if (tcp_accecn_validate_syn_feedback(sk, ace, tp->syn_ect_snt) &&
 		    INET_ECN_is_ce(ip_dsfield)) {
 			tp->received_ce++;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 34/44] tcp: accecn: retransmit downgraded SYN in AccECN negotiation
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (32 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 33/44] tcp: accecn: handle unexpected AccECN negotiation feedback chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 35/44] tcp: move increment of num_retrans chia-yu.chang
                   ` (10 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Based on specification:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt

3.1.4.1. Retransmitted SYNs - If the sender of an AccECN SYN (the TCP
Client) times out before receiving the SYN/ACK, it SHOULD attempt to
negotiate the use of AccECN at least one more time by continuing to
set all three TCP ECN flags (AE,CWR,ECE) = (1,1,1) on the first
retransmitted SYN (using the usual retransmission time-outs). If this
first retransmission also fails to be acknowledged, in deployment
scenarios where AccECN path traversal might be problematic, the TCP
Client SHOULD send subsequent retransmissions of the SYN with the
three TCP-ECN flags cleared (AE,CWR,ECE) = (0,0,0).

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_output.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ec10785f6d00..ae78ff6784d3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3617,12 +3617,14 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 			tcp_retrans_try_collapse(sk, skb, avail_wnd);
 	}
 
-	/* RFC3168, section 6.1.1.1. ECN fallback
-	 * As AccECN uses the same SYN flags (+ AE), this check covers both
-	 * cases.
-	 */
-	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN_ECN) == TCPHDR_SYN_ECN)
-		tcp_ecn_clear_syn(sk, skb);
+	if (!tcp_ecn_mode_pending(tp) || icsk->icsk_retransmits > 1) {
+		/* RFC3168, section 6.1.1.1. ECN fallback
+		 * As AccECN uses the same SYN flags (+ AE), this check covers both
+		 * cases.
+		 */
+		if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN_ECN) == TCPHDR_SYN_ECN)
+			tcp_ecn_clear_syn(sk, skb);
+	}
 
 	/* Update global and local TCP statistics. */
 	segs = tcp_skb_pcount(skb);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 35/44] tcp: move increment of num_retrans
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (33 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 34/44] tcp: accecn: retransmit downgraded SYN in AccECN negotiation chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 36/44] tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN SYN/ACK chia-yu.chang
                   ` (9 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Before this patch, num_retrans = 0 for the first SYN/ACK and the first
retransmitted SYN/ACK; however, an upcoming change will need to
differentiate between those two conditions. This patch moves the
increment of num_tranns before rtx_syn_ack() so we can distinguish
between these two cases when making SYN/ACK.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/inet_connection_sock.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 12e975ed4910..cf9491253ca3 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -906,10 +906,12 @@ static void syn_ack_recalc(struct request_sock *req,
 
 int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req)
 {
-	int err = req->rsk_ops->rtx_syn_ack(parent, req);
+	int err;
 
-	if (!err)
-		req->num_retrans++;
+	req->num_retrans++;
+	err = req->rsk_ops->rtx_syn_ack(parent, req);
+	if (err)
+		req->num_retrans--;
 	return err;
 }
 EXPORT_SYMBOL(inet_rtx_syn_ack);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 36/44] tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN SYN/ACK
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (34 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 35/44] tcp: move increment of num_retrans chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 37/44] tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion chia-yu.chang
                   ` (8 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Based on specification:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt

3.2.3.2.2. Testing for Loss of Packets Carrying the AccECN Option -
If the TCP Server has not received an ACK to acknowledge its SYN/ACK
after the normal TCP timeout or it receives a second SYN with a
request for AccECN support, then either the SYN/ACK might just have
been lost, e.g. due to congestion, or a middlebox might be blocking
AccECN Options. To expedite connection setup in deployment scenarios
where AccECN path traversal might be problematic, the TCP Server
SHOULD retransmit the SYN/ACK, but with no AccECN Option. If this
retransmission times out, to expedite connection setup, the TCP
Server SHOULD retransmit the SYN/ACK with (AE,CWR,ECE) = (0,0,0)
and no AccECN Option, but it remains in AccECN feedback mode

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_output.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ae78ff6784d3..e5c361788a17 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -399,10 +399,16 @@ static void tcp_accecn_echo_syn_ect(struct tcphdr *th, u8 ect)
 static void
 tcp_ecn_make_synack(const struct request_sock *req, struct tcphdr *th)
 {
-	if (tcp_rsk(req)->accecn_ok)
-		tcp_accecn_echo_syn_ect(th, tcp_rsk(req)->syn_ect_rcv);
-	else if (inet_rsk(req)->ecn_ok)
-		th->ece = 1;
+	if (req->num_retrans < 1 || req->num_timeout < 1) {
+		if (tcp_rsk(req)->accecn_ok)
+			tcp_accecn_echo_syn_ect(th, tcp_rsk(req)->syn_ect_rcv);
+		else if (inet_rsk(req)->ecn_ok)
+			th->ece = 1;
+	} else if (tcp_rsk(req)->accecn_ok) {
+		th->ae  = 0;
+		th->cwr = 0;
+		th->ece = 0;
+	}
 }
 
 static void tcp_accecn_set_ace(struct tcp_sock *tp, struct sk_buff *skb,
@@ -1165,7 +1171,7 @@ static unsigned int tcp_synack_options(const struct sock *sk,
 	smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining);
 
 	if (treq->accecn_ok && sock_net(sk)->ipv4.sysctl_tcp_ecn_option &&
-	    req->num_timeout < 1 && remaining >= TCPOLEN_ACCECN_BASE) {
+	    req->num_retrans < 1 && remaining >= TCPOLEN_ACCECN_BASE) {
 		opts->ecn_bytes = synack_ecn_bytes;
 		remaining -= tcp_options_fit_accecn(opts, 0, remaining,
 						    tcp_synack_options_combine_saving(opts));
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 37/44] tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (35 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 36/44] tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN SYN/ACK chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 38/44] tcp: accecn: fallback outgoing half link to non-AccECN chia-yu.chang
                   ` (7 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Based on specification:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt

3.1.5. Implications of AccECN Mode - A TCP Server in AccECN mode
MUST NOT set ECT on any packet for the rest of the connection, if
it has received or sent at least one valid SYN or Acceptable SYN/ACK
with (AE,CWR,ECE) = (0,0,0) during the handshake.

3.1.5 Implications of AccECN Mode - A host in AccECN mode that is
feeding back the IP-ECN field on a SYN or SYN/ACK: MUST feed back
the IP-ECN field on the latest valid SYN or acceptable
SYN/ACK to arrive.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_input.c     |  1 +
 net/ipv4/tcp_minisocks.c | 27 +++++++++++++++++----------
 net/ipv4/tcp_output.c    |  7 ++++---
 3 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 062bb77d886f..e88f449e89e1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6497,6 +6497,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 	if (th->syn) {
 		if (tcp_ecn_mode_accecn(tp)) {
 			send_accecn_reflector = true;
+			tp->syn_ect_rcv = TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK;
 			if (tp->rx_opt.accecn &&
 			    tp->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) {
 				tp->saw_accecn_opt = tcp_accecn_option_init(skb,
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 4037a94fbe59..301606ff1708 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -775,16 +775,23 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 		 */
 		if (!tcp_oow_rate_limited(sock_net(sk), skb,
 					  LINUX_MIB_TCPACKSKIPPEDSYNRECV,
-					  &tcp_rsk(req)->last_oow_ack_time) &&
-
-		    !inet_rtx_syn_ack(sk, req)) {
-			unsigned long expires = jiffies;
-
-			expires += reqsk_timeout(req, TCP_RTO_MAX);
-			if (!fastopen)
-				mod_timer_pending(&req->rsk_timer, expires);
-			else
-				req->rsk_timer.expires = expires;
+					  &tcp_rsk(req)->last_oow_ack_time)) {
+			if (tcp_rsk(req)->accecn_ok) {
+				tcp_rsk(req)->syn_ect_rcv = TCP_SKB_CB(skb)->ip_dsfield &
+							    INET_ECN_MASK;
+				if (tcp_accecn_ace(tcp_hdr(skb)) == 0x0)
+					tcp_accecn_fail_mode_set(tcp_sk(sk),
+								 TCP_ACCECN_ACE_FAIL_RECV);
+			}
+			if (!inet_rtx_syn_ack(sk, req)) {
+				unsigned long expires = jiffies;
+
+				expires += reqsk_timeout(req, TCP_RTO_MAX);
+				if (!fastopen)
+					mod_timer_pending(&req->rsk_timer, expires);
+				else
+					req->rsk_timer.expires = expires;
+			}
 		}
 		return NULL;
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e5c361788a17..74ba08a33434 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -397,7 +397,7 @@ static void tcp_accecn_echo_syn_ect(struct tcphdr *th, u8 ect)
 }
 
 static void
-tcp_ecn_make_synack(const struct request_sock *req, struct tcphdr *th)
+tcp_ecn_make_synack(struct sock *sk, const struct request_sock *req, struct tcphdr *th)
 {
 	if (req->num_retrans < 1 || req->num_timeout < 1) {
 		if (tcp_rsk(req)->accecn_ok)
@@ -408,6 +408,7 @@ tcp_ecn_make_synack(const struct request_sock *req, struct tcphdr *th)
 		th->ae  = 0;
 		th->cwr = 0;
 		th->ece = 0;
+		tcp_accecn_fail_mode_set(tcp_sk(sk), TCP_ACCECN_ACE_FAIL_SEND);
 	}
 }
 
@@ -440,7 +441,7 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff *skb,
 	if (!tcp_ecn_mode_any(tp))
 		return;
 
-	if (!tcp_accecn_ace_fail_recv(tp))
+	if (!tcp_accecn_ace_fail_send(tp) && !tcp_accecn_ace_fail_recv(tp))
 		/* The CCA could change the ECT codepoint on the fly, reset it*/
 		__INET_ECN_xmit(sk, tp->ecn_flags & TCP_ECN_ECT_1);
 	if (tcp_ecn_mode_accecn(tp)) {
@@ -4052,7 +4053,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	memset(th, 0, sizeof(struct tcphdr));
 	th->syn = 1;
 	th->ack = 1;
-	tcp_ecn_make_synack(req, th);
+	tcp_ecn_make_synack((struct sock *)sk, req, th);
 	th->source = htons(ireq->ir_num);
 	th->dest = ireq->ir_rmt_port;
 	skb->mark = ireq->ir_mark;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 38/44] tcp: accecn: fallback outgoing half link to non-AccECN
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (36 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 37/44] tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 39/44] tcp: accecn: verify ACE counter in 1st ACK after AccECN negotiation chia-yu.chang
                   ` (6 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Based on specification:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt

3.2.2.1. ACE Field on the ACK of the SYN/ACK - If the Server is in
AccECN mode and in SYN-RCVD state, and if it receives a value of zero
on a pure ACK with SYN=0 and no SACK blocks, for the rest of the
connection the Server MUST NOT set ECT on outgoing packets and MUST
NOT respond to AccECN feedback. Nonetheless, as a Data Receiver it
MUST NOT disable AccECN feedback.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_minisocks.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 301606ff1708..ba7a3300ab9e 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -458,7 +458,10 @@ void tcp_accecn_third_ack(struct sock *sk, const struct sk_buff *skb,
 
 	switch (ace) {
 	case 0x0:
-		tcp_accecn_fail_mode_set(tp, TCP_ACCECN_ACE_FAIL_RECV);
+		if (!TCP_SKB_CB(skb)->sacked) {
+			tcp_accecn_fail_mode_set(tp, TCP_ACCECN_ACE_FAIL_RECV |
+						     TCP_ACCECN_OPT_FAIL_RECV);
+		}
 		break;
 	case 0x7:
 	case 0x5:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 39/44] tcp: accecn: verify ACE counter in 1st ACK after AccECN negotiation
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (37 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 38/44] tcp: accecn: fallback outgoing half link to non-AccECN chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 40/44] tcp: accecn: stop sending AccECN option when loss ACK with AccECN option chia-yu.chang
                   ` (5 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

After successfully negotiating AccECN mode in the handshake, check
the ACE field of the first dta ACK. If zero, non-ECT packets are
sent and any response to CE marking feedback is disabled.

Based on specification:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt

3.2.2.4. Testing for Zeroing of the ACE Field - If AccECN has been
successfully negotiated, the Data Sender MAY check the value of the
ACE counter in the first feedback packet (with or without data) that
arrives after the 3-way handshake.  If the value of this ACE field is
found to be zero (0b000), for the remainder of the half-connection
the Data Sender ought to send non-ECN-capable packets and it is
advised not to respond to any feedback of CE markings.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 net/ipv4/tcp_input.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e88f449e89e1..0786e7127064 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -687,7 +687,8 @@ static void tcp_count_delivered(struct tcp_sock *tp, u32 delivered,
 
 /* Returns the ECN CE delta */
 static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
-				u32 delivered_pkts, u32 delivered_bytes, int flag)
+				u32 delivered_pkts, u32 delivered_bytes,
+				u64 prior_bytes_acked, int flag)
 {
 	u32 old_ceb = tcp_sk(sk)->delivered_ecn_bytes[INET_ECN_CE - 1];
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -724,6 +725,16 @@ static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
 	if (flag & FLAG_SYN_ACKED)
 		return 0;
 
+	/* Verify ACE!=0 in the 1st data ACK after AccECN negotiation */
+	if ((flag & FLAG_DATA_ACKED) && prior_bytes_acked <= tp->mss_cache) {
+		if (tcp_accecn_ace(tcp_hdr(skb)) == 0x0) {
+			INET_ECN_dontxmit(sk);
+			tcp_accecn_fail_mode_set(tp, TCP_ACCECN_ACE_FAIL_RECV |
+						     TCP_ACCECN_OPT_FAIL_RECV);
+			return 0;
+		}
+	}
+
 	if (tp->received_ce_pending >= TCP_ACCECN_ACE_MAX_DELTA)
 		inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
 
@@ -763,13 +774,14 @@ static u32 __tcp_accecn_process(struct sock *sk, const struct sk_buff *skb,
 
 static void tcp_accecn_process(struct sock *sk, struct rate_sample *rs,
 			       const struct sk_buff *skb,
-			       u32 delivered_pkts, u32 delivered_bytes, int *flag)
+			       u32 delivered_pkts, u32 delivered_bytes,
+			       u64 prior_bytes_acked, int *flag)
 {
 	u32 delta;
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	delta = __tcp_accecn_process(sk, skb, delivered_pkts,
-				     delivered_bytes, *flag);
+				     delivered_bytes, prior_bytes_acked, *flag);
 	if (delta > 0) {
 		tcp_count_delivered_ce(tp, delta);
 		*flag |= FLAG_ECE;
@@ -4303,6 +4315,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_sacktag_state sack_state;
 	struct rate_sample rs = { .prior_delivered = 0, .ece_delta = 0 };
+	u64 prior_bytes_acked = tp->bytes_acked;
 	u32 prior_snd_una = tp->snd_una;
 	bool is_sack_reneg = tp->is_sack_reneg;
 	u32 ack_seq = TCP_SKB_CB(skb)->seq;
@@ -4422,7 +4435,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	if (tcp_ecn_mode_accecn(tp))
 		tcp_accecn_process(sk, &rs, skb, tp->delivered - delivered,
-				   sack_state.delivered_bytes, &flag);
+				   sack_state.delivered_bytes,
+				   prior_bytes_acked, &flag);
 
 	tcp_in_ack_event(sk, flag);
 
@@ -4460,7 +4474,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 no_queue:
 	if (tcp_ecn_mode_accecn(tp))
 		tcp_accecn_process(sk, &rs, skb, tp->delivered - delivered,
-				   sack_state.delivered_bytes, &flag);
+				   sack_state.delivered_bytes,
+				   prior_bytes_acked, &flag);
 	tcp_in_ack_event(sk, flag);
 	/* If data was DSACKed, see if we can undo a cwnd reduction. */
 	if (flag & FLAG_DSACKING_ACK) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 40/44] tcp: accecn: stop sending AccECN option when loss ACK with AccECN option
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (38 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 39/44] tcp: accecn: verify ACE counter in 1st ACK after AccECN negotiation chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 41/44] Documentation: networking: Update ECN related sysctls chia-yu.chang
                   ` (4 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Detect spurious retransmission of a previously sent ACK carrying the
AccECN option after the second retransmission. Since this might be caused
by the middlebox dropping ACK with options it does not recognize, disable
the sending of the AccECN option in all subsequent ACKs.

Based on specification:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt

3.2.3.2.2. Testing for Loss of Packets Carrying the AccECN Option -
If a middlebox is dropping packets with options it does not recognize,
a host that is sending little or no data but mostly pure ACKs will not
inherently detect such losses. Such a host MAY detect loss of ACKs
carrying the AccECN Option by detecting whether the acknowledged data
always reappears as a retransmission. In such cases, the host SHOULD
disable the sending of the AccECN Option for this half-connection.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/tcp.h   | 3 ++-
 include/net/tcp.h     | 1 +
 net/ipv4/tcp_input.c  | 9 +++++++++
 net/ipv4/tcp_output.c | 3 +++
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 9dbfaa76d721..ecc9cfa7210f 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -304,7 +304,8 @@ struct tcp_sock {
 	u32	received_ce;	/* Like the above but for received CE marked packets */
 	u32	received_ecn_bytes[3];
 	u8	received_ce_pending:4, /* Not yet transmitted cnt of received_ce */
-		unused2:4;
+		accecn_opt_sent:1,/* Sent AccECN option in previous ACK */
+		unused2:3;
 	u8	accecn_minlen:2,/* Minimum length of AccECN option sent */
 		prev_ecnfield:2,/* ECN bits from the previous segment */
 		accecn_opt_demand:2,/* Demand AccECN option for n next ACKs */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4d055a54c645..ffb3971105b1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1050,6 +1050,7 @@ static inline void tcp_accecn_init_counters(struct tcp_sock *tp)
 	tp->received_ce_pending = 0;
 	__tcp_accecn_init_bytes_counters(tp->received_ecn_bytes);
 	__tcp_accecn_init_bytes_counters(tp->delivered_ecn_bytes);
+	tp->accecn_opt_sent = 0;
 	tp->accecn_minlen = 0;
 	tp->accecn_opt_demand = 0;
 	tp->estimate_ecnfield = 0;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0786e7127064..74d66c075d6e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5098,6 +5098,7 @@ static void tcp_dsack_extend(struct sock *sk, u32 seq, u32 end_seq)
 
 static void tcp_rcv_spurious_retrans(struct sock *sk, const struct sk_buff *skb)
 {
+	struct tcp_sock *tp = tcp_sk(sk);
 	/* When the ACK path fails or drops most ACKs, the sender would
 	 * timeout and spuriously retransmit the same segment repeatedly.
 	 * If it seems our ACKs are not reaching the other side,
@@ -5117,6 +5118,14 @@ static void tcp_rcv_spurious_retrans(struct sock *sk, const struct sk_buff *skb)
 	/* Save last flowlabel after a spurious retrans. */
 	tcp_save_lrcv_flowlabel(sk, skb);
 #endif
+	/* Check DSACK info to detect that the previous ACK carrying the
+	 * AccECN option was lost after the second retransmision, and then
+	 * stop sending AccECN option in all subsequent ACKs.
+	 */
+	if (tcp_ecn_mode_accecn(tp) &&
+	    TCP_SKB_CB(skb)->seq == tp->duplicate_sack[0].start_seq &&
+	    tp->accecn_opt_sent)
+		tcp_accecn_fail_mode_set(tp, TCP_ACCECN_OPT_FAIL_SEND);
 }
 
 static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 74ba08a33434..4e00ebf6bd42 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -804,9 +804,12 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
 		if (tp) {
 			tp->accecn_minlen = 0;
 			tp->accecn_opt_tstamp = tp->tcp_mstamp;
+			tp->accecn_opt_sent = 1;
 			if (tp->accecn_opt_demand)
 				tp->accecn_opt_demand--;
 		}
+	} else if (tp) {
+		tp->accecn_opt_sent = 0;
 	}
 
 	if (unlikely(OPTION_SACK_ADVERTISE & options)) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 41/44] Documentation: networking: Update ECN related sysctls
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (39 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 40/44] tcp: accecn: stop sending AccECN option when loss ACK with AccECN option chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 42/44] tcp: Add tso_segs() CC callback for TCP Prague chia-yu.chang
                   ` (3 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang, Bob Briscoe

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Clarified that tcp_ecn enables ECN at IP and TCP layers, explained
IP and TCP layers differently, and fixed table (table headings don't
seem to render unless there's text in the first heading column).
Add tcp_ecn_option and tcp_ecn_option_beacon explantions.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 Documentation/networking/ip-sysctl.rst | 55 ++++++++++++++++++++------
 1 file changed, 44 insertions(+), 11 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index eacf8983e230..de6b57775140 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -422,23 +422,56 @@ tcp_early_retrans - INTEGER
 
 tcp_ecn - INTEGER
 	Control use of Explicit Congestion Notification (ECN) by TCP.
-	ECN is used only when both ends of the TCP connection indicate
-	support for it.  This feature is useful in avoiding losses due
-	to congestion by allowing supporting routers to signal
-	congestion before having to drop packets.
+	ECN is used only when both ends of the TCP connection indicate support
+	for it. This feature is useful in avoiding losses due to congestion by
+	allowing supporting routers to signal congestion before having to drop
+	packets. A host that supports ECN both sends ECN at the IP layer and
+	feeds back ECN at the TCP layer. The highest variant of ECN feedback
+	that both peers support is chosen by the ECN negotiation (Accurate ECN,
+	ECN, or no ECN).
+
+	The highest negotiated variant for incoming connection requests
+	and the highest variant requested by outgoing connection
+	attempts:
+
+	===== ==================== ====================
+	Value Incoming connections Outgoing connections
+	===== ==================== ====================
+	0     No ECN               No ECN
+	1     ECN                  ECN
+	2     ECN                  No ECN
+	3     AccECN               AccECN
+	4     AccECN               ECN
+	5     AccECN               No ECN
+	===== ==================== ====================
+
+	Default: 2
+
+tcp_ecn_option - INTEGER
+	Control Accurate ECN (AccECN) option sending when AccECN has been
+	successfully negotiated during handshake. Send logic inhibits
+	sending AccECN options regarless of this setting when no AccECN
+	option has been seen for the reverse direction.
 
 	Possible values are:
 
-		=  =====================================================
-		0  Disable ECN.  Neither initiate nor accept ECN.
-		1  Enable ECN when requested by incoming connections and
-		   also request ECN on outgoing connection attempts.
-		2  Enable ECN when requested by incoming connections
-		   but do not request ECN on outgoing connections.
-		=  =====================================================
+	= ============================================================
+	0 Never send AccECN option. This also disables sending AccECN
+	  option in SYN/ACK during handshake.
+	1 Send AccECN option sparingly according to the minimum option
+	  rules outlined in draft-ietf-tcpm-accurate-ecn.
+	2 Send AccECN option on every packet whenever it fits into TCP
+	  option space.
+	= ============================================================
 
 	Default: 2
 
+tcp_ecn_option_beacon - INTEGER
+	Control Accurate ECN (AccECN) option sending frequency per RTT and it
+	takes effect only when tcp_ecn_option is set to 2.
+
+	Default: 3 (AccECN will be send at least 3 times per RTT)
+
 tcp_ecn_fallback - BOOLEAN
 	If the kernel detects that ECN connection misbehaves, enable fall
 	back to non-ECN. Currently, this knob implements the fallback
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 42/44] tcp: Add tso_segs() CC callback for TCP Prague
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (40 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 41/44] Documentation: networking: Update ECN related sysctls chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 43/44] tcp: Add mss_cache_set_by_ca for CC algorithm to set MSS chia-yu.chang
                   ` (2 subsequent siblings)
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

This patch adds tso_segs() CC callbak for CC algorithm to provides
explicit tso segment number of each data burst and overrides
tcp_tso_autosize().

No functional change.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/net/tcp.h     | 3 +++
 net/ipv4/tcp_output.c | 4 +++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index ffb3971105b1..ce7230c1ba5f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1322,6 +1322,9 @@ struct tcp_congestion_ops {
 	/* override sysctl_tcp_min_tso_segs */
 	u32 (*min_tso_segs)(struct sock *sk);
 
+	/* override tcp_tso_autosize */
+	u32 (*tso_segs)(struct sock *sk, u32 mss_now);
+
 	/* call when packets are delivered to update cwnd and pacing rate,
 	 * after all the ca_state processing. (optional)
 	 */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4e00ebf6bd42..0f0e79b42941 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2275,7 +2275,9 @@ static u32 tcp_tso_segs(struct sock *sk, unsigned int mss_now)
 			ca_ops->min_tso_segs(sk) :
 			READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_min_tso_segs);
 
-	tso_segs = tcp_tso_autosize(sk, mss_now, min_tso);
+	tso_segs = ca_ops->tso_segs ?
+			ca_ops->tso_segs(sk, mss_now) :
+			tcp_tso_autosize(sk, mss_now, min_tso);
 	return min_t(u32, tso_segs, sk->sk_gso_max_segs);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 43/44] tcp: Add mss_cache_set_by_ca for CC algorithm to set MSS
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (41 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 42/44] tcp: Add tso_segs() CC callback for TCP Prague chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:29 ` [PATCH net-next 44/44] tcp: Add the TCP Prague congestion control module chia-yu.chang
  2024-10-15 10:51 ` [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series Paolo Abeni
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Make the CC module set the mss_cache smaller than path mtu. This is useful
for a CC module that maintains an internal fractional cwnd less than 2 at
very low speed (<100kbps) and very low RTT (<1ms). In this case, the
minimum snd_cwnd for the stack remains at 2, but the CC module will limit
the pacing rate to ensure that its internal fractional cwnd takes effect.
Therefore, the CC algorithm can enable fine-grained control without
causing big rate saw-tooth and delay jitter.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/tcp.h   | 3 ++-
 net/ipv4/tcp.c        | 1 +
 net/ipv4/tcp_output.c | 4 ++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index ecc9cfa7210f..add0da4dbedc 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -232,7 +232,8 @@ struct tcp_sock {
 		repair      : 1,
 		tcp_usec_ts : 1, /* TSval values in usec */
 		is_sack_reneg:1,    /* in recovery from loss with SACK reneg? */
-		is_cwnd_limited:1;/* forward progress limited by snd_cwnd? */
+		is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
+		mss_cache_set_by_ca:1;/* mss_cache set by CA */
 	__cacheline_group_end(tcp_sock_read_txrx);
 
 	/* RX read-mostly hotpath cache lines */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 16bf550a619b..13db4db1be55 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -456,6 +456,7 @@ void tcp_init_sock(struct sock *sk)
 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 	tp->snd_cwnd_clamp = ~0;
 	tp->mss_cache = TCP_MSS_DEFAULT;
+	tp->mss_cache_set_by_ca = false;
 
 	tp->reordering = READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_reordering);
 	tcp_assign_congestion_control(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0f0e79b42941..d84c3897e932 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2074,7 +2074,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	int mss_now;
 
-	if (icsk->icsk_mtup.search_high > pmtu)
+	if (icsk->icsk_mtup.search_high > pmtu && !tp->mss_cache_set_by_ca)
 		icsk->icsk_mtup.search_high = pmtu;
 
 	mss_now = tcp_mtu_to_mss(sk, pmtu);
@@ -2104,7 +2104,7 @@ unsigned int tcp_current_mss(struct sock *sk)
 
 	mss_now = tp->mss_cache;
 
-	if (dst) {
+	if (dst && !tp->mss_cache_set_by_ca) {
 		u32 mtu = dst_mtu(dst);
 		if (mtu != inet_csk(sk)->icsk_pmtu_cookie)
 			mss_now = tcp_sync_mss(sk, mtu);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH net-next 44/44] tcp: Add the TCP Prague congestion control module
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (42 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 43/44] tcp: Add mss_cache_set_by_ca for CC algorithm to set MSS chia-yu.chang
@ 2024-10-15 10:29 ` chia-yu.chang
  2024-10-15 10:51 ` [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series Paolo Abeni
  44 siblings, 0 replies; 56+ messages in thread
From: chia-yu.chang @ 2024-10-15 10:29 UTC (permalink / raw)
  To: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel
  Cc: Chia-Yu Chang, Olivier Tilmans, Bob Briscoe

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

TCP-Prague evolved from DCTCP, adapted for the use over the public
internet and removing all reasons why the NW would need to build a
queue in a bottleneck link to control the rate of this congestion-
control. As such, it needs to implement the performance and safety
requirements listed in Appendix A of IETF RFC9331:
https://datatracker.ietf.org/doc/html/rfc9331

This version enhances DCTCP by:
* RTT independence
* Fractional window and increased alpha resolution
* Updated integer arithmetics and fixed-point scaling
* Only Additive Increase for ACK of non-marked packets
* Pacing/tso sizing
* Pacing below minimum congestion window of 2
* +/- 3% pacing variations per RTT
* Enforce the use of ECT_1, Accurate ECN and ECN++

All above improvements make Prague behave under 25ms very rate fair
and RTT independent, assures full or close to full link utilization
on a stable network link and allows the NW to control the rate down
to 100kbps without the need to drop packets or built a queue.
For RTTs from 0us till 25ms and link rates higher than 100kbps, the
resulting rate equation is very close to:
      r [Mbps] = 1/p - 1
Above 25ms, a correction factor of 25ms/RTT needs to be applied
(reducing the rate proportional to the higher RTT).

Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/uapi/linux/inet_diag.h |  13 +
 net/ipv4/Kconfig               |  37 ++
 net/ipv4/Makefile              |   1 +
 net/ipv4/tcp_prague.c          | 866 +++++++++++++++++++++++++++++++++
 4 files changed, 917 insertions(+)
 create mode 100644 net/ipv4/tcp_prague.c

diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
index 86bb2e8b17c9..69d144ba3627 100644
--- a/include/uapi/linux/inet_diag.h
+++ b/include/uapi/linux/inet_diag.h
@@ -161,6 +161,7 @@ enum {
 	INET_DIAG_SK_BPF_STORAGES,
 	INET_DIAG_CGROUP_ID,
 	INET_DIAG_SOCKOPT,
+	INET_DIAG_PRAGUEINFO,
 	__INET_DIAG_MAX,
 };
 
@@ -231,9 +232,21 @@ struct tcp_bbr_info {
 	__u32	bbr_cwnd_gain;		/* cwnd gain shifted left 8 bits */
 };
 
+/* INET_DIAG_PRAGUEINFO */
+
+struct tcp_prague_info {
+	__u64	prague_alpha;
+	__u64	prague_frac_cwnd;
+	__u64	prague_rate_bytes;
+	__u32	prague_max_burst;
+	__u32	prague_round;
+	__u32	prague_rtt_target;
+};
+
 union tcp_cc_info {
 	struct tcpvegas_info	vegas;
 	struct tcp_dctcp_info	dctcp;
+	struct tcp_prague_info	prague;
 	struct tcp_bbr_info	bbr;
 };
 #endif /* _UAPI_INET_DIAG_H_ */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 6d2c97f8e9ef..f55438c70579 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -679,6 +679,39 @@ config TCP_CONG_BBR
 	  AQM schemes that do not provide a delay signal. It requires the fq
 	  ("Fair Queue") pacing packet scheduler.
 
+config TCP_CONG_PRAGUE
+	tristate "TCP Prague"
+	default n
+	help
+	  TCP Prague is an enhancement for DCTCP to make DCTCP congestion
+	  control deployable into general Internet.
+	  TCP-Prague evolved from DCTCP, adapted for the use over the public
+	  internet and removing all reasons why the NW would need to build a
+	  queue in a bottleneck link to control the rate of this congestion-
+	  control. As such, it needs to implement the performance and safety
+	  requirements listed in Appendix A of IETF RFC9331:
+	  https://datatracker.ietf.org/doc/html/rfc9331
+
+	  This version enhances DCTCP by:
+	  * RTT independence
+	  * Fractional window and increased alpha resolution
+	  * Updated integer arithmetics and fixed-point scaling
+	  * Only Additive Increase for ACK of non-marked packets
+	  * Pacing/tso sizing
+	  * Pacing below minimum congestion window of 2
+	  * +/- 3% pacing variations per RTT
+	  * Enforce the use of ECT_1, Accurate ECN and ECN++
+
+	  All above improvements make Prague behave under 25ms very rate fair
+	  and RTT independent, assures full or close to full link utilization
+	  on a stable network link and allows the NW to control the rate down
+	  to 100kbps without the need to drop packets or built a queue.
+	  For RTTs from 0us till 25ms and link rates higher than 100kbps, the
+	  resulting rate equation is very close to:
+	  r [Mbps] = 1/p - 1
+	  Above 25ms, a correction factor of 25ms/RTT needs to be applied
+	  (reducing the rate proportional to the higher RTT).
+
 choice
 	prompt "Default TCP congestion control"
 	default DEFAULT_CUBIC
@@ -716,6 +749,9 @@ choice
 	config DEFAULT_BBR
 		bool "BBR" if TCP_CONG_BBR=y
 
+	config DEFAULT_PRAGUE
+		bool "Prague" if TCP_CONG_PRAGUE=y
+
 	config DEFAULT_RENO
 		bool "Reno"
 endchoice
@@ -740,6 +776,7 @@ config DEFAULT_TCP_CONG
 	default "dctcp" if DEFAULT_DCTCP
 	default "cdg" if DEFAULT_CDG
 	default "bbr" if DEFAULT_BBR
+	default "prague" if DEFAULT_PRAGUE
 	default "cubic"
 
 config TCP_SIGPOOL
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index ec36d2ec059e..47b1304ffa09 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_TCP_CONG_SCALABLE) += tcp_scalable.o
 obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o
 obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o
 obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
+obj-$(CONFIG_TCP_CONG_PRAGUE) += tcp_prague.o
 obj-$(CONFIG_TCP_SIGPOOL) += tcp_sigpool.o
 obj-$(CONFIG_NET_SOCK_MSG) += tcp_bpf.o
 obj-$(CONFIG_BPF_SYSCALL) += udp_bpf.o
diff --git a/net/ipv4/tcp_prague.c b/net/ipv4/tcp_prague.c
new file mode 100644
index 000000000000..6db9386b13af
--- /dev/null
+++ b/net/ipv4/tcp_prague.c
@@ -0,0 +1,866 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2024 Nokia
+ *
+ * Author: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
+ * Author: Olivier Tilmans <olivier.tilmans@nokia.com>
+ * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
+ * Author: Bob Briscoe <koen.de_schepper@nokia-bell-labs.com>
+ *
+ * TCP Prague congestion control.
+ *
+ * This congestion-control, part of the L4S architecture, achieves low loss,
+ * low latency and scalable throughput when used in combination with AQMs such
+ * as DualPI2, CurvyRED, or even fq_codel with a low ce_threshold for the
+ * L4S flows.
+ *
+ * TCP-Prague evolved from DCTCP, adapted for the use over the public
+ * internet and removing all reasons why the network would need to build a
+ * queue in a bottleneck link to control the rate of this congestion-control.
+ * As such, it needs to implement the performance and safety requirements
+ * listed in Appendix A of IETF RFC9331:
+ * https://datatracker.ietf.org/doc/html/rfc9331
+ *
+ * Notable changes from DCTCP:
+ *
+ * 1/ RTT independence:
+ * Below a minimum target RTT, Prague will operate as if it was experiencing
+ * that target RTT (default=25ms). This enable short RTT flows to co-exist
+ * with long RTT ones (e.g., Edge-DC flows competing vs intercontinental
+ * internet traffic) without causing starvation or saturating the ECN signal,
+ * without the need for Diffserv or bandwdith reservation. It also makes the
+ * lower RTT flows more resilient to the inertia of higher RTT flows.
+ *
+ * This is achieved by scaling cwnd growth during Additive Increase, thus
+ * leaving room for higher RTT flows to grab a larger bandwidth share while at
+ * the same time relieving the pressure on bottleneck link hence lowering the
+ * overall marking probability.
+ *
+ * Given that this slows short RTT flows, this behavior only makes sense for
+ * long-running flows that actually need to share the link--as opposed to,
+ * e.g., RPC traffic. To that end, flows become RTT independent after
+ * DEFAULT_RTT_TRANSITION number of RTTs after slowstart (default = 4).
+ *
+ * 2/ Fractional window and increased alpha resolution:
+ * To support slower and more gradual increase of the window, a fractional
+ * window is kept and manipulated from which the socket congestion window is
+ * derived (rounded up to the next integer and capped to at least 2).
+ *
+ * The resolution of alpha has been increased to ensure that a low amount of
+ * marks over high-BDP paths can be accurately taken into account in the
+ * computation.
+ *
+ * Orthogonally, the value of alpha that is kept in the connection state is
+ * stored upscaled, in order to preserve its remainder over the course of its
+ * updates (similarly to how tp->srtt_us is maintained, as opposed to
+ * dctcp->alpha).
+ *
+ * 3/ Updated integer arithmetics and fixed point scaling
+ * In order to operate with a permanent, (very) low marking probability and
+ * much larger RTT range, the arithmetics have been updated to track decimal
+ * precision with unbiased rounding, alongside avoiding capping the integer
+ * parts. This improves the precision, avoiding avalanche effects as
+ * remainders are carried over next operations, as well as responsiveness as
+ * the AQM at the bottleneck can effectively control the operation of the flow
+ * without drastic marking probability increase.
+ *
+ * 4/ Only Additive Increase for ACK of non-marked packets
+ * DCTCP disabled increase for a full RTT when marks were received. Given that
+ * L4S AQM may induce CE marks applied every ACK (e.g., from the PI2
+ * part of dualpi2), instead of full RTTs of marks once in a while that a step
+ * AQM would cause, Prague will increase every RTT, but proportional to the
+ * non-marked packets. So the total increase over an RTT is proportional to
+ * (1-p)/p. The cwnd is updated for every ACK that reports non-marked
+ * data on the receiver, regardless of the congestion status of the connection
+ * (i.e., it is expected to spent most of its time in TCP_CA_CWR when used
+ * over dualpi2). Note that this is only valid for CE marks. For loss (so
+ * being in TCP_CA_LOSS state) the increase is still disabled for one RTT.
+ *
+ * See https://arxiv.org/abs/1904.07605 for more details around saturation.
+ *
+ * 5/ Pacing/TSO sizing
+ * Prague aims to keep queuing delay as low as possible. To that end, it is in
+ * its best interest to pace outgoing segments (i.e., to smooth its traffic),
+ * as well as impose a maximal GSO burst size to avoid instantaneous queue
+ * buildups in the bottleneck link. The current GSO burst size is limited to
+ * create up to 250us latency assuming the current transmission rate is the
+ * bottleneck rate. For this functionality to be active, the "fq" qdisc needs
+ * to be active on the network interfaces that need to carry Prague flows.
+ * Note this is the "fq" qdisc, not the default "fq_codel" qdisc.
+ *
+ * 6/ Pacing below minimum congestion window of 2
+ * Prague will further reduce the pacing rate based on the fractional window
+ * below 2 MTUs. This is needed for very low RTT networks to be able to
+ * control flows to low rates without the need for the network to buffer the
+ * 2 packets in flight per active flow. The rate can go down to 100kbps on
+ * any RTT. Below 1Mbps, the packet size will be reduced to make sure we
+ * still can send 2 packets per 25ms, down to 150 bytes at 100kbps.
+ * The real blocking congestion window will still be 2, but as long as ACKs
+ * come in, the pacing rate will block the sending. The fractional window
+ * is also always rounded up to the next integer when assigned to the
+ * blocking congestion window. This makes the pacing rate most of the time
+ * the blocking mechanism. As the fractional window is updated every ACK,
+ * the pacing rate is smoothly increased guaranteeing a non-stepwise rate
+ * increase when the congestion window has a low integer value.
+ *
+ * 7/ +/- 3% pacing variations per RTT
+ * The first half of every RTT (or 25ms if it is less) the pacing rate is
+ * increased by 3%, the second half it is decreased by 3%. This triggers
+ * a stable amount of marks every RTT on a STEP marking AQM when the link
+ * is very stable. It avoids the undesired on/off marking scheme of DCTCP
+ * (one RTT of 100% marks and several RTTs no marks), which leads to larger
+ * rate variations and unfairness of rate and RTT due to its different rate
+ * to marking probability proportionality:
+ *     r ~ 1/p^2
+ *
+ * 8/ Enforce the use of ECT_1, Accurate ECN and ECN++
+ * As per RFC 9331, Prague needs to use ECT_1, Accurate ECN and ECN++
+ * (also ECT_1 on non-data packets like SYN, pure ACKs, ...). Independent
+ * of the other sysctl configs of the kernel, setting the Prague CC on a
+ * socket will cause the system-wide configuration being overruled. This
+ * also means that using Prague selectively on a system does not require
+ * any system-wide changes (except using the FQ qdisc on the NICs).
+ *
+ * All above improvements make Prague behave under 25ms very rate fair and
+ * RTT independent, and assures full or close to full link utilization on
+ * a stable network link. It allows the network to control the rate down to
+ * 100kbps without the need to drop packets or built a queue. For RTTs
+ * from 0us till 25ms and link rates higher than 100kbps, the resulting
+ * rate equation is very close to:
+ *     r [Mbps] = 1/p - 1
+ * or typically the other way around that a flow needs p marking probability
+ * to get squeezed down to r Mbps:
+ *     p = 1 / (r + 1)
+ * So 50% (p = 0.5) will result in a rate of 1Mbps or typically the other
+ * way around: 1Mbps needs 50% marks, 99Mbps needs 1% marks, 100kbps needs
+ * 91% marks, etc...
+ * For RTTs above 25ms, a correction factor should be taken into account:
+ *     r [Mbps] = (1/p - 1) * 25ms / RTT
+ * with RTT and 25ms expressed in the same unit.
+ */
+
+#define pr_fmt(fmt) "TCP-Prague " fmt
+
+#include <linux/inet_diag.h>
+#include <linux/inet.h>
+#include <linux/math64.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/printk.h>
+#include <net/tcp.h>
+
+#define MIN_CWND_RTT		2U
+#define MIN_CWND_VIRT		2U
+#define MIN_MSS			150U
+#define MINIMUM_RATE		12500ULL	/* Minimum rate: 100kbps */
+#define PRAGUE_ALPHA_BITS	24U
+#define PRAGUE_MAX_ALPHA	BIT_ULL(PRAGUE_ALPHA_BITS)
+#define CWND_UNIT		20U
+#define ONE_CWND		BIT_ULL(CWND_UNIT)
+#define PRAGUE_SHIFT_G		4		/* EWMA gain g = 1/2^4 */
+#define DEFAULT_RTT_TRANSITION	4
+#define MAX_SCALED_RTT		(100 * USEC_PER_MSEC)
+#define MTU_SYS			1500UL
+#define RATE_OFFSET		4
+#define OFFSET_UNIT		7
+#define HSRTT_SHIFT		7
+#define RTT2SEC_SHIFT		23
+
+static u32 prague_burst_shift __read_mostly = 12; /* 1/2^12 sec ~=.25ms */
+MODULE_PARM_DESC(prague_burst_shift,
+		 "maximal GSO burst duration as a base-2 negative exponent");
+module_param(prague_burst_shift, uint, 0644);
+
+static u32 prague_max_tso_segs __read_mostly; /* Default value for static is 0 */
+MODULE_PARM_DESC(prague_max_tso_segs, "Maximum TSO/GSO segments");
+module_param(prague_max_tso_segs, uint, 0644);
+
+static u32 prague_rtt_target __read_mostly = 25 * USEC_PER_MSEC;
+MODULE_PARM_DESC(prague_rtt_target, "RTT scaling target");
+module_param(prague_rtt_target, uint, 0644);
+
+static int prague_rtt_transition __read_mostly = DEFAULT_RTT_TRANSITION;
+MODULE_PARM_DESC(prague_rtt_transition,
+		 "Amount of post-SS rounds to transition to be RTT independent.");
+module_param(prague_rtt_transition, uint, 0644);
+
+static int prague_rate_offset __read_mostly = 4; /* 4/128 ~= 3% */
+MODULE_PARM_DESC(prague_rate_offset,
+		 "Pacing rate offset in 1/128 units at each half of RTT_virt");
+module_param(prague_rate_offset, uint, 0644);
+
+static int prague_alpha_mode __read_mostly; /* Default value for static is 0 */
+MODULE_PARM_DESC(prague_alpha_mode,
+		 "TCP Prague SS mode (0: Half cwnd at 1st mark; 1: Init alpha 1)");
+module_param(prague_alpha_mode, uint, 0644);
+
+static int prague_cwnd_mode __read_mostly; /* Default value for static is 0 */
+MODULE_PARM_DESC(prague_cwnd_mode,
+		 "TCP Prague mode (0: FracWin-base; 1: Rate-base; 2: Switch)");
+module_param(prague_cwnd_mode, uint, 0644);
+
+static int prague_cwnd_transit __read_mostly = 4;
+MODULE_PARM_DESC(prague_cwnd_transit,
+		 "CWND mode switching point in term of # of MTU_SYS");
+module_param(prague_cwnd_transit, uint, 0644);
+
+struct prague {
+	u64 cwr_stamp;
+	u64 alpha_stamp;	/* EWMA update timestamp */
+	u64 upscaled_alpha;	/* Congestion-estimate EWMA */
+	u64 ai_ack_increase;	/* AI increase per non-CE ACKed MSS */
+	u32 mtu_cache;
+	u64 hsrtt_us;
+	u64 frac_cwnd;		/* internal fractional cwnd */
+	u64 rate_bytes;		/* internal pacing rate in bytes */
+	u64 loss_rate_bytes;
+	u32 loss_cwnd;
+	u32 max_tso_burst;
+	u32 old_delivered;	/* tp->delivered at round start */
+	u32 old_delivered_ce;	/* tp->delivered_ce at round start */
+	u32 next_seq;		/* tp->snd_nxt at round start */
+	u32 round;		/* Round count since last slow-start exit */
+	u8  saw_ce:1,		/* Is there an AQM on the path? */
+	    cwnd_mode:1,	/* CWND operating mode */
+	    in_loss:1;		/* In cwnd reduction caused by loss */
+};
+
+/* Fallback struct ops if we fail to negotiate AccECN */
+static struct tcp_congestion_ops prague_reno;
+
+static void __prague_connection_id(struct sock *sk, char *str, size_t len)
+{
+	u16 dport = ntohs(inet_sk(sk)->inet_dport);
+	u16 sport = ntohs(inet_sk(sk)->inet_sport);
+
+	if (sk->sk_family == AF_INET)
+		snprintf(str, len, "%pI4:%u-%pI4:%u", &sk->sk_rcv_saddr, sport,
+			 &sk->sk_daddr, dport);
+	else if (sk->sk_family == AF_INET6)
+		snprintf(str, len, "[%pI6c]:%u-[%pI6c]:%u",
+			 &sk->sk_v6_rcv_saddr, sport, &sk->sk_v6_daddr, dport);
+}
+
+#define LOG(sk, fmt, ...) do {						\
+	char __tmp[2 * (INET6_ADDRSTRLEN + 9) + 1] = {0};		\
+	__prague_connection_id(sk, __tmp, sizeof(__tmp));		\
+	/* pr_fmt expects the connection ID*/				\
+	pr_info("(%s) : " fmt "\n", __tmp, ##__VA_ARGS__);		\
+} while (0)
+
+static struct prague *prague_ca(struct sock *sk)
+{
+	return (struct prague *)inet_csk_ca(sk);
+}
+
+static bool prague_is_rtt_indep(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	return !tcp_in_slow_start(tp) &&
+		ca->round >= prague_rtt_transition;
+}
+
+static bool prague_e2e_rtt_elapsed(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	return !before(tp->snd_una, ca->next_seq);
+}
+
+static u32 prague_target_rtt(struct sock *sk)
+{
+	return prague_rtt_target << 3;
+}
+
+static u32 prague_elapsed_since_alpha_update(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	return tcp_stamp_us_delta(tp->tcp_mstamp, ca->alpha_stamp);
+}
+
+static bool prague_target_rtt_elapsed(struct sock *sk)
+{
+	return (prague_target_rtt(sk) >> 3) <=
+		prague_elapsed_since_alpha_update(sk);
+}
+
+/* RTT independence on a step AQM requires the competing flows to converge to the
+ * same alpha, i.e., the EWMA update frequency might no longer be "once every RTT"
+ */
+static bool prague_should_update_ewma(struct sock *sk)
+{
+	return prague_e2e_rtt_elapsed(sk) &&
+		(!prague_is_rtt_indep(sk) || prague_target_rtt_elapsed(sk));
+}
+
+static u64 prague_unscaled_ai_ack_increase(struct sock *sk)
+{
+	return 1 << CWND_UNIT;
+}
+
+static u64 prague_rate_scaled_ai_ack_increase(struct sock *sk, u32 rtt)
+{
+	u64 increase;
+	u64 divisor;
+	u64 target;
+
+	target = prague_target_rtt(sk);
+	if (rtt >= target)
+		return prague_unscaled_ai_ack_increase(sk);
+	/* Scale increase to:
+	 * - Grow by 1MSS/target RTT
+	 * - Take into account the rate ratio of doing cwnd += 1MSS
+	 *
+	 * Overflows if e2e RTT is > 100ms, hence the cap
+	 */
+	increase = (u64)rtt << CWND_UNIT;
+	increase *= rtt;
+	divisor = target * target;
+	increase = DIV64_U64_ROUND_CLOSEST(increase, divisor);
+	return increase;
+}
+
+static u32 prague_frac_cwnd_to_snd_cwnd(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	return min_t(u32, max_t(u32, MIN_CWND_RTT,
+		     (ca->frac_cwnd + (ONE_CWND - 1)) >> CWND_UNIT), tp->snd_cwnd_clamp);
+}
+
+static u64 prague_virtual_rtt(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	return max_t(u32, prague_target_rtt(sk), tp->srtt_us);
+}
+
+static u64 prague_pacing_rate_to_max_mtu(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	u64 cwnd_bytes;
+
+	if (prague_is_rtt_indep(sk) && ca->cwnd_mode == 1) {
+		cwnd_bytes = mul_u64_u64_shr(ca->rate_bytes, prague_virtual_rtt(sk),
+					     RTT2SEC_SHIFT);
+	} else {
+		u64 target = prague_target_rtt(sk);
+		u64 scaled_cwnd = ca->frac_cwnd;
+		u64 rtt = tp->srtt_us;
+
+		if (rtt < target)
+			scaled_cwnd = div64_u64(scaled_cwnd * target, rtt);
+		cwnd_bytes = mul_u64_u64_shr(scaled_cwnd, tcp_mss_to_mtu(sk, tp->mss_cache),
+					     CWND_UNIT);
+	}
+	return DIV_U64_ROUND_UP(cwnd_bytes, MIN_CWND_VIRT);
+}
+
+static bool prague_half_virtual_rtt_elapsed(struct sock *sk)
+{
+	return (prague_virtual_rtt(sk) >> (3 + 1)) <=
+		prague_elapsed_since_alpha_update(sk);
+}
+
+static u64 prague_pacing_rate_to_frac_cwnd(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	u64 rtt;
+	u64 mtu;
+
+	mtu = tcp_mss_to_mtu(sk, tp->mss_cache);
+	rtt = (ca->hsrtt_us >> HSRTT_SHIFT) ?: tp->srtt_us;
+
+	return DIV_U64_ROUND_UP(mul_u64_u64_shr(ca->rate_bytes, rtt,
+						RTT2SEC_SHIFT - CWND_UNIT), mtu);
+}
+
+static u64 prague_frac_cwnd_to_pacing_rate(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	u64 rate;
+
+	rate = (u64)((u64)USEC_PER_SEC << 3) * tcp_mss_to_mtu(sk, tp->mss_cache);
+	if (tp->srtt_us)
+		rate = div64_u64(rate, tp->srtt_us);
+	return max_t(u64, mul_u64_u64_shr(rate, ca->frac_cwnd, CWND_UNIT),
+		     MINIMUM_RATE);
+}
+
+static u32 prague_valid_mtu(struct sock *sk, u32 mtu)
+{
+	struct prague *ca = prague_ca(sk);
+
+	return max_t(u32, min_t(u32, ca->mtu_cache, mtu), tcp_mss_to_mtu(sk, MIN_MSS));
+}
+
+/* RTT independence will scale the classical 1/W per ACK increase. */
+static void prague_ai_ack_increase(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	u64 increase;
+	u32 rtt;
+
+	rtt = tp->srtt_us;
+	if (ca->round < prague_rtt_transition ||
+	    !rtt || rtt > (MAX_SCALED_RTT << 3)) {
+		increase = prague_unscaled_ai_ack_increase(sk);
+		goto exit;
+	}
+
+	increase = prague_rate_scaled_ai_ack_increase(sk, rtt);
+
+exit:
+	WRITE_ONCE(ca->ai_ack_increase, increase);
+}
+
+static void prague_update_pacing_rate(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	u32 rate_offset = RATE_OFFSET;
+	u64 rate, burst, offset;
+
+	if (prague_rate_offset && prague_rate_offset < ((1 << OFFSET_UNIT) - 1))
+		rate_offset = prague_rate_offset;
+
+	if (tcp_snd_cwnd(tp) < tp->snd_ssthresh / 2) {
+		rate = ca->rate_bytes << 1;
+	} else {
+		offset = mul_u64_u64_shr(rate_offset, ca->rate_bytes, OFFSET_UNIT);
+		if (prague_half_virtual_rtt_elapsed(sk))
+			rate = ca->rate_bytes - offset;
+		else
+			rate = ca->rate_bytes + offset;
+	}
+
+	rate = min_t(u64, rate, sk->sk_max_pacing_rate);
+	burst = div_u64(rate, tcp_mss_to_mtu(sk, tp->mss_cache));
+
+	WRITE_ONCE(prague_ca(sk)->max_tso_burst,
+		   max_t(u32, 1, burst >> prague_burst_shift));
+	WRITE_ONCE(sk->sk_pacing_rate, rate);
+}
+
+static void prague_new_round(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	ca->next_seq = tp->snd_nxt;
+	ca->old_delivered_ce = tp->delivered_ce;
+	ca->old_delivered = tp->delivered;
+	if (!tcp_in_slow_start(tp)) {
+		++ca->round;
+		if (!ca->round)
+			ca->round = prague_rtt_transition;
+	}
+	prague_ai_ack_increase(sk);
+}
+
+static void prague_cwnd_changed(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	tp->snd_cwnd_stamp = tcp_jiffies32;
+	prague_ai_ack_increase(sk);
+}
+
+static void prague_update_alpha(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	u64 ecn_segs, alpha, mtu, mtu_used;
+
+	/* Do not update alpha before we have proof that there's an AQM on
+	 * the path.
+	 */
+	if (unlikely(!ca->saw_ce))
+		goto skip;
+
+	alpha = ca->upscaled_alpha;
+	ecn_segs = tp->delivered_ce - ca->old_delivered_ce;
+	/* We diverge from the original EWMA, i.e.,
+	 * alpha = (1 - g) * alpha + g * F
+	 * by working with (and storing)
+	 * upscaled_alpha = alpha * (1/g) [recall that 0<g<1]
+	 *
+	 * This enables to carry alpha's residual value to the next EWMA round.
+	 *
+	 * We first compute F, the fraction of ecn segments.
+	 */
+	if (ecn_segs) {
+		u32 acked_segs = tp->delivered - ca->old_delivered;
+
+		ecn_segs <<= PRAGUE_ALPHA_BITS;
+		ecn_segs = div_u64(ecn_segs, max(1U, acked_segs));
+	}
+	alpha = alpha - (alpha >> PRAGUE_SHIFT_G) + ecn_segs;
+	ca->alpha_stamp = tp->tcp_mstamp;
+
+	WRITE_ONCE(ca->upscaled_alpha,
+		   min(PRAGUE_MAX_ALPHA << PRAGUE_SHIFT_G, alpha));
+
+	if (prague_is_rtt_indep(sk) && !ca->in_loss) {
+		mtu_used = tcp_mss_to_mtu(sk, tp->mss_cache);
+		mtu = prague_valid_mtu(sk, prague_pacing_rate_to_max_mtu(sk));
+		if (mtu_used != mtu) {
+			ca->frac_cwnd = div_u64(ca->frac_cwnd * mtu_used, mtu);
+			tp->mss_cache_set_by_ca = true;
+			tcp_sync_mss(sk, mtu);
+
+			u64 new_cwnd = prague_frac_cwnd_to_snd_cwnd(sk);
+
+			if (tcp_snd_cwnd(tp) != new_cwnd) {
+				tcp_snd_cwnd_set(tp, new_cwnd);
+				tp->snd_ssthresh = div_u64(tp->snd_ssthresh * mtu_used, mtu);
+				prague_cwnd_changed(sk);
+			}
+		}
+	}
+skip:
+	ca->hsrtt_us = ca->hsrtt_us + tp->srtt_us - (ca->hsrtt_us >> HSRTT_SHIFT);
+	prague_new_round(sk);
+}
+
+static void prague_update_cwnd(struct sock *sk, const struct rate_sample *rs)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	u64 increase;
+	u64 new_cwnd;
+	u64 mtu_used;
+	u64 divisor;
+	s64 acked;
+
+	acked = rs->acked_sacked;
+	if (rs->ece_delta) {
+		if (rs->ece_delta > acked)
+			LOG(sk, "Received %u marks for %lld acks at %u",
+			    rs->ece_delta, acked, tp->snd_una);
+		if (unlikely(!ca->saw_ce) && !prague_alpha_mode)
+			ca->frac_cwnd = (ca->frac_cwnd + 1U) >> 1;
+		ca->saw_ce = 1;
+		acked -= rs->ece_delta;
+	}
+
+	if (acked <= 0 || ca->in_loss || tp->app_limited)
+		goto adjust;
+
+	if (tcp_in_slow_start(tp)) {
+		acked = tcp_slow_start(tp, acked);
+		ca->frac_cwnd = (u64)tcp_snd_cwnd(tp) << CWND_UNIT;
+		if (!acked) {
+			prague_cwnd_changed(sk);
+			return;
+		}
+	}
+
+	if (prague_is_rtt_indep(sk) && ca->cwnd_mode == 1) {
+		mtu_used = tcp_mss_to_mtu(sk, tp->mss_cache);
+		increase = div_u64(((u64)(acked * MTU_SYS)) << RTT2SEC_SHIFT,
+				   prague_virtual_rtt(sk));
+		divisor = mtu_used << RTT2SEC_SHIFT;
+		new_cwnd = DIV64_U64_ROUND_UP(ca->rate_bytes * prague_virtual_rtt(sk), divisor);
+		if (likely(new_cwnd))
+			ca->rate_bytes += DIV_U64_ROUND_CLOSEST(increase, new_cwnd);
+		ca->frac_cwnd = max_t(u64, ca->frac_cwnd + acked,
+				      prague_pacing_rate_to_frac_cwnd(sk));
+	} else {
+		increase = acked * ca->ai_ack_increase;
+		new_cwnd = ca->frac_cwnd;
+		if (likely(new_cwnd))
+			increase = DIV64_U64_ROUND_CLOSEST((increase << CWND_UNIT), new_cwnd);
+		increase = div_u64(increase * MTU_SYS, tcp_mss_to_mtu(sk, tp->mss_cache));
+		ca->frac_cwnd += max_t(u64, acked, increase);
+
+		u64 rate = prague_frac_cwnd_to_pacing_rate(sk);
+
+		ca->rate_bytes = max_t(u64, ca->rate_bytes + acked, rate);
+	}
+
+adjust:
+	new_cwnd = prague_frac_cwnd_to_snd_cwnd(sk);
+	if (tcp_snd_cwnd(tp) > new_cwnd) {
+		/* Step-wise cwnd decrement */
+		tcp_snd_cwnd_set(tp, tcp_snd_cwnd(tp) - 1);
+		tp->snd_ssthresh = tcp_snd_cwnd(tp);
+		prague_cwnd_changed(sk);
+	} else if (tcp_snd_cwnd(tp) < new_cwnd) {
+		/* Step-wise cwnd increment */
+		tcp_snd_cwnd_set(tp, tcp_snd_cwnd(tp) + 1);
+		prague_cwnd_changed(sk);
+	}
+}
+
+static void prague_ca_open(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+
+	ca->in_loss = 0;
+}
+
+static void prague_enter_loss(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	ca->loss_cwnd = tcp_snd_cwnd(tp);
+	ca->loss_rate_bytes = ca->rate_bytes;
+	if (prague_is_rtt_indep(sk) && ca->cwnd_mode == 1) {
+		ca->rate_bytes -= (ca->rate_bytes >> 1);
+		ca->frac_cwnd = prague_pacing_rate_to_frac_cwnd(sk);
+	} else {
+		ca->frac_cwnd -= (ca->frac_cwnd >> 1);
+		ca->rate_bytes = prague_frac_cwnd_to_pacing_rate(sk);
+	}
+	ca->in_loss = 1;
+}
+
+static void prague_enter_cwr(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	u64 reduction;
+	u64 alpha;
+
+	if (prague_is_rtt_indep(sk) &&
+	    (prague_target_rtt(sk) >> 3) > tcp_stamp_us_delta(tp->tcp_mstamp, ca->cwr_stamp))
+		return;
+	ca->cwr_stamp = tp->tcp_mstamp;
+	alpha = ca->upscaled_alpha >> PRAGUE_SHIFT_G;
+
+	if (prague_is_rtt_indep(sk) && ca->cwnd_mode == 1) {
+		reduction = mul_u64_u64_shr(ca->rate_bytes, alpha, PRAGUE_ALPHA_BITS + 1);
+		ca->rate_bytes = max_t(u64, ca->rate_bytes - reduction, MINIMUM_RATE);
+		ca->frac_cwnd = prague_pacing_rate_to_frac_cwnd(sk);
+	} else {
+		reduction = (alpha * (ca->frac_cwnd) +
+			     /* Unbias the rounding by adding 1/2 */
+			     PRAGUE_MAX_ALPHA) >>
+			     (PRAGUE_ALPHA_BITS + 1U);
+		ca->frac_cwnd -= reduction;
+		ca->rate_bytes = prague_frac_cwnd_to_pacing_rate(sk);
+	}
+}
+
+static void prague_state(struct sock *sk, u8 new_state)
+{
+	if (new_state == inet_csk(sk)->icsk_ca_state)
+		return;
+
+	switch (new_state) {
+	case TCP_CA_Recovery:
+		prague_enter_loss(sk);
+		break;
+	case TCP_CA_CWR:
+		prague_enter_cwr(sk);
+		break;
+	case TCP_CA_Open:
+		prague_ca_open(sk);
+		break;
+	}
+}
+
+static void prague_cwnd_event(struct sock *sk, enum tcp_ca_event ev)
+{
+	if (ev == CA_EVENT_LOSS)
+		prague_enter_loss(sk);
+}
+
+static u32 prague_cwnd_undo(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	/* We may have made some progress since then, account for it. */
+	ca->in_loss = 0;
+	ca->rate_bytes = max(ca->rate_bytes, ca->loss_rate_bytes);
+	ca->frac_cwnd = prague_pacing_rate_to_frac_cwnd(sk);
+	return max(ca->loss_cwnd, tp->snd_cwnd);
+}
+
+static void prague_cong_control(struct sock *sk, u32 ack, int flag, const struct rate_sample *rs)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	prague_update_cwnd(sk, rs);
+	if (prague_should_update_ewma(sk))
+		prague_update_alpha(sk);
+	prague_update_pacing_rate(sk);
+	if (prague_cwnd_mode > 1) {
+		u64 cwnd_bytes = tcp_snd_cwnd(tp) * tcp_mss_to_mtu(sk, tp->mss_cache);
+		u64 cwnd_bytes_transit = prague_cwnd_transit * MTU_SYS;
+
+		if (likely(ca->saw_ce) && cwnd_bytes <= cwnd_bytes_transit)
+			ca->cwnd_mode = 1;
+		else if (unlikely(!ca->saw_ce) || cwnd_bytes > cwnd_bytes_transit)
+			ca->cwnd_mode = 0;
+	}
+}
+
+static u32 prague_ssthresh(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	return tp->snd_ssthresh;
+}
+
+static u32 prague_tso_segs(struct sock *sk, unsigned int mss_now)
+{
+	u32 tso_segs = prague_ca(sk)->max_tso_burst;
+
+	if (prague_max_tso_segs)
+		tso_segs = min(tso_segs, prague_max_tso_segs);
+
+	return tso_segs;
+}
+
+static size_t prague_get_info(struct sock *sk, u32 ext, int *attr,
+			      union tcp_cc_info *info)
+{
+	const struct prague *ca = prague_ca(sk);
+
+	if (ext & (1 << (INET_DIAG_PRAGUEINFO - 1)) ||
+	    ext & (1 << (INET_DIAG_VEGASINFO - 1))) {
+		memset(&info->prague, 0, sizeof(info->prague));
+		if (inet_csk(sk)->icsk_ca_ops != &prague_reno) {
+			info->prague.prague_alpha =
+				ca->upscaled_alpha >> PRAGUE_SHIFT_G;
+			info->prague.prague_max_burst = ca->max_tso_burst;
+			info->prague.prague_round = ca->round;
+			info->prague.prague_rate_bytes =
+				READ_ONCE(ca->rate_bytes);
+			info->prague.prague_frac_cwnd =
+				READ_ONCE(ca->frac_cwnd);
+			info->prague.prague_rtt_target =
+				prague_target_rtt(sk);
+		}
+		*attr = INET_DIAG_PRAGUEINFO;
+		return sizeof(info->prague);
+	}
+	return 0;
+}
+
+static void prague_release(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	cmpxchg(&sk->sk_pacing_status, SK_PACING_NEEDED, SK_PACING_NONE);
+	tp->ecn_flags &= ~TCP_ECN_ECT_1;
+	if (!tcp_ecn_mode_any(tp))
+		/* We forced the use of ECN, but failed to negotiate it */
+		INET_ECN_dontxmit(sk);
+
+	LOG(sk, "Released [delivered_ce=%u,received_ce=%u]",
+	    tp->delivered_ce, tp->received_ce);
+}
+
+static void prague_init(struct sock *sk)
+{
+	struct prague *ca = prague_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (!tcp_ecn_mode_any(tp) &&
+	    sk->sk_state != TCP_LISTEN && sk->sk_state != TCP_CLOSE) {
+		prague_release(sk);
+		LOG(sk, "Switching to pure reno [ecn_status=%u,sk_state=%u]",
+		    tcp_ecn_mode_any(tp), sk->sk_state);
+		inet_csk(sk)->icsk_ca_ops = &prague_reno;
+		return;
+	}
+
+	tp->ecn_flags |= TCP_ECN_ECT_1;
+	cmpxchg(&sk->sk_pacing_status, SK_PACING_NONE, SK_PACING_NEEDED);
+	/* If we have an initial RTT estimate, ensure we have an initial pacing
+	 * rate to use if net.ipv4.tcp_pace_iw is set.
+	 */
+	ca->alpha_stamp = tp->tcp_mstamp;
+	if (!prague_alpha_mode)
+		ca->upscaled_alpha = 0;
+	else
+		ca->upscaled_alpha = PRAGUE_MAX_ALPHA << PRAGUE_SHIFT_G;
+	ca->frac_cwnd = (u64)tcp_snd_cwnd(tp) << CWND_UNIT;
+	ca->max_tso_burst = 1;
+
+	/* rate initialization */
+	if (tp->srtt_us) {
+		ca->rate_bytes = div_u64(((u64)USEC_PER_SEC << 3) *
+					 tcp_mss_to_mtu(sk, tp->mss_cache),
+					 tp->srtt_us);
+		ca->rate_bytes = max_t(u64, ca->rate_bytes * tcp_snd_cwnd(tp), MINIMUM_RATE);
+	} else {
+		ca->rate_bytes = MINIMUM_RATE;
+	}
+	prague_update_pacing_rate(sk);
+	ca->loss_rate_bytes = 0;
+	ca->round = 0;
+	ca->saw_ce = !!tp->delivered_ce;
+
+	ca->mtu_cache = tcp_mss_to_mtu(sk, tp->mss_cache) ?: MTU_SYS;
+	// Default as 1us
+	ca->hsrtt_us = tp->srtt_us ? (((u64)tp->srtt_us) << HSRTT_SHIFT) : (1 << (HSRTT_SHIFT + 3));
+	ca->cwnd_mode = (prague_cwnd_mode <= 1) ? prague_cwnd_mode : 0;
+
+	prague_new_round(sk);
+}
+
+static struct tcp_congestion_ops prague __read_mostly = {
+	.init		= prague_init,
+	.release	= prague_release,
+	.cong_control	= prague_cong_control,
+	.cwnd_event	= prague_cwnd_event,
+	.ssthresh	= prague_ssthresh,
+	.undo_cwnd	= prague_cwnd_undo,
+	.set_state	= prague_state,
+	.get_info	= prague_get_info,
+	.tso_segs	= prague_tso_segs,
+	.flags		= TCP_CONG_NEEDS_ECN |
+			  TCP_CONG_NEEDS_ACCECN |
+			  TCP_CONG_NO_FALLBACK_RFC3168 |
+			  TCP_CONG_NON_RESTRICTED,
+	.owner		= THIS_MODULE,
+	.name		= "prague",
+};
+
+static struct tcp_congestion_ops prague_reno __read_mostly = {
+	.ssthresh	= tcp_reno_ssthresh,
+	.cong_avoid	= tcp_reno_cong_avoid,
+	.undo_cwnd	= tcp_reno_undo_cwnd,
+	.get_info	= prague_get_info,
+	.owner		= THIS_MODULE,
+	.name		= "prague-reno",
+};
+
+static int __init prague_register(void)
+{
+	BUILD_BUG_ON(sizeof(struct prague) > ICSK_CA_PRIV_SIZE);
+	return tcp_register_congestion_control(&prague);
+}
+
+static void __exit prague_unregister(void)
+{
+	tcp_unregister_congestion_control(&prague);
+}
+
+module_init(prague_register);
+module_exit(prague_unregister);
+
+MODULE_DESCRIPTION("TCP Prague");
+MODULE_AUTHOR("Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>");
+MODULE_AUTHOR("Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>");
+MODULE_AUTHOR("Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>");
+MODULE_AUTHOR("Bob briscoe <research@bobbriscoe.net>");
+
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.7");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series
  2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
                   ` (43 preceding siblings ...)
  2024-10-15 10:29 ` [PATCH net-next 44/44] tcp: Add the TCP Prague congestion control module chia-yu.chang
@ 2024-10-15 10:51 ` Paolo Abeni
  2024-10-15 15:14   ` Koen De Schepper (Nokia)
  44 siblings, 1 reply; 56+ messages in thread
From: Paolo Abeni @ 2024-10-15 10:51 UTC (permalink / raw)
  To: chia-yu.chang, netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel

On 10/15/24 12:28, chia-yu.chang@nokia-bell-labs.com wrote:
> From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> 
> Hello,
> 
> Please find the enclosed patch series covering the L4S (Low Latency,
> Low Loss, and Scalable Throughput) as outlined in IETF RFC9330:
> https://datatracker.ietf.org/doc/html/rfc9330
> 
> * 1 patch for DualPI2 (cf. IETF RFC9332
>    https://datatracker.ietf.org/doc/html/rfc9332)
> * 40 pataches for Accurate ECN (It implements the AccECN protocol
>    in terms of negotiation, feedback, and compliance requirements:
>    https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28)
> * 3 patches for TCP Prague (It implements the performance and safety
>    requirements listed in Appendix A of IETF RFC9331:
>    https://datatracker.ietf.org/doc/html/rfc9331)
> 
> Best regagrds,
> Chia-Yu

I haven't looked into the series yet, and I doubt I'll be able to do 
that anytime soon, but you must have a good read of the netdev process 
before any other action, specifically:

https://elixir.bootlin.com/linux/v6.11.3/source/Documentation/process/maintainer-netdev.rst#L351

and

https://elixir.bootlin.com/linux/v6.11.3/source/Documentation/process/maintainer-netdev.rst#L15

Just to be clear: splitting the series into 3 and posting all of them 
together will not be good either.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series
  2024-10-15 10:51 ` [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series Paolo Abeni
@ 2024-10-15 15:14   ` Koen De Schepper (Nokia)
  2024-10-15 17:52     ` Eric Dumazet
  0 siblings, 1 reply; 56+ messages in thread
From: Koen De Schepper (Nokia) @ 2024-10-15 15:14 UTC (permalink / raw)
  To: Paolo Abeni, Chia-Yu Chang (Nokia), netdev@vger.kernel.org,
	ij@kernel.org, ncardwell@google.com, g.white@CableLabs.com,
	ingemar.s.johansson@ericsson.com, mirja.kuehlewind@ericsson.com,
	cheshire@apple.com, rs.ietf@gmx.at, Jason_Livingood@comcast.com,
	vidhi_goel@apple.com

We had several internal review rounds, that were specifically making sure it is in line with the processes/guidelines you are referring to.

DualPI2 and TCP-Prague are new modules mostly in a separate file. ACC_ECN unfortunately involves quite some changes in different files with different functionalities and were split into manageable smaller incremental chunks according to the guidelines, ending up in 40 patches. Good thing is that they are small and should be easily processable. It could be split in these 3 features, but would still involve all the ACC_ECN as preferably one patch set. On top of that the 3 TCP-Prague patches rely on the 40 ACC_ECN, so preferably we keep them together too...

The 3 functions are used and tested in many kernels. Initial development started from 3.16 to 4.x, 5.x and recently also in the 6.x kernels. So, the code should be pretty mature (at least from a functionality and stability point of view).

Koen.

-----Original Message-----
From: Paolo Abeni <pabeni@redhat.com>
Sent: Tuesday, October 15, 2024 12:51 PM
To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; netdev@vger.kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@CableLabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
Subject: Re: [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series

CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.

On 10/15/24 12:28, chia-yu.chang@nokia-bell-labs.com wrote:
> From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>
> Hello,
>
> Please find the enclosed patch series covering the L4S (Low Latency,
> Low Loss, and Scalable Throughput) as outlined in IETF RFC9330:
> https://datatracker.ietf.org/doc/html/rfc9330
>
> * 1 patch for DualPI2 (cf. IETF RFC9332
>    https://datatracker.ietf.org/doc/html/rfc9332)
> * 40 pataches for Accurate ECN (It implements the AccECN protocol
>    in terms of negotiation, feedback, and compliance requirements:
>
> https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28)
> * 3 patches for TCP Prague (It implements the performance and safety
>    requirements listed in Appendix A of IETF RFC9331:
>    https://datatracker.ietf.org/doc/html/rfc9331)
>
> Best regagrds,
> Chia-Yu

I haven't looked into the series yet, and I doubt I'll be able to do that anytime soon, but you must have a good read of the netdev process before any other action, specifically:

https://elixir.bootlin.com/linux/v6.11.3/source/Documentation/process/maintainer-netdev.rst#L351

and

https://elixir.bootlin.com/linux/v6.11.3/source/Documentation/process/maintainer-netdev.rst#L15

Just to be clear: splitting the series into 3 and posting all of them together will not be good either.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 01/44] sched: Add dualpi2 qdisc
  2024-10-15 10:28 ` [PATCH net-next 01/44] sched: Add dualpi2 qdisc chia-yu.chang
@ 2024-10-15 15:30   ` Jamal Hadi Salim
  2024-10-15 15:40     ` Jakub Kicinski
  0 siblings, 1 reply; 56+ messages in thread
From: Jamal Hadi Salim @ 2024-10-15 15:30 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel, Olga Albisser, Olivier Tilmans,
	Henrik Steen, Bob Briscoe

On Tue, Oct 15, 2024 at 6:31 AM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
>
> DualPI2 provides L4S-type low latency & loss to traffic that uses a
> scalable congestion controller (e.g. TCP-Prague, DCTCP) without
> degrading the performance of 'classic' traffic (e.g. Reno,
> Cubic etc.). It is intended to be the reference implementation of the
> IETF's DualQ Coupled AQM.
>
> The qdisc provides two queues called low latency and classic. It
> classifies packets based on the ECN field in the IP headers. By
> default it directs non-ECN and ECT(0) into the classic queue and
> ECT(1) and CE into the low latency queue, as per the IETF spec.
>
> Each queue runs its own AQM:
> * The classic AQM is called PI2, which is similar to the PIE AQM but
>   more responsive and simpler. Classic traffic requires a decent
>   target queue (default 15ms for Internet deployment) to fully
>   utilize the link and to avoid high drop rates.
> * The low latency AQM is, by default, a very shallow ECN marking
>   threshold (1ms) similar to that used for DCTCP.
>
> The DualQ isolates the low queuing delay of the Low Latency queue
> from the larger delay of the 'Classic' queue. However, from a
> bandwidth perspective, flows in either queue will share out the link
> capacity as if there was just a single queue. This bandwidth pooling
> effect is achieved by coupling together the drop and ECN-marking
> probabilities of the two AQMs.
>
> The PI2 AQM has two main parameters in addition to its target delay.
> All the defaults are suitable for any Internet setting, but it can
> be reconfigured for a Data Centre setting. The integral gain factor
> alpha is used to slowly correct any persistent standing queue error
> from the target delay, while the proportional gain factor beta is
> used to quickly compensate for queue changes (growth or shrinkage).
> Either alpha and beta are given as a parameter, or they can be
> calculated by tc from alternative typical and maximum RTT parameters.
>
> Internally, the output of a linear Proportional Integral (PI)
> controller is used for both queues. This output is squared to
> calculate the drop or ECN-marking probability of the classic queue.
> This counterbalances the square-root rate equation of Reno/Cubic,
> which is the trick that balances flow rates across the queues. For
> the ECN-marking probability of the low latency queue, the output of
> the base AQM is multiplied by a coupling factor. This determines the
> balance between the flow rates in each queue. The default setting
> makes the flow rates roughly equal, which should be generally
> applicable.
>
> If DUALPI2 AQM has detected overload (due to excessive non-responsive
> traffic in either queue), it will switch to signaling congestion
> solely using drop, irrespective of the ECN field. Alternatively, it
> can be configured to limit the drop probability and let the queue
> grow and eventually overflow (like tail-drop).
>
> Additional details can be found in the draft:
>   https://datatracker.ietf.org/doc/html/rfc9332
>
> Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
> Co-developed-by: Olga Albisser <olga@albisser.org>
> Signed-off-by: Olga Albisser <olga@albisser.org>
> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
> Co-developed-by: Henrik Steen <henrist@henrist.net>
> Signed-off-by: Henrik Steen <henrist@henrist.net>
> Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>


Most important thing in submissions (if you want reviews) is to make
sure you cc the stakeholders - not everybody keeps track of every
message on the list. Read the upstream howto doc...

> ---
>  include/linux/netdevice.h      |    1 +
>  include/uapi/linux/pkt_sched.h |   34 ++
>  net/sched/Kconfig              |   20 +
>  net/sched/Makefile             |    1 +
>  net/sched/sch_dualpi2.c        | 1046 ++++++++++++++++++++++++++++++++
>  5 files changed, 1102 insertions(+)
>  create mode 100644 net/sched/sch_dualpi2.c
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 8feaca12655e..bdd7d6262112 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -30,6 +30,7 @@
>  #include <asm/byteorder.h>
>  #include <asm/local.h>
>
> +#include <linux/netdev_features.h>
>  #include <linux/percpu.h>
>  #include <linux/rculist.h>
>  #include <linux/workqueue.h>
> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> index 25a9a47001cd..f2418eabdcb1 100644
> --- a/include/uapi/linux/pkt_sched.h
> +++ b/include/uapi/linux/pkt_sched.h
> @@ -1210,4 +1210,38 @@ enum {
>
>  #define TCA_ETS_MAX (__TCA_ETS_MAX - 1)
>
> +/* DUALPI2 */
> +enum {
> +       TCA_DUALPI2_UNSPEC,
> +       TCA_DUALPI2_LIMIT,              /* Packets */
> +       TCA_DUALPI2_TARGET,             /* us */
> +       TCA_DUALPI2_TUPDATE,            /* us */
> +       TCA_DUALPI2_ALPHA,              /* Hz scaled up by 256 */
> +       TCA_DUALPI2_BETA,               /* HZ scaled up by 256 */
> +       TCA_DUALPI2_STEP_THRESH,        /* Packets or us */
> +       TCA_DUALPI2_STEP_PACKETS,       /* Whether STEP_THRESH is in packets */
> +       TCA_DUALPI2_COUPLING,           /* Coupling factor between queues */
> +       TCA_DUALPI2_DROP_OVERLOAD,      /* Whether to drop on overload */
> +       TCA_DUALPI2_DROP_EARLY,         /* Whether to drop on enqueue */
> +       TCA_DUALPI2_C_PROTECTION,       /* Percentage */
> +       TCA_DUALPI2_ECN_MASK,           /* L4S queue classification mask */
> +       TCA_DUALPI2_SPLIT_GSO,          /* Split GSO packets at enqueue */
> +       TCA_DUALPI2_PAD,
> +       __TCA_DUALPI2_MAX
> +};
> +
> +#define TCA_DUALPI2_MAX   (__TCA_DUALPI2_MAX - 1)
> +
> +struct tc_dualpi2_xstats {
> +       __u32 prob;             /* current probability */
> +       __u32 delay_c;          /* current delay in C queue */
> +       __u32 delay_l;          /* current delay in L queue */
> +       __s32 credit;           /* current c_protection credit */
> +       __u32 packets_in_c;     /* number of packets enqueued in C queue */
> +       __u32 packets_in_l;     /* number of packets enqueued in L queue */
> +       __u32 maxq;             /* maximum queue size */
> +       __u32 ecn_mark;         /* packets marked with ecn*/
> +       __u32 step_marks;       /* ECN marks due to the step AQM */
> +};
> +
>  #endif
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 8180d0c12fce..c1421e219040 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -403,6 +403,26 @@ config NET_SCH_ETS
>
>           If unsure, say N.
>
> +config NET_SCH_DUALPI2
> +       tristate "Dual Queue Proportional Integral Controller Improved with a Square (DUALPI2) scheduler"
> +       help
> +         Say Y here if you want to use the DualPI2 AQM.
> +         This is a combination of the DUALQ Coupled-AQM with a PI2 base-AQM.
> +         The PI2 AQM is in turn both an extension and a simplification of the
> +         PIE AQM. PI2 makes quite some PIE heuristics unnecessary, while being
> +         able to control scalable congestion controls like DCTCP and
> +         TCP-Prague. With PI2, both Reno/Cubic can be used in parallel with
> +         DCTCP, maintaining window fairness. DUALQ provides latency separation
> +         between low latency DCTCP flows and Reno/Cubic flows that need a
> +         bigger queue.
> +         For more information, please see
> +         https://datatracker.ietf.org/doc/html/rfc9332
> +
> +         To compile this code as a module, choose M here: the module
> +         will be called sch_dualpi2.
> +
> +         If unsure, say N.
> +
>  menuconfig NET_SCH_DEFAULT
>         bool "Allow override default queue discipline"
>         help
> diff --git a/net/sched/Makefile b/net/sched/Makefile
> index 82c3f78ca486..1abb06554057 100644
> --- a/net/sched/Makefile
> +++ b/net/sched/Makefile
> @@ -62,6 +62,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE)  += sch_fq_pie.o
>  obj-$(CONFIG_NET_SCH_CBS)      += sch_cbs.o
>  obj-$(CONFIG_NET_SCH_ETF)      += sch_etf.o
>  obj-$(CONFIG_NET_SCH_TAPRIO)   += sch_taprio.o
> +obj-$(CONFIG_NET_SCH_DUALPI2)  += sch_dualpi2.o
>
>  obj-$(CONFIG_NET_CLS_U32)      += cls_u32.o
>  obj-$(CONFIG_NET_CLS_ROUTE4)   += cls_route.o
> diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
> new file mode 100644
> index 000000000000..18e8934faa4e
> --- /dev/null
> +++ b/net/sched/sch_dualpi2.c
> @@ -0,0 +1,1046 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (C) 2024 Nokia
> + *
> + * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
> + * Author: Olga Albisser <olga@albisser.org>
> + * Author: Henrik Steen <henrist@henrist.net>
> + * Author: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
> + * Author: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> + *
> + * DualPI Improved with a Square (dualpi2):
> + * - Supports congestion controls that comply with the Prague requirements
> + *   in RFC9331 (e.g. TCP-Prague)
> + * - Supports coupled dual-queue with PI2 as defined in RFC9332
> + * - Supports ECN L4S-identifier (IP.ECN==0b*1)
> + *
> + * note: DCTCP is not Prague compliant, so DCTCP & DualPI2 can only be
> + *   used in DC context; BBRv3 (overwrites bbr) stopped Prague support,
> + *   you should use TCP-Prague instead for low latency apps
> + *
> + * References:
> + * - RFC9332: https://datatracker.ietf.org/doc/html/rfc9332
> + * - De Schepper, Koen, et al. "PI 2: A linearized AQM for both classic and
> + *   scalable TCP."  in proc. ACM CoNEXT'16, 2016.
> + */
> +
> +#include <linux/errno.h>
> +#include <linux/hrtimer.h>
> +#include <linux/if_vlan.h>
> +#include <linux/kernel.h>
> +#include <linux/limits.h>
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/types.h>
> +
> +#include <net/gso.h>
> +#include <net/inet_ecn.h>
> +#include <net/pkt_cls.h>
> +#include <net/pkt_sched.h>
> +
> +/* 32b enable to support flows with windows up to ~8.6 * 1e9 packets
> + * i.e., twice the maximal snd_cwnd.
> + * MAX_PROB must be consistent with the RNG in dualpi2_roll().
> + */
> +#define MAX_PROB U32_MAX
> +/* alpha/beta values exchanged over netlink are in units of 256ns */
> +#define ALPHA_BETA_SHIFT 8
> +/* Scaled values of alpha/beta must fit in 32b to avoid overflow in later
> + * computations. Consequently (see and dualpi2_scale_alpha_beta()), their
> + * netlink-provided values can use at most 31b, i.e. be at most (2^23)-1
> + * (~4MHz) as those are given in 1/256th. This enable to tune alpha/beta to
> + * control flows whose maximal RTTs can be in usec up to few secs.
> + */
> +#define ALPHA_BETA_MAX ((1U << 31) - 1)
> +/* Internal alpha/beta are in units of 64ns.
> + * This enables to use all alpha/beta values in the allowed range without loss
> + * of precision due to rounding when scaling them internally, e.g.,
> + * scale_alpha_beta(1) will not round down to 0.
> + */
> +#define ALPHA_BETA_GRANULARITY 6
> +#define ALPHA_BETA_SCALING (ALPHA_BETA_SHIFT - ALPHA_BETA_GRANULARITY)
> +/* We express the weights (wc, wl) in %, i.e., wc + wl = 100 */
> +#define MAX_WC 100
> +
> +struct dualpi2_sched_data {
> +       struct Qdisc *l_queue;  /* The L4S LL queue */
> +       struct Qdisc *sch;      /* The classic queue (owner of this struct) */
> +
> +       /* Registered tc filters */
> +       struct {
> +               struct tcf_proto __rcu *filters;
> +               struct tcf_block *block;
> +       } tcf;
> +
> +       struct { /* PI2 parameters */
> +               u64     target; /* Target delay in nanoseconds */
> +               u32     tupdate;/* Timer frequency in nanoseconds */
> +               u32     prob;   /* Base PI probability */
> +               u32     alpha;  /* Gain factor for the integral rate response */
> +               u32     beta;   /* Gain factor for the proportional response */
> +               struct hrtimer timer; /* prob update timer */
> +       } pi2;
> +
> +       struct { /* Step AQM (L4S queue only) parameters */
> +               u32 thresh;     /* Step threshold */
> +               bool in_packets;/* Whether the step is in packets or time */
> +       } step;
> +
> +       struct { /* Classic queue starvation protection */
> +               s32     credit; /* Credit (sign indicates which queue) */
> +               s32     init;   /* Reset value of the credit */
> +               u8      wc;     /* C queue weight (between 0 and MAX_WC) */
> +               u8      wl;     /* L queue weight (MAX_WC - wc) */
> +       } c_protection;
> +
> +       /* General dualQ parameters */
> +       u8      coupling_factor;/* Coupling factor (k) between both queues */
> +       u8      ecn_mask;       /* Mask to match L4S packets */
> +       bool    drop_early;     /* Drop at enqueue instead of dequeue if true */
> +       bool    drop_overload;  /* Drop (1) on overload, or overflow (0) */
> +       bool    split_gso;      /* Split aggregated skb (1) or leave as is */
> +
> +       /* Statistics */
> +       u64     c_head_ts;      /* Enqueue timestamp of the classic Q's head */
> +       u64     l_head_ts;      /* Enqueue timestamp of the L Q's head */
> +       u64     last_qdelay;    /* Q delay val at the last probability update */
> +       u32     packets_in_c;   /* Number of packets enqueued in C queue */
> +       u32     packets_in_l;   /* Number of packets enqueued in L queue */
> +       u32     maxq;           /* maximum queue size */
> +       u32     ecn_mark;       /* packets marked with ECN */
> +       u32     step_marks;     /* ECN marks due to the step AQM */
> +
> +       struct { /* Deferred drop statistics */
> +               u32 cnt;        /* Packets dropped */
> +               u32 len;        /* Bytes dropped */
> +       } deferred_drops;
> +};
> +
> +struct dualpi2_skb_cb {
> +       u64 ts;                 /* Timestamp at enqueue */
> +       u8 apply_step:1,        /* Can we apply the step threshold */
> +          classified:2,        /* Packet classification results */
> +          ect:2;               /* Packet ECT codepoint */
> +};
> +
> +enum dualpi2_classification_results {
> +       DUALPI2_C_CLASSIC       = 0,    /* C queue */
> +       DUALPI2_C_L4S           = 1,    /* L queue (scalable marking/classic drops) */
> +       DUALPI2_C_LLLL          = 2,    /* L queue (no drops/marks) */
> +       __DUALPI2_C_MAX                 /* Keep last*/
> +};
> +
> +static struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb)
> +{
> +       qdisc_cb_private_validate(skb, sizeof(struct dualpi2_skb_cb));
> +       return (struct dualpi2_skb_cb *)qdisc_skb_cb(skb)->data;
> +}
> +
> +static u64 skb_sojourn_time(struct sk_buff *skb, u64 reference)
> +{
> +       return reference - dualpi2_skb_cb(skb)->ts;
> +}
>

better to use dualpi2 instead of skb prefix?

> +static u64 head_enqueue_time(struct Qdisc *q)
> +{
> +       struct sk_buff *skb = qdisc_peek_head(q);
> +
> +       return skb ? dualpi2_skb_cb(skb)->ts : 0;
> +}
> +
> +static u32 dualpi2_scale_alpha_beta(u32 param)
> +{
> +       u64 tmp = ((u64)param * MAX_PROB >> ALPHA_BETA_SCALING);
> +
> +       do_div(tmp, NSEC_PER_SEC);
> +       return tmp;
> +}
> +
> +static u32 dualpi2_unscale_alpha_beta(u32 param)
> +{
> +       u64 tmp = ((u64)param * NSEC_PER_SEC << ALPHA_BETA_SCALING);
> +
> +       do_div(tmp, MAX_PROB);
> +       return tmp;
> +}
> +
> +static ktime_t next_pi2_timeout(struct dualpi2_sched_data *q)
> +{
> +       return ktime_add_ns(ktime_get_ns(), q->pi2.tupdate);
> +}
> +
> +static bool skb_is_l4s(struct sk_buff *skb)
> +{
> +       return dualpi2_skb_cb(skb)->classified == DUALPI2_C_L4S;
> +}
> +
> +static bool skb_in_l_queue(struct sk_buff *skb)
> +{
> +       return dualpi2_skb_cb(skb)->classified != DUALPI2_C_CLASSIC;
> +}
> +
> +static bool dualpi2_mark(struct dualpi2_sched_data *q, struct sk_buff *skb)
> +{
> +       if (INET_ECN_set_ce(skb)) {
> +               q->ecn_mark++;
> +               return true;
> +       }
> +       return false;
> +}
> +
> +static void dualpi2_reset_c_protection(struct dualpi2_sched_data *q)
> +{
> +       q->c_protection.credit = q->c_protection.init;
> +}
> +
> +/* This computes the initial credit value and WRR weight for the L queue (wl)
> + * from the weight of the C queue (wc).
> + * If wl > wc, the scheduler will start with the L queue when reset.
> + */
> +static void dualpi2_calculate_c_protection(struct Qdisc *sch,
> +                                          struct dualpi2_sched_data *q, u32 wc)
> +{
> +       q->c_protection.wc = wc;
> +       q->c_protection.wl = MAX_WC - wc;
> +       q->c_protection.init = (s32)psched_mtu(qdisc_dev(sch)) *
> +               ((int)q->c_protection.wc - (int)q->c_protection.wl);
> +       dualpi2_reset_c_protection(q);
> +}
> +
> +static bool dualpi2_roll(u32 prob)
> +{
> +       return get_random_u32() <= prob;
> +}
> +
> +/* Packets in the C queue are subject to a marking probability pC, which is the
> + * square of the internal PI2 probability (i.e., have an overall lower mark/drop
> + * probability). If the qdisc is overloaded, ignore ECT values and only drop.
> + *
> + * Note that this marking scheme is also applied to L4S packets during overload.
> + * Return true if packet dropping is required in C queue
> + */
> +static bool dualpi2_classic_marking(struct dualpi2_sched_data *q,
> +                                   struct sk_buff *skb, u32 prob,
> +                                   bool overload)
> +{
> +       if (dualpi2_roll(prob) && dualpi2_roll(prob)) {
> +               if (overload || dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT)
> +                       return true;
> +               dualpi2_mark(q, skb);
> +       }
> +       return false;
> +}
> +
> +/* Packets in the L queue are subject to a marking probability pL given by the
> + * internal PI2 probability scaled by the coupling factor.
> + *
> + * On overload (i.e., @local_l_prob is >= 100%):
> + * - if the qdisc is configured to trade losses to preserve latency (i.e.,
> + *   @q->drop_overload), apply classic drops first before marking.
> + * - otherwise, preserve the "no loss" property of ECN at the cost of queueing
> + *   delay, eventually resulting in taildrop behavior once sch->limit is
> + *   reached.
> + * Return true if packet dropping is required in L queue
> + */
> +static bool dualpi2_scalable_marking(struct dualpi2_sched_data *q,
> +                                    struct sk_buff *skb,
> +                                    u64 local_l_prob, u32 prob,
> +                                    bool overload)
> +{
> +       if (overload) {
> +               /* Apply classic drop */
> +               if (!q->drop_overload ||
> +                   !(dualpi2_roll(prob) && dualpi2_roll(prob)))
> +                       goto mark;
> +               return true;
> +       }
> +
> +       /* We can safely cut the upper 32b as overload==false */
> +       if (dualpi2_roll(local_l_prob)) {
> +               /* Non-ECT packets could have classified as L4S by filters. */
> +               if (dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT)
> +                       return true;
> +mark:
> +               dualpi2_mark(q, skb);
> +       }
> +       return false;
> +}
> +
> +/* Decide whether a given packet must be dropped (or marked if ECT), according
> + * to the PI2 probability.
> + *
> + * Never mark/drop if we have a standing queue of less than 2 MTUs.
> + */
> +static bool must_drop(struct Qdisc *sch, struct dualpi2_sched_data *q,
> +                     struct sk_buff *skb)
> +{
> +       u64 local_l_prob;
> +       u32 prob;
> +       bool overload;
> +
> +       if (sch->qstats.backlog < 2 * psched_mtu(qdisc_dev(sch)))
> +               return false;
> +
> +       prob = READ_ONCE(q->pi2.prob);
> +       local_l_prob = (u64)prob * q->coupling_factor;
> +       overload = local_l_prob > MAX_PROB;
> +
> +       switch (dualpi2_skb_cb(skb)->classified) {
> +       case DUALPI2_C_CLASSIC:
> +               return dualpi2_classic_marking(q, skb, prob, overload);
> +       case DUALPI2_C_L4S:
> +               return dualpi2_scalable_marking(q, skb, local_l_prob, prob,
> +                                               overload);
> +       default: /* DUALPI2_C_LLLL */
> +               return false;
> +       }
> +}
> +
> +static void dualpi2_read_ect(struct sk_buff *skb)
> +{
> +       struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
> +       int wlen = skb_network_offset(skb);
> +
> +       switch (skb_protocol(skb, true)) {
> +       case htons(ETH_P_IP):
> +               wlen += sizeof(struct iphdr);
> +               if (!pskb_may_pull(skb, wlen) ||
> +                   skb_try_make_writable(skb, wlen))
> +                       goto not_ecn;
> +
> +               cb->ect = ipv4_get_dsfield(ip_hdr(skb)) & INET_ECN_MASK;
> +               break;
> +       case htons(ETH_P_IPV6):
> +               wlen += sizeof(struct ipv6hdr);
> +               if (!pskb_may_pull(skb, wlen) ||
> +                   skb_try_make_writable(skb, wlen))
> +                       goto not_ecn;
> +
> +               cb->ect = ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK;
> +               break;
> +       default:
> +               goto not_ecn;
> +       }
> +       return;
> +
> +not_ecn:
> +       /* Non pullable/writable packets can only be dropped hence are
> +        * classified as not ECT.
> +        */
> +       cb->ect = INET_ECN_NOT_ECT;
> +}
> +
> +static int dualpi2_skb_classify(struct dualpi2_sched_data *q,
> +                               struct sk_buff *skb)
> +{
> +       struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
> +       struct tcf_result res;
> +       struct tcf_proto *fl;
> +       int result;
> +
> +       dualpi2_read_ect(skb);
> +       if (cb->ect & q->ecn_mask) {
> +               cb->classified = DUALPI2_C_L4S;
> +               return NET_XMIT_SUCCESS;
> +       }
> +
> +       if (TC_H_MAJ(skb->priority) == q->sch->handle &&
> +           TC_H_MIN(skb->priority) < __DUALPI2_C_MAX) {
> +               cb->classified = TC_H_MIN(skb->priority);
> +               return NET_XMIT_SUCCESS;
> +       }
> +
> +       fl = rcu_dereference_bh(q->tcf.filters);
> +       if (!fl) {
> +               cb->classified = DUALPI2_C_CLASSIC;
> +               return NET_XMIT_SUCCESS;
> +       }
> +
> +       result = tcf_classify(skb, NULL, fl, &res, false);
> +       if (result >= 0) {
> +#ifdef CONFIG_NET_CLS_ACT
> +               switch (result) {
> +               case TC_ACT_STOLEN:
> +               case TC_ACT_QUEUED:
> +               case TC_ACT_TRAP:
> +                       return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
> +               case TC_ACT_SHOT:
> +                       return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> +               }
> +#endif
> +               cb->classified = TC_H_MIN(res.classid) < __DUALPI2_C_MAX ?
> +                       TC_H_MIN(res.classid) : DUALPI2_C_CLASSIC;
> +       }
> +       return NET_XMIT_SUCCESS;
> +}
> +
> +static int dualpi2_enqueue_skb(struct sk_buff *skb, struct Qdisc *sch,
> +                              struct sk_buff **to_free)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       struct dualpi2_skb_cb *cb;
> +
> +       if (unlikely(qdisc_qlen(sch) >= sch->limit)) {
> +               qdisc_qstats_overlimit(sch);
> +               if (skb_in_l_queue(skb))
> +                       qdisc_qstats_overlimit(q->l_queue);
> +               return qdisc_drop(skb, sch, to_free);
> +       }
> +
> +       if (q->drop_early && must_drop(sch, q, skb)) {
> +               qdisc_drop(skb, sch, to_free);
> +               return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> +       }
> +
> +       cb = dualpi2_skb_cb(skb);
> +       cb->ts = ktime_get_ns();
> +
> +       if (qdisc_qlen(sch) > q->maxq)
> +               q->maxq = qdisc_qlen(sch);
> +
> +       if (skb_in_l_queue(skb)) {
> +               /* Only apply the step if a queue is building up */
> +               dualpi2_skb_cb(skb)->apply_step =
> +                       skb_is_l4s(skb) && qdisc_qlen(q->l_queue) > 1;
> +               /* Keep the overall qdisc stats consistent */
> +               ++sch->q.qlen;
> +               qdisc_qstats_backlog_inc(sch, skb);
> +               ++q->packets_in_l;
> +               if (!q->l_head_ts)
> +                       q->l_head_ts = cb->ts;
> +               return qdisc_enqueue_tail(skb, q->l_queue);
> +       }
> +       ++q->packets_in_c;
> +       if (!q->c_head_ts)
> +               q->c_head_ts = cb->ts;
> +       return qdisc_enqueue_tail(skb, sch);
> +}
> +
> +/* Optionally, dualpi2 will split GSO skbs into independent skbs and enqueue
> + * each of those individually. This yields the following benefits, at the
> + * expense of CPU usage:
> + * - Finer-grained AQM actions as the sub-packets of a burst no longer share the
> + *   same fate (e.g., the random mark/drop probability is applied individually)
> + * - Improved precision of the starvation protection/WRR scheduler at dequeue,
> + *   as the size of the dequeued packets will be smaller.
> + */
> +static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> +                                struct sk_buff **to_free)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       int err;
> +
> +       err = dualpi2_skb_classify(q, skb);
> +       if (err != NET_XMIT_SUCCESS) {
> +               if (err & __NET_XMIT_BYPASS)
> +                       qdisc_qstats_drop(sch);
> +               __qdisc_drop(skb, to_free);
> +               return err;
> +       }
> +
> +       if (q->split_gso && skb_is_gso(skb)) {
> +               netdev_features_t features;
> +               struct sk_buff *nskb, *next;
> +               int cnt, byte_len, orig_len;
> +               int err;
> +
> +               features = netif_skb_features(skb);
> +               nskb = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
> +               if (IS_ERR_OR_NULL(nskb))
> +                       return qdisc_drop(skb, sch, to_free);
> +
> +               cnt = 1;
> +               byte_len = 0;
> +               orig_len = qdisc_pkt_len(skb);
> +               while (nskb) {
> +                       next = nskb->next;
> +                       skb_mark_not_on_list(nskb);
> +                       qdisc_skb_cb(nskb)->pkt_len = nskb->len;
> +                       dualpi2_skb_cb(nskb)->classified =
> +                               dualpi2_skb_cb(skb)->classified;
> +                       dualpi2_skb_cb(nskb)->ect = dualpi2_skb_cb(skb)->ect;
> +                       err = dualpi2_enqueue_skb(nskb, sch, to_free);
> +                       if (err == NET_XMIT_SUCCESS) {
> +                               /* Compute the backlog adjustement that needs
> +                                * to be propagated in the qdisc tree to reflect
> +                                * all new skbs successfully enqueued.
> +                                */
> +                               ++cnt;
> +                               byte_len += nskb->len;
> +                       }
> +                       nskb = next;
> +               }
> +               if (err == NET_XMIT_SUCCESS) {
> +                       /* The caller will add the original skb stats to its
> +                        * backlog, compensate this.
> +                        */
> +                       --cnt;
> +                       byte_len -= orig_len;
> +               }
> +               qdisc_tree_reduce_backlog(sch, -cnt, -byte_len);
> +               consume_skb(skb);
> +               return err;
> +       }
> +       return dualpi2_enqueue_skb(skb, sch, to_free);
> +}
> +
> +/* Select the queue from which the next packet can be dequeued, ensuring that
> + * neither queue can starve the other with a WRR scheduler.
> + *
> + * The sign of the WRR credit determines the next queue, while the size of
> + * the dequeued packet determines the magnitude of the WRR credit change. If
> + * either queue is empty, the WRR credit is kept unchanged.
> + *
> + * As the dequeued packet can be dropped later, the caller has to perform the
> + * qdisc_bstats_update() calls.
> + */
> +static struct sk_buff *dequeue_packet(struct Qdisc *sch,
> +                                     struct dualpi2_sched_data *q,
> +                                     int *credit_change,
> +                                     u64 now)
> +{
> +       struct sk_buff *skb = NULL;
> +       int c_len;
> +
> +       *credit_change = 0;
> +       c_len = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
> +       if (qdisc_qlen(q->l_queue) && (!c_len || q->c_protection.credit <= 0)) {
> +               skb = __qdisc_dequeue_head(&q->l_queue->q);
> +               WRITE_ONCE(q->l_head_ts, head_enqueue_time(q->l_queue));
> +               if (c_len)
> +                       *credit_change = q->c_protection.wc;
> +               qdisc_qstats_backlog_dec(q->l_queue, skb);
> +               /* Keep the global queue size consistent */
> +               --sch->q.qlen;
> +       } else if (c_len) {
> +               skb = __qdisc_dequeue_head(&sch->q);
> +               WRITE_ONCE(q->c_head_ts, head_enqueue_time(sch));
> +               if (qdisc_qlen(q->l_queue))
> +                       *credit_change = ~((s32)q->c_protection.wl) + 1;
> +       } else {
> +               dualpi2_reset_c_protection(q);
> +               return NULL;
> +       }
> +       *credit_change *= qdisc_pkt_len(skb);
> +       qdisc_qstats_backlog_dec(sch, skb);
> +       return skb;
> +}
> +
> +static int do_step_aqm(struct dualpi2_sched_data *q, struct sk_buff *skb,
> +                      u64 now)
> +{
> +       u64 qdelay = 0;
> +
> +       if (q->step.in_packets)
> +               qdelay = qdisc_qlen(q->l_queue);
> +       else
> +               qdelay = skb_sojourn_time(skb, now);
> +
> +       if (dualpi2_skb_cb(skb)->apply_step && qdelay > q->step.thresh) {
> +               if (!dualpi2_skb_cb(skb)->ect)
> +                       /* Drop this non-ECT packet */
> +                       return 1;
> +               if (dualpi2_mark(q, skb))
> +                       ++q->step_marks;
> +       }
> +       qdisc_bstats_update(q->l_queue, skb);
> +       return 0;
> +}
> +
> +static void drop_and_retry(struct dualpi2_sched_data *q, struct sk_buff *skb, struct Qdisc *sch)
> +{
> +       ++q->deferred_drops.cnt;
> +       q->deferred_drops.len += qdisc_pkt_len(skb);
> +       consume_skb(skb);
> +       qdisc_qstats_drop(sch);
> +}
> +
> +static struct sk_buff *dualpi2_qdisc_dequeue(struct Qdisc *sch)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       struct sk_buff *skb;
> +       int credit_change;
> +       u64 now;
> +
> +       now = ktime_get_ns();
> +
> +       while ((skb = dequeue_packet(sch, q, &credit_change, now))) {
> +               if (!q->drop_early && must_drop(sch, q, skb)) {
> +                       drop_and_retry(q, skb, sch);
> +                       continue;
> +               }
> +
> +               if (skb_in_l_queue(skb) && do_step_aqm(q, skb, now)) {
> +                       qdisc_qstats_drop(q->l_queue);
> +                       drop_and_retry(q, skb, sch);
> +                       continue;
> +               }
> +
> +               q->c_protection.credit += credit_change;
> +               qdisc_bstats_update(sch, skb);
> +               break;
> +       }
> +
> +       /* We cannot call qdisc_tree_reduce_backlog() if our qlen is 0,
> +        * or HTB crashes.
> +        */
> +       if (q->deferred_drops.cnt && qdisc_qlen(sch)) {
> +               qdisc_tree_reduce_backlog(sch, q->deferred_drops.cnt,
> +                                         q->deferred_drops.len);
> +               q->deferred_drops.cnt = 0;
> +               q->deferred_drops.len = 0;
> +       }
> +       return skb;
> +}
> +
> +static s64 __scale_delta(u64 diff)
> +{
> +       do_div(diff, 1 << ALPHA_BETA_GRANULARITY);
> +       return diff;
> +}
> +
> +static void get_queue_delays(struct dualpi2_sched_data *q, u64 *qdelay_c,
> +                            u64 *qdelay_l)
> +{
> +       u64 now, qc, ql;
> +
> +       now = ktime_get_ns();
> +       qc = READ_ONCE(q->c_head_ts);
> +       ql = READ_ONCE(q->l_head_ts);
> +
> +       *qdelay_c = qc ? now - qc : 0;
> +       *qdelay_l = ql ? now - ql : 0;
> +}
> +
> +static u32 calculate_probability(struct Qdisc *sch)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       u32 new_prob;
> +       u64 qdelay_c;
> +       u64 qdelay_l;
> +       u64 qdelay;
> +       s64 delta;
> +
> +       get_queue_delays(q, &qdelay_c, &qdelay_l);
> +       qdelay = max(qdelay_l, qdelay_c);
> +       /* Alpha and beta take at most 32b, i.e, the delay difference would
> +        * overflow for queuing delay differences > ~4.2sec.
> +        */
> +       delta = ((s64)qdelay - q->pi2.target) * q->pi2.alpha;
> +       delta += ((s64)qdelay - q->last_qdelay) * q->pi2.beta;
> +       if (delta > 0) {
> +               new_prob = __scale_delta(delta) + q->pi2.prob;
> +               if (new_prob < q->pi2.prob)
> +                       new_prob = MAX_PROB;
> +       } else {
> +               new_prob = q->pi2.prob - __scale_delta(~delta + 1);
> +               if (new_prob > q->pi2.prob)
> +                       new_prob = 0;
> +       }
> +       q->last_qdelay = qdelay;
> +       /* If we do not drop on overload, ensure we cap the L4S probability to
> +        * 100% to keep window fairness when overflowing.
> +        */
> +       if (!q->drop_overload)
> +               return min_t(u32, new_prob, MAX_PROB / q->coupling_factor);
> +       return new_prob;
> +}
> +
> +static enum hrtimer_restart dualpi2_timer(struct hrtimer *timer)
> +{
> +       struct dualpi2_sched_data *q = from_timer(q, timer, pi2.timer);
> +
> +       WRITE_ONCE(q->pi2.prob, calculate_probability(q->sch));
> +
> +       hrtimer_set_expires(&q->pi2.timer, next_pi2_timeout(q));
> +       return HRTIMER_RESTART;
> +}
> +
> +static const struct nla_policy dualpi2_policy[TCA_DUALPI2_MAX + 1] = {
> +       [TCA_DUALPI2_LIMIT] = {.type = NLA_U32},
> +       [TCA_DUALPI2_TARGET] = {.type = NLA_U32},
> +       [TCA_DUALPI2_TUPDATE] = {.type = NLA_U32},
> +       [TCA_DUALPI2_ALPHA] = {.type = NLA_U32},
> +       [TCA_DUALPI2_BETA] = {.type = NLA_U32},
> +       [TCA_DUALPI2_STEP_THRESH] = {.type = NLA_U32},
> +       [TCA_DUALPI2_STEP_PACKETS] = {.type = NLA_U8},
> +       [TCA_DUALPI2_COUPLING] = {.type = NLA_U8},
> +       [TCA_DUALPI2_DROP_OVERLOAD] = {.type = NLA_U8},
> +       [TCA_DUALPI2_DROP_EARLY] = {.type = NLA_U8},
> +       [TCA_DUALPI2_C_PROTECTION] = {.type = NLA_U8},
> +       [TCA_DUALPI2_ECN_MASK] = {.type = NLA_U8},
> +       [TCA_DUALPI2_SPLIT_GSO] = {.type = NLA_U8},
> +};
>
> +static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
> +                         struct netlink_ext_ack *extack)
> +{
> +       struct nlattr *tb[TCA_DUALPI2_MAX + 1];
> +       struct dualpi2_sched_data *q;
> +       int old_backlog;
> +       int old_qlen;
> +       int err;
> +
> +       if (!opt)
> +               return -EINVAL;
> +       err = nla_parse_nested_deprecated(tb, TCA_DUALPI2_MAX, opt,
> +                                         dualpi2_policy, extack);

Given this is a new qdisc - use normal nla_parse_nested()

> +       if (err < 0)
> +               return err;
> +
> +       q = qdisc_priv(sch);
> +       sch_tree_lock(sch);
> +
> +       if (tb[TCA_DUALPI2_LIMIT]) {
> +               u32 limit = nla_get_u32(tb[TCA_DUALPI2_LIMIT]);
> +
> +               if (!limit) {
> +                       NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_LIMIT],
> +                                           "limit must be greater than 0.");
> +                       sch_tree_unlock(sch);
> +                       return -EINVAL;
> +               }
> +               sch->limit = limit;
> +       }
> +
> +       if (tb[TCA_DUALPI2_TARGET])
> +               q->pi2.target = (u64)nla_get_u32(tb[TCA_DUALPI2_TARGET]) *
> +                       NSEC_PER_USEC;
> +
> +       if (tb[TCA_DUALPI2_TUPDATE]) {
> +               u64 tupdate = nla_get_u32(tb[TCA_DUALPI2_TUPDATE]);
> +
> +               if (!tupdate) {
> +                       NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_TUPDATE],
> +                                           "tupdate cannot be 0us.");
> +                       sch_tree_unlock(sch);
> +                       return -EINVAL;
> +               }
> +               q->pi2.tupdate = tupdate * NSEC_PER_USEC;
> +       }
> +
> +       if (tb[TCA_DUALPI2_ALPHA]) {
> +               u32 alpha = nla_get_u32(tb[TCA_DUALPI2_ALPHA]);
> +
> +               if (alpha > ALPHA_BETA_MAX) {
> +                       NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_ALPHA],
> +                                           "alpha is too large.");
> +                       sch_tree_unlock(sch);
> +                       return -EINVAL;
> +               }
> +               q->pi2.alpha = dualpi2_scale_alpha_beta(alpha);
> +       }

You should consider using netlink policies for these checks (for
example, you can check for min/max without replicating code as above).
Applies in quiet a few places (and not just for max/min validation)

cheers,
jamal

> +
> +       if (tb[TCA_DUALPI2_BETA]) {
> +               u32 beta = nla_get_u32(tb[TCA_DUALPI2_BETA]);
> +
> +               if (beta > ALPHA_BETA_MAX) {
> +                       NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_BETA],
> +                                           "beta is too large.");
> +                       sch_tree_unlock(sch);
> +                       return -EINVAL;
> +               }
> +               q->pi2.beta = dualpi2_scale_alpha_beta(beta);
> +       }
> +
> +       if (tb[TCA_DUALPI2_STEP_THRESH])
> +               q->step.thresh = nla_get_u32(tb[TCA_DUALPI2_STEP_THRESH]) *
> +                       NSEC_PER_USEC;
> +
> +       if (tb[TCA_DUALPI2_COUPLING]) {
> +               u8 coupling = nla_get_u8(tb[TCA_DUALPI2_COUPLING]);
> +
> +               if (!coupling) {
> +                       NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_COUPLING],
> +                                           "Must use a non-zero coupling.");
> +                       sch_tree_unlock(sch);
> +                       return -EINVAL;
> +               }
> +               q->coupling_factor = coupling;
> +       }
> +
> +       if (tb[TCA_DUALPI2_STEP_PACKETS])
> +               q->step.in_packets = !!nla_get_u8(tb[TCA_DUALPI2_STEP_PACKETS]);
> +
> +       if (tb[TCA_DUALPI2_DROP_OVERLOAD])
> +               q->drop_overload = !!nla_get_u8(tb[TCA_DUALPI2_DROP_OVERLOAD]);
> +
> +       if (tb[TCA_DUALPI2_DROP_EARLY])
> +               q->drop_early = !!nla_get_u8(tb[TCA_DUALPI2_DROP_EARLY]);
> +
> +       if (tb[TCA_DUALPI2_C_PROTECTION]) {
> +               u8 wc = nla_get_u8(tb[TCA_DUALPI2_C_PROTECTION]);
> +
> +               if (wc > MAX_WC) {
> +                       NL_SET_ERR_MSG_ATTR(extack,
> +                                           tb[TCA_DUALPI2_C_PROTECTION],
> +                                           "c_protection must be <= 100.");
> +                       sch_tree_unlock(sch);
> +                       return -EINVAL;
> +               }
> +               dualpi2_calculate_c_protection(sch, q, wc);
> +       }
> +
> +       if (tb[TCA_DUALPI2_ECN_MASK])
> +               q->ecn_mask = nla_get_u8(tb[TCA_DUALPI2_ECN_MASK]);
> +
> +       if (tb[TCA_DUALPI2_SPLIT_GSO])
> +               q->split_gso = !!nla_get_u8(tb[TCA_DUALPI2_SPLIT_GSO]);
> +
> +       old_qlen = qdisc_qlen(sch);
> +       old_backlog = sch->qstats.backlog;
> +       while (qdisc_qlen(sch) > sch->limit) {
> +               struct sk_buff *skb = __qdisc_dequeue_head(&sch->q);
> +
> +               qdisc_qstats_backlog_dec(sch, skb);
> +               rtnl_qdisc_drop(skb, sch);
> +       }
> +       qdisc_tree_reduce_backlog(sch, old_qlen - qdisc_qlen(sch),
> +                                 old_backlog - sch->qstats.backlog);
> +
> +       sch_tree_unlock(sch);
> +       return 0;
> +}
> +
> +/* Default alpha/beta values give a 10dB stability margin with max_rtt=100ms. */
> +static void dualpi2_reset_default(struct dualpi2_sched_data *q)
> +{
> +       q->sch->limit = 10000;                          /* Max 125ms at 1Gbps */
> +
> +       q->pi2.target = 15 * NSEC_PER_MSEC;
> +       q->pi2.tupdate = 16 * NSEC_PER_MSEC;
> +       q->pi2.alpha = dualpi2_scale_alpha_beta(41);    /* ~0.16 Hz * 256 */
> +       q->pi2.beta = dualpi2_scale_alpha_beta(819);    /* ~3.20 Hz * 256 */
> +
> +       q->step.thresh = 1 * NSEC_PER_MSEC;
> +       q->step.in_packets = false;
> +
> +       dualpi2_calculate_c_protection(q->sch, q, 10);  /* wc=10%, wl=90% */
> +
> +       q->ecn_mask = INET_ECN_ECT_1;
> +       q->coupling_factor = 2;         /* window fairness for equal RTTs */
> +       q->drop_overload = true;        /* Preserve latency by dropping */
> +       q->drop_early = false;          /* PI2 drops on dequeue */
> +       q->split_gso = true;
> +}
> +
> +static int dualpi2_init(struct Qdisc *sch, struct nlattr *opt,
> +                       struct netlink_ext_ack *extack)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       int err;
> +
> +       q->l_queue = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
> +                                      TC_H_MAKE(sch->handle, 1), extack);
> +       if (!q->l_queue)
> +               return -ENOMEM;
> +
> +       err = tcf_block_get(&q->tcf.block, &q->tcf.filters, sch, extack);
> +       if (err)
> +               return err;
> +
> +       q->sch = sch;
> +       dualpi2_reset_default(q);
> +       hrtimer_init(&q->pi2.timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> +       q->pi2.timer.function = dualpi2_timer;
> +
> +       if (opt) {
> +               err = dualpi2_change(sch, opt, extack);
> +
> +               if (err)
> +                       return err;
> +       }
> +
> +       hrtimer_start(&q->pi2.timer, next_pi2_timeout(q),
> +                     HRTIMER_MODE_ABS_PINNED);
> +       return 0;
> +}
> +
> +static u32 convert_ns_to_usec(u64 ns)
> +{
> +       do_div(ns, NSEC_PER_USEC);
> +       return ns;
> +}
> +
> +static int dualpi2_dump(struct Qdisc *sch, struct sk_buff *skb)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       struct nlattr *opts;
> +
> +       opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
> +       if (!opts)
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_DUALPI2_LIMIT, sch->limit) ||
> +           nla_put_u32(skb, TCA_DUALPI2_TARGET,
> +                       convert_ns_to_usec(q->pi2.target)) ||
> +           nla_put_u32(skb, TCA_DUALPI2_TUPDATE,
> +                       convert_ns_to_usec(q->pi2.tupdate)) ||
> +           nla_put_u32(skb, TCA_DUALPI2_ALPHA,
> +                       dualpi2_unscale_alpha_beta(q->pi2.alpha)) ||
> +           nla_put_u32(skb, TCA_DUALPI2_BETA,
> +                       dualpi2_unscale_alpha_beta(q->pi2.beta)) ||
> +           nla_put_u32(skb, TCA_DUALPI2_STEP_THRESH, q->step.in_packets ?
> +                       q->step.thresh : convert_ns_to_usec(q->step.thresh)) ||
> +           nla_put_u8(skb, TCA_DUALPI2_COUPLING, q->coupling_factor) ||
> +           nla_put_u8(skb, TCA_DUALPI2_DROP_OVERLOAD, q->drop_overload) ||
> +           nla_put_u8(skb, TCA_DUALPI2_STEP_PACKETS, q->step.in_packets) ||
> +           nla_put_u8(skb, TCA_DUALPI2_DROP_EARLY, q->drop_early) ||
> +           nla_put_u8(skb, TCA_DUALPI2_C_PROTECTION, q->c_protection.wc) ||
> +           nla_put_u8(skb, TCA_DUALPI2_ECN_MASK, q->ecn_mask) ||
> +           nla_put_u8(skb, TCA_DUALPI2_SPLIT_GSO, q->split_gso))
> +               goto nla_put_failure;
> +
> +       return nla_nest_end(skb, opts);
> +
> +nla_put_failure:
> +       nla_nest_cancel(skb, opts);
> +       return -1;
> +}
> +
> +static int dualpi2_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       struct tc_dualpi2_xstats st = {
> +               .prob           = READ_ONCE(q->pi2.prob),
> +               .packets_in_c   = q->packets_in_c,
> +               .packets_in_l   = q->packets_in_l,
> +               .maxq           = q->maxq,
> +               .ecn_mark       = q->ecn_mark,
> +               .credit         = q->c_protection.credit,
> +               .step_marks     = q->step_marks,
> +       };
> +       u64 qc, ql;
> +
> +       get_queue_delays(q, &qc, &ql);
> +       st.delay_l = convert_ns_to_usec(ql);
> +       st.delay_c = convert_ns_to_usec(qc);
> +       return gnet_stats_copy_app(d, &st, sizeof(st));
> +}
> +
> +static void dualpi2_reset(struct Qdisc *sch)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +
> +       qdisc_reset_queue(sch);
> +       qdisc_reset_queue(q->l_queue);
> +       q->c_head_ts = 0;
> +       q->l_head_ts = 0;
> +       q->pi2.prob = 0;
> +       q->packets_in_c = 0;
> +       q->packets_in_l = 0;
> +       q->maxq = 0;
> +       q->ecn_mark = 0;
> +       q->step_marks = 0;
> +       dualpi2_reset_c_protection(q);
> +}
> +
> +static void dualpi2_destroy(struct Qdisc *sch)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +
> +       q->pi2.tupdate = 0;
> +       hrtimer_cancel(&q->pi2.timer);
> +       if (q->l_queue)
> +               qdisc_put(q->l_queue);
> +       tcf_block_put(q->tcf.block);
> +}
> +
> +static struct Qdisc *dualpi2_leaf(struct Qdisc *sch, unsigned long arg)
> +{
> +       return NULL;
> +}
> +
> +static unsigned long dualpi2_find(struct Qdisc *sch, u32 classid)
> +{
> +       return 0;
> +}
> +
> +static unsigned long dualpi2_bind(struct Qdisc *sch, unsigned long parent,
> +                                 u32 classid)
> +{
> +       return 0;
> +}
> +
> +static void dualpi2_unbind(struct Qdisc *q, unsigned long cl)
> +{
> +}
> +
> +static struct tcf_block *dualpi2_tcf_block(struct Qdisc *sch, unsigned long cl,
> +                                          struct netlink_ext_ack *extack)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +
> +       if (cl)
> +               return NULL;
> +       return q->tcf.block;
> +}
> +
> +static void dualpi2_walk(struct Qdisc *sch, struct qdisc_walker *arg)
> +{
> +       unsigned int i;
> +
> +       if (arg->stop)
> +               return;
> +
> +       /* We statically define only 2 queues */
> +       for (i = 0; i < 2; i++) {
> +               if (arg->count < arg->skip) {
> +                       arg->count++;
> +                       continue;
> +               }
> +               if (arg->fn(sch, i + 1, arg) < 0) {
> +                       arg->stop = 1;
> +                       break;
> +               }
> +               arg->count++;
> +       }
> +}
> +
> +/* Minimal class support to handler tc filters */
> +static const struct Qdisc_class_ops dualpi2_class_ops = {
> +       .leaf           = dualpi2_leaf,
> +       .find           = dualpi2_find,
> +       .tcf_block      = dualpi2_tcf_block,
> +       .bind_tcf       = dualpi2_bind,
> +       .unbind_tcf     = dualpi2_unbind,
> +       .walk           = dualpi2_walk,
> +};
> +
> +static struct Qdisc_ops dualpi2_qdisc_ops __read_mostly = {
> +       .id             = "dualpi2",
> +       .cl_ops         = &dualpi2_class_ops,
> +       .priv_size      = sizeof(struct dualpi2_sched_data),
> +       .enqueue        = dualpi2_qdisc_enqueue,
> +       .dequeue        = dualpi2_qdisc_dequeue,
> +       .peek           = qdisc_peek_dequeued,
> +       .init           = dualpi2_init,
> +       .destroy        = dualpi2_destroy,
> +       .reset          = dualpi2_reset,
> +       .change         = dualpi2_change,
> +       .dump           = dualpi2_dump,
> +       .dump_stats     = dualpi2_dump_stats,
> +       .owner          = THIS_MODULE,
> +};
> +
> +static int __init dualpi2_module_init(void)
> +{
> +       return register_qdisc(&dualpi2_qdisc_ops);
> +}
> +
> +static void __exit dualpi2_module_exit(void)
> +{
> +       unregister_qdisc(&dualpi2_qdisc_ops);
> +}
> +
> +module_init(dualpi2_module_init);
> +module_exit(dualpi2_module_exit);
> +
> +MODULE_DESCRIPTION("Dual Queue with Proportional Integral controller Improved with a Square (dualpi2) scheduler");
> +MODULE_AUTHOR("Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>");
> +MODULE_AUTHOR("Olga Albisser <olga@albisser.org>");
> +MODULE_AUTHOR("Henrik Steen <henrist@henrist.net>");
> +MODULE_AUTHOR("Olivier Tilmans <olivier.tilmans@nokia.com>");
> +MODULE_AUTHOR("Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>");
> +
> +MODULE_LICENSE("GPL");
> +MODULE_VERSION("1.0");
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 01/44] sched: Add dualpi2 qdisc
  2024-10-15 15:30   ` Jamal Hadi Salim
@ 2024-10-15 15:40     ` Jakub Kicinski
  0 siblings, 0 replies; 56+ messages in thread
From: Jakub Kicinski @ 2024-10-15 15:40 UTC (permalink / raw)
  To: Jamal Hadi Salim, chia-yu.chang
  Cc: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel, Olga Albisser, Olivier Tilmans,
	Henrik Steen, Bob Briscoe

On Tue, 15 Oct 2024 11:30:01 -0400 Jamal Hadi Salim wrote:
> > +       if (tb[TCA_DUALPI2_ALPHA]) {
> > +               u32 alpha = nla_get_u32(tb[TCA_DUALPI2_ALPHA]);
> > +
> > +               if (alpha > ALPHA_BETA_MAX) {
> > +                       NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_ALPHA],
> > +                                           "alpha is too large.");
> > +                       sch_tree_unlock(sch);
> > +                       return -EINVAL;
> > +               }
> > +               q->pi2.alpha = dualpi2_scale_alpha_beta(alpha);
> > +       }  
> 
> You should consider using netlink policies for these checks (for
> example, you can check for min/max without replicating code as above).
> Applies in quiet a few places (and not just for max/min validation)

In fact I think we should also start asking for YAML specs.
Donald already added most of the existing TC stuff.
Please extend Documentation/netlink/specs/tc.yaml

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series
  2024-10-15 15:14   ` Koen De Schepper (Nokia)
@ 2024-10-15 17:52     ` Eric Dumazet
  2024-10-15 19:30       ` Chia-Yu Chang (Nokia)
  0 siblings, 1 reply; 56+ messages in thread
From: Eric Dumazet @ 2024-10-15 17:52 UTC (permalink / raw)
  To: Koen De Schepper (Nokia), Paolo Abeni, Chia-Yu Chang (Nokia),
	netdev@vger.kernel.org, ij@kernel.org, ncardwell@google.com,
	g.white@CableLabs.com, ingemar.s.johansson@ericsson.com,
	mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
	Jason_Livingood@comcast.com, vidhi_goel@apple.com, edumazet


On 10/15/24 5:14 PM, Koen De Schepper (Nokia) wrote:
> We had several internal review rounds, that were specifically making sure it is in line with the processes/guidelines you are referring to.
>
> DualPI2 and TCP-Prague are new modules mostly in a separate file. ACC_ECN unfortunately involves quite some changes in different files with different functionalities and were split into manageable smaller incremental chunks according to the guidelines, ending up in 40 patches. Good thing is that they are small and should be easily processable. It could be split in these 3 features, but would still involve all the ACC_ECN as preferably one patch set. On top of that the 3 TCP-Prague patches rely on the 40 ACC_ECN, so preferably we keep them together too...
>
> The 3 functions are used and tested in many kernels. Initial development started from 3.16 to 4.x, 5.x and recently also in the 6.x kernels. So, the code should be pretty mature (at least from a functionality and stability point of view).


We want bisection to be able to work all the time. This is a must.

That means that you should be able to split a series in arbitrary chunks.

If you take the first 15 patches, and end up with a kernel that breaks, 
then something is wrong.

Make sure to CC edumazet@google.com next time.

Thank you.



> Koen.
>
> -----Original Message-----
> From: Paolo Abeni <pabeni@redhat.com>
> Sent: Tuesday, October 15, 2024 12:51 PM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; netdev@vger.kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@CableLabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
> Subject: Re: [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series
>
>
> CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
>
>
>
> On 10/15/24 12:28, chia-yu.chang@nokia-bell-labs.com wrote:
>> From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>>
>> Hello,
>>
>> Please find the enclosed patch series covering the L4S (Low Latency,
>> Low Loss, and Scalable Throughput) as outlined in IETF RFC9330:
>> https://datatracker.ietf.org/doc/html/rfc9330
>>
>> * 1 patch for DualPI2 (cf. IETF RFC9332
>>     https://datatracker.ietf.org/doc/html/rfc9332)
>> * 40 pataches for Accurate ECN (It implements the AccECN protocol
>>     in terms of negotiation, feedback, and compliance requirements:
>>
>> https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28)
>> * 3 patches for TCP Prague (It implements the performance and safety
>>     requirements listed in Appendix A of IETF RFC9331:
>>     https://datatracker.ietf.org/doc/html/rfc9331)
>>
>> Best regagrds,
>> Chia-Yu
> I haven't looked into the series yet, and I doubt I'll be able to do that anytime soon, but you must have a good read of the netdev process before any other action, specifically:
>
> https://elixir.bootlin.com/linux/v6.11.3/source/Documentation/process/maintainer-netdev.rst#L351
>
> and
>
> https://elixir.bootlin.com/linux/v6.11.3/source/Documentation/process/maintainer-netdev.rst#L15
>
> Just to be clear: splitting the series into 3 and posting all of them together will not be good either.
>
> Thanks,
>
> Paolo
>
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series
  2024-10-15 17:52     ` Eric Dumazet
@ 2024-10-15 19:30       ` Chia-Yu Chang (Nokia)
  0 siblings, 0 replies; 56+ messages in thread
From: Chia-Yu Chang (Nokia) @ 2024-10-15 19:30 UTC (permalink / raw)
  To: Eric Dumazet, Koen De Schepper (Nokia), Paolo Abeni,
	netdev@vger.kernel.org, ij@kernel.org, ncardwell@google.com,
	g.white@CableLabs.com, ingemar.s.johansson@ericsson.com,
	mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at,
	Jason_Livingood@comcast.com, vidhi_goel@apple.com,
	edumazet@google.com

We will split into several chunks to follow this guideline and make sure Eric in CC'ed.
Thanks.

Chia-Yu

-----Original Message-----
From: Eric Dumazet <eric.dumazet@gmail.com> 
Sent: Tuesday, October 15, 2024 7:53 PM
To: Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; Paolo Abeni <pabeni@redhat.com>; Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; netdev@vger.kernel.org; ij@kernel.org; ncardwell@google.com; g.white@CableLabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com; edumazet@google.com
Subject: Re: [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series


CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.



On 10/15/24 5:14 PM, Koen De Schepper (Nokia) wrote:
> We had several internal review rounds, that were specifically making sure it is in line with the processes/guidelines you are referring to.
>
> DualPI2 and TCP-Prague are new modules mostly in a separate file. ACC_ECN unfortunately involves quite some changes in different files with different functionalities and were split into manageable smaller incremental chunks according to the guidelines, ending up in 40 patches. Good thing is that they are small and should be easily processable. It could be split in these 3 features, but would still involve all the ACC_ECN as preferably one patch set. On top of that the 3 TCP-Prague patches rely on the 40 ACC_ECN, so preferably we keep them together too...
>
> The 3 functions are used and tested in many kernels. Initial development started from 3.16 to 4.x, 5.x and recently also in the 6.x kernels. So, the code should be pretty mature (at least from a functionality and stability point of view).


We want bisection to be able to work all the time. This is a must.

That means that you should be able to split a series in arbitrary chunks.

If you take the first 15 patches, and end up with a kernel that breaks, then something is wrong.

Make sure to CC edumazet@google.com next time.

Thank you.



> Koen.
>
> -----Original Message-----
> From: Paolo Abeni <pabeni@redhat.com>
> Sent: Tuesday, October 15, 2024 12:51 PM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; 
> netdev@vger.kernel.org; ij@kernel.org; ncardwell@google.com; Koen De 
> Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; 
> g.white@CableLabs.com; ingemar.s.johansson@ericsson.com; 
> mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; 
> Jason_Livingood@comcast.com; vidhi_goel@apple.com
> Subject: Re: [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague 
> patch series
>
>
> CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
>
>
>
> On 10/15/24 12:28, chia-yu.chang@nokia-bell-labs.com wrote:
>> From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
>>
>> Hello,
>>
>> Please find the enclosed patch series covering the L4S (Low Latency, 
>> Low Loss, and Scalable Throughput) as outlined in IETF RFC9330:
>> https://datatracker.ietf.org/doc/html/rfc9330
>>
>> * 1 patch for DualPI2 (cf. IETF RFC9332
>>     https://datatracker.ietf.org/doc/html/rfc9332)
>> * 40 pataches for Accurate ECN (It implements the AccECN protocol
>>     in terms of negotiation, feedback, and compliance requirements:
>>
>> https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28
>> )
>> * 3 patches for TCP Prague (It implements the performance and safety
>>     requirements listed in Appendix A of IETF RFC9331:
>>     https://datatracker.ietf.org/doc/html/rfc9331)
>>
>> Best regagrds,
>> Chia-Yu
> I haven't looked into the series yet, and I doubt I'll be able to do that anytime soon, but you must have a good read of the netdev process before any other action, specifically:
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Felix
> ir.bootlin.com%2Flinux%2Fv6.11.3%2Fsource%2FDocumentation%2Fprocess%2F
> maintainer-netdev.rst%23L351&data=05%7C02%7Cchia-yu.chang%40nokia-bell
> -labs.com%7Cd3d50c18d3fd483af47908dced4228e5%7C5d4717519675428d917b70f
> 44f9630b0%7C0%7C0%7C638646115617608802%7CUnknown%7CTWFpbGZsb3d8eyJWIjo
> iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%
> 7C&sdata=4ZRJsQYIsYDrKQV1olJEcrcY7uZ%2Bg7CPhR4lWWPDsL0%3D&reserved=0
>
> and
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Felix
> ir.bootlin.com%2Flinux%2Fv6.11.3%2Fsource%2FDocumentation%2Fprocess%2F
> maintainer-netdev.rst%23L15&data=05%7C02%7Cchia-yu.chang%40nokia-bell-
> labs.com%7Cd3d50c18d3fd483af47908dced4228e5%7C5d4717519675428d917b70f4
> 4f9630b0%7C0%7C0%7C638646115617637044%7CUnknown%7CTWFpbGZsb3d8eyJWIjoi
> MC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7
> C&sdata=Yc3mqMnAOICPRzhPzRPbFmkOsuPReaBIgpZvtZaLPvc%3D&reserved=0
>
> Just to be clear: splitting the series into 3 and posting all of them together will not be good either.
>
> Thanks,
>
> Paolo
>
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 17/44] tcp: accecn: AccECN negotiation
  2024-10-15 10:29 ` [PATCH net-next 17/44] tcp: accecn: AccECN negotiation chia-yu.chang
@ 2024-10-15 19:49   ` Ilpo Järvinen
  2024-10-15 20:25     ` Chia-Yu Chang (Nokia)
  0 siblings, 1 reply; 56+ messages in thread
From: Ilpo Järvinen @ 2024-10-15 19:49 UTC (permalink / raw)
  To: Chia-Yu Chang
  Cc: netdev, ncardwell, koen.de_schepper, g.white, ingemar.s.johansson,
	mirja.kuehlewind, cheshire, rs.ietf, Jason_Livingood, vidhi_goel,
	Olivier Tilmans

[-- Attachment #1: Type: text/plain, Size: 2627 bytes --]

On Tue, 15 Oct 2024, chia-yu.chang@nokia-bell-labs.com wrote:

> From: Ilpo Järvinen <ij@kernel.org>
> 
> Accurate ECN negotiation parts based on the specification:
>   https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
> 
> Accurate ECN is negotiated using ECE, CWR and AE flags in the
> TCP header. TCP falls back into using RFC3168 ECN if one of the
> ends supports only RFC3168-style ECN.
> 
> The AccECN negotiation includes reflecting IP ECN field value
> seen in SYN and SYNACK back using the same bits as negotiation
> to allow responding to SYN CE marks and to detect ECN field
> mangling. CE marks should not occur currently because SYN=1
> segments are sent with Non-ECT in IP ECN field (but proposal
> exists to remove this restriction).
> 
> Reflecting SYN IP ECN field in SYNACK is relatively simple.
> Reflecting SYNACK IP ECN field in the final/third ACK of
> the handshake is more challenging. Linux TCP code is not well
> prepared for using the final/third ACK a signalling channel
> which makes things somewhat complicated here.
> 
> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
>  include/linux/tcp.h        |   9 ++-
>  include/net/tcp.h          |  80 +++++++++++++++++++-
>  net/ipv4/syncookies.c      |   3 +
>  net/ipv4/sysctl_net_ipv4.c |   2 +-
>  net/ipv4/tcp.c             |   2 +
>  net/ipv4/tcp_input.c       | 149 +++++++++++++++++++++++++++++++++----
>  net/ipv4/tcp_ipv4.c        |   3 +-
>  net/ipv4/tcp_minisocks.c   |  51 +++++++++++--
>  net/ipv4/tcp_output.c      |  77 +++++++++++++++----
>  net/ipv6/syncookies.c      |   1 +
>  net/ipv6/tcp_ipv6.c        |   1 +
>  11 files changed, 336 insertions(+), 42 deletions(-)
> 

> @@ -6358,6 +6446,13 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
>  		return;
>  
>  step5:
> +	if (unlikely(tp->wait_third_ack)) {
> +		if (!tcp_ecn_disabled(tp))
> +			tp->wait_third_ack = 0;

I don't think !tcp_ecn_disabled(tp) condition is necessary and is harmful
(I think I tried to explain this earlier but it seems there was a 
misunderstanding).

A third ACK is third ACK regardless of ECN mode and this entire code block 
should be skipped on subsequent ACKs after the third ACK. By adding that 
ECN mode condition, ->wait_third_ack cannot be set to zero if ECN mode get 
disabled which is harmful because then this code can never be skipped.

--
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH net-next 17/44] tcp: accecn: AccECN negotiation
  2024-10-15 19:49   ` Ilpo Järvinen
@ 2024-10-15 20:25     ` Chia-Yu Chang (Nokia)
  2024-10-15 20:31       ` Ilpo Järvinen
  0 siblings, 1 reply; 56+ messages in thread
From: Chia-Yu Chang (Nokia) @ 2024-10-15 20:25 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: netdev@vger.kernel.org, ncardwell@google.com,
	Koen De Schepper (Nokia), g.white@CableLabs.com,
	ingemar.s.johansson@ericsson.com, mirja.kuehlewind@ericsson.com,
	cheshire@apple.com, rs.ietf@gmx.at, Jason_Livingood@comcast.com,
	vidhi_goel@apple.com, Olivier Tilmans (Nokia)

-----Original Message-----
From: Ilpo Järvinen <ij@kernel.org> 
Sent: Tuesday, October 15, 2024 9:50 PM
To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>
Cc: netdev@vger.kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@CableLabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com; Olivier Tilmans (Nokia) <olivier.tilmans@nokia.com>
Subject: Re: [PATCH net-next 17/44] tcp: accecn: AccECN negotiation


CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.



On Tue, 15 Oct 2024, chia-yu.chang@nokia-bell-labs.com wrote:

> From: Ilpo Järvinen <ij@kernel.org>
>
> Accurate ECN negotiation parts based on the specification:
>   https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
>
> Accurate ECN is negotiated using ECE, CWR and AE flags in the TCP 
> header. TCP falls back into using RFC3168 ECN if one of the ends 
> supports only RFC3168-style ECN.
>
> The AccECN negotiation includes reflecting IP ECN field value seen in 
> SYN and SYNACK back using the same bits as negotiation to allow 
> responding to SYN CE marks and to detect ECN field mangling. CE marks 
> should not occur currently because SYN=1 segments are sent with 
> Non-ECT in IP ECN field (but proposal exists to remove this 
> restriction).
>
> Reflecting SYN IP ECN field in SYNACK is relatively simple.
> Reflecting SYNACK IP ECN field in the final/third ACK of the handshake 
> is more challenging. Linux TCP code is not well prepared for using the 
> final/third ACK a signalling channel which makes things somewhat 
> complicated here.
>
> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
>  include/linux/tcp.h        |   9 ++-
>  include/net/tcp.h          |  80 +++++++++++++++++++-
>  net/ipv4/syncookies.c      |   3 +
>  net/ipv4/sysctl_net_ipv4.c |   2 +-
>  net/ipv4/tcp.c             |   2 +
>  net/ipv4/tcp_input.c       | 149 +++++++++++++++++++++++++++++++++----
>  net/ipv4/tcp_ipv4.c        |   3 +-
>  net/ipv4/tcp_minisocks.c   |  51 +++++++++++--
>  net/ipv4/tcp_output.c      |  77 +++++++++++++++----
>  net/ipv6/syncookies.c      |   1 +
>  net/ipv6/tcp_ipv6.c        |   1 +
>  11 files changed, 336 insertions(+), 42 deletions(-)
>

> @@ -6358,6 +6446,13 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
>               return;
>
>  step5:
> +     if (unlikely(tp->wait_third_ack)) {
> +             if (!tcp_ecn_disabled(tp))
> +                     tp->wait_third_ack = 0;

I don't think !tcp_ecn_disabled(tp) condition is necessary and is harmful (I think I tried to explain this earlier but it seems there was a misunderstanding).

A third ACK is third ACK regardless of ECN mode and this entire code block should be skipped on subsequent ACKs after the third ACK. By adding that ECN mode condition, ->wait_third_ack cannot be set to zero if ECN mode get disabled which is harmful because then this code can never be skipped.

--
 i.

If you read the only place I set this flag as 1 is with the same condition if (!tcp_ecn_disabled(tp)), the original idea is to make it symmetric when setting back to 0.
Of course it might create problem if in future we change the condition when set this flag as TRUE, then we need to change also here to set this flag back to FALSE. But if this confusing, I can remove this if condition in the next patches

Chia-Yu

^ permalink raw reply	[flat|nested] 56+ messages in thread

* RE: [PATCH net-next 17/44] tcp: accecn: AccECN negotiation
  2024-10-15 20:25     ` Chia-Yu Chang (Nokia)
@ 2024-10-15 20:31       ` Ilpo Järvinen
  0 siblings, 0 replies; 56+ messages in thread
From: Ilpo Järvinen @ 2024-10-15 20:31 UTC (permalink / raw)
  To: Chia-Yu Chang (Nokia)
  Cc: netdev@vger.kernel.org, ncardwell@google.com,
	Koen De Schepper (Nokia), g.white@CableLabs.com,
	ingemar.s.johansson@ericsson.com, mirja.kuehlewind@ericsson.com,
	cheshire@apple.com, rs.ietf@gmx.at, Jason_Livingood@comcast.com,
	vidhi_goel@apple.com, Olivier Tilmans (Nokia)

[-- Attachment #1: Type: text/plain, Size: 4244 bytes --]

On Tue, 15 Oct 2024, Chia-Yu Chang (Nokia) wrote:

> -----Original Message-----
> From: Ilpo Järvinen <ij@kernel.org> 
> Sent: Tuesday, October 15, 2024 9:50 PM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>
> Cc: netdev@vger.kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@CableLabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com; Olivier Tilmans (Nokia) <olivier.tilmans@nokia.com>
> Subject: Re: [PATCH net-next 17/44] tcp: accecn: AccECN negotiation
> 
> 
> CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> 
> 
> 
> On Tue, 15 Oct 2024, chia-yu.chang@nokia-bell-labs.com wrote:
> 
> > From: Ilpo Järvinen <ij@kernel.org>
> >
> > Accurate ECN negotiation parts based on the specification:
> >   https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
> >
> > Accurate ECN is negotiated using ECE, CWR and AE flags in the TCP 
> > header. TCP falls back into using RFC3168 ECN if one of the ends 
> > supports only RFC3168-style ECN.
> >
> > The AccECN negotiation includes reflecting IP ECN field value seen in 
> > SYN and SYNACK back using the same bits as negotiation to allow 
> > responding to SYN CE marks and to detect ECN field mangling. CE marks 
> > should not occur currently because SYN=1 segments are sent with 
> > Non-ECT in IP ECN field (but proposal exists to remove this 
> > restriction).
> >
> > Reflecting SYN IP ECN field in SYNACK is relatively simple.
> > Reflecting SYNACK IP ECN field in the final/third ACK of the handshake 
> > is more challenging. Linux TCP code is not well prepared for using the 
> > final/third ACK a signalling channel which makes things somewhat 
> > complicated here.
> >
> > Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
> > Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
> > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > ---
> >  include/linux/tcp.h        |   9 ++-
> >  include/net/tcp.h          |  80 +++++++++++++++++++-
> >  net/ipv4/syncookies.c      |   3 +
> >  net/ipv4/sysctl_net_ipv4.c |   2 +-
> >  net/ipv4/tcp.c             |   2 +
> >  net/ipv4/tcp_input.c       | 149 +++++++++++++++++++++++++++++++++----
> >  net/ipv4/tcp_ipv4.c        |   3 +-
> >  net/ipv4/tcp_minisocks.c   |  51 +++++++++++--
> >  net/ipv4/tcp_output.c      |  77 +++++++++++++++----
> >  net/ipv6/syncookies.c      |   1 +
> >  net/ipv6/tcp_ipv6.c        |   1 +
> >  11 files changed, 336 insertions(+), 42 deletions(-)
> >
> 
> > @@ -6358,6 +6446,13 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
> >               return;
> >
> >  step5:
> > +     if (unlikely(tp->wait_third_ack)) {
> > +             if (!tcp_ecn_disabled(tp))
> > +                     tp->wait_third_ack = 0;
> 
> I don't think !tcp_ecn_disabled(tp) condition is necessary and is harmful (I think I tried to explain this earlier but it seems there was a misunderstanding).
> 
> A third ACK is third ACK regardless of ECN mode and this entire code block should be skipped on subsequent ACKs after the third ACK. By adding that ECN mode condition, ->wait_third_ack cannot be set to zero if ECN mode get disabled which is harmful because then this code can never be skipped.
> 
> --
>  i.
> 
> If you read the only place I set this flag as 1 is with the same 
> condition if (!tcp_ecn_disabled(tp)), the original idea is to make it 
> symmetric when setting back to 0.
> Of course it might create problem if in future we change the condition 
> when set this flag as TRUE, then we need to change also here to set this 
> flag back to FALSE. But if this confusing, I can remove this if 
> condition in the next patches

My point is that something can change ECN mode in between so the symmetry 
argument doesn't really work here. You want to make sure wait_third_ack 
won't remain set if ECN got disable before we reach this line.

-- 
 i.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 09/44] gso: AccECN support
  2024-10-15 10:29 ` [PATCH net-next 09/44] gso: AccECN support chia-yu.chang
@ 2024-10-16  1:31   ` Jakub Kicinski
  0 siblings, 0 replies; 56+ messages in thread
From: Jakub Kicinski @ 2024-10-16  1:31 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel

On Tue, 15 Oct 2024 12:29:05 +0200 chia-yu.chang@nokia-bell-labs.com
wrote:
> From: Ilpo Järvinen <ij@kernel.org>
> 
> Handling the CWR flag differs between RFC 3168 ECN and AccECN.
> With RFC 3168 ECN aware TSO (NETIF_F_TSO_ECN) CWR flag is cleared
> starting from 2nd segment which is incompatible how AccECN handles
> the CWR flag. Such super-segments are indicated by SKB_GSO_TCP_ECN.
> With AccECN, CWR flag (or more accurately, the ACE field that also
> includes ECE & AE flags) changes only when new packet(s) with CE
> mark arrives so the flag should not be changed within a super-skb.
> The new skb/feature flags are necessary to prevent such TSO engines
> corrupting AccECN ACE counters by clearing the CWR flag (if the
> CWR handling feature cannot be turned off).
> 
> If NIC is completely unaware of RFC3168 ECN (doesn't support
> NETIF_F_TSO_ECN) or its TSO engine can be set to not touch CWR flag
> despite supporting also NETIF_F_TSO_ECN, TSO could be safely used
> with AccECN on such NIC. This should be evaluated per NIC basis
> (not done in this patch series for any NICs).

net/ethtool/common.c:52:35: warning: initializer overrides prior initialization of this subobject [-Winitializer-overrides]
   52 |         [NETIF_F_FCOE_CRC_BIT] =         "tx-checksum-fcoe-crc",
      |                                          ^~~~~~~~~~~~~~~~~~~~~~
net/ethtool/common.c:35:30: note: previous initialization is here
   35 |         [NETIF_F_GSO_ACCECN_BIT] =       "tx-tcp-accecn-segmentation",
      |                                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH net-next 22/44] tcp: accecn: AccECN option
  2024-10-15 10:29 ` [PATCH net-next 22/44] tcp: accecn: AccECN option chia-yu.chang
@ 2024-10-16  1:32   ` Jakub Kicinski
  0 siblings, 0 replies; 56+ messages in thread
From: Jakub Kicinski @ 2024-10-16  1:32 UTC (permalink / raw)
  To: chia-yu.chang
  Cc: netdev, ij, ncardwell, koen.de_schepper, g.white,
	ingemar.s.johansson, mirja.kuehlewind, cheshire, rs.ietf,
	Jason_Livingood, vidhi_goel

On Tue, 15 Oct 2024 12:29:18 +0200 chia-yu.chang@nokia-bell-labs.com
wrote:
> From: Ilpo Järvinen <ij@kernel.org>
> 
> The Accurate ECN allows echoing back the sum of bytes for
> each IP ECN field value in the received packets using
> AccECN option. This change implements AccECN option tx & rx
> side processing without option send control related features
> that are added by a later change.
> 
> Based on specification:
>   https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
> (Some features of the spec will be added in the later changes
> rather than in this one).
> 
> A full-length AccECN option is always attempted but if it does
> not fit, the minimum length is selected based on the counters
> that have changed since the last update. The AccECN option
> (with 24-bit fields) often ends in odd sizes so the option
> write code tries to take advantage of some nop used to pad
> the other TCP options.
> 
> The delivered_ecn_bytes pairs with received_ecn_bytes similar
> to how delivered_ce pairs with received_ce. In contrast to
> ACE field, however, the option is not always available to update
> delivered_ecn_bytes. For ACK w/o AccECN option, the delivered
> bytes calculated based on the cumulative ACK+SACK information
> are assigned to one of the counters using an estimation
> heuristic to select the most likely ECN byte counter. Any
> estimation error is corrected when the next AccECN option
> arrives. It may occur that the heuristic gets too confused
> when there are enough different byte counter deltas between
> ACKs with the AccECN option in which case the heuristic just
> gives up on updating the counters for a while.

net/ipv4/tcp_output.c:922:5: warning: symbol 'synack_ecn_bytes' was not declared. Should it be static?

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2024-10-16  1:32 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-15 10:28 [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series chia-yu.chang
2024-10-15 10:28 ` [PATCH net-next 01/44] sched: Add dualpi2 qdisc chia-yu.chang
2024-10-15 15:30   ` Jamal Hadi Salim
2024-10-15 15:40     ` Jakub Kicinski
2024-10-15 10:28 ` [PATCH net-next 02/44] tcp: reorganize tcp_in_ack_event() and tcp_count_delivered() chia-yu.chang
2024-10-15 10:28 ` [PATCH net-next 03/44] tcp: create FLAG_TS_PROGRESS chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 04/44] tcp: use BIT() macro in include/net/tcp.h chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 05/44] tcp: extend TCP flags to allow AE bit/ACE field chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 06/44] tcp: reorganize SYN ECN code chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 07/44] tcp: rework {__,}tcp_ecn_check_ce() -> tcp_data_ecn_check() chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 08/44] tcp: helpers for ECN mode handling chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 09/44] gso: AccECN support chia-yu.chang
2024-10-16  1:31   ` Jakub Kicinski
2024-10-15 10:29 ` [PATCH net-next 10/44] gro: prevent ACE field corruption & better AccECN handling chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 11/44] tcp: AccECN support to tcp_add_backlog chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 12/44] tcp: allow ECN bits in TOS/traffic class chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 13/44] tcp: Pass flags to __tcp_send_ack chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 14/44] tcp: fast path functions later chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 15/44] tcp: AccECN core chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 16/44] net: sysctl: introduce sysctl SYSCTL_FIVE chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 17/44] tcp: accecn: AccECN negotiation chia-yu.chang
2024-10-15 19:49   ` Ilpo Järvinen
2024-10-15 20:25     ` Chia-Yu Chang (Nokia)
2024-10-15 20:31       ` Ilpo Järvinen
2024-10-15 10:29 ` [PATCH net-next 18/44] tcp: accecn: add AccECN rx byte counters chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 19/44] tcp: allow embedding leftover into option padding chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 20/44] tcp: accecn: AccECN needs to know delivered bytes chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 21/44] tcp: sack option handling improvements chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 22/44] tcp: accecn: AccECN option chia-yu.chang
2024-10-16  1:32   ` Jakub Kicinski
2024-10-15 10:29 ` [PATCH net-next 23/44] tcp: accecn: AccECN option send control chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 24/44] tcp: accecn: AccECN option failure handling chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 25/44] tcp: accecn: AccECN option ceb/cep heuristic chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 26/44] tcp: accecn: AccECN ACE field multi-wrap heuristic chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 27/44] tcp: accecn: try to fit AccECN option with SACK chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 28/44] tcp: try to avoid safer when ACKs are thinned chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 29/44] gro: flushing when CWR is set negatively affects AccECN chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 30/44] tcp: accecn: Add ece_delta to rate_sample chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 31/44] tcp: L4S ECT(1) identifier for CC modules chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 32/44] tcp: disable RFC3168 fallback " chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 33/44] tcp: accecn: handle unexpected AccECN negotiation feedback chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 34/44] tcp: accecn: retransmit downgraded SYN in AccECN negotiation chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 35/44] tcp: move increment of num_retrans chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 36/44] tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN SYN/ACK chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 37/44] tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 38/44] tcp: accecn: fallback outgoing half link to non-AccECN chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 39/44] tcp: accecn: verify ACE counter in 1st ACK after AccECN negotiation chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 40/44] tcp: accecn: stop sending AccECN option when loss ACK with AccECN option chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 41/44] Documentation: networking: Update ECN related sysctls chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 42/44] tcp: Add tso_segs() CC callback for TCP Prague chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 43/44] tcp: Add mss_cache_set_by_ca for CC algorithm to set MSS chia-yu.chang
2024-10-15 10:29 ` [PATCH net-next 44/44] tcp: Add the TCP Prague congestion control module chia-yu.chang
2024-10-15 10:51 ` [PATCH net-next 00/44] DualPI2, Accurate ECN, TCP Prague patch series Paolo Abeni
2024-10-15 15:14   ` Koen De Schepper (Nokia)
2024-10-15 17:52     ` Eric Dumazet
2024-10-15 19:30       ` Chia-Yu Chang (Nokia)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).