[PATCH net-next 0/6] net: nexthop: Increase weight to u16

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/6] net: nexthop: Increase weight to u16
@ 2024-08-01 16:23 Petr Machata
  2024-08-01 16:23 ` [PATCH net-next 1/6] net: nexthop: Add flag to assert that NHGRP reserved fields are zero Petr Machata
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Petr Machata @ 2024-08-01 16:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev
  Cc: Ido Schimmel, Petr Machata, David Ahern, Donald Sharp, mlxsw

In CLOS networks, as link failures occur at various points in the network,
ECMP weights of the involved nodes are adjusted to compensate. With high
fan-out of the involved nodes, and overall high number of nodes,
a (non-)ECMP weight ratio that we would like to configure does not fit into
8 bits. Instead of, say, 255:254, we might like to configure something like
1000:999. For these deployments, the 8-bit weight may not be enough.

To that end, in this patchset increase the next hop weight from u8 to u16.

Patch #1 adds a flag that indicates whether the reserved fields are zeroed.
This is a follow-up to a new fix merged in commit 6d745cd0e972 ("net:
nexthop: Initialize all fields in dumped nexthops"). The theory behind this
patch is that there is a strict ordering between the fields actually being
zeroed, the kernel declaring that they are, and the kernel repurposing the
fields. Thus clients can use the flag to tell if it is safe to interpret
the reserved fields in any way.

Patch #2 contains the substantial code and the commit message covers the
details of the changes.

Patches #3 to #6 add selftests.

Petr Machata (6):
  net: nexthop: Add flag to assert that NHGRP reserved fields are zero
  net: nexthop: Increase weight to u16
  selftests: router_mpath: Sleep after MZ
  selftests: router_mpath_nh: Test 16-bit next hop weights
  selftests: router_mpath_nh_res: Test 16-bit next hop weights
  selftests: fib_nexthops: Test 16-bit next hop weights

 include/net/nexthop.h                         |  4 +-
 include/uapi/linux/nexthop.h                  | 10 +++-
 net/ipv4/nexthop.c                            | 49 ++++++++++------
 tools/testing/selftests/net/fib_nexthops.sh   | 55 +++++++++++++++++-
 tools/testing/selftests/net/forwarding/lib.sh |  7 +++
 .../net/forwarding/router_mpath_nh.sh         | 40 ++++++++++---
 .../net/forwarding/router_mpath_nh_lib.sh     | 13 +++++
 .../net/forwarding/router_mpath_nh_res.sh     | 58 ++++++++++++++++---
 .../net/forwarding/router_multipath.sh        |  2 +
 9 files changed, 201 insertions(+), 37 deletions(-)

-- 
2.45.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH net-next 1/6] net: nexthop: Add flag to assert that NHGRP reserved fields are zero
  2024-08-01 16:23 [PATCH net-next 0/6] net: nexthop: Increase weight to u16 Petr Machata
@ 2024-08-01 16:23 ` Petr Machata
  2024-08-05 22:23   ` Jakub Kicinski
  2024-08-01 16:23 ` [PATCH net-next 2/6] net: nexthop: Increase weight to u16 Petr Machata
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Petr Machata @ 2024-08-01 16:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev
  Cc: Ido Schimmel, Petr Machata, David Ahern, Donald Sharp, mlxsw

There are many unpatched kernel versions out there that do not initialize
the reserved fields of struct nexthop_grp. The issue with that is that if
those fields were to be used for some end (i.e. stop being reserved), old
kernels would still keep sending random data through the field, and a new
userspace could not rely on the value.

In this patch, use the existing NHA_OP_FLAGS, which is currently inbound
only, to carry flags back to the userspace. Add a flag to indicate that the
reserved fields in struct nexthop_grp are zeroed before dumping. This is
reliant on the actual fix from commit 6d745cd0e972 ("net: nexthop:
Initialize all fields in dumped nexthops").

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
---
 include/uapi/linux/nexthop.h |  3 +++
 net/ipv4/nexthop.c           | 12 +++++++++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
index dd8787f9cf39..2ed643207847 100644
--- a/include/uapi/linux/nexthop.h
+++ b/include/uapi/linux/nexthop.h
@@ -33,6 +33,9 @@ enum {
 #define NHA_OP_FLAG_DUMP_STATS		BIT(0)
 #define NHA_OP_FLAG_DUMP_HW_STATS	BIT(1)
 
+/* Response OP_FLAGS. */
+#define NHA_OP_FLAG_RESP_GRP_RESVD_0	BIT(0)
+
 enum {
 	NHA_UNSPEC,
 	NHA_ID,		/* u32; id for nexthop. id == 0 means auto-assign */
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 6b9787ee8601..23caa13bf24d 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -865,7 +865,7 @@ static int nla_put_nh_group_stats(struct sk_buff *skb, struct nexthop *nh,
 }
 
 static int nla_put_nh_group(struct sk_buff *skb, struct nexthop *nh,
-			    u32 op_flags)
+			    u32 op_flags, u32 *resp_op_flags)
 {
 	struct nh_group *nhg = rtnl_dereference(nh->nh_grp);
 	struct nexthop_grp *p;
@@ -874,6 +874,8 @@ static int nla_put_nh_group(struct sk_buff *skb, struct nexthop *nh,
 	u16 group_type = 0;
 	int i;
 
+	*resp_op_flags |= NHA_OP_FLAG_RESP_GRP_RESVD_0;
+
 	if (nhg->hash_threshold)
 		group_type = NEXTHOP_GRP_TYPE_MPATH;
 	else if (nhg->resilient)
@@ -934,10 +936,12 @@ static int nh_fill_node(struct sk_buff *skb, struct nexthop *nh,
 
 	if (nh->is_group) {
 		struct nh_group *nhg = rtnl_dereference(nh->nh_grp);
+		u32 resp_op_flags = 0;
 
 		if (nhg->fdb_nh && nla_put_flag(skb, NHA_FDB))
 			goto nla_put_failure;
-		if (nla_put_nh_group(skb, nh, op_flags))
+		if (nla_put_nh_group(skb, nh, op_flags, &resp_op_flags) ||
+		    nla_put_u32(skb, NHA_OP_FLAGS, resp_op_flags))
 			goto nla_put_failure;
 		goto out;
 	}
@@ -1050,7 +1054,9 @@ static size_t nh_nlmsg_size(struct nexthop *nh)
 	sz += nla_total_size(4); /* NHA_ID */
 
 	if (nh->is_group)
-		sz += nh_nlmsg_size_grp(nh);
+		sz += nh_nlmsg_size_grp(nh) +
+		      nla_total_size(4) +	/* NHA_OP_FLAGS */
+		      0;
 	else
 		sz += nh_nlmsg_size_single(nh);
 
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 2/6] net: nexthop: Increase weight to u16
  2024-08-01 16:23 [PATCH net-next 0/6] net: nexthop: Increase weight to u16 Petr Machata
  2024-08-01 16:23 ` [PATCH net-next 1/6] net: nexthop: Add flag to assert that NHGRP reserved fields are zero Petr Machata
@ 2024-08-01 16:23 ` Petr Machata
  2024-08-01 19:39   ` Simon Horman
  2024-08-06 14:02   ` Przemek Kitszel
  2024-08-01 16:23 ` [PATCH net-next 3/6] selftests: router_mpath: Sleep after MZ Petr Machata
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 15+ messages in thread
From: Petr Machata @ 2024-08-01 16:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev
  Cc: Ido Schimmel, Petr Machata, David Ahern, Donald Sharp, mlxsw

In CLOS networks, as link failures occur at various points in the network,
ECMP weights of the involved nodes are adjusted to compensate. With high
fan-out of the involved nodes, and overall high number of nodes,
a (non-)ECMP weight ratio that we would like to configure does not fit into
8 bits. Instead of, say, 255:254, we might like to configure something like
1000:999. For these deployments, the 8-bit weight may not be enough.

To that end, in this patch increase the next hop weight from u8 to u16.

Increasing the width of an integral type can be tricky, because while the
code still compiles, the types may not check out anymore, and numerical
errors come up. To prevent this, the conversion was done in two steps.
First the type was changed from u8 to a single-member structure, which
invalidated all uses of the field. This allowed going through them one by
one and audit for type correctness. Then the structure was replaced with a
vanilla u16 again. This should ensure that no place was missed.

The UAPI for configuring nexthop group members is that an attribute
NHA_GROUP carries an array of struct nexthop_grp entries:

	struct nexthop_grp {
		__u32	id;	  /* nexthop id - must exist */
		__u8	weight;   /* weight of this nexthop */
		__u8	resvd1;
		__u16	resvd2;
	};

The field resvd1 is currently validated and required to be zero. We can
lift this requirement and carry high-order bits of the weight in the
reserved field:

	struct nexthop_grp {
		__u32	id;	  /* nexthop id - must exist */
		__u8	weight;   /* weight of this nexthop */
		__u8	weight_high;
		__u16	resvd2;
	};

Keeping the fields split this way was chosen in case an existing userspace
makes assumptions about the width of the weight field, and to sidestep any
endianes issues.

The weight field is currently encoded as the weight value minus one,
because weight of 0 is invalid. This same trick is impossible for the new
weight_high field, because zero must mean actual zero. With this in place:

- Old userspace is guaranteed to carry weight_high of 0, therefore
  configuring 8-bit weights as appropriate. When dumping nexthops with
  16-bit weight, it would only show the lower 8 bits. But configuring such
  nexthops implies existence of userspace aware of the extension in the
  first place.

- New userspace talking to an old kernel will work as long as it only
  attempts to configure 8-bit weights, where the high-order bits are zero.
  Old kernel will bounce attempts at configuring >8-bit weights.

Renaming reserved fields as they are allocated for some purpose is commonly
done in Linux. Whoever touches a reserved field is doing so at their own
risk. nexthop_grp::resvd1 in particular is currently used by at least
strace, however they carry an own copy of UAPI headers, and the conversion
should be trivial. A helper is provided for decoding the weight out of the
two fields. Forcing a conversion seems preferable to bending backwards and
introducing anonymous unions or whatever.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
---
 include/net/nexthop.h        |  4 ++--
 include/uapi/linux/nexthop.h |  7 ++++++-
 net/ipv4/nexthop.c           | 37 ++++++++++++++++++++++--------------
 3 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index 68463aebcc05..d9fb44e8b321 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -105,7 +105,7 @@ struct nh_grp_entry_stats {
 struct nh_grp_entry {
 	struct nexthop	*nh;
 	struct nh_grp_entry_stats __percpu	*stats;
-	u8		weight;
+	u16		weight;
 
 	union {
 		struct {
@@ -192,7 +192,7 @@ struct nh_notifier_single_info {
 };
 
 struct nh_notifier_grp_entry_info {
-	u8 weight;
+	u16 weight;
 	struct nh_notifier_single_info nh;
 };
 
diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
index 2ed643207847..3f869a8fc949 100644
--- a/include/uapi/linux/nexthop.h
+++ b/include/uapi/linux/nexthop.h
@@ -16,10 +16,15 @@ struct nhmsg {
 struct nexthop_grp {
 	__u32	id;	  /* nexthop id - must exist */
 	__u8	weight;   /* weight of this nexthop */
-	__u8	resvd1;
+	__u8	weight_high;	/* high order bits of weight */
 	__u16	resvd2;
 };
 
+static inline __u16 nexthop_grp_weight(const struct nexthop_grp *entry)
+{
+	return ((entry->weight_high << 8) | entry->weight) + 1;
+}
+
 enum {
 	NEXTHOP_GRP_TYPE_MPATH,  /* hash-threshold nexthop group
 				  * default type if not specified
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 23caa13bf24d..9db10d409074 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -872,6 +872,7 @@ static int nla_put_nh_group(struct sk_buff *skb, struct nexthop *nh,
 	size_t len = nhg->num_nh * sizeof(*p);
 	struct nlattr *nla;
 	u16 group_type = 0;
+	u16 weight;
 	int i;
 
 	*resp_op_flags |= NHA_OP_FLAG_RESP_GRP_RESVD_0;
@@ -890,9 +891,12 @@ static int nla_put_nh_group(struct sk_buff *skb, struct nexthop *nh,
 
 	p = nla_data(nla);
 	for (i = 0; i < nhg->num_nh; ++i) {
+		weight = nhg->nh_entries[i].weight - 1;
+
 		*p++ = (struct nexthop_grp) {
 			.id = nhg->nh_entries[i].nh->id,
-			.weight = nhg->nh_entries[i].weight - 1,
+			.weight = weight,
+			.weight_high = weight >> 8,
 		};
 	}
 
@@ -1286,11 +1290,14 @@ static int nh_check_attr_group(struct net *net,
 
 	nhg = nla_data(tb[NHA_GROUP]);
 	for (i = 0; i < len; ++i) {
-		if (nhg[i].resvd1 || nhg[i].resvd2) {
-			NL_SET_ERR_MSG(extack, "Reserved fields in nexthop_grp must be 0");
+		if (nhg[i].resvd2) {
+			NL_SET_ERR_MSG(extack, "Reserved field in nexthop_grp must be 0");
 			return -EINVAL;
 		}
-		if (nhg[i].weight > 254) {
+		if (nexthop_grp_weight(&nhg[i]) == 0) {
+			/* 0xffff got passed in, representing weight of 0x10000,
+			 * which is too heavy.
+			 */
 			NL_SET_ERR_MSG(extack, "Invalid value for weight");
 			return -EINVAL;
 		}
@@ -1886,9 +1893,9 @@ static void nh_res_table_cancel_upkeep(struct nh_res_table *res_table)
 static void nh_res_group_rebalance(struct nh_group *nhg,
 				   struct nh_res_table *res_table)
 {
-	int prev_upper_bound = 0;
-	int total = 0;
-	int w = 0;
+	u16 prev_upper_bound = 0;
+	u32 total = 0;
+	u32 w = 0;
 	int i;
 
 	INIT_LIST_HEAD(&res_table->uw_nh_entries);
@@ -1898,11 +1905,12 @@ static void nh_res_group_rebalance(struct nh_group *nhg,
 
 	for (i = 0; i < nhg->num_nh; ++i) {
 		struct nh_grp_entry *nhge = &nhg->nh_entries[i];
-		int upper_bound;
+		u16 upper_bound;
+		u64 btw;
 
 		w += nhge->weight;
-		upper_bound = DIV_ROUND_CLOSEST(res_table->num_nh_buckets * w,
-						total);
+		btw = ((u64)res_table->num_nh_buckets) * w;
+		upper_bound = DIV_ROUND_CLOSEST_ULL(btw, total);
 		nhge->res.wants_buckets = upper_bound - prev_upper_bound;
 		prev_upper_bound = upper_bound;
 
@@ -1968,8 +1976,8 @@ static void replace_nexthop_grp_res(struct nh_group *oldg,
 
 static void nh_hthr_group_rebalance(struct nh_group *nhg)
 {
-	int total = 0;
-	int w = 0;
+	u32 total = 0;
+	u32 w = 0;
 	int i;
 
 	for (i = 0; i < nhg->num_nh; ++i)
@@ -1977,7 +1985,7 @@ static void nh_hthr_group_rebalance(struct nh_group *nhg)
 
 	for (i = 0; i < nhg->num_nh; ++i) {
 		struct nh_grp_entry *nhge = &nhg->nh_entries[i];
-		int upper_bound;
+		u32 upper_bound;
 
 		w += nhge->weight;
 		upper_bound = DIV_ROUND_CLOSEST_ULL((u64)w << 31, total) - 1;
@@ -2719,7 +2727,8 @@ static struct nexthop *nexthop_create_group(struct net *net,
 			goto out_no_nh;
 		}
 		nhg->nh_entries[i].nh = nhe;
-		nhg->nh_entries[i].weight = entry[i].weight + 1;
+		nhg->nh_entries[i].weight = nexthop_grp_weight(&entry[i]);
+
 		list_add(&nhg->nh_entries[i].nh_list, &nhe->grp_list);
 		nhg->nh_entries[i].nh_parent = nh;
 	}
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 3/6] selftests: router_mpath: Sleep after MZ
  2024-08-01 16:23 [PATCH net-next 0/6] net: nexthop: Increase weight to u16 Petr Machata
  2024-08-01 16:23 ` [PATCH net-next 1/6] net: nexthop: Add flag to assert that NHGRP reserved fields are zero Petr Machata
  2024-08-01 16:23 ` [PATCH net-next 2/6] net: nexthop: Increase weight to u16 Petr Machata
@ 2024-08-01 16:23 ` Petr Machata
  2024-08-01 16:24 ` [PATCH net-next 4/6] selftests: router_mpath_nh: Test 16-bit next hop weights Petr Machata
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Petr Machata @ 2024-08-01 16:23 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev
  Cc: Ido Schimmel, Petr Machata, David Ahern, Donald Sharp, mlxsw,
	Shuah Khan, linux-kselftest

In the context of an offloaded datapath, it may take a while for the ip
link stats to be updated. This causes the test to fail when MZ_DELAY is too
low. Sleep after the packets are sent for the link stats to get up to date.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
---
 tools/testing/selftests/net/forwarding/router_mpath_nh.sh     | 2 ++
 tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh | 2 ++
 tools/testing/selftests/net/forwarding/router_multipath.sh    | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/tools/testing/selftests/net/forwarding/router_mpath_nh.sh b/tools/testing/selftests/net/forwarding/router_mpath_nh.sh
index 2ba44247c60a..c5a30f8f55b5 100755
--- a/tools/testing/selftests/net/forwarding/router_mpath_nh.sh
+++ b/tools/testing/selftests/net/forwarding/router_mpath_nh.sh
@@ -243,6 +243,7 @@ multipath4_test()
 
 	ip vrf exec vrf-h1 $MZ $h1 -q -p 64 -A 192.0.2.2 -B 198.51.100.2 \
 		-d $MZ_DELAY -t udp "sp=1024,dp=0-32768"
+	sleep 1
 
 	t1_rp12=$(link_stats_tx_packets_get $rp12)
 	t1_rp13=$(link_stats_tx_packets_get $rp13)
@@ -276,6 +277,7 @@ multipath6_test()
 
 	$MZ $h1 -6 -q -p 64 -A 2001:db8:1::2 -B 2001:db8:2::2 \
 		-d $MZ_DELAY -t udp "sp=1024,dp=0-32768"
+	sleep 1
 
 	t1_rp12=$(link_stats_tx_packets_get $rp12)
 	t1_rp13=$(link_stats_tx_packets_get $rp13)
diff --git a/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh b/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh
index cd9e346436fc..bd35fe8be9aa 100755
--- a/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh
+++ b/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh
@@ -244,6 +244,7 @@ multipath4_test()
 
 	ip vrf exec vrf-h1 $MZ $h1 -q -p 64 -A 192.0.2.2 -B 198.51.100.2 \
 		-d $MZ_DELAY -t udp "sp=1024,dp=0-32768"
+	sleep 1
 
 	t1_rp12=$(link_stats_tx_packets_get $rp12)
 	t1_rp13=$(link_stats_tx_packets_get $rp13)
@@ -274,6 +275,7 @@ multipath6_l4_test()
 
 	$MZ $h1 -6 -q -p 64 -A 2001:db8:1::2 -B 2001:db8:2::2 \
 		-d $MZ_DELAY -t udp "sp=1024,dp=0-32768"
+	sleep 1
 
 	t1_rp12=$(link_stats_tx_packets_get $rp12)
 	t1_rp13=$(link_stats_tx_packets_get $rp13)
diff --git a/tools/testing/selftests/net/forwarding/router_multipath.sh b/tools/testing/selftests/net/forwarding/router_multipath.sh
index e2be354167a1..46f365b557b7 100755
--- a/tools/testing/selftests/net/forwarding/router_multipath.sh
+++ b/tools/testing/selftests/net/forwarding/router_multipath.sh
@@ -180,6 +180,7 @@ multipath4_test()
 
        ip vrf exec vrf-h1 $MZ $h1 -q -p 64 -A 192.0.2.2 -B 198.51.100.2 \
 	       -d $MZ_DELAY -t udp "sp=1024,dp=0-32768"
+       sleep 1
 
        t1_rp12=$(link_stats_tx_packets_get $rp12)
        t1_rp13=$(link_stats_tx_packets_get $rp13)
@@ -217,6 +218,7 @@ multipath6_test()
 
        $MZ $h1 -6 -q -p 64 -A 2001:db8:1::2 -B 2001:db8:2::2 \
 	       -d $MZ_DELAY -t udp "sp=1024,dp=0-32768"
+       sleep 1
 
        t1_rp12=$(link_stats_tx_packets_get $rp12)
        t1_rp13=$(link_stats_tx_packets_get $rp13)
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 4/6] selftests: router_mpath_nh: Test 16-bit next hop weights
  2024-08-01 16:23 [PATCH net-next 0/6] net: nexthop: Increase weight to u16 Petr Machata
                   ` (2 preceding siblings ...)
  2024-08-01 16:23 ` [PATCH net-next 3/6] selftests: router_mpath: Sleep after MZ Petr Machata
@ 2024-08-01 16:24 ` Petr Machata
  2024-08-01 16:24 ` [PATCH net-next 5/6] selftests: router_mpath_nh_res: " Petr Machata
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Petr Machata @ 2024-08-01 16:24 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev
  Cc: Ido Schimmel, Petr Machata, David Ahern, Donald Sharp, mlxsw,
	Shuah Khan, linux-kselftest

Add tests that exercise full 16 bits of NH weight.

To test the 255:65535, it is necessary to run more packets than for the
other tests. On a debug kernel, the test can take up to a minute, therefore
avoid the test when KSFT_MACHINE_SLOW.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
---
 tools/testing/selftests/net/forwarding/lib.sh |  7 ++++
 .../net/forwarding/router_mpath_nh.sh         | 38 +++++++++++++++----
 .../net/forwarding/router_mpath_nh_lib.sh     | 13 +++++++
 3 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index ff96bb7535ff..cb0fcd6f0293 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -509,6 +509,13 @@ xfail_on_slow()
 	fi
 }
 
+omit_on_slow()
+{
+	if [[ $KSFT_MACHINE_SLOW != yes ]]; then
+		"$@"
+	fi
+}
+
 xfail_on_veth()
 {
 	local dev=$1; shift
diff --git a/tools/testing/selftests/net/forwarding/router_mpath_nh.sh b/tools/testing/selftests/net/forwarding/router_mpath_nh.sh
index c5a30f8f55b5..a7d8399c8d4f 100755
--- a/tools/testing/selftests/net/forwarding/router_mpath_nh.sh
+++ b/tools/testing/selftests/net/forwarding/router_mpath_nh.sh
@@ -40,6 +40,7 @@ ALL_TESTS="
 	ping_ipv4
 	ping_ipv6
 	multipath_test
+	multipath16_test
 	ping_ipv4_blackhole
 	ping_ipv6_blackhole
 	nh_stats_test_v4
@@ -226,9 +227,11 @@ routing_nh_obj()
 
 multipath4_test()
 {
-	local desc="$1"
-	local weight_rp12=$2
-	local weight_rp13=$3
+	local desc=$1; shift
+	local weight_rp12=$1; shift
+	local weight_rp13=$1; shift
+	local ports=${1-sp=1024,dp=0-32768}; shift
+
 	local t0_rp12 t0_rp13 t1_rp12 t1_rp13
 	local packets_rp12 packets_rp13
 
@@ -242,7 +245,7 @@ multipath4_test()
 	t0_rp13=$(link_stats_tx_packets_get $rp13)
 
 	ip vrf exec vrf-h1 $MZ $h1 -q -p 64 -A 192.0.2.2 -B 198.51.100.2 \
-		-d $MZ_DELAY -t udp "sp=1024,dp=0-32768"
+		-d $MZ_DELAY -t udp "$ports"
 	sleep 1
 
 	t1_rp12=$(link_stats_tx_packets_get $rp12)
@@ -259,9 +262,11 @@ multipath4_test()
 
 multipath6_test()
 {
-	local desc="$1"
-	local weight_rp12=$2
-	local weight_rp13=$3
+	local desc=$1; shift
+	local weight_rp12=$1; shift
+	local weight_rp13=$1; shift
+	local ports=${1-sp=1024,dp=0-32768}; shift
+
 	local t0_rp12 t0_rp13 t1_rp12 t1_rp13
 	local packets_rp12 packets_rp13
 
@@ -276,7 +281,7 @@ multipath6_test()
 	t0_rp13=$(link_stats_tx_packets_get $rp13)
 
 	$MZ $h1 -6 -q -p 64 -A 2001:db8:1::2 -B 2001:db8:2::2 \
-		-d $MZ_DELAY -t udp "sp=1024,dp=0-32768"
+		-d $MZ_DELAY -t udp "$ports"
 	sleep 1
 
 	t1_rp12=$(link_stats_tx_packets_get $rp12)
@@ -315,6 +320,23 @@ multipath_test()
 	multipath6_test "Weighted MP 11:45" 11 45
 }
 
+multipath16_test()
+{
+	check_nhgw16 104 || return
+
+	log_info "Running 16-bit IPv4 multipath tests"
+	multipath4_test "65535:65535" 65535 65535
+	multipath4_test "128:512" 128 512
+	omit_on_slow \
+		multipath4_test "255:65535" 255 65535 sp=1024-1026,dp=0-65535
+
+	log_info "Running 16-bit IPv6 multipath tests"
+	multipath6_test "65535:65535" 65535 65535
+	multipath6_test "128:512" 128 512
+	omit_on_slow \
+		multipath6_test "255:65535" 255 65535 sp=1024-1026,dp=0-65535
+}
+
 ping_ipv4_blackhole()
 {
 	RET=0
diff --git a/tools/testing/selftests/net/forwarding/router_mpath_nh_lib.sh b/tools/testing/selftests/net/forwarding/router_mpath_nh_lib.sh
index 2903294d8bca..507b2852dabe 100644
--- a/tools/testing/selftests/net/forwarding/router_mpath_nh_lib.sh
+++ b/tools/testing/selftests/net/forwarding/router_mpath_nh_lib.sh
@@ -117,3 +117,16 @@ __nh_stats_test_v6()
 			       $MZ -6 $h1 -A 2001:db8:1::2 -B 2001:db8:2::2
 	sysctl_restore net.ipv6.fib_multipath_hash_policy
 }
+
+check_nhgw16()
+{
+	local nhid=$1; shift
+
+	ip nexthop replace id 9999 group "$nhid,65535" &>/dev/null
+	if (( $? )); then
+		log_test_skip "16-bit multipath tests" \
+			      "iproute2 or the kernel do not support 16-bit next hop weights"
+		return 1
+	fi
+	ip nexthop del id 9999 ||:
+}
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 5/6] selftests: router_mpath_nh_res: Test 16-bit next hop weights
  2024-08-01 16:23 [PATCH net-next 0/6] net: nexthop: Increase weight to u16 Petr Machata
                   ` (3 preceding siblings ...)
  2024-08-01 16:24 ` [PATCH net-next 4/6] selftests: router_mpath_nh: Test 16-bit next hop weights Petr Machata
@ 2024-08-01 16:24 ` Petr Machata
  2024-08-01 16:24 ` [PATCH net-next 6/6] selftests: fib_nexthops: " Petr Machata
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Petr Machata @ 2024-08-01 16:24 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev
  Cc: Ido Schimmel, Petr Machata, David Ahern, Donald Sharp, mlxsw,
	Shuah Khan, linux-kselftest

Add tests that exercise full 16 bits of NH weight.

Like in the previous patch, omit the 255:65535 test when KSFT_MACHINE_SLOW.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
---
 .../net/forwarding/router_mpath_nh_res.sh     | 56 ++++++++++++++++---
 1 file changed, 48 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh b/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh
index bd35fe8be9aa..88ddae05b39d 100755
--- a/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh
+++ b/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh
@@ -40,6 +40,7 @@ ALL_TESTS="
 	ping_ipv4
 	ping_ipv6
 	multipath_test
+	multipath16_test
 	nh_stats_test_v4
 	nh_stats_test_v6
 "
@@ -228,9 +229,11 @@ routing_nh_obj()
 
 multipath4_test()
 {
-	local desc="$1"
-	local weight_rp12=$2
-	local weight_rp13=$3
+	local desc=$1; shift
+	local weight_rp12=$1; shift
+	local weight_rp13=$1; shift
+	local ports=${1-sp=1024,dp=0-32768}; shift
+
 	local t0_rp12 t0_rp13 t1_rp12 t1_rp13
 	local packets_rp12 packets_rp13
 
@@ -243,7 +246,7 @@ multipath4_test()
 	t0_rp13=$(link_stats_tx_packets_get $rp13)
 
 	ip vrf exec vrf-h1 $MZ $h1 -q -p 64 -A 192.0.2.2 -B 198.51.100.2 \
-		-d $MZ_DELAY -t udp "sp=1024,dp=0-32768"
+		-d $MZ_DELAY -t udp "$ports"
 	sleep 1
 
 	t1_rp12=$(link_stats_tx_packets_get $rp12)
@@ -259,9 +262,11 @@ multipath4_test()
 
 multipath6_l4_test()
 {
-	local desc="$1"
-	local weight_rp12=$2
-	local weight_rp13=$3
+	local desc=$1; shift
+	local weight_rp12=$1; shift
+	local weight_rp13=$1; shift
+	local ports=${1-sp=1024,dp=0-32768}; shift
+
 	local t0_rp12 t0_rp13 t1_rp12 t1_rp13
 	local packets_rp12 packets_rp13
 
@@ -274,7 +279,7 @@ multipath6_l4_test()
 	t0_rp13=$(link_stats_tx_packets_get $rp13)
 
 	$MZ $h1 -6 -q -p 64 -A 2001:db8:1::2 -B 2001:db8:2::2 \
-		-d $MZ_DELAY -t udp "sp=1024,dp=0-32768"
+		-d $MZ_DELAY -t udp "$ports"
 	sleep 1
 
 	t1_rp12=$(link_stats_tx_packets_get $rp12)
@@ -373,6 +378,41 @@ multipath_test()
 	ip nexthop replace id 106 group 104,1/105,1 type resilient
 }
 
+multipath16_test()
+{
+	check_nhgw16 104 || return
+
+	log_info "Running 16-bit IPv4 multipath tests"
+	ip nexthop replace id 103 group 101/102 type resilient idle_timer 0
+
+	ip nexthop replace id 103 group 101,65535/102,65535 type resilient
+	multipath4_test "65535:65535" 65535 65535
+
+	ip nexthop replace id 103 group 101,128/102,512 type resilient
+	multipath4_test "128:512" 128 512
+
+	ip nexthop replace id 103 group 101,255/102,65535 type resilient
+	omit_on_slow \
+		multipath4_test "255:65535" 255 65535 sp=1024-1026,dp=0-65535
+
+	ip nexthop replace id 103 group 101,1/102,1 type resilient
+
+	log_info "Running 16-bit IPv6 L4 hash multipath tests"
+	ip nexthop replace id 106 group 104/105 type resilient idle_timer 0
+
+	ip nexthop replace id 106 group 104,65535/105,65535 type resilient
+	multipath6_l4_test "65535:65535" 65535 65535
+
+	ip nexthop replace id 106 group 104,128/105,512 type resilient
+	multipath6_l4_test "128:512" 128 512
+
+	ip nexthop replace id 106 group 104,255/105,65535 type resilient
+	omit_on_slow \
+		multipath6_l4_test "255:65535" 255 65535 sp=1024-1026,dp=0-65535
+
+	ip nexthop replace id 106 group 104,1/105,1 type resilient
+}
+
 nh_stats_test_v4()
 {
 	__nh_stats_test_v4 resilient
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH net-next 6/6] selftests: fib_nexthops: Test 16-bit next hop weights
  2024-08-01 16:23 [PATCH net-next 0/6] net: nexthop: Increase weight to u16 Petr Machata
                   ` (4 preceding siblings ...)
  2024-08-01 16:24 ` [PATCH net-next 5/6] selftests: router_mpath_nh_res: " Petr Machata
@ 2024-08-01 16:24 ` Petr Machata
  2024-08-01 22:52 ` [PATCH net-next 0/6] net: nexthop: Increase weight to u16 Jakub Kicinski
  2024-08-05 14:31 ` David Ahern
  7 siblings, 0 replies; 15+ messages in thread
From: Petr Machata @ 2024-08-01 16:24 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev
  Cc: Ido Schimmel, Petr Machata, David Ahern, Donald Sharp, mlxsw,
	Shuah Khan, linux-kselftest

Add tests that attempt to create NH groups that use full 16 bits of NH
weight.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
---
 tools/testing/selftests/net/fib_nexthops.sh | 55 ++++++++++++++++++++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/fib_nexthops.sh b/tools/testing/selftests/net/fib_nexthops.sh
index ac0b2c6a5761..77c83d9508d3 100755
--- a/tools/testing/selftests/net/fib_nexthops.sh
+++ b/tools/testing/selftests/net/fib_nexthops.sh
@@ -78,7 +78,12 @@ log_test()
 	else
 		ret=1
 		nfail=$((nfail+1))
-		printf "TEST: %-60s  [FAIL]\n" "${msg}"
+		if [[ $rc -eq $ksft_skip ]]; then
+			printf "TEST: %-60s  [SKIP]\n" "${msg}"
+		else
+			printf "TEST: %-60s  [FAIL]\n" "${msg}"
+		fi
+
 		if [ "$VERBOSE" = "1" ]; then
 			echo "    rc=$rc, expected $expected"
 		fi
@@ -923,6 +928,29 @@ ipv6_grp_fcnal()
 
 	ipv6_grp_refs
 	log_test $? 0 "Nexthop group replace refcounts"
+
+	#
+	# 16-bit weights.
+	#
+	run_cmd "$IP nexthop add id 62 via 2001:db8:91::2 dev veth1"
+	run_cmd "$IP nexthop add id 63 via 2001:db8:91::3 dev veth1"
+	run_cmd "$IP nexthop add id 64 via 2001:db8:91::4 dev veth1"
+	run_cmd "$IP nexthop add id 65 via 2001:db8:91::5 dev veth1"
+	run_cmd "$IP nexthop add id 66 dev veth1"
+
+	run_cmd "$IP nexthop add id 103 group 62,1000"
+	if [[ $? == 0 ]]; then
+		local GRP="id 103 group 62,254/63,255/64,256/65,257/66,65535"
+		run_cmd "$IP nexthop replace $GRP"
+		check_nexthop "id 103" "$GRP"
+		rc=$?
+	else
+		rc=$ksft_skip
+	fi
+
+	$IP nexthop flush >/dev/null 2>&1
+
+	log_test $rc 0 "16-bit weights"
 }
 
 ipv6_res_grp_fcnal()
@@ -987,6 +1015,31 @@ ipv6_res_grp_fcnal()
 	check_nexthop_bucket "list id 102" \
 		"id 102 index 0 nhid 63 id 102 index 1 nhid 62 id 102 index 2 nhid 62 id 102 index 3 nhid 62"
 	log_test $? 0 "Nexthop buckets updated after replace - nECMP"
+
+	#
+	# 16-bit weights.
+	#
+	run_cmd "$IP nexthop add id 62 via 2001:db8:91::2 dev veth1"
+	run_cmd "$IP nexthop add id 63 via 2001:db8:91::3 dev veth1"
+	run_cmd "$IP nexthop add id 64 via 2001:db8:91::4 dev veth1"
+	run_cmd "$IP nexthop add id 65 via 2001:db8:91::5 dev veth1"
+	run_cmd "$IP nexthop add id 66 dev veth1"
+
+	run_cmd "$IP nexthop add id 103 group 62,1000 type resilient buckets 32"
+	if [[ $? == 0 ]]; then
+		local GRP="id 103 group 62,254/63,255/64,256/65,257/66,65535 $(:
+			  )type resilient buckets 32 idle_timer 0 $(:
+			  )unbalanced_timer 0"
+		run_cmd "$IP nexthop replace $GRP"
+		check_nexthop "id 103" "$GRP unbalanced_time 0"
+		rc=$?
+	else
+		rc=$ksft_skip
+	fi
+
+	$IP nexthop flush >/dev/null 2>&1
+
+	log_test $rc 0 "16-bit weights"
 }
 
 ipv6_fcnal_runtime()
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/6] net: nexthop: Increase weight to u16
  2024-08-01 16:23 ` [PATCH net-next 2/6] net: nexthop: Increase weight to u16 Petr Machata
@ 2024-08-01 19:39   ` Simon Horman
  2024-08-05 12:33     ` Petr Machata
  2024-08-06 14:02   ` Przemek Kitszel
  1 sibling, 1 reply; 15+ messages in thread
From: Simon Horman @ 2024-08-01 19:39 UTC (permalink / raw)
  To: Petr Machata
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Ido Schimmel, David Ahern, Donald Sharp, mlxsw

On Thu, Aug 01, 2024 at 06:23:58PM +0200, Petr Machata wrote:
> In CLOS networks, as link failures occur at various points in the network,
> ECMP weights of the involved nodes are adjusted to compensate. With high
> fan-out of the involved nodes, and overall high number of nodes,
> a (non-)ECMP weight ratio that we would like to configure does not fit into
> 8 bits. Instead of, say, 255:254, we might like to configure something like
> 1000:999. For these deployments, the 8-bit weight may not be enough.
> 
> To that end, in this patch increase the next hop weight from u8 to u16.
> 
> Increasing the width of an integral type can be tricky, because while the
> code still compiles, the types may not check out anymore, and numerical
> errors come up. To prevent this, the conversion was done in two steps.
> First the type was changed from u8 to a single-member structure, which
> invalidated all uses of the field. This allowed going through them one by
> one and audit for type correctness. Then the structure was replaced with a
> vanilla u16 again. This should ensure that no place was missed.
> 
> The UAPI for configuring nexthop group members is that an attribute
> NHA_GROUP carries an array of struct nexthop_grp entries:
> 
> 	struct nexthop_grp {
> 		__u32	id;	  /* nexthop id - must exist */
> 		__u8	weight;   /* weight of this nexthop */
> 		__u8	resvd1;
> 		__u16	resvd2;
> 	};
> 
> The field resvd1 is currently validated and required to be zero. We can
> lift this requirement and carry high-order bits of the weight in the
> reserved field:
> 
> 	struct nexthop_grp {
> 		__u32	id;	  /* nexthop id - must exist */
> 		__u8	weight;   /* weight of this nexthop */
> 		__u8	weight_high;
> 		__u16	resvd2;
> 	};
> 
> Keeping the fields split this way was chosen in case an existing userspace
> makes assumptions about the width of the weight field, and to sidestep any
> endianes issues.

nit: endianness

...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 0/6] net: nexthop: Increase weight to u16
  2024-08-01 16:23 [PATCH net-next 0/6] net: nexthop: Increase weight to u16 Petr Machata
                   ` (5 preceding siblings ...)
  2024-08-01 16:24 ` [PATCH net-next 6/6] selftests: fib_nexthops: " Petr Machata
@ 2024-08-01 22:52 ` Jakub Kicinski
  2024-08-05  8:06   ` Petr Machata
  2024-08-05 14:31 ` David Ahern
  7 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2024-08-01 22:52 UTC (permalink / raw)
  To: Petr Machata
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, netdev, Ido Schimmel,
	David Ahern, Donald Sharp, mlxsw

On Thu, 1 Aug 2024 18:23:56 +0200 Petr Machata wrote:
> Patches #3 to #6 add selftests.

Could you share the iproute2 patches?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 0/6] net: nexthop: Increase weight to u16
  2024-08-01 22:52 ` [PATCH net-next 0/6] net: nexthop: Increase weight to u16 Jakub Kicinski
@ 2024-08-05  8:06   ` Petr Machata
  0 siblings, 0 replies; 15+ messages in thread
From: Petr Machata @ 2024-08-05  8:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Petr Machata, David S. Miller, Eric Dumazet, Paolo Abeni, netdev,
	Ido Schimmel, David Ahern, Donald Sharp, mlxsw


Jakub Kicinski <kuba@kernel.org> writes:

> On Thu, 1 Aug 2024 18:23:56 +0200 Petr Machata wrote:
>> Patches #3 to #6 add selftests.
>
> Could you share the iproute2 patches?

https://github.com/pmachata/iproute2/commits/nhgw16/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/6] net: nexthop: Increase weight to u16
  2024-08-01 19:39   ` Simon Horman
@ 2024-08-05 12:33     ` Petr Machata
  0 siblings, 0 replies; 15+ messages in thread
From: Petr Machata @ 2024-08-05 12:33 UTC (permalink / raw)
  To: Simon Horman
  Cc: Petr Machata, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, Ido Schimmel, David Ahern, Donald Sharp,
	mlxsw


Simon Horman <horms@kernel.org> writes:

> On Thu, Aug 01, 2024 at 06:23:58PM +0200, Petr Machata wrote:
>> In CLOS networks, as link failures occur at various points in the network,
>> ECMP weights of the involved nodes are adjusted to compensate. With high
>> fan-out of the involved nodes, and overall high number of nodes,
>> a (non-)ECMP weight ratio that we would like to configure does not fit into
>> 8 bits. Instead of, say, 255:254, we might like to configure something like
>> 1000:999. For these deployments, the 8-bit weight may not be enough.
>> 
>> To that end, in this patch increase the next hop weight from u8 to u16.
>> 
>> Increasing the width of an integral type can be tricky, because while the
>> code still compiles, the types may not check out anymore, and numerical
>> errors come up. To prevent this, the conversion was done in two steps.
>> First the type was changed from u8 to a single-member structure, which
>> invalidated all uses of the field. This allowed going through them one by
>> one and audit for type correctness. Then the structure was replaced with a
>> vanilla u16 again. This should ensure that no place was missed.
>> 
>> The UAPI for configuring nexthop group members is that an attribute
>> NHA_GROUP carries an array of struct nexthop_grp entries:
>> 
>> 	struct nexthop_grp {
>> 		__u32	id;	  /* nexthop id - must exist */
>> 		__u8	weight;   /* weight of this nexthop */
>> 		__u8	resvd1;
>> 		__u16	resvd2;
>> 	};
>> 
>> The field resvd1 is currently validated and required to be zero. We can
>> lift this requirement and carry high-order bits of the weight in the
>> reserved field:
>> 
>> 	struct nexthop_grp {
>> 		__u32	id;	  /* nexthop id - must exist */
>> 		__u8	weight;   /* weight of this nexthop */
>> 		__u8	weight_high;
>> 		__u16	resvd2;
>> 	};
>> 
>> Keeping the fields split this way was chosen in case an existing userspace
>> makes assumptions about the width of the weight field, and to sidestep any
>> endianes issues.
>
> nit: endianness

Thanks, will fix for v2. For now I'm still waiting if there's other
feedback.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 0/6] net: nexthop: Increase weight to u16
  2024-08-01 16:23 [PATCH net-next 0/6] net: nexthop: Increase weight to u16 Petr Machata
                   ` (6 preceding siblings ...)
  2024-08-01 22:52 ` [PATCH net-next 0/6] net: nexthop: Increase weight to u16 Jakub Kicinski
@ 2024-08-05 14:31 ` David Ahern
  7 siblings, 0 replies; 15+ messages in thread
From: David Ahern @ 2024-08-05 14:31 UTC (permalink / raw)
  To: Petr Machata, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev
  Cc: Ido Schimmel, Donald Sharp, mlxsw

On 8/1/24 10:23 AM, Petr Machata wrote:
> In CLOS networks, as link failures occur at various points in the network,
> ECMP weights of the involved nodes are adjusted to compensate. With high
> fan-out of the involved nodes, and overall high number of nodes,
> a (non-)ECMP weight ratio that we would like to configure does not fit into
> 8 bits. Instead of, say, 255:254, we might like to configure something like
> 1000:999. For these deployments, the 8-bit weight may not be enough.
> 
> To that end, in this patchset increase the next hop weight from u8 to u16.
> 
> Patch #1 adds a flag that indicates whether the reserved fields are zeroed.
> This is a follow-up to a new fix merged in commit 6d745cd0e972 ("net:
> nexthop: Initialize all fields in dumped nexthops"). The theory behind this
> patch is that there is a strict ordering between the fields actually being
> zeroed, the kernel declaring that they are, and the kernel repurposing the
> fields. Thus clients can use the flag to tell if it is safe to interpret
> the reserved fields in any way.
> 
> Patch #2 contains the substantial code and the commit message covers the
> details of the changes.
> 
> Patches #3 to #6 add selftests.
> 

LGTM. For the set
Reviewed-by: David Ahern <dsahern@kernel.org>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 1/6] net: nexthop: Add flag to assert that NHGRP reserved fields are zero
  2024-08-01 16:23 ` [PATCH net-next 1/6] net: nexthop: Add flag to assert that NHGRP reserved fields are zero Petr Machata
@ 2024-08-05 22:23   ` Jakub Kicinski
  2024-08-06  9:48     ` Petr Machata
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2024-08-05 22:23 UTC (permalink / raw)
  To: Petr Machata
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, netdev, Ido Schimmel,
	David Ahern, Donald Sharp, mlxsw

On Thu, 1 Aug 2024 18:23:57 +0200 Petr Machata wrote:
> diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
> index dd8787f9cf39..2ed643207847 100644
> --- a/include/uapi/linux/nexthop.h
> +++ b/include/uapi/linux/nexthop.h
> @@ -33,6 +33,9 @@ enum {
>  #define NHA_OP_FLAG_DUMP_STATS		BIT(0)
>  #define NHA_OP_FLAG_DUMP_HW_STATS	BIT(1)
>  
> +/* Response OP_FLAGS. */
> +#define NHA_OP_FLAG_RESP_GRP_RESVD_0	BIT(0)

I guess these are op flags, so nobody should have a need to store them,
but having bits mean different things in and out make decoding and
binding generation much harder. Let's not do this unless absolutely
necessary. Perhaps you can start defining response flags from the 
"other end", i.e. bit 31? Chances are the two sides will never "meet".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 1/6] net: nexthop: Add flag to assert that NHGRP reserved fields are zero
  2024-08-05 22:23   ` Jakub Kicinski
@ 2024-08-06  9:48     ` Petr Machata
  0 siblings, 0 replies; 15+ messages in thread
From: Petr Machata @ 2024-08-06  9:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Petr Machata, David S. Miller, Eric Dumazet, Paolo Abeni, netdev,
	Ido Schimmel, David Ahern, Donald Sharp, mlxsw


Jakub Kicinski <kuba@kernel.org> writes:

> On Thu, 1 Aug 2024 18:23:57 +0200 Petr Machata wrote:
>> diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
>> index dd8787f9cf39..2ed643207847 100644
>> --- a/include/uapi/linux/nexthop.h
>> +++ b/include/uapi/linux/nexthop.h
>> @@ -33,6 +33,9 @@ enum {
>>  #define NHA_OP_FLAG_DUMP_STATS		BIT(0)
>>  #define NHA_OP_FLAG_DUMP_HW_STATS	BIT(1)
>>  
>> +/* Response OP_FLAGS. */
>> +#define NHA_OP_FLAG_RESP_GRP_RESVD_0	BIT(0)
>
> I guess these are op flags, so nobody should have a need to store them,
> but having bits mean different things in and out make decoding and
> binding generation much harder. Let's not do this unless absolutely
> necessary. Perhaps you can start defining response flags from the
> "other end", i.e. bit 31? Chances are the two sides will never "meet".

OK.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next 2/6] net: nexthop: Increase weight to u16
  2024-08-01 16:23 ` [PATCH net-next 2/6] net: nexthop: Increase weight to u16 Petr Machata
  2024-08-01 19:39   ` Simon Horman
@ 2024-08-06 14:02   ` Przemek Kitszel
  1 sibling, 0 replies; 15+ messages in thread
From: Przemek Kitszel @ 2024-08-06 14:02 UTC (permalink / raw)
  To: Petr Machata
  Cc: Ido Schimmel, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, David Ahern, Donald Sharp, mlxsw

On 8/1/24 18:23, Petr Machata wrote:
> In CLOS networks, as link failures occur at various points in the network,
> ECMP weights of the involved nodes are adjusted to compensate. With high
> fan-out of the involved nodes, and overall high number of nodes,
> a (non-)ECMP weight ratio that we would like to configure does not fit into
> 8 bits. Instead of, say, 255:254, we might like to configure something like
> 1000:999. For these deployments, the 8-bit weight may not be enough.
> 
> To that end, in this patch increase the next hop weight from u8 to u16.
> 
> Increasing the width of an integral type can be tricky, because while the
> code still compiles, the types may not check out anymore, and numerical
> errors come up. To prevent this, the conversion was done in two steps.
> First the type was changed from u8 to a single-member structure, which
> invalidated all uses of the field. This allowed going through them one by
> one and audit for type correctness. Then the structure was replaced with a
> vanilla u16 again. This should ensure that no place was missed.
> 
> The UAPI for configuring nexthop group members is that an attribute
> NHA_GROUP carries an array of struct nexthop_grp entries:
> 
> 	struct nexthop_grp {
> 		__u32	id;	  /* nexthop id - must exist */
> 		__u8	weight;   /* weight of this nexthop */
> 		__u8	resvd1;
> 		__u16	resvd2;
> 	};
> 
> The field resvd1 is currently validated and required to be zero. We can
> lift this requirement and carry high-order bits of the weight in the
> reserved field:
> 
> 	struct nexthop_grp {
> 		__u32	id;	  /* nexthop id - must exist */
> 		__u8	weight;   /* weight of this nexthop */
> 		__u8	weight_high;
> 		__u16	resvd2;
> 	};
> 
> Keeping the fields split this way was chosen in case an existing userspace
> makes assumptions about the width of the weight field, and to sidestep any
> endianes issues.
> 
> The weight field is currently encoded as the weight value minus one,
> because weight of 0 is invalid. This same trick is impossible for the new
> weight_high field, because zero must mean actual zero. With this in place:
> 
> - Old userspace is guaranteed to carry weight_high of 0, therefore
>    configuring 8-bit weights as appropriate. When dumping nexthops with
>    16-bit weight, it would only show the lower 8 bits. But configuring such
>    nexthops implies existence of userspace aware of the extension in the
>    first place.
> 
> - New userspace talking to an old kernel will work as long as it only
>    attempts to configure 8-bit weights, where the high-order bits are zero.
>    Old kernel will bounce attempts at configuring >8-bit weights.
> 
> Renaming reserved fields as they are allocated for some purpose is commonly
> done in Linux. Whoever touches a reserved field is doing so at their own
> risk. nexthop_grp::resvd1 in particular is currently used by at least
> strace, however they carry an own copy of UAPI headers, and the conversion
> should be trivial. A helper is provided for decoding the weight out of the
> two fields. Forcing a conversion seems preferable to bending backwards and
> introducing anonymous unions or whatever.
> 
> Signed-off-by: Petr Machata <petrm@nvidia.com>
> Reviewed-by: Ido Schimmel <idosch@nvidia.com>
> ---
>   include/net/nexthop.h        |  4 ++--
>   include/uapi/linux/nexthop.h |  7 ++++++-
>   net/ipv4/nexthop.c           | 37 ++++++++++++++++++++++--------------
>   3 files changed, 31 insertions(+), 17 deletions(-)
> 

> -		if (nhg[i].weight > 254) {
> +		if (nexthop_grp_weight(&nhg[i]) == 0) {
> +			/* 0xffff got passed in, representing weight of 0x10000,
> +			 * which is too heavy.
> +			 */
>   			NL_SET_ERR_MSG(extack, "Invalid value for weight");
>   			return -EINVAL;
>   		}

code is fine, and I like the decision to just extend the width
(instead of, say apply this +1 arithmetic only to lower byte, then just
add higher byte), so:

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-08-06 14:03 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-01 16:23 [PATCH net-next 0/6] net: nexthop: Increase weight to u16 Petr Machata
2024-08-01 16:23 ` [PATCH net-next 1/6] net: nexthop: Add flag to assert that NHGRP reserved fields are zero Petr Machata
2024-08-05 22:23   ` Jakub Kicinski
2024-08-06  9:48     ` Petr Machata
2024-08-01 16:23 ` [PATCH net-next 2/6] net: nexthop: Increase weight to u16 Petr Machata
2024-08-01 19:39   ` Simon Horman
2024-08-05 12:33     ` Petr Machata
2024-08-06 14:02   ` Przemek Kitszel
2024-08-01 16:23 ` [PATCH net-next 3/6] selftests: router_mpath: Sleep after MZ Petr Machata
2024-08-01 16:24 ` [PATCH net-next 4/6] selftests: router_mpath_nh: Test 16-bit next hop weights Petr Machata
2024-08-01 16:24 ` [PATCH net-next 5/6] selftests: router_mpath_nh_res: " Petr Machata
2024-08-01 16:24 ` [PATCH net-next 6/6] selftests: fib_nexthops: " Petr Machata
2024-08-01 22:52 ` [PATCH net-next 0/6] net: nexthop: Increase weight to u16 Jakub Kicinski
2024-08-05  8:06   ` Petr Machata
2024-08-05 14:31 ` David Ahern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).