netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table.
@ 2025-04-14 18:14 Kuniyuki Iwashima
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 01/14] ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config() Kuniyuki Iwashima
                   ` (13 more replies)
  0 siblings, 14 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:14 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

IPv6 routing tables are protected by each table's lock and work in
the interrupt context, which means we basically don't need RTNL to
modify an IPv6 routing table itself.

Currently, the control paths require RTNL because we may need to
perform device and nexthop lookups; we must prevent dev/nexthop from
going away from the netns.

This, however, can be achieved by RCU as well.

If we are in the RCU critical section while adding an IPv6 route,
synchronize_net() in __dev_change_net_namespace() and
unregister_netdevice_many_notify() guarantee that the dev will not be
moved to another netns or removed.

Also, nexthop is guaranteed not to be freed during the RCU grace period.

If we care about a race between nexthop removal and IPv6 route addition,
we can get rid of RTNL from the control paths.

Patch 1 moves a validation for RTA_MULTIPATH earlier.
Patch 2 removes RTNL for SIOCDELRT and RTM_DELROUTE.
Patch 3 ~ 10 moves validation and memory allocation earlier.
Patch 11 prevents a race between two requests for the same table.
Patch 12 & 13 prevents the nexthop race mentioned above.
Patch 14 removes RTNL for SIOCADDRT and RTM_NEWROUTE.

Test:

The script [0] lets each CPU-X create 100000 routes on table-X in a
batch.

On c7a.metal-48xl EC2 instance with 192 CPUs,

without this series:

  $ sudo ./route_test.sh
  start adding routes
  added 19200000 routes (100000 routes * 192 tables).
  total routes: 19200006
  Time elapsed: 191577 milliseconds.

with this series:

  $ sudo ./route_test.sh
  start adding routes
  added 19200000 routes (100000 routes * 192 tables).
  total routes: 19200006
  Time elapsed: 62854 milliseconds.

I changed the number of routes (1000 ~ 100000 per CPU/table) and
consistently saw it finish 3x faster with this series.


[0]
#!/bin/bash

mkdir tmp

NS="test"
ip netns add $NS
ip -n $NS link add veth0 type veth peer veth1
ip -n $NS link set veth0 up
ip -n $NS link set veth1 up

TABLES=()
for i in $(seq $(nproc)); do
    TABLES+=("$i")
done

ROUTES=()
for i in {1..100}; do
    for j in {1..1000}; do
	ROUTES+=("2001:$i:$j::/64")
    done
done

for TABLE in "${TABLES[@]}"; do
    (
	FILE="./tmp/batch-table-$TABLE.txt"
	> $FILE
	for ROUTE in "${ROUTES[@]}"; do
            echo "route add $ROUTE dev veth0 table $TABLE" >> $FILE
	done
    ) &
done

wait

echo "start adding routes"

START_TIME=$(date +%s%3N)
for TABLE in "${TABLES[@]}"; do
    ip -n $NS -6 -batch "./tmp/batch-table-$TABLE.txt" &
done

wait
END_TIME=$(date +%s%3N)
ELAPSED_TIME=$((END_TIME - START_TIME))

echo "added $((${#ROUTES[@]} * ${#TABLES[@]})) routes (${#ROUTES[@]} routes * ${#TABLES[@]} tables)."
echo "total routes: $(ip -n $NS -6 route show table all | wc -l)"  # Just for debug
echo "Time elapsed: ${ELAPSED_TIME} milliseconds."

ip netns del $NS
rm -fr ./tmp/


Changes:
  v2 (RESEND)

  v2: https://lore.kernel.org/netdev/20250409011243.26195-1-kuniyu@amazon.com/
    * Add Patch 12
    * Patch 2
      * Call __ip6_del_rt() under RCU

  v1: https://lore.kernel.org/netdev/20250321040131.21057-1-kuniyu@amazon.com/


Kuniyuki Iwashima (14):
  ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().
  ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.
  ipv6: Move some validation from ip6_route_info_create() to
    rtm_to_fib6_config().
  ipv6: Check GATEWAY in rtm_to_fib6_multipath_config().
  ipv6: Move nexthop_find_by_id() after fib6_info_alloc().
  ipv6: Split ip6_route_info_create().
  ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().
  ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().
  ipv6: Don't pass net to ip6_route_info_append().
  ipv6: Factorise ip6_route_multipath_add().
  ipv6: Protect fib6_link_table() with spinlock.
  ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add().
  ipv6: Protect nh->f6i_list with spinlock and flag.
  ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.

 include/net/ip6_fib.h    |   1 +
 include/net/netns/ipv6.h |   1 +
 include/net/nexthop.h    |   2 +
 net/ipv4/fib_semantics.c |  10 +-
 net/ipv4/nexthop.c       |  22 +-
 net/ipv6/ip6_fib.c       |  75 ++++--
 net/ipv6/route.c         | 567 ++++++++++++++++++++++++---------------
 7 files changed, 438 insertions(+), 240 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 01/14] ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
@ 2025-04-14 18:14 ` Kuniyuki Iwashima
  2025-04-16  8:22   ` Paolo Abeni
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 02/14] ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE Kuniyuki Iwashima
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:14 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We will perform RTM_NEWROUTE and RTM_DELROUTE under RCU, and then
we want to perform some validation out of the RCU scope.

When creating / removing an IPv6 route with RTA_MULTIPATH,
inet6_rtm_newroute() / inet6_rtm_delroute() validates RTA_GATEWAY
in each multipath entry.

Let's do that in rtm_to_fib6_config().

Note that now RTM_DELROUTE returns an error for RTA_MULTIPATH with
0 entries, which was accepted but should result in -EINVAL as
RTM_NEWROUTE.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv6/route.c | 82 +++++++++++++++++++++++++-----------------------
 1 file changed, 43 insertions(+), 39 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 210b84cecc24..237e31f64a4a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5050,6 +5050,44 @@ static const struct nla_policy rtm_ipv6_policy[RTA_MAX+1] = {
 	[RTA_FLOWLABEL]		= { .type = NLA_BE32 },
 };
 
+static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
+					struct netlink_ext_ack *extack)
+{
+	struct rtnexthop *rtnh;
+	int remaining;
+
+	remaining = cfg->fc_mp_len;
+	rtnh = (struct rtnexthop *)cfg->fc_mp;
+
+	if (!rtnh_ok(rtnh, remaining)) {
+		NL_SET_ERR_MSG(extack, "Invalid nexthop configuration - no valid nexthops");
+		return -EINVAL;
+	}
+
+	do {
+		int attrlen = rtnh_attrlen(rtnh);
+
+		if (attrlen > 0) {
+			struct nlattr *nla, *attrs;
+
+			attrs = rtnh_attrs(rtnh);
+			nla = nla_find(attrs, attrlen, RTA_GATEWAY);
+			if (nla) {
+				if (nla_len(nla) < sizeof(cfg->fc_gateway)) {
+					NL_SET_ERR_MSG(extack,
+						       "Invalid IPv6 address in RTA_GATEWAY");
+					return -EINVAL;
+				}
+			}
+		}
+
+		rtnh = rtnh_next(rtnh, &remaining);
+	} while (rtnh_ok(rtnh, remaining));
+
+	return lwtunnel_valid_encap_type_attr(cfg->fc_mp, cfg->fc_mp_len,
+					      extack, true);
+}
+
 static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
 			      struct fib6_config *cfg,
 			      struct netlink_ext_ack *extack)
@@ -5164,9 +5202,7 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
 		cfg->fc_mp = nla_data(tb[RTA_MULTIPATH]);
 		cfg->fc_mp_len = nla_len(tb[RTA_MULTIPATH]);
 
-		err = lwtunnel_valid_encap_type_attr(cfg->fc_mp,
-						     cfg->fc_mp_len,
-						     extack, true);
+		err = rtm_to_fib6_multipath_config(cfg, extack);
 		if (err < 0)
 			goto errout;
 	}
@@ -5286,19 +5322,6 @@ static bool ip6_route_mpath_should_notify(const struct fib6_info *rt)
 	return should_notify;
 }
 
-static int fib6_gw_from_attr(struct in6_addr *gw, struct nlattr *nla,
-			     struct netlink_ext_ack *extack)
-{
-	if (nla_len(nla) < sizeof(*gw)) {
-		NL_SET_ERR_MSG(extack, "Invalid IPv6 address in RTA_GATEWAY");
-		return -EINVAL;
-	}
-
-	*gw = nla_get_in6_addr(nla);
-
-	return 0;
-}
-
 static int ip6_route_multipath_add(struct fib6_config *cfg,
 				   struct netlink_ext_ack *extack)
 {
@@ -5339,18 +5362,11 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
 
 			nla = nla_find(attrs, attrlen, RTA_GATEWAY);
 			if (nla) {
-				err = fib6_gw_from_attr(&r_cfg.fc_gateway, nla,
-							extack);
-				if (err)
-					goto cleanup;
-
+				r_cfg.fc_gateway = nla_get_in6_addr(nla);
 				r_cfg.fc_flags |= RTF_GATEWAY;
 			}
-			r_cfg.fc_encap = nla_find(attrs, attrlen, RTA_ENCAP);
 
-			/* RTA_ENCAP_TYPE length checked in
-			 * lwtunnel_valid_encap_type_attr
-			 */
+			r_cfg.fc_encap = nla_find(attrs, attrlen, RTA_ENCAP);
 			nla = nla_find(attrs, attrlen, RTA_ENCAP_TYPE);
 			if (nla)
 				r_cfg.fc_encap_type = nla_get_u16(nla);
@@ -5383,12 +5399,6 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
 		rtnh = rtnh_next(rtnh, &remaining);
 	}
 
-	if (list_empty(&rt6_nh_list)) {
-		NL_SET_ERR_MSG(extack,
-			       "Invalid nexthop configuration - no valid nexthops");
-		return -EINVAL;
-	}
-
 	/* for add and replace send one notification with all nexthops.
 	 * Skip the notification in fib6_add_rt2node and send one with
 	 * the full route when done
@@ -5510,21 +5520,15 @@ static int ip6_route_multipath_del(struct fib6_config *cfg,
 
 			nla = nla_find(attrs, attrlen, RTA_GATEWAY);
 			if (nla) {
-				err = fib6_gw_from_attr(&r_cfg.fc_gateway, nla,
-							extack);
-				if (err) {
-					last_err = err;
-					goto next_rtnh;
-				}
-
+				r_cfg.fc_gateway = nla_get_in6_addr(nla);
 				r_cfg.fc_flags |= RTF_GATEWAY;
 			}
 		}
+
 		err = ip6_route_del(&r_cfg, extack);
 		if (err)
 			last_err = err;
 
-next_rtnh:
 		rtnh = rtnh_next(rtnh, &remaining);
 	}
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 02/14] ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 01/14] ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config() Kuniyuki Iwashima
@ 2025-04-14 18:14 ` Kuniyuki Iwashima
  2025-04-16  8:49   ` Paolo Abeni
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 03/14] ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config() Kuniyuki Iwashima
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:14 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Basically, removing an IPv6 route does not require RTNL because
the IPv6 routing tables are protected by per table lock.

inet6_rtm_delroute() calls nexthop_find_by_id() to check if the
nexthop specified by RTA_NH_ID exists.  nexthop uses rbtree and
the top-down walk can be safely performed under RCU.

ip6_route_del() already relies on RCU and the table lock, but we
need to extend the RCU critical section a bit more to cover
__ip6_del_rt().  For example, nexthop_for_each_fib6_nh() and
inet6_rt_notify() needs RCU.

Let's call nexthop_find_by_id() and __ip6_del_rt() under RCU and
get rid of RTNL from inet6_rtm_delroute() and SIOCDELRT.

Even if the nexthop is removed after rcu_read_unlock() in
inet6_rtm_delroute(), __remove_nexthop_fib() cleans up the routes
tied to the nexthop, and ip6_route_del() returns -ESRCH.  So the
request was at least valid as of nexthop_find_by_id(), and it's just
a matter of timing.

Note that we need to pass false to lwtunnel_valid_encap_type_attr().
The following patches also use the newroute bool.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
v2:
  * Call __ip6_del_rt() under RCU
---
 net/ipv6/route.c | 48 ++++++++++++++++++++++++++++--------------------
 1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 237e31f64a4a..abb9c07fc034 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4124,9 +4124,9 @@ static int ip6_route_del(struct fib6_config *cfg,
 			if (rt->nh) {
 				if (!fib6_info_hold_safe(rt))
 					continue;
-				rcu_read_unlock();
 
-				return __ip6_del_rt(rt, &cfg->fc_nlinfo);
+				err =  __ip6_del_rt(rt, &cfg->fc_nlinfo);
+				break;
 			}
 			if (cfg->fc_nh_id)
 				continue;
@@ -4141,13 +4141,13 @@ static int ip6_route_del(struct fib6_config *cfg,
 				continue;
 			if (!fib6_info_hold_safe(rt))
 				continue;
-			rcu_read_unlock();
 
 			/* if gateway was specified only delete the one hop */
 			if (cfg->fc_flags & RTF_GATEWAY)
-				return __ip6_del_rt(rt, &cfg->fc_nlinfo);
-
-			return __ip6_del_rt_siblings(rt, cfg);
+				err = __ip6_del_rt(rt, &cfg->fc_nlinfo);
+			else
+				err = __ip6_del_rt_siblings(rt, cfg);
+			break;
 		}
 	}
 	rcu_read_unlock();
@@ -4516,19 +4516,20 @@ int ipv6_route_ioctl(struct net *net, unsigned int cmd, struct in6_rtmsg *rtmsg)
 
 	rtmsg_to_fib6_config(net, rtmsg, &cfg);
 
-	rtnl_lock();
 	switch (cmd) {
 	case SIOCADDRT:
+		rtnl_lock();
 		/* Only do the default setting of fc_metric in route adding */
 		if (cfg.fc_metric == 0)
 			cfg.fc_metric = IP6_RT_PRIO_USER;
 		err = ip6_route_add(&cfg, GFP_KERNEL, NULL);
+		rtnl_unlock();
 		break;
 	case SIOCDELRT:
 		err = ip6_route_del(&cfg, NULL);
 		break;
 	}
-	rtnl_unlock();
+
 	return err;
 }
 
@@ -5051,7 +5052,8 @@ static const struct nla_policy rtm_ipv6_policy[RTA_MAX+1] = {
 };
 
 static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
-					struct netlink_ext_ack *extack)
+					struct netlink_ext_ack *extack,
+					bool newroute)
 {
 	struct rtnexthop *rtnh;
 	int remaining;
@@ -5085,15 +5087,16 @@ static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
 	} while (rtnh_ok(rtnh, remaining));
 
 	return lwtunnel_valid_encap_type_attr(cfg->fc_mp, cfg->fc_mp_len,
-					      extack, true);
+					      extack, newroute);
 }
 
 static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
 			      struct fib6_config *cfg,
 			      struct netlink_ext_ack *extack)
 {
-	struct rtmsg *rtm;
+	bool newroute = nlh->nlmsg_type == RTM_NEWROUTE;
 	struct nlattr *tb[RTA_MAX+1];
+	struct rtmsg *rtm;
 	unsigned int pref;
 	int err;
 
@@ -5202,7 +5205,7 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
 		cfg->fc_mp = nla_data(tb[RTA_MULTIPATH]);
 		cfg->fc_mp_len = nla_len(tb[RTA_MULTIPATH]);
 
-		err = rtm_to_fib6_multipath_config(cfg, extack);
+		err = rtm_to_fib6_multipath_config(cfg, extack, newroute);
 		if (err < 0)
 			goto errout;
 	}
@@ -5222,7 +5225,7 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
 		cfg->fc_encap_type = nla_get_u16(tb[RTA_ENCAP_TYPE]);
 
 		err = lwtunnel_valid_encap_type(cfg->fc_encap_type,
-						extack, true);
+						extack, newroute);
 		if (err < 0)
 			goto errout;
 	}
@@ -5545,15 +5548,20 @@ static int inet6_rtm_delroute(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (err < 0)
 		return err;
 
-	if (cfg.fc_nh_id &&
-	    !nexthop_find_by_id(sock_net(skb->sk), cfg.fc_nh_id)) {
-		NL_SET_ERR_MSG(extack, "Nexthop id does not exist");
-		return -EINVAL;
+	if (cfg.fc_nh_id) {
+		rcu_read_lock();
+		err = !nexthop_find_by_id(sock_net(skb->sk), cfg.fc_nh_id);
+		rcu_read_unlock();
+
+		if (err) {
+			NL_SET_ERR_MSG(extack, "Nexthop id does not exist");
+			return -EINVAL;
+		}
 	}
 
-	if (cfg.fc_mp)
+	if (cfg.fc_mp) {
 		return ip6_route_multipath_del(&cfg, extack);
-	else {
+	} else {
 		cfg.fc_delete_all_nh = 1;
 		return ip6_route_del(&cfg, extack);
 	}
@@ -6765,7 +6773,7 @@ static const struct rtnl_msg_handler ip6_route_rtnl_msg_handlers[] __initconst_o
 	{.owner = THIS_MODULE, .protocol = PF_INET6, .msgtype = RTM_NEWROUTE,
 	 .doit = inet6_rtm_newroute},
 	{.owner = THIS_MODULE, .protocol = PF_INET6, .msgtype = RTM_DELROUTE,
-	 .doit = inet6_rtm_delroute},
+	 .doit = inet6_rtm_delroute, .flags = RTNL_FLAG_DOIT_UNLOCKED},
 	{.owner = THIS_MODULE, .protocol = PF_INET6, .msgtype = RTM_GETROUTE,
 	 .doit = inet6_rtm_getroute, .flags = RTNL_FLAG_DOIT_UNLOCKED},
 };
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 03/14] ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 01/14] ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config() Kuniyuki Iwashima
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 02/14] ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE Kuniyuki Iwashima
@ 2025-04-14 18:14 ` Kuniyuki Iwashima
  2025-04-16  8:54   ` Paolo Abeni
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 04/14] ipv6: Check GATEWAY in rtm_to_fib6_multipath_config() Kuniyuki Iwashima
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:14 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

ip6_route_info_create() is called from 3 functions:

  * ip6_route_add()
  * ip6_route_multipath_add()
  * addrconf_f6i_alloc()

addrconf_f6i_alloc() does not need validation for struct fib6_config in
ip6_route_info_create().

ip6_route_multipath_add() calls ip6_route_info_create() for multiple
routes with slightly different fib6_config instances, which is copied
from the base config passed from userspace.  So, we need not validate
the same config repeatedly.

Let's move such validation into rtm_to_fib6_config().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv6/route.c | 79 +++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 37 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index abb9c07fc034..cb29b1f5fb1d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3739,38 +3739,6 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 	int err = -EINVAL;
 	int addr_type;
 
-	/* RTF_PCPU is an internal flag; can not be set by userspace */
-	if (cfg->fc_flags & RTF_PCPU) {
-		NL_SET_ERR_MSG(extack, "Userspace can not set RTF_PCPU");
-		goto out;
-	}
-
-	/* RTF_CACHE is an internal flag; can not be set by userspace */
-	if (cfg->fc_flags & RTF_CACHE) {
-		NL_SET_ERR_MSG(extack, "Userspace can not set RTF_CACHE");
-		goto out;
-	}
-
-	if (cfg->fc_type > RTN_MAX) {
-		NL_SET_ERR_MSG(extack, "Invalid route type");
-		goto out;
-	}
-
-	if (cfg->fc_dst_len > 128) {
-		NL_SET_ERR_MSG(extack, "Invalid prefix length");
-		goto out;
-	}
-	if (cfg->fc_src_len > 128) {
-		NL_SET_ERR_MSG(extack, "Invalid source address length");
-		goto out;
-	}
-#ifndef CONFIG_IPV6_SUBTREES
-	if (cfg->fc_src_len) {
-		NL_SET_ERR_MSG(extack,
-			       "Specifying source address requires IPV6_SUBTREES to be enabled");
-		goto out;
-	}
-#endif
 	if (cfg->fc_nh_id) {
 		nh = nexthop_find_by_id(net, cfg->fc_nh_id);
 		if (!nh) {
@@ -3835,11 +3803,6 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 	rt->fib6_src.plen = cfg->fc_src_len;
 #endif
 	if (nh) {
-		if (rt->fib6_src.plen) {
-			NL_SET_ERR_MSG(extack, "Nexthops can not be used with source routing");
-			err = -EINVAL;
-			goto out_free;
-		}
 		if (!nexthop_get(nh)) {
 			NL_SET_ERR_MSG(extack, "Nexthop has been deleted");
 			err = -ENOENT;
@@ -5239,6 +5202,48 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
 		}
 	}
 
+	if (newroute) {
+		/* RTF_PCPU is an internal flag; can not be set by userspace */
+		if (cfg->fc_flags & RTF_PCPU) {
+			NL_SET_ERR_MSG(extack, "Userspace can not set RTF_PCPU");
+			goto errout;
+		}
+
+		/* RTF_CACHE is an internal flag; can not be set by userspace */
+		if (cfg->fc_flags & RTF_CACHE) {
+			NL_SET_ERR_MSG(extack, "Userspace can not set RTF_CACHE");
+			goto errout;
+		}
+
+		if (cfg->fc_type > RTN_MAX) {
+			NL_SET_ERR_MSG(extack, "Invalid route type");
+			goto errout;
+		}
+
+		if (cfg->fc_dst_len > 128) {
+			NL_SET_ERR_MSG(extack, "Invalid prefix length");
+			goto errout;
+		}
+
+#ifdef CONFIG_IPV6_SUBTREES
+		if (cfg->fc_src_len > 128) {
+			NL_SET_ERR_MSG(extack, "Invalid source address length");
+			goto errout;
+		}
+
+		if (cfg->fc_nh_id &&  cfg->fc_src_len) {
+			NL_SET_ERR_MSG(extack, "Nexthops can not be used with source routing");
+			goto errout;
+		}
+#else
+		if (cfg->fc_src_len) {
+			NL_SET_ERR_MSG(extack,
+				       "Specifying source address requires IPV6_SUBTREES to be enabled");
+			goto errout;
+		}
+#endif
+	}
+
 	err = 0;
 errout:
 	return err;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 04/14] ipv6: Check GATEWAY in rtm_to_fib6_multipath_config().
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 03/14] ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config() Kuniyuki Iwashima
@ 2025-04-14 18:14 ` Kuniyuki Iwashima
  2025-04-16  9:06   ` Paolo Abeni
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 05/14] ipv6: Move nexthop_find_by_id() after fib6_info_alloc() Kuniyuki Iwashima
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:14 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

In ip6_route_multipath_add(), we call rt6_qualify_for_ecmp() for each
entry.  If it returns false, the request fails.

rt6_qualify_for_ecmp() returns false if either of the conditions below
is true:

  1. f6i->fib6_flags has RTF_ADDRCONF
  2. f6i->nh is not NULL
  3. f6i->fib6_nh->fib_nh_gw_family is AF_UNSPEC

1 is unnecessary because rtm_to_fib6_config() never sets RTF_ADDRCONF
to cfg->fc_flags.

2. is equivalent with cfg->fc_nh_id.

3. can be replaced by checking RTF_GATEWAY in the base and each multipath
entry because AF_INET6 is set to f6i->fib6_nh->fib_nh_gw_family only when
cfg.fc_is_fdb is true or RTF_GATEWAY is set, but the former is always
false.

Let's perform the equivalent checks in rtm_to_fib6_multipath_config().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv6/route.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index cb29b1f5fb1d..7e35187594e6 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5030,6 +5030,7 @@ static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
 	}
 
 	do {
+		bool has_gateway = cfg->fc_flags & RTF_GATEWAY;
 		int attrlen = rtnh_attrlen(rtnh);
 
 		if (attrlen > 0) {
@@ -5043,9 +5044,17 @@ static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
 						       "Invalid IPv6 address in RTA_GATEWAY");
 					return -EINVAL;
 				}
+
+				has_gateway = true;
 			}
 		}
 
+		if (newroute && (cfg->fc_nh_id || !has_gateway)) {
+			NL_SET_ERR_MSG(extack,
+				       "Device only routes can not be added for IPv6 using the multipath API.");
+			return -EINVAL;
+		}
+
 		rtnh = rtnh_next(rtnh, &remaining);
 	} while (rtnh_ok(rtnh, remaining));
 
@@ -5387,13 +5396,6 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
 			rt = NULL;
 			goto cleanup;
 		}
-		if (!rt6_qualify_for_ecmp(rt)) {
-			err = -EINVAL;
-			NL_SET_ERR_MSG(extack,
-				       "Device only routes can not be added for IPv6 using the multipath API.");
-			fib6_info_release(rt);
-			goto cleanup;
-		}
 
 		rt->fib6_nh->fib_nh_weight = rtnh->rtnh_hops + 1;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 05/14] ipv6: Move nexthop_find_by_id() after fib6_info_alloc().
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 04/14] ipv6: Check GATEWAY in rtm_to_fib6_multipath_config() Kuniyuki Iwashima
@ 2025-04-14 18:14 ` Kuniyuki Iwashima
  2025-04-16  9:13   ` Paolo Abeni
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 06/14] ipv6: Split ip6_route_info_create() Kuniyuki Iwashima
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:14 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

Then, we must perform two lookups for nexthop and dev under RCU
to guarantee their lifetime.

ip6_route_info_create() calls nexthop_find_by_id() first if
RTA_NH_ID is specified, and then allocates struct fib6_info.

nexthop_find_by_id() must be called under RCU, but we do not want
to use GFP_ATOMIC for memory allocation here, which will be likely
to fail in ip6_route_multipath_add().

Let's move nexthop_find_by_id() after the memory allocation so
that we can later split ip6_route_info_create() into two parts:
the sleepable part and the RCU part.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv6/route.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7e35187594e6..808b126de5b3 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3733,24 +3733,11 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 {
 	struct net *net = cfg->fc_nlinfo.nl_net;
 	struct fib6_info *rt = NULL;
-	struct nexthop *nh = NULL;
 	struct fib6_table *table;
 	struct fib6_nh *fib6_nh;
-	int err = -EINVAL;
+	int err = -ENOBUFS;
 	int addr_type;
 
-	if (cfg->fc_nh_id) {
-		nh = nexthop_find_by_id(net, cfg->fc_nh_id);
-		if (!nh) {
-			NL_SET_ERR_MSG(extack, "Nexthop id does not exist");
-			goto out;
-		}
-		err = fib6_check_nexthop(nh, cfg, extack);
-		if (err)
-			goto out;
-	}
-
-	err = -ENOBUFS;
 	if (cfg->fc_nlinfo.nlh &&
 	    !(cfg->fc_nlinfo.nlh->nlmsg_flags & NLM_F_CREATE)) {
 		table = fib6_get_table(net, cfg->fc_table);
@@ -3766,7 +3753,7 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 		goto out;
 
 	err = -ENOMEM;
-	rt = fib6_info_alloc(gfp_flags, !nh);
+	rt = fib6_info_alloc(gfp_flags, !cfg->fc_nh_id);
 	if (!rt)
 		goto out;
 
@@ -3802,12 +3789,27 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 	ipv6_addr_prefix(&rt->fib6_src.addr, &cfg->fc_src, cfg->fc_src_len);
 	rt->fib6_src.plen = cfg->fc_src_len;
 #endif
-	if (nh) {
+
+	if (cfg->fc_nh_id) {
+		struct nexthop *nh;
+
+		nh = nexthop_find_by_id(net, cfg->fc_nh_id);
+		if (!nh) {
+			err = -EINVAL;
+			NL_SET_ERR_MSG(extack, "Nexthop id does not exist");
+			goto out_free;
+		}
+
+		err = fib6_check_nexthop(nh, cfg, extack);
+		if (err)
+			goto out_free;
+
 		if (!nexthop_get(nh)) {
 			NL_SET_ERR_MSG(extack, "Nexthop has been deleted");
 			err = -ENOENT;
 			goto out_free;
 		}
+
 		rt->nh = nh;
 		fib6_nh = nexthop_fib6_nh(rt->nh);
 	} else {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 06/14] ipv6: Split ip6_route_info_create().
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 05/14] ipv6: Move nexthop_find_by_id() after fib6_info_alloc() Kuniyuki Iwashima
@ 2025-04-14 18:14 ` Kuniyuki Iwashima
  2025-04-16  9:12   ` Paolo Abeni
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 07/14] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create() Kuniyuki Iwashima
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:14 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely
on RCU to guarantee dev and nexthop lifetime.

Then, we want to allocate everything as possible before entering
the RCU section.

The RCU section will start in the middle of ip6_route_info_create(),
and this is problematic for ip6_route_multipath_add() that calls
ip6_route_info_create() multiple times.

Let's split ip6_route_info_create() into two parts; one for memory
allocation and another for nexthop setup.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv6/route.c | 95 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 62 insertions(+), 33 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 808b126de5b3..ce060b59d41a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3728,15 +3728,13 @@ void fib6_nh_release_dsts(struct fib6_nh *fib6_nh)
 }
 
 static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
-					      gfp_t gfp_flags,
-					      struct netlink_ext_ack *extack)
+					       gfp_t gfp_flags,
+					       struct netlink_ext_ack *extack)
 {
 	struct net *net = cfg->fc_nlinfo.nl_net;
-	struct fib6_info *rt = NULL;
 	struct fib6_table *table;
-	struct fib6_nh *fib6_nh;
-	int err = -ENOBUFS;
-	int addr_type;
+	struct fib6_info *rt;
+	int err;
 
 	if (cfg->fc_nlinfo.nlh &&
 	    !(cfg->fc_nlinfo.nlh->nlmsg_flags & NLM_F_CREATE)) {
@@ -3748,22 +3746,22 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 	} else {
 		table = fib6_new_table(net, cfg->fc_table);
 	}
+	if (!table) {
+		err = -ENOBUFS;
+		goto err;
+	}
 
-	if (!table)
-		goto out;
-
-	err = -ENOMEM;
 	rt = fib6_info_alloc(gfp_flags, !cfg->fc_nh_id);
-	if (!rt)
-		goto out;
+	if (!rt) {
+		err = -ENOMEM;
+		goto err;
+	}
 
 	rt->fib6_metrics = ip_fib_metrics_init(cfg->fc_mx, cfg->fc_mx_len,
 					       extack);
 	if (IS_ERR(rt->fib6_metrics)) {
 		err = PTR_ERR(rt->fib6_metrics);
-		/* Do not leave garbage there. */
-		rt->fib6_metrics = (struct dst_metrics *)&dst_default_metrics;
-		goto out_free;
+		goto free;
 	}
 
 	if (cfg->fc_flags & RTF_ADDRCONF)
@@ -3771,12 +3769,12 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 
 	if (cfg->fc_flags & RTF_EXPIRES)
 		fib6_set_expires(rt, jiffies +
-				clock_t_to_jiffies(cfg->fc_expires));
+				 clock_t_to_jiffies(cfg->fc_expires));
 
 	if (cfg->fc_protocol == RTPROT_UNSPEC)
 		cfg->fc_protocol = RTPROT_BOOT;
-	rt->fib6_protocol = cfg->fc_protocol;
 
+	rt->fib6_protocol = cfg->fc_protocol;
 	rt->fib6_table = table;
 	rt->fib6_metric = cfg->fc_metric;
 	rt->fib6_type = cfg->fc_type ? : RTN_UNICAST;
@@ -3789,6 +3787,20 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 	ipv6_addr_prefix(&rt->fib6_src.addr, &cfg->fc_src, cfg->fc_src_len);
 	rt->fib6_src.plen = cfg->fc_src_len;
 #endif
+	return rt;
+free:
+	kfree(rt);
+err:
+	return ERR_PTR(err);
+}
+
+static int ip6_route_info_create_nh(struct fib6_info *rt,
+				    struct fib6_config *cfg,
+				    struct netlink_ext_ack *extack)
+{
+	struct net *net = cfg->fc_nlinfo.nl_net;
+	struct fib6_nh *fib6_nh;
+	int err;
 
 	if (cfg->fc_nh_id) {
 		struct nexthop *nh;
@@ -3813,9 +3825,11 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 		rt->nh = nh;
 		fib6_nh = nexthop_fib6_nh(rt->nh);
 	} else {
-		err = fib6_nh_init(net, rt->fib6_nh, cfg, gfp_flags, extack);
+		int addr_type;
+
+		err = fib6_nh_init(net, rt->fib6_nh, cfg, GFP_ATOMIC, extack);
 		if (err)
-			goto out;
+			goto out_release;
 
 		fib6_nh = rt->fib6_nh;
 
@@ -3834,21 +3848,20 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 		if (!ipv6_chk_addr(net, &cfg->fc_prefsrc, dev, 0)) {
 			NL_SET_ERR_MSG(extack, "Invalid source address");
 			err = -EINVAL;
-			goto out;
+			goto out_release;
 		}
 		rt->fib6_prefsrc.addr = cfg->fc_prefsrc;
 		rt->fib6_prefsrc.plen = 128;
-	} else
-		rt->fib6_prefsrc.plen = 0;
+	}
 
-	return rt;
-out:
+	return 0;
+out_release:
 	fib6_info_release(rt);
-	return ERR_PTR(err);
+	return err;
 out_free:
 	ip_fib_metrics_put(rt->fib6_metrics);
 	kfree(rt);
-	return ERR_PTR(err);
+	return err;
 }
 
 int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
@@ -3861,6 +3874,10 @@ int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
 	if (IS_ERR(rt))
 		return PTR_ERR(rt);
 
+	err = ip6_route_info_create_nh(rt, cfg, extack);
+	if (err)
+		return err;
+
 	err = __ip6_ins_rt(rt, &cfg->fc_nlinfo, extack);
 	fib6_info_release(rt);
 
@@ -4584,6 +4601,7 @@ struct fib6_info *addrconf_f6i_alloc(struct net *net,
 		.fc_ignore_dev_down = true,
 	};
 	struct fib6_info *f6i;
+	int err;
 
 	if (anycast) {
 		cfg.fc_type = RTN_ANYCAST;
@@ -4594,14 +4612,19 @@ struct fib6_info *addrconf_f6i_alloc(struct net *net,
 	}
 
 	f6i = ip6_route_info_create(&cfg, gfp_flags, extack);
-	if (!IS_ERR(f6i)) {
-		f6i->dst_nocount = true;
+	if (IS_ERR(f6i))
+		return f6i;
 
-		if (!anycast &&
-		    (READ_ONCE(net->ipv6.devconf_all->disable_policy) ||
-		     READ_ONCE(idev->cnf.disable_policy)))
-			f6i->dst_nopolicy = true;
-	}
+	err = ip6_route_info_create_nh(f6i, &cfg, extack);
+	if (err)
+		return ERR_PTR(err);
+
+	f6i->dst_nocount = true;
+
+	if (!anycast &&
+	    (READ_ONCE(net->ipv6.devconf_all->disable_policy) ||
+	     READ_ONCE(idev->cnf.disable_policy)))
+		f6i->dst_nopolicy = true;
 
 	return f6i;
 }
@@ -5399,6 +5422,12 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
 			goto cleanup;
 		}
 
+		err = ip6_route_info_create_nh(rt, &r_cfg, extack);
+		if (err) {
+			rt = NULL;
+			goto cleanup;
+		}
+
 		rt->fib6_nh->fib_nh_weight = rtnh->rtnh_hops + 1;
 
 		err = ip6_route_info_append(info->nl_net, &rt6_nh_list,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 07/14] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
                   ` (5 preceding siblings ...)
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 06/14] ipv6: Split ip6_route_info_create() Kuniyuki Iwashima
@ 2025-04-14 18:14 ` Kuniyuki Iwashima
  2025-04-16  9:21   ` Paolo Abeni
  2025-04-16  9:23   ` Paolo Abeni
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 08/14] ipv6: Preallocate nhc_pcpu_rth_output " Kuniyuki Iwashima
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:14 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

ip6_route_info_create_nh() will be called under RCU.

Then, fib6_nh_init() is also under RCU, but per-cpu memory allocation
is very likely to fail with GFP_ATOMIC while bluk-adding IPv6 routes
and we would see a bunch of this message in dmesg.

  percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left
  percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left

Let's preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().

If something fails before the original memory allocation in
fib6_nh_init(), ip6_route_info_create_nh() calls fib6_info_release(),
which releases the preallocated per-cpu memory.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv6/route.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ce060b59d41a..404da01a7502 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3664,10 +3664,12 @@ int fib6_nh_init(struct net *net, struct fib6_nh *fib6_nh,
 		goto out;
 
 pcpu_alloc:
-	fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
 	if (!fib6_nh->rt6i_pcpu) {
-		err = -ENOMEM;
-		goto out;
+		fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
+		if (!fib6_nh->rt6i_pcpu) {
+			err = -ENOMEM;
+			goto out;
+		}
 	}
 
 	fib6_nh->fib_nh_dev = dev;
@@ -3727,6 +3729,15 @@ void fib6_nh_release_dsts(struct fib6_nh *fib6_nh)
 	}
 }
 
+static int fib6_nh_prealloc_percpu(struct fib6_nh *fib6_nh, gfp_t gfp_flags)
+{
+	fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
+	if (!fib6_nh->rt6i_pcpu)
+		return -ENOMEM;
+
+	return 0;
+}
+
 static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 					       gfp_t gfp_flags,
 					       struct netlink_ext_ack *extack)
@@ -3764,6 +3775,12 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 		goto free;
 	}
 
+	if (!cfg->fc_nh_id) {
+		err = fib6_nh_prealloc_percpu(&rt->fib6_nh[0], gfp_flags);
+		if (err)
+			goto free_metrics;
+	}
+
 	if (cfg->fc_flags & RTF_ADDRCONF)
 		rt->dst_nocount = true;
 
@@ -3788,6 +3805,8 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
 	rt->fib6_src.plen = cfg->fc_src_len;
 #endif
 	return rt;
+free_metrics:
+	ip_fib_metrics_put(rt->fib6_metrics);
 free:
 	kfree(rt);
 err:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 08/14] ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
                   ` (6 preceding siblings ...)
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 07/14] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create() Kuniyuki Iwashima
@ 2025-04-14 18:14 ` Kuniyuki Iwashima
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 09/14] ipv6: Don't pass net to ip6_route_info_append() Kuniyuki Iwashima
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:14 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

ip6_route_info_create_nh() will be called under RCU.

It calls fib_nh_common_init() and allocates nhc->nhc_pcpu_rth_output.

As with the reason for rt->fib6_nh->rt6i_pcpu, we want to avoid
GFP_ATOMIC allocation for nhc->nhc_pcpu_rth_output under RCU.

Let's preallocate it in ip6_route_info_create().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/fib_semantics.c | 10 ++++++----
 net/ipv6/route.c         |  9 +++++++++
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f68bb9e34c34..5326f1501af0 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -617,10 +617,12 @@ int fib_nh_common_init(struct net *net, struct fib_nh_common *nhc,
 {
 	int err;
 
-	nhc->nhc_pcpu_rth_output = alloc_percpu_gfp(struct rtable __rcu *,
-						    gfp_flags);
-	if (!nhc->nhc_pcpu_rth_output)
-		return -ENOMEM;
+	if (!nhc->nhc_pcpu_rth_output) {
+		nhc->nhc_pcpu_rth_output = alloc_percpu_gfp(struct rtable __rcu *,
+							    gfp_flags);
+		if (!nhc->nhc_pcpu_rth_output)
+			return -ENOMEM;
+	}
 
 	if (encap) {
 		struct lwtunnel_state *lwtstate;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 404da01a7502..7b33f46fdae4 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3731,10 +3731,19 @@ void fib6_nh_release_dsts(struct fib6_nh *fib6_nh)
 
 static int fib6_nh_prealloc_percpu(struct fib6_nh *fib6_nh, gfp_t gfp_flags)
 {
+	struct fib_nh_common *nhc = &fib6_nh->nh_common;
+
 	fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
 	if (!fib6_nh->rt6i_pcpu)
 		return -ENOMEM;
 
+	nhc->nhc_pcpu_rth_output = alloc_percpu_gfp(struct rtable __rcu *,
+						    gfp_flags);
+	if (!nhc->nhc_pcpu_rth_output) {
+		free_percpu(fib6_nh->rt6i_pcpu);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 09/14] ipv6: Don't pass net to ip6_route_info_append().
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
                   ` (7 preceding siblings ...)
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 08/14] ipv6: Preallocate nhc_pcpu_rth_output " Kuniyuki Iwashima
@ 2025-04-14 18:14 ` Kuniyuki Iwashima
  2025-04-16  9:39   ` Paolo Abeni
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 10/14] ipv6: Factorise ip6_route_multipath_add() Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:14 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

net is not used in ip6_route_info_append() after commit 36f19d5b4f99
("net/ipv6: Remove extra call to ip6_convert_metrics for multipath case").

Let's remove the argument.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv6/route.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7b33f46fdae4..49eea7e1e2da 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5317,8 +5317,7 @@ struct rt6_nh {
 	struct list_head next;
 };
 
-static int ip6_route_info_append(struct net *net,
-				 struct list_head *rt6_nh_list,
+static int ip6_route_info_append(struct list_head *rt6_nh_list,
 				 struct fib6_info *rt,
 				 struct fib6_config *r_cfg)
 {
@@ -5458,8 +5457,7 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
 
 		rt->fib6_nh->fib_nh_weight = rtnh->rtnh_hops + 1;
 
-		err = ip6_route_info_append(info->nl_net, &rt6_nh_list,
-					    rt, &r_cfg);
+		err = ip6_route_info_append(&rt6_nh_list, rt, &r_cfg);
 		if (err) {
 			fib6_info_release(rt);
 			goto cleanup;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 10/14] ipv6: Factorise ip6_route_multipath_add().
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
                   ` (8 preceding siblings ...)
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 09/14] ipv6: Don't pass net to ip6_route_info_append() Kuniyuki Iwashima
@ 2025-04-14 18:14 ` Kuniyuki Iwashima
  2025-04-16  9:57   ` Paolo Abeni
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 11/14] ipv6: Protect fib6_link_table() with spinlock Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:14 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely
on RCU to guarantee dev and nexthop lifetime.

Then, the RCU section will start before ip6_route_info_create_nh()
in ip6_route_multipath_add(), but ip6_route_info_create() is called
in the same loop and will sleep.

Let's split the loop into ip6_route_mpath_info_create() and
ip6_route_mpath_info_create_nh().

Note that ip6_route_info_append() is now integrated into
ip6_route_mpath_info_create_nh() because we need to call different
free functions for nexthops that passed ip6_route_info_create_nh().

In case of failure, the remaining nexthops that ip6_route_info_create_nh()
has not been called for will be freed by ip6_route_mpath_info_cleanup().

OTOH, if a nexthop passes ip6_route_info_create_nh(), it will be linked
to a local temporary list, which will be spliced back to rt6_nh_list.
In case of failure, these nexthops will be released by fib6_info_release()
in ip6_route_multipath_add().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv6/route.c | 205 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 130 insertions(+), 75 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 49eea7e1e2da..c026f8fe5f78 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5315,29 +5315,131 @@ struct rt6_nh {
 	struct fib6_info *fib6_info;
 	struct fib6_config r_cfg;
 	struct list_head next;
+	int weight;
 };
 
-static int ip6_route_info_append(struct list_head *rt6_nh_list,
-				 struct fib6_info *rt,
-				 struct fib6_config *r_cfg)
+static void ip6_route_mpath_info_cleanup(struct list_head *rt6_nh_list)
 {
-	struct rt6_nh *nh;
-	int err = -EEXIST;
+	struct rt6_nh *nh, *nh_next;
 
-	list_for_each_entry(nh, rt6_nh_list, next) {
-		/* check if fib6_info already exists */
-		if (rt6_duplicate_nexthop(nh->fib6_info, rt))
-			return err;
+	list_for_each_entry_safe(nh, nh_next, rt6_nh_list, next) {
+		struct fib6_info *rt = nh->fib6_info;
+
+		if (rt) {
+			free_percpu(rt->fib6_nh->nh_common.nhc_pcpu_rth_output);
+			free_percpu(rt->fib6_nh->rt6i_pcpu);
+			ip_fib_metrics_put(rt->fib6_metrics);
+			kfree(rt);
+		}
+
+		list_del(&nh->next);
+		kfree(nh);
 	}
+}
 
-	nh = kzalloc(sizeof(*nh), GFP_KERNEL);
-	if (!nh)
-		return -ENOMEM;
-	nh->fib6_info = rt;
-	memcpy(&nh->r_cfg, r_cfg, sizeof(*r_cfg));
-	list_add_tail(&nh->next, rt6_nh_list);
+static int ip6_route_mpath_info_create(struct list_head *rt6_nh_list,
+				       struct fib6_config *cfg,
+				       struct netlink_ext_ack *extack)
+{
+	struct rtnexthop *rtnh;
+	int remaining;
+	int err;
+
+	remaining = cfg->fc_mp_len;
+	rtnh = (struct rtnexthop *)cfg->fc_mp;
+
+	/* Parse a Multipath Entry and build a list (rt6_nh_list) of
+	 * fib6_info structs per nexthop
+	 */
+	while (rtnh_ok(rtnh, remaining)) {
+		struct fib6_config r_cfg;
+		struct fib6_info *rt;
+		struct rt6_nh *nh;
+		int attrlen;
+
+		nh = kzalloc(sizeof(*nh), GFP_KERNEL);
+		if (!nh) {
+			err = -ENOMEM;
+			goto err;
+		}
+
+		list_add_tail(&nh->next, rt6_nh_list);
+
+		memcpy(&r_cfg, cfg, sizeof(*cfg));
+		if (rtnh->rtnh_ifindex)
+			r_cfg.fc_ifindex = rtnh->rtnh_ifindex;
+
+		attrlen = rtnh_attrlen(rtnh);
+		if (attrlen > 0) {
+			struct nlattr *nla, *attrs = rtnh_attrs(rtnh);
+
+			nla = nla_find(attrs, attrlen, RTA_GATEWAY);
+			if (nla) {
+				r_cfg.fc_gateway = nla_get_in6_addr(nla);
+				r_cfg.fc_flags |= RTF_GATEWAY;
+			}
+
+			r_cfg.fc_encap = nla_find(attrs, attrlen, RTA_ENCAP);
+			nla = nla_find(attrs, attrlen, RTA_ENCAP_TYPE);
+			if (nla)
+				r_cfg.fc_encap_type = nla_get_u16(nla);
+		}
+
+		r_cfg.fc_flags |= (rtnh->rtnh_flags & RTNH_F_ONLINK);
+
+		rt = ip6_route_info_create(&r_cfg, GFP_KERNEL, extack);
+		if (IS_ERR(rt)) {
+			err = PTR_ERR(rt);
+			goto err;
+		}
+
+		nh->fib6_info = rt;
+		nh->weight = rtnh->rtnh_hops + 1;
+		memcpy(&nh->r_cfg, &r_cfg, sizeof(r_cfg));
+
+		rtnh = rtnh_next(rtnh, &remaining);
+	}
 
 	return 0;
+err:
+	ip6_route_mpath_info_cleanup(rt6_nh_list);
+	return err;
+}
+
+static int ip6_route_mpath_info_create_nh(struct list_head *rt6_nh_list,
+					  struct netlink_ext_ack *extack)
+{
+	struct rt6_nh *nh, *nh_next, *nh_tmp;
+	LIST_HEAD(tmp);
+	int err;
+
+	list_for_each_entry_safe(nh, nh_next, rt6_nh_list, next) {
+		struct fib6_info *rt = nh->fib6_info;
+
+		err = ip6_route_info_create_nh(rt, &nh->r_cfg, extack);
+		if (err) {
+			nh->fib6_info = NULL;
+			goto err;
+		}
+
+		rt->fib6_nh->fib_nh_weight = nh->weight;
+
+		list_move_tail(&nh->next, &tmp);
+
+		list_for_each_entry(nh_tmp, rt6_nh_list, next) {
+			/* check if fib6_info already exists */
+			if (rt6_duplicate_nexthop(nh_tmp->fib6_info, rt)) {
+				err = -EEXIST;
+				goto err;
+			}
+		}
+	}
+out:
+	list_splice(&tmp, rt6_nh_list);
+	return err;
+err:
+	ip6_route_mpath_info_cleanup(rt6_nh_list);
+	goto out;
 }
 
 static void ip6_route_mpath_notify(struct fib6_info *rt,
@@ -5396,75 +5498,28 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
 {
 	struct fib6_info *rt_notif = NULL, *rt_last = NULL;
 	struct nl_info *info = &cfg->fc_nlinfo;
-	struct fib6_config r_cfg;
-	struct rtnexthop *rtnh;
-	struct fib6_info *rt;
-	struct rt6_nh *err_nh;
 	struct rt6_nh *nh, *nh_safe;
+	LIST_HEAD(rt6_nh_list);
+	struct rt6_nh *err_nh;
 	__u16 nlflags;
-	int remaining;
-	int attrlen;
-	int err = 1;
 	int nhn = 0;
-	int replace = (cfg->fc_nlinfo.nlh &&
-		       (cfg->fc_nlinfo.nlh->nlmsg_flags & NLM_F_REPLACE));
-	LIST_HEAD(rt6_nh_list);
+	int replace;
+	int err;
+
+	replace = (cfg->fc_nlinfo.nlh &&
+		   (cfg->fc_nlinfo.nlh->nlmsg_flags & NLM_F_REPLACE));
 
 	nlflags = replace ? NLM_F_REPLACE : NLM_F_CREATE;
 	if (info->nlh && info->nlh->nlmsg_flags & NLM_F_APPEND)
 		nlflags |= NLM_F_APPEND;
 
-	remaining = cfg->fc_mp_len;
-	rtnh = (struct rtnexthop *)cfg->fc_mp;
-
-	/* Parse a Multipath Entry and build a list (rt6_nh_list) of
-	 * fib6_info structs per nexthop
-	 */
-	while (rtnh_ok(rtnh, remaining)) {
-		memcpy(&r_cfg, cfg, sizeof(*cfg));
-		if (rtnh->rtnh_ifindex)
-			r_cfg.fc_ifindex = rtnh->rtnh_ifindex;
-
-		attrlen = rtnh_attrlen(rtnh);
-		if (attrlen > 0) {
-			struct nlattr *nla, *attrs = rtnh_attrs(rtnh);
-
-			nla = nla_find(attrs, attrlen, RTA_GATEWAY);
-			if (nla) {
-				r_cfg.fc_gateway = nla_get_in6_addr(nla);
-				r_cfg.fc_flags |= RTF_GATEWAY;
-			}
-
-			r_cfg.fc_encap = nla_find(attrs, attrlen, RTA_ENCAP);
-			nla = nla_find(attrs, attrlen, RTA_ENCAP_TYPE);
-			if (nla)
-				r_cfg.fc_encap_type = nla_get_u16(nla);
-		}
-
-		r_cfg.fc_flags |= (rtnh->rtnh_flags & RTNH_F_ONLINK);
-		rt = ip6_route_info_create(&r_cfg, GFP_KERNEL, extack);
-		if (IS_ERR(rt)) {
-			err = PTR_ERR(rt);
-			rt = NULL;
-			goto cleanup;
-		}
-
-		err = ip6_route_info_create_nh(rt, &r_cfg, extack);
-		if (err) {
-			rt = NULL;
-			goto cleanup;
-		}
-
-		rt->fib6_nh->fib_nh_weight = rtnh->rtnh_hops + 1;
-
-		err = ip6_route_info_append(&rt6_nh_list, rt, &r_cfg);
-		if (err) {
-			fib6_info_release(rt);
-			goto cleanup;
-		}
+	err = ip6_route_mpath_info_create(&rt6_nh_list, cfg, extack);
+	if (err)
+		return err;
 
-		rtnh = rtnh_next(rtnh, &remaining);
-	}
+	err = ip6_route_mpath_info_create_nh(&rt6_nh_list, extack);
+	if (err)
+		goto cleanup;
 
 	/* for add and replace send one notification with all nexthops.
 	 * Skip the notification in fib6_add_rt2node and send one with
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 11/14] ipv6: Protect fib6_link_table() with spinlock.
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
                   ` (9 preceding siblings ...)
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 10/14] ipv6: Factorise ip6_route_multipath_add() Kuniyuki Iwashima
@ 2025-04-14 18:14 ` Kuniyuki Iwashima
  2025-04-16 10:00   ` Paolo Abeni
  2025-04-14 18:15 ` [PATCH RESEND v2 net-next 12/14] ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add() Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:14 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

If the request specifies a new table ID, fib6_new_table() is
called to create a new routing table.

Two concurrent requests could specify the same table ID, so we
need a lock to protect net->ipv6.fib_table_hash[h].

Let's add a spinlock to protect the hash bucket linkage.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/netns/ipv6.h |  1 +
 net/ipv6/ip6_fib.c       | 26 +++++++++++++++++++++-----
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 5f2cfd84570a..47dc70d8100a 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -72,6 +72,7 @@ struct netns_ipv6 {
 	struct rt6_statistics   *rt6_stats;
 	struct timer_list       ip6_fib_timer;
 	struct hlist_head       *fib_table_hash;
+	spinlock_t		fib_table_hash_lock;
 	struct fib6_table       *fib6_main_tbl;
 	struct list_head	fib6_walkers;
 	rwlock_t		fib6_walker_lock;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index bf727149fdec..79b672f3fc53 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -249,19 +249,33 @@ static struct fib6_table *fib6_alloc_table(struct net *net, u32 id)
 
 struct fib6_table *fib6_new_table(struct net *net, u32 id)
 {
-	struct fib6_table *tb;
+	struct fib6_table *tb, *new_tb;
 
 	if (id == 0)
 		id = RT6_TABLE_MAIN;
+
 	tb = fib6_get_table(net, id);
 	if (tb)
 		return tb;
 
-	tb = fib6_alloc_table(net, id);
-	if (tb)
-		fib6_link_table(net, tb);
+	new_tb = fib6_alloc_table(net, id);
+	if (!new_tb)
+		return NULL;
+
+	spin_lock_bh(&net->ipv6.fib_table_hash_lock);
+
+	tb = fib6_get_table(net, id);
+	if (unlikely(tb)) {
+		spin_unlock_bh(&net->ipv6.fib_table_hash_lock);
+		kfree(new_tb);
+		return tb;
+	}
 
-	return tb;
+	fib6_link_table(net, new_tb);
+
+	spin_unlock_bh(&net->ipv6.fib_table_hash_lock);
+
+	return new_tb;
 }
 EXPORT_SYMBOL_GPL(fib6_new_table);
 
@@ -2423,6 +2437,8 @@ static int __net_init fib6_net_init(struct net *net)
 	if (!net->ipv6.fib_table_hash)
 		goto out_rt6_stats;
 
+	spin_lock_init(&net->ipv6.fib_table_hash_lock);
+
 	net->ipv6.fib6_main_tbl = kzalloc(sizeof(*net->ipv6.fib6_main_tbl),
 					  GFP_KERNEL);
 	if (!net->ipv6.fib6_main_tbl)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 12/14] ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add().
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
                   ` (10 preceding siblings ...)
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 11/14] ipv6: Protect fib6_link_table() with spinlock Kuniyuki Iwashima
@ 2025-04-14 18:15 ` Kuniyuki Iwashima
  2025-04-16 13:59   ` Paolo Abeni
  2025-04-14 18:15 ` [PATCH RESEND v2 net-next 13/14] ipv6: Protect nh->f6i_list with spinlock and flag Kuniyuki Iwashima
  2025-04-14 18:15 ` [PATCH RESEND v2 net-next 14/14] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE Kuniyuki Iwashima
  13 siblings, 1 reply; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:15 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

The next patch adds per-nexthop spinlock which protects nh->f6i_list.

When rt->nh is not NULL, fib6_add_rt2node() will be called under the lock.
fib6_add_rt2node() could call fib6_purge_rt() for another route, which
could holds another nexthop lock.

Then, deadlock could happen between two nexthops.

Let's defer fib6_purge_rt() after fib6_add_rt2node().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/ip6_fib.h |  1 +
 net/ipv6/ip6_fib.c    | 21 ++++++++++++++-------
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 7c87873ae211..88b0dd4d8e09 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -198,6 +198,7 @@ struct fib6_info {
 					fib6_destroying:1,
 					unused:4;
 
+	struct list_head		purge_link;
 	struct rcu_head			rcu;
 	struct nexthop			*nh;
 	struct fib6_nh			fib6_nh[];
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 79b672f3fc53..9e9db5470bbf 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1083,8 +1083,8 @@ static void fib6_purge_rt(struct fib6_info *rt, struct fib6_node *fn,
  */
 
 static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
-			    struct nl_info *info,
-			    struct netlink_ext_ack *extack)
+			    struct nl_info *info, struct netlink_ext_ack *extack,
+			    struct list_head *purge_list)
 {
 	struct fib6_info *leaf = rcu_dereference_protected(fn->leaf,
 				    lockdep_is_held(&rt->fib6_table->tb6_lock));
@@ -1308,10 +1308,9 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
 		}
 		nsiblings = iter->fib6_nsiblings;
 		iter->fib6_node = NULL;
-		fib6_purge_rt(iter, fn, info->nl_net);
+		list_add(&iter->purge_link, purge_list);
 		if (rcu_access_pointer(fn->rr_ptr) == iter)
 			fn->rr_ptr = NULL;
-		fib6_info_release(iter);
 
 		if (nsiblings) {
 			/* Replacing an ECMP route, remove all siblings */
@@ -1324,10 +1323,9 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
 				if (rt6_qualify_for_ecmp(iter)) {
 					*ins = iter->fib6_next;
 					iter->fib6_node = NULL;
-					fib6_purge_rt(iter, fn, info->nl_net);
+					list_add(&iter->purge_link, purge_list);
 					if (rcu_access_pointer(fn->rr_ptr) == iter)
 						fn->rr_ptr = NULL;
-					fib6_info_release(iter);
 					nsiblings--;
 					info->nl_net->ipv6.rt6_stats->fib_rt_entries--;
 				} else {
@@ -1397,6 +1395,7 @@ int fib6_add(struct fib6_node *root, struct fib6_info *rt,
 	     struct nl_info *info, struct netlink_ext_ack *extack)
 {
 	struct fib6_table *table = rt->fib6_table;
+	LIST_HEAD(purge_list);
 	struct fib6_node *fn;
 #ifdef CONFIG_IPV6_SUBTREES
 	struct fib6_node *pn = NULL;
@@ -1499,8 +1498,16 @@ int fib6_add(struct fib6_node *root, struct fib6_info *rt,
 	}
 #endif
 
-	err = fib6_add_rt2node(fn, rt, info, extack);
+	err = fib6_add_rt2node(fn, rt, info, extack, &purge_list);
 	if (!err) {
+		struct fib6_info *iter, *next;
+
+		list_for_each_entry_safe(iter, next, &purge_list, purge_link) {
+			list_del(&iter->purge_link);
+			fib6_purge_rt(iter, fn, info->nl_net);
+			fib6_info_release(iter);
+		}
+
 		if (rt->nh)
 			list_add(&rt->nh_list, &rt->nh->f6i_list);
 		__fib6_update_sernum_upto_root(rt, fib6_new_sernum(info->nl_net));
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 13/14] ipv6: Protect nh->f6i_list with spinlock and flag.
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
                   ` (11 preceding siblings ...)
  2025-04-14 18:15 ` [PATCH RESEND v2 net-next 12/14] ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add() Kuniyuki Iwashima
@ 2025-04-14 18:15 ` Kuniyuki Iwashima
  2025-04-16 15:17   ` Paolo Abeni
  2025-04-14 18:15 ` [PATCH RESEND v2 net-next 14/14] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE Kuniyuki Iwashima
  13 siblings, 1 reply; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:15 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

Then, we may be going to add a route tied to a dying nexthop.

The nexthop itself is not freed during the RCU grace period, but
if we link a route after __remove_nexthop_fib() is called for the
nexthop, the route will be leaked.

To avoid the race between IPv6 route addition under RCU vs nexthop
deletion under RTNL, let's add a dead flag and protect it and
nh->f6i_list with a spinlock.

__remove_nexthop_fib() acquires the nexthop's spinlock and sets false
to nh->dead, then calls ip6_del_rt() for the linked route one by one
without the spinlock because fib6_purge_rt() acquires it later.

While adding an IPv6 route, fib6_add() acquires the nexthop lock and
checks the dead flag just before inserting the route.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/nexthop.h |  2 ++
 net/ipv4/nexthop.c    | 18 +++++++++++++++---
 net/ipv6/ip6_fib.c    | 30 +++++++++++++++++++++++++-----
 3 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index d9fb44e8b321..572e69cda476 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -152,6 +152,8 @@ struct nexthop {
 	u8			protocol;   /* app managing this nh */
 	u8			nh_flags;
 	bool			is_group;
+	bool			dead;
+	spinlock_t		lock;       /* protect dead and f6i_list */
 
 	refcount_t		refcnt;
 	struct rcu_head		rcu;
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 467151517023..5e47166512e2 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -541,6 +541,7 @@ static struct nexthop *nexthop_alloc(void)
 		INIT_LIST_HEAD(&nh->f6i_list);
 		INIT_LIST_HEAD(&nh->grp_list);
 		INIT_LIST_HEAD(&nh->fdb_list);
+		spin_lock_init(&nh->lock);
 	}
 	return nh;
 }
@@ -2118,7 +2119,7 @@ static void remove_nexthop_group(struct nexthop *nh, struct nl_info *nlinfo)
 /* not called for nexthop replace */
 static void __remove_nexthop_fib(struct net *net, struct nexthop *nh)
 {
-	struct fib6_info *f6i, *tmp;
+	struct fib6_info *f6i;
 	bool do_flush = false;
 	struct fib_info *fi;
 
@@ -2129,13 +2130,24 @@ static void __remove_nexthop_fib(struct net *net, struct nexthop *nh)
 	if (do_flush)
 		fib_flush(net);
 
-	/* ip6_del_rt removes the entry from this list hence the _safe */
-	list_for_each_entry_safe(f6i, tmp, &nh->f6i_list, nh_list) {
+	spin_lock_bh(&nh->lock);
+
+	nh->dead = true;
+
+	while (!list_empty(&nh->f6i_list)) {
+		f6i = list_first_entry(&nh->f6i_list, typeof(*f6i), nh_list);
+
 		/* __ip6_del_rt does a release, so do a hold here */
 		fib6_info_hold(f6i);
+
+		spin_unlock_bh(&nh->lock);
 		ipv6_stub->ip6_del_rt(net, f6i,
 				      !READ_ONCE(net->ipv4.sysctl_nexthop_compat_mode));
+
+		spin_lock_bh(&nh->lock);
 	}
+
+	spin_unlock_bh(&nh->lock);
 }
 
 static void __remove_nexthop(struct net *net, struct nexthop *nh,
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 9e9db5470bbf..dac9596cf532 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1048,8 +1048,14 @@ static void fib6_purge_rt(struct fib6_info *rt, struct fib6_node *fn,
 	rt6_flush_exceptions(rt);
 	fib6_drop_pcpu_from(rt, table);
 
-	if (rt->nh && !list_empty(&rt->nh_list))
-		list_del_init(&rt->nh_list);
+	if (rt->nh) {
+		spin_lock(&rt->nh->lock);
+
+		if (!list_empty(&rt->nh_list))
+			list_del_init(&rt->nh_list);
+
+		spin_unlock(&rt->nh->lock);
+	}
 
 	if (refcount_read(&rt->fib6_ref) != 1) {
 		/* This route is used as dummy address holder in some split
@@ -1498,7 +1504,23 @@ int fib6_add(struct fib6_node *root, struct fib6_info *rt,
 	}
 #endif
 
-	err = fib6_add_rt2node(fn, rt, info, extack, &purge_list);
+	if (rt->nh) {
+		spin_lock(&rt->nh->lock);
+
+		if (rt->nh->dead) {
+			NL_SET_ERR_MSG(extack, "Nexthop has been deleted");
+			err = -EINVAL;
+		} else {
+			err = fib6_add_rt2node(fn, rt, info, extack, &purge_list);
+			if (!err)
+				list_add(&rt->nh_list, &rt->nh->f6i_list);
+		}
+
+		spin_unlock(&rt->nh->lock);
+	} else {
+		err = fib6_add_rt2node(fn, rt, info, extack, &purge_list);
+	}
+
 	if (!err) {
 		struct fib6_info *iter, *next;
 
@@ -1508,8 +1530,6 @@ int fib6_add(struct fib6_node *root, struct fib6_info *rt,
 			fib6_info_release(iter);
 		}
 
-		if (rt->nh)
-			list_add(&rt->nh_list, &rt->nh->f6i_list);
 		__fib6_update_sernum_upto_root(rt, fib6_new_sernum(info->nl_net));
 
 		if (rt->fib6_flags & RTF_EXPIRES)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RESEND v2 net-next 14/14] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.
  2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
                   ` (12 preceding siblings ...)
  2025-04-14 18:15 ` [PATCH RESEND v2 net-next 13/14] ipv6: Protect nh->f6i_list with spinlock and flag Kuniyuki Iwashima
@ 2025-04-14 18:15 ` Kuniyuki Iwashima
  2025-04-16 15:26   ` Paolo Abeni
  13 siblings, 1 reply; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-14 18:15 UTC (permalink / raw)
  To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Now we are ready to remove RTNL from SIOCADDRT and RTM_NEWROUTE.

The remaining things to do are

  1. pass false to lwtunnel_valid_encap_type_attr()
  2. use rcu_dereference_rtnl() in fib6_check_nexthop()
  3. place rcu_read_lock() before ip6_route_info_create_nh().

Let's complete the RTNL-free conversion.

When each CPU-X adds 100000 routes on table-X in a batch
concurrently on c7a.metal-48xl EC2 instance with 192 CPUs,

without this series:

  $ sudo ./route_test.sh
  ...
  added 19200000 routes (100000 routes * 192 tables).
  time elapsed: 191577 milliseconds.

with this series:

  $ sudo ./route_test.sh
  ...
  added 19200000 routes (100000 routes * 192 tables).
  time elapsed: 62854 milliseconds.

I changed the number of routes in each table (1000 ~ 100000)
and consistently saw it finish 3x faster with this series.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/nexthop.c |  4 ++--
 net/ipv6/route.c   | 18 ++++++++++++------
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 5e47166512e2..329f33a2be51 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -1556,12 +1556,12 @@ int fib6_check_nexthop(struct nexthop *nh, struct fib6_config *cfg,
 	if (nh->is_group) {
 		struct nh_group *nhg;
 
-		nhg = rtnl_dereference(nh->nh_grp);
+		nhg = rcu_dereference_rtnl(nh->nh_grp);
 		if (nhg->has_v4)
 			goto no_v4_nh;
 		is_fdb_nh = nhg->fdb_nh;
 	} else {
-		nhi = rtnl_dereference(nh->nh_info);
+		nhi = rcu_dereference_rtnl(nh->nh_info);
 		if (nhi->family == AF_INET)
 			goto no_v4_nh;
 		is_fdb_nh = nhi->fdb_nh;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index c026f8fe5f78..5ccbbfb68e3d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3902,12 +3902,16 @@ int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
 	if (IS_ERR(rt))
 		return PTR_ERR(rt);
 
+	rcu_read_lock();
+
 	err = ip6_route_info_create_nh(rt, cfg, extack);
 	if (err)
-		return err;
+		goto unlock;
 
 	err = __ip6_ins_rt(rt, &cfg->fc_nlinfo, extack);
 	fib6_info_release(rt);
+unlock:
+	rcu_read_unlock();
 
 	return err;
 }
@@ -4528,12 +4532,10 @@ int ipv6_route_ioctl(struct net *net, unsigned int cmd, struct in6_rtmsg *rtmsg)
 
 	switch (cmd) {
 	case SIOCADDRT:
-		rtnl_lock();
 		/* Only do the default setting of fc_metric in route adding */
 		if (cfg.fc_metric == 0)
 			cfg.fc_metric = IP6_RT_PRIO_USER;
 		err = ip6_route_add(&cfg, GFP_KERNEL, NULL);
-		rtnl_unlock();
 		break;
 	case SIOCDELRT:
 		err = ip6_route_del(&cfg, NULL);
@@ -5112,7 +5114,7 @@ static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
 	} while (rtnh_ok(rtnh, remaining));
 
 	return lwtunnel_valid_encap_type_attr(cfg->fc_mp, cfg->fc_mp_len,
-					      extack, newroute);
+					      extack, false);
 }
 
 static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
@@ -5250,7 +5252,7 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
 		cfg->fc_encap_type = nla_get_u16(tb[RTA_ENCAP_TYPE]);
 
 		err = lwtunnel_valid_encap_type(cfg->fc_encap_type,
-						extack, newroute);
+						extack, false);
 		if (err < 0)
 			goto errout;
 	}
@@ -5517,6 +5519,8 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
 	if (err)
 		return err;
 
+	rcu_read_lock();
+
 	err = ip6_route_mpath_info_create_nh(&rt6_nh_list, extack);
 	if (err)
 		goto cleanup;
@@ -5608,6 +5612,8 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
 	}
 
 cleanup:
+	rcu_read_unlock();
+
 	list_for_each_entry_safe(nh, nh_safe, &rt6_nh_list, next) {
 		fib6_info_release(nh->fib6_info);
 		list_del(&nh->next);
@@ -6890,7 +6896,7 @@ static void bpf_iter_unregister(void)
 
 static const struct rtnl_msg_handler ip6_route_rtnl_msg_handlers[] __initconst_or_module = {
 	{.owner = THIS_MODULE, .protocol = PF_INET6, .msgtype = RTM_NEWROUTE,
-	 .doit = inet6_rtm_newroute},
+	 .doit = inet6_rtm_newroute, .flags = RTNL_FLAG_DOIT_UNLOCKED},
 	{.owner = THIS_MODULE, .protocol = PF_INET6, .msgtype = RTM_DELROUTE,
 	 .doit = inet6_rtm_delroute, .flags = RTNL_FLAG_DOIT_UNLOCKED},
 	{.owner = THIS_MODULE, .protocol = PF_INET6, .msgtype = RTM_GETROUTE,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 01/14] ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 01/14] ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config() Kuniyuki Iwashima
@ 2025-04-16  8:22   ` Paolo Abeni
  2025-04-17 17:53     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16  8:22 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev

On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> We will perform RTM_NEWROUTE and RTM_DELROUTE under RCU, and then
> we want to perform some validation out of the RCU scope.
> 
> When creating / removing an IPv6 route with RTA_MULTIPATH,
> inet6_rtm_newroute() / inet6_rtm_delroute() validates RTA_GATEWAY
> in each multipath entry.
> 
> Let's do that in rtm_to_fib6_config().
> 
> Note that now RTM_DELROUTE returns an error for RTA_MULTIPATH with
> 0 entries, which was accepted but should result in -EINVAL as
> RTM_NEWROUTE.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  net/ipv6/route.c | 82 +++++++++++++++++++++++++-----------------------
>  1 file changed, 43 insertions(+), 39 deletions(-)
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 210b84cecc24..237e31f64a4a 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -5050,6 +5050,44 @@ static const struct nla_policy rtm_ipv6_policy[RTA_MAX+1] = {
>  	[RTA_FLOWLABEL]		= { .type = NLA_BE32 },
>  };
>  
> +static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
> +					struct netlink_ext_ack *extack)
> +{
> +	struct rtnexthop *rtnh;
> +	int remaining;
> +
> +	remaining = cfg->fc_mp_len;
> +	rtnh = (struct rtnexthop *)cfg->fc_mp;
> +
> +	if (!rtnh_ok(rtnh, remaining)) {
> +		NL_SET_ERR_MSG(extack, "Invalid nexthop configuration - no valid nexthops");
> +		return -EINVAL;
> +	}
> +
> +	do {

I think the initial checks and the loop could be rewritten reducing the
indentation and calling the helper only once with something alike:

	for (i = 0; rtnh_ok(rtnh, remaining);
	     i++, rtnh = rtnh_next(rtnh, &remaining)) {
		int attrlen = rtnh_attrlen(rtnh);
		if (!attrlen)
			continue;
		
		// ...
	}
	if (i == 0) {
		NL_SET_ERR_MSG(extack, "Invalid nexthop configuration - no valid
nexthops");
		// ..

I guess it's a bit subjective, don't repost just for this.

/P


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 02/14] ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 02/14] ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE Kuniyuki Iwashima
@ 2025-04-16  8:49   ` Paolo Abeni
  2025-04-16 18:45     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16  8:49 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev



On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> Basically, removing an IPv6 route does not require RTNL because
> the IPv6 routing tables are protected by per table lock.
> 
> inet6_rtm_delroute() calls nexthop_find_by_id() to check if the
> nexthop specified by RTA_NH_ID exists.  nexthop uses rbtree and
> the top-down walk can be safely performed under RCU.
> 
> ip6_route_del() already relies on RCU and the table lock, but we
> need to extend the RCU critical section a bit more to cover
> __ip6_del_rt().  For example, nexthop_for_each_fib6_nh() and
> inet6_rt_notify() needs RCU.

The last statement is not clear to me. I don't see __ip6_del_rt()
calling nexthop_for_each_fib6_nh() or inet6_rt_notify() ?!?

Also after this patch we have this chunk in ip6_route_del():

	table = fib6_get_table(cfg->fc_nlinfo.nl_net, cfg->fc_table);
	if (!table)
		//..

	rcu_read_lock();

which AFAICS should be safe because 'table' is freed only at netns exit
time, but acquiring the rcu lock after grabbing the rcu protected struct
is confusing. It should be good adding a comment or moving the rcu lock
before the lookup (and dropping the RCU lock from fib6_get_table())

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 03/14] ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 03/14] ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config() Kuniyuki Iwashima
@ 2025-04-16  8:54   ` Paolo Abeni
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16  8:54 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev



On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> ip6_route_info_create() is called from 3 functions:
> 
>   * ip6_route_add()
>   * ip6_route_multipath_add()
>   * addrconf_f6i_alloc()
> 
> addrconf_f6i_alloc() does not need validation for struct fib6_config in
> ip6_route_info_create().
> 
> ip6_route_multipath_add() calls ip6_route_info_create() for multiple
> routes with slightly different fib6_config instances, which is copied
> from the base config passed from userspace.  So, we need not validate
> the same config repeatedly.
> 
> Let's move such validation into rtm_to_fib6_config().
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Acked-by: Paolo Abeni <pabeni@redhat.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 04/14] ipv6: Check GATEWAY in rtm_to_fib6_multipath_config().
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 04/14] ipv6: Check GATEWAY in rtm_to_fib6_multipath_config() Kuniyuki Iwashima
@ 2025-04-16  9:06   ` Paolo Abeni
  2025-04-16 18:48     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16  9:06 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev

On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> In ip6_route_multipath_add(), we call rt6_qualify_for_ecmp() for each
> entry.  If it returns false, the request fails.
> 
> rt6_qualify_for_ecmp() returns false if either of the conditions below
> is true:
> 
>   1. f6i->fib6_flags has RTF_ADDRCONF
>   2. f6i->nh is not NULL
>   3. f6i->fib6_nh->fib_nh_gw_family is AF_UNSPEC
> 
> 1 is unnecessary because rtm_to_fib6_config() never sets RTF_ADDRCONF
> to cfg->fc_flags.
> 
> 2. is equivalent with cfg->fc_nh_id.
> 
> 3. can be replaced by checking RTF_GATEWAY in the base and each multipath
> entry because AF_INET6 is set to f6i->fib6_nh->fib_nh_gw_family only when
> cfg.fc_is_fdb is true or RTF_GATEWAY is set, but the former is always
> false.
> 
> Let's perform the equivalent checks in rtm_to_fib6_multipath_config().

It's unclear to me the 'why'???

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 06/14] ipv6: Split ip6_route_info_create().
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 06/14] ipv6: Split ip6_route_info_create() Kuniyuki Iwashima
@ 2025-04-16  9:12   ` Paolo Abeni
  2025-04-16 18:50     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16  9:12 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev

On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely
> on RCU to guarantee dev and nexthop lifetime.
> 
> Then, we want to allocate everything as possible before entering

Then, we want to allocate everything before ...

or

Then, we want to allocate as much as possible before ...

> the RCU section.
> 
> The RCU section will start in the middle of ip6_route_info_create(),
> and this is problematic for ip6_route_multipath_add() that calls
> ip6_route_info_create() multiple times.
> 
> Let's split ip6_route_info_create() into two parts; one for memory
> allocation and another for nexthop setup.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Acked-by: Paolo Abeni <pabeni@redhat.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 05/14] ipv6: Move nexthop_find_by_id() after fib6_info_alloc().
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 05/14] ipv6: Move nexthop_find_by_id() after fib6_info_alloc() Kuniyuki Iwashima
@ 2025-04-16  9:13   ` Paolo Abeni
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16  9:13 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev

On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.
> 
> Then, we must perform two lookups for nexthop and dev under RCU
> to guarantee their lifetime.
> 
> ip6_route_info_create() calls nexthop_find_by_id() first if
> RTA_NH_ID is specified, and then allocates struct fib6_info.
> 
> nexthop_find_by_id() must be called under RCU, but we do not want
> to use GFP_ATOMIC for memory allocation here, which will be likely
> to fail in ip6_route_multipath_add().
> 
> Let's move nexthop_find_by_id() after the memory allocation so
> that we can later split ip6_route_info_create() into two parts:
> the sleepable part and the RCU part.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Acked-by: Paolo Abeni <pabeni@redhat.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 07/14] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 07/14] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create() Kuniyuki Iwashima
@ 2025-04-16  9:21   ` Paolo Abeni
  2025-04-16  9:37     ` Paolo Abeni
  2025-04-16  9:23   ` Paolo Abeni
  1 sibling, 1 reply; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16  9:21 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev

On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index ce060b59d41a..404da01a7502 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -3664,10 +3664,12 @@ int fib6_nh_init(struct net *net, struct fib6_nh *fib6_nh,
>  		goto out;
>  
>  pcpu_alloc:
> -	fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
>  	if (!fib6_nh->rt6i_pcpu) {
> -		err = -ENOMEM;
> -		goto out;
> +		fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);

'rt6i_pcpu' has just been pre-allocated, why we need to try again the
allocation here? And if it's really needed why don't rename the new
helper to something more generic and re-use it here?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 07/14] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 07/14] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create() Kuniyuki Iwashima
  2025-04-16  9:21   ` Paolo Abeni
@ 2025-04-16  9:23   ` Paolo Abeni
  1 sibling, 0 replies; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16  9:23 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev

On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> ip6_route_info_create_nh() will be called under RCU.
> 
> Then, fib6_nh_init() is also under RCU, but per-cpu memory allocation
> is very likely to fail with GFP_ATOMIC while bluk-adding IPv6 routes

oops, I forgot a very minor nit:               ^^^^ bulk

/P


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 07/14] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().
  2025-04-16  9:21   ` Paolo Abeni
@ 2025-04-16  9:37     ` Paolo Abeni
  2025-04-16 18:55       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16  9:37 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev



On 4/16/25 11:21 AM, Paolo Abeni wrote:
> On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index ce060b59d41a..404da01a7502 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -3664,10 +3664,12 @@ int fib6_nh_init(struct net *net, struct fib6_nh *fib6_nh,
>>  		goto out;
>>  
>>  pcpu_alloc:
>> -	fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
>>  	if (!fib6_nh->rt6i_pcpu) {
>> -		err = -ENOMEM;
>> -		goto out;
>> +		fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
> 
> 'rt6i_pcpu' has just been pre-allocated, why we need to try again the
> allocation here? 

Ah, I missed the shared code with ipv4. The next patch clarifies thing
to me. I guess it would be better to either squash the patches or add a
comment here (or in the commit message).

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 09/14] ipv6: Don't pass net to ip6_route_info_append().
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 09/14] ipv6: Don't pass net to ip6_route_info_append() Kuniyuki Iwashima
@ 2025-04-16  9:39   ` Paolo Abeni
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16  9:39 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev

On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> net is not used in ip6_route_info_append() after commit 36f19d5b4f99
> ("net/ipv6: Remove extra call to ip6_convert_metrics for multipath case").
> 
> Let's remove the argument.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Acked-by: Paolo Abeni <pabeni@redhat.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 10/14] ipv6: Factorise ip6_route_multipath_add().
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 10/14] ipv6: Factorise ip6_route_multipath_add() Kuniyuki Iwashima
@ 2025-04-16  9:57   ` Paolo Abeni
  2025-04-16 18:58     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16  9:57 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev

On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 49eea7e1e2da..c026f8fe5f78 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -5315,29 +5315,131 @@ struct rt6_nh {
>  	struct fib6_info *fib6_info;
>  	struct fib6_config r_cfg;
>  	struct list_head next;
> +	int weight;
>  };
>  
> -static int ip6_route_info_append(struct list_head *rt6_nh_list,
> -				 struct fib6_info *rt,
> -				 struct fib6_config *r_cfg)
> +static void ip6_route_mpath_info_cleanup(struct list_head *rt6_nh_list)
>  {
> -	struct rt6_nh *nh;
> -	int err = -EEXIST;
> +	struct rt6_nh *nh, *nh_next;
>  
> -	list_for_each_entry(nh, rt6_nh_list, next) {
> -		/* check if fib6_info already exists */
> -		if (rt6_duplicate_nexthop(nh->fib6_info, rt))
> -			return err;
> +	list_for_each_entry_safe(nh, nh_next, rt6_nh_list, next) {
> +		struct fib6_info *rt = nh->fib6_info;
> +
> +		if (rt) {
> +			free_percpu(rt->fib6_nh->nh_common.nhc_pcpu_rth_output);
> +			free_percpu(rt->fib6_nh->rt6i_pcpu);
> +			ip_fib_metrics_put(rt->fib6_metrics);
> +			kfree(rt);
> +		}
> +
> +		list_del(&nh->next);

Somewhat unrelated, but the field 'next' has IMHO a quite misleading
name, and it would be great to rename it to something else ('list'),
could you please consider including such change?

Thanks,

Paolo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 11/14] ipv6: Protect fib6_link_table() with spinlock.
  2025-04-14 18:14 ` [PATCH RESEND v2 net-next 11/14] ipv6: Protect fib6_link_table() with spinlock Kuniyuki Iwashima
@ 2025-04-16 10:00   ` Paolo Abeni
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16 10:00 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev

On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.
> 
> If the request specifies a new table ID, fib6_new_table() is
> called to create a new routing table.
> 
> Two concurrent requests could specify the same table ID, so we
> need a lock to protect net->ipv6.fib_table_hash[h].
> 
> Let's add a spinlock to protect the hash bucket linkage.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Acked-by: Paolo Abeni <pabeni@redhat.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 12/14] ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add().
  2025-04-14 18:15 ` [PATCH RESEND v2 net-next 12/14] ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add() Kuniyuki Iwashima
@ 2025-04-16 13:59   ` Paolo Abeni
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16 13:59 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev

On 4/14/25 8:15 PM, Kuniyuki Iwashima wrote:
> The next patch adds per-nexthop spinlock which protects nh->f6i_list.
> 
> When rt->nh is not NULL, fib6_add_rt2node() will be called under the lock.
> fib6_add_rt2node() could call fib6_purge_rt() for another route, which
> could holds another nexthop lock.
> 
> Then, deadlock could happen between two nexthops.
> 
> Let's defer fib6_purge_rt() after fib6_add_rt2node().
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Acked-by: Paolo Abeni <pabeni@redhat.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 13/14] ipv6: Protect nh->f6i_list with spinlock and flag.
  2025-04-14 18:15 ` [PATCH RESEND v2 net-next 13/14] ipv6: Protect nh->f6i_list with spinlock and flag Kuniyuki Iwashima
@ 2025-04-16 15:17   ` Paolo Abeni
  2025-04-16 19:02     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16 15:17 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev

On 4/14/25 8:15 PM, Kuniyuki Iwashima wrote:
> @@ -1498,7 +1504,23 @@ int fib6_add(struct fib6_node *root, struct fib6_info *rt,
>  	}
>  #endif
>  
> -	err = fib6_add_rt2node(fn, rt, info, extack, &purge_list);
> +	if (rt->nh) {
> +		spin_lock(&rt->nh->lock);
> +
> +		if (rt->nh->dead) {
> +			NL_SET_ERR_MSG(extack, "Nexthop has been deleted");
> +			err = -EINVAL;
> +		} else {
> +			err = fib6_add_rt2node(fn, rt, info, extack, &purge_list);
> +			if (!err)
> +				list_add(&rt->nh_list, &rt->nh->f6i_list);
> +		}

Maybe move the new check and list_add inside fib6_add_rt2node() or
bundle all the above in a new helper?

/P


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 14/14] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.
  2025-04-14 18:15 ` [PATCH RESEND v2 net-next 14/14] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE Kuniyuki Iwashima
@ 2025-04-16 15:26   ` Paolo Abeni
  2025-04-16 19:04     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Abeni @ 2025-04-16 15:26 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski
  Cc: Simon Horman, Kuniyuki Iwashima, netdev

On 4/14/25 8:15 PM, Kuniyuki Iwashima wrote:
> @@ -5250,7 +5252,7 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
>  		cfg->fc_encap_type = nla_get_u16(tb[RTA_ENCAP_TYPE]);
>  
>  		err = lwtunnel_valid_encap_type(cfg->fc_encap_type,
> -						extack, newroute);
> +						extack, false);

It looks every caller always pass 'false' as last argument to
lwtunnel_valid_encap_type(), so such argument could/should be dropped
(as a follow-up if this is too big)

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 02/14] ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.
  2025-04-16  8:49   ` Paolo Abeni
@ 2025-04-16 18:45     ` Kuniyuki Iwashima
  2025-04-17  6:46       ` Paolo Abeni
  0 siblings, 1 reply; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-16 18:45 UTC (permalink / raw)
  To: pabeni; +Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 16 Apr 2025 10:49:53 +0200
> On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> > Basically, removing an IPv6 route does not require RTNL because
> > the IPv6 routing tables are protected by per table lock.
> > 
> > inet6_rtm_delroute() calls nexthop_find_by_id() to check if the
> > nexthop specified by RTA_NH_ID exists.  nexthop uses rbtree and
> > the top-down walk can be safely performed under RCU.
> > 
> > ip6_route_del() already relies on RCU and the table lock, but we
> > need to extend the RCU critical section a bit more to cover
> > __ip6_del_rt().  For example, nexthop_for_each_fib6_nh() and
> > inet6_rt_notify() needs RCU.
> 
> The last statement is not clear to me. I don't see __ip6_del_rt()
> calling nexthop_for_each_fib6_nh() or inet6_rt_notify() ?!?

Thank you for review!

It's burried in the depths, and I noticed this from the v1 test result.
https://lore.kernel.org/netdev/Z91yk90LZy9yJexG@mini-arch/

inet6_rtm_delroute
 ip6_route_del
  __ip6_del_rt
   fib6_del
    fib6_del_route
     fib6_purge_rt
      nexthop_for_each_fib6_nh

inet6_rtm_delroute
 ip6_route_del
  __ip6_del_rt
   fib6_del
    fib6_del_route
     inet6_rt_notify


> 
> Also after this patch we have this chunk in ip6_route_del():
> 
> 	table = fib6_get_table(cfg->fc_nlinfo.nl_net, cfg->fc_table);
> 	if (!table)
> 		//..
> 
> 	rcu_read_lock();
> 
> which AFAICS should be safe because 'table' is freed only at netns exit
> time,

Right, and there are few other functions assuming the same thing.

addrconf_get_prefix_route()
rt6_get_route_info()

> but acquiring the rcu lock after grabbing the rcu protected struct
> is confusing. It should be good adding a comment or moving the rcu lock
> before the lookup (and dropping the RCU lock from fib6_get_table())

There are other callers of fib6_get_table(), so I'd move rcu_read_lock()
before it, and will look into them if we can drop it from fib6_get_table().


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 04/14] ipv6: Check GATEWAY in rtm_to_fib6_multipath_config().
  2025-04-16  9:06   ` Paolo Abeni
@ 2025-04-16 18:48     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-16 18:48 UTC (permalink / raw)
  To: pabeni; +Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 16 Apr 2025 11:06:30 +0200
> On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> > In ip6_route_multipath_add(), we call rt6_qualify_for_ecmp() for each
> > entry.  If it returns false, the request fails.
> > 
> > rt6_qualify_for_ecmp() returns false if either of the conditions below
> > is true:
> > 
> >   1. f6i->fib6_flags has RTF_ADDRCONF
> >   2. f6i->nh is not NULL
> >   3. f6i->fib6_nh->fib_nh_gw_family is AF_UNSPEC
> > 
> > 1 is unnecessary because rtm_to_fib6_config() never sets RTF_ADDRCONF
> > to cfg->fc_flags.
> > 
> > 2. is equivalent with cfg->fc_nh_id.
> > 
> > 3. can be replaced by checking RTF_GATEWAY in the base and each multipath
> > entry because AF_INET6 is set to f6i->fib6_nh->fib_nh_gw_family only when
> > cfg.fc_is_fdb is true or RTF_GATEWAY is set, but the former is always
> > false.
> > 
> > Let's perform the equivalent checks in rtm_to_fib6_multipath_config().
> 
> It's unclear to me the 'why'???

It's just because this validation does not need to be done under RCU.
Or are you asking why I didn't use rt6_qualify_for_ecmp() ?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 06/14] ipv6: Split ip6_route_info_create().
  2025-04-16  9:12   ` Paolo Abeni
@ 2025-04-16 18:50     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-16 18:50 UTC (permalink / raw)
  To: pabeni; +Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 16 Apr 2025 11:12:31 +0200
> On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> > We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely
> > on RCU to guarantee dev and nexthop lifetime.
> > 
> > Then, we want to allocate everything as possible before entering
> 
> Then, we want to allocate everything before ...
> 
> or
> 
> Then, we want to allocate as much as possible before ...

Will use the latter, thanks!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 07/14] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().
  2025-04-16  9:37     ` Paolo Abeni
@ 2025-04-16 18:55       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-16 18:55 UTC (permalink / raw)
  To: pabeni; +Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 16 Apr 2025 11:37:54 +0200
> On 4/16/25 11:21 AM, Paolo Abeni wrote:
> > On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> >> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> >> index ce060b59d41a..404da01a7502 100644
> >> --- a/net/ipv6/route.c
> >> +++ b/net/ipv6/route.c
> >> @@ -3664,10 +3664,12 @@ int fib6_nh_init(struct net *net, struct fib6_nh *fib6_nh,
> >>  		goto out;
> >>  
> >>  pcpu_alloc:
> >> -	fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
> >>  	if (!fib6_nh->rt6i_pcpu) {
> >> -		err = -ENOMEM;
> >> -		goto out;
> >> +		fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
> > 
> > 'rt6i_pcpu' has just been pre-allocated, why we need to try again the
> > allocation here? 
> 
> Ah, I missed the shared code with ipv4. The next patch clarifies thing
> to me. I guess it would be better to either squash the patches or add a
> comment here (or in the commit message).

Sure, will add description in the commit message.



From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 16 Apr 2025 11:23:32 +0200
> On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> > ip6_route_info_create_nh() will be called under RCU.
> > 
> > Then, fib6_nh_init() is also under RCU, but per-cpu memory allocation
> > is very likely to fail with GFP_ATOMIC while bluk-adding IPv6 routes
> 
> oops, I forgot a very minor nit:               ^^^^ bulk

OMG (Oh My Grammarly) :)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 10/14] ipv6: Factorise ip6_route_multipath_add().
  2025-04-16  9:57   ` Paolo Abeni
@ 2025-04-16 18:58     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-16 18:58 UTC (permalink / raw)
  To: pabeni; +Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 16 Apr 2025 11:57:23 +0200
> On 4/14/25 8:14 PM, Kuniyuki Iwashima wrote:
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index 49eea7e1e2da..c026f8fe5f78 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -5315,29 +5315,131 @@ struct rt6_nh {
> >  	struct fib6_info *fib6_info;
> >  	struct fib6_config r_cfg;
> >  	struct list_head next;
> > +	int weight;
> >  };
> >  
> > -static int ip6_route_info_append(struct list_head *rt6_nh_list,
> > -				 struct fib6_info *rt,
> > -				 struct fib6_config *r_cfg)
> > +static void ip6_route_mpath_info_cleanup(struct list_head *rt6_nh_list)
> >  {
> > -	struct rt6_nh *nh;
> > -	int err = -EEXIST;
> > +	struct rt6_nh *nh, *nh_next;
> >  
> > -	list_for_each_entry(nh, rt6_nh_list, next) {
> > -		/* check if fib6_info already exists */
> > -		if (rt6_duplicate_nexthop(nh->fib6_info, rt))
> > -			return err;
> > +	list_for_each_entry_safe(nh, nh_next, rt6_nh_list, next) {
> > +		struct fib6_info *rt = nh->fib6_info;
> > +
> > +		if (rt) {
> > +			free_percpu(rt->fib6_nh->nh_common.nhc_pcpu_rth_output);
> > +			free_percpu(rt->fib6_nh->rt6i_pcpu);
> > +			ip_fib_metrics_put(rt->fib6_metrics);
> > +			kfree(rt);
> > +		}
> > +
> > +		list_del(&nh->next);
> 
> Somewhat unrelated, but the field 'next' has IMHO a quite misleading
> name, and it would be great to rename it to something else ('list'),
> could you please consider including such change?

Had the same feeling :)

Will add a patch or post a follow if the series gets over 15.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 13/14] ipv6: Protect nh->f6i_list with spinlock and flag.
  2025-04-16 15:17   ` Paolo Abeni
@ 2025-04-16 19:02     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-16 19:02 UTC (permalink / raw)
  To: pabeni; +Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 16 Apr 2025 17:17:13 +0200
> On 4/14/25 8:15 PM, Kuniyuki Iwashima wrote:
> > @@ -1498,7 +1504,23 @@ int fib6_add(struct fib6_node *root, struct fib6_info *rt,
> >  	}
> >  #endif
> >  
> > -	err = fib6_add_rt2node(fn, rt, info, extack, &purge_list);
> > +	if (rt->nh) {
> > +		spin_lock(&rt->nh->lock);
> > +
> > +		if (rt->nh->dead) {
> > +			NL_SET_ERR_MSG(extack, "Nexthop has been deleted");
> > +			err = -EINVAL;
> > +		} else {
> > +			err = fib6_add_rt2node(fn, rt, info, extack, &purge_list);
> > +			if (!err)
> > +				list_add(&rt->nh_list, &rt->nh->f6i_list);
> > +		}
> 
> Maybe move the new check and list_add inside fib6_add_rt2node() or
> bundle all the above in a new helper?

fib6_add_rt2node() has return in many places, so a new hepler makes
more sense to me.

Will add fib6_add_rt2node_nh() and move spin_lock() section there.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 14/14] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.
  2025-04-16 15:26   ` Paolo Abeni
@ 2025-04-16 19:04     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-16 19:04 UTC (permalink / raw)
  To: pabeni; +Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 16 Apr 2025 17:26:49 +0200
> On 4/14/25 8:15 PM, Kuniyuki Iwashima wrote:
> > @@ -5250,7 +5252,7 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
> >  		cfg->fc_encap_type = nla_get_u16(tb[RTA_ENCAP_TYPE]);
> >  
> >  		err = lwtunnel_valid_encap_type(cfg->fc_encap_type,
> > -						extack, newroute);
> > +						extack, false);
> 
> It looks every caller always pass 'false' as last argument to
> lwtunnel_valid_encap_type(), so such argument could/should be dropped
> (as a follow-up if this is too big)

Yes, it will be a follow-up.

Thanks again for reviewing!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 02/14] ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.
  2025-04-16 18:45     ` Kuniyuki Iwashima
@ 2025-04-17  6:46       ` Paolo Abeni
  2025-04-17 17:45         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Abeni @ 2025-04-17  6:46 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, netdev

On 4/16/25 8:45 PM, Kuniyuki Iwashima wrote:
> From: Paolo Abeni <pabeni@redhat.com>
>> but acquiring the rcu lock after grabbing the rcu protected struct
>> is confusing. It should be good adding a comment or moving the rcu lock
>> before the lookup (and dropping the RCU lock from fib6_get_table())
> 
> There are other callers of fib6_get_table(), so I'd move rcu_read_lock()
> before it, and will look into them if we can drop it from fib6_get_table().

you could provide a RCU-lockless __fib6_get_table() variant, use it
here, and with time (outside this series) such helper usage could be
extended.

Paolo


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 02/14] ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.
  2025-04-17  6:46       ` Paolo Abeni
@ 2025-04-17 17:45         ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-17 17:45 UTC (permalink / raw)
  To: pabeni; +Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Thu, 17 Apr 2025 08:46:01 +0200
> On 4/16/25 8:45 PM, Kuniyuki Iwashima wrote:
> > From: Paolo Abeni <pabeni@redhat.com>
> >> but acquiring the rcu lock after grabbing the rcu protected struct
> >> is confusing. It should be good adding a comment or moving the rcu lock
> >> before the lookup (and dropping the RCU lock from fib6_get_table())
> > 
> > There are other callers of fib6_get_table(), so I'd move rcu_read_lock()
> > before it, and will look into them if we can drop it from fib6_get_table().
> 
> you could provide a RCU-lockless __fib6_get_table() variant, use it
> here, and with time (outside this series) such helper usage could be
> extended.

I think it's worth a separate patch, but now the number of patches
is 17, so I'll add a note in the commit message that we don't need
rcu for fib6_get_table() and will post a follow up series to convert
the lockless version in some places.

Thanks!

---8<---
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 7c87873ae211..e5b28a732796 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -430,6 +430,7 @@ struct fib6_entry_notifier_info {
  *	exported functions
  */
 
+struct fib6_table *__fib6_get_table(struct net *net, u32 id);
 struct fib6_table *fib6_get_table(struct net *net, u32 id);
 struct fib6_table *fib6_new_table(struct net *net, u32 id);
 struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6,
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index bf727149fdec..ba3e290a9da4 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -265,27 +265,36 @@ struct fib6_table *fib6_new_table(struct net *net, u32 id)
 }
 EXPORT_SYMBOL_GPL(fib6_new_table);
 
-struct fib6_table *fib6_get_table(struct net *net, u32 id)
+struct fib6_table *__fib6_get_table(struct net *net, u32 id)
 {
-	struct fib6_table *tb;
 	struct hlist_head *head;
-	unsigned int h;
+	struct fib6_table *tb;
 
-	if (id == 0)
+	if (!id)
 		id = RT6_TABLE_MAIN;
-	h = id & (FIB6_TABLE_HASHSZ - 1);
-	rcu_read_lock();
-	head = &net->ipv6.fib_table_hash[h];
-	hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
-		if (tb->tb6_id == id) {
-			rcu_read_unlock();
+
+	head = &net->ipv6.fib_table_hash[id & (FIB6_TABLE_HASHSZ - 1)];
+
+	/* Once allocated, fib6_table is not freed until netns dismantle, so
+	 * RCU is not required (but rcu_dereference_raw() is for data-race).
+	 */
+	hlist_for_each_entry_rcu(tb, head, tb6_hlist, true)
+		if (tb->tb6_id == id)
 			return tb;
-		}
-	}
-	rcu_read_unlock();
 
 	return NULL;
 }
+
+struct fib6_table *fib6_get_table(struct net *net, u32 id)
+{
+	struct fib6_table *tb;
+
+	rcu_read_lock();
+	tb = __fib6_get_table(net, id);
+	rcu_read_unlock();
+
+	return tb;
+}
 EXPORT_SYMBOL_GPL(fib6_get_table);
 
 static void __net_init fib6_tables_init(struct net *net)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 237e31f64a4a..af311c19dcc6 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4076,7 +4076,7 @@ static int ip6_route_del(struct fib6_config *cfg,
 	struct fib6_node *fn;
 	int err = -ESRCH;
 
-	table = fib6_get_table(cfg->fc_nlinfo.nl_net, cfg->fc_table);
+	table = __fib6_get_table(cfg->fc_nlinfo.nl_net, cfg->fc_table);
 	if (!table) {
 		NL_SET_ERR_MSG(extack, "FIB table does not exist");
 		return err;
---8<---

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RESEND v2 net-next 01/14] ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().
  2025-04-16  8:22   ` Paolo Abeni
@ 2025-04-17 17:53     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-17 17:53 UTC (permalink / raw)
  To: pabeni; +Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 16 Apr 2025 10:22:33 +0200
> > +static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
> > +					struct netlink_ext_ack *extack)
> > +{
> > +	struct rtnexthop *rtnh;
> > +	int remaining;
> > +
> > +	remaining = cfg->fc_mp_len;
> > +	rtnh = (struct rtnexthop *)cfg->fc_mp;
> > +
> > +	if (!rtnh_ok(rtnh, remaining)) {
> > +		NL_SET_ERR_MSG(extack, "Invalid nexthop configuration - no valid nexthops");
> > +		return -EINVAL;
> > +	}
> > +
> > +	do {
> 
> I think the initial checks and the loop could be rewritten reducing the
> indentation and calling the helper only once with something alike:
> 
> 	for (i = 0; rtnh_ok(rtnh, remaining);
> 	     i++, rtnh = rtnh_next(rtnh, &remaining)) {
> 		int attrlen = rtnh_attrlen(rtnh);
> 		if (!attrlen)
> 			continue;
> 		
> 		// ...
> 	}
> 	if (i == 0) {
> 		NL_SET_ERR_MSG(extack, "Invalid nexthop configuration - no valid
> nexthops");
> 		// ..
> 
> I guess it's a bit subjective, don't repost just for this.

I couldn't reduce the nest due to patch 4, so I'll keep this as is.

Thanks!

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2025-04-17 17:54 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-14 18:14 [PATCH RESEND v2 net-next 00/14] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
2025-04-14 18:14 ` [PATCH RESEND v2 net-next 01/14] ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config() Kuniyuki Iwashima
2025-04-16  8:22   ` Paolo Abeni
2025-04-17 17:53     ` Kuniyuki Iwashima
2025-04-14 18:14 ` [PATCH RESEND v2 net-next 02/14] ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE Kuniyuki Iwashima
2025-04-16  8:49   ` Paolo Abeni
2025-04-16 18:45     ` Kuniyuki Iwashima
2025-04-17  6:46       ` Paolo Abeni
2025-04-17 17:45         ` Kuniyuki Iwashima
2025-04-14 18:14 ` [PATCH RESEND v2 net-next 03/14] ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config() Kuniyuki Iwashima
2025-04-16  8:54   ` Paolo Abeni
2025-04-14 18:14 ` [PATCH RESEND v2 net-next 04/14] ipv6: Check GATEWAY in rtm_to_fib6_multipath_config() Kuniyuki Iwashima
2025-04-16  9:06   ` Paolo Abeni
2025-04-16 18:48     ` Kuniyuki Iwashima
2025-04-14 18:14 ` [PATCH RESEND v2 net-next 05/14] ipv6: Move nexthop_find_by_id() after fib6_info_alloc() Kuniyuki Iwashima
2025-04-16  9:13   ` Paolo Abeni
2025-04-14 18:14 ` [PATCH RESEND v2 net-next 06/14] ipv6: Split ip6_route_info_create() Kuniyuki Iwashima
2025-04-16  9:12   ` Paolo Abeni
2025-04-16 18:50     ` Kuniyuki Iwashima
2025-04-14 18:14 ` [PATCH RESEND v2 net-next 07/14] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create() Kuniyuki Iwashima
2025-04-16  9:21   ` Paolo Abeni
2025-04-16  9:37     ` Paolo Abeni
2025-04-16 18:55       ` Kuniyuki Iwashima
2025-04-16  9:23   ` Paolo Abeni
2025-04-14 18:14 ` [PATCH RESEND v2 net-next 08/14] ipv6: Preallocate nhc_pcpu_rth_output " Kuniyuki Iwashima
2025-04-14 18:14 ` [PATCH RESEND v2 net-next 09/14] ipv6: Don't pass net to ip6_route_info_append() Kuniyuki Iwashima
2025-04-16  9:39   ` Paolo Abeni
2025-04-14 18:14 ` [PATCH RESEND v2 net-next 10/14] ipv6: Factorise ip6_route_multipath_add() Kuniyuki Iwashima
2025-04-16  9:57   ` Paolo Abeni
2025-04-16 18:58     ` Kuniyuki Iwashima
2025-04-14 18:14 ` [PATCH RESEND v2 net-next 11/14] ipv6: Protect fib6_link_table() with spinlock Kuniyuki Iwashima
2025-04-16 10:00   ` Paolo Abeni
2025-04-14 18:15 ` [PATCH RESEND v2 net-next 12/14] ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add() Kuniyuki Iwashima
2025-04-16 13:59   ` Paolo Abeni
2025-04-14 18:15 ` [PATCH RESEND v2 net-next 13/14] ipv6: Protect nh->f6i_list with spinlock and flag Kuniyuki Iwashima
2025-04-16 15:17   ` Paolo Abeni
2025-04-16 19:02     ` Kuniyuki Iwashima
2025-04-14 18:15 ` [PATCH RESEND v2 net-next 14/14] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE Kuniyuki Iwashima
2025-04-16 15:26   ` Paolo Abeni
2025-04-16 19:04     ` Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).