* [PATCH v3 net-next 01/15] ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 02/15] ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE Kuniyuki Iwashima
` (14 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
We will perform RTM_NEWROUTE and RTM_DELROUTE under RCU, and then
we want to perform some validation out of the RCU scope.
When creating / removing an IPv6 route with RTA_MULTIPATH,
inet6_rtm_newroute() / inet6_rtm_delroute() validates RTA_GATEWAY
in each multipath entry.
Let's do that in rtm_to_fib6_config().
Note that now RTM_DELROUTE returns an error for RTA_MULTIPATH with
0 entries, which was accepted but should result in -EINVAL as
RTM_NEWROUTE.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
net/ipv6/route.c | 82 +++++++++++++++++++++++++-----------------------
1 file changed, 43 insertions(+), 39 deletions(-)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index e2c6c0b0684b..51f693581b7c 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5050,6 +5050,44 @@ static const struct nla_policy rtm_ipv6_policy[RTA_MAX+1] = {
[RTA_FLOWLABEL] = { .type = NLA_BE32 },
};
+static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
+ struct netlink_ext_ack *extack)
+{
+ struct rtnexthop *rtnh;
+ int remaining;
+
+ remaining = cfg->fc_mp_len;
+ rtnh = (struct rtnexthop *)cfg->fc_mp;
+
+ if (!rtnh_ok(rtnh, remaining)) {
+ NL_SET_ERR_MSG(extack, "Invalid nexthop configuration - no valid nexthops");
+ return -EINVAL;
+ }
+
+ do {
+ int attrlen = rtnh_attrlen(rtnh);
+
+ if (attrlen > 0) {
+ struct nlattr *nla, *attrs;
+
+ attrs = rtnh_attrs(rtnh);
+ nla = nla_find(attrs, attrlen, RTA_GATEWAY);
+ if (nla) {
+ if (nla_len(nla) < sizeof(cfg->fc_gateway)) {
+ NL_SET_ERR_MSG(extack,
+ "Invalid IPv6 address in RTA_GATEWAY");
+ return -EINVAL;
+ }
+ }
+ }
+
+ rtnh = rtnh_next(rtnh, &remaining);
+ } while (rtnh_ok(rtnh, remaining));
+
+ return lwtunnel_valid_encap_type_attr(cfg->fc_mp, cfg->fc_mp_len,
+ extack, true);
+}
+
static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
struct fib6_config *cfg,
struct netlink_ext_ack *extack)
@@ -5164,9 +5202,7 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
cfg->fc_mp = nla_data(tb[RTA_MULTIPATH]);
cfg->fc_mp_len = nla_len(tb[RTA_MULTIPATH]);
- err = lwtunnel_valid_encap_type_attr(cfg->fc_mp,
- cfg->fc_mp_len,
- extack, true);
+ err = rtm_to_fib6_multipath_config(cfg, extack);
if (err < 0)
goto errout;
}
@@ -5286,19 +5322,6 @@ static bool ip6_route_mpath_should_notify(const struct fib6_info *rt)
return should_notify;
}
-static int fib6_gw_from_attr(struct in6_addr *gw, struct nlattr *nla,
- struct netlink_ext_ack *extack)
-{
- if (nla_len(nla) < sizeof(*gw)) {
- NL_SET_ERR_MSG(extack, "Invalid IPv6 address in RTA_GATEWAY");
- return -EINVAL;
- }
-
- *gw = nla_get_in6_addr(nla);
-
- return 0;
-}
-
static int ip6_route_multipath_add(struct fib6_config *cfg,
struct netlink_ext_ack *extack)
{
@@ -5339,18 +5362,11 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
nla = nla_find(attrs, attrlen, RTA_GATEWAY);
if (nla) {
- err = fib6_gw_from_attr(&r_cfg.fc_gateway, nla,
- extack);
- if (err)
- goto cleanup;
-
+ r_cfg.fc_gateway = nla_get_in6_addr(nla);
r_cfg.fc_flags |= RTF_GATEWAY;
}
- r_cfg.fc_encap = nla_find(attrs, attrlen, RTA_ENCAP);
- /* RTA_ENCAP_TYPE length checked in
- * lwtunnel_valid_encap_type_attr
- */
+ r_cfg.fc_encap = nla_find(attrs, attrlen, RTA_ENCAP);
nla = nla_find(attrs, attrlen, RTA_ENCAP_TYPE);
if (nla)
r_cfg.fc_encap_type = nla_get_u16(nla);
@@ -5383,12 +5399,6 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
rtnh = rtnh_next(rtnh, &remaining);
}
- if (list_empty(&rt6_nh_list)) {
- NL_SET_ERR_MSG(extack,
- "Invalid nexthop configuration - no valid nexthops");
- return -EINVAL;
- }
-
/* for add and replace send one notification with all nexthops.
* Skip the notification in fib6_add_rt2node and send one with
* the full route when done
@@ -5510,21 +5520,15 @@ static int ip6_route_multipath_del(struct fib6_config *cfg,
nla = nla_find(attrs, attrlen, RTA_GATEWAY);
if (nla) {
- err = fib6_gw_from_attr(&r_cfg.fc_gateway, nla,
- extack);
- if (err) {
- last_err = err;
- goto next_rtnh;
- }
-
+ r_cfg.fc_gateway = nla_get_in6_addr(nla);
r_cfg.fc_flags |= RTF_GATEWAY;
}
}
+
err = ip6_route_del(&r_cfg, extack);
if (err)
last_err = err;
-next_rtnh:
rtnh = rtnh_next(rtnh, &remaining);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v3 net-next 02/15] ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 01/15] ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config() Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 03/15] ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config() Kuniyuki Iwashima
` (13 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
Basically, removing an IPv6 route does not require RTNL because
the IPv6 routing tables are protected by per table lock.
inet6_rtm_delroute() calls nexthop_find_by_id() to check if the
nexthop specified by RTA_NH_ID exists. nexthop uses rbtree and
the top-down walk can be safely performed under RCU.
ip6_route_del() already relies on RCU and the table lock, but we
need to extend the RCU critical section a bit more to cover
__ip6_del_rt(). For example, nexthop_for_each_fib6_nh() and
inet6_rt_notify() needs RCU.
Let's call nexthop_find_by_id() and __ip6_del_rt() under RCU and
get rid of RTNL from inet6_rtm_delroute() and SIOCDELRT.
Even if the nexthop is removed after rcu_read_unlock() in
inet6_rtm_delroute(), __remove_nexthop_fib() cleans up the routes
tied to the nexthop, and ip6_route_del() returns -ESRCH. So the
request was at least valid as of nexthop_find_by_id(), and it's just
a matter of timing.
Note that we need to pass false to lwtunnel_valid_encap_type_attr().
The following patches also use the newroute bool.
Note also that fib6_get_table() does not require RCU because once
allocated fib6_table is not freed until netns dismantle. I will
post a follow-up series to convert such callers to RCU-lockless
version. [0]
Link: https://lore.kernel.org/netdev/20250417174557.65721-1-kuniyu@amazon.com/ #[0]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
v3: Add a note that fib6_get_table() does not require RCU
v2: Call __ip6_del_rt() under RCU
---
net/ipv6/route.c | 48 ++++++++++++++++++++++++++++--------------------
1 file changed, 28 insertions(+), 20 deletions(-)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 51f693581b7c..4de7abe5ee02 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4124,9 +4124,9 @@ static int ip6_route_del(struct fib6_config *cfg,
if (rt->nh) {
if (!fib6_info_hold_safe(rt))
continue;
- rcu_read_unlock();
- return __ip6_del_rt(rt, &cfg->fc_nlinfo);
+ err = __ip6_del_rt(rt, &cfg->fc_nlinfo);
+ break;
}
if (cfg->fc_nh_id)
continue;
@@ -4141,13 +4141,13 @@ static int ip6_route_del(struct fib6_config *cfg,
continue;
if (!fib6_info_hold_safe(rt))
continue;
- rcu_read_unlock();
/* if gateway was specified only delete the one hop */
if (cfg->fc_flags & RTF_GATEWAY)
- return __ip6_del_rt(rt, &cfg->fc_nlinfo);
-
- return __ip6_del_rt_siblings(rt, cfg);
+ err = __ip6_del_rt(rt, &cfg->fc_nlinfo);
+ else
+ err = __ip6_del_rt_siblings(rt, cfg);
+ break;
}
}
rcu_read_unlock();
@@ -4516,19 +4516,20 @@ int ipv6_route_ioctl(struct net *net, unsigned int cmd, struct in6_rtmsg *rtmsg)
rtmsg_to_fib6_config(net, rtmsg, &cfg);
- rtnl_lock();
switch (cmd) {
case SIOCADDRT:
+ rtnl_lock();
/* Only do the default setting of fc_metric in route adding */
if (cfg.fc_metric == 0)
cfg.fc_metric = IP6_RT_PRIO_USER;
err = ip6_route_add(&cfg, GFP_KERNEL, NULL);
+ rtnl_unlock();
break;
case SIOCDELRT:
err = ip6_route_del(&cfg, NULL);
break;
}
- rtnl_unlock();
+
return err;
}
@@ -5051,7 +5052,8 @@ static const struct nla_policy rtm_ipv6_policy[RTA_MAX+1] = {
};
static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
- struct netlink_ext_ack *extack)
+ struct netlink_ext_ack *extack,
+ bool newroute)
{
struct rtnexthop *rtnh;
int remaining;
@@ -5085,15 +5087,16 @@ static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
} while (rtnh_ok(rtnh, remaining));
return lwtunnel_valid_encap_type_attr(cfg->fc_mp, cfg->fc_mp_len,
- extack, true);
+ extack, newroute);
}
static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
struct fib6_config *cfg,
struct netlink_ext_ack *extack)
{
- struct rtmsg *rtm;
+ bool newroute = nlh->nlmsg_type == RTM_NEWROUTE;
struct nlattr *tb[RTA_MAX+1];
+ struct rtmsg *rtm;
unsigned int pref;
int err;
@@ -5202,7 +5205,7 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
cfg->fc_mp = nla_data(tb[RTA_MULTIPATH]);
cfg->fc_mp_len = nla_len(tb[RTA_MULTIPATH]);
- err = rtm_to_fib6_multipath_config(cfg, extack);
+ err = rtm_to_fib6_multipath_config(cfg, extack, newroute);
if (err < 0)
goto errout;
}
@@ -5222,7 +5225,7 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
cfg->fc_encap_type = nla_get_u16(tb[RTA_ENCAP_TYPE]);
err = lwtunnel_valid_encap_type(cfg->fc_encap_type,
- extack, true);
+ extack, newroute);
if (err < 0)
goto errout;
}
@@ -5545,15 +5548,20 @@ static int inet6_rtm_delroute(struct sk_buff *skb, struct nlmsghdr *nlh,
if (err < 0)
return err;
- if (cfg.fc_nh_id &&
- !nexthop_find_by_id(sock_net(skb->sk), cfg.fc_nh_id)) {
- NL_SET_ERR_MSG(extack, "Nexthop id does not exist");
- return -EINVAL;
+ if (cfg.fc_nh_id) {
+ rcu_read_lock();
+ err = !nexthop_find_by_id(sock_net(skb->sk), cfg.fc_nh_id);
+ rcu_read_unlock();
+
+ if (err) {
+ NL_SET_ERR_MSG(extack, "Nexthop id does not exist");
+ return -EINVAL;
+ }
}
- if (cfg.fc_mp)
+ if (cfg.fc_mp) {
return ip6_route_multipath_del(&cfg, extack);
- else {
+ } else {
cfg.fc_delete_all_nh = 1;
return ip6_route_del(&cfg, extack);
}
@@ -6765,7 +6773,7 @@ static const struct rtnl_msg_handler ip6_route_rtnl_msg_handlers[] __initconst_o
{.owner = THIS_MODULE, .protocol = PF_INET6, .msgtype = RTM_NEWROUTE,
.doit = inet6_rtm_newroute},
{.owner = THIS_MODULE, .protocol = PF_INET6, .msgtype = RTM_DELROUTE,
- .doit = inet6_rtm_delroute},
+ .doit = inet6_rtm_delroute, .flags = RTNL_FLAG_DOIT_UNLOCKED},
{.owner = THIS_MODULE, .protocol = PF_INET6, .msgtype = RTM_GETROUTE,
.doit = inet6_rtm_getroute, .flags = RTNL_FLAG_DOIT_UNLOCKED},
};
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v3 net-next 03/15] ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 01/15] ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config() Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 02/15] ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-29 0:24 ` Lai, Yi
2025-04-18 0:03 ` [PATCH v3 net-next 04/15] ipv6: Check GATEWAY in rtm_to_fib6_multipath_config() Kuniyuki Iwashima
` (12 subsequent siblings)
15 siblings, 1 reply; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
ip6_route_info_create() is called from 3 functions:
* ip6_route_add()
* ip6_route_multipath_add()
* addrconf_f6i_alloc()
addrconf_f6i_alloc() does not need validation for struct fib6_config in
ip6_route_info_create().
ip6_route_multipath_add() calls ip6_route_info_create() for multiple
routes with slightly different fib6_config instances, which is copied
from the base config passed from userspace. So, we need not validate
the same config repeatedly.
Let's move such validation into rtm_to_fib6_config().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
net/ipv6/route.c | 79 +++++++++++++++++++++++++-----------------------
1 file changed, 42 insertions(+), 37 deletions(-)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 4de7abe5ee02..23102f37f220 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3739,38 +3739,6 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
int err = -EINVAL;
int addr_type;
- /* RTF_PCPU is an internal flag; can not be set by userspace */
- if (cfg->fc_flags & RTF_PCPU) {
- NL_SET_ERR_MSG(extack, "Userspace can not set RTF_PCPU");
- goto out;
- }
-
- /* RTF_CACHE is an internal flag; can not be set by userspace */
- if (cfg->fc_flags & RTF_CACHE) {
- NL_SET_ERR_MSG(extack, "Userspace can not set RTF_CACHE");
- goto out;
- }
-
- if (cfg->fc_type > RTN_MAX) {
- NL_SET_ERR_MSG(extack, "Invalid route type");
- goto out;
- }
-
- if (cfg->fc_dst_len > 128) {
- NL_SET_ERR_MSG(extack, "Invalid prefix length");
- goto out;
- }
- if (cfg->fc_src_len > 128) {
- NL_SET_ERR_MSG(extack, "Invalid source address length");
- goto out;
- }
-#ifndef CONFIG_IPV6_SUBTREES
- if (cfg->fc_src_len) {
- NL_SET_ERR_MSG(extack,
- "Specifying source address requires IPV6_SUBTREES to be enabled");
- goto out;
- }
-#endif
if (cfg->fc_nh_id) {
nh = nexthop_find_by_id(net, cfg->fc_nh_id);
if (!nh) {
@@ -3835,11 +3803,6 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
rt->fib6_src.plen = cfg->fc_src_len;
#endif
if (nh) {
- if (rt->fib6_src.plen) {
- NL_SET_ERR_MSG(extack, "Nexthops can not be used with source routing");
- err = -EINVAL;
- goto out_free;
- }
if (!nexthop_get(nh)) {
NL_SET_ERR_MSG(extack, "Nexthop has been deleted");
err = -ENOENT;
@@ -5239,6 +5202,48 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
}
}
+ if (newroute) {
+ /* RTF_PCPU is an internal flag; can not be set by userspace */
+ if (cfg->fc_flags & RTF_PCPU) {
+ NL_SET_ERR_MSG(extack, "Userspace can not set RTF_PCPU");
+ goto errout;
+ }
+
+ /* RTF_CACHE is an internal flag; can not be set by userspace */
+ if (cfg->fc_flags & RTF_CACHE) {
+ NL_SET_ERR_MSG(extack, "Userspace can not set RTF_CACHE");
+ goto errout;
+ }
+
+ if (cfg->fc_type > RTN_MAX) {
+ NL_SET_ERR_MSG(extack, "Invalid route type");
+ goto errout;
+ }
+
+ if (cfg->fc_dst_len > 128) {
+ NL_SET_ERR_MSG(extack, "Invalid prefix length");
+ goto errout;
+ }
+
+#ifdef CONFIG_IPV6_SUBTREES
+ if (cfg->fc_src_len > 128) {
+ NL_SET_ERR_MSG(extack, "Invalid source address length");
+ goto errout;
+ }
+
+ if (cfg->fc_nh_id && cfg->fc_src_len) {
+ NL_SET_ERR_MSG(extack, "Nexthops can not be used with source routing");
+ goto errout;
+ }
+#else
+ if (cfg->fc_src_len) {
+ NL_SET_ERR_MSG(extack,
+ "Specifying source address requires IPV6_SUBTREES to be enabled");
+ goto errout;
+ }
+#endif
+ }
+
err = 0;
errout:
return err;
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH v3 net-next 03/15] ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().
2025-04-18 0:03 ` [PATCH v3 net-next 03/15] ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config() Kuniyuki Iwashima
@ 2025-04-29 0:24 ` Lai, Yi
2025-04-29 1:20 ` Kuniyuki Iwashima
0 siblings, 1 reply; 23+ messages in thread
From: Lai, Yi @ 2025-04-29 0:24 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Kuniyuki Iwashima, netdev, yi1.lai,
syzkaller-bugs
Hi Kuniyuki Iwashima,
Greetings!
I used Syzkaller and found that there is KASAN: use-after-free Read in ip6_route_info_create in linux-next tag - next-20250428.
After bisection and the first bad commit is:
"
fa76c1674f2e ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().
"
All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/250429_005622_ip6_route_info_create
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/250429_005622_ip6_route_info_create/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/250429_005622_ip6_route_info_create/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/250429_005622_ip6_route_info_create/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/250429_005622_ip6_route_info_create/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/250429_005622_ip6_route_info_create/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250429_005622_ip6_route_info_create/bzImage_33035b665157558254b3c21c3f049fd728e72368
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/250429_005622_ip6_route_info_create/33035b665157558254b3c21c3f049fd728e72368_dmesg.log
"
[ 17.307248] ==================================================================
[ 17.307611] BUG: KASAN: slab-use-after-free in ip6_route_info_create+0xb84/0xc30
[ 17.307993] Read of size 1 at addr ffff8880100b8a94 by task repro/727
[ 17.308291]
[ 17.308389] CPU: 0 UID: 0 PID: 727 Comm: repro Not tainted 6.15.0-rc4-next-20250428-33035b665157 #1 PREEMPT(voluntary)
[ 17.308397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 17.308405] Call Trace:
[ 17.308412] <TASK>
[ 17.308414] dump_stack_lvl+0xea/0x150
[ 17.308439] print_report+0xce/0x660
[ 17.308469] ? ip6_route_info_create+0xb84/0xc30
[ 17.308475] ? kasan_complete_mode_report_info+0x80/0x200
[ 17.308482] ? ip6_route_info_create+0xb84/0xc30
[ 17.308489] kasan_report+0xd6/0x110
[ 17.308496] ? ip6_route_info_create+0xb84/0xc30
[ 17.308504] __asan_report_load1_noabort+0x18/0x20
[ 17.308509] ip6_route_info_create+0xb84/0xc30
[ 17.308516] ip6_route_add+0x32/0x320
[ 17.308524] ipv6_route_ioctl+0x414/0x5a0
[ 17.308530] ? __pfx_ipv6_route_ioctl+0x10/0x10
[ 17.308539] ? __might_fault+0xf1/0x1b0
[ 17.308556] inet6_ioctl+0x265/0x2b0
[ 17.308568] ? __pfx_inet6_ioctl+0x10/0x10
[ 17.308573] ? do_anonymous_page+0x4b5/0x1b30
[ 17.308579] ? register_lock_class+0x49/0x4b0
[ 17.308597] ? __sanitizer_cov_trace_switch+0x58/0xa0
[ 17.308616] sock_do_ioctl+0xde/0x260
[ 17.308628] ? __pfx_sock_do_ioctl+0x10/0x10
[ 17.308634] ? __lock_acquire+0x410/0x2260
[ 17.308640] ? __lock_acquire+0x410/0x2260
[ 17.308649] ? __sanitizer_cov_trace_switch+0x58/0xa0
[ 17.308656] sock_ioctl+0x23e/0x6a0
[ 17.308665] ? __pfx_sock_ioctl+0x10/0x10
[ 17.308671] ? __this_cpu_preempt_check+0x21/0x30
[ 17.308683] ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
[ 17.308694] ? lockdep_hardirqs_on+0x89/0x110
[ 17.308703] ? trace_hardirqs_on+0x51/0x60
[ 17.308717] ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
[ 17.308723] ? __sanitizer_cov_trace_cmp4+0x1a/0x20
[ 17.308729] ? ktime_get_coarse_real_ts64+0xad/0xf0
[ 17.308737] ? __pfx_sock_ioctl+0x10/0x10
[ 17.308744] __x64_sys_ioctl+0x1bc/0x220
[ 17.308765] x64_sys_call+0x122e/0x2150
[ 17.308774] do_syscall_64+0x6d/0x150
[ 17.308783] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 17.308789] RIP: 0033:0x7f75a8c3ee5d
[ 17.308797] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[ 17.308803] RSP: 002b:00007ffe7620af68 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[ 17.308814] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f75a8c3ee5d
[ 17.308818] RDX: 00000000200015c0 RSI: 000000000000890b RDI: 0000000000000003
[ 17.308821] RBP: 00007ffe7620af80 R08: 0000000000000800 R09: 0000000000000800
[ 17.308825] R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffe7620b098
[ 17.308828] R13: 0000000000401136 R14: 0000000000403e08 R15: 00007f75a8fc3000
[ 17.308835] </TASK>
[ 17.308837]
[ 17.320668] Allocated by task 653:
[ 17.320836] kasan_save_stack+0x2c/0x60
[ 17.321028] kasan_save_track+0x18/0x40
[ 17.321217] kasan_save_alloc_info+0x3c/0x50
[ 17.321430] __kasan_slab_alloc+0x62/0x80
[ 17.321627] kmem_cache_alloc_noprof+0x13d/0x430
[ 17.321855] getname_kernel+0x5c/0x390
[ 17.322044] kern_path+0x29/0x90
[ 17.322203] unix_find_other+0x11b/0x880
[ 17.322395] unix_stream_connect+0x4f5/0x1a50
[ 17.322604] __sys_connect_file+0x159/0x1d0
[ 17.322805] __sys_connect+0x176/0x1b0
[ 17.322986] __x64_sys_connect+0x7b/0xc0
[ 17.323180] x64_sys_call+0x1bc7/0x2150
[ 17.323371] do_syscall_64+0x6d/0x150
[ 17.323555] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 17.323793]
[ 17.323876] Freed by task 653:
[ 17.324024] kasan_save_stack+0x2c/0x60
[ 17.324212] kasan_save_track+0x18/0x40
[ 17.324405] kasan_save_free_info+0x3f/0x60
[ 17.324606] __kasan_slab_free+0x3d/0x60
[ 17.324799] kmem_cache_free+0x2ea/0x520
[ 17.324987] putname.part.0+0x132/0x180
[ 17.325175] kern_path+0x74/0x90
[ 17.325335] unix_find_other+0x11b/0x880
[ 17.325526] unix_stream_connect+0x4f5/0x1a50
[ 17.325736] __sys_connect_file+0x159/0x1d0
[ 17.325941] __sys_connect+0x176/0x1b0
[ 17.326122] __x64_sys_connect+0x7b/0xc0
[ 17.326316] x64_sys_call+0x1bc7/0x2150
[ 17.326504] do_syscall_64+0x6d/0x150
[ 17.326687] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 17.326929]
[ 17.327013] The buggy address belongs to the object at ffff8880100b8000
[ 17.327013] which belongs to the cache names_cache of size 4096
[ 17.327572] The buggy address is located 2708 bytes inside of
[ 17.327572] freed 4096-byte region [ffff8880100b8000, ffff8880100b9000)
[ 17.328121]
[ 17.328204] The buggy address belongs to the physical page:
[ 17.328461] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x100b8
[ 17.328831] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 17.329184] flags: 0xfffffc0000040(head|node=0|zone=1|lastcpupid=0x1fffff)
[ 17.329508] page_type: f5(slab)
[ 17.329670] raw: 000fffffc0000040 ffff88800d72cdc0 dead000000000100 dead000000000122
[ 17.330022] raw: 0000000000000000 0000000000070007 00000000f5000000 0000000000000000
[ 17.330381] head: 000fffffc0000040 ffff88800d72cdc0 dead000000000100 dead000000000122
[ 17.330738] head: 0000000000000000 0000000000070007 00000000f5000000 0000000000000000
[ 17.331094] head: 000fffffc0000003 ffffea0000402e01 00000000ffffffff 00000000ffffffff
[ 17.331454] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
[ 17.331807] page dumped because: kasan: bad access detected
[ 17.332066]
[ 17.332150] Memory state around the buggy address:
[ 17.332374] ffff8880100b8980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 17.332703] ffff8880100b8a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 17.333031] >ffff8880100b8a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 17.333361] ^
[ 17.333545] ffff8880100b8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 17.333874] ffff8880100b8b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 17.334201] ==================================================================
"
Hope this cound be insightful to you.
Regards,
Yi Lai
---
If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.
How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0
// start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
// You could change the bzImage_xxx as you want
// Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost
After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/
Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage //x should equal or less than cpu num your pc has
Fill the bzImage file into above start3.sh to load the target kernel in vm.
Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install
On Thu, Apr 17, 2025 at 05:03:44PM -0700, Kuniyuki Iwashima wrote:
> ip6_route_info_create() is called from 3 functions:
>
> * ip6_route_add()
> * ip6_route_multipath_add()
> * addrconf_f6i_alloc()
>
> addrconf_f6i_alloc() does not need validation for struct fib6_config in
> ip6_route_info_create().
>
> ip6_route_multipath_add() calls ip6_route_info_create() for multiple
> routes with slightly different fib6_config instances, which is copied
> from the base config passed from userspace. So, we need not validate
> the same config repeatedly.
>
> Let's move such validation into rtm_to_fib6_config().
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> Acked-by: Paolo Abeni <pabeni@redhat.com>
> ---
> net/ipv6/route.c | 79 +++++++++++++++++++++++++-----------------------
> 1 file changed, 42 insertions(+), 37 deletions(-)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 4de7abe5ee02..23102f37f220 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -3739,38 +3739,6 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
> int err = -EINVAL;
> int addr_type;
>
> - /* RTF_PCPU is an internal flag; can not be set by userspace */
> - if (cfg->fc_flags & RTF_PCPU) {
> - NL_SET_ERR_MSG(extack, "Userspace can not set RTF_PCPU");
> - goto out;
> - }
> -
> - /* RTF_CACHE is an internal flag; can not be set by userspace */
> - if (cfg->fc_flags & RTF_CACHE) {
> - NL_SET_ERR_MSG(extack, "Userspace can not set RTF_CACHE");
> - goto out;
> - }
> -
> - if (cfg->fc_type > RTN_MAX) {
> - NL_SET_ERR_MSG(extack, "Invalid route type");
> - goto out;
> - }
> -
> - if (cfg->fc_dst_len > 128) {
> - NL_SET_ERR_MSG(extack, "Invalid prefix length");
> - goto out;
> - }
> - if (cfg->fc_src_len > 128) {
> - NL_SET_ERR_MSG(extack, "Invalid source address length");
> - goto out;
> - }
> -#ifndef CONFIG_IPV6_SUBTREES
> - if (cfg->fc_src_len) {
> - NL_SET_ERR_MSG(extack,
> - "Specifying source address requires IPV6_SUBTREES to be enabled");
> - goto out;
> - }
> -#endif
> if (cfg->fc_nh_id) {
> nh = nexthop_find_by_id(net, cfg->fc_nh_id);
> if (!nh) {
> @@ -3835,11 +3803,6 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
> rt->fib6_src.plen = cfg->fc_src_len;
> #endif
> if (nh) {
> - if (rt->fib6_src.plen) {
> - NL_SET_ERR_MSG(extack, "Nexthops can not be used with source routing");
> - err = -EINVAL;
> - goto out_free;
> - }
> if (!nexthop_get(nh)) {
> NL_SET_ERR_MSG(extack, "Nexthop has been deleted");
> err = -ENOENT;
> @@ -5239,6 +5202,48 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
> }
> }
>
> + if (newroute) {
> + /* RTF_PCPU is an internal flag; can not be set by userspace */
> + if (cfg->fc_flags & RTF_PCPU) {
> + NL_SET_ERR_MSG(extack, "Userspace can not set RTF_PCPU");
> + goto errout;
> + }
> +
> + /* RTF_CACHE is an internal flag; can not be set by userspace */
> + if (cfg->fc_flags & RTF_CACHE) {
> + NL_SET_ERR_MSG(extack, "Userspace can not set RTF_CACHE");
> + goto errout;
> + }
> +
> + if (cfg->fc_type > RTN_MAX) {
> + NL_SET_ERR_MSG(extack, "Invalid route type");
> + goto errout;
> + }
> +
> + if (cfg->fc_dst_len > 128) {
> + NL_SET_ERR_MSG(extack, "Invalid prefix length");
> + goto errout;
> + }
> +
> +#ifdef CONFIG_IPV6_SUBTREES
> + if (cfg->fc_src_len > 128) {
> + NL_SET_ERR_MSG(extack, "Invalid source address length");
> + goto errout;
> + }
> +
> + if (cfg->fc_nh_id && cfg->fc_src_len) {
> + NL_SET_ERR_MSG(extack, "Nexthops can not be used with source routing");
> + goto errout;
> + }
> +#else
> + if (cfg->fc_src_len) {
> + NL_SET_ERR_MSG(extack,
> + "Specifying source address requires IPV6_SUBTREES to be enabled");
> + goto errout;
> + }
> +#endif
> + }
> +
> err = 0;
> errout:
> return err;
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v3 net-next 03/15] ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().
2025-04-29 0:24 ` Lai, Yi
@ 2025-04-29 1:20 ` Kuniyuki Iwashima
0 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-29 1:20 UTC (permalink / raw)
To: yi1.lai
Cc: davem, dsahern, edumazet, horms, kuba, kuni1840, kuniyu, netdev,
pabeni, syzkaller-bugs, yi1.lai
From: "Lai, Yi" <yi1.lai@linux.intel.com>
Date: Tue, 29 Apr 2025 08:24:08 +0800
> Hi Kuniyuki Iwashima,
>
> Greetings!
>
> I used Syzkaller and found that there is KASAN: use-after-free Read in ip6_route_info_create in linux-next tag - next-20250428.
>
> After bisection and the first bad commit is:
> "
> fa76c1674f2e ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().
> "
>
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250429_005622_ip6_route_info_create
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250429_005622_ip6_route_info_create/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250429_005622_ip6_route_info_create/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250429_005622_ip6_route_info_create/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/250429_005622_ip6_route_info_create/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250429_005622_ip6_route_info_create/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250429_005622_ip6_route_info_create/bzImage_33035b665157558254b3c21c3f049fd728e72368
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/250429_005622_ip6_route_info_create/33035b665157558254b3c21c3f049fd728e72368_dmesg.log
>
> "
> [ 17.307248] ==================================================================
> [ 17.307611] BUG: KASAN: slab-use-after-free in ip6_route_info_create+0xb84/0xc30
> [ 17.307993] Read of size 1 at addr ffff8880100b8a94 by task repro/727
> [ 17.308291]
> [ 17.308389] CPU: 0 UID: 0 PID: 727 Comm: repro Not tainted 6.15.0-rc4-next-20250428-33035b665157 #1 PREEMPT(voluntary)
> [ 17.308397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [ 17.308405] Call Trace:
> [ 17.308412] <TASK>
> [ 17.308414] dump_stack_lvl+0xea/0x150
> [ 17.308439] print_report+0xce/0x660
> [ 17.308469] ? ip6_route_info_create+0xb84/0xc30
> [ 17.308475] ? kasan_complete_mode_report_info+0x80/0x200
> [ 17.308482] ? ip6_route_info_create+0xb84/0xc30
> [ 17.308489] kasan_report+0xd6/0x110
> [ 17.308496] ? ip6_route_info_create+0xb84/0xc30
> [ 17.308504] __asan_report_load1_noabort+0x18/0x20
> [ 17.308509] ip6_route_info_create+0xb84/0xc30
> [ 17.308516] ip6_route_add+0x32/0x320
> [ 17.308524] ipv6_route_ioctl+0x414/0x5a0
Thanks for the report.
It seems I accidentally removed validation from the ioctl path,
not sure why I missed the path...
Will post a fix soon.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v3 net-next 04/15] ipv6: Check GATEWAY in rtm_to_fib6_multipath_config().
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
` (2 preceding siblings ...)
2025-04-18 0:03 ` [PATCH v3 net-next 03/15] ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config() Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 05/15] ipv6: Move nexthop_find_by_id() after fib6_info_alloc() Kuniyuki Iwashima
` (11 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In ip6_route_multipath_add(), we call rt6_qualify_for_ecmp() for each
entry. If it returns false, the request fails.
rt6_qualify_for_ecmp() returns false if either of the conditions below
is true:
1. f6i->fib6_flags has RTF_ADDRCONF
2. f6i->nh is not NULL
3. f6i->fib6_nh->fib_nh_gw_family is AF_UNSPEC
1 is unnecessary because rtm_to_fib6_config() never sets RTF_ADDRCONF
to cfg->fc_flags.
2. is equivalent with cfg->fc_nh_id.
3. can be replaced by checking RTF_GATEWAY in the base and each multipath
entry because AF_INET6 is set to f6i->fib6_nh->fib_nh_gw_family only when
cfg.fc_is_fdb is true or RTF_GATEWAY is set, but the former is always
false.
These checks do not require RCU and can be done earlier.
Let's perform the equivalent checks in rtm_to_fib6_multipath_config().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
v3: Explain the checks do not require RCU.
---
net/ipv6/route.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 23102f37f220..5f370c269e64 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5030,6 +5030,7 @@ static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
}
do {
+ bool has_gateway = cfg->fc_flags & RTF_GATEWAY;
int attrlen = rtnh_attrlen(rtnh);
if (attrlen > 0) {
@@ -5043,9 +5044,17 @@ static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
"Invalid IPv6 address in RTA_GATEWAY");
return -EINVAL;
}
+
+ has_gateway = true;
}
}
+ if (newroute && (cfg->fc_nh_id || !has_gateway)) {
+ NL_SET_ERR_MSG(extack,
+ "Device only routes can not be added for IPv6 using the multipath API.");
+ return -EINVAL;
+ }
+
rtnh = rtnh_next(rtnh, &remaining);
} while (rtnh_ok(rtnh, remaining));
@@ -5387,13 +5396,6 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
rt = NULL;
goto cleanup;
}
- if (!rt6_qualify_for_ecmp(rt)) {
- err = -EINVAL;
- NL_SET_ERR_MSG(extack,
- "Device only routes can not be added for IPv6 using the multipath API.");
- fib6_info_release(rt);
- goto cleanup;
- }
rt->fib6_nh->fib_nh_weight = rtnh->rtnh_hops + 1;
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v3 net-next 05/15] ipv6: Move nexthop_find_by_id() after fib6_info_alloc().
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
` (3 preceding siblings ...)
2025-04-18 0:03 ` [PATCH v3 net-next 04/15] ipv6: Check GATEWAY in rtm_to_fib6_multipath_config() Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 06/15] ipv6: Split ip6_route_info_create() Kuniyuki Iwashima
` (10 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.
Then, we must perform two lookups for nexthop and dev under RCU
to guarantee their lifetime.
ip6_route_info_create() calls nexthop_find_by_id() first if
RTA_NH_ID is specified, and then allocates struct fib6_info.
nexthop_find_by_id() must be called under RCU, but we do not want
to use GFP_ATOMIC for memory allocation here, which will be likely
to fail in ip6_route_multipath_add().
Let's move nexthop_find_by_id() after the memory allocation so
that we can later split ip6_route_info_create() into two parts:
the sleepable part and the RCU part.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
net/ipv6/route.c | 34 ++++++++++++++++++----------------
1 file changed, 18 insertions(+), 16 deletions(-)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 5f370c269e64..06c5414fc14e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3733,24 +3733,11 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
{
struct net *net = cfg->fc_nlinfo.nl_net;
struct fib6_info *rt = NULL;
- struct nexthop *nh = NULL;
struct fib6_table *table;
struct fib6_nh *fib6_nh;
- int err = -EINVAL;
+ int err = -ENOBUFS;
int addr_type;
- if (cfg->fc_nh_id) {
- nh = nexthop_find_by_id(net, cfg->fc_nh_id);
- if (!nh) {
- NL_SET_ERR_MSG(extack, "Nexthop id does not exist");
- goto out;
- }
- err = fib6_check_nexthop(nh, cfg, extack);
- if (err)
- goto out;
- }
-
- err = -ENOBUFS;
if (cfg->fc_nlinfo.nlh &&
!(cfg->fc_nlinfo.nlh->nlmsg_flags & NLM_F_CREATE)) {
table = fib6_get_table(net, cfg->fc_table);
@@ -3766,7 +3753,7 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
goto out;
err = -ENOMEM;
- rt = fib6_info_alloc(gfp_flags, !nh);
+ rt = fib6_info_alloc(gfp_flags, !cfg->fc_nh_id);
if (!rt)
goto out;
@@ -3802,12 +3789,27 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
ipv6_addr_prefix(&rt->fib6_src.addr, &cfg->fc_src, cfg->fc_src_len);
rt->fib6_src.plen = cfg->fc_src_len;
#endif
- if (nh) {
+
+ if (cfg->fc_nh_id) {
+ struct nexthop *nh;
+
+ nh = nexthop_find_by_id(net, cfg->fc_nh_id);
+ if (!nh) {
+ err = -EINVAL;
+ NL_SET_ERR_MSG(extack, "Nexthop id does not exist");
+ goto out_free;
+ }
+
+ err = fib6_check_nexthop(nh, cfg, extack);
+ if (err)
+ goto out_free;
+
if (!nexthop_get(nh)) {
NL_SET_ERR_MSG(extack, "Nexthop has been deleted");
err = -ENOENT;
goto out_free;
}
+
rt->nh = nh;
fib6_nh = nexthop_fib6_nh(rt->nh);
} else {
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v3 net-next 06/15] ipv6: Split ip6_route_info_create().
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
` (4 preceding siblings ...)
2025-04-18 0:03 ` [PATCH v3 net-next 05/15] ipv6: Move nexthop_find_by_id() after fib6_info_alloc() Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 07/15] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create() Kuniyuki Iwashima
` (9 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely
on RCU to guarantee dev and nexthop lifetime.
Then, we want to allocate as much as possible before entering
the RCU section.
The RCU section will start in the middle of ip6_route_info_create(),
and this is problematic for ip6_route_multipath_add() that calls
ip6_route_info_create() multiple times.
Let's split ip6_route_info_create() into two parts; one for memory
allocation and another for nexthop setup.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
v3: Update changelog s/everything as possible/as much as possible/
---
net/ipv6/route.c | 95 +++++++++++++++++++++++++++++++-----------------
1 file changed, 62 insertions(+), 33 deletions(-)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 06c5414fc14e..7328404c77c1 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3728,15 +3728,13 @@ void fib6_nh_release_dsts(struct fib6_nh *fib6_nh)
}
static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
- gfp_t gfp_flags,
- struct netlink_ext_ack *extack)
+ gfp_t gfp_flags,
+ struct netlink_ext_ack *extack)
{
struct net *net = cfg->fc_nlinfo.nl_net;
- struct fib6_info *rt = NULL;
struct fib6_table *table;
- struct fib6_nh *fib6_nh;
- int err = -ENOBUFS;
- int addr_type;
+ struct fib6_info *rt;
+ int err;
if (cfg->fc_nlinfo.nlh &&
!(cfg->fc_nlinfo.nlh->nlmsg_flags & NLM_F_CREATE)) {
@@ -3748,22 +3746,22 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
} else {
table = fib6_new_table(net, cfg->fc_table);
}
+ if (!table) {
+ err = -ENOBUFS;
+ goto err;
+ }
- if (!table)
- goto out;
-
- err = -ENOMEM;
rt = fib6_info_alloc(gfp_flags, !cfg->fc_nh_id);
- if (!rt)
- goto out;
+ if (!rt) {
+ err = -ENOMEM;
+ goto err;
+ }
rt->fib6_metrics = ip_fib_metrics_init(cfg->fc_mx, cfg->fc_mx_len,
extack);
if (IS_ERR(rt->fib6_metrics)) {
err = PTR_ERR(rt->fib6_metrics);
- /* Do not leave garbage there. */
- rt->fib6_metrics = (struct dst_metrics *)&dst_default_metrics;
- goto out_free;
+ goto free;
}
if (cfg->fc_flags & RTF_ADDRCONF)
@@ -3771,12 +3769,12 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
if (cfg->fc_flags & RTF_EXPIRES)
fib6_set_expires(rt, jiffies +
- clock_t_to_jiffies(cfg->fc_expires));
+ clock_t_to_jiffies(cfg->fc_expires));
if (cfg->fc_protocol == RTPROT_UNSPEC)
cfg->fc_protocol = RTPROT_BOOT;
- rt->fib6_protocol = cfg->fc_protocol;
+ rt->fib6_protocol = cfg->fc_protocol;
rt->fib6_table = table;
rt->fib6_metric = cfg->fc_metric;
rt->fib6_type = cfg->fc_type ? : RTN_UNICAST;
@@ -3789,6 +3787,20 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
ipv6_addr_prefix(&rt->fib6_src.addr, &cfg->fc_src, cfg->fc_src_len);
rt->fib6_src.plen = cfg->fc_src_len;
#endif
+ return rt;
+free:
+ kfree(rt);
+err:
+ return ERR_PTR(err);
+}
+
+static int ip6_route_info_create_nh(struct fib6_info *rt,
+ struct fib6_config *cfg,
+ struct netlink_ext_ack *extack)
+{
+ struct net *net = cfg->fc_nlinfo.nl_net;
+ struct fib6_nh *fib6_nh;
+ int err;
if (cfg->fc_nh_id) {
struct nexthop *nh;
@@ -3813,9 +3825,11 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
rt->nh = nh;
fib6_nh = nexthop_fib6_nh(rt->nh);
} else {
- err = fib6_nh_init(net, rt->fib6_nh, cfg, gfp_flags, extack);
+ int addr_type;
+
+ err = fib6_nh_init(net, rt->fib6_nh, cfg, GFP_ATOMIC, extack);
if (err)
- goto out;
+ goto out_release;
fib6_nh = rt->fib6_nh;
@@ -3834,21 +3848,20 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
if (!ipv6_chk_addr(net, &cfg->fc_prefsrc, dev, 0)) {
NL_SET_ERR_MSG(extack, "Invalid source address");
err = -EINVAL;
- goto out;
+ goto out_release;
}
rt->fib6_prefsrc.addr = cfg->fc_prefsrc;
rt->fib6_prefsrc.plen = 128;
- } else
- rt->fib6_prefsrc.plen = 0;
+ }
- return rt;
-out:
+ return 0;
+out_release:
fib6_info_release(rt);
- return ERR_PTR(err);
+ return err;
out_free:
ip_fib_metrics_put(rt->fib6_metrics);
kfree(rt);
- return ERR_PTR(err);
+ return err;
}
int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
@@ -3861,6 +3874,10 @@ int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
if (IS_ERR(rt))
return PTR_ERR(rt);
+ err = ip6_route_info_create_nh(rt, cfg, extack);
+ if (err)
+ return err;
+
err = __ip6_ins_rt(rt, &cfg->fc_nlinfo, extack);
fib6_info_release(rt);
@@ -4584,6 +4601,7 @@ struct fib6_info *addrconf_f6i_alloc(struct net *net,
.fc_ignore_dev_down = true,
};
struct fib6_info *f6i;
+ int err;
if (anycast) {
cfg.fc_type = RTN_ANYCAST;
@@ -4594,14 +4612,19 @@ struct fib6_info *addrconf_f6i_alloc(struct net *net,
}
f6i = ip6_route_info_create(&cfg, gfp_flags, extack);
- if (!IS_ERR(f6i)) {
- f6i->dst_nocount = true;
+ if (IS_ERR(f6i))
+ return f6i;
- if (!anycast &&
- (READ_ONCE(net->ipv6.devconf_all->disable_policy) ||
- READ_ONCE(idev->cnf.disable_policy)))
- f6i->dst_nopolicy = true;
- }
+ err = ip6_route_info_create_nh(f6i, &cfg, extack);
+ if (err)
+ return ERR_PTR(err);
+
+ f6i->dst_nocount = true;
+
+ if (!anycast &&
+ (READ_ONCE(net->ipv6.devconf_all->disable_policy) ||
+ READ_ONCE(idev->cnf.disable_policy)))
+ f6i->dst_nopolicy = true;
return f6i;
}
@@ -5399,6 +5422,12 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
goto cleanup;
}
+ err = ip6_route_info_create_nh(rt, &r_cfg, extack);
+ if (err) {
+ rt = NULL;
+ goto cleanup;
+ }
+
rt->fib6_nh->fib_nh_weight = rtnh->rtnh_hops + 1;
err = ip6_route_info_append(info->nl_net, &rt6_nh_list,
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v3 net-next 07/15] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
` (5 preceding siblings ...)
2025-04-18 0:03 ` [PATCH v3 net-next 06/15] ipv6: Split ip6_route_info_create() Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 08/15] ipv6: Preallocate nhc_pcpu_rth_output " Kuniyuki Iwashima
` (8 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
ip6_route_info_create_nh() will be called under RCU.
Then, fib6_nh_init() is also under RCU, but per-cpu memory allocation
is very likely to fail with GFP_ATOMIC while bulk-adding IPv6 routes
and we would see a bunch of this message in dmesg.
percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left
percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left
Let's preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().
If something fails before the original memory allocation in
fib6_nh_init(), ip6_route_info_create_nh() calls fib6_info_release(),
which releases the preallocated per-cpu memory.
Note that rt->fib6_nh->rt6i_pcpu is not preallocated when called via
ipv6_stub, so we still need alloc_percpu_gfp() in fib6_nh_init().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
v3: Explain alloc_percpu_gfp() is still needed when called via ipv6_stub.
---
net/ipv6/route.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7328404c77c1..b0ddb73c732e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3664,10 +3664,12 @@ int fib6_nh_init(struct net *net, struct fib6_nh *fib6_nh,
goto out;
pcpu_alloc:
- fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
if (!fib6_nh->rt6i_pcpu) {
- err = -ENOMEM;
- goto out;
+ fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
+ if (!fib6_nh->rt6i_pcpu) {
+ err = -ENOMEM;
+ goto out;
+ }
}
fib6_nh->fib_nh_dev = dev;
@@ -3727,6 +3729,15 @@ void fib6_nh_release_dsts(struct fib6_nh *fib6_nh)
}
}
+static int fib6_nh_prealloc_percpu(struct fib6_nh *fib6_nh, gfp_t gfp_flags)
+{
+ fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
+ if (!fib6_nh->rt6i_pcpu)
+ return -ENOMEM;
+
+ return 0;
+}
+
static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
gfp_t gfp_flags,
struct netlink_ext_ack *extack)
@@ -3764,6 +3775,12 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
goto free;
}
+ if (!cfg->fc_nh_id) {
+ err = fib6_nh_prealloc_percpu(&rt->fib6_nh[0], gfp_flags);
+ if (err)
+ goto free_metrics;
+ }
+
if (cfg->fc_flags & RTF_ADDRCONF)
rt->dst_nocount = true;
@@ -3788,6 +3805,8 @@ static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
rt->fib6_src.plen = cfg->fc_src_len;
#endif
return rt;
+free_metrics:
+ ip_fib_metrics_put(rt->fib6_metrics);
free:
kfree(rt);
err:
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v3 net-next 08/15] ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
` (6 preceding siblings ...)
2025-04-18 0:03 ` [PATCH v3 net-next 07/15] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create() Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 09/15] ipv6: Don't pass net to ip6_route_info_append() Kuniyuki Iwashima
` (7 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
ip6_route_info_create_nh() will be called under RCU.
It calls fib_nh_common_init() and allocates nhc->nhc_pcpu_rth_output.
As with the reason for rt->fib6_nh->rt6i_pcpu, we want to avoid
GFP_ATOMIC allocation for nhc->nhc_pcpu_rth_output under RCU.
Let's preallocate it in ip6_route_info_create().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
net/ipv4/fib_semantics.c | 10 ++++++----
net/ipv6/route.c | 9 +++++++++
2 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f68bb9e34c34..5326f1501af0 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -617,10 +617,12 @@ int fib_nh_common_init(struct net *net, struct fib_nh_common *nhc,
{
int err;
- nhc->nhc_pcpu_rth_output = alloc_percpu_gfp(struct rtable __rcu *,
- gfp_flags);
- if (!nhc->nhc_pcpu_rth_output)
- return -ENOMEM;
+ if (!nhc->nhc_pcpu_rth_output) {
+ nhc->nhc_pcpu_rth_output = alloc_percpu_gfp(struct rtable __rcu *,
+ gfp_flags);
+ if (!nhc->nhc_pcpu_rth_output)
+ return -ENOMEM;
+ }
if (encap) {
struct lwtunnel_state *lwtstate;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b0ddb73c732e..ea755027cf61 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3731,10 +3731,19 @@ void fib6_nh_release_dsts(struct fib6_nh *fib6_nh)
static int fib6_nh_prealloc_percpu(struct fib6_nh *fib6_nh, gfp_t gfp_flags)
{
+ struct fib_nh_common *nhc = &fib6_nh->nh_common;
+
fib6_nh->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags);
if (!fib6_nh->rt6i_pcpu)
return -ENOMEM;
+ nhc->nhc_pcpu_rth_output = alloc_percpu_gfp(struct rtable __rcu *,
+ gfp_flags);
+ if (!nhc->nhc_pcpu_rth_output) {
+ free_percpu(fib6_nh->rt6i_pcpu);
+ return -ENOMEM;
+ }
+
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v3 net-next 09/15] ipv6: Don't pass net to ip6_route_info_append().
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
` (7 preceding siblings ...)
2025-04-18 0:03 ` [PATCH v3 net-next 08/15] ipv6: Preallocate nhc_pcpu_rth_output " Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 10/15] ipv6: Rename rt6_nh.next to rt6_nh.list Kuniyuki Iwashima
` (6 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
net is not used in ip6_route_info_append() after commit 36f19d5b4f99
("net/ipv6: Remove extra call to ip6_convert_metrics for multipath case").
Let's remove the argument.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
net/ipv6/route.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ea755027cf61..af11fcaa5cf3 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5317,8 +5317,7 @@ struct rt6_nh {
struct list_head next;
};
-static int ip6_route_info_append(struct net *net,
- struct list_head *rt6_nh_list,
+static int ip6_route_info_append(struct list_head *rt6_nh_list,
struct fib6_info *rt,
struct fib6_config *r_cfg)
{
@@ -5458,8 +5457,7 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
rt->fib6_nh->fib_nh_weight = rtnh->rtnh_hops + 1;
- err = ip6_route_info_append(info->nl_net, &rt6_nh_list,
- rt, &r_cfg);
+ err = ip6_route_info_append(&rt6_nh_list, rt, &r_cfg);
if (err) {
fib6_info_release(rt);
goto cleanup;
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v3 net-next 10/15] ipv6: Rename rt6_nh.next to rt6_nh.list.
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
` (8 preceding siblings ...)
2025-04-18 0:03 ` [PATCH v3 net-next 09/15] ipv6: Don't pass net to ip6_route_info_append() Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 11/15] ipv6: Factorise ip6_route_multipath_add() Kuniyuki Iwashima
` (5 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
ip6_route_multipath_add() allocates struct rt6_nh for each config
of multipath routes to link them to a local list rt6_nh_list.
struct rt6_nh.next is the list node of each config, so the name
is quite misleading.
Let's rename it to list.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/netdev/c9bee472-c94e-4878-8cc2-1512b2c54db5@redhat.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
net/ipv6/route.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index af11fcaa5cf3..05e33d319488 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5314,7 +5314,7 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
struct rt6_nh {
struct fib6_info *fib6_info;
struct fib6_config r_cfg;
- struct list_head next;
+ struct list_head list;
};
static int ip6_route_info_append(struct list_head *rt6_nh_list,
@@ -5324,7 +5324,7 @@ static int ip6_route_info_append(struct list_head *rt6_nh_list,
struct rt6_nh *nh;
int err = -EEXIST;
- list_for_each_entry(nh, rt6_nh_list, next) {
+ list_for_each_entry(nh, rt6_nh_list, list) {
/* check if fib6_info already exists */
if (rt6_duplicate_nexthop(nh->fib6_info, rt))
return err;
@@ -5335,7 +5335,7 @@ static int ip6_route_info_append(struct list_head *rt6_nh_list,
return -ENOMEM;
nh->fib6_info = rt;
memcpy(&nh->r_cfg, r_cfg, sizeof(*r_cfg));
- list_add_tail(&nh->next, rt6_nh_list);
+ list_add_tail(&nh->list, rt6_nh_list);
return 0;
}
@@ -5478,7 +5478,7 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
info->skip_notify_kernel = 1;
err_nh = NULL;
- list_for_each_entry(nh, &rt6_nh_list, next) {
+ list_for_each_entry(nh, &rt6_nh_list, list) {
err = __ip6_ins_rt(nh->fib6_info, info, extack);
if (err) {
@@ -5546,16 +5546,16 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
ip6_route_mpath_notify(rt_notif, rt_last, info, nlflags);
/* Delete routes that were already added */
- list_for_each_entry(nh, &rt6_nh_list, next) {
+ list_for_each_entry(nh, &rt6_nh_list, list) {
if (err_nh == nh)
break;
ip6_route_del(&nh->r_cfg, extack);
}
cleanup:
- list_for_each_entry_safe(nh, nh_safe, &rt6_nh_list, next) {
+ list_for_each_entry_safe(nh, nh_safe, &rt6_nh_list, list) {
fib6_info_release(nh->fib6_info);
- list_del(&nh->next);
+ list_del(&nh->list);
kfree(nh);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v3 net-next 11/15] ipv6: Factorise ip6_route_multipath_add().
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
` (9 preceding siblings ...)
2025-04-18 0:03 ` [PATCH v3 net-next 10/15] ipv6: Rename rt6_nh.next to rt6_nh.list Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 12/15] ipv6: Protect fib6_link_table() with spinlock Kuniyuki Iwashima
` (4 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely
on RCU to guarantee dev and nexthop lifetime.
Then, the RCU section will start before ip6_route_info_create_nh()
in ip6_route_multipath_add(), but ip6_route_info_create() is called
in the same loop and will sleep.
Let's split the loop into ip6_route_mpath_info_create() and
ip6_route_mpath_info_create_nh().
Note that ip6_route_info_append() is now integrated into
ip6_route_mpath_info_create_nh() because we need to call different
free functions for nexthops that passed ip6_route_info_create_nh().
In case of failure, the remaining nexthops that ip6_route_info_create_nh()
has not been called for will be freed by ip6_route_mpath_info_cleanup().
OTOH, if a nexthop passes ip6_route_info_create_nh(), it will be linked
to a local temporary list, which will be spliced back to rt6_nh_list.
In case of failure, these nexthops will be released by fib6_info_release()
in ip6_route_multipath_add().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
net/ipv6/route.c | 205 ++++++++++++++++++++++++++++++-----------------
1 file changed, 130 insertions(+), 75 deletions(-)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 05e33d319488..c8c1c75268e3 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5315,29 +5315,131 @@ struct rt6_nh {
struct fib6_info *fib6_info;
struct fib6_config r_cfg;
struct list_head list;
+ int weight;
};
-static int ip6_route_info_append(struct list_head *rt6_nh_list,
- struct fib6_info *rt,
- struct fib6_config *r_cfg)
+static void ip6_route_mpath_info_cleanup(struct list_head *rt6_nh_list)
{
- struct rt6_nh *nh;
- int err = -EEXIST;
+ struct rt6_nh *nh, *nh_next;
- list_for_each_entry(nh, rt6_nh_list, list) {
- /* check if fib6_info already exists */
- if (rt6_duplicate_nexthop(nh->fib6_info, rt))
- return err;
+ list_for_each_entry_safe(nh, nh_next, rt6_nh_list, list) {
+ struct fib6_info *rt = nh->fib6_info;
+
+ if (rt) {
+ free_percpu(rt->fib6_nh->nh_common.nhc_pcpu_rth_output);
+ free_percpu(rt->fib6_nh->rt6i_pcpu);
+ ip_fib_metrics_put(rt->fib6_metrics);
+ kfree(rt);
+ }
+
+ list_del(&nh->list);
+ kfree(nh);
}
+}
- nh = kzalloc(sizeof(*nh), GFP_KERNEL);
- if (!nh)
- return -ENOMEM;
- nh->fib6_info = rt;
- memcpy(&nh->r_cfg, r_cfg, sizeof(*r_cfg));
- list_add_tail(&nh->list, rt6_nh_list);
+static int ip6_route_mpath_info_create(struct list_head *rt6_nh_list,
+ struct fib6_config *cfg,
+ struct netlink_ext_ack *extack)
+{
+ struct rtnexthop *rtnh;
+ int remaining;
+ int err;
+
+ remaining = cfg->fc_mp_len;
+ rtnh = (struct rtnexthop *)cfg->fc_mp;
+
+ /* Parse a Multipath Entry and build a list (rt6_nh_list) of
+ * fib6_info structs per nexthop
+ */
+ while (rtnh_ok(rtnh, remaining)) {
+ struct fib6_config r_cfg;
+ struct fib6_info *rt;
+ struct rt6_nh *nh;
+ int attrlen;
+
+ nh = kzalloc(sizeof(*nh), GFP_KERNEL);
+ if (!nh) {
+ err = -ENOMEM;
+ goto err;
+ }
+
+ list_add_tail(&nh->list, rt6_nh_list);
+
+ memcpy(&r_cfg, cfg, sizeof(*cfg));
+ if (rtnh->rtnh_ifindex)
+ r_cfg.fc_ifindex = rtnh->rtnh_ifindex;
+
+ attrlen = rtnh_attrlen(rtnh);
+ if (attrlen > 0) {
+ struct nlattr *nla, *attrs = rtnh_attrs(rtnh);
+
+ nla = nla_find(attrs, attrlen, RTA_GATEWAY);
+ if (nla) {
+ r_cfg.fc_gateway = nla_get_in6_addr(nla);
+ r_cfg.fc_flags |= RTF_GATEWAY;
+ }
+
+ r_cfg.fc_encap = nla_find(attrs, attrlen, RTA_ENCAP);
+ nla = nla_find(attrs, attrlen, RTA_ENCAP_TYPE);
+ if (nla)
+ r_cfg.fc_encap_type = nla_get_u16(nla);
+ }
+
+ r_cfg.fc_flags |= (rtnh->rtnh_flags & RTNH_F_ONLINK);
+
+ rt = ip6_route_info_create(&r_cfg, GFP_KERNEL, extack);
+ if (IS_ERR(rt)) {
+ err = PTR_ERR(rt);
+ goto err;
+ }
+
+ nh->fib6_info = rt;
+ nh->weight = rtnh->rtnh_hops + 1;
+ memcpy(&nh->r_cfg, &r_cfg, sizeof(r_cfg));
+
+ rtnh = rtnh_next(rtnh, &remaining);
+ }
return 0;
+err:
+ ip6_route_mpath_info_cleanup(rt6_nh_list);
+ return err;
+}
+
+static int ip6_route_mpath_info_create_nh(struct list_head *rt6_nh_list,
+ struct netlink_ext_ack *extack)
+{
+ struct rt6_nh *nh, *nh_next, *nh_tmp;
+ LIST_HEAD(tmp);
+ int err;
+
+ list_for_each_entry_safe(nh, nh_next, rt6_nh_list, list) {
+ struct fib6_info *rt = nh->fib6_info;
+
+ err = ip6_route_info_create_nh(rt, &nh->r_cfg, extack);
+ if (err) {
+ nh->fib6_info = NULL;
+ goto err;
+ }
+
+ rt->fib6_nh->fib_nh_weight = nh->weight;
+
+ list_move_tail(&nh->list, &tmp);
+
+ list_for_each_entry(nh_tmp, rt6_nh_list, list) {
+ /* check if fib6_info already exists */
+ if (rt6_duplicate_nexthop(nh_tmp->fib6_info, rt)) {
+ err = -EEXIST;
+ goto err;
+ }
+ }
+ }
+out:
+ list_splice(&tmp, rt6_nh_list);
+ return err;
+err:
+ ip6_route_mpath_info_cleanup(rt6_nh_list);
+ goto out;
}
static void ip6_route_mpath_notify(struct fib6_info *rt,
@@ -5396,75 +5498,28 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
{
struct fib6_info *rt_notif = NULL, *rt_last = NULL;
struct nl_info *info = &cfg->fc_nlinfo;
- struct fib6_config r_cfg;
- struct rtnexthop *rtnh;
- struct fib6_info *rt;
- struct rt6_nh *err_nh;
struct rt6_nh *nh, *nh_safe;
+ LIST_HEAD(rt6_nh_list);
+ struct rt6_nh *err_nh;
__u16 nlflags;
- int remaining;
- int attrlen;
- int err = 1;
int nhn = 0;
- int replace = (cfg->fc_nlinfo.nlh &&
- (cfg->fc_nlinfo.nlh->nlmsg_flags & NLM_F_REPLACE));
- LIST_HEAD(rt6_nh_list);
+ int replace;
+ int err;
+
+ replace = (cfg->fc_nlinfo.nlh &&
+ (cfg->fc_nlinfo.nlh->nlmsg_flags & NLM_F_REPLACE));
nlflags = replace ? NLM_F_REPLACE : NLM_F_CREATE;
if (info->nlh && info->nlh->nlmsg_flags & NLM_F_APPEND)
nlflags |= NLM_F_APPEND;
- remaining = cfg->fc_mp_len;
- rtnh = (struct rtnexthop *)cfg->fc_mp;
-
- /* Parse a Multipath Entry and build a list (rt6_nh_list) of
- * fib6_info structs per nexthop
- */
- while (rtnh_ok(rtnh, remaining)) {
- memcpy(&r_cfg, cfg, sizeof(*cfg));
- if (rtnh->rtnh_ifindex)
- r_cfg.fc_ifindex = rtnh->rtnh_ifindex;
-
- attrlen = rtnh_attrlen(rtnh);
- if (attrlen > 0) {
- struct nlattr *nla, *attrs = rtnh_attrs(rtnh);
-
- nla = nla_find(attrs, attrlen, RTA_GATEWAY);
- if (nla) {
- r_cfg.fc_gateway = nla_get_in6_addr(nla);
- r_cfg.fc_flags |= RTF_GATEWAY;
- }
-
- r_cfg.fc_encap = nla_find(attrs, attrlen, RTA_ENCAP);
- nla = nla_find(attrs, attrlen, RTA_ENCAP_TYPE);
- if (nla)
- r_cfg.fc_encap_type = nla_get_u16(nla);
- }
-
- r_cfg.fc_flags |= (rtnh->rtnh_flags & RTNH_F_ONLINK);
- rt = ip6_route_info_create(&r_cfg, GFP_KERNEL, extack);
- if (IS_ERR(rt)) {
- err = PTR_ERR(rt);
- rt = NULL;
- goto cleanup;
- }
-
- err = ip6_route_info_create_nh(rt, &r_cfg, extack);
- if (err) {
- rt = NULL;
- goto cleanup;
- }
-
- rt->fib6_nh->fib_nh_weight = rtnh->rtnh_hops + 1;
-
- err = ip6_route_info_append(&rt6_nh_list, rt, &r_cfg);
- if (err) {
- fib6_info_release(rt);
- goto cleanup;
- }
+ err = ip6_route_mpath_info_create(&rt6_nh_list, cfg, extack);
+ if (err)
+ return err;
- rtnh = rtnh_next(rtnh, &remaining);
- }
+ err = ip6_route_mpath_info_create_nh(&rt6_nh_list, extack);
+ if (err)
+ goto cleanup;
/* for add and replace send one notification with all nexthops.
* Skip the notification in fib6_add_rt2node and send one with
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v3 net-next 12/15] ipv6: Protect fib6_link_table() with spinlock.
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
` (10 preceding siblings ...)
2025-04-18 0:03 ` [PATCH v3 net-next 11/15] ipv6: Factorise ip6_route_multipath_add() Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 13/15] ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add() Kuniyuki Iwashima
` (3 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.
If the request specifies a new table ID, fib6_new_table() is
called to create a new routing table.
Two concurrent requests could specify the same table ID, so we
need a lock to protect net->ipv6.fib_table_hash[h].
Let's add a spinlock to protect the hash bucket linkage.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
include/net/netns/ipv6.h | 1 +
net/ipv6/ip6_fib.c | 26 +++++++++++++++++++++-----
2 files changed, 22 insertions(+), 5 deletions(-)
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 5f2cfd84570a..47dc70d8100a 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -72,6 +72,7 @@ struct netns_ipv6 {
struct rt6_statistics *rt6_stats;
struct timer_list ip6_fib_timer;
struct hlist_head *fib_table_hash;
+ spinlock_t fib_table_hash_lock;
struct fib6_table *fib6_main_tbl;
struct list_head fib6_walkers;
rwlock_t fib6_walker_lock;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index bf727149fdec..79b672f3fc53 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -249,19 +249,33 @@ static struct fib6_table *fib6_alloc_table(struct net *net, u32 id)
struct fib6_table *fib6_new_table(struct net *net, u32 id)
{
- struct fib6_table *tb;
+ struct fib6_table *tb, *new_tb;
if (id == 0)
id = RT6_TABLE_MAIN;
+
tb = fib6_get_table(net, id);
if (tb)
return tb;
- tb = fib6_alloc_table(net, id);
- if (tb)
- fib6_link_table(net, tb);
+ new_tb = fib6_alloc_table(net, id);
+ if (!new_tb)
+ return NULL;
+
+ spin_lock_bh(&net->ipv6.fib_table_hash_lock);
+
+ tb = fib6_get_table(net, id);
+ if (unlikely(tb)) {
+ spin_unlock_bh(&net->ipv6.fib_table_hash_lock);
+ kfree(new_tb);
+ return tb;
+ }
- return tb;
+ fib6_link_table(net, new_tb);
+
+ spin_unlock_bh(&net->ipv6.fib_table_hash_lock);
+
+ return new_tb;
}
EXPORT_SYMBOL_GPL(fib6_new_table);
@@ -2423,6 +2437,8 @@ static int __net_init fib6_net_init(struct net *net)
if (!net->ipv6.fib_table_hash)
goto out_rt6_stats;
+ spin_lock_init(&net->ipv6.fib_table_hash_lock);
+
net->ipv6.fib6_main_tbl = kzalloc(sizeof(*net->ipv6.fib6_main_tbl),
GFP_KERNEL);
if (!net->ipv6.fib6_main_tbl)
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v3 net-next 13/15] ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add().
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
` (11 preceding siblings ...)
2025-04-18 0:03 ` [PATCH v3 net-next 12/15] ipv6: Protect fib6_link_table() with spinlock Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 14/15] ipv6: Protect nh->f6i_list with spinlock and flag Kuniyuki Iwashima
` (2 subsequent siblings)
15 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
The next patch adds per-nexthop spinlock which protects nh->f6i_list.
When rt->nh is not NULL, fib6_add_rt2node() will be called under the lock.
fib6_add_rt2node() could call fib6_purge_rt() for another route, which
could holds another nexthop lock.
Then, deadlock could happen between two nexthops.
Let's defer fib6_purge_rt() after fib6_add_rt2node().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
include/net/ip6_fib.h | 1 +
net/ipv6/ip6_fib.c | 21 ++++++++++++++-------
2 files changed, 15 insertions(+), 7 deletions(-)
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 7c87873ae211..88b0dd4d8e09 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -198,6 +198,7 @@ struct fib6_info {
fib6_destroying:1,
unused:4;
+ struct list_head purge_link;
struct rcu_head rcu;
struct nexthop *nh;
struct fib6_nh fib6_nh[];
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 79b672f3fc53..9e9db5470bbf 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1083,8 +1083,8 @@ static void fib6_purge_rt(struct fib6_info *rt, struct fib6_node *fn,
*/
static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
- struct nl_info *info,
- struct netlink_ext_ack *extack)
+ struct nl_info *info, struct netlink_ext_ack *extack,
+ struct list_head *purge_list)
{
struct fib6_info *leaf = rcu_dereference_protected(fn->leaf,
lockdep_is_held(&rt->fib6_table->tb6_lock));
@@ -1308,10 +1308,9 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
}
nsiblings = iter->fib6_nsiblings;
iter->fib6_node = NULL;
- fib6_purge_rt(iter, fn, info->nl_net);
+ list_add(&iter->purge_link, purge_list);
if (rcu_access_pointer(fn->rr_ptr) == iter)
fn->rr_ptr = NULL;
- fib6_info_release(iter);
if (nsiblings) {
/* Replacing an ECMP route, remove all siblings */
@@ -1324,10 +1323,9 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
if (rt6_qualify_for_ecmp(iter)) {
*ins = iter->fib6_next;
iter->fib6_node = NULL;
- fib6_purge_rt(iter, fn, info->nl_net);
+ list_add(&iter->purge_link, purge_list);
if (rcu_access_pointer(fn->rr_ptr) == iter)
fn->rr_ptr = NULL;
- fib6_info_release(iter);
nsiblings--;
info->nl_net->ipv6.rt6_stats->fib_rt_entries--;
} else {
@@ -1397,6 +1395,7 @@ int fib6_add(struct fib6_node *root, struct fib6_info *rt,
struct nl_info *info, struct netlink_ext_ack *extack)
{
struct fib6_table *table = rt->fib6_table;
+ LIST_HEAD(purge_list);
struct fib6_node *fn;
#ifdef CONFIG_IPV6_SUBTREES
struct fib6_node *pn = NULL;
@@ -1499,8 +1498,16 @@ int fib6_add(struct fib6_node *root, struct fib6_info *rt,
}
#endif
- err = fib6_add_rt2node(fn, rt, info, extack);
+ err = fib6_add_rt2node(fn, rt, info, extack, &purge_list);
if (!err) {
+ struct fib6_info *iter, *next;
+
+ list_for_each_entry_safe(iter, next, &purge_list, purge_link) {
+ list_del(&iter->purge_link);
+ fib6_purge_rt(iter, fn, info->nl_net);
+ fib6_info_release(iter);
+ }
+
if (rt->nh)
list_add(&rt->nh_list, &rt->nh->f6i_list);
__fib6_update_sernum_upto_root(rt, fib6_new_sernum(info->nl_net));
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v3 net-next 14/15] ipv6: Protect nh->f6i_list with spinlock and flag.
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
` (12 preceding siblings ...)
2025-04-18 0:03 ` [PATCH v3 net-next 13/15] ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add() Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-04-18 0:03 ` [PATCH v3 net-next 15/15] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE Kuniyuki Iwashima
2025-04-24 7:50 ` [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table patchwork-bot+netdevbpf
15 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.
Then, we may be going to add a route tied to a dying nexthop.
The nexthop itself is not freed during the RCU grace period, but
if we link a route after __remove_nexthop_fib() is called for the
nexthop, the route will be leaked.
To avoid the race between IPv6 route addition under RCU vs nexthop
deletion under RTNL, let's add a dead flag and protect it and
nh->f6i_list with a spinlock.
__remove_nexthop_fib() acquires the nexthop's spinlock and sets false
to nh->dead, then calls ip6_del_rt() for the linked route one by one
without the spinlock because fib6_purge_rt() acquires it later.
While adding an IPv6 route, fib6_add() acquires the nexthop lock and
checks the dead flag just before inserting the route.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
v3: Bundle critical section for rt->nh as fib6_add_rt2node_nh()
---
include/net/nexthop.h | 2 ++
net/ipv4/nexthop.c | 18 +++++++++++++++---
net/ipv6/ip6_fib.c | 39 ++++++++++++++++++++++++++++++++++-----
3 files changed, 51 insertions(+), 8 deletions(-)
diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index d9fb44e8b321..572e69cda476 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -152,6 +152,8 @@ struct nexthop {
u8 protocol; /* app managing this nh */
u8 nh_flags;
bool is_group;
+ bool dead;
+ spinlock_t lock; /* protect dead and f6i_list */
refcount_t refcnt;
struct rcu_head rcu;
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index d9cf06b297d1..6ba6cb1340c1 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -541,6 +541,7 @@ static struct nexthop *nexthop_alloc(void)
INIT_LIST_HEAD(&nh->f6i_list);
INIT_LIST_HEAD(&nh->grp_list);
INIT_LIST_HEAD(&nh->fdb_list);
+ spin_lock_init(&nh->lock);
}
return nh;
}
@@ -2118,7 +2119,7 @@ static void remove_nexthop_group(struct nexthop *nh, struct nl_info *nlinfo)
/* not called for nexthop replace */
static void __remove_nexthop_fib(struct net *net, struct nexthop *nh)
{
- struct fib6_info *f6i, *tmp;
+ struct fib6_info *f6i;
bool do_flush = false;
struct fib_info *fi;
@@ -2129,13 +2130,24 @@ static void __remove_nexthop_fib(struct net *net, struct nexthop *nh)
if (do_flush)
fib_flush(net);
- /* ip6_del_rt removes the entry from this list hence the _safe */
- list_for_each_entry_safe(f6i, tmp, &nh->f6i_list, nh_list) {
+ spin_lock_bh(&nh->lock);
+
+ nh->dead = true;
+
+ while (!list_empty(&nh->f6i_list)) {
+ f6i = list_first_entry(&nh->f6i_list, typeof(*f6i), nh_list);
+
/* __ip6_del_rt does a release, so do a hold here */
fib6_info_hold(f6i);
+
+ spin_unlock_bh(&nh->lock);
ipv6_stub->ip6_del_rt(net, f6i,
!READ_ONCE(net->ipv4.sysctl_nexthop_compat_mode));
+
+ spin_lock_bh(&nh->lock);
}
+
+ spin_unlock_bh(&nh->lock);
}
static void __remove_nexthop(struct net *net, struct nexthop *nh,
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 9e9db5470bbf..1f860340690c 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1048,8 +1048,14 @@ static void fib6_purge_rt(struct fib6_info *rt, struct fib6_node *fn,
rt6_flush_exceptions(rt);
fib6_drop_pcpu_from(rt, table);
- if (rt->nh && !list_empty(&rt->nh_list))
- list_del_init(&rt->nh_list);
+ if (rt->nh) {
+ spin_lock(&rt->nh->lock);
+
+ if (!list_empty(&rt->nh_list))
+ list_del_init(&rt->nh_list);
+
+ spin_unlock(&rt->nh->lock);
+ }
if (refcount_read(&rt->fib6_ref) != 1) {
/* This route is used as dummy address holder in some split
@@ -1341,6 +1347,28 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
return 0;
}
+static int fib6_add_rt2node_nh(struct fib6_node *fn, struct fib6_info *rt,
+ struct nl_info *info, struct netlink_ext_ack *extack,
+ struct list_head *purge_list)
+{
+ int err;
+
+ spin_lock(&rt->nh->lock);
+
+ if (rt->nh->dead) {
+ NL_SET_ERR_MSG(extack, "Nexthop has been deleted");
+ err = -EINVAL;
+ } else {
+ err = fib6_add_rt2node(fn, rt, info, extack, purge_list);
+ if (!err)
+ list_add(&rt->nh_list, &rt->nh->f6i_list);
+ }
+
+ spin_unlock(&rt->nh->lock);
+
+ return err;
+}
+
static void fib6_start_gc(struct net *net, struct fib6_info *rt)
{
if (!timer_pending(&net->ipv6.ip6_fib_timer) &&
@@ -1498,7 +1526,10 @@ int fib6_add(struct fib6_node *root, struct fib6_info *rt,
}
#endif
- err = fib6_add_rt2node(fn, rt, info, extack, &purge_list);
+ if (rt->nh)
+ err = fib6_add_rt2node_nh(fn, rt, info, extack, &purge_list);
+ else
+ err = fib6_add_rt2node(fn, rt, info, extack, &purge_list);
if (!err) {
struct fib6_info *iter, *next;
@@ -1508,8 +1539,6 @@ int fib6_add(struct fib6_node *root, struct fib6_info *rt,
fib6_info_release(iter);
}
- if (rt->nh)
- list_add(&rt->nh_list, &rt->nh->f6i_list);
__fib6_update_sernum_upto_root(rt, fib6_new_sernum(info->nl_net));
if (rt->fib6_flags & RTF_EXPIRES)
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* [PATCH v3 net-next 15/15] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
` (13 preceding siblings ...)
2025-04-18 0:03 ` [PATCH v3 net-next 14/15] ipv6: Protect nh->f6i_list with spinlock and flag Kuniyuki Iwashima
@ 2025-04-18 0:03 ` Kuniyuki Iwashima
2025-05-04 9:16 ` Eric Dumazet
2025-04-24 7:50 ` [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table patchwork-bot+netdevbpf
15 siblings, 1 reply; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-18 0:03 UTC (permalink / raw)
To: David S. Miller, David Ahern, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
Now we are ready to remove RTNL from SIOCADDRT and RTM_NEWROUTE.
The remaining things to do are
1. pass false to lwtunnel_valid_encap_type_attr()
2. use rcu_dereference_rtnl() in fib6_check_nexthop()
3. place rcu_read_lock() before ip6_route_info_create_nh().
Let's complete the RTNL-free conversion.
When each CPU-X adds 100000 routes on table-X in a batch
concurrently on c7a.metal-48xl EC2 instance with 192 CPUs,
without this series:
$ sudo ./route_test.sh
...
added 19200000 routes (100000 routes * 192 tables).
time elapsed: 191577 milliseconds.
with this series:
$ sudo ./route_test.sh
...
added 19200000 routes (100000 routes * 192 tables).
time elapsed: 62854 milliseconds.
I changed the number of routes in each table (1000 ~ 100000)
and consistently saw it finish 3x faster with this series.
Note that now every caller of lwtunnel_valid_encap_type() passes
false as the last argument, and this can be removed later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
net/ipv4/nexthop.c | 4 ++--
net/ipv6/route.c | 18 ++++++++++++------
2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 6ba6cb1340c1..823e4a783d2b 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -1556,12 +1556,12 @@ int fib6_check_nexthop(struct nexthop *nh, struct fib6_config *cfg,
if (nh->is_group) {
struct nh_group *nhg;
- nhg = rtnl_dereference(nh->nh_grp);
+ nhg = rcu_dereference_rtnl(nh->nh_grp);
if (nhg->has_v4)
goto no_v4_nh;
is_fdb_nh = nhg->fdb_nh;
} else {
- nhi = rtnl_dereference(nh->nh_info);
+ nhi = rcu_dereference_rtnl(nh->nh_info);
if (nhi->family == AF_INET)
goto no_v4_nh;
is_fdb_nh = nhi->fdb_nh;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index c8c1c75268e3..bb46e724db73 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3902,12 +3902,16 @@ int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
if (IS_ERR(rt))
return PTR_ERR(rt);
+ rcu_read_lock();
+
err = ip6_route_info_create_nh(rt, cfg, extack);
if (err)
- return err;
+ goto unlock;
err = __ip6_ins_rt(rt, &cfg->fc_nlinfo, extack);
fib6_info_release(rt);
+unlock:
+ rcu_read_unlock();
return err;
}
@@ -4528,12 +4532,10 @@ int ipv6_route_ioctl(struct net *net, unsigned int cmd, struct in6_rtmsg *rtmsg)
switch (cmd) {
case SIOCADDRT:
- rtnl_lock();
/* Only do the default setting of fc_metric in route adding */
if (cfg.fc_metric == 0)
cfg.fc_metric = IP6_RT_PRIO_USER;
err = ip6_route_add(&cfg, GFP_KERNEL, NULL);
- rtnl_unlock();
break;
case SIOCDELRT:
err = ip6_route_del(&cfg, NULL);
@@ -5112,7 +5114,7 @@ static int rtm_to_fib6_multipath_config(struct fib6_config *cfg,
} while (rtnh_ok(rtnh, remaining));
return lwtunnel_valid_encap_type_attr(cfg->fc_mp, cfg->fc_mp_len,
- extack, newroute);
+ extack, false);
}
static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
@@ -5250,7 +5252,7 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
cfg->fc_encap_type = nla_get_u16(tb[RTA_ENCAP_TYPE]);
err = lwtunnel_valid_encap_type(cfg->fc_encap_type,
- extack, newroute);
+ extack, false);
if (err < 0)
goto errout;
}
@@ -5517,6 +5519,8 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
if (err)
return err;
+ rcu_read_lock();
+
err = ip6_route_mpath_info_create_nh(&rt6_nh_list, extack);
if (err)
goto cleanup;
@@ -5608,6 +5612,8 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
}
cleanup:
+ rcu_read_unlock();
+
list_for_each_entry_safe(nh, nh_safe, &rt6_nh_list, list) {
fib6_info_release(nh->fib6_info);
list_del(&nh->list);
@@ -6890,7 +6896,7 @@ static void bpf_iter_unregister(void)
static const struct rtnl_msg_handler ip6_route_rtnl_msg_handlers[] __initconst_or_module = {
{.owner = THIS_MODULE, .protocol = PF_INET6, .msgtype = RTM_NEWROUTE,
- .doit = inet6_rtm_newroute},
+ .doit = inet6_rtm_newroute, .flags = RTNL_FLAG_DOIT_UNLOCKED},
{.owner = THIS_MODULE, .protocol = PF_INET6, .msgtype = RTM_DELROUTE,
.doit = inet6_rtm_delroute, .flags = RTNL_FLAG_DOIT_UNLOCKED},
{.owner = THIS_MODULE, .protocol = PF_INET6, .msgtype = RTM_GETROUTE,
--
2.49.0
^ permalink raw reply related [flat|nested] 23+ messages in thread* Re: [PATCH v3 net-next 15/15] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.
2025-04-18 0:03 ` [PATCH v3 net-next 15/15] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE Kuniyuki Iwashima
@ 2025-05-04 9:16 ` Eric Dumazet
2025-05-04 17:20 ` Kuniyuki Iwashima
0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2025-05-04 9:16 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, netdev
On Thu, Apr 17, 2025 at 5:10 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> Now we are ready to remove RTNL from SIOCADDRT and RTM_NEWROUTE.
>
> The remaining things to do are
>
> 1. pass false to lwtunnel_valid_encap_type_attr()
> 2. use rcu_dereference_rtnl() in fib6_check_nexthop()
> 3. place rcu_read_lock() before ip6_route_info_create_nh().
>
> Let's complete the RTNL-free conversion.
>
> When each CPU-X adds 100000 routes on table-X in a batch
> concurrently on c7a.metal-48xl EC2 instance with 192 CPUs,
>
> without this series:
>
> $ sudo ./route_test.sh
> ...
> added 19200000 routes (100000 routes * 192 tables).
> time elapsed: 191577 milliseconds.
>
> with this series:
>
> $ sudo ./route_test.sh
> ...
> added 19200000 routes (100000 routes * 192 tables).
> time elapsed: 62854 milliseconds.
>
> I changed the number of routes in each table (1000 ~ 100000)
> and consistently saw it finish 3x faster with this series.
>
> Note that now every caller of lwtunnel_valid_encap_type() passes
> false as the last argument, and this can be removed later.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
> net/ipv4/nexthop.c | 4 ++--
> net/ipv6/route.c | 18 ++++++++++++------
> 2 files changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
> index 6ba6cb1340c1..823e4a783d2b 100644
> --- a/net/ipv4/nexthop.c
> +++ b/net/ipv4/nexthop.c
> @@ -1556,12 +1556,12 @@ int fib6_check_nexthop(struct nexthop *nh, struct fib6_config *cfg,
> if (nh->is_group) {
> struct nh_group *nhg;
>
> - nhg = rtnl_dereference(nh->nh_grp);
> + nhg = rcu_dereference_rtnl(nh->nh_grp);
Or add a condition about table lock being held ?
> if (nhg->has_v4)
> goto no_v4_nh;
> is_fdb_nh = nhg->fdb_nh;
> } else {
> - nhi = rtnl_dereference(nh->nh_info);
> + nhi = rcu_dereference_rtnl(nh->nh_info);
> if (nhi->family == AF_INET)
> goto no_v4_nh;
> is_fdb_nh = nhi->fdb_nh;
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index c8c1c75268e3..bb46e724db73 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -3902,12 +3902,16 @@ int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
> if (IS_ERR(rt))
> return PTR_ERR(rt);
>
> + rcu_read_lock();
This looks bogus to me (and syzbot)
Holding rcu_read_lock() from writers is almost always wrong.
In this case, this prevents any GFP_KERNEL allocations to happen
(among other things)
[ BUG: Invalid wait context ]
6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 Tainted: G W
-----------------------------
syz-executor234/5832 is trying to lock:
ffffffff8e021688 (pcpu_alloc_mutex){+.+.}-{4:4}, at:
pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
other info that might help us debug this:
context-{5:5}
1 lock held by syz-executor234/5832:
#0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire
include/linux/rcupdate.h:331 [inline]
#0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock
include/linux/rcupdate.h:841 [inline]
#0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at:
ip6_route_add+0x4d/0x2f0 net/ipv6/route.c:3913
stack backtrace:
CPU: 0 UID: 0 PID: 5832 Comm: syz-executor234 Tainted: G W
6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 PREEMPT(full)
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 04/29/2025
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_lock_invalid_wait_context kernel/locking/lockdep.c:4831 [inline]
check_wait_context kernel/locking/lockdep.c:4903 [inline]
__lock_acquire+0xbcf/0xd20 kernel/locking/lockdep.c:5185
lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5866
__mutex_lock_common kernel/locking/mutex.c:601 [inline]
__mutex_lock+0x182/0xe80 kernel/locking/mutex.c:746
pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
dst_cache_init+0x37/0xc0 net/core/dst_cache.c:145
ip_tun_build_state+0x193/0x6b0 net/ipv4/ip_tunnel_core.c:687
lwtunnel_build_state+0x381/0x4c0 net/core/lwtunnel.c:137
fib_nh_common_init+0x129/0x460 net/ipv4/fib_semantics.c:635
fib6_nh_init+0x15e4/0x2030 net/ipv6/route.c:3669
ip6_route_info_create_nh+0x139/0x870 net/ipv6/route.c:3866
ip6_route_add+0xf6/0x2f0 net/ipv6/route.c:3915
inet6_rtm_newroute+0x284/0x1c50 net/ipv6/route.c:5732
rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6955
netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534
netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
netlink_unicast+0x758/0x8d0 net/netlink/af_netlink.c:1339
netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883
sock_sendmsg_nosec net/socket.c:712 [inline]
__sock_sendmsg+0x219/0x270 net/socket.c:727
____sys_sendmsg+0x505/0x830 net/socket.c:2566
___sys_sendmsg+0x21f/0x2a0 net/socket.c:2620
__sys_sendmsg net/socket.c:2652 [inline]
__do_sys_sendmsg net/socket.c:2657 [inline]
__se_sys_sendmsg net/socket.c:2655 [inline]
__x64_sys_sendmsg+0x19b/0x260 net/socket.c:2655
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v3 net-next 15/15] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.
2025-05-04 9:16 ` Eric Dumazet
@ 2025-05-04 17:20 ` Kuniyuki Iwashima
2025-05-04 19:34 ` Eric Dumazet
0 siblings, 1 reply; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-04 17:20 UTC (permalink / raw)
To: edumazet; +Cc: davem, dsahern, horms, kuba, kuni1840, kuniyu, netdev, pabeni
From: Eric Dumazet <edumazet@google.com>
Date: Sun, 4 May 2025 02:16:13 -0700
> On Thu, Apr 17, 2025 at 5:10 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > Now we are ready to remove RTNL from SIOCADDRT and RTM_NEWROUTE.
> >
> > The remaining things to do are
> >
> > 1. pass false to lwtunnel_valid_encap_type_attr()
> > 2. use rcu_dereference_rtnl() in fib6_check_nexthop()
> > 3. place rcu_read_lock() before ip6_route_info_create_nh().
> >
> > Let's complete the RTNL-free conversion.
> >
> > When each CPU-X adds 100000 routes on table-X in a batch
> > concurrently on c7a.metal-48xl EC2 instance with 192 CPUs,
> >
> > without this series:
> >
> > $ sudo ./route_test.sh
> > ...
> > added 19200000 routes (100000 routes * 192 tables).
> > time elapsed: 191577 milliseconds.
> >
> > with this series:
> >
> > $ sudo ./route_test.sh
> > ...
> > added 19200000 routes (100000 routes * 192 tables).
> > time elapsed: 62854 milliseconds.
> >
> > I changed the number of routes in each table (1000 ~ 100000)
> > and consistently saw it finish 3x faster with this series.
> >
> > Note that now every caller of lwtunnel_valid_encap_type() passes
> > false as the last argument, and this can be removed later.
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> > net/ipv4/nexthop.c | 4 ++--
> > net/ipv6/route.c | 18 ++++++++++++------
> > 2 files changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
> > index 6ba6cb1340c1..823e4a783d2b 100644
> > --- a/net/ipv4/nexthop.c
> > +++ b/net/ipv4/nexthop.c
> > @@ -1556,12 +1556,12 @@ int fib6_check_nexthop(struct nexthop *nh, struct fib6_config *cfg,
> > if (nh->is_group) {
> > struct nh_group *nhg;
> >
> > - nhg = rtnl_dereference(nh->nh_grp);
> > + nhg = rcu_dereference_rtnl(nh->nh_grp);
>
> Or add a condition about table lock being held ?
I think in this context the caller is more of an rcu reader
than a ipv6 route writer.
>
> > if (nhg->has_v4)
> > goto no_v4_nh;
> > is_fdb_nh = nhg->fdb_nh;
> > } else {
> > - nhi = rtnl_dereference(nh->nh_info);
> > + nhi = rcu_dereference_rtnl(nh->nh_info);
> > if (nhi->family == AF_INET)
> > goto no_v4_nh;
> > is_fdb_nh = nhi->fdb_nh;
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index c8c1c75268e3..bb46e724db73 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -3902,12 +3902,16 @@ int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
> > if (IS_ERR(rt))
> > return PTR_ERR(rt);
> >
> > + rcu_read_lock();
>
> This looks bogus to me (and syzbot)
>
> Holding rcu_read_lock() from writers is almost always wrong.
Yes, but I added it as a reader of netdev and nexthop to guarantee
that they won't go away.
>
> In this case, this prevents any GFP_KERNEL allocations to happen
> (among other things)
Oh, I completely missed this path, thanks!
Fortunately, it seems all ->build_state() except for
ip_tun_build_state() use GFP_ATOMIC.
So, simply changing GFP_KERNEL to GFP_ATOMIC is acceptable ?
lwtunnel_state_alloc
- kzalloc(GFP_ATOMIC)
ip_tun_build_state
- dst_cache_init(GFP_KERNEL)
ip6_tun_build_state / mpls_build_state / xfrmi_build_state
-> no alloc other than lwtunnel_state_alloc()
bpf_build_state
- bpf_parse_prog
- nla_memdup(GFP_ATOMIC)
ila_build_state / ioam6_build_state / rpl_build_state / seg6_build_state
- dst_cache_init(GFP_ATOMIC)
seg6_local_build_state
- seg6_end_dt4_build / seg6_end_dt6_build / seg6_end_dt46_build
-> no alloc other than lwtunnel_state_alloc()
>
> [ BUG: Invalid wait context ]
> 6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 Tainted: G W
> -----------------------------
> syz-executor234/5832 is trying to lock:
> ffffffff8e021688 (pcpu_alloc_mutex){+.+.}-{4:4}, at:
> pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
> other info that might help us debug this:
> context-{5:5}
> 1 lock held by syz-executor234/5832:
> #0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire
> include/linux/rcupdate.h:331 [inline]
> #0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock
> include/linux/rcupdate.h:841 [inline]
> #0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at:
> ip6_route_add+0x4d/0x2f0 net/ipv6/route.c:3913
> stack backtrace:
> CPU: 0 UID: 0 PID: 5832 Comm: syz-executor234 Tainted: G W
> 6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 PREEMPT(full)
> Tainted: [W]=WARN
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 04/29/2025
> Call Trace:
> <TASK>
> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
> print_lock_invalid_wait_context kernel/locking/lockdep.c:4831 [inline]
> check_wait_context kernel/locking/lockdep.c:4903 [inline]
> __lock_acquire+0xbcf/0xd20 kernel/locking/lockdep.c:5185
> lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5866
> __mutex_lock_common kernel/locking/mutex.c:601 [inline]
> __mutex_lock+0x182/0xe80 kernel/locking/mutex.c:746
> pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782
> dst_cache_init+0x37/0xc0 net/core/dst_cache.c:145
> ip_tun_build_state+0x193/0x6b0 net/ipv4/ip_tunnel_core.c:687
> lwtunnel_build_state+0x381/0x4c0 net/core/lwtunnel.c:137
> fib_nh_common_init+0x129/0x460 net/ipv4/fib_semantics.c:635
> fib6_nh_init+0x15e4/0x2030 net/ipv6/route.c:3669
> ip6_route_info_create_nh+0x139/0x870 net/ipv6/route.c:3866
> ip6_route_add+0xf6/0x2f0 net/ipv6/route.c:3915
> inet6_rtm_newroute+0x284/0x1c50 net/ipv6/route.c:5732
> rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6955
> netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534
> netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
> netlink_unicast+0x758/0x8d0 net/netlink/af_netlink.c:1339
> netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883
> sock_sendmsg_nosec net/socket.c:712 [inline]
> __sock_sendmsg+0x219/0x270 net/socket.c:727
> ____sys_sendmsg+0x505/0x830 net/socket.c:2566
> ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2620
> __sys_sendmsg net/socket.c:2652 [inline]
> __do_sys_sendmsg net/socket.c:2657 [inline]
> __se_sys_sendmsg net/socket.c:2655 [inline]
> __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2655
> do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v3 net-next 15/15] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.
2025-05-04 17:20 ` Kuniyuki Iwashima
@ 2025-05-04 19:34 ` Eric Dumazet
2025-05-04 20:11 ` Kuniyuki Iwashima
0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2025-05-04 19:34 UTC (permalink / raw)
To: Kuniyuki Iwashima; +Cc: davem, dsahern, horms, kuba, kuni1840, netdev, pabeni
On Sun, May 4, 2025 at 10:22 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
> Date: Sun, 4 May 2025 02:16:13 -0700
> > On Thu, Apr 17, 2025 at 5:10 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > Now we are ready to remove RTNL from SIOCADDRT and RTM_NEWROUTE.
> > >
> > > The remaining things to do are
> > >
> > > 1. pass false to lwtunnel_valid_encap_type_attr()
> > > 2. use rcu_dereference_rtnl() in fib6_check_nexthop()
> > > 3. place rcu_read_lock() before ip6_route_info_create_nh().
> > >
> > > Let's complete the RTNL-free conversion.
> > >
> > > When each CPU-X adds 100000 routes on table-X in a batch
> > > concurrently on c7a.metal-48xl EC2 instance with 192 CPUs,
> > >
> > > without this series:
> > >
> > > $ sudo ./route_test.sh
> > > ...
> > > added 19200000 routes (100000 routes * 192 tables).
> > > time elapsed: 191577 milliseconds.
> > >
> > > with this series:
> > >
> > > $ sudo ./route_test.sh
> > > ...
> > > added 19200000 routes (100000 routes * 192 tables).
> > > time elapsed: 62854 milliseconds.
> > >
> > > I changed the number of routes in each table (1000 ~ 100000)
> > > and consistently saw it finish 3x faster with this series.
> > >
> > > Note that now every caller of lwtunnel_valid_encap_type() passes
> > > false as the last argument, and this can be removed later.
> > >
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > ---
> > > net/ipv4/nexthop.c | 4 ++--
> > > net/ipv6/route.c | 18 ++++++++++++------
> > > 2 files changed, 14 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
> > > index 6ba6cb1340c1..823e4a783d2b 100644
> > > --- a/net/ipv4/nexthop.c
> > > +++ b/net/ipv4/nexthop.c
> > > @@ -1556,12 +1556,12 @@ int fib6_check_nexthop(struct nexthop *nh, struct fib6_config *cfg,
> > > if (nh->is_group) {
> > > struct nh_group *nhg;
> > >
> > > - nhg = rtnl_dereference(nh->nh_grp);
> > > + nhg = rcu_dereference_rtnl(nh->nh_grp);
> >
> > Or add a condition about table lock being held ?
>
> I think in this context the caller is more of an rcu reader
> than a ipv6 route writer.
>
>
>
> >
> > > if (nhg->has_v4)
> > > goto no_v4_nh;
> > > is_fdb_nh = nhg->fdb_nh;
> > > } else {
> > > - nhi = rtnl_dereference(nh->nh_info);
> > > + nhi = rcu_dereference_rtnl(nh->nh_info);
> > > if (nhi->family == AF_INET)
> > > goto no_v4_nh;
> > > is_fdb_nh = nhi->fdb_nh;
> > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > > index c8c1c75268e3..bb46e724db73 100644
> > > --- a/net/ipv6/route.c
> > > +++ b/net/ipv6/route.c
> > > @@ -3902,12 +3902,16 @@ int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
> > > if (IS_ERR(rt))
> > > return PTR_ERR(rt);
> > >
> > > + rcu_read_lock();
> >
> > This looks bogus to me (and syzbot)
> >
> > Holding rcu_read_lock() from writers is almost always wrong.
>
> Yes, but I added it as a reader of netdev and nexthop to guarantee
> that they won't go away.
We have standard ways for preventing this, acquiring a refcount on the objects.
>
>
> >
> > In this case, this prevents any GFP_KERNEL allocations to happen
> > (among other things)
>
> Oh, I completely missed this path, thanks!
>
> Fortunately, it seems all ->build_state() except for
> ip_tun_build_state() use GFP_ATOMIC.
>
> So, simply changing GFP_KERNEL to GFP_ATOMIC is acceptable ?
What protects against writers' mutual exclusion ?
I dislike using GFP_ATOMIC in control paths. This is something that we
must avoid.
This will make admin operations unpredictable. Many scripts would
break under pressure.
>
>
> lwtunnel_state_alloc
> - kzalloc(GFP_ATOMIC)
>
> ip_tun_build_state
> - dst_cache_init(GFP_KERNEL)
>
> ip6_tun_build_state / mpls_build_state / xfrmi_build_state
> -> no alloc other than lwtunnel_state_alloc()
>
> bpf_build_state
> - bpf_parse_prog
> - nla_memdup(GFP_ATOMIC)
>
> ila_build_state / ioam6_build_state / rpl_build_state / seg6_build_state
> - dst_cache_init(GFP_ATOMIC)
>
> seg6_local_build_state
> - seg6_end_dt4_build / seg6_end_dt6_build / seg6_end_dt46_build
> -> no alloc other than lwtunnel_state_alloc()
>
You mean, following the wrong fix done in :
14a0087e7236228d56bfa3fab7084c19fcb513fb ipv6: sr: switch to
GFP_ATOMIC flag to allocate memory during seg6local LWT setup
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH v3 net-next 15/15] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.
2025-05-04 19:34 ` Eric Dumazet
@ 2025-05-04 20:11 ` Kuniyuki Iwashima
0 siblings, 0 replies; 23+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-04 20:11 UTC (permalink / raw)
To: edumazet; +Cc: davem, dsahern, horms, kuba, kuni1840, kuniyu, netdev, pabeni
From: Eric Dumazet <edumazet@google.com>
Date: Sun, 4 May 2025 12:34:49 -0700
> On Sun, May 4, 2025 at 10:22 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From: Eric Dumazet <edumazet@google.com>
> > Date: Sun, 4 May 2025 02:16:13 -0700
> > > On Thu, Apr 17, 2025 at 5:10 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > Now we are ready to remove RTNL from SIOCADDRT and RTM_NEWROUTE.
> > > >
> > > > The remaining things to do are
> > > >
> > > > 1. pass false to lwtunnel_valid_encap_type_attr()
> > > > 2. use rcu_dereference_rtnl() in fib6_check_nexthop()
> > > > 3. place rcu_read_lock() before ip6_route_info_create_nh().
> > > >
> > > > Let's complete the RTNL-free conversion.
> > > >
> > > > When each CPU-X adds 100000 routes on table-X in a batch
> > > > concurrently on c7a.metal-48xl EC2 instance with 192 CPUs,
> > > >
> > > > without this series:
> > > >
> > > > $ sudo ./route_test.sh
> > > > ...
> > > > added 19200000 routes (100000 routes * 192 tables).
> > > > time elapsed: 191577 milliseconds.
> > > >
> > > > with this series:
> > > >
> > > > $ sudo ./route_test.sh
> > > > ...
> > > > added 19200000 routes (100000 routes * 192 tables).
> > > > time elapsed: 62854 milliseconds.
> > > >
> > > > I changed the number of routes in each table (1000 ~ 100000)
> > > > and consistently saw it finish 3x faster with this series.
> > > >
> > > > Note that now every caller of lwtunnel_valid_encap_type() passes
> > > > false as the last argument, and this can be removed later.
> > > >
> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > > > ---
> > > > net/ipv4/nexthop.c | 4 ++--
> > > > net/ipv6/route.c | 18 ++++++++++++------
> > > > 2 files changed, 14 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
> > > > index 6ba6cb1340c1..823e4a783d2b 100644
> > > > --- a/net/ipv4/nexthop.c
> > > > +++ b/net/ipv4/nexthop.c
> > > > @@ -1556,12 +1556,12 @@ int fib6_check_nexthop(struct nexthop *nh, struct fib6_config *cfg,
> > > > if (nh->is_group) {
> > > > struct nh_group *nhg;
> > > >
> > > > - nhg = rtnl_dereference(nh->nh_grp);
> > > > + nhg = rcu_dereference_rtnl(nh->nh_grp);
> > >
> > > Or add a condition about table lock being held ?
> >
> > I think in this context the caller is more of an rcu reader
> > than a ipv6 route writer.
> >
> >
> >
> > >
> > > > if (nhg->has_v4)
> > > > goto no_v4_nh;
> > > > is_fdb_nh = nhg->fdb_nh;
> > > > } else {
> > > > - nhi = rtnl_dereference(nh->nh_info);
> > > > + nhi = rcu_dereference_rtnl(nh->nh_info);
> > > > if (nhi->family == AF_INET)
> > > > goto no_v4_nh;
> > > > is_fdb_nh = nhi->fdb_nh;
> > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > > > index c8c1c75268e3..bb46e724db73 100644
> > > > --- a/net/ipv6/route.c
> > > > +++ b/net/ipv6/route.c
> > > > @@ -3902,12 +3902,16 @@ int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
> > > > if (IS_ERR(rt))
> > > > return PTR_ERR(rt);
> > > >
> > > > + rcu_read_lock();
> > >
> > > This looks bogus to me (and syzbot)
> > >
> > > Holding rcu_read_lock() from writers is almost always wrong.
> >
> > Yes, but I added it as a reader of netdev and nexthop to guarantee
> > that they won't go away.
>
> We have standard ways for preventing this, acquiring a refcount on the objects.
> >
> >
> > >
> > > In this case, this prevents any GFP_KERNEL allocations to happen
> > > (among other things)
> >
> > Oh, I completely missed this path, thanks!
> >
> > Fortunately, it seems all ->build_state() except for
> > ip_tun_build_state() use GFP_ATOMIC.
> >
> > So, simply changing GFP_KERNEL to GFP_ATOMIC is acceptable ?
>
> What protects against writers' mutual exclusion ?
>
> I dislike using GFP_ATOMIC in control paths. This is something that we
> must avoid.
>
> This will make admin operations unpredictable. Many scripts would
> break under pressure.
I see. I'll rework this and convert RCU to refcounting of
netdev and nexthop.
Then, I'll change the rcu_dereference_rtnl() above and friends to
check the table lock as you suggested.
>
> >
> >
> > lwtunnel_state_alloc
> > - kzalloc(GFP_ATOMIC)
> >
> > ip_tun_build_state
> > - dst_cache_init(GFP_KERNEL)
> >
> > ip6_tun_build_state / mpls_build_state / xfrmi_build_state
> > -> no alloc other than lwtunnel_state_alloc()
> >
> > bpf_build_state
> > - bpf_parse_prog
> > - nla_memdup(GFP_ATOMIC)
> >
> > ila_build_state / ioam6_build_state / rpl_build_state / seg6_build_state
> > - dst_cache_init(GFP_ATOMIC)
> >
> > seg6_local_build_state
> > - seg6_end_dt4_build / seg6_end_dt6_build / seg6_end_dt46_build
> > -> no alloc other than lwtunnel_state_alloc()
> >
>
> You mean, following the wrong fix done in :
>
> 14a0087e7236228d56bfa3fab7084c19fcb513fb ipv6: sr: switch to
> GFP_ATOMIC flag to allocate memory during seg6local LWT setup
Oh it's last week... somehow I missed this mail..
maybe I filtered it with ipv6: sr.. :/
In the rework, I'll include the revert of that.
Thanks!
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table.
2025-04-18 0:03 [PATCH v3 net-next 00/15] ipv6: No RTNL for IPv6 routing table Kuniyuki Iwashima
` (14 preceding siblings ...)
2025-04-18 0:03 ` [PATCH v3 net-next 15/15] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE Kuniyuki Iwashima
@ 2025-04-24 7:50 ` patchwork-bot+netdevbpf
15 siblings, 0 replies; 23+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-04-24 7:50 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: davem, dsahern, edumazet, kuba, pabeni, horms, kuni1840, netdev
Hello:
This series was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:
On Thu, 17 Apr 2025 17:03:41 -0700 you wrote:
> IPv6 routing tables are protected by each table's lock and work in
> the interrupt context, which means we basically don't need RTNL to
> modify an IPv6 routing table itself.
>
> Currently, the control paths require RTNL because we may need to
> perform device and nexthop lookups; we must prevent dev/nexthop from
> going away from the netns.
>
> [...]
Here is the summary with links:
- [v3,net-next,01/15] ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().
https://git.kernel.org/netdev/net-next/c/4cb4861d8c3b
- [v3,net-next,02/15] ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.
https://git.kernel.org/netdev/net-next/c/bd11ff421d36
- [v3,net-next,03/15] ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().
https://git.kernel.org/netdev/net-next/c/fa76c1674f2e
- [v3,net-next,04/15] ipv6: Check GATEWAY in rtm_to_fib6_multipath_config().
https://git.kernel.org/netdev/net-next/c/e6f497955fb6
- [v3,net-next,05/15] ipv6: Move nexthop_find_by_id() after fib6_info_alloc().
https://git.kernel.org/netdev/net-next/c/c9cabe05e450
- [v3,net-next,06/15] ipv6: Split ip6_route_info_create().
https://git.kernel.org/netdev/net-next/c/c4837b9853e5
- [v3,net-next,07/15] ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().
https://git.kernel.org/netdev/net-next/c/5720a328c3e9
- [v3,net-next,08/15] ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().
https://git.kernel.org/netdev/net-next/c/d27b9c40dbd6
- [v3,net-next,09/15] ipv6: Don't pass net to ip6_route_info_append().
https://git.kernel.org/netdev/net-next/c/87d5d921eaf2
- [v3,net-next,10/15] ipv6: Rename rt6_nh.next to rt6_nh.list.
https://git.kernel.org/netdev/net-next/c/5a1ccff5c65a
- [v3,net-next,11/15] ipv6: Factorise ip6_route_multipath_add().
https://git.kernel.org/netdev/net-next/c/71c0efb6d12f
- [v3,net-next,12/15] ipv6: Protect fib6_link_table() with spinlock.
https://git.kernel.org/netdev/net-next/c/834d97843e3b
- [v3,net-next,13/15] ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add().
https://git.kernel.org/netdev/net-next/c/accb46b56bc3
- [v3,net-next,14/15] ipv6: Protect nh->f6i_list with spinlock and flag.
https://git.kernel.org/netdev/net-next/c/081efd18326e
- [v3,net-next,15/15] ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.
https://git.kernel.org/netdev/net-next/c/169fd62799e8
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 23+ messages in thread