Netdev List
 help / color / mirror / Atom feed
* [PATCH] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump
@ 2026-06-24 17:11 Pengfei Zhang
  2026-06-24 17:22 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Pengfei Zhang @ 2026-06-24 17:11 UTC (permalink / raw)
  To: dsahern, idosch
  Cc: davem, edumazet, kuba, pabeni, horms, netdev, linux-kernel,
	chenzhangqi, baohua, Pengfei Zhang, Pengfei Zhang

From: Pengfei Zhang <zhangpengfei16@xiaomi.com>

inet6_dump_fib() saves its progress in cb->args[1] as a positional
index within the current hash chain.  Between batches the RTNL lock
is released, so a concurrent fib6_new_table() can insert a new table
at the chain head, shifting all existing entries.  The saved index
then lands on a different table, causing fib6_dump_table() to set
w->root to the wrong table while w->node still points into the
previous one.  fib6_walk_continue() dereferences w->node->parent
(NULL) and panics:

  BUG: kernel NULL pointer dereference, address: 0000000000000008
  RIP: 0010:fib6_walk_continue+0x6e/0x170
  Call Trace:
   <TASK>
   fib6_dump_table.isra.0+0xc5/0x240
   inet6_dump_fib+0xf6/0x420
   rtnl_dumpit+0x30/0xa0
   netlink_dump+0x15b/0x460
   netlink_recvmsg+0x1d6/0x2a0
   ____sys_recvmsg+0x17a/0x190

Fix by storing tb->tb6_id in cb->args[1] instead of a positional
index.  On resume, skip entries until the id matches; a concurrent
head-insert can never match the saved id, so the walker always
resumes on the correct table.

Signed-off-by: Pengfei Zhang <zhangfeionline@gmail.com>
---
The same crash was independently reported in a production environment
(kernel 5.15.137, triggered by ovs-vswitchd issuing RTM_GETROUTE):
  https://lkml.iu.edu/hypermail/linux/kernel/2402.3/02068.html

The crash is probabilistic and occurs in fib6_walk_continue() at the
FWS_U state:

  case FWS_U:
      if (fn == w->root)
          return 0;
      pn = rcu_dereference_protected(fn->parent, 1);
      left = rcu_dereference_protected(pn->left, 1);  /* crash here */

The crash dump shows fn->parent is NULL.  At first glance this looks
like fn is a leaf node whose parent was freed, but closer inspection of
the walker state reveals fn->fn_flags has RTN_ROOT set — fn is itself
a root node of a routing table, not a child node.  A root node has no
parent by definition, so fn->parent == NULL is correct for that node.

The real question is why fn != w->root despite fn being a root.  The
answer is that w->root and fn belong to *different* tables: w->node
(which became fn during traversal) still references a node from the
table that was being dumped when the batch suspended, while w->root was
silently redirected to a different table on resume.

This misdirection happens because inet6_dump_fib() uses a positional
index to resume across batches.  Consider a hash slot containing two
tables [A(pos=0), B(pos=1)] where B is large enough to require multiple
batches.  On the first batch, B suspends mid-walk and the loop saves:

  cb->args[1] = e;   /* e=1, position of B in the chain */

The RTNL lock is then released.  At this point a concurrent
fib6_new_table() inserts table C at the chain head via
hlist_add_head_rcu(), making the chain [C(pos=0), A(pos=1), B(pos=2)].

On the next batch, inet6_dump_fib() resumes with s_e=1 and iterates:

  s_e = cb->args[1];   /* s_e = 1 */
  hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
      if (e < s_e)     /* skip C at pos=0 */
          goto next;
      /* e=1: tb now points to A, not B */
      fib6_dump_table(tb, skb, cb);   /* called with wrong table A */
  }

Inside fib6_dump_table(), w->root is unconditionally overwritten
before the resume branch is entered:

  w->root = &table->tb6_root;        /* now A's root              */
  /* ... */
  } else {
      int sernum = READ_ONCE(w->root->fn_sernum);  /* A's sernum  */
      if (cb->args[5] != sernum) {
          /* sernum changed: safe reset, w->node = w->root (A)    */
          w->node = w->root;
      } else {
          /* sernum unchanged: w->node untouched, still in B       */
          w->skip = 0;
      }
      fib6_walk_continue(w);   /* sernum equal: w->root=A, w->node=B */
  }

The sernum guard was intended to detect tree modifications and reset
the walk, but here the two tables happen to share the same fn_sernum
value (a global flush had previously unified them), so the guard does
not fire and w->node is left pointing into B's tree.

From this point w->root and w->node belong to different tables.  When
fib6_walk_continue() traverses upward and reaches B's root node
(fn->fn_flags & RTN_ROOT), the exit check:

  if (fn == w->root)   /* B's root != A's root, check fails */
      return 0;
  pn = fn->parent;     /* B's root has no parent: pn == NULL */
  left = pn->left;     /* NULL deref -> crash */

 net/ipv6/ip6_fib.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index fc95738de..bda492634 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -636,11 +636,11 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	};
 	const struct nlmsghdr *nlh = cb->nlh;
 	struct net *net = sock_net(skb->sk);
-	unsigned int e = 0, s_e;
 	struct hlist_head *head;
 	struct fib6_walker *w;
 	struct fib6_table *tb;
 	unsigned int h, s_h;
+	u32 s_id;
 	int err = 0;
 
 	rcu_read_lock();
@@ -701,23 +701,22 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 
 	s_h = cb->args[0];
-	s_e = cb->args[1];
+	s_id = cb->args[1];
 
-	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) {
-		e = 0;
+	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_id = 0) {
 		head = &net->ipv6.fib_table_hash[h];
 		hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
-			if (e < s_e)
-				goto next;
+			if (s_id && tb->tb6_id != s_id)
+				continue;
+			s_id = 0;
+
+			cb->args[1] = tb->tb6_id;
 			err = fib6_dump_table(tb, skb, cb);
 			if (err != 0)
 				goto out;
-next:
-			e++;
 		}
 	}
 out:
-	cb->args[1] = e;
 	cb->args[0] = h;
 
 unlock:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump
  2026-06-24 17:11 [PATCH] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump Pengfei Zhang
@ 2026-06-24 17:22 ` Eric Dumazet
  2026-06-25  1:23 ` Pengfei Zhang
  2026-06-25  1:23 ` [PATCH v2] " Pengfei Zhang
  2 siblings, 0 replies; 4+ messages in thread
From: Eric Dumazet @ 2026-06-24 17:22 UTC (permalink / raw)
  To: Pengfei Zhang
  Cc: dsahern, idosch, davem, kuba, pabeni, horms, netdev, linux-kernel,
	chenzhangqi, baohua, Pengfei Zhang

On Wed, Jun 24, 2026 at 10:12 AM Pengfei Zhang <zhangfeionline@gmail.com> wrote:
>
> From: Pengfei Zhang <zhangpengfei16@xiaomi.com>
>
> inet6_dump_fib() saves its progress in cb->args[1] as a positional
> index within the current hash chain.  Between batches the RTNL lock
> is released, so a concurrent fib6_new_table() can insert a new table
> at the chain head, shifting all existing entries.  The saved index
> then lands on a different table, causing fib6_dump_table() to set
> w->root to the wrong table while w->node still points into the
> previous one.  fib6_walk_continue() dereferences w->node->parent
> (NULL) and panics:
>
>   BUG: kernel NULL pointer dereference, address: 0000000000000008
>   RIP: 0010:fib6_walk_continue+0x6e/0x170
>   Call Trace:
>    <TASK>
>    fib6_dump_table.isra.0+0xc5/0x240
>    inet6_dump_fib+0xf6/0x420
>    rtnl_dumpit+0x30/0xa0
>    netlink_dump+0x15b/0x460
>    netlink_recvmsg+0x1d6/0x2a0
>    ____sys_recvmsg+0x17a/0x190
>
> Fix by storing tb->tb6_id in cb->args[1] instead of a positional
> index.  On resume, skip entries until the id matches; a concurrent
> head-insert can never match the saved id, so the walker always
> resumes on the correct table.
>
> Signed-off-by: Pengfei Zhang <zhangfeionline@gmail.com>

Patch looks good, but you forgot to add a Fixes: tag

Perhaps:

Fixes: 1b43af5480c3 ("[IPV6]: Increase number of possible routing
tables to 2^32")

> ---
> The same crash was independently reported in a production environment
> (kernel 5.15.137, triggered by ovs-vswitchd issuing RTM_GETROUTE):
>   https://lkml.iu.edu/hypermail/linux/kernel/2402.3/02068.html
>
> The crash is probabilistic and occurs in fib6_walk_continue() at the
> FWS_U state:
>
>   case FWS_U:
>       if (fn == w->root)
>           return 0;
>       pn = rcu_dereference_protected(fn->parent, 1);
>       left = rcu_dereference_protected(pn->left, 1);  /* crash here */
>
> The crash dump shows fn->parent is NULL.  At first glance this looks
> like fn is a leaf node whose parent was freed, but closer inspection of
> the walker state reveals fn->fn_flags has RTN_ROOT set — fn is itself
> a root node of a routing table, not a child node.  A root node has no
> parent by definition, so fn->parent == NULL is correct for that node.
>
> The real question is why fn != w->root despite fn being a root.  The
> answer is that w->root and fn belong to *different* tables: w->node
> (which became fn during traversal) still references a node from the
> table that was being dumped when the batch suspended, while w->root was
> silently redirected to a different table on resume.
>
> This misdirection happens because inet6_dump_fib() uses a positional
> index to resume across batches.  Consider a hash slot containing two
> tables [A(pos=0), B(pos=1)] where B is large enough to require multiple
> batches.  On the first batch, B suspends mid-walk and the loop saves:
>
>   cb->args[1] = e;   /* e=1, position of B in the chain */
>
> The RTNL lock is then released.  At this point a concurrent
> fib6_new_table() inserts table C at the chain head via
> hlist_add_head_rcu(), making the chain [C(pos=0), A(pos=1), B(pos=2)].
>
> On the next batch, inet6_dump_fib() resumes with s_e=1 and iterates:
>
>   s_e = cb->args[1];   /* s_e = 1 */
>   hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
>       if (e < s_e)     /* skip C at pos=0 */
>           goto next;
>       /* e=1: tb now points to A, not B */
>       fib6_dump_table(tb, skb, cb);   /* called with wrong table A */
>   }
>
> Inside fib6_dump_table(), w->root is unconditionally overwritten
> before the resume branch is entered:
>
>   w->root = &table->tb6_root;        /* now A's root              */
>   /* ... */
>   } else {
>       int sernum = READ_ONCE(w->root->fn_sernum);  /* A's sernum  */
>       if (cb->args[5] != sernum) {
>           /* sernum changed: safe reset, w->node = w->root (A)    */
>           w->node = w->root;
>       } else {
>           /* sernum unchanged: w->node untouched, still in B       */
>           w->skip = 0;
>       }
>       fib6_walk_continue(w);   /* sernum equal: w->root=A, w->node=B */
>   }
>
> The sernum guard was intended to detect tree modifications and reset
> the walk, but here the two tables happen to share the same fn_sernum
> value (a global flush had previously unified them), so the guard does
> not fire and w->node is left pointing into B's tree.
>
> From this point w->root and w->node belong to different tables.  When
> fib6_walk_continue() traverses upward and reaches B's root node
> (fn->fn_flags & RTN_ROOT), the exit check:
>
>   if (fn == w->root)   /* B's root != A's root, check fails */
>       return 0;
>   pn = fn->parent;     /* B's root has no parent: pn == NULL */
>   left = pn->left;     /* NULL deref -> crash */
>
>  net/ipv6/ip6_fib.c | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
>
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index fc95738de..bda492634 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -636,11 +636,11 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
>         };
>         const struct nlmsghdr *nlh = cb->nlh;
>         struct net *net = sock_net(skb->sk);
> -       unsigned int e = 0, s_e;
>         struct hlist_head *head;
>         struct fib6_walker *w;
>         struct fib6_table *tb;
>         unsigned int h, s_h;
> +       u32 s_id;
>         int err = 0;
>
>         rcu_read_lock();
> @@ -701,23 +701,22 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
>         }
>
>         s_h = cb->args[0];
> -       s_e = cb->args[1];
> +       s_id = cb->args[1];
>
> -       for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) {
> -               e = 0;
> +       for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_id = 0) {
>                 head = &net->ipv6.fib_table_hash[h];
>                 hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
> -                       if (e < s_e)
> -                               goto next;
> +                       if (s_id && tb->tb6_id != s_id)
> +                               continue;
> +                       s_id = 0;
> +
> +                       cb->args[1] = tb->tb6_id;
>                         err = fib6_dump_table(tb, skb, cb);
>                         if (err != 0)
>                                 goto out;
> -next:
> -                       e++;
>                 }
>         }
>  out:
> -       cb->args[1] = e;
>         cb->args[0] = h;
>
>  unlock:
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump
  2026-06-24 17:11 [PATCH] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump Pengfei Zhang
  2026-06-24 17:22 ` Eric Dumazet
@ 2026-06-25  1:23 ` Pengfei Zhang
  2026-06-25  1:23 ` [PATCH v2] " Pengfei Zhang
  2 siblings, 0 replies; 4+ messages in thread
From: Pengfei Zhang @ 2026-06-25  1:23 UTC (permalink / raw)
  To: dsahern, idosch
  Cc: davem, edumazet, kuba, pabeni, horms, netdev, linux-kernel,
	chenzhangqi, baohua, Pengfei Zhang, Pengfei Zhang

From: Pengfei Zhang <zhangpengfei16@xiaomi.com>

inet6_dump_fib() saves its progress in cb->args[1] as a positional
index within the current hash chain.  Between batches the RTNL lock
is released, so a concurrent fib6_new_table() can insert a new table
at the chain head, shifting all existing entries.  The saved index
then lands on a different table, causing fib6_dump_table() to set
w->root to the wrong table while w->node still points into the
previous one.  fib6_walk_continue() dereferences w->node->parent
(NULL) and panics:

  BUG: kernel NULL pointer dereference, address: 0000000000000008
  RIP: 0010:fib6_walk_continue+0x6e/0x170
  Call Trace:
   <TASK>
   fib6_dump_table.isra.0+0xc5/0x240
   inet6_dump_fib+0xf6/0x420
   rtnl_dumpit+0x30/0xa0
   netlink_dump+0x15b/0x460
   netlink_recvmsg+0x1d6/0x2a0
   ____sys_recvmsg+0x17a/0x190

Fix by storing tb->tb6_id in cb->args[1] instead of a positional
index.  On resume, skip entries until the id matches; a concurrent
head-insert can never match the saved id, so the walker always
resumes on the correct table.

Fixes: 1b43af5480c3 ("[IPV6]: Increase number of possible routing tables to 2^32")
Signed-off-by: Pengfei Zhang <zhangfeionline@gmail.com>
---
 net/ipv6/ip6_fib.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index fc95738de..bda492634 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -636,11 +636,11 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	};
 	const struct nlmsghdr *nlh = cb->nlh;
 	struct net *net = sock_net(skb->sk);
-	unsigned int e = 0, s_e;
 	struct hlist_head *head;
 	struct fib6_walker *w;
 	struct fib6_table *tb;
 	unsigned int h, s_h;
+	u32 s_id;
 	int err = 0;
 
 	rcu_read_lock();
@@ -701,23 +701,22 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 
 	s_h = cb->args[0];
-	s_e = cb->args[1];
+	s_id = cb->args[1];
 
-	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) {
-		e = 0;
+	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_id = 0) {
 		head = &net->ipv6.fib_table_hash[h];
 		hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
-			if (e < s_e)
-				goto next;
+			if (s_id && tb->tb6_id != s_id)
+				continue;
+			s_id = 0;
+
+			cb->args[1] = tb->tb6_id;
 			err = fib6_dump_table(tb, skb, cb);
 			if (err != 0)
 				goto out;
-next:
-			e++;
 		}
 	}
 out:
-	cb->args[1] = e;
 	cb->args[0] = h;
 
 unlock:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v2] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump
  2026-06-24 17:11 [PATCH] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump Pengfei Zhang
  2026-06-24 17:22 ` Eric Dumazet
  2026-06-25  1:23 ` Pengfei Zhang
@ 2026-06-25  1:23 ` Pengfei Zhang
  2 siblings, 0 replies; 4+ messages in thread
From: Pengfei Zhang @ 2026-06-25  1:23 UTC (permalink / raw)
  To: dsahern, idosch
  Cc: davem, edumazet, kuba, pabeni, horms, netdev, linux-kernel,
	chenzhangqi, baohua, Pengfei Zhang, Pengfei Zhang

From: Pengfei Zhang <zhangpengfei16@xiaomi.com>

inet6_dump_fib() saves its progress in cb->args[1] as a positional
index within the current hash chain.  Between batches the RTNL lock
is released, so a concurrent fib6_new_table() can insert a new table
at the chain head, shifting all existing entries.  The saved index
then lands on a different table, causing fib6_dump_table() to set
w->root to the wrong table while w->node still points into the
previous one.  fib6_walk_continue() dereferences w->node->parent
(NULL) and panics:

  BUG: kernel NULL pointer dereference, address: 0000000000000008
  RIP: 0010:fib6_walk_continue+0x6e/0x170
  Call Trace:
   <TASK>
   fib6_dump_table.isra.0+0xc5/0x240
   inet6_dump_fib+0xf6/0x420
   rtnl_dumpit+0x30/0xa0
   netlink_dump+0x15b/0x460
   netlink_recvmsg+0x1d6/0x2a0
   ____sys_recvmsg+0x17a/0x190

Fix by storing tb->tb6_id in cb->args[1] instead of a positional
index.  On resume, skip entries until the id matches; a concurrent
head-insert can never match the saved id, so the walker always
resumes on the correct table.

Fixes: 1b43af5480c3 ("[IPV6]: Increase number of possible routing tables to 2^32")
Signed-off-by: Pengfei Zhang <zhangfeionline@gmail.com>
---
 net/ipv6/ip6_fib.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index fc95738de..bda492634 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -636,11 +636,11 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	};
 	const struct nlmsghdr *nlh = cb->nlh;
 	struct net *net = sock_net(skb->sk);
-	unsigned int e = 0, s_e;
 	struct hlist_head *head;
 	struct fib6_walker *w;
 	struct fib6_table *tb;
 	unsigned int h, s_h;
+	u32 s_id;
 	int err = 0;
 
 	rcu_read_lock();
@@ -701,23 +701,22 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 
 	s_h = cb->args[0];
-	s_e = cb->args[1];
+	s_id = cb->args[1];
 
-	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) {
-		e = 0;
+	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_id = 0) {
 		head = &net->ipv6.fib_table_hash[h];
 		hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
-			if (e < s_e)
-				goto next;
+			if (s_id && tb->tb6_id != s_id)
+				continue;
+			s_id = 0;
+
+			cb->args[1] = tb->tb6_id;
 			err = fib6_dump_table(tb, skb, cb);
 			if (err != 0)
 				goto out;
-next:
-			e++;
 		}
 	}
 out:
-	cb->args[1] = e;
 	cb->args[0] = h;
 
 unlock:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-25  1:23 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 17:11 [PATCH] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump Pengfei Zhang
2026-06-24 17:22 ` Eric Dumazet
2026-06-25  1:23 ` Pengfei Zhang
2026-06-25  1:23 ` [PATCH v2] " Pengfei Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox