Netdev List
 help / color / mirror / Atom feed
From: Pengfei Zhang <zhangfeionline@gmail.com>
To: dsahern@kernel.org, idosch@nvidia.com
Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, horms@kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, chenzhangqi@xiaomi.com,
	baohua@kernel.org, Pengfei Zhang <zhangpengfei16@xiaomi.com>,
	Pengfei Zhang <zhangfeionline@gmail.com>
Subject: [PATCH] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump
Date: Thu, 25 Jun 2026 01:11:56 +0800	[thread overview]
Message-ID: <20260624171156.822055-1-zhangfeionline@gmail.com> (raw)

From: Pengfei Zhang <zhangpengfei16@xiaomi.com>

inet6_dump_fib() saves its progress in cb->args[1] as a positional
index within the current hash chain.  Between batches the RTNL lock
is released, so a concurrent fib6_new_table() can insert a new table
at the chain head, shifting all existing entries.  The saved index
then lands on a different table, causing fib6_dump_table() to set
w->root to the wrong table while w->node still points into the
previous one.  fib6_walk_continue() dereferences w->node->parent
(NULL) and panics:

  BUG: kernel NULL pointer dereference, address: 0000000000000008
  RIP: 0010:fib6_walk_continue+0x6e/0x170
  Call Trace:
   <TASK>
   fib6_dump_table.isra.0+0xc5/0x240
   inet6_dump_fib+0xf6/0x420
   rtnl_dumpit+0x30/0xa0
   netlink_dump+0x15b/0x460
   netlink_recvmsg+0x1d6/0x2a0
   ____sys_recvmsg+0x17a/0x190

Fix by storing tb->tb6_id in cb->args[1] instead of a positional
index.  On resume, skip entries until the id matches; a concurrent
head-insert can never match the saved id, so the walker always
resumes on the correct table.

Signed-off-by: Pengfei Zhang <zhangfeionline@gmail.com>
---
The same crash was independently reported in a production environment
(kernel 5.15.137, triggered by ovs-vswitchd issuing RTM_GETROUTE):
  https://lkml.iu.edu/hypermail/linux/kernel/2402.3/02068.html

The crash is probabilistic and occurs in fib6_walk_continue() at the
FWS_U state:

  case FWS_U:
      if (fn == w->root)
          return 0;
      pn = rcu_dereference_protected(fn->parent, 1);
      left = rcu_dereference_protected(pn->left, 1);  /* crash here */

The crash dump shows fn->parent is NULL.  At first glance this looks
like fn is a leaf node whose parent was freed, but closer inspection of
the walker state reveals fn->fn_flags has RTN_ROOT set — fn is itself
a root node of a routing table, not a child node.  A root node has no
parent by definition, so fn->parent == NULL is correct for that node.

The real question is why fn != w->root despite fn being a root.  The
answer is that w->root and fn belong to *different* tables: w->node
(which became fn during traversal) still references a node from the
table that was being dumped when the batch suspended, while w->root was
silently redirected to a different table on resume.

This misdirection happens because inet6_dump_fib() uses a positional
index to resume across batches.  Consider a hash slot containing two
tables [A(pos=0), B(pos=1)] where B is large enough to require multiple
batches.  On the first batch, B suspends mid-walk and the loop saves:

  cb->args[1] = e;   /* e=1, position of B in the chain */

The RTNL lock is then released.  At this point a concurrent
fib6_new_table() inserts table C at the chain head via
hlist_add_head_rcu(), making the chain [C(pos=0), A(pos=1), B(pos=2)].

On the next batch, inet6_dump_fib() resumes with s_e=1 and iterates:

  s_e = cb->args[1];   /* s_e = 1 */
  hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
      if (e < s_e)     /* skip C at pos=0 */
          goto next;
      /* e=1: tb now points to A, not B */
      fib6_dump_table(tb, skb, cb);   /* called with wrong table A */
  }

Inside fib6_dump_table(), w->root is unconditionally overwritten
before the resume branch is entered:

  w->root = &table->tb6_root;        /* now A's root              */
  /* ... */
  } else {
      int sernum = READ_ONCE(w->root->fn_sernum);  /* A's sernum  */
      if (cb->args[5] != sernum) {
          /* sernum changed: safe reset, w->node = w->root (A)    */
          w->node = w->root;
      } else {
          /* sernum unchanged: w->node untouched, still in B       */
          w->skip = 0;
      }
      fib6_walk_continue(w);   /* sernum equal: w->root=A, w->node=B */
  }

The sernum guard was intended to detect tree modifications and reset
the walk, but here the two tables happen to share the same fn_sernum
value (a global flush had previously unified them), so the guard does
not fire and w->node is left pointing into B's tree.

From this point w->root and w->node belong to different tables.  When
fib6_walk_continue() traverses upward and reaches B's root node
(fn->fn_flags & RTN_ROOT), the exit check:

  if (fn == w->root)   /* B's root != A's root, check fails */
      return 0;
  pn = fn->parent;     /* B's root has no parent: pn == NULL */
  left = pn->left;     /* NULL deref -> crash */

 net/ipv6/ip6_fib.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index fc95738de..bda492634 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -636,11 +636,11 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	};
 	const struct nlmsghdr *nlh = cb->nlh;
 	struct net *net = sock_net(skb->sk);
-	unsigned int e = 0, s_e;
 	struct hlist_head *head;
 	struct fib6_walker *w;
 	struct fib6_table *tb;
 	unsigned int h, s_h;
+	u32 s_id;
 	int err = 0;
 
 	rcu_read_lock();
@@ -701,23 +701,22 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 
 	s_h = cb->args[0];
-	s_e = cb->args[1];
+	s_id = cb->args[1];
 
-	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) {
-		e = 0;
+	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_id = 0) {
 		head = &net->ipv6.fib_table_hash[h];
 		hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
-			if (e < s_e)
-				goto next;
+			if (s_id && tb->tb6_id != s_id)
+				continue;
+			s_id = 0;
+
+			cb->args[1] = tb->tb6_id;
 			err = fib6_dump_table(tb, skb, cb);
 			if (err != 0)
 				goto out;
-next:
-			e++;
 		}
 	}
 out:
-	cb->args[1] = e;
 	cb->args[0] = h;
 
 unlock:
-- 
2.34.1


             reply	other threads:[~2026-06-24 17:12 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-24 17:11 Pengfei Zhang [this message]
2026-06-24 17:22 ` [PATCH] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump Eric Dumazet
2026-06-25  1:23 ` Pengfei Zhang
2026-06-25  1:23 ` [PATCH v2] " Pengfei Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260624171156.822055-1-zhangfeionline@gmail.com \
    --to=zhangfeionline@gmail.com \
    --cc=baohua@kernel.org \
    --cc=chenzhangqi@xiaomi.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=idosch@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=zhangpengfei16@xiaomi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox