Linux kernel -stable discussions
 help / color / mirror / Atom feed
* [PATCH v2 net] net: core: dev: add reprocess depth limit for another_round in __netif_receive_skb_core
@ 2026-05-12  2:21 Yizhou Zhao
  2026-05-14 11:45 ` Paolo Abeni
  2026-05-14 22:11 ` kernel test robot
  0 siblings, 2 replies; 3+ messages in thread
From: Yizhou Zhao @ 2026-05-12  2:21 UTC (permalink / raw)
  To: netdev
  Cc: Yizhou Zhao, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Hangbin Liu, Krishna Kumar, Yuxiang Yang,
	Xuewei Feng, Qi Li, Ke Xu, stable

In __netif_receive_skb_core(), the another_round label can be reached 
via a TC ingress redirect (bpf_redirect_peer returning -EAGAIN).

Across network namespaces, two BPF programs on peer devices can redirect
packets back and forth indefinitely, creating an unbounded loop that 
monopolizes a CPU core in softirq context. This leads to RCU stalls, 
soft lockups, and system-wide denial of service.

We reproduced it by creating a pair of TC BPF programs across two 
network namespaces that redirect packets to each other, and the RCU 
subsystem detects a stall:

```
[   24.835219] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   24.835837] rcu: 	(detected by 0, t=21002 jiffies, g=-627, q=2 ncpus=1)
[   24.835959] rcu: All QSes seen, last rcu_preempt kthread activity 21002 (4294691810-4294670808), jiffies_till_next_fqs=3, root ->qsmask 0x0
[   24.836239] rcu: rcu_preempt kthread starved for 21002 jiffies! g-627 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[   24.836362] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[   24.836460] rcu: RCU grace-period kthread stack dump:
[   24.836601] task:rcu_preempt     state:R  running task     stack:15448 pid:15    tgid:15    ppid:2      task_flags:0x208040 flags:0x00080000
[   24.837139] Call Trace:
[   24.837568]  <TASK>
[   24.838008]  __schedule+0x4ed/0xea0
[   24.838934]  schedule+0x22/0xd0
[   24.839023]  schedule_timeout+0x81/0x100
[   24.839095]  ? __pfx_process_timeout+0x10/0x10
[   24.839165]  rcu_gp_fqs_loop+0x11b/0x650
[   24.839226]  ? __pfx_rcu_gp_kthread+0x10/0x10
[   24.839282]  rcu_gp_kthread+0x17e/0x210
[   24.839333]  ? __pfx_rcu_gp_kthread+0x10/0x10
[   24.839383]  kthread+0xdd/0x110
[   24.839433]  ? __pfx_kthread+0x10/0x10
[   24.839481]  ret_from_fork+0x1aa/0x260
[   24.839538]  ? __pfx_kthread+0x10/0x10
[   24.839585]  ret_from_fork_asm+0x1a/0x30
[   24.839686]  </TASK>
......
```

Fix this by adding a depth counter when it is about to go to another_round
label. When the counter exceeds XMIT_RECURSION_LIMIT (8), the packet is 
dropped. This follows the same pattern as dev_xmit_recursion() which 
protects the TX redirect path with the same limit.

Reuse SKB_DROP_REASON_TC_RECLASSIFY_LOOP for observability.

Fixes: 9aa1206e8f48 ("bpf: Add redirect_peer helper")
Cc: stable@vger.kernel.org
Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn>
Reported-by: Xuewei Feng <fengxw06@126.com>
Reported-by: Qi Li <qli01@tsinghua.edu.cn>
Reported-by: Ke Xu <xuke@tsinghua.edu.cn>
Assisted-by: GLM:GLM-5.1
Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
---
Changes in v2:
- Move the check just after `another` is set to true to avoid affecting the fast path
- Reuse SKB_DROP_REASON_TC_RECLASSIFY_LOOP to avoid adding new drop reason
- Link to v1: https://lore.kernel.org/netdev/20260511063005.38134-1-zhaoyz24@mails.tsinghua.edu.cn/
---
 net/core/dev.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 831129f2a..bb9ae92f0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5958,6 +5958,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
 	struct net_device *orig_dev;
 	bool deliver_exact = false;
 	int ret = NET_RX_DROP;
+	int redirect_depth = 0;
 	__be16 type;

 	net_timestamp_check(!READ_ONCE(net_hotdata.tstamp_prequeue), skb);
@@ -6031,8 +6032,16 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
 		nf_skip_egress(skb, true);
 		skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev,
 					 &another);
-		if (another)
+		if (another) {
+			if (unlikely(++redirect_depth > XMIT_RECURSION_LIMIT)) {
+				net_warn_ratelimited(
+					"%s: redirect loop limit reached, dropping (dev=%s)\n",
+					__func__, skb->dev->name);
+				drop_reason = SKB_DROP_REASON_TC_RECLASSIFY_LOOP;
+				goto drop;
+			}
 			goto another_round;
+		}
 		if (!skb)
 			goto out;

--
2.43.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2 net] net: core: dev: add reprocess depth limit for another_round in __netif_receive_skb_core
  2026-05-12  2:21 [PATCH v2 net] net: core: dev: add reprocess depth limit for another_round in __netif_receive_skb_core Yizhou Zhao
@ 2026-05-14 11:45 ` Paolo Abeni
  2026-05-14 22:11 ` kernel test robot
  1 sibling, 0 replies; 3+ messages in thread
From: Paolo Abeni @ 2026-05-14 11:45 UTC (permalink / raw)
  To: Yizhou Zhao, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman,
	Stanislav Fomichev, Kuniyuki Iwashima, Samiullah Khawaja,
	Hangbin Liu, Krishna Kumar, Yuxiang Yang, Xuewei Feng, Qi Li,
	Ke Xu, stable

On 5/12/26 4:21 AM, Yizhou Zhao wrote:
> In __netif_receive_skb_core(), the another_round label can be reached 
> via a TC ingress redirect (bpf_redirect_peer returning -EAGAIN).
> 
> Across network namespaces, two BPF programs on peer devices can redirect
> packets back and forth indefinitely, creating an unbounded loop that 
> monopolizes a CPU core in softirq context. This leads to RCU stalls, 
> soft lockups, and system-wide denial of service.
> 
> We reproduced it by creating a pair of TC BPF programs across two 
> network namespaces that redirect packets to each other, and the RCU 
> subsystem detects a stall:
> 
> ```
> [   24.835219] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [   24.835837] rcu: 	(detected by 0, t=21002 jiffies, g=-627, q=2 ncpus=1)
> [   24.835959] rcu: All QSes seen, last rcu_preempt kthread activity 21002 (4294691810-4294670808), jiffies_till_next_fqs=3, root ->qsmask 0x0
> [   24.836239] rcu: rcu_preempt kthread starved for 21002 jiffies! g-627 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> [   24.836362] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
> [   24.836460] rcu: RCU grace-period kthread stack dump:
> [   24.836601] task:rcu_preempt     state:R  running task     stack:15448 pid:15    tgid:15    ppid:2      task_flags:0x208040 flags:0x00080000
> [   24.837139] Call Trace:
> [   24.837568]  <TASK>
> [   24.838008]  __schedule+0x4ed/0xea0
> [   24.838934]  schedule+0x22/0xd0
> [   24.839023]  schedule_timeout+0x81/0x100
> [   24.839095]  ? __pfx_process_timeout+0x10/0x10
> [   24.839165]  rcu_gp_fqs_loop+0x11b/0x650
> [   24.839226]  ? __pfx_rcu_gp_kthread+0x10/0x10
> [   24.839282]  rcu_gp_kthread+0x17e/0x210
> [   24.839333]  ? __pfx_rcu_gp_kthread+0x10/0x10
> [   24.839383]  kthread+0xdd/0x110
> [   24.839433]  ? __pfx_kthread+0x10/0x10
> [   24.839481]  ret_from_fork+0x1aa/0x260
> [   24.839538]  ? __pfx_kthread+0x10/0x10
> [   24.839585]  ret_from_fork_asm+0x1a/0x30
> [   24.839686]  </TASK>
> ......
> ```
> 
> Fix this by adding a depth counter when it is about to go to another_round
> label. When the counter exceeds XMIT_RECURSION_LIMIT (8), the packet is 
> dropped. This follows the same pattern as dev_xmit_recursion() which 
> protects the TX redirect path with the same limit.
> 
> Reuse SKB_DROP_REASON_TC_RECLASSIFY_LOOP for observability.
> 
> Fixes: 9aa1206e8f48 ("bpf: Add redirect_peer helper")
> Cc: stable@vger.kernel.org
> Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn>
> Reported-by: Xuewei Feng <fengxw06@126.com>
> Reported-by: Qi Li <qli01@tsinghua.edu.cn>
> Reported-by: Ke Xu <xuke@tsinghua.edu.cn>
> Assisted-by: GLM:GLM-5.1
> Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
> ---
> Changes in v2:
> - Move the check just after `another` is set to true to avoid affecting the fast path
> - Reuse SKB_DROP_REASON_TC_RECLASSIFY_LOOP to avoid adding new drop reason
> - Link to v1: https://lore.kernel.org/netdev/20260511063005.38134-1-zhaoyz24@mails.tsinghua.edu.cn/
> ---
>  net/core/dev.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 831129f2a..bb9ae92f0 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5958,6 +5958,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
>  	struct net_device *orig_dev;
>  	bool deliver_exact = false;
>  	int ret = NET_RX_DROP;
> +	int redirect_depth = 0;

As reported by sashiko, the above will cause an unused variable warning,
should be protected by #ifdef CONFIG_NET_INGRESS compiler guard.

Also please respect the reverse christmas tree order above.

/P


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2 net] net: core: dev: add reprocess depth limit for another_round in __netif_receive_skb_core
  2026-05-12  2:21 [PATCH v2 net] net: core: dev: add reprocess depth limit for another_round in __netif_receive_skb_core Yizhou Zhao
  2026-05-14 11:45 ` Paolo Abeni
@ 2026-05-14 22:11 ` kernel test robot
  1 sibling, 0 replies; 3+ messages in thread
From: kernel test robot @ 2026-05-14 22:11 UTC (permalink / raw)
  To: Yizhou Zhao, netdev
  Cc: oe-kbuild-all, Yizhou Zhao, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Hangbin Liu, Krishna Kumar, Yuxiang Yang,
	Xuewei Feng, Qi Li, Ke Xu, stable

Hi Yizhou,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net/main]
[also build test WARNING on net-next/main linus/master horms-ipvs/master v7.1-rc3 next-20260508]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Yizhou-Zhao/net-core-dev-add-reprocess-depth-limit-for-another_round-in-__netif_receive_skb_core/20260514-205938
base:   net/main
patch link:    https://lore.kernel.org/r/20260512022127.7818-1-zhaoyz24%40mails.tsinghua.edu.cn
patch subject: [PATCH v2 net] net: core: dev: add reprocess depth limit for another_round in __netif_receive_skb_core
config: openrisc-defconfig (https://download.01.org/0day-ci/archive/20260515/202605150631.QDJOt3V7-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260515/202605150631.QDJOt3V7-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605150631.QDJOt3V7-lkp@intel.com/

All warnings (new ones prefixed by >>):

   net/core/dev.c: In function '__netif_receive_skb_core':
>> net/core/dev.c:5982:13: warning: unused variable 'redirect_depth' [-Wunused-variable]
    5982 |         int redirect_depth = 0;
         |             ^~~~~~~~~~~~~~


vim +/redirect_depth +5982 net/core/dev.c

  5971	
  5972	static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
  5973					    struct packet_type **ppt_prev)
  5974	{
  5975		enum skb_drop_reason drop_reason = SKB_DROP_REASON_UNHANDLED_PROTO;
  5976		struct packet_type *ptype, *pt_prev;
  5977		rx_handler_func_t *rx_handler;
  5978		struct sk_buff *skb = *pskb;
  5979		struct net_device *orig_dev;
  5980		bool deliver_exact = false;
  5981		int ret = NET_RX_DROP;
> 5982		int redirect_depth = 0;
  5983		__be16 type;
  5984	
  5985		net_timestamp_check(!READ_ONCE(net_hotdata.tstamp_prequeue), skb);
  5986	
  5987		trace_netif_receive_skb(skb);
  5988	
  5989		orig_dev = skb->dev;
  5990	
  5991		skb_reset_network_header(skb);
  5992	#if !defined(CONFIG_DEBUG_NET)
  5993		/* We plan to no longer reset the transport header here.
  5994		 * Give some time to fuzzers and dev build to catch bugs
  5995		 * in network stacks.
  5996		 */
  5997		if (!skb_transport_header_was_set(skb))
  5998			skb_reset_transport_header(skb);
  5999	#endif
  6000		skb_reset_mac_len(skb);
  6001	
  6002		pt_prev = NULL;
  6003	
  6004	another_round:
  6005		skb->skb_iif = skb->dev->ifindex;
  6006	
  6007		__this_cpu_inc(softnet_data.processed);
  6008	
  6009		if (static_branch_unlikely(&generic_xdp_needed_key)) {
  6010			int ret2;
  6011	
  6012			migrate_disable();
  6013			ret2 = do_xdp_generic(rcu_dereference(skb->dev->xdp_prog),
  6014					      &skb);
  6015			migrate_enable();
  6016	
  6017			if (ret2 != XDP_PASS) {
  6018				ret = NET_RX_DROP;
  6019				goto out;
  6020			}
  6021		}
  6022	
  6023		if (eth_type_vlan(skb->protocol)) {
  6024			skb = skb_vlan_untag(skb);
  6025			if (unlikely(!skb))
  6026				goto out;
  6027		}
  6028	
  6029		if (skb_skip_tc_classify(skb))
  6030			goto skip_classify;
  6031	
  6032		if (pfmemalloc)
  6033			goto skip_taps;
  6034	
  6035		list_for_each_entry_rcu(ptype, &dev_net_rcu(skb->dev)->ptype_all,
  6036					list) {
  6037			if (unlikely(pt_prev))
  6038				ret = deliver_skb(skb, pt_prev, orig_dev);
  6039			pt_prev = ptype;
  6040		}
  6041	
  6042		list_for_each_entry_rcu(ptype, &skb->dev->ptype_all, list) {
  6043			if (unlikely(pt_prev))
  6044				ret = deliver_skb(skb, pt_prev, orig_dev);
  6045			pt_prev = ptype;
  6046		}
  6047	
  6048	skip_taps:
  6049	#ifdef CONFIG_NET_INGRESS
  6050		if (static_branch_unlikely(&ingress_needed_key)) {
  6051			bool another = false;
  6052	
  6053			nf_skip_egress(skb, true);
  6054			skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev,
  6055						 &another);
  6056			if (another) {
  6057				if (unlikely(++redirect_depth > XMIT_RECURSION_LIMIT)) {
  6058					net_warn_ratelimited(
  6059						"%s: redirect loop limit reached, dropping (dev=%s)\n",
  6060						__func__, skb->dev->name);
  6061					drop_reason = SKB_DROP_REASON_TC_RECLASSIFY_LOOP;
  6062					goto drop;
  6063				}
  6064				goto another_round;
  6065			}
  6066			if (!skb)
  6067				goto out;
  6068	
  6069			nf_skip_egress(skb, false);
  6070			if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
  6071				goto out;
  6072		}
  6073	#endif
  6074		skb_reset_redirect(skb);
  6075	skip_classify:
  6076		if (pfmemalloc && !skb_pfmemalloc_protocol(skb)) {
  6077			drop_reason = SKB_DROP_REASON_PFMEMALLOC;
  6078			goto drop;
  6079		}
  6080	
  6081		if (skb_vlan_tag_present(skb)) {
  6082			if (unlikely(pt_prev)) {
  6083				ret = deliver_skb(skb, pt_prev, orig_dev);
  6084				pt_prev = NULL;
  6085			}
  6086			if (vlan_do_receive(&skb))
  6087				goto another_round;
  6088			else if (unlikely(!skb))
  6089				goto out;
  6090		}
  6091	
  6092		rx_handler = rcu_dereference(skb->dev->rx_handler);
  6093		if (rx_handler) {
  6094			if (unlikely(pt_prev)) {
  6095				ret = deliver_skb(skb, pt_prev, orig_dev);
  6096				pt_prev = NULL;
  6097			}
  6098			switch (rx_handler(&skb)) {
  6099			case RX_HANDLER_CONSUMED:
  6100				ret = NET_RX_SUCCESS;
  6101				goto out;
  6102			case RX_HANDLER_ANOTHER:
  6103				goto another_round;
  6104			case RX_HANDLER_EXACT:
  6105				deliver_exact = true;
  6106				break;
  6107			case RX_HANDLER_PASS:
  6108				break;
  6109			default:
  6110				BUG();
  6111			}
  6112		}
  6113	
  6114		if (unlikely(skb_vlan_tag_present(skb)) && !netdev_uses_dsa(skb->dev)) {
  6115	check_vlan_id:
  6116			if (skb_vlan_tag_get_id(skb)) {
  6117				/* Vlan id is non 0 and vlan_do_receive() above couldn't
  6118				 * find vlan device.
  6119				 */
  6120				skb->pkt_type = PACKET_OTHERHOST;
  6121			} else if (eth_type_vlan(skb->protocol)) {
  6122				/* Outer header is 802.1P with vlan 0, inner header is
  6123				 * 802.1Q or 802.1AD and vlan_do_receive() above could
  6124				 * not find vlan dev for vlan id 0.
  6125				 */
  6126				__vlan_hwaccel_clear_tag(skb);
  6127				skb = skb_vlan_untag(skb);
  6128				if (unlikely(!skb))
  6129					goto out;
  6130				if (vlan_do_receive(&skb))
  6131					/* After stripping off 802.1P header with vlan 0
  6132					 * vlan dev is found for inner header.
  6133					 */
  6134					goto another_round;
  6135				else if (unlikely(!skb))
  6136					goto out;
  6137				else
  6138					/* We have stripped outer 802.1P vlan 0 header.
  6139					 * But could not find vlan dev.
  6140					 * check again for vlan id to set OTHERHOST.
  6141					 */
  6142					goto check_vlan_id;
  6143			}
  6144			/* Note: we might in the future use prio bits
  6145			 * and set skb->priority like in vlan_do_receive()
  6146			 * For the time being, just ignore Priority Code Point
  6147			 */
  6148			__vlan_hwaccel_clear_tag(skb);
  6149		}
  6150	
  6151		type = skb->protocol;
  6152	
  6153		/* deliver only exact match when indicated */
  6154		if (likely(!deliver_exact)) {
  6155			deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
  6156					       &ptype_base[ntohs(type) &
  6157							   PTYPE_HASH_MASK]);
  6158	
  6159			/* orig_dev and skb->dev could belong to different netns;
  6160			 * Even in such case we need to traverse only the list
  6161			 * coming from skb->dev, as the ptype owner (packet socket)
  6162			 * will use dev_net(skb->dev) to do namespace filtering.
  6163			 */
  6164			deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
  6165					       &dev_net_rcu(skb->dev)->ptype_specific);
  6166		}
  6167	
  6168		deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
  6169				       &orig_dev->ptype_specific);
  6170	
  6171		if (unlikely(skb->dev != orig_dev)) {
  6172			deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
  6173					       &skb->dev->ptype_specific);
  6174		}
  6175	
  6176		if (pt_prev) {
  6177			*ppt_prev = pt_prev;
  6178		} else {
  6179	drop:
  6180			if (!deliver_exact)
  6181				dev_core_stats_rx_dropped_inc(skb->dev);
  6182			else
  6183				dev_core_stats_rx_nohandler_inc(skb->dev);
  6184	
  6185			kfree_skb_reason(skb, drop_reason);
  6186			/* Jamal, now you will not able to escape explaining
  6187			 * me how you were going to use this. :-)
  6188			 */
  6189			ret = NET_RX_DROP;
  6190		}
  6191	
  6192	out:
  6193		/* The invariant here is that if *ppt_prev is not NULL
  6194		 * then skb should also be non-NULL.
  6195		 *
  6196		 * Apparently *ppt_prev assignment above holds this invariant due to
  6197		 * skb dereferencing near it.
  6198		 */
  6199		*pskb = skb;
  6200		return ret;
  6201	}
  6202	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-05-14 22:12 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12  2:21 [PATCH v2 net] net: core: dev: add reprocess depth limit for another_round in __netif_receive_skb_core Yizhou Zhao
2026-05-14 11:45 ` Paolo Abeni
2026-05-14 22:11 ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox