netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Soft lock-ups caused by iptables
@ 2025-11-18 22:17 Hamza Mahfooz
  2025-11-19 14:49 ` Phil Sutter
  0 siblings, 1 reply; 14+ messages in thread
From: Hamza Mahfooz @ 2025-11-18 22:17 UTC (permalink / raw)
  To: netdev
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Phil Sutter, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, netfilter-devel, coreteam,
	linux-kernel

Hi,

I am able to consistly repro several cpu soft lock-ups that seem to all
end up in either in nft_chain_validate(), nft_match_validate(), or
nft_match_validate(), see below for examples. Also, this doesn't seem
to be a recent regression since I am able to repro it as far back as
v5.15.184. The repro steps are rather convoluted (involving a config
with a ~40k iptables rules and 2 vCPUs) so I am happy to test any
patches. You can find the config I used to build the 6.18 kernel at [1].

[1] https://raw.githubusercontent.com/microsoft/azurelinux/refs/heads/3.0-dev/SPECS/kernel-hwe/config

Trace #1:
 watchdog: BUG: soft lockup - CPU#1 stuck for 27s! [iptables-nft-re:37547]
 Modules linked in: ipt_REJECT nf_reject_ipv4 xt_REDIRECT xt_connmark ip_set_hash_ipport ip_set_hash_net ip_set_list_set xt_statistic xt_nfacct nfnetlink_acct xt_mark xt_MASQUERADE xt_addrtype xt_nat xt_set ip_set_hash_ip ip_set xt_comment tls mptcp_diag xsk_diag vsock_diag tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag nf_conntrack_netlink nft_masq nft_nat nft_fib_ipv4 nft_fib nft_chain_nat cfg80211 8021q garp mrp binfmt_misc xt_conntrack xt_owner nft_compat nf_tables mlx5_ib ib_uverbs ib_core mlx5_core mousedev intel_rapl_msr hid_generic mlxfw intel_rapl_common hid_hyperv hyperv_fb evdev hid sch_fq_codel dm_multipath nf_nat ebt_ip nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc fuse drm dmi_sysfs autofs4
 CPU: 1 UID: 0 PID: 37547 Comm: iptables-nft-re Not tainted 6.18.0-rc6+ #1 PREEMPT(none) 
 Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 08/23/2024
 RIP: 0010:nft_chain_validate+0xcb/0x110 [nf_tables]
 Code: 10 eb c3 49 8b 07 8b 40 10 49 01 c7 4d 39 fd 74 b5 49 8b 07 48 8b 48 48 48 85 c9 74 e9 4c 89 fe 48 89 df e8 57 35 61 f6 85 c0 <79> d7 5b 41 5c 41 5d 41 5e 41 5f 5d e9 af a8 47 f5 65 48 8b 05 b4
 RSP: 0018:ffffd11885f3b428 EFLAGS: 00000246
 RAX: 0000000000000000 RBX: ffffd11885f3b640 RCX: 0000000000000002
 RDX: 0000000000000002 RSI: 000000000000001f RDI: 0000000000000000
 RBP: ffffd11885f3b450 R08: 0000000000000000 R09: 000000000003abc8
 R10: 000000000000000f R11: 0000000000000007 R12: ffff89bc4c3eb100
 R13: ffff89bc4c3eb1e0 R14: ffff89bc4baf0190 R15: ffff89bc4c3eb1a0
 FS:  00007e16cbfa8080(0000) GS:ffff89bef78f5000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007e16cbea2008 CR3: 00000002477e1000 CR4: 0000000000350ef0
 Call Trace:
  <TASK>
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_table_validate+0x6b/0xb0 [nf_tables]
  nf_tables_validate+0x8b/0xa0 [nf_tables]
  nf_tables_commit+0x1df/0x1eb0 [nf_tables]
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? kvfree+0x31/0x40
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? nf_tables_newrule+0x9e2/0xc70 [nf_tables]
  nfnetlink_rcv_batch+0x2b0/0x9d0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __nla_validate_parse+0x5a/0xcd0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? apparmor_capable+0xbb/0x1a0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? security_capable+0x82/0x190
  nfnetlink_rcv+0x120/0x160
  netlink_unicast+0x282/0x3d0
  ? __build_skb+0x52/0x60
  netlink_sendmsg+0x20c/0x440
  ____sys_sendmsg+0x35f/0x390
  ? srso_alias_return_thunk+0x5/0xfbef5
  ___sys_sendmsg+0x85/0xd0
  ? security_capable+0x82/0x190
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? _raw_spin_unlock_bh+0x21/0x30
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? release_sock+0x91/0xb0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? sk_setsockopt+0x1ac/0x17b0
  ? aa_sk_perm+0x99/0x220
  __sys_sendmsg+0x77/0xe0
  ? audit_reset_context+0x24a/0x320
  __x64_sys_sendmsg+0x21/0x30
  x64_sys_call+0x1b5f/0x20f0
  do_syscall_64+0x72/0x7c0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? count_memcg_events+0xbd/0x170
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? handle_mm_fault+0xc0/0x300
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? do_user_addr_fault+0x205/0x6c0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? irqentry_exit+0x3f/0x50
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? exc_page_fault+0x7f/0x170
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7e16cc0d5034
 Code: 15 e9 6d 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 80 3d 15 f0 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
 RSP: 002b:00007ffd99bb52f8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007ffd99bb9ea0 RCX: 00007e16cc0d5034
 RDX: 0000000000000000 RSI: 00007ffd99bb63a0 RDI: 0000000000000003
 RBP: 00007ffd99bb69a0 R08: 0000000000000004 R09: 00007e16cc24a440
 R10: 00007ffd99bb638c R11: 0000000000000202 R12: 00000000006a7000
 R13: 00007ffd99bb5300 R14: 0000000000000001 R15: 00007ffd99bb5310
  </TASK>

Trace #2:
 watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [iptables-nft-re:40308]
 Modules linked in: ipt_REJECT nf_reject_ipv4 xt_REDIRECT xt_connmark ip_set_hash_ipport ip_set_hash_net ip_set_list_set xt_statistic xt_nfacct nfnetlink_acct xt_mark xt_MASQUERADE xt_addrtype xt_nat xt_set ip_set_hash_ip ip_set xt_comment tls mptcp_diag xsk_diag vsock_diag tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag nf_conntrack_netlink nft_masq nft_nat nft_fib_ipv4 nft_fib nft_chain_nat cfg80211 8021q garp mrp binfmt_misc xt_conntrack xt_owner nft_compat nf_tables mlx5_ib ib_uverbs ib_core mlx5_core mousedev intel_rapl_msr hid_generic mlxfw intel_rapl_common hid_hyperv hyperv_fb evdev hid sch_fq_codel dm_multipath nf_nat ebt_ip nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc fuse drm dmi_sysfs autofs4
 CPU: 0 UID: 0 PID: 40308 Comm: iptables-nft-re Tainted: G             L      6.18.0-rc6+ #1 PREEMPT(none) 
 Tainted: [L]=SOFTLOCKUP
 Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 08/23/2024
 RIP: 0010:nft_match_validate+0xa/0xd0 [nft_compat]
 Code: 91 f5 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 <48> 89 e5 41 56 41 55 41 54 53 48 8b 06 48 89 fb 4c 8b a8 80 00 00
 RSP: 0018:ffffd11885ef3138 EFLAGS: 00000286
 RAX: ffff89bc4bd909c0 RBX: ffffd11885ef3360 RCX: ffffffffc07f33c0
 RDX: 0000000000000002 RSI: ffff89bc474633a0 RDI: ffffd11885ef3360
 RBP: ffffd11885ef3170 R08: 0000000000000000 R09: 000000000003abc8
 R10: 000000000000000b R11: 0000000000000003 R12: ffff89bc47463300
 R13: ffff89bc474633e0 R14: ffff89bc308e8310 R15: ffff89bc474633a0
 FS:  00007231af246080(0000) GS:ffff89bef77f5000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007231af140000 CR3: 000000022fea0000 CR4: 0000000000350ef0
 Call Trace:
  <TASK>
  ? nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_table_validate+0x6b/0xb0 [nf_tables]
  nf_tables_validate+0x8b/0xa0 [nf_tables]
  nf_tables_commit+0x1df/0x1eb0 [nf_tables]
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? kvfree+0x31/0x40
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? nf_tables_newrule+0x9e2/0xc70 [nf_tables]
  nfnetlink_rcv_batch+0x2b0/0x9d0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __nla_validate_parse+0x5a/0xcd0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? apparmor_capable+0xbb/0x1a0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? security_capable+0x82/0x190
  nfnetlink_rcv+0x120/0x160
  netlink_unicast+0x282/0x3d0
  ? __build_skb+0x52/0x60
  netlink_sendmsg+0x20c/0x440
  ____sys_sendmsg+0x35f/0x390
  ? srso_alias_return_thunk+0x5/0xfbef5
  ___sys_sendmsg+0x85/0xd0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? apparmor_capable+0xbb/0x1a0
  __sys_sendmsg+0x77/0xe0
  __x64_sys_sendmsg+0x21/0x30
  x64_sys_call+0x1b5f/0x20f0
  do_syscall_64+0x72/0x7c0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? audit_reset_context+0x24a/0x320
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __audit_syscall_exit+0xbf/0x100
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? syscall_exit_work+0x120/0x160
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? do_syscall_64+0x1eb/0x7c0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? apparmor_capable+0xbb/0x1a0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? security_capable+0x82/0x190
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? _raw_spin_unlock_bh+0x21/0x30
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? release_sock+0x91/0xb0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? sk_setsockopt+0x1ac/0x17b0
  ? aa_sk_perm+0x99/0x220
  ? _raw_spin_lock_irq+0x1/0x40
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? audit_reset_context+0x24a/0x320
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __audit_syscall_exit+0xbf/0x100
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? syscall_exit_work+0x120/0x160
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? do_syscall_64+0x1eb/0x7c0
  ? handle_mm_fault+0xc0/0x300
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? do_user_addr_fault+0x205/0x6c0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? irqentry_exit+0x3f/0x50
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? exc_page_fault+0x7f/0x170
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7231af373034
 Code: 15 e9 6d 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 80 3d 15 f0 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
 RSP: 002b:00007ffeec7cf3b8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007ffeec7d3f60 RCX: 00007231af373034
 RDX: 0000000000000000 RSI: 00007ffeec7d0460 RDI: 0000000000000003
 RBP: 00007ffeec7d0a60 R08: 0000000000000004 R09: 00007231af4e8440
 R10: 00007ffeec7d044c R11: 0000000000000202 R12: 00000000006a7000
 R13: 00007ffeec7cf3c0 R14: 0000000000000001 R15: 00007ffeec7cf3d0
  </TASK>

Trace #3:
 watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [iptables-restor:54374]
 Modules linked in: ipt_REJECT nf_reject_ipv4 xt_REDIRECT xt_connmark ip_set_hash_ipport ip_set_hash_net ip_set_list_set xt_statistic xt_nfacct nfnetlink_acct xt_mark xt_MASQUERADE xt_addrtype xt_nat xt_set ip_set_hash_ip ip_set xt_comment tls mptcp_diag xsk_diag vsock_diag tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag nf_conntrack_netlink nft_masq nft_nat nft_fib_ipv4 nft_fib nft_chain_nat cfg80211 8021q garp mrp binfmt_misc xt_conntrack xt_owner nft_compat nf_tables mlx5_ib ib_uverbs ib_core mlx5_core mousedev intel_rapl_msr hid_generic mlxfw intel_rapl_common hid_hyperv hyperv_fb evdev hid sch_fq_codel dm_multipath nf_nat ebt_ip nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc fuse drm dmi_sysfs autofs4
 CPU: 0 UID: 0 PID: 54374 Comm: iptables-restor Tainted: G             L      6.18.0-rc6+ #1 PREEMPT(none) 
 Tainted: [L]=SOFTLOCKUP
 Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 08/23/2024
 RIP: 0010:nft_immediate_validate+0x4/0x50 [nf_tables]
 Code: b6 56 19 0f b6 f0 48 89 e5 e8 98 34 fe ff 31 c0 5d e9 eb eb 45 f5 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa <0f> 1f 44 00 00 80 7e 18 00 75 0b 8b 46 08 83 c0 04 83 f8 01 76 07
 RSP: 0018:ffffd118889b3260 EFLAGS: 00000286
 RAX: ffffffffc0aac6e0 RBX: ffffd118889b3480 RCX: ffffffffc0ca5530
 RDX: 0000000000000002 RSI: ffff89bc532966f8 RDI: ffffd118889b3480
 RBP: ffffd118889b3290 R08: 0000000000000000 R09: 000000000003abc8
 R10: 000000000000000b R11: 0000000000000003 R12: ffff89bc53296600
 R13: ffff89bc53296718 R14: ffff89bc528ec390 R15: ffff89bc532966f8
 FS:  000073e411a74040(0000) GS:ffff89bef77f5000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000073e410d42002 CR3: 000000022fe1e000 CR4: 0000000000350ef0
 Call Trace:
  <TASK>
  ? nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_immediate_validate+0x36/0x50 [nf_tables]
  nft_chain_validate+0xc9/0x110 [nf_tables]
  nft_table_validate+0x6b/0xb0 [nf_tables]
  nf_tables_validate+0x8b/0xa0 [nf_tables]
  nf_tables_commit+0x1df/0x1eb0 [nf_tables]
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? kvfree+0x31/0x40
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? nf_tables_newrule+0x9e2/0xc70 [nf_tables]
  nfnetlink_rcv_batch+0x2b0/0x9d0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __nla_validate_parse+0x5a/0xcd0
  ? sysvec_hyperv_stimer0+0x93/0xa0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? security_capable+0x82/0x190
  nfnetlink_rcv+0x120/0x160
  netlink_unicast+0x282/0x3d0
  netlink_sendmsg+0x20c/0x440
  ____sys_sendmsg+0x35f/0x390
  ? srso_alias_return_thunk+0x5/0xfbef5
  ___sys_sendmsg+0x85/0xd0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? audit_reset_context+0x24a/0x320
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __audit_syscall_exit+0xbf/0x100
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? syscall_exit_work+0x120/0x160
  ? do_syscall_64+0x1eb/0x7c0
  __sys_sendmsg+0x77/0xe0
  ? __entry_text_end+0xdb76/0x101c79
  __x64_sys_sendmsg+0x21/0x30
  x64_sys_call+0x1b5f/0x20f0
  do_syscall_64+0x72/0x7c0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __alloc_frozen_pages_noprof+0x165/0x330
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? mod_memcg_lruvec_state+0xc6/0x1d0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __lruvec_stat_mod_folio+0x8f/0xe0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? set_ptes.isra.0+0x3b/0x80
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? _raw_spin_unlock+0x12/0x30
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? do_anonymous_page+0x111/0x880
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? ___pte_offset_map+0x20/0x160
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __handle_mm_fault+0xae2/0xfb0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? count_memcg_events+0xbd/0x170
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? handle_mm_fault+0xc0/0x300
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? do_user_addr_fault+0x205/0x6c0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? irqentry_exit+0x3f/0x50
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? exc_page_fault+0x7f/0x170
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x73e411b7fbc0
 Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 66 2e 0f 1f 84 00 00 00 00 00 90 80 3d 21 fa 0c 00 00 74 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 54
 RSP: 002b:00007fffec444358 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007fffec448fb0 RCX: 000073e411b7fbc0
 RDX: 0000000000000000 RSI: 00007fffec445460 RDI: 0000000000000003
 RBP: 00007fffec445ae0 R08: 0000000000000004 R09: 0000000000000000
 R10: 00007fffec44544c R11: 0000000000000202 R12: 0000000004a1d000
 R13: 00007fffec448fd8 R14: 0000000000000007 R15: 00007fffec444360
  </TASK>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Soft lock-ups caused by iptables
  2025-11-18 22:17 Soft lock-ups caused by iptables Hamza Mahfooz
@ 2025-11-19 14:49 ` Phil Sutter
  2025-11-19 15:58   ` Florian Westphal
  2025-11-19 22:29   ` Hamza Mahfooz
  0 siblings, 2 replies; 14+ messages in thread
From: Phil Sutter @ 2025-11-19 14:49 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: netdev, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netfilter-devel, coreteam, linux-kernel

Hi,

On Tue, Nov 18, 2025 at 02:17:35PM -0800, Hamza Mahfooz wrote:
> I am able to consistly repro several cpu soft lock-ups that seem to all
> end up in either in nft_chain_validate(), nft_match_validate(), or
> nft_match_validate(), see below for examples. Also, this doesn't seem
> to be a recent regression since I am able to repro it as far back as
> v5.15.184. The repro steps are rather convoluted (involving a config
> with a ~40k iptables rules and 2 vCPUs) so I am happy to test any
> patches. You can find the config I used to build the 6.18 kernel at [1].

Nftables ruleset validation code was refactored in v6.10 with commit
cff3bd012a95 ("netfilter: nf_tables: prefer nft_chain_validate"). This
is also present in v5.15.184, so in order to estimate whether a bug is
"new" or "old", better really use old kernels not recent minor releases
of old major ones. :)

Anyway, basically what happens is that nft_chain_validate() iterates
over each rule's expressions calling their 'validate' callback if
present. With nft_immediate, this leads to a recursive call to
nft_chain_validate() if the verdict is a jump/goto call. There is a
recursion limit involved, but chains are potentially revalidated
multiple times to cover all possible flow paths (e.g. with consecutive
rules jumping to the same chain).

So, how many --jump/--goto calls does your 40k iptables dump contain? Is
this a (penetration) test or an actual ruleset in use? While it might be
possible to reduce the overhead involved with this chain validation,
maybe you want to consider using ipset (or better, nftables and its
verdict maps) to improve the ruleset in general?

On nftables side, maybe we could annotate chains with a depth value once
validated to skip digging into them again when revisiting from another
jump?

Cheers, Phil

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Soft lock-ups caused by iptables
  2025-11-19 14:49 ` Phil Sutter
@ 2025-11-19 15:58   ` Florian Westphal
  2025-11-19 18:12     ` Phil Sutter
  2025-11-19 22:29   ` Hamza Mahfooz
  1 sibling, 1 reply; 14+ messages in thread
From: Florian Westphal @ 2025-11-19 15:58 UTC (permalink / raw)
  To: Phil Sutter, Hamza Mahfooz, netdev, Pablo Neira Ayuso,
	Jozsef Kadlecsik, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, netfilter-devel, coreteam,
	linux-kernel

Phil Sutter <phil@nwl.cc> wrote:
> On nftables side, maybe we could annotate chains with a depth value once
> validated to skip digging into them again when revisiting from another
> jump?

Yes, but you also need to annotate the type of the last base chain origin,
else you might skip validation of 'chain foo' because its depth value says its
fine but new caller is coming from filter, not nat, and chain foo had
masquerade expression.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Soft lock-ups caused by iptables
  2025-11-19 15:58   ` Florian Westphal
@ 2025-11-19 18:12     ` Phil Sutter
  2025-11-19 23:10       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 14+ messages in thread
From: Phil Sutter @ 2025-11-19 18:12 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Hamza Mahfooz, netdev, Pablo Neira Ayuso, Jozsef Kadlecsik,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netfilter-devel, coreteam, linux-kernel

On Wed, Nov 19, 2025 at 04:58:46PM +0100, Florian Westphal wrote:
> Phil Sutter <phil@nwl.cc> wrote:
> > On nftables side, maybe we could annotate chains with a depth value once
> > validated to skip digging into them again when revisiting from another
> > jump?
> 
> Yes, but you also need to annotate the type of the last base chain origin,
> else you might skip validation of 'chain foo' because its depth value says its
> fine but new caller is coming from filter, not nat, and chain foo had
> masquerade expression.

There would need to be masks of valid types and hooks recording the
restrictions imposed on a non-base chain by its rules' expressions.
Maybe this even needs a matrix for cases where some hooks are OK in some
families/types but not others.

Cheers, Phil

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Soft lock-ups caused by iptables
  2025-11-19 14:49 ` Phil Sutter
  2025-11-19 15:58   ` Florian Westphal
@ 2025-11-19 22:29   ` Hamza Mahfooz
  2025-11-19 23:14     ` Pablo Neira Ayuso
  1 sibling, 1 reply; 14+ messages in thread
From: Hamza Mahfooz @ 2025-11-19 22:29 UTC (permalink / raw)
  To: Phil Sutter
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netdev, netfilter-devel, coreteam, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1119 bytes --]

On Wed, Nov 19, 2025 at 03:49:57PM +0100, Phil Sutter wrote:
> Nftables ruleset validation code was refactored in v6.10 with commit
> cff3bd012a95 ("netfilter: nf_tables: prefer nft_chain_validate"). This
> is also present in v5.15.184, so in order to estimate whether a bug is
> "new" or "old", better really use old kernels not recent minor releases
> of old major ones. :)

FWIW I tried to repro this on v6.6.45 as well and it also suffers from
this issue.

> 
> So, how many --jump/--goto calls does your 40k iptables dump contain? Is
> this a (penetration) test or an actual ruleset in use? While it might be
> possible to reduce the overhead involved with this chain validation,
> maybe you want to consider using ipset (or better, nftables and its
> verdict maps) to improve the ruleset in general?
> 

The vast majority of rules have --jumps (see the attached file). My
reproducible setup is a stress test but we have seen something like this
in production as well. Also, I'm not opposed to the idea of trying something
more efficient but I think we should get to the bottom of what's going on
here.

BR,
Hamza

[-- Attachment #2: iptables-save.gz --]
[-- Type: application/gzip, Size: 404920 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Soft lock-ups caused by iptables
  2025-11-19 18:12     ` Phil Sutter
@ 2025-11-19 23:10       ` Pablo Neira Ayuso
  2025-11-20  9:34         ` Florian Westphal
  0 siblings, 1 reply; 14+ messages in thread
From: Pablo Neira Ayuso @ 2025-11-19 23:10 UTC (permalink / raw)
  To: Phil Sutter, Florian Westphal, Hamza Mahfooz, netdev,
	Jozsef Kadlecsik, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, netfilter-devel, coreteam,
	linux-kernel

On Wed, Nov 19, 2025 at 07:12:37PM +0100, Phil Sutter wrote:
> On Wed, Nov 19, 2025 at 04:58:46PM +0100, Florian Westphal wrote:
> > Phil Sutter <phil@nwl.cc> wrote:
> > > On nftables side, maybe we could annotate chains with a depth value once
> > > validated to skip digging into them again when revisiting from another
> > > jump?
> > 
> > Yes, but you also need to annotate the type of the last base chain origin,
> > else you might skip validation of 'chain foo' because its depth value says its
> > fine but new caller is coming from filter, not nat, and chain foo had
> > masquerade expression.

You could also have chains being called from different levels.

> There would need to be masks of valid types and hooks recording the
> restrictions imposed on a non-base chain by its rules' expressions.
> Maybe this even needs a matrix for cases where some hooks are OK in some
> families/types but not others.

I posted a series to maintain a graph that relates jumps
chain-to-chain, set-to-chain and chain-to-set (both backwards and
forward) to improve validation, I would need to come back to it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Soft lock-ups caused by iptables
  2025-11-19 22:29   ` Hamza Mahfooz
@ 2025-11-19 23:14     ` Pablo Neira Ayuso
  0 siblings, 0 replies; 14+ messages in thread
From: Pablo Neira Ayuso @ 2025-11-19 23:14 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: Phil Sutter, Jozsef Kadlecsik, Florian Westphal, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev,
	netfilter-devel, coreteam, linux-kernel

On Wed, Nov 19, 2025 at 02:29:40PM -0800, Hamza Mahfooz wrote:
> On Wed, Nov 19, 2025 at 03:49:57PM +0100, Phil Sutter wrote:
> > Nftables ruleset validation code was refactored in v6.10 with commit
> > cff3bd012a95 ("netfilter: nf_tables: prefer nft_chain_validate"). This
> > is also present in v5.15.184, so in order to estimate whether a bug is
> > "new" or "old", better really use old kernels not recent minor releases
> > of old major ones. :)
> 
> FWIW I tried to repro this on v6.6.45 as well and it also suffers from
> this issue.

This example ruleset does not restore, it is missing ipsets.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Soft lock-ups caused by iptables
  2025-11-19 23:10       ` Pablo Neira Ayuso
@ 2025-11-20  9:34         ` Florian Westphal
  2025-11-20 11:22           ` Phil Sutter
                             ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Florian Westphal @ 2025-11-20  9:34 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Phil Sutter, Hamza Mahfooz, netdev, Jozsef Kadlecsik,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netfilter-devel, coreteam, linux-kernel

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > Yes, but you also need to annotate the type of the last base chain origin,
> > > else you might skip validation of 'chain foo' because its depth value says its
> > > fine but new caller is coming from filter, not nat, and chain foo had
> > > masquerade expression.
> 
> You could also have chains being called from different levels.

But thats not an issue.  If you see a jump from c1 to c2, and c2
has been validated for a level of 5, then you need to revalidate
only if c1->depth >= 5.

Do you see any issue with this? (it still lacks annotation for
the calling basechains type, so this cannot be applied as-is):

netfilter: nf_tables: avoid chain re-validation if possible

Consider:

      input -> j2 -> j3
      input -> j2 -> j3
      input -> j1 -> j2 -> j3

Then the second rule does not need to revalidate j2, and, by extension j3.

We need to validate it only for rule 3.

This is needed because chain loop detection also ensures we do not
exceed the jump stack: Just because we know that j2 is cycle free, its
last jump might now exceed the allowed stack.  We also need to update
the new largest call depth for all the reachable nodes.

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1109,6 +1109,7 @@ struct nft_rule_blob {
  *	@udlen: user data length
  *	@udata: user data in the chain
  *	@blob_next: rule blob pointer to the next in the chain
+ *	@depth: chain was validated for call level <= depth
  */
 struct nft_chain {
 	struct nft_rule_blob		__rcu *blob_gen_0;
@@ -1128,9 +1129,10 @@ struct nft_chain {
 
 	/* Only used during control plane commit phase: */
 	struct nft_rule_blob		*blob_next;
+	u8				depth;
 };
 
-int nft_chain_validate(const struct nft_ctx *ctx, const struct nft_chain *chain);
+int nft_chain_validate(const struct nft_ctx *ctx, struct nft_chain *chain);
 int nft_setelem_validate(const struct nft_ctx *ctx, struct nft_set *set,
 			 const struct nft_set_iter *iter,
 			 struct nft_elem_priv *elem_priv);
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -4088,15 +4088,26 @@ static void nf_tables_rule_release(const struct nft_ctx *ctx, struct nft_rule *r
  * and set lookups until either the jump limit is hit or all reachable
  * chains have been validated.
  */
-int nft_chain_validate(const struct nft_ctx *ctx, const struct nft_chain *chain)
+int nft_chain_validate(const struct nft_ctx *ctx, struct nft_chain *chain)
 {
 	struct nft_expr *expr, *last;
 	struct nft_rule *rule;
 	int err;
 
+	BUILD_BUG_ON(NFT_JUMP_STACK_SIZE > 255);
 	if (ctx->level == NFT_JUMP_STACK_SIZE)
 		return -EMLINK;
 
+	/* jumps to base chains are not allowed, this is already
+	 * validated by nft_verdict_init().
+	 *
+	 * Chain must be re-validated if we are entering for first
+	 * time or if the current jumpstack usage is higher than on
+	 * previous check.
+	 */
+	if (ctx->level && chain->depth >= ctx->level)
+		return 0;
+
 	list_for_each_entry(rule, &chain->rules, list) {
 		if (fatal_signal_pending(current))
 			return -EINTR;
@@ -4117,6 +4128,10 @@ int nft_chain_validate(const struct nft_ctx *ctx, const struct nft_chain *chain)
 		}
 	}
 
+	/* Chain needs no re-validation if called again
+	 * from a path that doesn't exceed level.
+	 */
+	chain->depth = ctx->level;
 	return 0;
 }
 EXPORT_SYMBOL_GPL(nft_chain_validate);
@@ -4128,7 +4143,7 @@ static int nft_table_validate(struct net *net, const struct nft_table *table)
 		.net	= net,
 		.family	= table->family,
 	};
-	int err;
+	int err = 0;
 
 	list_for_each_entry(chain, &table->chains, list) {
 		if (!nft_is_base_chain(chain))
@@ -4137,12 +4152,16 @@ static int nft_table_validate(struct net *net, const struct nft_table *table)
 		ctx.chain = chain;
 		err = nft_chain_validate(&ctx, chain);
 		if (err < 0)
-			return err;
+			goto err;
 
 		cond_resched();
 	}
 
-	return 0;
+err:
+	list_for_each_entry(chain, &table->chains, list)
+		chain->depth = 0;
+
+	return err;
 }
 
 int nft_setelem_validate(const struct nft_ctx *ctx, struct nft_set *set,


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Soft lock-ups caused by iptables
  2025-11-20  9:34         ` Florian Westphal
@ 2025-11-20 11:22           ` Phil Sutter
  2025-11-20 20:38           ` Hamza Mahfooz
  2025-11-20 21:01           ` Pablo Neira Ayuso
  2 siblings, 0 replies; 14+ messages in thread
From: Phil Sutter @ 2025-11-20 11:22 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Pablo Neira Ayuso, Hamza Mahfooz, netdev, Jozsef Kadlecsik,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netfilter-devel, coreteam, linux-kernel

On Thu, Nov 20, 2025 at 10:34:46AM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > > Yes, but you also need to annotate the type of the last base chain origin,
> > > > else you might skip validation of 'chain foo' because its depth value says its
> > > > fine but new caller is coming from filter, not nat, and chain foo had
> > > > masquerade expression.
> > 
> > You could also have chains being called from different levels.
> 
> But thats not an issue.  If you see a jump from c1 to c2, and c2
> has been validated for a level of 5, then you need to revalidate
> only if c1->depth >= 5.
> 
> Do you see any issue with this? (it still lacks annotation for
> the calling basechains type, so this cannot be applied as-is):

Assuming that we don't allow jumps from one family to another, we may
get by with two bitfields which validate callbacks fill: One for base
chain types and one for hooks.

The current family would still be validated inside the callback, but
nft_chain_validate_dependency() and nft_chain_validate_hooks() called
once (I think) for each base chain after collecting. The callbacks could
also return void and leave the hooks bitmask zeroed to signal "invalid
family".

> netfilter: nf_tables: avoid chain re-validation if possible

Thanks, Phil

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Soft lock-ups caused by iptables
  2025-11-20  9:34         ` Florian Westphal
  2025-11-20 11:22           ` Phil Sutter
@ 2025-11-20 20:38           ` Hamza Mahfooz
  2025-11-20 20:46             ` Florian Westphal
  2025-11-20 21:07             ` Pablo Neira Ayuso
  2025-11-20 21:01           ` Pablo Neira Ayuso
  2 siblings, 2 replies; 14+ messages in thread
From: Hamza Mahfooz @ 2025-11-20 20:38 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Pablo Neira Ayuso, Phil Sutter, netdev, Jozsef Kadlecsik,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netfilter-devel, coreteam, linux-kernel

On Thu, Nov 20, 2025 at 10:34:46AM +0100, Florian Westphal wrote:
> netfilter: nf_tables: avoid chain re-validation if possible
> 
> Consider:
> 
>       input -> j2 -> j3
>       input -> j2 -> j3
>       input -> j1 -> j2 -> j3
> 
> Then the second rule does not need to revalidate j2, and, by extension j3.
> 
> We need to validate it only for rule 3.
> 
> This is needed because chain loop detection also ensures we do not
> exceed the jump stack: Just because we know that j2 is cycle free, its
> last jump might now exceed the allowed stack.  We also need to update
> the new largest call depth for all the reachable nodes.
> 
> diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
> --- a/include/net/netfilter/nf_tables.h
> +++ b/include/net/netfilter/nf_tables.h
> @@ -1109,6 +1109,7 @@ struct nft_rule_blob {
>   *	@udlen: user data length
>   *	@udata: user data in the chain
>   *	@blob_next: rule blob pointer to the next in the chain
> + *	@depth: chain was validated for call level <= depth
>   */
>  struct nft_chain {
>  	struct nft_rule_blob		__rcu *blob_gen_0;
> @@ -1128,9 +1129,10 @@ struct nft_chain {
>  
>  	/* Only used during control plane commit phase: */
>  	struct nft_rule_blob		*blob_next;
> +	u8				depth;
>  };
>  
> -int nft_chain_validate(const struct nft_ctx *ctx, const struct nft_chain *chain);
> +int nft_chain_validate(const struct nft_ctx *ctx, struct nft_chain *chain);
>  int nft_setelem_validate(const struct nft_ctx *ctx, struct nft_set *set,
>  			 const struct nft_set_iter *iter,
>  			 struct nft_elem_priv *elem_priv);
> diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> --- a/net/netfilter/nf_tables_api.c
> +++ b/net/netfilter/nf_tables_api.c
> @@ -4088,15 +4088,26 @@ static void nf_tables_rule_release(const struct nft_ctx *ctx, struct nft_rule *r
>   * and set lookups until either the jump limit is hit or all reachable
>   * chains have been validated.
>   */
> -int nft_chain_validate(const struct nft_ctx *ctx, const struct nft_chain *chain)
> +int nft_chain_validate(const struct nft_ctx *ctx, struct nft_chain *chain)
>  {
>  	struct nft_expr *expr, *last;
>  	struct nft_rule *rule;
>  	int err;
>  
> +	BUILD_BUG_ON(NFT_JUMP_STACK_SIZE > 255);
>  	if (ctx->level == NFT_JUMP_STACK_SIZE)
>  		return -EMLINK;
>  
> +	/* jumps to base chains are not allowed, this is already
> +	 * validated by nft_verdict_init().
> +	 *
> +	 * Chain must be re-validated if we are entering for first
> +	 * time or if the current jumpstack usage is higher than on
> +	 * previous check.
> +	 */
> +	if (ctx->level && chain->depth >= ctx->level)
> +		return 0;
> +
>  	list_for_each_entry(rule, &chain->rules, list) {
>  		if (fatal_signal_pending(current))
>  			return -EINTR;
> @@ -4117,6 +4128,10 @@ int nft_chain_validate(const struct nft_ctx *ctx, const struct nft_chain *chain)
>  		}
>  	}
>  
> +	/* Chain needs no re-validation if called again
> +	 * from a path that doesn't exceed level.
> +	 */
> +	chain->depth = ctx->level;
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(nft_chain_validate);
> @@ -4128,7 +4143,7 @@ static int nft_table_validate(struct net *net, const struct nft_table *table)
>  		.net	= net,
>  		.family	= table->family,
>  	};
> -	int err;
> +	int err = 0;
>  
>  	list_for_each_entry(chain, &table->chains, list) {
>  		if (!nft_is_base_chain(chain))
> @@ -4137,12 +4152,16 @@ static int nft_table_validate(struct net *net, const struct nft_table *table)
>  		ctx.chain = chain;
>  		err = nft_chain_validate(&ctx, chain);
>  		if (err < 0)
> -			return err;
> +			goto err;
>  
>  		cond_resched();
>  	}
>  
> -	return 0;
> +err:
> +	list_for_each_entry(chain, &table->chains, list)
> +		chain->depth = 0;
> +
> +	return err;
>  }
>  
>  int nft_setelem_validate(const struct nft_ctx *ctx, struct nft_set *set,

FWIW This patch seems to resolve the issue, assuming you intended to
include the following:

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index fab7dc73f738..e5f7a3b1d946 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1130,7 +1130,7 @@ struct nft_chain {
	struct nft_rule_blob		*blob_next;
 };
 
-int nft_chain_validate(const struct nft_ctx *ctx, const struct nft_chain *chain);
+int nft_chain_validate(const struct nft_ctx *ctx, struct nft_chain *chain);
 int nft_setelem_validate(const struct nft_ctx *ctx, struct nft_set *set,
 			  const struct nft_set_iter *iter,
 			  struct nft_elem_priv *elem_priv);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Soft lock-ups caused by iptables
  2025-11-20 20:38           ` Hamza Mahfooz
@ 2025-11-20 20:46             ` Florian Westphal
  2025-11-20 21:07             ` Pablo Neira Ayuso
  1 sibling, 0 replies; 14+ messages in thread
From: Florian Westphal @ 2025-11-20 20:46 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: Pablo Neira Ayuso, Phil Sutter, netdev, Jozsef Kadlecsik,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netfilter-devel, coreteam, linux-kernel

Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> wrote:
> On Thu, Nov 20, 2025 at 10:34:46AM +0100, Florian Westphal wrote:
> > netfilter: nf_tables: avoid chain re-validation if possible
> > 
> > Consider:
> > 
> >       input -> j2 -> j3
> >       input -> j2 -> j3
> >       input -> j1 -> j2 -> j3
> > 
> > Then the second rule does not need to revalidate j2, and, by extension j3.
> > 
> > We need to validate it only for rule 3.
> > 
> > This is needed because chain loop detection also ensures we do not
> > exceed the jump stack: Just because we know that j2 is cycle free, its
> > last jump might now exceed the allowed stack.  We also need to update
> > the new largest call depth for all the reachable nodes.
> > 
> > diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
> > --- a/include/net/netfilter/nf_tables.h
> > +++ b/include/net/netfilter/nf_tables.h
> > @@ -1109,6 +1109,7 @@ struct nft_rule_blob {
> >   *	@udlen: user data length
> >   *	@udata: user data in the chain
> >   *	@blob_next: rule blob pointer to the next in the chain
> > + *	@depth: chain was validated for call level <= depth
> >   */
> >  struct nft_chain {
> >  	struct nft_rule_blob		__rcu *blob_gen_0;
> > @@ -1128,9 +1129,10 @@ struct nft_chain {
> >  
> >  	/* Only used during control plane commit phase: */
> >  	struct nft_rule_blob		*blob_next;
> > +	u8				depth;
> >  };
> >  
> > -int nft_chain_validate(const struct nft_ctx *ctx, const struct nft_chain *chain);
> > +int nft_chain_validate(const struct nft_ctx *ctx, struct nft_chain *chain);
> >  int nft_setelem_validate(const struct nft_ctx *ctx, struct nft_set *set,
> >  			 const struct nft_set_iter *iter,
> >  			 struct nft_elem_priv *elem_priv);
> > diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> > --- a/net/netfilter/nf_tables_api.c
> > +++ b/net/netfilter/nf_tables_api.c
> > @@ -4088,15 +4088,26 @@ static void nf_tables_rule_release(const struct nft_ctx *ctx, struct nft_rule *r
> >   * and set lookups until either the jump limit is hit or all reachable
> >   * chains have been validated.
> >   */
> > -int nft_chain_validate(const struct nft_ctx *ctx, const struct nft_chain *chain)
> > +int nft_chain_validate(const struct nft_ctx *ctx, struct nft_chain *chain)
> >  {
> >  	struct nft_expr *expr, *last;
> >  	struct nft_rule *rule;
> >  	int err;
> >  
> > +	BUILD_BUG_ON(NFT_JUMP_STACK_SIZE > 255);
> >  	if (ctx->level == NFT_JUMP_STACK_SIZE)
> >  		return -EMLINK;
> >  
> > +	/* jumps to base chains are not allowed, this is already
> > +	 * validated by nft_verdict_init().
> > +	 *
> > +	 * Chain must be re-validated if we are entering for first
> > +	 * time or if the current jumpstack usage is higher than on
> > +	 * previous check.
> > +	 */
> > +	if (ctx->level && chain->depth >= ctx->level)
> > +		return 0;
> > +
> >  	list_for_each_entry(rule, &chain->rules, list) {
> >  		if (fatal_signal_pending(current))
> >  			return -EINTR;
> > @@ -4117,6 +4128,10 @@ int nft_chain_validate(const struct nft_ctx *ctx, const struct nft_chain *chain)
> >  		}
> >  	}
> >  
> > +	/* Chain needs no re-validation if called again
> > +	 * from a path that doesn't exceed level.
> > +	 */
> > +	chain->depth = ctx->level;
> >  	return 0;
> >  }
> >  EXPORT_SYMBOL_GPL(nft_chain_validate);
> > @@ -4128,7 +4143,7 @@ static int nft_table_validate(struct net *net, const struct nft_table *table)
> >  		.net	= net,
> >  		.family	= table->family,
> >  	};
> > -	int err;
> > +	int err = 0;
> >  
> >  	list_for_each_entry(chain, &table->chains, list) {
> >  		if (!nft_is_base_chain(chain))
> > @@ -4137,12 +4152,16 @@ static int nft_table_validate(struct net *net, const struct nft_table *table)
> >  		ctx.chain = chain;
> >  		err = nft_chain_validate(&ctx, chain);
> >  		if (err < 0)
> > -			return err;
> > +			goto err;
> >  
> >  		cond_resched();
> >  	}
> >  
> > -	return 0;
> > +err:
> > +	list_for_each_entry(chain, &table->chains, list)
> > +		chain->depth = 0;
> > +
> > +	return err;
> >  }
> >  
> >  int nft_setelem_validate(const struct nft_ctx *ctx, struct nft_set *set,
> 
> FWIW This patch seems to resolve the issue, assuming you intended to
> include the following:

Thanks for testing.  I will try to make this work universally next week
(this needs more work to keep a bitmask of base hook types for
 which we already validated this).  And we likely need to improve
existing test coverage, the above patch should fail the tests we have.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Soft lock-ups caused by iptables
  2025-11-20  9:34         ` Florian Westphal
  2025-11-20 11:22           ` Phil Sutter
  2025-11-20 20:38           ` Hamza Mahfooz
@ 2025-11-20 21:01           ` Pablo Neira Ayuso
  2 siblings, 0 replies; 14+ messages in thread
From: Pablo Neira Ayuso @ 2025-11-20 21:01 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Phil Sutter, Hamza Mahfooz, netdev, Jozsef Kadlecsik,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netfilter-devel, coreteam, linux-kernel

On Thu, Nov 20, 2025 at 10:34:46AM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > > Yes, but you also need to annotate the type of the last base chain origin,
> > > > else you might skip validation of 'chain foo' because its depth value says its
> > > > fine but new caller is coming from filter, not nat, and chain foo had
> > > > masquerade expression.
> > 
> > You could also have chains being called from different levels.
> 
> But thats not an issue.  If you see a jump from c1 to c2, and c2
> has been validated for a level of 5, then you need to revalidate
> only if c1->depth >= 5.

OK, you could also have a jump to chain from filter and nat basechain
chains, does this optimization below works in that case too?

Validation is two-folded:

- Search for cycles.
- Ensure expression can be called from basechains that can reach it.

> Do you see any issue with this? (it still lacks annotation for
> the calling basechains type, so this cannot be applied as-is):
> 
> netfilter: nf_tables: avoid chain re-validation if possible
> 
> Consider:
> 
>       input -> j2 -> j3
>       input -> j2 -> j3
>       input -> j1 -> j2 -> j3
> 
> Then the second rule does not need to revalidate j2, and, by extension j3.
> 
> We need to validate it only for rule 3.
> 
> This is needed because chain loop detection also ensures we do not
> exceed the jump stack: Just because we know that j2 is cycle free, its
> last jump might now exceed the allowed stack.  We also need to update
> the new largest call depth for all the reachable nodes.
> 
> diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
> --- a/include/net/netfilter/nf_tables.h
> +++ b/include/net/netfilter/nf_tables.h
> @@ -1109,6 +1109,7 @@ struct nft_rule_blob {
>   *	@udlen: user data length
>   *	@udata: user data in the chain
>   *	@blob_next: rule blob pointer to the next in the chain
> + *	@depth: chain was validated for call level <= depth
>   */
>  struct nft_chain {
>  	struct nft_rule_blob		__rcu *blob_gen_0;
> @@ -1128,9 +1129,10 @@ struct nft_chain {
>  
>  	/* Only used during control plane commit phase: */
>  	struct nft_rule_blob		*blob_next;
> +	u8				depth;
>  };
>  
> -int nft_chain_validate(const struct nft_ctx *ctx, const struct nft_chain *chain);
> +int nft_chain_validate(const struct nft_ctx *ctx, struct nft_chain *chain);
>  int nft_setelem_validate(const struct nft_ctx *ctx, struct nft_set *set,
>  			 const struct nft_set_iter *iter,
>  			 struct nft_elem_priv *elem_priv);
> diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> --- a/net/netfilter/nf_tables_api.c
> +++ b/net/netfilter/nf_tables_api.c
> @@ -4088,15 +4088,26 @@ static void nf_tables_rule_release(const struct nft_ctx *ctx, struct nft_rule *r
>   * and set lookups until either the jump limit is hit or all reachable
>   * chains have been validated.
>   */
> -int nft_chain_validate(const struct nft_ctx *ctx, const struct nft_chain *chain)
> +int nft_chain_validate(const struct nft_ctx *ctx, struct nft_chain *chain)
>  {
>  	struct nft_expr *expr, *last;
>  	struct nft_rule *rule;
>  	int err;
>  
> +	BUILD_BUG_ON(NFT_JUMP_STACK_SIZE > 255);
>  	if (ctx->level == NFT_JUMP_STACK_SIZE)
>  		return -EMLINK;
>  
> +	/* jumps to base chains are not allowed, this is already
> +	 * validated by nft_verdict_init().
> +	 *
> +	 * Chain must be re-validated if we are entering for first
> +	 * time or if the current jumpstack usage is higher than on
> +	 * previous check.
> +	 */
> +	if (ctx->level && chain->depth >= ctx->level)
> +		return 0;
> +
>  	list_for_each_entry(rule, &chain->rules, list) {
>  		if (fatal_signal_pending(current))
>  			return -EINTR;
> @@ -4117,6 +4128,10 @@ int nft_chain_validate(const struct nft_ctx *ctx, const struct nft_chain *chain)
>  		}
>  	}
>  
> +	/* Chain needs no re-validation if called again
> +	 * from a path that doesn't exceed level.
> +	 */
> +	chain->depth = ctx->level;
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(nft_chain_validate);
> @@ -4128,7 +4143,7 @@ static int nft_table_validate(struct net *net, const struct nft_table *table)
>  		.net	= net,
>  		.family	= table->family,
>  	};
> -	int err;
> +	int err = 0;
>  
>  	list_for_each_entry(chain, &table->chains, list) {
>  		if (!nft_is_base_chain(chain))
> @@ -4137,12 +4152,16 @@ static int nft_table_validate(struct net *net, const struct nft_table *table)
>  		ctx.chain = chain;
>  		err = nft_chain_validate(&ctx, chain);
>  		if (err < 0)
> -			return err;
> +			goto err;
>  
>  		cond_resched();
>  	}
>  
> -	return 0;
> +err:
> +	list_for_each_entry(chain, &table->chains, list)
> +		chain->depth = 0;
> +
> +	return err;
>  }
>  
>  int nft_setelem_validate(const struct nft_ctx *ctx, struct nft_set *set,
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Soft lock-ups caused by iptables
  2025-11-20 20:38           ` Hamza Mahfooz
  2025-11-20 20:46             ` Florian Westphal
@ 2025-11-20 21:07             ` Pablo Neira Ayuso
  2025-11-21 20:59               ` Hamza Mahfooz
  1 sibling, 1 reply; 14+ messages in thread
From: Pablo Neira Ayuso @ 2025-11-20 21:07 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: Florian Westphal, Phil Sutter, netdev, Jozsef Kadlecsik,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netfilter-devel, coreteam, linux-kernel

Hi,

On Thu, Nov 20, 2025 at 12:38:36PM -0800, Hamza Mahfooz wrote:
[...]
> FWIW This patch seems to resolve the issue, assuming you intended to
> include the following:

Could you also give a try to this small patch:

https://lore.kernel.org/netfilter-devel/aR27zHy5Mp4x-rrL@strlen.de/T/#mc6b8e6b02a4a46a62f443912d8122c8529df0c88
https://patchwork.ozlabs.org/project/netfilter-devel/patch/20251119124205.124376-1-pablo@netfilter.org/
(patchwork.ozlabs.org is a bit slow today)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Soft lock-ups caused by iptables
  2025-11-20 21:07             ` Pablo Neira Ayuso
@ 2025-11-21 20:59               ` Hamza Mahfooz
  0 siblings, 0 replies; 14+ messages in thread
From: Hamza Mahfooz @ 2025-11-21 20:59 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Florian Westphal, Phil Sutter, netdev, Jozsef Kadlecsik,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netfilter-devel, coreteam, linux-kernel

On Thu, Nov 20, 2025 at 10:07:30PM +0100, Pablo Neira Ayuso wrote:
> Could you also give a try to this small patch:
> 
> https://lore.kernel.org/netfilter-devel/aR27zHy5Mp4x-rrL@strlen.de/T/#mc6b8e6b02a4a46a62f443912d8122c8529df0c88
> https://patchwork.ozlabs.org/project/netfilter-devel/patch/20251119124205.124376-1-pablo@netfilter.org/
> (patchwork.ozlabs.org is a bit slow today)

The issue is still reproducible with that patch applied (the stack trace
doesn't seem to be any different as well).

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-11-21 20:59 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-18 22:17 Soft lock-ups caused by iptables Hamza Mahfooz
2025-11-19 14:49 ` Phil Sutter
2025-11-19 15:58   ` Florian Westphal
2025-11-19 18:12     ` Phil Sutter
2025-11-19 23:10       ` Pablo Neira Ayuso
2025-11-20  9:34         ` Florian Westphal
2025-11-20 11:22           ` Phil Sutter
2025-11-20 20:38           ` Hamza Mahfooz
2025-11-20 20:46             ` Florian Westphal
2025-11-20 21:07             ` Pablo Neira Ayuso
2025-11-21 20:59               ` Hamza Mahfooz
2025-11-20 21:01           ` Pablo Neira Ayuso
2025-11-19 22:29   ` Hamza Mahfooz
2025-11-19 23:14     ` Pablo Neira Ayuso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).