[syzbot] [kernel?] BUG: sleeping function called from invalid context in static_key_slow

All of lore.kernel.org
 help / color / mirror / Atom feed

* [syzbot] [kernel?] BUG: sleeping function called from invalid context in static_key_slow_dec
@ 2024-11-27 21:22 syzbot
  2024-11-29 10:47 ` Florian Westphal
  2024-12-07 11:14 ` [PATCH nf] netfilter: nf_tables: do not defer rule destruction via call_rcu Florian Westphal
  0 siblings, 2 replies; 6+ messages in thread
From: syzbot @ 2024-11-27 21:22 UTC (permalink / raw)
  To: linux-kernel, netdev, peterz, syzkaller-bugs, tglx

Hello,

syzbot found the following issue on:

HEAD commit:    5b366eae7193 stmmac: dwmac-intel-plat: fix call balance of..
git tree:       net
console output: https://syzkaller.appspot.com/x/log.txt?x=14c0ab5f980000
kernel config:  https://syzkaller.appspot.com/x/.config?x=64aa0d9945bd5c1
dashboard link: https://syzkaller.appspot.com/bug?extid=b26935466701e56cfdc2
compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/d36e3f2a031f/disk-5b366eae.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/baad85a1bbb6/vmlinux-5b366eae.xz
kernel image: https://storage.googleapis.com/syzbot-assets/4ac0e6e4acdf/bzImage-5b366eae.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+b26935466701e56cfdc2@syzkaller.appspotmail.com

BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 16, name: ksoftirqd/0
preempt_count: 100, expected: 0
RCU nest depth: 0, expected: 0
1 lock held by ksoftirqd/0/16:
 #0: ffffffff8e937e60 (rcu_callback){....}-{0:0}, at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
 #0: ffffffff8e937e60 (rcu_callback){....}-{0:0}, at: rcu_do_batch kernel/rcu/tree.c:2561 [inline]
 #0: ffffffff8e937e60 (rcu_callback){....}-{0:0}, at: rcu_core+0xa37/0x17a0 kernel/rcu/tree.c:2823
Preemption disabled at:
[<ffffffff81578192>] softirq_handle_begin kernel/softirq.c:395 [inline]
[<ffffffff81578192>] handle_softirqs+0x122/0x980 kernel/softirq.c:530
CPU: 0 UID: 0 PID: 16 Comm: ksoftirqd/0 Not tainted 6.12.0-rc6-syzkaller-00203-g5b366eae7193 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/30/2024
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
 __might_resched+0x5d4/0x780 kernel/sched/core.c:8653
 percpu_down_read include/linux/percpu-rwsem.h:49 [inline]
 cpus_read_lock+0x1b/0x150 kernel/cpu.c:490
 __static_key_slow_dec kernel/jump_label.c:320 [inline]
 static_key_slow_dec+0x49/0xa0 kernel/jump_label.c:336
 nf_tables_chain_destroy+0x3c4/0x4f0 net/netfilter/nf_tables_api.c:2160
 __nft_release_basechain_now net/netfilter/nf_tables_api.c:11442 [inline]
 nft_release_basechain_rcu+0x3fc/0x550 net/netfilter/nf_tables_api.c:11454
 rcu_do_batch kernel/rcu/tree.c:2567 [inline]
 rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823
 handle_softirqs+0x2c5/0x980 kernel/softirq.c:554
 run_ksoftirqd+0xca/0x130 kernel/softirq.c:927
 smpboot_thread_fn+0x544/0xa30 kernel/smpboot.c:164
 kthread+0x2f0/0x390 kernel/kthread.c:389
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 </TASK>

================================
WARNING: inconsistent lock state
6.12.0-rc6-syzkaller-00203-g5b366eae7193 #0 Tainted: G        W         
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-R} usage.
ksoftirqd/0/16 [HC0[0]:SC1[1]:HE1:SE0] takes:
ffffffff8e7d1d90 (cpu_hotplug_lock){+++?}-{0:0}, at: __static_key_slow_dec kernel/jump_label.c:320 [inline]
ffffffff8e7d1d90 (cpu_hotplug_lock){+++?}-{0:0}, at: static_key_slow_dec+0x49/0xa0 kernel/jump_label.c:336
{SOFTIRQ-ON-W} state was registered at:
  lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825
  percpu_down_write+0x54/0x310 kernel/locking/percpu-rwsem.c:229
  cpus_write_lock kernel/cpu.c:508 [inline]
  _cpu_up+0x76/0x580 kernel/cpu.c:1638
  cpu_up+0x184/0x230 kernel/cpu.c:1722
  cpuhp_bringup_mask+0xdf/0x260 kernel/cpu.c:1788
  cpuhp_bringup_cpus_parallel+0xaf/0x160 kernel/cpu.c:1866
  bringup_nonboot_cpus+0x2b/0x50 kernel/cpu.c:1892
  smp_init+0x34/0x150 kernel/smp.c:1009
  kernel_init_freeable+0x417/0x5d0 init/main.c:1572
  kernel_init+0x1d/0x2b0 init/main.c:1469
  ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
irq event stamp: 1161634
hardirqs last  enabled at (1161634): [<ffffffff8bc76793>] irqentry_exit+0x63/0x90 kernel/entry/common.c:357
hardirqs last disabled at (1161633): [<ffffffff8bc7433e>] sysvec_apic_timer_interrupt+0xe/0xc0 arch/x86/kernel/apic/apic.c:1049
softirqs last  enabled at (1161438): [<ffffffff8157b03a>] run_ksoftirqd+0xca/0x130 kernel/softirq.c:927
softirqs last disabled at (1161447): [<ffffffff8157b03a>] run_ksoftirqd+0xca/0x130 kernel/softirq.c:927

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(cpu_hotplug_lock);
  <Interrupt>
    lock(cpu_hotplug_lock);

 *** DEADLOCK ***

1 lock held by ksoftirqd/0/16:
 #0: ffffffff8e937e60 (rcu_callback){....}-{0:0}, at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
 #0: ffffffff8e937e60 (rcu_callback){....}-{0:0}, at: rcu_do_batch kernel/rcu/tree.c:2561 [inline]
 #0: ffffffff8e937e60 (rcu_callback){....}-{0:0}, at: rcu_core+0xa37/0x17a0 kernel/rcu/tree.c:2823

stack backtrace:
CPU: 0 UID: 0 PID: 16 Comm: ksoftirqd/0 Tainted: G        W          6.12.0-rc6-syzkaller-00203-g5b366eae7193 #0
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/30/2024
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
 print_usage_bug+0x62e/0x8b0 kernel/locking/lockdep.c:4038
 valid_state+0x13a/0x1c0 kernel/locking/lockdep.c:4052
 mark_lock_irq+0xbb/0xc20 kernel/locking/lockdep.c:4263
 mark_lock+0x223/0x360 kernel/locking/lockdep.c:4725
 __lock_acquire+0xbf9/0x2050 kernel/locking/lockdep.c:5156
 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825
 percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
 cpus_read_lock+0x42/0x150 kernel/cpu.c:490
 __static_key_slow_dec kernel/jump_label.c:320 [inline]
 static_key_slow_dec+0x49/0xa0 kernel/jump_label.c:336
 nf_tables_chain_destroy+0x3c4/0x4f0 net/netfilter/nf_tables_api.c:2160
 __nft_release_basechain_now net/netfilter/nf_tables_api.c:11442 [inline]
 nft_release_basechain_rcu+0x3fc/0x550 net/netfilter/nf_tables_api.c:11454
 rcu_do_batch kernel/rcu/tree.c:2567 [inline]
 rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823
 handle_softirqs+0x2c5/0x980 kernel/softirq.c:554
 run_ksoftirqd+0xca/0x130 kernel/softirq.c:927
 smpboot_thread_fn+0x544/0xa30 kernel/smpboot.c:164
 kthread+0x2f0/0x390 kernel/kthread.c:389
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 </TASK>
BUG: scheduling while atomic: ksoftirqd/0/16/0x00000101
INFO: lockdep is turned off.
Modules linked in:
Preemption disabled at:
[<ffffffff81578192>] softirq_handle_begin kernel/softirq.c:395 [inline]
[<ffffffff81578192>] handle_softirqs+0x122/0x980 kernel/softirq.c:530


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [syzbot] [kernel?] BUG: sleeping function called from invalid context in static_key_slow_dec
  2024-11-27 21:22 [syzbot] [kernel?] BUG: sleeping function called from invalid context in static_key_slow_dec syzbot
@ 2024-11-29 10:47 ` Florian Westphal
  2024-12-07 11:14 ` [PATCH nf] netfilter: nf_tables: do not defer rule destruction via call_rcu Florian Westphal
  1 sibling, 0 replies; 6+ messages in thread
From: Florian Westphal @ 2024-11-29 10:47 UTC (permalink / raw)
  To: syzbot; +Cc: linux-kernel, netdev, peterz, syzkaller-bugs, tglx

syzbot <syzbot+b26935466701e56cfdc2@syzkaller.appspotmail.com> wrote:
> BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49
> in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 16, name: ksoftirqd/0
> preempt_count: 100, expected: 0
> RCU nest depth: 0, expected: 0
> 1 lock held by ksoftirqd/0/16:
>  #0: ffffffff8e937e60 (rcu_callback){....}-{0:0}, at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
>  #0: ffffffff8e937e60 (rcu_callback){....}-{0:0}, at: rcu_do_batch kernel/rcu/tree.c:2561 [inline]
>  #0: ffffffff8e937e60 (rcu_callback){....}-{0:0}, at: rcu_core+0xa37/0x17a0 kernel/rcu/tree.c:2823
> Preemption disabled at:
> [<ffffffff81578192>] softirq_handle_begin kernel/softirq.c:395 [inline]
> [<ffffffff81578192>] handle_softirqs+0x122/0x980 kernel/softirq.c:530
> CPU: 0 UID: 0 PID: 16 Comm: ksoftirqd/0 Not tainted 6.12.0-rc6-syzkaller-00203-g5b366eae7193 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/30/2024
> Call Trace:
>  <TASK>
>  __dump_stack lib/dump_stack.c:94 [inline]
>  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
>  __might_resched+0x5d4/0x780 kernel/sched/core.c:8653
>  percpu_down_read include/linux/percpu-rwsem.h:49 [inline]
>  cpus_read_lock+0x1b/0x150 kernel/cpu.c:490
>  __static_key_slow_dec kernel/jump_label.c:320 [inline]
>  static_key_slow_dec+0x49/0xa0 kernel/jump_label.c:336
>  nf_tables_chain_destroy+0x3c4/0x4f0 net/netfilter/nf_tables_api.c:2160
>  __nft_release_basechain_now net/netfilter/nf_tables_api.c:11442 [inline]
>  nft_release_basechain_rcu+0x3fc/0x550 net/netfilter/nf_tables_api.c:11454

nf_tables_chain_destroy can sleep via the static key.

I suggest to remove the basechain stats, this was a mistake all along.

Alternative is to defer to work queue or see if replacing the static key
with a deferred static key, that should place the problematic jump
patching to work queue too.

But I'd rather axe all of the basechain stat stuff.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH nf] netfilter: nf_tables: do not defer rule destruction via call_rcu
  2024-11-27 21:22 [syzbot] [kernel?] BUG: sleeping function called from invalid context in static_key_slow_dec syzbot
  2024-11-29 10:47 ` Florian Westphal
@ 2024-12-07 11:14 ` Florian Westphal
  2024-12-09 21:27   ` Pablo Neira Ayuso
  1 sibling, 1 reply; 6+ messages in thread
From: Florian Westphal @ 2024-12-07 11:14 UTC (permalink / raw)
  To: netfilter-devel
  Cc: syzkaller-bugs, Florian Westphal, syzbot+b26935466701e56cfdc2

nf_tables_chain_destroy can sleep, it can't be used from call_rcu
callbacks.

Moreover, nf_tables_rule_release() is only safe for error unwinding,
while transaction mutex is held and the to-be-desroyed rule was not
exposed to either dataplane or dumps, as it deactives+frees without
the required synchronize_rcu() in-between.

nft_rule_expr_deactivate() callbacks will change ->use counters
of other chains/sets, see e.g. nft_lookup .deactivate callback, these
must be serialized via transaction mutex.

Also add a few lockdep asserts to make this more explicit.

Calling synchronize_rcu() isn't ideal, but fixing this without is hard
and way more intrusive.  As-is, we can get:

WARNING: .. net/netfilter/nf_tables_api.c:5515 nft_set_destroy+0x..
Workqueue: events nf_tables_trans_destroy_work
RIP: 0010:nft_set_destroy+0x3fe/0x5c0
Call Trace:
 <TASK>
 nf_tables_trans_destroy_work+0x6b7/0xad0
 process_one_work+0x64a/0xce0
 worker_thread+0x613/0x10d0

In case the synchronize_rcu becomes an issue, we can explore alternatives.

One way would be to allocate nft_trans_rule objects + one nft_trans_chain
object, deactivate the rules + the chain and then defer the freeing to the
nft destroy workqueue.  We'd still need to keep the synchronize_rcu path as
a fallback to handle -ENOMEM corner cases though.

Reported-by: syzbot+b26935466701e56cfdc2@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67478d92.050a0220.253251.0062.GAE@google.com/T/
Fixes: c03d278fdf35 ("netfilter: nf_tables: wait for rcu grace period on net_device removal")
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_tables_api.c | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 21b6f7410a1f..a3b6b6b32f72 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3987,8 +3987,11 @@ void nf_tables_rule_destroy(const struct nft_ctx *ctx, struct nft_rule *rule)
 	kfree(rule);
 }
 
+/* can only be used if rule is no longer visible to dumps */
 static void nf_tables_rule_release(const struct nft_ctx *ctx, struct nft_rule *rule)
 {
+	lockdep_commit_lock_is_held(ctx->net);
+
 	nft_rule_expr_deactivate(ctx, rule, NFT_TRANS_RELEASE);
 	nf_tables_rule_destroy(ctx, rule);
 }
@@ -5757,6 +5760,8 @@ void nf_tables_deactivate_set(const struct nft_ctx *ctx, struct nft_set *set,
 			      struct nft_set_binding *binding,
 			      enum nft_trans_phase phase)
 {
+	lockdep_commit_lock_is_held(ctx->net);
+
 	switch (phase) {
 	case NFT_TRANS_PREPARE_ERROR:
 		nft_set_trans_unbind(ctx, set);
@@ -11695,19 +11700,6 @@ static void __nft_release_basechain_now(struct nft_ctx *ctx)
 	nf_tables_chain_destroy(ctx->chain);
 }
 
-static void nft_release_basechain_rcu(struct rcu_head *head)
-{
-	struct nft_chain *chain = container_of(head, struct nft_chain, rcu_head);
-	struct nft_ctx ctx = {
-		.family	= chain->table->family,
-		.chain	= chain,
-		.net	= read_pnet(&chain->table->net),
-	};
-
-	__nft_release_basechain_now(&ctx);
-	put_net(ctx.net);
-}
-
 int __nft_release_basechain(struct nft_ctx *ctx)
 {
 	struct nft_rule *rule;
@@ -11722,11 +11714,18 @@ int __nft_release_basechain(struct nft_ctx *ctx)
 	nft_chain_del(ctx->chain);
 	nft_use_dec(&ctx->table->use);
 
-	if (maybe_get_net(ctx->net))
-		call_rcu(&ctx->chain->rcu_head, nft_release_basechain_rcu);
-	else
+	if (!maybe_get_net(ctx->net)) {
 		__nft_release_basechain_now(ctx);
+		return 0;
+	}
+
+	/* wait for ruleset dumps to complete.  Owning chain is no longer in
+	 * lists, so new dumps can't find any of these rules anymore.
+	 */
+	synchronize_rcu();
 
+	__nft_release_basechain_now(ctx);
+	put_net(ctx->net);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__nft_release_basechain);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH nf] netfilter: nf_tables: do not defer rule destruction via call_rcu
  2024-12-07 11:14 ` [PATCH nf] netfilter: nf_tables: do not defer rule destruction via call_rcu Florian Westphal
@ 2024-12-09 21:27   ` Pablo Neira Ayuso
  2024-12-09 22:04     ` Florian Westphal
  0 siblings, 1 reply; 6+ messages in thread
From: Pablo Neira Ayuso @ 2024-12-09 21:27 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, syzkaller-bugs, syzbot+b26935466701e56cfdc2

Hi Florian,

Thanks a lot for your quick fix.

On Sat, Dec 07, 2024 at 12:14:48PM +0100, Florian Westphal wrote:
> nf_tables_chain_destroy can sleep, it can't be used from call_rcu
> callbacks.
> 
> Moreover, nf_tables_rule_release() is only safe for error unwinding,
> while transaction mutex is held and the to-be-desroyed rule was not
> exposed to either dataplane or dumps, as it deactives+frees without
> the required synchronize_rcu() in-between.
> 
> nft_rule_expr_deactivate() callbacks will change ->use counters
> of other chains/sets, see e.g. nft_lookup .deactivate callback, these
> must be serialized via transaction mutex.
> 
> Also add a few lockdep asserts to make this more explicit.
> 
> Calling synchronize_rcu() isn't ideal, but fixing this without is hard
> and way more intrusive.  As-is, we can get:
> 
> WARNING: .. net/netfilter/nf_tables_api.c:5515 nft_set_destroy+0x..

Right, rhash needs this, that is why there is a workqueue to release
objects from nftables commit path.

> Workqueue: events nf_tables_trans_destroy_work
> RIP: 0010:nft_set_destroy+0x3fe/0x5c0
> Call Trace:
>  <TASK>
>  nf_tables_trans_destroy_work+0x6b7/0xad0
>  process_one_work+0x64a/0xce0
>  worker_thread+0x613/0x10d0
> 
> In case the synchronize_rcu becomes an issue, we can explore alternatives.
> 
> One way would be to allocate nft_trans_rule objects + one nft_trans_chain
> object, deactivate the rules + the chain and then defer the freeing to the
> nft destroy workqueue.  We'd still need to keep the synchronize_rcu path as
> a fallback to handle -ENOMEM corner cases though.

I think it can be done _without_ nft_trans objects.

Since the commit mutex is held in this netdev event path: Remove this
basechain, deactivate rules and add basechain to global list protected
with spinlock, it invokes worker. Then, worker zaps this list
basechains, it calls synchronize_rcu() and it destroys rules and then
basechain. No memory allocations needed in this case?

Thanks.

> Reported-by: syzbot+b26935466701e56cfdc2@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/67478d92.050a0220.253251.0062.GAE@google.com/T/
> Fixes: c03d278fdf35 ("netfilter: nf_tables: wait for rcu grace period on net_device removal")
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  net/netfilter/nf_tables_api.c | 31 +++++++++++++++----------------
>  1 file changed, 15 insertions(+), 16 deletions(-)
> 
> diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> index 21b6f7410a1f..a3b6b6b32f72 100644
> --- a/net/netfilter/nf_tables_api.c
> +++ b/net/netfilter/nf_tables_api.c
> @@ -3987,8 +3987,11 @@ void nf_tables_rule_destroy(const struct nft_ctx *ctx, struct nft_rule *rule)
>  	kfree(rule);
>  }
>  
> +/* can only be used if rule is no longer visible to dumps */
>  static void nf_tables_rule_release(const struct nft_ctx *ctx, struct nft_rule *rule)
>  {
> +	lockdep_commit_lock_is_held(ctx->net);
> +
>  	nft_rule_expr_deactivate(ctx, rule, NFT_TRANS_RELEASE);
>  	nf_tables_rule_destroy(ctx, rule);
>  }
> @@ -5757,6 +5760,8 @@ void nf_tables_deactivate_set(const struct nft_ctx *ctx, struct nft_set *set,
>  			      struct nft_set_binding *binding,
>  			      enum nft_trans_phase phase)
>  {
> +	lockdep_commit_lock_is_held(ctx->net);
> +
>  	switch (phase) {
>  	case NFT_TRANS_PREPARE_ERROR:
>  		nft_set_trans_unbind(ctx, set);
> @@ -11695,19 +11700,6 @@ static void __nft_release_basechain_now(struct nft_ctx *ctx)
>  	nf_tables_chain_destroy(ctx->chain);
>  }
>  
> -static void nft_release_basechain_rcu(struct rcu_head *head)
> -{
> -	struct nft_chain *chain = container_of(head, struct nft_chain, rcu_head);
> -	struct nft_ctx ctx = {
> -		.family	= chain->table->family,
> -		.chain	= chain,
> -		.net	= read_pnet(&chain->table->net),
> -	};
> -
> -	__nft_release_basechain_now(&ctx);
> -	put_net(ctx.net);
> -}
> -
>  int __nft_release_basechain(struct nft_ctx *ctx)
>  {
>  	struct nft_rule *rule;
> @@ -11722,11 +11714,18 @@ int __nft_release_basechain(struct nft_ctx *ctx)
>  	nft_chain_del(ctx->chain);
>  	nft_use_dec(&ctx->table->use);
>  
> -	if (maybe_get_net(ctx->net))
> -		call_rcu(&ctx->chain->rcu_head, nft_release_basechain_rcu);
> -	else
> +	if (!maybe_get_net(ctx->net)) {
>  		__nft_release_basechain_now(ctx);
> +		return 0;
> +	}
> +
> +	/* wait for ruleset dumps to complete.  Owning chain is no longer in
> +	 * lists, so new dumps can't find any of these rules anymore.
> +	 */
> +	synchronize_rcu();
>  
> +	__nft_release_basechain_now(ctx);
> +	put_net(ctx->net);
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(__nft_release_basechain);
> -- 
> 2.47.1
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH nf] netfilter: nf_tables: do not defer rule destruction via call_rcu
  2024-12-09 21:27   ` Pablo Neira Ayuso
@ 2024-12-09 22:04     ` Florian Westphal
  2024-12-11 16:16       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 6+ messages in thread
From: Florian Westphal @ 2024-12-09 22:04 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Florian Westphal, netfilter-devel, syzkaller-bugs,
	syzbot+b26935466701e56cfdc2

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > One way would be to allocate nft_trans_rule objects + one nft_trans_chain
> > object, deactivate the rules + the chain and then defer the freeing to the
> > nft destroy workqueue.  We'd still need to keep the synchronize_rcu path as
> > a fallback to handle -ENOMEM corner cases though.
> 
> I think it can be done _without_ nft_trans objects.
> 
> Since the commit mutex is held in this netdev event path: Remove this
> basechain, deactivate rules and add basechain to global list protected
> with spinlock, it invokes worker. Then, worker zaps this list
> basechains, it calls synchronize_rcu() and it destroys rules and then
> basechain. No memory allocations needed in this case?

I don't think its possible due to netlink dumps.

Its safe for normal commit path because list_del_rcu() gets called on
these objects (making them unreachable for new dumps), then,
synchronize_rcu is called (so all concurrent dumpers are done) and then
we free.

It would work if you add a new list_head to the basechain, then you
could just link the basechain object to some other list and then
call nf_tables_rule_release() AFTER a synchronize_rcu because rules
can't be picked up if the owning chain is no longer reachable.

However, I do wonder how complex it would be.

This is tricky to get right and I'm not sure this patch adds a
noticeable issue, see nft_rcv_nl_event() which has similar pattern,
i.e. synchronize_rcu() is called.

I'm not saying we should not consider async route, but maybe better
to followup in -next?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH nf] netfilter: nf_tables: do not defer rule destruction via call_rcu
  2024-12-09 22:04     ` Florian Westphal
@ 2024-12-11 16:16       ` Pablo Neira Ayuso
  0 siblings, 0 replies; 6+ messages in thread
From: Pablo Neira Ayuso @ 2024-12-11 16:16 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, syzkaller-bugs, syzbot+b26935466701e56cfdc2

Hi Florian

On Mon, Dec 09, 2024 at 11:04:01PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > One way would be to allocate nft_trans_rule objects + one nft_trans_chain
> > > object, deactivate the rules + the chain and then defer the freeing to the
> > > nft destroy workqueue.  We'd still need to keep the synchronize_rcu path as
> > > a fallback to handle -ENOMEM corner cases though.
> > 
> > I think it can be done _without_ nft_trans objects.
> > 
> > Since the commit mutex is held in this netdev event path: Remove this
> > basechain, deactivate rules and add basechain to global list protected
> > with spinlock, it invokes worker. Then, worker zaps this list
> > basechains, it calls synchronize_rcu() and it destroys rules and then
> > basechain. No memory allocations needed in this case?
> 
> I don't think its possible due to netlink dumps.
> 
> Its safe for normal commit path because list_del_rcu() gets called on
> these objects (making them unreachable for new dumps), then,
> synchronize_rcu is called (so all concurrent dumpers are done) and then
> we free.
> 
> It would work if you add a new list_head to the basechain, then you
> could just link the basechain object to some other list and then
> call nf_tables_rule_release() AFTER a synchronize_rcu because rules
> can't be picked up if the owning chain is no longer reachable.
> 
> However, I do wonder how complex it would be.

I started a patch, spent several hours in a row, but I need more time
to get this right.

There is also Phil making the line to remove this in nf-next.

> This is tricky to get right and I'm not sure this patch adds a
> noticeable issue, see nft_rcv_nl_event() which has similar pattern,
> i.e. synchronize_rcu() is called.

I can see veth hits synchronize_rcu(), this can be called from
default_device_exit_batch() which calls unregister_netdevice_many().

I need time, but it is difficult. We have to deliver a pull request
very late on wednesday so it can be taken on thursday.

Now the week has three days to fix and test stuff, so upstream
maintainers have a day to tidy up and submit upstream.

> I'm not saying we should not consider async route, but maybe better
> to followup in -next?

Yes, I can follow up in -next, but then Phil just remove this again in
-next.

I can just take this fix to address the existing situation which is
worse than before. Before this patch UAF was possible, although I
never manage to trigger it. Now crash is easy to trigger, I spent too
much time looking at the interaction with layers and I forgot basic
assumption regarding nf_tables in this patch.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-12-11 16:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-27 21:22 [syzbot] [kernel?] BUG: sleeping function called from invalid context in static_key_slow_dec syzbot
2024-11-29 10:47 ` Florian Westphal
2024-12-07 11:14 ` [PATCH nf] netfilter: nf_tables: do not defer rule destruction via call_rcu Florian Westphal
2024-12-09 21:27   ` Pablo Neira Ayuso
2024-12-09 22:04     ` Florian Westphal
2024-12-11 16:16       ` Pablo Neira Ayuso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.