All of lore.kernel.org
 help / color / mirror / Atom feed
From: sdf@google.com
To: tong@infragraf.org
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, Hou Tao <houtao1@huawei.com>
Subject: Re: [bpf-next v1] bpf: hash map, suppress lockdep warning
Date: Thu, 5 Jan 2023 10:36:58 -0800	[thread overview]
Message-ID: <Y7cYyml13lfWptYi@google.com> (raw)
In-Reply-To: <20230105112749.38421-1-tong@infragraf.org>

On 01/05, tong@infragraf.org wrote:
> From: Tonghao Zhang <tong@infragraf.org>

> The lock may be taken in both NMI and non-NMI contexts.
> There is a lockdep warning (inconsistent lock state), if
> enable lockdep. For performance, this patch doesn't use trylock,
> and disable lockdep temporarily.

> [   82.474075] ================================
> [   82.474076] WARNING: inconsistent lock state
> [   82.474090] 6.1.0+ #48 Tainted: G            E
> [   82.474093] --------------------------------
> [   82.474100] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> [   82.474101] kprobe-load/1740 [HC1[1]:SC0[0]:HE0:SE1] takes:
> [   82.474105] ffff88860a5cf7b0 (&htab->lockdep_key){....}-{2:2}, at:  
> htab_lock_bucket+0x61/0x6c
> [   82.474120] {INITIAL USE} state was registered at:
> [   82.474122]   mark_usage+0x1d/0x11d
> [   82.474130]   __lock_acquire+0x3c9/0x6ed
> [   82.474131]   lock_acquire+0x23d/0x29a
> [   82.474135]   _raw_spin_lock_irqsave+0x43/0x7f
> [   82.474148]   htab_lock_bucket+0x61/0x6c
> [   82.474151]   htab_map_update_elem+0x11e/0x220
> [   82.474155]   bpf_map_update_value+0x267/0x28e
> [   82.474160]   map_update_elem+0x13e/0x17d
> [   82.474164]   __sys_bpf+0x2ae/0xb2e
> [   82.474167]   __do_sys_bpf+0xd/0x15
> [   82.474171]   do_syscall_64+0x6d/0x84
> [   82.474174]   entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [   82.474178] irq event stamp: 1496498
> [   82.474180] hardirqs last  enabled at (1496497): [<ffffffff817eb9d9>]  
> syscall_enter_from_user_mode+0x63/0x8d
> [   82.474184] hardirqs last disabled at (1496498): [<ffffffff817ea6b6>]  
> exc_nmi+0x87/0x109
> [   82.474187] softirqs last  enabled at (1446698): [<ffffffff81a00347>]  
> __do_softirq+0x347/0x387
> [   82.474191] softirqs last disabled at (1446693): [<ffffffff810b9b06>]  
> __irq_exit_rcu+0x67/0xc6
> [   82.474195]
> [   82.474195] other info that might help us debug this:
> [   82.474196]  Possible unsafe locking scenario:
> [   82.474196]
> [   82.474197]        CPU0
> [   82.474198]        ----
> [   82.474198]   lock(&htab->lockdep_key);
> [   82.474200]   <Interrupt>
> [   82.474200]     lock(&htab->lockdep_key);
> [   82.474201]
> [   82.474201]  *** DEADLOCK ***
> [   82.474201]
> [   82.474202] no locks held by kprobe-load/1740.
> [   82.474203]
> [   82.474203] stack backtrace:
> [   82.474205] CPU: 14 PID: 1740 Comm: kprobe-load Tainted: G             
> E      6.1.0+ #48
> [   82.474208] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),  
> BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> [   82.474213] Call Trace:
> [   82.474218]  <NMI>
> [   82.474224]  dump_stack_lvl+0x57/0x81
> [   82.474228]  lock_acquire+0x1f4/0x29a
> [   82.474233]  ? htab_lock_bucket+0x61/0x6c
> [   82.474237]  ? rcu_read_lock_held_common+0xe/0x38
> [   82.474245]  _raw_spin_lock_irqsave+0x43/0x7f
> [   82.474249]  ? htab_lock_bucket+0x61/0x6c
> [   82.474253]  htab_lock_bucket+0x61/0x6c
> [   82.474257]  htab_map_update_elem+0x11e/0x220
> [   82.474264]  bpf_prog_df326439468c24a9_bpf_prog1+0x41/0x45
> [   82.474276]  bpf_trampoline_6442457183_0+0x43/0x1000
> [   82.474283]  nmi_handle+0x5/0x254
> [   82.474289]  default_do_nmi+0x3d/0xf6
> [   82.474293]  exc_nmi+0xa1/0x109
> [   82.474297]  end_repeat_nmi+0x16/0x67
> [   82.474300] RIP: 0010:cpu_online+0xa/0x12
> [   82.474308] Code: 08 00 00 00 39 c6 0f 43 c6 83 c0 07 83 e0 f8 c3 cc  
> cc cc cc 0f 1f 44 00 00 31 c0 c3 cc cc cc cc 89 ff 48 0f a3 3d 5f 52 75  
> 01 <0f> 92 c0 c3 cc cc cc cc 55 48 89 e5 41 57 49 89 f7 41 56 49 896
> [   82.474310] RSP: 0018:ffffc9000131bd38 EFLAGS: 00000283
> [   82.474313] RAX: ffff88860b85fe78 RBX: 0000000000102cc0 RCX:  
> 0000000000000008
> [   82.474315] RDX: 0000000000000004 RSI: ffff88860b85fe78 RDI:  
> 000000000000000e
> [   82.474316] RBP: 00000000ffffffff R08: 0000000000102cc0 R09:  
> 00000000ffffffff
> [   82.474318] R10: 0000000000000001 R11: 0000000000000000 R12:  
> ffff888100042200
> [   82.474320] R13: 0000000000000004 R14: ffffffff81271dc2 R15:  
> ffff88860b85fe78
> [   82.474322]  ? kvmalloc_node+0x44/0xd2
> [   82.474333]  ? cpu_online+0xa/0x12
> [   82.474338]  ? cpu_online+0xa/0x12
> [   82.474342]  </NMI>
> [   82.474343]  <TASK>
> [   82.474343]  trace_kmalloc+0x7c/0xe6
> [   82.474347]  ? kvmalloc_node+0x44/0xd2
> [   82.474350]  __kmalloc_node+0x9a/0xaf
> [   82.474354]  kvmalloc_node+0x44/0xd2
> [   82.474359]  kvmemdup_bpfptr+0x29/0x66
> [   82.474363]  map_update_elem+0x119/0x17d
> [   82.474370]  __sys_bpf+0x2ae/0xb2e
> [   82.474380]  __do_sys_bpf+0xd/0x15
> [   82.474384]  do_syscall_64+0x6d/0x84
> [   82.474387]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [   82.474391] RIP: 0033:0x7fe75d4f752d
> [   82.474394] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa  
> 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f  
> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2b 79 2c 00 f7 d8 64 89 018
> [   82.474396] RSP: 002b:00007ffe95d1cd78 EFLAGS: 00000246 ORIG_RAX:  
> 0000000000000141
> [   82.474398] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:  
> 00007fe75d4f752d
> [   82.474400] RDX: 0000000000000078 RSI: 00007ffe95d1cd80 RDI:  
> 0000000000000002
> [   82.474401] RBP: 00007ffe95d1ce30 R08: 0000000000000000 R09:  
> 0000000000000004
> [   82.474403] R10: 00007ffe95d1cd80 R11: 0000000000000246 R12:  
> 00000000004007f0
> [   82.474405] R13: 00007ffe95d1cf10 R14: 0000000000000000 R15:  
> 0000000000000000
> [   82.474412]  </TASK>

> Signed-off-by: Tonghao Zhang <tong@infragraf.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Andrii Nakryiko <andrii@kernel.org>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Song Liu <song@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: KP Singh <kpsingh@kernel.org>
> Cc: Stanislav Fomichev <sdf@google.com>
> Cc: Hao Luo <haoluo@google.com>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: Hou Tao <houtao1@huawei.com>
> ---
> previous discussion:  
> https://lore.kernel.org/all/20221121100521.56601-2-xiangxia.m.yue@gmail.com/
> ---
>   kernel/bpf/hashtab.c | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 974f104f47a0..146433c9bd1a 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -161,6 +161,19 @@ static inline int htab_lock_bucket(const struct  
> bpf_htab *htab,
>   		return -EBUSY;
>   	}

> +	/*
> +	 * The lock may be taken in both NMI and non-NMI contexts.
> +	 * There is a lockdep warning (inconsistent lock state), if
> +	 * enable lockdep. The potential deadlock happens when the
> +	 * lock is contended from the same cpu. map_locked rejects
> +	 * concurrent access to the same bucket from the same CPU.
> +	 * When the lock is contended from a remote cpu, we would
> +	 * like the remote cpu to spin and wait, instead of giving
> +	 * up immediately. As this gives better throughput. So replacing
> +	 * the current raw_spin_lock_irqsave() with trylock sacrifices
> +	 * this performance gain. lockdep_off temporarily.

So is this a real issue or not? Which NMI context path leads to taking
this lock? If it can be taken by both nmi and non-nmi context,
will we see a real deadlock?

Also, how temporary is your 'temporarily'? Is there a plan to properly
fix it?

> +	 */
> +	lockdep_off();
>   	raw_spin_lock_irqsave(&b->raw_lock, flags);
>   	*pflags = flags;

> @@ -172,7 +185,10 @@ static inline void htab_unlock_bucket(const struct  
> bpf_htab *htab,
>   				      unsigned long flags)
>   {
>   	hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
> +
>   	raw_spin_unlock_irqrestore(&b->raw_lock, flags);
> +	lockdep_on();
> +
>   	__this_cpu_dec(*(htab->map_locked[hash]));
>   	preempt_enable();
>   }
> --
> 2.27.0


  parent reply	other threads:[~2023-01-05 18:37 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-05 11:27 [bpf-next v1] bpf: hash map, suppress lockdep warning tong
2023-01-05 12:06 ` Hou Tao
2023-01-05 14:26   ` Tonghao Zhang
2023-01-06  4:01     ` Hou Tao
2023-01-05 18:36 ` sdf [this message]
2023-01-06  2:13 ` Xu Kuohai
2023-01-06  4:19   ` Hou Tao
2023-01-09  6:10     ` Tonghao Zhang
2023-01-09  9:12       ` Hou Tao
2023-01-10  2:01         ` Tonghao Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y7cYyml13lfWptYi@google.com \
    --to=sdf@google.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=haoluo@google.com \
    --cc=houtao1@huawei.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=song@kernel.org \
    --cc=tong@infragraf.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.