From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76BBFC4708D for ; Fri, 6 Jan 2023 04:20:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231504AbjAFEUM (ORCPT ); Thu, 5 Jan 2023 23:20:12 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55548 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231795AbjAFETc (ORCPT ); Thu, 5 Jan 2023 23:19:32 -0500 Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1852A392CB for ; Thu, 5 Jan 2023 20:19:08 -0800 (PST) Received: from dggpeml500025.china.huawei.com (unknown [172.30.72.57]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4Np96V30Dcz16MTq; Fri, 6 Jan 2023 12:17:38 +0800 (CST) Received: from [10.174.176.117] (10.174.176.117) by dggpeml500025.china.huawei.com (7.185.36.35) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.34; Fri, 6 Jan 2023 12:19:05 +0800 Subject: Re: [bpf-next v1] bpf: hash map, suppress lockdep warning To: Xu Kuohai , , CC: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa References: <20230105112749.38421-1-tong@infragraf.org> <6004bbc1-54a8-0141-04af-0a5fba82e6ee@huawei.com> From: Hou Tao Message-ID: Date: Fri, 6 Jan 2023 12:19:05 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: <6004bbc1-54a8-0141-04af-0a5fba82e6ee@huawei.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Content-Language: en-US X-Originating-IP: [10.174.176.117] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpeml500025.china.huawei.com (7.185.36.35) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org Hi, On 1/6/2023 10:13 AM, Xu Kuohai wrote: > On 1/5/2023 7:27 PM, tong@infragraf.org wrote: >> From: Tonghao Zhang >> >> The lock may be taken in both NMI and non-NMI contexts. >> There is a lockdep warning (inconsistent lock state), if >> enable lockdep. For performance, this patch doesn't use trylock, >> and disable lockdep temporarily. >> >> [   82.474075] ================================ >> [   82.474076] WARNING: inconsistent lock state >> [   82.474090] 6.1.0+ #48 Tainted: G            E >> [   82.474093] -------------------------------- >> [   82.474100] inconsistent {INITIAL USE} -> {IN-NMI} usage. >> [   82.474101] kprobe-load/1740 [HC1[1]:SC0[0]:HE0:SE1] takes: >> [   82.474105] ffff88860a5cf7b0 (&htab->lockdep_key){....}-{2:2}, at: >> htab_lock_bucket+0x61/0x6c >> [   82.474120] {INITIAL USE} state was registered at: >> [   82.474122]   mark_usage+0x1d/0x11d >> [   82.474130]   __lock_acquire+0x3c9/0x6ed >> [   82.474131]   lock_acquire+0x23d/0x29a >> [   82.474135]   _raw_spin_lock_irqsave+0x43/0x7f >> [   82.474148]   htab_lock_bucket+0x61/0x6c >> [   82.474151]   htab_map_update_elem+0x11e/0x220 >> [   82.474155]   bpf_map_update_value+0x267/0x28e >> [   82.474160]   map_update_elem+0x13e/0x17d >> [   82.474164]   __sys_bpf+0x2ae/0xb2e >> [   82.474167]   __do_sys_bpf+0xd/0x15 >> [   82.474171]   do_syscall_64+0x6d/0x84 >> [   82.474174]   entry_SYSCALL_64_after_hwframe+0x63/0xcd >> [   82.474178] irq event stamp: 1496498 >> [   82.474180] hardirqs last  enabled at (1496497): [] >> syscall_enter_from_user_mode+0x63/0x8d >> [   82.474184] hardirqs last disabled at (1496498): [] >> exc_nmi+0x87/0x109 >> [   82.474187] softirqs last  enabled at (1446698): [] >> __do_softirq+0x347/0x387 >> [   82.474191] softirqs last disabled at (1446693): [] >> __irq_exit_rcu+0x67/0xc6 >> [   82.474195] >> [   82.474195] other info that might help us debug this: >> [   82.474196]  Possible unsafe locking scenario: >> [   82.474196] >> [   82.474197]        CPU0 >> [   82.474198]        ---- >> [   82.474198]   lock(&htab->lockdep_key); >> [   82.474200]   >> [   82.474200]     lock(&htab->lockdep_key); >> [   82.474201] >> [   82.474201]  *** DEADLOCK *** >> [   82.474201] >> [   82.474202] no locks held by kprobe-load/1740. >> [   82.474203] >> [   82.474203] stack backtrace: >> [   82.474205] CPU: 14 PID: 1740 Comm: kprobe-load Tainted: G            >> E      6.1.0+ #48 >> [   82.474208] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 >> [   82.474213] Call Trace: >> [   82.474218]  >> [   82.474224]  dump_stack_lvl+0x57/0x81 >> [   82.474228]  lock_acquire+0x1f4/0x29a >> [   82.474233]  ? htab_lock_bucket+0x61/0x6c >> [   82.474237]  ? rcu_read_lock_held_common+0xe/0x38 >> [   82.474245]  _raw_spin_lock_irqsave+0x43/0x7f >> [   82.474249]  ? htab_lock_bucket+0x61/0x6c >> [   82.474253]  htab_lock_bucket+0x61/0x6c >> [   82.474257]  htab_map_update_elem+0x11e/0x220 >> [   82.474264]  bpf_prog_df326439468c24a9_bpf_prog1+0x41/0x45 >> [   82.474276]  bpf_trampoline_6442457183_0+0x43/0x1000 >> [   82.474283]  nmi_handle+0x5/0x254 >> [   82.474289]  default_do_nmi+0x3d/0xf6 >> [   82.474293]  exc_nmi+0xa1/0x109 >> [   82.474297]  end_repeat_nmi+0x16/0x67 >> [   82.474300] RIP: 0010:cpu_online+0xa/0x12 >> [   82.474308] Code: 08 00 00 00 39 c6 0f 43 c6 83 c0 07 83 e0 f8 c3 cc cc cc >> cc 0f 1f 44 00 00 31 c0 c3 cc cc cc cc 89 ff 48 0f a3 3d 5f 52 75 01 <0f> 92 >> c0 c3 cc cc cc cc 55 48 89 e5 41 57 49 89 f7 41 56 49 896 >> [   82.474310] RSP: 0018:ffffc9000131bd38 EFLAGS: 00000283 >> [   82.474313] RAX: ffff88860b85fe78 RBX: 0000000000102cc0 RCX: 0000000000000008 >> [   82.474315] RDX: 0000000000000004 RSI: ffff88860b85fe78 RDI: 000000000000000e >> [   82.474316] RBP: 00000000ffffffff R08: 0000000000102cc0 R09: 00000000ffffffff >> [   82.474318] R10: 0000000000000001 R11: 0000000000000000 R12: ffff888100042200 >> [   82.474320] R13: 0000000000000004 R14: ffffffff81271dc2 R15: ffff88860b85fe78 >> [   82.474322]  ? kvmalloc_node+0x44/0xd2 >> [   82.474333]  ? cpu_online+0xa/0x12 >> [   82.474338]  ? cpu_online+0xa/0x12 >> [   82.474342]  >> [   82.474343]  >> [   82.474343]  trace_kmalloc+0x7c/0xe6 >> [   82.474347]  ? kvmalloc_node+0x44/0xd2 >> [   82.474350]  __kmalloc_node+0x9a/0xaf >> [   82.474354]  kvmalloc_node+0x44/0xd2 >> [   82.474359]  kvmemdup_bpfptr+0x29/0x66 >> [   82.474363]  map_update_elem+0x119/0x17d >> [   82.474370]  __sys_bpf+0x2ae/0xb2e >> [   82.474380]  __do_sys_bpf+0xd/0x15 >> [   82.474384]  do_syscall_64+0x6d/0x84 >> [   82.474387]  entry_SYSCALL_64_after_hwframe+0x63/0xcd >> [   82.474391] RIP: 0033:0x7fe75d4f752d >> [   82.474394] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 >> f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d >> 01 f0 ff ff 73 01 c3 48 8b 0d 2b 79 2c 00 f7 d8 64 89 018 >> [   82.474396] RSP: 002b:00007ffe95d1cd78 EFLAGS: 00000246 ORIG_RAX: >> 0000000000000141 >> [   82.474398] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe75d4f752d >> [   82.474400] RDX: 0000000000000078 RSI: 00007ffe95d1cd80 RDI: 0000000000000002 >> [   82.474401] RBP: 00007ffe95d1ce30 R08: 0000000000000000 R09: 0000000000000004 >> [   82.474403] R10: 00007ffe95d1cd80 R11: 0000000000000246 R12: 00000000004007f0 >> [   82.474405] R13: 00007ffe95d1cf10 R14: 0000000000000000 R15: 0000000000000000 >> [   82.474412]  >> >> Signed-off-by: Tonghao Zhang >> Cc: Alexei Starovoitov >> Cc: Daniel Borkmann >> Cc: Andrii Nakryiko >> Cc: Martin KaFai Lau >> Cc: Song Liu >> Cc: Yonghong Song >> Cc: John Fastabend >> Cc: KP Singh >> Cc: Stanislav Fomichev >> Cc: Hao Luo >> Cc: Jiri Olsa >> Cc: Hou Tao >> --- >> previous discussion: >> https://lore.kernel.org/all/20221121100521.56601-2-xiangxia.m.yue@gmail.com/ >> --- >>   kernel/bpf/hashtab.c | 16 ++++++++++++++++ >>   1 file changed, 16 insertions(+) >> >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >> index 974f104f47a0..146433c9bd1a 100644 >> --- a/kernel/bpf/hashtab.c >> +++ b/kernel/bpf/hashtab.c >> @@ -161,6 +161,19 @@ static inline int htab_lock_bucket(const struct bpf_htab >> *htab, >>           return -EBUSY; >>       } >>   +    /* >> +     * The lock may be taken in both NMI and non-NMI contexts. >> +     * There is a lockdep warning (inconsistent lock state), if >> +     * enable lockdep. The potential deadlock happens when the >> +     * lock is contended from the same cpu. map_locked rejects >> +     * concurrent access to the same bucket from the same CPU. >> +     * When the lock is contended from a remote cpu, we would >> +     * like the remote cpu to spin and wait, instead of giving >> +     * up immediately. As this gives better throughput. So replacing >> +     * the current raw_spin_lock_irqsave() with trylock sacrifices >> +     * this performance gain. lockdep_off temporarily. >> +     */ >> +    lockdep_off(); > > seems it's better to define map_locked as a raw spin lock and use > trylock(map_locked[hash]) > to check if this cpu has locked the bucket. Redefining map_locked as per-cpu raw spinlock array seems a good idea, so we don't need to call lock_acquire() and lock_release() manually. But lockdep_off() is still needed here, else there will always be lockdep warning. > >>       raw_spin_lock_irqsave(&b->raw_lock, flags); >>       *pflags = flags; >>   @@ -172,7 +185,10 @@ static inline void htab_unlock_bucket(const struct >> bpf_htab *htab, >>                         unsigned long flags) >>   { >>       hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1); >> + >>       raw_spin_unlock_irqrestore(&b->raw_lock, flags); >> +    lockdep_on(); >> + >>       __this_cpu_dec(*(htab->map_locked[hash])); >>       preempt_enable(); >>   } > > .