From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7F2AAC3DA7D for ; Thu, 5 Jan 2023 18:37:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230190AbjAESh1 (ORCPT ); Thu, 5 Jan 2023 13:37:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55590 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230493AbjAEShB (ORCPT ); Thu, 5 Jan 2023 13:37:01 -0500 Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com [IPv6:2607:f8b0:4864:20::449]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3E4C45B172 for ; Thu, 5 Jan 2023 10:37:00 -0800 (PST) Received: by mail-pf1-x449.google.com with SMTP id u3-20020a056a00124300b0056d4ab0c7cbso17736516pfi.7 for ; Thu, 05 Jan 2023 10:37:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=62IrXdFwq97c9jkpK7ggSdEpJlGkmmOdPYkxbYT25WE=; b=ER6aiQq0bg3r5KuUKWWoDnhZUK/GTNb7aF34tackGmTztTynIdtlGNVCGZvkpFpEGm ukBHTA/dqbrThyl4clv09q9VUB7qfZ8omOVwKD8LrhYCI9ug/LK6ClzbLPtFZsJth8WX UJEUN7NrdiLPTw+pAvc4XXARh0Zy+PE0RV+J08kQuw1PzKI2Bt99BW3EUgCMgUYjODBl J/zpjrIwE7dCNA4jidXlbK5So3qww7apIRfPnKeGuFgQzaEkEn9wLRTn4VWSUeAleUeq 8mbx4n1T3j9cZNtnK7ngHbPNMy/SRCEnx8otnEltn5DmhGFt5qG80kOc2j6OmWV73Ama o5zQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=62IrXdFwq97c9jkpK7ggSdEpJlGkmmOdPYkxbYT25WE=; b=iZNDwtG1Q+YxZ/QX4PBHc4LyW9CdkIMsYePA3uLRpCJJgDJ1Iz+1qKzlprl+Wat9SY XF96rG/Wo8v/NjQS7xHmX31yfcYfG1tuw5pLiZ6SuOumNqGgOdlYiFStHCSgKi57MMQe F4GevdUrZXil9OJiYmlBKPZb2C9L3Wsn014f0DvLGE8MWfoElCn8qgbhWtRAG5V7JEGG VAUB5j5wNIp16QPPTGf0LCWa2MIPNNAf3mNXtU0mmnftuLEh32TeseKpNa1ogx1vA1IK Velqsu8WHGbRQf5CFhXxxFsFX//6df8kL/MTW0KTTnHy+ypG8G0uxi+l91c2HOmexMWd TG8Q== X-Gm-Message-State: AFqh2koqyNOt5V1jXAApNfJRdqF5mNF8I1dms9gT+hPZ4y+Cnf84nuWG HGlQ9OlvDB7Un0BCctlIjyskiYs= X-Google-Smtp-Source: AMrXdXs20vXTc5m7kh29JNvZHu7CELlqJ9dN7R5XXiw9Z7gGCkbIgs+6SRpRnvuZOuMlD5isZW8bLiU= X-Received: from sdf.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5935]) (user=sdf job=sendgmr) by 2002:a05:6a00:e16:b0:581:3204:b6a9 with SMTP id bq22-20020a056a000e1600b005813204b6a9mr2167551pfb.30.1672943819718; Thu, 05 Jan 2023 10:36:59 -0800 (PST) Date: Thu, 5 Jan 2023 10:36:58 -0800 In-Reply-To: <20230105112749.38421-1-tong@infragraf.org> Mime-Version: 1.0 References: <20230105112749.38421-1-tong@infragraf.org> Message-ID: Subject: Re: [bpf-next v1] bpf: hash map, suppress lockdep warning From: sdf@google.com To: tong@infragraf.org Cc: bpf@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , KP Singh , Hao Luo , Jiri Olsa , Hou Tao Content-Type: text/plain; charset="UTF-8"; format=flowed; delsp=yes Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On 01/05, tong@infragraf.org wrote: > From: Tonghao Zhang > The lock may be taken in both NMI and non-NMI contexts. > There is a lockdep warning (inconsistent lock state), if > enable lockdep. For performance, this patch doesn't use trylock, > and disable lockdep temporarily. > [ 82.474075] ================================ > [ 82.474076] WARNING: inconsistent lock state > [ 82.474090] 6.1.0+ #48 Tainted: G E > [ 82.474093] -------------------------------- > [ 82.474100] inconsistent {INITIAL USE} -> {IN-NMI} usage. > [ 82.474101] kprobe-load/1740 [HC1[1]:SC0[0]:HE0:SE1] takes: > [ 82.474105] ffff88860a5cf7b0 (&htab->lockdep_key){....}-{2:2}, at: > htab_lock_bucket+0x61/0x6c > [ 82.474120] {INITIAL USE} state was registered at: > [ 82.474122] mark_usage+0x1d/0x11d > [ 82.474130] __lock_acquire+0x3c9/0x6ed > [ 82.474131] lock_acquire+0x23d/0x29a > [ 82.474135] _raw_spin_lock_irqsave+0x43/0x7f > [ 82.474148] htab_lock_bucket+0x61/0x6c > [ 82.474151] htab_map_update_elem+0x11e/0x220 > [ 82.474155] bpf_map_update_value+0x267/0x28e > [ 82.474160] map_update_elem+0x13e/0x17d > [ 82.474164] __sys_bpf+0x2ae/0xb2e > [ 82.474167] __do_sys_bpf+0xd/0x15 > [ 82.474171] do_syscall_64+0x6d/0x84 > [ 82.474174] entry_SYSCALL_64_after_hwframe+0x63/0xcd > [ 82.474178] irq event stamp: 1496498 > [ 82.474180] hardirqs last enabled at (1496497): [] > syscall_enter_from_user_mode+0x63/0x8d > [ 82.474184] hardirqs last disabled at (1496498): [] > exc_nmi+0x87/0x109 > [ 82.474187] softirqs last enabled at (1446698): [] > __do_softirq+0x347/0x387 > [ 82.474191] softirqs last disabled at (1446693): [] > __irq_exit_rcu+0x67/0xc6 > [ 82.474195] > [ 82.474195] other info that might help us debug this: > [ 82.474196] Possible unsafe locking scenario: > [ 82.474196] > [ 82.474197] CPU0 > [ 82.474198] ---- > [ 82.474198] lock(&htab->lockdep_key); > [ 82.474200] > [ 82.474200] lock(&htab->lockdep_key); > [ 82.474201] > [ 82.474201] *** DEADLOCK *** > [ 82.474201] > [ 82.474202] no locks held by kprobe-load/1740. > [ 82.474203] > [ 82.474203] stack backtrace: > [ 82.474205] CPU: 14 PID: 1740 Comm: kprobe-load Tainted: G > E 6.1.0+ #48 > [ 82.474208] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 > [ 82.474213] Call Trace: > [ 82.474218] > [ 82.474224] dump_stack_lvl+0x57/0x81 > [ 82.474228] lock_acquire+0x1f4/0x29a > [ 82.474233] ? htab_lock_bucket+0x61/0x6c > [ 82.474237] ? rcu_read_lock_held_common+0xe/0x38 > [ 82.474245] _raw_spin_lock_irqsave+0x43/0x7f > [ 82.474249] ? htab_lock_bucket+0x61/0x6c > [ 82.474253] htab_lock_bucket+0x61/0x6c > [ 82.474257] htab_map_update_elem+0x11e/0x220 > [ 82.474264] bpf_prog_df326439468c24a9_bpf_prog1+0x41/0x45 > [ 82.474276] bpf_trampoline_6442457183_0+0x43/0x1000 > [ 82.474283] nmi_handle+0x5/0x254 > [ 82.474289] default_do_nmi+0x3d/0xf6 > [ 82.474293] exc_nmi+0xa1/0x109 > [ 82.474297] end_repeat_nmi+0x16/0x67 > [ 82.474300] RIP: 0010:cpu_online+0xa/0x12 > [ 82.474308] Code: 08 00 00 00 39 c6 0f 43 c6 83 c0 07 83 e0 f8 c3 cc > cc cc cc 0f 1f 44 00 00 31 c0 c3 cc cc cc cc 89 ff 48 0f a3 3d 5f 52 75 > 01 <0f> 92 c0 c3 cc cc cc cc 55 48 89 e5 41 57 49 89 f7 41 56 49 896 > [ 82.474310] RSP: 0018:ffffc9000131bd38 EFLAGS: 00000283 > [ 82.474313] RAX: ffff88860b85fe78 RBX: 0000000000102cc0 RCX: > 0000000000000008 > [ 82.474315] RDX: 0000000000000004 RSI: ffff88860b85fe78 RDI: > 000000000000000e > [ 82.474316] RBP: 00000000ffffffff R08: 0000000000102cc0 R09: > 00000000ffffffff > [ 82.474318] R10: 0000000000000001 R11: 0000000000000000 R12: > ffff888100042200 > [ 82.474320] R13: 0000000000000004 R14: ffffffff81271dc2 R15: > ffff88860b85fe78 > [ 82.474322] ? kvmalloc_node+0x44/0xd2 > [ 82.474333] ? cpu_online+0xa/0x12 > [ 82.474338] ? cpu_online+0xa/0x12 > [ 82.474342] > [ 82.474343] > [ 82.474343] trace_kmalloc+0x7c/0xe6 > [ 82.474347] ? kvmalloc_node+0x44/0xd2 > [ 82.474350] __kmalloc_node+0x9a/0xaf > [ 82.474354] kvmalloc_node+0x44/0xd2 > [ 82.474359] kvmemdup_bpfptr+0x29/0x66 > [ 82.474363] map_update_elem+0x119/0x17d > [ 82.474370] __sys_bpf+0x2ae/0xb2e > [ 82.474380] __do_sys_bpf+0xd/0x15 > [ 82.474384] do_syscall_64+0x6d/0x84 > [ 82.474387] entry_SYSCALL_64_after_hwframe+0x63/0xcd > [ 82.474391] RIP: 0033:0x7fe75d4f752d > [ 82.474394] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa > 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f > 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2b 79 2c 00 f7 d8 64 89 018 > [ 82.474396] RSP: 002b:00007ffe95d1cd78 EFLAGS: 00000246 ORIG_RAX: > 0000000000000141 > [ 82.474398] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: > 00007fe75d4f752d > [ 82.474400] RDX: 0000000000000078 RSI: 00007ffe95d1cd80 RDI: > 0000000000000002 > [ 82.474401] RBP: 00007ffe95d1ce30 R08: 0000000000000000 R09: > 0000000000000004 > [ 82.474403] R10: 00007ffe95d1cd80 R11: 0000000000000246 R12: > 00000000004007f0 > [ 82.474405] R13: 00007ffe95d1cf10 R14: 0000000000000000 R15: > 0000000000000000 > [ 82.474412] > Signed-off-by: Tonghao Zhang > Cc: Alexei Starovoitov > Cc: Daniel Borkmann > Cc: Andrii Nakryiko > Cc: Martin KaFai Lau > Cc: Song Liu > Cc: Yonghong Song > Cc: John Fastabend > Cc: KP Singh > Cc: Stanislav Fomichev > Cc: Hao Luo > Cc: Jiri Olsa > Cc: Hou Tao > --- > previous discussion: > https://lore.kernel.org/all/20221121100521.56601-2-xiangxia.m.yue@gmail.com/ > --- > kernel/bpf/hashtab.c | 16 ++++++++++++++++ > 1 file changed, 16 insertions(+) > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > index 974f104f47a0..146433c9bd1a 100644 > --- a/kernel/bpf/hashtab.c > +++ b/kernel/bpf/hashtab.c > @@ -161,6 +161,19 @@ static inline int htab_lock_bucket(const struct > bpf_htab *htab, > return -EBUSY; > } > + /* > + * The lock may be taken in both NMI and non-NMI contexts. > + * There is a lockdep warning (inconsistent lock state), if > + * enable lockdep. The potential deadlock happens when the > + * lock is contended from the same cpu. map_locked rejects > + * concurrent access to the same bucket from the same CPU. > + * When the lock is contended from a remote cpu, we would > + * like the remote cpu to spin and wait, instead of giving > + * up immediately. As this gives better throughput. So replacing > + * the current raw_spin_lock_irqsave() with trylock sacrifices > + * this performance gain. lockdep_off temporarily. So is this a real issue or not? Which NMI context path leads to taking this lock? If it can be taken by both nmi and non-nmi context, will we see a real deadlock? Also, how temporary is your 'temporarily'? Is there a plan to properly fix it? > + */ > + lockdep_off(); > raw_spin_lock_irqsave(&b->raw_lock, flags); > *pflags = flags; > @@ -172,7 +185,10 @@ static inline void htab_unlock_bucket(const struct > bpf_htab *htab, > unsigned long flags) > { > hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1); > + > raw_spin_unlock_irqrestore(&b->raw_lock, flags); > + lockdep_on(); > + > __this_cpu_dec(*(htab->map_locked[hash])); > preempt_enable(); > } > -- > 2.27.0