From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bpf-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 18628C433FE
	for <bpf@archiver.kernel.org>; Wed, 30 Nov 2022 03:07:56 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231489AbiK3DHx (ORCPT <rfc822;bpf@archiver.kernel.org>);
        Tue, 29 Nov 2022 22:07:53 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38516 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229775AbiK3DHv (ORCPT <rfc822;bpf@vger.kernel.org>);
        Tue, 29 Nov 2022 22:07:51 -0500
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CA0265E3D1
        for <bpf@vger.kernel.org>; Tue, 29 Nov 2022 19:07:00 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1669777619;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=05YZUrxC9fujgNfFDXzVRk/CJWC6WegjsSO4HrBVROk=;
        b=TrKL3pXcqEys5lFtJtpv0UiqnEOBVQ6jioeadf+GUq69pBjAAeoMl6zZnhEzusagl6jyz3
        1xpuSulyDGoT+XgOw8k9CX7G3yHSZ9R3udw5wDQWpAcjMLBwrtgFPfAYqmCPcDNbHgDF0H
        nEsS/6eGisdAL8g6ThsVKuQ1OiLJp2k=
Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com
 [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-294-1eJOFMpnNLeWBdN-Lc-Zfw-1; Tue, 29 Nov 2022 22:06:56 -0500
X-MC-Unique: 1eJOFMpnNLeWBdN-Lc-Zfw-1
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 858DF1C051AC;
        Wed, 30 Nov 2022 03:06:55 +0000 (UTC)
Received: from [10.22.17.30] (unknown [10.22.17.30])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 444F42166B26;
        Wed, 30 Nov 2022 03:06:54 +0000 (UTC)
Message-ID: <9455ff51-098c-87f0-dc83-2303921032a2@redhat.com>
Date:   Tue, 29 Nov 2022 22:06:52 -0500
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.4.0
Subject: Re: [net-next] bpf: avoid hashtab deadlock with try_lock
Content-Language: en-US
To:     Tonghao Zhang <xiangxia.m.yue@gmail.com>,
        Hou Tao <houtao@huaweicloud.com>
Cc:     Hao Luo <haoluo@google.com>, Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        netdev@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        John Fastabend <john.fastabend@gmail.com>,
        KP Singh <kpsingh@kernel.org>,
        Stanislav Fomichev <sdf@google.com>,
        Jiri Olsa <jolsa@kernel.org>, bpf <bpf@vger.kernel.org>,
        "houtao1@huawei.com" <houtao1@huawei.com>,
        LKML <linux-kernel@vger.kernel.org>,
        Boqun Feng <boqun.feng@gmail.com>
References: <41eda0ea-0ed4-1ffb-5520-06fda08e5d38@huawei.com>
 <CAMDZJNVSv3Msxw=5PRiXyO8bxNsA-4KyxU8BMCVyHxH-3iuq2Q@mail.gmail.com>
 <fdb3b69c-a29c-2d5b-a122-9d98ea387fda@huawei.com>
 <CAMDZJNWTry2eF_n41a13tKFFSSLFyp3BVKakOOWhSDApdp0f=w@mail.gmail.com>
 <CA+khW7jgsyFgBqU7hCzZiSSANE7f=A+M-0XbcKApz6Nr-ZnZDg@mail.gmail.com>
 <07a7491e-f391-a9b2-047e-cab5f23decc5@huawei.com>
 <CAMDZJNUTaiXMe460P7a7NfK1_bbaahpvi3Q9X85o=G7v9x-w=g@mail.gmail.com>
 <59fc54b7-c276-2918-6741-804634337881@huaweicloud.com>
 <541aa740-dcf3-35f5-9f9b-e411978eaa06@redhat.com>
 <Y4ZABpDSs4/uRutC@Boquns-Mac-mini.local>
 <Y4ZCKaQFqDY3aLTy@Boquns-Mac-mini.local>
 <CA+khW7hkQRFcC1QgGxEK_NeaVvCe3Hbe_mZ-_UkQKaBaqnOLEQ@mail.gmail.com>
 <23b5de45-1a11-b5c9-d0d3-4dbca0b7661e@huaweicloud.com>
 <CAMDZJNWtyanKtXtAxYGwvJ0LTgYLf=5iYFm63pbvvJLPE8oHSQ@mail.gmail.com>
From:   Waiman Long <longman@redhat.com>
In-Reply-To: <CAMDZJNWtyanKtXtAxYGwvJ0LTgYLf=5iYFm63pbvvJLPE8oHSQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.6
Precedence: bulk
List-ID: <bpf.vger.kernel.org>
X-Mailing-List: bpf@vger.kernel.org

On 11/29/22 21:47, Tonghao Zhang wrote:
> On Wed, Nov 30, 2022 at 9:50 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi Hao,
>>
>> On 11/30/2022 3:36 AM, Hao Luo wrote:
>>> On Tue, Nov 29, 2022 at 9:32 AM Boqun Feng <boqun.feng@gmail.com> wrote:
>>>> Just to be clear, I meant to refactor htab_lock_bucket() into a try
>>>> lock pattern. Also after a second thought, the below suggestion doesn't
>>>> work. I think the proper way is to make htab_lock_bucket() as a
>>>> raw_spin_trylock_irqsave().
>>>>
>>>> Regards,
>>>> Boqun
>>>>
>>> The potential deadlock happens when the lock is contended from the
>>> same cpu. When the lock is contended from a remote cpu, we would like
>>> the remote cpu to spin and wait, instead of giving up immediately. As
>>> this gives better throughput. So replacing the current
>>> raw_spin_lock_irqsave() with trylock sacrifices this performance gain.
>>>
>>> I suspect the source of the problem is the 'hash' that we used in
>>> htab_lock_bucket(). The 'hash' is derived from the 'key', I wonder
>>> whether we should use a hash derived from 'bucket' rather than from
>>> 'key'. For example, from the memory address of the 'bucket'. Because,
>>> different keys may fall into the same bucket, but yield different
>>> hashes. If the same bucket can never have two different 'hashes' here,
>>> the map_locked check should behave as intended. Also because
>>> ->map_locked is per-cpu, execution flows from two different cpus can
>>> both pass.
>> The warning from lockdep is due to the reason the bucket lock A is used in a
>> no-NMI context firstly, then the same bucke lock is used a NMI context, so
> Yes, I tested lockdep too, we can't use the lock in NMI(but only
> try_lock work fine) context if we use them no-NMI context. otherwise
> the lockdep prints the warning.
> * for the dead-lock case: we can use the
> 1. hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1)
> 2. or hash bucket address.
>
> * for lockdep warning, we should use in_nmi check with map_locked.
>
> BTW, the patch doesn't work, so we can remove the lock_key
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50eb518e262fa06bd334e6eec172eaf5d7a5bd9
>
> static inline int htab_lock_bucket(const struct bpf_htab *htab,
>                                     struct bucket *b, u32 hash,
>                                     unsigned long *pflags)
> {
>          unsigned long flags;
>
>          hash = hash & min(HASHTAB_MAP_LOCK_MASK, htab->n_buckets -1);
>
>          preempt_disable();
>          if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>                  __this_cpu_dec(*(htab->map_locked[hash]));
>                  preempt_enable();
>                  return -EBUSY;
>          }
>
>          if (in_nmi()) {
>                  if (!raw_spin_trylock_irqsave(&b->raw_lock, flags))
>                          return -EBUSY;
That is not right. You have to do the same step as above by decrementing 
the percpu count and enable preemption. So you may want to put all these 
busy_out steps after the return 0 and use "goto busy_out;" to jump there.
>          } else {
>                  raw_spin_lock_irqsave(&b->raw_lock, flags);
>          }
>
>          *pflags = flags;
>          return 0;
> }

BTW, with that change, I believe you can actually remove all the percpu 
map_locked count code.

Cheers,
Longman