BPF List
 help / color / mirror / Atom feed
From: Feng Zhou <zhoufeng.zf@bytedance.com>
To: Yonghong Song <yhs@fb.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	Network Development <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>,
	Xiongchun Duan <duanxiongchun@bytedance.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Dongdong Wang <wangdongdong.6@bytedance.com>,
	Cong Wang <cong.wang@bytedance.com>,
	Chengming Zhou <zhouchengming@bytedance.com>
Subject: Re: [External] Re: [PATCH] bpf: avoid grabbing spin_locks of all cpus when no free elems
Date: Mon, 23 May 2022 10:24:32 +0800	[thread overview]
Message-ID: <877ac441-045b-1844-6938-fcaee5eee7f2@bytedance.com> (raw)
In-Reply-To: <0f904395-350d-5ee7-152e-93d104742e98@fb.com>

在 2022/5/20 上午12:45, Yonghong Song 写道:
>
>
> On 5/18/22 8:12 PM, Feng Zhou wrote:
>> 在 2022/5/19 上午4:39, Yonghong Song 写道:
>>>
>>>
>>> On 5/17/22 11:57 PM, Feng Zhou wrote:
>>>> 在 2022/5/18 下午2:32, Alexei Starovoitov 写道:
>>>>> On Tue, May 17, 2022 at 11:27 PM Feng zhou 
>>>>> <zhoufeng.zf@bytedance.com> wrote:
>>>>>> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>>>>>>
>>>>>> We encountered bad case on big system with 96 CPUs that
>>>>>> alloc_htab_elem() would last for 1ms. The reason is that after the
>>>>>> prealloc hashtab has no free elems, when trying to update, it 
>>>>>> will still
>>>>>> grab spin_locks of all cpus. If there are multiple update users, the
>>>>>> competition is very serious.
>>>>>>
>>>>>> So this patch add is_empty in pcpu_freelist_head to check freelist
>>>>>> having free or not. If having, grab spin_lock, or check next cpu's
>>>>>> freelist.
>>>>>>
>>>>>> Before patch: hash_map performance
>>>>>> ./map_perf_test 1
>>>
>>> could you explain what parameter '1' means here?
>>
>> This code is here:
>> samples/bpf/map_perf_test_user.c
>> samples/bpf/map_perf_test_kern.c
>> parameter '1' means testcase flag, test hash_map's performance
>> parameter '2048' means test hash_map's performance when free=0.
>> testcase flag '2048' is added by myself to reproduce the problem 
>> phenomenon.
>>
>>>
>>>>>> 0:hash_map_perf pre-alloc 975345 events per sec
>>>>>> 4:hash_map_perf pre-alloc 855367 events per sec
>>>>>> 12:hash_map_perf pre-alloc 860862 events per sec
>>>>>> 8:hash_map_perf pre-alloc 849561 events per sec
>>>>>> 3:hash_map_perf pre-alloc 849074 events per sec
>>>>>> 6:hash_map_perf pre-alloc 847120 events per sec
>>>>>> 10:hash_map_perf pre-alloc 845047 events per sec
>>>>>> 5:hash_map_perf pre-alloc 841266 events per sec
>>>>>> 14:hash_map_perf pre-alloc 849740 events per sec
>>>>>> 2:hash_map_perf pre-alloc 839598 events per sec
>>>>>> 9:hash_map_perf pre-alloc 838695 events per sec
>>>>>> 11:hash_map_perf pre-alloc 845390 events per sec
>>>>>> 7:hash_map_perf pre-alloc 834865 events per sec
>>>>>> 13:hash_map_perf pre-alloc 842619 events per sec
>>>>>> 1:hash_map_perf pre-alloc 804231 events per sec
>>>>>> 15:hash_map_perf pre-alloc 795314 events per sec
>>>>>>
>>>>>> hash_map the worst: no free
>>>>>> ./map_perf_test 2048
>>>>>> 6:worse hash_map_perf pre-alloc 28628 events per sec
>>>>>> 5:worse hash_map_perf pre-alloc 28553 events per sec
>>>>>> 11:worse hash_map_perf pre-alloc 28543 events per sec
>>>>>> 3:worse hash_map_perf pre-alloc 28444 events per sec
>>>>>> 1:worse hash_map_perf pre-alloc 28418 events per sec
>>>>>> 7:worse hash_map_perf pre-alloc 28427 events per sec
>>>>>> 13:worse hash_map_perf pre-alloc 28330 events per sec
>>>>>> 14:worse hash_map_perf pre-alloc 28263 events per sec
>>>>>> 9:worse hash_map_perf pre-alloc 28211 events per sec
>>>>>> 15:worse hash_map_perf pre-alloc 28193 events per sec
>>>>>> 12:worse hash_map_perf pre-alloc 28190 events per sec
>>>>>> 10:worse hash_map_perf pre-alloc 28129 events per sec
>>>>>> 8:worse hash_map_perf pre-alloc 28116 events per sec
>>>>>> 4:worse hash_map_perf pre-alloc 27906 events per sec
>>>>>> 2:worse hash_map_perf pre-alloc 27801 events per sec
>>>>>> 0:worse hash_map_perf pre-alloc 27416 events per sec
>>>>>> 3:worse hash_map_perf pre-alloc 28188 events per sec
>>>>>>
>>>>>> ftrace trace
>>>>>>
>>>>>> 0)               |  htab_map_update_elem() {
>>>>>> 0)   0.198 us    |    migrate_disable();
>>>>>> 0)               |    _raw_spin_lock_irqsave() {
>>>>>> 0)   0.157 us    |      preempt_count_add();
>>>>>> 0)   0.538 us    |    }
>>>>>> 0)   0.260 us    |    lookup_elem_raw();
>>>>>> 0)               |    alloc_htab_elem() {
>>>>>> 0)               |      __pcpu_freelist_pop() {
>>>>>> 0)               |        _raw_spin_lock() {
>>>>>> 0)   0.152 us    |          preempt_count_add();
>>>>>> 0)   0.352 us    | native_queued_spin_lock_slowpath();
>>>>>> 0)   1.065 us    |        }
>>>>>>                   |        ...
>>>>>> 0)               |        _raw_spin_unlock() {
>>>>>> 0)   0.254 us    |          preempt_count_sub();
>>>>>> 0)   0.555 us    |        }
>>>>>> 0) + 25.188 us   |      }
>>>>>> 0) + 25.486 us   |    }
>>>>>> 0)               |    _raw_spin_unlock_irqrestore() {
>>>>>> 0)   0.155 us    |      preempt_count_sub();
>>>>>> 0)   0.454 us    |    }
>>>>>> 0)   0.148 us    |    migrate_enable();
>>>>>> 0) + 28.439 us   |  }
>>>>>>
>>>>>> The test machine is 16C, trying to get spin_lock 17 times, in 
>>>>>> addition
>>>>>> to 16c, there is an extralist.
>>>>> Is this with small max_entries and a large number of cpus?
>>>>>
>>>>> If so, probably better to fix would be to artificially
>>>>> bump max_entries to be 4x of num_cpus.
>>>>> Racy is_empty check still wastes the loop.
>>>>
>>>> This hash_map worst testcase with 16 CPUs, map's max_entries is 1000.
>>>>
>>>> This is the test case I constructed, it is to fill the map on 
>>>> purpose, and then
>>>>
>>>> continue to update, just to reproduce the problem phenomenon.
>>>>
>>>> The bad case we encountered with 96 CPUs, map's max_entries is 10240.
>>>
>>> For such cases, most likely the map is *almost* full. What is the 
>>> performance if we increase map size, e.g., from 10240 to 16K(16192)?
>>
>> Yes, increasing max_entries can temporarily solve this problem, but 
>> when 16k is used up,
>> it will still encounter this problem. This patch is to try to fix 
>> this corner case.
>
> Okay, if I understand correctly, in your use case, you have lots of 
> different keys and your intention is NOT to capture all the keys in
> the hash table. So given a hash table, it is possible that the hash
> will become full even if you increase the hashtable size.
>
> Maybe you will occasionally delete some keys which will free some
> space but the space will be quickly occupied by the new updates.
>
> For such cases, yes, check whether the free list is empty or not
> before taking the lock should be helpful. But I am wondering
> what is the rationale behind your use case.

My use case is to monitor the network traffic of the server, and use 
five-tuple as the key.
When there is a surge in network traffic, it is possible to cause the 
hash_map to be full.





      reply	other threads:[~2022-05-23  2:24 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-18  6:27 [PATCH] bpf: avoid grabbing spin_locks of all cpus when no free elems Feng zhou
2022-05-18  6:32 ` Alexei Starovoitov
2022-05-18  6:57   ` [External] " Feng Zhou
2022-05-18 20:39     ` Yonghong Song
2022-05-19  3:12       ` Feng Zhou
2022-05-19 16:12         ` Alexei Starovoitov
2022-05-20  3:02           ` Feng Zhou
2022-05-19 16:45         ` Yonghong Song
2022-05-23  2:24           ` Feng Zhou [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877ac441-045b-1844-6938-fcaee5eee7f2@bytedance.com \
    --to=zhoufeng.zf@bytedance.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=daniel@iogearbox.net \
    --cc=duanxiongchun@bytedance.com \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=songliubraving@fb.com \
    --cc=songmuchun@bytedance.com \
    --cc=wangdongdong.6@bytedance.com \
    --cc=yhs@fb.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox