From: Yonghong Song <yhs@fb.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
Martin KaFai Lau <kafai@fb.com>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
"toke@redhat.com" <toke@redhat.com>
Subject: Re: [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop
Date: Mon, 8 Aug 2022 16:23:45 -0700 [thread overview]
Message-ID: <1052d7fa-3c36-6995-7455-1287fd8fab90@fb.com> (raw)
In-Reply-To: <CAP01T76W95FnsT26L=f6ErVWvjkxMg92o-XLGqP9zbHLEG1yvw@mail.gmail.com>
On 8/8/22 11:55 AM, Kumar Kartikeya Dwivedi wrote:
> On Mon, 8 Aug 2022 at 18:19, Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 8/8/22 4:18 AM, Kumar Kartikeya Dwivedi wrote:
>>> On Mon, 8 Aug 2022 at 08:09, Yonghong Song <yhs@fb.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/5/22 6:46 PM, Kumar Kartikeya Dwivedi wrote:
>>>>> The LRU map that is preallocated may have its elements reused while
>>>>> another program holds a pointer to it from bpf_map_lookup_elem. Hence,
>>>>> only check_and_free_fields is appropriate when the element is being
>>>>> deleted, as it ensures proper synchronization against concurrent access
>>>>> of the map value. After that, we cannot call check_and_init_map_value
>>>>> again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
>>>>> they can be concurrently accessed from a BPF program.
>>>>
>>>> If I understand correctly, one lru_node gets freed and pushed to free
>>>> list without doing check_and_free_fields().
>>>
>>> I don't think that's true, there is a check_and_free_fields call on
>>> deletion right before bpf_lru_push_free in htab_lru_push_free.
>>> Then once the preallocated items are freed on map destruction, we free
>>> timers and kptrs again, so if someone has access to preallocated items
>>> after freeing e.g. through an earlier lookup, we still release
>>> resources they may have created at the end of map's lifetime.
>>>
>>>> If later the same node is used with bpf_map_update_elem() and
>>>> prealloc_lru_pop() is called, then with this patch,
>>>> check_and_init_map_value() is not called, so the new node may contain
>>>> leftover values for kptr/timer/spin_lock, could this cause a problem?
>>>>
>>>
>>> This can only happen once you touch kptr/timer/spin_lock after
>>> deletion's check_and_free_fields call, but the program obtaining the
>>> new item will see and be able to handle that case. The timer helpers
>>> handle if an existing timer exists, kptr_xchg returns the old pointer
>>> as a reference you must release. For unreferenced kptr, it is marked
>>> as PTR_UNTRUSTED so a corrupted pointer value is possible but not
>>> fatal. If spin_lock is locked on lookup, then the other CPU having
>>> access to deleted-but-now-reallocated item will eventually call
>>> unlock.
>>
>> Thanks for explanation. Originally I think we should clear everything
>> including spin_lock before putting the deleted lru_node to free list.
>> check_and_free_fields() only did for kptr/timer but not spin_lock.
>>
>> But looks like we should not delete spin_lock before pushing the
>> deleted nodes to free list since lookup side may hold a reference
>> to the map value and it may have done a bpf_spin_lock() call.
>> And we should not clear spin_lock fields in check_and_free_fields()
>> and neither in prealloc_lru_pop() in map_update. Otherwise, we
>> may have issues for bpf_spin_unlock() on lookup side.
>>
>> It looks timer and kptr are already been handled for such
>> cases (concurrency between map_lookup() and clearing some map_value
>> fields for timer/kptr).
>>
>
> Yes, I also took a look again at other call sites of
> check_and_init_map_value and everything looks sane.
Sounds good.
>
>>>
>>> It is very much expected, IIUC, that someone else may use-after-free
>>> deleted items of hashtab.c maps in case of preallocation. It can be
>>> considered similar to how SLAB_TYPESAFE_BY_RCU behaves.
>>>
>>>> To address the above rewrite issue, maybe the solution should be
>>>> to push the deleted lru_nodes back to free list only after
>>>> rcu_read_unlock() is done?
>>>
>>> Please correct me if I'm wrong, but I don't think this is a good idea.
>>> Delaying preallocated item reuse for a RCU grace period will greatly
>>> increase the probability of running out of preallocated items under
>>> load, even though technically those items are free for use.
>>
>> Agree. This is not a good idea. It increased life cycle for preallocated
>> item reuse and will have some performance issue and resource consumption
>> issue.
>>
>>>
>>> Side note: I found the problem this patch fixes while reading the
>>> code, because I am running into this exact problem with my WIP skip
>>> list map implementation, where in the preallocated case, to make
>>> things a bit easier for the lockless lookup, I delay reuse of items
>>> until an RCU grace period passes (so that the deleted -> unlinked
>>> transition does not happen during traversal), but I'm easily able to
>>> come up with scenarios (producer/consumer situations) where that leads
>>> to exhaustion of the preallocated memory (and if not that, performance
>>> degradation on updates because pcpu_freelist now needs to search other
>>> CPU's caches more often).
Thinking again. I guess the following scenario is possible:
rcu_read_lock()
v = bpf_map_lookup_elem(&key);
t1 = v->field;
bpf_map_delete_elem(&key);
/* another bpf program triggering bpf_map_update_elem() */
/* the element with 'key' is reused */
/* the value is updated */
t2 = v->field;
...
rcu_read_unlock()
it is possible t1 and t2 may not be the same.
This should be an extremely corner case, not sure how to resolve
this easily without performance degradation.
>>>
>>> BTW, this would be fixed if we could simply use Alexei's per-CPU cache
>>> based memory allocator instead of preallocating items, because then
>>> the per-CPU cache gets replenished when it goes below the watermark
>>> (and also frees stuff back to kernel allocator above the high
>>> watermark, which is great for producer/consumer cases with alloc/free
>>> imbalance), so we can do the RCU delays similar to kmalloc case
>>> without running into the memory exhaustion problem. >>
>> Thanks for explanation. So okay the patch looks good to me.
>>
>>>
>>>>
>>>>>
>>>>> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
>>>>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>>
>> Acked-by: Yonghong Song <yhs@fb.com>
>>
>
> Thanks! I'll summarize our discussion in short and add it to the commit log.
next prev parent reply other threads:[~2022-08-08 23:24 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-06 1:46 [PATCH bpf v1 0/3] Don't reinit map value in prealloc_lru_pop Kumar Kartikeya Dwivedi
2022-08-06 1:46 ` [PATCH bpf v1 1/3] bpf: Allow calling bpf_prog_test kfuncs in tracing programs Kumar Kartikeya Dwivedi
2022-08-06 1:46 ` [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop Kumar Kartikeya Dwivedi
2022-08-08 6:09 ` Yonghong Song
2022-08-08 11:18 ` Kumar Kartikeya Dwivedi
2022-08-08 16:19 ` Yonghong Song
2022-08-08 18:55 ` Kumar Kartikeya Dwivedi
2022-08-08 23:23 ` Yonghong Song [this message]
2022-08-09 0:24 ` Kumar Kartikeya Dwivedi
2022-08-09 0:53 ` Kumar Kartikeya Dwivedi
2022-08-09 3:18 ` Alexei Starovoitov
2022-08-09 4:42 ` Kumar Kartikeya Dwivedi
2022-08-06 1:46 ` [PATCH bpf v1 3/3] selftests/bpf: Add test for prealloc_lru_pop bug Kumar Kartikeya Dwivedi
2022-08-08 11:36 ` Kumar Kartikeya Dwivedi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1052d7fa-3c36-6995-7455-1287fd8fab90@fb.com \
--to=yhs@fb.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kafai@fb.com \
--cc=memxor@gmail.com \
--cc=toke@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox