BPF List
 help / color / mirror / Atom feed
From: Yonghong Song <yhs@fb.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Martin KaFai Lau <kafai@fb.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"toke@redhat.com" <toke@redhat.com>
Subject: Re: [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop
Date: Mon, 8 Aug 2022 16:23:45 -0700	[thread overview]
Message-ID: <1052d7fa-3c36-6995-7455-1287fd8fab90@fb.com> (raw)
In-Reply-To: <CAP01T76W95FnsT26L=f6ErVWvjkxMg92o-XLGqP9zbHLEG1yvw@mail.gmail.com>



On 8/8/22 11:55 AM, Kumar Kartikeya Dwivedi wrote:
> On Mon, 8 Aug 2022 at 18:19, Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 8/8/22 4:18 AM, Kumar Kartikeya Dwivedi wrote:
>>> On Mon, 8 Aug 2022 at 08:09, Yonghong Song <yhs@fb.com> wrote:
>>>>
>>>>
>>>>
>>>> On 8/5/22 6:46 PM, Kumar Kartikeya Dwivedi wrote:
>>>>> The LRU map that is preallocated may have its elements reused while
>>>>> another program holds a pointer to it from bpf_map_lookup_elem. Hence,
>>>>> only check_and_free_fields is appropriate when the element is being
>>>>> deleted, as it ensures proper synchronization against concurrent access
>>>>> of the map value. After that, we cannot call check_and_init_map_value
>>>>> again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
>>>>> they can be concurrently accessed from a BPF program.
>>>>
>>>> If I understand correctly, one lru_node gets freed and pushed to free
>>>> list without doing check_and_free_fields().
>>>
>>> I don't think that's true, there is a check_and_free_fields call on
>>> deletion right before bpf_lru_push_free in htab_lru_push_free.
>>> Then once the preallocated items are freed on map destruction, we free
>>> timers and kptrs again, so if someone has access to preallocated items
>>> after freeing e.g. through an earlier lookup, we still release
>>> resources they may have created at the end of map's lifetime.
>>>
>>>> If later the same node is used with bpf_map_update_elem() and
>>>> prealloc_lru_pop() is called, then with this patch,
>>>> check_and_init_map_value() is not called, so the new node may contain
>>>> leftover values for kptr/timer/spin_lock, could this cause a problem?
>>>>
>>>
>>> This can only happen once you touch kptr/timer/spin_lock after
>>> deletion's check_and_free_fields call, but the program obtaining the
>>> new item will see and be able to handle that case. The timer helpers
>>> handle if an existing timer exists, kptr_xchg returns the old pointer
>>> as a reference you must release. For unreferenced kptr, it is marked
>>> as PTR_UNTRUSTED so a corrupted pointer value is possible but not
>>> fatal. If spin_lock is locked on lookup, then the other CPU having
>>> access to deleted-but-now-reallocated item will eventually call
>>> unlock.
>>
>> Thanks for explanation. Originally I think we should clear everything
>> including spin_lock before putting the deleted lru_node to free list.
>> check_and_free_fields() only did for kptr/timer but not spin_lock.
>>
>> But looks like we should not delete spin_lock before pushing the
>> deleted nodes to free list since lookup side may hold a reference
>> to the map value and it may have done a bpf_spin_lock() call.
>> And we should not clear spin_lock fields in check_and_free_fields()
>> and neither in prealloc_lru_pop() in map_update. Otherwise, we
>> may have issues for bpf_spin_unlock() on lookup side.
>>
>> It looks timer and kptr are already been handled for such
>> cases (concurrency between map_lookup() and clearing some map_value
>> fields for timer/kptr).
>>
> 
> Yes, I also took a look again at other call sites of
> check_and_init_map_value and everything looks sane.

Sounds good.

> 
>>>
>>> It is very much expected, IIUC, that someone else may use-after-free
>>> deleted items of hashtab.c maps in case of preallocation. It can be
>>> considered similar to how SLAB_TYPESAFE_BY_RCU behaves.
>>>
>>>> To address the above rewrite issue, maybe the solution should be
>>>> to push the deleted lru_nodes back to free list only after
>>>> rcu_read_unlock() is done?
>>>
>>> Please correct me if I'm wrong, but I don't think this is a good idea.
>>> Delaying preallocated item reuse for a RCU grace period will greatly
>>> increase the probability of running out of preallocated items under
>>> load, even though technically those items are free for use.
>>
>> Agree. This is not a good idea. It increased life cycle for preallocated
>> item reuse and will have some performance issue and resource consumption
>> issue.
>>
>>>
>>> Side note: I found the problem this patch fixes while reading the
>>> code, because I am running into this exact problem with my WIP skip
>>> list map implementation, where in the preallocated case, to make
>>> things a bit easier for the lockless lookup, I delay reuse of items
>>> until an RCU grace period passes (so that the deleted -> unlinked
>>> transition does not happen during traversal), but I'm easily able to
>>> come up with scenarios (producer/consumer situations) where that leads
>>> to exhaustion of the preallocated memory (and if not that, performance
>>> degradation on updates because pcpu_freelist now needs to search other
>>> CPU's caches more often).

Thinking again. I guess the following scenario is possible:

      rcu_read_lock()
         v = bpf_map_lookup_elem(&key);
         t1 = v->field;
         bpf_map_delete_elem(&key);
            /* another bpf program triggering bpf_map_update_elem() */
            /* the element with 'key' is reused */
            /* the value is updated */
         t2 = v->field;
         ...
      rcu_read_unlock()

it is possible t1 and t2 may not be the same.
This should be an extremely corner case, not sure how to resolve
this easily without performance degradation.

>>>
>>> BTW, this would be fixed if we could simply use Alexei's per-CPU cache
>>> based memory allocator instead of preallocating items, because then
>>> the per-CPU cache gets replenished when it goes below the watermark
>>> (and also frees stuff back to kernel allocator above the high
>>> watermark, which is great for producer/consumer cases with alloc/free
>>> imbalance), so we can do the RCU delays similar to kmalloc case
>>> without running into the memory exhaustion problem. >>
>> Thanks for explanation. So okay the patch looks good to me.
>>
>>>
>>>>
>>>>>
>>>>> Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
>>>>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>>
>> Acked-by: Yonghong Song <yhs@fb.com>
>>
> 
> Thanks! I'll summarize our discussion in short and add it to the commit log.

  reply	other threads:[~2022-08-08 23:24 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-06  1:46 [PATCH bpf v1 0/3] Don't reinit map value in prealloc_lru_pop Kumar Kartikeya Dwivedi
2022-08-06  1:46 ` [PATCH bpf v1 1/3] bpf: Allow calling bpf_prog_test kfuncs in tracing programs Kumar Kartikeya Dwivedi
2022-08-06  1:46 ` [PATCH bpf v1 2/3] bpf: Don't reinit map value in prealloc_lru_pop Kumar Kartikeya Dwivedi
2022-08-08  6:09   ` Yonghong Song
2022-08-08 11:18     ` Kumar Kartikeya Dwivedi
2022-08-08 16:19       ` Yonghong Song
2022-08-08 18:55         ` Kumar Kartikeya Dwivedi
2022-08-08 23:23           ` Yonghong Song [this message]
2022-08-09  0:24             ` Kumar Kartikeya Dwivedi
2022-08-09  0:53               ` Kumar Kartikeya Dwivedi
2022-08-09  3:18               ` Alexei Starovoitov
2022-08-09  4:42                 ` Kumar Kartikeya Dwivedi
2022-08-06  1:46 ` [PATCH bpf v1 3/3] selftests/bpf: Add test for prealloc_lru_pop bug Kumar Kartikeya Dwivedi
2022-08-08 11:36   ` Kumar Kartikeya Dwivedi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1052d7fa-3c36-6995-7455-1287fd8fab90@fb.com \
    --to=yhs@fb.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kafai@fb.com \
    --cc=memxor@gmail.com \
    --cc=toke@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox