BPF List
 help / color / mirror / Atom feed
From: Philo Lu <lulie@linux.alibaba.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, song@kernel.org, andrii@kernel.org,
	ast@kernel.org, Daniel Borkmann <daniel@iogearbox.net>,
	xuanzhuo@linux.alibaba.com, dust.li@linux.alibaba.com,
	guwen@linux.alibaba.com, alibuda@linux.alibaba.com,
	hengqi@linux.alibaba.com, Nathan Slingerland <slinger@meta.com>,
	"rihams@meta.com" <rihams@meta.com>,
	Alan Maguire <alan.maguire@oracle.com>,
	Dmitry Vyukov <dvyukov@google.com>
Subject: Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting
Date: Mon, 18 Dec 2023 20:58:25 +0800	[thread overview]
Message-ID: <82ae47cf-74ae-445a-b00c-e068f49a348a@linux.alibaba.com> (raw)
In-Reply-To: <CACT4Y+bb7DuQXQ=-PRO4FteRz_4OLsRw0tXFKqNiOoT6UOFLaA@mail.gmail.com>



On 2023/12/16 16:50, Dmitry Vyukov wrote:
> On Fri, 15 Dec 2023 at 23:39, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>>> On 2023/12/14 07:35, Andrii Nakryiko wrote:
>>>> On Mon, Dec 11, 2023 at 4:39 AM Philo Lu <lulie@linux.alibaba.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2023/12/9 06:32, Andrii Nakryiko wrote:
>>>>>> On Thu, Dec 7, 2023 at 6:49 AM Alan Maguire <alan.maguire@oracle.com> wrote:
>>>>>>>
>>>>>>> On 07/12/2023 13:15, Philo Lu wrote:
>>>>>>>> Hi all. I have a question when using perfbuf/ringbuf in bpf. I will
>>>>>>>> appreciate it if you give me any advice.
>>>>>>>>
>>>>>>>> Imagine a simple case: the bpf program output a log (some tcp
>>>>>>>> statistics) to user every time a packet is received, and the user
>>>>>>>> actively read the logs if he wants. I do not want to keep a user process
>>>>>>>> alive, waiting for outputs of the buffer. User can read the buffer as
>>>>>>>> need. BTW, the order does not matter.
>>>>>>>>
>>>>>>>> To conclude, I hope the buffer performs like relayfs: (1) no need for
>>>>>>>> user process to receive logs, and the user may read at any time (and no
>>>>>>>> wakeup would be better); (2) old data can be overwritten by new ones.
>>>>>>>>
>>>>>>>> Currently, it seems that perfbuf and ringbuf cannot satisfy both: (i)
>>>>>>>> ringbuf: only satisfies (1). However, if data arrive when the buffer is
>>>>>>>> full, the new data will be lost, until the buffer is consumed. (ii)
>>>>>>>> perfbuf: only satisfies (2). But user cannot access the buffer after the
>>>>>>>> process who creates it (including perf_event.rb via mmap) exits.
>>>>>>>> Specifically, I can use BPF_F_PRESERVE_ELEMS flag to keep the
>>>>>>>> perf_events, but I do not know how to get the buffer again in a new
>>>>>>>> process.
>>>>>>>>
>>>>>>>> In my opinion, this can be solved by either of the following: (a) add
>>>>>>>> overwrite support in ringbuf (maybe a new flag for reserve), but we have
>>>>>>>> to address synchronization between kernel and user, especially under
>>>>>>>> variable data size, because when overwriting occurs, kernel has to
>>>>>>>> update the consumer posi too; (b) implement map_fd_sys_lookup_elem for
>>>>>>>> perfbuf to expose fds to user via map_lookup_elem syscall, and a
>>>>>>>> mechanism is need to preserve perf_event->rb when process exits
>>>>>>>> (otherwise the buffer will be freed by perf_mmap_close). I am not sure
>>>>>>>> if they are feasible, and which is better. If not, perhaps we can
>>>>>>>> develop another mechanism to achieve this?
>>>>>>>>
>>>>>>>
>>>>>>> There was an RFC a while back focused on supporting BPF ringbuf
>>>>>>> over-writing [1]; at the time, Andrii noted some potential issues that
>>>>>>> might be exposed by doing multiple ringbuf reserves to overfill the
>>>>>>> buffer within the same program.
>>>>>>>
>>>>>>
>>>>>> Correct. I don't think it's possible to correctly and safely support
>>>>>> overwriting with BPF ringbuf that has variable-sized elements.
>>>>>>
>>>>>> We'll need to implement MPMC ringbuf (probably with fixed sized
>>>>>> element size) to be able to support this.
>>>>>>
>>>>>
>>>>> Thank you very much!
>>>>>
>>>>> If it is indeed difficult with ringbuf, maybe I can implement a new type
>>>>> of bpf map based on relay interface [1]? e.g., init relay during map
>>>>> creating, write into it with bpf helper, and then user can access to it
>>>>> in filesystem. I think it will be a simple but useful map for
>>>>> overwritable data transfer.
>>>>
>>>> I don't know much about relay, tbh. Give it a try, I guess.
>>>> Alternatively, we need better and faster implementation of
>>>> BPF_MAP_TYPE_QUEUE, which seems like the data structure that can
>>>> support overwriting and generally be a fixed elementa size
>>>> alternative/complement to BPF ringbuf.
>>>>
>>>
>>> Thank you for your reply. I am afraid BPF_MAP_TYPE_QUEUE cannot get rid
>>> of locking overheads with concurrent reading and writing by design, and
>>
>> I disagree, I think [0] from Dmitry Vyukov is one way to implement
>> lock-free BPF_MAP_TYPE_QUEUE. I don't know how easy it would be to
>> implement overwriting support, but it would be worth considering.
>>
>>    [0] https://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
> 
> 
> I am missing some context here. But note that this queue is not
> formally lock-free. While it's usually faster and more scalable than
> mutex-protected queues, stuck readers and writers will eventually
> block each other. Stucking for a short time is not a problem because
> the queue allows parallelism for both readers and writers. But if
> threads get stuck for a long time and the queue wraps around so that
> writers try to write to elements being read/written by slow threads,
> they block. Similarly, readers get blocked by slow writers even if
> there are other fully written elements in the queue already.
> The queue is not serializable either, which may be surprisable in some cases.
> 
> Adding overwriting support may be an interesting exercise.
> I guess readers could use some variation of a seqlock to deal with
> elements that are being overwritten.
> Writers can already skip over other slow writers. Normally this is
> used w/o wrap-around, but I suspect it can just work with wrap-around
> as well (a writer can skip over a writer stuck on the previous lap).
> Since we overwrite elements, the queue provides only a very weak
> notion of FIFO anyway, so skipping over very old writers may be fine's

Thanks for these hints. The MPMC queue with a seqlock could be a 
effective method to improve BPF_MAP_TYPE_QUEUE. But I don't think it 
will work well in our case.

In my opinion, under very frequent writing, it will be hard for reader 
to get all elements in one shot (e.g., bpf_map_lookup_batch), because we 
use a seqlock and the whole buffer could be large. What's worse, with 
overwriting, many elements will be dropped sliently before readers get 
access to them.

Basically, I think BPF_MAP_TYPE_QUEUE assumes reliable results by 
design, and so does ringbuf. But in our case, we'd rather catch logs in 
time, even at the cost of few wrong records, and this is how relay performs.

Anyway, the MPMC queue optimization for BPF_MAP_TYPE_QUEUE is an 
interesting topic. I'd like to try it besides relay if possible.

  reply	other threads:[~2023-12-18 12:58 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-07 13:15 Question about bpf perfbuf/ringbuf: pinned in backend with overwriting Philo Lu
2023-12-07 14:48 ` Alan Maguire
2023-12-08 22:32   ` Andrii Nakryiko
2023-12-11 12:39     ` Philo Lu
2023-12-13 23:35       ` Andrii Nakryiko
2023-12-15 10:10         ` Philo Lu
2023-12-15 22:39           ` Andrii Nakryiko
2023-12-16  8:50             ` Dmitry Vyukov
2023-12-18 12:58               ` Philo Lu [this message]
2023-12-19 19:25               ` Andrii Nakryiko
2023-12-19  6:23         ` Shung-Hsi Yu
2023-12-19 13:38           ` Steven Rostedt
2023-12-19 17:01             ` Alexei Starovoitov
2023-12-19 17:28             ` Steven Rostedt
2023-12-21 13:00             ` Philo Lu
2023-12-21 14:49               ` Steven Rostedt
2023-12-22 12:25                 ` Philo Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=82ae47cf-74ae-445a-b00c-e068f49a348a@linux.alibaba.com \
    --to=lulie@linux.alibaba.com \
    --cc=alan.maguire@oracle.com \
    --cc=alibuda@linux.alibaba.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=dust.li@linux.alibaba.com \
    --cc=dvyukov@google.com \
    --cc=guwen@linux.alibaba.com \
    --cc=hengqi@linux.alibaba.com \
    --cc=rihams@meta.com \
    --cc=slinger@meta.com \
    --cc=song@kernel.org \
    --cc=xuanzhuo@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox