From: Xu Kuohai <xukuohai@huaweicloud.com>
To: Jordan Rome <linux@jordanrome.com>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf <bpf@vger.kernel.org>,
"open list:KERNEL SELFTEST FRAMEWORK"
<linux-kselftest@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard Zingerman <eddyz87@gmail.com>, Yonghong Song <yhs@fb.com>,
Song Liu <song@kernel.org>,
John Fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Mykola Lysenko <mykolal@fb.com>,
Shuah Khan <shuah@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>,
Willem de Bruijn <willemb@google.com>,
Jason Xing <kerneljasonxing@gmail.com>,
Paul Chaignon <paul.chaignon@gmail.com>,
Tao Chen <chen.dylane@linux.dev>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>,
Martin Kelly <martin.kelly@crowdstrike.com>
Subject: Re: [PATCH bpf-next 1/4] bpf: Add overwrite mode for bpf ring buffer
Date: Thu, 14 Aug 2025 21:59:52 +0800 [thread overview]
Message-ID: <1f1d98bc-2243-44c9-94e3-3594d19ea313@huaweicloud.com> (raw)
In-Reply-To: <27a11141-8e0d-498a-bc27-318d108161c8@jordanrome.com>
On 8/13/2025 9:22 PM, Jordan Rome wrote:
>
> On 8/12/25 12:02 AM, Xu Kuohai wrote:
>> On 8/9/2025 5:39 AM, Alexei Starovoitov wrote:
>>> On Sun, Aug 3, 2025 at 7:27 PM Xu Kuohai <xukuohai@huaweicloud.com> wrote:
>>>>
>>>> From: Xu Kuohai <xukuohai@huawei.com>
>>>>
>>>> When the bpf ring buffer is full, new events can not be recorded util
>>>> the consumer consumes some events to free space. This may cause critical
>>>> events to be discarded, such as in fault diagnostic, where recent events
>>>> are more critical than older ones.
>>>>
>>>> So add ovewrite mode for bpf ring buffer. In this mode, the new event
>>>> overwrites the oldest event when the buffer is full.
>>>>
>>>> The scheme is as follows:
>>>>
>>>> 1. producer_pos tracks the next position to write new data. When there
>>>> is enough free space, producer simply moves producer_pos forward to
>>>> make space for the new event.
>>>>
>>>> 2. To avoid waiting for consumer to free space when the buffer is full,
>>>> a new variable overwrite_pos is introduced for producer. overwrite_pos
>>>> tracks the next event to be overwritten (the oldest event committed) in
>>>> the buffer. producer moves it forward to discard the oldest events when
>>>> the buffer is full.
>>>>
>>>> 3. pending_pos tracks the oldest event under committing. producer ensures
>>>> producers_pos never passes pending_pos when making space for new events.
>>>> So multiple producers never write to the same position at the same time.
>>>>
>>>> 4. producer wakes up consumer every half a round ahead to give it a chance
>>>> to retrieve data. However, for an overwrite-mode ring buffer, users
>>>> typically only cares about the ring buffer snapshot before a fault occurs.
>>>> In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag
>>>> to avoid unnecessary wakeups.
>>>
>>> If I understand it correctly the algorithm requires all events to be the same
>>> size otherwise first overwrite might trash the header,
>>> also the producers should use some kind of signaling to
>>> timestamp each event otherwise it all will look out of order to the consumer.
>>>
>>> At the end it looks inferior to the existing perf ring buffer with overwrite.
>>> Since in both cases the out of order needs to be dealt with
>>> in post processing the main advantage of ring buf vs perf buf is gone.
>>
>> No, the advantage is not gone.
>>
>> The ring buffer is still shared by multiple producers. When an event occurs,
>> the producer queues up to acquire the spin lock of the ring buffer to write
>> event to it. So events in the ring buffer are always ordered, no out of order
>> occurs.
>>
>> And events are not required to be the same size. When an overwrite happens,
>> the events bing trashed are discared, and the overwrite_pos is moved forward
>> to skip these events until it reaches the first event that is not trashed.
>>
>> To make it clear, here are some example diagrams.
>>
>> 1. Let's say we have a ring buffer with size 4096.
>>
>> At first, {producer,overwrite,pending,consumer}_pos are all set to 0
>>
>> 0 512 1024 1536 2048 2560 3072 3584 4096
>> +-----------------------------------------------------------------------+
>> | |
>> | |
>> | |
>> +-----------------------------------------------------------------------+
>> ^
>> |
>> |
>> producer_pos = 0
>> overwrite_pos = 0
>> pending_pos = 0
>> consumer_pos = 0
>>
>> 2. Reserve event A, size 512.
>>
>> There is enough free space, so A is allocated at offset 0 and producer_pos
>> is moved to 512, the end of A. Since A is not submitted, the BUSY bit is
>> set.
>>
>> 0 512 1024 1536 2048 2560 3072 3584 4096
>> +-----------------------------------------------------------------------+
>> | | |
>> | A | |
>> | [BUSY] | |
>> +-----------------------------------------------------------------------+
>> ^ ^
>> | |
>> | |
>> | producer_pos = 512
>> |
>> overwrite_pos = 0
>> pending_pos = 0
>> consumer_pos = 0
>>
>>
>> 3. Reserve event B, size 1024.
>>
>> B is allocated at offset 512 with BUSY bit set, and producer_pos is moved
>> to the end of B.
>>
>> 0 512 1024 1536 2048 2560 3072 3584 4096
>> +-----------------------------------------------------------------------+
>> | | | |
>> | A | B | |
>> | [BUSY] | [BUSY] | |
>> +-----------------------------------------------------------------------+
>> ^ ^
>> | |
>> | |
>> | producer_pos = 1536
>> |
>> overwrite_pos = 0
>> pending_pos = 0
>> consumer_pos = 0
>>
>> 4. Reserve event C, size 2048.
>>
>> C is allocated at offset 1536 and producer_pos becomes 3584.
>>
>> 0 512 1024 1536 2048 2560 3072 3584 4096
>> +-----------------------------------------------------------------------+
>> | | | | |
>> | A | B | C | |
>> | [BUSY] | [BUSY] | [BUSY] | |
>> +-----------------------------------------------------------------------+
>> ^ ^
>> | |
>> | |
>> | producer_pos = 3584
>> |
>> overwrite_pos = 0
>> pending_pos = 0
>> consumer_pos = 0
>>
>> 5. Submit event A.
>>
>> The BUSY bit of A is cleared. B becomes the oldest event under writing, so
>> pending_pos is moved to 512, the start of B.
>>
>> 0 512 1024 1536 2048 2560 3072 3584 4096
>> +-----------------------------------------------------------------------+
>> | | | | |
>> | A | B | C | |
>> | | [BUSY] | [BUSY] | |
>> +-----------------------------------------------------------------------+
>> ^ ^ ^
>> | | |
>> | | |
>> | pending_pos = 512 producer_pos = 3584
>> |
>> overwrite_pos = 0
>> consumer_pos = 0
>>
>> 6. Submit event B.
>>
>> The BUSY bit of B is cleared, and pending_pos is moved to the start of C,
>> which is the oldest event under writing now.
>>
>> 0 512 1024 1536 2048 2560 3072 3584 4096
>> +-----------------------------------------------------------------------+
>> | | | | |
>> | A | B | C | |
>> | | | [BUSY] | |
>> +-----------------------------------------------------------------------+
>> ^ ^ ^
>> | | |
>> | | |
>> | pending_pos = 1536 producer_pos = 3584
>> |
>> overwrite_pos = 0
>> consumer_pos = 0
>>
>> 7. Reserve event D, size 1536 (3 * 512).
>>
>> There are 2048 bytes not under writing between producer_pos and pending_pos,
>> so D is allocated at offset 3584, and producer_pos is moved from 3584 to
>> 5120.
>>
>> Since event D will overwrite all bytes of event A and the begining 512 bytes
>> of event B, overwrite_pos is moved to the start of event C, the oldest event
>> that is not overwritten.
>>
>> 0 512 1024 1536 2048 2560 3072 3584 4096
>> +-----------------------------------------------------------------------+
>> | | | | |
>> | D End | | C | D Begin|
>> | [BUSY] | | [BUSY] | [BUSY] |
>> +-----------------------------------------------------------------------+
>> ^ ^ ^
>> | | |
>> | | pending_pos = 1536
>> | | overwrite_pos = 1536
>> | |
>> | producer_pos=5120
>> |
>> consumer_pos = 0
>>
>> 8. Reserve event E, size 1024.
>>
>> Though there are 512 bytes not under writing between producer_pos and
>> pending_pos, E can not be reserved, as it would overwrite the first 512
>> bytes of event C, which is still under writing.
>>
>> 9. Submit event C and D.
>>
>> pending_pos is moved to the end of D.
>>
>> 0 512 1024 1536 2048 2560 3072 3584 4096
>> +-----------------------------------------------------------------------+
>> | | | | |
>> | D End | | C | D Begin|
>> | | | | |
>> +-----------------------------------------------------------------------+
>> ^ ^ ^
>> | | |
>> | | overwrite_pos = 1536
>> | |
>> | producer_pos=5120
>> | pending_pos=5120
>> |
>> consumer_pos = 0
>
> These diagrams are very helpful in terms of understanding the flow.
> In part 7 when A is overwritten by D, why doesn't the consumer position move forward to
> point to the beginning of C? If the ring buffer producer guarantees ordering of reserved
> slots then C, in this case, is now the oldest reserved. This speaks to your second patch
> where you say that the consumer resolves conflicts by discarding data that has been
> overwritten but I feel like the simpler thing to do is just move the consumer position.
>
But the consumer may be ahead of overwrite_pos. In this case, moving
consumer_pos back to the oldest event is not correct, as the event has
already been consumed.
next prev parent reply other threads:[~2025-08-14 13:59 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-04 2:20 [PATCH bpf-next 0/4] Add overwrite mode for bpf ring buffer Xu Kuohai
2025-08-04 2:20 ` [PATCH bpf-next 1/4] bpf: " Xu Kuohai
2025-08-08 21:39 ` Alexei Starovoitov
2025-08-12 4:02 ` Xu Kuohai
2025-08-13 13:22 ` Jordan Rome
2025-08-14 13:59 ` Xu Kuohai [this message]
2025-08-04 2:20 ` [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process Xu Kuohai
2025-08-13 18:21 ` Zvi Effron
2025-08-14 14:10 ` Xu Kuohai
2025-08-14 19:34 ` Eduard Zingerman
2025-08-14 21:20 ` Zvi Effron
2025-08-22 21:23 ` Andrii Nakryiko
2025-08-23 14:38 ` Xu Kuohai
2025-08-04 2:20 ` [PATCH bpf-next 3/4] selftests/bpf: Add test for overwrite ring buffer Xu Kuohai
2025-08-04 2:21 ` [PATCH bpf-next 4/4] selftests/bpf/benchs: Add overwrite mode bench for rb-libbpf Xu Kuohai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1f1d98bc-2243-44c9-94e3-3594d19ea313@huaweicloud.com \
--to=xukuohai@huaweicloud.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=chen.dylane@linux.dev \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kerneljasonxing@gmail.com \
--cc=kpsingh@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux@jordanrome.com \
--cc=martin.kelly@crowdstrike.com \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=mykolal@fb.com \
--cc=paul.chaignon@gmail.com \
--cc=sdf@fomichev.me \
--cc=sdf@google.com \
--cc=shuah@kernel.org \
--cc=song@kernel.org \
--cc=willemb@google.com \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).