[PATCH bpf-next 0/4] Add overwrite mode for bpf ring buffer

linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/4] Add overwrite mode for bpf ring buffer
@ 2025-08-04  2:20 Xu Kuohai
  2025-08-04  2:20 ` [PATCH bpf-next 1/4] bpf: " Xu Kuohai
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Xu Kuohai @ 2025-08-04  2:20 UTC (permalink / raw)
  To: bpf, linux-kselftest, linux-kernel
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Song Liu,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan, Stanislav Fomichev, Willem de Bruijn,
	Jason Xing, Paul Chaignon, Tao Chen, Kumar Kartikeya Dwivedi,
	Martin Kelly

From: Xu Kuohai <xukuohai@huawei.com>

When the bpf ring buffer is full, new events can not be recorded util
the consumer consumes some events to free space. This may cause critical
events to be discarded, such as in fault diagnostic, where recent events
are more critical than older ones.

So add ovewrite mode for bpf ring buffer. In this mode, the new event
overwrites the oldest event when the buffer is full.

Xu Kuohai (4):
  bpf: Add overwrite mode for bpf ring buffer
  libbpf: ringbuf: Add overwrite ring buffer process
  selftests/bpf: Add test for overwrite ring buffer
  selftests/bpf/benchs: Add overwrite mode bench for rb-libbpf

 include/uapi/linux/bpf.h                      |   4 +
 kernel/bpf/ringbuf.c                          | 159 +++++++++++++++---
 tools/include/uapi/linux/bpf.h                |   4 +
 tools/lib/bpf/ringbuf.c                       | 103 +++++++++++-
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../selftests/bpf/benchs/bench_ringbufs.c     |  22 ++-
 .../bpf/benchs/run_bench_ringbufs.sh          |   4 +
 .../selftests/bpf/prog_tests/ringbuf.c        |  74 ++++++++
 .../bpf/progs/test_ringbuf_overwrite.c        |  98 +++++++++++
 9 files changed, 442 insertions(+), 29 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c

-- 
2.43.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH bpf-next 1/4] bpf: Add overwrite mode for bpf ring buffer
  2025-08-04  2:20 [PATCH bpf-next 0/4] Add overwrite mode for bpf ring buffer Xu Kuohai
@ 2025-08-04  2:20 ` Xu Kuohai
  2025-08-08 21:39   ` Alexei Starovoitov
  2025-08-04  2:20 ` [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process Xu Kuohai
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Xu Kuohai @ 2025-08-04  2:20 UTC (permalink / raw)
  To: bpf, linux-kselftest, linux-kernel
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Song Liu,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan, Stanislav Fomichev, Willem de Bruijn,
	Jason Xing, Paul Chaignon, Tao Chen, Kumar Kartikeya Dwivedi,
	Martin Kelly

From: Xu Kuohai <xukuohai@huawei.com>

When the bpf ring buffer is full, new events can not be recorded util
the consumer consumes some events to free space. This may cause critical
events to be discarded, such as in fault diagnostic, where recent events
are more critical than older ones.

So add ovewrite mode for bpf ring buffer. In this mode, the new event
overwrites the oldest event when the buffer is full.

The scheme is as follows:

1. producer_pos tracks the next position to write new data. When there
   is enough free space, producer simply moves producer_pos forward to
   make space for the new event.

2. To avoid waiting for consumer to free space when the buffer is full,
   a new variable overwrite_pos is introduced for producer. overwrite_pos
   tracks the next event to be overwritten (the oldest event committed) in
   the buffer. producer moves it forward to discard the oldest events when
   the buffer is full.

3. pending_pos tracks the oldest event under committing. producer ensures
   producers_pos never passes pending_pos when making space for new events.
   So multiple producers never write to the same position at the same time.

4. producer wakes up consumer every half a round ahead to give it a chance
   to retrieve data. However, for an overwrite-mode ring buffer, users
   typically only cares about the ring buffer snapshot before a fault occurs.
   In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag
   to avoid unnecessary wakeups.

The performance data for overwrite mode will be provided in a follow-up
patch that adds overwrite mode benchs.

A sample of performance data for non-overwrite mode on an x86_64 and arm64
CPU, before and after this patch, is shown below. As we can see, no obvious
performance regression occurs.

- x86_64 (AMD EPYC 9654)

Before:

Ringbuf, multi-producer contention
==================================
  rb-libbpf nr_prod 1  13.218 ± 0.039M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 2  15.684 ± 0.015M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 3  7.771 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 4  6.281 ± 0.001M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 8  2.842 ± 0.003M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 12 2.001 ± 0.004M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 16 1.833 ± 0.003M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 20 1.508 ± 0.003M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 24 1.421 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 28 1.309 ± 0.001M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 32 1.265 ± 0.003M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 36 1.198 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 40 1.174 ± 0.001M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 44 1.113 ± 0.003M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 48 1.097 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 52 1.070 ± 0.002M/s (drops 0.000 ± 0.000M/s)

After:

Ringbuf, multi-producer contention
==================================
  rb-libbpf nr_prod 1  13.751 ± 0.673M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 2  15.592 ± 0.008M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 3  7.776 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 4  6.463 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 8  2.883 ± 0.003M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 12 2.017 ± 0.003M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 16 1.816 ± 0.004M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 20 1.512 ± 0.003M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 24 1.396 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 28 1.303 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 32 1.267 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 36 1.210 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 40 1.181 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 44 1.136 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 48 1.090 ± 0.001M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 52 1.091 ± 0.002M/s (drops 0.000 ± 0.000M/s)

- arm64 (HiSilicon Kunpeng 920)

Before:

  Ringbuf, multi-producer contention
  ==================================
  rb-libbpf nr_prod 1  11.602 ± 0.423M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 2  9.599 ± 0.007M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 3  6.669 ± 0.008M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 4  4.806 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 8  3.856 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 12 3.368 ± 0.003M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 16 3.210 ± 0.007M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 20 3.003 ± 0.007M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 24 2.944 ± 0.007M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 28 2.863 ± 0.008M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 32 2.819 ± 0.007M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 36 2.887 ± 0.008M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 40 2.837 ± 0.008M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 44 2.787 ± 0.012M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 48 2.738 ± 0.010M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 52 2.700 ± 0.007M/s (drops 0.000 ± 0.000M/s)

After:

  Ringbuf, multi-producer contention
  ==================================
  rb-libbpf nr_prod 1  11.614 ± 0.268M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 2  9.917 ± 0.007M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 3  6.920 ± 0.008M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 4  4.803 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 8  3.898 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 12 3.426 ± 0.008M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 16 3.320 ± 0.008M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 20 3.029 ± 0.013M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 24 3.068 ± 0.012M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 28 2.890 ± 0.009M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 32 2.950 ± 0.012M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 36 2.812 ± 0.006M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 40 2.834 ± 0.009M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 44 2.803 ± 0.010M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 48 2.766 ± 0.010M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 52 2.754 ± 0.009M/s (drops 0.000 ± 0.000M/s)

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
 include/uapi/linux/bpf.h       |   4 +
 kernel/bpf/ringbuf.c           | 159 +++++++++++++++++++++++++++------
 tools/include/uapi/linux/bpf.h |   4 +
 3 files changed, 141 insertions(+), 26 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 233de8677382..d3b2fd2ae527 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1430,6 +1430,9 @@ enum {
 
 /* Do not translate kernel bpf_arena pointers to user pointers */
 	BPF_F_NO_USER_CONV	= (1U << 18),
+
+/* bpf ringbuf works in overwrite mode? */
+	BPF_F_OVERWRITE		= (1U << 19),
 };
 
 /* Flags for BPF_PROG_QUERY. */
@@ -6215,6 +6218,7 @@ enum {
 	BPF_RB_RING_SIZE = 1,
 	BPF_RB_CONS_POS = 2,
 	BPF_RB_PROD_POS = 3,
+	BPF_RB_OVER_POS = 4,
 };
 
 /* BPF ring buffer constants */
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 719d73299397..6ca41d01f187 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -13,7 +13,7 @@
 #include <linux/btf_ids.h>
 #include <asm/rqspinlock.h>
 
-#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE)
+#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE | BPF_F_OVERWRITE)
 
 /* non-mmap()'able part of bpf_ringbuf (everything up to consumer page) */
 #define RINGBUF_PGOFF \
@@ -27,7 +27,8 @@
 struct bpf_ringbuf {
 	wait_queue_head_t waitq;
 	struct irq_work work;
-	u64 mask;
+	u64 mask:48;
+	u64 overwrite_mode:1;
 	struct page **pages;
 	int nr_pages;
 	rqspinlock_t spinlock ____cacheline_aligned_in_smp;
@@ -72,6 +73,7 @@ struct bpf_ringbuf {
 	 */
 	unsigned long consumer_pos __aligned(PAGE_SIZE);
 	unsigned long producer_pos __aligned(PAGE_SIZE);
+	unsigned long overwrite_pos;  /* to be overwritten in overwrite mode */
 	unsigned long pending_pos;
 	char data[] __aligned(PAGE_SIZE);
 };
@@ -166,7 +168,8 @@ static void bpf_ringbuf_notify(struct irq_work *work)
  * considering that the maximum value of data_sz is (4GB - 1), there
  * will be no overflow, so just note the size limit in the comments.
  */
-static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
+static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node,
+					     int overwrite_mode)
 {
 	struct bpf_ringbuf *rb;
 
@@ -183,17 +186,25 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
 	rb->consumer_pos = 0;
 	rb->producer_pos = 0;
 	rb->pending_pos = 0;
+	rb->overwrite_mode = overwrite_mode;
 
 	return rb;
 }
 
 static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 {
+	int overwrite_mode = 0;
 	struct bpf_ringbuf_map *rb_map;
 
 	if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
 
+	if (attr->map_flags & BPF_F_OVERWRITE) {
+		if (attr->map_type == BPF_MAP_TYPE_USER_RINGBUF)
+			return ERR_PTR(-EINVAL);
+		overwrite_mode = 1;
+	}
+
 	if (attr->key_size || attr->value_size ||
 	    !is_power_of_2(attr->max_entries) ||
 	    !PAGE_ALIGNED(attr->max_entries))
@@ -205,7 +216,8 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 
 	bpf_map_init_from_attr(&rb_map->map, attr);
 
-	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
+	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node,
+				       overwrite_mode);
 	if (!rb_map->rb) {
 		bpf_map_area_free(rb_map);
 		return ERR_PTR(-ENOMEM);
@@ -295,11 +307,16 @@ static int ringbuf_map_mmap_user(struct bpf_map *map, struct vm_area_struct *vma
 
 static unsigned long ringbuf_avail_data_sz(struct bpf_ringbuf *rb)
 {
-	unsigned long cons_pos, prod_pos;
+	unsigned long cons_pos, prod_pos, over_pos;
 
 	cons_pos = smp_load_acquire(&rb->consumer_pos);
 	prod_pos = smp_load_acquire(&rb->producer_pos);
-	return prod_pos - cons_pos;
+
+	if (likely(!rb->overwrite_mode))
+		return prod_pos - cons_pos;
+
+	over_pos = READ_ONCE(rb->overwrite_pos);
+	return min(prod_pos - max(cons_pos, over_pos), rb->mask + 1);
 }
 
 static u32 ringbuf_total_data_sz(const struct bpf_ringbuf *rb)
@@ -402,11 +419,43 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr)
 	return (void*)((addr & PAGE_MASK) - off);
 }
 
+
+static bool bpf_ringbuf_has_space(const struct bpf_ringbuf *rb,
+				  unsigned long new_prod_pos,
+				  unsigned long cons_pos,
+				  unsigned long pend_pos)
+{
+	/* no space if oldest not yet committed record until the newest
+	 * record span more than (ringbuf_size - 1)
+	 */
+	if (new_prod_pos - pend_pos > rb->mask)
+		return false;
+
+	/* ok, we have space in ovewrite mode */
+	if (unlikely(rb->overwrite_mode))
+		return true;
+
+	/* no space if producer position advances more than (ringbuf_size - 1)
+	 * ahead than consumer position when not in overwrite mode
+	 */
+	if (new_prod_pos - cons_pos > rb->mask)
+		return false;
+
+	return true;
+}
+
+static u32 ringbuf_round_up_hdr_len(u32 hdr_len)
+{
+	hdr_len &= ~BPF_RINGBUF_DISCARD_BIT;
+	return round_up(hdr_len + BPF_RINGBUF_HDR_SZ, 8);
+}
+
 static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
 {
-	unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, flags;
+	unsigned long flags;
 	struct bpf_ringbuf_hdr *hdr;
-	u32 len, pg_off, tmp_size, hdr_len;
+	u32 len, pg_off, hdr_len;
+	unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, over_pos;
 
 	if (unlikely(size > RINGBUF_MAX_RECORD_SZ))
 		return NULL;
@@ -429,24 +478,39 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
 		hdr_len = READ_ONCE(hdr->len);
 		if (hdr_len & BPF_RINGBUF_BUSY_BIT)
 			break;
-		tmp_size = hdr_len & ~BPF_RINGBUF_DISCARD_BIT;
-		tmp_size = round_up(tmp_size + BPF_RINGBUF_HDR_SZ, 8);
-		pend_pos += tmp_size;
+		pend_pos += ringbuf_round_up_hdr_len(hdr_len);
 	}
 	rb->pending_pos = pend_pos;
 
-	/* check for out of ringbuf space:
-	 * - by ensuring producer position doesn't advance more than
-	 *   (ringbuf_size - 1) ahead
-	 * - by ensuring oldest not yet committed record until newest
-	 *   record does not span more than (ringbuf_size - 1)
-	 */
-	if (new_prod_pos - cons_pos > rb->mask ||
-	    new_prod_pos - pend_pos > rb->mask) {
+	if (!bpf_ringbuf_has_space(rb, new_prod_pos, cons_pos, pend_pos)) {
 		raw_res_spin_unlock_irqrestore(&rb->spinlock, flags);
 		return NULL;
 	}
 
+	/* In overwrite mode, move overwrite_pos to the next record to be
+	 * overwritten if the ring buffer is full
+	 */
+	if (unlikely(rb->overwrite_mode)) {
+		over_pos = rb->overwrite_pos;
+		while (new_prod_pos - over_pos > rb->mask) {
+			hdr = (void *)rb->data + (over_pos & rb->mask);
+			hdr_len = READ_ONCE(hdr->len);
+			/* since pending_pos is the first record with BUSY
+			 * bit set and overwrite_pos is never bigger than
+			 * pending_pos, no need to check BUSY bit here.
+			 */
+			over_pos += ringbuf_round_up_hdr_len(hdr_len);
+		}
+		/* smp_store_release(&rb->producer_pos, new_prod_pos) at
+		 * the end of the function ensures that when consumer sees
+		 * the updated rb->producer_pos, it always sees the updated
+		 * rb->overwrite_pos, so when consumer reads overwrite_pos
+		 * after smp_load_acquire(r->producer_pos), the overwrite_pos
+		 * will always be valid.
+		 */
+		WRITE_ONCE(rb->overwrite_pos, over_pos);
+	}
+
 	hdr = (void *)rb->data + (prod_pos & rb->mask);
 	pg_off = bpf_ringbuf_rec_pg_off(rb, hdr);
 	hdr->len = size | BPF_RINGBUF_BUSY_BIT;
@@ -479,7 +543,50 @@ const struct bpf_func_proto bpf_ringbuf_reserve_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard)
+static __always_inline
+bool ringbuf_should_wakeup(const struct bpf_ringbuf *rb,
+			   unsigned long rec_pos,
+			   unsigned long cons_pos,
+			   u32 len, u64 flags)
+{
+	unsigned long rec_end;
+
+	if (flags & BPF_RB_FORCE_WAKEUP)
+		return true;
+
+	if (flags & BPF_RB_NO_WAKEUP)
+		return false;
+
+	/* for non-overwrite mode, if consumer caught up and is waiting for
+	 * our record, notify about new data availability
+	 */
+	if (likely(!rb->overwrite_mode))
+		return cons_pos == rec_pos;
+
+	/* for overwrite mode, to give the consumer a chance to catch up
+	 * before being overwritten, wake up consumer every half a round
+	 * ahead.
+	 */
+	rec_end = rec_pos + ringbuf_round_up_hdr_len(len);
+
+	cons_pos &= (rb->mask >> 1);
+	rec_pos &= (rb->mask >> 1);
+	rec_end &= (rb->mask >> 1);
+
+	if (cons_pos == rec_pos)
+		return true;
+
+	if (rec_pos < cons_pos && cons_pos < rec_end)
+		return true;
+
+	if (rec_end < rec_pos && (cons_pos > rec_pos || cons_pos < rec_end))
+		return true;
+
+	return false;
+}
+
+static __always_inline
+void bpf_ringbuf_commit(void *sample, u64 flags, bool discard)
 {
 	unsigned long rec_pos, cons_pos;
 	struct bpf_ringbuf_hdr *hdr;
@@ -495,15 +602,10 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard)
 	/* update record header with correct final size prefix */
 	xchg(&hdr->len, new_len);
 
-	/* if consumer caught up and is waiting for our record, notify about
-	 * new data availability
-	 */
 	rec_pos = (void *)hdr - (void *)rb->data;
 	cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
 
-	if (flags & BPF_RB_FORCE_WAKEUP)
-		irq_work_queue(&rb->work);
-	else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP))
+	if (ringbuf_should_wakeup(rb, rec_pos, cons_pos, new_len, flags))
 		irq_work_queue(&rb->work);
 }
 
@@ -576,6 +678,8 @@ BPF_CALL_2(bpf_ringbuf_query, struct bpf_map *, map, u64, flags)
 		return smp_load_acquire(&rb->consumer_pos);
 	case BPF_RB_PROD_POS:
 		return smp_load_acquire(&rb->producer_pos);
+	case BPF_RB_OVER_POS:
+		return READ_ONCE(rb->overwrite_pos);
 	default:
 		return 0;
 	}
@@ -749,6 +853,9 @@ BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map,
 
 	rb = container_of(map, struct bpf_ringbuf_map, map)->rb;
 
+	if (unlikely(rb->overwrite_mode))
+		return -EOPNOTSUPP;
+
 	/* If another consumer is already consuming a sample, wait for them to finish. */
 	if (!atomic_try_cmpxchg(&rb->busy, &busy, 1))
 		return -EBUSY;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 233de8677382..d3b2fd2ae527 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1430,6 +1430,9 @@ enum {
 
 /* Do not translate kernel bpf_arena pointers to user pointers */
 	BPF_F_NO_USER_CONV	= (1U << 18),
+
+/* bpf ringbuf works in overwrite mode? */
+	BPF_F_OVERWRITE		= (1U << 19),
 };
 
 /* Flags for BPF_PROG_QUERY. */
@@ -6215,6 +6218,7 @@ enum {
 	BPF_RB_RING_SIZE = 1,
 	BPF_RB_CONS_POS = 2,
 	BPF_RB_PROD_POS = 3,
+	BPF_RB_OVER_POS = 4,
 };
 
 /* BPF ring buffer constants */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process
  2025-08-04  2:20 [PATCH bpf-next 0/4] Add overwrite mode for bpf ring buffer Xu Kuohai
  2025-08-04  2:20 ` [PATCH bpf-next 1/4] bpf: " Xu Kuohai
@ 2025-08-04  2:20 ` Xu Kuohai
  2025-08-13 18:21   ` Zvi Effron
                     ` (2 more replies)
  2025-08-04  2:20 ` [PATCH bpf-next 3/4] selftests/bpf: Add test for overwrite ring buffer Xu Kuohai
  2025-08-04  2:21 ` [PATCH bpf-next 4/4] selftests/bpf/benchs: Add overwrite mode bench for rb-libbpf Xu Kuohai
  3 siblings, 3 replies; 15+ messages in thread
From: Xu Kuohai @ 2025-08-04  2:20 UTC (permalink / raw)
  To: bpf, linux-kselftest, linux-kernel
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Song Liu,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan, Stanislav Fomichev, Willem de Bruijn,
	Jason Xing, Paul Chaignon, Tao Chen, Kumar Kartikeya Dwivedi,
	Martin Kelly

From: Xu Kuohai <xukuohai@huawei.com>

In overwrite mode, the producer does not wait for the consumer, so the
consumer is responsible for handling conflicts. An optimistic method
is used to resolve the conflicts: the consumer first reads consumer_pos,
producer_pos and overwrite_pos, then calculates a read window and copies
data in the window from the ring buffer. After copying, it checks the
positions to decide if the data in the copy window have been overwritten
by be the producer. If so, it discards the copy and tries again. Once
success, the consumer processes the events in the copy.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
 tools/lib/bpf/ringbuf.c | 103 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 102 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/ringbuf.c b/tools/lib/bpf/ringbuf.c
index 9702b70da444..9c072af675ff 100644
--- a/tools/lib/bpf/ringbuf.c
+++ b/tools/lib/bpf/ringbuf.c
@@ -27,10 +27,13 @@ struct ring {
 	ring_buffer_sample_fn sample_cb;
 	void *ctx;
 	void *data;
+	void *read_buffer;
 	unsigned long *consumer_pos;
 	unsigned long *producer_pos;
+	unsigned long *overwrite_pos;
 	unsigned long mask;
 	int map_fd;
+	bool overwrite_mode;
 };
 
 struct ring_buffer {
@@ -69,6 +72,9 @@ static void ringbuf_free_ring(struct ring_buffer *rb, struct ring *r)
 		r->producer_pos = NULL;
 	}
 
+	if (r->read_buffer)
+		free(r->read_buffer);
+
 	free(r);
 }
 
@@ -119,6 +125,14 @@ int ring_buffer__add(struct ring_buffer *rb, int map_fd,
 	r->sample_cb = sample_cb;
 	r->ctx = ctx;
 	r->mask = info.max_entries - 1;
+	r->overwrite_mode = info.map_flags & BPF_F_OVERWRITE;
+	if (unlikely(r->overwrite_mode)) {
+		r->read_buffer = malloc(info.max_entries);
+		if (!r->read_buffer) {
+			err = -ENOMEM;
+			goto err_out;
+		}
+	}
 
 	/* Map writable consumer page */
 	tmp = mmap(NULL, rb->page_size, PROT_READ | PROT_WRITE, MAP_SHARED, map_fd, 0);
@@ -148,6 +162,7 @@ int ring_buffer__add(struct ring_buffer *rb, int map_fd,
 		goto err_out;
 	}
 	r->producer_pos = tmp;
+	r->overwrite_pos = r->producer_pos + 1; /* overwrite_pos is next to producer_pos */
 	r->data = tmp + rb->page_size;
 
 	e = &rb->events[rb->ring_cnt];
@@ -232,7 +247,7 @@ static inline int roundup_len(__u32 len)
 	return (len + 7) / 8 * 8;
 }
 
-static int64_t ringbuf_process_ring(struct ring *r, size_t n)
+static int64_t ringbuf_process_normal_ring(struct ring *r, size_t n)
 {
 	int *len_ptr, len, err;
 	/* 64-bit to avoid overflow in case of extreme application behavior */
@@ -278,6 +293,92 @@ static int64_t ringbuf_process_ring(struct ring *r, size_t n)
 	return cnt;
 }
 
+static int64_t ringbuf_process_overwrite_ring(struct ring *r, size_t n)
+{
+
+	int err;
+	uint32_t *len_ptr, len;
+	/* 64-bit to avoid overflow in case of extreme application behavior */
+	int64_t cnt = 0;
+	size_t size, offset;
+	unsigned long cons_pos, prod_pos, over_pos, tmp_pos;
+	bool got_new_data;
+	void *sample;
+	bool copied;
+
+	size = r->mask + 1;
+
+	cons_pos = smp_load_acquire(r->consumer_pos);
+	do {
+		got_new_data = false;
+
+		/* grab a copy of data */
+		prod_pos = smp_load_acquire(r->producer_pos);
+		do {
+			over_pos = READ_ONCE(*r->overwrite_pos);
+			/* prod_pos may be outdated now */
+			if (over_pos < prod_pos) {
+				tmp_pos = max(cons_pos, over_pos);
+				/* smp_load_acquire(r->producer_pos) before
+				 * READ_ONCE(*r->overwrite_pos) ensures that
+				 * over_pos + r->mask < prod_pos never occurs,
+				 * so size is never larger than r->mask
+				 */
+				size = prod_pos - tmp_pos;
+				if (!size)
+					goto done;
+				memcpy(r->read_buffer,
+				       r->data + (tmp_pos & r->mask), size);
+				copied = true;
+			} else {
+				copied = false;
+			}
+			prod_pos = smp_load_acquire(r->producer_pos);
+		/* retry if data is overwritten by producer */
+		} while (!copied || prod_pos - tmp_pos > r->mask);
+
+		cons_pos = tmp_pos;
+
+		for (offset = 0; offset < size; offset += roundup_len(len)) {
+			len_ptr = r->read_buffer + (offset & r->mask);
+			len = *len_ptr;
+
+			if (len & BPF_RINGBUF_BUSY_BIT)
+				goto done;
+
+			got_new_data = true;
+			cons_pos += roundup_len(len);
+
+			if ((len & BPF_RINGBUF_DISCARD_BIT) == 0) {
+				sample = (void *)len_ptr + BPF_RINGBUF_HDR_SZ;
+				err = r->sample_cb(r->ctx, sample, len);
+				if (err < 0) {
+					/* update consumer pos and bail out */
+					smp_store_release(r->consumer_pos,
+							  cons_pos);
+					return err;
+				}
+				cnt++;
+			}
+
+			if (cnt >= n)
+				goto done;
+		}
+	} while (got_new_data);
+
+done:
+	smp_store_release(r->consumer_pos, cons_pos);
+	return cnt;
+}
+
+static int64_t ringbuf_process_ring(struct ring *r, size_t n)
+{
+	if (likely(!r->overwrite_mode))
+		return ringbuf_process_normal_ring(r, n);
+	else
+		return ringbuf_process_overwrite_ring(r, n);
+}
+
 /* Consume available ring buffer(s) data without event polling, up to n
  * records.
  *
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next 3/4] selftests/bpf: Add test for overwrite ring buffer
  2025-08-04  2:20 [PATCH bpf-next 0/4] Add overwrite mode for bpf ring buffer Xu Kuohai
  2025-08-04  2:20 ` [PATCH bpf-next 1/4] bpf: " Xu Kuohai
  2025-08-04  2:20 ` [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process Xu Kuohai
@ 2025-08-04  2:20 ` Xu Kuohai
  2025-08-04  2:21 ` [PATCH bpf-next 4/4] selftests/bpf/benchs: Add overwrite mode bench for rb-libbpf Xu Kuohai
  3 siblings, 0 replies; 15+ messages in thread
From: Xu Kuohai @ 2025-08-04  2:20 UTC (permalink / raw)
  To: bpf, linux-kselftest, linux-kernel
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Song Liu,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan, Stanislav Fomichev, Willem de Bruijn,
	Jason Xing, Paul Chaignon, Tao Chen, Kumar Kartikeya Dwivedi,
	Martin Kelly

From: Xu Kuohai <xukuohai@huawei.com>

Add test for overwiret mode ring buffer.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
 tools/testing/selftests/bpf/Makefile          |  3 +-
 .../selftests/bpf/prog_tests/ringbuf.c        | 74 ++++++++++++++
 .../bpf/progs/test_ringbuf_overwrite.c        | 98 +++++++++++++++++++
 3 files changed, 174 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 4863106034df..8a3796a2e5f5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -499,7 +499,8 @@ LINKED_SKELS := test_static_linked.skel.h linked_funcs.skel.h		\
 LSKELS := fentry_test.c fexit_test.c fexit_sleep.c atomics.c 		\
 	trace_printk.c trace_vprintk.c map_ptr_kern.c 			\
 	core_kern.c core_kern_overflow.c test_ringbuf.c			\
-	test_ringbuf_n.c test_ringbuf_map_key.c test_ringbuf_write.c
+	test_ringbuf_n.c test_ringbuf_map_key.c test_ringbuf_write.c    \
+	test_ringbuf_overwrite.c
 
 # Generate both light skeleton and libbpf skeleton for these
 LSKELS_EXTRA := test_ksyms_module.c test_ksyms_weak.c kfunc_call_test.c \
diff --git a/tools/testing/selftests/bpf/prog_tests/ringbuf.c b/tools/testing/selftests/bpf/prog_tests/ringbuf.c
index d1e4cb28a72c..205a51c725a7 100644
--- a/tools/testing/selftests/bpf/prog_tests/ringbuf.c
+++ b/tools/testing/selftests/bpf/prog_tests/ringbuf.c
@@ -17,6 +17,7 @@
 #include "test_ringbuf_n.lskel.h"
 #include "test_ringbuf_map_key.lskel.h"
 #include "test_ringbuf_write.lskel.h"
+#include "test_ringbuf_overwrite.lskel.h"
 
 #define EDONE 7777
 
@@ -497,6 +498,77 @@ static void ringbuf_map_key_subtest(void)
 	test_ringbuf_map_key_lskel__destroy(skel_map_key);
 }
 
+static void ringbuf_overwrite_mode_subtest(void)
+{
+	unsigned long size, len1, len2, len3, len4, len5;
+	unsigned long expect_avail_data, expect_prod_pos, expect_over_pos;
+	struct test_ringbuf_overwrite_lskel *skel;
+	int err;
+
+	skel = test_ringbuf_overwrite_lskel__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		return;
+
+	size = 0x1000;
+	len1 = 0x800;
+	len2 = 0x400;
+	len3 = size - len1 - len2 - BPF_RINGBUF_HDR_SZ * 3; /* 0x3e8 */
+	len4 = len3 - 8; /* 0x3e0 */
+	len5 = len3; /* retry with len3 */
+
+	skel->maps.ringbuf.max_entries = size;
+	skel->rodata->LEN1 = len1;
+	skel->rodata->LEN2 = len2;
+	skel->rodata->LEN3 = len3;
+	skel->rodata->LEN4 = len4;
+	skel->rodata->LEN5 = len5;
+
+	skel->bss->pid = getpid();
+
+	err = test_ringbuf_overwrite_lskel__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	err = test_ringbuf_overwrite_lskel__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach"))
+		goto cleanup;
+
+	syscall(__NR_getpgid);
+
+	ASSERT_EQ(skel->bss->reserve1_fail, 0, "reserve 1");
+	ASSERT_EQ(skel->bss->reserve2_fail, 0, "reserve 2");
+	ASSERT_EQ(skel->bss->reserve3_fail, 1, "reserve 3");
+	ASSERT_EQ(skel->bss->reserve4_fail, 0, "reserve 4");
+	ASSERT_EQ(skel->bss->reserve5_fail, 0, "reserve 5");
+
+	CHECK(skel->bss->ring_size != size,
+	      "check_ring_size", "exp %lu, got %lu\n",
+	      size, skel->bss->ring_size);
+
+	expect_avail_data = len2 + len4 + len5 + 3 * BPF_RINGBUF_HDR_SZ;
+	CHECK(skel->bss->avail_data != expect_avail_data,
+	      "check_avail_size", "exp %lu, got %lu\n",
+	      expect_avail_data, skel->bss->avail_data);
+
+	CHECK(skel->bss->cons_pos != 0,
+	      "check_cons_pos", "exp 0, got %lu\n",
+	      skel->bss->cons_pos);
+
+	expect_prod_pos = len1 + len2 + len4 + len5 + 4 * BPF_RINGBUF_HDR_SZ;
+	CHECK(skel->bss->prod_pos != expect_prod_pos,
+	      "check_prod_pos", "exp %lu, got %lu\n",
+	      expect_prod_pos, skel->bss->prod_pos);
+
+	expect_over_pos = len1 + BPF_RINGBUF_HDR_SZ;
+	CHECK(skel->bss->over_pos != expect_over_pos,
+	      "check_over_pos", "exp %lu, got %lu\n",
+	      (unsigned long)expect_over_pos, skel->bss->over_pos);
+
+	test_ringbuf_overwrite_lskel__detach(skel);
+cleanup:
+	test_ringbuf_overwrite_lskel__destroy(skel);
+}
+
 void test_ringbuf(void)
 {
 	if (test__start_subtest("ringbuf"))
@@ -507,4 +579,6 @@ void test_ringbuf(void)
 		ringbuf_map_key_subtest();
 	if (test__start_subtest("ringbuf_write"))
 		ringbuf_write_subtest();
+	if (test__start_subtest("ringbuf_overwrite_mode"))
+		ringbuf_overwrite_mode_subtest();
 }
diff --git a/tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c b/tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c
new file mode 100644
index 000000000000..da89ba12a75c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2025. Huawei Technologies Co., Ltd */
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(map_flags, BPF_F_OVERWRITE);
+} ringbuf SEC(".maps");
+
+int pid;
+
+const volatile unsigned long LEN1;
+const volatile unsigned long LEN2;
+const volatile unsigned long LEN3;
+const volatile unsigned long LEN4;
+const volatile unsigned long LEN5;
+
+long reserve1_fail = 0;
+long reserve2_fail = 0;
+long reserve3_fail = 0;
+long reserve4_fail = 0;
+long reserve5_fail = 0;
+
+unsigned long avail_data = 0;
+unsigned long ring_size = 0;
+unsigned long cons_pos = 0;
+unsigned long prod_pos = 0;
+unsigned long over_pos = 0;
+
+SEC("fentry/" SYS_PREFIX "sys_getpgid")
+int test_overwrite_ringbuf(void *ctx)
+{
+	char *rec1, *rec2, *rec3, *rec4, *rec5;
+	int cur_pid = bpf_get_current_pid_tgid() >> 32;
+
+	if (cur_pid != pid)
+		return 0;
+
+	rec1 = bpf_ringbuf_reserve(&ringbuf, LEN1, 0);
+	if (!rec1) {
+		reserve1_fail = 1;
+		return 0;
+	}
+
+	rec2 = bpf_ringbuf_reserve(&ringbuf, LEN2, 0);
+	if (!rec2) {
+		bpf_ringbuf_discard(rec1, 0);
+		reserve2_fail = 1;
+		return 0;
+	}
+
+	rec3 = bpf_ringbuf_reserve(&ringbuf, LEN3, 0);
+	/* expect failure */
+	if (!rec3) {
+		reserve3_fail = 1;
+	} else {
+		bpf_ringbuf_discard(rec1, 0);
+		bpf_ringbuf_discard(rec2, 0);
+		bpf_ringbuf_discard(rec3, 0);
+		return 0;
+	}
+
+	rec4 = bpf_ringbuf_reserve(&ringbuf, LEN4, 0);
+	if (!rec4) {
+		reserve4_fail = 1;
+		bpf_ringbuf_discard(rec1, 0);
+		bpf_ringbuf_discard(rec2, 0);
+		return 0;
+	}
+
+	bpf_ringbuf_submit(rec1, 0);
+	bpf_ringbuf_submit(rec2, 0);
+	bpf_ringbuf_submit(rec4, 0);
+
+	rec5 = bpf_ringbuf_reserve(&ringbuf, LEN5, 0);
+	if (!rec5) {
+		reserve5_fail = 1;
+		return 0;
+	}
+
+	for (int i = 0; i < LEN3; i++)
+		rec5[i] = 0xdd;
+
+	bpf_ringbuf_submit(rec5, 0);
+
+	ring_size = bpf_ringbuf_query(&ringbuf, BPF_RB_RING_SIZE);
+	avail_data = bpf_ringbuf_query(&ringbuf, BPF_RB_AVAIL_DATA);
+	cons_pos = bpf_ringbuf_query(&ringbuf, BPF_RB_CONS_POS);
+	prod_pos = bpf_ringbuf_query(&ringbuf, BPF_RB_PROD_POS);
+	over_pos = bpf_ringbuf_query(&ringbuf, BPF_RB_OVER_POS);
+
+	return 0;
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next 4/4] selftests/bpf/benchs: Add overwrite mode bench for rb-libbpf
  2025-08-04  2:20 [PATCH bpf-next 0/4] Add overwrite mode for bpf ring buffer Xu Kuohai
                   ` (2 preceding siblings ...)
  2025-08-04  2:20 ` [PATCH bpf-next 3/4] selftests/bpf: Add test for overwrite ring buffer Xu Kuohai
@ 2025-08-04  2:21 ` Xu Kuohai
  3 siblings, 0 replies; 15+ messages in thread
From: Xu Kuohai @ 2025-08-04  2:21 UTC (permalink / raw)
  To: bpf, linux-kselftest, linux-kernel
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Song Liu,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan, Stanislav Fomichev, Willem de Bruijn,
	Jason Xing, Paul Chaignon, Tao Chen, Kumar Kartikeya Dwivedi,
	Martin Kelly

From: Xu Kuohai <xukuohai@huawei.com>

Add overwrite mode bench for ring buffer.

For reference, below are bench numbers collected from x86_64 and arm64.

- x86_64 (AMD EPYC 9654)

  Ringbuf, multi-producer contention, overwrite mode
  ==================================================
  rb-libbpf nr_prod 1  14.970 ± 0.012M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 2  14.064 ± 0.007M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 3  7.493 ± 0.003M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 4  6.575 ± 0.001M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 8  3.696 ± 0.011M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 12 2.612 ± 0.012M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 16 2.335 ± 0.005M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 20 2.079 ± 0.005M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 24 1.965 ± 0.004M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 28 1.846 ± 0.004M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 32 1.790 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 36 1.735 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 40 1.701 ± 0.002M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 44 1.669 ± 0.001M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 48 1.749 ± 0.001M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 52 1.709 ± 0.001M/s (drops 0.000 ± 0.000M/s)

- arm64 (HiSilicon Kunpeng 920)

  Ringbuf, multi-producer contention, overwrite mode
  ==================================================
  rb-libbpf nr_prod 1  10.319 ± 0.231M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 2  9.219 ± 0.006M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 3  6.699 ± 0.013M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 4  4.608 ± 0.001M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 8  3.905 ± 0.001M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 12 3.282 ± 0.004M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 16 3.182 ± 0.008M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 20 3.029 ± 0.006M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 24 3.116 ± 0.004M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 28 2.869 ± 0.005M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 32 3.075 ± 0.010M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 36 2.795 ± 0.003M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 40 2.947 ± 0.005M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 44 2.748 ± 0.006M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 48 2.767 ± 0.003M/s (drops 0.000 ± 0.000M/s)
  rb-libbpf nr_prod 52 2.858 ± 0.002M/s (drops 0.000 ± 0.000M/s)

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
 .../selftests/bpf/benchs/bench_ringbufs.c     | 22 ++++++++++++++++++-
 .../bpf/benchs/run_bench_ringbufs.sh          |  4 ++++
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
index e1ee979e6acc..6fdfc61c721b 100644
--- a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
+++ b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
@@ -19,6 +19,7 @@ static struct {
 	int ringbuf_sz; /* per-ringbuf, in bytes */
 	bool ringbuf_use_output; /* use slower output API */
 	int perfbuf_sz; /* per-CPU size, in pages */
+	bool overwrite_mode;
 } args = {
 	.back2back = false,
 	.batch_cnt = 500,
@@ -27,6 +28,7 @@ static struct {
 	.ringbuf_sz = 512 * 1024,
 	.ringbuf_use_output = false,
 	.perfbuf_sz = 128,
+	.overwrite_mode = false,
 };
 
 enum {
@@ -35,6 +37,7 @@ enum {
 	ARG_RB_BATCH_CNT = 2002,
 	ARG_RB_SAMPLED = 2003,
 	ARG_RB_SAMPLE_RATE = 2004,
+	ARG_RB_OVERWRITE = 2005,
 };
 
 static const struct argp_option opts[] = {
@@ -43,6 +46,7 @@ static const struct argp_option opts[] = {
 	{ "rb-batch-cnt", ARG_RB_BATCH_CNT, "CNT", 0, "Set BPF-side record batch count"},
 	{ "rb-sampled", ARG_RB_SAMPLED, NULL, 0, "Notification sampling"},
 	{ "rb-sample-rate", ARG_RB_SAMPLE_RATE, "RATE", 0, "Notification sample rate"},
+	{ "rb-overwrite", ARG_RB_OVERWRITE, NULL, 0, "overwrite mode"},
 	{},
 };
 
@@ -72,6 +76,9 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
 			argp_usage(state);
 		}
 		break;
+	case ARG_RB_OVERWRITE:
+		args.overwrite_mode = true;
+		break;
 	default:
 		return ARGP_ERR_UNKNOWN;
 	}
@@ -104,6 +111,11 @@ static void bufs_validate(void)
 		fprintf(stderr, "back-to-back mode makes sense only for single-producer case!\n");
 		exit(1);
 	}
+
+	if (args.overwrite_mode && strcmp(env.bench_name, "rb-libbpf") != 0) {
+		fprintf(stderr, "rb-overwrite mode only supports rb-libbpf!\n");
+		exit(1);
+	}
 }
 
 static void *bufs_sample_producer(void *input)
@@ -134,6 +146,8 @@ static void ringbuf_libbpf_measure(struct bench_res *res)
 
 static struct ringbuf_bench *ringbuf_setup_skeleton(void)
 {
+	__u32 flags;
+	struct bpf_map *ringbuf;
 	struct ringbuf_bench *skel;
 
 	setup_libbpf();
@@ -151,7 +165,13 @@ static struct ringbuf_bench *ringbuf_setup_skeleton(void)
 		/* record data + header take 16 bytes */
 		skel->rodata->wakeup_data_size = args.sample_rate * 16;
 
-	bpf_map__set_max_entries(skel->maps.ringbuf, args.ringbuf_sz);
+	ringbuf = skel->maps.ringbuf;
+	if (args.overwrite_mode) {
+		flags = bpf_map__map_flags(ringbuf) | BPF_F_OVERWRITE;
+		bpf_map__set_map_flags(ringbuf,  flags);
+	}
+
+	bpf_map__set_max_entries(ringbuf, args.ringbuf_sz);
 
 	if (ringbuf_bench__load(skel)) {
 		fprintf(stderr, "failed to load skeleton\n");
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh b/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh
index 91e3567962ff..4e758bc52b73 100755
--- a/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh
+++ b/tools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh
@@ -49,3 +49,7 @@ for b in 1 2 3 4 8 12 16 20 24 28 32 36 40 44 48 52; do
 	summarize "rb-libbpf nr_prod $b" "$($RUN_RB_BENCH -p$b --rb-batch-cnt 50 rb-libbpf)"
 done
 
+header "Ringbuf, multi-producer contention, overwrite mode"
+for b in 1 2 3 4 8 12 16 20 24 28 32 36 40 44 48 52; do
+	summarize "rb-libbpf nr_prod $b" "$($RUN_RB_BENCH -p$b --rb-overwrite --rb-batch-cnt 50 rb-libbpf)"
+done
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf: Add overwrite mode for bpf ring buffer
  2025-08-04  2:20 ` [PATCH bpf-next 1/4] bpf: " Xu Kuohai
@ 2025-08-08 21:39   ` Alexei Starovoitov
  2025-08-12  4:02     ` Xu Kuohai
  0 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2025-08-08 21:39 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, open list:KERNEL SELFTEST FRAMEWORK, LKML,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Song Liu,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan, Stanislav Fomichev, Willem de Bruijn,
	Jason Xing, Paul Chaignon, Tao Chen, Kumar Kartikeya Dwivedi,
	Martin Kelly

On Sun, Aug 3, 2025 at 7:27 PM Xu Kuohai <xukuohai@huaweicloud.com> wrote:
>
> From: Xu Kuohai <xukuohai@huawei.com>
>
> When the bpf ring buffer is full, new events can not be recorded util
> the consumer consumes some events to free space. This may cause critical
> events to be discarded, such as in fault diagnostic, where recent events
> are more critical than older ones.
>
> So add ovewrite mode for bpf ring buffer. In this mode, the new event
> overwrites the oldest event when the buffer is full.
>
> The scheme is as follows:
>
> 1. producer_pos tracks the next position to write new data. When there
>    is enough free space, producer simply moves producer_pos forward to
>    make space for the new event.
>
> 2. To avoid waiting for consumer to free space when the buffer is full,
>    a new variable overwrite_pos is introduced for producer. overwrite_pos
>    tracks the next event to be overwritten (the oldest event committed) in
>    the buffer. producer moves it forward to discard the oldest events when
>    the buffer is full.
>
> 3. pending_pos tracks the oldest event under committing. producer ensures
>    producers_pos never passes pending_pos when making space for new events.
>    So multiple producers never write to the same position at the same time.
>
> 4. producer wakes up consumer every half a round ahead to give it a chance
>    to retrieve data. However, for an overwrite-mode ring buffer, users
>    typically only cares about the ring buffer snapshot before a fault occurs.
>    In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag
>    to avoid unnecessary wakeups.

If I understand it correctly the algorithm requires all events to be the same
size otherwise first overwrite might trash the header,
also the producers should use some kind of signaling to
timestamp each event otherwise it all will look out of order to the consumer.

At the end it looks inferior to the existing perf ring buffer with overwrite.
Since in both cases the out of order needs to be dealt with
in post processing the main advantage of ring buf vs perf buf is gone.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf: Add overwrite mode for bpf ring buffer
  2025-08-08 21:39   ` Alexei Starovoitov
@ 2025-08-12  4:02     ` Xu Kuohai
  2025-08-13 13:22       ` Jordan Rome
  0 siblings, 1 reply; 15+ messages in thread
From: Xu Kuohai @ 2025-08-12  4:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, open list:KERNEL SELFTEST FRAMEWORK, LKML,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Song Liu,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan, Stanislav Fomichev, Willem de Bruijn,
	Jason Xing, Paul Chaignon, Tao Chen, Kumar Kartikeya Dwivedi,
	Martin Kelly

On 8/9/2025 5:39 AM, Alexei Starovoitov wrote:
> On Sun, Aug 3, 2025 at 7:27 PM Xu Kuohai <xukuohai@huaweicloud.com> wrote:
>>
>> From: Xu Kuohai <xukuohai@huawei.com>
>>
>> When the bpf ring buffer is full, new events can not be recorded util
>> the consumer consumes some events to free space. This may cause critical
>> events to be discarded, such as in fault diagnostic, where recent events
>> are more critical than older ones.
>>
>> So add ovewrite mode for bpf ring buffer. In this mode, the new event
>> overwrites the oldest event when the buffer is full.
>>
>> The scheme is as follows:
>>
>> 1. producer_pos tracks the next position to write new data. When there
>>     is enough free space, producer simply moves producer_pos forward to
>>     make space for the new event.
>>
>> 2. To avoid waiting for consumer to free space when the buffer is full,
>>     a new variable overwrite_pos is introduced for producer. overwrite_pos
>>     tracks the next event to be overwritten (the oldest event committed) in
>>     the buffer. producer moves it forward to discard the oldest events when
>>     the buffer is full.
>>
>> 3. pending_pos tracks the oldest event under committing. producer ensures
>>     producers_pos never passes pending_pos when making space for new events.
>>     So multiple producers never write to the same position at the same time.
>>
>> 4. producer wakes up consumer every half a round ahead to give it a chance
>>     to retrieve data. However, for an overwrite-mode ring buffer, users
>>     typically only cares about the ring buffer snapshot before a fault occurs.
>>     In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag
>>     to avoid unnecessary wakeups.
> 
> If I understand it correctly the algorithm requires all events to be the same
> size otherwise first overwrite might trash the header,
> also the producers should use some kind of signaling to
> timestamp each event otherwise it all will look out of order to the consumer.
> 
> At the end it looks inferior to the existing perf ring buffer with overwrite.
> Since in both cases the out of order needs to be dealt with
> in post processing the main advantage of ring buf vs perf buf is gone.

No, the advantage is not gone.

The ring buffer is still shared by multiple producers. When an event occurs,
the producer queues up to acquire the spin lock of the ring buffer to write
event to it. So events in the ring buffer are always ordered, no out of order
occurs.

And events are not required to be the same size. When an overwrite happens,
the events bing trashed are discared, and the overwrite_pos is moved forward
to skip these events until it reaches the first event that is not trashed.

To make it clear, here are some example diagrams.

1. Let's say we have a ring buffer with size 4096.

    At first, {producer,overwrite,pending,consumer}_pos are all set to 0

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |                                                                       |
    |                                                                       |
    |                                                                       |
    +-----------------------------------------------------------------------+
    ^
    |
    |
producer_pos = 0
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

2. Reserve event A, size 512.

    There is enough free space, so A is allocated at offset 0 and producer_pos
    is moved to 512, the end of A. Since A is not submitted, the BUSY bit is
    set.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |        |                                                              |
    |   A    |                                                              |
    | [BUSY] |                                                              |
    +-----------------------------------------------------------------------+
    ^        ^
    |        |
    |        |
    |    producer_pos = 512
    |
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0


3. Reserve event B, size 1024.

    B is allocated at offset 512 with BUSY bit set, and producer_pos is moved
    to the end of B.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |        |                 |                                            |
    |   A    |        B        |                                            |
    | [BUSY] |      [BUSY]     |                                            |
    +-----------------------------------------------------------------------+
    ^                          ^
    |                          |
    |                          |
    |                   producer_pos = 1536
    |
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

4. Reserve event C, size 2048.

    C is allocated at offset 1536 and producer_pos becomes 3584.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |        |                 |                                   |        |
    |    A   |        B        |                 C                 |        |
    | [BUSY] |      [BUSY]     |               [BUSY]              |        |
    +-----------------------------------------------------------------------+
    ^                                                              ^
    |                                                              |
    |                                                              |
    |                                                    producer_pos = 3584
    |
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0

5. Submit event A.

    The BUSY bit of A is cleared. B becomes the oldest event under writing, so
    pending_pos is moved to 512, the start of B.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |        |                 |                                   |        |
    |    A   |        B        |                 C                 |        |
    |        |      [BUSY]     |               [BUSY]              |        |
    +-----------------------------------------------------------------------+
    ^        ^                                                     ^
    |        |                                                     |
    |        |                                                     |
    |   pending_pos = 512                                  producer_pos = 3584
    |
overwrite_pos = 0
consumer_pos = 0

6. Submit event B.

    The BUSY bit of B is cleared, and pending_pos is moved to the start of C,
    which is the oldest event under writing now.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |        |                 |                                   |        |
    |    A   |        B        |                 C                 |        |
    |        |                 |               [BUSY]              |        |
    +-----------------------------------------------------------------------+
    ^                          ^                                   ^
    |                          |                                   |
    |                          |                                   |
    |                     pending_pos = 1536               producer_pos = 3584
    |
overwrite_pos = 0
consumer_pos = 0

7. Reserve event D, size 1536 (3 * 512).

    There are 2048 bytes not under writing between producer_pos and pending_pos,
    so D is allocated at offset 3584, and producer_pos is moved from 3584 to
    5120.

    Since event D will overwrite all bytes of event A and the begining 512 bytes
    of event B, overwrite_pos is moved to the start of event C, the oldest event
    that is not overwritten.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |                 |        |                                   |        |
    |      D End      |        |                 C                 | D Begin|
    |      [BUSY]     |        |               [BUSY]              | [BUSY] |
    +-----------------------------------------------------------------------+
    ^                 ^        ^
    |                 |        |
    |                 |   pending_pos = 1536
    |                 |   overwrite_pos = 1536
    |                 |
    |             producer_pos=5120
    |
consumer_pos = 0

8. Reserve event E, size 1024.

    Though there are 512 bytes not under writing between producer_pos and
    pending_pos, E can not be reserved, as it would overwrite the first 512
    bytes of event C, which is still under writing.

9. Submit event C and D.

    pending_pos is moved to the end of D.

    0       512      1024    1536     2048     2560     3072     3584       4096
    +-----------------------------------------------------------------------+
    |                 |        |                                   |        |
    |      D End      |        |                 C                 | D Begin|
    |                 |        |                                   |        |
    +-----------------------------------------------------------------------+
    ^                 ^        ^
    |                 |        |
    |                 |   overwrite_pos = 1536
    |                 |
    |             producer_pos=5120
    |             pending_pos=5120
    |
consumer_pos = 0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf: Add overwrite mode for bpf ring buffer
  2025-08-12  4:02     ` Xu Kuohai
@ 2025-08-13 13:22       ` Jordan Rome
  2025-08-14 13:59         ` Xu Kuohai
  0 siblings, 1 reply; 15+ messages in thread
From: Jordan Rome @ 2025-08-13 13:22 UTC (permalink / raw)
  To: Xu Kuohai, Alexei Starovoitov
  Cc: bpf, open list:KERNEL SELFTEST FRAMEWORK, LKML,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Song Liu,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan, Stanislav Fomichev, Willem de Bruijn,
	Jason Xing, Paul Chaignon, Tao Chen, Kumar Kartikeya Dwivedi,
	Martin Kelly


On 8/12/25 12:02 AM, Xu Kuohai wrote:
> On 8/9/2025 5:39 AM, Alexei Starovoitov wrote:
>> On Sun, Aug 3, 2025 at 7:27 PM Xu Kuohai <xukuohai@huaweicloud.com> 
>> wrote:
>>>
>>> From: Xu Kuohai <xukuohai@huawei.com>
>>>
>>> When the bpf ring buffer is full, new events can not be recorded util
>>> the consumer consumes some events to free space. This may cause 
>>> critical
>>> events to be discarded, such as in fault diagnostic, where recent 
>>> events
>>> are more critical than older ones.
>>>
>>> So add ovewrite mode for bpf ring buffer. In this mode, the new event
>>> overwrites the oldest event when the buffer is full.
>>>
>>> The scheme is as follows:
>>>
>>> 1. producer_pos tracks the next position to write new data. When there
>>>     is enough free space, producer simply moves producer_pos forward to
>>>     make space for the new event.
>>>
>>> 2. To avoid waiting for consumer to free space when the buffer is full,
>>>     a new variable overwrite_pos is introduced for producer. 
>>> overwrite_pos
>>>     tracks the next event to be overwritten (the oldest event 
>>> committed) in
>>>     the buffer. producer moves it forward to discard the oldest 
>>> events when
>>>     the buffer is full.
>>>
>>> 3. pending_pos tracks the oldest event under committing. producer 
>>> ensures
>>>     producers_pos never passes pending_pos when making space for new 
>>> events.
>>>     So multiple producers never write to the same position at the 
>>> same time.
>>>
>>> 4. producer wakes up consumer every half a round ahead to give it a 
>>> chance
>>>     to retrieve data. However, for an overwrite-mode ring buffer, users
>>>     typically only cares about the ring buffer snapshot before a 
>>> fault occurs.
>>>     In this case, the producer should commit data with 
>>> BPF_RB_NO_WAKEUP flag
>>>     to avoid unnecessary wakeups.
>>
>> If I understand it correctly the algorithm requires all events to be 
>> the same
>> size otherwise first overwrite might trash the header,
>> also the producers should use some kind of signaling to
>> timestamp each event otherwise it all will look out of order to the 
>> consumer.
>>
>> At the end it looks inferior to the existing perf ring buffer with 
>> overwrite.
>> Since in both cases the out of order needs to be dealt with
>> in post processing the main advantage of ring buf vs perf buf is gone.
>
> No, the advantage is not gone.
>
> The ring buffer is still shared by multiple producers. When an event 
> occurs,
> the producer queues up to acquire the spin lock of the ring buffer to 
> write
> event to it. So events in the ring buffer are always ordered, no out 
> of order
> occurs.
>
> And events are not required to be the same size. When an overwrite 
> happens,
> the events bing trashed are discared, and the overwrite_pos is moved 
> forward
> to skip these events until it reaches the first event that is not 
> trashed.
>
> To make it clear, here are some example diagrams.
>
> 1. Let's say we have a ring buffer with size 4096.
>
>    At first, {producer,overwrite,pending,consumer}_pos are all set to 0
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
> | |
> | |
> | |
> +-----------------------------------------------------------------------+
>    ^
>    |
>    |
> producer_pos = 0
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 2. Reserve event A, size 512.
>
>    There is enough free space, so A is allocated at offset 0 and 
> producer_pos
>    is moved to 512, the end of A. Since A is not submitted, the BUSY 
> bit is
>    set.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    | |                                                              |
>    |   A |                                                              |
>    | [BUSY] 
> |                                                              |
> +-----------------------------------------------------------------------+
>    ^        ^
>    |        |
>    |        |
>    |    producer_pos = 512
>    |
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
>
> 3. Reserve event B, size 1024.
>
>    B is allocated at offset 512 with BUSY bit set, and producer_pos is 
> moved
>    to the end of B.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    |        | |                                            |
>    |   A    |        B |                                            |
>    | [BUSY] |      [BUSY] |                                            |
> +-----------------------------------------------------------------------+
>    ^                          ^
>    |                          |
>    |                          |
>    |                   producer_pos = 1536
>    |
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 4. Reserve event C, size 2048.
>
>    C is allocated at offset 1536 and producer_pos becomes 3584.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    |        |                 | |        |
>    |    A   |        B        |                 C |        |
>    | [BUSY] |      [BUSY]     |               [BUSY] |        |
> +-----------------------------------------------------------------------+
>    ^ ^
>    | |
>    | |
>    | producer_pos = 3584
>    |
> overwrite_pos = 0
> pending_pos = 0
> consumer_pos = 0
>
> 5. Submit event A.
>
>    The BUSY bit of A is cleared. B becomes the oldest event under 
> writing, so
>    pending_pos is moved to 512, the start of B.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    |        |                 | |        |
>    |    A   |        B        |                 C |        |
>    |        |      [BUSY]     |               [BUSY] |        |
> +-----------------------------------------------------------------------+
>    ^        ^ ^
>    |        | |
>    |        | |
>    |   pending_pos = 512 producer_pos = 3584
>    |
> overwrite_pos = 0
> consumer_pos = 0
>
> 6. Submit event B.
>
>    The BUSY bit of B is cleared, and pending_pos is moved to the start 
> of C,
>    which is the oldest event under writing now.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    |        |                 | |        |
>    |    A   |        B        |                 C |        |
>    |        |                 |               [BUSY] |        |
> +-----------------------------------------------------------------------+
>    ^                          ^ ^
>    |                          | |
>    |                          | |
>    |                     pending_pos = 1536 producer_pos = 3584
>    |
> overwrite_pos = 0
> consumer_pos = 0
>
> 7. Reserve event D, size 1536 (3 * 512).
>
>    There are 2048 bytes not under writing between producer_pos and 
> pending_pos,
>    so D is allocated at offset 3584, and producer_pos is moved from 
> 3584 to
>    5120.
>
>    Since event D will overwrite all bytes of event A and the begining 
> 512 bytes
>    of event B, overwrite_pos is moved to the start of event C, the 
> oldest event
>    that is not overwritten.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    |                 |        | |        |
>    |      D End      |        |                 C | D Begin|
>    |      [BUSY]     |        |               [BUSY] | [BUSY] |
> +-----------------------------------------------------------------------+
>    ^                 ^        ^
>    |                 |        |
>    |                 |   pending_pos = 1536
>    |                 |   overwrite_pos = 1536
>    |                 |
>    |             producer_pos=5120
>    |
> consumer_pos = 0
>
> 8. Reserve event E, size 1024.
>
>    Though there are 512 bytes not under writing between producer_pos and
>    pending_pos, E can not be reserved, as it would overwrite the first 
> 512
>    bytes of event C, which is still under writing.
>
> 9. Submit event C and D.
>
>    pending_pos is moved to the end of D.
>
>    0       512      1024    1536     2048     2560     3072 3584       
> 4096
> +-----------------------------------------------------------------------+
>    |                 |        | |        |
>    |      D End      |        |                 C | D Begin|
>    |                 |        | |        |
> +-----------------------------------------------------------------------+
>    ^                 ^        ^
>    |                 |        |
>    |                 |   overwrite_pos = 1536
>    |                 |
>    |             producer_pos=5120
>    |             pending_pos=5120
>    |
> consumer_pos = 0

These diagrams are very helpful in terms of understanding the flow.
In part 7 when A is overwritten by D, why doesn't the consumer position 
move forward to
point to the beginning of C? If the ring buffer producer guarantees 
ordering of reserved
slots then C, in this case, is now the oldest reserved. This speaks to 
your second patch
where you say that the consumer resolves conflicts by discarding data 
that has been
overwritten but I feel like the simpler thing to do is just move the 
consumer position.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process
  2025-08-04  2:20 ` [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process Xu Kuohai
@ 2025-08-13 18:21   ` Zvi Effron
  2025-08-14 14:10     ` Xu Kuohai
  2025-08-14 19:34   ` Eduard Zingerman
  2025-08-22 21:23   ` Andrii Nakryiko
  2 siblings, 1 reply; 15+ messages in thread
From: Zvi Effron @ 2025-08-13 18:21 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-kselftest, linux-kernel, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Yonghong Song, Song Liu, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan, Stanislav Fomichev, Willem de Bruijn, Jason Xing,
	Paul Chaignon, Tao Chen, Kumar Kartikeya Dwivedi, Martin Kelly

On Sun, Aug 3, 2025 at 7:27 PM Xu Kuohai <xukuohai@huaweicloud.com> wrote:
>
> From: Xu Kuohai <xukuohai@huawei.com>
>
> In overwrite mode, the producer does not wait for the consumer, so the
> consumer is responsible for handling conflicts. An optimistic method
> is used to resolve the conflicts: the consumer first reads consumer_pos,
> producer_pos and overwrite_pos, then calculates a read window and copies
> data in the window from the ring buffer. After copying, it checks the
> positions to decide if the data in the copy window have been overwritten
> by be the producer. If so, it discards the copy and tries again. Once
> success, the consumer processes the events in the copy.
>
> Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
> ---
> tools/lib/bpf/ringbuf.c | 103 +++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 102 insertions(+), 1 deletion(-)
>
> diff --git a/tools/lib/bpf/ringbuf.c b/tools/lib/bpf/ringbuf.c
> index 9702b70da444..9c072af675ff 100644
> --- a/tools/lib/bpf/ringbuf.c
> +++ b/tools/lib/bpf/ringbuf.c
> @@ -27,10 +27,13 @@ struct ring {
> ring_buffer_sample_fn sample_cb;
> void *ctx;
> void *data;
> + void *read_buffer;
> unsigned long *consumer_pos;
> unsigned long *producer_pos;
> + unsigned long *overwrite_pos;
> unsigned long mask;
> int map_fd;
> + bool overwrite_mode;
> };
>
> struct ring_buffer {
> @@ -69,6 +72,9 @@ static void ringbuf_free_ring(struct ring_buffer *rb, struct ring *r)
> r->producer_pos = NULL;
> }
>
> + if (r->read_buffer)
> + free(r->read_buffer);
> +
> free(r);
> }
>
> @@ -119,6 +125,14 @@ int ring_buffer__add(struct ring_buffer *rb, int map_fd,
> r->sample_cb = sample_cb;
> r->ctx = ctx;
> r->mask = info.max_entries - 1;
> + r->overwrite_mode = info.map_flags & BPF_F_OVERWRITE;
> + if (unlikely(r->overwrite_mode)) {
> + r->read_buffer = malloc(info.max_entries);
> + if (!r->read_buffer) {
> + err = -ENOMEM;
> + goto err_out;
> + }
> + }
>
> /* Map writable consumer page */
> tmp = mmap(NULL, rb->page_size, PROT_READ | PROT_WRITE, MAP_SHARED, map_fd, 0);
> @@ -148,6 +162,7 @@ int ring_buffer__add(struct ring_buffer *rb, int map_fd,
> goto err_out;
> }
> r->producer_pos = tmp;
> + r->overwrite_pos = r->producer_pos + 1; /* overwrite_pos is next to producer_pos */
> r->data = tmp + rb->page_size;
>
> e = &rb->events[rb->ring_cnt];
> @@ -232,7 +247,7 @@ static inline int roundup_len(__u32 len)
> return (len + 7) / 8 * 8;
> }
>
> -static int64_t ringbuf_process_ring(struct ring *r, size_t n)
> +static int64_t ringbuf_process_normal_ring(struct ring *r, size_t n)
> {
> int *len_ptr, len, err;
> /* 64-bit to avoid overflow in case of extreme application behavior */
> @@ -278,6 +293,92 @@ static int64_t ringbuf_process_ring(struct ring *r, size_t n)
> return cnt;
> }
>
> +static int64_t ringbuf_process_overwrite_ring(struct ring *r, size_t n)
> +{
> +
> + int err;
> + uint32_t *len_ptr, len;
> + /* 64-bit to avoid overflow in case of extreme application behavior */
> + int64_t cnt = 0;
> + size_t size, offset;
> + unsigned long cons_pos, prod_pos, over_pos, tmp_pos;
> + bool got_new_data;
> + void *sample;
> + bool copied;
> +
> + size = r->mask + 1;
> +
> + cons_pos = smp_load_acquire(r->consumer_pos);
> + do {
> + got_new_data = false;
> +
> + /* grab a copy of data */
> + prod_pos = smp_load_acquire(r->producer_pos);
> + do {
> + over_pos = READ_ONCE(*r->overwrite_pos);
> + /* prod_pos may be outdated now */
> + if (over_pos < prod_pos) {
> + tmp_pos = max(cons_pos, over_pos);
> + /* smp_load_acquire(r->producer_pos) before
> + * READ_ONCE(*r->overwrite_pos) ensures that
> + * over_pos + r->mask < prod_pos never occurs,
> + * so size is never larger than r->mask
> + */
> + size = prod_pos - tmp_pos;
> + if (!size)
> + goto done;
> + memcpy(r->read_buffer,
> + r->data + (tmp_pos & r->mask), size);
> + copied = true;
> + } else {
> + copied = false;
> + }
> + prod_pos = smp_load_acquire(r->producer_pos);
> + /* retry if data is overwritten by producer */
> + } while (!copied || prod_pos - tmp_pos > r->mask);

This seems to allow for a situation where a call to process the ring can
infinite loop if the producers are producing and overwriting fast enough. That
seems suboptimal to me?

Should there be a timeout or maximum number of attempts or something that
returns -EBUSY or another error to the user?

> +
> + cons_pos = tmp_pos;
> +
> + for (offset = 0; offset < size; offset += roundup_len(len)) {
> + len_ptr = r->read_buffer + (offset & r->mask);
> + len = *len_ptr;
> +
> + if (len & BPF_RINGBUF_BUSY_BIT)
> + goto done;
> +
> + got_new_data = true;
> + cons_pos += roundup_len(len);
> +
> + if ((len & BPF_RINGBUF_DISCARD_BIT) == 0) {
> + sample = (void *)len_ptr + BPF_RINGBUF_HDR_SZ;
> + err = r->sample_cb(r->ctx, sample, len);
> + if (err < 0) {
> + /* update consumer pos and bail out */
> + smp_store_release(r->consumer_pos,
> + cons_pos);
> + return err;
> + }
> + cnt++;
> + }
> +
> + if (cnt >= n)
> + goto done;
> + }
> + } while (got_new_data);
> +
> +done:
> + smp_store_release(r->consumer_pos, cons_pos);
> + return cnt;
> +}
> +
> +static int64_t ringbuf_process_ring(struct ring *r, size_t n)
> +{
> + if (likely(!r->overwrite_mode))
> + return ringbuf_process_normal_ring(r, n);
> + else
> + return ringbuf_process_overwrite_ring(r, n);
> +}
> +
> /* Consume available ring buffer(s) data without event polling, up to n
> * records.
> *
> --
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf: Add overwrite mode for bpf ring buffer
  2025-08-13 13:22       ` Jordan Rome
@ 2025-08-14 13:59         ` Xu Kuohai
  0 siblings, 0 replies; 15+ messages in thread
From: Xu Kuohai @ 2025-08-14 13:59 UTC (permalink / raw)
  To: Jordan Rome, Alexei Starovoitov
  Cc: bpf, open list:KERNEL SELFTEST FRAMEWORK, LKML,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Song Liu,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan, Stanislav Fomichev, Willem de Bruijn,
	Jason Xing, Paul Chaignon, Tao Chen, Kumar Kartikeya Dwivedi,
	Martin Kelly

On 8/13/2025 9:22 PM, Jordan Rome wrote:
> 
> On 8/12/25 12:02 AM, Xu Kuohai wrote:
>> On 8/9/2025 5:39 AM, Alexei Starovoitov wrote:
>>> On Sun, Aug 3, 2025 at 7:27 PM Xu Kuohai <xukuohai@huaweicloud.com> wrote:
>>>>
>>>> From: Xu Kuohai <xukuohai@huawei.com>
>>>>
>>>> When the bpf ring buffer is full, new events can not be recorded util
>>>> the consumer consumes some events to free space. This may cause critical
>>>> events to be discarded, such as in fault diagnostic, where recent events
>>>> are more critical than older ones.
>>>>
>>>> So add ovewrite mode for bpf ring buffer. In this mode, the new event
>>>> overwrites the oldest event when the buffer is full.
>>>>
>>>> The scheme is as follows:
>>>>
>>>> 1. producer_pos tracks the next position to write new data. When there
>>>>     is enough free space, producer simply moves producer_pos forward to
>>>>     make space for the new event.
>>>>
>>>> 2. To avoid waiting for consumer to free space when the buffer is full,
>>>>     a new variable overwrite_pos is introduced for producer. overwrite_pos
>>>>     tracks the next event to be overwritten (the oldest event committed) in
>>>>     the buffer. producer moves it forward to discard the oldest events when
>>>>     the buffer is full.
>>>>
>>>> 3. pending_pos tracks the oldest event under committing. producer ensures
>>>>     producers_pos never passes pending_pos when making space for new events.
>>>>     So multiple producers never write to the same position at the same time.
>>>>
>>>> 4. producer wakes up consumer every half a round ahead to give it a chance
>>>>     to retrieve data. However, for an overwrite-mode ring buffer, users
>>>>     typically only cares about the ring buffer snapshot before a fault occurs.
>>>>     In this case, the producer should commit data with BPF_RB_NO_WAKEUP flag
>>>>     to avoid unnecessary wakeups.
>>>
>>> If I understand it correctly the algorithm requires all events to be the same
>>> size otherwise first overwrite might trash the header,
>>> also the producers should use some kind of signaling to
>>> timestamp each event otherwise it all will look out of order to the consumer.
>>>
>>> At the end it looks inferior to the existing perf ring buffer with overwrite.
>>> Since in both cases the out of order needs to be dealt with
>>> in post processing the main advantage of ring buf vs perf buf is gone.
>>
>> No, the advantage is not gone.
>>
>> The ring buffer is still shared by multiple producers. When an event occurs,
>> the producer queues up to acquire the spin lock of the ring buffer to write
>> event to it. So events in the ring buffer are always ordered, no out of order
>> occurs.
>>
>> And events are not required to be the same size. When an overwrite happens,
>> the events bing trashed are discared, and the overwrite_pos is moved forward
>> to skip these events until it reaches the first event that is not trashed.
>>
>> To make it clear, here are some example diagrams.
>>
>> 1. Let's say we have a ring buffer with size 4096.
>>
>>    At first, {producer,overwrite,pending,consumer}_pos are all set to 0
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>> | |
>> | |
>> | |
>> +-----------------------------------------------------------------------+
>>    ^
>>    |
>>    |
>> producer_pos = 0
>> overwrite_pos = 0
>> pending_pos = 0
>> consumer_pos = 0
>>
>> 2. Reserve event A, size 512.
>>
>>    There is enough free space, so A is allocated at offset 0 and producer_pos
>>    is moved to 512, the end of A. Since A is not submitted, the BUSY bit is
>>    set.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    | |                                                              |
>>    |   A |                                                              |
>>    | [BUSY] |                                                              |
>> +-----------------------------------------------------------------------+
>>    ^        ^
>>    |        |
>>    |        |
>>    |    producer_pos = 512
>>    |
>> overwrite_pos = 0
>> pending_pos = 0
>> consumer_pos = 0
>>
>>
>> 3. Reserve event B, size 1024.
>>
>>    B is allocated at offset 512 with BUSY bit set, and producer_pos is moved
>>    to the end of B.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    |        | |                                            |
>>    |   A    |        B |                                            |
>>    | [BUSY] |      [BUSY] |                                            |
>> +-----------------------------------------------------------------------+
>>    ^                          ^
>>    |                          |
>>    |                          |
>>    |                   producer_pos = 1536
>>    |
>> overwrite_pos = 0
>> pending_pos = 0
>> consumer_pos = 0
>>
>> 4. Reserve event C, size 2048.
>>
>>    C is allocated at offset 1536 and producer_pos becomes 3584.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    |        |                 | |        |
>>    |    A   |        B        |                 C |        |
>>    | [BUSY] |      [BUSY]     |               [BUSY] |        |
>> +-----------------------------------------------------------------------+
>>    ^ ^
>>    | |
>>    | |
>>    | producer_pos = 3584
>>    |
>> overwrite_pos = 0
>> pending_pos = 0
>> consumer_pos = 0
>>
>> 5. Submit event A.
>>
>>    The BUSY bit of A is cleared. B becomes the oldest event under writing, so
>>    pending_pos is moved to 512, the start of B.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    |        |                 | |        |
>>    |    A   |        B        |                 C |        |
>>    |        |      [BUSY]     |               [BUSY] |        |
>> +-----------------------------------------------------------------------+
>>    ^        ^ ^
>>    |        | |
>>    |        | |
>>    |   pending_pos = 512 producer_pos = 3584
>>    |
>> overwrite_pos = 0
>> consumer_pos = 0
>>
>> 6. Submit event B.
>>
>>    The BUSY bit of B is cleared, and pending_pos is moved to the start of C,
>>    which is the oldest event under writing now.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    |        |                 | |        |
>>    |    A   |        B        |                 C |        |
>>    |        |                 |               [BUSY] |        |
>> +-----------------------------------------------------------------------+
>>    ^                          ^ ^
>>    |                          | |
>>    |                          | |
>>    |                     pending_pos = 1536 producer_pos = 3584
>>    |
>> overwrite_pos = 0
>> consumer_pos = 0
>>
>> 7. Reserve event D, size 1536 (3 * 512).
>>
>>    There are 2048 bytes not under writing between producer_pos and pending_pos,
>>    so D is allocated at offset 3584, and producer_pos is moved from 3584 to
>>    5120.
>>
>>    Since event D will overwrite all bytes of event A and the begining 512 bytes
>>    of event B, overwrite_pos is moved to the start of event C, the oldest event
>>    that is not overwritten.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    |                 |        | |        |
>>    |      D End      |        |                 C | D Begin|
>>    |      [BUSY]     |        |               [BUSY] | [BUSY] |
>> +-----------------------------------------------------------------------+
>>    ^                 ^        ^
>>    |                 |        |
>>    |                 |   pending_pos = 1536
>>    |                 |   overwrite_pos = 1536
>>    |                 |
>>    |             producer_pos=5120
>>    |
>> consumer_pos = 0
>>
>> 8. Reserve event E, size 1024.
>>
>>    Though there are 512 bytes not under writing between producer_pos and
>>    pending_pos, E can not be reserved, as it would overwrite the first 512
>>    bytes of event C, which is still under writing.
>>
>> 9. Submit event C and D.
>>
>>    pending_pos is moved to the end of D.
>>
>>    0       512      1024    1536     2048     2560     3072 3584 4096
>> +-----------------------------------------------------------------------+
>>    |                 |        | |        |
>>    |      D End      |        |                 C | D Begin|
>>    |                 |        | |        |
>> +-----------------------------------------------------------------------+
>>    ^                 ^        ^
>>    |                 |        |
>>    |                 |   overwrite_pos = 1536
>>    |                 |
>>    |             producer_pos=5120
>>    |             pending_pos=5120
>>    |
>> consumer_pos = 0
> 
> These diagrams are very helpful in terms of understanding the flow.
> In part 7 when A is overwritten by D, why doesn't the consumer position move forward to
> point to the beginning of C? If the ring buffer producer guarantees ordering of reserved
> slots then C, in this case, is now the oldest reserved. This speaks to your second patch
> where you say that the consumer resolves conflicts by discarding data that has been
> overwritten but I feel like the simpler thing to do is just move the consumer position.
> 

But the consumer may be ahead of overwrite_pos. In this case, moving
consumer_pos back to the oldest event is not correct, as the event has
already been consumed.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process
  2025-08-13 18:21   ` Zvi Effron
@ 2025-08-14 14:10     ` Xu Kuohai
  0 siblings, 0 replies; 15+ messages in thread
From: Xu Kuohai @ 2025-08-14 14:10 UTC (permalink / raw)
  To: Zvi Effron
  Cc: bpf, linux-kselftest, linux-kernel, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Yonghong Song, Song Liu, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Mykola Lysenko, Shuah Khan,
	Stanislav Fomichev, Willem de Bruijn, Jason Xing, Paul Chaignon,
	Tao Chen, Kumar Kartikeya Dwivedi, Martin Kelly

On 8/14/2025 2:21 AM, Zvi Effron wrote:
> On Sun, Aug 3, 2025 at 7:27 PM Xu Kuohai <xukuohai@huaweicloud.com> wrote:
>>
>> From: Xu Kuohai <xukuohai@huawei.com>
>>
>> In overwrite mode, the producer does not wait for the consumer, so the
>> consumer is responsible for handling conflicts. An optimistic method
>> is used to resolve the conflicts: the consumer first reads consumer_pos,
>> producer_pos and overwrite_pos, then calculates a read window and copies
>> data in the window from the ring buffer. After copying, it checks the
>> positions to decide if the data in the copy window have been overwritten
>> by be the producer. If so, it discards the copy and tries again. Once
>> success, the consumer processes the events in the copy.
>>
>> Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
>> ---
>> tools/lib/bpf/ringbuf.c | 103 +++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 102 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/lib/bpf/ringbuf.c b/tools/lib/bpf/ringbuf.c
>> index 9702b70da444..9c072af675ff 100644
>> --- a/tools/lib/bpf/ringbuf.c
>> +++ b/tools/lib/bpf/ringbuf.c
>> @@ -27,10 +27,13 @@ struct ring {
>> ring_buffer_sample_fn sample_cb;
>> void *ctx;
>> void *data;
>> + void *read_buffer;
>> unsigned long *consumer_pos;
>> unsigned long *producer_pos;
>> + unsigned long *overwrite_pos;
>> unsigned long mask;
>> int map_fd;
>> + bool overwrite_mode;
>> };
>>
>> struct ring_buffer {
>> @@ -69,6 +72,9 @@ static void ringbuf_free_ring(struct ring_buffer *rb, struct ring *r)
>> r->producer_pos = NULL;
>> }
>>
>> + if (r->read_buffer)
>> + free(r->read_buffer);
>> +
>> free(r);
>> }
>>
>> @@ -119,6 +125,14 @@ int ring_buffer__add(struct ring_buffer *rb, int map_fd,
>> r->sample_cb = sample_cb;
>> r->ctx = ctx;
>> r->mask = info.max_entries - 1;
>> + r->overwrite_mode = info.map_flags & BPF_F_OVERWRITE;
>> + if (unlikely(r->overwrite_mode)) {
>> + r->read_buffer = malloc(info.max_entries);
>> + if (!r->read_buffer) {
>> + err = -ENOMEM;
>> + goto err_out;
>> + }
>> + }
>>
>> /* Map writable consumer page */
>> tmp = mmap(NULL, rb->page_size, PROT_READ | PROT_WRITE, MAP_SHARED, map_fd, 0);
>> @@ -148,6 +162,7 @@ int ring_buffer__add(struct ring_buffer *rb, int map_fd,
>> goto err_out;
>> }
>> r->producer_pos = tmp;
>> + r->overwrite_pos = r->producer_pos + 1; /* overwrite_pos is next to producer_pos */
>> r->data = tmp + rb->page_size;
>>
>> e = &rb->events[rb->ring_cnt];
>> @@ -232,7 +247,7 @@ static inline int roundup_len(__u32 len)
>> return (len + 7) / 8 * 8;
>> }
>>
>> -static int64_t ringbuf_process_ring(struct ring *r, size_t n)
>> +static int64_t ringbuf_process_normal_ring(struct ring *r, size_t n)
>> {
>> int *len_ptr, len, err;
>> /* 64-bit to avoid overflow in case of extreme application behavior */
>> @@ -278,6 +293,92 @@ static int64_t ringbuf_process_ring(struct ring *r, size_t n)
>> return cnt;
>> }
>>
>> +static int64_t ringbuf_process_overwrite_ring(struct ring *r, size_t n)
>> +{
>> +
>> + int err;
>> + uint32_t *len_ptr, len;
>> + /* 64-bit to avoid overflow in case of extreme application behavior */
>> + int64_t cnt = 0;
>> + size_t size, offset;
>> + unsigned long cons_pos, prod_pos, over_pos, tmp_pos;
>> + bool got_new_data;
>> + void *sample;
>> + bool copied;
>> +
>> + size = r->mask + 1;
>> +
>> + cons_pos = smp_load_acquire(r->consumer_pos);
>> + do {
>> + got_new_data = false;
>> +
>> + /* grab a copy of data */
>> + prod_pos = smp_load_acquire(r->producer_pos);
>> + do {
>> + over_pos = READ_ONCE(*r->overwrite_pos);
>> + /* prod_pos may be outdated now */
>> + if (over_pos < prod_pos) {
>> + tmp_pos = max(cons_pos, over_pos);
>> + /* smp_load_acquire(r->producer_pos) before
>> + * READ_ONCE(*r->overwrite_pos) ensures that
>> + * over_pos + r->mask < prod_pos never occurs,
>> + * so size is never larger than r->mask
>> + */
>> + size = prod_pos - tmp_pos;
>> + if (!size)
>> + goto done;
>> + memcpy(r->read_buffer,
>> + r->data + (tmp_pos & r->mask), size);
>> + copied = true;
>> + } else {
>> + copied = false;
>> + }
>> + prod_pos = smp_load_acquire(r->producer_pos);
>> + /* retry if data is overwritten by producer */
>> + } while (!copied || prod_pos - tmp_pos > r->mask);
> 
> This seems to allow for a situation where a call to process the ring can
> infinite loop if the producers are producing and overwriting fast enough. That
> seems suboptimal to me?
> 
> Should there be a timeout or maximum number of attempts or something that
> returns -EBUSY or another error to the user?
> 

Yeah, infinite loop is a bit unsettling, will return -EBUSY after some
tries.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process
  2025-08-04  2:20 ` [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process Xu Kuohai
  2025-08-13 18:21   ` Zvi Effron
@ 2025-08-14 19:34   ` Eduard Zingerman
  2025-08-14 21:20     ` Zvi Effron
  2025-08-22 21:23   ` Andrii Nakryiko
  2 siblings, 1 reply; 15+ messages in thread
From: Eduard Zingerman @ 2025-08-14 19:34 UTC (permalink / raw)
  To: Xu Kuohai, bpf, linux-kselftest, linux-kernel
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Yonghong Song, Song Liu, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan, Stanislav Fomichev, Willem de Bruijn, Jason Xing,
	Paul Chaignon, Tao Chen, Kumar Kartikeya Dwivedi, Martin Kelly

On Mon, 2025-08-04 at 10:20 +0800, Xu Kuohai wrote:

[...]

> @@ -278,6 +293,92 @@ static int64_t ringbuf_process_ring(struct ring *r, size_t n)
>  	return cnt;
>  }
>  
> +static int64_t ringbuf_process_overwrite_ring(struct ring *r, size_t n)
> +{
> +
> +	int err;
> +	uint32_t *len_ptr, len;
> +	/* 64-bit to avoid overflow in case of extreme application behavior */
> +	int64_t cnt = 0;
> +	size_t size, offset;
> +	unsigned long cons_pos, prod_pos, over_pos, tmp_pos;
> +	bool got_new_data;
> +	void *sample;
> +	bool copied;
> +
> +	size = r->mask + 1;
> +
> +	cons_pos = smp_load_acquire(r->consumer_pos);
> +	do {
> +		got_new_data = false;
> +
> +		/* grab a copy of data */
> +		prod_pos = smp_load_acquire(r->producer_pos);
> +		do {
> +			over_pos = READ_ONCE(*r->overwrite_pos);
> +			/* prod_pos may be outdated now */
> +			if (over_pos < prod_pos) {
> +				tmp_pos = max(cons_pos, over_pos);
> +				/* smp_load_acquire(r->producer_pos) before
> +				 * READ_ONCE(*r->overwrite_pos) ensures that
> +				 * over_pos + r->mask < prod_pos never occurs,
> +				 * so size is never larger than r->mask
> +				 */
> +				size = prod_pos - tmp_pos;
> +				if (!size)
> +					goto done;
> +				memcpy(r->read_buffer,
> +				       r->data + (tmp_pos & r->mask), size);
> +				copied = true;
> +			} else {
> +				copied = false;
> +			}
> +			prod_pos = smp_load_acquire(r->producer_pos);
> +		/* retry if data is overwritten by producer */
> +		} while (!copied || prod_pos - tmp_pos > r->mask);

Could you please elaborate a bit, why this condition is sufficient to
guarantee that r->overwrite_pos had not changed while memcpy() was
executing?

> +
> +		cons_pos = tmp_pos;
> +
> +		for (offset = 0; offset < size; offset += roundup_len(len)) {
> +			len_ptr = r->read_buffer + (offset & r->mask);
> +			len = *len_ptr;
> +
> +			if (len & BPF_RINGBUF_BUSY_BIT)
> +				goto done;
> +
> +			got_new_data = true;
> +			cons_pos += roundup_len(len);
> +
> +			if ((len & BPF_RINGBUF_DISCARD_BIT) == 0) {
> +				sample = (void *)len_ptr + BPF_RINGBUF_HDR_SZ;
> +				err = r->sample_cb(r->ctx, sample, len);
> +				if (err < 0) {
> +					/* update consumer pos and bail out */
> +					smp_store_release(r->consumer_pos,
> +							  cons_pos);
> +					return err;
> +				}
> +				cnt++;
> +			}
> +
> +			if (cnt >= n)
> +				goto done;
> +		}
> +	} while (got_new_data);
> +
> +done:
> +	smp_store_release(r->consumer_pos, cons_pos);
> +	return cnt;
> +}

[...]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process
  2025-08-14 19:34   ` Eduard Zingerman
@ 2025-08-14 21:20     ` Zvi Effron
  0 siblings, 0 replies; 15+ messages in thread
From: Zvi Effron @ 2025-08-14 21:20 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Xu Kuohai, bpf, linux-kselftest, linux-kernel, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Yonghong Song,
	Song Liu, John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo,
	Jiri Olsa, Mykola Lysenko, Shuah Khan, Stanislav Fomichev,
	Willem de Bruijn, Jason Xing, Paul Chaignon, Tao Chen,
	Kumar Kartikeya Dwivedi, Martin Kelly

On Thu, Aug 14, 2025 at 12:34 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Mon, 2025-08-04 at 10:20 +0800, Xu Kuohai wrote:
>
> [...]
>
> > @@ -278,6 +293,92 @@ static int64_t ringbuf_process_ring(struct ring *r, size_t n)
> >       return cnt;
> >  }
> >
> > +static int64_t ringbuf_process_overwrite_ring(struct ring *r, size_t n)
> > +{
> > +
> > +     int err;
> > +     uint32_t *len_ptr, len;
> > +     /* 64-bit to avoid overflow in case of extreme application behavior */
> > +     int64_t cnt = 0;
> > +     size_t size, offset;
> > +     unsigned long cons_pos, prod_pos, over_pos, tmp_pos;
> > +     bool got_new_data;
> > +     void *sample;
> > +     bool copied;
> > +
> > +     size = r->mask + 1;
> > +
> > +     cons_pos = smp_load_acquire(r->consumer_pos);
> > +     do {
> > +             got_new_data = false;
> > +
> > +             /* grab a copy of data */
> > +             prod_pos = smp_load_acquire(r->producer_pos);
> > +             do {
> > +                     over_pos = READ_ONCE(*r->overwrite_pos);
> > +                     /* prod_pos may be outdated now */
> > +                     if (over_pos < prod_pos) {
> > +                             tmp_pos = max(cons_pos, over_pos);
> > +                             /* smp_load_acquire(r->producer_pos) before
> > +                              * READ_ONCE(*r->overwrite_pos) ensures that
> > +                              * over_pos + r->mask < prod_pos never occurs,
> > +                              * so size is never larger than r->mask
> > +                              */
> > +                             size = prod_pos - tmp_pos;
> > +                             if (!size)
> > +                                     goto done;
> > +                             memcpy(r->read_buffer,
> > +                                    r->data + (tmp_pos & r->mask), size);
> > +                             copied = true;
> > +                     } else {
> > +                             copied = false;
> > +                     }
> > +                     prod_pos = smp_load_acquire(r->producer_pos);
> > +             /* retry if data is overwritten by producer */
> > +             } while (!copied || prod_pos - tmp_pos > r->mask);
>
> Could you please elaborate a bit, why this condition is sufficient to
> guarantee that r->overwrite_pos had not changed while memcpy() was
> executing?
>

It isn't sufficient to guarantee that, but does it need tobe ? The concern is
that the data being memcpy-ed might have been overwritten, right? This
condition is sufficient to guarantee that can't happen without forcing another
loop iteration.

For the producer to overwrite a memcpy-ed byte, it must have looped around the
entire buffer, so r->producer_pos will be at least r->mask + 1 more than
tmp_pos. The +1 is because r->producer_pos first had to produce the byte
that got overwritten for it to be included in the memcpy, then produce it a
second time to overwrite it.

Since the code rereads r->producer_pos before making the check, if any bytes
have been overwritten, prod_pos - tmp_pos will be at least r->mask + 1, so the
check will return true and the loop will iterate again, and a new memcpy will
be performed.

> > +
> > +             cons_pos = tmp_pos;
> > +
> > +             for (offset = 0; offset < size; offset += roundup_len(len)) {
> > +                     len_ptr = r->read_buffer + (offset & r->mask);
> > +                     len = *len_ptr;
> > +
> > +                     if (len & BPF_RINGBUF_BUSY_BIT)
> > +                             goto done;
> > +
> > +                     got_new_data = true;
> > +                     cons_pos += roundup_len(len);
> > +
> > +                     if ((len & BPF_RINGBUF_DISCARD_BIT) == 0) {
> > +                             sample = (void *)len_ptr + BPF_RINGBUF_HDR_SZ;
> > +                             err = r->sample_cb(r->ctx, sample, len);
> > +                             if (err < 0) {
> > +                                     /* update consumer pos and bail out */
> > +                                     smp_store_release(r->consumer_pos,
> > +                                                       cons_pos);
> > +                                     return err;
> > +                             }
> > +                             cnt++;
> > +                     }
> > +
> > +                     if (cnt >= n)
> > +                             goto done;
> > +             }
> > +     } while (got_new_data);
> > +
> > +done:
> > +     smp_store_release(r->consumer_pos, cons_pos);
> > +     return cnt;
> > +}
>
> [...]
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process
  2025-08-04  2:20 ` [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process Xu Kuohai
  2025-08-13 18:21   ` Zvi Effron
  2025-08-14 19:34   ` Eduard Zingerman
@ 2025-08-22 21:23   ` Andrii Nakryiko
  2025-08-23 14:38     ` Xu Kuohai
  2 siblings, 1 reply; 15+ messages in thread
From: Andrii Nakryiko @ 2025-08-22 21:23 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-kselftest, linux-kernel, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Yonghong Song, Song Liu, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan, Stanislav Fomichev, Willem de Bruijn, Jason Xing,
	Paul Chaignon, Tao Chen, Kumar Kartikeya Dwivedi, Martin Kelly

On Sun, Aug 3, 2025 at 7:27 PM Xu Kuohai <xukuohai@huaweicloud.com> wrote:
>
> From: Xu Kuohai <xukuohai@huawei.com>
>
> In overwrite mode, the producer does not wait for the consumer, so the
> consumer is responsible for handling conflicts. An optimistic method
> is used to resolve the conflicts: the consumer first reads consumer_pos,
> producer_pos and overwrite_pos, then calculates a read window and copies
> data in the window from the ring buffer. After copying, it checks the
> positions to decide if the data in the copy window have been overwritten
> by be the producer. If so, it discards the copy and tries again. Once
> success, the consumer processes the events in the copy.
>

I don't mind adding BPF_F_OVERWRITE mode to BPF ringbuf (it seems like
this will work fine) itself, but I don't think retrofitting it to this
callback-based libbpf-side API is a good fit.

For one, I don't like that extra memory copy and potentially a huge
allocation that you do. I think for some use cases user logic might be
totally fine with using ringbuf memory directly, even if it can be
overwritten at any point. So it would be unfair to penalize
sophisticated users for such cases. Even if not, I'd say allocating
just enough to hold the record would be a better approach.

Another downside is that the user doesn't really have much visibility
right now into whether any samples were overwritten.

I've been mulling over the idea of adding an iterator-like API for BPF
ringbuf on the libbpf side for a while now. I'm still debating some
API nuances with Eduard, but I think we'll end up adding something
pretty soon. Iterator-based API seems like a much better fit for
overwritable mode here.

But all that is not really overwrite-specific and is broader, so I
think we can proceed with finalizing kernel-side details of overwrite
and not block on libbpf side of things for now, though.

> Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
> ---
>  tools/lib/bpf/ringbuf.c | 103 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 102 insertions(+), 1 deletion(-)
>
> diff --git a/tools/lib/bpf/ringbuf.c b/tools/lib/bpf/ringbuf.c
> index 9702b70da444..9c072af675ff 100644
> --- a/tools/lib/bpf/ringbuf.c
> +++ b/tools/lib/bpf/ringbuf.c
> @@ -27,10 +27,13 @@ struct ring {
>         ring_buffer_sample_fn sample_cb;
>         void *ctx;
>         void *data;
> +       void *read_buffer;
>         unsigned long *consumer_pos;
>         unsigned long *producer_pos;
> +       unsigned long *overwrite_pos;
>         unsigned long mask;
>         int map_fd;
> +       bool overwrite_mode;
>  };

[...]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process
  2025-08-22 21:23   ` Andrii Nakryiko
@ 2025-08-23 14:38     ` Xu Kuohai
  0 siblings, 0 replies; 15+ messages in thread
From: Xu Kuohai @ 2025-08-23 14:38 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, linux-kselftest, linux-kernel, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Yonghong Song, Song Liu, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan, Stanislav Fomichev, Willem de Bruijn, Jason Xing,
	Paul Chaignon, Tao Chen, Kumar Kartikeya Dwivedi, Martin Kelly

On 8/23/2025 5:23 AM, Andrii Nakryiko wrote:
> On Sun, Aug 3, 2025 at 7:27 PM Xu Kuohai <xukuohai@huaweicloud.com> wrote:
>>
>> From: Xu Kuohai <xukuohai@huawei.com>
>>
>> In overwrite mode, the producer does not wait for the consumer, so the
>> consumer is responsible for handling conflicts. An optimistic method
>> is used to resolve the conflicts: the consumer first reads consumer_pos,
>> producer_pos and overwrite_pos, then calculates a read window and copies
>> data in the window from the ring buffer. After copying, it checks the
>> positions to decide if the data in the copy window have been overwritten
>> by be the producer. If so, it discards the copy and tries again. Once
>> success, the consumer processes the events in the copy.
>>
> 
> I don't mind adding BPF_F_OVERWRITE mode to BPF ringbuf (it seems like
> this will work fine) itself, but I don't think retrofitting it to this
> callback-based libbpf-side API is a good fit.
> 
> For one, I don't like that extra memory copy and potentially a huge
> allocation that you do. I think for some use cases user logic might be
> totally fine with using ringbuf memory directly, even if it can be
> overwritten at any point. So it would be unfair to penalize
> sophisticated users for such cases. Even if not, I'd say allocating
> just enough to hold the record would be a better approach.
> 
> Another downside is that the user doesn't really have much visibility
> right now into whether any samples were overwritten.
> 
> I've been mulling over the idea of adding an iterator-like API for BPF
> ringbuf on the libbpf side for a while now. I'm still debating some
> API nuances with Eduard, but I think we'll end up adding something
> pretty soon. Iterator-based API seems like a much better fit for
> overwritable mode here.
> 
> But all that is not really overwrite-specific and is broader, so I
> think we can proceed with finalizing kernel-side details of overwrite
> and not block on libbpf side of things for now, though.
>

Sounds great! Looking forward to the new iterator-based API. Clearly, no
memory allocation or a samll allocation is better than a huge allocation.
I'll focus on the kernel side before the new API is introduced.

>> Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
>> ---
>>   tools/lib/bpf/ringbuf.c | 103 +++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 102 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/lib/bpf/ringbuf.c b/tools/lib/bpf/ringbuf.c
>> index 9702b70da444..9c072af675ff 100644
>> --- a/tools/lib/bpf/ringbuf.c
>> +++ b/tools/lib/bpf/ringbuf.c
>> @@ -27,10 +27,13 @@ struct ring {
>>          ring_buffer_sample_fn sample_cb;
>>          void *ctx;
>>          void *data;
>> +       void *read_buffer;
>>          unsigned long *consumer_pos;
>>          unsigned long *producer_pos;
>> +       unsigned long *overwrite_pos;
>>          unsigned long mask;
>>          int map_fd;
>> +       bool overwrite_mode;
>>   };
> 
> [...]


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-08-23 14:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-04  2:20 [PATCH bpf-next 0/4] Add overwrite mode for bpf ring buffer Xu Kuohai
2025-08-04  2:20 ` [PATCH bpf-next 1/4] bpf: " Xu Kuohai
2025-08-08 21:39   ` Alexei Starovoitov
2025-08-12  4:02     ` Xu Kuohai
2025-08-13 13:22       ` Jordan Rome
2025-08-14 13:59         ` Xu Kuohai
2025-08-04  2:20 ` [PATCH bpf-next 2/4] libbpf: ringbuf: Add overwrite ring buffer process Xu Kuohai
2025-08-13 18:21   ` Zvi Effron
2025-08-14 14:10     ` Xu Kuohai
2025-08-14 19:34   ` Eduard Zingerman
2025-08-14 21:20     ` Zvi Effron
2025-08-22 21:23   ` Andrii Nakryiko
2025-08-23 14:38     ` Xu Kuohai
2025-08-04  2:20 ` [PATCH bpf-next 3/4] selftests/bpf: Add test for overwrite ring buffer Xu Kuohai
2025-08-04  2:21 ` [PATCH bpf-next 4/4] selftests/bpf/benchs: Add overwrite mode bench for rb-libbpf Xu Kuohai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).