netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 0/5] bpf: udp: Exactly-once socket iteration
@ 2025-04-11 17:35 Jordan Rife
  2025-04-11 17:35 ` [PATCH v2 bpf-next 1/5] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Jordan Rife @ 2025-04-11 17:35 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau,
	Willem de Bruijn, Kuniyuki Iwashima

Both UDP and TCP socket iterators use iter->offset to track progress
through a bucket, which is a measure of the number of matching sockets
from the current bucket that have been seen or processed by the
iterator. On subsequent iterations, if the current bucket has
unprocessed items, we skip at least iter->offset matching items in the
bucket before adding any remaining items to the next batch. However,
iter->offset isn't always an accurate measure of "things already seen"
when the underlying bucket changes between reads which can lead to
repeated or skipped sockets. Instead, this series remembers the cookies
of the sockets we haven't seen yet in the current bucket and resumes
from the first cookie in that list that we can find on the next
iteration.

To be more specific, this series replaces struct sock **batch inside
struct bpf_udp_iter_state with union bpf_udp_iter_batch_item *batch,
where union bpf_udp_iter_batch_item can contain either a pointer to a
socket or a socket cookie. During reads, batch contains pointers to all
sockets in the current batch while between reads batch contains all the
cookies of the sockets in the current bucket that have yet to be
processed. On subsequent reads, when iteration resumes,
bpf_iter_udp_batch finds the first saved cookie that matches a socket in
the bucket's socket list and picks up from there to construct the next
batch. On average, assuming it's rare that the next socket disappears
before the next read occurs, we should only need to scan as much as we
did with the offset-based approach to find the starting point. In the
case that the next socket is no longer there, we keep scanning through
the saved cookies list until we find a match. The worst case is when
none of the sockets from last time exist anymore, but again, this should
be rare.

CHANGES
=======
v1 -> v2:
* Drop WARN_ON_ONCE from bpf_iter_udp_realloc_batch (Kuniyuki).
* Fixed memcpy size parameter in bpf_iter_udp_realloc_batch; before it
  was missing sizeof(elem) * (Kuniyuki).
* Move "bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch" to patch
  two in the series (Kuniyuki).

rfc [1] -> v1:
* Use hlist_entry_safe directly to retrieve the first socket in the
  current bucket's linked list instead of immediately breaking from
  udp_portaddr_for_each_entry (Martin).
* Cancel iteration if bpf_iter_udp_realloc_batch() can't grab enough
  memory to contain a full snapshot of the current bucket to prevent
  unwanted skips or repeats [2].

[1]: https://lore.kernel.org/bpf/20250404220221.1665428-1-jordan@jrife.io/
[2]: https://lore.kernel.org/bpf/CABi4-ogUtMrH8-NVB6W8Xg_F_KDLq=yy-yu-tKr2udXE2Mu1Lg@mail.gmail.com/

Jordan Rife (5):
  bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch
    items
  bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch
  bpf: udp: Avoid socket skips and repeats during iteration
  selftests/bpf: Return socket cookies from sock_iter_batch progs
  selftests/bpf: Add tests for bucket resume logic in UDP socket
    iterators

 include/linux/udp.h                           |   3 +
 net/ipv4/udp.c                                | 100 +++-
 .../bpf/prog_tests/sock_iter_batch.c          | 451 +++++++++++++++++-
 .../selftests/bpf/progs/bpf_tracing_net.h     |   1 +
 .../selftests/bpf/progs/sock_iter_batch.c     |  24 +-
 5 files changed, 537 insertions(+), 42 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 bpf-next 1/5] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items
  2025-04-11 17:35 [PATCH v2 bpf-next 0/5] bpf: udp: Exactly-once socket iteration Jordan Rife
@ 2025-04-11 17:35 ` Jordan Rife
  2025-04-11 17:35 ` [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch Jordan Rife
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Jordan Rife @ 2025-04-11 17:35 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau,
	Willem de Bruijn, Kuniyuki Iwashima

Prepare for the commit that tracks cookies between iterations by
converting struct sock **batch to union bpf_udp_iter_batch_item *batch
inside struct bpf_udp_iter_state.

Signed-off-by: Jordan Rife <jordan@jrife.io>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/udp.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d0bffcfa56d8..59c3281962b9 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3384,13 +3384,17 @@ struct bpf_iter__udp {
 	int bucket __aligned(8);
 };
 
+union bpf_udp_iter_batch_item {
+	struct sock *sock;
+};
+
 struct bpf_udp_iter_state {
 	struct udp_iter_state state;
 	unsigned int cur_sk;
 	unsigned int end_sk;
 	unsigned int max_sk;
 	int offset;
-	struct sock **batch;
+	union bpf_udp_iter_batch_item *batch;
 	bool st_bucket_done;
 };
 
@@ -3449,7 +3453,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 				}
 				if (iter->end_sk < iter->max_sk) {
 					sock_hold(sk);
-					iter->batch[iter->end_sk++] = sk;
+					iter->batch[iter->end_sk++].sock = sk;
 				}
 				batch_sks++;
 			}
@@ -3479,7 +3483,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 		goto again;
 	}
 done:
-	return iter->batch[0];
+	return iter->batch[0].sock;
 }
 
 static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
@@ -3491,7 +3495,7 @@ static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 	 * done with seq_show(), so unref the iter->cur_sk.
 	 */
 	if (iter->cur_sk < iter->end_sk) {
-		sock_put(iter->batch[iter->cur_sk++]);
+		sock_put(iter->batch[iter->cur_sk++].sock);
 		++iter->offset;
 	}
 
@@ -3499,7 +3503,7 @@ static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 	 * available in the current bucket batch.
 	 */
 	if (iter->cur_sk < iter->end_sk)
-		sk = iter->batch[iter->cur_sk];
+		sk = iter->batch[iter->cur_sk].sock;
 	else
 		/* Prepare a new batch. */
 		sk = bpf_iter_udp_batch(seq);
@@ -3564,7 +3568,7 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
 static void bpf_iter_udp_put_batch(struct bpf_udp_iter_state *iter)
 {
 	while (iter->cur_sk < iter->end_sk)
-		sock_put(iter->batch[iter->cur_sk++]);
+		sock_put(iter->batch[iter->cur_sk++].sock);
 }
 
 static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
@@ -3827,7 +3831,7 @@ DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
 static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
 				      unsigned int new_batch_sz)
 {
-	struct sock **new_batch;
+	union bpf_udp_iter_batch_item *new_batch;
 
 	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
 				   GFP_USER | __GFP_NOWARN);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch
  2025-04-11 17:35 [PATCH v2 bpf-next 0/5] bpf: udp: Exactly-once socket iteration Jordan Rife
  2025-04-11 17:35 ` [PATCH v2 bpf-next 1/5] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife
@ 2025-04-11 17:35 ` Jordan Rife
  2025-04-11 22:10   ` Martin KaFai Lau
  2025-04-11 17:35 ` [PATCH v2 bpf-next 3/5] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Jordan Rife @ 2025-04-11 17:35 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau,
	Willem de Bruijn, Kuniyuki Iwashima

Stop iteration if the current bucket can't be contained in a single
batch to avoid choosing between skipping or repeating sockets. In cases
where none of the saved cookies can be found in the current bucket and
the batch isn't big enough to contain all the sockets in the bucket,
there are really only two choices, neither of which is desirable:

1. Start from the beginning, assuming we haven't seen any sockets in the
   current bucket, in which case we might repeat a socket we've already
   seen.
2. Go to the next bucket to avoid repeating a socket we may have already
   seen, in which case we may accidentally skip a socket that we didn't
   yet visit.

To avoid this tradeoff, enforce the invariant that the batch always
contains a full snapshot of the bucket from last time by returning
-ENOMEM if bpf_iter_udp_realloc_batch() can't grab enough memory to fit
all sockets in the current bucket.

To test this code path, I forced bpf_iter_udp_realloc_batch() to return
-ENOMEM when called from within bpf_iter_udp_batch() and observed that
read() fails in userspace with errno set to ENOMEM. Otherwise, it's a
bit hard to test this scenario.

Link: https://lore.kernel.org/bpf/CABi4-ogUtMrH8-NVB6W8Xg_F_KDLq=yy-yu-tKr2udXE2Mu1Lg@mail.gmail.com/

Signed-off-by: Jordan Rife <jordan@jrife.io>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/udp.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 59c3281962b9..1e8ae08d24db 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3410,6 +3410,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 	unsigned int batch_sks = 0;
 	bool resized = false;
 	struct sock *sk;
+	int err = 0;
 
 	resume_bucket = state->bucket;
 	resume_offset = iter->offset;
@@ -3475,7 +3476,11 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 		iter->st_bucket_done = true;
 		goto done;
 	}
-	if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
+	if (!resized) {
+		err = bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2);
+		if (err)
+			return ERR_PTR(err);
+
 		resized = true;
 		/* After allocating a larger batch, retry one more time to grab
 		 * the whole bucket.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 bpf-next 3/5] bpf: udp: Avoid socket skips and repeats during iteration
  2025-04-11 17:35 [PATCH v2 bpf-next 0/5] bpf: udp: Exactly-once socket iteration Jordan Rife
  2025-04-11 17:35 ` [PATCH v2 bpf-next 1/5] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife
  2025-04-11 17:35 ` [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch Jordan Rife
@ 2025-04-11 17:35 ` Jordan Rife
  2025-04-11 20:12   ` Kuniyuki Iwashima
  2025-04-11 17:35 ` [PATCH v2 bpf-next 4/5] selftests/bpf: Return socket cookies from sock_iter_batch progs Jordan Rife
  2025-04-11 17:35 ` [PATCH v2 bpf-next 5/5] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators Jordan Rife
  4 siblings, 1 reply; 15+ messages in thread
From: Jordan Rife @ 2025-04-11 17:35 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau,
	Willem de Bruijn, Kuniyuki Iwashima

Replace the offset-based approach for tracking progress through a bucket
in the UDP table with one based on socket cookies. Remember the cookies
of unprocessed sockets from the last batch and use this list to
pick up where we left off or, in the case that the next socket
disappears between reads, find the first socket after that point that
still exists in the bucket and resume from there.

In order to make the control flow a bit easier to follow inside
bpf_iter_udp_batch, introduce the udp_portaddr_for_each_entry_from macro
and use this to split bucket processing into two stages: finding the
starting point and adding items to the next batch. Originally, I
implemented this patch inside a single udp_portaddr_for_each_entry loop,
as it was before, but I found the resulting logic a bit messy. Overall,
this version seems more readable.

Signed-off-by: Jordan Rife <jordan@jrife.io>
---
 include/linux/udp.h |  3 ++
 net/ipv4/udp.c      | 77 ++++++++++++++++++++++++++++++++++-----------
 2 files changed, 62 insertions(+), 18 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 0807e21cfec9..a69da9c4c1c5 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -209,6 +209,9 @@ static inline void udp_allow_gso(struct sock *sk)
 #define udp_portaddr_for_each_entry(__sk, list) \
 	hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
+#define udp_portaddr_for_each_entry_from(__sk) \
+	hlist_for_each_entry_from(__sk, __sk_common.skc_portaddr_node)
+
 #define udp_portaddr_for_each_entry_rcu(__sk, list) \
 	hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node)
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1e8ae08d24db..6e7c3c113320 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -93,6 +93,7 @@
 #include <linux/inet.h>
 #include <linux/netdevice.h>
 #include <linux/slab.h>
+#include <linux/sock_diag.h>
 #include <net/tcp_states.h>
 #include <linux/skbuff.h>
 #include <linux/proc_fs.h>
@@ -3386,6 +3387,7 @@ struct bpf_iter__udp {
 
 union bpf_udp_iter_batch_item {
 	struct sock *sock;
+	__u64 cookie;
 };
 
 struct bpf_udp_iter_state {
@@ -3393,27 +3395,43 @@ struct bpf_udp_iter_state {
 	unsigned int cur_sk;
 	unsigned int end_sk;
 	unsigned int max_sk;
-	int offset;
 	union bpf_udp_iter_batch_item *batch;
 	bool st_bucket_done;
 };
 
 static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
 				      unsigned int new_batch_sz);
+static struct sock *bpf_iter_udp_resume(struct sock *first_sk,
+					union bpf_udp_iter_batch_item *cookies,
+					int n_cookies)
+{
+	struct sock *sk = NULL;
+	int i = 0;
+
+	for (; i < n_cookies; i++) {
+		sk = first_sk;
+		udp_portaddr_for_each_entry_from(sk)
+			if (cookies[i].cookie == atomic64_read(&sk->sk_cookie))
+				goto done;
+	}
+done:
+	return sk;
+}
+
 static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 {
 	struct bpf_udp_iter_state *iter = seq->private;
 	struct udp_iter_state *state = &iter->state;
+	unsigned int find_cookie, end_cookie = 0;
 	struct net *net = seq_file_net(seq);
-	int resume_bucket, resume_offset;
 	struct udp_table *udptable;
 	unsigned int batch_sks = 0;
 	bool resized = false;
+	int resume_bucket;
 	struct sock *sk;
 	int err = 0;
 
 	resume_bucket = state->bucket;
-	resume_offset = iter->offset;
 
 	/* The current batch is done, so advance the bucket. */
 	if (iter->st_bucket_done)
@@ -3429,6 +3447,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 	 * before releasing the bucket lock. This allows BPF programs that are
 	 * called in seq_show to acquire the bucket lock if needed.
 	 */
+	find_cookie = iter->cur_sk;
+	end_cookie = iter->end_sk;
 	iter->cur_sk = 0;
 	iter->end_sk = 0;
 	iter->st_bucket_done = false;
@@ -3440,18 +3460,26 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 		if (hlist_empty(&hslot2->head))
 			continue;
 
-		iter->offset = 0;
 		spin_lock_bh(&hslot2->lock);
-		udp_portaddr_for_each_entry(sk, &hslot2->head) {
+		/* Initialize sk to the first socket in hslot2. */
+		sk = hlist_entry_safe(hslot2->head.first, struct sock,
+				      __sk_common.skc_portaddr_node);
+		/* Resume from the first (in iteration order) unseen socket from
+		 * the last batch that still exists in resume_bucket. Most of
+		 * the time this will just be where the last iteration left off
+		 * in resume_bucket unless that socket disappeared between
+		 * reads.
+		 *
+		 * Skip this if end_cookie isn't set; this is the first
+		 * batch, we're on bucket zero, and we want to start from the
+		 * beginning.
+		 */
+		if (state->bucket == resume_bucket && end_cookie)
+			sk = bpf_iter_udp_resume(sk,
+						 &iter->batch[find_cookie],
+						 end_cookie - find_cookie);
+		udp_portaddr_for_each_entry_from(sk) {
 			if (seq_sk_match(seq, sk)) {
-				/* Resume from the last iterated socket at the
-				 * offset in the bucket before iterator was stopped.
-				 */
-				if (state->bucket == resume_bucket &&
-				    iter->offset < resume_offset) {
-					++iter->offset;
-					continue;
-				}
 				if (iter->end_sk < iter->max_sk) {
 					sock_hold(sk);
 					iter->batch[iter->end_sk++].sock = sk;
@@ -3499,10 +3527,8 @@ static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 	/* Whenever seq_next() is called, the iter->cur_sk is
 	 * done with seq_show(), so unref the iter->cur_sk.
 	 */
-	if (iter->cur_sk < iter->end_sk) {
+	if (iter->cur_sk < iter->end_sk)
 		sock_put(iter->batch[iter->cur_sk++].sock);
-		++iter->offset;
-	}
 
 	/* After updating iter->cur_sk, check if there are more sockets
 	 * available in the current bucket batch.
@@ -3572,8 +3598,19 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
 
 static void bpf_iter_udp_put_batch(struct bpf_udp_iter_state *iter)
 {
-	while (iter->cur_sk < iter->end_sk)
-		sock_put(iter->batch[iter->cur_sk++].sock);
+	union bpf_udp_iter_batch_item *item;
+	unsigned int cur_sk = iter->cur_sk;
+	__u64 cookie;
+
+	/* Remember the cookies of the sockets we haven't seen yet, so we can
+	 * pick up where we left off next time around.
+	 */
+	while (cur_sk < iter->end_sk) {
+		item = &iter->batch[cur_sk++];
+		cookie = __sock_gen_cookie(item->sock);
+		sock_put(item->sock);
+		item->cookie = cookie;
+	}
 }
 
 static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
@@ -3844,6 +3881,10 @@ static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
 		return -ENOMEM;
 
 	bpf_iter_udp_put_batch(iter);
+	/* Make sure the new batch has the cookies of the sockets we haven't
+	 * visited yet.
+	 */
+	memcpy(new_batch, iter->batch, sizeof(*iter->batch) * iter->end_sk);
 	kvfree(iter->batch);
 	iter->batch = new_batch;
 	iter->max_sk = new_batch_sz;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 bpf-next 4/5] selftests/bpf: Return socket cookies from sock_iter_batch progs
  2025-04-11 17:35 [PATCH v2 bpf-next 0/5] bpf: udp: Exactly-once socket iteration Jordan Rife
                   ` (2 preceding siblings ...)
  2025-04-11 17:35 ` [PATCH v2 bpf-next 3/5] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife
@ 2025-04-11 17:35 ` Jordan Rife
  2025-04-11 17:35 ` [PATCH v2 bpf-next 5/5] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators Jordan Rife
  4 siblings, 0 replies; 15+ messages in thread
From: Jordan Rife @ 2025-04-11 17:35 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau,
	Willem de Bruijn, Kuniyuki Iwashima

Extend the iter_udp_soreuse and iter_tcp_soreuse programs to write the
cookie of the current socket, so that we can track the identity of the
sockets that the iterator has seen so far. Update the existing do_test
function to account for this change to the iterator program output. At
the same time, teach both programs to work with AF_INET as well.

Signed-off-by: Jordan Rife <jordan@jrife.io>
---
 .../bpf/prog_tests/sock_iter_batch.c          | 33 +++++++++++--------
 .../selftests/bpf/progs/bpf_tracing_net.h     |  1 +
 .../selftests/bpf/progs/sock_iter_batch.c     | 24 +++++++++++---
 3 files changed, 41 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c b/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
index d56e18b25528..74dbe91806a0 100644
--- a/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
+++ b/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
@@ -9,12 +9,18 @@
 
 static const int nr_soreuse = 4;
 
+struct iter_out {
+	int idx;
+	__u64 cookie;
+} __packed;
+
 static void do_test(int sock_type, bool onebyone)
 {
 	int err, i, nread, to_read, total_read, iter_fd = -1;
-	int first_idx, second_idx, indices[nr_soreuse];
+	struct iter_out outputs[nr_soreuse];
 	struct bpf_link *link = NULL;
 	struct sock_iter_batch *skel;
+	int first_idx, second_idx;
 	int *fds[2] = {};
 
 	skel = sock_iter_batch__open();
@@ -34,6 +40,7 @@ static void do_test(int sock_type, bool onebyone)
 			goto done;
 		skel->rodata->ports[i] = ntohs(local_port);
 	}
+	skel->rodata->sf = AF_INET6;
 
 	err = sock_iter_batch__load(skel);
 	if (!ASSERT_OK(err, "sock_iter_batch__load"))
@@ -55,38 +62,38 @@ static void do_test(int sock_type, bool onebyone)
 	 * from a bucket and leave one socket out from
 	 * that bucket on purpose.
 	 */
-	to_read = (nr_soreuse - 1) * sizeof(*indices);
+	to_read = (nr_soreuse - 1) * sizeof(*outputs);
 	total_read = 0;
 	first_idx = -1;
 	do {
-		nread = read(iter_fd, indices, onebyone ? sizeof(*indices) : to_read);
-		if (nread <= 0 || nread % sizeof(*indices))
+		nread = read(iter_fd, outputs, onebyone ? sizeof(*outputs) : to_read);
+		if (nread <= 0 || nread % sizeof(*outputs))
 			break;
 		total_read += nread;
 
 		if (first_idx == -1)
-			first_idx = indices[0];
-		for (i = 0; i < nread / sizeof(*indices); i++)
-			ASSERT_EQ(indices[i], first_idx, "first_idx");
+			first_idx = outputs[0].idx;
+		for (i = 0; i < nread / sizeof(*outputs); i++)
+			ASSERT_EQ(outputs[i].idx, first_idx, "first_idx");
 	} while (total_read < to_read);
-	ASSERT_EQ(nread, onebyone ? sizeof(*indices) : to_read, "nread");
+	ASSERT_EQ(nread, onebyone ? sizeof(*outputs) : to_read, "nread");
 	ASSERT_EQ(total_read, to_read, "total_read");
 
 	free_fds(fds[first_idx], nr_soreuse);
 	fds[first_idx] = NULL;
 
 	/* Read the "whole" second bucket */
-	to_read = nr_soreuse * sizeof(*indices);
+	to_read = nr_soreuse * sizeof(*outputs);
 	total_read = 0;
 	second_idx = !first_idx;
 	do {
-		nread = read(iter_fd, indices, onebyone ? sizeof(*indices) : to_read);
-		if (nread <= 0 || nread % sizeof(*indices))
+		nread = read(iter_fd, outputs, onebyone ? sizeof(*outputs) : to_read);
+		if (nread <= 0 || nread % sizeof(*outputs))
 			break;
 		total_read += nread;
 
-		for (i = 0; i < nread / sizeof(*indices); i++)
-			ASSERT_EQ(indices[i], second_idx, "second_idx");
+		for (i = 0; i < nread / sizeof(*outputs); i++)
+			ASSERT_EQ(outputs[i].idx, second_idx, "second_idx");
 	} while (total_read <= to_read);
 	ASSERT_EQ(nread, 0, "nread");
 	/* Both so_reuseport ports should be in different buckets, so
diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index 659694162739..17db400f0e0d 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -128,6 +128,7 @@
 #define sk_refcnt		__sk_common.skc_refcnt
 #define sk_state		__sk_common.skc_state
 #define sk_net			__sk_common.skc_net
+#define sk_rcv_saddr		__sk_common.skc_rcv_saddr
 #define sk_v6_daddr		__sk_common.skc_v6_daddr
 #define sk_v6_rcv_saddr		__sk_common.skc_v6_rcv_saddr
 #define sk_flags		__sk_common.skc_flags
diff --git a/tools/testing/selftests/bpf/progs/sock_iter_batch.c b/tools/testing/selftests/bpf/progs/sock_iter_batch.c
index 96531b0d9d55..8f483337e103 100644
--- a/tools/testing/selftests/bpf/progs/sock_iter_batch.c
+++ b/tools/testing/selftests/bpf/progs/sock_iter_batch.c
@@ -17,6 +17,12 @@ static bool ipv6_addr_loopback(const struct in6_addr *a)
 		a->s6_addr32[2] | (a->s6_addr32[3] ^ bpf_htonl(1))) == 0;
 }
 
+static bool ipv4_addr_loopback(__be32 a)
+{
+	return a == bpf_ntohl(0x7f000001);
+}
+
+volatile const unsigned int sf;
 volatile const __u16 ports[2];
 unsigned int bucket[2];
 
@@ -26,16 +32,20 @@ int iter_tcp_soreuse(struct bpf_iter__tcp *ctx)
 	struct sock *sk = (struct sock *)ctx->sk_common;
 	struct inet_hashinfo *hinfo;
 	unsigned int hash;
+	__u64 sock_cookie;
 	struct net *net;
 	int idx;
 
 	if (!sk)
 		return 0;
 
+	sock_cookie = bpf_get_socket_cookie(sk);
 	sk = bpf_core_cast(sk, struct sock);
-	if (sk->sk_family != AF_INET6 ||
+	if (sk->sk_family != sf ||
 	    sk->sk_state != TCP_LISTEN ||
-	    !ipv6_addr_loopback(&sk->sk_v6_rcv_saddr))
+	    sk->sk_family == AF_INET6 ?
+	    !ipv6_addr_loopback(&sk->sk_v6_rcv_saddr) :
+	    !ipv4_addr_loopback(sk->sk_rcv_saddr))
 		return 0;
 
 	if (sk->sk_num == ports[0])
@@ -52,6 +62,7 @@ int iter_tcp_soreuse(struct bpf_iter__tcp *ctx)
 	hinfo = net->ipv4.tcp_death_row.hashinfo;
 	bucket[idx] = hash & hinfo->lhash2_mask;
 	bpf_seq_write(ctx->meta->seq, &idx, sizeof(idx));
+	bpf_seq_write(ctx->meta->seq, &sock_cookie, sizeof(sock_cookie));
 
 	return 0;
 }
@@ -63,14 +74,18 @@ int iter_udp_soreuse(struct bpf_iter__udp *ctx)
 {
 	struct sock *sk = (struct sock *)ctx->udp_sk;
 	struct udp_table *udptable;
+	__u64 sock_cookie;
 	int idx;
 
 	if (!sk)
 		return 0;
 
+	sock_cookie = bpf_get_socket_cookie(sk);
 	sk = bpf_core_cast(sk, struct sock);
-	if (sk->sk_family != AF_INET6 ||
-	    !ipv6_addr_loopback(&sk->sk_v6_rcv_saddr))
+	if (sk->sk_family != sf ||
+	    sk->sk_family == AF_INET6 ?
+	    !ipv6_addr_loopback(&sk->sk_v6_rcv_saddr) :
+	    !ipv4_addr_loopback(sk->sk_rcv_saddr))
 		return 0;
 
 	if (sk->sk_num == ports[0])
@@ -84,6 +99,7 @@ int iter_udp_soreuse(struct bpf_iter__udp *ctx)
 	udptable = sk->sk_net.net->ipv4.udp_table;
 	bucket[idx] = udp_sk(sk)->udp_portaddr_hash & udptable->mask;
 	bpf_seq_write(ctx->meta->seq, &idx, sizeof(idx));
+	bpf_seq_write(ctx->meta->seq, &sock_cookie, sizeof(sock_cookie));
 
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 bpf-next 5/5] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators
  2025-04-11 17:35 [PATCH v2 bpf-next 0/5] bpf: udp: Exactly-once socket iteration Jordan Rife
                   ` (3 preceding siblings ...)
  2025-04-11 17:35 ` [PATCH v2 bpf-next 4/5] selftests/bpf: Return socket cookies from sock_iter_batch progs Jordan Rife
@ 2025-04-11 17:35 ` Jordan Rife
  2025-04-11 22:32   ` Martin KaFai Lau
  4 siblings, 1 reply; 15+ messages in thread
From: Jordan Rife @ 2025-04-11 17:35 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau,
	Willem de Bruijn, Kuniyuki Iwashima

Introduce a set of tests that exercise various bucket resume scenarios:

* remove_seen resumes iteration after removing a socket from the bucket
  that we've already processed. Before, with the offset-based approach,
  this test would have skipped an unseen socket after resuming
  iteration. With the cookie-based approach, we now see all sockets
  exactly once.
* remove_unseen exercises the condition where the next socket that we
  would have seen is removed from the bucket before we resume iteration.
  This tests the scenario where we need to scan past the first cookie in
  our remembered cookies list to find the socket from which to resume
  iteration.
* remove_all exercises the condition where all sockets we remembered
  were removed from the bucket to make sure iteration terminates and
  returns no more results.
* add_some exercises the condition where a few, but not enough to
  trigger a realloc, sockets are added to the head of the current bucket
  between reads. Before, with the offset-based approach, this test would
  have repeated sockets we've already seen. With the cookie-based
  approach, we now see all sockets exactly once.
* force_realloc exercises the condition that we need to realloc the
  batch on a subsequent read, since more sockets than can be held in the
  current batch array were added to the current bucket. This exercies
  the logic inside bpf_iter_udp_realloc_batch that copies cookies into
  the new batch to make sure nothing is skipped or repeated.

Signed-off-by: Jordan Rife <jordan@jrife.io>
---
 .../bpf/prog_tests/sock_iter_batch.c          | 418 ++++++++++++++++++
 1 file changed, 418 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c b/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
index 74dbe91806a0..93b992fa5efe 100644
--- a/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
+++ b/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
@@ -7,6 +7,7 @@
 
 #define TEST_NS "sock_iter_batch_netns"
 
+static const int init_batch_size = 16;
 static const int nr_soreuse = 4;
 
 struct iter_out {
@@ -14,6 +15,422 @@ struct iter_out {
 	__u64 cookie;
 } __packed;
 
+struct sock_count {
+	__u64 cookie;
+	int count;
+};
+
+static int insert(__u64 cookie, struct sock_count counts[], int counts_len)
+{
+	int insert = -1;
+	int i = 0;
+
+	for (; i < counts_len; i++) {
+		if (!counts[i].cookie) {
+			insert = i;
+		} else if (counts[i].cookie == cookie) {
+			insert = i;
+			break;
+		}
+	}
+	if (insert < 0)
+		return insert;
+
+	counts[insert].cookie = cookie;
+	counts[insert].count++;
+
+	return counts[insert].count;
+}
+
+static int read_n(int iter_fd, int n, struct sock_count counts[],
+		  int counts_len)
+{
+	struct iter_out out;
+	int nread = 1;
+	int i = 0;
+
+	for (; nread > 0 && (n < 0 || i < n); i++) {
+		nread = read(iter_fd, &out, sizeof(out));
+		if (!nread || !ASSERT_GE(nread, 1, "nread"))
+			break;
+		ASSERT_GE(insert(out.cookie, counts, counts_len), 0, "insert");
+	}
+
+	ASSERT_TRUE(n < 0 || i == n, "n < 0 || i == n");
+
+	return i;
+}
+
+static __u64 socket_cookie(int fd)
+{
+	__u64 cookie;
+	socklen_t cookie_len = sizeof(cookie);
+	static __u32 duration;	/* for CHECK macro */
+
+	if (CHECK(getsockopt(fd, SOL_SOCKET, SO_COOKIE, &cookie, &cookie_len) < 0,
+		  "getsockopt(SO_COOKIE)", "%s\n", strerror(errno)))
+		return 0;
+	return cookie;
+}
+
+static bool was_seen(int fd, struct sock_count counts[], int counts_len)
+{
+	__u64 cookie = socket_cookie(fd);
+	int i = 0;
+
+	for (; cookie && i < counts_len; i++)
+		if (cookie == counts[i].cookie)
+			return true;
+
+	return false;
+}
+
+static int get_seen_socket(int *fds, struct sock_count counts[], int n)
+{
+	int i = 0;
+
+	for (; i < n; i++)
+		if (was_seen(fds[i], counts, n))
+			return i;
+	return -1;
+}
+
+static int get_nth_socket(int *fds, int fds_len, struct bpf_link *link, int n)
+{
+	int i, nread, iter_fd;
+	int nth_sock_idx = -1;
+	struct iter_out out;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
+		return -1;
+
+	for (; n >= 0; n--) {
+		nread = read(iter_fd, &out, sizeof(out));
+		if (!nread || !ASSERT_GE(nread, 1, "nread"))
+			goto done;
+	}
+
+	for (i = 0; i < fds_len && nth_sock_idx < 0; i++)
+		if (fds[i] >= 0 && socket_cookie(fds[i]) == out.cookie)
+			nth_sock_idx = i;
+done:
+	if (iter_fd >= 0)
+		close(iter_fd);
+	return nth_sock_idx;
+}
+
+static int get_seen_count(int fd, struct sock_count counts[], int n)
+{
+	__u64 cookie = socket_cookie(fd);
+	int count = 0;
+	int i = 0;
+
+	for (; cookie && !count && i < n; i++)
+		if (cookie == counts[i].cookie)
+			count = counts[i].count;
+
+	return count;
+}
+
+static void check_n_were_seen_once(int *fds, int fds_len, int n,
+				   struct sock_count counts[], int counts_len)
+{
+	int seen_once = 0;
+	int seen_cnt;
+	int i = 0;
+
+	for (; i < fds_len; i++) {
+		/* Skip any sockets that were closed or that weren't seen
+		 * exactly once.
+		 */
+		if (fds[i] < 0)
+			continue;
+		seen_cnt = get_seen_count(fds[i], counts, counts_len);
+		if (seen_cnt && ASSERT_EQ(seen_cnt, 1, "seen_cnt"))
+			seen_once++;
+	}
+
+	ASSERT_EQ(seen_once, n, "seen_once");
+}
+
+static void remove_seen(int family, int sock_type, const char *addr, __u16 port,
+			int *socks, int socks_len, struct sock_count *counts,
+			int counts_len, struct bpf_link *link, int iter_fd)
+{
+	int close_idx;
+
+	/* Iterate through the first socks_len - 1 sockets. */
+	read_n(iter_fd, socks_len - 1, counts, counts_len);
+
+	/* Make sure we saw socks_len - 1 sockets exactly once. */
+	check_n_were_seen_once(socks, socks_len, socks_len - 1, counts,
+			       counts_len);
+
+	/* Close a socket we've already seen to remove it from the bucket. */
+	close_idx = get_seen_socket(socks, counts, counts_len);
+	if (!ASSERT_GE(close_idx, 0, "close_idx"))
+		return;
+	close(socks[close_idx]);
+	socks[close_idx] = -1;
+
+	/* Iterate through the rest of the sockets. */
+	read_n(iter_fd, -1, counts, counts_len);
+
+	/* Make sure the last socket wasn't skipped and that there were no
+	 * repeats.
+	 */
+	check_n_were_seen_once(socks, socks_len, socks_len - 1, counts,
+			       counts_len);
+}
+
+static void remove_unseen(int family, int sock_type, const char *addr,
+			  __u16 port, int *socks, int socks_len,
+			  struct sock_count *counts, int counts_len,
+			  struct bpf_link *link, int iter_fd)
+{
+	int close_idx;
+
+	/* Iterate through the first socket. */
+	read_n(iter_fd, 1, counts, counts_len);
+
+	/* Make sure we saw a socket from fds. */
+	check_n_were_seen_once(socks, socks_len, 1, counts, counts_len);
+
+	/* Close what would be the next socket in the bucket to exercise the
+	 * condition where we need to skip past the first cookie we remembered.
+	 */
+	close_idx = get_nth_socket(socks, socks_len, link, 1);
+	if (!ASSERT_GE(close_idx, 0, "close_idx"))
+		return;
+	close(socks[close_idx]);
+	socks[close_idx] = -1;
+
+	/* Iterate through the rest of the sockets. */
+	read_n(iter_fd, -1, counts, counts_len);
+
+	/* Make sure the remaining sockets were seen exactly once and that we
+	 * didn't repeat the socket that was already seen.
+	 */
+	check_n_were_seen_once(socks, socks_len, socks_len - 1, counts,
+			       counts_len);
+}
+
+static void remove_all(int family, int sock_type, const char *addr,
+		       __u16 port, int *socks, int socks_len,
+		       struct sock_count *counts, int counts_len,
+		       struct bpf_link *link, int iter_fd)
+{
+	int close_idx, i;
+
+	/* Iterate through the first socket. */
+	read_n(iter_fd, 1, counts, counts_len);
+
+	/* Make sure we saw a socket from fds. */
+	check_n_were_seen_once(socks, socks_len, 1, counts, counts_len);
+
+	/* Close all remaining sockets to exhaust the list of saved cookies and
+	 * exit without putting any sockets into the batch on the next read.
+	 */
+	for (i = 0; i < socks_len - 1; i++) {
+		close_idx = get_nth_socket(socks, socks_len, link, 1);
+		if (!ASSERT_GE(close_idx, 0, "close_idx"))
+			return;
+		close(socks[close_idx]);
+		socks[close_idx] = -1;
+	}
+
+	/* Make sure there are no more sockets returned */
+	ASSERT_EQ(read_n(iter_fd, -1, counts, counts_len), 0, "read_n");
+}
+
+static void add_some(int family, int sock_type, const char *addr, __u16 port,
+		     int *socks, int socks_len, struct sock_count *counts,
+		     int counts_len, struct bpf_link *link, int iter_fd)
+{
+	int *new_socks = NULL;
+
+	/* Iterate through the first socks_len - 1 sockets. */
+	read_n(iter_fd, socks_len - 1, counts, counts_len);
+
+	/* Make sure we saw socks_len - 1 sockets exactly once. */
+	check_n_were_seen_once(socks, socks_len, socks_len - 1, counts,
+			       counts_len);
+
+	/* Double the number of sockets in the bucket. */
+	new_socks = start_reuseport_server(family, sock_type, addr, port, 0,
+					   socks_len);
+	if (!ASSERT_OK_PTR(new_socks, "start_reuseport_server"))
+		goto done;
+
+	/* Iterate through the rest of the sockets. */
+	read_n(iter_fd, -1, counts, counts_len);
+
+	/* Make sure each of the original sockets was seen exactly once. */
+	check_n_were_seen_once(socks, socks_len, socks_len, counts,
+			       counts_len);
+done:
+	if (new_socks)
+		free_fds(new_socks, socks_len);
+}
+
+static void force_realloc(int family, int sock_type, const char *addr,
+			  __u16 port, int *socks, int socks_len,
+			  struct sock_count *counts, int counts_len,
+			  struct bpf_link *link, int iter_fd)
+{
+	int *new_socks = NULL;
+
+	/* Iterate through the first socket just to initialize the batch. */
+	read_n(iter_fd, 1, counts, counts_len);
+
+	/* Double the number of sockets in the bucket to force a realloc on the
+	 * next read.
+	 */
+	new_socks = start_reuseport_server(family, sock_type, addr, port, 0,
+					   socks_len);
+	if (!ASSERT_OK_PTR(new_socks, "start_reuseport_server"))
+		goto done;
+
+	/* Iterate through the rest of the sockets. */
+	read_n(iter_fd, -1, counts, counts_len);
+
+	/* Make sure each socket from the first set was seen exactly once. */
+	check_n_were_seen_once(socks, socks_len, socks_len, counts,
+			       counts_len);
+done:
+	if (new_socks)
+		free_fds(new_socks, socks_len);
+}
+
+struct test_case {
+	void (*test)(int family, int sock_type, const char *addr, __u16 port,
+		     int *socks, int socks_len, struct sock_count *counts,
+		     int counts_len, struct bpf_link *link, int iter_fd);
+	const char *description;
+	int init_socks;
+	int max_socks;
+	int sock_type;
+	int family;
+};
+
+static struct test_case resume_tests[] = {
+	{
+		.description = "udp: resume after removing a seen socket",
+		.init_socks = nr_soreuse,
+		.max_socks = nr_soreuse,
+		.sock_type = SOCK_DGRAM,
+		.family = AF_INET6,
+		.test = remove_seen,
+	},
+	{
+		.description = "udp: resume after removing one unseen socket",
+		.init_socks = nr_soreuse,
+		.max_socks = nr_soreuse,
+		.sock_type = SOCK_DGRAM,
+		.family = AF_INET6,
+		.test = remove_unseen,
+	},
+	{
+		.description = "udp: resume after removing all unseen sockets",
+		.init_socks = nr_soreuse,
+		.max_socks = nr_soreuse,
+		.sock_type = SOCK_DGRAM,
+		.family = AF_INET6,
+		.test = remove_all,
+	},
+	{
+		.description = "udp: resume after adding a few sockets",
+		.init_socks = nr_soreuse,
+		.max_socks = nr_soreuse,
+		.sock_type = SOCK_DGRAM,
+		/* Use AF_INET so that new sockets are added to the head of the
+		 * bucket's list.
+		 */
+		.family = AF_INET,
+		.test = add_some,
+	},
+	{
+		.description = "udp: force a realloc to occur",
+		.init_socks = init_batch_size,
+		.max_socks = init_batch_size * 2,
+		.sock_type = SOCK_DGRAM,
+		/* Use AF_INET6 so that new sockets are added to the tail of the
+		 * bucket's list, needing to be added to the next batch to force
+		 * a realloc.
+		 */
+		.family = AF_INET6,
+		.test = force_realloc,
+	},
+};
+
+static void do_resume_test(struct test_case *tc)
+{
+	static const __u16 port = 10001;
+	struct bpf_link *link = NULL;
+	struct sock_iter_batch *skel;
+	struct sock_count *counts;
+	int err, iter_fd = -1;
+	const char *addr;
+	int *fds;
+
+	counts = calloc(tc->max_socks, sizeof(*counts));
+	if (!counts)
+		return;
+	skel = sock_iter_batch__open();
+	if (!ASSERT_OK_PTR(skel, "sock_iter_batch__open"))
+		return;
+
+	/* Prepare a bucket of sockets in the kernel hashtable */
+	int local_port;
+
+	addr = tc->family == AF_INET6 ? "::1" : "127.0.0.1";
+	fds = start_reuseport_server(tc->family, tc->sock_type, addr, port, 0,
+				     tc->init_socks);
+	if (!ASSERT_OK_PTR(fds, "start_reuseport_server"))
+		goto done;
+	local_port = get_socket_local_port(*fds);
+	if (!ASSERT_GE(local_port, 0, "get_socket_local_port"))
+		goto done;
+	skel->rodata->ports[0] = ntohs(local_port);
+	skel->rodata->sf = tc->family;
+
+	err = sock_iter_batch__load(skel);
+	if (!ASSERT_OK(err, "sock_iter_batch__load"))
+		goto done;
+
+	link = bpf_program__attach_iter(tc->sock_type == SOCK_STREAM ?
+					skel->progs.iter_tcp_soreuse :
+					skel->progs.iter_udp_soreuse,
+					NULL);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_iter"))
+		goto done;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
+		goto done;
+
+	tc->test(tc->family, tc->sock_type, addr, port, fds, tc->init_socks,
+		 counts, tc->max_socks, link, iter_fd);
+done:
+	free_fds(fds, tc->init_socks);
+	if (iter_fd >= 0)
+		close(iter_fd);
+	bpf_link__destroy(link);
+	sock_iter_batch__destroy(skel);
+}
+
+static void do_resume_tests(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(resume_tests); i++) {
+		if (test__start_subtest(resume_tests[i].description)) {
+			do_resume_test(&resume_tests[i]);
+		}
+	}
+}
+
 static void do_test(int sock_type, bool onebyone)
 {
 	int err, i, nread, to_read, total_read, iter_fd = -1;
@@ -135,6 +552,7 @@ void test_sock_iter_batch(void)
 		do_test(SOCK_DGRAM, true);
 		do_test(SOCK_DGRAM, false);
 	}
+	do_resume_tests();
 	close_netns(nstoken);
 
 done:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 3/5] bpf: udp: Avoid socket skips and repeats during iteration
  2025-04-11 17:35 ` [PATCH v2 bpf-next 3/5] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife
@ 2025-04-11 20:12   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 15+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-11 20:12 UTC (permalink / raw)
  To: jordan
  Cc: aditi.ghag, bpf, daniel, kuniyu, martin.lau, netdev,
	willemdebruijn.kernel

From: Jordan Rife <jordan@jrife.io>
Date: Fri, 11 Apr 2025 10:35:43 -0700
> Replace the offset-based approach for tracking progress through a bucket
> in the UDP table with one based on socket cookies. Remember the cookies
> of unprocessed sockets from the last batch and use this list to
> pick up where we left off or, in the case that the next socket
> disappears between reads, find the first socket after that point that
> still exists in the bucket and resume from there.
> 
> In order to make the control flow a bit easier to follow inside
> bpf_iter_udp_batch, introduce the udp_portaddr_for_each_entry_from macro
> and use this to split bucket processing into two stages: finding the
> starting point and adding items to the next batch. Originally, I
> implemented this patch inside a single udp_portaddr_for_each_entry loop,
> as it was before, but I found the resulting logic a bit messy. Overall,
> this version seems more readable.
> 
> Signed-off-by: Jordan Rife <jordan@jrife.io>

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch
  2025-04-11 17:35 ` [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch Jordan Rife
@ 2025-04-11 22:10   ` Martin KaFai Lau
  2025-04-11 23:31     ` Jordan Rife
  0 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2025-04-11 22:10 UTC (permalink / raw)
  To: Jordan Rife
  Cc: Aditi Ghag, Daniel Borkmann, Willem de Bruijn, Kuniyuki Iwashima,
	bpf, netdev

On 4/11/25 10:35 AM, Jordan Rife wrote:
> Stop iteration if the current bucket can't be contained in a single
> batch to avoid choosing between skipping or repeating sockets. In cases
> where none of the saved cookies can be found in the current bucket and

Since the cookie change is in the next patch, it will be useful to mention it is 
a prep work for the next patch.

> the batch isn't big enough to contain all the sockets in the bucket,
> there are really only two choices, neither of which is desirable:
> 
> 1. Start from the beginning, assuming we haven't seen any sockets in the
>     current bucket, in which case we might repeat a socket we've already
>     seen.
> 2. Go to the next bucket to avoid repeating a socket we may have already
>     seen, in which case we may accidentally skip a socket that we didn't
>     yet visit.
> 
> To avoid this tradeoff, enforce the invariant that the batch always
> contains a full snapshot of the bucket from last time by returning
> -ENOMEM if bpf_iter_udp_realloc_batch() can't grab enough memory to fit
> all sockets in the current bucket.
> 
> To test this code path, I forced bpf_iter_udp_realloc_batch() to return
> -ENOMEM when called from within bpf_iter_udp_batch() and observed that
> read() fails in userspace with errno set to ENOMEM. Otherwise, it's a
> bit hard to test this scenario.
> 
> Link: https://lore.kernel.org/bpf/CABi4-ogUtMrH8-NVB6W8Xg_F_KDLq=yy-yu-tKr2udXE2Mu1Lg@mail.gmail.com/
> 
> Signed-off-by: Jordan Rife <jordan@jrife.io>
> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>   net/ipv4/udp.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 59c3281962b9..1e8ae08d24db 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3410,6 +3410,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>   	unsigned int batch_sks = 0;
>   	bool resized = false;
>   	struct sock *sk;
> +	int err = 0;
>   
>   	resume_bucket = state->bucket;
>   	resume_offset = iter->offset;
> @@ -3475,7 +3476,11 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>   		iter->st_bucket_done = true;
>   		goto done;
>   	}
> -	if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
> +	if (!resized) {

The resized == true case will have a similar issue. Meaning the next 
bpf_iter_udp_batch() will end up skipping the remaining sk in that bucket, e.g. 
the partial-bucket batch has been consumed, so cur_sk == end_sk but 
st_bucket_done == false and bpf_iter_udp_resume() returns NULL. It is sort of a 
regression from the current "offset" implementation for this case. Any thought 
on how to make it better?

> +		err = bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2);
> +		if (err)
> +			return ERR_PTR(err);
> +
>   		resized = true;
>   		/* After allocating a larger batch, retry one more time to grab
>   		 * the whole bucket.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 5/5] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators
  2025-04-11 17:35 ` [PATCH v2 bpf-next 5/5] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators Jordan Rife
@ 2025-04-11 22:32   ` Martin KaFai Lau
  2025-04-15  0:04     ` Jordan Rife
  0 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2025-04-11 22:32 UTC (permalink / raw)
  To: Jordan Rife
  Cc: bpf, netdev, Aditi Ghag, Daniel Borkmann, Willem de Bruijn,
	Kuniyuki Iwashima

On 4/11/25 10:35 AM, Jordan Rife wrote:
> Introduce a set of tests that exercise various bucket resume scenarios:
> 
> * remove_seen resumes iteration after removing a socket from the bucket
>    that we've already processed. Before, with the offset-based approach,
>    this test would have skipped an unseen socket after resuming
>    iteration. With the cookie-based approach, we now see all sockets
>    exactly once.
> * remove_unseen exercises the condition where the next socket that we
>    would have seen is removed from the bucket before we resume iteration.
>    This tests the scenario where we need to scan past the first cookie in
>    our remembered cookies list to find the socket from which to resume
>    iteration.
> * remove_all exercises the condition where all sockets we remembered
>    were removed from the bucket to make sure iteration terminates and
>    returns no more results.
> * add_some exercises the condition where a few, but not enough to
>    trigger a realloc, sockets are added to the head of the current bucket
>    between reads. Before, with the offset-based approach, this test would
>    have repeated sockets we've already seen. With the cookie-based
>    approach, we now see all sockets exactly once.
> * force_realloc exercises the condition that we need to realloc the
>    batch on a subsequent read, since more sockets than can be held in the
>    current batch array were added to the current bucket. This exercies
>    the logic inside bpf_iter_udp_realloc_batch that copies cookies into
>    the new batch to make sure nothing is skipped or repeated.
> 
> Signed-off-by: Jordan Rife <jordan@jrife.io>
> ---
>   .../bpf/prog_tests/sock_iter_batch.c          | 418 ++++++++++++++++++
>   1 file changed, 418 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c b/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
> index 74dbe91806a0..93b992fa5efe 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
> @@ -7,6 +7,7 @@
>   
>   #define TEST_NS "sock_iter_batch_netns"
>   
> +static const int init_batch_size = 16;
>   static const int nr_soreuse = 4;
>   
>   struct iter_out {
> @@ -14,6 +15,422 @@ struct iter_out {
>   	__u64 cookie;
>   } __packed;
>   
> +struct sock_count {
> +	__u64 cookie;
> +	int count;
> +};
> +
> +static int insert(__u64 cookie, struct sock_count counts[], int counts_len)
> +{
> +	int insert = -1;
> +	int i = 0;
> +
> +	for (; i < counts_len; i++) {
> +		if (!counts[i].cookie) {
> +			insert = i;
> +		} else if (counts[i].cookie == cookie) {
> +			insert = i;
> +			break;
> +		}
> +	}
> +	if (insert < 0)
> +		return insert;
> +
> +	counts[insert].cookie = cookie;
> +	counts[insert].count++;
> +
> +	return counts[insert].count;
> +}
> +
> +static int read_n(int iter_fd, int n, struct sock_count counts[],
> +		  int counts_len)
> +{
> +	struct iter_out out;
> +	int nread = 1;
> +	int i = 0;
> +
> +	for (; nread > 0 && (n < 0 || i < n); i++) {
> +		nread = read(iter_fd, &out, sizeof(out));
> +		if (!nread || !ASSERT_GE(nread, 1, "nread"))

why checks nread >= 1 instead of nread == sizeof(out)?

> +			break;
> +		ASSERT_GE(insert(out.cookie, counts, counts_len), 0, "insert");
> +	}
> +
> +	ASSERT_TRUE(n < 0 || i == n, "n < 0 || i == n");
> +
> +	return i;
> +}
> +
> +static __u64 socket_cookie(int fd)
> +{
> +	__u64 cookie;
> +	socklen_t cookie_len = sizeof(cookie);
> +	static __u32 duration;	/* for CHECK macro */
> +
> +	if (CHECK(getsockopt(fd, SOL_SOCKET, SO_COOKIE, &cookie, &cookie_len) < 0,

ASSERT_OK

> +		  "getsockopt(SO_COOKIE)", "%s\n", strerror(errno)))
> +		return 0;
> +	return cookie;
> +}
> +
> +static bool was_seen(int fd, struct sock_count counts[], int counts_len)
> +{
> +	__u64 cookie = socket_cookie(fd);
> +	int i = 0;
> +
> +	for (; cookie && i < counts_len; i++)
> +		if (cookie == counts[i].cookie)
> +			return true;
> +
> +	return false;
> +}
> +
> +static int get_seen_socket(int *fds, struct sock_count counts[], int n)
> +{
> +	int i = 0;
> +
> +	for (; i < n; i++)
> +		if (was_seen(fds[i], counts, n))
> +			return i;
> +	return -1;
> +}
> +
> +static int get_nth_socket(int *fds, int fds_len, struct bpf_link *link, int n)
> +{
> +	int i, nread, iter_fd;
> +	int nth_sock_idx = -1;
> +	struct iter_out out;
> +
> +	iter_fd = bpf_iter_create(bpf_link__fd(link));
> +	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))

ASSERT_OK_FD

> +		return -1;
> +
> +	for (; n >= 0; n--) {
> +		nread = read(iter_fd, &out, sizeof(out));
> +		if (!nread || !ASSERT_GE(nread, 1, "nread"))
> +			goto done;
> +	}
> +
> +	for (i = 0; i < fds_len && nth_sock_idx < 0; i++)
> +		if (fds[i] >= 0 && socket_cookie(fds[i]) == out.cookie)
> +			nth_sock_idx = i;
> +done:
> +	if (iter_fd >= 0)

Not needed.

> +		close(iter_fd);
> +	return nth_sock_idx;
> +}
> +
> +static int get_seen_count(int fd, struct sock_count counts[], int n)
> +{
> +	__u64 cookie = socket_cookie(fd);
> +	int count = 0;
> +	int i = 0;
> +
> +	for (; cookie && !count && i < n; i++)
> +		if (cookie == counts[i].cookie)
> +			count = counts[i].count;
> +
> +	return count;
> +}
> +
> +static void check_n_were_seen_once(int *fds, int fds_len, int n,
> +				   struct sock_count counts[], int counts_len)
> +{
> +	int seen_once = 0;
> +	int seen_cnt;
> +	int i = 0;
> +
> +	for (; i < fds_len; i++) {
> +		/* Skip any sockets that were closed or that weren't seen
> +		 * exactly once.
> +		 */
> +		if (fds[i] < 0)
> +			continue;
> +		seen_cnt = get_seen_count(fds[i], counts, counts_len);
> +		if (seen_cnt && ASSERT_EQ(seen_cnt, 1, "seen_cnt"))
> +			seen_once++;
> +	}
> +
> +	ASSERT_EQ(seen_once, n, "seen_once");
> +}
> +
> +static void remove_seen(int family, int sock_type, const char *addr, __u16 port,
> +			int *socks, int socks_len, struct sock_count *counts,
> +			int counts_len, struct bpf_link *link, int iter_fd)
> +{
> +	int close_idx;
> +
> +	/* Iterate through the first socks_len - 1 sockets. */
> +	read_n(iter_fd, socks_len - 1, counts, counts_len);
> +
> +	/* Make sure we saw socks_len - 1 sockets exactly once. */
> +	check_n_were_seen_once(socks, socks_len, socks_len - 1, counts,
> +			       counts_len);
> +
> +	/* Close a socket we've already seen to remove it from the bucket. */
> +	close_idx = get_seen_socket(socks, counts, counts_len);
> +	if (!ASSERT_GE(close_idx, 0, "close_idx"))
> +		return;
> +	close(socks[close_idx]);
> +	socks[close_idx] = -1;
> +
> +	/* Iterate through the rest of the sockets. */
> +	read_n(iter_fd, -1, counts, counts_len);
> +
> +	/* Make sure the last socket wasn't skipped and that there were no
> +	 * repeats.
> +	 */
> +	check_n_were_seen_once(socks, socks_len, socks_len - 1, counts,
> +			       counts_len);
> +}
> +
> +static void remove_unseen(int family, int sock_type, const char *addr,
> +			  __u16 port, int *socks, int socks_len,
> +			  struct sock_count *counts, int counts_len,
> +			  struct bpf_link *link, int iter_fd)
> +{
> +	int close_idx;
> +
> +	/* Iterate through the first socket. */
> +	read_n(iter_fd, 1, counts, counts_len);
> +
> +	/* Make sure we saw a socket from fds. */
> +	check_n_were_seen_once(socks, socks_len, 1, counts, counts_len);
> +
> +	/* Close what would be the next socket in the bucket to exercise the
> +	 * condition where we need to skip past the first cookie we remembered.
> +	 */
> +	close_idx = get_nth_socket(socks, socks_len, link, 1);
> +	if (!ASSERT_GE(close_idx, 0, "close_idx"))
> +		return;
> +	close(socks[close_idx]);
> +	socks[close_idx] = -1;
> +
> +	/* Iterate through the rest of the sockets. */
> +	read_n(iter_fd, -1, counts, counts_len);
> +
> +	/* Make sure the remaining sockets were seen exactly once and that we
> +	 * didn't repeat the socket that was already seen.
> +	 */
> +	check_n_were_seen_once(socks, socks_len, socks_len - 1, counts,
> +			       counts_len);
> +}
> +
> +static void remove_all(int family, int sock_type, const char *addr,
> +		       __u16 port, int *socks, int socks_len,
> +		       struct sock_count *counts, int counts_len,
> +		       struct bpf_link *link, int iter_fd)
> +{
> +	int close_idx, i;
> +
> +	/* Iterate through the first socket. */
> +	read_n(iter_fd, 1, counts, counts_len);
> +
> +	/* Make sure we saw a socket from fds. */
> +	check_n_were_seen_once(socks, socks_len, 1, counts, counts_len);
> +
> +	/* Close all remaining sockets to exhaust the list of saved cookies and
> +	 * exit without putting any sockets into the batch on the next read.
> +	 */
> +	for (i = 0; i < socks_len - 1; i++) {
> +		close_idx = get_nth_socket(socks, socks_len, link, 1);
> +		if (!ASSERT_GE(close_idx, 0, "close_idx"))
> +			return;
> +		close(socks[close_idx]);
> +		socks[close_idx] = -1;
> +	}
> +
> +	/* Make sure there are no more sockets returned */
> +	ASSERT_EQ(read_n(iter_fd, -1, counts, counts_len), 0, "read_n");
> +}
> +
> +static void add_some(int family, int sock_type, const char *addr, __u16 port,
> +		     int *socks, int socks_len, struct sock_count *counts,
> +		     int counts_len, struct bpf_link *link, int iter_fd)
> +{
> +	int *new_socks = NULL;
> +
> +	/* Iterate through the first socks_len - 1 sockets. */
> +	read_n(iter_fd, socks_len - 1, counts, counts_len);
> +
> +	/* Make sure we saw socks_len - 1 sockets exactly once. */
> +	check_n_were_seen_once(socks, socks_len, socks_len - 1, counts,
> +			       counts_len);
> +
> +	/* Double the number of sockets in the bucket. */
> +	new_socks = start_reuseport_server(family, sock_type, addr, port, 0,
> +					   socks_len);
> +	if (!ASSERT_OK_PTR(new_socks, "start_reuseport_server"))
> +		goto done;
> +
> +	/* Iterate through the rest of the sockets. */
> +	read_n(iter_fd, -1, counts, counts_len);
> +
> +	/* Make sure each of the original sockets was seen exactly once. */
> +	check_n_were_seen_once(socks, socks_len, socks_len, counts,
> +			       counts_len);
> +done:
> +	if (new_socks)

Not needed.

> +		free_fds(new_socks, socks_len);
> +}
> +
> +static void force_realloc(int family, int sock_type, const char *addr,
> +			  __u16 port, int *socks, int socks_len,
> +			  struct sock_count *counts, int counts_len,
> +			  struct bpf_link *link, int iter_fd)
> +{
> +	int *new_socks = NULL;
> +
> +	/* Iterate through the first socket just to initialize the batch. */
> +	read_n(iter_fd, 1, counts, counts_len);
> +
> +	/* Double the number of sockets in the bucket to force a realloc on the
> +	 * next read.
> +	 */
> +	new_socks = start_reuseport_server(family, sock_type, addr, port, 0,
> +					   socks_len);
> +	if (!ASSERT_OK_PTR(new_socks, "start_reuseport_server"))
> +		goto done;
> +
> +	/* Iterate through the rest of the sockets. */
> +	read_n(iter_fd, -1, counts, counts_len);
> +
> +	/* Make sure each socket from the first set was seen exactly once. */
> +	check_n_were_seen_once(socks, socks_len, socks_len, counts,
> +			       counts_len);
> +done:
> +	if (new_socks)

Not needed.

> +		free_fds(new_socks, socks_len);
> +}
> +
> +struct test_case {
> +	void (*test)(int family, int sock_type, const char *addr, __u16 port,
> +		     int *socks, int socks_len, struct sock_count *counts,
> +		     int counts_len, struct bpf_link *link, int iter_fd);
> +	const char *description;
> +	int init_socks;
> +	int max_socks;
> +	int sock_type;
> +	int family;
> +};
> +
> +static struct test_case resume_tests[] = {
> +	{
> +		.description = "udp: resume after removing a seen socket",
> +		.init_socks = nr_soreuse,
> +		.max_socks = nr_soreuse,
> +		.sock_type = SOCK_DGRAM,
> +		.family = AF_INET6,
> +		.test = remove_seen,
> +	},
> +	{
> +		.description = "udp: resume after removing one unseen socket",
> +		.init_socks = nr_soreuse,
> +		.max_socks = nr_soreuse,
> +		.sock_type = SOCK_DGRAM,
> +		.family = AF_INET6,
> +		.test = remove_unseen,
> +	},
> +	{
> +		.description = "udp: resume after removing all unseen sockets",
> +		.init_socks = nr_soreuse,
> +		.max_socks = nr_soreuse,
> +		.sock_type = SOCK_DGRAM,
> +		.family = AF_INET6,
> +		.test = remove_all,
> +	},
> +	{
> +		.description = "udp: resume after adding a few sockets",
> +		.init_socks = nr_soreuse,
> +		.max_socks = nr_soreuse,
> +		.sock_type = SOCK_DGRAM,
> +		/* Use AF_INET so that new sockets are added to the head of the
> +		 * bucket's list.
> +		 */
> +		.family = AF_INET,
> +		.test = add_some,
> +	},
> +	{
> +		.description = "udp: force a realloc to occur",
> +		.init_socks = init_batch_size,
> +		.max_socks = init_batch_size * 2,
> +		.sock_type = SOCK_DGRAM,
> +		/* Use AF_INET6 so that new sockets are added to the tail of the
> +		 * bucket's list, needing to be added to the next batch to force
> +		 * a realloc.
> +		 */
> +		.family = AF_INET6,
> +		.test = force_realloc,
> +	},
> +};
> +
> +static void do_resume_test(struct test_case *tc)
> +{
> +	static const __u16 port = 10001;
> +	struct bpf_link *link = NULL;
> +	struct sock_iter_batch *skel;
> +	struct sock_count *counts;
> +	int err, iter_fd = -1;
> +	const char *addr;
> +	int *fds;
> +
> +	counts = calloc(tc->max_socks, sizeof(*counts));


free(counts) is missing.

> +	if (!counts)
> +		return;
> +	skel = sock_iter_batch__open();
> +	if (!ASSERT_OK_PTR(skel, "sock_iter_batch__open"))
> +		return;
> +
> +	/* Prepare a bucket of sockets in the kernel hashtable */
> +	int local_port;

Move to the beginning of the function.

> +
> +	addr = tc->family == AF_INET6 ? "::1" : "127.0.0.1";
> +	fds = start_reuseport_server(tc->family, tc->sock_type, addr, port, 0,
> +				     tc->init_socks);
> +	if (!ASSERT_OK_PTR(fds, "start_reuseport_server"))
> +		goto done;
> +	local_port = get_socket_local_port(*fds);
> +	if (!ASSERT_GE(local_port, 0, "get_socket_local_port"))
> +		goto done;
> +	skel->rodata->ports[0] = ntohs(local_port);
> +	skel->rodata->sf = tc->family;
> +
> +	err = sock_iter_batch__load(skel);
> +	if (!ASSERT_OK(err, "sock_iter_batch__load"))
> +		goto done;
> +
> +	link = bpf_program__attach_iter(tc->sock_type == SOCK_STREAM ?
> +					skel->progs.iter_tcp_soreuse :
> +					skel->progs.iter_udp_soreuse,
> +					NULL);
> +	if (!ASSERT_OK_PTR(link, "bpf_program__attach_iter"))
> +		goto done;
> +
> +	iter_fd = bpf_iter_create(bpf_link__fd(link));
> +	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))

ASSERT_OK_FD

> +		goto done;
> +
> +	tc->test(tc->family, tc->sock_type, addr, port, fds, tc->init_socks,
> +		 counts, tc->max_socks, link, iter_fd);
> +done:
> +	free_fds(fds, tc->init_socks);
> +	if (iter_fd >= 0)
> +		close(iter_fd);
> +	bpf_link__destroy(link);
> +	sock_iter_batch__destroy(skel);
> +}
> +
> +static void do_resume_tests(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(resume_tests); i++) {
> +		if (test__start_subtest(resume_tests[i].description)) {
> +			do_resume_test(&resume_tests[i]);
> +		}
> +	}
> +}
> +
>   static void do_test(int sock_type, bool onebyone)
>   {
>   	int err, i, nread, to_read, total_read, iter_fd = -1;
> @@ -135,6 +552,7 @@ void test_sock_iter_batch(void)
>   		do_test(SOCK_DGRAM, true);
>   		do_test(SOCK_DGRAM, false);
>   	}
> +	do_resume_tests();
>   	close_netns(nstoken);
>   
>   done:


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch
  2025-04-11 22:10   ` Martin KaFai Lau
@ 2025-04-11 23:31     ` Jordan Rife
  2025-04-12  3:47       ` Martin KaFai Lau
  0 siblings, 1 reply; 15+ messages in thread
From: Jordan Rife @ 2025-04-11 23:31 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Aditi Ghag, Daniel Borkmann, Willem de Bruijn, Kuniyuki Iwashima,
	bpf, netdev

> Since the cookie change is in the next patch, it will be useful to mention it is
> a prep work for the next patch.

Sure, will do.

> The resized == true case will have a similar issue. Meaning the next
> bpf_iter_udp_batch() will end up skipping the remaining sk in that bucket, e.g.
> the partial-bucket batch has been consumed, so cur_sk == end_sk but
> st_bucket_done == false and bpf_iter_udp_resume() returns NULL. It is sort of a
> regression from the current "offset" implementation for this case. Any thought
> on how to make it better?

Are you referring to the case where the bucket grows in size so much
between releasing and reacquiring the bucket's lock to where we still
can't fit all sockets into our batch even after a
bpf_iter_udp_realloc_batch()? If so, I think we touched on this a bit
in [1]:

> > Barring the possibility that bpf_iter_udp_realloc_batch() fails to
> > grab more memory (should this basically never happen?), this should
> > ensure that we always grab the full contents of the bucket on the
> > second go around. However, the spin lock on hslot2->lock is released
> > before doing this. Would it not be more accurate to hold onto the lock
> > until after the second attempt, so we know the size isn't changing
> > between the time where we release the lock and the time when we
> > reacquire it post-batch-resize. The bucket size would have to grow by
> > more than 1.5x for the new size to be insufficient, so I may be
> > splitting hairs here, but just something I noticed.
>
> It is a very corner case.
>
> I guess it can with GFP_ATOMIC. I just didn't think it was needed considering
> the key of the hash is addresses+ports. If we have many socks collided on the
> same addresses+ports bucket, that would be a better hashtable problem to solve
> first.
>
> The default batch size is 16 now. On the listening hashtable + SO_REUSEPORT,
> userspace may have one listen sk binding on the same address+port for each
> thread. It is not uncommon to have hundreds of CPU cores now, so it may actually
> need to hit the realloc_batch() path once and then likely will stay at that size
> for the whole hashtable iteration.

Would it be worth using GFP_ATOMIC here so that we can hold onto the
spinlock and guarantee the bucket size doesn't change?

Some alternatives I can imagine are:

1) Loop until iter->end_sk == batch_sks, possibly calling realloc a
couple times. The unbounded loop is a bit worrying; I guess
bpf_iter_udp_batch could "race" if the bucket size keeps growing here.
2) Loop some bounded number of times and return some ERR_PTR(x) if the
loop can't keep up after a few tries so we don't break the invariant
that the batch is always a full snapshot of a bucket.
3) Set some flag in the iterator state, e.g. iter->is_partial,
indicating to the next call to bpf_iter_udp_realloc_batch() that the
last batch was actually partial and that if it can't find any of the
cookies from last time it should start over from the beginning of the
bucket instead of advancing to the next bucket. This may repeat
sockets we've already seen in the worst case, but still better than
skipping them.

I kind of like the GFP_ATOMIC thing if possible.

[1]: https://lore.kernel.org/bpf/c53cee32-02c0-4c5a-a57d-910b12e73afd@linux.dev/

-Jordan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch
  2025-04-11 23:31     ` Jordan Rife
@ 2025-04-12  3:47       ` Martin KaFai Lau
  2025-04-14  0:02         ` Jordan Rife
  0 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2025-04-12  3:47 UTC (permalink / raw)
  To: Jordan Rife
  Cc: Aditi Ghag, Daniel Borkmann, Willem de Bruijn, Kuniyuki Iwashima,
	bpf, netdev

On 4/11/25 4:31 PM, Jordan Rife wrote:
>> The resized == true case will have a similar issue. Meaning the next
>> bpf_iter_udp_batch() will end up skipping the remaining sk in that bucket, e.g.
>> the partial-bucket batch has been consumed, so cur_sk == end_sk but
>> st_bucket_done == false and bpf_iter_udp_resume() returns NULL. It is sort of a
>> regression from the current "offset" implementation for this case. Any thought
>> on how to make it better?
> 
> Are you referring to the case where the bucket grows in size so much
> between releasing and reacquiring the bucket's lock to where we still
> can't fit all sockets into our batch even after a
> bpf_iter_udp_realloc_batch()? If so, I think we touched on this a bit
> in [1]:

Right, and it is also the same as the kvmalloc failure case that this patch is 
handling. Let see if it can be done better without returning error in both cases.

> 1) Loop until iter->end_sk == batch_sks, possibly calling realloc a
> couple times. The unbounded loop is a bit worrying; I guess
> bpf_iter_udp_batch could "race" if the bucket size keeps growing here.
> 2) Loop some bounded number of times and return some ERR_PTR(x) if the
> loop can't keep up after a few tries so we don't break the invariant
> that the batch is always a full snapshot of a bucket.
> 3) Set some flag in the iterator state, e.g. iter->is_partial,
> indicating to the next call to bpf_iter_udp_realloc_batch() that the
> last batch was actually partial and that if it can't find any of the
> cookies from last time it should start over from the beginning of the
> bucket instead of advancing to the next bucket. This may repeat
> sockets we've already seen in the worst case, but still better than
> skipping them.

Probably something like (3) but I don't think it needs a new "is_partial". The 
existing "st_bucket_done" should do.

How about for the "st_bucket_done == false" case, it also stores the
cookie before advancing the cur_sk in bpf_iter_udp_seq_next().

not-compiled code:

static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
{
	if (iter->cur_sk < iter->end_sk) {
		u64 cookie;

		cookie = iter->st_bucket_done ?
			0 : __sock_gen_cookie(iter->batch[iter->cur_sk].sock);
		sock_put(iter->batch[iter->cur_sk].sock);
		iter->batch[iter->cur_sk++].cookie = cookie;
	}

	/* ... */
}

In bpf_iter_udp_resume(), if it cannot find the first sk from find_cookie to 
end_cookie, then it searches backward from find_cookie to 0. If nothing found, 
then it should start from the beginning of the resume_bucket. Would it work?



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch
  2025-04-12  3:47       ` Martin KaFai Lau
@ 2025-04-14  0:02         ` Jordan Rife
  2025-04-14 21:54           ` Martin KaFai Lau
  0 siblings, 1 reply; 15+ messages in thread
From: Jordan Rife @ 2025-04-14  0:02 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Aditi Ghag, Daniel Borkmann, Willem de Bruijn, Kuniyuki Iwashima,
	bpf, netdev

> static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> {
>         if (iter->cur_sk < iter->end_sk) {
>                 u64 cookie;
>
>                 cookie = iter->st_bucket_done ?
>                         0 : __sock_gen_cookie(iter->batch[iter->cur_sk].sock);
>                 sock_put(iter->batch[iter->cur_sk].sock);
>                 iter->batch[iter->cur_sk++].cookie = cookie;
>         }
>
>         /* ... */
> }
>
> In bpf_iter_udp_resume(), if it cannot find the first sk from find_cookie to
> end_cookie, then it searches backward from find_cookie to 0. If nothing found,
> then it should start from the beginning of the resume_bucket. Would it work?

It seems like the intent here is to avoid repeating sockets that we've
already visited?

This would work if you need to process a bucket in two batches or less,
but it would still be possible to repeat a socket if you have to process
a bucket in more than two batches: during the transition from batch two
to batch three you don't have any context about what you saw in batch
one, so in the worst case where all the cookies we remembered from batch
two are not found, we restart from the beginning of the list where we
might revisit sockets from batch one. I guess you can say this reduces
the probability of repeats but doesn't eliminate it.

e.g.: socket A gets repeated in batch three after two consecutive calls
      to bpf_iter_udp_batch() hit the resized == true case due to heavy
      churn in the current bucket.

|               Thread 1            Thread 2   Batch State    List State
|  -------------------------------  ---------  ------------   ----------
|                                              [_]            A->B
|  bpf_iter_udp_batch()                        "              "
|    spin_lock_bh(&hslot2->lock)               "              "
|    ...                                       [A]            "
|    spin_unlock_bh(&hslot2->lock)             "              "
|                                   add C,D    "              A->B->C->D
|    bpf_iter_udp_realloc_batch(3)             "              "
|    spin_lock_bh(&hslot2->lock)               [A,_,_]        "
|    ...                                       [A,B,C]        "
|    spin_unlock_bh(&hslot2->lock)             "              "
|    resized == true                           "              "
|    return A                                  "              "
|                                   del B,C    "              A->D
|                                   add E,F,G  "              A->D->E->
t                                                             F->G
i  bpf_iter_udp_batch()                        "              "
m    spin_lock_bh(&hslot2->lock)               "              "
e    ...                                       [D,E,F]        "
|    spin_unlock_bh(&hslot2->lock)             "              "
|                                   add H,I,J  "              A->D->E->
|                                                             F->G->H->
|                                                             I->J
|    bpf_iter_udp_realloc_batch(6)             [D,E,F,_,_,_]  "
|    spin_lock_bh(&hslot2->lock)               "              "
|    ...                                       [D,E,F,G,H,I]  "
|    spin_unlock_bh(&hslot2->lock)             "              "
|    resized == true                           "              "
|    return D                                  "              "
|                                   del D,E,   "              A->J
|                                       F,G,                   
|                                       H,I,                   
|  bpf_iter_udp_batch()                        "              "
|    spin_lock_bh(&hslot2->lock)               "              "
|    ...                                       [A,J,_,_,_,_]  "
|                         !!! A IS REPEATED !!! ^
|    spin_unlock_bh(&hslot2->lock)             "              "
|    return A                                  "              "
v

There's a fundamental limitation where if we have to process a bucket in
more than two batches, we can lose context about what we've visited
before, so there's always some edge case like this. The choice is
basically:

(1) Make a best-effort attempt to avoid repeating sockets, and accept
    that repeats can still happen in rare cases. Maybe the chances are
    close enough to zero to never actually happen, although it's hard to
    say; it may be more probable in some scenarios.

or

(2) Guarantee that repeats can't happen by requiring that a bucket
    completely fits into one (or two?) batches: either error out in the
    resized == true case or prevent it altogether by holding onto the
    lock across reallocs with GFP_ATOMIC to prevent races.

All things being equal, (2) is a nice guarantee to have, but I sense
some hesitance to hold onto hslot2->lock any longer than we already are.
Is there a high performance cost I'm not seeing there? I guess there's a
higher chance of lock contention, and with GFP_ATOMIC allocation is more
likely to fail, but reallocs should be fairly rare? Maybe we could
reduce the chance of reallocs during iteration by "right-sizing" the
batch from the start, e.g. on iterator init, allocate the batch size to
be 3/2 * the maximum list length currently in the UDP table, since you
know you'll eventually need it to be that size anyway. Of course, lists
might grow after that point requiring a realloc somewhere along the way,
but it would avoid any reallocs in cases where the lengths are mostly
stable. I'm fine with (1) if that's the only viable option, but I just
wanted to make sure I'm accurately understanding the constraints here.

Thanks!

-Jordan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch
  2025-04-14  0:02         ` Jordan Rife
@ 2025-04-14 21:54           ` Martin KaFai Lau
  2025-04-14 23:59             ` Jordan Rife
  0 siblings, 1 reply; 15+ messages in thread
From: Martin KaFai Lau @ 2025-04-14 21:54 UTC (permalink / raw)
  To: Jordan Rife
  Cc: Aditi Ghag, Daniel Borkmann, Willem de Bruijn, Kuniyuki Iwashima,
	bpf, netdev

On 4/13/25 5:02 PM, Jordan Rife wrote:
>> static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>> {
>>          if (iter->cur_sk < iter->end_sk) {
>>                  u64 cookie;
>>
>>                  cookie = iter->st_bucket_done ?
>>                          0 : __sock_gen_cookie(iter->batch[iter->cur_sk].sock);
>>                  sock_put(iter->batch[iter->cur_sk].sock);
>>                  iter->batch[iter->cur_sk++].cookie = cookie;
>>          }
>>
>>          /* ... */
>> }
>>
>> In bpf_iter_udp_resume(), if it cannot find the first sk from find_cookie to
>> end_cookie, then it searches backward from find_cookie to 0. If nothing found,
>> then it should start from the beginning of the resume_bucket. Would it work?
> 
> It seems like the intent here is to avoid repeating sockets that we've
> already visited?
> 
> This would work if you need to process a bucket in two batches or less,
> but it would still be possible to repeat a socket if you have to process
> a bucket in more than two batches: during the transition from batch two
> to batch three you don't have any context about what you saw in batch
> one, so in the worst case where all the cookies we remembered from batch
> two are not found, we restart from the beginning of the list where we
> might revisit sockets from batch one. I guess you can say this reduces
> the probability of repeats but doesn't eliminate it.
> 
> e.g.: socket A gets repeated in batch three after two consecutive calls
>        to bpf_iter_udp_batch() hit the resized == true case due to heavy
>        churn in the current bucket.
> 
> |               Thread 1            Thread 2   Batch State    List State
> |  -------------------------------  ---------  ------------   ----------
> |                                              [_]            A->B
> |  bpf_iter_udp_batch()                        "              "
> |    spin_lock_bh(&hslot2->lock)               "              "
> |    ...                                       [A]            "
> |    spin_unlock_bh(&hslot2->lock)             "              "
> |                                   add C,D    "              A->B->C->D
> |    bpf_iter_udp_realloc_batch(3)             "              "
> |    spin_lock_bh(&hslot2->lock)               [A,_,_]        "
> |    ...                                       [A,B,C]        "
> |    spin_unlock_bh(&hslot2->lock)             "              "
> |    resized == true                           "              "
> |    return A                                  "              "
> |                                   del B,C    "              A->D
> |                                   add E,F,G  "              A->D->E->
> t                                                             F->G
> i  bpf_iter_udp_batch()                        "              "
> m    spin_lock_bh(&hslot2->lock)               "              "
> e    ...                                       [D,E,F]        "
> |    spin_unlock_bh(&hslot2->lock)             "              "
> |                                   add H,I,J  "              A->D->E->
> |                                                             F->G->H->
> |                                                             I->J
> |    bpf_iter_udp_realloc_batch(6)             [D,E,F,_,_,_]  "
> |    spin_lock_bh(&hslot2->lock)               "              "
> |    ...                                       [D,E,F,G,H,I]  "
> |    spin_unlock_bh(&hslot2->lock)             "              "
> |    resized == true                           "              "
> |    return D                                  "              "
> |                                   del D,E,   "              A->J
> |                                       F,G,
> |                                       H,I,
> |  bpf_iter_udp_batch()                        "              "
> |    spin_lock_bh(&hslot2->lock)               "              "
> |    ...                                       [A,J,_,_,_,_]  "
> |                         !!! A IS REPEATED !!! ^
> |    spin_unlock_bh(&hslot2->lock)             "              "
> |    return A                                  "              "
> v

Agree that >1 batches on the same bucket may have a repeated-sk situation, like 
the above example you demonstrated.

> 
> There's a fundamental limitation where if we have to process a bucket in
> more than two batches, we can lose context about what we've visited
> before, so there's always some edge case like this. The choice is
> basically:
> 
> (1) Make a best-effort attempt to avoid repeating sockets, and accept
>      that repeats can still happen in rare cases. Maybe the chances are
>      close enough to zero to never actually happen, although it's hard to
>      say; it may be more probable in some scenarios.
> 
> or
> 
> (2) Guarantee that repeats can't happen by requiring that a bucket
>      completely fits into one (or two?) batches: either error out in the
>      resized == true case or prevent it altogether by holding onto the
>      lock across reallocs with GFP_ATOMIC to prevent races.
> 
> All things being equal, (2) is a nice guarantee to have, but I sense
> some hesitance to hold onto hslot2->lock any longer than we already are.
> Is there a high performance cost I'm not seeing there? I guess there's a
> higher chance of lock contention, and with GFP_ATOMIC allocation is more
> likely to fail, but reallocs should be fairly rare? Maybe we could
> reduce the chance of reallocs during iteration by "right-sizing" the
> batch from the start, e.g. on iterator init, allocate the batch size to
> be 3/2 * the maximum list length currently in the UDP table, since you
> know you'll eventually need it to be that size anyway. Of course, lists
> might grow after that point requiring a realloc somewhere along the way,
> but it would avoid any reallocs in cases where the lengths are mostly
> stable. I'm fine with (1) if that's the only viable option, but I just
> wanted to make sure I'm accurately understanding the constraints here.

I am concerned having higher unnecessary failure chance (although unlikely) for 
the current use cases that do not care for a sk repeated or not. For example, 
the bpf prog has checked the sk conditions (address/port/tcp-cc...etc) before 
doing setsockopt or doing bpf_sock_destory.

I may have over-thought here. ok to bite the bullet on GFP_ATOMIC but I will be 
more comfortable if it can retry a few times on the "resized == true" case first 
with GFP_USER before finally resort to GFP_ATOMIC. or may be another way around 
GFP_ATOMIC fist and falls back to GFP_USER. Thoughts?

For tracking the maximum list length, not sure how much it will help considering 
it may still change, so it still needs to handle the resize+realloc situation 
regardless.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch
  2025-04-14 21:54           ` Martin KaFai Lau
@ 2025-04-14 23:59             ` Jordan Rife
  0 siblings, 0 replies; 15+ messages in thread
From: Jordan Rife @ 2025-04-14 23:59 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Aditi Ghag, Daniel Borkmann, Willem de Bruijn, Kuniyuki Iwashima,
	bpf, netdev

> I am concerned having higher unnecessary failure chance (although unlikely)
> for the current use cases that do not care for a sk repeated or not. For
> example, the bpf prog has checked the sk conditions
> (address/port/tcp-cc...etc) before doing setsockopt or doing
> bpf_sock_destory.
> 
> I may have over-thought here. ok to bite the bullet on GFP_ATOMIC but I will
> be more comfortable if it can retry a few times on the "resized == true"
> case first with GFP_USER before finally resort to GFP_ATOMIC. or may be
> another way around GFP_ATOMIC fist and falls back to GFP_USER. Thoughts?

Sure, this sounds like a good balance. I'm leaning towards falling back
to GFP_ATOMIC if trying GFP_USER first hits the resized == true case,
since then most of the time you wouldn't have to hold onto the spin lock
any longer we already are. Maybe try with GFP_USER two times before
falling back? I can add a new patch to then next version of this series
with a PoC to review.

> 
> For tracking the maximum list length, not sure how much it will help
> considering it may still change, so it still needs to handle the
> resize+realloc situation regardless.

Yeah, thinking about this more today it's not very helpful. Also,
tracking the current longest list length gets a bit messy.

-Jordan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 bpf-next 5/5] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators
  2025-04-11 22:32   ` Martin KaFai Lau
@ 2025-04-15  0:04     ` Jordan Rife
  0 siblings, 0 replies; 15+ messages in thread
From: Jordan Rife @ 2025-04-15  0:04 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Aditi Ghag, Daniel Borkmann, Willem de Bruijn,
	Kuniyuki Iwashima

> > +static int read_n(int iter_fd, int n, struct sock_count counts[],
> > +		  int counts_len)
> > +{
> > +	struct iter_out out;
> > +	int nread = 1;
> > +	int i = 0;
> > +
> > +	for (; nread > 0 && (n < 0 || i < n); i++) {
> > +		nread = read(iter_fd, &out, sizeof(out));
> > +		if (!nread || !ASSERT_GE(nread, 1, "nread"))
> 
> why checks nread >= 1 instead of nread == sizeof(out)?

Will adjust this check to be more precise and apply the other changes
you mentioned.

Thanks!

-Jordan


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-04-15  0:04 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-11 17:35 [PATCH v2 bpf-next 0/5] bpf: udp: Exactly-once socket iteration Jordan Rife
2025-04-11 17:35 ` [PATCH v2 bpf-next 1/5] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife
2025-04-11 17:35 ` [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch Jordan Rife
2025-04-11 22:10   ` Martin KaFai Lau
2025-04-11 23:31     ` Jordan Rife
2025-04-12  3:47       ` Martin KaFai Lau
2025-04-14  0:02         ` Jordan Rife
2025-04-14 21:54           ` Martin KaFai Lau
2025-04-14 23:59             ` Jordan Rife
2025-04-11 17:35 ` [PATCH v2 bpf-next 3/5] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife
2025-04-11 20:12   ` Kuniyuki Iwashima
2025-04-11 17:35 ` [PATCH v2 bpf-next 4/5] selftests/bpf: Return socket cookies from sock_iter_batch progs Jordan Rife
2025-04-11 17:35 ` [PATCH v2 bpf-next 5/5] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators Jordan Rife
2025-04-11 22:32   ` Martin KaFai Lau
2025-04-15  0:04     ` Jordan Rife

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).