[RFC PATCH bpf-next 0/3] Exactly-once UDP socket iteration

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH bpf-next 0/3] Exactly-once UDP socket iteration
@ 2025-04-04 22:02 Jordan Rife
  2025-04-04 22:02 ` [RFC PATCH bpf-next 1/3] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jordan Rife @ 2025-04-04 22:02 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau,
	Willem de Bruijn

Both UDP and TCP socket iterators use iter->offset to track progress
through a bucket, which is a measure of the number of matching sockets
from the current bucket that have been seen or processed by the
iterator. On subsequent iterations, if the current bucket has
unprocessed items, we skip at least iter->offset matching items in the
bucket before adding any remaining items to the next batch. However,
iter->offset isn't always an accurate measure of "things already seen"
when the underlying bucket changes between reads which can lead to
repeated or skipped sockets.

In my original RFC, [1], I proposed a solution that added a new index
field to struct sock_common, but general feedback is that we should
avoid this. After some discussion, Martin suggested using socket cookies
to keep track of what we haven't seen yet in the current bucket. This
series is a follow up from that discussion and implements a PoC of this
approach.

This series replaces struct sock **batch inside struct
bpf_udp_iter_state with union bpf_udp_iter_batch_item *batch, where
union bpf_udp_iter_batch_item can contain either a pointer to a socket
or a socket cookie. During reads, batch contains pointers to all sockets
in the current batch while between reads batch contains all the cookies
of the sockets in the current bucket that have yet to be processed. On
subsequent reads, when iteration resumes, bpf_iter_udp_batch finds the
first saved cookie that matches a socket in the bucket's socket list and
picks up from there to construct the next batch. On average, assuming
it's rare that the next socket disappears before the next read occurs,
we should only need to scan as much as we did with the offset-based
approach to find the starting point. In the case that the next socket
is no longer there, we keep scanning through the saved cookies list
until we find a match. The worst case is when none of the sockets from
last time exist anymore, but again, this should be rare.

[1]: https://lore.kernel.org/bpf/20250313233615.2329869-1-jrife@google.com/

Jordan Rife (3):
  bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch
    items
  bpf: udp: Avoid socket skips and repeats during iteration
  selftests/bpf: Add tests for bucket resume logic in UDP socket
    iterators

 include/linux/udp.h                           |   3 +
 net/ipv4/udp.c                                |  94 +++-
 .../bpf/prog_tests/sock_iter_batch.c          | 452 +++++++++++++++++-
 .../selftests/bpf/progs/bpf_tracing_net.h     |   1 +
 .../selftests/bpf/progs/sock_iter_batch.c     |  24 +-
 5 files changed, 533 insertions(+), 41 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH bpf-next 1/3] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items
  2025-04-04 22:02 [RFC PATCH bpf-next 0/3] Exactly-once UDP socket iteration Jordan Rife
@ 2025-04-04 22:02 ` Jordan Rife
  2025-04-04 22:02 ` [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife
  2025-04-04 22:02 ` [RFC PATCH bpf-next 3/3] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators Jordan Rife
  2 siblings, 0 replies; 12+ messages in thread
From: Jordan Rife @ 2025-04-04 22:02 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau,
	Willem de Bruijn

Prepare for the next commit that tracks cookies between iterations by
converting struct sock **batch to union bpf_udp_iter_batch_item *batch
inside struct bpf_udp_iter_state.

Signed-off-by: Jordan Rife <jordan@jrife.io>
---
 net/ipv4/udp.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d0bffcfa56d8..59c3281962b9 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3384,13 +3384,17 @@ struct bpf_iter__udp {
 	int bucket __aligned(8);
 };
 
+union bpf_udp_iter_batch_item {
+	struct sock *sock;
+};
+
 struct bpf_udp_iter_state {
 	struct udp_iter_state state;
 	unsigned int cur_sk;
 	unsigned int end_sk;
 	unsigned int max_sk;
 	int offset;
-	struct sock **batch;
+	union bpf_udp_iter_batch_item *batch;
 	bool st_bucket_done;
 };
 
@@ -3449,7 +3453,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 				}
 				if (iter->end_sk < iter->max_sk) {
 					sock_hold(sk);
-					iter->batch[iter->end_sk++] = sk;
+					iter->batch[iter->end_sk++].sock = sk;
 				}
 				batch_sks++;
 			}
@@ -3479,7 +3483,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 		goto again;
 	}
 done:
-	return iter->batch[0];
+	return iter->batch[0].sock;
 }
 
 static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
@@ -3491,7 +3495,7 @@ static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 	 * done with seq_show(), so unref the iter->cur_sk.
 	 */
 	if (iter->cur_sk < iter->end_sk) {
-		sock_put(iter->batch[iter->cur_sk++]);
+		sock_put(iter->batch[iter->cur_sk++].sock);
 		++iter->offset;
 	}
 
@@ -3499,7 +3503,7 @@ static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 	 * available in the current bucket batch.
 	 */
 	if (iter->cur_sk < iter->end_sk)
-		sk = iter->batch[iter->cur_sk];
+		sk = iter->batch[iter->cur_sk].sock;
 	else
 		/* Prepare a new batch. */
 		sk = bpf_iter_udp_batch(seq);
@@ -3564,7 +3568,7 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
 static void bpf_iter_udp_put_batch(struct bpf_udp_iter_state *iter)
 {
 	while (iter->cur_sk < iter->end_sk)
-		sock_put(iter->batch[iter->cur_sk++]);
+		sock_put(iter->batch[iter->cur_sk++].sock);
 }
 
 static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
@@ -3827,7 +3831,7 @@ DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
 static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
 				      unsigned int new_batch_sz)
 {
-	struct sock **new_batch;
+	union bpf_udp_iter_batch_item *new_batch;
 
 	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
 				   GFP_USER | __GFP_NOWARN);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration
  2025-04-04 22:02 [RFC PATCH bpf-next 0/3] Exactly-once UDP socket iteration Jordan Rife
  2025-04-04 22:02 ` [RFC PATCH bpf-next 1/3] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife
@ 2025-04-04 22:02 ` Jordan Rife
  2025-04-04 23:20   ` Kuniyuki Iwashima
  2025-04-07 21:56   ` Martin KaFai Lau
  2025-04-04 22:02 ` [RFC PATCH bpf-next 3/3] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators Jordan Rife
  2 siblings, 2 replies; 12+ messages in thread
From: Jordan Rife @ 2025-04-04 22:02 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau,
	Willem de Bruijn

Replace the offset-based approach for tracking progress through a bucket
in the UDP table with one based on socket cookies. Remember the cookies
of unprocessed sockets from the last batch and use this list to
pick up where we left off or, in the case that the next socket
disappears between reads, find the first socket after that point that
still exists in the bucket and resume from there.

In order to make the control flow a bit easier to follow inside
bpf_iter_udp_batch, introduce the udp_portaddr_for_each_entry_from macro
and use this to split bucket processing into two stages: finding the
starting point and adding items to the next batch. Originally, I
implemented this patch inside a single udp_portaddr_for_each_entry loop,
as it was before, but I found the resulting logic a bit messy. Overall,
this version seems more readable.

Signed-off-by: Jordan Rife <jordan@jrife.io>
---
 include/linux/udp.h |  3 ++
 net/ipv4/udp.c      | 78 ++++++++++++++++++++++++++++++++++-----------
 2 files changed, 63 insertions(+), 18 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 0807e21cfec9..a69da9c4c1c5 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -209,6 +209,9 @@ static inline void udp_allow_gso(struct sock *sk)
 #define udp_portaddr_for_each_entry(__sk, list) \
 	hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
+#define udp_portaddr_for_each_entry_from(__sk) \
+	hlist_for_each_entry_from(__sk, __sk_common.skc_portaddr_node)
+
 #define udp_portaddr_for_each_entry_rcu(__sk, list) \
 	hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node)
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 59c3281962b9..00cec269c149 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -93,6 +93,7 @@
 #include <linux/inet.h>
 #include <linux/netdevice.h>
 #include <linux/slab.h>
+#include <linux/sock_diag.h>
 #include <net/tcp_states.h>
 #include <linux/skbuff.h>
 #include <linux/proc_fs.h>
@@ -3386,6 +3387,7 @@ struct bpf_iter__udp {
 
 union bpf_udp_iter_batch_item {
 	struct sock *sock;
+	__u64 cookie;
 };
 
 struct bpf_udp_iter_state {
@@ -3393,26 +3395,42 @@ struct bpf_udp_iter_state {
 	unsigned int cur_sk;
 	unsigned int end_sk;
 	unsigned int max_sk;
-	int offset;
 	union bpf_udp_iter_batch_item *batch;
 	bool st_bucket_done;
 };
 
 static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
 				      unsigned int new_batch_sz);
+static struct sock *bpf_iter_udp_resume(struct sock *first_sk,
+					union bpf_udp_iter_batch_item *cookies,
+					int n_cookies)
+{
+	struct sock *sk = NULL;
+	int i = 0;
+
+	for (; i < n_cookies; i++) {
+		sk = first_sk;
+		udp_portaddr_for_each_entry_from(sk)
+			if (cookies[i].cookie == atomic64_read(&sk->sk_cookie))
+				goto done;
+	}
+done:
+	return sk;
+}
+
 static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 {
 	struct bpf_udp_iter_state *iter = seq->private;
 	struct udp_iter_state *state = &iter->state;
+	unsigned int find_cookie, end_cookie = 0;
 	struct net *net = seq_file_net(seq);
-	int resume_bucket, resume_offset;
 	struct udp_table *udptable;
 	unsigned int batch_sks = 0;
 	bool resized = false;
+	int resume_bucket;
 	struct sock *sk;
 
 	resume_bucket = state->bucket;
-	resume_offset = iter->offset;
 
 	/* The current batch is done, so advance the bucket. */
 	if (iter->st_bucket_done)
@@ -3428,6 +3446,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 	 * before releasing the bucket lock. This allows BPF programs that are
 	 * called in seq_show to acquire the bucket lock if needed.
 	 */
+	find_cookie = iter->cur_sk;
+	end_cookie = iter->end_sk;
 	iter->cur_sk = 0;
 	iter->end_sk = 0;
 	iter->st_bucket_done = false;
@@ -3439,18 +3459,26 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 		if (hlist_empty(&hslot2->head))
 			continue;
 
-		iter->offset = 0;
 		spin_lock_bh(&hslot2->lock);
-		udp_portaddr_for_each_entry(sk, &hslot2->head) {
+		/* Initialize sk to the first socket in hslot2. */
+		udp_portaddr_for_each_entry(sk, &hslot2->head)
+			break;
+		/* Resume from the first (in iteration order) unseen socket from
+		 * the last batch that still exists in resume_bucket. Most of
+		 * the time this will just be where the last iteration left off
+		 * in resume_bucket unless that socket disappeared between
+		 * reads.
+		 *
+		 * Skip this if end_cookie isn't set; this is the first
+		 * batch, we're on bucket zero, and we want to start from the
+		 * beginning.
+		 */
+		if (state->bucket == resume_bucket && end_cookie)
+			sk = bpf_iter_udp_resume(sk,
+						 &iter->batch[find_cookie],
+						 end_cookie - find_cookie);
+		udp_portaddr_for_each_entry_from(sk) {
 			if (seq_sk_match(seq, sk)) {
-				/* Resume from the last iterated socket at the
-				 * offset in the bucket before iterator was stopped.
-				 */
-				if (state->bucket == resume_bucket &&
-				    iter->offset < resume_offset) {
-					++iter->offset;
-					continue;
-				}
 				if (iter->end_sk < iter->max_sk) {
 					sock_hold(sk);
 					iter->batch[iter->end_sk++].sock = sk;
@@ -3494,10 +3522,8 @@ static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 	/* Whenever seq_next() is called, the iter->cur_sk is
 	 * done with seq_show(), so unref the iter->cur_sk.
 	 */
-	if (iter->cur_sk < iter->end_sk) {
+	if (iter->cur_sk < iter->end_sk)
 		sock_put(iter->batch[iter->cur_sk++].sock);
-		++iter->offset;
-	}
 
 	/* After updating iter->cur_sk, check if there are more sockets
 	 * available in the current bucket batch.
@@ -3567,8 +3593,19 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
 
 static void bpf_iter_udp_put_batch(struct bpf_udp_iter_state *iter)
 {
-	while (iter->cur_sk < iter->end_sk)
-		sock_put(iter->batch[iter->cur_sk++].sock);
+	union bpf_udp_iter_batch_item *item;
+	unsigned int cur_sk = iter->cur_sk;
+	__u64 cookie;
+
+	/* Remember the cookies of the sockets we haven't seen yet, so we can
+	 * pick up where we left off next time around.
+	 */
+	while (cur_sk < iter->end_sk) {
+		item = &iter->batch[cur_sk++];
+		cookie = __sock_gen_cookie(item->sock);
+		sock_put(item->sock);
+		item->cookie = cookie;
+	}
 }
 
 static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
@@ -3839,6 +3876,11 @@ static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
 		return -ENOMEM;
 
 	bpf_iter_udp_put_batch(iter);
+	WARN_ON_ONCE(new_batch_sz < iter->max_sk);
+	/* Make sure the new batch has the cookies of the sockets we haven't
+	 * visited yet.
+	 */
+	memcpy(new_batch, iter->batch, iter->end_sk);
 	kvfree(iter->batch);
 	iter->batch = new_batch;
 	iter->max_sk = new_batch_sz;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH bpf-next 3/3] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators
  2025-04-04 22:02 [RFC PATCH bpf-next 0/3] Exactly-once UDP socket iteration Jordan Rife
  2025-04-04 22:02 ` [RFC PATCH bpf-next 1/3] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife
  2025-04-04 22:02 ` [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife
@ 2025-04-04 22:02 ` Jordan Rife
  2 siblings, 0 replies; 12+ messages in thread
From: Jordan Rife @ 2025-04-04 22:02 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau,
	Willem de Bruijn

First, extend the iter_udp_soreuse and iter_tcp_soreuse to write the
cookie of the current socket so that we can track the identity of the
sockets that the iterator has seen so far in the test code. Update the
existing do_test function to account for this change to the iterator
program output.

Next, introduce a set of tests that exercise various bucket resume
scenarios:

* remove_seen resumes iteration after removing a socket from the bucket
  that we've already processed. Before, with the offset-based approach,
  this test would have skipped an unseen socket after resuming
  iteration. With the cookie-based approach, we now see all sockets
  exactly once.
* remove_unseen exercises the condition where the next socket that we
  would have seen is removed from the bucket before we resume iteration.
  This tests the scenario where we need to scan past the first cookie in
  our remembered cookies list to find the socket from which to resume
  iteration.
* remove_all exercises the condition where all sockets we remembered
  were removed from the bucket to make sure iteration terminates and
  returns no more results.
* add_some exercises the condition where a few, but not enough to
  trigger a realloc, sockets are added to the head of the current bucket
  between reads. Before, with the offset-based approach, this test would
  have repeated sockets we've already seen. With the cookie-based
  approach, we now see all sockets exactly once.
* force_realloc exercises the condition that we need to realloc the
  batch on a subsequent read, since more sockets than can be held in the
  current batch array were added to the current bucket. This exercies
  the logic inside bpf_iter_udp_realloc_batch that copies cookies into
  the new batch to make sure nothing is skipped or repeated.

Signed-off-by: Jordan Rife <jordan@jrife.io>
---
 .../bpf/prog_tests/sock_iter_batch.c          | 452 +++++++++++++++++-
 .../selftests/bpf/progs/bpf_tracing_net.h     |   1 +
 .../selftests/bpf/progs/sock_iter_batch.c     |  24 +-
 3 files changed, 460 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c b/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
index d56e18b25528..dc44115bd078 100644
--- a/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
+++ b/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
@@ -1,20 +1,444 @@
 // SPDX-License-Identifier: GPL-2.0
 // Copyright (c) 2024 Meta
 
+#include "linux/bpf.h"
 #include <test_progs.h>
 #include "network_helpers.h"
 #include "sock_iter_batch.skel.h"
 
 #define TEST_NS "sock_iter_batch_netns"
 
+static const int init_batch_size = 16;
 static const int nr_soreuse = 4;
 
+struct iter_out {
+	int idx;
+	__u64 cookie;
+} __packed;
+
+struct sock_count {
+	__u64 cookie;
+	int count;
+};
+
+static int insert(__u64 cookie, struct sock_count counts[], int counts_len)
+{
+	int insert = -1;
+	int i = 0;
+
+	for (; i < counts_len; i++) {
+		if (!counts[i].cookie) {
+			insert = i;
+		} else if (counts[i].cookie == cookie) {
+			insert = i;
+			break;
+		}
+	}
+	if (insert < 0)
+		return insert;
+
+	counts[insert].cookie = cookie;
+	counts[insert].count++;
+
+	return counts[insert].count;
+}
+
+static int read_n(int iter_fd, int n, struct sock_count counts[],
+		  int counts_len)
+{
+	struct iter_out out;
+	int nread = 1;
+	int i = 0;
+
+	for (; nread > 0 && (n < 0 || i < n); i++) {
+		nread = read(iter_fd, &out, sizeof(out));
+		if (!nread || !ASSERT_GE(nread, 1, "nread"))
+			break;
+		ASSERT_GE(insert(out.cookie, counts, counts_len), 0, "insert");
+	}
+
+	ASSERT_TRUE(n < 0 || i == n, "n < 0 || i == n");
+
+	return i;
+}
+
+static __u64 socket_cookie(int fd)
+{
+	__u64 cookie;
+	socklen_t cookie_len = sizeof(cookie);
+	static __u32 duration;	/* for CHECK macro */
+
+	if (CHECK(getsockopt(fd, SOL_SOCKET, SO_COOKIE, &cookie, &cookie_len) < 0,
+		  "getsockopt(SO_COOKIE)", "%s\n", strerror(errno)))
+		return 0;
+	return cookie;
+}
+
+static bool was_seen(int fd, struct sock_count counts[], int counts_len)
+{
+	__u64 cookie = socket_cookie(fd);
+	int i = 0;
+
+	for (; cookie && i < counts_len; i++)
+		if (cookie == counts[i].cookie)
+			return true;
+
+	return false;
+}
+
+static int get_seen_socket(int *fds, struct sock_count counts[], int n)
+{
+	int i = 0;
+
+	for (; i < n; i++)
+		if (was_seen(fds[i], counts, n))
+			return i;
+	return -1;
+}
+
+static int get_nth_socket(int *fds, int fds_len, struct bpf_link *link, int n)
+{
+	int i, nread, iter_fd;
+	int nth_sock_idx = -1;
+	struct iter_out out;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
+		return -1;
+
+	for (; n >= 0; n--) {
+		nread = read(iter_fd, &out, sizeof(out));
+		if (!nread || !ASSERT_GE(nread, 1, "nread"))
+			goto done;
+	}
+
+	for (i = 0; i < fds_len && nth_sock_idx < 0; i++)
+		if (fds[i] >= 0 && socket_cookie(fds[i]) == out.cookie)
+			nth_sock_idx = i;
+done:
+	if (iter_fd < 0)
+		close(iter_fd);
+	return nth_sock_idx;
+}
+
+static int get_seen_count(int fd, struct sock_count counts[], int n)
+{
+	__u64 cookie = socket_cookie(fd);
+	int count = 0;
+	int i = 0;
+
+	for (; cookie && !count && i < n; i++)
+		if (cookie == counts[i].cookie)
+			count = counts[i].count;
+
+	return count;
+}
+
+static void check_n_were_seen_once(int *fds, int fds_len, int n,
+				   struct sock_count counts[], int counts_len)
+{
+	int seen_once = 0;
+	int seen_cnt;
+	int i = 0;
+
+	for (; i < fds_len; i++) {
+		/* Skip any sockets that were closed or that weren't seen
+		 * exactly once.
+		 */
+		if (fds[i] < 0)
+			continue;
+		seen_cnt = get_seen_count(fds[i], counts, counts_len);
+		if (seen_cnt && ASSERT_EQ(seen_cnt, 1, "seen_cnt"))
+			seen_once++;
+	}
+
+	ASSERT_EQ(seen_once, n, "seen_once");
+}
+
+static void remove_seen(int family, int sock_type, const char *addr, __u16 port,
+			int *socks, int socks_len, struct sock_count *counts,
+			int counts_len, struct bpf_link *link, int iter_fd)
+{
+	int close_idx;
+
+	/* Iterate through the first socks_len - 1 sockets. */
+	read_n(iter_fd, socks_len - 1, counts, counts_len);
+
+	/* Make sure we saw socks_len - 1 sockets exactly once. */
+	check_n_were_seen_once(socks, socks_len, socks_len - 1, counts,
+			       counts_len);
+
+	/* Close a socket we've already seen to remove it from the bucket. */
+	close_idx = get_seen_socket(socks, counts, counts_len);
+	if (!ASSERT_GE(close_idx, 0, "close_idx"))
+		return;
+	close(socks[close_idx]);
+	socks[close_idx] = -1;
+
+	/* Iterate through the rest of the sockets. */
+	read_n(iter_fd, -1, counts, counts_len);
+
+	/* Make sure the last socket wasn't skipped and that there were no
+	 * repeats.
+	 */
+	check_n_were_seen_once(socks, socks_len, socks_len - 1, counts,
+			       counts_len);
+}
+
+static void remove_unseen(int family, int sock_type, const char *addr,
+			  __u16 port, int *socks, int socks_len,
+			  struct sock_count *counts, int counts_len,
+			  struct bpf_link *link, int iter_fd)
+{
+	int close_idx;
+
+	/* Iterate through the first socket. */
+	read_n(iter_fd, 1, counts, counts_len);
+
+	/* Make sure we saw a socket from fds. */
+	check_n_were_seen_once(socks, socks_len, 1, counts, counts_len);
+
+	/* Close what would be the next socket in the bucket to exercise the
+	 * condition where we need to skip past the first cookie we remembered.
+	 */
+	close_idx = get_nth_socket(socks, socks_len, link, 1);
+	if (!ASSERT_GE(close_idx, 0, "close_idx"))
+		return;
+	close(socks[close_idx]);
+	socks[close_idx] = -1;
+
+	/* Iterate through the rest of the sockets. */
+	read_n(iter_fd, -1, counts, counts_len);
+
+	/* Make sure the remaining sockets were seen exactly once and that we
+	 * didn't repeat the socket that was already seen.
+	 */
+	check_n_were_seen_once(socks, socks_len, socks_len - 1, counts,
+			       counts_len);
+}
+
+static void remove_all(int family, int sock_type, const char *addr,
+		       __u16 port, int *socks, int socks_len,
+		       struct sock_count *counts, int counts_len,
+		       struct bpf_link *link, int iter_fd)
+{
+	int close_idx, i;
+
+	/* Iterate through the first socket. */
+	read_n(iter_fd, 1, counts, counts_len);
+
+	/* Make sure we saw a socket from fds. */
+	check_n_were_seen_once(socks, socks_len, 1, counts, counts_len);
+
+	/* Close all remaining sockets to exhaust the list of saved cookies and
+	 * exit without putting any sockets into the batch on the next read.
+	 */
+	for (i = 0; i < socks_len - 1; i++) {
+		close_idx = get_nth_socket(socks, socks_len, link, 1);
+		if (!ASSERT_GE(close_idx, 0, "close_idx"))
+			return;
+		close(socks[close_idx]);
+		socks[close_idx] = -1;
+	}
+
+	/* Make sure there are no more sockets returned */
+	ASSERT_EQ(read_n(iter_fd, -1, counts, counts_len), 0, "read_n");
+}
+
+static void add_some(int family, int sock_type, const char *addr, __u16 port,
+		     int *socks, int socks_len, struct sock_count *counts,
+		     int counts_len, struct bpf_link *link, int iter_fd)
+{
+	int *new_socks = NULL;
+
+	/* Iterate through the first socks_len - 1 sockets. */
+	read_n(iter_fd, socks_len - 1, counts, counts_len);
+
+	/* Make sure we saw socks_len - 1 sockets exactly once. */
+	check_n_were_seen_once(socks, socks_len, socks_len - 1, counts,
+			       counts_len);
+
+	/* Double the number of sockets in the bucket. */
+	new_socks = start_reuseport_server(family, sock_type, addr, port, 0,
+					   socks_len);
+	if (!ASSERT_OK_PTR(new_socks, "start_reuseport_server"))
+		goto done;
+
+	/* Iterate through the rest of the sockets. */
+	read_n(iter_fd, -1, counts, counts_len);
+
+	/* Make sure each of the original sockets was seen exactly once. */
+	check_n_were_seen_once(socks, socks_len, socks_len, counts,
+			       counts_len);
+done:
+	if (new_socks)
+		free_fds(new_socks, socks_len);
+}
+
+static void force_realloc(int family, int sock_type, const char *addr,
+			  __u16 port, int *socks, int socks_len,
+			  struct sock_count *counts, int counts_len,
+			  struct bpf_link *link, int iter_fd)
+{
+	int *new_socks = NULL;
+
+	/* Iterate through the first socket just to initialize the batch. */
+	read_n(iter_fd, 1, counts, counts_len);
+
+	/* Double the number of sockets in the bucket to force a realloc on the
+	 * next read.
+	 */
+	new_socks = start_reuseport_server(family, sock_type, addr, port, 0,
+					   socks_len);
+	if (!ASSERT_OK_PTR(new_socks, "start_reuseport_server"))
+		goto done;
+
+	/* Iterate through the rest of the sockets. */
+	read_n(iter_fd, -1, counts, counts_len);
+
+	/* Make sure each socket from the first set was seen exactly once. */
+	check_n_were_seen_once(socks, socks_len, socks_len, counts,
+			       counts_len);
+done:
+	if (new_socks)
+		free_fds(new_socks, socks_len);
+}
+
+struct test_case {
+	void (*test)(int family, int sock_type, const char *addr, __u16 port,
+		     int *socks, int socks_len, struct sock_count *counts,
+		     int counts_len, struct bpf_link *link, int iter_fd);
+	const char *description;
+	int init_socks;
+	int max_socks;
+	int sock_type;
+	int family;
+};
+
+static struct test_case resume_tests[] = {
+	{
+		.description = "udp: resume after removing a seen socket",
+		.init_socks = nr_soreuse,
+		.max_socks = nr_soreuse,
+		.sock_type = SOCK_DGRAM,
+		.family = AF_INET6,
+		.test = remove_seen,
+	},
+	{
+		.description = "udp: resume after removing one unseen socket",
+		.init_socks = nr_soreuse,
+		.max_socks = nr_soreuse,
+		.sock_type = SOCK_DGRAM,
+		.family = AF_INET6,
+		.test = remove_unseen,
+	},
+	{
+		.description = "udp: resume after removing all unseen sockets",
+		.init_socks = nr_soreuse,
+		.max_socks = nr_soreuse,
+		.sock_type = SOCK_DGRAM,
+		.family = AF_INET6,
+		.test = remove_all,
+	},
+	{
+		.description = "udp: resume after adding a few sockets",
+		.init_socks = nr_soreuse,
+		.max_socks = nr_soreuse,
+		.sock_type = SOCK_DGRAM,
+		/* Use AF_INET so that new sockets are added to the head of the
+		 * bucket's list.
+		 */
+		.family = AF_INET,
+		.test = add_some,
+	},
+	{
+		.description = "udp: force a realloc to occur",
+		.init_socks = init_batch_size,
+		.max_socks = init_batch_size * 2,
+		.sock_type = SOCK_DGRAM,
+		/* Use AF_INET6 so that new sockets are added to the tail of the
+		 * bucket's list, needing to be added to the next batch to force
+		 * a realloc.
+		 */
+		.family = AF_INET6,
+		.test = force_realloc,
+	},
+};
+
+static void do_resume_test(struct test_case *tc)
+{
+	static const __u16 port = 10001;
+	struct bpf_link *link = NULL;
+	struct sock_iter_batch *skel;
+	struct sock_count *counts;
+	int err, iter_fd = -1;
+	const char *addr;
+	int *fds;
+
+	counts = calloc(tc->max_socks, sizeof(*counts));
+	if (!counts)
+		return;
+	skel = sock_iter_batch__open();
+	if (!ASSERT_OK_PTR(skel, "sock_iter_batch__open"))
+		return;
+
+	/* Prepare a bucket of sockets in the kernel hashtable */
+	int local_port;
+
+	addr = tc->family == AF_INET6 ? "::1" : "127.0.0.1";
+	fds = start_reuseport_server(tc->family, tc->sock_type, addr, port, 0,
+				     tc->init_socks);
+	if (!ASSERT_OK_PTR(fds, "start_reuseport_server"))
+		goto done;
+	local_port = get_socket_local_port(*fds);
+	if (!ASSERT_GE(local_port, 0, "get_socket_local_port"))
+		goto done;
+	skel->rodata->ports[0] = ntohs(local_port);
+	skel->rodata->sf = tc->family;
+
+	err = sock_iter_batch__load(skel);
+	if (!ASSERT_OK(err, "sock_iter_batch__load"))
+		goto done;
+
+	link = bpf_program__attach_iter(tc->sock_type == SOCK_STREAM ?
+					skel->progs.iter_tcp_soreuse :
+					skel->progs.iter_udp_soreuse,
+					NULL);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_iter"))
+		goto done;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
+		goto done;
+
+	tc->test(tc->family, tc->sock_type, addr, port, fds, tc->init_socks,
+		 counts, tc->max_socks, link, iter_fd);
+done:
+	free_fds(fds, tc->init_socks);
+	if (iter_fd < 0)
+		close(iter_fd);
+	bpf_link__destroy(link);
+	sock_iter_batch__destroy(skel);
+}
+
+static void do_resume_tests(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(resume_tests); i++) {
+		if (test__start_subtest(resume_tests[i].description)) {
+			do_resume_test(&resume_tests[i]);
+		}
+	}
+}
+
 static void do_test(int sock_type, bool onebyone)
 {
 	int err, i, nread, to_read, total_read, iter_fd = -1;
-	int first_idx, second_idx, indices[nr_soreuse];
+	struct iter_out outputs[nr_soreuse];
 	struct bpf_link *link = NULL;
 	struct sock_iter_batch *skel;
+	int first_idx, second_idx;
 	int *fds[2] = {};
 
 	skel = sock_iter_batch__open();
@@ -34,6 +458,7 @@ static void do_test(int sock_type, bool onebyone)
 			goto done;
 		skel->rodata->ports[i] = ntohs(local_port);
 	}
+	skel->rodata->sf = AF_INET6;
 
 	err = sock_iter_batch__load(skel);
 	if (!ASSERT_OK(err, "sock_iter_batch__load"))
@@ -55,38 +480,38 @@ static void do_test(int sock_type, bool onebyone)
 	 * from a bucket and leave one socket out from
 	 * that bucket on purpose.
 	 */
-	to_read = (nr_soreuse - 1) * sizeof(*indices);
+	to_read = (nr_soreuse - 1) * sizeof(*outputs);
 	total_read = 0;
 	first_idx = -1;
 	do {
-		nread = read(iter_fd, indices, onebyone ? sizeof(*indices) : to_read);
-		if (nread <= 0 || nread % sizeof(*indices))
+		nread = read(iter_fd, outputs, onebyone ? sizeof(*outputs) : to_read);
+		if (nread <= 0 || nread % sizeof(*outputs))
 			break;
 		total_read += nread;
 
 		if (first_idx == -1)
-			first_idx = indices[0];
-		for (i = 0; i < nread / sizeof(*indices); i++)
-			ASSERT_EQ(indices[i], first_idx, "first_idx");
+			first_idx = outputs[0].idx;
+		for (i = 0; i < nread / sizeof(*outputs); i++)
+			ASSERT_EQ(outputs[i].idx, first_idx, "first_idx");
 	} while (total_read < to_read);
-	ASSERT_EQ(nread, onebyone ? sizeof(*indices) : to_read, "nread");
+	ASSERT_EQ(nread, onebyone ? sizeof(*outputs) : to_read, "nread");
 	ASSERT_EQ(total_read, to_read, "total_read");
 
 	free_fds(fds[first_idx], nr_soreuse);
 	fds[first_idx] = NULL;
 
 	/* Read the "whole" second bucket */
-	to_read = nr_soreuse * sizeof(*indices);
+	to_read = nr_soreuse * sizeof(*outputs);
 	total_read = 0;
 	second_idx = !first_idx;
 	do {
-		nread = read(iter_fd, indices, onebyone ? sizeof(*indices) : to_read);
-		if (nread <= 0 || nread % sizeof(*indices))
+		nread = read(iter_fd, outputs, onebyone ? sizeof(*outputs) : to_read);
+		if (nread <= 0 || nread % sizeof(*outputs))
 			break;
 		total_read += nread;
 
-		for (i = 0; i < nread / sizeof(*indices); i++)
-			ASSERT_EQ(indices[i], second_idx, "second_idx");
+		for (i = 0; i < nread / sizeof(*outputs); i++)
+			ASSERT_EQ(outputs[i].idx, second_idx, "second_idx");
 	} while (total_read <= to_read);
 	ASSERT_EQ(nread, 0, "nread");
 	/* Both so_reuseport ports should be in different buckets, so
@@ -128,6 +553,7 @@ void test_sock_iter_batch(void)
 		do_test(SOCK_DGRAM, true);
 		do_test(SOCK_DGRAM, false);
 	}
+	do_resume_tests();
 	close_netns(nstoken);
 
 done:
diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index 659694162739..17db400f0e0d 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -128,6 +128,7 @@
 #define sk_refcnt		__sk_common.skc_refcnt
 #define sk_state		__sk_common.skc_state
 #define sk_net			__sk_common.skc_net
+#define sk_rcv_saddr		__sk_common.skc_rcv_saddr
 #define sk_v6_daddr		__sk_common.skc_v6_daddr
 #define sk_v6_rcv_saddr		__sk_common.skc_v6_rcv_saddr
 #define sk_flags		__sk_common.skc_flags
diff --git a/tools/testing/selftests/bpf/progs/sock_iter_batch.c b/tools/testing/selftests/bpf/progs/sock_iter_batch.c
index 96531b0d9d55..8f483337e103 100644
--- a/tools/testing/selftests/bpf/progs/sock_iter_batch.c
+++ b/tools/testing/selftests/bpf/progs/sock_iter_batch.c
@@ -17,6 +17,12 @@ static bool ipv6_addr_loopback(const struct in6_addr *a)
 		a->s6_addr32[2] | (a->s6_addr32[3] ^ bpf_htonl(1))) == 0;
 }
 
+static bool ipv4_addr_loopback(__be32 a)
+{
+	return a == bpf_ntohl(0x7f000001);
+}
+
+volatile const unsigned int sf;
 volatile const __u16 ports[2];
 unsigned int bucket[2];
 
@@ -26,16 +32,20 @@ int iter_tcp_soreuse(struct bpf_iter__tcp *ctx)
 	struct sock *sk = (struct sock *)ctx->sk_common;
 	struct inet_hashinfo *hinfo;
 	unsigned int hash;
+	__u64 sock_cookie;
 	struct net *net;
 	int idx;
 
 	if (!sk)
 		return 0;
 
+	sock_cookie = bpf_get_socket_cookie(sk);
 	sk = bpf_core_cast(sk, struct sock);
-	if (sk->sk_family != AF_INET6 ||
+	if (sk->sk_family != sf ||
 	    sk->sk_state != TCP_LISTEN ||
-	    !ipv6_addr_loopback(&sk->sk_v6_rcv_saddr))
+	    sk->sk_family == AF_INET6 ?
+	    !ipv6_addr_loopback(&sk->sk_v6_rcv_saddr) :
+	    !ipv4_addr_loopback(sk->sk_rcv_saddr))
 		return 0;
 
 	if (sk->sk_num == ports[0])
@@ -52,6 +62,7 @@ int iter_tcp_soreuse(struct bpf_iter__tcp *ctx)
 	hinfo = net->ipv4.tcp_death_row.hashinfo;
 	bucket[idx] = hash & hinfo->lhash2_mask;
 	bpf_seq_write(ctx->meta->seq, &idx, sizeof(idx));
+	bpf_seq_write(ctx->meta->seq, &sock_cookie, sizeof(sock_cookie));
 
 	return 0;
 }
@@ -63,14 +74,18 @@ int iter_udp_soreuse(struct bpf_iter__udp *ctx)
 {
 	struct sock *sk = (struct sock *)ctx->udp_sk;
 	struct udp_table *udptable;
+	__u64 sock_cookie;
 	int idx;
 
 	if (!sk)
 		return 0;
 
+	sock_cookie = bpf_get_socket_cookie(sk);
 	sk = bpf_core_cast(sk, struct sock);
-	if (sk->sk_family != AF_INET6 ||
-	    !ipv6_addr_loopback(&sk->sk_v6_rcv_saddr))
+	if (sk->sk_family != sf ||
+	    sk->sk_family == AF_INET6 ?
+	    !ipv6_addr_loopback(&sk->sk_v6_rcv_saddr) :
+	    !ipv4_addr_loopback(sk->sk_rcv_saddr))
 		return 0;
 
 	if (sk->sk_num == ports[0])
@@ -84,6 +99,7 @@ int iter_udp_soreuse(struct bpf_iter__udp *ctx)
 	udptable = sk->sk_net.net->ipv4.udp_table;
 	bucket[idx] = udp_sk(sk)->udp_portaddr_hash & udptable->mask;
 	bpf_seq_write(ctx->meta->seq, &idx, sizeof(idx));
+	bpf_seq_write(ctx->meta->seq, &sock_cookie, sizeof(sock_cookie));
 
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration
  2025-04-04 22:02 ` [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife
@ 2025-04-04 23:20   ` Kuniyuki Iwashima
  2025-04-07 23:30     ` Jordan Rife
  2025-04-07 21:56   ` Martin KaFai Lau
  1 sibling, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-04 23:20 UTC (permalink / raw)
  To: jordan
  Cc: aditi.ghag, bpf, daniel, martin.lau, netdev,
	willemdebruijn.kernel, kuniyu

From: Jordan Rife <jordan@jrife.io>
Date: Fri,  4 Apr 2025 15:02:17 -0700
> Replace the offset-based approach for tracking progress through a bucket
> in the UDP table with one based on socket cookies. Remember the cookies
> of unprocessed sockets from the last batch and use this list to
> pick up where we left off or, in the case that the next socket
> disappears between reads, find the first socket after that point that
> still exists in the bucket and resume from there.
> 
> In order to make the control flow a bit easier to follow inside
> bpf_iter_udp_batch, introduce the udp_portaddr_for_each_entry_from macro
> and use this to split bucket processing into two stages: finding the
> starting point and adding items to the next batch. Originally, I
> implemented this patch inside a single udp_portaddr_for_each_entry loop,
> as it was before, but I found the resulting logic a bit messy. Overall,
> this version seems more readable.
> 
> Signed-off-by: Jordan Rife <jordan@jrife.io>
> ---
>  include/linux/udp.h |  3 ++
>  net/ipv4/udp.c      | 78 ++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 63 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/udp.h b/include/linux/udp.h
> index 0807e21cfec9..a69da9c4c1c5 100644
> --- a/include/linux/udp.h
> +++ b/include/linux/udp.h
> @@ -209,6 +209,9 @@ static inline void udp_allow_gso(struct sock *sk)
>  #define udp_portaddr_for_each_entry(__sk, list) \
>  	hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
>  
> +#define udp_portaddr_for_each_entry_from(__sk) \
> +	hlist_for_each_entry_from(__sk, __sk_common.skc_portaddr_node)
> +
>  #define udp_portaddr_for_each_entry_rcu(__sk, list) \
>  	hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node)
>  
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 59c3281962b9..00cec269c149 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -93,6 +93,7 @@
>  #include <linux/inet.h>
>  #include <linux/netdevice.h>
>  #include <linux/slab.h>
> +#include <linux/sock_diag.h>
>  #include <net/tcp_states.h>
>  #include <linux/skbuff.h>
>  #include <linux/proc_fs.h>
> @@ -3386,6 +3387,7 @@ struct bpf_iter__udp {
>  
>  union bpf_udp_iter_batch_item {
>  	struct sock *sock;
> +	__u64 cookie;
>  };
>  
>  struct bpf_udp_iter_state {
> @@ -3393,26 +3395,42 @@ struct bpf_udp_iter_state {
>  	unsigned int cur_sk;
>  	unsigned int end_sk;
>  	unsigned int max_sk;
> -	int offset;
>  	union bpf_udp_iter_batch_item *batch;
>  	bool st_bucket_done;
>  };
>  
>  static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
>  				      unsigned int new_batch_sz);
> +static struct sock *bpf_iter_udp_resume(struct sock *first_sk,
> +					union bpf_udp_iter_batch_item *cookies,
> +					int n_cookies)
> +{
> +	struct sock *sk = NULL;
> +	int i = 0;
> +
> +	for (; i < n_cookies; i++) {
> +		sk = first_sk;
> +		udp_portaddr_for_each_entry_from(sk)
> +			if (cookies[i].cookie == atomic64_read(&sk->sk_cookie))
> +				goto done;
> +	}
> +done:
> +	return sk;

We may need to iterate all visited sockets again in this bucket if all
unvisited sockets disappear from the previous iteration.

When the number of the unvisited sockets is small like 1, the duplicated
records will not be rare and rather more often than before ?


> +}
> +
>  static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>  {
>  	struct bpf_udp_iter_state *iter = seq->private;
>  	struct udp_iter_state *state = &iter->state;
> +	unsigned int find_cookie, end_cookie = 0;
>  	struct net *net = seq_file_net(seq);
> -	int resume_bucket, resume_offset;
>  	struct udp_table *udptable;
>  	unsigned int batch_sks = 0;
>  	bool resized = false;
> +	int resume_bucket;
>  	struct sock *sk;
>  
>  	resume_bucket = state->bucket;
> -	resume_offset = iter->offset;
>  
>  	/* The current batch is done, so advance the bucket. */
>  	if (iter->st_bucket_done)
> @@ -3428,6 +3446,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>  	 * before releasing the bucket lock. This allows BPF programs that are
>  	 * called in seq_show to acquire the bucket lock if needed.
>  	 */
> +	find_cookie = iter->cur_sk;
> +	end_cookie = iter->end_sk;
>  	iter->cur_sk = 0;
>  	iter->end_sk = 0;
>  	iter->st_bucket_done = false;
> @@ -3439,18 +3459,26 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>  		if (hlist_empty(&hslot2->head))
>  			continue;
>  
> -		iter->offset = 0;
>  		spin_lock_bh(&hslot2->lock);
> -		udp_portaddr_for_each_entry(sk, &hslot2->head) {
> +		/* Initialize sk to the first socket in hslot2. */
> +		udp_portaddr_for_each_entry(sk, &hslot2->head)
> +			break;
> +		/* Resume from the first (in iteration order) unseen socket from
> +		 * the last batch that still exists in resume_bucket. Most of
> +		 * the time this will just be where the last iteration left off
> +		 * in resume_bucket unless that socket disappeared between
> +		 * reads.
> +		 *
> +		 * Skip this if end_cookie isn't set; this is the first
> +		 * batch, we're on bucket zero, and we want to start from the
> +		 * beginning.
> +		 */
> +		if (state->bucket == resume_bucket && end_cookie)
> +			sk = bpf_iter_udp_resume(sk,
> +						 &iter->batch[find_cookie],
> +						 end_cookie - find_cookie);
> +		udp_portaddr_for_each_entry_from(sk) {
>  			if (seq_sk_match(seq, sk)) {
> -				/* Resume from the last iterated socket at the
> -				 * offset in the bucket before iterator was stopped.
> -				 */
> -				if (state->bucket == resume_bucket &&
> -				    iter->offset < resume_offset) {
> -					++iter->offset;
> -					continue;
> -				}
>  				if (iter->end_sk < iter->max_sk) {
>  					sock_hold(sk);
>  					iter->batch[iter->end_sk++].sock = sk;
> @@ -3494,10 +3522,8 @@ static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>  	/* Whenever seq_next() is called, the iter->cur_sk is
>  	 * done with seq_show(), so unref the iter->cur_sk.
>  	 */
> -	if (iter->cur_sk < iter->end_sk) {
> +	if (iter->cur_sk < iter->end_sk)
>  		sock_put(iter->batch[iter->cur_sk++].sock);
> -		++iter->offset;
> -	}
>  
>  	/* After updating iter->cur_sk, check if there are more sockets
>  	 * available in the current bucket batch.
> @@ -3567,8 +3593,19 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
>  
>  static void bpf_iter_udp_put_batch(struct bpf_udp_iter_state *iter)
>  {
> -	while (iter->cur_sk < iter->end_sk)
> -		sock_put(iter->batch[iter->cur_sk++].sock);
> +	union bpf_udp_iter_batch_item *item;
> +	unsigned int cur_sk = iter->cur_sk;
> +	__u64 cookie;
> +
> +	/* Remember the cookies of the sockets we haven't seen yet, so we can
> +	 * pick up where we left off next time around.
> +	 */
> +	while (cur_sk < iter->end_sk) {
> +		item = &iter->batch[cur_sk++];
> +		cookie = __sock_gen_cookie(item->sock);
> +		sock_put(item->sock);
> +		item->cookie = cookie;
> +	}
>  }
>  
>  static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
> @@ -3839,6 +3876,11 @@ static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
>  		return -ENOMEM;
>  
>  	bpf_iter_udp_put_batch(iter);
> +	WARN_ON_ONCE(new_batch_sz < iter->max_sk);
> +	/* Make sure the new batch has the cookies of the sockets we haven't
> +	 * visited yet.
> +	 */
> +	memcpy(new_batch, iter->batch, iter->end_sk);
>  	kvfree(iter->batch);
>  	iter->batch = new_batch;
>  	iter->max_sk = new_batch_sz;
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration
  2025-04-04 22:02 ` [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife
  2025-04-04 23:20   ` Kuniyuki Iwashima
@ 2025-04-07 21:56   ` Martin KaFai Lau
  2025-04-07 23:39     ` Jordan Rife
  1 sibling, 1 reply; 12+ messages in thread
From: Martin KaFai Lau @ 2025-04-07 21:56 UTC (permalink / raw)
  To: Jordan Rife; +Cc: netdev, bpf, Aditi Ghag, Daniel Borkmann, Willem de Bruijn

On 4/4/25 3:02 PM, Jordan Rife wrote:
> +static struct sock *bpf_iter_udp_resume(struct sock *first_sk,
> +					union bpf_udp_iter_batch_item *cookies,
> +					int n_cookies)
> +{
> +	struct sock *sk = NULL;
> +	int i = 0;
> +
> +	for (; i < n_cookies; i++) {
> +		sk = first_sk;
> +		udp_portaddr_for_each_entry_from(sk)
> +			if (cookies[i].cookie == atomic64_read(&sk->sk_cookie))
> +				goto done;
> +	}
> +done:
> +	return sk;
> +}
> +
>   static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>   {
>   	struct bpf_udp_iter_state *iter = seq->private;
>   	struct udp_iter_state *state = &iter->state;
> +	unsigned int find_cookie, end_cookie = 0;
>   	struct net *net = seq_file_net(seq);
> -	int resume_bucket, resume_offset;
>   	struct udp_table *udptable;
>   	unsigned int batch_sks = 0;
>   	bool resized = false;
> +	int resume_bucket;
>   	struct sock *sk;
>   
>   	resume_bucket = state->bucket;
> -	resume_offset = iter->offset;
>   
>   	/* The current batch is done, so advance the bucket. */
>   	if (iter->st_bucket_done)
> @@ -3428,6 +3446,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>   	 * before releasing the bucket lock. This allows BPF programs that are
>   	 * called in seq_show to acquire the bucket lock if needed.
>   	 */
> +	find_cookie = iter->cur_sk;
> +	end_cookie = iter->end_sk;
>   	iter->cur_sk = 0;
>   	iter->end_sk = 0;
>   	iter->st_bucket_done = false;
> @@ -3439,18 +3459,26 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>   		if (hlist_empty(&hslot2->head))
>   			continue;
>   
> -		iter->offset = 0;
>   		spin_lock_bh(&hslot2->lock);
> -		udp_portaddr_for_each_entry(sk, &hslot2->head) {
> +		/* Initialize sk to the first socket in hslot2. */
> +		udp_portaddr_for_each_entry(sk, &hslot2->head)
> +			break;

nit. It is to get the first entry? May be directly do
hlist_entry_safe(hslot2->head.first, ... ) instead.

> +		/* Resume from the first (in iteration order) unseen socket from
> +		 * the last batch that still exists in resume_bucket. Most of
> +		 * the time this will just be where the last iteration left off
> +		 * in resume_bucket unless that socket disappeared between
> +		 * reads.
> +		 *
> +		 * Skip this if end_cookie isn't set; this is the first
> +		 * batch, we're on bucket zero, and we want to start from the
> +		 * beginning.
> +		 */
> +		if (state->bucket == resume_bucket && end_cookie)
> +			sk = bpf_iter_udp_resume(sk,
> +						 &iter->batch[find_cookie],
> +						 end_cookie - find_cookie);
> +		udp_portaddr_for_each_entry_from(sk) {
>   			if (seq_sk_match(seq, sk)) {
> -				/* Resume from the last iterated socket at the
> -				 * offset in the bucket before iterator was stopped.
> -				 */
> -				if (state->bucket == resume_bucket &&
> -				    iter->offset < resume_offset) {
> -					++iter->offset;
> -					continue;
> -				}
>   				if (iter->end_sk < iter->max_sk) {
>   					sock_hold(sk);
>   					iter->batch[iter->end_sk++].sock = sk;

I looked at the details for these two functions. The approach looks good to me. 
Thanks for trying it.

This should stop the potential duplicates during stop() and then re-start().

My understanding is that it may or may not batch something newer than the last 
stop(). This behavior should be similar to the current offset approach also. I 
think it is fine. The similar situation is true for the next bucket anyway.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration
  2025-04-04 23:20   ` Kuniyuki Iwashima
@ 2025-04-07 23:30     ` Jordan Rife
  2025-04-08  0:16       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 12+ messages in thread
From: Jordan Rife @ 2025-04-07 23:30 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: aditi.ghag, bpf, daniel, martin.lau, netdev,
	willemdebruijn.kernel

> We may need to iterate all visited sockets again in this bucket if all
> unvisited sockets disappear from the previous iteration.

If the next socket disappears between iterator stop and start, the
outer loop would need to keep going until it finds a socket from last
time that still exists. In most cases, it seems unlikely that the next
socket will disappear between iterator reads, so in general the outer
loop would only need to iterate once; the common case should perform
the same as before with the offset approach. The worst case indeed
would be if all the sockets disappear between reads. Then you'd have
to scan through all items in the bucket n_cookies times. Again though,
this is hopefully a rare case.

> When the number of the unvisited sockets is small like 1, the duplicated
> records will not be rare and rather more often than before ?

Sorry if I'm missing something, but what's the relationship between
the number of unvisited sockets and rarity of duplicated records?

-Jordan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration
  2025-04-07 21:56   ` Martin KaFai Lau
@ 2025-04-07 23:39     ` Jordan Rife
  0 siblings, 0 replies; 12+ messages in thread
From: Jordan Rife @ 2025-04-07 23:39 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, bpf, Aditi Ghag, Daniel Borkmann, Willem de Bruijn

> nit. It is to get the first entry? May be directly do
> hlist_entry_safe(hslot2->head.first, ... ) instead.

Sure, I can change this and drop the RFC tag for the next iteration of
this series.

> My understanding is that it may or may not batch something newer than the last
> stop(). This behavior should be similar to the current offset approach also. I
> think it is fine. The similar situation is true for the next bucket anyway.

Assuming it's rare that the first unvisited socket disappears between
stop and start, which seems like a reasonable assumption, you should
generally only need to scan through the list once to find that socket
(similar amount of work to offset). Worst case is if every socket from
last time is no longer there. Then you'd end up scanning through the
full list end_cookie - find_cookie times. And yeah, I think the
iterator shouldn't really care if new sockets are seen or not as long
as you see all sockets that were there when you started iterating.

-Jordan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration
  2025-04-07 23:30     ` Jordan Rife
@ 2025-04-08  0:16       ` Kuniyuki Iwashima
  2025-04-08  2:39         ` Jordan Rife
  0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-04-08  0:16 UTC (permalink / raw)
  To: jordan
  Cc: aditi.ghag, bpf, daniel, kuniyu, martin.lau, netdev,
	willemdebruijn.kernel

From: Jordan Rife <jordan@jrife.io>
Date: Mon, 7 Apr 2025 16:30:46 -0700
> > We may need to iterate all visited sockets again in this bucket if all
> > unvisited sockets disappear from the previous iteration.
> 
> If the next socket disappears between iterator stop and start, the
> outer loop would need to keep going until it finds a socket from last
> time that still exists. In most cases, it seems unlikely that the next
> socket will disappear between iterator reads, so in general the outer
> loop would only need to iterate once; the common case should perform
> the same as before with the offset approach. The worst case indeed
> would be if all the sockets disappear between reads. Then you'd have
> to scan through all items in the bucket n_cookies times. Again though,
> this is hopefully a rare case.
> 
> > When the number of the unvisited sockets is small like 1, the duplicated
> > records will not be rare and rather more often than before ?
> 
> Sorry if I'm missing something, but what's the relationship between
> the number of unvisited sockets and rarity of duplicated records?

Sorry, I misread the code, and s/duplicated/skipped/.

I was thinking that rarity of such unwanted events depends on how
many unvisited sockets are left before restarting.

Let's say batch has 16 sockets and the iterator stopped at 15,
it's more likely that a single socket disappear.

This should be fine given the batch size normally covers the full
bucket of the hash, and it's unlikely that many sockets are added
in the bucket between stop and restart.

In the worst case, where vmalloc() fails and the batch does not
cover full bucket, say the batch size is 16 but the list length
is 256, if the iterator stops at sk15 and sk16 disappers,
sk17 ~ sk256 will be skipped in the next iteration.

 sk1 -> ... sk15 -> sk16 -> sk17 -> ... -> sk256

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration
  2025-04-08  0:16       ` Kuniyuki Iwashima
@ 2025-04-08  2:39         ` Jordan Rife
  2025-04-08  5:23           ` Martin KaFai Lau
  0 siblings, 1 reply; 12+ messages in thread
From: Jordan Rife @ 2025-04-08  2:39 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: aditi.ghag, bpf, daniel, martin.lau, netdev,
	willemdebruijn.kernel

> In the worst case, where vmalloc() fails and the batch does not
> cover full bucket, say the batch size is 16 but the list length
> is 256, if the iterator stops at sk15 and sk16 disappers,
> sk17 ~ sk256 will be skipped in the next iteration.
>
> sk1 -> ... sk15 -> sk16 -> sk17 -> ... -> sk256

Ah yes, this is true. Thank you for clarifying, you bring up a good
point. In case vmalloc() fails, the batch size can't cover the whole
bucket in one go, and none of the saved cookies from last time are in
the bucket, there's currently no great option. You'd need to do one of
the following:

1) Start from the beginning of the list, assuming none of the sockets
had been seen so far. This risks repeating sockets you've already
seen, however.
2) Skip the rest of the sockets to avoid repeating sockets you've
already seen. You might skip sockets that you didn't want to skip.

I actually wonder if a third option might be better in this case though:

3) If vmalloc fails, propagate ENOMEM up to userspace and stop
iteration instead of making the tradeoff of possibly repeating or
skipping sockets. seq_read can already return ENOMEM in some cases, so
IMO this feels more correct. WDYT?

-Jordan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration
  2025-04-08  2:39         ` Jordan Rife
@ 2025-04-08  5:23           ` Martin KaFai Lau
  2025-04-09  0:11             ` Jordan Rife
  0 siblings, 1 reply; 12+ messages in thread
From: Martin KaFai Lau @ 2025-04-08  5:23 UTC (permalink / raw)
  To: Jordan Rife
  Cc: Kuniyuki Iwashima, aditi.ghag, bpf, daniel, netdev,
	willemdebruijn.kernel

On 4/7/25 7:39 PM, Jordan Rife wrote:
> 3) If vmalloc fails, propagate ENOMEM up to userspace and stop
> iteration instead of making the tradeoff of possibly repeating or
> skipping sockets. seq_read can already return ENOMEM in some cases, so
> IMO this feels more correct. WDYT?

Agree that this is better.
The stop() may need to take care of the start()/next() may fail. Take a look at 
the bpf_seq_read() in bpf_iter.c. Please check.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration
  2025-04-08  5:23           ` Martin KaFai Lau
@ 2025-04-09  0:11             ` Jordan Rife
  0 siblings, 0 replies; 12+ messages in thread
From: Jordan Rife @ 2025-04-09  0:11 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Kuniyuki Iwashima, aditi.ghag, bpf, daniel, netdev,
	willemdebruijn.kernel

> Agree that this is better.
> The stop() may need to take care of the start()/next() may fail. Take a look at
> the bpf_seq_read() in bpf_iter.c. Please check.

Thanks, I will take a look.

-Jordan

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-04-09  0:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-04 22:02 [RFC PATCH bpf-next 0/3] Exactly-once UDP socket iteration Jordan Rife
2025-04-04 22:02 ` [RFC PATCH bpf-next 1/3] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife
2025-04-04 22:02 ` [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife
2025-04-04 23:20   ` Kuniyuki Iwashima
2025-04-07 23:30     ` Jordan Rife
2025-04-08  0:16       ` Kuniyuki Iwashima
2025-04-08  2:39         ` Jordan Rife
2025-04-08  5:23           ` Martin KaFai Lau
2025-04-09  0:11             ` Jordan Rife
2025-04-07 21:56   ` Martin KaFai Lau
2025-04-07 23:39     ` Jordan Rife
2025-04-04 22:02 ` [RFC PATCH bpf-next 3/3] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators Jordan Rife

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).