* [RFC PATCH bpf-next 0/3] Exactly-once UDP socket iteration
@ 2025-04-04 22:02 Jordan Rife
2025-04-04 22:02 ` [RFC PATCH bpf-next 1/3] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Jordan Rife @ 2025-04-04 22:02 UTC (permalink / raw)
To: netdev, bpf
Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau,
Willem de Bruijn
Both UDP and TCP socket iterators use iter->offset to track progress
through a bucket, which is a measure of the number of matching sockets
from the current bucket that have been seen or processed by the
iterator. On subsequent iterations, if the current bucket has
unprocessed items, we skip at least iter->offset matching items in the
bucket before adding any remaining items to the next batch. However,
iter->offset isn't always an accurate measure of "things already seen"
when the underlying bucket changes between reads which can lead to
repeated or skipped sockets.
In my original RFC, [1], I proposed a solution that added a new index
field to struct sock_common, but general feedback is that we should
avoid this. After some discussion, Martin suggested using socket cookies
to keep track of what we haven't seen yet in the current bucket. This
series is a follow up from that discussion and implements a PoC of this
approach.
This series replaces struct sock **batch inside struct
bpf_udp_iter_state with union bpf_udp_iter_batch_item *batch, where
union bpf_udp_iter_batch_item can contain either a pointer to a socket
or a socket cookie. During reads, batch contains pointers to all sockets
in the current batch while between reads batch contains all the cookies
of the sockets in the current bucket that have yet to be processed. On
subsequent reads, when iteration resumes, bpf_iter_udp_batch finds the
first saved cookie that matches a socket in the bucket's socket list and
picks up from there to construct the next batch. On average, assuming
it's rare that the next socket disappears before the next read occurs,
we should only need to scan as much as we did with the offset-based
approach to find the starting point. In the case that the next socket
is no longer there, we keep scanning through the saved cookies list
until we find a match. The worst case is when none of the sockets from
last time exist anymore, but again, this should be rare.
[1]: https://lore.kernel.org/bpf/20250313233615.2329869-1-jrife@google.com/
Jordan Rife (3):
bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch
items
bpf: udp: Avoid socket skips and repeats during iteration
selftests/bpf: Add tests for bucket resume logic in UDP socket
iterators
include/linux/udp.h | 3 +
net/ipv4/udp.c | 94 +++-
.../bpf/prog_tests/sock_iter_batch.c | 452 +++++++++++++++++-
.../selftests/bpf/progs/bpf_tracing_net.h | 1 +
.../selftests/bpf/progs/sock_iter_batch.c | 24 +-
5 files changed, 533 insertions(+), 41 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 12+ messages in thread* [RFC PATCH bpf-next 1/3] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items 2025-04-04 22:02 [RFC PATCH bpf-next 0/3] Exactly-once UDP socket iteration Jordan Rife @ 2025-04-04 22:02 ` Jordan Rife 2025-04-04 22:02 ` [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife 2025-04-04 22:02 ` [RFC PATCH bpf-next 3/3] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators Jordan Rife 2 siblings, 0 replies; 12+ messages in thread From: Jordan Rife @ 2025-04-04 22:02 UTC (permalink / raw) To: netdev, bpf Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau, Willem de Bruijn Prepare for the next commit that tracks cookies between iterations by converting struct sock **batch to union bpf_udp_iter_batch_item *batch inside struct bpf_udp_iter_state. Signed-off-by: Jordan Rife <jordan@jrife.io> --- net/ipv4/udp.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index d0bffcfa56d8..59c3281962b9 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -3384,13 +3384,17 @@ struct bpf_iter__udp { int bucket __aligned(8); }; +union bpf_udp_iter_batch_item { + struct sock *sock; +}; + struct bpf_udp_iter_state { struct udp_iter_state state; unsigned int cur_sk; unsigned int end_sk; unsigned int max_sk; int offset; - struct sock **batch; + union bpf_udp_iter_batch_item *batch; bool st_bucket_done; }; @@ -3449,7 +3453,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq) } if (iter->end_sk < iter->max_sk) { sock_hold(sk); - iter->batch[iter->end_sk++] = sk; + iter->batch[iter->end_sk++].sock = sk; } batch_sks++; } @@ -3479,7 +3483,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq) goto again; } done: - return iter->batch[0]; + return iter->batch[0].sock; } static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos) @@ -3491,7 +3495,7 @@ static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos) * done with seq_show(), so unref the iter->cur_sk. */ if (iter->cur_sk < iter->end_sk) { - sock_put(iter->batch[iter->cur_sk++]); + sock_put(iter->batch[iter->cur_sk++].sock); ++iter->offset; } @@ -3499,7 +3503,7 @@ static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos) * available in the current bucket batch. */ if (iter->cur_sk < iter->end_sk) - sk = iter->batch[iter->cur_sk]; + sk = iter->batch[iter->cur_sk].sock; else /* Prepare a new batch. */ sk = bpf_iter_udp_batch(seq); @@ -3564,7 +3568,7 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v) static void bpf_iter_udp_put_batch(struct bpf_udp_iter_state *iter) { while (iter->cur_sk < iter->end_sk) - sock_put(iter->batch[iter->cur_sk++]); + sock_put(iter->batch[iter->cur_sk++].sock); } static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v) @@ -3827,7 +3831,7 @@ DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta, static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter, unsigned int new_batch_sz) { - struct sock **new_batch; + union bpf_udp_iter_batch_item *new_batch; new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch), GFP_USER | __GFP_NOWARN); -- 2.43.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration 2025-04-04 22:02 [RFC PATCH bpf-next 0/3] Exactly-once UDP socket iteration Jordan Rife 2025-04-04 22:02 ` [RFC PATCH bpf-next 1/3] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife @ 2025-04-04 22:02 ` Jordan Rife 2025-04-04 23:20 ` Kuniyuki Iwashima 2025-04-07 21:56 ` Martin KaFai Lau 2025-04-04 22:02 ` [RFC PATCH bpf-next 3/3] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators Jordan Rife 2 siblings, 2 replies; 12+ messages in thread From: Jordan Rife @ 2025-04-04 22:02 UTC (permalink / raw) To: netdev, bpf Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau, Willem de Bruijn Replace the offset-based approach for tracking progress through a bucket in the UDP table with one based on socket cookies. Remember the cookies of unprocessed sockets from the last batch and use this list to pick up where we left off or, in the case that the next socket disappears between reads, find the first socket after that point that still exists in the bucket and resume from there. In order to make the control flow a bit easier to follow inside bpf_iter_udp_batch, introduce the udp_portaddr_for_each_entry_from macro and use this to split bucket processing into two stages: finding the starting point and adding items to the next batch. Originally, I implemented this patch inside a single udp_portaddr_for_each_entry loop, as it was before, but I found the resulting logic a bit messy. Overall, this version seems more readable. Signed-off-by: Jordan Rife <jordan@jrife.io> --- include/linux/udp.h | 3 ++ net/ipv4/udp.c | 78 ++++++++++++++++++++++++++++++++++----------- 2 files changed, 63 insertions(+), 18 deletions(-) diff --git a/include/linux/udp.h b/include/linux/udp.h index 0807e21cfec9..a69da9c4c1c5 100644 --- a/include/linux/udp.h +++ b/include/linux/udp.h @@ -209,6 +209,9 @@ static inline void udp_allow_gso(struct sock *sk) #define udp_portaddr_for_each_entry(__sk, list) \ hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node) +#define udp_portaddr_for_each_entry_from(__sk) \ + hlist_for_each_entry_from(__sk, __sk_common.skc_portaddr_node) + #define udp_portaddr_for_each_entry_rcu(__sk, list) \ hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 59c3281962b9..00cec269c149 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -93,6 +93,7 @@ #include <linux/inet.h> #include <linux/netdevice.h> #include <linux/slab.h> +#include <linux/sock_diag.h> #include <net/tcp_states.h> #include <linux/skbuff.h> #include <linux/proc_fs.h> @@ -3386,6 +3387,7 @@ struct bpf_iter__udp { union bpf_udp_iter_batch_item { struct sock *sock; + __u64 cookie; }; struct bpf_udp_iter_state { @@ -3393,26 +3395,42 @@ struct bpf_udp_iter_state { unsigned int cur_sk; unsigned int end_sk; unsigned int max_sk; - int offset; union bpf_udp_iter_batch_item *batch; bool st_bucket_done; }; static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter, unsigned int new_batch_sz); +static struct sock *bpf_iter_udp_resume(struct sock *first_sk, + union bpf_udp_iter_batch_item *cookies, + int n_cookies) +{ + struct sock *sk = NULL; + int i = 0; + + for (; i < n_cookies; i++) { + sk = first_sk; + udp_portaddr_for_each_entry_from(sk) + if (cookies[i].cookie == atomic64_read(&sk->sk_cookie)) + goto done; + } +done: + return sk; +} + static struct sock *bpf_iter_udp_batch(struct seq_file *seq) { struct bpf_udp_iter_state *iter = seq->private; struct udp_iter_state *state = &iter->state; + unsigned int find_cookie, end_cookie = 0; struct net *net = seq_file_net(seq); - int resume_bucket, resume_offset; struct udp_table *udptable; unsigned int batch_sks = 0; bool resized = false; + int resume_bucket; struct sock *sk; resume_bucket = state->bucket; - resume_offset = iter->offset; /* The current batch is done, so advance the bucket. */ if (iter->st_bucket_done) @@ -3428,6 +3446,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq) * before releasing the bucket lock. This allows BPF programs that are * called in seq_show to acquire the bucket lock if needed. */ + find_cookie = iter->cur_sk; + end_cookie = iter->end_sk; iter->cur_sk = 0; iter->end_sk = 0; iter->st_bucket_done = false; @@ -3439,18 +3459,26 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq) if (hlist_empty(&hslot2->head)) continue; - iter->offset = 0; spin_lock_bh(&hslot2->lock); - udp_portaddr_for_each_entry(sk, &hslot2->head) { + /* Initialize sk to the first socket in hslot2. */ + udp_portaddr_for_each_entry(sk, &hslot2->head) + break; + /* Resume from the first (in iteration order) unseen socket from + * the last batch that still exists in resume_bucket. Most of + * the time this will just be where the last iteration left off + * in resume_bucket unless that socket disappeared between + * reads. + * + * Skip this if end_cookie isn't set; this is the first + * batch, we're on bucket zero, and we want to start from the + * beginning. + */ + if (state->bucket == resume_bucket && end_cookie) + sk = bpf_iter_udp_resume(sk, + &iter->batch[find_cookie], + end_cookie - find_cookie); + udp_portaddr_for_each_entry_from(sk) { if (seq_sk_match(seq, sk)) { - /* Resume from the last iterated socket at the - * offset in the bucket before iterator was stopped. - */ - if (state->bucket == resume_bucket && - iter->offset < resume_offset) { - ++iter->offset; - continue; - } if (iter->end_sk < iter->max_sk) { sock_hold(sk); iter->batch[iter->end_sk++].sock = sk; @@ -3494,10 +3522,8 @@ static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos) /* Whenever seq_next() is called, the iter->cur_sk is * done with seq_show(), so unref the iter->cur_sk. */ - if (iter->cur_sk < iter->end_sk) { + if (iter->cur_sk < iter->end_sk) sock_put(iter->batch[iter->cur_sk++].sock); - ++iter->offset; - } /* After updating iter->cur_sk, check if there are more sockets * available in the current bucket batch. @@ -3567,8 +3593,19 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v) static void bpf_iter_udp_put_batch(struct bpf_udp_iter_state *iter) { - while (iter->cur_sk < iter->end_sk) - sock_put(iter->batch[iter->cur_sk++].sock); + union bpf_udp_iter_batch_item *item; + unsigned int cur_sk = iter->cur_sk; + __u64 cookie; + + /* Remember the cookies of the sockets we haven't seen yet, so we can + * pick up where we left off next time around. + */ + while (cur_sk < iter->end_sk) { + item = &iter->batch[cur_sk++]; + cookie = __sock_gen_cookie(item->sock); + sock_put(item->sock); + item->cookie = cookie; + } } static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v) @@ -3839,6 +3876,11 @@ static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter, return -ENOMEM; bpf_iter_udp_put_batch(iter); + WARN_ON_ONCE(new_batch_sz < iter->max_sk); + /* Make sure the new batch has the cookies of the sockets we haven't + * visited yet. + */ + memcpy(new_batch, iter->batch, iter->end_sk); kvfree(iter->batch); iter->batch = new_batch; iter->max_sk = new_batch_sz; -- 2.43.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration 2025-04-04 22:02 ` [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife @ 2025-04-04 23:20 ` Kuniyuki Iwashima 2025-04-07 23:30 ` Jordan Rife 2025-04-07 21:56 ` Martin KaFai Lau 1 sibling, 1 reply; 12+ messages in thread From: Kuniyuki Iwashima @ 2025-04-04 23:20 UTC (permalink / raw) To: jordan Cc: aditi.ghag, bpf, daniel, martin.lau, netdev, willemdebruijn.kernel, kuniyu From: Jordan Rife <jordan@jrife.io> Date: Fri, 4 Apr 2025 15:02:17 -0700 > Replace the offset-based approach for tracking progress through a bucket > in the UDP table with one based on socket cookies. Remember the cookies > of unprocessed sockets from the last batch and use this list to > pick up where we left off or, in the case that the next socket > disappears between reads, find the first socket after that point that > still exists in the bucket and resume from there. > > In order to make the control flow a bit easier to follow inside > bpf_iter_udp_batch, introduce the udp_portaddr_for_each_entry_from macro > and use this to split bucket processing into two stages: finding the > starting point and adding items to the next batch. Originally, I > implemented this patch inside a single udp_portaddr_for_each_entry loop, > as it was before, but I found the resulting logic a bit messy. Overall, > this version seems more readable. > > Signed-off-by: Jordan Rife <jordan@jrife.io> > --- > include/linux/udp.h | 3 ++ > net/ipv4/udp.c | 78 ++++++++++++++++++++++++++++++++++----------- > 2 files changed, 63 insertions(+), 18 deletions(-) > > diff --git a/include/linux/udp.h b/include/linux/udp.h > index 0807e21cfec9..a69da9c4c1c5 100644 > --- a/include/linux/udp.h > +++ b/include/linux/udp.h > @@ -209,6 +209,9 @@ static inline void udp_allow_gso(struct sock *sk) > #define udp_portaddr_for_each_entry(__sk, list) \ > hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node) > > +#define udp_portaddr_for_each_entry_from(__sk) \ > + hlist_for_each_entry_from(__sk, __sk_common.skc_portaddr_node) > + > #define udp_portaddr_for_each_entry_rcu(__sk, list) \ > hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node) > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index 59c3281962b9..00cec269c149 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -93,6 +93,7 @@ > #include <linux/inet.h> > #include <linux/netdevice.h> > #include <linux/slab.h> > +#include <linux/sock_diag.h> > #include <net/tcp_states.h> > #include <linux/skbuff.h> > #include <linux/proc_fs.h> > @@ -3386,6 +3387,7 @@ struct bpf_iter__udp { > > union bpf_udp_iter_batch_item { > struct sock *sock; > + __u64 cookie; > }; > > struct bpf_udp_iter_state { > @@ -3393,26 +3395,42 @@ struct bpf_udp_iter_state { > unsigned int cur_sk; > unsigned int end_sk; > unsigned int max_sk; > - int offset; > union bpf_udp_iter_batch_item *batch; > bool st_bucket_done; > }; > > static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter, > unsigned int new_batch_sz); > +static struct sock *bpf_iter_udp_resume(struct sock *first_sk, > + union bpf_udp_iter_batch_item *cookies, > + int n_cookies) > +{ > + struct sock *sk = NULL; > + int i = 0; > + > + for (; i < n_cookies; i++) { > + sk = first_sk; > + udp_portaddr_for_each_entry_from(sk) > + if (cookies[i].cookie == atomic64_read(&sk->sk_cookie)) > + goto done; > + } > +done: > + return sk; We may need to iterate all visited sockets again in this bucket if all unvisited sockets disappear from the previous iteration. When the number of the unvisited sockets is small like 1, the duplicated records will not be rare and rather more often than before ? > +} > + > static struct sock *bpf_iter_udp_batch(struct seq_file *seq) > { > struct bpf_udp_iter_state *iter = seq->private; > struct udp_iter_state *state = &iter->state; > + unsigned int find_cookie, end_cookie = 0; > struct net *net = seq_file_net(seq); > - int resume_bucket, resume_offset; > struct udp_table *udptable; > unsigned int batch_sks = 0; > bool resized = false; > + int resume_bucket; > struct sock *sk; > > resume_bucket = state->bucket; > - resume_offset = iter->offset; > > /* The current batch is done, so advance the bucket. */ > if (iter->st_bucket_done) > @@ -3428,6 +3446,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq) > * before releasing the bucket lock. This allows BPF programs that are > * called in seq_show to acquire the bucket lock if needed. > */ > + find_cookie = iter->cur_sk; > + end_cookie = iter->end_sk; > iter->cur_sk = 0; > iter->end_sk = 0; > iter->st_bucket_done = false; > @@ -3439,18 +3459,26 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq) > if (hlist_empty(&hslot2->head)) > continue; > > - iter->offset = 0; > spin_lock_bh(&hslot2->lock); > - udp_portaddr_for_each_entry(sk, &hslot2->head) { > + /* Initialize sk to the first socket in hslot2. */ > + udp_portaddr_for_each_entry(sk, &hslot2->head) > + break; > + /* Resume from the first (in iteration order) unseen socket from > + * the last batch that still exists in resume_bucket. Most of > + * the time this will just be where the last iteration left off > + * in resume_bucket unless that socket disappeared between > + * reads. > + * > + * Skip this if end_cookie isn't set; this is the first > + * batch, we're on bucket zero, and we want to start from the > + * beginning. > + */ > + if (state->bucket == resume_bucket && end_cookie) > + sk = bpf_iter_udp_resume(sk, > + &iter->batch[find_cookie], > + end_cookie - find_cookie); > + udp_portaddr_for_each_entry_from(sk) { > if (seq_sk_match(seq, sk)) { > - /* Resume from the last iterated socket at the > - * offset in the bucket before iterator was stopped. > - */ > - if (state->bucket == resume_bucket && > - iter->offset < resume_offset) { > - ++iter->offset; > - continue; > - } > if (iter->end_sk < iter->max_sk) { > sock_hold(sk); > iter->batch[iter->end_sk++].sock = sk; > @@ -3494,10 +3522,8 @@ static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos) > /* Whenever seq_next() is called, the iter->cur_sk is > * done with seq_show(), so unref the iter->cur_sk. > */ > - if (iter->cur_sk < iter->end_sk) { > + if (iter->cur_sk < iter->end_sk) > sock_put(iter->batch[iter->cur_sk++].sock); > - ++iter->offset; > - } > > /* After updating iter->cur_sk, check if there are more sockets > * available in the current bucket batch. > @@ -3567,8 +3593,19 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v) > > static void bpf_iter_udp_put_batch(struct bpf_udp_iter_state *iter) > { > - while (iter->cur_sk < iter->end_sk) > - sock_put(iter->batch[iter->cur_sk++].sock); > + union bpf_udp_iter_batch_item *item; > + unsigned int cur_sk = iter->cur_sk; > + __u64 cookie; > + > + /* Remember the cookies of the sockets we haven't seen yet, so we can > + * pick up where we left off next time around. > + */ > + while (cur_sk < iter->end_sk) { > + item = &iter->batch[cur_sk++]; > + cookie = __sock_gen_cookie(item->sock); > + sock_put(item->sock); > + item->cookie = cookie; > + } > } > > static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v) > @@ -3839,6 +3876,11 @@ static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter, > return -ENOMEM; > > bpf_iter_udp_put_batch(iter); > + WARN_ON_ONCE(new_batch_sz < iter->max_sk); > + /* Make sure the new batch has the cookies of the sockets we haven't > + * visited yet. > + */ > + memcpy(new_batch, iter->batch, iter->end_sk); > kvfree(iter->batch); > iter->batch = new_batch; > iter->max_sk = new_batch_sz; > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration 2025-04-04 23:20 ` Kuniyuki Iwashima @ 2025-04-07 23:30 ` Jordan Rife 2025-04-08 0:16 ` Kuniyuki Iwashima 0 siblings, 1 reply; 12+ messages in thread From: Jordan Rife @ 2025-04-07 23:30 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: aditi.ghag, bpf, daniel, martin.lau, netdev, willemdebruijn.kernel > We may need to iterate all visited sockets again in this bucket if all > unvisited sockets disappear from the previous iteration. If the next socket disappears between iterator stop and start, the outer loop would need to keep going until it finds a socket from last time that still exists. In most cases, it seems unlikely that the next socket will disappear between iterator reads, so in general the outer loop would only need to iterate once; the common case should perform the same as before with the offset approach. The worst case indeed would be if all the sockets disappear between reads. Then you'd have to scan through all items in the bucket n_cookies times. Again though, this is hopefully a rare case. > When the number of the unvisited sockets is small like 1, the duplicated > records will not be rare and rather more often than before ? Sorry if I'm missing something, but what's the relationship between the number of unvisited sockets and rarity of duplicated records? -Jordan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration 2025-04-07 23:30 ` Jordan Rife @ 2025-04-08 0:16 ` Kuniyuki Iwashima 2025-04-08 2:39 ` Jordan Rife 0 siblings, 1 reply; 12+ messages in thread From: Kuniyuki Iwashima @ 2025-04-08 0:16 UTC (permalink / raw) To: jordan Cc: aditi.ghag, bpf, daniel, kuniyu, martin.lau, netdev, willemdebruijn.kernel From: Jordan Rife <jordan@jrife.io> Date: Mon, 7 Apr 2025 16:30:46 -0700 > > We may need to iterate all visited sockets again in this bucket if all > > unvisited sockets disappear from the previous iteration. > > If the next socket disappears between iterator stop and start, the > outer loop would need to keep going until it finds a socket from last > time that still exists. In most cases, it seems unlikely that the next > socket will disappear between iterator reads, so in general the outer > loop would only need to iterate once; the common case should perform > the same as before with the offset approach. The worst case indeed > would be if all the sockets disappear between reads. Then you'd have > to scan through all items in the bucket n_cookies times. Again though, > this is hopefully a rare case. > > > When the number of the unvisited sockets is small like 1, the duplicated > > records will not be rare and rather more often than before ? > > Sorry if I'm missing something, but what's the relationship between > the number of unvisited sockets and rarity of duplicated records? Sorry, I misread the code, and s/duplicated/skipped/. I was thinking that rarity of such unwanted events depends on how many unvisited sockets are left before restarting. Let's say batch has 16 sockets and the iterator stopped at 15, it's more likely that a single socket disappear. This should be fine given the batch size normally covers the full bucket of the hash, and it's unlikely that many sockets are added in the bucket between stop and restart. In the worst case, where vmalloc() fails and the batch does not cover full bucket, say the batch size is 16 but the list length is 256, if the iterator stops at sk15 and sk16 disappers, sk17 ~ sk256 will be skipped in the next iteration. sk1 -> ... sk15 -> sk16 -> sk17 -> ... -> sk256 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration 2025-04-08 0:16 ` Kuniyuki Iwashima @ 2025-04-08 2:39 ` Jordan Rife 2025-04-08 5:23 ` Martin KaFai Lau 0 siblings, 1 reply; 12+ messages in thread From: Jordan Rife @ 2025-04-08 2:39 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: aditi.ghag, bpf, daniel, martin.lau, netdev, willemdebruijn.kernel > In the worst case, where vmalloc() fails and the batch does not > cover full bucket, say the batch size is 16 but the list length > is 256, if the iterator stops at sk15 and sk16 disappers, > sk17 ~ sk256 will be skipped in the next iteration. > > sk1 -> ... sk15 -> sk16 -> sk17 -> ... -> sk256 Ah yes, this is true. Thank you for clarifying, you bring up a good point. In case vmalloc() fails, the batch size can't cover the whole bucket in one go, and none of the saved cookies from last time are in the bucket, there's currently no great option. You'd need to do one of the following: 1) Start from the beginning of the list, assuming none of the sockets had been seen so far. This risks repeating sockets you've already seen, however. 2) Skip the rest of the sockets to avoid repeating sockets you've already seen. You might skip sockets that you didn't want to skip. I actually wonder if a third option might be better in this case though: 3) If vmalloc fails, propagate ENOMEM up to userspace and stop iteration instead of making the tradeoff of possibly repeating or skipping sockets. seq_read can already return ENOMEM in some cases, so IMO this feels more correct. WDYT? -Jordan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration 2025-04-08 2:39 ` Jordan Rife @ 2025-04-08 5:23 ` Martin KaFai Lau 2025-04-09 0:11 ` Jordan Rife 0 siblings, 1 reply; 12+ messages in thread From: Martin KaFai Lau @ 2025-04-08 5:23 UTC (permalink / raw) To: Jordan Rife Cc: Kuniyuki Iwashima, aditi.ghag, bpf, daniel, netdev, willemdebruijn.kernel On 4/7/25 7:39 PM, Jordan Rife wrote: > 3) If vmalloc fails, propagate ENOMEM up to userspace and stop > iteration instead of making the tradeoff of possibly repeating or > skipping sockets. seq_read can already return ENOMEM in some cases, so > IMO this feels more correct. WDYT? Agree that this is better. The stop() may need to take care of the start()/next() may fail. Take a look at the bpf_seq_read() in bpf_iter.c. Please check. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration 2025-04-08 5:23 ` Martin KaFai Lau @ 2025-04-09 0:11 ` Jordan Rife 0 siblings, 0 replies; 12+ messages in thread From: Jordan Rife @ 2025-04-09 0:11 UTC (permalink / raw) To: Martin KaFai Lau Cc: Kuniyuki Iwashima, aditi.ghag, bpf, daniel, netdev, willemdebruijn.kernel > Agree that this is better. > The stop() may need to take care of the start()/next() may fail. Take a look at > the bpf_seq_read() in bpf_iter.c. Please check. Thanks, I will take a look. -Jordan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration 2025-04-04 22:02 ` [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife 2025-04-04 23:20 ` Kuniyuki Iwashima @ 2025-04-07 21:56 ` Martin KaFai Lau 2025-04-07 23:39 ` Jordan Rife 1 sibling, 1 reply; 12+ messages in thread From: Martin KaFai Lau @ 2025-04-07 21:56 UTC (permalink / raw) To: Jordan Rife; +Cc: netdev, bpf, Aditi Ghag, Daniel Borkmann, Willem de Bruijn On 4/4/25 3:02 PM, Jordan Rife wrote: > +static struct sock *bpf_iter_udp_resume(struct sock *first_sk, > + union bpf_udp_iter_batch_item *cookies, > + int n_cookies) > +{ > + struct sock *sk = NULL; > + int i = 0; > + > + for (; i < n_cookies; i++) { > + sk = first_sk; > + udp_portaddr_for_each_entry_from(sk) > + if (cookies[i].cookie == atomic64_read(&sk->sk_cookie)) > + goto done; > + } > +done: > + return sk; > +} > + > static struct sock *bpf_iter_udp_batch(struct seq_file *seq) > { > struct bpf_udp_iter_state *iter = seq->private; > struct udp_iter_state *state = &iter->state; > + unsigned int find_cookie, end_cookie = 0; > struct net *net = seq_file_net(seq); > - int resume_bucket, resume_offset; > struct udp_table *udptable; > unsigned int batch_sks = 0; > bool resized = false; > + int resume_bucket; > struct sock *sk; > > resume_bucket = state->bucket; > - resume_offset = iter->offset; > > /* The current batch is done, so advance the bucket. */ > if (iter->st_bucket_done) > @@ -3428,6 +3446,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq) > * before releasing the bucket lock. This allows BPF programs that are > * called in seq_show to acquire the bucket lock if needed. > */ > + find_cookie = iter->cur_sk; > + end_cookie = iter->end_sk; > iter->cur_sk = 0; > iter->end_sk = 0; > iter->st_bucket_done = false; > @@ -3439,18 +3459,26 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq) > if (hlist_empty(&hslot2->head)) > continue; > > - iter->offset = 0; > spin_lock_bh(&hslot2->lock); > - udp_portaddr_for_each_entry(sk, &hslot2->head) { > + /* Initialize sk to the first socket in hslot2. */ > + udp_portaddr_for_each_entry(sk, &hslot2->head) > + break; nit. It is to get the first entry? May be directly do hlist_entry_safe(hslot2->head.first, ... ) instead. > + /* Resume from the first (in iteration order) unseen socket from > + * the last batch that still exists in resume_bucket. Most of > + * the time this will just be where the last iteration left off > + * in resume_bucket unless that socket disappeared between > + * reads. > + * > + * Skip this if end_cookie isn't set; this is the first > + * batch, we're on bucket zero, and we want to start from the > + * beginning. > + */ > + if (state->bucket == resume_bucket && end_cookie) > + sk = bpf_iter_udp_resume(sk, > + &iter->batch[find_cookie], > + end_cookie - find_cookie); > + udp_portaddr_for_each_entry_from(sk) { > if (seq_sk_match(seq, sk)) { > - /* Resume from the last iterated socket at the > - * offset in the bucket before iterator was stopped. > - */ > - if (state->bucket == resume_bucket && > - iter->offset < resume_offset) { > - ++iter->offset; > - continue; > - } > if (iter->end_sk < iter->max_sk) { > sock_hold(sk); > iter->batch[iter->end_sk++].sock = sk; I looked at the details for these two functions. The approach looks good to me. Thanks for trying it. This should stop the potential duplicates during stop() and then re-start(). My understanding is that it may or may not batch something newer than the last stop(). This behavior should be similar to the current offset approach also. I think it is fine. The similar situation is true for the next bucket anyway. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration 2025-04-07 21:56 ` Martin KaFai Lau @ 2025-04-07 23:39 ` Jordan Rife 0 siblings, 0 replies; 12+ messages in thread From: Jordan Rife @ 2025-04-07 23:39 UTC (permalink / raw) To: Martin KaFai Lau Cc: netdev, bpf, Aditi Ghag, Daniel Borkmann, Willem de Bruijn > nit. It is to get the first entry? May be directly do > hlist_entry_safe(hslot2->head.first, ... ) instead. Sure, I can change this and drop the RFC tag for the next iteration of this series. > My understanding is that it may or may not batch something newer than the last > stop(). This behavior should be similar to the current offset approach also. I > think it is fine. The similar situation is true for the next bucket anyway. Assuming it's rare that the first unvisited socket disappears between stop and start, which seems like a reasonable assumption, you should generally only need to scan through the list once to find that socket (similar amount of work to offset). Worst case is if every socket from last time is no longer there. Then you'd end up scanning through the full list end_cookie - find_cookie times. And yeah, I think the iterator shouldn't really care if new sockets are seen or not as long as you see all sockets that were there when you started iterating. -Jordan ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFC PATCH bpf-next 3/3] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators 2025-04-04 22:02 [RFC PATCH bpf-next 0/3] Exactly-once UDP socket iteration Jordan Rife 2025-04-04 22:02 ` [RFC PATCH bpf-next 1/3] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife 2025-04-04 22:02 ` [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife @ 2025-04-04 22:02 ` Jordan Rife 2 siblings, 0 replies; 12+ messages in thread From: Jordan Rife @ 2025-04-04 22:02 UTC (permalink / raw) To: netdev, bpf Cc: Jordan Rife, Aditi Ghag, Daniel Borkmann, Martin KaFai Lau, Willem de Bruijn First, extend the iter_udp_soreuse and iter_tcp_soreuse to write the cookie of the current socket so that we can track the identity of the sockets that the iterator has seen so far in the test code. Update the existing do_test function to account for this change to the iterator program output. Next, introduce a set of tests that exercise various bucket resume scenarios: * remove_seen resumes iteration after removing a socket from the bucket that we've already processed. Before, with the offset-based approach, this test would have skipped an unseen socket after resuming iteration. With the cookie-based approach, we now see all sockets exactly once. * remove_unseen exercises the condition where the next socket that we would have seen is removed from the bucket before we resume iteration. This tests the scenario where we need to scan past the first cookie in our remembered cookies list to find the socket from which to resume iteration. * remove_all exercises the condition where all sockets we remembered were removed from the bucket to make sure iteration terminates and returns no more results. * add_some exercises the condition where a few, but not enough to trigger a realloc, sockets are added to the head of the current bucket between reads. Before, with the offset-based approach, this test would have repeated sockets we've already seen. With the cookie-based approach, we now see all sockets exactly once. * force_realloc exercises the condition that we need to realloc the batch on a subsequent read, since more sockets than can be held in the current batch array were added to the current bucket. This exercies the logic inside bpf_iter_udp_realloc_batch that copies cookies into the new batch to make sure nothing is skipped or repeated. Signed-off-by: Jordan Rife <jordan@jrife.io> --- .../bpf/prog_tests/sock_iter_batch.c | 452 +++++++++++++++++- .../selftests/bpf/progs/bpf_tracing_net.h | 1 + .../selftests/bpf/progs/sock_iter_batch.c | 24 +- 3 files changed, 460 insertions(+), 17 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c b/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c index d56e18b25528..dc44115bd078 100644 --- a/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c +++ b/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c @@ -1,20 +1,444 @@ // SPDX-License-Identifier: GPL-2.0 // Copyright (c) 2024 Meta +#include "linux/bpf.h" #include <test_progs.h> #include "network_helpers.h" #include "sock_iter_batch.skel.h" #define TEST_NS "sock_iter_batch_netns" +static const int init_batch_size = 16; static const int nr_soreuse = 4; +struct iter_out { + int idx; + __u64 cookie; +} __packed; + +struct sock_count { + __u64 cookie; + int count; +}; + +static int insert(__u64 cookie, struct sock_count counts[], int counts_len) +{ + int insert = -1; + int i = 0; + + for (; i < counts_len; i++) { + if (!counts[i].cookie) { + insert = i; + } else if (counts[i].cookie == cookie) { + insert = i; + break; + } + } + if (insert < 0) + return insert; + + counts[insert].cookie = cookie; + counts[insert].count++; + + return counts[insert].count; +} + +static int read_n(int iter_fd, int n, struct sock_count counts[], + int counts_len) +{ + struct iter_out out; + int nread = 1; + int i = 0; + + for (; nread > 0 && (n < 0 || i < n); i++) { + nread = read(iter_fd, &out, sizeof(out)); + if (!nread || !ASSERT_GE(nread, 1, "nread")) + break; + ASSERT_GE(insert(out.cookie, counts, counts_len), 0, "insert"); + } + + ASSERT_TRUE(n < 0 || i == n, "n < 0 || i == n"); + + return i; +} + +static __u64 socket_cookie(int fd) +{ + __u64 cookie; + socklen_t cookie_len = sizeof(cookie); + static __u32 duration; /* for CHECK macro */ + + if (CHECK(getsockopt(fd, SOL_SOCKET, SO_COOKIE, &cookie, &cookie_len) < 0, + "getsockopt(SO_COOKIE)", "%s\n", strerror(errno))) + return 0; + return cookie; +} + +static bool was_seen(int fd, struct sock_count counts[], int counts_len) +{ + __u64 cookie = socket_cookie(fd); + int i = 0; + + for (; cookie && i < counts_len; i++) + if (cookie == counts[i].cookie) + return true; + + return false; +} + +static int get_seen_socket(int *fds, struct sock_count counts[], int n) +{ + int i = 0; + + for (; i < n; i++) + if (was_seen(fds[i], counts, n)) + return i; + return -1; +} + +static int get_nth_socket(int *fds, int fds_len, struct bpf_link *link, int n) +{ + int i, nread, iter_fd; + int nth_sock_idx = -1; + struct iter_out out; + + iter_fd = bpf_iter_create(bpf_link__fd(link)); + if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create")) + return -1; + + for (; n >= 0; n--) { + nread = read(iter_fd, &out, sizeof(out)); + if (!nread || !ASSERT_GE(nread, 1, "nread")) + goto done; + } + + for (i = 0; i < fds_len && nth_sock_idx < 0; i++) + if (fds[i] >= 0 && socket_cookie(fds[i]) == out.cookie) + nth_sock_idx = i; +done: + if (iter_fd < 0) + close(iter_fd); + return nth_sock_idx; +} + +static int get_seen_count(int fd, struct sock_count counts[], int n) +{ + __u64 cookie = socket_cookie(fd); + int count = 0; + int i = 0; + + for (; cookie && !count && i < n; i++) + if (cookie == counts[i].cookie) + count = counts[i].count; + + return count; +} + +static void check_n_were_seen_once(int *fds, int fds_len, int n, + struct sock_count counts[], int counts_len) +{ + int seen_once = 0; + int seen_cnt; + int i = 0; + + for (; i < fds_len; i++) { + /* Skip any sockets that were closed or that weren't seen + * exactly once. + */ + if (fds[i] < 0) + continue; + seen_cnt = get_seen_count(fds[i], counts, counts_len); + if (seen_cnt && ASSERT_EQ(seen_cnt, 1, "seen_cnt")) + seen_once++; + } + + ASSERT_EQ(seen_once, n, "seen_once"); +} + +static void remove_seen(int family, int sock_type, const char *addr, __u16 port, + int *socks, int socks_len, struct sock_count *counts, + int counts_len, struct bpf_link *link, int iter_fd) +{ + int close_idx; + + /* Iterate through the first socks_len - 1 sockets. */ + read_n(iter_fd, socks_len - 1, counts, counts_len); + + /* Make sure we saw socks_len - 1 sockets exactly once. */ + check_n_were_seen_once(socks, socks_len, socks_len - 1, counts, + counts_len); + + /* Close a socket we've already seen to remove it from the bucket. */ + close_idx = get_seen_socket(socks, counts, counts_len); + if (!ASSERT_GE(close_idx, 0, "close_idx")) + return; + close(socks[close_idx]); + socks[close_idx] = -1; + + /* Iterate through the rest of the sockets. */ + read_n(iter_fd, -1, counts, counts_len); + + /* Make sure the last socket wasn't skipped and that there were no + * repeats. + */ + check_n_were_seen_once(socks, socks_len, socks_len - 1, counts, + counts_len); +} + +static void remove_unseen(int family, int sock_type, const char *addr, + __u16 port, int *socks, int socks_len, + struct sock_count *counts, int counts_len, + struct bpf_link *link, int iter_fd) +{ + int close_idx; + + /* Iterate through the first socket. */ + read_n(iter_fd, 1, counts, counts_len); + + /* Make sure we saw a socket from fds. */ + check_n_were_seen_once(socks, socks_len, 1, counts, counts_len); + + /* Close what would be the next socket in the bucket to exercise the + * condition where we need to skip past the first cookie we remembered. + */ + close_idx = get_nth_socket(socks, socks_len, link, 1); + if (!ASSERT_GE(close_idx, 0, "close_idx")) + return; + close(socks[close_idx]); + socks[close_idx] = -1; + + /* Iterate through the rest of the sockets. */ + read_n(iter_fd, -1, counts, counts_len); + + /* Make sure the remaining sockets were seen exactly once and that we + * didn't repeat the socket that was already seen. + */ + check_n_were_seen_once(socks, socks_len, socks_len - 1, counts, + counts_len); +} + +static void remove_all(int family, int sock_type, const char *addr, + __u16 port, int *socks, int socks_len, + struct sock_count *counts, int counts_len, + struct bpf_link *link, int iter_fd) +{ + int close_idx, i; + + /* Iterate through the first socket. */ + read_n(iter_fd, 1, counts, counts_len); + + /* Make sure we saw a socket from fds. */ + check_n_were_seen_once(socks, socks_len, 1, counts, counts_len); + + /* Close all remaining sockets to exhaust the list of saved cookies and + * exit without putting any sockets into the batch on the next read. + */ + for (i = 0; i < socks_len - 1; i++) { + close_idx = get_nth_socket(socks, socks_len, link, 1); + if (!ASSERT_GE(close_idx, 0, "close_idx")) + return; + close(socks[close_idx]); + socks[close_idx] = -1; + } + + /* Make sure there are no more sockets returned */ + ASSERT_EQ(read_n(iter_fd, -1, counts, counts_len), 0, "read_n"); +} + +static void add_some(int family, int sock_type, const char *addr, __u16 port, + int *socks, int socks_len, struct sock_count *counts, + int counts_len, struct bpf_link *link, int iter_fd) +{ + int *new_socks = NULL; + + /* Iterate through the first socks_len - 1 sockets. */ + read_n(iter_fd, socks_len - 1, counts, counts_len); + + /* Make sure we saw socks_len - 1 sockets exactly once. */ + check_n_were_seen_once(socks, socks_len, socks_len - 1, counts, + counts_len); + + /* Double the number of sockets in the bucket. */ + new_socks = start_reuseport_server(family, sock_type, addr, port, 0, + socks_len); + if (!ASSERT_OK_PTR(new_socks, "start_reuseport_server")) + goto done; + + /* Iterate through the rest of the sockets. */ + read_n(iter_fd, -1, counts, counts_len); + + /* Make sure each of the original sockets was seen exactly once. */ + check_n_were_seen_once(socks, socks_len, socks_len, counts, + counts_len); +done: + if (new_socks) + free_fds(new_socks, socks_len); +} + +static void force_realloc(int family, int sock_type, const char *addr, + __u16 port, int *socks, int socks_len, + struct sock_count *counts, int counts_len, + struct bpf_link *link, int iter_fd) +{ + int *new_socks = NULL; + + /* Iterate through the first socket just to initialize the batch. */ + read_n(iter_fd, 1, counts, counts_len); + + /* Double the number of sockets in the bucket to force a realloc on the + * next read. + */ + new_socks = start_reuseport_server(family, sock_type, addr, port, 0, + socks_len); + if (!ASSERT_OK_PTR(new_socks, "start_reuseport_server")) + goto done; + + /* Iterate through the rest of the sockets. */ + read_n(iter_fd, -1, counts, counts_len); + + /* Make sure each socket from the first set was seen exactly once. */ + check_n_were_seen_once(socks, socks_len, socks_len, counts, + counts_len); +done: + if (new_socks) + free_fds(new_socks, socks_len); +} + +struct test_case { + void (*test)(int family, int sock_type, const char *addr, __u16 port, + int *socks, int socks_len, struct sock_count *counts, + int counts_len, struct bpf_link *link, int iter_fd); + const char *description; + int init_socks; + int max_socks; + int sock_type; + int family; +}; + +static struct test_case resume_tests[] = { + { + .description = "udp: resume after removing a seen socket", + .init_socks = nr_soreuse, + .max_socks = nr_soreuse, + .sock_type = SOCK_DGRAM, + .family = AF_INET6, + .test = remove_seen, + }, + { + .description = "udp: resume after removing one unseen socket", + .init_socks = nr_soreuse, + .max_socks = nr_soreuse, + .sock_type = SOCK_DGRAM, + .family = AF_INET6, + .test = remove_unseen, + }, + { + .description = "udp: resume after removing all unseen sockets", + .init_socks = nr_soreuse, + .max_socks = nr_soreuse, + .sock_type = SOCK_DGRAM, + .family = AF_INET6, + .test = remove_all, + }, + { + .description = "udp: resume after adding a few sockets", + .init_socks = nr_soreuse, + .max_socks = nr_soreuse, + .sock_type = SOCK_DGRAM, + /* Use AF_INET so that new sockets are added to the head of the + * bucket's list. + */ + .family = AF_INET, + .test = add_some, + }, + { + .description = "udp: force a realloc to occur", + .init_socks = init_batch_size, + .max_socks = init_batch_size * 2, + .sock_type = SOCK_DGRAM, + /* Use AF_INET6 so that new sockets are added to the tail of the + * bucket's list, needing to be added to the next batch to force + * a realloc. + */ + .family = AF_INET6, + .test = force_realloc, + }, +}; + +static void do_resume_test(struct test_case *tc) +{ + static const __u16 port = 10001; + struct bpf_link *link = NULL; + struct sock_iter_batch *skel; + struct sock_count *counts; + int err, iter_fd = -1; + const char *addr; + int *fds; + + counts = calloc(tc->max_socks, sizeof(*counts)); + if (!counts) + return; + skel = sock_iter_batch__open(); + if (!ASSERT_OK_PTR(skel, "sock_iter_batch__open")) + return; + + /* Prepare a bucket of sockets in the kernel hashtable */ + int local_port; + + addr = tc->family == AF_INET6 ? "::1" : "127.0.0.1"; + fds = start_reuseport_server(tc->family, tc->sock_type, addr, port, 0, + tc->init_socks); + if (!ASSERT_OK_PTR(fds, "start_reuseport_server")) + goto done; + local_port = get_socket_local_port(*fds); + if (!ASSERT_GE(local_port, 0, "get_socket_local_port")) + goto done; + skel->rodata->ports[0] = ntohs(local_port); + skel->rodata->sf = tc->family; + + err = sock_iter_batch__load(skel); + if (!ASSERT_OK(err, "sock_iter_batch__load")) + goto done; + + link = bpf_program__attach_iter(tc->sock_type == SOCK_STREAM ? + skel->progs.iter_tcp_soreuse : + skel->progs.iter_udp_soreuse, + NULL); + if (!ASSERT_OK_PTR(link, "bpf_program__attach_iter")) + goto done; + + iter_fd = bpf_iter_create(bpf_link__fd(link)); + if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create")) + goto done; + + tc->test(tc->family, tc->sock_type, addr, port, fds, tc->init_socks, + counts, tc->max_socks, link, iter_fd); +done: + free_fds(fds, tc->init_socks); + if (iter_fd < 0) + close(iter_fd); + bpf_link__destroy(link); + sock_iter_batch__destroy(skel); +} + +static void do_resume_tests(void) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(resume_tests); i++) { + if (test__start_subtest(resume_tests[i].description)) { + do_resume_test(&resume_tests[i]); + } + } +} + static void do_test(int sock_type, bool onebyone) { int err, i, nread, to_read, total_read, iter_fd = -1; - int first_idx, second_idx, indices[nr_soreuse]; + struct iter_out outputs[nr_soreuse]; struct bpf_link *link = NULL; struct sock_iter_batch *skel; + int first_idx, second_idx; int *fds[2] = {}; skel = sock_iter_batch__open(); @@ -34,6 +458,7 @@ static void do_test(int sock_type, bool onebyone) goto done; skel->rodata->ports[i] = ntohs(local_port); } + skel->rodata->sf = AF_INET6; err = sock_iter_batch__load(skel); if (!ASSERT_OK(err, "sock_iter_batch__load")) @@ -55,38 +480,38 @@ static void do_test(int sock_type, bool onebyone) * from a bucket and leave one socket out from * that bucket on purpose. */ - to_read = (nr_soreuse - 1) * sizeof(*indices); + to_read = (nr_soreuse - 1) * sizeof(*outputs); total_read = 0; first_idx = -1; do { - nread = read(iter_fd, indices, onebyone ? sizeof(*indices) : to_read); - if (nread <= 0 || nread % sizeof(*indices)) + nread = read(iter_fd, outputs, onebyone ? sizeof(*outputs) : to_read); + if (nread <= 0 || nread % sizeof(*outputs)) break; total_read += nread; if (first_idx == -1) - first_idx = indices[0]; - for (i = 0; i < nread / sizeof(*indices); i++) - ASSERT_EQ(indices[i], first_idx, "first_idx"); + first_idx = outputs[0].idx; + for (i = 0; i < nread / sizeof(*outputs); i++) + ASSERT_EQ(outputs[i].idx, first_idx, "first_idx"); } while (total_read < to_read); - ASSERT_EQ(nread, onebyone ? sizeof(*indices) : to_read, "nread"); + ASSERT_EQ(nread, onebyone ? sizeof(*outputs) : to_read, "nread"); ASSERT_EQ(total_read, to_read, "total_read"); free_fds(fds[first_idx], nr_soreuse); fds[first_idx] = NULL; /* Read the "whole" second bucket */ - to_read = nr_soreuse * sizeof(*indices); + to_read = nr_soreuse * sizeof(*outputs); total_read = 0; second_idx = !first_idx; do { - nread = read(iter_fd, indices, onebyone ? sizeof(*indices) : to_read); - if (nread <= 0 || nread % sizeof(*indices)) + nread = read(iter_fd, outputs, onebyone ? sizeof(*outputs) : to_read); + if (nread <= 0 || nread % sizeof(*outputs)) break; total_read += nread; - for (i = 0; i < nread / sizeof(*indices); i++) - ASSERT_EQ(indices[i], second_idx, "second_idx"); + for (i = 0; i < nread / sizeof(*outputs); i++) + ASSERT_EQ(outputs[i].idx, second_idx, "second_idx"); } while (total_read <= to_read); ASSERT_EQ(nread, 0, "nread"); /* Both so_reuseport ports should be in different buckets, so @@ -128,6 +553,7 @@ void test_sock_iter_batch(void) do_test(SOCK_DGRAM, true); do_test(SOCK_DGRAM, false); } + do_resume_tests(); close_netns(nstoken); done: diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h index 659694162739..17db400f0e0d 100644 --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h @@ -128,6 +128,7 @@ #define sk_refcnt __sk_common.skc_refcnt #define sk_state __sk_common.skc_state #define sk_net __sk_common.skc_net +#define sk_rcv_saddr __sk_common.skc_rcv_saddr #define sk_v6_daddr __sk_common.skc_v6_daddr #define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr #define sk_flags __sk_common.skc_flags diff --git a/tools/testing/selftests/bpf/progs/sock_iter_batch.c b/tools/testing/selftests/bpf/progs/sock_iter_batch.c index 96531b0d9d55..8f483337e103 100644 --- a/tools/testing/selftests/bpf/progs/sock_iter_batch.c +++ b/tools/testing/selftests/bpf/progs/sock_iter_batch.c @@ -17,6 +17,12 @@ static bool ipv6_addr_loopback(const struct in6_addr *a) a->s6_addr32[2] | (a->s6_addr32[3] ^ bpf_htonl(1))) == 0; } +static bool ipv4_addr_loopback(__be32 a) +{ + return a == bpf_ntohl(0x7f000001); +} + +volatile const unsigned int sf; volatile const __u16 ports[2]; unsigned int bucket[2]; @@ -26,16 +32,20 @@ int iter_tcp_soreuse(struct bpf_iter__tcp *ctx) struct sock *sk = (struct sock *)ctx->sk_common; struct inet_hashinfo *hinfo; unsigned int hash; + __u64 sock_cookie; struct net *net; int idx; if (!sk) return 0; + sock_cookie = bpf_get_socket_cookie(sk); sk = bpf_core_cast(sk, struct sock); - if (sk->sk_family != AF_INET6 || + if (sk->sk_family != sf || sk->sk_state != TCP_LISTEN || - !ipv6_addr_loopback(&sk->sk_v6_rcv_saddr)) + sk->sk_family == AF_INET6 ? + !ipv6_addr_loopback(&sk->sk_v6_rcv_saddr) : + !ipv4_addr_loopback(sk->sk_rcv_saddr)) return 0; if (sk->sk_num == ports[0]) @@ -52,6 +62,7 @@ int iter_tcp_soreuse(struct bpf_iter__tcp *ctx) hinfo = net->ipv4.tcp_death_row.hashinfo; bucket[idx] = hash & hinfo->lhash2_mask; bpf_seq_write(ctx->meta->seq, &idx, sizeof(idx)); + bpf_seq_write(ctx->meta->seq, &sock_cookie, sizeof(sock_cookie)); return 0; } @@ -63,14 +74,18 @@ int iter_udp_soreuse(struct bpf_iter__udp *ctx) { struct sock *sk = (struct sock *)ctx->udp_sk; struct udp_table *udptable; + __u64 sock_cookie; int idx; if (!sk) return 0; + sock_cookie = bpf_get_socket_cookie(sk); sk = bpf_core_cast(sk, struct sock); - if (sk->sk_family != AF_INET6 || - !ipv6_addr_loopback(&sk->sk_v6_rcv_saddr)) + if (sk->sk_family != sf || + sk->sk_family == AF_INET6 ? + !ipv6_addr_loopback(&sk->sk_v6_rcv_saddr) : + !ipv4_addr_loopback(sk->sk_rcv_saddr)) return 0; if (sk->sk_num == ports[0]) @@ -84,6 +99,7 @@ int iter_udp_soreuse(struct bpf_iter__udp *ctx) udptable = sk->sk_net.net->ipv4.udp_table; bucket[idx] = udp_sk(sk)->udp_portaddr_hash & udptable->mask; bpf_seq_write(ctx->meta->seq, &idx, sizeof(idx)); + bpf_seq_write(ctx->meta->seq, &sock_cookie, sizeof(sock_cookie)); return 0; } -- 2.43.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-04-09 0:11 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-04 22:02 [RFC PATCH bpf-next 0/3] Exactly-once UDP socket iteration Jordan Rife 2025-04-04 22:02 ` [RFC PATCH bpf-next 1/3] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife 2025-04-04 22:02 ` [RFC PATCH bpf-next 2/3] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife 2025-04-04 23:20 ` Kuniyuki Iwashima 2025-04-07 23:30 ` Jordan Rife 2025-04-08 0:16 ` Kuniyuki Iwashima 2025-04-08 2:39 ` Jordan Rife 2025-04-08 5:23 ` Martin KaFai Lau 2025-04-09 0:11 ` Jordan Rife 2025-04-07 21:56 ` Martin KaFai Lau 2025-04-07 23:39 ` Jordan Rife 2025-04-04 22:02 ` [RFC PATCH bpf-next 3/3] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators Jordan Rife
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox