All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin KaFai Lau <martin.lau@linux.dev>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: 'Alexei Starovoitov ' <ast@kernel.org>,
	'Andrii Nakryiko ' <andrii@kernel.org>,
	netdev@vger.kernel.org, kernel-team@meta.com,
	Aditi Ghag <aditi.ghag@isovalent.com>,
	bpf@vger.kernel.org
Subject: Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
Date: Wed, 20 Dec 2023 20:45:08 -0800	[thread overview]
Message-ID: <9f3697c1-ed15-4a3d-9113-c4437f421bb3@linux.dev> (raw)
In-Reply-To: <fc1b5650-72bb-4b09-bab4-f61b2186f673@linux.dev>

On 12/20/23 11:10 AM, Martin KaFai Lau wrote:
> Good catch. It will unnecessary skip in the following batch/bucket if there is 
> changes in the current batch/bucket.
> 
>  From looking at the loop again, I think it is better not to change the 
> iter->offset during the for loop. Only update iter->offset after the for loop 
> has concluded.
> 
> The non-zero iter->offset is only useful for the first bucket, so does a test on 
> the first bucket (state->bucket == bucket) before skipping sockets. Something 
> like this:
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 89e5a806b82e..a993f364d6ae 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>       struct net *net = seq_file_net(seq);
>       struct udp_table *udptable;
>       unsigned int batch_sks = 0;
> +    int bucket, bucket_offset;
>       bool resized = false;
>       struct sock *sk;
> 
> @@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
> *seq)
>       iter->end_sk = 0;
>       iter->st_bucket_done = false;
>       batch_sks = 0;
> +    bucket = state->bucket;
> +    bucket_offset = 0;
> 
>       for (; state->bucket <= udptable->mask; state->bucket++) {
>           struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
> 
> -        if (hlist_empty(&hslot2->head)) {
> -            iter->offset = 0;
> +        if (hlist_empty(&hslot2->head))
>               continue;
> -        }
> 
>           spin_lock_bh(&hslot2->lock);
>           udp_portaddr_for_each_entry(sk, &hslot2->head) {
> @@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>                   /* Resume from the last iterated socket at the
>                    * offset in the bucket before iterator was stopped.
>                    */
> -                if (iter->offset) {
> -                    --iter->offset;
> +                if (state->bucket == bucket &&
> +                    bucket_offset < iter->offset) {
> +                    ++bucket_offset;
>                       continue;
>                   }
>                   if (iter->end_sk < iter->max_sk) {
> @@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
> *seq)
> 
>           if (iter->end_sk)
>               break;
> +    }
> 
> -        /* Reset the current bucket's offset before moving to the next bucket. */
> +    if (state->bucket != bucket)
>           iter->offset = 0;
> -    }
> 
>       /* All done: no batch made. */
>       if (!iter->end_sk)

I think I found another bug in the current bpf_iter_udp_batch(). The 
"state->bucket--;" at the end of the batch() function is wrong also. It does not 
need to go back to the previous bucket. After realloc with a larger batch array, 
it should retry on the "state->bucket" as is. I tried to force the bind() to use 
bucket 0 and bind a larger so_reuseport set (24 sockets). WARN_ON(state->bucket 
< 0) triggered.

Going back to this bug (backward progress on --iter->offset), I think it is a 
bit cleaner to always reset iter->offset to 0 and advance iter->offset to the 
resume_offset only when needed. Something like this:

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 89e5a806b82e..184aa966a006 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3137,16 +3137,18 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  	struct bpf_udp_iter_state *iter = seq->private;
  	struct udp_iter_state *state = &iter->state;
  	struct net *net = seq_file_net(seq);
+	int resume_bucket, resume_offset;
  	struct udp_table *udptable;
  	unsigned int batch_sks = 0;
  	bool resized = false;
  	struct sock *sk;

+	resume_bucket = state->bucket;
+	resume_offset = iter->offset;
+
  	/* The current batch is done, so advance the bucket. */
-	if (iter->st_bucket_done) {
+	if (iter->st_bucket_done)
  		state->bucket++;
-		iter->offset = 0;
-	}

  	udptable = udp_get_table_seq(seq, net);

@@ -3166,10 +3168,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  	for (; state->bucket <= udptable->mask; state->bucket++) {
  		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];

-		if (hlist_empty(&hslot2->head)) {
-			iter->offset = 0;
+		iter->offset = 0;
+		if (hlist_empty(&hslot2->head))
  			continue;
-		}

  		spin_lock_bh(&hslot2->lock);
  		udp_portaddr_for_each_entry(sk, &hslot2->head) {
@@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  				/* Resume from the last iterated socket at the
  				 * offset in the bucket before iterator was stopped.
  				 */
-				if (iter->offset) {
-					--iter->offset;
+				if (state->bucket == resume_bucket &&
+				    iter->offset < resume_offset) {
+					++iter->offset;
  					continue;
  				}
  				if (iter->end_sk < iter->max_sk) {
@@ -3192,9 +3194,6 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)

  		if (iter->end_sk)
  			break;
-
-		/* Reset the current bucket's offset before moving to the next bucket. */
-		iter->offset = 0;
  	}

  	/* All done: no batch made. */
@@ -3210,10 +3209,6 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  	}
  	if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
  		resized = true;
-		/* After allocating a larger batch, retry one more time to grab
-		 * the whole bucket.
-		 */
-		state->bucket--;
  		goto again;
  	}
  done:


  reply	other threads:[~2023-12-21  4:45 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-19 19:32 [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp Martin KaFai Lau
2023-12-19 19:32 ` [PATCH bpf 2/2] selftests/bpf: Test udp and tcp iter batching Martin KaFai Lau
2023-12-20 14:24 ` [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp Daniel Borkmann
2023-12-20 19:10   ` Martin KaFai Lau
2023-12-21  4:45     ` Martin KaFai Lau [this message]
2023-12-21 13:21       ` Daniel Borkmann
2023-12-21 14:58         ` Martin KaFai Lau
2023-12-21 20:27           ` Daniel Borkmann
2023-12-21 22:19             ` Martin KaFai Lau
2024-01-04 20:21           ` Aditi Ghag
2024-01-04 22:27             ` Martin KaFai Lau
2024-01-04 23:38               ` Aditi Ghag
2024-01-05  0:33                 ` Martin KaFai Lau
2024-01-08 23:24                   ` Aditi Ghag

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9f3697c1-ed15-4a3d-9113-c4437f421bb3@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=aditi.ghag@isovalent.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@meta.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.