From: Martin KaFai Lau <martin.lau@linux.dev>
To: Jordan Rife <jordan@jrife.io>
Cc: Aditi Ghag <aditi.ghag@isovalent.com>,
Daniel Borkmann <daniel@iogearbox.net>,
Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
Kuniyuki Iwashima <kuniyu@amazon.com>,
netdev@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH v4 bpf-next 2/6] bpf: udp: Make sure iter->batch always contains a full bucket snapshot
Date: Tue, 22 Apr 2025 13:28:45 -0700 [thread overview]
Message-ID: <68dfbcbe-1f12-406c-8913-70e3dd5f8154@linux.dev> (raw)
In-Reply-To: <fa0a7936-145a-4f47-a858-66e46b829486@linux.dev>
On 4/22/25 1:25 PM, Martin KaFai Lau wrote:
> On 4/22/25 11:02 AM, Jordan Rife wrote:
>>> I found the "if (lock)" changes and its related changes make the code harder
>>> to follow. This change is mostly to handle one special case,
>>> avoid releasing the lock when "resizes" reaches the limit.
>>>
>>> Can this one case be specifically handled in the "for(bucket)" loop?
>>>
>>> With this special case, it can alloc exactly the "batch_sks" size
>>> with GFP_ATOMIC. It does not need to put the sk or get the cookie.
>>> It can directly continue from the "iter->batch[iter->end_sk - 1].sock".
>>>
>>> Something like this on top of this set. I reset the "resizes" on each new
>>> bucket,
>>> removed the existing "done" label and avoid getting cookie in the last attempt.
>>>
>>> Untested code and likely still buggy. wdyt?
>>
>> Overall I like it, and at a glance, it seems correct. The code before
>
> Thanks for checking!
>
>> (with the if (lock) stuff) made it harder to easily verify that the
>> lock was always released for every code path. This structure makes it
>> more clear. I'll adopt this for the next version of the series and do
>> a bit more testing to make sure everything's sound.
>>
>>>
>>> #define MAX_REALLOC_ATTEMPTS 2
>>>
>>> static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>> {
>>> struct bpf_udp_iter_state *iter = seq->private;
>>> struct udp_iter_state *state = &iter->state;
>>> unsigned int find_cookie, end_cookie = 0;
>>> struct net *net = seq_file_net(seq);
>>> struct udp_table *udptable;
>>> unsigned int batch_sks = 0;
>>> int resume_bucket;
>>> struct sock *sk;
>>> int resizes = 0;
>>> int err = 0;
>>>
>>> resume_bucket = state->bucket;
>>>
>>> /* The current batch is done, so advance the bucket. */
>>> if (iter->st_bucket_done)
>>> state->bucket++;
>>>
>>> udptable = udp_get_table_seq(seq, net);
>>>
>>> again:
>>> /* New batch for the next bucket.
>>> * Iterate over the hash table to find a bucket with sockets matching
>>> * the iterator attributes, and return the first matching socket from
>>> * the bucket. The remaining matched sockets from the bucket are batched
>>> * before releasing the bucket lock. This allows BPF programs that are
>>> * called in seq_show to acquire the bucket lock if needed.
>>> */
>>> find_cookie = iter->cur_sk;
>>> end_cookie = iter->end_sk;
>>> iter->cur_sk = 0;
>>> iter->end_sk = 0;
>>> iter->st_bucket_done = false;
>>> batch_sks = 0;
>>>
>>> for (; state->bucket <= udptable->mask; state->bucket++) {
>>> struct udp_hslot *hslot2 = &udptable->hash2[state->bucket].hslot;
>>>
>>> if (hlist_empty(&hslot2->head)) {
>>> resizes = 0;
>>
>>
>>
>>> continue;
>>> }
>>>
>>> spin_lock_bh(&hslot2->lock);
>>>
>>> /* Initialize sk to the first socket in hslot2. */
>>> sk = hlist_entry_safe(hslot2->head.first, struct sock,
>>> __sk_common.skc_portaddr_node);
>>> /* Resume from the first (in iteration order) unseen socket from
>>> * the last batch that still exists in resume_bucket. Most of
>>> * the time this will just be where the last iteration left off
>>> * in resume_bucket unless that socket disappeared between
>>> * reads.
>>> *
>>> * Skip this if end_cookie isn't set; this is the first
>>> * batch, we're on bucket zero, and we want to start from the
>>> * beginning.
>>> */
>>> if (state->bucket == resume_bucket && end_cookie)
>>> sk = bpf_iter_udp_resume(sk,
>>> &iter->batch[find_cookie],
>>> end_cookie - find_cookie);
>>> last_realloc_retry:
>>> udp_portaddr_for_each_entry_from(sk) {
>>> if (seq_sk_match(seq, sk)) {
>>> if (iter->end_sk < iter->max_sk) {
>>> sock_hold(sk);
>>> iter->batch[iter->end_sk++].sock = sk;
>>> }
>>> batch_sks++;
>>> }
>>> }
>>>
>>> if (unlikely(resizes == MAX_REALLOC_ATTEMPTS) &&
>>> iter->end_sk && iter->end_sk != batch_sks) {
>>
>> While iter->end_sk == batch_sks should always be true here after goto
>> last_realloc_retry, I wonder if it's worth adding a sanity check:
>> WARN_*ing and bailing out if we hit this condition twice? Not sure if
>> I'm being overly paranoid here.
>
> hmm... I usually won't go with WARN if the case is impossible.
>
> The code is quite tricky here, so I think it is ok to have a WARN_ON_ONCE.
>
> May be increment the "retries" in this special case also and it can stop hitting
typo.... s/retries/resizes/. Same for the code below.
> this case again for the same bucket. WARN outside of the for loop. Like:
>
>>
>>> /* last realloc attempt to batch the whole
>>> * bucket. Keep holding the lock to ensure the
>>> * bucket will not be changed.
>>> */
>>> err = bpf_iter_udp_realloc_batch(iter, batch_sks,
>>> GFP_ATOMIC);
>>> if (err) {
>>> spin_unlock_bh(&hslot2->lock);
>>> return ERR_PTR(err);
>>> }
>>> sk = iter->batch[iter->end_sk - 1].sock;
>>> sk = hlist_entry_safe(sk-
>>> >__sk_common.skc_portaddr_node.next,
>>> struct sock,
>>> __sk_common.skc_portaddr_node);
>>> batch_sks = iter->end_sk;
>
> /* and then WARN outside of the for loop */
> retries++;
>
>
>>> goto last_realloc_retry;
>>> }
>>>
>>> spin_unlock_bh(&hslot2->lock);
>>>
>>> if (iter->end_sk)
>>> break;
>>>
>>> /* Got an empty bucket after taking the lock */
>>> resizes = 0;
>>> }
>>>
>>> /* All done: no batch made. */
>>> if (!iter->end_sk)
>>> return NULL;
>>>
>>> if (iter->end_sk == batch_sks) {
>>> /* Batching is done for the current bucket; return the first
>>> * socket to be iterated from the batch.
>>> */
>>> iter->st_bucket_done = true;
>>> return iter->batch[0].sock;
>>> }
>
> if (WARN_ON_ONCE(retries > MAX_REALLOC_ATTEMPTS))
> return iter->batch[0].sock;
>
>>>
>>> err = bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2, GFP_USER);
>>> if (err)
>>> return ERR_PTR(err);
>>>
>>> resizes++;
>>> goto again;
>>> }
>>> static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
>>> unsigned int new_batch_sz, int flags)
>>> {
>>> union bpf_udp_iter_batch_item *new_batch;
>>>
>>> new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
>>> flags | __GFP_NOWARN);
>>> if (!new_batch)
>>> return -ENOMEM;
>>>
>>> if (flags != GFP_ATOMIC)
>>> bpf_iter_udp_put_batch(iter);
>>>
>>> /* Make sure the new batch has the cookies of the sockets we haven't
>>> * visited yet.
>>> */
>>> memcpy(new_batch, iter->batch, sizeof(*iter->batch) * iter->end_sk);
>>> kvfree(iter->batch);
>>> iter->batch = new_batch;
>>> iter->max_sk = new_batch_sz;
>>>
>>> return 0;
>>> }
>>
>> Jordan
>
>
next prev parent reply other threads:[~2025-04-22 20:29 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-19 15:57 [PATCH v4 bpf-next 0/6] bpf: udp: Exactly-once socket iteration Jordan Rife
2025-04-19 15:57 ` [PATCH v4 bpf-next 1/6] bpf: udp: Make mem flags configurable through bpf_iter_udp_realloc_batch Jordan Rife
2025-04-19 15:57 ` [PATCH v4 bpf-next 2/6] bpf: udp: Make sure iter->batch always contains a full bucket snapshot Jordan Rife
2025-04-22 3:33 ` Martin KaFai Lau
2025-04-22 18:02 ` Jordan Rife
2025-04-22 20:25 ` Martin KaFai Lau
2025-04-22 20:28 ` Martin KaFai Lau [this message]
2025-04-19 15:58 ` [PATCH v4 bpf-next 3/6] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife
2025-04-19 15:58 ` [PATCH v4 bpf-next 4/6] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife
2025-04-22 3:47 ` Martin KaFai Lau
2025-04-22 18:03 ` Jordan Rife
2025-04-19 15:58 ` [PATCH v4 bpf-next 5/6] selftests/bpf: Return socket cookies from sock_iter_batch progs Jordan Rife
2025-04-19 15:58 ` [PATCH v4 bpf-next 6/6] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators Jordan Rife
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=68dfbcbe-1f12-406c-8913-70e3dd5f8154@linux.dev \
--to=martin.lau@linux.dev \
--cc=aditi.ghag@isovalent.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=jordan@jrife.io \
--cc=kuniyu@amazon.com \
--cc=netdev@vger.kernel.org \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).