From: Martin KaFai Lau <martin.lau@linux.dev>
To: Jordan Rife <jordan@jrife.io>
Cc: Aditi Ghag <aditi.ghag@isovalent.com>,
Daniel Borkmann <daniel@iogearbox.net>,
Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
Kuniyuki Iwashima <kuniyu@amazon.com>,
bpf@vger.kernel.org, netdev@vger.kernel.org
Subject: Re: [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch
Date: Mon, 14 Apr 2025 14:54:07 -0700 [thread overview]
Message-ID: <d323d417-3e8b-48af-ae94-bc28469ac0c1@linux.dev> (raw)
In-Reply-To: <Z_xQhm4aLW9UBykJ@t14>
On 4/13/25 5:02 PM, Jordan Rife wrote:
>> static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>> {
>> if (iter->cur_sk < iter->end_sk) {
>> u64 cookie;
>>
>> cookie = iter->st_bucket_done ?
>> 0 : __sock_gen_cookie(iter->batch[iter->cur_sk].sock);
>> sock_put(iter->batch[iter->cur_sk].sock);
>> iter->batch[iter->cur_sk++].cookie = cookie;
>> }
>>
>> /* ... */
>> }
>>
>> In bpf_iter_udp_resume(), if it cannot find the first sk from find_cookie to
>> end_cookie, then it searches backward from find_cookie to 0. If nothing found,
>> then it should start from the beginning of the resume_bucket. Would it work?
>
> It seems like the intent here is to avoid repeating sockets that we've
> already visited?
>
> This would work if you need to process a bucket in two batches or less,
> but it would still be possible to repeat a socket if you have to process
> a bucket in more than two batches: during the transition from batch two
> to batch three you don't have any context about what you saw in batch
> one, so in the worst case where all the cookies we remembered from batch
> two are not found, we restart from the beginning of the list where we
> might revisit sockets from batch one. I guess you can say this reduces
> the probability of repeats but doesn't eliminate it.
>
> e.g.: socket A gets repeated in batch three after two consecutive calls
> to bpf_iter_udp_batch() hit the resized == true case due to heavy
> churn in the current bucket.
>
> | Thread 1 Thread 2 Batch State List State
> | ------------------------------- --------- ------------ ----------
> | [_] A->B
> | bpf_iter_udp_batch() " "
> | spin_lock_bh(&hslot2->lock) " "
> | ... [A] "
> | spin_unlock_bh(&hslot2->lock) " "
> | add C,D " A->B->C->D
> | bpf_iter_udp_realloc_batch(3) " "
> | spin_lock_bh(&hslot2->lock) [A,_,_] "
> | ... [A,B,C] "
> | spin_unlock_bh(&hslot2->lock) " "
> | resized == true " "
> | return A " "
> | del B,C " A->D
> | add E,F,G " A->D->E->
> t F->G
> i bpf_iter_udp_batch() " "
> m spin_lock_bh(&hslot2->lock) " "
> e ... [D,E,F] "
> | spin_unlock_bh(&hslot2->lock) " "
> | add H,I,J " A->D->E->
> | F->G->H->
> | I->J
> | bpf_iter_udp_realloc_batch(6) [D,E,F,_,_,_] "
> | spin_lock_bh(&hslot2->lock) " "
> | ... [D,E,F,G,H,I] "
> | spin_unlock_bh(&hslot2->lock) " "
> | resized == true " "
> | return D " "
> | del D,E, " A->J
> | F,G,
> | H,I,
> | bpf_iter_udp_batch() " "
> | spin_lock_bh(&hslot2->lock) " "
> | ... [A,J,_,_,_,_] "
> | !!! A IS REPEATED !!! ^
> | spin_unlock_bh(&hslot2->lock) " "
> | return A " "
> v
Agree that >1 batches on the same bucket may have a repeated-sk situation, like
the above example you demonstrated.
>
> There's a fundamental limitation where if we have to process a bucket in
> more than two batches, we can lose context about what we've visited
> before, so there's always some edge case like this. The choice is
> basically:
>
> (1) Make a best-effort attempt to avoid repeating sockets, and accept
> that repeats can still happen in rare cases. Maybe the chances are
> close enough to zero to never actually happen, although it's hard to
> say; it may be more probable in some scenarios.
>
> or
>
> (2) Guarantee that repeats can't happen by requiring that a bucket
> completely fits into one (or two?) batches: either error out in the
> resized == true case or prevent it altogether by holding onto the
> lock across reallocs with GFP_ATOMIC to prevent races.
>
> All things being equal, (2) is a nice guarantee to have, but I sense
> some hesitance to hold onto hslot2->lock any longer than we already are.
> Is there a high performance cost I'm not seeing there? I guess there's a
> higher chance of lock contention, and with GFP_ATOMIC allocation is more
> likely to fail, but reallocs should be fairly rare? Maybe we could
> reduce the chance of reallocs during iteration by "right-sizing" the
> batch from the start, e.g. on iterator init, allocate the batch size to
> be 3/2 * the maximum list length currently in the UDP table, since you
> know you'll eventually need it to be that size anyway. Of course, lists
> might grow after that point requiring a realloc somewhere along the way,
> but it would avoid any reallocs in cases where the lengths are mostly
> stable. I'm fine with (1) if that's the only viable option, but I just
> wanted to make sure I'm accurately understanding the constraints here.
I am concerned having higher unnecessary failure chance (although unlikely) for
the current use cases that do not care for a sk repeated or not. For example,
the bpf prog has checked the sk conditions (address/port/tcp-cc...etc) before
doing setsockopt or doing bpf_sock_destory.
I may have over-thought here. ok to bite the bullet on GFP_ATOMIC but I will be
more comfortable if it can retry a few times on the "resized == true" case first
with GFP_USER before finally resort to GFP_ATOMIC. or may be another way around
GFP_ATOMIC fist and falls back to GFP_USER. Thoughts?
For tracking the maximum list length, not sure how much it will help considering
it may still change, so it still needs to handle the resize+realloc situation
regardless.
next prev parent reply other threads:[~2025-04-14 21:54 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-11 17:35 [PATCH v2 bpf-next 0/5] bpf: udp: Exactly-once socket iteration Jordan Rife
2025-04-11 17:35 ` [PATCH v2 bpf-next 1/5] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife
2025-04-11 17:35 ` [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch Jordan Rife
2025-04-11 22:10 ` Martin KaFai Lau
2025-04-11 23:31 ` Jordan Rife
2025-04-12 3:47 ` Martin KaFai Lau
2025-04-14 0:02 ` Jordan Rife
2025-04-14 21:54 ` Martin KaFai Lau [this message]
2025-04-14 23:59 ` Jordan Rife
2025-04-11 17:35 ` [PATCH v2 bpf-next 3/5] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife
2025-04-11 20:12 ` Kuniyuki Iwashima
2025-04-11 17:35 ` [PATCH v2 bpf-next 4/5] selftests/bpf: Return socket cookies from sock_iter_batch progs Jordan Rife
2025-04-11 17:35 ` [PATCH v2 bpf-next 5/5] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators Jordan Rife
2025-04-11 22:32 ` Martin KaFai Lau
2025-04-15 0:04 ` Jordan Rife
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d323d417-3e8b-48af-ae94-bc28469ac0c1@linux.dev \
--to=martin.lau@linux.dev \
--cc=aditi.ghag@isovalent.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=jordan@jrife.io \
--cc=kuniyu@amazon.com \
--cc=netdev@vger.kernel.org \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.