public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Jordan Rife <jordan@jrife.io>
To: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Aditi Ghag <aditi.ghag@isovalent.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
	Kuniyuki Iwashima <kuniyu@amazon.com>,
	bpf@vger.kernel.org, netdev@vger.kernel.org
Subject: Re: [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch
Date: Sun, 13 Apr 2025 17:02:14 -0700	[thread overview]
Message-ID: <Z_xQhm4aLW9UBykJ@t14> (raw)
In-Reply-To: <daa3f02a-c982-4a7a-afcd-41f5e9b2f79c@linux.dev>

> static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> {
>         if (iter->cur_sk < iter->end_sk) {
>                 u64 cookie;
>
>                 cookie = iter->st_bucket_done ?
>                         0 : __sock_gen_cookie(iter->batch[iter->cur_sk].sock);
>                 sock_put(iter->batch[iter->cur_sk].sock);
>                 iter->batch[iter->cur_sk++].cookie = cookie;
>         }
>
>         /* ... */
> }
>
> In bpf_iter_udp_resume(), if it cannot find the first sk from find_cookie to
> end_cookie, then it searches backward from find_cookie to 0. If nothing found,
> then it should start from the beginning of the resume_bucket. Would it work?

It seems like the intent here is to avoid repeating sockets that we've
already visited?

This would work if you need to process a bucket in two batches or less,
but it would still be possible to repeat a socket if you have to process
a bucket in more than two batches: during the transition from batch two
to batch three you don't have any context about what you saw in batch
one, so in the worst case where all the cookies we remembered from batch
two are not found, we restart from the beginning of the list where we
might revisit sockets from batch one. I guess you can say this reduces
the probability of repeats but doesn't eliminate it.

e.g.: socket A gets repeated in batch three after two consecutive calls
      to bpf_iter_udp_batch() hit the resized == true case due to heavy
      churn in the current bucket.

|               Thread 1            Thread 2   Batch State    List State
|  -------------------------------  ---------  ------------   ----------
|                                              [_]            A->B
|  bpf_iter_udp_batch()                        "              "
|    spin_lock_bh(&hslot2->lock)               "              "
|    ...                                       [A]            "
|    spin_unlock_bh(&hslot2->lock)             "              "
|                                   add C,D    "              A->B->C->D
|    bpf_iter_udp_realloc_batch(3)             "              "
|    spin_lock_bh(&hslot2->lock)               [A,_,_]        "
|    ...                                       [A,B,C]        "
|    spin_unlock_bh(&hslot2->lock)             "              "
|    resized == true                           "              "
|    return A                                  "              "
|                                   del B,C    "              A->D
|                                   add E,F,G  "              A->D->E->
t                                                             F->G
i  bpf_iter_udp_batch()                        "              "
m    spin_lock_bh(&hslot2->lock)               "              "
e    ...                                       [D,E,F]        "
|    spin_unlock_bh(&hslot2->lock)             "              "
|                                   add H,I,J  "              A->D->E->
|                                                             F->G->H->
|                                                             I->J
|    bpf_iter_udp_realloc_batch(6)             [D,E,F,_,_,_]  "
|    spin_lock_bh(&hslot2->lock)               "              "
|    ...                                       [D,E,F,G,H,I]  "
|    spin_unlock_bh(&hslot2->lock)             "              "
|    resized == true                           "              "
|    return D                                  "              "
|                                   del D,E,   "              A->J
|                                       F,G,                   
|                                       H,I,                   
|  bpf_iter_udp_batch()                        "              "
|    spin_lock_bh(&hslot2->lock)               "              "
|    ...                                       [A,J,_,_,_,_]  "
|                         !!! A IS REPEATED !!! ^
|    spin_unlock_bh(&hslot2->lock)             "              "
|    return A                                  "              "
v

There's a fundamental limitation where if we have to process a bucket in
more than two batches, we can lose context about what we've visited
before, so there's always some edge case like this. The choice is
basically:

(1) Make a best-effort attempt to avoid repeating sockets, and accept
    that repeats can still happen in rare cases. Maybe the chances are
    close enough to zero to never actually happen, although it's hard to
    say; it may be more probable in some scenarios.

or

(2) Guarantee that repeats can't happen by requiring that a bucket
    completely fits into one (or two?) batches: either error out in the
    resized == true case or prevent it altogether by holding onto the
    lock across reallocs with GFP_ATOMIC to prevent races.

All things being equal, (2) is a nice guarantee to have, but I sense
some hesitance to hold onto hslot2->lock any longer than we already are.
Is there a high performance cost I'm not seeing there? I guess there's a
higher chance of lock contention, and with GFP_ATOMIC allocation is more
likely to fail, but reallocs should be fairly rare? Maybe we could
reduce the chance of reallocs during iteration by "right-sizing" the
batch from the start, e.g. on iterator init, allocate the batch size to
be 3/2 * the maximum list length currently in the UDP table, since you
know you'll eventually need it to be that size anyway. Of course, lists
might grow after that point requiring a realloc somewhere along the way,
but it would avoid any reallocs in cases where the lengths are mostly
stable. I'm fine with (1) if that's the only viable option, but I just
wanted to make sure I'm accurately understanding the constraints here.

Thanks!

-Jordan

  reply	other threads:[~2025-04-14  0:02 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-11 17:35 [PATCH v2 bpf-next 0/5] bpf: udp: Exactly-once socket iteration Jordan Rife
2025-04-11 17:35 ` [PATCH v2 bpf-next 1/5] bpf: udp: Use bpf_udp_iter_batch_item for bpf_udp_iter_state batch items Jordan Rife
2025-04-11 17:35 ` [PATCH v2 bpf-next 2/5] bpf: udp: Propagate ENOMEM up from bpf_iter_udp_batch Jordan Rife
2025-04-11 22:10   ` Martin KaFai Lau
2025-04-11 23:31     ` Jordan Rife
2025-04-12  3:47       ` Martin KaFai Lau
2025-04-14  0:02         ` Jordan Rife [this message]
2025-04-14 21:54           ` Martin KaFai Lau
2025-04-14 23:59             ` Jordan Rife
2025-04-11 17:35 ` [PATCH v2 bpf-next 3/5] bpf: udp: Avoid socket skips and repeats during iteration Jordan Rife
2025-04-11 20:12   ` Kuniyuki Iwashima
2025-04-11 17:35 ` [PATCH v2 bpf-next 4/5] selftests/bpf: Return socket cookies from sock_iter_batch progs Jordan Rife
2025-04-11 17:35 ` [PATCH v2 bpf-next 5/5] selftests/bpf: Add tests for bucket resume logic in UDP socket iterators Jordan Rife
2025-04-11 22:32   ` Martin KaFai Lau
2025-04-15  0:04     ` Jordan Rife

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z_xQhm4aLW9UBykJ@t14 \
    --to=jordan@jrife.io \
    --cc=aditi.ghag@isovalent.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kuniyu@amazon.com \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox