From: Stanislav Fomichev <sdf@google.com>
To: Aditi Ghag <aditi.ghag@isovalent.com>
Cc: bpf@vger.kernel.org, kafai@fb.com, edumazet@google.com,
Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator
Date: Fri, 24 Feb 2023 14:32:16 -0800 [thread overview]
Message-ID: <Y/k68KV9GDakrKQ1@google.com> (raw)
In-Reply-To: <20230223215311.926899-2-aditi.ghag@isovalent.com>
On 02/23, Aditi Ghag wrote:
> Batch UDP sockets from BPF iterator that allows for overlapping locking
> semantics in BPF/kernel helpers executed in BPF programs. This
> facilitates
> BPF socket destroy kfunc (introduced by follow-up patches) to execute from
> BPF iterator programs.
> Previously, BPF iterators acquired the sock lock and sockets hash table
> bucket lock while executing BPF programs. This prevented BPF helpers that
> again acquire these locks to be executed from BPF iterators. With the
> batching approach, we acquire a bucket lock, batch all the bucket sockets,
> and then release the bucket lock. This enables BPF or kernel helpers to
> skip sock locking when invoked in the supported BPF contexts.
> The batching logic is similar to the logic implemented in TCP iterator:
> https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.
> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
> net/ipv4/udp.c | 224 +++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 215 insertions(+), 9 deletions(-)
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index c605d171eb2d..2f3978de45f2 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3152,6 +3152,141 @@ struct bpf_iter__udp {
> int bucket __aligned(8);
> };
> +struct bpf_udp_iter_state {
> + struct udp_iter_state state;
[..]
> + unsigned int cur_sk;
> + unsigned int end_sk;
> + unsigned int max_sk;
> + struct sock **batch;
> + bool st_bucket_done;
Any change we can generalize some of those across tcp & udp? I haven't
looked too deep, but a lot of things look like a plain copy-paste
from tcp batching. Or not worth it?
> +};
> +
> +static unsigned short seq_file_family(const struct seq_file *seq);
> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> + unsigned int new_batch_sz);
> +
> +static inline bool seq_sk_match(struct seq_file *seq, const struct sock
> *sk)
> +{
> + unsigned short family = seq_file_family(seq);
> +
> + /* AF_UNSPEC is used as a match all */
> + return ((family == AF_UNSPEC || family == sk->sk_family) &&
> + net_eq(sock_net(sk), seq_file_net(seq)));
> +}
> +
> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> +{
> + struct bpf_udp_iter_state *iter = seq->private;
> + struct udp_iter_state *state = &iter->state;
> + struct net *net = seq_file_net(seq);
> + struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
> + struct udp_table *udptable;
> + struct sock *first_sk = NULL;
> + struct sock *sk;
> + unsigned int bucket_sks = 0;
> + bool first;
> + bool resized = false;
> +
> + /* The current batch is done, so advance the bucket. */
> + if (iter->st_bucket_done)
> + state->bucket++;
> +
> + udptable = udp_get_table_afinfo(afinfo, net);
> +
> +again:
> + /* New batch for the next bucket.
> + * Iterate over the hash table to find a bucket with sockets matching
> + * the iterator attributes, and return the first matching socket from
> + * the bucket. The remaining matched sockets from the bucket are batched
> + * before releasing the bucket lock. This allows BPF programs that are
> + * called in seq_show to acquire the bucket lock if needed.
> + */
> + iter->cur_sk = 0;
> + iter->end_sk = 0;
> + iter->st_bucket_done = false;
> + first = true;
> +
> + for (; state->bucket <= udptable->mask; state->bucket++) {
> + struct udp_hslot *hslot = &udptable->hash[state->bucket];
> +
> + if (hlist_empty(&hslot->head))
> + continue;
> +
> + spin_lock_bh(&hslot->lock);
> + sk_for_each(sk, &hslot->head) {
> + if (seq_sk_match(seq, sk)) {
> + if (first) {
> + first_sk = sk;
> + first = false;
> + }
> + if (iter->end_sk < iter->max_sk) {
> + sock_hold(sk);
> + iter->batch[iter->end_sk++] = sk;
> + }
> + bucket_sks++;
> + }
> + }
> + spin_unlock_bh(&hslot->lock);
> + if (first_sk)
> + break;
> + }
> +
> + /* All done: no batch made. */
> + if (!first_sk)
> + return NULL;
> +
> + if (iter->end_sk == bucket_sks) {
> + /* Batching is done for the current bucket; return the first
> + * socket to be iterated from the batch.
> + */
> + iter->st_bucket_done = true;
> + return first_sk;
> + }
> + if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
> + resized = true;
> + /* Go back to the previous bucket to resize its batch. */
> + state->bucket--;
> + goto again;
> + }
> + return first_sk;
> +}
> +
> +static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t
> *pos)
> +{
> + struct bpf_udp_iter_state *iter = seq->private;
> + struct sock *sk;
> +
> + /* Whenever seq_next() is called, the iter->cur_sk is
> + * done with seq_show(), so unref the iter->cur_sk.
> + */
> + if (iter->cur_sk < iter->end_sk)
> + sock_put(iter->batch[iter->cur_sk++]);
> +
> + /* After updating iter->cur_sk, check if there are more sockets
> + * available in the current bucket batch.
> + */
> + if (iter->cur_sk < iter->end_sk) {
> + sk = iter->batch[iter->cur_sk];
> + } else {
> + // Prepare a new batch.
> + sk = bpf_iter_udp_batch(seq);
> + }
> +
> + ++*pos;
> + return sk;
> +}
> +
> +static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> + /* bpf iter does not support lseek, so it always
> + * continue from where it was stop()-ped.
> + */
> + if (*pos)
> + return bpf_iter_udp_batch(seq);
> +
> + return SEQ_START_TOKEN;
> +}
> +
> static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta
> *meta,
> struct udp_sock *udp_sk, uid_t uid, int bucket)
> {
> @@ -3172,18 +3307,34 @@ static int bpf_iter_udp_seq_show(struct seq_file
> *seq, void *v)
> struct bpf_prog *prog;
> struct sock *sk = v;
> uid_t uid;
> + bool slow;
> + int rc;
> if (v == SEQ_START_TOKEN)
> return 0;
> + slow = lock_sock_fast(sk);
Hm, I missed the fact that we're already using fast lock in the tcp batching
as well. Should we not use fask locks here? On a loaded system it's
probably fair to pay some backlog processing in the path that goes
over every socket (here)? Martin, WDYT?
> +
> + if (unlikely(sk_unhashed(sk))) {
> + rc = SEQ_SKIP;
> + goto unlock;
> + }
> +
> uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
> meta.seq = seq;
> prog = bpf_iter_get_info(&meta, false);
> - return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> + rc = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> +
> +unlock:
> + unlock_sock_fast(sk, slow);
> + return rc;
> }
> +static void bpf_iter_udp_unref_batch(struct bpf_udp_iter_state *iter);
Why forward declaration? Why not define the function here?
> +
> static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
> {
> + struct bpf_udp_iter_state *iter = seq->private;
> struct bpf_iter_meta meta;
> struct bpf_prog *prog;
> @@ -3194,15 +3345,31 @@ static void bpf_iter_udp_seq_stop(struct seq_file
> *seq, void *v)
> (void)udp_prog_seq_show(prog, &meta, v, 0, 0);
> }
> - udp_seq_stop(seq, v);
> + if (iter->cur_sk < iter->end_sk) {
> + bpf_iter_udp_unref_batch(iter);
> + iter->st_bucket_done = false;
> + }
> }
> static const struct seq_operations bpf_iter_udp_seq_ops = {
> - .start = udp_seq_start,
> - .next = udp_seq_next,
> + .start = bpf_iter_udp_seq_start,
> + .next = bpf_iter_udp_seq_next,
> .stop = bpf_iter_udp_seq_stop,
> .show = bpf_iter_udp_seq_show,
> };
> +
> +static unsigned short seq_file_family(const struct seq_file *seq)
> +{
> + const struct udp_seq_afinfo *afinfo;
> +
> + /* BPF iterator: bpf programs to filter sockets. */
> + if (seq->op == &bpf_iter_udp_seq_ops)
> + return AF_UNSPEC;
> +
> + /* Proc fs iterator */
> + afinfo = pde_data(file_inode(seq->file));
> + return afinfo->family;
> +}
> #endif
> const struct seq_operations udp_seq_ops = {
> @@ -3413,9 +3580,38 @@ static struct pernet_operations __net_initdata
> udp_sysctl_ops = {
> DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
> struct udp_sock *udp_sk, uid_t uid, int bucket)
> +static void bpf_iter_udp_unref_batch(struct bpf_udp_iter_state *iter)
> +{
> + while (iter->cur_sk < iter->end_sk)
> + sock_put(iter->batch[iter->cur_sk++]);
> +}
> +
> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> + unsigned int new_batch_sz)
> +{
> + struct sock **new_batch;
> +
> + new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
> + GFP_USER | __GFP_NOWARN);
> + if (!new_batch)
> + return -ENOMEM;
> +
> + bpf_iter_udp_unref_batch(iter);
> + kvfree(iter->batch);
> + iter->batch = new_batch;
> + iter->max_sk = new_batch_sz;
> +
> + return 0;
> +}
> +
> +#define INIT_BATCH_SZ 16
> +
> +static void bpf_iter_fini_udp(void *priv_data);
> +
> static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info
> *aux)
> {
> - struct udp_iter_state *st = priv_data;
> + struct bpf_udp_iter_state *iter = priv_data;
> + struct udp_iter_state *st = &iter->state;
> struct udp_seq_afinfo *afinfo;
> int ret;
> @@ -3427,24 +3623,34 @@ static int bpf_iter_init_udp(void *priv_data,
> struct bpf_iter_aux_info *aux)
> afinfo->udp_table = NULL;
> st->bpf_seq_afinfo = afinfo;
> ret = bpf_iter_init_seq_net(priv_data, aux);
> - if (ret)
> + if (ret) {
> kfree(afinfo);
> + return ret;
> + }
> + ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
> + if (ret) {
> + bpf_iter_fini_seq_net(priv_data);
Leaking afinfo here? Since we are not feeing it from bpf_iter_fini_udp
any more? (why?)
> + return ret;
> + }
> + iter->cur_sk = 0;
> + iter->end_sk = 0;
> +
> return ret;
> }
> static void bpf_iter_fini_udp(void *priv_data)
> {
> - struct udp_iter_state *st = priv_data;
> + struct bpf_udp_iter_state *iter = priv_data;
> - kfree(st->bpf_seq_afinfo);
> bpf_iter_fini_seq_net(priv_data);
> + kfree(iter->batch);
> }
> static const struct bpf_iter_seq_info udp_seq_info = {
> .seq_ops = &bpf_iter_udp_seq_ops,
> .init_seq_private = bpf_iter_init_udp,
> .fini_seq_private = bpf_iter_fini_udp,
> - .seq_priv_size = sizeof(struct udp_iter_state),
> + .seq_priv_size = sizeof(struct bpf_udp_iter_state),
> };
> static struct bpf_iter_reg udp_reg_info = {
> --
> 2.34.1
next prev parent reply other threads:[~2023-02-24 22:32 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-23 21:53 [PATCH v2 bpf-next 0/3]: Add socket destroy capability Aditi Ghag
2023-02-23 21:53 ` [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator Aditi Ghag
2023-02-24 22:32 ` Stanislav Fomichev [this message]
2023-02-28 20:32 ` Martin KaFai Lau
2023-02-28 20:52 ` Stanislav Fomichev
2023-02-28 19:58 ` Martin KaFai Lau
2023-03-01 2:40 ` Aditi Ghag
2023-03-02 6:43 ` Martin KaFai Lau
2023-02-23 21:53 ` [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
2023-02-24 22:35 ` Stanislav Fomichev
2023-02-27 14:56 ` Aditi Ghag
2023-02-27 15:32 ` Aditi Ghag
2023-02-27 17:30 ` Stanislav Fomichev
2023-02-28 22:55 ` Martin KaFai Lau
2023-03-16 22:37 ` Aditi Ghag
2023-03-21 18:49 ` Aditi Ghag
2023-02-23 21:53 ` [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
2023-02-24 22:40 ` Stanislav Fomichev
2023-02-27 19:37 ` Andrii Nakryiko
2023-03-03 16:00 ` Aditi Ghag
2023-02-28 23:08 ` Martin KaFai Lau
2023-03-01 2:17 ` Aditi Ghag
2023-03-02 7:06 ` Martin KaFai Lau
2023-03-02 20:52 ` Aditi Ghag
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y/k68KV9GDakrKQ1@google.com \
--to=sdf@google.com \
--cc=aditi.ghag@isovalent.com \
--cc=bpf@vger.kernel.org \
--cc=edumazet@google.com \
--cc=kafai@fb.com \
--cc=martin.lau@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).