Re: [PATCH 4/7] bpf: udp: Implement batching for sockets iterator

BPF List
 help / color / mirror / Atom feed

From: Martin KaFai Lau <martin.lau@linux.dev>
To: Aditi Ghag <aditi.ghag@isovalent.com>
Cc: kafai@fb.com, sdf@google.com, edumazet@google.com,
	Martin KaFai Lau <martin.lau@kernel.org>,
	bpf@vger.kernel.org
Subject: Re: [PATCH 4/7] bpf: udp: Implement batching for sockets iterator
Date: Sun, 23 Apr 2023 22:46:38 -0700	[thread overview]
Message-ID: <251b0a3f-47d6-d9ad-0c8e-b707fa286c7c@linux.dev> (raw)
In-Reply-To: <20230418153148.2231644-5-aditi.ghag@isovalent.com>

On 4/18/23 8:31 AM, Aditi Ghag wrote:
> Batch UDP sockets from BPF iterator that allows for overlapping locking
> semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
> BPF socket destroy kfunc (introduced by follow-up patches) to execute from
> BPF iterator programs.
> 
> Previously, BPF iterators acquired the sock lock and sockets hash table
> bucket lock while executing BPF programs. This prevented BPF helpers that
> again acquire these locks to be executed from BPF iterators.  With the
> batching approach, we acquire a bucket lock, batch all the bucket sockets,
> and then release the bucket lock. This enables BPF or kernel helpers to
> skip sock locking when invoked in the supported BPF contexts.
> 
> The batching logic is similar to the logic implemented in TCP iterator:
> https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.
> 
> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   net/ipv4/udp.c | 209 +++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 203 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 8689ed171776..f1c001641e53 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3148,6 +3148,145 @@ struct bpf_iter__udp {
>   	int bucket __aligned(8);
>   };
>   
> +struct bpf_udp_iter_state {
> +	struct udp_iter_state state;
> +	unsigned int cur_sk;
> +	unsigned int end_sk;
> +	unsigned int max_sk;
> +	int offset;
> +	struct sock **batch;
> +	bool st_bucket_done;
> +};
> +
> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> +				      unsigned int new_batch_sz);
> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct udp_iter_state *state = &iter->state;
> +	struct net *net = seq_file_net(seq);
> +	struct udp_seq_afinfo afinfo;
> +	struct udp_table *udptable;
> +	unsigned int batch_sks = 0;
> +	bool resized = false;
> +	struct sock *sk;
> +
> +	/* The current batch is done, so advance the bucket. */
> +	if (iter->st_bucket_done) {
> +		state->bucket++;
> +		iter->offset = 0;
> +	}
> +
> +	afinfo.family = AF_UNSPEC;
> +	afinfo.udp_table = NULL;
> +	udptable = udp_get_table_afinfo(&afinfo, net);
> +
> +again:
> +	/* New batch for the next bucket.
> +	 * Iterate over the hash table to find a bucket with sockets matching
> +	 * the iterator attributes, and return the first matching socket from
> +	 * the bucket. The remaining matched sockets from the bucket are batched
> +	 * before releasing the bucket lock. This allows BPF programs that are
> +	 * called in seq_show to acquire the bucket lock if needed.
> +	 */
> +	iter->cur_sk = 0;
> +	iter->end_sk = 0;
> +	iter->st_bucket_done = false;
> +	batch_sks = 0;
> +
> +	for (; state->bucket <= udptable->mask; state->bucket++) {
> +		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
> +
> +		if (hlist_empty(&hslot2->head)) {
> +			iter->offset = 0;
> +			continue;
> +		}
> +
> +		spin_lock_bh(&hslot2->lock);
> +		udp_portaddr_for_each_entry(sk, &hslot2->head) {
> +			if (seq_sk_match(seq, sk)) {
> +				/* Resume from the last iterated socket at the
> +				 * offset in the bucket before iterator was stopped.
> +				 */
> +				if (iter->offset) {
> +					--iter->offset;
> +					continue;
> +				}
> +				if (iter->end_sk < iter->max_sk) {
> +					sock_hold(sk);
> +					iter->batch[iter->end_sk++] = sk;
> +				}
> +				batch_sks++;
> +			}
> +		}
> +		spin_unlock_bh(&hslot2->lock);
> +
> +		if (iter->end_sk)
> +			break;
> +
> +		/* Reset the current bucket's offset before moving to the next bucket. */
> +		iter->offset = 0;
> +	}
> +
> +	/* All done: no batch made. */
> +	if (!iter->end_sk)
> +		return NULL;
> +
> +	if (iter->end_sk == batch_sks) {
> +		/* Batching is done for the current bucket; return the first
> +		 * socket to be iterated from the batch.
> +		 */
> +		iter->st_bucket_done = true;
> +		goto ret;

nit. "ret" is a variable name in some other changes in this patch. How about 
directly "return iter->batch[0];" here or change the label name to something 
else like "done"?

> +	}
> +	if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
> +		resized = true;
> +		/* Go back to the previous bucket to resize its batch. */

nit. I found the "to resize its batch" part confusing. The resize has already 
been done in the above bpf_iter_udp_realloc_batch(). How about something like, 
"After getting a larger max_sk, retry one more time to grab the whole bucket" ?

> +		state->bucket--;
> +		goto again;
> +	}
> +ret:
> +	return iter->batch[0];
> +}
> +
> +static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct sock *sk;
> +
> +	/* Whenever seq_next() is called, the iter->cur_sk is
> +	 * done with seq_show(), so unref the iter->cur_sk.
> +	 */
> +	if (iter->cur_sk < iter->end_sk) {
> +		sock_put(iter->batch[iter->cur_sk++]);
> +		++iter->offset;
> +	}
> +
> +	/* After updating iter->cur_sk, check if there are more sockets
> +	 * available in the current bucket batch.
> +	 */
> +	if (iter->cur_sk < iter->end_sk) {
> +		sk = iter->batch[iter->cur_sk];
> +	} else {
> +		// Prepare a new batch.
> +		sk = bpf_iter_udp_batch(seq);
> +	}

nit. remove "{ }" for one liner if-else statement.

> +
> +	++*pos;
> +	return sk;
> +}
> +
> +static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	/* bpf iter does not support lseek, so it always
> +	 * continue from where it was stop()-ped.
> +	 */
> +	if (*pos)
> +		return bpf_iter_udp_batch(seq);
> +
> +	return SEQ_START_TOKEN;
> +}
> +
>   static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
>   			     struct udp_sock *udp_sk, uid_t uid, int bucket)
>   {
> @@ -3168,18 +3307,37 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
>   	struct bpf_prog *prog;
>   	struct sock *sk = v;
>   	uid_t uid;
> +	int rc;

nit. be consistent with variable name. "ret" is used in other changes in this patch.

>   
>   	if (v == SEQ_START_TOKEN)
>   		return 0;
>   
> +	lock_sock(sk);
> +
> +	if (unlikely(sk_unhashed(sk))) {
> +		rc = SEQ_SKIP;
> +		goto unlock;
> +	}
> +
>   	uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
>   	meta.seq = seq;
>   	prog = bpf_iter_get_info(&meta, false);
> -	return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> +	rc = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> +
> +unlock:
> +	release_sock(sk);
> +	return rc;
> +}
> +
> +static void bpf_iter_udp_put_batch(struct bpf_udp_iter_state *iter)
> +{
> +	while (iter->cur_sk < iter->end_sk)
> +		sock_put(iter->batch[iter->cur_sk++]);
>   }
>   
>   static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
>   {
> +	struct bpf_udp_iter_state *iter = seq->private;
>   	struct bpf_iter_meta meta;
>   	struct bpf_prog *prog;
>   
> @@ -3190,12 +3348,15 @@ static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
>   			(void)udp_prog_seq_show(prog, &meta, v, 0, 0);
>   	}
>   
> -	udp_seq_stop(seq, v);
> +	if (iter->cur_sk < iter->end_sk) {
> +		bpf_iter_udp_put_batch(iter);
> +		iter->st_bucket_done = false;
> +	}
>   }
>   
>   static const struct seq_operations bpf_iter_udp_seq_ops = {
> -	.start		= udp_seq_start,
> -	.next		= udp_seq_next,
> +	.start		= bpf_iter_udp_seq_start,
> +	.next		= bpf_iter_udp_seq_next,
>   	.stop		= bpf_iter_udp_seq_stop,
>   	.show		= bpf_iter_udp_seq_show,
>   };
> @@ -3424,21 +3585,57 @@ static struct pernet_operations __net_initdata udp_sysctl_ops = {
>   DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
>   		     struct udp_sock *udp_sk, uid_t uid, int bucket)
>   
> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> +				      unsigned int new_batch_sz)
> +{
> +	struct sock **new_batch;
> +
> +	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
> +				   GFP_USER | __GFP_NOWARN);
> +	if (!new_batch)
> +		return -ENOMEM;
> +
> +	bpf_iter_udp_put_batch(iter);
> +	kvfree(iter->batch);
> +	iter->batch = new_batch;
> +	iter->max_sk = new_batch_sz;
> +
> +	return 0;
> +}
> +
> +#define INIT_BATCH_SZ 16
> +
>   static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
>   {
> -	return bpf_iter_init_seq_net(priv_data, aux);
> +	struct bpf_udp_iter_state *iter = priv_data;
> +	int ret;
> +
> +	ret = bpf_iter_init_seq_net(priv_data, aux);
> +	if (ret)
> +		return ret;
> +
> +	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
> +	if (ret) {
> +		bpf_iter_fini_seq_net(priv_data);
> +		return ret;

nit. remove this "return ret;" statement.

Others lgtm. I will continue with the rest of the patchset tomorrow.

> +	}
> +
> +	return ret;
>   }
>   
>   static void bpf_iter_fini_udp(void *priv_data)
>   {
> +	struct bpf_udp_iter_state *iter = priv_data;
> +
>   	bpf_iter_fini_seq_net(priv_data);
> +	kvfree(iter->batch);
>   }
>   
>   static const struct bpf_iter_seq_info udp_seq_info = {
>   	.seq_ops		= &bpf_iter_udp_seq_ops,
>   	.init_seq_private	= bpf_iter_init_udp,
>   	.fini_seq_private	= bpf_iter_fini_udp,
> -	.seq_priv_size		= sizeof(struct udp_iter_state),
> +	.seq_priv_size		= sizeof(struct bpf_udp_iter_state),
>   };
>   
>   static struct bpf_iter_reg udp_reg_info = {

next prev parent reply	other threads:[~2023-04-24  5:46 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-18 15:31 [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Aditi Ghag
2023-04-18 15:31 ` [PATCH 1/7] bpf: tcp: Avoid taking fast sock lock in iterator Aditi Ghag
2023-04-20  8:55   ` Paolo Abeni
2023-05-03 20:25     ` Aditi Ghag
2023-04-25  5:45   ` Yonghong Song
2023-05-03 20:26     ` Aditi Ghag
2023-04-18 15:31 ` [PATCH 2/7] udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state Aditi Ghag
2023-04-24  0:18   ` Martin KaFai Lau
2023-05-01 22:39     ` Aditi Ghag
2023-04-18 15:31 ` [PATCH 3/7] udp: seq_file: Helper function to match socket attributes Aditi Ghag
2023-04-18 15:31 ` [PATCH 4/7] bpf: udp: Implement batching for sockets iterator Aditi Ghag
2023-04-20 14:07   ` Paolo Abeni
2023-04-24  5:46   ` Martin KaFai Lau [this message]
2023-04-18 15:31 ` [PATCH 5/7] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
2023-04-18 15:31 ` [PATCH 6/7] selftests/bpf: Add helper to get port using getsockname Aditi Ghag
2023-04-18 18:45   ` Stanislav Fomichev
2023-04-24 17:58     ` Martin KaFai Lau
2023-04-18 15:31 ` [PATCH 7/7] selftests/bpf: Test bpf_sock_destroy Aditi Ghag
2023-04-24 19:20   ` Martin KaFai Lau
2023-04-24 22:15 ` [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Martin KaFai Lau
2023-05-01 23:32   ` Aditi Ghag
2023-05-01 23:37     ` Aditi Ghag
2023-05-02 23:24       ` Martin KaFai Lau
2023-05-02 22:52     ` Aditi Ghag
2023-05-02 23:40       ` Martin KaFai Lau
2023-05-04 17:32         ` Aditi Ghag

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=251b0a3f-47d6-d9ad-0c8e-b707fa286c7c@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=aditi.ghag@isovalent.com \
    --cc=bpf@vger.kernel.org \
    --cc=edumazet@google.com \
    --cc=kafai@fb.com \
    --cc=martin.lau@kernel.org \
    --cc=sdf@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox