netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Martin KaFai Lau <martin.lau@linux.dev>
To: "D. Wythe" <alibuda@linux.alibaba.com>
Cc: kgraul@linux.ibm.com, wenjia@linux.ibm.com, jaka@linux.ibm.com,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	pabeni@redhat.com, song@kernel.org, sdf@google.com,
	haoluo@google.com, yhs@fb.com, edumazet@google.com,
	john.fastabend@gmail.com, kpsingh@kernel.org, jolsa@kernel.org,
	guwen@linux.alibaba.com, kuba@kernel.org, davem@davemloft.net,
	netdev@vger.kernel.org, linux-s390@vger.kernel.org,
	linux-rdma@vger.kernel.org, bpf@vger.kernel.org,
	dtcccc@linux.alibaba.com
Subject: Re: [PATCH net-next 3/4] net/smc: Introduce smc_bpf_ops
Date: Thu, 24 Oct 2024 17:26:49 -0700	[thread overview]
Message-ID: <74c06b43-095f-414a-b4aa-2addbe610336@linux.dev> (raw)
In-Reply-To: <1729737768-124596-4-git-send-email-alibuda@linux.alibaba.com>

On 10/23/24 7:42 PM, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> The introduction of IPPROTO_SMC enables eBPF programs to determine
> whether to use SMC based on the context of socket creation, such as
> network namespaces, PID and comm name, etc.
> 
> As a subsequent enhancement, this patch introduces a new hook for eBPF
> programs that allows decisions on whether to use SMC or not at runtime,
> including but not limited to local/remote IP address or ports. In
> simpler words, this feature allows modifications to syn_smc through eBPF
> programs before the TCP three-way handshake got established.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>   include/linux/tcp.h   |   2 +-
>   include/net/smc.h     |  47 +++++++++++
>   include/net/tcp.h     |   6 ++
>   net/ipv4/tcp_input.c  |   3 +-
>   net/ipv4/tcp_output.c |  14 +++-
>   net/smc/Kconfig       |  12 +++
>   net/smc/Makefile      |   1 +
>   net/smc/af_smc.c      |  38 ++++++---
>   net/smc/smc.h         |   4 +
>   net/smc/smc_bpf.c     | 212 ++++++++++++++++++++++++++++++++++++++++++++++++++
>   net/smc/smc_bpf.h     |  34 ++++++++
>   11 files changed, 357 insertions(+), 16 deletions(-)
>   create mode 100644 net/smc/smc_bpf.c
>   create mode 100644 net/smc/smc_bpf.h
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 6a5e08b..4ef160a 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -478,7 +478,7 @@ struct tcp_sock {
>   #endif
>   #if IS_ENABLED(CONFIG_SMC)
>   	bool	syn_smc;	/* SYN includes SMC */
> -	bool	(*smc_hs_congested)(const struct sock *sk);
> +	struct tcpsmc_ctx *smc;
>   #endif
>   
>   #if defined(CONFIG_TCP_MD5SIG) || defined(CONFIG_TCP_AO)
> diff --git a/include/net/smc.h b/include/net/smc.h
> index db84e4e..34ab2c6 100644
> --- a/include/net/smc.h
> +++ b/include/net/smc.h
> @@ -18,6 +18,8 @@
>   #include "linux/ism.h"
>   
>   struct sock;
> +struct tcp_sock;
> +struct inet_request_sock;
>   
>   #define SMC_MAX_PNETID_LEN	16	/* Max. length of PNET id */
>   
> @@ -97,4 +99,49 @@ struct smcd_dev {
>   	u8 going_away : 1;
>   };
>   
> +/*
> + * This structure is used to store the parameters passed to the member of struct_ops.
> + * Due to the BPF verifier cannot restrict the writing of bit fields, such as limiting
> + * it to only write ireq->smc_ok. Using kfunc can solve this issue, but we don't want
> + * to introduce a kfunc with such a narrow function.

imo, adding kfunc is fine.

> + *
> + * Moreover, using this structure for unified parameters also addresses another
> + * potential issue. Currently, kfunc cannot recognize the calling context
> + * through BPF's existing structure. In the future, we can solve this problem
> + * by passing this ctx to kfunc.

This part I don't understand. How is it different from the "tcp_cubic_kfunc_set" 
allowed in tcp_congestion_ops?

> + */
> +struct smc_bpf_ops_ctx {
> +	struct {
> +		struct tcp_sock *tp;
> +	} set_option;
> +	struct {
> +		const struct tcp_sock *tp;
> +		struct inet_request_sock *ireq;
> +		int smc_ok;
> +	} set_option_cond;
> +};

There is no need to create one single ctx for struct_ops prog. struct_ops prog 
can take >1 args and different ops can take different args.

> +
> +struct smc_bpf_ops {
> +	/* priavte */
> +
> +	struct list_head	list;
> +
> +	/* public */
> +
> +	/* Invoked before computing SMC option for SYN packets.
> +	 * We can control whether to set SMC options by modifying
> +	 * ctx->set_option->tp->syn_smc.
> +	 * This's also the only member that can be modified now.
> +	 * Only member in ctx->set_option is valid for this callback.
> +	 */
> +	void (*set_option)(struct smc_bpf_ops_ctx *ctx);
> +
> +	/* Invoked before Set up SMC options for SYN-ACK packets
> +	 * We can control whether to respond SMC options by modifying
> +	 * ctx->set_option_cond.smc_ok.
> +	 * Only member in ctx->set_option_cond is valid for this callback.
> +	 */
> +	void (*set_option_cond)(struct smc_bpf_ops_ctx *ctx);

The struct smc_bpf_ops already has set_option and set_option_cnd, but...

> +};
> +
>   #endif	/* _SMC_H */
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 739a9fb..c322443 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2730,6 +2730,12 @@ static inline void tcp_bpf_rtt(struct sock *sk, long mrtt, u32 srtt)
>   
>   #if IS_ENABLED(CONFIG_SMC)
>   extern struct static_key_false tcp_have_smc;
> +struct tcpsmc_ctx {
> +	/* Invoked before computing SMC option for SYN packets. */
> +	void (*set_option)(struct tcp_sock *tp);
> +	/* Invoked before Set up SMC options for SYN-ACK packets */
> +	void (*set_option_cond)(const struct tcp_sock *tp, struct inet_request_sock *ireq);
> +};

another new struct tcpsmc_ctx has exactly the same functions (at least the same 
name) but different arguments. I don't understand why this duplicate, is it 
because the need to prepare the "struct smc_bpf_ops_ctx"?

The "struct tcpsmc_ctx" should be the "struct smc_bpf_ops" itself.

[ ... ]

> +static int smc_bpf_ops_btf_struct_access(struct bpf_verifier_log *log,
> +					 const struct bpf_reg_state *reg,
> +					 const struct bpf_prog *prog,
> +					 int off, int size)
> +{
> +	const struct btf_member *member;
> +	const char *mname;
> +	int member_idx;
> +
> +	member_idx = prog->expected_attach_type;
> +	if (member_idx >= btf_type_vlen(smc_bpf_ops_type))
> +		goto out_err;
> +
> +	member = &btf_type_member(smc_bpf_ops_type)[member_idx];
> +	mname = btf_str_by_offset(saved_btf, member->name_off);
> +
> +	if (!strcmp(mname, "set_option")) {

btf_member_bit_offset can be used instead of strcmp. Take a look at bpf_tcp_ca.c 
and kernel/sched/ext.c

> +		/* only support to modify tcp_sock->syn_smc */
> +		if (reg->btf_id == tcp_sock_id &&
> +		    off == offsetof(struct tcp_sock, syn_smc) &&
> +		    off + size == offsetofend(struct tcp_sock, syn_smc))
> +			return 0;
> +	} else if (!strcmp(mname, "set_option_cond")) {
> +		/* only support to modify smc_bpf_ops_ctx->smc_ok */
> +		if (reg->btf_id == smc_bpf_ops_ctx_id &&
> +		    off == offsetof(struct smc_bpf_ops_ctx, set_option_cond.smc_ok) &&
> +		    off + size == offsetofend(struct smc_bpf_ops_ctx, set_option_cond.smc_ok))
> +			return 0;
> +	}
> +
> +out_err:
> +	return -EACCES;
> +}
> +
> +static const struct bpf_verifier_ops smc_bpf_verifier_ops = {
> +	.get_func_proto = bpf_base_func_proto,
> +	.is_valid_access = bpf_tracing_btf_ctx_access,
> +	.btf_struct_access = smc_bpf_ops_btf_struct_access,
> +};
> +
> +static struct bpf_struct_ops bpf_smc_bpf_ops = {
> +	.init = smc_bpf_ops_init,
> +	.name = "smc_bpf_ops",
> +	.reg = smc_bpf_ops_reg,
> +	.unreg = smc_bpf_ops_unreg,
> +	.cfi_stubs = &__bpf_smc_bpf_ops,
> +	.verifier_ops = &smc_bpf_verifier_ops,
> +	.init_member = smc_bpf_ops_init_member,
> +	.check_member = smc_bpf_ops_check_member,
> +	.owner = THIS_MODULE,
> +};
> +
> +int smc_bpf_struct_ops_init(void)
> +{
> +	return register_bpf_struct_ops(&bpf_smc_bpf_ops, smc_bpf_ops);
> +}
> +
> +void bpf_smc_set_tcp_option(struct tcp_sock *tp)
> +{
> +	struct smc_bpf_ops_ctx ops_ctx = {};
> +	struct smc_bpf_ops *ops;
> +
> +	ops_ctx.set_option.tp = tp;

All this initialization should be unnecessary. Directly pass tp instead.

> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(ops, &smc_bpf_ops_list, list) {

Does it need to have a list (meaning >1) of smc_bpf_ops to act on a sock? The 
ordering expectation is hard to manage.

> +		ops->set_option(&ops_ctx);

A dumb question. This will only affect AF_SMC (or AF_INET[6]/IPPROTO_SMC) 
socket but not the AF_INET[6]/IPPROTO_{TCP,UDP} socket?

pw-bot: cr

> +	}
> +	rcu_read_unlock();
> +}

  reply	other threads:[~2024-10-25  0:27 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-24  2:42 [PATCH bpf-next 0/4] net/smc: Introduce smc_bpf_ops D. Wythe
2024-10-24  2:42 ` [PATCH bpf-next 1/4] bpf: export necessary sympols for modules D. Wythe
2024-10-24  2:42 ` [PATCH bpf-next 2/4] bpf: allow to access bpf_prog during bpf_struct_access D. Wythe
2024-10-25  9:14   ` kernel test robot
2024-10-25 12:20   ` kernel test robot
2024-10-24  2:42 ` [PATCH net-next 3/4] net/smc: Introduce smc_bpf_ops D. Wythe
2024-10-25  0:26   ` Martin KaFai Lau [this message]
2024-10-25 11:05     ` D. Wythe
2024-10-25 18:30       ` Martin KaFai Lau
2024-10-29  8:53         ` D. Wythe
2024-10-24  2:42 ` [PATCH bpf-next 4/4] bpf/selftests: add simple selftest for bpf_smc_ops D. Wythe
2024-10-24  4:04   ` D. Wythe
2024-10-24  4:49     ` Tianchen Ding
2024-10-24  5:49       ` D. Wythe
2024-11-03 13:01   ` Zhu Yanjun
2024-11-21  2:00     ` D. Wythe
2024-11-25 10:52       ` Zhu Yanjun
2024-11-25 23:32         ` Martin KaFai Lau
2024-11-26  8:29           ` Zhu Yanjun
2024-11-29  4:11           ` D. Wythe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=74c06b43-095f-414a-b4aa-2addbe610336@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=alibuda@linux.alibaba.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dtcccc@linux.alibaba.com \
    --cc=edumazet@google.com \
    --cc=guwen@linux.alibaba.com \
    --cc=haoluo@google.com \
    --cc=jaka@linux.ibm.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kgraul@linux.ibm.com \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=wenjia@linux.ibm.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).