* [PATCH bpf-next v6 0/5] net/smc: Introduce smc_ops
@ 2025-01-16 7:44 D. Wythe
2025-01-16 7:44 ` [PATCH bpf-next v6 1/5] bpf: export necessary sympols for modules with struct_ops D. Wythe
` (4 more replies)
0 siblings, 5 replies; 17+ messages in thread
From: D. Wythe @ 2025-01-16 7:44 UTC (permalink / raw)
To: kgraul, wenjia, jaka, ast, daniel, andrii, martin.lau, pabeni,
song, sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa,
guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, bpf
This patch aims to introduce BPF injection capabilities for SMC and
includes a self-test to ensure code stability.
Since the SMC protocol isn't ideal for every situation, especially
short-lived ones, most applications can't guarantee the absence of
such scenarios. Consequently, applications may need specific strategies
to decide whether to use SMC. For example, an application might limit SMC
usage to certain IP addresses or ports.
To maintain the principle of transparent replacement, we want applications
to remain unaffected even if they need specific SMC strategies. In other
words, they should not require recompilation of their code.
Additionally, we need to ensure the scalability of strategy implementation.
While using socket options or sysctl might be straightforward, it could
complicate future expansions.
Fortunately, BPF addresses these concerns effectively. Users can write
their own strategies in eBPF to determine whether to use SMC, and they can
easily modify those strategies in the future.
v2:
1. Rename smc_bpf_ops to smc_ops.
2. Change the scope of smc_ops from global to per netns.
3. Directly pass parameters to ops instead of smc_ops_ctx.
4. Remove struct smc_ops_ctx.
5. Remove exports that are no longer needed.
v3:
1. Remove find_ksym_btf_id_by_prefix_kind.
2. Enhance selftest, introduce a complete ops for filtering smc
connections based on ip pairs and a realistic topology test
to verify it.
v4:
1. Remove unless func: smc_bpf_ops_check_member()
2. Remove unless inline func: smc_ops_find_by_name()
3. Change CONFIG_SMC=y to complete CI testing
4. Change smc_sock to smc_sock___local in test to avoid
compiling failed with CONFIG_SMC=y
5. Improve test cases, remove unnecessary timeouts and multi-thread
test, using network_helpers to start testing between server and
client.
6. Fix issues when the return value of the ops function is neither 0
nor 1.
v5:
1. Fix incorrect CI config from CONFIG_SMC=Y to CONFIG_SMC=y.
v6:
1. Fix some spelling errors and code styles.
2. Fix test failed under s390x.
3. New approach to fix libbpf exceptions.
4. Fix warning & failure when compiling "Introduce generic hook smc_ops".
D. Wythe (5):
bpf: export necessary sympols for modules with struct_ops
net/smc: Introduce generic hook smc_ops
net/smc: bpf: register smc_ops info struct_ops
libbpf: fix error when st-prefix_ops and ops from differ btf
bpf/selftests: add selftest for bpf_smc_ops
include/net/netns/smc.h | 3 +
include/net/smc.h | 53 +++
kernel/bpf/bpf_struct_ops.c | 2 +
kernel/bpf/syscall.c | 1 +
net/ipv4/tcp_output.c | 18 +-
net/smc/Kconfig | 12 +
net/smc/Makefile | 1 +
net/smc/af_smc.c | 11 +
net/smc/smc_ops.c | 131 ++++++
net/smc/smc_ops.h | 33 ++
net/smc/smc_sysctl.c | 94 +++++
tools/lib/bpf/libbpf.c | 41 +-
tools/testing/selftests/bpf/config | 4 +
.../selftests/bpf/prog_tests/test_bpf_smc.c | 397 ++++++++++++++++++
tools/testing/selftests/bpf/progs/bpf_smc.c | 117 ++++++
15 files changed, 896 insertions(+), 22 deletions(-)
create mode 100644 net/smc/smc_ops.c
create mode 100644 net/smc/smc_ops.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c
create mode 100644 tools/testing/selftests/bpf/progs/bpf_smc.c
--
2.45.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH bpf-next v6 1/5] bpf: export necessary sympols for modules with struct_ops
2025-01-16 7:44 [PATCH bpf-next v6 0/5] net/smc: Introduce smc_ops D. Wythe
@ 2025-01-16 7:44 ` D. Wythe
2025-01-16 7:44 ` [PATCH bpf-next v6 2/5] net/smc: Introduce generic hook smc_ops D. Wythe
` (3 subsequent siblings)
4 siblings, 0 replies; 17+ messages in thread
From: D. Wythe @ 2025-01-16 7:44 UTC (permalink / raw)
To: kgraul, wenjia, jaka, ast, daniel, andrii, martin.lau, pabeni,
song, sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa,
guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, bpf
Exports three necessary symbols for implementing struct_ops with
tristate subsystem.
To hold or release refcnt of struct_ops refcnt by inline funcs
bpf_try_module_get and bpf_module_put which use bpf_struct_ops_get(put)
conditionally.
And to copy obj name from one to the other with effective checks by
bpf_obj_name_cpy.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
kernel/bpf/bpf_struct_ops.c | 2 ++
kernel/bpf/syscall.c | 1 +
2 files changed, 3 insertions(+)
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 606efe32485a..00c212e0ad39 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -1119,6 +1119,7 @@ bool bpf_struct_ops_get(const void *kdata)
map = __bpf_map_inc_not_zero(&st_map->map, false);
return !IS_ERR(map);
}
+EXPORT_SYMBOL_GPL(bpf_struct_ops_get);
void bpf_struct_ops_put(const void *kdata)
{
@@ -1130,6 +1131,7 @@ void bpf_struct_ops_put(const void *kdata)
bpf_map_put(&st_map->map);
}
+EXPORT_SYMBOL_GPL(bpf_struct_ops_put);
int bpf_struct_ops_supported(const struct bpf_struct_ops *st_ops, u32 moff)
{
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5684e8ce132d..62238ec989dc 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1167,6 +1167,7 @@ int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size)
return src - orig_src;
}
+EXPORT_SYMBOL_GPL(bpf_obj_name_cpy);
int map_check_no_btf(const struct bpf_map *map,
const struct btf *btf,
--
2.45.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH bpf-next v6 2/5] net/smc: Introduce generic hook smc_ops
2025-01-16 7:44 [PATCH bpf-next v6 0/5] net/smc: Introduce smc_ops D. Wythe
2025-01-16 7:44 ` [PATCH bpf-next v6 1/5] bpf: export necessary sympols for modules with struct_ops D. Wythe
@ 2025-01-16 7:44 ` D. Wythe
2025-01-16 12:22 ` Dust Li
2025-01-17 23:50 ` Martin KaFai Lau
2025-01-16 7:44 ` [PATCH bpf-next v6 3/5] net/smc: bpf: register smc_ops info struct_ops D. Wythe
` (2 subsequent siblings)
4 siblings, 2 replies; 17+ messages in thread
From: D. Wythe @ 2025-01-16 7:44 UTC (permalink / raw)
To: kgraul, wenjia, jaka, ast, daniel, andrii, martin.lau, pabeni,
song, sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa,
guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, bpf
The introduction of IPPROTO_SMC enables eBPF programs to determine
whether to use SMC based on the context of socket creation, such as
network namespaces, PID and comm name, etc.
As a subsequent enhancement, to introduce a new generic hook that
allows decisions on whether to use SMC or not at runtime, including
but not limited to local/remote IP address or ports.
Moreover, in the future, we can achieve more complex extensions to the
protocol stack by extending this ops.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
include/net/netns/smc.h | 3 ++
include/net/smc.h | 53 +++++++++++++++++++++++
net/ipv4/tcp_output.c | 18 ++++++--
net/smc/Kconfig | 12 ++++++
net/smc/Makefile | 1 +
net/smc/smc_ops.c | 53 +++++++++++++++++++++++
net/smc/smc_ops.h | 28 ++++++++++++
net/smc/smc_sysctl.c | 94 +++++++++++++++++++++++++++++++++++++++++
8 files changed, 258 insertions(+), 4 deletions(-)
create mode 100644 net/smc/smc_ops.c
create mode 100644 net/smc/smc_ops.h
diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
index fc752a50f91b..81b3fdb39cd2 100644
--- a/include/net/netns/smc.h
+++ b/include/net/netns/smc.h
@@ -17,6 +17,9 @@ struct netns_smc {
#ifdef CONFIG_SYSCTL
struct ctl_table_header *smc_hdr;
#endif
+#if IS_ENABLED(CONFIG_SMC_OPS)
+ struct smc_ops __rcu *ops;
+#endif /* CONFIG_SMC_OPS */
unsigned int sysctl_autocorking_size;
unsigned int sysctl_smcr_buf_type;
int sysctl_smcr_testlink_time;
diff --git a/include/net/smc.h b/include/net/smc.h
index db84e4e35080..271838591b63 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -18,6 +18,8 @@
#include "linux/ism.h"
struct sock;
+struct tcp_sock;
+struct inet_request_sock;
#define SMC_MAX_PNETID_LEN 16 /* Max. length of PNET id */
@@ -97,4 +99,55 @@ struct smcd_dev {
u8 going_away : 1;
};
+#define SMC_OPS_NAME_MAX 16
+
+enum {
+ /* ops can be inherit from init_net */
+ SMC_OPS_FLAG_INHERITABLE = 0x1,
+
+ SMC_OPS_ALL_FLAGS = SMC_OPS_FLAG_INHERITABLE,
+};
+
+struct smc_ops {
+ /* priavte */
+
+ struct list_head list;
+ struct module *owner;
+
+ /* public */
+
+ /* unique name */
+ char name[SMC_OPS_NAME_MAX];
+ int flags;
+
+ /* Invoked before computing SMC option for SYN packets.
+ * We can control whether to set SMC options by returning varios value.
+ * Return 0 to disable SMC, or return any other value to enable it.
+ */
+ int (*set_option)(struct tcp_sock *tp);
+
+ /* Invoked before Set up SMC options for SYN-ACK packets
+ * We can control whether to respond SMC options by returning varios
+ * value. Return 0 to disable SMC, or return any other value to enable
+ * it.
+ */
+ int (*set_option_cond)(const struct tcp_sock *tp,
+ struct inet_request_sock *ireq);
+};
+
+#if IS_ENABLED(CONFIG_SMC_OPS)
+#define smc_call_retops(init_val, sk, func, ...) ({ \
+ typeof(init_val) __ret = (init_val); \
+ struct smc_ops *ops; \
+ rcu_read_lock(); \
+ ops = READ_ONCE(sock_net(sk)->smc.ops); \
+ if (ops && ops->func) \
+ __ret = ops->func(__VA_ARGS__); \
+ rcu_read_unlock(); \
+ !!__ret; \
+})
+#else
+#define smc_call_retops(init_val, ...) (init_val)
+#endif /* CONFIG_SMC_OPS */
+
#endif /* _SMC_H */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5485a70b5fe5..f2c397841553 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -40,6 +40,7 @@
#include <net/tcp.h>
#include <net/mptcp.h>
#include <net/proto_memory.h>
+#include <net/smc.h>
#include <linux/compiler.h>
#include <linux/gfp.h>
@@ -759,14 +760,18 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
mptcp_options_write(th, ptr, tp, opts);
}
-static void smc_set_option(const struct tcp_sock *tp,
+static void smc_set_option(struct tcp_sock *tp,
struct tcp_out_options *opts,
unsigned int *remaining)
{
#if IS_ENABLED(CONFIG_SMC)
+ struct sock *sk = &tp->inet_conn.icsk_inet.sk;
if (static_branch_unlikely(&tcp_have_smc)) {
if (tp->syn_smc) {
- if (*remaining >= TCPOLEN_EXP_SMC_BASE_ALIGNED) {
+ tp->syn_smc = smc_call_retops(1, sk, set_option, tp);
+ /* re-check syn_smc */
+ if (tp->syn_smc &&
+ *remaining >= TCPOLEN_EXP_SMC_BASE_ALIGNED) {
opts->options |= OPTION_SMC;
*remaining -= TCPOLEN_EXP_SMC_BASE_ALIGNED;
}
@@ -776,14 +781,19 @@ static void smc_set_option(const struct tcp_sock *tp,
}
static void smc_set_option_cond(const struct tcp_sock *tp,
- const struct inet_request_sock *ireq,
+ struct inet_request_sock *ireq,
struct tcp_out_options *opts,
unsigned int *remaining)
{
#if IS_ENABLED(CONFIG_SMC)
+ const struct sock *sk = &tp->inet_conn.icsk_inet.sk;
if (static_branch_unlikely(&tcp_have_smc)) {
if (tp->syn_smc && ireq->smc_ok) {
- if (*remaining >= TCPOLEN_EXP_SMC_BASE_ALIGNED) {
+ ireq->smc_ok = smc_call_retops(1, sk, set_option_cond,
+ tp, ireq);
+ /* re-check smc_ok */
+ if (ireq->smc_ok &&
+ *remaining >= TCPOLEN_EXP_SMC_BASE_ALIGNED) {
opts->options |= OPTION_SMC;
*remaining -= TCPOLEN_EXP_SMC_BASE_ALIGNED;
}
diff --git a/net/smc/Kconfig b/net/smc/Kconfig
index ba5e6a2dd2fd..27f35064d04c 100644
--- a/net/smc/Kconfig
+++ b/net/smc/Kconfig
@@ -33,3 +33,15 @@ config SMC_LO
of architecture or hardware.
if unsure, say N.
+
+config SMC_OPS
+ bool "Generic hook for SMC subsystem"
+ depends on SMC && BPF_SYSCALL
+ default n
+ help
+ SMC_OPS enables support to register generic hook via eBPF programs
+ for SMC subsystem. eBPF programs offer much greater flexibility
+ in modifying the behavior of the SMC protocol stack compared
+ to a complete kernel-based approach.
+
+ if unsure, say N.
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 60f1c87d5212..5dd706b2927a 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -7,3 +7,4 @@ smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o smc_netlink.o smc_sta
smc-y += smc_tracepoint.o smc_inet.o
smc-$(CONFIG_SYSCTL) += smc_sysctl.o
smc-$(CONFIG_SMC_LO) += smc_loopback.o
+smc-$(CONFIG_SMC_OPS) += smc_ops.o
\ No newline at end of file
diff --git a/net/smc/smc_ops.c b/net/smc/smc_ops.c
new file mode 100644
index 000000000000..86c71f6c5ea6
--- /dev/null
+++ b/net/smc/smc_ops.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Shared Memory Communications over RDMA (SMC-R) and RoCE
+ *
+ * Generic hook for SMC subsystem.
+ *
+ * Copyright IBM Corp. 2016
+ * Copyright (c) 2024, Alibaba Inc.
+ *
+ * Author: D. Wythe <alibuda@linux.alibaba.com>
+ */
+
+#include <linux/rculist.h>
+
+#include "smc_ops.h"
+
+static DEFINE_SPINLOCK(smc_ops_list_lock);
+static LIST_HEAD(smc_ops_list);
+
+int smc_ops_reg(struct smc_ops *ops)
+{
+ int ret = 0;
+
+ spin_lock(&smc_ops_list_lock);
+ /* already exist or duplicate name */
+ if (smc_ops_find_by_name(ops->name))
+ ret = -EEXIST;
+ else
+ list_add_tail_rcu(&ops->list, &smc_ops_list);
+ spin_unlock(&smc_ops_list_lock);
+ return ret;
+}
+
+void smc_ops_unreg(struct smc_ops *ops)
+{
+ spin_lock(&smc_ops_list_lock);
+ list_del_rcu(&ops->list);
+ spin_unlock(&smc_ops_list_lock);
+
+ /* Ensure that all readers to complete */
+ synchronize_rcu();
+}
+
+struct smc_ops *smc_ops_find_by_name(const char *name)
+{
+ struct smc_ops *ops;
+
+ list_for_each_entry_rcu(ops, &smc_ops_list, list) {
+ if (strcmp(ops->name, name) == 0)
+ return ops;
+ }
+ return NULL;
+}
diff --git a/net/smc/smc_ops.h b/net/smc/smc_ops.h
new file mode 100644
index 000000000000..24f094464b45
--- /dev/null
+++ b/net/smc/smc_ops.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Shared Memory Communications over RDMA (SMC-R) and RoCE
+ *
+ * Generic hook for SMC subsystem.
+ *
+ * Copyright IBM Corp. 2016
+ * Copyright (c) 2024, Alibaba Inc.
+ *
+ * Author: D. Wythe <alibuda@linux.alibaba.com>
+ */
+
+#ifndef __SMC_OPS
+#define __SMC_OPS
+
+#include <net/smc.h>
+
+int smc_ops_reg(struct smc_ops *ops);
+void smc_ops_unreg(struct smc_ops *ops);
+
+/* Find ops by the target name, which required to be a c-string.
+ * Return NULL if no such ops was found,otherwise, return a valid ops.
+ *
+ * Note: Caller MUST ensure it's was invoked under rcu_read_lock.
+ */
+struct smc_ops *smc_ops_find_by_name(const char *name);
+
+#endif /* __SMC_OPS */
diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
index 2fab6456f765..2004241c3045 100644
--- a/net/smc/smc_sysctl.c
+++ b/net/smc/smc_sysctl.c
@@ -18,6 +18,7 @@
#include "smc_core.h"
#include "smc_llc.h"
#include "smc_sysctl.h"
+#include "smc_ops.h"
static int min_sndbuf = SMC_BUF_MIN_SIZE;
static int min_rcvbuf = SMC_BUF_MIN_SIZE;
@@ -30,6 +31,69 @@ static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
+#if IS_ENABLED(CONFIG_SMC_OPS)
+static int smc_net_replace_smc_ops(struct net *net, const char *name)
+{
+ struct smc_ops *ops = NULL;
+
+ rcu_read_lock();
+ /* null or empty name ask to clear current ops */
+ if (name && name[0]) {
+ ops = smc_ops_find_by_name(name);
+ if (!ops) {
+ rcu_read_unlock();
+ return -EINVAL;
+ }
+ /* no change, just return */
+ if (ops == rcu_dereference(net->smc.ops)) {
+ rcu_read_unlock();
+ return 0;
+ }
+ }
+ if (!ops || bpf_try_module_get(ops, ops->owner)) {
+ /* xhcg */
+ ops = rcu_replace_pointer(net->smc.ops, ops, true);
+ /* release old ops */
+ if (ops)
+ bpf_module_put(ops, ops->owner);
+ } else if (ops) {
+ rcu_read_unlock();
+ return -EBUSY;
+ }
+ rcu_read_unlock();
+ return 0;
+}
+
+static int proc_smc_ops(const struct ctl_table *ctl, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct net *net = container_of(ctl->data, struct net, smc.ops);
+ char val[SMC_OPS_NAME_MAX];
+ const struct ctl_table tbl = {
+ .data = val,
+ .maxlen = SMC_OPS_NAME_MAX,
+ };
+ struct smc_ops *ops;
+ int ret;
+
+ rcu_read_lock();
+ ops = rcu_dereference(net->smc.ops);
+ if (ops)
+ memcpy(val, ops->name, sizeof(ops->name));
+ else
+ val[0] = '\0';
+ rcu_read_unlock();
+
+ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
+ if (ret)
+ return ret;
+
+ if (write)
+ ret = smc_net_replace_smc_ops(net, val);
+ return ret;
+}
+#endif /* CONFIG_SMC_OPS */
+
static struct ctl_table smc_table[] = {
{
.procname = "autocorking_size",
@@ -99,6 +163,15 @@ static struct ctl_table smc_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
},
+#if IS_ENABLED(CONFIG_SMC_OPS)
+ {
+ .procname = "ops",
+ .data = &init_net.smc.ops,
+ .mode = 0644,
+ .maxlen = SMC_OPS_NAME_MAX,
+ .proc_handler = proc_smc_ops,
+ },
+#endif /* CONFIG_SMC_OPS */
};
int __net_init smc_sysctl_net_init(struct net *net)
@@ -109,6 +182,20 @@ int __net_init smc_sysctl_net_init(struct net *net)
table = smc_table;
if (!net_eq(net, &init_net)) {
int i;
+#if IS_ENABLED(CONFIG_SMC_OPS)
+ struct smc_ops *ops;
+
+ rcu_read_lock();
+ ops = rcu_dereference(init_net.smc.ops);
+ if (ops && ops->flags & SMC_OPS_FLAG_INHERITABLE) {
+ if (!bpf_try_module_get(ops, ops->owner)) {
+ rcu_read_unlock();
+ return -EBUSY;
+ }
+ rcu_assign_pointer(net->smc.ops, ops);
+ }
+ rcu_read_unlock();
+#endif /* CONFIG_SMC_OPS */
table = kmemdup(table, sizeof(smc_table), GFP_KERNEL);
if (!table)
@@ -139,6 +226,9 @@ int __net_init smc_sysctl_net_init(struct net *net)
if (!net_eq(net, &init_net))
kfree(table);
err_alloc:
+#if IS_ENABLED(CONFIG_SMC_OPS)
+ smc_net_replace_smc_ops(net, NULL);
+#endif /* CONFIG_SMC_OPS */
return -ENOMEM;
}
@@ -148,6 +238,10 @@ void __net_exit smc_sysctl_net_exit(struct net *net)
table = net->smc.smc_hdr->ctl_table_arg;
unregister_net_sysctl_table(net->smc.smc_hdr);
+#if IS_ENABLED(CONFIG_SMC_OPS)
+ smc_net_replace_smc_ops(net, NULL);
+#endif /* CONFIG_SMC_OPS */
+
if (!net_eq(net, &init_net))
kfree(table);
}
--
2.45.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH bpf-next v6 3/5] net/smc: bpf: register smc_ops info struct_ops
2025-01-16 7:44 [PATCH bpf-next v6 0/5] net/smc: Introduce smc_ops D. Wythe
2025-01-16 7:44 ` [PATCH bpf-next v6 1/5] bpf: export necessary sympols for modules with struct_ops D. Wythe
2025-01-16 7:44 ` [PATCH bpf-next v6 2/5] net/smc: Introduce generic hook smc_ops D. Wythe
@ 2025-01-16 7:44 ` D. Wythe
2025-01-16 7:44 ` [PATCH bpf-next v6 4/5] libbpf: fix error when st-prefix_ops and ops from differ btf D. Wythe
2025-01-16 7:44 ` [PATCH bpf-next v6 5/5] bpf/selftests: add selftest for bpf_smc_ops D. Wythe
4 siblings, 0 replies; 17+ messages in thread
From: D. Wythe @ 2025-01-16 7:44 UTC (permalink / raw)
To: kgraul, wenjia, jaka, ast, daniel, andrii, martin.lau, pabeni,
song, sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa,
guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, bpf
To implement injection capability for smc via struct_ops, so that
user can make their own smc_ops to modify the behavior of smc stack.
Currently, user can write their own implememtion to choose whether to
use SMC or not before TCP 3rd handshake to be comleted. In the future,
users can implement more complex functions on smc by expanding it.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
net/smc/af_smc.c | 11 +++++++
net/smc/smc_ops.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++
net/smc/smc_ops.h | 5 +++
3 files changed, 94 insertions(+)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 9d76e902fd77..8af7c5503fdf 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -55,6 +55,7 @@
#include "smc_sysctl.h"
#include "smc_loopback.h"
#include "smc_inet.h"
+#include "smc_ops.h"
static DEFINE_MUTEX(smc_server_lgr_pending); /* serialize link group
* creation on server
@@ -3576,8 +3577,18 @@ static int __init smc_init(void)
pr_err("%s: smc_inet_init fails with %d\n", __func__, rc);
goto out_ulp;
}
+
+ rc = smc_bpf_struct_ops_init();
+ if (rc) {
+ pr_err("%s: smc_bpf_struct_ops_init fails with %d\n", __func__,
+ rc);
+ goto out_inet;
+ }
+
static_branch_enable(&tcp_have_smc);
return 0;
+out_inet:
+ smc_inet_exit();
out_ulp:
tcp_unregister_ulp(&smc_ulp_ops);
out_lo:
diff --git a/net/smc/smc_ops.c b/net/smc/smc_ops.c
index 86c71f6c5ea6..adff580a74eb 100644
--- a/net/smc/smc_ops.c
+++ b/net/smc/smc_ops.c
@@ -10,6 +10,9 @@
* Author: D. Wythe <alibuda@linux.alibaba.com>
*/
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
#include <linux/rculist.h>
#include "smc_ops.h"
@@ -51,3 +54,78 @@ struct smc_ops *smc_ops_find_by_name(const char *name)
}
return NULL;
}
+
+static int __bpf_smc_stub_set_tcp_option(struct tcp_sock *tp) { return 1; }
+static int __bpf_smc_stub_set_tcp_option_cond(const struct tcp_sock *tp,
+ struct inet_request_sock *ireq)
+{
+ return 1;
+}
+
+static struct smc_ops __bpf_smc_bpf_ops = {
+ .set_option = __bpf_smc_stub_set_tcp_option,
+ .set_option_cond = __bpf_smc_stub_set_tcp_option_cond,
+};
+
+static int smc_bpf_ops_init(struct btf *btf) { return 0; }
+
+static int smc_bpf_ops_reg(void *kdata, struct bpf_link *link)
+{
+ return smc_ops_reg(kdata);
+}
+
+static void smc_bpf_ops_unreg(void *kdata, struct bpf_link *link)
+{
+ smc_ops_unreg(kdata);
+}
+
+static int smc_bpf_ops_init_member(const struct btf_type *t,
+ const struct btf_member *member,
+ void *kdata, const void *udata)
+{
+ const struct smc_ops *u_ops;
+ struct smc_ops *k_ops;
+ u32 moff;
+
+ u_ops = (const struct smc_ops *)udata;
+ k_ops = (struct smc_ops *)kdata;
+
+ moff = __btf_member_bit_offset(t, member) / 8;
+ switch (moff) {
+ case offsetof(struct smc_ops, name):
+ if (bpf_obj_name_cpy(k_ops->name, u_ops->name,
+ sizeof(u_ops->name)) <= 0)
+ return -EINVAL;
+ return 1;
+ case offsetof(struct smc_ops, flags):
+ if (u_ops->flags & ~SMC_OPS_ALL_FLAGS)
+ return -EINVAL;
+ k_ops->flags = u_ops->flags;
+ return 1;
+ default:
+ break;
+ }
+
+ return 0;
+}
+
+static const struct bpf_verifier_ops smc_bpf_verifier_ops = {
+ .get_func_proto = bpf_base_func_proto,
+ .is_valid_access = bpf_tracing_btf_ctx_access,
+};
+
+static struct bpf_struct_ops bpf_smc_bpf_ops = {
+ .name = "smc_ops",
+ .init = smc_bpf_ops_init,
+ .reg = smc_bpf_ops_reg,
+ .unreg = smc_bpf_ops_unreg,
+ .cfi_stubs = &__bpf_smc_bpf_ops,
+ .verifier_ops = &smc_bpf_verifier_ops,
+ .init_member = smc_bpf_ops_init_member,
+ .owner = THIS_MODULE,
+};
+
+int smc_bpf_struct_ops_init(void)
+{
+ return register_bpf_struct_ops(&bpf_smc_bpf_ops, smc_ops);
+}
diff --git a/net/smc/smc_ops.h b/net/smc/smc_ops.h
index 24f094464b45..412f225fe6f3 100644
--- a/net/smc/smc_ops.h
+++ b/net/smc/smc_ops.h
@@ -24,5 +24,10 @@ void smc_ops_unreg(struct smc_ops *ops);
* Note: Caller MUST ensure it's was invoked under rcu_read_lock.
*/
struct smc_ops *smc_ops_find_by_name(const char *name);
+#if IS_ENABLED(CONFIG_SMC_OPS)
+int smc_bpf_struct_ops_init(void);
+#else
+static inline int smc_bpf_struct_ops_init(void) { return 0; }
+#endif
#endif /* __SMC_OPS */
--
2.45.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH bpf-next v6 4/5] libbpf: fix error when st-prefix_ops and ops from differ btf
2025-01-16 7:44 [PATCH bpf-next v6 0/5] net/smc: Introduce smc_ops D. Wythe
` (2 preceding siblings ...)
2025-01-16 7:44 ` [PATCH bpf-next v6 3/5] net/smc: bpf: register smc_ops info struct_ops D. Wythe
@ 2025-01-16 7:44 ` D. Wythe
2025-01-17 18:36 ` Andrii Nakryiko
2025-01-16 7:44 ` [PATCH bpf-next v6 5/5] bpf/selftests: add selftest for bpf_smc_ops D. Wythe
4 siblings, 1 reply; 17+ messages in thread
From: D. Wythe @ 2025-01-16 7:44 UTC (permalink / raw)
To: kgraul, wenjia, jaka, ast, daniel, andrii, martin.lau, pabeni,
song, sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa,
guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, bpf
When a struct_ops named xxx_ops was registered by a module, and
it will be used in both built-in modules and the module itself,
so that the btf_type of xxx_ops will be present in btf_vmlinux
instead of in btf_mod, which means that the btf_type of
bpf_struct_ops_xxx_ops and xxx_ops will not be in the same btf.
Here are four possible case:
+--------+---------------+-------------+---------------------------------+
| | st_ops_xxx_ops| xxx_ops | |
+--------+---------------+-------------+---------------------------------+
| case 0 | btf_vmlinux | bft_vmlinux | be used and reg only in vmlinux |
+--------+---------------+-------------+---------------------------------+
| case 1 | btf_vmlinux | bpf_mod | INVALID |
+--------+---------------+-------------+---------------------------------+
| case 2 | btf_mod | btf_vmlinux | reg in mod but be used both in |
| | | | vmlinux and mod. |
+--------+---------------+-------------+---------------------------------+
| case 3 | btf_mod | btf_mod | be used and reg only in mod |
+--------+---------------+-------------+---------------------------------+
At present, cases 0, 1, and 3 can be correctly identified, because
st_ops_xxx_ops is searched from the same btf with xxx_ops. In order to
handle case 2 correctly without affecting other cases, we cannot simply
change the search method for st_ops_xxx_ops from find_btf_by_prefix_kind()
to find_ksym_btf_id(), because in this way, case 1 will not be
recognized anymore.
To address the issue, we always look for st_ops_xxx_ops first,
figure out the btf, and then look for xxx_ops with the very btf to avoid
such issue.
Fixes: 590a00888250 ("bpf: libbpf: Add STRUCT_OPS support")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
tools/lib/bpf/libbpf.c | 41 +++++++++++++++++++++++------------------
1 file changed, 23 insertions(+), 18 deletions(-)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 66173ddb5a2d..202bc4c1001e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1005,14 +1005,33 @@ find_struct_ops_kern_types(struct bpf_object *obj, const char *tname_raw,
const struct btf_member *kern_data_member;
struct btf *btf = NULL;
__s32 kern_vtype_id, kern_type_id;
- char tname[256];
+ char tname[256], stname[256];
+ int ret;
__u32 i;
snprintf(tname, sizeof(tname), "%.*s",
(int)bpf_core_essential_name_len(tname_raw), tname_raw);
- kern_type_id = find_ksym_btf_id(obj, tname, BTF_KIND_STRUCT,
+ ret = snprintf(stname, sizeof(stname), "%s%s", STRUCT_OPS_VALUE_PREFIX,
+ tname);
+ if (ret < 0 || ret >= sizeof(stname))
+ return -ENAMETOOLONG;
+
+ /* Look for the corresponding "map_value" type that will be used
+ * in map_update(BPF_MAP_TYPE_STRUCT_OPS) first, figure out the btf
+ * and the mod_btf.
+ * For example, find "struct bpf_struct_ops_tcp_congestion_ops".
+ */
+ kern_vtype_id = find_ksym_btf_id(obj, stname, BTF_KIND_STRUCT,
&btf, mod_btf);
+ if (kern_vtype_id < 0) {
+ pr_warn("struct_ops init_kern: struct %s is not found in kernel BTF\n",
+ stname);
+ return kern_vtype_id;
+ }
+ kern_vtype = btf__type_by_id(btf, kern_vtype_id);
+
+ kern_type_id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT);
if (kern_type_id < 0) {
pr_warn("struct_ops init_kern: struct %s is not found in kernel BTF\n",
tname);
@@ -1020,20 +1039,6 @@ find_struct_ops_kern_types(struct bpf_object *obj, const char *tname_raw,
}
kern_type = btf__type_by_id(btf, kern_type_id);
- /* Find the corresponding "map_value" type that will be used
- * in map_update(BPF_MAP_TYPE_STRUCT_OPS). For example,
- * find "struct bpf_struct_ops_tcp_congestion_ops" from the
- * btf_vmlinux.
- */
- kern_vtype_id = find_btf_by_prefix_kind(btf, STRUCT_OPS_VALUE_PREFIX,
- tname, BTF_KIND_STRUCT);
- if (kern_vtype_id < 0) {
- pr_warn("struct_ops init_kern: struct %s%s is not found in kernel BTF\n",
- STRUCT_OPS_VALUE_PREFIX, tname);
- return kern_vtype_id;
- }
- kern_vtype = btf__type_by_id(btf, kern_vtype_id);
-
/* Find "struct tcp_congestion_ops" from
* struct bpf_struct_ops_tcp_congestion_ops {
* [ ... ]
@@ -1046,8 +1051,8 @@ find_struct_ops_kern_types(struct bpf_object *obj, const char *tname_raw,
break;
}
if (i == btf_vlen(kern_vtype)) {
- pr_warn("struct_ops init_kern: struct %s data is not found in struct %s%s\n",
- tname, STRUCT_OPS_VALUE_PREFIX, tname);
+ pr_warn("struct_ops init_kern: struct %s data is not found in struct %s\n",
+ tname, stname);
return -EINVAL;
}
--
2.45.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH bpf-next v6 5/5] bpf/selftests: add selftest for bpf_smc_ops
2025-01-16 7:44 [PATCH bpf-next v6 0/5] net/smc: Introduce smc_ops D. Wythe
` (3 preceding siblings ...)
2025-01-16 7:44 ` [PATCH bpf-next v6 4/5] libbpf: fix error when st-prefix_ops and ops from differ btf D. Wythe
@ 2025-01-16 7:44 ` D. Wythe
2025-01-18 0:37 ` Martin KaFai Lau
2025-01-21 6:42 ` Saket Kumar Bhaskar
4 siblings, 2 replies; 17+ messages in thread
From: D. Wythe @ 2025-01-16 7:44 UTC (permalink / raw)
To: kgraul, wenjia, jaka, ast, daniel, andrii, martin.lau, pabeni,
song, sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa,
guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, bpf
This tests introduces a tiny smc_ops for filtering SMC connections based on
IP pairs, and also adds a realistic topology model to verify this ops.
Also, we can only use SMC loopback under CI test, so an
additional configuration needs to be enabled.
Follow the steps below to run this test.
make -C tools/testing/selftests/bpf
cd tools/testing/selftests/bpf
sudo ./test_progs -t smc
Results shows:
Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
tools/testing/selftests/bpf/config | 4 +
.../selftests/bpf/prog_tests/test_bpf_smc.c | 397 ++++++++++++++++++
tools/testing/selftests/bpf/progs/bpf_smc.c | 117 ++++++
3 files changed, 518 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c
create mode 100644 tools/testing/selftests/bpf/progs/bpf_smc.c
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index c378d5d07e02..fac2f2a9d02f 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -113,3 +113,7 @@ CONFIG_XDP_SOCKETS=y
CONFIG_XFRM_INTERFACE=y
CONFIG_TCP_CONG_DCTCP=y
CONFIG_TCP_CONG_BBR=y
+CONFIG_INFINIBAND=y
+CONFIG_SMC=y
+CONFIG_SMC_OPS=y
+CONFIG_SMC_LO=y
\ No newline at end of file
diff --git a/tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c b/tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c
new file mode 100644
index 000000000000..1e06325bfbaf
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c
@@ -0,0 +1,397 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <linux/genetlink.h>
+#include "network_helpers.h"
+#include "bpf_smc.skel.h"
+
+#ifndef IPPROTO_SMC
+#define IPPROTO_SMC 256
+#endif
+
+#define CLIENT_IP "127.0.0.1"
+#define SERVER_IP "127.0.1.0"
+#define SERVER_IP_VIA_RISK_PATH "127.0.2.0"
+
+#define SERVICE_1 11234
+#define SERVICE_2 22345
+#define SERVICE_3 33456
+
+#define TEST_NS "bpf_smc_netns"
+
+static struct netns_obj *test_netns;
+
+struct smc_strat_ip_key {
+ __u32 sip;
+ __u32 dip;
+};
+
+struct smc_strat_ip_value {
+ __u8 mode;
+};
+
+#if defined(__s390x__)
+/* s390x has default seid */
+static bool setup_ueid(void) { return true; }
+static void cleanup_ueid(void) {}
+#else
+enum {
+ SMC_NETLINK_ADD_UEID = 10,
+ SMC_NETLINK_REMOVE_UEID
+};
+
+enum {
+ SMC_NLA_EID_TABLE_UNSPEC,
+ SMC_NLA_EID_TABLE_ENTRY, /* string */
+};
+
+struct msgtemplate {
+ struct nlmsghdr n;
+ struct genlmsghdr g;
+ char buf[1024];
+};
+
+#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
+#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
+#define NLA_DATA(na) ((void *)((char *)(na) + NLA_HDRLEN))
+#define NLA_PAYLOAD(len) ((len) - NLA_HDRLEN)
+
+#define SMC_GENL_FAMILY_NAME "SMC_GEN_NETLINK"
+#define SMC_BPFTEST_UEID "SMC-BPFTEST-UEID"
+
+static uint16_t smc_nl_family_id = -1;
+
+static int send_cmd(int fd, __u16 nlmsg_type, __u32 nlmsg_pid,
+ __u16 nlmsg_flags, __u8 genl_cmd, __u16 nla_type,
+ void *nla_data, int nla_len)
+{
+ struct nlattr *na;
+ struct sockaddr_nl nladdr;
+ int r, buflen;
+ char *buf;
+
+ struct msgtemplate msg = {0};
+
+ msg.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
+ msg.n.nlmsg_type = nlmsg_type;
+ msg.n.nlmsg_flags = nlmsg_flags;
+ msg.n.nlmsg_seq = 0;
+ msg.n.nlmsg_pid = nlmsg_pid;
+ msg.g.cmd = genl_cmd;
+ msg.g.version = 1;
+ na = (struct nlattr *) GENLMSG_DATA(&msg);
+ na->nla_type = nla_type;
+ na->nla_len = nla_len + 1 + NLA_HDRLEN;
+ memcpy(NLA_DATA(na), nla_data, nla_len);
+ msg.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);
+
+ buf = (char *) &msg;
+ buflen = msg.n.nlmsg_len;
+ memset(&nladdr, 0, sizeof(nladdr));
+ nladdr.nl_family = AF_NETLINK;
+
+ while ((r = sendto(fd, buf, buflen, 0, (struct sockaddr *) &nladdr,
+ sizeof(nladdr))) < buflen) {
+ if (r > 0) {
+ buf += r;
+ buflen -= r;
+ } else if (errno != EAGAIN) {
+ return -1;
+ }
+ }
+ return 0;
+}
+
+static bool get_smc_nl_family_id(void)
+{
+ struct sockaddr_nl nl_src;
+ struct msgtemplate msg;
+ struct nlattr *nl;
+ int fd, ret;
+ pid_t pid;
+
+ fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
+ if (!ASSERT_GT(fd, 0, "nl_family socket"))
+ return false;
+
+ pid = getpid();
+
+ memset(&nl_src, 0, sizeof(nl_src));
+ nl_src.nl_family = AF_NETLINK;
+ nl_src.nl_pid = pid;
+
+ ret = bind(fd, (struct sockaddr *) &nl_src, sizeof(nl_src));
+ if (!ASSERT_GE(ret, 0, "nl_family bind"))
+ goto fail;
+
+ ret = send_cmd(fd, GENL_ID_CTRL, pid,
+ NLM_F_REQUEST, CTRL_CMD_GETFAMILY,
+ CTRL_ATTR_FAMILY_NAME, (void *)SMC_GENL_FAMILY_NAME,
+ strlen(SMC_GENL_FAMILY_NAME));
+ if (!ASSERT_EQ(ret, 0, "nl_family query"))
+ goto fail;
+
+ ret = recv(fd, &msg, sizeof(msg), 0);
+ if (!ASSERT_FALSE(msg.n.nlmsg_type == NLMSG_ERROR || (ret < 0) ||
+ !NLMSG_OK((&msg.n), ret), "nl_family response"))
+ goto fail;
+
+ nl = (struct nlattr *) GENLMSG_DATA(&msg);
+ nl = (struct nlattr *) ((char *) nl + NLA_ALIGN(nl->nla_len));
+ if (!ASSERT_EQ(nl->nla_type, CTRL_ATTR_FAMILY_ID, "nl_family nla type"))
+ goto fail;
+
+ smc_nl_family_id = *(uint16_t *) NLA_DATA(nl);
+ close(fd);
+ return true;
+fail:
+ close(fd);
+ return false;
+}
+
+static bool smc_ueid(int op)
+{
+ struct sockaddr_nl nl_src;
+ struct msgtemplate msg;
+ struct nlmsgerr *err;
+ char test_ueid[32];
+ int fd, ret;
+ pid_t pid;
+
+ /* UEID required */
+ memset(test_ueid, '\x20', sizeof(test_ueid));
+ memcpy(test_ueid, SMC_BPFTEST_UEID, strlen(SMC_BPFTEST_UEID));
+ fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
+ if (!ASSERT_GT(fd, 0, "ueid socket"))
+ return false;
+
+ pid = getpid();
+ memset(&nl_src, 0, sizeof(nl_src));
+ nl_src.nl_family = AF_NETLINK;
+ nl_src.nl_pid = pid;
+
+ ret = bind(fd, (struct sockaddr *) &nl_src, sizeof(nl_src));
+ if (!ASSERT_GE(ret, 0, "ueid bind"))
+ goto fail;
+
+ ret = send_cmd(fd, smc_nl_family_id, pid,
+ NLM_F_REQUEST | NLM_F_ACK, op, SMC_NLA_EID_TABLE_ENTRY,
+ (void *)test_ueid, sizeof(test_ueid));
+ if (!ASSERT_EQ(ret, 0, "ueid cmd"))
+ goto fail;
+
+ ret = recv(fd, &msg, sizeof(msg), 0);
+ if (!ASSERT_FALSE((ret < 0) ||
+ !NLMSG_OK((&msg.n), ret), "ueid response"))
+ goto fail;
+
+ if (msg.n.nlmsg_type == NLMSG_ERROR) {
+ err = NLMSG_DATA(&msg);
+ switch (op) {
+ case SMC_NETLINK_REMOVE_UEID:
+ if (!ASSERT_FALSE((err->error && err->error != -ENOENT),
+ "ueid remove"))
+ goto fail;
+ break;
+ case SMC_NETLINK_ADD_UEID:
+ if (!ASSERT_EQ(err->error, 0, "ueid add"))
+ goto fail;
+ break;
+ default:
+ break;
+ }
+ }
+ close(fd);
+ return true;
+fail:
+ close(fd);
+ return false;
+}
+
+static bool setup_ueid(void)
+{
+ /* get smc nl id */
+ if (!get_smc_nl_family_id())
+ return false;
+ /* clear old ueid for bpftest */
+ smc_ueid(SMC_NETLINK_REMOVE_UEID);
+ /* smc-loopback required ueid */
+ return smc_ueid(SMC_NETLINK_ADD_UEID);
+}
+
+static void cleanup_ueid(void)
+{
+ smc_ueid(SMC_NETLINK_REMOVE_UEID);
+}
+#endif /* __s390x__ */
+
+static bool setup_netns(void)
+{
+ test_netns = netns_new(TEST_NS, true);
+ if (!ASSERT_OK_PTR(test_netns, "open net namespace"))
+ goto fail_netns;
+
+ if (!ASSERT_OK(system("ip addr add 127.0.1.0/8 dev lo"),
+ "add server node"))
+ goto fail_ip;
+
+ if (!ASSERT_OK(system("ip addr add 127.0.2.0/8 dev lo"),
+ "server via risk path"))
+ goto fail_ip;
+
+ return true;
+fail_ip:
+ netns_free(test_netns);
+fail_netns:
+ return false;
+}
+
+static void cleanup_netns(void)
+{
+ netns_free(test_netns);
+ remove_netns(TEST_NS);
+}
+
+static bool setup_smc(void)
+{
+ if (!setup_ueid())
+ return false;
+
+ if (!setup_netns())
+ goto fail_netns;
+
+ return true;
+fail_netns:
+ cleanup_ueid();
+ return false;
+}
+
+static int set_client_addr_cb(int fd, void *opts)
+{
+ const char *src = (const char *)opts;
+ struct sockaddr_in localaddr;
+
+ localaddr.sin_family = AF_INET;
+ localaddr.sin_port = htons(0);
+ localaddr.sin_addr.s_addr = inet_addr(src);
+ return !ASSERT_EQ(bind(fd, &localaddr, sizeof(localaddr)), 0,
+ "client bind");
+}
+
+static void run_link(const char *src, const char *dst, int port)
+{
+ struct network_helper_opts opts = {0};
+ int server, client;
+
+ server = start_server_str(AF_INET, SOCK_STREAM, dst, port, NULL);
+ if (!ASSERT_OK_FD(server, "start service_1"))
+ return;
+
+ opts.proto = IPPROTO_TCP;
+ opts.post_socket_cb = set_client_addr_cb;
+ opts.cb_opts = (void *)src;
+
+ client = connect_to_fd_opts(server, &opts);
+ if (!ASSERT_OK_FD(client, "start connect"))
+ goto fail_client;
+
+ close(client);
+fail_client:
+ close(server);
+}
+
+static void block_link(int map_fd, const char *src, const char *dst)
+{
+ struct smc_strat_ip_value val = { .mode = /* block */ 0 };
+ struct smc_strat_ip_key key = {
+ .sip = inet_addr(src),
+ .dip = inet_addr(dst),
+ };
+
+ bpf_map_update_elem(map_fd, &key, &val, BPF_ANY);
+}
+
+/*
+ * This test describes a real-life service topology as follows:
+ *
+ * +-------------> service_1
+ * link1 | |
+ * +--------------------> server | link 2
+ * | | V
+ * | +-------------> service_2
+ * | link 3
+ * client -------------------> server_via_unsafe_path -> service_3
+ *
+ * Among them,
+ * 1. link-1 is very suitable for using SMC.
+ * 2. link-2 is not suitable for using SMC, because the mode of this link is
+ * kind of short-link services.
+ * 3. link-3 is also not suitable for using SMC, because the RDMA link is
+ * unavailable and needs to go through a long timeout before it can fallback
+ * to TCP.
+ * To achieve this goal, we use a customized SMC ip strategy via smc_ops.
+ */
+static void test_topo(void)
+{
+ struct bpf_smc *skel;
+ int rc, map_fd;
+
+ skel = bpf_smc__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "bpf_smc__open_and_load"))
+ return;
+
+ rc = bpf_smc__attach(skel);
+ if (!ASSERT_EQ(rc, 0, "bpf_smc__attach"))
+ goto fail;
+
+ map_fd = bpf_map__fd(skel->maps.smc_strats_ip);
+ if (!ASSERT_GT(map_fd, 0, "bpf_map__fd"))
+ goto fail;
+
+ /* Mock the process of transparent replacement, since we will modify
+ * protocol to ipproto_smc accropding to it via
+ * fmod_ret/update_socket_protocol.
+ */
+ system("sysctl -w net.smc.ops=linkcheck");
+
+ /* Configure ip strat */
+ block_link(map_fd, CLIENT_IP, SERVER_IP_VIA_RISK_PATH);
+ block_link(map_fd, SERVER_IP, SERVER_IP);
+
+ /* should go with smc */
+ run_link(CLIENT_IP, SERVER_IP, SERVICE_1);
+ /* should go with smc fallback */
+ run_link(SERVER_IP, SERVER_IP, SERVICE_2);
+
+ ASSERT_EQ(skel->bss->smc_cnt, 2, "smc count");
+ ASSERT_EQ(skel->bss->fallback_cnt, 1, "fallback count");
+
+ /* should go with smc */
+ run_link(CLIENT_IP, SERVER_IP, SERVICE_2);
+
+ ASSERT_EQ(skel->bss->smc_cnt, 3, "smc count");
+ ASSERT_EQ(skel->bss->fallback_cnt, 1, "fallback count");
+
+ /* should go with smc fallback */
+ run_link(CLIENT_IP, SERVER_IP_VIA_RISK_PATH, SERVICE_3);
+
+ ASSERT_EQ(skel->bss->smc_cnt, 4, "smc count");
+ ASSERT_EQ(skel->bss->fallback_cnt, 2, "fallback count");
+
+fail:
+ bpf_smc__destroy(skel);
+}
+
+void test_bpf_smc(void)
+{
+ if (!setup_smc()) {
+ printf("setup for smc test failed, test SKIP:\n");
+ test__skip();
+ return;
+ }
+
+ if (test__start_subtest("topo"))
+ test_topo();
+
+ cleanup_ueid();
+ cleanup_netns();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_smc.c b/tools/testing/selftests/bpf/progs/bpf_smc.c
new file mode 100644
index 000000000000..38b0490bd875
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_smc.c
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_tracing_net.h"
+
+char _license[] SEC("license") = "GPL";
+
+enum {
+ BPF_SMC_LISTEN = 10,
+};
+
+struct smc_sock___local {
+ struct sock sk;
+ struct smc_sock *listen_smc;
+ bool use_fallback;
+} __attribute__((preserve_access_index));
+
+int smc_cnt = 0;
+int fallback_cnt = 0;
+
+SEC("fentry/smc_release")
+int BPF_PROG(bpf_smc_release, struct socket *sock)
+{
+ /* only count from one side (client) */
+ if (sock->sk->__sk_common.skc_state == BPF_SMC_LISTEN)
+ return 0;
+ smc_cnt++;
+ return 0;
+}
+
+SEC("fentry/smc_switch_to_fallback")
+int BPF_PROG(bpf_smc_switch_to_fallback, struct smc_sock___local *smc)
+{
+ /* only count from one side (client) */
+ if (smc && !BPF_CORE_READ(smc, listen_smc))
+ fallback_cnt++;
+ return 0;
+}
+
+/* go with default value if no strat was found */
+bool default_ip_strat_value = true;
+
+struct smc_strat_ip_key {
+ __u32 sip;
+ __u32 dip;
+};
+
+struct smc_strat_ip_value {
+ __u8 mode;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __uint(key_size, sizeof(struct smc_strat_ip_key));
+ __uint(value_size, sizeof(struct smc_strat_ip_value));
+ __uint(max_entries, 128);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+} smc_strats_ip SEC(".maps");
+
+static bool smc_check(__u32 src, __u32 dst)
+{
+ struct smc_strat_ip_value *value;
+ struct smc_strat_ip_key key = {
+ .sip = src,
+ .dip = dst,
+ };
+
+ value = bpf_map_lookup_elem(&smc_strats_ip, &key);
+ return value ? value->mode : default_ip_strat_value;
+}
+
+SEC("fmod_ret/update_socket_protocol")
+int BPF_PROG(smc_run, int family, int type, int protocol)
+{
+ struct task_struct *task;
+
+ if (family != AF_INET && family != AF_INET6)
+ return protocol;
+
+ if ((type & 0xf) != SOCK_STREAM)
+ return protocol;
+
+ if (protocol != 0 && protocol != IPPROTO_TCP)
+ return protocol;
+
+ task = bpf_get_current_task_btf();
+ /* Prevent from affecting other tests */
+ if (!task || !task->nsproxy->net_ns->smc.ops)
+ return protocol;
+
+ return IPPROTO_SMC;
+}
+
+SEC("struct_ops/bpf_smc_set_tcp_option_cond")
+int BPF_PROG(bpf_smc_set_tcp_option_cond, const struct tcp_sock *tp,
+ struct inet_request_sock *ireq)
+{
+ return smc_check(ireq->req.__req_common.skc_daddr,
+ ireq->req.__req_common.skc_rcv_saddr);
+}
+
+SEC("struct_ops/bpf_smc_set_tcp_option")
+int BPF_PROG(bpf_smc_set_tcp_option, struct tcp_sock *tp)
+{
+ return smc_check(tp->inet_conn.icsk_inet.sk.__sk_common.skc_rcv_saddr,
+ tp->inet_conn.icsk_inet.sk.__sk_common.skc_daddr);
+}
+
+SEC(".struct_ops.link")
+struct smc_ops linkcheck = {
+ .name = "linkcheck",
+ .set_option = (void *) bpf_smc_set_tcp_option,
+ .set_option_cond = (void *) bpf_smc_set_tcp_option_cond,
+};
--
2.45.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next v6 2/5] net/smc: Introduce generic hook smc_ops
2025-01-16 7:44 ` [PATCH bpf-next v6 2/5] net/smc: Introduce generic hook smc_ops D. Wythe
@ 2025-01-16 12:22 ` Dust Li
2025-01-17 23:50 ` Martin KaFai Lau
1 sibling, 0 replies; 17+ messages in thread
From: Dust Li @ 2025-01-16 12:22 UTC (permalink / raw)
To: D. Wythe, kgraul, wenjia, jaka, ast, daniel, andrii, martin.lau,
pabeni, song, sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh,
jolsa, guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, bpf
On 2025-01-16 15:44:39, D. Wythe wrote:
>The introduction of IPPROTO_SMC enables eBPF programs to determine
>whether to use SMC based on the context of socket creation, such as
>network namespaces, PID and comm name, etc.
>
>As a subsequent enhancement, to introduce a new generic hook that
>allows decisions on whether to use SMC or not at runtime, including
>but not limited to local/remote IP address or ports.
>
>Moreover, in the future, we can achieve more complex extensions to the
>protocol stack by extending this ops.
>
>Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
>---
> include/net/netns/smc.h | 3 ++
> include/net/smc.h | 53 +++++++++++++++++++++++
> net/ipv4/tcp_output.c | 18 ++++++--
> net/smc/Kconfig | 12 ++++++
> net/smc/Makefile | 1 +
> net/smc/smc_ops.c | 53 +++++++++++++++++++++++
> net/smc/smc_ops.h | 28 ++++++++++++
> net/smc/smc_sysctl.c | 94 +++++++++++++++++++++++++++++++++++++++++
> 8 files changed, 258 insertions(+), 4 deletions(-)
> create mode 100644 net/smc/smc_ops.c
> create mode 100644 net/smc/smc_ops.h
>
>diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
>index fc752a50f91b..81b3fdb39cd2 100644
>--- a/include/net/netns/smc.h
>+++ b/include/net/netns/smc.h
>@@ -17,6 +17,9 @@ struct netns_smc {
> #ifdef CONFIG_SYSCTL
> struct ctl_table_header *smc_hdr;
> #endif
>+#if IS_ENABLED(CONFIG_SMC_OPS)
>+ struct smc_ops __rcu *ops;
>+#endif /* CONFIG_SMC_OPS */
> unsigned int sysctl_autocorking_size;
> unsigned int sysctl_smcr_buf_type;
> int sysctl_smcr_testlink_time;
>diff --git a/include/net/smc.h b/include/net/smc.h
>index db84e4e35080..271838591b63 100644
>--- a/include/net/smc.h
>+++ b/include/net/smc.h
>@@ -18,6 +18,8 @@
> #include "linux/ism.h"
>
> struct sock;
>+struct tcp_sock;
>+struct inet_request_sock;
>
> #define SMC_MAX_PNETID_LEN 16 /* Max. length of PNET id */
>
>@@ -97,4 +99,55 @@ struct smcd_dev {
> u8 going_away : 1;
> };
>
>+#define SMC_OPS_NAME_MAX 16
>+
>+enum {
>+ /* ops can be inherit from init_net */
>+ SMC_OPS_FLAG_INHERITABLE = 0x1,
>+
>+ SMC_OPS_ALL_FLAGS = SMC_OPS_FLAG_INHERITABLE,
>+};
>+
>+struct smc_ops {
>+ /* priavte */
>+
>+ struct list_head list;
>+ struct module *owner;
>+
>+ /* public */
>+
>+ /* unique name */
>+ char name[SMC_OPS_NAME_MAX];
>+ int flags;
>+
>+ /* Invoked before computing SMC option for SYN packets.
>+ * We can control whether to set SMC options by returning varios value.
>+ * Return 0 to disable SMC, or return any other value to enable it.
>+ */
>+ int (*set_option)(struct tcp_sock *tp);
>+
>+ /* Invoked before Set up SMC options for SYN-ACK packets
>+ * We can control whether to respond SMC options by returning varios
>+ * value. Return 0 to disable SMC, or return any other value to enable
>+ * it.
>+ */
>+ int (*set_option_cond)(const struct tcp_sock *tp,
>+ struct inet_request_sock *ireq);
>+};
>+
>+#if IS_ENABLED(CONFIG_SMC_OPS)
>+#define smc_call_retops(init_val, sk, func, ...) ({ \
>+ typeof(init_val) __ret = (init_val); \
>+ struct smc_ops *ops; \
>+ rcu_read_lock(); \
>+ ops = READ_ONCE(sock_net(sk)->smc.ops); \
>+ if (ops && ops->func) \
>+ __ret = ops->func(__VA_ARGS__); \
>+ rcu_read_unlock(); \
>+ !!__ret; \
>+})
Here you force the return value to be bool by !!ret, what if the
future caller expects the return value to be an integer or other types ?
Best regards,
Dust
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next v6 4/5] libbpf: fix error when st-prefix_ops and ops from differ btf
2025-01-16 7:44 ` [PATCH bpf-next v6 4/5] libbpf: fix error when st-prefix_ops and ops from differ btf D. Wythe
@ 2025-01-17 18:36 ` Andrii Nakryiko
2025-01-22 2:43 ` D. Wythe
0 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2025-01-17 18:36 UTC (permalink / raw)
To: D. Wythe
Cc: kgraul, wenjia, jaka, ast, daniel, andrii, martin.lau, pabeni,
song, sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa,
guwen, kuba, davem, netdev, linux-s390, linux-rdma, bpf
On Wed, Jan 15, 2025 at 11:45 PM D. Wythe <alibuda@linux.alibaba.com> wrote:
>
> When a struct_ops named xxx_ops was registered by a module, and
> it will be used in both built-in modules and the module itself,
> so that the btf_type of xxx_ops will be present in btf_vmlinux
> instead of in btf_mod, which means that the btf_type of
> bpf_struct_ops_xxx_ops and xxx_ops will not be in the same btf.
>
> Here are four possible case:
>
> +--------+---------------+-------------+---------------------------------+
> | | st_ops_xxx_ops| xxx_ops | |
> +--------+---------------+-------------+---------------------------------+
> | case 0 | btf_vmlinux | bft_vmlinux | be used and reg only in vmlinux |
> +--------+---------------+-------------+---------------------------------+
> | case 1 | btf_vmlinux | bpf_mod | INVALID |
> +--------+---------------+-------------+---------------------------------+
> | case 2 | btf_mod | btf_vmlinux | reg in mod but be used both in |
> | | | | vmlinux and mod. |
> +--------+---------------+-------------+---------------------------------+
> | case 3 | btf_mod | btf_mod | be used and reg only in mod |
> +--------+---------------+-------------+---------------------------------+
>
> At present, cases 0, 1, and 3 can be correctly identified, because
> st_ops_xxx_ops is searched from the same btf with xxx_ops. In order to
> handle case 2 correctly without affecting other cases, we cannot simply
> change the search method for st_ops_xxx_ops from find_btf_by_prefix_kind()
> to find_ksym_btf_id(), because in this way, case 1 will not be
> recognized anymore.
>
> To address the issue, we always look for st_ops_xxx_ops first,
> figure out the btf, and then look for xxx_ops with the very btf to avoid
> such issue.
>
> Fixes: 590a00888250 ("bpf: libbpf: Add STRUCT_OPS support")
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
> tools/lib/bpf/libbpf.c | 41 +++++++++++++++++++++++------------------
> 1 file changed, 23 insertions(+), 18 deletions(-)
>
Other than a few nits below, LGTM
Acked-by: Andrii Nakryiko <andrii@kernel.org>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 66173ddb5a2d..202bc4c1001e 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -1005,14 +1005,33 @@ find_struct_ops_kern_types(struct bpf_object *obj, const char *tname_raw,
> const struct btf_member *kern_data_member;
> struct btf *btf = NULL;
> __s32 kern_vtype_id, kern_type_id;
> - char tname[256];
> + char tname[256], stname[256];
> + int ret;
> __u32 i;
>
> snprintf(tname, sizeof(tname), "%.*s",
> (int)bpf_core_essential_name_len(tname_raw), tname_raw);
>
> - kern_type_id = find_ksym_btf_id(obj, tname, BTF_KIND_STRUCT,
> + ret = snprintf(stname, sizeof(stname), "%s%s", STRUCT_OPS_VALUE_PREFIX,
> + tname);
> + if (ret < 0 || ret >= sizeof(stname))
> + return -ENAMETOOLONG;
see preexisting snprintf() above, we don't really handle truncation
errors explicitly, they are extremely unlikely and not expected at
all, and worst case nothing will be found and user will get some
-ENOENT or something like that eventually. I'd drop this extra error
checking and keep it streamlines, similar to tname
> +
> + /* Look for the corresponding "map_value" type that will be used
> + * in map_update(BPF_MAP_TYPE_STRUCT_OPS) first, figure out the btf
> + * and the mod_btf.
> + * For example, find "struct bpf_struct_ops_tcp_congestion_ops".
> + */
> + kern_vtype_id = find_ksym_btf_id(obj, stname, BTF_KIND_STRUCT,
> &btf, mod_btf);
nit: if this fits under 100 characters, keep it single line
> + if (kern_vtype_id < 0) {
> + pr_warn("struct_ops init_kern: struct %s is not found in kernel BTF\n",
> + stname);
same nit about preserving single-line statements as much as possible,
they are much easier to read
> + return kern_vtype_id;
> + }
> + kern_vtype = btf__type_by_id(btf, kern_vtype_id);
> +
> + kern_type_id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT);
> if (kern_type_id < 0) {
> pr_warn("struct_ops init_kern: struct %s is not found in kernel BTF\n",
> tname);
> @@ -1020,20 +1039,6 @@ find_struct_ops_kern_types(struct bpf_object *obj, const char *tname_raw,
> }
> kern_type = btf__type_by_id(btf, kern_type_id);
>
> - /* Find the corresponding "map_value" type that will be used
> - * in map_update(BPF_MAP_TYPE_STRUCT_OPS). For example,
> - * find "struct bpf_struct_ops_tcp_congestion_ops" from the
> - * btf_vmlinux.
> - */
> - kern_vtype_id = find_btf_by_prefix_kind(btf, STRUCT_OPS_VALUE_PREFIX,
> - tname, BTF_KIND_STRUCT);
> - if (kern_vtype_id < 0) {
> - pr_warn("struct_ops init_kern: struct %s%s is not found in kernel BTF\n",
> - STRUCT_OPS_VALUE_PREFIX, tname);
> - return kern_vtype_id;
> - }
> - kern_vtype = btf__type_by_id(btf, kern_vtype_id);
> -
> /* Find "struct tcp_congestion_ops" from
> * struct bpf_struct_ops_tcp_congestion_ops {
> * [ ... ]
> @@ -1046,8 +1051,8 @@ find_struct_ops_kern_types(struct bpf_object *obj, const char *tname_raw,
> break;
> }
> if (i == btf_vlen(kern_vtype)) {
> - pr_warn("struct_ops init_kern: struct %s data is not found in struct %s%s\n",
> - tname, STRUCT_OPS_VALUE_PREFIX, tname);
> + pr_warn("struct_ops init_kern: struct %s data is not found in struct %s\n",
> + tname, stname);
> return -EINVAL;
> }
>
> --
> 2.45.0
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next v6 2/5] net/smc: Introduce generic hook smc_ops
2025-01-16 7:44 ` [PATCH bpf-next v6 2/5] net/smc: Introduce generic hook smc_ops D. Wythe
2025-01-16 12:22 ` Dust Li
@ 2025-01-17 23:50 ` Martin KaFai Lau
2025-01-22 4:35 ` D. Wythe
1 sibling, 1 reply; 17+ messages in thread
From: Martin KaFai Lau @ 2025-01-17 23:50 UTC (permalink / raw)
To: D. Wythe
Cc: kgraul, wenjia, jaka, ast, daniel, andrii, pabeni, song, sdf,
haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa, guwen,
kuba, davem, netdev, linux-s390, linux-rdma, bpf
On 1/15/25 11:44 PM, D. Wythe wrote:
> diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
> index 2fab6456f765..2004241c3045 100644
> --- a/net/smc/smc_sysctl.c
> +++ b/net/smc/smc_sysctl.c
> @@ -18,6 +18,7 @@
> #include "smc_core.h"
> #include "smc_llc.h"
> #include "smc_sysctl.h"
> +#include "smc_ops.h"
>
> static int min_sndbuf = SMC_BUF_MIN_SIZE;
> static int min_rcvbuf = SMC_BUF_MIN_SIZE;
> @@ -30,6 +31,69 @@ static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
> static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
> static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
>
> +#if IS_ENABLED(CONFIG_SMC_OPS)
> +static int smc_net_replace_smc_ops(struct net *net, const char *name)
> +{
> + struct smc_ops *ops = NULL;
> +
> + rcu_read_lock();
> + /* null or empty name ask to clear current ops */
> + if (name && name[0]) {
> + ops = smc_ops_find_by_name(name);
> + if (!ops) {
> + rcu_read_unlock();
> + return -EINVAL;
> + }
> + /* no change, just return */
> + if (ops == rcu_dereference(net->smc.ops)) {
> + rcu_read_unlock();
> + return 0;
> + }
> + }
> + if (!ops || bpf_try_module_get(ops, ops->owner)) {
> + /* xhcg */
typo. I noticed it only because...
> + ops = rcu_replace_pointer(net->smc.ops, ops, true);
... rcu_replace_pointer() does not align with the above xchg comment. From
looking into rcu_replace_pointer, it is not a xchg. It is also not obvious to me
why it is safe to assume "true" here...
> + /* release old ops */
> + if (ops)
> + bpf_module_put(ops, ops->owner);
... together with a put here on the return value of the rcu_replace_pointer.
> + } else if (ops) {
nit. This looks redundant when looking at the "if (!ops || ..." test above
Also a nit, I would move the bpf_try_module_get() immediately after the above
"if (ops == rcu_dereference(net->smc.ops))" test. This should simplify the later
cases.
> + rcu_read_unlock();
> + return -EBUSY;
> + }
> + rcu_read_unlock();
> + return 0;
> +}
> +
> +static int proc_smc_ops(const struct ctl_table *ctl, int write,
> + void *buffer, size_t *lenp, loff_t *ppos)
> +{
> + struct net *net = container_of(ctl->data, struct net, smc.ops);
> + char val[SMC_OPS_NAME_MAX];
> + const struct ctl_table tbl = {
> + .data = val,
> + .maxlen = SMC_OPS_NAME_MAX,
> + };
> + struct smc_ops *ops;
> + int ret;
> +
> + rcu_read_lock();
> + ops = rcu_dereference(net->smc.ops);
> + if (ops)
> + memcpy(val, ops->name, sizeof(ops->name));
> + else
> + val[0] = '\0';
> + rcu_read_unlock();
> +
> + ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
> + if (ret)
> + return ret;
> +
> + if (write)
> + ret = smc_net_replace_smc_ops(net, val);
> + return ret;
> +}
> +#endif /* CONFIG_SMC_OPS */
> +
> static struct ctl_table smc_table[] = {
> {
> .procname = "autocorking_size",
> @@ -99,6 +163,15 @@ static struct ctl_table smc_table[] = {
> .extra1 = SYSCTL_ZERO,
> .extra2 = SYSCTL_ONE,
> },
> +#if IS_ENABLED(CONFIG_SMC_OPS)
> + {
> + .procname = "ops",
> + .data = &init_net.smc.ops,
> + .mode = 0644,
> + .maxlen = SMC_OPS_NAME_MAX,
> + .proc_handler = proc_smc_ops,
> + },
> +#endif /* CONFIG_SMC_OPS */
> };
>
> int __net_init smc_sysctl_net_init(struct net *net)
> @@ -109,6 +182,20 @@ int __net_init smc_sysctl_net_init(struct net *net)
> table = smc_table;
> if (!net_eq(net, &init_net)) {
> int i;
> +#if IS_ENABLED(CONFIG_SMC_OPS)
> + struct smc_ops *ops;
> +
> + rcu_read_lock();
> + ops = rcu_dereference(init_net.smc.ops);
> + if (ops && ops->flags & SMC_OPS_FLAG_INHERITABLE) {
> + if (!bpf_try_module_get(ops, ops->owner)) {
> + rcu_read_unlock();
> + return -EBUSY;
Not sure if it should count as error when the ops is in the process of
un-register-ing. The next smc_sysctl_net_init will have NULL ops and succeed.
Something for you to consider.
Also, it needs an ack from the SMC maintainer for the SMC specific parts like
the sysctl here.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next v6 5/5] bpf/selftests: add selftest for bpf_smc_ops
2025-01-16 7:44 ` [PATCH bpf-next v6 5/5] bpf/selftests: add selftest for bpf_smc_ops D. Wythe
@ 2025-01-18 0:37 ` Martin KaFai Lau
2025-01-22 1:51 ` D. Wythe
2025-01-21 6:42 ` Saket Kumar Bhaskar
1 sibling, 1 reply; 17+ messages in thread
From: Martin KaFai Lau @ 2025-01-18 0:37 UTC (permalink / raw)
To: D. Wythe
Cc: kgraul, wenjia, jaka, ast, daniel, andrii, pabeni, song, sdf,
haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa, guwen,
kuba, davem, netdev, linux-s390, linux-rdma, bpf
On 1/15/25 11:44 PM, D. Wythe wrote:
> This tests introduces a tiny smc_ops for filtering SMC connections based on
> IP pairs, and also adds a realistic topology model to verify this ops.
>
> Also, we can only use SMC loopback under CI test, so an
> additional configuration needs to be enabled.
>
> Follow the steps below to run this test.
>
> make -C tools/testing/selftests/bpf
> cd tools/testing/selftests/bpf
> sudo ./test_progs -t smc
>
> Results shows:
> Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
>
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
> tools/testing/selftests/bpf/config | 4 +
> .../selftests/bpf/prog_tests/test_bpf_smc.c | 397 ++++++++++++++++++
> tools/testing/selftests/bpf/progs/bpf_smc.c | 117 ++++++
> 3 files changed, 518 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c
> create mode 100644 tools/testing/selftests/bpf/progs/bpf_smc.c
>
> diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
> index c378d5d07e02..fac2f2a9d02f 100644
> --- a/tools/testing/selftests/bpf/config
> +++ b/tools/testing/selftests/bpf/config
> @@ -113,3 +113,7 @@ CONFIG_XDP_SOCKETS=y
> CONFIG_XFRM_INTERFACE=y
> CONFIG_TCP_CONG_DCTCP=y
> CONFIG_TCP_CONG_BBR=y
> +CONFIG_INFINIBAND=y
> +CONFIG_SMC=y
> +CONFIG_SMC_OPS=y
> +CONFIG_SMC_LO=y
> \ No newline at end of file
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c b/tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c
> new file mode 100644
> index 000000000000..1e06325bfbaf
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c
> @@ -0,0 +1,397 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <linux/genetlink.h>
> +#include "network_helpers.h"
> +#include "bpf_smc.skel.h"
> +
> +#ifndef IPPROTO_SMC
> +#define IPPROTO_SMC 256
> +#endif
> +
> +#define CLIENT_IP "127.0.0.1"
> +#define SERVER_IP "127.0.1.0"
> +#define SERVER_IP_VIA_RISK_PATH "127.0.2.0"
> +
> +#define SERVICE_1 11234
> +#define SERVICE_2 22345
> +#define SERVICE_3 33456
> +
> +#define TEST_NS "bpf_smc_netns"
> +
> +static struct netns_obj *test_netns;
> +
> +struct smc_strat_ip_key {
> + __u32 sip;
> + __u32 dip;
> +};
> +
> +struct smc_strat_ip_value {
> + __u8 mode;
> +};
> +
> +#if defined(__s390x__)
> +/* s390x has default seid */
> +static bool setup_ueid(void) { return true; }
> +static void cleanup_ueid(void) {}
> +#else
> +enum {
> + SMC_NETLINK_ADD_UEID = 10,
> + SMC_NETLINK_REMOVE_UEID
> +};
> +
> +enum {
> + SMC_NLA_EID_TABLE_UNSPEC,
> + SMC_NLA_EID_TABLE_ENTRY, /* string */
> +};
> +
> +struct msgtemplate {
> + struct nlmsghdr n;
> + struct genlmsghdr g;
> + char buf[1024];
> +};
> +
> +#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
> +#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
> +#define NLA_DATA(na) ((void *)((char *)(na) + NLA_HDRLEN))
> +#define NLA_PAYLOAD(len) ((len) - NLA_HDRLEN)
> +
> +#define SMC_GENL_FAMILY_NAME "SMC_GEN_NETLINK"
> +#define SMC_BPFTEST_UEID "SMC-BPFTEST-UEID"
> +
> +static uint16_t smc_nl_family_id = -1;
> +
> +static int send_cmd(int fd, __u16 nlmsg_type, __u32 nlmsg_pid,
> + __u16 nlmsg_flags, __u8 genl_cmd, __u16 nla_type,
> + void *nla_data, int nla_len)
> +{
> + struct nlattr *na;
> + struct sockaddr_nl nladdr;
> + int r, buflen;
> + char *buf;
> +
> + struct msgtemplate msg = {0};
> +
> + msg.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
> + msg.n.nlmsg_type = nlmsg_type;
> + msg.n.nlmsg_flags = nlmsg_flags;
> + msg.n.nlmsg_seq = 0;
> + msg.n.nlmsg_pid = nlmsg_pid;
> + msg.g.cmd = genl_cmd;
> + msg.g.version = 1;
> + na = (struct nlattr *) GENLMSG_DATA(&msg);
> + na->nla_type = nla_type;
> + na->nla_len = nla_len + 1 + NLA_HDRLEN;
> + memcpy(NLA_DATA(na), nla_data, nla_len);
> + msg.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);
> +
> + buf = (char *) &msg;
> + buflen = msg.n.nlmsg_len;
> + memset(&nladdr, 0, sizeof(nladdr));
> + nladdr.nl_family = AF_NETLINK;
> +
> + while ((r = sendto(fd, buf, buflen, 0, (struct sockaddr *) &nladdr,
> + sizeof(nladdr))) < buflen) {
> + if (r > 0) {
> + buf += r;
> + buflen -= r;
> + } else if (errno != EAGAIN) {
> + return -1;
> + }
> + }
> + return 0;
> +}
> +
> +static bool get_smc_nl_family_id(void)
> +{
> + struct sockaddr_nl nl_src;
> + struct msgtemplate msg;
> + struct nlattr *nl;
> + int fd, ret;
> + pid_t pid;
> +
> + fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
> + if (!ASSERT_GT(fd, 0, "nl_family socket"))
Should be _GE. or just use ASSERT_OK_FD.
> + return false;
> +
> + pid = getpid();
> +
> + memset(&nl_src, 0, sizeof(nl_src));
> + nl_src.nl_family = AF_NETLINK;
> + nl_src.nl_pid = pid;
> +
> + ret = bind(fd, (struct sockaddr *) &nl_src, sizeof(nl_src));
> + if (!ASSERT_GE(ret, 0, "nl_family bind"))
nit. ASSERT_OK.
> + goto fail;
> +
> + ret = send_cmd(fd, GENL_ID_CTRL, pid,
> + NLM_F_REQUEST, CTRL_CMD_GETFAMILY,
> + CTRL_ATTR_FAMILY_NAME, (void *)SMC_GENL_FAMILY_NAME,
> + strlen(SMC_GENL_FAMILY_NAME));
> + if (!ASSERT_EQ(ret, 0, "nl_family query"))
ASSERT_OK.
> + goto fail;
> +
> + ret = recv(fd, &msg, sizeof(msg), 0);
> + if (!ASSERT_FALSE(msg.n.nlmsg_type == NLMSG_ERROR || (ret < 0) ||
> + !NLMSG_OK((&msg.n), ret), "nl_family response"))
> + goto fail;
> +
> + nl = (struct nlattr *) GENLMSG_DATA(&msg);
> + nl = (struct nlattr *) ((char *) nl + NLA_ALIGN(nl->nla_len));
> + if (!ASSERT_EQ(nl->nla_type, CTRL_ATTR_FAMILY_ID, "nl_family nla type"))
> + goto fail;
> +
> + smc_nl_family_id = *(uint16_t *) NLA_DATA(nl);
> + close(fd);
> + return true;
> +fail:
> + close(fd);
> + return false;
> +}
> +
> +static bool smc_ueid(int op)
> +{
> + struct sockaddr_nl nl_src;
> + struct msgtemplate msg;
> + struct nlmsgerr *err;
> + char test_ueid[32];
> + int fd, ret;
> + pid_t pid;
> +
> + /* UEID required */
> + memset(test_ueid, '\x20', sizeof(test_ueid));
> + memcpy(test_ueid, SMC_BPFTEST_UEID, strlen(SMC_BPFTEST_UEID));
> + fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
> + if (!ASSERT_GT(fd, 0, "ueid socket"))
ASSERT_OK_FD
> + return false;
> +
> + pid = getpid();
> + memset(&nl_src, 0, sizeof(nl_src));
> + nl_src.nl_family = AF_NETLINK;
> + nl_src.nl_pid = pid;
> +
> + ret = bind(fd, (struct sockaddr *) &nl_src, sizeof(nl_src));
> + if (!ASSERT_GE(ret, 0, "ueid bind"))
ASSERT_OK
> + goto fail;
> +
> + ret = send_cmd(fd, smc_nl_family_id, pid,
> + NLM_F_REQUEST | NLM_F_ACK, op, SMC_NLA_EID_TABLE_ENTRY,
> + (void *)test_ueid, sizeof(test_ueid));
> + if (!ASSERT_EQ(ret, 0, "ueid cmd"))
ASSERT_OK
> + goto fail;
> +
> + ret = recv(fd, &msg, sizeof(msg), 0);
> + if (!ASSERT_FALSE((ret < 0) ||
> + !NLMSG_OK((&msg.n), ret), "ueid response"))
> + goto fail;
> +
> + if (msg.n.nlmsg_type == NLMSG_ERROR) {
> + err = NLMSG_DATA(&msg);
> + switch (op) {
> + case SMC_NETLINK_REMOVE_UEID:
> + if (!ASSERT_FALSE((err->error && err->error != -ENOENT),
> + "ueid remove"))
> + goto fail;
> + break;
> + case SMC_NETLINK_ADD_UEID:
> + if (!ASSERT_EQ(err->error, 0, "ueid add"))
> + goto fail;
> + break;
> + default:
> + break;
> + }
> + }
> + close(fd);
> + return true;
> +fail:
> + close(fd);
> + return false;
> +}
> +
> +static bool setup_ueid(void)
> +{
> + /* get smc nl id */
> + if (!get_smc_nl_family_id())
> + return false;
> + /* clear old ueid for bpftest */
> + smc_ueid(SMC_NETLINK_REMOVE_UEID);
> + /* smc-loopback required ueid */
> + return smc_ueid(SMC_NETLINK_ADD_UEID);
> +}
> +
> +static void cleanup_ueid(void)
> +{
> + smc_ueid(SMC_NETLINK_REMOVE_UEID);
> +}
> +#endif /* __s390x__ */
> +
> +static bool setup_netns(void)
> +{
> + test_netns = netns_new(TEST_NS, true);
> + if (!ASSERT_OK_PTR(test_netns, "open net namespace"))
> + goto fail_netns;
> +
> + if (!ASSERT_OK(system("ip addr add 127.0.1.0/8 dev lo"),
> + "add server node"))
> + goto fail_ip;
> +
> + if (!ASSERT_OK(system("ip addr add 127.0.2.0/8 dev lo"),
> + "server via risk path"))
> + goto fail_ip;
> +
> + return true;
> +fail_ip:
> + netns_free(test_netns);
> +fail_netns:
> + return false;
> +}
> +
> +static void cleanup_netns(void)
> +{
> + netns_free(test_netns);
> + remove_netns(TEST_NS);
> +}
> +
> +static bool setup_smc(void)
> +{
> + if (!setup_ueid())
> + return false;
> +
> + if (!setup_netns())
> + goto fail_netns;
> +
> + return true;
> +fail_netns:
> + cleanup_ueid();
> + return false;
> +}
> +
> +static int set_client_addr_cb(int fd, void *opts)
> +{
> + const char *src = (const char *)opts;
> + struct sockaddr_in localaddr;
> +
> + localaddr.sin_family = AF_INET;
> + localaddr.sin_port = htons(0);
> + localaddr.sin_addr.s_addr = inet_addr(src);
> + return !ASSERT_EQ(bind(fd, &localaddr, sizeof(localaddr)), 0,
> + "client bind");
> +}
> +
> +static void run_link(const char *src, const char *dst, int port)
> +{
> + struct network_helper_opts opts = {0};
> + int server, client;
> +
> + server = start_server_str(AF_INET, SOCK_STREAM, dst, port, NULL);
> + if (!ASSERT_OK_FD(server, "start service_1"))
> + return;
> +
> + opts.proto = IPPROTO_TCP;
> + opts.post_socket_cb = set_client_addr_cb;
> + opts.cb_opts = (void *)src;
> +
> + client = connect_to_fd_opts(server, &opts);
> + if (!ASSERT_OK_FD(client, "start connect"))
> + goto fail_client;
> +
> + close(client);
> +fail_client:
> + close(server);
> +}
> +
> +static void block_link(int map_fd, const char *src, const char *dst)
> +{
> + struct smc_strat_ip_value val = { .mode = /* block */ 0 };
> + struct smc_strat_ip_key key = {
> + .sip = inet_addr(src),
> + .dip = inet_addr(dst),
> + };
> +
> + bpf_map_update_elem(map_fd, &key, &val, BPF_ANY);
> +}
> +
> +/*
> + * This test describes a real-life service topology as follows:
> + *
> + * +-------------> service_1
> + * link1 | |
> + * +--------------------> server | link 2
> + * | | V
> + * | +-------------> service_2
> + * | link 3
> + * client -------------------> server_via_unsafe_path -> service_3
> + *
> + * Among them,
> + * 1. link-1 is very suitable for using SMC.
> + * 2. link-2 is not suitable for using SMC, because the mode of this link is
> + * kind of short-link services.
> + * 3. link-3 is also not suitable for using SMC, because the RDMA link is
> + * unavailable and needs to go through a long timeout before it can fallback
> + * to TCP.
> + * To achieve this goal, we use a customized SMC ip strategy via smc_ops.
> + */
> +static void test_topo(void)
> +{
> + struct bpf_smc *skel;
> + int rc, map_fd;
> +
> + skel = bpf_smc__open_and_load();
> + if (!ASSERT_OK_PTR(skel, "bpf_smc__open_and_load"))
> + return;
> +
> + rc = bpf_smc__attach(skel);
> + if (!ASSERT_EQ(rc, 0, "bpf_smc__attach"))
> + goto fail;
> +
> + map_fd = bpf_map__fd(skel->maps.smc_strats_ip);
> + if (!ASSERT_GT(map_fd, 0, "bpf_map__fd"))
> + goto fail;
> +
> + /* Mock the process of transparent replacement, since we will modify
> + * protocol to ipproto_smc accropding to it via
> + * fmod_ret/update_socket_protocol.
> + */
> + system("sysctl -w net.smc.ops=linkcheck");
> +
> + /* Configure ip strat */
> + block_link(map_fd, CLIENT_IP, SERVER_IP_VIA_RISK_PATH);
> + block_link(map_fd, SERVER_IP, SERVER_IP);
> +
> + /* should go with smc */
> + run_link(CLIENT_IP, SERVER_IP, SERVICE_1);
> + /* should go with smc fallback */
> + run_link(SERVER_IP, SERVER_IP, SERVICE_2);
> +
> + ASSERT_EQ(skel->bss->smc_cnt, 2, "smc count");
> + ASSERT_EQ(skel->bss->fallback_cnt, 1, "fallback count");
> +
> + /* should go with smc */
> + run_link(CLIENT_IP, SERVER_IP, SERVICE_2);
> +
> + ASSERT_EQ(skel->bss->smc_cnt, 3, "smc count");
> + ASSERT_EQ(skel->bss->fallback_cnt, 1, "fallback count");
> +
> + /* should go with smc fallback */
> + run_link(CLIENT_IP, SERVER_IP_VIA_RISK_PATH, SERVICE_3);
> +
> + ASSERT_EQ(skel->bss->smc_cnt, 4, "smc count");
> + ASSERT_EQ(skel->bss->fallback_cnt, 2, "fallback count");
> +
> +fail:
> + bpf_smc__destroy(skel);
> +}
> +
> +void test_bpf_smc(void)
> +{
> + if (!setup_smc()) {
> + printf("setup for smc test failed, test SKIP:\n");
> + test__skip();
> + return;
> + }
> +
> + if (test__start_subtest("topo"))
> + test_topo();
> +
> + cleanup_ueid();
> + cleanup_netns();
> +}
> diff --git a/tools/testing/selftests/bpf/progs/bpf_smc.c b/tools/testing/selftests/bpf/progs/bpf_smc.c
> new file mode 100644
> index 000000000000..38b0490bd875
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpf_smc.c
> @@ -0,0 +1,117 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "vmlinux.h"
> +
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include "bpf_tracing_net.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +enum {
> + BPF_SMC_LISTEN = 10,
> +};
> +
> +struct smc_sock___local {
> + struct sock sk;
> + struct smc_sock *listen_smc;
> + bool use_fallback;
> +} __attribute__((preserve_access_index));
> +
> +int smc_cnt = 0;
> +int fallback_cnt = 0;
> +
> +SEC("fentry/smc_release")
> +int BPF_PROG(bpf_smc_release, struct socket *sock)
> +{
> + /* only count from one side (client) */
> + if (sock->sk->__sk_common.skc_state == BPF_SMC_LISTEN)
> + return 0;
> + smc_cnt++;
> + return 0;
> +}
> +
> +SEC("fentry/smc_switch_to_fallback")
> +int BPF_PROG(bpf_smc_switch_to_fallback, struct smc_sock___local *smc)
> +{
> + /* only count from one side (client) */
> + if (smc && !BPF_CORE_READ(smc, listen_smc))
It should not need BPF_CORE_READ. smc can be directly read like the above
sock->sk->...
> + fallback_cnt++;
> + return 0;
> +}
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next v6 5/5] bpf/selftests: add selftest for bpf_smc_ops
2025-01-16 7:44 ` [PATCH bpf-next v6 5/5] bpf/selftests: add selftest for bpf_smc_ops D. Wythe
2025-01-18 0:37 ` Martin KaFai Lau
@ 2025-01-21 6:42 ` Saket Kumar Bhaskar
2025-01-22 2:46 ` D. Wythe
1 sibling, 1 reply; 17+ messages in thread
From: Saket Kumar Bhaskar @ 2025-01-21 6:42 UTC (permalink / raw)
To: D. Wythe
Cc: kgraul, wenjia, jaka, ast, daniel, andrii, martin.lau, pabeni,
song, sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa,
guwen, kuba, davem, netdev, linux-s390, linux-rdma, bpf
On Thu, Jan 16, 2025 at 03:44:42PM +0800, D. Wythe wrote:
> This tests introduces a tiny smc_ops for filtering SMC connections based on
> IP pairs, and also adds a realistic topology model to verify this ops.
>
> Also, we can only use SMC loopback under CI test, so an
> additional configuration needs to be enabled.
>
> Follow the steps below to run this test.
>
> make -C tools/testing/selftests/bpf
> cd tools/testing/selftests/bpf
> sudo ./test_progs -t smc
>
> Results shows:
> Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
>
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
> tools/testing/selftests/bpf/config | 4 +
> .../selftests/bpf/prog_tests/test_bpf_smc.c | 397 ++++++++++++++++++
> tools/testing/selftests/bpf/progs/bpf_smc.c | 117 ++++++
> 3 files changed, 518 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c
> create mode 100644 tools/testing/selftests/bpf/progs/bpf_smc.c
>
> diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
> index c378d5d07e02..fac2f2a9d02f 100644
> --- a/tools/testing/selftests/bpf/config
> +++ b/tools/testing/selftests/bpf/config
> @@ -113,3 +113,7 @@ CONFIG_XDP_SOCKETS=y
> CONFIG_XFRM_INTERFACE=y
> CONFIG_TCP_CONG_DCTCP=y
> CONFIG_TCP_CONG_BBR=y
> +CONFIG_INFINIBAND=y
> +CONFIG_SMC=y
> +CONFIG_SMC_OPS=y
> +CONFIG_SMC_LO=y
> \ No newline at end of file
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c b/tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c
> new file mode 100644
> index 000000000000..1e06325bfbaf
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c
> @@ -0,0 +1,397 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <linux/genetlink.h>
> +#include "network_helpers.h"
> +#include "bpf_smc.skel.h"
> +
> +#ifndef IPPROTO_SMC
> +#define IPPROTO_SMC 256
> +#endif
> +
> +#define CLIENT_IP "127.0.0.1"
> +#define SERVER_IP "127.0.1.0"
> +#define SERVER_IP_VIA_RISK_PATH "127.0.2.0"
> +
> +#define SERVICE_1 11234
> +#define SERVICE_2 22345
> +#define SERVICE_3 33456
> +
> +#define TEST_NS "bpf_smc_netns"
> +
> +static struct netns_obj *test_netns;
> +
> +struct smc_strat_ip_key {
> + __u32 sip;
> + __u32 dip;
> +};
> +
> +struct smc_strat_ip_value {
> + __u8 mode;
> +};
> +
> +#if defined(__s390x__)
> +/* s390x has default seid */
> +static bool setup_ueid(void) { return true; }
> +static void cleanup_ueid(void) {}
> +#else
> +enum {
> + SMC_NETLINK_ADD_UEID = 10,
> + SMC_NETLINK_REMOVE_UEID
> +};
> +
> +enum {
> + SMC_NLA_EID_TABLE_UNSPEC,
> + SMC_NLA_EID_TABLE_ENTRY, /* string */
> +};
> +
> +struct msgtemplate {
> + struct nlmsghdr n;
> + struct genlmsghdr g;
> + char buf[1024];
> +};
> +
> +#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
> +#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
> +#define NLA_DATA(na) ((void *)((char *)(na) + NLA_HDRLEN))
> +#define NLA_PAYLOAD(len) ((len) - NLA_HDRLEN)
> +
> +#define SMC_GENL_FAMILY_NAME "SMC_GEN_NETLINK"
> +#define SMC_BPFTEST_UEID "SMC-BPFTEST-UEID"
> +
> +static uint16_t smc_nl_family_id = -1;
> +
> +static int send_cmd(int fd, __u16 nlmsg_type, __u32 nlmsg_pid,
> + __u16 nlmsg_flags, __u8 genl_cmd, __u16 nla_type,
> + void *nla_data, int nla_len)
> +{
> + struct nlattr *na;
> + struct sockaddr_nl nladdr;
> + int r, buflen;
> + char *buf;
> +
> + struct msgtemplate msg = {0};
> +
> + msg.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
> + msg.n.nlmsg_type = nlmsg_type;
> + msg.n.nlmsg_flags = nlmsg_flags;
> + msg.n.nlmsg_seq = 0;
> + msg.n.nlmsg_pid = nlmsg_pid;
> + msg.g.cmd = genl_cmd;
> + msg.g.version = 1;
> + na = (struct nlattr *) GENLMSG_DATA(&msg);
> + na->nla_type = nla_type;
> + na->nla_len = nla_len + 1 + NLA_HDRLEN;
> + memcpy(NLA_DATA(na), nla_data, nla_len);
> + msg.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);
> +
> + buf = (char *) &msg;
> + buflen = msg.n.nlmsg_len;
> + memset(&nladdr, 0, sizeof(nladdr));
> + nladdr.nl_family = AF_NETLINK;
> +
> + while ((r = sendto(fd, buf, buflen, 0, (struct sockaddr *) &nladdr,
> + sizeof(nladdr))) < buflen) {
> + if (r > 0) {
> + buf += r;
> + buflen -= r;
> + } else if (errno != EAGAIN) {
> + return -1;
> + }
> + }
> + return 0;
> +}
> +
> +static bool get_smc_nl_family_id(void)
> +{
> + struct sockaddr_nl nl_src;
> + struct msgtemplate msg;
> + struct nlattr *nl;
> + int fd, ret;
> + pid_t pid;
> +
> + fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
> + if (!ASSERT_GT(fd, 0, "nl_family socket"))
> + return false;
> +
> + pid = getpid();
> +
> + memset(&nl_src, 0, sizeof(nl_src));
> + nl_src.nl_family = AF_NETLINK;
> + nl_src.nl_pid = pid;
> +
> + ret = bind(fd, (struct sockaddr *) &nl_src, sizeof(nl_src));
> + if (!ASSERT_GE(ret, 0, "nl_family bind"))
> + goto fail;
> +
> + ret = send_cmd(fd, GENL_ID_CTRL, pid,
> + NLM_F_REQUEST, CTRL_CMD_GETFAMILY,
> + CTRL_ATTR_FAMILY_NAME, (void *)SMC_GENL_FAMILY_NAME,
> + strlen(SMC_GENL_FAMILY_NAME));
> + if (!ASSERT_EQ(ret, 0, "nl_family query"))
> + goto fail;
> +
> + ret = recv(fd, &msg, sizeof(msg), 0);
> + if (!ASSERT_FALSE(msg.n.nlmsg_type == NLMSG_ERROR || (ret < 0) ||
> + !NLMSG_OK((&msg.n), ret), "nl_family response"))
> + goto fail;
> +
> + nl = (struct nlattr *) GENLMSG_DATA(&msg);
> + nl = (struct nlattr *) ((char *) nl + NLA_ALIGN(nl->nla_len));
> + if (!ASSERT_EQ(nl->nla_type, CTRL_ATTR_FAMILY_ID, "nl_family nla type"))
> + goto fail;
> +
> + smc_nl_family_id = *(uint16_t *) NLA_DATA(nl);
> + close(fd);
> + return true;
> +fail:
> + close(fd);
> + return false;
> +}
> +
> +static bool smc_ueid(int op)
> +{
> + struct sockaddr_nl nl_src;
> + struct msgtemplate msg;
> + struct nlmsgerr *err;
> + char test_ueid[32];
> + int fd, ret;
> + pid_t pid;
> +
> + /* UEID required */
> + memset(test_ueid, '\x20', sizeof(test_ueid));
> + memcpy(test_ueid, SMC_BPFTEST_UEID, strlen(SMC_BPFTEST_UEID));
> + fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
> + if (!ASSERT_GT(fd, 0, "ueid socket"))
> + return false;
> +
> + pid = getpid();
> + memset(&nl_src, 0, sizeof(nl_src));
> + nl_src.nl_family = AF_NETLINK;
> + nl_src.nl_pid = pid;
> +
> + ret = bind(fd, (struct sockaddr *) &nl_src, sizeof(nl_src));
> + if (!ASSERT_GE(ret, 0, "ueid bind"))
> + goto fail;
> +
> + ret = send_cmd(fd, smc_nl_family_id, pid,
> + NLM_F_REQUEST | NLM_F_ACK, op, SMC_NLA_EID_TABLE_ENTRY,
> + (void *)test_ueid, sizeof(test_ueid));
> + if (!ASSERT_EQ(ret, 0, "ueid cmd"))
> + goto fail;
> +
> + ret = recv(fd, &msg, sizeof(msg), 0);
> + if (!ASSERT_FALSE((ret < 0) ||
> + !NLMSG_OK((&msg.n), ret), "ueid response"))
> + goto fail;
> +
> + if (msg.n.nlmsg_type == NLMSG_ERROR) {
> + err = NLMSG_DATA(&msg);
> + switch (op) {
> + case SMC_NETLINK_REMOVE_UEID:
> + if (!ASSERT_FALSE((err->error && err->error != -ENOENT),
> + "ueid remove"))
> + goto fail;
> + break;
> + case SMC_NETLINK_ADD_UEID:
> + if (!ASSERT_EQ(err->error, 0, "ueid add"))
> + goto fail;
> + break;
> + default:
> + break;
> + }
> + }
> + close(fd);
> + return true;
> +fail:
> + close(fd);
> + return false;
> +}
> +
> +static bool setup_ueid(void)
> +{
> + /* get smc nl id */
> + if (!get_smc_nl_family_id())
> + return false;
> + /* clear old ueid for bpftest */
> + smc_ueid(SMC_NETLINK_REMOVE_UEID);
> + /* smc-loopback required ueid */
> + return smc_ueid(SMC_NETLINK_ADD_UEID);
> +}
> +
> +static void cleanup_ueid(void)
> +{
> + smc_ueid(SMC_NETLINK_REMOVE_UEID);
> +}
> +#endif /* __s390x__ */
> +
> +static bool setup_netns(void)
> +{
> + test_netns = netns_new(TEST_NS, true);
> + if (!ASSERT_OK_PTR(test_netns, "open net namespace"))
> + goto fail_netns;
> +
> + if (!ASSERT_OK(system("ip addr add 127.0.1.0/8 dev lo"),
> + "add server node"))
> + goto fail_ip;
> +
> + if (!ASSERT_OK(system("ip addr add 127.0.2.0/8 dev lo"),
> + "server via risk path"))
> + goto fail_ip;
> +
> + return true;
> +fail_ip:
> + netns_free(test_netns);
> +fail_netns:
> + return false;
> +}
> +
> +static void cleanup_netns(void)
> +{
> + netns_free(test_netns);
> + remove_netns(TEST_NS);
> +}
> +
> +static bool setup_smc(void)
> +{
> + if (!setup_ueid())
> + return false;
> +
> + if (!setup_netns())
> + goto fail_netns;
> +
> + return true;
> +fail_netns:
> + cleanup_ueid();
> + return false;
> +}
> +
> +static int set_client_addr_cb(int fd, void *opts)
> +{
> + const char *src = (const char *)opts;
> + struct sockaddr_in localaddr;
> +
> + localaddr.sin_family = AF_INET;
> + localaddr.sin_port = htons(0);
> + localaddr.sin_addr.s_addr = inet_addr(src);
> + return !ASSERT_EQ(bind(fd, &localaddr, sizeof(localaddr)), 0,
> + "client bind");
> +}
> +
> +static void run_link(const char *src, const char *dst, int port)
> +{
> + struct network_helper_opts opts = {0};
> + int server, client;
> +
> + server = start_server_str(AF_INET, SOCK_STREAM, dst, port, NULL);
> + if (!ASSERT_OK_FD(server, "start service_1"))
> + return;
> +
> + opts.proto = IPPROTO_TCP;
> + opts.post_socket_cb = set_client_addr_cb;
> + opts.cb_opts = (void *)src;
> +
> + client = connect_to_fd_opts(server, &opts);
> + if (!ASSERT_OK_FD(client, "start connect"))
> + goto fail_client;
> +
> + close(client);
> +fail_client:
> + close(server);
> +}
> +
> +static void block_link(int map_fd, const char *src, const char *dst)
> +{
> + struct smc_strat_ip_value val = { .mode = /* block */ 0 };
> + struct smc_strat_ip_key key = {
> + .sip = inet_addr(src),
> + .dip = inet_addr(dst),
> + };
> +
> + bpf_map_update_elem(map_fd, &key, &val, BPF_ANY);
> +}
> +
> +/*
> + * This test describes a real-life service topology as follows:
> + *
> + * +-------------> service_1
> + * link1 | |
> + * +--------------------> server | link 2
> + * | | V
> + * | +-------------> service_2
> + * | link 3
> + * client -------------------> server_via_unsafe_path -> service_3
> + *
> + * Among them,
> + * 1. link-1 is very suitable for using SMC.
> + * 2. link-2 is not suitable for using SMC, because the mode of this link is
> + * kind of short-link services.
> + * 3. link-3 is also not suitable for using SMC, because the RDMA link is
> + * unavailable and needs to go through a long timeout before it can fallback
> + * to TCP.
> + * To achieve this goal, we use a customized SMC ip strategy via smc_ops.
> + */
> +static void test_topo(void)
> +{
> + struct bpf_smc *skel;
> + int rc, map_fd;
> +
> + skel = bpf_smc__open_and_load();
> + if (!ASSERT_OK_PTR(skel, "bpf_smc__open_and_load"))
> + return;
> +
> + rc = bpf_smc__attach(skel);
> + if (!ASSERT_EQ(rc, 0, "bpf_smc__attach"))
> + goto fail;
> +
> + map_fd = bpf_map__fd(skel->maps.smc_strats_ip);
> + if (!ASSERT_GT(map_fd, 0, "bpf_map__fd"))
> + goto fail;
> +
> + /* Mock the process of transparent replacement, since we will modify
> + * protocol to ipproto_smc accropding to it via
> + * fmod_ret/update_socket_protocol.
> + */
> + system("sysctl -w net.smc.ops=linkcheck");
> +
> + /* Configure ip strat */
> + block_link(map_fd, CLIENT_IP, SERVER_IP_VIA_RISK_PATH);
> + block_link(map_fd, SERVER_IP, SERVER_IP);
> +
> + /* should go with smc */
> + run_link(CLIENT_IP, SERVER_IP, SERVICE_1);
> + /* should go with smc fallback */
> + run_link(SERVER_IP, SERVER_IP, SERVICE_2);
> +
> + ASSERT_EQ(skel->bss->smc_cnt, 2, "smc count");
> + ASSERT_EQ(skel->bss->fallback_cnt, 1, "fallback count");
> +
> + /* should go with smc */
> + run_link(CLIENT_IP, SERVER_IP, SERVICE_2);
> +
> + ASSERT_EQ(skel->bss->smc_cnt, 3, "smc count");
> + ASSERT_EQ(skel->bss->fallback_cnt, 1, "fallback count");
> +
> + /* should go with smc fallback */
> + run_link(CLIENT_IP, SERVER_IP_VIA_RISK_PATH, SERVICE_3);
> +
> + ASSERT_EQ(skel->bss->smc_cnt, 4, "smc count");
> + ASSERT_EQ(skel->bss->fallback_cnt, 2, "fallback count");
> +
> +fail:
> + bpf_smc__destroy(skel);
> +}
> +
> +void test_bpf_smc(void)
> +{
> + if (!setup_smc()) {
> + printf("setup for smc test failed, test SKIP:\n");
> + test__skip();
> + return;
> + }
> +
> + if (test__start_subtest("topo"))
> + test_topo();
> +
> + cleanup_ueid();
> + cleanup_netns();
> +}
> diff --git a/tools/testing/selftests/bpf/progs/bpf_smc.c b/tools/testing/selftests/bpf/progs/bpf_smc.c
> new file mode 100644
> index 000000000000..38b0490bd875
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpf_smc.c
> @@ -0,0 +1,117 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "vmlinux.h"
> +
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include "bpf_tracing_net.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +enum {
> + BPF_SMC_LISTEN = 10,
> +};
> +
> +struct smc_sock___local {
> + struct sock sk;
> + struct smc_sock *listen_smc;
> + bool use_fallback;
> +} __attribute__((preserve_access_index));
> +
> +int smc_cnt = 0;
> +int fallback_cnt = 0;
> +
> +SEC("fentry/smc_release")
> +int BPF_PROG(bpf_smc_release, struct socket *sock)
> +{
> + /* only count from one side (client) */
> + if (sock->sk->__sk_common.skc_state == BPF_SMC_LISTEN)
> + return 0;
> + smc_cnt++;
> + return 0;
> +}
> +
> +SEC("fentry/smc_switch_to_fallback")
> +int BPF_PROG(bpf_smc_switch_to_fallback, struct smc_sock___local *smc)
> +{
> + /* only count from one side (client) */
> + if (smc && !BPF_CORE_READ(smc, listen_smc))
> + fallback_cnt++;
> + return 0;
> +}
> +
> +/* go with default value if no strat was found */
> +bool default_ip_strat_value = true;
> +
> +struct smc_strat_ip_key {
> + __u32 sip;
> + __u32 dip;
> +};
> +
> +struct smc_strat_ip_value {
> + __u8 mode;
> +};
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_HASH);
> + __uint(key_size, sizeof(struct smc_strat_ip_key));
> + __uint(value_size, sizeof(struct smc_strat_ip_value));
> + __uint(max_entries, 128);
> + __uint(map_flags, BPF_F_NO_PREALLOC);
> +} smc_strats_ip SEC(".maps");
> +
> +static bool smc_check(__u32 src, __u32 dst)
> +{
> + struct smc_strat_ip_value *value;
> + struct smc_strat_ip_key key = {
> + .sip = src,
> + .dip = dst,
> + };
> +
> + value = bpf_map_lookup_elem(&smc_strats_ip, &key);
> + return value ? value->mode : default_ip_strat_value;
> +}
> +
> +SEC("fmod_ret/update_socket_protocol")
> +int BPF_PROG(smc_run, int family, int type, int protocol)
> +{
> + struct task_struct *task;
> +
> + if (family != AF_INET && family != AF_INET6)
> + return protocol;
> +
> + if ((type & 0xf) != SOCK_STREAM)
> + return protocol;
> +
> + if (protocol != 0 && protocol != IPPROTO_TCP)
> + return protocol;
> +
> + task = bpf_get_current_task_btf();
> + /* Prevent from affecting other tests */
> + if (!task || !task->nsproxy->net_ns->smc.ops)
> + return protocol;
> +
> + return IPPROTO_SMC;
> +}
> +
> +SEC("struct_ops/bpf_smc_set_tcp_option_cond")
> +int BPF_PROG(bpf_smc_set_tcp_option_cond, const struct tcp_sock *tp,
> + struct inet_request_sock *ireq)
> +{
> + return smc_check(ireq->req.__req_common.skc_daddr,
> + ireq->req.__req_common.skc_rcv_saddr);
> +}
> +
> +SEC("struct_ops/bpf_smc_set_tcp_option")
> +int BPF_PROG(bpf_smc_set_tcp_option, struct tcp_sock *tp)
> +{
> + return smc_check(tp->inet_conn.icsk_inet.sk.__sk_common.skc_rcv_saddr,
> + tp->inet_conn.icsk_inet.sk.__sk_common.skc_daddr);
> +}
> +
> +SEC(".struct_ops.link")
> +struct smc_ops linkcheck = {
> + .name = "linkcheck",
> + .set_option = (void *) bpf_smc_set_tcp_option,
> + .set_option_cond = (void *) bpf_smc_set_tcp_option_cond,
> +};
Tested this selftest with patches applied on powerpc.
#./test_progs -t bpf_smc
net.smc.ops = linkcheck
#27/1 bpf_smc/topo:OK
#27 bpf_smc:OK
Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
Tested-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
Thanks,
Saket
> --
> 2.45.0
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next v6 5/5] bpf/selftests: add selftest for bpf_smc_ops
2025-01-18 0:37 ` Martin KaFai Lau
@ 2025-01-22 1:51 ` D. Wythe
0 siblings, 0 replies; 17+ messages in thread
From: D. Wythe @ 2025-01-22 1:51 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: D. Wythe, kgraul, wenjia, jaka, ast, daniel, andrii, pabeni, song,
sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa, guwen,
kuba, davem, netdev, linux-s390, linux-rdma, bpf
On Fri, Jan 17, 2025 at 04:37:04PM -0800, Martin KaFai Lau wrote:
> On 1/15/25 11:44 PM, D. Wythe wrote:
> >This tests introduces a tiny smc_ops for filtering SMC connections based on
> >IP pairs, and also adds a realistic topology model to verify this ops.
> >
> >Also, we can only use SMC loopback under CI test, so an
> >additional configuration needs to be enabled.
> >
> >Follow the steps below to run this test.
> >
> >make -C tools/testing/selftests/bpf
> >cd tools/testing/selftests/bpf
> >sudo ./test_progs -t smc
> >
> >Results shows:
> >Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> >
> >Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> >---
> > tools/testing/selftests/bpf/config | 4 +
> > .../selftests/bpf/prog_tests/test_bpf_smc.c | 397 ++++++++++++++++++
> > tools/testing/selftests/bpf/progs/bpf_smc.c | 117 ++++++
> > 3 files changed, 518 insertions(+)
> > create mode 100644 tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c
> > create mode 100644 tools/testing/selftests/bpf/progs/bpf_smc.c
> >
> >diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
> >index c378d5d07e02..fac2f2a9d02f 100644
> >--- a/tools/testing/selftests/bpf/config
> >+++ b/tools/testing/selftests/bpf/config
> >@@ -113,3 +113,7 @@ CONFIG_XDP_SOCKETS=y
> > CONFIG_XFRM_INTERFACE=y
> > CONFIG_TCP_CONG_DCTCP=y
> > CONFIG_TCP_CONG_BBR=y
> >+CONFIG_INFINIBAND=y
> >+CONFIG_SMC=y
> >+CONFIG_SMC_OPS=y
> >+CONFIG_SMC_LO=y
> >\ No newline at end of file
> >+ int fd, ret;
> >+ pid_t pid;
> >+
> >+ fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
> >+ if (!ASSERT_GT(fd, 0, "nl_family socket"))
>
> Should be _GE. or just use ASSERT_OK_FD.
>
Take it.
> >+ if (!ASSERT_GE(ret, 0, "nl_family bind"))
>
> nit. ASSERT_OK.
>
> >+ if (!ASSERT_EQ(ret, 0, "nl_family query"))
>
> ASSERT_OK.
>
> >+ if (!ASSERT_GT(fd, 0, "ueid socket"))
>
> ASSERT_OK_FD
>
> >+ return false;
> >+ ret = bind(fd, (struct sockaddr *) &nl_src, sizeof(nl_src));
> >+ if (!ASSERT_GE(ret, 0, "ueid bind"))
>
> ASSERT_OK
>
> >+ goto fail;
> >+ ret = send_cmd(fd, smc_nl_family_id, pid,
> >+ (void *)test_ueid, sizeof(test_ueid));
> >+ if (!ASSERT_EQ(ret, 0, "ueid cmd"))
>
> ASSERT_OK
>
The parts of the assert macro have been all fixed, thanks for your
suggestion.
> >+ goto fail;
> >+
> >+int BPF_PROG(bpf_smc_switch_to_fallback, struct smc_sock___local *smc)
> >+{
> >+ /* only count from one side (client) */
> >+ if (smc && !BPF_CORE_READ(smc, listen_smc))
>
> It should not need BPF_CORE_READ. smc can be directly read like the
> above sock->sk->...
Got it. I'll fix it in next version.
Thanks,
D. Wythe
>
> >+ fallback_cnt++;
> >+ return 0;
> >+}
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next v6 4/5] libbpf: fix error when st-prefix_ops and ops from differ btf
2025-01-17 18:36 ` Andrii Nakryiko
@ 2025-01-22 2:43 ` D. Wythe
2025-01-22 17:16 ` Andrii Nakryiko
0 siblings, 1 reply; 17+ messages in thread
From: D. Wythe @ 2025-01-22 2:43 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: D. Wythe, kgraul, wenjia, jaka, ast, daniel, andrii, martin.lau,
pabeni, song, sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh,
jolsa, guwen, kuba, davem, netdev, linux-s390, linux-rdma, bpf
On Fri, Jan 17, 2025 at 10:36:50AM -0800, Andrii Nakryiko wrote:
> On Wed, Jan 15, 2025 at 11:45 PM D. Wythe <alibuda@linux.alibaba.com> wrote:
> >
> > When a struct_ops named xxx_ops was registered by a module, and
> > it will be used in both built-in modules and the module itself,
> > so that the btf_type of xxx_ops will be present in btf_vmlinux
> > instead of in btf_mod, which means that the btf_type of
> > bpf_struct_ops_xxx_ops and xxx_ops will not be in the same btf.
> >
> > Here are four possible case:
> >
> > +--------+---------------+-------------+---------------------------------+
> > | | st_ops_xxx_ops| xxx_ops | |
> > +--------+---------------+-------------+---------------------------------+
> > | case 0 | btf_vmlinux | bft_vmlinux | be used and reg only in vmlinux |
> > +--------+---------------+-------------+---------------------------------+
> > | case 1 | btf_vmlinux | bpf_mod | INVALID |
> > +--------+---------------+-------------+---------------------------------+
> > | case 2 | btf_mod | btf_vmlinux | reg in mod but be used both in |
> > | | | | vmlinux and mod. |
> > +--------+---------------+-------------+---------------------------------+
> > | case 3 | btf_mod | btf_mod | be used and reg only in mod |
> > +--------+---------------+-------------+---------------------------------+
> >
> > At present, cases 0, 1, and 3 can be correctly identified, because
> > + if (ret < 0 || ret >= sizeof(stname))
> > + return -ENAMETOOLONG;
>
> see preexisting snprintf() above, we don't really handle truncation
> errors explicitly, they are extremely unlikely and not expected at
> all, and worst case nothing will be found and user will get some
> -ENOENT or something like that eventually. I'd drop this extra error
> checking and keep it streamlines, similar to tname
>
Sounds reasonable to me. I will remove the explicit error checks in the
next version.
> > +
> > + /* Look for the corresponding "map_value" type that will be used
> > + * in map_update(BPF_MAP_TYPE_STRUCT_OPS) first, figure out the btf
> > + * and the mod_btf.
> > + * For example, find "struct bpf_struct_ops_tcp_congestion_ops".
> > + */
> > + kern_vtype_id = find_ksym_btf_id(obj, stname, BTF_KIND_STRUCT,
> > &btf, mod_btf);
>
> nit: if this fits under 100 characters, keep it single line
>
> > + if (kern_vtype_id < 0) {
> > + pr_warn("struct_ops init_kern: struct %s is not found in kernel BTF\n",
> > + stname);
>
> same nit about preserving single-line statements as much as possible,
> they are much easier to read
None of them exceed 100 lines. Usually, I would check patches with 85 lines limitations,
but since 100 lines is acceptable, we can modify it to a single line here for
better readability.
And thanks very much for your suggestion, I plan to fix these style
issues in next versions with you ack, is this okay for you?
Best wishes,
D. Wythe
>
> > + return kern_vtype_id;
> > + }
> > + kern_vtype = btf__type_by_id(btf, kern_vtype_id);
> > +
> > + kern_type_id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT);
> > if (kern_type_id < 0) {
> > pr_warn("struct_ops init_kern: struct %s is not found in kernel BTF\n",
> > tname);
> > @@ -1020,20 +1039,6 @@ find_struct_ops_kern_types(struct bpf_object *obj, const char *tname_raw,
> > }
> > kern_type = btf__type_by_id(btf, kern_type_id);
> >
> > - /* Find the corresponding "map_value" type that will be used
> > - * in map_update(BPF_MAP_TYPE_STRUCT_OPS). For example,
> > - * find "struct bpf_struct_ops_tcp_congestion_ops" from the
> > - * btf_vmlinux.
> > - */
> > - kern_vtype_id = find_btf_by_prefix_kind(btf, STRUCT_OPS_VALUE_PREFIX,
> > - tname, BTF_KIND_STRUCT);
> > - if (kern_vtype_id < 0) {
> > - pr_warn("struct_ops init_kern: struct %s%s is not found in kernel BTF\n",
> > - STRUCT_OPS_VALUE_PREFIX, tname);
> > - return kern_vtype_id;
> > - }
> > - kern_vtype = btf__type_by_id(btf, kern_vtype_id);
> > -
> > /* Find "struct tcp_congestion_ops" from
> > * struct bpf_struct_ops_tcp_congestion_ops {
> > * [ ... ]
> > @@ -1046,8 +1051,8 @@ find_struct_ops_kern_types(struct bpf_object *obj, const char *tname_raw,
> > break;
> > }
> > if (i == btf_vlen(kern_vtype)) {
> > - pr_warn("struct_ops init_kern: struct %s data is not found in struct %s%s\n",
> > - tname, STRUCT_OPS_VALUE_PREFIX, tname);
> > + pr_warn("struct_ops init_kern: struct %s data is not found in struct %s\n",
> > + tname, stname);
> > return -EINVAL;
> > }
> >
> > --
> > 2.45.0
> >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next v6 5/5] bpf/selftests: add selftest for bpf_smc_ops
2025-01-21 6:42 ` Saket Kumar Bhaskar
@ 2025-01-22 2:46 ` D. Wythe
2025-01-22 4:39 ` Saket Kumar Bhaskar
0 siblings, 1 reply; 17+ messages in thread
From: D. Wythe @ 2025-01-22 2:46 UTC (permalink / raw)
To: Saket Kumar Bhaskar
Cc: D. Wythe, kgraul, wenjia, jaka, ast, daniel, andrii, martin.lau,
pabeni, song, sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh,
jolsa, guwen, kuba, davem, netdev, linux-s390, linux-rdma, bpf
On Tue, Jan 21, 2025 at 12:12:07PM +0530, Saket Kumar Bhaskar wrote:
> On Thu, Jan 16, 2025 at 03:44:42PM +0800, D. Wythe wrote:
> > This tests introduces a tiny smc_ops for filtering SMC connections based on
> > IP pairs, and also adds a realistic topology model to verify this ops.
> >
> > Also, we can only use SMC loopback under CI test, so an
> > additional configuration needs to be enabled.
> >
> > Follow the steps below to run this test.
> >
> > make -C tools/testing/selftests/bpf
> > cd tools/testing/selftests/bpf
> > sudo ./test_progs -t smc
> >
> > Results shows:
> > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> >
> > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > ---
> > tools/testing/selftests/bpf/config | 4 +
> > .../selftests/bpf/prog_tests/test_bpf_smc.c | 397 ++++++++++++++++++
> > tools/testing/selftests/bpf/progs/bpf_smc.c | 117 ++++++
> > 3 files changed, 518 insertions(+)
> > create mode 100644 tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c
> > create mode 100644 tools/testing/selftests/bpf/progs/bpf_smc.c
> >
> > diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
> > index c378d5d07e02..fac2f2a9d02f 100644
> > --- a/tools/testing/selftests/bpf/config
> > +++ b/tools/testing/selftests/bpf/config
> > @@ -113,3 +113,7 @@ CONFIG_XDP_SOCKETS=y
> > +};
> Tested this selftest with patches applied on powerpc.
>
> #./test_progs -t bpf_smc
>
> net.smc.ops = linkcheck
> #27/1 bpf_smc/topo:OK
> #27 bpf_smc:OK
> Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
>
> Tested-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
>
> Thanks,
> Saket
Hi Saket,
Thanks for your testing. I hope you don't mind if I add your test-by in
the next version.
Best wishes,
D. Wythe
> > --
> > 2.45.0
> >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next v6 2/5] net/smc: Introduce generic hook smc_ops
2025-01-17 23:50 ` Martin KaFai Lau
@ 2025-01-22 4:35 ` D. Wythe
0 siblings, 0 replies; 17+ messages in thread
From: D. Wythe @ 2025-01-22 4:35 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: D. Wythe, kgraul, wenjia, jaka, ast, daniel, andrii, pabeni, song,
sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa, guwen,
kuba, davem, netdev, linux-s390, linux-rdma, bpf
On Fri, Jan 17, 2025 at 03:50:48PM -0800, Martin KaFai Lau wrote:
> On 1/15/25 11:44 PM, D. Wythe wrote:
> >diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
> >index 2fab6456f765..2004241c3045 100644
> >--- a/net/smc/smc_sysctl.c
> >+++ b/net/smc/smc_sysctl.c
> >@@ -18,6 +18,7 @@
> > #include "smc_core.h"
> > #include "smc_llc.h"
> > #include "smc_sysctl.h"
> >+#include "smc_ops.h"
> > static int min_sndbuf = SMC_BUF_MIN_SIZE;
> > static int min_rcvbuf = SMC_BUF_MIN_SIZE;
> >@@ -30,6 +31,69 @@ static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
> > static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
> > static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
> >+#if IS_ENABLED(CONFIG_SMC_OPS)
> >+static int smc_net_replace_smc_ops(struct net *net, const char *name)
> >+{
> >+ struct smc_ops *ops = NULL;
> >+
> >+ rcu_read_lock();
> >+ /* null or empty name ask to clear current ops */
> >+ if (name && name[0]) {
> >+ ops = smc_ops_find_by_name(name);
> >+ if (!ops) {
> >+ rcu_read_unlock();
> >+ return -EINVAL;
> >+ }
> >+ /* no change, just return */
> >+ if (ops == rcu_dereference(net->smc.ops)) {
> >+ rcu_read_unlock();
> >+ return 0;
> >+ }
> >+ }
> >+ if (!ops || bpf_try_module_get(ops, ops->owner)) {
> >+ /* xhcg */
>
> typo. I noticed it only because...
>
> >+ ops = rcu_replace_pointer(net->smc.ops, ops, true);
>
> ... rcu_replace_pointer() does not align with the above xchg
> comment. From looking into rcu_replace_pointer, it is not a xchg. It
> is also not obvious to me why it is safe to assume "true" here...
>
> >+ /* release old ops */
> >+ if (ops)
> >+ bpf_module_put(ops, ops->owner);
>
> ... together with a put here on the return value of the rcu_replace_pointer.
>
Hi Martin,
This is indeed a very good catch. Initially, I used the xhcg()
for swapping, but later I thought there wouldn't be a situation where
smc_net_replace_smc_ops would be called simultaneously with the same net.
Therefore, I modified it to rcu_replace_pointer, which is also why I assumed
that it was true here, I thought the updates here was prevented. but now I
realize that sysctl might not be mutually exclusive. It seems that this should
be changed back to xhcg().
> >+ } else if (ops) {
>
> nit. This looks redundant when looking at the "if (!ops || ..." test above
> Also a nit, I would move the bpf_try_module_get() immediately after
> the above "if (ops == rcu_dereference(net->smc.ops))" test. This
> should simplify the later cases.
>
This is a very good suggestion. I tried it and the code became very
clean. I'll take it in next version.
> >+ rcu_read_unlock();
> >+ return -EBUSY;
> >+ }
> >+ rcu_read_unlock();
> >+ return 0;
> >+}
> >+
> >+static int proc_smc_ops(const struct ctl_table *ctl, int write,
> >+#if IS_ENABLED(CONFIG_SMC_OPS)
> >+ struct smc_ops *ops;
> >+
> >+ rcu_read_lock();
> >+ ops = rcu_dereference(init_net.smc.ops);
> >+ if (ops && ops->flags & SMC_OPS_FLAG_INHERITABLE) {
> >+ if (!bpf_try_module_get(ops, ops->owner)) {
> >+ rcu_read_unlock();
> >+ return -EBUSY;
>
> Not sure if it should count as error when the ops is in the process
> of un-register-ing. The next smc_sysctl_net_init will have NULL ops
> and succeed. Something for you to consider.
>
It seems more reasonable that no need to prevent net initialization
just because ops is uninstalling... I plan to just skip that error.
>
> Also, it needs an ack from the SMC maintainer for the SMC specific
> parts like the sysctl here.
Got it. I will communicate this matter with the SMC maintainers.
Best wishes,
D. Wythe
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next v6 5/5] bpf/selftests: add selftest for bpf_smc_ops
2025-01-22 2:46 ` D. Wythe
@ 2025-01-22 4:39 ` Saket Kumar Bhaskar
0 siblings, 0 replies; 17+ messages in thread
From: Saket Kumar Bhaskar @ 2025-01-22 4:39 UTC (permalink / raw)
To: D. Wythe
Cc: kgraul, wenjia, jaka, ast, daniel, andrii, martin.lau, pabeni,
song, sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa,
guwen, kuba, davem, netdev, linux-s390, linux-rdma, bpf
On Wed, Jan 22, 2025 at 10:46:51AM +0800, D. Wythe wrote:
> On Tue, Jan 21, 2025 at 12:12:07PM +0530, Saket Kumar Bhaskar wrote:
> > On Thu, Jan 16, 2025 at 03:44:42PM +0800, D. Wythe wrote:
> > > This tests introduces a tiny smc_ops for filtering SMC connections based on
> > > IP pairs, and also adds a realistic topology model to verify this ops.
> > >
> > > Also, we can only use SMC loopback under CI test, so an
> > > additional configuration needs to be enabled.
> > >
> > > Follow the steps below to run this test.
> > >
> > > make -C tools/testing/selftests/bpf
> > > cd tools/testing/selftests/bpf
> > > sudo ./test_progs -t smc
> > >
> > > Results shows:
> > > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> > >
> > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > ---
> > > tools/testing/selftests/bpf/config | 4 +
> > > .../selftests/bpf/prog_tests/test_bpf_smc.c | 397 ++++++++++++++++++
> > > tools/testing/selftests/bpf/progs/bpf_smc.c | 117 ++++++
> > > 3 files changed, 518 insertions(+)
> > > create mode 100644 tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c
> > > create mode 100644 tools/testing/selftests/bpf/progs/bpf_smc.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
> > > index c378d5d07e02..fac2f2a9d02f 100644
> > > --- a/tools/testing/selftests/bpf/config
> > > +++ b/tools/testing/selftests/bpf/config
> > > @@ -113,3 +113,7 @@ CONFIG_XDP_SOCKETS=y
> > > +};
> > Tested this selftest with patches applied on powerpc.
> >
> > #./test_progs -t bpf_smc
> >
> > net.smc.ops = linkcheck
> > #27/1 bpf_smc/topo:OK
> > #27 bpf_smc:OK
> > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> >
> > Tested-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
> >
> > Thanks,
> > Saket
>
> Hi Saket,
>
> Thanks for your testing. I hope you don't mind if I add your test-by in
> the next version.
>
> Best wishes,
> D. Wythe
>
Fine for me Wythe.
Thanks,
Saket
> > > --
> > > 2.45.0
> > >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next v6 4/5] libbpf: fix error when st-prefix_ops and ops from differ btf
2025-01-22 2:43 ` D. Wythe
@ 2025-01-22 17:16 ` Andrii Nakryiko
0 siblings, 0 replies; 17+ messages in thread
From: Andrii Nakryiko @ 2025-01-22 17:16 UTC (permalink / raw)
To: D. Wythe
Cc: kgraul, wenjia, jaka, ast, daniel, andrii, martin.lau, pabeni,
song, sdf, haoluo, yhs, edumazet, john.fastabend, kpsingh, jolsa,
guwen, kuba, davem, netdev, linux-s390, linux-rdma, bpf
On Tue, Jan 21, 2025 at 6:43 PM D. Wythe <alibuda@linux.alibaba.com> wrote:
>
> On Fri, Jan 17, 2025 at 10:36:50AM -0800, Andrii Nakryiko wrote:
> > On Wed, Jan 15, 2025 at 11:45 PM D. Wythe <alibuda@linux.alibaba.com> wrote:
> > >
> > > When a struct_ops named xxx_ops was registered by a module, and
> > > it will be used in both built-in modules and the module itself,
> > > so that the btf_type of xxx_ops will be present in btf_vmlinux
> > > instead of in btf_mod, which means that the btf_type of
> > > bpf_struct_ops_xxx_ops and xxx_ops will not be in the same btf.
> > >
> > > Here are four possible case:
> > >
> > > +--------+---------------+-------------+---------------------------------+
> > > | | st_ops_xxx_ops| xxx_ops | |
> > > +--------+---------------+-------------+---------------------------------+
> > > | case 0 | btf_vmlinux | bft_vmlinux | be used and reg only in vmlinux |
> > > +--------+---------------+-------------+---------------------------------+
> > > | case 1 | btf_vmlinux | bpf_mod | INVALID |
> > > +--------+---------------+-------------+---------------------------------+
> > > | case 2 | btf_mod | btf_vmlinux | reg in mod but be used both in |
> > > | | | | vmlinux and mod. |
> > > +--------+---------------+-------------+---------------------------------+
> > > | case 3 | btf_mod | btf_mod | be used and reg only in mod |
> > > +--------+---------------+-------------+---------------------------------+
> > >
> > > At present, cases 0, 1, and 3 can be correctly identified, because
> > > + if (ret < 0 || ret >= sizeof(stname))
> > > + return -ENAMETOOLONG;
> >
> > see preexisting snprintf() above, we don't really handle truncation
> > errors explicitly, they are extremely unlikely and not expected at
> > all, and worst case nothing will be found and user will get some
> > -ENOENT or something like that eventually. I'd drop this extra error
> > checking and keep it streamlines, similar to tname
> >
>
> Sounds reasonable to me. I will remove the explicit error checks in the
> next version.
>
> > > +
> > > + /* Look for the corresponding "map_value" type that will be used
> > > + * in map_update(BPF_MAP_TYPE_STRUCT_OPS) first, figure out the btf
> > > + * and the mod_btf.
> > > + * For example, find "struct bpf_struct_ops_tcp_congestion_ops".
> > > + */
> > > + kern_vtype_id = find_ksym_btf_id(obj, stname, BTF_KIND_STRUCT,
> > > &btf, mod_btf);
> >
> > nit: if this fits under 100 characters, keep it single line
> >
> > > + if (kern_vtype_id < 0) {
> > > + pr_warn("struct_ops init_kern: struct %s is not found in kernel BTF\n",
> > > + stname);
> >
> > same nit about preserving single-line statements as much as possible,
> > they are much easier to read
>
> None of them exceed 100 lines. Usually, I would check patches with 85 lines limitations,
> but since 100 lines is acceptable, we can modify it to a single line here for
> better readability.
>
> And thanks very much for your suggestion, I plan to fix these style
> issues in next versions with you ack, is this okay for you?
yep, sgtm
>
> Best wishes,
> D. Wythe
> >
> > > + return kern_vtype_id;
> > > + }
> > > + kern_vtype = btf__type_by_id(btf, kern_vtype_id);
> > > +
> > > + kern_type_id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT);
> > > if (kern_type_id < 0) {
> > > pr_warn("struct_ops init_kern: struct %s is not found in kernel BTF\n",
> > > tname);
> > > @@ -1020,20 +1039,6 @@ find_struct_ops_kern_types(struct bpf_object *obj, const char *tname_raw,
> > > }
> > > kern_type = btf__type_by_id(btf, kern_type_id);
> > >
> > > - /* Find the corresponding "map_value" type that will be used
> > > - * in map_update(BPF_MAP_TYPE_STRUCT_OPS). For example,
> > > - * find "struct bpf_struct_ops_tcp_congestion_ops" from the
> > > - * btf_vmlinux.
> > > - */
> > > - kern_vtype_id = find_btf_by_prefix_kind(btf, STRUCT_OPS_VALUE_PREFIX,
> > > - tname, BTF_KIND_STRUCT);
> > > - if (kern_vtype_id < 0) {
> > > - pr_warn("struct_ops init_kern: struct %s%s is not found in kernel BTF\n",
> > > - STRUCT_OPS_VALUE_PREFIX, tname);
> > > - return kern_vtype_id;
> > > - }
> > > - kern_vtype = btf__type_by_id(btf, kern_vtype_id);
> > > -
> > > /* Find "struct tcp_congestion_ops" from
> > > * struct bpf_struct_ops_tcp_congestion_ops {
> > > * [ ... ]
> > > @@ -1046,8 +1051,8 @@ find_struct_ops_kern_types(struct bpf_object *obj, const char *tname_raw,
> > > break;
> > > }
> > > if (i == btf_vlen(kern_vtype)) {
> > > - pr_warn("struct_ops init_kern: struct %s data is not found in struct %s%s\n",
> > > - tname, STRUCT_OPS_VALUE_PREFIX, tname);
> > > + pr_warn("struct_ops init_kern: struct %s data is not found in struct %s\n",
> > > + tname, stname);
> > > return -EINVAL;
> > > }
> > >
> > > --
> > > 2.45.0
> > >
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-01-22 17:17 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-16 7:44 [PATCH bpf-next v6 0/5] net/smc: Introduce smc_ops D. Wythe
2025-01-16 7:44 ` [PATCH bpf-next v6 1/5] bpf: export necessary sympols for modules with struct_ops D. Wythe
2025-01-16 7:44 ` [PATCH bpf-next v6 2/5] net/smc: Introduce generic hook smc_ops D. Wythe
2025-01-16 12:22 ` Dust Li
2025-01-17 23:50 ` Martin KaFai Lau
2025-01-22 4:35 ` D. Wythe
2025-01-16 7:44 ` [PATCH bpf-next v6 3/5] net/smc: bpf: register smc_ops info struct_ops D. Wythe
2025-01-16 7:44 ` [PATCH bpf-next v6 4/5] libbpf: fix error when st-prefix_ops and ops from differ btf D. Wythe
2025-01-17 18:36 ` Andrii Nakryiko
2025-01-22 2:43 ` D. Wythe
2025-01-22 17:16 ` Andrii Nakryiko
2025-01-16 7:44 ` [PATCH bpf-next v6 5/5] bpf/selftests: add selftest for bpf_smc_ops D. Wythe
2025-01-18 0:37 ` Martin KaFai Lau
2025-01-22 1:51 ` D. Wythe
2025-01-21 6:42 ` Saket Kumar Bhaskar
2025-01-22 2:46 ` D. Wythe
2025-01-22 4:39 ` Saket Kumar Bhaskar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).