* [PATCH net-next v5 0/3] Introduce IPPROTO_SMC
@ 2024-05-30 9:30 D. Wythe
2024-05-30 9:30 ` [PATCH net-next v5 1/3] net/smc: refactoring initialization of smc sock D. Wythe
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: D. Wythe @ 2024-05-30 9:30 UTC (permalink / raw)
To: kgraul, wenjia, jaka, wintera, guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, tonylu, pabeni,
edumazet
From: "D. Wythe" <alibuda@linux.alibaba.com>
This patch allows to create smc socket via AF_INET,
similar to the following code,
/* create v4 smc sock */
v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);
/* create v6 smc sock */
v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);
There are several reasons why we believe it is appropriate here:
1. For smc sockets, it actually use IPv4 (AF-INET) or IPv6 (AF-INET6)
address. There is no AF_SMC address at all.
2. Create smc socket in the AF_INET(6) path, which allows us to reuse
the infrastructure of AF_INET(6) path, such as common ebpf hooks.
Otherwise, smc have to implement it again in AF_SMC path. Such as:
1. Replace IPPROTO_TCP with IPPROTO_SMC in the socket() syscall
initiated by the user, without the use of LD-PRELOAD.
2. Select whether immediate fallback is required based on peer's port/ip
before connect().
A very significant result is that we can now use eBPF to implement smc_run
instead of LD_PRELOAD, who is completely ineffective in scenarios of static
linking.
Another potential value is that we are attempting to optimize the
performance of fallback socks, where merging socks is an important part,
and it relies on the creation of SMC sockets under the AF_INET path.
(More information :
https://lore.kernel.org/netdev/1699442703-25015-1-git-send-email-alibuda@linux.alibaba.com/T/)
v2 -> v1:
- Code formatting, mainly including alignment and annotation repair.
- move inet_smc proto ops to inet_smc.c, avoiding af_smc.c becoming too bulky.
- Fix the issue where refactoring affects the initialization order.
- Fix compile warning (unused out_inet_prot) while CONFIG_IPV6 was not set.
v3 -> v2:
- Add Alibaba's copyright information to the newfile
v4 -> v3:
- Fix some spelling errors
- Align function naming style with smc_sock_init() to smc_sk_init()
- Reversing the order of the conditional checks on clcsock to make the code more intuitive
v5 -> v4:
- Fix some spelling errors
- Added comment, "/* CONFIG_IPV6 */", after the final #endif directive.
- Rename smc_inet.h and smc_inet.c to smc_inet.h and smc_inet.c
- Encapsulate the initialization and destruction of inet_smc in inet_smc.c,
rather than implementing it directly in af_smc.c.
- Remove useless header files in smc_inet.h
- Make smc_inet_prot_xxx and smc_inet_sock_init() to be static, since it's
only used in smc_inet.c
D. Wythe (3):
net/smc: refactoring initialization of smc sock
net/smc: expose smc proto operations
net/smc: Introduce IPPROTO_SMC
include/uapi/linux/in.h | 2 +
net/smc/Makefile | 2 +-
net/smc/af_smc.c | 162 +++++++++++++++++++++++++--------------------
net/smc/smc.h | 38 +++++++++++
net/smc/smc_inet.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++
net/smc/smc_inet.h | 22 +++++++
6 files changed, 325 insertions(+), 71 deletions(-)
create mode 100644 net/smc/smc_inet.c
create mode 100644 net/smc/smc_inet.h
--
1.8.3.1
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH net-next v5 1/3] net/smc: refactoring initialization of smc sock
2024-05-30 9:30 [PATCH net-next v5 0/3] Introduce IPPROTO_SMC D. Wythe
@ 2024-05-30 9:30 ` D. Wythe
2024-05-30 9:30 ` [PATCH net-next v5 2/3] net/smc: expose smc proto operations D. Wythe
` (2 subsequent siblings)
3 siblings, 0 replies; 12+ messages in thread
From: D. Wythe @ 2024-05-30 9:30 UTC (permalink / raw)
To: kgraul, wenjia, jaka, wintera, guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, tonylu, pabeni,
edumazet
From: "D. Wythe" <alibuda@linux.alibaba.com>
This patch aims to isolate the shared components of SMC socket
allocation by introducing smc_sk_init() for sock initialization
and __smc_create_clcsk() for the initialization of clcsock.
This is in preparation for the subsequent implementation of the
AF_INET version of SMC.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
---
net/smc/af_smc.c | 86 +++++++++++++++++++++++++++++++-------------------------
net/smc/smc.h | 5 ++++
2 files changed, 53 insertions(+), 38 deletions(-)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index e50a286..77a9d58 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -361,25 +361,15 @@ static void smc_destruct(struct sock *sk)
return;
}
-static struct sock *smc_sock_alloc(struct net *net, struct socket *sock,
- int protocol)
+void smc_sk_init(struct net *net, struct sock *sk, int protocol)
{
- struct smc_sock *smc;
- struct proto *prot;
- struct sock *sk;
-
- prot = (protocol == SMCPROTO_SMC6) ? &smc_proto6 : &smc_proto;
- sk = sk_alloc(net, PF_SMC, GFP_KERNEL, prot, 0);
- if (!sk)
- return NULL;
+ struct smc_sock *smc = smc_sk(sk);
- sock_init_data(sock, sk); /* sets sk_refcnt to 1 */
sk->sk_state = SMC_INIT;
sk->sk_destruct = smc_destruct;
sk->sk_protocol = protocol;
WRITE_ONCE(sk->sk_sndbuf, 2 * READ_ONCE(net->smc.sysctl_wmem));
WRITE_ONCE(sk->sk_rcvbuf, 2 * READ_ONCE(net->smc.sysctl_rmem));
- smc = smc_sk(sk);
INIT_WORK(&smc->tcp_listen_work, smc_tcp_listen_work);
INIT_WORK(&smc->connect_work, smc_connect_work);
INIT_DELAYED_WORK(&smc->conn.tx_work, smc_tx_work);
@@ -389,6 +379,24 @@ static struct sock *smc_sock_alloc(struct net *net, struct socket *sock,
sk->sk_prot->hash(sk);
mutex_init(&smc->clcsock_release_lock);
smc_init_saved_callbacks(smc);
+ smc->limit_smc_hs = net->smc.limit_smc_hs;
+ smc->use_fallback = false; /* assume rdma capability first */
+ smc->fallback_rsn = 0;
+}
+
+static struct sock *smc_sock_alloc(struct net *net, struct socket *sock,
+ int protocol)
+{
+ struct proto *prot;
+ struct sock *sk;
+
+ prot = (protocol == SMCPROTO_SMC6) ? &smc_proto6 : &smc_proto;
+ sk = sk_alloc(net, PF_SMC, GFP_KERNEL, prot, 0);
+ if (!sk)
+ return NULL;
+
+ sock_init_data(sock, sk); /* sets sk_refcnt to 1 */
+ smc_sk_init(net, sk, protocol);
return sk;
}
@@ -3321,6 +3329,31 @@ static ssize_t smc_splice_read(struct socket *sock, loff_t *ppos,
.splice_read = smc_splice_read,
};
+int smc_create_clcsk(struct net *net, struct sock *sk, int family)
+{
+ struct smc_sock *smc = smc_sk(sk);
+ int rc;
+
+ rc = sock_create_kern(net, family, SOCK_STREAM, IPPROTO_TCP,
+ &smc->clcsock);
+ if (rc) {
+ sk_common_release(sk);
+ return rc;
+ }
+
+ /* smc_clcsock_release() does not wait smc->clcsock->sk's
+ * destruction; its sk_state might not be TCP_CLOSE after
+ * smc->sk is close()d, and TCP timers can be fired later,
+ * which need net ref.
+ */
+ sk = smc->clcsock->sk;
+ __netns_tracker_free(net, &sk->ns_tracker, false);
+ sk->sk_net_refcnt = 1;
+ get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
+ sock_inuse_add(net, 1);
+ return 0;
+}
+
static int __smc_create(struct net *net, struct socket *sock, int protocol,
int kern, struct socket *clcsock)
{
@@ -3346,35 +3379,12 @@ static int __smc_create(struct net *net, struct socket *sock, int protocol,
/* create internal TCP socket for CLC handshake and fallback */
smc = smc_sk(sk);
- smc->use_fallback = false; /* assume rdma capability first */
- smc->fallback_rsn = 0;
-
- /* default behavior from limit_smc_hs in every net namespace */
- smc->limit_smc_hs = net->smc.limit_smc_hs;
rc = 0;
- if (!clcsock) {
- rc = sock_create_kern(net, family, SOCK_STREAM, IPPROTO_TCP,
- &smc->clcsock);
- if (rc) {
- sk_common_release(sk);
- goto out;
- }
-
- /* smc_clcsock_release() does not wait smc->clcsock->sk's
- * destruction; its sk_state might not be TCP_CLOSE after
- * smc->sk is close()d, and TCP timers can be fired later,
- * which need net ref.
- */
- sk = smc->clcsock->sk;
- __netns_tracker_free(net, &sk->ns_tracker, false);
- sk->sk_net_refcnt = 1;
- get_net_track(net, &sk->ns_tracker, GFP_KERNEL);
- sock_inuse_add(net, 1);
- } else {
+ if (clcsock)
smc->clcsock = clcsock;
- }
-
+ else
+ rc = smc_create_clcsk(net, sk, family);
out:
return rc;
}
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 18c8b78..3edec1e 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -34,6 +34,11 @@
extern struct proto smc_proto;
extern struct proto smc_proto6;
+/* smc sock initialization */
+void smc_sk_init(struct net *net, struct sock *sk, int protocol);
+/* clcsock initialization */
+int smc_create_clcsk(struct net *net, struct sock *sk, int family);
+
#ifdef ATOMIC64_INIT
#define KERNEL_HAS_ATOMIC64
#endif
--
1.8.3.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next v5 2/3] net/smc: expose smc proto operations
2024-05-30 9:30 [PATCH net-next v5 0/3] Introduce IPPROTO_SMC D. Wythe
2024-05-30 9:30 ` [PATCH net-next v5 1/3] net/smc: refactoring initialization of smc sock D. Wythe
@ 2024-05-30 9:30 ` D. Wythe
2024-05-30 9:30 ` [PATCH net-next v5 3/3] net/smc: Introduce IPPROTO_SMC D. Wythe
2024-05-30 10:14 ` [PATCH net-next v5 0/3] " D. Wythe
3 siblings, 0 replies; 12+ messages in thread
From: D. Wythe @ 2024-05-30 9:30 UTC (permalink / raw)
To: kgraul, wenjia, jaka, wintera, guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, tonylu, pabeni,
edumazet
From: "D. Wythe" <alibuda@linux.alibaba.com>
Externalize smc proto operations (smc_xxx) to allow
access from files other than af_smc.c
This is in preparation for the subsequent implementation
of the AF_INET version of SMC.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
net/smc/af_smc.c | 60 ++++++++++++++++++++++++++++----------------------------
net/smc/smc.h | 33 +++++++++++++++++++++++++++++++
2 files changed, 63 insertions(+), 30 deletions(-)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 77a9d58..8e3ce76 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -170,15 +170,15 @@ static bool smc_hs_congested(const struct sock *sk)
return false;
}
-static struct smc_hashinfo smc_v4_hashinfo = {
+struct smc_hashinfo smc_v4_hashinfo = {
.lock = __RW_LOCK_UNLOCKED(smc_v4_hashinfo.lock),
};
-static struct smc_hashinfo smc_v6_hashinfo = {
+struct smc_hashinfo smc_v6_hashinfo = {
.lock = __RW_LOCK_UNLOCKED(smc_v6_hashinfo.lock),
};
-static int smc_hash_sk(struct sock *sk)
+int smc_hash_sk(struct sock *sk)
{
struct smc_hashinfo *h = sk->sk_prot->h.smc_hash;
struct hlist_head *head;
@@ -193,7 +193,7 @@ static int smc_hash_sk(struct sock *sk)
return 0;
}
-static void smc_unhash_sk(struct sock *sk)
+void smc_unhash_sk(struct sock *sk)
{
struct smc_hashinfo *h = sk->sk_prot->h.smc_hash;
@@ -207,7 +207,7 @@ static void smc_unhash_sk(struct sock *sk)
* work which we didn't do because of user hold the sock_lock in the
* BH context
*/
-static void smc_release_cb(struct sock *sk)
+void smc_release_cb(struct sock *sk)
{
struct smc_sock *smc = smc_sk(sk);
@@ -307,7 +307,7 @@ static int __smc_release(struct smc_sock *smc)
return rc;
}
-static int smc_release(struct socket *sock)
+int smc_release(struct socket *sock)
{
struct sock *sk = sock->sk;
struct smc_sock *smc;
@@ -401,8 +401,8 @@ static struct sock *smc_sock_alloc(struct net *net, struct socket *sock,
return sk;
}
-static int smc_bind(struct socket *sock, struct sockaddr *uaddr,
- int addr_len)
+int smc_bind(struct socket *sock, struct sockaddr *uaddr,
+ int addr_len)
{
struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
struct sock *sk = sock->sk;
@@ -1649,8 +1649,8 @@ static void smc_connect_work(struct work_struct *work)
release_sock(&smc->sk);
}
-static int smc_connect(struct socket *sock, struct sockaddr *addr,
- int alen, int flags)
+int smc_connect(struct socket *sock, struct sockaddr *addr,
+ int alen, int flags)
{
struct sock *sk = sock->sk;
struct smc_sock *smc;
@@ -2631,7 +2631,7 @@ static void smc_clcsock_data_ready(struct sock *listen_clcsock)
read_unlock_bh(&listen_clcsock->sk_callback_lock);
}
-static int smc_listen(struct socket *sock, int backlog)
+int smc_listen(struct socket *sock, int backlog)
{
struct sock *sk = sock->sk;
struct smc_sock *smc;
@@ -2696,8 +2696,8 @@ static int smc_listen(struct socket *sock, int backlog)
return rc;
}
-static int smc_accept(struct socket *sock, struct socket *new_sock,
- struct proto_accept_arg *arg)
+int smc_accept(struct socket *sock, struct socket *new_sock,
+ struct proto_accept_arg *arg)
{
struct sock *sk = sock->sk, *nsk;
DECLARE_WAITQUEUE(wait, current);
@@ -2766,8 +2766,8 @@ static int smc_accept(struct socket *sock, struct socket *new_sock,
return rc;
}
-static int smc_getname(struct socket *sock, struct sockaddr *addr,
- int peer)
+int smc_getname(struct socket *sock, struct sockaddr *addr,
+ int peer)
{
struct smc_sock *smc;
@@ -2780,7 +2780,7 @@ static int smc_getname(struct socket *sock, struct sockaddr *addr,
return smc->clcsock->ops->getname(smc->clcsock, addr, peer);
}
-static int smc_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
+int smc_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
{
struct sock *sk = sock->sk;
struct smc_sock *smc;
@@ -2818,8 +2818,8 @@ static int smc_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
return rc;
}
-static int smc_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
- int flags)
+int smc_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
+ int flags)
{
struct sock *sk = sock->sk;
struct smc_sock *smc;
@@ -2868,8 +2868,8 @@ static __poll_t smc_accept_poll(struct sock *parent)
return mask;
}
-static __poll_t smc_poll(struct file *file, struct socket *sock,
- poll_table *wait)
+__poll_t smc_poll(struct file *file, struct socket *sock,
+ poll_table *wait)
{
struct sock *sk = sock->sk;
struct smc_sock *smc;
@@ -2921,7 +2921,7 @@ static __poll_t smc_poll(struct file *file, struct socket *sock,
return mask;
}
-static int smc_shutdown(struct socket *sock, int how)
+int smc_shutdown(struct socket *sock, int how)
{
struct sock *sk = sock->sk;
bool do_shutdown = true;
@@ -3061,8 +3061,8 @@ static int __smc_setsockopt(struct socket *sock, int level, int optname,
return rc;
}
-static int smc_setsockopt(struct socket *sock, int level, int optname,
- sockptr_t optval, unsigned int optlen)
+int smc_setsockopt(struct socket *sock, int level, int optname,
+ sockptr_t optval, unsigned int optlen)
{
struct sock *sk = sock->sk;
struct smc_sock *smc;
@@ -3148,8 +3148,8 @@ static int smc_setsockopt(struct socket *sock, int level, int optname,
return rc;
}
-static int smc_getsockopt(struct socket *sock, int level, int optname,
- char __user *optval, int __user *optlen)
+int smc_getsockopt(struct socket *sock, int level, int optname,
+ char __user *optval, int __user *optlen)
{
struct smc_sock *smc;
int rc;
@@ -3174,8 +3174,8 @@ static int smc_getsockopt(struct socket *sock, int level, int optname,
return rc;
}
-static int smc_ioctl(struct socket *sock, unsigned int cmd,
- unsigned long arg)
+int smc_ioctl(struct socket *sock, unsigned int cmd,
+ unsigned long arg)
{
union smc_host_cursor cons, urg;
struct smc_connection *conn;
@@ -3261,9 +3261,9 @@ static int smc_ioctl(struct socket *sock, unsigned int cmd,
* Note that subsequent recv() calls have to wait till all splice() processing
* completed.
*/
-static ssize_t smc_splice_read(struct socket *sock, loff_t *ppos,
- struct pipe_inode_info *pipe, size_t len,
- unsigned int flags)
+ssize_t smc_splice_read(struct socket *sock, loff_t *ppos,
+ struct pipe_inode_info *pipe, size_t len,
+ unsigned int flags)
{
struct sock *sk = sock->sk;
struct smc_sock *smc;
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 3edec1e..34b781e 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -34,6 +34,39 @@
extern struct proto smc_proto;
extern struct proto smc_proto6;
+extern struct smc_hashinfo smc_v4_hashinfo;
+extern struct smc_hashinfo smc_v6_hashinfo;
+
+int smc_hash_sk(struct sock *sk);
+void smc_unhash_sk(struct sock *sk);
+void smc_release_cb(struct sock *sk);
+
+int smc_release(struct socket *sock);
+int smc_bind(struct socket *sock, struct sockaddr *uaddr,
+ int addr_len);
+int smc_connect(struct socket *sock, struct sockaddr *addr,
+ int alen, int flags);
+int smc_accept(struct socket *sock, struct socket *new_sock,
+ struct proto_accept_arg *arg);
+int smc_getname(struct socket *sock, struct sockaddr *addr,
+ int peer);
+__poll_t smc_poll(struct file *file, struct socket *sock,
+ poll_table *wait);
+int smc_ioctl(struct socket *sock, unsigned int cmd,
+ unsigned long arg);
+int smc_listen(struct socket *sock, int backlog);
+int smc_shutdown(struct socket *sock, int how);
+int smc_setsockopt(struct socket *sock, int level, int optname,
+ sockptr_t optval, unsigned int optlen);
+int smc_getsockopt(struct socket *sock, int level, int optname,
+ char __user *optval, int __user *optlen);
+int smc_sendmsg(struct socket *sock, struct msghdr *msg, size_t len);
+int smc_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
+ int flags);
+ssize_t smc_splice_read(struct socket *sock, loff_t *ppos,
+ struct pipe_inode_info *pipe, size_t len,
+ unsigned int flags);
+
/* smc sock initialization */
void smc_sk_init(struct net *net, struct sock *sk, int protocol);
/* clcsock initialization */
--
1.8.3.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next v5 3/3] net/smc: Introduce IPPROTO_SMC
2024-05-30 9:30 [PATCH net-next v5 0/3] Introduce IPPROTO_SMC D. Wythe
2024-05-30 9:30 ` [PATCH net-next v5 1/3] net/smc: refactoring initialization of smc sock D. Wythe
2024-05-30 9:30 ` [PATCH net-next v5 2/3] net/smc: expose smc proto operations D. Wythe
@ 2024-05-30 9:30 ` D. Wythe
2024-06-03 3:49 ` Ratheesh Kannoth
2024-05-30 10:14 ` [PATCH net-next v5 0/3] " D. Wythe
3 siblings, 1 reply; 12+ messages in thread
From: D. Wythe @ 2024-05-30 9:30 UTC (permalink / raw)
To: kgraul, wenjia, jaka, wintera, guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, tonylu, pabeni,
edumazet
From: "D. Wythe" <alibuda@linux.alibaba.com>
This patch allows to create smc socket via AF_INET,
similar to the following code,
/* create v4 smc sock */
v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);
/* create v6 smc sock */
v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);
There are several reasons why we believe it is appropriate here:
1. For smc sockets, it actually use IPv4 (AF-INET) or IPv6 (AF-INET6)
address. There is no AF_SMC address at all.
2. Create smc socket in the AF_INET(6) path, which allows us to reuse
the infrastructure of AF_INET(6) path, such as common ebpf hooks.
Otherwise, smc have to implement it again in AF_SMC path.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
include/uapi/linux/in.h | 2 +
net/smc/Makefile | 2 +-
net/smc/af_smc.c | 16 ++++-
net/smc/smc_inet.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++
net/smc/smc_inet.h | 22 +++++++
5 files changed, 209 insertions(+), 3 deletions(-)
create mode 100644 net/smc/smc_inet.c
create mode 100644 net/smc/smc_inet.h
diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
index e682ab6..0c6322b 100644
--- a/include/uapi/linux/in.h
+++ b/include/uapi/linux/in.h
@@ -83,6 +83,8 @@ enum {
#define IPPROTO_RAW IPPROTO_RAW
IPPROTO_MPTCP = 262, /* Multipath TCP connection */
#define IPPROTO_MPTCP IPPROTO_MPTCP
+ IPPROTO_SMC = 263, /* Shared Memory Communications */
+#define IPPROTO_SMC IPPROTO_SMC
IPPROTO_MAX
};
#endif
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 2c510d54..60f1c87 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -4,6 +4,6 @@ obj-$(CONFIG_SMC) += smc.o
obj-$(CONFIG_SMC_DIAG) += smc_diag.o
smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o
smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o smc_netlink.o smc_stats.o
-smc-y += smc_tracepoint.o
+smc-y += smc_tracepoint.o smc_inet.o
smc-$(CONFIG_SYSCTL) += smc_sysctl.o
smc-$(CONFIG_SMC_LO) += smc_loopback.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 8e3ce76..743c27e 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -54,6 +54,7 @@
#include "smc_tracepoint.h"
#include "smc_sysctl.h"
#include "smc_loopback.h"
+#include "smc_inet.h"
static DEFINE_MUTEX(smc_server_lgr_pending); /* serialize link group
* creation on server
@@ -3593,10 +3594,15 @@ static int __init smc_init(void)
pr_err("%s: tcp_ulp_register fails with %d\n", __func__, rc);
goto out_lo;
}
-
+ rc = smc_inet_init();
+ if (rc) {
+ pr_err("%s: smc_inet_init fails with %d\n", __func__, rc);
+ goto out_ulp;
+ }
static_branch_enable(&tcp_have_smc);
return 0;
-
+out_ulp:
+ tcp_unregister_ulp(&smc_ulp_ops);
out_lo:
smc_loopback_exit();
out_ib:
@@ -3633,6 +3639,7 @@ static int __init smc_init(void)
static void __exit smc_exit(void)
{
static_branch_disable(&tcp_have_smc);
+ smc_inet_exit();
tcp_unregister_ulp(&smc_ulp_ops);
sock_unregister(PF_SMC);
smc_core_exit();
@@ -3660,4 +3667,9 @@ static void __exit smc_exit(void)
MODULE_LICENSE("GPL");
MODULE_ALIAS_NETPROTO(PF_SMC);
MODULE_ALIAS_TCP_ULP("smc");
+/* 263 for IPPROTO_SMC and 1 for SOCK_STREAM */
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_INET, 263, 1);
+#if IS_ENABLED(CONFIG_IPV6)
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_INET6, 263, 1);
+#endif /* CONFIG_IPV6 */
MODULE_ALIAS_GENL_FAMILY(SMC_GENL_FAMILY_NAME);
diff --git a/net/smc/smc_inet.c b/net/smc/smc_inet.c
new file mode 100644
index 00000000..72e4f61
--- /dev/null
+++ b/net/smc/smc_inet.c
@@ -0,0 +1,170 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Shared Memory Communications over RDMA (SMC-R) and RoCE
+ *
+ * Definitions for the IPPROTO_SMC (socket related)
+ *
+ * Copyright IBM Corp. 2016, 2018
+ * Copyright (c) 2024, Alibaba Inc.
+ *
+ * Author: D. Wythe <alibuda@linux.alibaba.com>
+ */
+
+#include <net/protocol.h>
+#include <net/sock.h>
+
+#include "smc_inet.h"
+#include "smc.h"
+
+static struct proto smc_inet_prot;
+static const struct proto_ops smc_inet_stream_ops;
+static struct inet_protosw smc_inet_protosw;
+
+#if IS_ENABLED(CONFIG_IPV6)
+static struct proto smc_inet6_prot;
+static const struct proto_ops smc_inet6_stream_ops;
+static struct inet_protosw smc_inet6_protosw;
+#endif /* CONFIG_IPV6 */
+
+static int smc_inet_init_sock(struct sock *sk);
+
+static struct proto smc_inet_prot = {
+ .name = "INET_SMC",
+ .owner = THIS_MODULE,
+ .init = smc_inet_init_sock,
+ .hash = smc_hash_sk,
+ .unhash = smc_unhash_sk,
+ .release_cb = smc_release_cb,
+ .obj_size = sizeof(struct smc_sock),
+ .h.smc_hash = &smc_v4_hashinfo,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
+};
+
+static const struct proto_ops smc_inet_stream_ops = {
+ .family = PF_INET,
+ .owner = THIS_MODULE,
+ .release = smc_release,
+ .bind = smc_bind,
+ .connect = smc_connect,
+ .socketpair = sock_no_socketpair,
+ .accept = smc_accept,
+ .getname = smc_getname,
+ .poll = smc_poll,
+ .ioctl = smc_ioctl,
+ .listen = smc_listen,
+ .shutdown = smc_shutdown,
+ .setsockopt = smc_setsockopt,
+ .getsockopt = smc_getsockopt,
+ .sendmsg = smc_sendmsg,
+ .recvmsg = smc_recvmsg,
+ .mmap = sock_no_mmap,
+ .splice_read = smc_splice_read,
+};
+
+static struct inet_protosw smc_inet_protosw = {
+ .type = SOCK_STREAM,
+ .protocol = IPPROTO_SMC,
+ .prot = &smc_inet_prot,
+ .ops = &smc_inet_stream_ops,
+ .flags = INET_PROTOSW_ICSK,
+};
+
+#if IS_ENABLED(CONFIG_IPV6)
+static struct proto smc_inet6_prot = {
+ .name = "INET6_SMC",
+ .owner = THIS_MODULE,
+ .init = smc_inet_init_sock,
+ .hash = smc_hash_sk,
+ .unhash = smc_unhash_sk,
+ .release_cb = smc_release_cb,
+ .obj_size = sizeof(struct smc_sock),
+ .h.smc_hash = &smc_v6_hashinfo,
+ .slab_flags = SLAB_TYPESAFE_BY_RCU,
+};
+
+static const struct proto_ops smc_inet6_stream_ops = {
+ .family = PF_INET6,
+ .owner = THIS_MODULE,
+ .release = smc_release,
+ .bind = smc_bind,
+ .connect = smc_connect,
+ .socketpair = sock_no_socketpair,
+ .accept = smc_accept,
+ .getname = smc_getname,
+ .poll = smc_poll,
+ .ioctl = smc_ioctl,
+ .listen = smc_listen,
+ .shutdown = smc_shutdown,
+ .setsockopt = smc_setsockopt,
+ .getsockopt = smc_getsockopt,
+ .sendmsg = smc_sendmsg,
+ .recvmsg = smc_recvmsg,
+ .mmap = sock_no_mmap,
+ .splice_read = smc_splice_read,
+};
+
+static struct inet_protosw smc_inet6_protosw = {
+ .type = SOCK_STREAM,
+ .protocol = IPPROTO_SMC,
+ .prot = &smc_inet6_prot,
+ .ops = &smc_inet6_stream_ops,
+ .flags = INET_PROTOSW_ICSK,
+};
+#endif /* CONFIG_IPV6 */
+
+static int smc_inet_init_sock(struct sock *sk)
+{
+ struct net *net = sock_net(sk);
+
+ /* init common smc sock */
+ smc_sk_init(net, sk, IPPROTO_SMC);
+ /* create clcsock */
+ return smc_create_clcsk(net, sk, sk->sk_family);
+}
+
+int __init smc_inet_init(void)
+{
+ int rc;
+
+ rc = proto_register(&smc_inet_prot, 1);
+ if (rc) {
+ pr_err("%s: proto_register smc_inet_prot fails with %d\n", __func__, rc);
+ return rc;
+ }
+ /* no return value */
+ inet_register_protosw(&smc_inet_protosw);
+
+#if IS_ENABLED(CONFIG_IPV6)
+ rc = proto_register(&smc_inet6_prot, 1);
+ if (rc) {
+ pr_err("%s: proto_register smc_inet6_prot fails with %d\n", __func__, rc);
+ goto out_inet6_prot;
+ }
+ rc = inet6_register_protosw(&smc_inet6_protosw);
+ if (rc) {
+ pr_err("%s: inet6_register_protosw smc_inet6_protosw fails with %d\n",
+ __func__, rc);
+ goto out_inet6_protosw;
+ }
+#endif /* CONFIG_IPV6 */
+
+ return rc;
+#if IS_ENABLED(CONFIG_IPV6)
+out_inet6_protosw:
+ proto_unregister(&smc_inet6_prot);
+out_inet6_prot:
+ inet_unregister_protosw(&smc_inet_protosw);
+ proto_unregister(&smc_inet_prot);
+ return rc;
+#endif /* CONFIG_IPV6 */
+}
+
+void smc_inet_exit(void)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+ inet6_unregister_protosw(&smc_inet6_protosw);
+ proto_unregister(&smc_inet6_prot);
+#endif /* CONFIG_IPV6 */
+ inet_unregister_protosw(&smc_inet_protosw);
+ proto_unregister(&smc_inet_prot);
+}
diff --git a/net/smc/smc_inet.h b/net/smc/smc_inet.h
new file mode 100644
index 00000000..a489c8a
--- /dev/null
+++ b/net/smc/smc_inet.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Shared Memory Communications over RDMA (SMC-R) and RoCE
+ *
+ * Definitions for the IPPROTO_SMC (socket related)
+
+ * Copyright IBM Corp. 2016
+ * Copyright (c) 2024, Alibaba Inc.
+ *
+ * Author: D. Wythe <alibuda@linux.alibaba.com>
+ */
+#ifndef __INET_SMC
+#define __INET_SMC
+
+/* Initialize protocol registration on IPPROTO_SMC,
+ * @return 0 on success
+ */
+int smc_inet_init(void);
+
+void smc_inet_exit(void);
+
+#endif /* __INET_SMC */
--
1.8.3.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v5 0/3] Introduce IPPROTO_SMC
2024-05-30 9:30 [PATCH net-next v5 0/3] Introduce IPPROTO_SMC D. Wythe
` (2 preceding siblings ...)
2024-05-30 9:30 ` [PATCH net-next v5 3/3] net/smc: Introduce IPPROTO_SMC D. Wythe
@ 2024-05-30 10:14 ` D. Wythe
2024-05-31 8:06 ` Wenjia Zhang
2024-06-03 7:48 ` Niklas Schnelle
3 siblings, 2 replies; 12+ messages in thread
From: D. Wythe @ 2024-05-30 10:14 UTC (permalink / raw)
To: kgraul, wenjia, jaka, wintera, guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, tonylu, pabeni,
edumazet
On 5/30/24 5:30 PM, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
>
> This patch allows to create smc socket via AF_INET,
> similar to the following code,
>
> /* create v4 smc sock */
> v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);
>
> /* create v6 smc sock */
> v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);
Welcome everyone to try out the eBPF based version of smc_run during
testing, I have added a separate command called smc_run.bpf,
it was equivalent to normal smc_run but with IPPROTO_SMC via eBPF.
You can obtain the code and more info from:
https://github.com/D-Wythe/smc-tools
Usage:
smc_run.bpf
An eBPF implemented smc_run based on IPPROTO_SMC:
1. Support to transparent replacement based on command (Just like smc_run).
2. Supprot to transparent replacement based on pid configuration. And
supports the inheritance of this capability between parent and child
processes.
3. Support to transparent replacement based on per netns configuration.
smc_run.bpf COMMAND
1. Equivalent to smc_run but with IPPROTO_SMC via eBPF
smc_run.bpf -p pid
1. Add the process with target pid to the map. Afterward, all socket()
calls of the process and its descendant processes will be replaced from
IPPROTO_TCP to IPPROTO_SMC.
2. Mapping will be automatically deleted when process exits.
3. Specifically, COMMAND mode is actually works like following:
smc_run.bpf -p $$
COMMAND
exit
smc_run.bpf -n 1
1. Make all socket() calls of the current netns to be replaced from
IPPROTO_TCP to IPPROTO_SMC.
2. Turn off it by smc_run.bpf -n 0
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v5 0/3] Introduce IPPROTO_SMC
2024-05-30 10:14 ` [PATCH net-next v5 0/3] " D. Wythe
@ 2024-05-31 8:06 ` Wenjia Zhang
2024-06-03 3:01 ` D. Wythe
2024-06-03 7:48 ` Niklas Schnelle
1 sibling, 1 reply; 12+ messages in thread
From: Wenjia Zhang @ 2024-05-31 8:06 UTC (permalink / raw)
To: D. Wythe, kgraul, jaka, wintera, guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, tonylu, pabeni,
edumazet
On 30.05.24 12:14, D. Wythe wrote:
>
>
> On 5/30/24 5:30 PM, D. Wythe wrote:
>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>
>> This patch allows to create smc socket via AF_INET,
>> similar to the following code,
>>
>> /* create v4 smc sock */
>> v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);
>>
>> /* create v6 smc sock */
>> v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);
>
> Welcome everyone to try out the eBPF based version of smc_run during
> testing, I have added a separate command called smc_run.bpf,
> it was equivalent to normal smc_run but with IPPROTO_SMC via eBPF.
>
> You can obtain the code and more info from:
> https://github.com/D-Wythe/smc-tools
>
> Usage:
>
> smc_run.bpf
> An eBPF implemented smc_run based on IPPROTO_SMC:
>
> 1. Support to transparent replacement based on command (Just like smc_run).
> 2. Supprot to transparent replacement based on pid configuration. And
> supports the inheritance of this capability between parent and child
> processes.
> 3. Support to transparent replacement based on per netns configuration.
>
> smc_run.bpf COMMAND
>
> 1. Equivalent to smc_run but with IPPROTO_SMC via eBPF
>
> smc_run.bpf -p pid
>
> 1. Add the process with target pid to the map. Afterward, all socket()
> calls of the process and its descendant processes will be replaced from
> IPPROTO_TCP to IPPROTO_SMC.
> 2. Mapping will be automatically deleted when process exits.
> 3. Specifically, COMMAND mode is actually works like following:
>
> smc_run.bpf -p $$
> COMMAND
> exit
>
> smc_run.bpf -n 1
>
> 1. Make all socket() calls of the current netns to be replaced from
> IPPROTO_TCP to IPPROTO_SMC.
> 2. Turn off it by smc_run.bpf -n 0
>
>
Hi D. Wythe,
Thank you for the info and description! The code generally looks good to
me, just still some details I need to check again. And I'd like to give
smc_run.bpf a try, and maybe let you know if it works for me next week.
Thanks,
Wenjia
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v5 0/3] Introduce IPPROTO_SMC
2024-05-31 8:06 ` Wenjia Zhang
@ 2024-06-03 3:01 ` D. Wythe
0 siblings, 0 replies; 12+ messages in thread
From: D. Wythe @ 2024-06-03 3:01 UTC (permalink / raw)
To: Wenjia Zhang, kgraul, jaka, wintera, guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, tonylu, pabeni,
edumazet
On 5/31/24 4:06 PM, Wenjia Zhang wrote:
>
>
> On 30.05.24 12:14, D. Wythe wrote:
>>
>>
>> On 5/30/24 5:30 PM, D. Wythe wrote:
>>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>>
>>> This patch allows to create smc socket via AF_INET,
>>> similar to the following code,
>>>
>>> /* create v4 smc sock */
>>> v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);
>>>
>>> /* create v6 smc sock */
>>> v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);
>>
>> Welcome everyone to try out the eBPF based version of smc_run during
>> testing, I have added a separate command called smc_run.bpf,
>> it was equivalent to normal smc_run but with IPPROTO_SMC via eBPF.
>>
>> You can obtain the code and more info from:
>> https://github.com/D-Wythe/smc-tools
>>
>> Usage:
>>
>> smc_run.bpf
>> An eBPF implemented smc_run based on IPPROTO_SMC:
>>
>> 1. Support to transparent replacement based on command (Just like
>> smc_run).
>> 2. Supprot to transparent replacement based on pid configuration. And
>> supports the inheritance of this capability between parent and child
>> processes.
>> 3. Support to transparent replacement based on per netns configuration.
>>
>> smc_run.bpf COMMAND
>>
>> 1. Equivalent to smc_run but with IPPROTO_SMC via eBPF
>>
>> smc_run.bpf -p pid
>>
>> 1. Add the process with target pid to the map. Afterward, all
>> socket() calls of the process and its descendant processes will be
>> replaced from IPPROTO_TCP to IPPROTO_SMC.
>> 2. Mapping will be automatically deleted when process exits.
>> 3. Specifically, COMMAND mode is actually works like following:
>>
>> smc_run.bpf -p $$
>> COMMAND
>> exit
>>
>> smc_run.bpf -n 1
>>
>> 1. Make all socket() calls of the current netns to be replaced from
>> IPPROTO_TCP to IPPROTO_SMC.
>> 2. Turn off it by smc_run.bpf -n 0
>>
>>
> Hi D. Wythe,
>
> Thank you for the info and description! The code generally looks good
> to me, just still some details I need to check again. And I'd like to
> give smc_run.bpf a try, and maybe let you know if it works for me next
> week.
>
> Thanks,
> Wenjia
Hi Wenjia,
That's okay to us. And if there are any issues regarding the use of
smc_run.bpf, please let me know.
Best wishes,
D. Wythe
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v5 3/3] net/smc: Introduce IPPROTO_SMC
2024-05-30 9:30 ` [PATCH net-next v5 3/3] net/smc: Introduce IPPROTO_SMC D. Wythe
@ 2024-06-03 3:49 ` Ratheesh Kannoth
2024-06-03 14:43 ` D. Wythe
0 siblings, 1 reply; 12+ messages in thread
From: Ratheesh Kannoth @ 2024-06-03 3:49 UTC (permalink / raw)
To: D. Wythe
Cc: kgraul, wenjia, jaka, wintera, guwen, kuba, davem, netdev,
linux-s390, linux-rdma, tonylu, pabeni, edumazet
On 2024-05-30 at 15:00:40, D. Wythe (alibuda@linux.alibaba.com) wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
>
> +
> +int __init smc_inet_init(void)
> +{
> + int rc;
> +
> + rc = proto_register(&smc_inet_prot, 1);
> + if (rc) {
> + pr_err("%s: proto_register smc_inet_prot fails with %d\n", __func__, rc);
> + return rc;
> + }
> + /* no return value */
> + inet_register_protosw(&smc_inet_protosw);
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> + rc = proto_register(&smc_inet6_prot, 1);
> + if (rc) {
> + pr_err("%s: proto_register smc_inet6_prot fails with %d\n", __func__, rc);
> + goto out_inet6_prot;
> + }
> + rc = inet6_register_protosw(&smc_inet6_protosw);
> + if (rc) {
> + pr_err("%s: inet6_register_protosw smc_inet6_protosw fails with %d\n",
> + __func__, rc);
> + goto out_inet6_protosw;
> + }
> +#endif /* CONFIG_IPV6 */
> +
> + return rc;
> +#if IS_ENABLED(CONFIG_IPV6)
Can you combine this #if with above one ? Any way you need this only in case of ipv6.
Error handling with #if is an hindrance to a good readability.
> +out_inet6_protosw:
> + proto_unregister(&smc_inet6_prot);
> +out_inet6_prot:
> + inet_unregister_protosw(&smc_inet_protosw);
> + proto_unregister(&smc_inet_prot);
> + return rc;
> +#endif /* CONFIG_IPV6 */
> +}
> +
> +void smc_inet_exit(void)
> +{
> +#if IS_ENABLED(CONFIG_IPV6)
> + inet6_unregister_protosw(&smc_inet6_protosw);
> + proto_unregister(&smc_inet6_prot);
> +#endif /* CONFIG_IPV6 */
> + inet_unregister_protosw(&smc_inet_protosw);
> + proto_unregister(&smc_inet_prot);
> +}
> diff --git a/net/smc/smc_inet.h b/net/smc/smc_inet.h
> new file mode 100644
> index 00000000..a489c8a
> --- /dev/null
> +++ b/net/smc/smc_inet.h
> @@ -0,0 +1,22 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Shared Memory Communications over RDMA (SMC-R) and RoCE
> + *
> + * Definitions for the IPPROTO_SMC (socket related)
> +
> + * Copyright IBM Corp. 2016
> + * Copyright (c) 2024, Alibaba Inc.
> + *
> + * Author: D. Wythe <alibuda@linux.alibaba.com>
> + */
> +#ifndef __INET_SMC
> +#define __INET_SMC
> +
> +/* Initialize protocol registration on IPPROTO_SMC,
> + * @return 0 on success
> + */
> +int smc_inet_init(void);
> +
> +void smc_inet_exit(void);
> +
> +#endif /* __INET_SMC */
> --
> 1.8.3.1
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v5 0/3] Introduce IPPROTO_SMC
2024-05-30 10:14 ` [PATCH net-next v5 0/3] " D. Wythe
2024-05-31 8:06 ` Wenjia Zhang
@ 2024-06-03 7:48 ` Niklas Schnelle
2024-06-03 15:07 ` D. Wythe
1 sibling, 1 reply; 12+ messages in thread
From: Niklas Schnelle @ 2024-06-03 7:48 UTC (permalink / raw)
To: D. Wythe, kgraul, wenjia, jaka, wintera, guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, tonylu, pabeni,
edumazet
On Thu, 2024-05-30 at 18:14 +0800, D. Wythe wrote:
>
> On 5/30/24 5:30 PM, D. Wythe wrote:
> > From: "D. Wythe" <alibuda@linux.alibaba.com>
> >
> > This patch allows to create smc socket via AF_INET,
> > similar to the following code,
> >
> > /* create v4 smc sock */
> > v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);
> >
> > /* create v6 smc sock */
> > v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);
>
> Welcome everyone to try out the eBPF based version of smc_run during
> testing, I have added a separate command called smc_run.bpf,
> it was equivalent to normal smc_run but with IPPROTO_SMC via eBPF.
>
> You can obtain the code and more info from:
> https://github.com/D-Wythe/smc-tools
>
> Usage:
>
> smc_run.bpf
> An eBPF implemented smc_run based on IPPROTO_SMC:
>
> 1. Support to transparent replacement based on command (Just like smc_run).
> 2. Supprot to transparent replacement based on pid configuration. And
> supports the inheritance of this capability between parent and child
> processes.
> 3. Support to transparent replacement based on per netns configuration.
>
> smc_run.bpf COMMAND
>
> 1. Equivalent to smc_run but with IPPROTO_SMC via eBPF
>
> smc_run.bpf -p pid
>
> 1. Add the process with target pid to the map. Afterward, all socket()
> calls of the process and its descendant processes will be replaced from
> IPPROTO_TCP to IPPROTO_SMC.
> 2. Mapping will be automatically deleted when process exits.
> 3. Specifically, COMMAND mode is actually works like following:
>
> smc_run.bpf -p $$
> COMMAND
> exit
>
> smc_run.bpf -n 1
>
> 1. Make all socket() calls of the current netns to be replaced from
> IPPROTO_TCP to IPPROTO_SMC.
> 2. Turn off it by smc_run.bpf -n 0
>
>
Hi D. Wythe,
I gave this series plus your smc_run.bpf and SMC_LO based SMC-D a test
run on my Ryzen 3900X workstation and I have to say I'm quite
impressed. I first tried the SMC_LO feature as merged in v6.10-rc1 with
the classic LD_PRELOAD based smc_run and iperf3, and qperf …
tcp_bw/tcp_lat both with normal localhost and between docker
containers. For this to work I of course had to initially set my UEID
as x86_64 unlike s390x doesn't get an SEID set. I used the following
script for this.
#!/usr/bin/sh
machine_id_upper=$(cat /etc/machine-id | tr '[:lower:]' '[:upper:]')
machine_id_suffix=$(echo "$machine_id_upper" | head -c 27)
ueid="MID-$machine_id_suffix"
smcd ueid add "$ueid"
The performance is pretty impressive:
* iperf3 with 12 parallel connections (matching core count) results in
~152 Gbit/s on normal loopback and ~312 Gbit/s with SMC_LO.
* qperf … tcp_bw (single thread) results in ~46 Gbit/s on normal loopback
and ~58 Gbit/s with SMC_LO
* qperf … tcp_lat latency test results in 5-9 us with normal loopback
and around 3-4 us with SMC_LO
Then I applied this series on top of v6.10-rc1 and tried it with your
smc_run.bpf. The performance is of course in-line with the above but
thanks to being able to enable SMC on a per-netns basis I was able to
try a few more thing. First I tried just enabling it in my default
netns and verified that after restarting sshd new ssh connections to
localhost used SMC-D through SMC_LO. Then I started Chrome and
confirmed that its TCP connections also registered with SMC and
successfully fell back to TCP mode. I had no trouble with normal
browsing though I guess especially Google stuff often uses HTTP/3 so
isn't affected. Still nice to see I didn't get breakage.
Secondly I tried smc_run.bpf with docker containers using the following
trick:
docker inspect --format '{{.State.Pid}}' <my_container_name>
34651
nsenter -t 34651 -n smc_run.bpf -n 1
Sadly this only works for commands started in the container after
loading the BPF. So I wonder if you know of a good way to either
automatically execute smc_run.bpf on container start or maybe use it on
the docker daemon such that all namespaces created by docker get the
IPPROTO_SMC override. I'd then definitely consider using SMC-D with
SMC_LO between my home lab containers even if just for bragging rights
;-)
Feel free to add for the IPPROTO_SMC series:
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Thanks,
Niklas
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v5 3/3] net/smc: Introduce IPPROTO_SMC
2024-06-03 3:49 ` Ratheesh Kannoth
@ 2024-06-03 14:43 ` D. Wythe
0 siblings, 0 replies; 12+ messages in thread
From: D. Wythe @ 2024-06-03 14:43 UTC (permalink / raw)
To: Ratheesh Kannoth
Cc: kgraul, wenjia, jaka, wintera, guwen, kuba, davem, netdev,
linux-s390, linux-rdma, tonylu, pabeni, edumazet
On 6/3/24 11:49 AM, Ratheesh Kannoth wrote:
> On 2024-05-30 at 15:00:40, D. Wythe (alibuda@linux.alibaba.com) wrote:
>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>
>> +
>> +int __init smc_inet_init(void)
>> +{
>> + int rc;
>> +
>> + rc = proto_register(&smc_inet_prot, 1);
>> + if (rc) {
>> + pr_err("%s: proto_register smc_inet_prot fails with %d\n", __func__, rc);
>> + return rc;
>> + }
>> + /* no return value */
>> + inet_register_protosw(&smc_inet_protosw);
>> +
>> +#if IS_ENABLED(CONFIG_IPV6)
>> + rc = proto_register(&smc_inet6_prot, 1);
>> + if (rc) {
>> + pr_err("%s: proto_register smc_inet6_prot fails with %d\n", __func__, rc);
>> + goto out_inet6_prot;
>> + }
>> + rc = inet6_register_protosw(&smc_inet6_protosw);
>> + if (rc) {
>> + pr_err("%s: inet6_register_protosw smc_inet6_protosw fails with %d\n",
>> + __func__, rc);
>> + goto out_inet6_protosw;
>> + }
>> +#endif /* CONFIG_IPV6 */
>> +
>> + return rc;
>> +#if IS_ENABLED(CONFIG_IPV6)
> Can you combine this #if with above one ? Any way you need this only in case of ipv6.
> Error handling with #if is an hindrance to a good readability.
Hi Ratheesh,
Thanks for your advice. Totally agreed with that.
I'll give it a fix in next version.
Best wishes,
D. Wythe
>> +out_inet6_protosw:
>> + proto_unregister(&smc_inet6_prot);
>> +out_inet6_prot:
>> + inet_unregister_protosw(&smc_inet_protosw);
>> + proto_unregister(&smc_inet_prot);
>> + return rc;
>> +#endif /* CONFIG_IPV6 */
>> +}
>> +
>> +void smc_inet_exit(void)
>> +{
>> +#if IS_ENABLED(CONFIG_IPV6)
>> + inet6_unregister_protosw(&smc_inet6_protosw);
>> + proto_unregister(&smc_inet6_prot);
>> +#endif /* CONFIG_IPV6 */
>> + inet_unregister_protosw(&smc_inet_protosw);
>> + proto_unregister(&smc_inet_prot);
>> +}
>> diff --git a/net/smc/smc_inet.h b/net/smc/smc_inet.h
>> new file mode 100644
>> index 00000000..a489c8a
>> --- /dev/null
>> +++ b/net/smc/smc_inet.h
>> @@ -0,0 +1,22 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Shared Memory Communications over RDMA (SMC-R) and RoCE
>> + *
>> + * Definitions for the IPPROTO_SMC (socket related)
>> +
>> + * Copyright IBM Corp. 2016
>> + * Copyright (c) 2024, Alibaba Inc.
>> + *
>> + * Author: D. Wythe <alibuda@linux.alibaba.com>
>> + */
>> +#ifndef __INET_SMC
>> +#define __INET_SMC
>> +
>> +/* Initialize protocol registration on IPPROTO_SMC,
>> + * @return 0 on success
>> + */
>> +int smc_inet_init(void);
>> +
>> +void smc_inet_exit(void);
>> +
>> +#endif /* __INET_SMC */
>> --
>> 1.8.3.1
>>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v5 0/3] Introduce IPPROTO_SMC
2024-06-03 7:48 ` Niklas Schnelle
@ 2024-06-03 15:07 ` D. Wythe
2024-06-04 11:32 ` Niklas Schnelle
0 siblings, 1 reply; 12+ messages in thread
From: D. Wythe @ 2024-06-03 15:07 UTC (permalink / raw)
To: Niklas Schnelle, kgraul, wenjia, jaka, wintera, guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, tonylu, pabeni,
edumazet
On 6/3/24 3:48 PM, Niklas Schnelle wrote:
> On Thu, 2024-05-30 at 18:14 +0800, D. Wythe wrote:
>> On 5/30/24 5:30 PM, D. Wythe wrote:
>>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>>
>>> This patch allows to create smc socket via AF_INET,
>>> similar to the following code,
>>>
>>> /* create v4 smc sock */
>>> v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);
>>>
>>> /* create v6 smc sock */
>>> v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);
>> Welcome everyone to try out the eBPF based version of smc_run during
>> testing, I have added a separate command called smc_run.bpf,
>> it was equivalent to normal smc_run but with IPPROTO_SMC via eBPF.
>>
>> You can obtain the code and more info from:
>> https://github.com/D-Wythe/smc-tools
>>
>> Usage:
>>
>> smc_run.bpf
>> An eBPF implemented smc_run based on IPPROTO_SMC:
>>
>> 1. Support to transparent replacement based on command (Just like smc_run).
>> 2. Supprot to transparent replacement based on pid configuration. And
>> supports the inheritance of this capability between parent and child
>> processes.
>> 3. Support to transparent replacement based on per netns configuration.
>>
>> smc_run.bpf COMMAND
>>
>> 1. Equivalent to smc_run but with IPPROTO_SMC via eBPF
>>
>> smc_run.bpf -p pid
>>
>> 1. Add the process with target pid to the map. Afterward, all socket()
>> calls of the process and its descendant processes will be replaced from
>> IPPROTO_TCP to IPPROTO_SMC.
>> 2. Mapping will be automatically deleted when process exits.
>> 3. Specifically, COMMAND mode is actually works like following:
>>
>> smc_run.bpf -p $$
>> COMMAND
>> exit
>>
>> smc_run.bpf -n 1
>>
>> 1. Make all socket() calls of the current netns to be replaced from
>> IPPROTO_TCP to IPPROTO_SMC.
>> 2. Turn off it by smc_run.bpf -n 0
>>
>>
> Hi D. Wythe,
>
> I gave this series plus your smc_run.bpf and SMC_LO based SMC-D a test
> run on my Ryzen 3900X workstation and I have to say I'm quite
> impressed. I first tried the SMC_LO feature as merged in v6.10-rc1 with
> the classic LD_PRELOAD based smc_run and iperf3, and qperf …
> tcp_bw/tcp_lat both with normal localhost and between docker
> containers. For this to work I of course had to initially set my UEID
> as x86_64 unlike s390x doesn't get an SEID set. I used the following
> script for this.
>
>
> #!/usr/bin/sh
> machine_id_upper=$(cat /etc/machine-id | tr '[:lower:]' '[:upper:]')
> machine_id_suffix=$(echo "$machine_id_upper" | head -c 27)
> ueid="MID-$machine_id_suffix"
> smcd ueid add "$ueid"
>
>
> The performance is pretty impressive:
> * iperf3 with 12 parallel connections (matching core count) results in
> ~152 Gbit/s on normal loopback and ~312 Gbit/s with SMC_LO.
> * qperf … tcp_bw (single thread) results in ~46 Gbit/s on normal loopback
> and ~58 Gbit/s with SMC_LO
> * qperf … tcp_lat latency test results in 5-9 us with normal loopback
> and around 3-4 us with SMC_LO
>
> Then I applied this series on top of v6.10-rc1 and tried it with your
> smc_run.bpf. The performance is of course in-line with the above but
> thanks to being able to enable SMC on a per-netns basis I was able to
> try a few more thing. First I tried just enabling it in my default
> netns and verified that after restarting sshd new ssh connections to
> localhost used SMC-D through SMC_LO. Then I started Chrome and
> confirmed that its TCP connections also registered with SMC and
> successfully fell back to TCP mode. I had no trouble with normal
> browsing though I guess especially Google stuff often uses HTTP/3 so
> isn't affected. Still nice to see I didn't get breakage.
>
> Secondly I tried smc_run.bpf with docker containers using the following
> trick:
>
> docker inspect --format '{{.State.Pid}}' <my_container_name>
> 34651
> nsenter -t 34651 -n smc_run.bpf -n 1
>
> Sadly this only works for commands started in the container after
> loading the BPF. So I wonder if you know of a good way to either
> automatically execute smc_run.bpf on container start or maybe use it on
> the docker daemon such that all namespaces created by docker get the
> IPPROTO_SMC override. I'd then definitely consider using SMC-D with
> SMC_LO between my home lab containers even if just for bragging rights
> ;-)
>
> Feel free to add for the IPPROTO_SMC series:
>
> Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
>
> Thanks,
> Niklas
Hi Niklas ,
Thanks very much for your testing.
Regarding your question, have you ever tried starting the container
using 'smc_run.bpf docker' ?
The smc_run.bpf allows the capability for replacement to be inherited by
descendant processes. This might meet your needs.
However, it should be noted that this scope would no longer be limited
to netns.
If you don't want to replace the docker command and would like to keep
per netns, there are indeed some tricky ways, for example,
we could check current process name when creating new netns to decide if
we should add it to the ebpf-map,
but I think it's not appropriate to include this in smc_run.bpf.
Best wishes,
D. Wythe
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v5 0/3] Introduce IPPROTO_SMC
2024-06-03 15:07 ` D. Wythe
@ 2024-06-04 11:32 ` Niklas Schnelle
0 siblings, 0 replies; 12+ messages in thread
From: Niklas Schnelle @ 2024-06-04 11:32 UTC (permalink / raw)
To: D. Wythe, kgraul, wenjia, jaka, wintera, guwen
Cc: kuba, davem, netdev, linux-s390, linux-rdma, tonylu, pabeni,
edumazet
On Mon, 2024-06-03 at 23:07 +0800, D. Wythe wrote:
>
> On 6/3/24 3:48 PM, Niklas Schnelle wrote:
> > On Thu, 2024-05-30 at 18:14 +0800, D. Wythe wrote:
> > > On 5/30/24 5:30 PM, D. Wythe wrote:
> > > > From: "D. Wythe" <alibuda@linux.alibaba.com>
> > >
---8<---
> > Hi D. Wythe,
> >
> > I gave this series plus your smc_run.bpf and SMC_LO based SMC-D a test
> > run on my Ryzen 3900X workstation and I have to say I'm quite
> > impressed. I first tried the SMC_LO feature as merged in v6.10-rc1 with
> > the classic LD_PRELOAD based smc_run and iperf3, and qperf …
> > tcp_bw/tcp_lat both with normal localhost and between docker
> > containers. For this to work I of course had to initially set my UEID
> > as x86_64 unlike s390x doesn't get an SEID set. I used the following
> > script for this.
> >
> >
> > #!/usr/bin/sh
> > machine_id_upper=$(cat /etc/machine-id | tr '[:lower:]' '[:upper:]')
> > machine_id_suffix=$(echo "$machine_id_upper" | head -c 27)
> > ueid="MID-$machine_id_suffix"
> > smcd ueid add "$ueid"
> >
> >
> > The performance is pretty impressive:
> > * iperf3 with 12 parallel connections (matching core count) results in
> > ~152 Gbit/s on normal loopback and ~312 Gbit/s with SMC_LO.
> > * qperf … tcp_bw (single thread) results in ~46 Gbit/s on normal loopback
> > and ~58 Gbit/s with SMC_LO
> > * qperf … tcp_lat latency test results in 5-9 us with normal loopback
> > and around 3-4 us with SMC_LO
> >
> > Then I applied this series on top of v6.10-rc1 and tried it with your
> > smc_run.bpf. The performance is of course in-line with the above but
> > thanks to being able to enable SMC on a per-netns basis I was able to
> > try a few more thing. First I tried just enabling it in my default
> > netns and verified that after restarting sshd new ssh connections to
> > localhost used SMC-D through SMC_LO. Then I started Chrome and
> > confirmed that its TCP connections also registered with SMC and
> > successfully fell back to TCP mode. I had no trouble with normal
> > browsing though I guess especially Google stuff often uses HTTP/3 so
> > isn't affected. Still nice to see I didn't get breakage.
> >
> > Secondly I tried smc_run.bpf with docker containers using the following
> > trick:
> >
> > docker inspect --format '{{.State.Pid}}' <my_container_name>
> > 34651
> > nsenter -t 34651 -n smc_run.bpf -n 1
> >
> > Sadly this only works for commands started in the container after
> > loading the BPF. So I wonder if you know of a good way to either
> > automatically execute smc_run.bpf on container start or maybe use it on
> > the docker daemon such that all namespaces created by docker get the
> > IPPROTO_SMC override. I'd then definitely consider using SMC-D with
> > SMC_LO between my home lab containers even if just for bragging rights
> > ;-)
> >
> > Feel free to add for the IPPROTO_SMC series:
> >
> > Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
> >
> > Thanks,
> > Niklas
>
> Hi Niklas ,
>
> Thanks very much for your testing.
>
> Regarding your question, have you ever tried starting the container
> using 'smc_run.bpf docker' ?
>
> The smc_run.bpf allows the capability for replacement to be inherited by
> descendant processes. This might meet your needs.
> However, it should be noted that this scope would no longer be limited
> to netns.
>
> If you don't want to replace the docker command and would like to keep
> per netns, there are indeed some tricky ways, for example,
> we could check current process name when creating new netns to decide if
> we should add it to the ebpf-map,
> but I think it's not appropriate to include this in smc_run.bpf.
>
> Best wishes,
> D. Wythe
I'll have to try this. I'm guessing that for docker the smc_run would
have to be added for the docker daemon and not the individual docker
commands. For podman on the other hand the individual command might
work as there is no central daemon. And as you said bpf should allow us
to add other policies in the future.
Thanks,
Niklas
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-06-04 11:32 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-30 9:30 [PATCH net-next v5 0/3] Introduce IPPROTO_SMC D. Wythe
2024-05-30 9:30 ` [PATCH net-next v5 1/3] net/smc: refactoring initialization of smc sock D. Wythe
2024-05-30 9:30 ` [PATCH net-next v5 2/3] net/smc: expose smc proto operations D. Wythe
2024-05-30 9:30 ` [PATCH net-next v5 3/3] net/smc: Introduce IPPROTO_SMC D. Wythe
2024-06-03 3:49 ` Ratheesh Kannoth
2024-06-03 14:43 ` D. Wythe
2024-05-30 10:14 ` [PATCH net-next v5 0/3] " D. Wythe
2024-05-31 8:06 ` Wenjia Zhang
2024-06-03 3:01 ` D. Wythe
2024-06-03 7:48 ` Niklas Schnelle
2024-06-03 15:07 ` D. Wythe
2024-06-04 11:32 ` Niklas Schnelle
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox