[PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently
@ 2024-10-28 11:05 Jason Xing
  2024-10-28 11:05 ` [PATCH net-next v3 01/14] net-timestamp: reorganize in skb_tstamp_tx_output() Jason Xing
                   ` (13 more replies)
  0 siblings, 14 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

A few weeks ago, I planned to extend SO_TIMESTMAMPING feature by using
tracepoint to print information (say, tstamp) so that we can
transparently equip applications with this feature and require no
modification in user side.

Later, we discussed at netconf and agreed that we can use bpf for better
extension, which is mainly suggested by John Fastabend and Willem de
Bruijn. After sending the a few series in recent days, Martin KaFai Lau
provided many valuable advices. Many thanks here!

I post this series to see if we have a better solution to extend. My
feeling is BPF is a good place to provide a way to add timestamping by
administrators, without having to rebuild applications.  After this
series, we could step by step implement more advanced functions/flags
already in SO_TIMESTAMPING feature for bpf extension.

This approach mostly relies on existing SO_TIMESTAMPING feature, users
only needs to pass certain flags through bpf_setsocktop() to a separate
tsflags. For TX timestamps, they will be printed during generation
phase. For RX timestamps, we will wait for the moment when recvmsg() is
called, which isn't supported right now.

In this series, I support foundamental codes for both TCP and UDP protocols.

---
v3
Link: https://lore.kernel.org/all/20241012040651.95616-1-kerneljasonxing@gmail.com/
1. support UDP proto by introducing a new generation point.
2. for OPT_ID, introducing sk_tskey_bpf_offset to compute the delta
between the current socket key and bpf socket key. It is desiged for
UDP, which also applies to TCP.
3. support bpf_getsockopt()
4. use cgroup static key instead.
5. add one simple bpf selftest to show how it can be used.
6. remove the rx support from v2 because the number of patches could
exceed the limit of one series.

V2
Link: https://lore.kernel.org/all/20241008095109.99918-1-kerneljasonxing@gmail.com/
1. Introduce tsflag requestors so that we are able to extend more in the
future. Besides, it enables TX flags for bpf extension feature separately
without breaking users. It is suggested by Vadim Fedorenko.
2. introduce a static key to control the whole feature. (Willem)
3. Open the gate of bpf_setsockopt for the SO_TIMESTAMPING feature in
some TX/RX cases, not all the cases.

Jason Xing (14):
  net-timestamp: reorganize in skb_tstamp_tx_output()
  net-timestamp: allow two features to work parallelly
  net-timestamp: open gate for bpf_setsockopt/_getsockopt
  net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit
    timestamp
  net-timestamp: introduce TS_SW_OPT_CB to generate driver timestamp
  net-timestamp: introduce TS_ACK_OPT_CB to generate tcp acked timestamp
  net-timestamp: add a new triggered point to set sk_tsflags_bpf in UDP
    layer
  net-timestamp: make bpf for tx timestamp work
  net-timestamp: add a common helper to set tskey
  net-timestamp: add basic support with tskey offset
  net-timestamp: support OPT_ID for TCP proto
  net-timestamp: add OPT_ID for UDP proto
  net-timestamp: use static key to control bpf extension
  bpf: add simple bpf tests in the tx path for so_timstamping feature

 include/net/sock.h                            |  14 +-
 include/uapi/linux/bpf.h                      |  18 +++
 include/uapi/linux/net_tstamp.h               |   7 +
 net/core/filter.c                             |   7 +-
 net/core/skbuff.c                             | 114 +++++++++++++++-
 net/core/sock.c                               | 125 ++++++++++++++----
 net/ipv4/ip_output.c                          |  18 ++-
 net/ipv4/tcp.c                                |  19 +++
 net/ipv4/udp.c                                |   4 +-
 net/ipv6/ip6_output.c                         |  18 ++-
 net/mptcp/sockopt.c                           |   2 +-
 net/socket.c                                  |   2 +-
 tools/include/uapi/linux/bpf.h                |  18 +++
 .../bpf/prog_tests/so_timestamping.c          |  98 ++++++++++++++
 .../selftests/bpf/progs/so_timestamping.c     | 123 +++++++++++++++++
 15 files changed, 539 insertions(+), 48 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/so_timestamping.c
 create mode 100644 tools/testing/selftests/bpf/progs/so_timestamping.c

-- 
2.37.3

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 01/14] net-timestamp: reorganize in skb_tstamp_tx_output()
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-28 11:05 ` [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly Jason Xing
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

It's a prep for bpf print function later. This patch only puts the
original generating logic into one function, so that we integrate
bpf print easily. No functional changes here.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/core/skbuff.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 00afeb90c23a..1cf8416f4123 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5539,18 +5539,15 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
 }
 EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
 
-void __skb_tstamp_tx(struct sk_buff *orig_skb,
-		     const struct sk_buff *ack_skb,
-		     struct skb_shared_hwtstamps *hwtstamps,
-		     struct sock *sk, int tstype)
+static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
+				 const struct sk_buff *ack_skb,
+				 struct skb_shared_hwtstamps *hwtstamps,
+				 struct sock *sk, int tstype)
 {
 	struct sk_buff *skb;
 	bool tsonly, opt_stats = false;
 	u32 tsflags;
 
-	if (!sk)
-		return;
-
 	tsflags = READ_ONCE(sk->sk_tsflags);
 	if (!hwtstamps && !(tsflags & SOF_TIMESTAMPING_OPT_TX_SWHW) &&
 	    skb_shinfo(orig_skb)->tx_flags & SKBTX_IN_PROGRESS)
@@ -5594,6 +5591,17 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 
 	__skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
 }
+
+void __skb_tstamp_tx(struct sk_buff *orig_skb,
+		     const struct sk_buff *ack_skb,
+		     struct skb_shared_hwtstamps *hwtstamps,
+		     struct sock *sk, int tstype)
+{
+	if (!sk)
+		return;
+
+	skb_tstamp_tx_output(orig_skb, ack_skb, hwtstamps, sk, tstype);
+}
 EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
 
 void skb_tstamp_tx(struct sk_buff *orig_skb,
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
  2024-10-28 11:05 ` [PATCH net-next v3 01/14] net-timestamp: reorganize in skb_tstamp_tx_output() Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-29 23:00   ` Martin KaFai Lau
  2024-11-02 13:43   ` Simon Horman
  2024-10-28 11:05 ` [PATCH net-next v3 03/14] net-timestamp: open gate for bpf_setsockopt/_getsockopt Jason Xing
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

This patch has introduced a separate sk_tsflags_bpf for bpf
extension, which helps us let two feature work nearly at the
same time.

Each feature will finally take effect on skb_shinfo(skb)->tx_flags,
say, tcp_tx_timestamp() for TCP or skb_setup_tx_timestamp() for
other types, so in __skb_tstamp_tx() we are unable to know which
feature is turned on, unless we check each feature's own socket
flag field.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/sock.h |  1 +
 net/core/skbuff.c  | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 7464e9f9f47c..5384f1e49f5c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -445,6 +445,7 @@ struct sock {
 	u32			sk_reserved_mem;
 	int			sk_forward_alloc;
 	u32			sk_tsflags;
+	u32			sk_tsflags_bpf;
 	__cacheline_group_end(sock_write_rxtx);
 
 	__cacheline_group_begin(sock_write_tx);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1cf8416f4123..39309f75e105 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5539,6 +5539,32 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
 }
 EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
 
+/* This function is used to test if application SO_TIMESTAMPING feature
+ * or bpf SO_TIMESTAMPING feature is loaded by checking its own socket flags.
+ */
+static bool sk_tstamp_tx_flags(struct sock *sk, u32 tsflags, int tstype)
+{
+	u32 testflag;
+
+	switch (tstype) {
+	case SCM_TSTAMP_SCHED:
+		testflag = SOF_TIMESTAMPING_TX_SCHED;
+		break;
+	case SCM_TSTAMP_SND:
+		testflag = SOF_TIMESTAMPING_TX_SOFTWARE;
+		break;
+	case SCM_TSTAMP_ACK:
+		testflag = SOF_TIMESTAMPING_TX_ACK;
+		break;
+	default:
+		return false;
+	}
+	if (tsflags & testflag)
+		return true;
+
+	return false;
+}
+
 static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
 				 const struct sk_buff *ack_skb,
 				 struct skb_shared_hwtstamps *hwtstamps,
@@ -5549,6 +5575,9 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
 	u32 tsflags;
 
 	tsflags = READ_ONCE(sk->sk_tsflags);
+	if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
+		return;
+
 	if (!hwtstamps && !(tsflags & SOF_TIMESTAMPING_OPT_TX_SWHW) &&
 	    skb_shinfo(orig_skb)->tx_flags & SKBTX_IN_PROGRESS)
 		return;
@@ -5592,6 +5621,15 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
 	__skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
 }
 
+static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype)
+{
+	u32 tsflags;
+
+	tsflags = READ_ONCE(sk->sk_tsflags_bpf);
+	if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
+		return;
+}
+
 void __skb_tstamp_tx(struct sk_buff *orig_skb,
 		     const struct sk_buff *ack_skb,
 		     struct skb_shared_hwtstamps *hwtstamps,
@@ -5600,6 +5638,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (!sk)
 		return;
 
+	skb_tstamp_tx_output_bpf(sk, tstype);
 	skb_tstamp_tx_output(orig_skb, ack_skb, hwtstamps, sk, tstype);
 }
 EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 03/14] net-timestamp: open gate for bpf_setsockopt/_getsockopt
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
  2024-10-28 11:05 ` [PATCH net-next v3 01/14] net-timestamp: reorganize in skb_tstamp_tx_output() Jason Xing
  2024-10-28 11:05 ` [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-29  0:59   ` Willem de Bruijn
  2024-10-30  0:32   ` Martin KaFai Lau
  2024-10-28 11:05 ` [PATCH net-next v3 04/14] net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit timestamp Jason Xing
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

For now, we support bpf_setsockopt to set or clear timestamps flags.

Users can use something like this in bpf program to turn on the feature:
flags = SOF_TIMESTAMPING_TX_SCHED;
bpf_setsockopt(skops, SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
The specific use cases can be seen in the bpf selftest in this series.

Later, I will support each flags one by one based on this.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/sock.h              |  4 ++--
 include/uapi/linux/net_tstamp.h |  7 +++++++
 net/core/filter.c               |  7 +++++--
 net/core/sock.c                 | 34 ++++++++++++++++++++++++++-------
 net/ipv4/udp.c                  |  2 +-
 net/mptcp/sockopt.c             |  2 +-
 net/socket.c                    |  2 +-
 7 files changed, 44 insertions(+), 14 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 5384f1e49f5c..062f405c744e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1775,7 +1775,7 @@ static inline void skb_set_owner_edemux(struct sk_buff *skb, struct sock *sk)
 #endif
 
 int sk_setsockopt(struct sock *sk, int level, int optname,
-		  sockptr_t optval, unsigned int optlen);
+		  sockptr_t optval, unsigned int optlen, bool bpf_timetamping);
 int sock_setsockopt(struct socket *sock, int level, int op,
 		    sockptr_t optval, unsigned int optlen);
 int do_sock_setsockopt(struct socket *sock, bool compat, int level,
@@ -1784,7 +1784,7 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level,
 		       int optname, sockptr_t optval, sockptr_t optlen);
 
 int sk_getsockopt(struct sock *sk, int level, int optname,
-		  sockptr_t optval, sockptr_t optlen);
+		  sockptr_t optval, sockptr_t optlen, bool bpf_timetamping);
 int sock_gettstamp(struct socket *sock, void __user *userstamp,
 		   bool timeval, bool time32);
 struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 858339d1c1c4..0696699cf964 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -49,6 +49,13 @@ enum {
 					 SOF_TIMESTAMPING_TX_SCHED | \
 					 SOF_TIMESTAMPING_TX_ACK)
 
+#define SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK (SOF_TIMESTAMPING_SOFTWARE | \
+					      SOF_TIMESTAMPING_TX_SCHED | \
+					      SOF_TIMESTAMPING_TX_SOFTWARE | \
+					      SOF_TIMESTAMPING_TX_ACK | \
+					      SOF_TIMESTAMPING_OPT_ID | \
+					      SOF_TIMESTAMPING_OPT_ID_TCP)
+
 /**
  * struct so_timestamping - SO_TIMESTAMPING parameter
  *
diff --git a/net/core/filter.c b/net/core/filter.c
index 58761263176c..dc8ecf899ced 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5238,6 +5238,9 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
 		break;
 	case SO_BINDTODEVICE:
 		break;
+	case SO_TIMESTAMPING_NEW:
+	case SO_TIMESTAMPING_OLD:
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -5247,11 +5250,11 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
 			return -EINVAL;
 		return sk_getsockopt(sk, SOL_SOCKET, optname,
 				     KERNEL_SOCKPTR(optval),
-				     KERNEL_SOCKPTR(optlen));
+				     KERNEL_SOCKPTR(optlen), true);
 	}
 
 	return sk_setsockopt(sk, SOL_SOCKET, optname,
-			     KERNEL_SOCKPTR(optval), *optlen);
+			     KERNEL_SOCKPTR(optval), *optlen, true);
 }
 
 static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
diff --git a/net/core/sock.c b/net/core/sock.c
index 7f398bd07fb7..7e05748b1a06 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -941,6 +941,19 @@ int sock_set_timestamping(struct sock *sk, int optname,
 	return 0;
 }
 
+static int sock_set_timestamping_bpf(struct sock *sk,
+				     struct so_timestamping timestamping)
+{
+	u32 flags = timestamping.flags;
+
+	if (flags & ~SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK)
+		return -EINVAL;
+
+	WRITE_ONCE(sk->sk_tsflags_bpf, flags);
+
+	return 0;
+}
+
 void sock_set_keepalive(struct sock *sk)
 {
 	lock_sock(sk);
@@ -1159,7 +1172,7 @@ static int sockopt_validate_clockid(__kernel_clockid_t value)
  */
 
 int sk_setsockopt(struct sock *sk, int level, int optname,
-		  sockptr_t optval, unsigned int optlen)
+		  sockptr_t optval, unsigned int optlen, bool bpf_timetamping)
 {
 	struct so_timestamping timestamping;
 	struct socket *sock = sk->sk_socket;
@@ -1409,7 +1422,10 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
 			memset(&timestamping, 0, sizeof(timestamping));
 			timestamping.flags = val;
 		}
-		ret = sock_set_timestamping(sk, optname, timestamping);
+		if (!bpf_timetamping)
+			ret = sock_set_timestamping(sk, optname, timestamping);
+		else
+			ret = sock_set_timestamping_bpf(sk, timestamping);
 		break;
 
 	case SO_RCVLOWAT:
@@ -1626,7 +1642,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 		    sockptr_t optval, unsigned int optlen)
 {
 	return sk_setsockopt(sock->sk, level, optname,
-			     optval, optlen);
+			     optval, optlen, false);
 }
 EXPORT_SYMBOL(sock_setsockopt);
 
@@ -1670,7 +1686,7 @@ static int groups_to_user(sockptr_t dst, const struct group_info *src)
 }
 
 int sk_getsockopt(struct sock *sk, int level, int optname,
-		  sockptr_t optval, sockptr_t optlen)
+		  sockptr_t optval, sockptr_t optlen, bool bpf_timetamping)
 {
 	struct socket *sock = sk->sk_socket;
 
@@ -1793,9 +1809,13 @@ int sk_getsockopt(struct sock *sk, int level, int optname,
 		 * returning the flags when they were set through the same option.
 		 * Don't change the beviour for the old case SO_TIMESTAMPING_OLD.
 		 */
-		if (optname == SO_TIMESTAMPING_OLD || sock_flag(sk, SOCK_TSTAMP_NEW)) {
-			v.timestamping.flags = READ_ONCE(sk->sk_tsflags);
-			v.timestamping.bind_phc = READ_ONCE(sk->sk_bind_phc);
+		if (!bpf_timetamping) {
+			if (optname == SO_TIMESTAMPING_OLD || sock_flag(sk, SOCK_TSTAMP_NEW)) {
+				v.timestamping.flags = READ_ONCE(sk->sk_tsflags);
+				v.timestamping.bind_phc = READ_ONCE(sk->sk_bind_phc);
+			}
+		} else {
+			v.timestamping.flags = READ_ONCE(sk->sk_tsflags_bpf);
 		}
 		break;
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 0e24916b39d4..9a20af41e272 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2679,7 +2679,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
 	int is_udplite = IS_UDPLITE(sk);
 
 	if (level == SOL_SOCKET) {
-		err = sk_setsockopt(sk, level, optname, optval, optlen);
+		err = sk_setsockopt(sk, level, optname, optval, optlen, false);
 
 		if (optname == SO_RCVBUF || optname == SO_RCVBUFFORCE) {
 			sockopt_lock_sock(sk);
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index 505445a9598f..7b12cc2db136 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -306,7 +306,7 @@ static int mptcp_setsockopt_sol_socket(struct mptcp_sock *msk, int optname,
 			return PTR_ERR(ssk);
 		}
 
-		ret = sk_setsockopt(ssk, SOL_SOCKET, optname, optval, optlen);
+		ret = sk_setsockopt(ssk, SOL_SOCKET, optname, optval, optlen, false);
 		if (ret == 0) {
 			if (optname == SO_REUSEPORT)
 				sk->sk_reuseport = ssk->sk_reuseport;
diff --git a/net/socket.c b/net/socket.c
index 9a8e4452b9b2..4bdca39685a6 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2385,7 +2385,7 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level,
 
 	ops = READ_ONCE(sock->ops);
 	if (level == SOL_SOCKET) {
-		err = sk_getsockopt(sock->sk, level, optname, optval, optlen);
+		err = sk_getsockopt(sock->sk, level, optname, optval, optlen, false);
 	} else if (unlikely(!ops->getsockopt)) {
 		err = -EOPNOTSUPP;
 	} else {
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 04/14] net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit timestamp
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (2 preceding siblings ...)
  2024-10-28 11:05 ` [PATCH net-next v3 03/14] net-timestamp: open gate for bpf_setsockopt/_getsockopt Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-29  0:23   ` kernel test robot
                     ` (2 more replies)
  2024-10-28 11:05 ` [PATCH net-next v3 05/14] net-timestamp: introduce TS_SW_OPT_CB to generate driver timestamp Jason Xing
                   ` (9 subsequent siblings)
  13 siblings, 3 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Introduce BPF_SOCK_OPS_TS_SCHED_OPT_CB flag so that we can decide to
print timestamps when the skb just passes the dev layer.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/uapi/linux/bpf.h       |  5 +++++
 net/core/skbuff.c              | 31 ++++++++++++++++++++++++++++++-
 tools/include/uapi/linux/bpf.h |  5 +++++
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e8241b320c6d..324e9e40969c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7013,6 +7013,11 @@ enum {
 					 * by the kernel or the
 					 * earlier bpf-progs.
 					 */
+	BPF_SOCK_OPS_TS_SCHED_OPT_CB,	/* Called when skb is passing through
+					 * dev layer when SO_TIMESTAMPING
+					 * feature is on. It indicates the
+					 * recorded timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 39309f75e105..e6a5c883bdc6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -64,6 +64,7 @@
 #include <linux/mpls.h>
 #include <linux/kcov.h>
 #include <linux/iov_iter.h>
+#include <linux/bpf-cgroup.h>
 
 #include <net/protocol.h>
 #include <net/dst.h>
@@ -5621,13 +5622,41 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
 	__skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
 }
 
+static void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
+{
+	struct bpf_sock_ops_kern sock_ops;
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+	if (sk_fullsock(sk)) {
+		sock_ops.is_fullsock = 1;
+		sock_owned_by_me(sk);
+	}
+
+	sock_ops.sk = sk;
+	sock_ops.op = op;
+	if (nargs > 0)
+		memcpy(sock_ops.args, args, nargs * sizeof(*args));
+
+	BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
+}
+
 static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype)
 {
-	u32 tsflags;
+	u32 tsflags, cb_flag;
 
 	tsflags = READ_ONCE(sk->sk_tsflags_bpf);
 	if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
 		return;
+
+	switch (tstype) {
+	case SCM_TSTAMP_SCHED:
+		cb_flag = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
+		break;
+	default:
+		return;
+	}
+
+	timestamp_call_bpf(sk, cb_flag, 0, NULL);
 }
 
 void __skb_tstamp_tx(struct sk_buff *orig_skb,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e8241b320c6d..324e9e40969c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7013,6 +7013,11 @@ enum {
 					 * by the kernel or the
 					 * earlier bpf-progs.
 					 */
+	BPF_SOCK_OPS_TS_SCHED_OPT_CB,	/* Called when skb is passing through
+					 * dev layer when SO_TIMESTAMPING
+					 * feature is on. It indicates the
+					 * recorded timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 05/14] net-timestamp: introduce TS_SW_OPT_CB to generate driver timestamp
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (3 preceding siblings ...)
  2024-10-28 11:05 ` [PATCH net-next v3 04/14] net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit timestamp Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-28 11:05 ` [PATCH net-next v3 06/14] net-timestamp: introduce TS_ACK_OPT_CB to generate tcp acked timestamp Jason Xing
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When the skb is about to send from driver to nic, we can print timestamp
by setting BPF_SOCK_OPS_TS_SW_OPT_CB in bpf program.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/uapi/linux/bpf.h       |  5 +++++
 net/core/skbuff.c              | 19 ++++++++++++++++---
 tools/include/uapi/linux/bpf.h |  5 +++++
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 324e9e40969c..b0032e173e65 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7018,6 +7018,11 @@ enum {
 					 * feature is on. It indicates the
 					 * recorded timestamp.
 					 */
+	BPF_SOCK_OPS_TS_SW_OPT_CB,	/* Called when skb is about to send
+					 * to the nic when SO_TIMESTAMPING
+					 * feature is on. It indicates the
+					 * recorded timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e6a5c883bdc6..e29ab3e45213 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5640,8 +5640,10 @@ static void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 	BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
 }
 
-static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype)
+static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype,
+				     struct skb_shared_hwtstamps *hwtstamps)
 {
+	u32 args[2] = {0, 0};
 	u32 tsflags, cb_flag;
 
 	tsflags = READ_ONCE(sk->sk_tsflags_bpf);
@@ -5652,11 +5654,22 @@ static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype)
 	case SCM_TSTAMP_SCHED:
 		cb_flag = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
 		break;
+	case SCM_TSTAMP_SND:
+		cb_flag = BPF_SOCK_OPS_TS_SW_OPT_CB;
+		break;
 	default:
 		return;
 	}
 
-	timestamp_call_bpf(sk, cb_flag, 0, NULL);
+	if (hwtstamps) {
+		struct timespec64 ts;
+
+		ts = ktime_to_timespec64(hwtstamps->hwtstamp);
+		args[0] = ts.tv_sec;
+		args[1] = ts.tv_nsec;
+	}
+
+	timestamp_call_bpf(sk, cb_flag, 2, args);
 }
 
 void __skb_tstamp_tx(struct sk_buff *orig_skb,
@@ -5667,7 +5680,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (!sk)
 		return;
 
-	skb_tstamp_tx_output_bpf(sk, tstype);
+	skb_tstamp_tx_output_bpf(sk, tstype, hwtstamps);
 	skb_tstamp_tx_output(orig_skb, ack_skb, hwtstamps, sk, tstype);
 }
 EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 324e9e40969c..b0032e173e65 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7018,6 +7018,11 @@ enum {
 					 * feature is on. It indicates the
 					 * recorded timestamp.
 					 */
+	BPF_SOCK_OPS_TS_SW_OPT_CB,	/* Called when skb is about to send
+					 * to the nic when SO_TIMESTAMPING
+					 * feature is on. It indicates the
+					 * recorded timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 06/14] net-timestamp: introduce TS_ACK_OPT_CB to generate tcp acked timestamp
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (4 preceding siblings ...)
  2024-10-28 11:05 ` [PATCH net-next v3 05/14] net-timestamp: introduce TS_SW_OPT_CB to generate driver timestamp Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-29  1:03   ` Willem de Bruijn
  2024-10-28 11:05 ` [PATCH net-next v3 07/14] net-timestamp: add a new triggered point to set sk_tsflags_bpf in UDP layer Jason Xing
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When the last sent skb in each sendmsg() is acknowledged in TCP layer,
we can print timestamp by setting BPF_SOCK_OPS_TS_ACK_OPT_CB in
bpf program.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/uapi/linux/bpf.h       | 5 +++++
 net/core/skbuff.c              | 3 +++
 tools/include/uapi/linux/bpf.h | 5 +++++
 3 files changed, 13 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b0032e173e65..6fc3bd12b650 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7023,6 +7023,11 @@ enum {
 					 * feature is on. It indicates the
 					 * recorded timestamp.
 					 */
+	BPF_SOCK_OPS_TS_ACK_OPT_CB,	/* Called when all the skbs are
+					 * acknowledged when SO_TIMESTAMPING
+					 * feature is on. It indicates the
+					 * recorded timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e29ab3e45213..8b2a79c0fe1c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5657,6 +5657,9 @@ static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype,
 	case SCM_TSTAMP_SND:
 		cb_flag = BPF_SOCK_OPS_TS_SW_OPT_CB;
 		break;
+	case SCM_TSTAMP_ACK:
+		cb_flag = BPF_SOCK_OPS_TS_ACK_OPT_CB;
+		break;
 	default:
 		return;
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b0032e173e65..6fc3bd12b650 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7023,6 +7023,11 @@ enum {
 					 * feature is on. It indicates the
 					 * recorded timestamp.
 					 */
+	BPF_SOCK_OPS_TS_ACK_OPT_CB,	/* Called when all the skbs are
+					 * acknowledged when SO_TIMESTAMPING
+					 * feature is on. It indicates the
+					 * recorded timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 07/14] net-timestamp: add a new triggered point to set sk_tsflags_bpf in UDP layer
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (5 preceding siblings ...)
  2024-10-28 11:05 ` [PATCH net-next v3 06/14] net-timestamp: introduce TS_ACK_OPT_CB to generate tcp acked timestamp Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-29  1:07   ` Willem de Bruijn
  2024-10-28 11:05 ` [PATCH net-next v3 08/14] net-timestamp: make bpf for tx timestamp work Jason Xing
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

This patch behaves like how cmsg feature works, that is to say,
check and set on each call of udp_sendmsg before passing sk_tsflags_bpf
to cork tsflags.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/sock.h             | 1 +
 include/uapi/linux/bpf.h       | 3 +++
 net/core/skbuff.c              | 2 +-
 net/ipv4/udp.c                 | 1 +
 tools/include/uapi/linux/bpf.h | 3 +++
 5 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 062f405c744e..cf7fea456455 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2828,6 +2828,7 @@ static inline bool sk_listener_or_tw(const struct sock *sk)
 }
 
 void sock_enable_timestamp(struct sock *sk, enum sock_flags flag);
+void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args);
 int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len, int level,
 		       int type);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6fc3bd12b650..055ffa7c965c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7028,6 +7028,9 @@ enum {
 					 * feature is on. It indicates the
 					 * recorded timestamp.
 					 */
+	BPF_SOCK_OPS_TS_UDP_SND_CB,	/* Called when every udp_sendmsg
+					 * syscall is triggered
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8b2a79c0fe1c..0b571306f7ea 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5622,7 +5622,7 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
 	__skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
 }
 
-static void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
+void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
 	struct bpf_sock_ops_kern sock_ops;
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 9a20af41e272..e768421abc37 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1264,6 +1264,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	if (!corkreq) {
 		struct inet_cork cork;
 
+		timestamp_call_bpf(sk, BPF_SOCK_OPS_TS_UDP_SND_CB, 0, NULL);
 		skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
 				  sizeof(struct udphdr), &ipc, &rt,
 				  &cork, msg->msg_flags);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6fc3bd12b650..055ffa7c965c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7028,6 +7028,9 @@ enum {
 					 * feature is on. It indicates the
 					 * recorded timestamp.
 					 */
+	BPF_SOCK_OPS_TS_UDP_SND_CB,	/* Called when every udp_sendmsg
+					 * syscall is triggered
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 08/14] net-timestamp: make bpf for tx timestamp work
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (6 preceding siblings ...)
  2024-10-28 11:05 ` [PATCH net-next v3 07/14] net-timestamp: add a new triggered point to set sk_tsflags_bpf in UDP layer Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-28 11:05 ` [PATCH net-next v3 09/14] net-timestamp: add a common helper to set tskey Jason Xing
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Until now, we've already prepared the generation related work, so
it's time to let it work finally for both TCP and UDP protos.

This is how I use in bpf program:
1) for UDP
case BPF_SOCK_OPS_TS_UDP_SND_CB:
	bpf_setsockopt(...);

2) for TCP
case BPF_SOCK_OPS_TCP_CONNECT_CB:
	bpf_setsockopt(...)

3) common part used to report the timestamp
case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
	dport = bpf_ntohl(skops->remote_port);
	sport = skops->local_port;
	bpf_printk(...);

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/sock.h    |  6 ++++++
 net/ipv4/ip_output.c  |  1 +
 net/ipv4/tcp.c        | 16 ++++++++++++++++
 net/ipv6/ip6_output.c |  1 +
 4 files changed, 24 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index cf7fea456455..cf687efbea9f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2710,6 +2710,12 @@ static inline void sock_tx_timestamp(struct sock *sk,
 	_sock_tx_timestamp(sk, sockc, tx_flags, NULL);
 }
 
+static inline void sock_tx_timestamp_bpf(u32 tsflags, __u8 *tx_flags)
+{
+	if (tsflags)
+		__sock_tx_timestamp(tsflags, tx_flags);
+}
+
 static inline void skb_setup_tx_timestamp(struct sk_buff *skb,
 					  const struct sockcm_cookie *sockc)
 {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 0065b1996c94..9d94a209057b 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1332,6 +1332,7 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
 	cork->transmit_time = ipc->sockc.transmit_time;
 	cork->tx_flags = 0;
 	sock_tx_timestamp(sk, &ipc->sockc, &cork->tx_flags);
+	sock_tx_timestamp_bpf(READ_ONCE(sk->sk_tsflags_bpf), &cork->tx_flags);
 	if (ipc->sockc.tsflags & SOCKCM_FLAG_TS_OPT_ID) {
 		cork->flags |= IPCORK_TS_OPT_ID;
 		cork->ts_opt_id = ipc->sockc.ts_opt_id;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 82cc4a5633ce..6b23b4aa3c91 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -477,6 +477,20 @@ void tcp_init_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_init_sock);
 
+static void tcp_tx_timestamp_bpf(struct sock *sk, struct sk_buff *skb)
+{
+	u32 tsflags = READ_ONCE(sk->sk_tsflags_bpf);
+
+	if (tsflags && skb) {
+		struct skb_shared_info *shinfo = skb_shinfo(skb);
+		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+
+		sock_tx_timestamp_bpf(tsflags, &shinfo->tx_flags);
+		if (tsflags & SOF_TIMESTAMPING_TX_ACK)
+			tcb->txstamp_ack = 1;
+	}
+}
+
 static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
 {
 	struct sk_buff *skb = tcp_write_queue_tail(sk);
@@ -492,6 +506,8 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
 		if (tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK)
 			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
 	}
+
+	tcp_tx_timestamp_bpf(sk, skb);
 }
 
 static bool tcp_stream_is_readable(struct sock *sk, int target)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index f7b4608bb316..230e8d5a792c 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1402,6 +1402,7 @@ static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork,
 	cork->base.tx_flags = 0;
 	cork->base.mark = ipc6->sockc.mark;
 	sock_tx_timestamp(sk, &ipc6->sockc, &cork->base.tx_flags);
+	sock_tx_timestamp_bpf(READ_ONCE(sk->sk_tsflags_bpf), &cork->base.tx_flags);
 	if (ipc6->sockc.tsflags & SOCKCM_FLAG_TS_OPT_ID) {
 		cork->base.flags |= IPCORK_TS_OPT_ID;
 		cork->base.ts_opt_id = ipc6->sockc.ts_opt_id;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 09/14] net-timestamp: add a common helper to set tskey
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (7 preceding siblings ...)
  2024-10-28 11:05 ` [PATCH net-next v3 08/14] net-timestamp: make bpf for tx timestamp work Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-28 11:05 ` [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset Jason Xing
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

No functional changes here. Only add a common helper so that we
can use it later for bpf extension easily.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/sock.h |  1 +
 net/core/sock.c    | 27 +++++++++++++++++++--------
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index cf687efbea9f..91398b20a4a3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2917,6 +2917,7 @@ void sock_def_readable(struct sock *sk);
 
 int sock_bindtoindex(struct sock *sk, int ifindex, bool lock_sk);
 void sock_set_timestamp(struct sock *sk, int optname, bool valbool);
+int sock_set_tskey(struct sock *sk, int val, int bpf_type);
 int sock_set_timestamping(struct sock *sk, int optname,
 			  struct so_timestamping timestamping);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 7e05748b1a06..42c1aba0b3fe 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -891,21 +891,16 @@ static int sock_timestamping_bind_phc(struct sock *sk, int phc_index)
 	return 0;
 }
 
-int sock_set_timestamping(struct sock *sk, int optname,
-			  struct so_timestamping timestamping)
+int sock_set_tskey(struct sock *sk, int val, int bpf_type)
 {
-	int val = timestamping.flags;
-	int ret;
-
-	if (val & ~SOF_TIMESTAMPING_MASK)
-		return -EINVAL;
+	u32 tsflags = bpf_type ? sk->sk_tsflags_bpf : sk->sk_tsflags;
 
 	if (val & SOF_TIMESTAMPING_OPT_ID_TCP &&
 	    !(val & SOF_TIMESTAMPING_OPT_ID))
 		return -EINVAL;
 
 	if (val & SOF_TIMESTAMPING_OPT_ID &&
-	    !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
+	    !(tsflags & SOF_TIMESTAMPING_OPT_ID)) {
 		if (sk_is_tcp(sk)) {
 			if ((1 << sk->sk_state) &
 			    (TCPF_CLOSE | TCPF_LISTEN))
@@ -919,6 +914,22 @@ int sock_set_timestamping(struct sock *sk, int optname,
 		}
 	}
 
+	return 0;
+}
+
+int sock_set_timestamping(struct sock *sk, int optname,
+			  struct so_timestamping timestamping)
+{
+	int val = timestamping.flags;
+	int ret;
+
+	if (val & ~SOF_TIMESTAMPING_MASK)
+		return -EINVAL;
+
+	ret = sock_set_tskey(sk, val, 0);
+	if (ret)
+		return ret;
+
 	if (val & SOF_TIMESTAMPING_OPT_STATS &&
 	    !(val & SOF_TIMESTAMPING_OPT_TSONLY))
 		return -EINVAL;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (8 preceding siblings ...)
  2024-10-28 11:05 ` [PATCH net-next v3 09/14] net-timestamp: add a common helper to set tskey Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-29  1:24   ` Willem de Bruijn
  2024-10-30  5:42   ` Martin KaFai Lau
  2024-10-28 11:05 ` [PATCH net-next v3 11/14] net-timestamp: support OPT_ID for TCP proto Jason Xing
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Use the offset to record the delta value between current socket key
and bpf socket key.

1. If there is only bpf feature running, the socket key is bpf socket
key and the offset is zero;
2. If there is only traditional feature running, and then bpf feature
is turned on, the socket key is still used by the former while the offset
is the delta between them;
3. if there is only bpf feature running, and then application uses it,
the socket key would be re-init for application and the offset is the
delta.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/sock.h |  1 +
 net/core/skbuff.c  | 15 ++++++++---
 net/core/sock.c    | 66 ++++++++++++++++++++++++++++++++++++++--------
 3 files changed, 68 insertions(+), 14 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 91398b20a4a3..41c6c6f78e55 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -469,6 +469,7 @@ struct sock {
 	unsigned long		sk_pacing_rate; /* bytes per second */
 	atomic_t		sk_zckey;
 	atomic_t		sk_tskey;
+	u32			sk_tskey_bpf_offset;
 	__cacheline_group_end(sock_write_tx);
 
 	__cacheline_group_begin(sock_read_tx);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0b571306f7ea..d1739317b97d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5641,9 +5641,10 @@ void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 }
 
 static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype,
+				     struct sk_buff *skb,
 				     struct skb_shared_hwtstamps *hwtstamps)
 {
-	u32 args[2] = {0, 0};
+	u32 args[3] = {0, 0, 0};
 	u32 tsflags, cb_flag;
 
 	tsflags = READ_ONCE(sk->sk_tsflags_bpf);
@@ -5672,7 +5673,15 @@ static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype,
 		args[1] = ts.tv_nsec;
 	}
 
-	timestamp_call_bpf(sk, cb_flag, 2, args);
+	if (tsflags & SOF_TIMESTAMPING_OPT_ID) {
+		args[2] = skb_shinfo(skb)->tskey;
+		if (sk_is_tcp(sk))
+			args[2] -= atomic_read(&sk->sk_tskey);
+		if (sk->sk_tskey_bpf_offset)
+			args[2] += sk->sk_tskey_bpf_offset;
+	}
+
+	timestamp_call_bpf(sk, cb_flag, 3, args);
 }
 
 void __skb_tstamp_tx(struct sk_buff *orig_skb,
@@ -5683,7 +5692,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (!sk)
 		return;
 
-	skb_tstamp_tx_output_bpf(sk, tstype, hwtstamps);
+	skb_tstamp_tx_output_bpf(sk, tstype, orig_skb, hwtstamps);
 	skb_tstamp_tx_output(orig_skb, ack_skb, hwtstamps, sk, tstype);
 }
 EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
diff --git a/net/core/sock.c b/net/core/sock.c
index 42c1aba0b3fe..914ec8046f86 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -891,6 +891,49 @@ static int sock_timestamping_bind_phc(struct sock *sk, int phc_index)
 	return 0;
 }
 
+/* Used to track the tskey for bpf extension
+ *
+ * @sk_tskey: bpf extension can use it only when no application uses.
+ *            Application can use it directly regardless of bpf extension.
+ *
+ * There are three strategies:
+ * 1) If we've already set through setsockopt() and here we're going to set
+ *    OPT_ID for bpf use, we will not re-initialize the @sk_tskey and will
+ *    keep the record of delta between the current "key" and previous key.
+ * 2) If we've already set through bpf_setsockopt() and here we're going to
+ *    set for application use, we will record the delta first and then
+ *    override/initialize the @sk_tskey.
+ * 3) other cases, which means only either of them takes effect, so initialize
+ *    everything simplely.
+ */
+static long int sock_calculate_tskey_offset(struct sock *sk, int val, int bpf_type)
+{
+	u32 tskey;
+
+	if (sk_is_tcp(sk)) {
+		if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
+			return -EINVAL;
+
+		if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
+			tskey = tcp_sk(sk)->write_seq;
+		else
+			tskey = tcp_sk(sk)->snd_una;
+	} else {
+		tskey = 0;
+	}
+
+	if (bpf_type && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
+		sk->sk_tskey_bpf_offset = tskey - atomic_read(&sk->sk_tskey);
+		return 0;
+	} else if (!bpf_type && (sk->sk_tsflags_bpf & SOF_TIMESTAMPING_OPT_ID)) {
+		sk->sk_tskey_bpf_offset = atomic_read(&sk->sk_tskey) - tskey;
+	} else {
+		sk->sk_tskey_bpf_offset = 0;
+	}
+
+	return tskey;
+}
+
 int sock_set_tskey(struct sock *sk, int val, int bpf_type)
 {
 	u32 tsflags = bpf_type ? sk->sk_tsflags_bpf : sk->sk_tsflags;
@@ -901,17 +944,13 @@ int sock_set_tskey(struct sock *sk, int val, int bpf_type)
 
 	if (val & SOF_TIMESTAMPING_OPT_ID &&
 	    !(tsflags & SOF_TIMESTAMPING_OPT_ID)) {
-		if (sk_is_tcp(sk)) {
-			if ((1 << sk->sk_state) &
-			    (TCPF_CLOSE | TCPF_LISTEN))
-				return -EINVAL;
-			if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
-				atomic_set(&sk->sk_tskey, tcp_sk(sk)->write_seq);
-			else
-				atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
-		} else {
-			atomic_set(&sk->sk_tskey, 0);
-		}
+		long int ret;
+
+		ret = sock_calculate_tskey_offset(sk, val, bpf_type);
+		if (ret <= 0)
+			return ret;
+
+		atomic_set(&sk->sk_tskey, ret);
 	}
 
 	return 0;
@@ -956,10 +995,15 @@ static int sock_set_timestamping_bpf(struct sock *sk,
 				     struct so_timestamping timestamping)
 {
 	u32 flags = timestamping.flags;
+	int ret;
 
 	if (flags & ~SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK)
 		return -EINVAL;
 
+	ret = sock_set_tskey(sk, flags, 1);
+	if (ret)
+		return ret;
+
 	WRITE_ONCE(sk->sk_tsflags_bpf, flags);
 
 	return 0;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 11/14] net-timestamp: support OPT_ID for TCP proto
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (9 preceding siblings ...)
  2024-10-28 11:05 ` [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-28 11:05 ` [PATCH net-next v3 12/14] net-timestamp: add OPT_ID for UDP proto Jason Xing
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Let it work for TCP proto.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/ipv4/tcp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6b23b4aa3c91..f77dc7a4a98e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -488,6 +488,8 @@ static void tcp_tx_timestamp_bpf(struct sock *sk, struct sk_buff *skb)
 		sock_tx_timestamp_bpf(tsflags, &shinfo->tx_flags);
 		if (tsflags & SOF_TIMESTAMPING_TX_ACK)
 			tcb->txstamp_ack = 1;
+		if (tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK)
+			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
 	}
 }
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 12/14] net-timestamp: add OPT_ID for UDP proto
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (10 preceding siblings ...)
  2024-10-28 11:05 ` [PATCH net-next v3 11/14] net-timestamp: support OPT_ID for TCP proto Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-28 11:05 ` [PATCH net-next v3 13/14] net-timestamp: use static key to control bpf extension Jason Xing
  2024-10-28 11:05 ` [PATCH net-next v3 14/14] bpf: add simple bpf tests in the tx path for so_timstamping feature Jason Xing
  13 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Let it work for UDP proto.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/ipv4/ip_output.c  | 16 +++++++++++-----
 net/ipv6/ip6_output.c | 16 +++++++++++-----
 2 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9d94a209057b..45033105b34c 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1049,11 +1049,17 @@ static int __ip_append_data(struct sock *sk,
 
 	cork->length += length;
 
-	if (cork->tx_flags & SKBTX_ANY_TSTAMP &&
-	    READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_OPT_ID) {
-		if (cork->flags & IPCORK_TS_OPT_ID) {
-			tskey = cork->ts_opt_id;
-		} else {
+	if (cork->tx_flags & SKBTX_ANY_TSTAMP) {
+		if (READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_OPT_ID) {
+			if (cork->flags & IPCORK_TS_OPT_ID) {
+				tskey = cork->ts_opt_id;
+			} else {
+				tskey = atomic_inc_return(&sk->sk_tskey) - 1;
+				hold_tskey = true;
+			}
+		}
+		if (!hold_tskey &&
+		    READ_ONCE(sk->sk_tsflags_bpf) & SOF_TIMESTAMPING_OPT_ID) {
 			tskey = atomic_inc_return(&sk->sk_tskey) - 1;
 			hold_tskey = true;
 		}
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 230e8d5a792c..ec956ada7179 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1547,11 +1547,17 @@ static int __ip6_append_data(struct sock *sk,
 			flags &= ~MSG_SPLICE_PAGES;
 	}
 
-	if (cork->tx_flags & SKBTX_ANY_TSTAMP &&
-	    READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_OPT_ID) {
-		if (cork->flags & IPCORK_TS_OPT_ID) {
-			tskey = cork->ts_opt_id;
-		} else {
+	if (cork->tx_flags & SKBTX_ANY_TSTAMP) {
+		if (READ_ONCE(sk->sk_tsflags) & SOF_TIMESTAMPING_OPT_ID) {
+			if (cork->flags & IPCORK_TS_OPT_ID) {
+				tskey = cork->ts_opt_id;
+			} else {
+				tskey = atomic_inc_return(&sk->sk_tskey) - 1;
+				hold_tskey = true;
+			}
+		}
+		if (!hold_tskey &&
+		    READ_ONCE(sk->sk_tsflags_bpf) & SOF_TIMESTAMPING_OPT_ID) {
 			tskey = atomic_inc_return(&sk->sk_tskey) - 1;
 			hold_tskey = true;
 		}
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 13/14] net-timestamp: use static key to control bpf extension
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (11 preceding siblings ...)
  2024-10-28 11:05 ` [PATCH net-next v3 12/14] net-timestamp: add OPT_ID for UDP proto Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-28 11:05 ` [PATCH net-next v3 14/14] bpf: add simple bpf tests in the tx path for so_timstamping feature Jason Xing
  13 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Using the existing cgroup static key to control every possible
call in bpf extension.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/core/skbuff.c     | 3 ++-
 net/core/sock.c       | 4 ++--
 net/ipv4/ip_output.c  | 5 +++--
 net/ipv4/tcp.c        | 3 ++-
 net/ipv4/udp.c        | 3 ++-
 net/ipv6/ip6_output.c | 5 +++--
 6 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d1739317b97d..2e5af24802ee 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5692,7 +5692,8 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (!sk)
 		return;
 
-	skb_tstamp_tx_output_bpf(sk, tstype, orig_skb, hwtstamps);
+	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS))
+		skb_tstamp_tx_output_bpf(sk, tstype, orig_skb, hwtstamps);
 	skb_tstamp_tx_output(orig_skb, ack_skb, hwtstamps, sk, tstype);
 }
 EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
diff --git a/net/core/sock.c b/net/core/sock.c
index 914ec8046f86..3a6f7c9b6459 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1479,7 +1479,7 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
 		}
 		if (!bpf_timetamping)
 			ret = sock_set_timestamping(sk, optname, timestamping);
-		else
+		else if (cgroup_bpf_enabled(CGROUP_SOCK_OPS))
 			ret = sock_set_timestamping_bpf(sk, timestamping);
 		break;
 
@@ -1869,7 +1869,7 @@ int sk_getsockopt(struct sock *sk, int level, int optname,
 				v.timestamping.flags = READ_ONCE(sk->sk_tsflags);
 				v.timestamping.bind_phc = READ_ONCE(sk->sk_bind_phc);
 			}
-		} else {
+		} else if (cgroup_bpf_enabled(CGROUP_SOCK_OPS)) {
 			v.timestamping.flags = READ_ONCE(sk->sk_tsflags_bpf);
 		}
 		break;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 45033105b34c..9678a88714e5 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1058,7 +1058,7 @@ static int __ip_append_data(struct sock *sk,
 				hold_tskey = true;
 			}
 		}
-		if (!hold_tskey &&
+		if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) && !hold_tskey &&
 		    READ_ONCE(sk->sk_tsflags_bpf) & SOF_TIMESTAMPING_OPT_ID) {
 			tskey = atomic_inc_return(&sk->sk_tskey) - 1;
 			hold_tskey = true;
@@ -1338,7 +1338,8 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
 	cork->transmit_time = ipc->sockc.transmit_time;
 	cork->tx_flags = 0;
 	sock_tx_timestamp(sk, &ipc->sockc, &cork->tx_flags);
-	sock_tx_timestamp_bpf(READ_ONCE(sk->sk_tsflags_bpf), &cork->tx_flags);
+	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS))
+		sock_tx_timestamp_bpf(READ_ONCE(sk->sk_tsflags_bpf), &cork->tx_flags);
 	if (ipc->sockc.tsflags & SOCKCM_FLAG_TS_OPT_ID) {
 		cork->flags |= IPCORK_TS_OPT_ID;
 		cork->ts_opt_id = ipc->sockc.ts_opt_id;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f77dc7a4a98e..8f42c254bc7e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -509,7 +509,8 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
 			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
 	}
 
-	tcp_tx_timestamp_bpf(sk, skb);
+	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS))
+		tcp_tx_timestamp_bpf(sk, skb);
 }
 
 static bool tcp_stream_is_readable(struct sock *sk, int target)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e768421abc37..27cf2f8a9409 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1264,7 +1264,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	if (!corkreq) {
 		struct inet_cork cork;
 
-		timestamp_call_bpf(sk, BPF_SOCK_OPS_TS_UDP_SND_CB, 0, NULL);
+		if (cgroup_bpf_enabled(CGROUP_SOCK_OPS))
+			timestamp_call_bpf(sk, BPF_SOCK_OPS_TS_UDP_SND_CB, 0, NULL);
 		skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
 				  sizeof(struct udphdr), &ipc, &rt,
 				  &cork, msg->msg_flags);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ec956ada7179..3a96fb09f068 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1402,7 +1402,8 @@ static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork,
 	cork->base.tx_flags = 0;
 	cork->base.mark = ipc6->sockc.mark;
 	sock_tx_timestamp(sk, &ipc6->sockc, &cork->base.tx_flags);
-	sock_tx_timestamp_bpf(READ_ONCE(sk->sk_tsflags_bpf), &cork->base.tx_flags);
+	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS))
+		sock_tx_timestamp_bpf(READ_ONCE(sk->sk_tsflags_bpf), &cork->base.tx_flags);
 	if (ipc6->sockc.tsflags & SOCKCM_FLAG_TS_OPT_ID) {
 		cork->base.flags |= IPCORK_TS_OPT_ID;
 		cork->base.ts_opt_id = ipc6->sockc.ts_opt_id;
@@ -1556,7 +1557,7 @@ static int __ip6_append_data(struct sock *sk,
 				hold_tskey = true;
 			}
 		}
-		if (!hold_tskey &&
+		if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) && !hold_tskey &&
 		    READ_ONCE(sk->sk_tsflags_bpf) & SOF_TIMESTAMPING_OPT_ID) {
 			tskey = atomic_inc_return(&sk->sk_tskey) - 1;
 			hold_tskey = true;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH net-next v3 14/14] bpf: add simple bpf tests in the tx path for so_timstamping feature
  2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (12 preceding siblings ...)
  2024-10-28 11:05 ` [PATCH net-next v3 13/14] net-timestamp: use static key to control bpf extension Jason Xing
@ 2024-10-28 11:05 ` Jason Xing
  2024-10-29  1:26   ` Willem de Bruijn
  2024-10-30  5:57   ` Martin KaFai Lau
  13 siblings, 2 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-28 11:05 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal
  Cc: bpf, netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Only check if we pass those three key points after we enable the
bpf extension for so_timestamping. During each point, we can choose
whether to print the current timestamp.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 .../bpf/prog_tests/so_timestamping.c          |  98 ++++++++++++++
 .../selftests/bpf/progs/so_timestamping.c     | 123 ++++++++++++++++++
 2 files changed, 221 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/so_timestamping.c
 create mode 100644 tools/testing/selftests/bpf/progs/so_timestamping.c

diff --git a/tools/testing/selftests/bpf/prog_tests/so_timestamping.c b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
new file mode 100644
index 000000000000..dfb7588c246d
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Tencent */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <linux/socket.h>
+#include <linux/tls.h>
+#include <net/if.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+
+#include "so_timestamping.skel.h"
+
+#define CG_NAME "/so-timestamping-test"
+
+static const char addr4_str[] = "127.0.0.1";
+static const char addr6_str[] = "::1";
+static struct so_timestamping *skel;
+static int cg_fd;
+
+static int create_netns(void)
+{
+	if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
+		return -1;
+
+	if (!ASSERT_OK(system("ip link set dev lo up"), "set lo up"))
+		return -1;
+
+	return 0;
+}
+
+static void test_tcp(int family)
+{
+	struct so_timestamping__bss *bss = skel->bss;
+	char buf[] = "testing testing";
+	int sfd = -1, cfd = -1;
+	int n;
+
+	memset(bss, 0, sizeof(*bss));
+
+	sfd = start_server(family, SOCK_STREAM,
+			   family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
+	if (!ASSERT_GE(sfd, 0, "start_server"))
+		goto out;
+
+	cfd = connect_to_fd(sfd, 0);
+	if (!ASSERT_GE(cfd, 0, "connect_to_fd_server")) {
+		close(sfd);
+		goto out;
+	}
+
+	n = write(cfd, buf, sizeof(buf));
+	if (!ASSERT_EQ(n, sizeof(buf), "send to server"))
+		goto out;
+
+	ASSERT_EQ(bss->nr_active, 1, "nr_active");
+	ASSERT_EQ(bss->nr_passive, 1, "nr_passive");
+	ASSERT_EQ(bss->nr_sched, 1, "nr_sched");
+	ASSERT_EQ(bss->nr_txsw, 1, "nr_txsw");
+	ASSERT_EQ(bss->nr_ack, 1, "nr_ack");
+
+out:
+	if (sfd >= 0)
+		close(sfd);
+	if (cfd >= 0)
+		close(cfd);
+}
+
+void test_so_timestamping(void)
+{
+	cg_fd = test__join_cgroup(CG_NAME);
+	if (cg_fd < 0)
+		return;
+
+	if (create_netns())
+		goto done;
+
+	skel = so_timestamping__open();
+	if (!ASSERT_OK_PTR(skel, "open skel"))
+		goto done;
+
+	if (!ASSERT_OK(so_timestamping__load(skel), "load skel"))
+		goto done;
+
+	skel->links.skops_sockopt =
+		bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd);
+	if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup"))
+		goto done;
+
+	test_tcp(AF_INET6);
+	test_tcp(AF_INET);
+
+done:
+	so_timestamping__destroy(skel);
+	close(cg_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/so_timestamping.c b/tools/testing/selftests/bpf/progs/so_timestamping.c
new file mode 100644
index 000000000000..a15317951786
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/so_timestamping.c
@@ -0,0 +1,123 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Tencent */
+
+#include "vmlinux.h"
+#include "bpf_tracing_net.h"
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+
+#define SO_TIMESTAMPING 37
+#define SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK (SOF_TIMESTAMPING_SOFTWARE | \
+					      SOF_TIMESTAMPING_TX_SCHED | \
+					      SOF_TIMESTAMPING_TX_SOFTWARE | \
+					      SOF_TIMESTAMPING_TX_ACK | \
+					      SOF_TIMESTAMPING_OPT_ID | \
+					      SOF_TIMESTAMPING_OPT_ID_TCP)
+
+extern unsigned long CONFIG_HZ __kconfig;
+
+int nr_active;
+int nr_passive;
+int nr_sched;
+int nr_txsw;
+int nr_ack;
+
+struct sockopt_test {
+	int opt;
+	int new;
+	int expected;
+};
+
+static const struct sockopt_test sol_socket_tests[] = {
+	{ .opt = SO_TIMESTAMPING, .new = SOF_TIMESTAMPING_TX_SCHED, .expected = 256, },
+	{ .opt = SO_TIMESTAMPING, .new = SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK, .expected = 66450, },
+	{ .opt = 0, },
+};
+
+struct loop_ctx {
+	void *ctx;
+	struct sock *sk;
+};
+
+static int bpf_test_sockopt_int(void *ctx, struct sock *sk,
+				const struct sockopt_test *t,
+				int level)
+{
+	int tmp, new, expected, opt;
+
+	opt = t->opt;
+	new = t->new;
+	expected = t->expected;
+
+	if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
+		return 1;
+	if (bpf_getsockopt(ctx, level, opt, &tmp, sizeof(tmp)) ||
+	    tmp != expected)
+		return 1;
+
+	return 0;
+}
+
+static int bpf_test_socket_sockopt(__u32 i, struct loop_ctx *lc)
+{
+	const struct sockopt_test *t;
+
+	if (i >= ARRAY_SIZE(sol_socket_tests))
+		return 1;
+
+	t = &sol_socket_tests[i];
+	if (!t->opt)
+		return 1;
+
+	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, SOL_SOCKET);
+}
+
+static int bpf_test_sockopt(void *ctx, struct sock *sk)
+{
+	struct loop_ctx lc = { .ctx = ctx, .sk = sk, };
+	int n;
+
+	n = bpf_loop(ARRAY_SIZE(sol_socket_tests), bpf_test_socket_sockopt, &lc, 0);
+	if (n != ARRAY_SIZE(sol_socket_tests))
+		return -1;
+
+	return 0;
+}
+
+SEC("sockops")
+int skops_sockopt(struct bpf_sock_ops *skops)
+{
+	struct bpf_sock *bpf_sk = skops->sk;
+	struct sock *sk;
+
+	if (!bpf_sk)
+		return 1;
+
+	sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
+	if (!sk)
+		return 1;
+
+	switch (skops->op) {
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+		nr_active += !bpf_test_sockopt(skops, sk);
+		break;
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		nr_passive += !bpf_test_sockopt(skops, sk);
+		break;
+	case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
+		nr_sched += 1;
+		break;
+	case BPF_SOCK_OPS_TS_SW_OPT_CB:
+		nr_txsw += 1;
+		break;
+	case BPF_SOCK_OPS_TS_ACK_OPT_CB:
+		nr_ack += 1;
+		break;
+	}
+
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 04/14] net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit timestamp
  2024-10-28 11:05 ` [PATCH net-next v3 04/14] net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit timestamp Jason Xing
@ 2024-10-29  0:23   ` kernel test robot
  2024-10-29  1:02   ` Willem de Bruijn
  2024-10-29  1:04   ` kernel test robot
  2 siblings, 0 replies; 88+ messages in thread
From: kernel test robot @ 2024-10-29  0:23 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, shuah, ykolal
  Cc: oe-kbuild-all, bpf, netdev, Jason Xing

Hi Jason,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Jason-Xing/net-timestamp-reorganize-in-skb_tstamp_tx_output/20241028-192036
base:   net-next/main
patch link:    https://lore.kernel.org/r/20241028110535.82999-5-kerneljasonxing%40gmail.com
patch subject: [PATCH net-next v3 04/14] net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit timestamp
config: arm-randconfig-001-20241029 (https://download.01.org/0day-ci/archive/20241029/202410290828.ZqgMO8Xc-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241029/202410290828.ZqgMO8Xc-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410290828.ZqgMO8Xc-lkp@intel.com/

All errors (new ones prefixed by >>):

   net/core/skbuff.c: In function 'timestamp_call_bpf':
>> net/core/skbuff.c:5640:9: error: implicit declaration of function 'BPF_CGROUP_RUN_PROG_SOCK_OPS_SK'; did you mean 'BPF_CGROUP_RUN_PROG_SOCK_OPS'? [-Wimplicit-function-declaration]
    5640 |         BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         |         BPF_CGROUP_RUN_PROG_SOCK_OPS


vim +5640 net/core/skbuff.c

  5624	
  5625	static void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
  5626	{
  5627		struct bpf_sock_ops_kern sock_ops;
  5628	
  5629		memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
  5630		if (sk_fullsock(sk)) {
  5631			sock_ops.is_fullsock = 1;
  5632			sock_owned_by_me(sk);
  5633		}
  5634	
  5635		sock_ops.sk = sk;
  5636		sock_ops.op = op;
  5637		if (nargs > 0)
  5638			memcpy(sock_ops.args, args, nargs * sizeof(*args));
  5639	
> 5640		BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
  5641	}
  5642	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 03/14] net-timestamp: open gate for bpf_setsockopt/_getsockopt
  2024-10-28 11:05 ` [PATCH net-next v3 03/14] net-timestamp: open gate for bpf_setsockopt/_getsockopt Jason Xing
@ 2024-10-29  0:59   ` Willem de Bruijn
  2024-10-29  1:18     ` Jason Xing
  2024-10-30  0:32   ` Martin KaFai Lau
  1 sibling, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-29  0:59 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, shuah, ykolal
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> For now, we support bpf_setsockopt to set or clear timestamps flags.
> 
> Users can use something like this in bpf program to turn on the feature:
> flags = SOF_TIMESTAMPING_TX_SCHED;
> bpf_setsockopt(skops, SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
> The specific use cases can be seen in the bpf selftest in this series.
> 
> Later, I will support each flags one by one based on this.
> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  include/net/sock.h              |  4 ++--
>  include/uapi/linux/net_tstamp.h |  7 +++++++
>  net/core/filter.c               |  7 +++++--
>  net/core/sock.c                 | 34 ++++++++++++++++++++++++++-------
>  net/ipv4/udp.c                  |  2 +-
>  net/mptcp/sockopt.c             |  2 +-
>  net/socket.c                    |  2 +-
>  7 files changed, 44 insertions(+), 14 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 5384f1e49f5c..062f405c744e 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1775,7 +1775,7 @@ static inline void skb_set_owner_edemux(struct sk_buff *skb, struct sock *sk)
>  #endif
>  
>  int sk_setsockopt(struct sock *sk, int level, int optname,
> -		  sockptr_t optval, unsigned int optlen);
> +		  sockptr_t optval, unsigned int optlen, bool bpf_timetamping);

timestamping, not timetamping

More importantly, is there perhaps a cleaner way to add a BPF
setsockopt than to have to update the existing API and all its
callers?

>  int sock_setsockopt(struct socket *sock, int level, int op,
>  		    sockptr_t optval, unsigned int optlen);
>  int do_sock_setsockopt(struct socket *sock, bool compat, int level,
> @@ -1784,7 +1784,7 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level,
>  		       int optname, sockptr_t optval, sockptr_t optlen);
>  
>  int sk_getsockopt(struct sock *sk, int level, int optname,
> -		  sockptr_t optval, sockptr_t optlen);
> +		  sockptr_t optval, sockptr_t optlen, bool bpf_timetamping);
>  int sock_gettstamp(struct socket *sock, void __user *userstamp,
>  		   bool timeval, bool time32);
>  struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
> diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
> index 858339d1c1c4..0696699cf964 100644
> --- a/include/uapi/linux/net_tstamp.h
> +++ b/include/uapi/linux/net_tstamp.h
> @@ -49,6 +49,13 @@ enum {
>  					 SOF_TIMESTAMPING_TX_SCHED | \
>  					 SOF_TIMESTAMPING_TX_ACK)
>  
> +#define SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK (SOF_TIMESTAMPING_SOFTWARE | \
> +					      SOF_TIMESTAMPING_TX_SCHED | \
> +					      SOF_TIMESTAMPING_TX_SOFTWARE | \
> +					      SOF_TIMESTAMPING_TX_ACK | \
> +					      SOF_TIMESTAMPING_OPT_ID | \
> +					      SOF_TIMESTAMPING_OPT_ID_TCP)
> +

We discussed the subtle distinction between OPT_ID and OPT_ID_TCP before.

Basically, OPT_ID_TCP is a fix for OPT_ID on TCP sockets, and should always be
passed. On a new API like this one, we can even require this.

Not super important, only if it does not make the code more complex.




^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 04/14] net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit timestamp
  2024-10-28 11:05 ` [PATCH net-next v3 04/14] net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit timestamp Jason Xing
  2024-10-29  0:23   ` kernel test robot
@ 2024-10-29  1:02   ` Willem de Bruijn
  2024-10-29  1:30     ` Jason Xing
  2024-10-29  1:04   ` kernel test robot
  2 siblings, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-29  1:02 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, shuah, ykolal
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> Introduce BPF_SOCK_OPS_TS_SCHED_OPT_CB flag so that we can decide to
> print timestamps when the skb just passes the dev layer.
> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  include/uapi/linux/bpf.h       |  5 +++++
>  net/core/skbuff.c              | 31 ++++++++++++++++++++++++++++++-
>  tools/include/uapi/linux/bpf.h |  5 +++++
>  3 files changed, 40 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e8241b320c6d..324e9e40969c 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7013,6 +7013,11 @@ enum {
>  					 * by the kernel or the
>  					 * earlier bpf-progs.
>  					 */
> +	BPF_SOCK_OPS_TS_SCHED_OPT_CB,	/* Called when skb is passing through
> +					 * dev layer when SO_TIMESTAMPING
> +					 * feature is on. It indicates the
> +					 * recorded timestamp.
> +					 */
>  };
>  
>  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 39309f75e105..e6a5c883bdc6 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -64,6 +64,7 @@
>  #include <linux/mpls.h>
>  #include <linux/kcov.h>
>  #include <linux/iov_iter.h>
> +#include <linux/bpf-cgroup.h>
>  
>  #include <net/protocol.h>
>  #include <net/dst.h>
> @@ -5621,13 +5622,41 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
>  	__skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
>  }
>  
> +static void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> +{
> +	struct bpf_sock_ops_kern sock_ops;
> +
> +	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
> +	if (sk_fullsock(sk)) {
> +		sock_ops.is_fullsock = 1;
> +		sock_owned_by_me(sk);

Why this check?

This will usually be false, as timestamps are taken outside the
protocol layers.

> +	}
> +
> +	sock_ops.sk = sk;
> +	sock_ops.op = op;
> +	if (nargs > 0)
> +		memcpy(sock_ops.args, args, nargs * sizeof(*args));
> +
> +	BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
> +}
> +

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 06/14] net-timestamp: introduce TS_ACK_OPT_CB to generate tcp acked timestamp
  2024-10-28 11:05 ` [PATCH net-next v3 06/14] net-timestamp: introduce TS_ACK_OPT_CB to generate tcp acked timestamp Jason Xing
@ 2024-10-29  1:03   ` Willem de Bruijn
  2024-10-29  1:19     ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-29  1:03 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, shuah, ykolal
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> When the last sent skb in each sendmsg() is acknowledged in TCP layer,

nit: last byte.

The TCP bytestream has no concept of fixed buffer sizes or skbs.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 04/14] net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit timestamp
  2024-10-28 11:05 ` [PATCH net-next v3 04/14] net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit timestamp Jason Xing
  2024-10-29  0:23   ` kernel test robot
  2024-10-29  1:02   ` Willem de Bruijn
@ 2024-10-29  1:04   ` kernel test robot
  2 siblings, 0 replies; 88+ messages in thread
From: kernel test robot @ 2024-10-29  1:04 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, shuah, ykolal
  Cc: llvm, oe-kbuild-all, bpf, netdev, Jason Xing

Hi Jason,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Jason-Xing/net-timestamp-reorganize-in-skb_tstamp_tx_output/20241028-192036
base:   net-next/main
patch link:    https://lore.kernel.org/r/20241028110535.82999-5-kerneljasonxing%40gmail.com
patch subject: [PATCH net-next v3 04/14] net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit timestamp
config: arm64-randconfig-001-20241029 (https://download.01.org/0day-ci/archive/20241029/202410290852.PLcWZ1Yo-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241029/202410290852.PLcWZ1Yo-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410290852.PLcWZ1Yo-lkp@intel.com/

All errors (new ones prefixed by >>):

>> net/core/skbuff.c:5640:2: error: call to undeclared function 'BPF_CGROUP_RUN_PROG_SOCK_OPS_SK'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    5640 |         BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
         |         ^
   1 error generated.


vim +/BPF_CGROUP_RUN_PROG_SOCK_OPS_SK +5640 net/core/skbuff.c

  5624	
  5625	static void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
  5626	{
  5627		struct bpf_sock_ops_kern sock_ops;
  5628	
  5629		memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
  5630		if (sk_fullsock(sk)) {
  5631			sock_ops.is_fullsock = 1;
  5632			sock_owned_by_me(sk);
  5633		}
  5634	
  5635		sock_ops.sk = sk;
  5636		sock_ops.op = op;
  5637		if (nargs > 0)
  5638			memcpy(sock_ops.args, args, nargs * sizeof(*args));
  5639	
> 5640		BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
  5641	}
  5642	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 07/14] net-timestamp: add a new triggered point to set sk_tsflags_bpf in UDP layer
  2024-10-28 11:05 ` [PATCH net-next v3 07/14] net-timestamp: add a new triggered point to set sk_tsflags_bpf in UDP layer Jason Xing
@ 2024-10-29  1:07   ` Willem de Bruijn
  2024-10-29  1:23     ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-29  1:07 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, shuah, ykolal
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> This patch behaves like how cmsg feature works, that is to say,
> check and set on each call of udp_sendmsg before passing sk_tsflags_bpf
> to cork tsflags.
> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  include/net/sock.h             | 1 +
>  include/uapi/linux/bpf.h       | 3 +++
>  net/core/skbuff.c              | 2 +-
>  net/ipv4/udp.c                 | 1 +
>  tools/include/uapi/linux/bpf.h | 3 +++
>  5 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 062f405c744e..cf7fea456455 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2828,6 +2828,7 @@ static inline bool sk_listener_or_tw(const struct sock *sk)
>  }
>  
>  void sock_enable_timestamp(struct sock *sk, enum sock_flags flag);
> +void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args);
>  int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len, int level,
>  		       int type);
>  
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 6fc3bd12b650..055ffa7c965c 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7028,6 +7028,9 @@ enum {
>  					 * feature is on. It indicates the
>  					 * recorded timestamp.
>  					 */
> +	BPF_SOCK_OPS_TS_UDP_SND_CB,	/* Called when every udp_sendmsg
> +					 * syscall is triggered
> +					 */
>  };
>  
>  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 8b2a79c0fe1c..0b571306f7ea 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5622,7 +5622,7 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
>  	__skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
>  }
>  
> -static void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> +void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
>  {
>  	struct bpf_sock_ops_kern sock_ops;
>  
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 9a20af41e272..e768421abc37 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1264,6 +1264,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  	if (!corkreq) {
>  		struct inet_cork cork;
>  
> +		timestamp_call_bpf(sk, BPF_SOCK_OPS_TS_UDP_SND_CB, 0, NULL);
>  		skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
>  				  sizeof(struct udphdr), &ipc, &rt,
>  				  &cork, msg->msg_flags);
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 6fc3bd12b650..055ffa7c965c 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -7028,6 +7028,9 @@ enum {
>  					 * feature is on. It indicates the
>  					 * recorded timestamp.
>  					 */
> +	BPF_SOCK_OPS_TS_UDP_SND_CB,	/* Called when every udp_sendmsg
> +					 * syscall is triggered
> +					 */

If adding a timestamp as close to syscall entry as possible, give it a
generic name, not specific to UDP.

And please explain in the commit message the reason for a new
timestamp recording point: with existing timestamping the application
can call clock_gettime before (and optionally after) the send call.
An admin using BPF does not have this option, so needs this as part of
the BPF timestamping API.


>  };
>  
>  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> -- 
> 2.37.3
> 



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 03/14] net-timestamp: open gate for bpf_setsockopt/_getsockopt
  2024-10-29  0:59   ` Willem de Bruijn
@ 2024-10-29  1:18     ` Jason Xing
  0 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-29  1:18 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Tue, Oct 29, 2024 at 8:59 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > For now, we support bpf_setsockopt to set or clear timestamps flags.
> >
> > Users can use something like this in bpf program to turn on the feature:
> > flags = SOF_TIMESTAMPING_TX_SCHED;
> > bpf_setsockopt(skops, SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
> > The specific use cases can be seen in the bpf selftest in this series.
> >
> > Later, I will support each flags one by one based on this.
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  include/net/sock.h              |  4 ++--
> >  include/uapi/linux/net_tstamp.h |  7 +++++++
> >  net/core/filter.c               |  7 +++++--
> >  net/core/sock.c                 | 34 ++++++++++++++++++++++++++-------
> >  net/ipv4/udp.c                  |  2 +-
> >  net/mptcp/sockopt.c             |  2 +-
> >  net/socket.c                    |  2 +-
> >  7 files changed, 44 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 5384f1e49f5c..062f405c744e 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -1775,7 +1775,7 @@ static inline void skb_set_owner_edemux(struct sk_buff *skb, struct sock *sk)
> >  #endif
> >
> >  int sk_setsockopt(struct sock *sk, int level, int optname,
> > -               sockptr_t optval, unsigned int optlen);
> > +               sockptr_t optval, unsigned int optlen, bool bpf_timetamping);
>
> timestamping, not timetamping

Oh, right...

>
> More importantly, is there perhaps a cleaner way to add a BPF
> setsockopt than to have to update the existing API and all its
> callers?

I've thought about that as well. As you may notice, this version
changes the prior implementation [1] that makes the code more clear
from my perspective.

[1]: https://lore.kernel.org/all/20241012040651.95616-3-kerneljasonxing@gmail.com/

The link here didn't support the bpf_setsockopt which requires more
strange modification in sol_socket_sockopt() and return earlier
compared to other uses of SO_xxx. That's why I changed here.

>
> >  int sock_setsockopt(struct socket *sock, int level, int op,
> >                   sockptr_t optval, unsigned int optlen);
> >  int do_sock_setsockopt(struct socket *sock, bool compat, int level,
> > @@ -1784,7 +1784,7 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level,
> >                      int optname, sockptr_t optval, sockptr_t optlen);
> >
> >  int sk_getsockopt(struct sock *sk, int level, int optname,
> > -               sockptr_t optval, sockptr_t optlen);
> > +               sockptr_t optval, sockptr_t optlen, bool bpf_timetamping);
> >  int sock_gettstamp(struct socket *sock, void __user *userstamp,
> >                  bool timeval, bool time32);
> >  struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
> > diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
> > index 858339d1c1c4..0696699cf964 100644
> > --- a/include/uapi/linux/net_tstamp.h
> > +++ b/include/uapi/linux/net_tstamp.h
> > @@ -49,6 +49,13 @@ enum {
> >                                        SOF_TIMESTAMPING_TX_SCHED | \
> >                                        SOF_TIMESTAMPING_TX_ACK)
> >
> > +#define SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK (SOF_TIMESTAMPING_SOFTWARE | \
> > +                                           SOF_TIMESTAMPING_TX_SCHED | \
> > +                                           SOF_TIMESTAMPING_TX_SOFTWARE | \
> > +                                           SOF_TIMESTAMPING_TX_ACK | \
> > +                                           SOF_TIMESTAMPING_OPT_ID | \
> > +                                           SOF_TIMESTAMPING_OPT_ID_TCP)
> > +
>
> We discussed the subtle distinction between OPT_ID and OPT_ID_TCP before.
>
> Basically, OPT_ID_TCP is a fix for OPT_ID on TCP sockets, and should always be
> passed. On a new API like this one, we can even require this.

Good idea. Will do it. Thanks.

>
> Not super important, only if it does not make the code more complex.

I need to ponder on this point more.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 06/14] net-timestamp: introduce TS_ACK_OPT_CB to generate tcp acked timestamp
  2024-10-29  1:03   ` Willem de Bruijn
@ 2024-10-29  1:19     ` Jason Xing
  0 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-29  1:19 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Tue, Oct 29, 2024 at 9:03 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > When the last sent skb in each sendmsg() is acknowledged in TCP layer,
>
> nit: last byte.
>
> The TCP bytestream has no concept of fixed buffer sizes or skbs.

Right, right, big mistake of basic theory. Sorry.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 07/14] net-timestamp: add a new triggered point to set sk_tsflags_bpf in UDP layer
  2024-10-29  1:07   ` Willem de Bruijn
@ 2024-10-29  1:23     ` Jason Xing
  2024-10-29  1:33       ` Willem de Bruijn
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-29  1:23 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Tue, Oct 29, 2024 at 9:07 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > This patch behaves like how cmsg feature works, that is to say,
> > check and set on each call of udp_sendmsg before passing sk_tsflags_bpf
> > to cork tsflags.
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  include/net/sock.h             | 1 +
> >  include/uapi/linux/bpf.h       | 3 +++
> >  net/core/skbuff.c              | 2 +-
> >  net/ipv4/udp.c                 | 1 +
> >  tools/include/uapi/linux/bpf.h | 3 +++
> >  5 files changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 062f405c744e..cf7fea456455 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -2828,6 +2828,7 @@ static inline bool sk_listener_or_tw(const struct sock *sk)
> >  }
> >
> >  void sock_enable_timestamp(struct sock *sk, enum sock_flags flag);
> > +void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args);
> >  int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len, int level,
> >                      int type);
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 6fc3bd12b650..055ffa7c965c 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -7028,6 +7028,9 @@ enum {
> >                                        * feature is on. It indicates the
> >                                        * recorded timestamp.
> >                                        */
> > +     BPF_SOCK_OPS_TS_UDP_SND_CB,     /* Called when every udp_sendmsg
> > +                                      * syscall is triggered
> > +                                      */
> >  };
> >
> >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 8b2a79c0fe1c..0b571306f7ea 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5622,7 +5622,7 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> >       __skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
> >  }
> >
> > -static void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> > +void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> >  {
> >       struct bpf_sock_ops_kern sock_ops;
> >
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 9a20af41e272..e768421abc37 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1264,6 +1264,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> >       if (!corkreq) {
> >               struct inet_cork cork;
> >
> > +             timestamp_call_bpf(sk, BPF_SOCK_OPS_TS_UDP_SND_CB, 0, NULL);
> >               skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
> >                                 sizeof(struct udphdr), &ipc, &rt,
> >                                 &cork, msg->msg_flags);
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 6fc3bd12b650..055ffa7c965c 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -7028,6 +7028,9 @@ enum {
> >                                        * feature is on. It indicates the
> >                                        * recorded timestamp.
> >                                        */
> > +     BPF_SOCK_OPS_TS_UDP_SND_CB,     /* Called when every udp_sendmsg
> > +                                      * syscall is triggered
> > +                                      */
>
> If adding a timestamp as close to syscall entry as possible, give it a
> generic name, not specific to UDP.

Good suggestion, then it will also solve the remaining issue for TCP type:
__when__ we should record the user timestamp which exists in the
application SO_TIMESTAMPING feature.

>
> And please explain in the commit message the reason for a new
> timestamp recording point: with existing timestamping the application
> can call clock_gettime before (and optionally after) the send call.
> An admin using BPF does not have this option, so needs this as part of
> the BPF timestamping API.

Will revise this part. Thanks for your description!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-28 11:05 ` [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset Jason Xing
@ 2024-10-29  1:24   ` Willem de Bruijn
  2024-10-29  2:41     ` Jason Xing
  2024-10-30  5:42   ` Martin KaFai Lau
  1 sibling, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-29  1:24 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, shuah, ykolal
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> Use the offset to record the delta value between current socket key
> and bpf socket key.
> 
> 1. If there is only bpf feature running, the socket key is bpf socket
> key and the offset is zero;
> 2. If there is only traditional feature running, and then bpf feature
> is turned on, the socket key is still used by the former while the offset
> is the delta between them;
> 3. if there is only bpf feature running, and then application uses it,
> the socket key would be re-init for application and the offset is the
> delta.

We need to also figure out the rare conflict when one user sets
OPT_ID | OPT_ID_TCP while the other only uses OPT_ID.

It is so obscure, that perhaps we can punt and say that the BPF
program just has to follow the application preference and be aware of
the subtle difference.

> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  include/net/sock.h |  1 +
>  net/core/skbuff.c  | 15 ++++++++---
>  net/core/sock.c    | 66 ++++++++++++++++++++++++++++++++++++++--------
>  3 files changed, 68 insertions(+), 14 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 91398b20a4a3..41c6c6f78e55 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -469,6 +469,7 @@ struct sock {
>  	unsigned long		sk_pacing_rate; /* bytes per second */
>  	atomic_t		sk_zckey;
>  	atomic_t		sk_tskey;
> +	u32			sk_tskey_bpf_offset;
>  	__cacheline_group_end(sock_write_tx);
>  
>  	__cacheline_group_begin(sock_read_tx);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 0b571306f7ea..d1739317b97d 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5641,9 +5641,10 @@ void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
>  }
>  
>  static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype,
> +				     struct sk_buff *skb,
>  				     struct skb_shared_hwtstamps *hwtstamps)
>  {
> -	u32 args[2] = {0, 0};
> +	u32 args[3] = {0, 0, 0};
>  	u32 tsflags, cb_flag;
>  
>  	tsflags = READ_ONCE(sk->sk_tsflags_bpf);
> @@ -5672,7 +5673,15 @@ static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype,
>  		args[1] = ts.tv_nsec;
>  	}
>  
> -	timestamp_call_bpf(sk, cb_flag, 2, args);
> +	if (tsflags & SOF_TIMESTAMPING_OPT_ID) {
> +		args[2] = skb_shinfo(skb)->tskey;
> +		if (sk_is_tcp(sk))
> +			args[2] -= atomic_read(&sk->sk_tskey);
> +		if (sk->sk_tskey_bpf_offset)
> +			args[2] += sk->sk_tskey_bpf_offset;
> +	}
> +
> +	timestamp_call_bpf(sk, cb_flag, 3, args);


So the BPF interface is effectively OPT_TSONLY: the packet data is
never shared.

Then OPT_ID should be mandatory, because it without it the data is
not actionable: which byte in the bytestream or packet in the case
of datagram sockets does a callback refer to.

> +/* Used to track the tskey for bpf extension
> + *
> + * @sk_tskey: bpf extension can use it only when no application uses.
> + *            Application can use it directly regardless of bpf extension.
> + *
> + * There are three strategies:
> + * 1) If we've already set through setsockopt() and here we're going to set
> + *    OPT_ID for bpf use, we will not re-initialize the @sk_tskey and will
> + *    keep the record of delta between the current "key" and previous key.
> + * 2) If we've already set through bpf_setsockopt() and here we're going to
> + *    set for application use, we will record the delta first and then
> + *    override/initialize the @sk_tskey.
> + * 3) other cases, which means only either of them takes effect, so initialize
> + *    everything simplely.
> + */

Please explain in the commit message that these gymnastics are needed
because there can only be one tskey in skb_shared_info.

> +static long int sock_calculate_tskey_offset(struct sock *sk, int val, int bpf_type)
> +{
> +	u32 tskey;
> +
> +	if (sk_is_tcp(sk)) {
> +		if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
> +			return -EINVAL;
> +
> +		if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> +			tskey = tcp_sk(sk)->write_seq;
> +		else
> +			tskey = tcp_sk(sk)->snd_una;
> +	} else {
> +		tskey = 0;
> +	}
> +
> +	if (bpf_type && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> +		sk->sk_tskey_bpf_offset = tskey - atomic_read(&sk->sk_tskey);
> +		return 0;
> +	} else if (!bpf_type && (sk->sk_tsflags_bpf & SOF_TIMESTAMPING_OPT_ID)) {
> +		sk->sk_tskey_bpf_offset = atomic_read(&sk->sk_tskey) - tskey;
> +	} else {
> +		sk->sk_tskey_bpf_offset = 0;
> +	}
> +
> +	return tskey;
> +}
> +
>  int sock_set_tskey(struct sock *sk, int val, int bpf_type)
>  {
>  	u32 tsflags = bpf_type ? sk->sk_tsflags_bpf : sk->sk_tsflags;
> @@ -901,17 +944,13 @@ int sock_set_tskey(struct sock *sk, int val, int bpf_type)
>  
>  	if (val & SOF_TIMESTAMPING_OPT_ID &&
>  	    !(tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> -		if (sk_is_tcp(sk)) {
> -			if ((1 << sk->sk_state) &
> -			    (TCPF_CLOSE | TCPF_LISTEN))
> -				return -EINVAL;
> -			if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> -				atomic_set(&sk->sk_tskey, tcp_sk(sk)->write_seq);
> -			else
> -				atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
> -		} else {
> -			atomic_set(&sk->sk_tskey, 0);
> -		}
> +		long int ret;
> +
> +		ret = sock_calculate_tskey_offset(sk, val, bpf_type);
> +		if (ret <= 0)
> +			return ret;
> +
> +		atomic_set(&sk->sk_tskey, ret);
>  	}
>  
>  	return 0;
> @@ -956,10 +995,15 @@ static int sock_set_timestamping_bpf(struct sock *sk,
>  				     struct so_timestamping timestamping)
>  {
>  	u32 flags = timestamping.flags;
> +	int ret;
>  
>  	if (flags & ~SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK)
>  		return -EINVAL;
>  
> +	ret = sock_set_tskey(sk, flags, 1);
> +	if (ret)
> +		return ret;
> +
>  	WRITE_ONCE(sk->sk_tsflags_bpf, flags);
>  
>  	return 0;

I'm a bit hazy on when this can be called. We can assume that this new
BPF operation cannot race with the existing setsockopt nor with the
datapath that might touch the atomic fields, right?


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 14/14] bpf: add simple bpf tests in the tx path for so_timstamping feature
  2024-10-28 11:05 ` [PATCH net-next v3 14/14] bpf: add simple bpf tests in the tx path for so_timstamping feature Jason Xing
@ 2024-10-29  1:26   ` Willem de Bruijn
  2024-10-29  1:33     ` Jason Xing
  2024-10-30  5:57   ` Martin KaFai Lau
  1 sibling, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-29  1:26 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, shuah, ykolal
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> Only check if we pass those three key points after we enable the
> bpf extension for so_timestamping. During each point, we can choose
> whether to print the current timestamp.
> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  .../bpf/prog_tests/so_timestamping.c          |  98 ++++++++++++++
>  .../selftests/bpf/progs/so_timestamping.c     | 123 ++++++++++++++++++
>  2 files changed, 221 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/so_timestamping.c
>  create mode 100644 tools/testing/selftests/bpf/progs/so_timestamping.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/so_timestamping.c b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> new file mode 100644
> index 000000000000..dfb7588c246d
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> @@ -0,0 +1,98 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2024 Tencent */
> +
> +#define _GNU_SOURCE
> +#include <sched.h>
> +#include <linux/socket.h>
> +#include <linux/tls.h>
> +#include <net/if.h>
> +
> +#include "test_progs.h"
> +#include "cgroup_helpers.h"
> +#include "network_helpers.h"
> +
> +#include "so_timestamping.skel.h"
> +
> +#define CG_NAME "/so-timestamping-test"
> +
> +static const char addr4_str[] = "127.0.0.1";
> +static const char addr6_str[] = "::1";
> +static struct so_timestamping *skel;
> +static int cg_fd;
> +
> +static int create_netns(void)
> +{
> +	if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
> +		return -1;
> +
> +	if (!ASSERT_OK(system("ip link set dev lo up"), "set lo up"))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static void test_tcp(int family)
> +{
> +	struct so_timestamping__bss *bss = skel->bss;
> +	char buf[] = "testing testing";
> +	int sfd = -1, cfd = -1;
> +	int n;
> +
> +	memset(bss, 0, sizeof(*bss));
> +
> +	sfd = start_server(family, SOCK_STREAM,
> +			   family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
> +	if (!ASSERT_GE(sfd, 0, "start_server"))
> +		goto out;
> +
> +	cfd = connect_to_fd(sfd, 0);
> +	if (!ASSERT_GE(cfd, 0, "connect_to_fd_server")) {
> +		close(sfd);
> +		goto out;
> +	}
> +
> +	n = write(cfd, buf, sizeof(buf));
> +	if (!ASSERT_EQ(n, sizeof(buf), "send to server"))
> +		goto out;
> +
> +	ASSERT_EQ(bss->nr_active, 1, "nr_active");
> +	ASSERT_EQ(bss->nr_passive, 1, "nr_passive");
> +	ASSERT_EQ(bss->nr_sched, 1, "nr_sched");
> +	ASSERT_EQ(bss->nr_txsw, 1, "nr_txsw");
> +	ASSERT_EQ(bss->nr_ack, 1, "nr_ack");
> +
> +out:
> +	if (sfd >= 0)
> +		close(sfd);
> +	if (cfd >= 0)
> +		close(cfd);
> +}
> +
> +void test_so_timestamping(void)
> +{
> +	cg_fd = test__join_cgroup(CG_NAME);
> +	if (cg_fd < 0)
> +		return;
> +
> +	if (create_netns())
> +		goto done;
> +
> +	skel = so_timestamping__open();
> +	if (!ASSERT_OK_PTR(skel, "open skel"))
> +		goto done;
> +
> +	if (!ASSERT_OK(so_timestamping__load(skel), "load skel"))
> +		goto done;
> +
> +	skel->links.skops_sockopt =
> +		bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd);
> +	if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup"))
> +		goto done;
> +
> +	test_tcp(AF_INET6);
> +	test_tcp(AF_INET);
> +
> +done:
> +	so_timestamping__destroy(skel);
> +	close(cg_fd);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/so_timestamping.c b/tools/testing/selftests/bpf/progs/so_timestamping.c
> new file mode 100644
> index 000000000000..a15317951786
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/so_timestamping.c
> @@ -0,0 +1,123 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2024 Tencent */
> +
> +#include "vmlinux.h"
> +#include "bpf_tracing_net.h"
> +#include <bpf/bpf_core_read.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include "bpf_misc.h"
> +
> +#define SO_TIMESTAMPING 37
> +#define SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK (SOF_TIMESTAMPING_SOFTWARE | \
> +					      SOF_TIMESTAMPING_TX_SCHED | \
> +					      SOF_TIMESTAMPING_TX_SOFTWARE | \
> +					      SOF_TIMESTAMPING_TX_ACK | \
> +					      SOF_TIMESTAMPING_OPT_ID | \
> +					      SOF_TIMESTAMPING_OPT_ID_TCP)
> +
> +extern unsigned long CONFIG_HZ __kconfig;
> +
> +int nr_active;
> +int nr_passive;
> +int nr_sched;
> +int nr_txsw;
> +int nr_ack;
> +
> +struct sockopt_test {
> +	int opt;
> +	int new;
> +	int expected;
> +};
> +
> +static const struct sockopt_test sol_socket_tests[] = {
> +	{ .opt = SO_TIMESTAMPING, .new = SOF_TIMESTAMPING_TX_SCHED, .expected = 256, },
> +	{ .opt = SO_TIMESTAMPING, .new = SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK, .expected = 66450, },
> +	{ .opt = 0, },
> +};
> +
> +struct loop_ctx {
> +	void *ctx;
> +	struct sock *sk;
> +};
> +
> +static int bpf_test_sockopt_int(void *ctx, struct sock *sk,
> +				const struct sockopt_test *t,
> +				int level)
> +{
> +	int tmp, new, expected, opt;
> +
> +	opt = t->opt;
> +	new = t->new;
> +	expected = t->expected;
> +
> +	if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
> +		return 1;
> +	if (bpf_getsockopt(ctx, level, opt, &tmp, sizeof(tmp)) ||
> +	    tmp != expected)
> +		return 1;
> +
> +	return 0;
> +}
> +
> +static int bpf_test_socket_sockopt(__u32 i, struct loop_ctx *lc)
> +{
> +	const struct sockopt_test *t;
> +
> +	if (i >= ARRAY_SIZE(sol_socket_tests))
> +		return 1;
> +
> +	t = &sol_socket_tests[i];
> +	if (!t->opt)
> +		return 1;
> +
> +	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, SOL_SOCKET);
> +}
> +
> +static int bpf_test_sockopt(void *ctx, struct sock *sk)
> +{
> +	struct loop_ctx lc = { .ctx = ctx, .sk = sk, };
> +	int n;
> +
> +	n = bpf_loop(ARRAY_SIZE(sol_socket_tests), bpf_test_socket_sockopt, &lc, 0);
> +	if (n != ARRAY_SIZE(sol_socket_tests))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +SEC("sockops")
> +int skops_sockopt(struct bpf_sock_ops *skops)
> +{
> +	struct bpf_sock *bpf_sk = skops->sk;
> +	struct sock *sk;
> +
> +	if (!bpf_sk)
> +		return 1;
> +
> +	sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
> +	if (!sk)
> +		return 1;
> +
> +	switch (skops->op) {
> +	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
> +		nr_active += !bpf_test_sockopt(skops, sk);
> +		break;
> +	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
> +		nr_passive += !bpf_test_sockopt(skops, sk);
> +		break;
> +	case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
> +		nr_sched += 1;
> +		break;
> +	case BPF_SOCK_OPS_TS_SW_OPT_CB:
> +		nr_txsw += 1;
> +		break;
> +	case BPF_SOCK_OPS_TS_ACK_OPT_CB:
> +		nr_ack += 1;
> +		break;

Perhaps demonstrate what to do with the args on the new 
TS_*_OPT_CB.

> +	}
> +
> +	return 1;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> -- 
> 2.37.3
> 



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 04/14] net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit timestamp
  2024-10-29  1:02   ` Willem de Bruijn
@ 2024-10-29  1:30     ` Jason Xing
  0 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-29  1:30 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Tue, Oct 29, 2024 at 9:02 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Introduce BPF_SOCK_OPS_TS_SCHED_OPT_CB flag so that we can decide to
> > print timestamps when the skb just passes the dev layer.
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  include/uapi/linux/bpf.h       |  5 +++++
> >  net/core/skbuff.c              | 31 ++++++++++++++++++++++++++++++-
> >  tools/include/uapi/linux/bpf.h |  5 +++++
> >  3 files changed, 40 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index e8241b320c6d..324e9e40969c 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -7013,6 +7013,11 @@ enum {
> >                                        * by the kernel or the
> >                                        * earlier bpf-progs.
> >                                        */
> > +     BPF_SOCK_OPS_TS_SCHED_OPT_CB,   /* Called when skb is passing through
> > +                                      * dev layer when SO_TIMESTAMPING
> > +                                      * feature is on. It indicates the
> > +                                      * recorded timestamp.
> > +                                      */
> >  };
> >
> >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 39309f75e105..e6a5c883bdc6 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -64,6 +64,7 @@
> >  #include <linux/mpls.h>
> >  #include <linux/kcov.h>
> >  #include <linux/iov_iter.h>
> > +#include <linux/bpf-cgroup.h>
> >
> >  #include <net/protocol.h>
> >  #include <net/dst.h>
> > @@ -5621,13 +5622,41 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> >       __skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
> >  }
> >
> > +static void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> > +{
> > +     struct bpf_sock_ops_kern sock_ops;
> > +
> > +     memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
> > +     if (sk_fullsock(sk)) {
> > +             sock_ops.is_fullsock = 1;
> > +             sock_owned_by_me(sk);
>
> Why this check?

I imitated the use of BPF_CGROUP_RUN_PROG_SOCK_OPS.

>
> This will usually be false, as timestamps are taken outside the
> protocol layers.

I will remove this if branch.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 07/14] net-timestamp: add a new triggered point to set sk_tsflags_bpf in UDP layer
  2024-10-29  1:23     ` Jason Xing
@ 2024-10-29  1:33       ` Willem de Bruijn
  2024-10-29  3:12         ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-29  1:33 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

Jason Xing wrote:
> On Tue, Oct 29, 2024 at 9:07 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > This patch behaves like how cmsg feature works, that is to say,
> > > check and set on each call of udp_sendmsg before passing sk_tsflags_bpf
> > > to cork tsflags.
> > >
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >  include/net/sock.h             | 1 +
> > >  include/uapi/linux/bpf.h       | 3 +++
> > >  net/core/skbuff.c              | 2 +-
> > >  net/ipv4/udp.c                 | 1 +
> > >  tools/include/uapi/linux/bpf.h | 3 +++
> > >  5 files changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index 062f405c744e..cf7fea456455 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -2828,6 +2828,7 @@ static inline bool sk_listener_or_tw(const struct sock *sk)
> > >  }
> > >
> > >  void sock_enable_timestamp(struct sock *sk, enum sock_flags flag);
> > > +void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args);
> > >  int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len, int level,
> > >                      int type);
> > >
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 6fc3bd12b650..055ffa7c965c 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -7028,6 +7028,9 @@ enum {
> > >                                        * feature is on. It indicates the
> > >                                        * recorded timestamp.
> > >                                        */
> > > +     BPF_SOCK_OPS_TS_UDP_SND_CB,     /* Called when every udp_sendmsg
> > > +                                      * syscall is triggered
> > > +                                      */
> > >  };
> > >
> > >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 8b2a79c0fe1c..0b571306f7ea 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -5622,7 +5622,7 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > >       __skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
> > >  }
> > >
> > > -static void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> > > +void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> > >  {
> > >       struct bpf_sock_ops_kern sock_ops;
> > >
> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > index 9a20af41e272..e768421abc37 100644
> > > --- a/net/ipv4/udp.c
> > > +++ b/net/ipv4/udp.c
> > > @@ -1264,6 +1264,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> > >       if (!corkreq) {
> > >               struct inet_cork cork;
> > >
> > > +             timestamp_call_bpf(sk, BPF_SOCK_OPS_TS_UDP_SND_CB, 0, NULL);
> > >               skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
> > >                                 sizeof(struct udphdr), &ipc, &rt,
> > >                                 &cork, msg->msg_flags);
> > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > > index 6fc3bd12b650..055ffa7c965c 100644
> > > --- a/tools/include/uapi/linux/bpf.h
> > > +++ b/tools/include/uapi/linux/bpf.h
> > > @@ -7028,6 +7028,9 @@ enum {
> > >                                        * feature is on. It indicates the
> > >                                        * recorded timestamp.
> > >                                        */
> > > +     BPF_SOCK_OPS_TS_UDP_SND_CB,     /* Called when every udp_sendmsg
> > > +                                      * syscall is triggered
> > > +                                      */
> >
> > If adding a timestamp as close to syscall entry as possible, give it a
> > generic name, not specific to UDP.
> 
> Good suggestion, then it will also solve the remaining issue for TCP type:
> __when__ we should record the user timestamp which exists in the
> application SO_TIMESTAMPING feature.
> 
> >
> > And please explain in the commit message the reason for a new
> > timestamp recording point: with existing timestamping the application
> > can call clock_gettime before (and optionally after) the send call.
> > An admin using BPF does not have this option, so needs this as part of
> > the BPF timestamping API.
> 
> Will revise this part. Thanks for your description!

Actually, I may have misunderstood the intention of this new hook.

I thought it was to record an additional timestamp.

But it is (also?) to program skb_shared_info.tx_flags based on
instructions parsed from cmsg in __sock_cmsg_send.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 14/14] bpf: add simple bpf tests in the tx path for so_timstamping feature
  2024-10-29  1:26   ` Willem de Bruijn
@ 2024-10-29  1:33     ` Jason Xing
  2024-10-29  1:40       ` Willem de Bruijn
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-29  1:33 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Tue, Oct 29, 2024 at 9:27 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Only check if we pass those three key points after we enable the
> > bpf extension for so_timestamping. During each point, we can choose
> > whether to print the current timestamp.
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  .../bpf/prog_tests/so_timestamping.c          |  98 ++++++++++++++
> >  .../selftests/bpf/progs/so_timestamping.c     | 123 ++++++++++++++++++
> >  2 files changed, 221 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/so_timestamping.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/so_timestamping.c b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> > new file mode 100644
> > index 000000000000..dfb7588c246d
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> > @@ -0,0 +1,98 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2024 Tencent */
> > +
> > +#define _GNU_SOURCE
> > +#include <sched.h>
> > +#include <linux/socket.h>
> > +#include <linux/tls.h>
> > +#include <net/if.h>
> > +
> > +#include "test_progs.h"
> > +#include "cgroup_helpers.h"
> > +#include "network_helpers.h"
> > +
> > +#include "so_timestamping.skel.h"
> > +
> > +#define CG_NAME "/so-timestamping-test"
> > +
> > +static const char addr4_str[] = "127.0.0.1";
> > +static const char addr6_str[] = "::1";
> > +static struct so_timestamping *skel;
> > +static int cg_fd;
> > +
> > +static int create_netns(void)
> > +{
> > +     if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
> > +             return -1;
> > +
> > +     if (!ASSERT_OK(system("ip link set dev lo up"), "set lo up"))
> > +             return -1;
> > +
> > +     return 0;
> > +}
> > +
> > +static void test_tcp(int family)
> > +{
> > +     struct so_timestamping__bss *bss = skel->bss;
> > +     char buf[] = "testing testing";
> > +     int sfd = -1, cfd = -1;
> > +     int n;
> > +
> > +     memset(bss, 0, sizeof(*bss));
> > +
> > +     sfd = start_server(family, SOCK_STREAM,
> > +                        family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
> > +     if (!ASSERT_GE(sfd, 0, "start_server"))
> > +             goto out;
> > +
> > +     cfd = connect_to_fd(sfd, 0);
> > +     if (!ASSERT_GE(cfd, 0, "connect_to_fd_server")) {
> > +             close(sfd);
> > +             goto out;
> > +     }
> > +
> > +     n = write(cfd, buf, sizeof(buf));
> > +     if (!ASSERT_EQ(n, sizeof(buf), "send to server"))
> > +             goto out;
> > +
> > +     ASSERT_EQ(bss->nr_active, 1, "nr_active");
> > +     ASSERT_EQ(bss->nr_passive, 1, "nr_passive");
> > +     ASSERT_EQ(bss->nr_sched, 1, "nr_sched");
> > +     ASSERT_EQ(bss->nr_txsw, 1, "nr_txsw");
> > +     ASSERT_EQ(bss->nr_ack, 1, "nr_ack");
> > +
> > +out:
> > +     if (sfd >= 0)
> > +             close(sfd);
> > +     if (cfd >= 0)
> > +             close(cfd);
> > +}
> > +
> > +void test_so_timestamping(void)
> > +{
> > +     cg_fd = test__join_cgroup(CG_NAME);
> > +     if (cg_fd < 0)
> > +             return;
> > +
> > +     if (create_netns())
> > +             goto done;
> > +
> > +     skel = so_timestamping__open();
> > +     if (!ASSERT_OK_PTR(skel, "open skel"))
> > +             goto done;
> > +
> > +     if (!ASSERT_OK(so_timestamping__load(skel), "load skel"))
> > +             goto done;
> > +
> > +     skel->links.skops_sockopt =
> > +             bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd);
> > +     if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup"))
> > +             goto done;
> > +
> > +     test_tcp(AF_INET6);
> > +     test_tcp(AF_INET);
> > +
> > +done:
> > +     so_timestamping__destroy(skel);
> > +     close(cg_fd);
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/so_timestamping.c b/tools/testing/selftests/bpf/progs/so_timestamping.c
> > new file mode 100644
> > index 000000000000..a15317951786
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/so_timestamping.c
> > @@ -0,0 +1,123 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2024 Tencent */
> > +
> > +#include "vmlinux.h"
> > +#include "bpf_tracing_net.h"
> > +#include <bpf/bpf_core_read.h>
> > +#include <bpf/bpf_helpers.h>
> > +#include <bpf/bpf_tracing.h>
> > +#include "bpf_misc.h"
> > +
> > +#define SO_TIMESTAMPING 37
> > +#define SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK (SOF_TIMESTAMPING_SOFTWARE | \
> > +                                           SOF_TIMESTAMPING_TX_SCHED | \
> > +                                           SOF_TIMESTAMPING_TX_SOFTWARE | \
> > +                                           SOF_TIMESTAMPING_TX_ACK | \
> > +                                           SOF_TIMESTAMPING_OPT_ID | \
> > +                                           SOF_TIMESTAMPING_OPT_ID_TCP)
> > +
> > +extern unsigned long CONFIG_HZ __kconfig;
> > +
> > +int nr_active;
> > +int nr_passive;
> > +int nr_sched;
> > +int nr_txsw;
> > +int nr_ack;
> > +
> > +struct sockopt_test {
> > +     int opt;
> > +     int new;
> > +     int expected;
> > +};
> > +
> > +static const struct sockopt_test sol_socket_tests[] = {
> > +     { .opt = SO_TIMESTAMPING, .new = SOF_TIMESTAMPING_TX_SCHED, .expected = 256, },
> > +     { .opt = SO_TIMESTAMPING, .new = SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK, .expected = 66450, },
> > +     { .opt = 0, },
> > +};
> > +
> > +struct loop_ctx {
> > +     void *ctx;
> > +     struct sock *sk;
> > +};
> > +
> > +static int bpf_test_sockopt_int(void *ctx, struct sock *sk,
> > +                             const struct sockopt_test *t,
> > +                             int level)
> > +{
> > +     int tmp, new, expected, opt;
> > +
> > +     opt = t->opt;
> > +     new = t->new;
> > +     expected = t->expected;
> > +
> > +     if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
> > +             return 1;
> > +     if (bpf_getsockopt(ctx, level, opt, &tmp, sizeof(tmp)) ||
> > +         tmp != expected)
> > +             return 1;
> > +
> > +     return 0;
> > +}
> > +
> > +static int bpf_test_socket_sockopt(__u32 i, struct loop_ctx *lc)
> > +{
> > +     const struct sockopt_test *t;
> > +
> > +     if (i >= ARRAY_SIZE(sol_socket_tests))
> > +             return 1;
> > +
> > +     t = &sol_socket_tests[i];
> > +     if (!t->opt)
> > +             return 1;
> > +
> > +     return bpf_test_sockopt_int(lc->ctx, lc->sk, t, SOL_SOCKET);
> > +}
> > +
> > +static int bpf_test_sockopt(void *ctx, struct sock *sk)
> > +{
> > +     struct loop_ctx lc = { .ctx = ctx, .sk = sk, };
> > +     int n;
> > +
> > +     n = bpf_loop(ARRAY_SIZE(sol_socket_tests), bpf_test_socket_sockopt, &lc, 0);
> > +     if (n != ARRAY_SIZE(sol_socket_tests))
> > +             return -1;
> > +
> > +     return 0;
> > +}
> > +
> > +SEC("sockops")
> > +int skops_sockopt(struct bpf_sock_ops *skops)
> > +{
> > +     struct bpf_sock *bpf_sk = skops->sk;
> > +     struct sock *sk;
> > +
> > +     if (!bpf_sk)
> > +             return 1;
> > +
> > +     sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
> > +     if (!sk)
> > +             return 1;
> > +
> > +     switch (skops->op) {
> > +     case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
> > +             nr_active += !bpf_test_sockopt(skops, sk);
> > +             break;
> > +     case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
> > +             nr_passive += !bpf_test_sockopt(skops, sk);
> > +             break;
> > +     case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
> > +             nr_sched += 1;
> > +             break;
> > +     case BPF_SOCK_OPS_TS_SW_OPT_CB:
> > +             nr_txsw += 1;
> > +             break;
> > +     case BPF_SOCK_OPS_TS_ACK_OPT_CB:
> > +             nr_ack += 1;
> > +             break;
>
> Perhaps demonstrate what to do with the args on the new
> TS_*_OPT_CB.

Roger that.

I would like to know if the current patch is too big to review? Should
I split it into a few patches? But this series has 14 patches right
now which could possibly exceed the maximum limit.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 14/14] bpf: add simple bpf tests in the tx path for so_timstamping feature
  2024-10-29  1:33     ` Jason Xing
@ 2024-10-29  1:40       ` Willem de Bruijn
  2024-10-29  3:13         ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-29  1:40 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

Jason Xing wrote:
> On Tue, Oct 29, 2024 at 9:27 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > Only check if we pass those three key points after we enable the
> > > bpf extension for so_timestamping. During each point, we can choose
> > > whether to print the current timestamp.
> > >
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >  .../bpf/prog_tests/so_timestamping.c          |  98 ++++++++++++++
> > >  .../selftests/bpf/progs/so_timestamping.c     | 123 ++++++++++++++++++
> > >  2 files changed, 221 insertions(+)
> > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> > >  create mode 100644 tools/testing/selftests/bpf/progs/so_timestamping.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/so_timestamping.c b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> > > new file mode 100644
> > > index 000000000000..dfb7588c246d
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> > > @@ -0,0 +1,98 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/* Copyright (c) 2024 Tencent */
> > > +
> > > +#define _GNU_SOURCE
> > > +#include <sched.h>
> > > +#include <linux/socket.h>
> > > +#include <linux/tls.h>
> > > +#include <net/if.h>
> > > +
> > > +#include "test_progs.h"
> > > +#include "cgroup_helpers.h"
> > > +#include "network_helpers.h"
> > > +
> > > +#include "so_timestamping.skel.h"
> > > +
> > > +#define CG_NAME "/so-timestamping-test"
> > > +
> > > +static const char addr4_str[] = "127.0.0.1";
> > > +static const char addr6_str[] = "::1";
> > > +static struct so_timestamping *skel;
> > > +static int cg_fd;
> > > +
> > > +static int create_netns(void)
> > > +{
> > > +     if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
> > > +             return -1;
> > > +
> > > +     if (!ASSERT_OK(system("ip link set dev lo up"), "set lo up"))
> > > +             return -1;
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +static void test_tcp(int family)
> > > +{
> > > +     struct so_timestamping__bss *bss = skel->bss;
> > > +     char buf[] = "testing testing";
> > > +     int sfd = -1, cfd = -1;
> > > +     int n;
> > > +
> > > +     memset(bss, 0, sizeof(*bss));
> > > +
> > > +     sfd = start_server(family, SOCK_STREAM,
> > > +                        family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
> > > +     if (!ASSERT_GE(sfd, 0, "start_server"))
> > > +             goto out;
> > > +
> > > +     cfd = connect_to_fd(sfd, 0);
> > > +     if (!ASSERT_GE(cfd, 0, "connect_to_fd_server")) {
> > > +             close(sfd);
> > > +             goto out;
> > > +     }
> > > +
> > > +     n = write(cfd, buf, sizeof(buf));
> > > +     if (!ASSERT_EQ(n, sizeof(buf), "send to server"))
> > > +             goto out;
> > > +
> > > +     ASSERT_EQ(bss->nr_active, 1, "nr_active");
> > > +     ASSERT_EQ(bss->nr_passive, 1, "nr_passive");
> > > +     ASSERT_EQ(bss->nr_sched, 1, "nr_sched");
> > > +     ASSERT_EQ(bss->nr_txsw, 1, "nr_txsw");
> > > +     ASSERT_EQ(bss->nr_ack, 1, "nr_ack");
> > > +
> > > +out:
> > > +     if (sfd >= 0)
> > > +             close(sfd);
> > > +     if (cfd >= 0)
> > > +             close(cfd);
> > > +}
> > > +
> > > +void test_so_timestamping(void)
> > > +{
> > > +     cg_fd = test__join_cgroup(CG_NAME);
> > > +     if (cg_fd < 0)
> > > +             return;
> > > +
> > > +     if (create_netns())
> > > +             goto done;
> > > +
> > > +     skel = so_timestamping__open();
> > > +     if (!ASSERT_OK_PTR(skel, "open skel"))
> > > +             goto done;
> > > +
> > > +     if (!ASSERT_OK(so_timestamping__load(skel), "load skel"))
> > > +             goto done;
> > > +
> > > +     skel->links.skops_sockopt =
> > > +             bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd);
> > > +     if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup"))
> > > +             goto done;
> > > +
> > > +     test_tcp(AF_INET6);
> > > +     test_tcp(AF_INET);
> > > +
> > > +done:
> > > +     so_timestamping__destroy(skel);
> > > +     close(cg_fd);
> > > +}
> > > diff --git a/tools/testing/selftests/bpf/progs/so_timestamping.c b/tools/testing/selftests/bpf/progs/so_timestamping.c
> > > new file mode 100644
> > > index 000000000000..a15317951786
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/progs/so_timestamping.c
> > > @@ -0,0 +1,123 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/* Copyright (c) 2024 Tencent */
> > > +
> > > +#include "vmlinux.h"
> > > +#include "bpf_tracing_net.h"
> > > +#include <bpf/bpf_core_read.h>
> > > +#include <bpf/bpf_helpers.h>
> > > +#include <bpf/bpf_tracing.h>
> > > +#include "bpf_misc.h"
> > > +
> > > +#define SO_TIMESTAMPING 37
> > > +#define SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK (SOF_TIMESTAMPING_SOFTWARE | \
> > > +                                           SOF_TIMESTAMPING_TX_SCHED | \
> > > +                                           SOF_TIMESTAMPING_TX_SOFTWARE | \
> > > +                                           SOF_TIMESTAMPING_TX_ACK | \
> > > +                                           SOF_TIMESTAMPING_OPT_ID | \
> > > +                                           SOF_TIMESTAMPING_OPT_ID_TCP)
> > > +
> > > +extern unsigned long CONFIG_HZ __kconfig;
> > > +
> > > +int nr_active;
> > > +int nr_passive;
> > > +int nr_sched;
> > > +int nr_txsw;
> > > +int nr_ack;
> > > +
> > > +struct sockopt_test {
> > > +     int opt;
> > > +     int new;
> > > +     int expected;
> > > +};
> > > +
> > > +static const struct sockopt_test sol_socket_tests[] = {
> > > +     { .opt = SO_TIMESTAMPING, .new = SOF_TIMESTAMPING_TX_SCHED, .expected = 256, },
> > > +     { .opt = SO_TIMESTAMPING, .new = SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK, .expected = 66450, },
> > > +     { .opt = 0, },
> > > +};
> > > +
> > > +struct loop_ctx {
> > > +     void *ctx;
> > > +     struct sock *sk;
> > > +};
> > > +
> > > +static int bpf_test_sockopt_int(void *ctx, struct sock *sk,
> > > +                             const struct sockopt_test *t,
> > > +                             int level)
> > > +{
> > > +     int tmp, new, expected, opt;
> > > +
> > > +     opt = t->opt;
> > > +     new = t->new;
> > > +     expected = t->expected;
> > > +
> > > +     if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
> > > +             return 1;
> > > +     if (bpf_getsockopt(ctx, level, opt, &tmp, sizeof(tmp)) ||
> > > +         tmp != expected)
> > > +             return 1;
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +static int bpf_test_socket_sockopt(__u32 i, struct loop_ctx *lc)
> > > +{
> > > +     const struct sockopt_test *t;
> > > +
> > > +     if (i >= ARRAY_SIZE(sol_socket_tests))
> > > +             return 1;
> > > +
> > > +     t = &sol_socket_tests[i];
> > > +     if (!t->opt)
> > > +             return 1;
> > > +
> > > +     return bpf_test_sockopt_int(lc->ctx, lc->sk, t, SOL_SOCKET);
> > > +}
> > > +
> > > +static int bpf_test_sockopt(void *ctx, struct sock *sk)
> > > +{
> > > +     struct loop_ctx lc = { .ctx = ctx, .sk = sk, };
> > > +     int n;
> > > +
> > > +     n = bpf_loop(ARRAY_SIZE(sol_socket_tests), bpf_test_socket_sockopt, &lc, 0);
> > > +     if (n != ARRAY_SIZE(sol_socket_tests))
> > > +             return -1;
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +SEC("sockops")
> > > +int skops_sockopt(struct bpf_sock_ops *skops)
> > > +{
> > > +     struct bpf_sock *bpf_sk = skops->sk;
> > > +     struct sock *sk;
> > > +
> > > +     if (!bpf_sk)
> > > +             return 1;
> > > +
> > > +     sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
> > > +     if (!sk)
> > > +             return 1;
> > > +
> > > +     switch (skops->op) {
> > > +     case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
> > > +             nr_active += !bpf_test_sockopt(skops, sk);
> > > +             break;
> > > +     case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
> > > +             nr_passive += !bpf_test_sockopt(skops, sk);
> > > +             break;
> > > +     case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
> > > +             nr_sched += 1;
> > > +             break;
> > > +     case BPF_SOCK_OPS_TS_SW_OPT_CB:
> > > +             nr_txsw += 1;
> > > +             break;
> > > +     case BPF_SOCK_OPS_TS_ACK_OPT_CB:
> > > +             nr_ack += 1;
> > > +             break;
> >
> > Perhaps demonstrate what to do with the args on the new
> > TS_*_OPT_CB.
> 
> Roger that.
> 
> I would like to know if the current patch is too big to review? Should
> I split it into a few patches? But this series has 14 patches right
> now which could possibly exceed the maximum limit.

For a test patch, this looks fine to me. They often are a longer than
feature patches. But much of it is easy to grasp.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-29  1:24   ` Willem de Bruijn
@ 2024-10-29  2:41     ` Jason Xing
  2024-10-29 15:03       ` Willem de Bruijn
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-29  2:41 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Tue, Oct 29, 2024 at 9:24 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Use the offset to record the delta value between current socket key
> > and bpf socket key.
> >
> > 1. If there is only bpf feature running, the socket key is bpf socket
> > key and the offset is zero;
> > 2. If there is only traditional feature running, and then bpf feature
> > is turned on, the socket key is still used by the former while the offset
> > is the delta between them;
> > 3. if there is only bpf feature running, and then application uses it,
> > the socket key would be re-init for application and the offset is the
> > delta.
>
> We need to also figure out the rare conflict when one user sets
> OPT_ID | OPT_ID_TCP while the other only uses OPT_ID.

I think the current patch handles the case because:
1. sock_calculate_tskey_offset() gets the final key first whether the
OPT_ID_TCP is set or not.
2. we will use that tskey to calculate the delta.

>
> It is so obscure, that perhaps we can punt and say that the BPF
> program just has to follow the application preference and be aware of
> the subtle difference.

Right.

>
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  include/net/sock.h |  1 +
> >  net/core/skbuff.c  | 15 ++++++++---
> >  net/core/sock.c    | 66 ++++++++++++++++++++++++++++++++++++++--------
> >  3 files changed, 68 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 91398b20a4a3..41c6c6f78e55 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -469,6 +469,7 @@ struct sock {
> >       unsigned long           sk_pacing_rate; /* bytes per second */
> >       atomic_t                sk_zckey;
> >       atomic_t                sk_tskey;
> > +     u32                     sk_tskey_bpf_offset;
> >       __cacheline_group_end(sock_write_tx);
> >
> >       __cacheline_group_begin(sock_read_tx);
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 0b571306f7ea..d1739317b97d 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5641,9 +5641,10 @@ void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> >  }
> >
> >  static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype,
> > +                                  struct sk_buff *skb,
> >                                    struct skb_shared_hwtstamps *hwtstamps)
> >  {
> > -     u32 args[2] = {0, 0};
> > +     u32 args[3] = {0, 0, 0};
> >       u32 tsflags, cb_flag;
> >
> >       tsflags = READ_ONCE(sk->sk_tsflags_bpf);
> > @@ -5672,7 +5673,15 @@ static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype,
> >               args[1] = ts.tv_nsec;
> >       }
> >
> > -     timestamp_call_bpf(sk, cb_flag, 2, args);
> > +     if (tsflags & SOF_TIMESTAMPING_OPT_ID) {
> > +             args[2] = skb_shinfo(skb)->tskey;
> > +             if (sk_is_tcp(sk))
> > +                     args[2] -= atomic_read(&sk->sk_tskey);
> > +             if (sk->sk_tskey_bpf_offset)
> > +                     args[2] += sk->sk_tskey_bpf_offset;
> > +     }
> > +
> > +     timestamp_call_bpf(sk, cb_flag, 3, args);
>
>
> So the BPF interface is effectively OPT_TSONLY: the packet data is
> never shared.
>
> Then OPT_ID should be mandatory, because it without it the data is
> not actionable: which byte in the bytestream or packet in the case
> of datagram sockets does a callback refer to.

It does make sense, I think I will implement it when bpf_setsockopt() is called.

>
> > +/* Used to track the tskey for bpf extension
> > + *
> > + * @sk_tskey: bpf extension can use it only when no application uses.
> > + *            Application can use it directly regardless of bpf extension.
> > + *
> > + * There are three strategies:
> > + * 1) If we've already set through setsockopt() and here we're going to set
> > + *    OPT_ID for bpf use, we will not re-initialize the @sk_tskey and will
> > + *    keep the record of delta between the current "key" and previous key.
> > + * 2) If we've already set through bpf_setsockopt() and here we're going to
> > + *    set for application use, we will record the delta first and then
> > + *    override/initialize the @sk_tskey.
> > + * 3) other cases, which means only either of them takes effect, so initialize
> > + *    everything simplely.
> > + */
>
> Please explain in the commit message that these gymnastics are needed
> because there can only be one tskey in skb_shared_info.

No problem.

>
> > +static long int sock_calculate_tskey_offset(struct sock *sk, int val, int bpf_type)
> > +{
> > +     u32 tskey;
> > +
> > +     if (sk_is_tcp(sk)) {
> > +             if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
> > +                     return -EINVAL;
> > +
> > +             if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > +                     tskey = tcp_sk(sk)->write_seq;
> > +             else
> > +                     tskey = tcp_sk(sk)->snd_una;
> > +     } else {
> > +             tskey = 0;
> > +     }
> > +
> > +     if (bpf_type && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > +             sk->sk_tskey_bpf_offset = tskey - atomic_read(&sk->sk_tskey);
> > +             return 0;
> > +     } else if (!bpf_type && (sk->sk_tsflags_bpf & SOF_TIMESTAMPING_OPT_ID)) {
> > +             sk->sk_tskey_bpf_offset = atomic_read(&sk->sk_tskey) - tskey;
> > +     } else {
> > +             sk->sk_tskey_bpf_offset = 0;
> > +     }
> > +
> > +     return tskey;
> > +}
> > +
> >  int sock_set_tskey(struct sock *sk, int val, int bpf_type)
> >  {
> >       u32 tsflags = bpf_type ? sk->sk_tsflags_bpf : sk->sk_tsflags;
> > @@ -901,17 +944,13 @@ int sock_set_tskey(struct sock *sk, int val, int bpf_type)
> >
> >       if (val & SOF_TIMESTAMPING_OPT_ID &&
> >           !(tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > -             if (sk_is_tcp(sk)) {
> > -                     if ((1 << sk->sk_state) &
> > -                         (TCPF_CLOSE | TCPF_LISTEN))
> > -                             return -EINVAL;
> > -                     if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > -                             atomic_set(&sk->sk_tskey, tcp_sk(sk)->write_seq);
> > -                     else
> > -                             atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
> > -             } else {
> > -                     atomic_set(&sk->sk_tskey, 0);
> > -             }
> > +             long int ret;
> > +
> > +             ret = sock_calculate_tskey_offset(sk, val, bpf_type);
> > +             if (ret <= 0)
> > +                     return ret;
> > +
> > +             atomic_set(&sk->sk_tskey, ret);
> >       }
> >
> >       return 0;
> > @@ -956,10 +995,15 @@ static int sock_set_timestamping_bpf(struct sock *sk,
> >                                    struct so_timestamping timestamping)
> >  {
> >       u32 flags = timestamping.flags;
> > +     int ret;
> >
> >       if (flags & ~SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK)
> >               return -EINVAL;
> >
> > +     ret = sock_set_tskey(sk, flags, 1);
> > +     if (ret)
> > +             return ret;
> > +
> >       WRITE_ONCE(sk->sk_tsflags_bpf, flags);
> >
> >       return 0;
>
> I'm a bit hazy on when this can be called. We can assume that this new
> BPF operation cannot race with the existing setsockopt nor with the
> datapath that might touch the atomic fields, right?

It surely can race with the existing setsockopt.

1)
if (only existing setsockopt works) {
        then sk->sk_tskey is set through setsockopt, sk_tskey_bpf_offset is 0.
}

2)
if (only bpf setsockopt works) {
        then sk->sk_tskey is set through bpf_setsockopt,
sk_tskey_bpf_offset is 0.
}

3)
if (existing setsockopt already started, here we enable the bpf feature) {
        then sk->sk_tskey will not change, but the sk_tskey_bpf_offset
will be calculated.
}

4)
if (bpf setsockopt already started, here we enable the application feature) {
        then sk->sk_tskey will re-initialized/overridden by
setsockopt, and the sk_tskey_bpf_offset will be calculated.
}

Then the skb tskey will use the sk->sk_tskey like before.

At last, when we are about to print in bpf extension if we're allowed
(by testing the sk_tsflags_bpf), we only need to check if
sk_tskey_bpf_offset is zero or not. If the value is zero, it means
only the bpf program runs; if not, it means the sk->sk_tskey servers
for application feature, we need to compute the real bpf tskey. Please
see skb_tstamp_tx_output_bpf().

Above makes sure that two features can work parallelly. It's honestly
a little bit complicated. Before writing this part, I drew a few
pictures to help me understand how it works.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 07/14] net-timestamp: add a new triggered point to set sk_tsflags_bpf in UDP layer
  2024-10-29  1:33       ` Willem de Bruijn
@ 2024-10-29  3:12         ` Jason Xing
  2024-10-29 15:04           ` Willem de Bruijn
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-29  3:12 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Tue, Oct 29, 2024 at 9:33 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Tue, Oct 29, 2024 at 9:07 AM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > Jason Xing wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > This patch behaves like how cmsg feature works, that is to say,
> > > > check and set on each call of udp_sendmsg before passing sk_tsflags_bpf
> > > > to cork tsflags.
> > > >
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > >  include/net/sock.h             | 1 +
> > > >  include/uapi/linux/bpf.h       | 3 +++
> > > >  net/core/skbuff.c              | 2 +-
> > > >  net/ipv4/udp.c                 | 1 +
> > > >  tools/include/uapi/linux/bpf.h | 3 +++
> > > >  5 files changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > index 062f405c744e..cf7fea456455 100644
> > > > --- a/include/net/sock.h
> > > > +++ b/include/net/sock.h
> > > > @@ -2828,6 +2828,7 @@ static inline bool sk_listener_or_tw(const struct sock *sk)
> > > >  }
> > > >
> > > >  void sock_enable_timestamp(struct sock *sk, enum sock_flags flag);
> > > > +void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args);
> > > >  int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len, int level,
> > > >                      int type);
> > > >
> > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > index 6fc3bd12b650..055ffa7c965c 100644
> > > > --- a/include/uapi/linux/bpf.h
> > > > +++ b/include/uapi/linux/bpf.h
> > > > @@ -7028,6 +7028,9 @@ enum {
> > > >                                        * feature is on. It indicates the
> > > >                                        * recorded timestamp.
> > > >                                        */
> > > > +     BPF_SOCK_OPS_TS_UDP_SND_CB,     /* Called when every udp_sendmsg
> > > > +                                      * syscall is triggered
> > > > +                                      */
> > > >  };
> > > >
> > > >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > index 8b2a79c0fe1c..0b571306f7ea 100644
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -5622,7 +5622,7 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > > >       __skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
> > > >  }
> > > >
> > > > -static void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> > > > +void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> > > >  {
> > > >       struct bpf_sock_ops_kern sock_ops;
> > > >
> > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > index 9a20af41e272..e768421abc37 100644
> > > > --- a/net/ipv4/udp.c
> > > > +++ b/net/ipv4/udp.c
> > > > @@ -1264,6 +1264,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> > > >       if (!corkreq) {
> > > >               struct inet_cork cork;
> > > >
> > > > +             timestamp_call_bpf(sk, BPF_SOCK_OPS_TS_UDP_SND_CB, 0, NULL);
> > > >               skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
> > > >                                 sizeof(struct udphdr), &ipc, &rt,
> > > >                                 &cork, msg->msg_flags);
> > > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > > > index 6fc3bd12b650..055ffa7c965c 100644
> > > > --- a/tools/include/uapi/linux/bpf.h
> > > > +++ b/tools/include/uapi/linux/bpf.h
> > > > @@ -7028,6 +7028,9 @@ enum {
> > > >                                        * feature is on. It indicates the
> > > >                                        * recorded timestamp.
> > > >                                        */
> > > > +     BPF_SOCK_OPS_TS_UDP_SND_CB,     /* Called when every udp_sendmsg
> > > > +                                      * syscall is triggered
> > > > +                                      */
> > >
> > > If adding a timestamp as close to syscall entry as possible, give it a
> > > generic name, not specific to UDP.
> >
> > Good suggestion, then it will also solve the remaining issue for TCP type:
> > __when__ we should record the user timestamp which exists in the
> > application SO_TIMESTAMPING feature.
> >
> > >
> > > And please explain in the commit message the reason for a new
> > > timestamp recording point: with existing timestamping the application
> > > can call clock_gettime before (and optionally after) the send call.
> > > An admin using BPF does not have this option, so needs this as part of
> > > the BPF timestamping API.
> >
> > Will revise this part. Thanks for your description!
>
> Actually, I may have misunderstood the intention of this new hook.
>
> I thought it was to record an additional timestamp.

I planned to do it after this series. For now, without the new hook,
it will not work for UDP type.

>
> But it is (also?) to program skb_shared_info.tx_flags based on
> instructions parsed from cmsg in __sock_cmsg_send.

I'm not sure if I grasp the key point you said.

For UDP, skb_shared_info.tx_flags will finally be initialized in
__ip_append_data() based on cork->tx_flags.

cork->tx_flags is computed by sock_tx_timestamp() based on
ipc->sockc.tsflags if cmsg feature is turned on.

__sock_tx_timestamp() uses "flags |= xxx" to initialize the
cork->tx_flags, so that the cork->tx_flags will not be completely
overridden by either the cmsg method or bpf program, that is to say,
the cork->tx_flags can combine both of them.

Then another key point is that we do the check to see which one
actually works in sk_tstamp_tx_flags() by testing sk->sk_tsflags or
sk->sk_tsflags_bpf in patch [2/14]. It guarantees that.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 14/14] bpf: add simple bpf tests in the tx path for so_timstamping feature
  2024-10-29  1:40       ` Willem de Bruijn
@ 2024-10-29  3:13         ` Jason Xing
  0 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-29  3:13 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Tue, Oct 29, 2024 at 9:40 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Tue, Oct 29, 2024 at 9:27 AM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > Jason Xing wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > Only check if we pass those three key points after we enable the
> > > > bpf extension for so_timestamping. During each point, we can choose
> > > > whether to print the current timestamp.
> > > >
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > >  .../bpf/prog_tests/so_timestamping.c          |  98 ++++++++++++++
> > > >  .../selftests/bpf/progs/so_timestamping.c     | 123 ++++++++++++++++++
> > > >  2 files changed, 221 insertions(+)
> > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> > > >  create mode 100644 tools/testing/selftests/bpf/progs/so_timestamping.c
> > > >
> > > > diff --git a/tools/testing/selftests/bpf/prog_tests/so_timestamping.c b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> > > > new file mode 100644
> > > > index 000000000000..dfb7588c246d
> > > > --- /dev/null
> > > > +++ b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> > > > @@ -0,0 +1,98 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/* Copyright (c) 2024 Tencent */
> > > > +
> > > > +#define _GNU_SOURCE
> > > > +#include <sched.h>
> > > > +#include <linux/socket.h>
> > > > +#include <linux/tls.h>
> > > > +#include <net/if.h>
> > > > +
> > > > +#include "test_progs.h"
> > > > +#include "cgroup_helpers.h"
> > > > +#include "network_helpers.h"
> > > > +
> > > > +#include "so_timestamping.skel.h"
> > > > +
> > > > +#define CG_NAME "/so-timestamping-test"
> > > > +
> > > > +static const char addr4_str[] = "127.0.0.1";
> > > > +static const char addr6_str[] = "::1";
> > > > +static struct so_timestamping *skel;
> > > > +static int cg_fd;
> > > > +
> > > > +static int create_netns(void)
> > > > +{
> > > > +     if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
> > > > +             return -1;
> > > > +
> > > > +     if (!ASSERT_OK(system("ip link set dev lo up"), "set lo up"))
> > > > +             return -1;
> > > > +
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static void test_tcp(int family)
> > > > +{
> > > > +     struct so_timestamping__bss *bss = skel->bss;
> > > > +     char buf[] = "testing testing";
> > > > +     int sfd = -1, cfd = -1;
> > > > +     int n;
> > > > +
> > > > +     memset(bss, 0, sizeof(*bss));
> > > > +
> > > > +     sfd = start_server(family, SOCK_STREAM,
> > > > +                        family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
> > > > +     if (!ASSERT_GE(sfd, 0, "start_server"))
> > > > +             goto out;
> > > > +
> > > > +     cfd = connect_to_fd(sfd, 0);
> > > > +     if (!ASSERT_GE(cfd, 0, "connect_to_fd_server")) {
> > > > +             close(sfd);
> > > > +             goto out;
> > > > +     }
> > > > +
> > > > +     n = write(cfd, buf, sizeof(buf));
> > > > +     if (!ASSERT_EQ(n, sizeof(buf), "send to server"))
> > > > +             goto out;
> > > > +
> > > > +     ASSERT_EQ(bss->nr_active, 1, "nr_active");
> > > > +     ASSERT_EQ(bss->nr_passive, 1, "nr_passive");
> > > > +     ASSERT_EQ(bss->nr_sched, 1, "nr_sched");
> > > > +     ASSERT_EQ(bss->nr_txsw, 1, "nr_txsw");
> > > > +     ASSERT_EQ(bss->nr_ack, 1, "nr_ack");
> > > > +
> > > > +out:
> > > > +     if (sfd >= 0)
> > > > +             close(sfd);
> > > > +     if (cfd >= 0)
> > > > +             close(cfd);
> > > > +}
> > > > +
> > > > +void test_so_timestamping(void)
> > > > +{
> > > > +     cg_fd = test__join_cgroup(CG_NAME);
> > > > +     if (cg_fd < 0)
> > > > +             return;
> > > > +
> > > > +     if (create_netns())
> > > > +             goto done;
> > > > +
> > > > +     skel = so_timestamping__open();
> > > > +     if (!ASSERT_OK_PTR(skel, "open skel"))
> > > > +             goto done;
> > > > +
> > > > +     if (!ASSERT_OK(so_timestamping__load(skel), "load skel"))
> > > > +             goto done;
> > > > +
> > > > +     skel->links.skops_sockopt =
> > > > +             bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd);
> > > > +     if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup"))
> > > > +             goto done;
> > > > +
> > > > +     test_tcp(AF_INET6);
> > > > +     test_tcp(AF_INET);
> > > > +
> > > > +done:
> > > > +     so_timestamping__destroy(skel);
> > > > +     close(cg_fd);
> > > > +}
> > > > diff --git a/tools/testing/selftests/bpf/progs/so_timestamping.c b/tools/testing/selftests/bpf/progs/so_timestamping.c
> > > > new file mode 100644
> > > > index 000000000000..a15317951786
> > > > --- /dev/null
> > > > +++ b/tools/testing/selftests/bpf/progs/so_timestamping.c
> > > > @@ -0,0 +1,123 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/* Copyright (c) 2024 Tencent */
> > > > +
> > > > +#include "vmlinux.h"
> > > > +#include "bpf_tracing_net.h"
> > > > +#include <bpf/bpf_core_read.h>
> > > > +#include <bpf/bpf_helpers.h>
> > > > +#include <bpf/bpf_tracing.h>
> > > > +#include "bpf_misc.h"
> > > > +
> > > > +#define SO_TIMESTAMPING 37
> > > > +#define SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK (SOF_TIMESTAMPING_SOFTWARE | \
> > > > +                                           SOF_TIMESTAMPING_TX_SCHED | \
> > > > +                                           SOF_TIMESTAMPING_TX_SOFTWARE | \
> > > > +                                           SOF_TIMESTAMPING_TX_ACK | \
> > > > +                                           SOF_TIMESTAMPING_OPT_ID | \
> > > > +                                           SOF_TIMESTAMPING_OPT_ID_TCP)
> > > > +
> > > > +extern unsigned long CONFIG_HZ __kconfig;
> > > > +
> > > > +int nr_active;
> > > > +int nr_passive;
> > > > +int nr_sched;
> > > > +int nr_txsw;
> > > > +int nr_ack;
> > > > +
> > > > +struct sockopt_test {
> > > > +     int opt;
> > > > +     int new;
> > > > +     int expected;
> > > > +};
> > > > +
> > > > +static const struct sockopt_test sol_socket_tests[] = {
> > > > +     { .opt = SO_TIMESTAMPING, .new = SOF_TIMESTAMPING_TX_SCHED, .expected = 256, },
> > > > +     { .opt = SO_TIMESTAMPING, .new = SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK, .expected = 66450, },
> > > > +     { .opt = 0, },
> > > > +};
> > > > +
> > > > +struct loop_ctx {
> > > > +     void *ctx;
> > > > +     struct sock *sk;
> > > > +};
> > > > +
> > > > +static int bpf_test_sockopt_int(void *ctx, struct sock *sk,
> > > > +                             const struct sockopt_test *t,
> > > > +                             int level)
> > > > +{
> > > > +     int tmp, new, expected, opt;
> > > > +
> > > > +     opt = t->opt;
> > > > +     new = t->new;
> > > > +     expected = t->expected;
> > > > +
> > > > +     if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
> > > > +             return 1;
> > > > +     if (bpf_getsockopt(ctx, level, opt, &tmp, sizeof(tmp)) ||
> > > > +         tmp != expected)
> > > > +             return 1;
> > > > +
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static int bpf_test_socket_sockopt(__u32 i, struct loop_ctx *lc)
> > > > +{
> > > > +     const struct sockopt_test *t;
> > > > +
> > > > +     if (i >= ARRAY_SIZE(sol_socket_tests))
> > > > +             return 1;
> > > > +
> > > > +     t = &sol_socket_tests[i];
> > > > +     if (!t->opt)
> > > > +             return 1;
> > > > +
> > > > +     return bpf_test_sockopt_int(lc->ctx, lc->sk, t, SOL_SOCKET);
> > > > +}
> > > > +
> > > > +static int bpf_test_sockopt(void *ctx, struct sock *sk)
> > > > +{
> > > > +     struct loop_ctx lc = { .ctx = ctx, .sk = sk, };
> > > > +     int n;
> > > > +
> > > > +     n = bpf_loop(ARRAY_SIZE(sol_socket_tests), bpf_test_socket_sockopt, &lc, 0);
> > > > +     if (n != ARRAY_SIZE(sol_socket_tests))
> > > > +             return -1;
> > > > +
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +SEC("sockops")
> > > > +int skops_sockopt(struct bpf_sock_ops *skops)
> > > > +{
> > > > +     struct bpf_sock *bpf_sk = skops->sk;
> > > > +     struct sock *sk;
> > > > +
> > > > +     if (!bpf_sk)
> > > > +             return 1;
> > > > +
> > > > +     sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
> > > > +     if (!sk)
> > > > +             return 1;
> > > > +
> > > > +     switch (skops->op) {
> > > > +     case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
> > > > +             nr_active += !bpf_test_sockopt(skops, sk);
> > > > +             break;
> > > > +     case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
> > > > +             nr_passive += !bpf_test_sockopt(skops, sk);
> > > > +             break;
> > > > +     case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
> > > > +             nr_sched += 1;
> > > > +             break;
> > > > +     case BPF_SOCK_OPS_TS_SW_OPT_CB:
> > > > +             nr_txsw += 1;
> > > > +             break;
> > > > +     case BPF_SOCK_OPS_TS_ACK_OPT_CB:
> > > > +             nr_ack += 1;
> > > > +             break;
> > >
> > > Perhaps demonstrate what to do with the args on the new
> > > TS_*_OPT_CB.
> >
> > Roger that.
> >
> > I would like to know if the current patch is too big to review? Should
> > I split it into a few patches? But this series has 14 patches right
> > now which could possibly exceed the maximum limit.
>
> For a test patch, this looks fine to me. They often are a longer than
> feature patches. But much of it is easy to grasp.

Got it, I will do it as you said.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-29  2:41     ` Jason Xing
@ 2024-10-29 15:03       ` Willem de Bruijn
  2024-10-29 15:50         ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-29 15:03 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

Jason Xing wrote:
> On Tue, Oct 29, 2024 at 9:24 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > Use the offset to record the delta value between current socket key
> > > and bpf socket key.
> > >
> > > 1. If there is only bpf feature running, the socket key is bpf socket
> > > key and the offset is zero;
> > > 2. If there is only traditional feature running, and then bpf feature
> > > is turned on, the socket key is still used by the former while the offset
> > > is the delta between them;
> > > 3. if there is only bpf feature running, and then application uses it,
> > > the socket key would be re-init for application and the offset is the
> > > delta.
> >
> > We need to also figure out the rare conflict when one user sets
> > OPT_ID | OPT_ID_TCP while the other only uses OPT_ID.
> 
> I think the current patch handles the case because:
> 1. sock_calculate_tskey_offset() gets the final key first whether the
> OPT_ID_TCP is set or not.
> 2. we will use that tskey to calculate the delta.

Oh yes of course. Great, then this is resolved.

> > > +static long int sock_calculate_tskey_offset(struct sock *sk, int val, int bpf_type)
> > > +{
> > > +     u32 tskey;
> > > +
> > > +     if (sk_is_tcp(sk)) {
> > > +             if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
> > > +                     return -EINVAL;
> > > +
> > > +             if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > > +                     tskey = tcp_sk(sk)->write_seq;
> > > +             else
> > > +                     tskey = tcp_sk(sk)->snd_una;
> > > +     } else {
> > > +             tskey = 0;
> > > +     }
> > > +
> > > +     if (bpf_type && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > > +             sk->sk_tskey_bpf_offset = tskey - atomic_read(&sk->sk_tskey);
> > > +             return 0;
> > > +     } else if (!bpf_type && (sk->sk_tsflags_bpf & SOF_TIMESTAMPING_OPT_ID)) {
> > > +             sk->sk_tskey_bpf_offset = atomic_read(&sk->sk_tskey) - tskey;
> > > +     } else {
> > > +             sk->sk_tskey_bpf_offset = 0;
> > > +     }
> > > +
> > > +     return tskey;
> > > +}
> > > +
> > >  int sock_set_tskey(struct sock *sk, int val, int bpf_type)
> > >  {
> > >       u32 tsflags = bpf_type ? sk->sk_tsflags_bpf : sk->sk_tsflags;
> > > @@ -901,17 +944,13 @@ int sock_set_tskey(struct sock *sk, int val, int bpf_type)
> > >
> > >       if (val & SOF_TIMESTAMPING_OPT_ID &&
> > >           !(tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > > -             if (sk_is_tcp(sk)) {
> > > -                     if ((1 << sk->sk_state) &
> > > -                         (TCPF_CLOSE | TCPF_LISTEN))
> > > -                             return -EINVAL;
> > > -                     if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > > -                             atomic_set(&sk->sk_tskey, tcp_sk(sk)->write_seq);
> > > -                     else
> > > -                             atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
> > > -             } else {
> > > -                     atomic_set(&sk->sk_tskey, 0);
> > > -             }
> > > +             long int ret;
> > > +
> > > +             ret = sock_calculate_tskey_offset(sk, val, bpf_type);
> > > +             if (ret <= 0)
> > > +                     return ret;
> > > +
> > > +             atomic_set(&sk->sk_tskey, ret);
> > >       }
> > >
> > >       return 0;
> > > @@ -956,10 +995,15 @@ static int sock_set_timestamping_bpf(struct sock *sk,
> > >                                    struct so_timestamping timestamping)
> > >  {
> > >       u32 flags = timestamping.flags;
> > > +     int ret;
> > >
> > >       if (flags & ~SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK)
> > >               return -EINVAL;
> > >
> > > +     ret = sock_set_tskey(sk, flags, 1);
> > > +     if (ret)
> > > +             return ret;
> > > +
> > >       WRITE_ONCE(sk->sk_tsflags_bpf, flags);
> > >
> > >       return 0;
> >
> > I'm a bit hazy on when this can be called. We can assume that this new
> > BPF operation cannot race with the existing setsockopt nor with the
> > datapath that might touch the atomic fields, right?
> 
> It surely can race with the existing setsockopt.
> 
> 1)
> if (only existing setsockopt works) {
>         then sk->sk_tskey is set through setsockopt, sk_tskey_bpf_offset is 0.
> }
> 
> 2)
> if (only bpf setsockopt works) {
>         then sk->sk_tskey is set through bpf_setsockopt,
> sk_tskey_bpf_offset is 0.
> }
> 
> 3)
> if (existing setsockopt already started, here we enable the bpf feature) {
>         then sk->sk_tskey will not change, but the sk_tskey_bpf_offset
> will be calculated.
> }
> 
> 4)
> if (bpf setsockopt already started, here we enable the application feature) {
>         then sk->sk_tskey will re-initialized/overridden by
> setsockopt, and the sk_tskey_bpf_offset will be calculated.
> }
> 
> Then the skb tskey will use the sk->sk_tskey like before.

I mean race as in the setsockopt and bpf setsockopt and datapath
running concurrently.

As long as both variants of setsockopt hold the socket lock, that
won't happen.

The datapath is lockless for UDP, so atomic_inc sk_tskey can race
with calculating the difference. But this is a known issue. A process
that cares should not run setsockopt and send concurrently. So this is
fine too.



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 07/14] net-timestamp: add a new triggered point to set sk_tsflags_bpf in UDP layer
  2024-10-29  3:12         ` Jason Xing
@ 2024-10-29 15:04           ` Willem de Bruijn
  2024-10-29 15:44             ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-29 15:04 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

Jason Xing wrote:
> On Tue, Oct 29, 2024 at 9:33 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > On Tue, Oct 29, 2024 at 9:07 AM Willem de Bruijn
> > > <willemdebruijn.kernel@gmail.com> wrote:
> > > >
> > > > Jason Xing wrote:
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > This patch behaves like how cmsg feature works, that is to say,
> > > > > check and set on each call of udp_sendmsg before passing sk_tsflags_bpf
> > > > > to cork tsflags.
> > > > >
> > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > ---
> > > > >  include/net/sock.h             | 1 +
> > > > >  include/uapi/linux/bpf.h       | 3 +++
> > > > >  net/core/skbuff.c              | 2 +-
> > > > >  net/ipv4/udp.c                 | 1 +
> > > > >  tools/include/uapi/linux/bpf.h | 3 +++
> > > > >  5 files changed, 9 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > > index 062f405c744e..cf7fea456455 100644
> > > > > --- a/include/net/sock.h
> > > > > +++ b/include/net/sock.h
> > > > > @@ -2828,6 +2828,7 @@ static inline bool sk_listener_or_tw(const struct sock *sk)
> > > > >  }
> > > > >
> > > > >  void sock_enable_timestamp(struct sock *sk, enum sock_flags flag);
> > > > > +void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args);
> > > > >  int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len, int level,
> > > > >                      int type);
> > > > >
> > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > > index 6fc3bd12b650..055ffa7c965c 100644
> > > > > --- a/include/uapi/linux/bpf.h
> > > > > +++ b/include/uapi/linux/bpf.h
> > > > > @@ -7028,6 +7028,9 @@ enum {
> > > > >                                        * feature is on. It indicates the
> > > > >                                        * recorded timestamp.
> > > > >                                        */
> > > > > +     BPF_SOCK_OPS_TS_UDP_SND_CB,     /* Called when every udp_sendmsg
> > > > > +                                      * syscall is triggered
> > > > > +                                      */
> > > > >  };
> > > > >
> > > > >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > index 8b2a79c0fe1c..0b571306f7ea 100644
> > > > > --- a/net/core/skbuff.c
> > > > > +++ b/net/core/skbuff.c
> > > > > @@ -5622,7 +5622,7 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > > > >       __skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
> > > > >  }
> > > > >
> > > > > -static void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> > > > > +void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> > > > >  {
> > > > >       struct bpf_sock_ops_kern sock_ops;
> > > > >
> > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > > index 9a20af41e272..e768421abc37 100644
> > > > > --- a/net/ipv4/udp.c
> > > > > +++ b/net/ipv4/udp.c
> > > > > @@ -1264,6 +1264,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> > > > >       if (!corkreq) {
> > > > >               struct inet_cork cork;
> > > > >
> > > > > +             timestamp_call_bpf(sk, BPF_SOCK_OPS_TS_UDP_SND_CB, 0, NULL);
> > > > >               skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
> > > > >                                 sizeof(struct udphdr), &ipc, &rt,
> > > > >                                 &cork, msg->msg_flags);
> > > > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > > > > index 6fc3bd12b650..055ffa7c965c 100644
> > > > > --- a/tools/include/uapi/linux/bpf.h
> > > > > +++ b/tools/include/uapi/linux/bpf.h
> > > > > @@ -7028,6 +7028,9 @@ enum {
> > > > >                                        * feature is on. It indicates the
> > > > >                                        * recorded timestamp.
> > > > >                                        */
> > > > > +     BPF_SOCK_OPS_TS_UDP_SND_CB,     /* Called when every udp_sendmsg
> > > > > +                                      * syscall is triggered
> > > > > +                                      */
> > > >
> > > > If adding a timestamp as close to syscall entry as possible, give it a
> > > > generic name, not specific to UDP.
> > >
> > > Good suggestion, then it will also solve the remaining issue for TCP type:
> > > __when__ we should record the user timestamp which exists in the
> > > application SO_TIMESTAMPING feature.
> > >
> > > >
> > > > And please explain in the commit message the reason for a new
> > > > timestamp recording point: with existing timestamping the application
> > > > can call clock_gettime before (and optionally after) the send call.
> > > > An admin using BPF does not have this option, so needs this as part of
> > > > the BPF timestamping API.
> > >
> > > Will revise this part. Thanks for your description!
> >
> > Actually, I may have misunderstood the intention of this new hook.
> >
> > I thought it was to record an additional timestamp.
> 
> I planned to do it after this series. For now, without the new hook,
> it will not work for UDP type.

Why not? This is something specific to the SK BPF hooks, I suppose?

As soon as bpf_setsockopt is called, the timestamp callbacks should
start getting called?

> >
> > But it is (also?) to program skb_shared_info.tx_flags based on
> > instructions parsed from cmsg in __sock_cmsg_send.
> 
> I'm not sure if I grasp the key point you said.
> 
> For UDP, skb_shared_info.tx_flags will finally be initialized in
> __ip_append_data() based on cork->tx_flags.
> 
> cork->tx_flags is computed by sock_tx_timestamp() based on
> ipc->sockc.tsflags if cmsg feature is turned on.
> 
> __sock_tx_timestamp() uses "flags |= xxx" to initialize the
> cork->tx_flags, so that the cork->tx_flags will not be completely
> overridden by either the cmsg method or bpf program, that is to say,
> the cork->tx_flags can combine both of them.
> 
> Then another key point is that we do the check to see which one
> actually works in sk_tstamp_tx_flags() by testing sk->sk_tsflags or
> sk->sk_tsflags_bpf in patch [2/14]. It guarantees that.

Ack, thanks. So I was mistaken the second time around.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 07/14] net-timestamp: add a new triggered point to set sk_tsflags_bpf in UDP layer
  2024-10-29 15:04           ` Willem de Bruijn
@ 2024-10-29 15:44             ` Jason Xing
  0 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-29 15:44 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Tue, Oct 29, 2024 at 11:04 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Tue, Oct 29, 2024 at 9:33 AM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > Jason Xing wrote:
> > > > On Tue, Oct 29, 2024 at 9:07 AM Willem de Bruijn
> > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > >
> > > > > Jason Xing wrote:
> > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > >
> > > > > > This patch behaves like how cmsg feature works, that is to say,
> > > > > > check and set on each call of udp_sendmsg before passing sk_tsflags_bpf
> > > > > > to cork tsflags.
> > > > > >
> > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > > ---
> > > > > >  include/net/sock.h             | 1 +
> > > > > >  include/uapi/linux/bpf.h       | 3 +++
> > > > > >  net/core/skbuff.c              | 2 +-
> > > > > >  net/ipv4/udp.c                 | 1 +
> > > > > >  tools/include/uapi/linux/bpf.h | 3 +++
> > > > > >  5 files changed, 9 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > > > index 062f405c744e..cf7fea456455 100644
> > > > > > --- a/include/net/sock.h
> > > > > > +++ b/include/net/sock.h
> > > > > > @@ -2828,6 +2828,7 @@ static inline bool sk_listener_or_tw(const struct sock *sk)
> > > > > >  }
> > > > > >
> > > > > >  void sock_enable_timestamp(struct sock *sk, enum sock_flags flag);
> > > > > > +void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args);
> > > > > >  int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len, int level,
> > > > > >                      int type);
> > > > > >
> > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > > > index 6fc3bd12b650..055ffa7c965c 100644
> > > > > > --- a/include/uapi/linux/bpf.h
> > > > > > +++ b/include/uapi/linux/bpf.h
> > > > > > @@ -7028,6 +7028,9 @@ enum {
> > > > > >                                        * feature is on. It indicates the
> > > > > >                                        * recorded timestamp.
> > > > > >                                        */
> > > > > > +     BPF_SOCK_OPS_TS_UDP_SND_CB,     /* Called when every udp_sendmsg
> > > > > > +                                      * syscall is triggered
> > > > > > +                                      */
> > > > > >  };
> > > > > >
> > > > > >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > index 8b2a79c0fe1c..0b571306f7ea 100644
> > > > > > --- a/net/core/skbuff.c
> > > > > > +++ b/net/core/skbuff.c
> > > > > > @@ -5622,7 +5622,7 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > > > > >       __skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
> > > > > >  }
> > > > > >
> > > > > > -static void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> > > > > > +void timestamp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> > > > > >  {
> > > > > >       struct bpf_sock_ops_kern sock_ops;
> > > > > >
> > > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > > > index 9a20af41e272..e768421abc37 100644
> > > > > > --- a/net/ipv4/udp.c
> > > > > > +++ b/net/ipv4/udp.c
> > > > > > @@ -1264,6 +1264,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> > > > > >       if (!corkreq) {
> > > > > >               struct inet_cork cork;
> > > > > >
> > > > > > +             timestamp_call_bpf(sk, BPF_SOCK_OPS_TS_UDP_SND_CB, 0, NULL);
> > > > > >               skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
> > > > > >                                 sizeof(struct udphdr), &ipc, &rt,
> > > > > >                                 &cork, msg->msg_flags);
> > > > > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > > > > > index 6fc3bd12b650..055ffa7c965c 100644
> > > > > > --- a/tools/include/uapi/linux/bpf.h
> > > > > > +++ b/tools/include/uapi/linux/bpf.h
> > > > > > @@ -7028,6 +7028,9 @@ enum {
> > > > > >                                        * feature is on. It indicates the
> > > > > >                                        * recorded timestamp.
> > > > > >                                        */
> > > > > > +     BPF_SOCK_OPS_TS_UDP_SND_CB,     /* Called when every udp_sendmsg
> > > > > > +                                      * syscall is triggered
> > > > > > +                                      */
> > > > >
> > > > > If adding a timestamp as close to syscall entry as possible, give it a
> > > > > generic name, not specific to UDP.
> > > >
> > > > Good suggestion, then it will also solve the remaining issue for TCP type:
> > > > __when__ we should record the user timestamp which exists in the
> > > > application SO_TIMESTAMPING feature.
> > > >
> > > > >
> > > > > And please explain in the commit message the reason for a new
> > > > > timestamp recording point: with existing timestamping the application
> > > > > can call clock_gettime before (and optionally after) the send call.
> > > > > An admin using BPF does not have this option, so needs this as part of
> > > > > the BPF timestamping API.
> > > >
> > > > Will revise this part. Thanks for your description!
> > >
> > > Actually, I may have misunderstood the intention of this new hook.
> > >
> > > I thought it was to record an additional timestamp.
> >
> > I planned to do it after this series. For now, without the new hook,
> > it will not work for UDP type.
>
> Why not? This is something specific to the SK BPF hooks, I suppose?

I mean both hooks (one for UDP, one for USR time) are significant.

>
> As soon as bpf_setsockopt is called, the timestamp callbacks should
> start getting called?

Right, but the question is when we trigger the call of
bpf_setsockopt() for the UDP proto? The current patch is trying to
deal with it.

>
> > >
> > > But it is (also?) to program skb_shared_info.tx_flags based on
> > > instructions parsed from cmsg in __sock_cmsg_send.
> >
> > I'm not sure if I grasp the key point you said.
> >
> > For UDP, skb_shared_info.tx_flags will finally be initialized in
> > __ip_append_data() based on cork->tx_flags.
> >
> > cork->tx_flags is computed by sock_tx_timestamp() based on
> > ipc->sockc.tsflags if cmsg feature is turned on.
> >
> > __sock_tx_timestamp() uses "flags |= xxx" to initialize the
> > cork->tx_flags, so that the cork->tx_flags will not be completely
> > overridden by either the cmsg method or bpf program, that is to say,
> > the cork->tx_flags can combine both of them.
> >
> > Then another key point is that we do the check to see which one
> > actually works in sk_tstamp_tx_flags() by testing sk->sk_tsflags or
> > sk->sk_tsflags_bpf in patch [2/14]. It guarantees that.
>
> Ack, thanks. So I was mistaken the second time around.

Thanks for your review :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-29 15:03       ` Willem de Bruijn
@ 2024-10-29 15:50         ` Jason Xing
  2024-10-29 19:45           ` Willem de Bruijn
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-29 15:50 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Tue, Oct 29, 2024 at 11:03 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Tue, Oct 29, 2024 at 9:24 AM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > Jason Xing wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > Use the offset to record the delta value between current socket key
> > > > and bpf socket key.
> > > >
> > > > 1. If there is only bpf feature running, the socket key is bpf socket
> > > > key and the offset is zero;
> > > > 2. If there is only traditional feature running, and then bpf feature
> > > > is turned on, the socket key is still used by the former while the offset
> > > > is the delta between them;
> > > > 3. if there is only bpf feature running, and then application uses it,
> > > > the socket key would be re-init for application and the offset is the
> > > > delta.
> > >
> > > We need to also figure out the rare conflict when one user sets
> > > OPT_ID | OPT_ID_TCP while the other only uses OPT_ID.
> >
> > I think the current patch handles the case because:
> > 1. sock_calculate_tskey_offset() gets the final key first whether the
> > OPT_ID_TCP is set or not.
> > 2. we will use that tskey to calculate the delta.
>
> Oh yes of course. Great, then this is resolved.
>
> > > > +static long int sock_calculate_tskey_offset(struct sock *sk, int val, int bpf_type)
> > > > +{
> > > > +     u32 tskey;
> > > > +
> > > > +     if (sk_is_tcp(sk)) {
> > > > +             if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
> > > > +                     return -EINVAL;
> > > > +
> > > > +             if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > > > +                     tskey = tcp_sk(sk)->write_seq;
> > > > +             else
> > > > +                     tskey = tcp_sk(sk)->snd_una;
> > > > +     } else {
> > > > +             tskey = 0;
> > > > +     }
> > > > +
> > > > +     if (bpf_type && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > > > +             sk->sk_tskey_bpf_offset = tskey - atomic_read(&sk->sk_tskey);
> > > > +             return 0;
> > > > +     } else if (!bpf_type && (sk->sk_tsflags_bpf & SOF_TIMESTAMPING_OPT_ID)) {
> > > > +             sk->sk_tskey_bpf_offset = atomic_read(&sk->sk_tskey) - tskey;
> > > > +     } else {
> > > > +             sk->sk_tskey_bpf_offset = 0;
> > > > +     }
> > > > +
> > > > +     return tskey;
> > > > +}
> > > > +
> > > >  int sock_set_tskey(struct sock *sk, int val, int bpf_type)
> > > >  {
> > > >       u32 tsflags = bpf_type ? sk->sk_tsflags_bpf : sk->sk_tsflags;
> > > > @@ -901,17 +944,13 @@ int sock_set_tskey(struct sock *sk, int val, int bpf_type)
> > > >
> > > >       if (val & SOF_TIMESTAMPING_OPT_ID &&
> > > >           !(tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > > > -             if (sk_is_tcp(sk)) {
> > > > -                     if ((1 << sk->sk_state) &
> > > > -                         (TCPF_CLOSE | TCPF_LISTEN))
> > > > -                             return -EINVAL;
> > > > -                     if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > > > -                             atomic_set(&sk->sk_tskey, tcp_sk(sk)->write_seq);
> > > > -                     else
> > > > -                             atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
> > > > -             } else {
> > > > -                     atomic_set(&sk->sk_tskey, 0);
> > > > -             }
> > > > +             long int ret;
> > > > +
> > > > +             ret = sock_calculate_tskey_offset(sk, val, bpf_type);
> > > > +             if (ret <= 0)
> > > > +                     return ret;
> > > > +
> > > > +             atomic_set(&sk->sk_tskey, ret);
> > > >       }
> > > >
> > > >       return 0;
> > > > @@ -956,10 +995,15 @@ static int sock_set_timestamping_bpf(struct sock *sk,
> > > >                                    struct so_timestamping timestamping)
> > > >  {
> > > >       u32 flags = timestamping.flags;
> > > > +     int ret;
> > > >
> > > >       if (flags & ~SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK)
> > > >               return -EINVAL;
> > > >
> > > > +     ret = sock_set_tskey(sk, flags, 1);
> > > > +     if (ret)
> > > > +             return ret;
> > > > +
> > > >       WRITE_ONCE(sk->sk_tsflags_bpf, flags);
> > > >
> > > >       return 0;
> > >
> > > I'm a bit hazy on when this can be called. We can assume that this new
> > > BPF operation cannot race with the existing setsockopt nor with the
> > > datapath that might touch the atomic fields, right?
> >
> > It surely can race with the existing setsockopt.
> >
> > 1)
> > if (only existing setsockopt works) {
> >         then sk->sk_tskey is set through setsockopt, sk_tskey_bpf_offset is 0.
> > }
> >
> > 2)
> > if (only bpf setsockopt works) {
> >         then sk->sk_tskey is set through bpf_setsockopt,
> > sk_tskey_bpf_offset is 0.
> > }
> >
> > 3)
> > if (existing setsockopt already started, here we enable the bpf feature) {
> >         then sk->sk_tskey will not change, but the sk_tskey_bpf_offset
> > will be calculated.
> > }
> >
> > 4)
> > if (bpf setsockopt already started, here we enable the application feature) {
> >         then sk->sk_tskey will re-initialized/overridden by
> > setsockopt, and the sk_tskey_bpf_offset will be calculated.
> > }

I will copy the above to the commit message next time in order to
provide a clear design to future readers.

> >
> > Then the skb tskey will use the sk->sk_tskey like before.
>
> I mean race as in the setsockopt and bpf setsockopt and datapath
> running concurrently.
>
> As long as both variants of setsockopt hold the socket lock, that
> won't happen.
>
> The datapath is lockless for UDP, so atomic_inc sk_tskey can race
> with calculating the difference. But this is a known issue. A process
> that cares should not run setsockopt and send concurrently. So this is
> fine too.

Oh, now I see. Thanks for the detailed explanation! So Do you feel if
we need to take care of this in the future, I mean, after this series
gets merged...?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-29 15:50         ` Jason Xing
@ 2024-10-29 19:45           ` Willem de Bruijn
  2024-10-30  3:27             ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-29 19:45 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

> > > > > +static long int sock_calculate_tskey_offset(struct sock *sk, int val, int bpf_type)
> > > > > +{
> > > > > +     u32 tskey;
> > > > > +
> > > > > +     if (sk_is_tcp(sk)) {
> > > > > +             if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
> > > > > +                     return -EINVAL;
> > > > > +
> > > > > +             if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > > > > +                     tskey = tcp_sk(sk)->write_seq;
> > > > > +             else
> > > > > +                     tskey = tcp_sk(sk)->snd_una;
> > > > > +     } else {
> > > > > +             tskey = 0;
> > > > > +     }
> > > > > +
> > > > > +     if (bpf_type && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > > > > +             sk->sk_tskey_bpf_offset = tskey - atomic_read(&sk->sk_tskey);
> > > > > +             return 0;
> > > > > +     } else if (!bpf_type && (sk->sk_tsflags_bpf & SOF_TIMESTAMPING_OPT_ID)) {
> > > > > +             sk->sk_tskey_bpf_offset = atomic_read(&sk->sk_tskey) - tskey;
> > > > > +     } else {
> > > > > +             sk->sk_tskey_bpf_offset = 0;
> > > > > +     }
> > > > > +
> > > > > +     return tskey;
> > > > > +}
> > > > > +
> > > > >  int sock_set_tskey(struct sock *sk, int val, int bpf_type)
> > > > >  {
> > > > >       u32 tsflags = bpf_type ? sk->sk_tsflags_bpf : sk->sk_tsflags;
> > > > > @@ -901,17 +944,13 @@ int sock_set_tskey(struct sock *sk, int val, int bpf_type)
> > > > >
> > > > >       if (val & SOF_TIMESTAMPING_OPT_ID &&
> > > > >           !(tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > > > > -             if (sk_is_tcp(sk)) {
> > > > > -                     if ((1 << sk->sk_state) &
> > > > > -                         (TCPF_CLOSE | TCPF_LISTEN))
> > > > > -                             return -EINVAL;
> > > > > -                     if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > > > > -                             atomic_set(&sk->sk_tskey, tcp_sk(sk)->write_seq);
> > > > > -                     else
> > > > > -                             atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
> > > > > -             } else {
> > > > > -                     atomic_set(&sk->sk_tskey, 0);
> > > > > -             }
> > > > > +             long int ret;
> > > > > +
> > > > > +             ret = sock_calculate_tskey_offset(sk, val, bpf_type);
> > > > > +             if (ret <= 0)
> > > > > +                     return ret;
> > > > > +
> > > > > +             atomic_set(&sk->sk_tskey, ret);
> > > > >       }
> > > > >
> > > > >       return 0;
> > > > > @@ -956,10 +995,15 @@ static int sock_set_timestamping_bpf(struct sock *sk,
> > > > >                                    struct so_timestamping timestamping)
> > > > >  {
> > > > >       u32 flags = timestamping.flags;
> > > > > +     int ret;
> > > > >
> > > > >       if (flags & ~SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK)
> > > > >               return -EINVAL;
> > > > >
> > > > > +     ret = sock_set_tskey(sk, flags, 1);
> > > > > +     if (ret)
> > > > > +             return ret;
> > > > > +
> > > > >       WRITE_ONCE(sk->sk_tsflags_bpf, flags);
> > > > >
> > > > >       return 0;
> > > >
> > > > I'm a bit hazy on when this can be called. We can assume that this new
> > > > BPF operation cannot race with the existing setsockopt nor with the
> > > > datapath that might touch the atomic fields, right?
> > >
> > > It surely can race with the existing setsockopt.
> > >
> > > 1)
> > > if (only existing setsockopt works) {
> > >         then sk->sk_tskey is set through setsockopt, sk_tskey_bpf_offset is 0.
> > > }
> > >
> > > 2)
> > > if (only bpf setsockopt works) {
> > >         then sk->sk_tskey is set through bpf_setsockopt,
> > > sk_tskey_bpf_offset is 0.
> > > }
> > >
> > > 3)
> > > if (existing setsockopt already started, here we enable the bpf feature) {
> > >         then sk->sk_tskey will not change, but the sk_tskey_bpf_offset
> > > will be calculated.
> > > }
> > >
> > > 4)
> > > if (bpf setsockopt already started, here we enable the application feature) {
> > >         then sk->sk_tskey will re-initialized/overridden by
> > > setsockopt, and the sk_tskey_bpf_offset will be calculated.
> > > }
> 
> I will copy the above to the commit message next time in order to
> provide a clear design to future readers.
> 
> > >
> > > Then the skb tskey will use the sk->sk_tskey like before.
> >
> > I mean race as in the setsockopt and bpf setsockopt and datapath
> > running concurrently.
> >
> > As long as both variants of setsockopt hold the socket lock, that
> > won't happen.
> >
> > The datapath is lockless for UDP, so atomic_inc sk_tskey can race
> > with calculating the difference. But this is a known issue. A process
> > that cares should not run setsockopt and send concurrently. So this is
> > fine too.
> 
> Oh, now I see. Thanks for the detailed explanation! So Do you feel if
> we need to take care of this in the future, I mean, after this series
> gets merged...?

If there is a race condition, then that cannot be fixed up later.

But from my admittedly brief analysis, it seems that there is nothing
here that needs to be fixed: control plane operations (setsockopt)
hold the socket lock. A setsockopt that conflicts with a lockless
datapath update will have a slightly ambiguous offset. It is under
controlof and up to the user to avoid that if they care.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-28 11:05 ` [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly Jason Xing
@ 2024-10-29 23:00   ` Martin KaFai Lau
  2024-10-30  1:23     ` Jason Xing
  2024-11-02 13:43   ` Simon Horman
  1 sibling, 1 reply; 88+ messages in thread
From: Martin KaFai Lau @ 2024-10-29 23:00 UTC (permalink / raw)
  To: Jason Xing, willemb
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On 10/28/24 4:05 AM, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> This patch has introduced a separate sk_tsflags_bpf for bpf
> extension, which helps us let two feature work nearly at the
> same time.
> 
> Each feature will finally take effect on skb_shinfo(skb)->tx_flags,
> say, tcp_tx_timestamp() for TCP or skb_setup_tx_timestamp() for
> other types, so in __skb_tstamp_tx() we are unable to know which
> feature is turned on, unless we check each feature's own socket
> flag field.
> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>   include/net/sock.h |  1 +
>   net/core/skbuff.c  | 39 +++++++++++++++++++++++++++++++++++++++
>   2 files changed, 40 insertions(+)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 7464e9f9f47c..5384f1e49f5c 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -445,6 +445,7 @@ struct sock {
>   	u32			sk_reserved_mem;
>   	int			sk_forward_alloc;
>   	u32			sk_tsflags;
> +	u32			sk_tsflags_bpf;
>   	__cacheline_group_end(sock_write_rxtx);
>   
>   	__cacheline_group_begin(sock_write_tx);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 1cf8416f4123..39309f75e105 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5539,6 +5539,32 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
>   }
>   EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
>   
> +/* This function is used to test if application SO_TIMESTAMPING feature
> + * or bpf SO_TIMESTAMPING feature is loaded by checking its own socket flags.
> + */
> +static bool sk_tstamp_tx_flags(struct sock *sk, u32 tsflags, int tstype)
> +{
> +	u32 testflag;
> +
> +	switch (tstype) {
> +	case SCM_TSTAMP_SCHED:
> +		testflag = SOF_TIMESTAMPING_TX_SCHED;
> +		break;
> +	case SCM_TSTAMP_SND:
> +		testflag = SOF_TIMESTAMPING_TX_SOFTWARE;
> +		break;
> +	case SCM_TSTAMP_ACK:
> +		testflag = SOF_TIMESTAMPING_TX_ACK;
> +		break;
> +	default:
> +		return false;
> +	}
> +	if (tsflags & testflag)
> +		return true;
> +
> +	return false;
> +}
> +
>   static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
>   				 const struct sk_buff *ack_skb,
>   				 struct skb_shared_hwtstamps *hwtstamps,
> @@ -5549,6 +5575,9 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
>   	u32 tsflags;
>   
>   	tsflags = READ_ONCE(sk->sk_tsflags);
> +	if (!sk_tstamp_tx_flags(sk, tsflags, tstype))

I still don't get this part since v2. How does it work with cmsg only 
SOF_TIMESTAMPING_TX_*?

I tried with "./txtimestamp -6 -c 1 -C -N -L ::1" and it does not return any tx 
time stamp after this patch.

I am likely missing something
or v2 concluded that this behavior change is acceptable?

> +		return;
> +
>   	if (!hwtstamps && !(tsflags & SOF_TIMESTAMPING_OPT_TX_SWHW) &&
>   	    skb_shinfo(orig_skb)->tx_flags & SKBTX_IN_PROGRESS)
>   		return;
> @@ -5592,6 +5621,15 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
>   	__skb_complete_tx_timestamp(skb, sk, tstype, opt_stats);
>   }
>   
> +static void skb_tstamp_tx_output_bpf(struct sock *sk, int tstype)
> +{
> +	u32 tsflags;
> +
> +	tsflags = READ_ONCE(sk->sk_tsflags_bpf);
> +	if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
> +		return;
> +}
> +
>   void __skb_tstamp_tx(struct sk_buff *orig_skb,
>   		     const struct sk_buff *ack_skb,
>   		     struct skb_shared_hwtstamps *hwtstamps,
> @@ -5600,6 +5638,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
>   	if (!sk)
>   		return;
>   
> +	skb_tstamp_tx_output_bpf(sk, tstype);
>   	skb_tstamp_tx_output(orig_skb, ack_skb, hwtstamps, sk, tstype);
>   }
>   EXPORT_SYMBOL_GPL(__skb_tstamp_tx);


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 03/14] net-timestamp: open gate for bpf_setsockopt/_getsockopt
  2024-10-28 11:05 ` [PATCH net-next v3 03/14] net-timestamp: open gate for bpf_setsockopt/_getsockopt Jason Xing
  2024-10-29  0:59   ` Willem de Bruijn
@ 2024-10-30  0:32   ` Martin KaFai Lau
  2024-10-30  1:15     ` Jason Xing
  1 sibling, 1 reply; 88+ messages in thread
From: Martin KaFai Lau @ 2024-10-30  0:32 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf,
	netdev, Jason Xing

On 10/28/24 4:05 AM, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> For now, we support bpf_setsockopt to set or clear timestamps flags.
> 
> Users can use something like this in bpf program to turn on the feature:
> flags = SOF_TIMESTAMPING_TX_SCHED;
> bpf_setsockopt(skops, SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
> The specific use cases can be seen in the bpf selftest in this series.
> 
> Later, I will support each flags one by one based on this.
> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>   include/net/sock.h              |  4 ++--
>   include/uapi/linux/net_tstamp.h |  7 +++++++
>   net/core/filter.c               |  7 +++++--
>   net/core/sock.c                 | 34 ++++++++++++++++++++++++++-------
>   net/ipv4/udp.c                  |  2 +-
>   net/mptcp/sockopt.c             |  2 +-
>   net/socket.c                    |  2 +-
>   7 files changed, 44 insertions(+), 14 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 5384f1e49f5c..062f405c744e 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1775,7 +1775,7 @@ static inline void skb_set_owner_edemux(struct sk_buff *skb, struct sock *sk)
>   #endif
>   
>   int sk_setsockopt(struct sock *sk, int level, int optname,
> -		  sockptr_t optval, unsigned int optlen);
> +		  sockptr_t optval, unsigned int optlen, bool bpf_timetamping);
>   int sock_setsockopt(struct socket *sock, int level, int op,
>   		    sockptr_t optval, unsigned int optlen);
>   int do_sock_setsockopt(struct socket *sock, bool compat, int level,
> @@ -1784,7 +1784,7 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level,
>   		       int optname, sockptr_t optval, sockptr_t optlen);
>   
>   int sk_getsockopt(struct sock *sk, int level, int optname,
> -		  sockptr_t optval, sockptr_t optlen);
> +		  sockptr_t optval, sockptr_t optlen, bool bpf_timetamping);
>   int sock_gettstamp(struct socket *sock, void __user *userstamp,
>   		   bool timeval, bool time32);
>   struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
> diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
> index 858339d1c1c4..0696699cf964 100644
> --- a/include/uapi/linux/net_tstamp.h
> +++ b/include/uapi/linux/net_tstamp.h
> @@ -49,6 +49,13 @@ enum {
>   					 SOF_TIMESTAMPING_TX_SCHED | \
>   					 SOF_TIMESTAMPING_TX_ACK)
>   
> +#define SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK (SOF_TIMESTAMPING_SOFTWARE | \

hmm... so we are allowing it but SOF_TIMESTAMPING_SOFTWARE won't do anything 
(meaning set and not-set are both no-op) ?

> +					      SOF_TIMESTAMPING_TX_SCHED | \
> +					      SOF_TIMESTAMPING_TX_SOFTWARE | \
> +					      SOF_TIMESTAMPING_TX_ACK | \
> +					      SOF_TIMESTAMPING_OPT_ID | \
> +					      SOF_TIMESTAMPING_OPT_ID_TCP)
> +
>   /**
>    * struct so_timestamping - SO_TIMESTAMPING parameter
>    *
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 58761263176c..dc8ecf899ced 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5238,6 +5238,9 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
>   		break;
>   	case SO_BINDTODEVICE:
>   		break;
> +	case SO_TIMESTAMPING_NEW:

How about only allow bpf_setsockopt(SO_TIMESTAMPING_NEW) instead of 
bpf_setsockopt(SO_TIMESTAMPING). Does it solve the issue reported in v2?

> +	case SO_TIMESTAMPING_OLD:
> +		break;
>   	default:
>   		return -EINVAL;
>   	}
> @@ -5247,11 +5250,11 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
>   			return -EINVAL;
>   		return sk_getsockopt(sk, SOL_SOCKET, optname,
>   				     KERNEL_SOCKPTR(optval),
> -				     KERNEL_SOCKPTR(optlen));
> +				     KERNEL_SOCKPTR(optlen), true);
>   	}
>   
>   	return sk_setsockopt(sk, SOL_SOCKET, optname,
> -			     KERNEL_SOCKPTR(optval), *optlen);
> +			     KERNEL_SOCKPTR(optval), *optlen, true);
>   }
>   
>   static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 7f398bd07fb7..7e05748b1a06 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -941,6 +941,19 @@ int sock_set_timestamping(struct sock *sk, int optname,
>   	return 0;
>   }
>   
> +static int sock_set_timestamping_bpf(struct sock *sk,
> +				     struct so_timestamping timestamping)
> +{
> +	u32 flags = timestamping.flags;
> +
> +	if (flags & ~SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK)
> +		return -EINVAL;
> +
> +	WRITE_ONCE(sk->sk_tsflags_bpf, flags);

I think it is cleaner to directly "WRITE_ONCE(sk->sk_tsflags_bpf, flags);" in 
sol_socket_sockopt() instead of adding "bool bpf_timestamping" to sk_setsockopt. 
sk_tsflags_bpf is a separate u32 anyway, so not a lot of code to share. The same 
for getsockopt.

[ will continue the remaining patches a little later ]

> +
> +	return 0;
> +}
> +
>   void sock_set_keepalive(struct sock *sk)
>   {
>   	lock_sock(sk);
> @@ -1159,7 +1172,7 @@ static int sockopt_validate_clockid(__kernel_clockid_t value)
>    */
>   
>   int sk_setsockopt(struct sock *sk, int level, int optname,
> -		  sockptr_t optval, unsigned int optlen)
> +		  sockptr_t optval, unsigned int optlen, bool bpf_timetamping)
>   {
>   	struct so_timestamping timestamping;
>   	struct socket *sock = sk->sk_socket;
> @@ -1409,7 +1422,10 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
>   			memset(&timestamping, 0, sizeof(timestamping));
>   			timestamping.flags = val;
>   		}
> -		ret = sock_set_timestamping(sk, optname, timestamping);
> +		if (!bpf_timetamping)
> +			ret = sock_set_timestamping(sk, optname, timestamping);
> +		else
> +			ret = sock_set_timestamping_bpf(sk, timestamping);
>   		break;
>   
>   	case SO_RCVLOWAT:
> @@ -1626,7 +1642,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
>   		    sockptr_t optval, unsigned int optlen)
>   {
>   	return sk_setsockopt(sock->sk, level, optname,
> -			     optval, optlen);
> +			     optval, optlen, false);
>   }
>   EXPORT_SYMBOL(sock_setsockopt);
>   
> @@ -1670,7 +1686,7 @@ static int groups_to_user(sockptr_t dst, const struct group_info *src)
>   }
>   
>   int sk_getsockopt(struct sock *sk, int level, int optname,
> -		  sockptr_t optval, sockptr_t optlen)
> +		  sockptr_t optval, sockptr_t optlen, bool bpf_timetamping)
>   {
>   	struct socket *sock = sk->sk_socket;
>   
> @@ -1793,9 +1809,13 @@ int sk_getsockopt(struct sock *sk, int level, int optname,
>   		 * returning the flags when they were set through the same option.
>   		 * Don't change the beviour for the old case SO_TIMESTAMPING_OLD.
>   		 */
> -		if (optname == SO_TIMESTAMPING_OLD || sock_flag(sk, SOCK_TSTAMP_NEW)) {
> -			v.timestamping.flags = READ_ONCE(sk->sk_tsflags);
> -			v.timestamping.bind_phc = READ_ONCE(sk->sk_bind_phc);
> +		if (!bpf_timetamping) {
> +			if (optname == SO_TIMESTAMPING_OLD || sock_flag(sk, SOCK_TSTAMP_NEW)) {
> +				v.timestamping.flags = READ_ONCE(sk->sk_tsflags);
> +				v.timestamping.bind_phc = READ_ONCE(sk->sk_bind_phc);
> +			}
> +		} else {
> +			v.timestamping.flags = READ_ONCE(sk->sk_tsflags_bpf);
>   		}
>   		break;
>   
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 0e24916b39d4..9a20af41e272 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2679,7 +2679,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
>   	int is_udplite = IS_UDPLITE(sk);
>   
>   	if (level == SOL_SOCKET) {
> -		err = sk_setsockopt(sk, level, optname, optval, optlen);
> +		err = sk_setsockopt(sk, level, optname, optval, optlen, false);
>   
>   		if (optname == SO_RCVBUF || optname == SO_RCVBUFFORCE) {
>   			sockopt_lock_sock(sk);
> diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
> index 505445a9598f..7b12cc2db136 100644
> --- a/net/mptcp/sockopt.c
> +++ b/net/mptcp/sockopt.c
> @@ -306,7 +306,7 @@ static int mptcp_setsockopt_sol_socket(struct mptcp_sock *msk, int optname,
>   			return PTR_ERR(ssk);
>   		}
>   
> -		ret = sk_setsockopt(ssk, SOL_SOCKET, optname, optval, optlen);
> +		ret = sk_setsockopt(ssk, SOL_SOCKET, optname, optval, optlen, false);
>   		if (ret == 0) {
>   			if (optname == SO_REUSEPORT)
>   				sk->sk_reuseport = ssk->sk_reuseport;
> diff --git a/net/socket.c b/net/socket.c
> index 9a8e4452b9b2..4bdca39685a6 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -2385,7 +2385,7 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level,
>   
>   	ops = READ_ONCE(sock->ops);
>   	if (level == SOL_SOCKET) {
> -		err = sk_getsockopt(sock->sk, level, optname, optval, optlen);
> +		err = sk_getsockopt(sock->sk, level, optname, optval, optlen, false);
>   	} else if (unlikely(!ops->getsockopt)) {
>   		err = -EOPNOTSUPP;
>   	} else {


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 03/14] net-timestamp: open gate for bpf_setsockopt/_getsockopt
  2024-10-30  0:32   ` Martin KaFai Lau
@ 2024-10-30  1:15     ` Jason Xing
  0 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-30  1:15 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf,
	netdev, Jason Xing

On Wed, Oct 30, 2024 at 8:32 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/28/24 4:05 AM, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > For now, we support bpf_setsockopt to set or clear timestamps flags.
> >
> > Users can use something like this in bpf program to turn on the feature:
> > flags = SOF_TIMESTAMPING_TX_SCHED;
> > bpf_setsockopt(skops, SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
> > The specific use cases can be seen in the bpf selftest in this series.
> >
> > Later, I will support each flags one by one based on this.
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >   include/net/sock.h              |  4 ++--
> >   include/uapi/linux/net_tstamp.h |  7 +++++++
> >   net/core/filter.c               |  7 +++++--
> >   net/core/sock.c                 | 34 ++++++++++++++++++++++++++-------
> >   net/ipv4/udp.c                  |  2 +-
> >   net/mptcp/sockopt.c             |  2 +-
> >   net/socket.c                    |  2 +-
> >   7 files changed, 44 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 5384f1e49f5c..062f405c744e 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -1775,7 +1775,7 @@ static inline void skb_set_owner_edemux(struct sk_buff *skb, struct sock *sk)
> >   #endif
> >
> >   int sk_setsockopt(struct sock *sk, int level, int optname,
> > -               sockptr_t optval, unsigned int optlen);
> > +               sockptr_t optval, unsigned int optlen, bool bpf_timetamping);
> >   int sock_setsockopt(struct socket *sock, int level, int op,
> >                   sockptr_t optval, unsigned int optlen);
> >   int do_sock_setsockopt(struct socket *sock, bool compat, int level,
> > @@ -1784,7 +1784,7 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level,
> >                      int optname, sockptr_t optval, sockptr_t optlen);
> >
> >   int sk_getsockopt(struct sock *sk, int level, int optname,
> > -               sockptr_t optval, sockptr_t optlen);
> > +               sockptr_t optval, sockptr_t optlen, bool bpf_timetamping);
> >   int sock_gettstamp(struct socket *sock, void __user *userstamp,
> >                  bool timeval, bool time32);
> >   struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
> > diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
> > index 858339d1c1c4..0696699cf964 100644
> > --- a/include/uapi/linux/net_tstamp.h
> > +++ b/include/uapi/linux/net_tstamp.h
> > @@ -49,6 +49,13 @@ enum {
> >                                        SOF_TIMESTAMPING_TX_SCHED | \
> >                                        SOF_TIMESTAMPING_TX_ACK)
> >
> > +#define SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK (SOF_TIMESTAMPING_SOFTWARE | \
>
> hmm... so we are allowing it but SOF_TIMESTAMPING_SOFTWARE won't do anything
> (meaning set and not-set are both no-op) ?

I was thinking of writing a separate patch to control the output
function by using this flag. Apparently, I didn't do that, so I think
I can remove it from this series.

>
> > +                                           SOF_TIMESTAMPING_TX_SCHED | \
> > +                                           SOF_TIMESTAMPING_TX_SOFTWARE | \
> > +                                           SOF_TIMESTAMPING_TX_ACK | \
> > +                                           SOF_TIMESTAMPING_OPT_ID | \
> > +                                           SOF_TIMESTAMPING_OPT_ID_TCP)
> > +
> >   /**
> >    * struct so_timestamping - SO_TIMESTAMPING parameter
> >    *
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 58761263176c..dc8ecf899ced 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5238,6 +5238,9 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
> >               break;
> >       case SO_BINDTODEVICE:
> >               break;
> > +     case SO_TIMESTAMPING_NEW:
>
> How about only allow bpf_setsockopt(SO_TIMESTAMPING_NEW) instead of
> bpf_setsockopt(SO_TIMESTAMPING). Does it solve the issue reported in v2?

No, it doesn't. Sorry, I will handle it in a proper way.

>
> > +     case SO_TIMESTAMPING_OLD:
> > +             break;
> >       default:
> >               return -EINVAL;
> >       }
> > @@ -5247,11 +5250,11 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
> >                       return -EINVAL;
> >               return sk_getsockopt(sk, SOL_SOCKET, optname,
> >                                    KERNEL_SOCKPTR(optval),
> > -                                  KERNEL_SOCKPTR(optlen));
> > +                                  KERNEL_SOCKPTR(optlen), true);
> >       }
> >
> >       return sk_setsockopt(sk, SOL_SOCKET, optname,
> > -                          KERNEL_SOCKPTR(optval), *optlen);
> > +                          KERNEL_SOCKPTR(optval), *optlen, true);
> >   }
> >
> >   static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index 7f398bd07fb7..7e05748b1a06 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -941,6 +941,19 @@ int sock_set_timestamping(struct sock *sk, int optname,
> >       return 0;
> >   }
> >
> > +static int sock_set_timestamping_bpf(struct sock *sk,
> > +                                  struct so_timestamping timestamping)
> > +{
> > +     u32 flags = timestamping.flags;
> > +
> > +     if (flags & ~SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK)
> > +             return -EINVAL;
> > +
> > +     WRITE_ONCE(sk->sk_tsflags_bpf, flags);
>
> I think it is cleaner to directly "WRITE_ONCE(sk->sk_tsflags_bpf, flags);" in
> sol_socket_sockopt() instead of adding "bool bpf_timestamping" to sk_setsockopt.
> sk_tsflags_bpf is a separate u32 anyway, so not a lot of code to share. The same
> for getsockopt.

As I replied to Willem, I feel this way (that is also the same as v2)
[1] introduces more extra duplicated code and returns earlier compared
to other use cases of SO_xxx, which do you think is a bit weird?

[1]: https://lore.kernel.org/all/20241012040651.95616-3-kerneljasonxing@gmail.com/

Surely, I can write it like how v2 works. Which one would you prefer :) ?

>
> [ will continue the remaining patches a little later ]

Thanks!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-29 23:00   ` Martin KaFai Lau
@ 2024-10-30  1:23     ` Jason Xing
  2024-10-30  1:45       ` Willem de Bruijn
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-30  1:23 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: willemb, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, ast, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal, bpf, netdev, Jason Xing

On Wed, Oct 30, 2024 at 7:00 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/28/24 4:05 AM, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > This patch has introduced a separate sk_tsflags_bpf for bpf
> > extension, which helps us let two feature work nearly at the
> > same time.
> >
> > Each feature will finally take effect on skb_shinfo(skb)->tx_flags,
> > say, tcp_tx_timestamp() for TCP or skb_setup_tx_timestamp() for
> > other types, so in __skb_tstamp_tx() we are unable to know which
> > feature is turned on, unless we check each feature's own socket
> > flag field.
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >   include/net/sock.h |  1 +
> >   net/core/skbuff.c  | 39 +++++++++++++++++++++++++++++++++++++++
> >   2 files changed, 40 insertions(+)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 7464e9f9f47c..5384f1e49f5c 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -445,6 +445,7 @@ struct sock {
> >       u32                     sk_reserved_mem;
> >       int                     sk_forward_alloc;
> >       u32                     sk_tsflags;
> > +     u32                     sk_tsflags_bpf;
> >       __cacheline_group_end(sock_write_rxtx);
> >
> >       __cacheline_group_begin(sock_write_tx);
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 1cf8416f4123..39309f75e105 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5539,6 +5539,32 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
> >   }
> >   EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
> >
> > +/* This function is used to test if application SO_TIMESTAMPING feature
> > + * or bpf SO_TIMESTAMPING feature is loaded by checking its own socket flags.
> > + */
> > +static bool sk_tstamp_tx_flags(struct sock *sk, u32 tsflags, int tstype)
> > +{
> > +     u32 testflag;
> > +
> > +     switch (tstype) {
> > +     case SCM_TSTAMP_SCHED:
> > +             testflag = SOF_TIMESTAMPING_TX_SCHED;
> > +             break;
> > +     case SCM_TSTAMP_SND:
> > +             testflag = SOF_TIMESTAMPING_TX_SOFTWARE;
> > +             break;
> > +     case SCM_TSTAMP_ACK:
> > +             testflag = SOF_TIMESTAMPING_TX_ACK;
> > +             break;
> > +     default:
> > +             return false;
> > +     }
> > +     if (tsflags & testflag)
> > +             return true;
> > +
> > +     return false;
> > +}
> > +
> >   static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> >                                const struct sk_buff *ack_skb,
> >                                struct skb_shared_hwtstamps *hwtstamps,
> > @@ -5549,6 +5575,9 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> >       u32 tsflags;
> >
> >       tsflags = READ_ONCE(sk->sk_tsflags);
> > +     if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
>
> I still don't get this part since v2. How does it work with cmsg only
> SOF_TIMESTAMPING_TX_*?
>
> I tried with "./txtimestamp -6 -c 1 -C -N -L ::1" and it does not return any tx
> time stamp after this patch.
>
> I am likely missing something
> or v2 concluded that this behavior change is acceptable?

Sorry, I submitted this series accidentally removing one important
thing which is similar to what Vadim Fedorenko mentioned in the v1
[1]:
adding another member like sk_flags_bpf to handle the cmsg case.

Willem, would it be acceptable to add another field in struct sock to
help us recognise the case where BPF and cmsg works parallelly?

[1]: https://lore.kernel.org/all/662873cb-a897-464e-bdb3-edf01363c3b2@linux.dev/

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-30  1:23     ` Jason Xing
@ 2024-10-30  1:45       ` Willem de Bruijn
  2024-10-30  2:32         ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-30  1:45 UTC (permalink / raw)
  To: Jason Xing, Martin KaFai Lau
  Cc: willemb, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, ast, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal, bpf, netdev, Jason Xing

Jason Xing wrote:
> On Wed, Oct 30, 2024 at 7:00 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 10/28/24 4:05 AM, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > This patch has introduced a separate sk_tsflags_bpf for bpf
> > > extension, which helps us let two feature work nearly at the
> > > same time.
> > >
> > > Each feature will finally take effect on skb_shinfo(skb)->tx_flags,
> > > say, tcp_tx_timestamp() for TCP or skb_setup_tx_timestamp() for
> > > other types, so in __skb_tstamp_tx() we are unable to know which
> > > feature is turned on, unless we check each feature's own socket
> > > flag field.
> > >
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >   include/net/sock.h |  1 +
> > >   net/core/skbuff.c  | 39 +++++++++++++++++++++++++++++++++++++++
> > >   2 files changed, 40 insertions(+)
> > >
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index 7464e9f9f47c..5384f1e49f5c 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -445,6 +445,7 @@ struct sock {
> > >       u32                     sk_reserved_mem;
> > >       int                     sk_forward_alloc;
> > >       u32                     sk_tsflags;
> > > +     u32                     sk_tsflags_bpf;
> > >       __cacheline_group_end(sock_write_rxtx);
> > >
> > >       __cacheline_group_begin(sock_write_tx);
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 1cf8416f4123..39309f75e105 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -5539,6 +5539,32 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
> > >   }
> > >   EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
> > >
> > > +/* This function is used to test if application SO_TIMESTAMPING feature
> > > + * or bpf SO_TIMESTAMPING feature is loaded by checking its own socket flags.
> > > + */
> > > +static bool sk_tstamp_tx_flags(struct sock *sk, u32 tsflags, int tstype)
> > > +{
> > > +     u32 testflag;
> > > +
> > > +     switch (tstype) {
> > > +     case SCM_TSTAMP_SCHED:
> > > +             testflag = SOF_TIMESTAMPING_TX_SCHED;
> > > +             break;
> > > +     case SCM_TSTAMP_SND:
> > > +             testflag = SOF_TIMESTAMPING_TX_SOFTWARE;
> > > +             break;
> > > +     case SCM_TSTAMP_ACK:
> > > +             testflag = SOF_TIMESTAMPING_TX_ACK;
> > > +             break;
> > > +     default:
> > > +             return false;
> > > +     }
> > > +     if (tsflags & testflag)
> > > +             return true;
> > > +
> > > +     return false;
> > > +}
> > > +
> > >   static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > >                                const struct sk_buff *ack_skb,
> > >                                struct skb_shared_hwtstamps *hwtstamps,
> > > @@ -5549,6 +5575,9 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > >       u32 tsflags;
> > >
> > >       tsflags = READ_ONCE(sk->sk_tsflags);
> > > +     if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
> >
> > I still don't get this part since v2. How does it work with cmsg only
> > SOF_TIMESTAMPING_TX_*?
> >
> > I tried with "./txtimestamp -6 -c 1 -C -N -L ::1" and it does not return any tx
> > time stamp after this patch.
> >
> > I am likely missing something
> > or v2 concluded that this behavior change is acceptable?
> 
> Sorry, I submitted this series accidentally removing one important
> thing which is similar to what Vadim Fedorenko mentioned in the v1
> [1]:
> adding another member like sk_flags_bpf to handle the cmsg case.
> 
> Willem, would it be acceptable to add another field in struct sock to
> help us recognise the case where BPF and cmsg works parallelly?
> 
> [1]: https://lore.kernel.org/all/662873cb-a897-464e-bdb3-edf01363c3b2@linux.dev/

The current timestamp flags don't need a u32. Maybe just reserve a bit
for this purpose?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-30  1:45       ` Willem de Bruijn
@ 2024-10-30  2:32         ` Jason Xing
  2024-10-30  2:47           ` Willem de Bruijn
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-30  2:32 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin KaFai Lau, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Wed, Oct 30, 2024 at 9:45 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Wed, Oct 30, 2024 at 7:00 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >
> > > On 10/28/24 4:05 AM, Jason Xing wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > This patch has introduced a separate sk_tsflags_bpf for bpf
> > > > extension, which helps us let two feature work nearly at the
> > > > same time.
> > > >
> > > > Each feature will finally take effect on skb_shinfo(skb)->tx_flags,
> > > > say, tcp_tx_timestamp() for TCP or skb_setup_tx_timestamp() for
> > > > other types, so in __skb_tstamp_tx() we are unable to know which
> > > > feature is turned on, unless we check each feature's own socket
> > > > flag field.
> > > >
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > >   include/net/sock.h |  1 +
> > > >   net/core/skbuff.c  | 39 +++++++++++++++++++++++++++++++++++++++
> > > >   2 files changed, 40 insertions(+)
> > > >
> > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > index 7464e9f9f47c..5384f1e49f5c 100644
> > > > --- a/include/net/sock.h
> > > > +++ b/include/net/sock.h
> > > > @@ -445,6 +445,7 @@ struct sock {
> > > >       u32                     sk_reserved_mem;
> > > >       int                     sk_forward_alloc;
> > > >       u32                     sk_tsflags;
> > > > +     u32                     sk_tsflags_bpf;
> > > >       __cacheline_group_end(sock_write_rxtx);
> > > >
> > > >       __cacheline_group_begin(sock_write_tx);
> > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > index 1cf8416f4123..39309f75e105 100644
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -5539,6 +5539,32 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
> > > >   }
> > > >   EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
> > > >
> > > > +/* This function is used to test if application SO_TIMESTAMPING feature
> > > > + * or bpf SO_TIMESTAMPING feature is loaded by checking its own socket flags.
> > > > + */
> > > > +static bool sk_tstamp_tx_flags(struct sock *sk, u32 tsflags, int tstype)
> > > > +{
> > > > +     u32 testflag;
> > > > +
> > > > +     switch (tstype) {
> > > > +     case SCM_TSTAMP_SCHED:
> > > > +             testflag = SOF_TIMESTAMPING_TX_SCHED;
> > > > +             break;
> > > > +     case SCM_TSTAMP_SND:
> > > > +             testflag = SOF_TIMESTAMPING_TX_SOFTWARE;
> > > > +             break;
> > > > +     case SCM_TSTAMP_ACK:
> > > > +             testflag = SOF_TIMESTAMPING_TX_ACK;
> > > > +             break;
> > > > +     default:
> > > > +             return false;
> > > > +     }
> > > > +     if (tsflags & testflag)
> > > > +             return true;
> > > > +
> > > > +     return false;
> > > > +}
> > > > +
> > > >   static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > > >                                const struct sk_buff *ack_skb,
> > > >                                struct skb_shared_hwtstamps *hwtstamps,
> > > > @@ -5549,6 +5575,9 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > > >       u32 tsflags;
> > > >
> > > >       tsflags = READ_ONCE(sk->sk_tsflags);
> > > > +     if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
> > >
> > > I still don't get this part since v2. How does it work with cmsg only
> > > SOF_TIMESTAMPING_TX_*?
> > >
> > > I tried with "./txtimestamp -6 -c 1 -C -N -L ::1" and it does not return any tx
> > > time stamp after this patch.
> > >
> > > I am likely missing something
> > > or v2 concluded that this behavior change is acceptable?
> >
> > Sorry, I submitted this series accidentally removing one important
> > thing which is similar to what Vadim Fedorenko mentioned in the v1
> > [1]:
> > adding another member like sk_flags_bpf to handle the cmsg case.
> >
> > Willem, would it be acceptable to add another field in struct sock to
> > help us recognise the case where BPF and cmsg works parallelly?
> >
> > [1]: https://lore.kernel.org/all/662873cb-a897-464e-bdb3-edf01363c3b2@linux.dev/
>
> The current timestamp flags don't need a u32. Maybe just reserve a bit
> for this purpose?

Sure. Good suggestion.

But I think only using one bit to reflect whether the sk->sk_tsflags
is used by normal or cmsg features is not enough. The existing
implementation in tcp_sendmsg_locked() doesn't override the
sk->sk_tsflags even the normal and cmsg features enabled parallelly.
It only overrides sockc.tsflags in tcp_sendmsg_locked(). Based on
that, even if at some point users suddenly remove the cmsg use and
then the prior normal SO_TIMESTAMPING continues to work.

How about this, please see below:
For now, sk->sk_tsflags only uses 17 bits (see the last one
SOF_TIMESTAMPING_OPT_RX_FILTER). The cmsg feature only uses 4 flags
(see SOF_TIMESTAMPING_TX_RECORD_MASK in __sock_cmsg_send()). With that
said, we could reserve the highest four bits for cmsg use for the
moment. Four bits represents four points where we can record the
timestamp in the tx case.

Do you agree on this point?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-30  2:32         ` Jason Xing
@ 2024-10-30  2:47           ` Willem de Bruijn
  2024-10-30  3:04             ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-30  2:47 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: Martin KaFai Lau, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

Jason Xing wrote:
> On Wed, Oct 30, 2024 at 9:45 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > On Wed, Oct 30, 2024 at 7:00 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > > >
> > > > On 10/28/24 4:05 AM, Jason Xing wrote:
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > This patch has introduced a separate sk_tsflags_bpf for bpf
> > > > > extension, which helps us let two feature work nearly at the
> > > > > same time.
> > > > >
> > > > > Each feature will finally take effect on skb_shinfo(skb)->tx_flags,
> > > > > say, tcp_tx_timestamp() for TCP or skb_setup_tx_timestamp() for
> > > > > other types, so in __skb_tstamp_tx() we are unable to know which
> > > > > feature is turned on, unless we check each feature's own socket
> > > > > flag field.
> > > > >
> > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > ---
> > > > >   include/net/sock.h |  1 +
> > > > >   net/core/skbuff.c  | 39 +++++++++++++++++++++++++++++++++++++++
> > > > >   2 files changed, 40 insertions(+)
> > > > >
> > > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > > index 7464e9f9f47c..5384f1e49f5c 100644
> > > > > --- a/include/net/sock.h
> > > > > +++ b/include/net/sock.h
> > > > > @@ -445,6 +445,7 @@ struct sock {
> > > > >       u32                     sk_reserved_mem;
> > > > >       int                     sk_forward_alloc;
> > > > >       u32                     sk_tsflags;
> > > > > +     u32                     sk_tsflags_bpf;
> > > > >       __cacheline_group_end(sock_write_rxtx);
> > > > >
> > > > >       __cacheline_group_begin(sock_write_tx);
> > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > index 1cf8416f4123..39309f75e105 100644
> > > > > --- a/net/core/skbuff.c
> > > > > +++ b/net/core/skbuff.c
> > > > > @@ -5539,6 +5539,32 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
> > > > >   }
> > > > >   EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
> > > > >
> > > > > +/* This function is used to test if application SO_TIMESTAMPING feature
> > > > > + * or bpf SO_TIMESTAMPING feature is loaded by checking its own socket flags.
> > > > > + */
> > > > > +static bool sk_tstamp_tx_flags(struct sock *sk, u32 tsflags, int tstype)
> > > > > +{
> > > > > +     u32 testflag;
> > > > > +
> > > > > +     switch (tstype) {
> > > > > +     case SCM_TSTAMP_SCHED:
> > > > > +             testflag = SOF_TIMESTAMPING_TX_SCHED;
> > > > > +             break;
> > > > > +     case SCM_TSTAMP_SND:
> > > > > +             testflag = SOF_TIMESTAMPING_TX_SOFTWARE;
> > > > > +             break;
> > > > > +     case SCM_TSTAMP_ACK:
> > > > > +             testflag = SOF_TIMESTAMPING_TX_ACK;
> > > > > +             break;
> > > > > +     default:
> > > > > +             return false;
> > > > > +     }
> > > > > +     if (tsflags & testflag)
> > > > > +             return true;
> > > > > +
> > > > > +     return false;
> > > > > +}
> > > > > +
> > > > >   static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > > > >                                const struct sk_buff *ack_skb,
> > > > >                                struct skb_shared_hwtstamps *hwtstamps,
> > > > > @@ -5549,6 +5575,9 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > > > >       u32 tsflags;
> > > > >
> > > > >       tsflags = READ_ONCE(sk->sk_tsflags);
> > > > > +     if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
> > > >
> > > > I still don't get this part since v2. How does it work with cmsg only
> > > > SOF_TIMESTAMPING_TX_*?
> > > >
> > > > I tried with "./txtimestamp -6 -c 1 -C -N -L ::1" and it does not return any tx
> > > > time stamp after this patch.
> > > >
> > > > I am likely missing something
> > > > or v2 concluded that this behavior change is acceptable?
> > >
> > > Sorry, I submitted this series accidentally removing one important
> > > thing which is similar to what Vadim Fedorenko mentioned in the v1
> > > [1]:
> > > adding another member like sk_flags_bpf to handle the cmsg case.
> > >
> > > Willem, would it be acceptable to add another field in struct sock to
> > > help us recognise the case where BPF and cmsg works parallelly?
> > >
> > > [1]: https://lore.kernel.org/all/662873cb-a897-464e-bdb3-edf01363c3b2@linux.dev/
> >
> > The current timestamp flags don't need a u32. Maybe just reserve a bit
> > for this purpose?
> 
> Sure. Good suggestion.
> 
> But I think only using one bit to reflect whether the sk->sk_tsflags
> is used by normal or cmsg features is not enough. The existing
> implementation in tcp_sendmsg_locked() doesn't override the
> sk->sk_tsflags even the normal and cmsg features enabled parallelly.
> It only overrides sockc.tsflags in tcp_sendmsg_locked(). Based on
> that, even if at some point users suddenly remove the cmsg use and
> then the prior normal SO_TIMESTAMPING continues to work.
> 
> How about this, please see below:
> For now, sk->sk_tsflags only uses 17 bits (see the last one
> SOF_TIMESTAMPING_OPT_RX_FILTER). The cmsg feature only uses 4 flags
> (see SOF_TIMESTAMPING_TX_RECORD_MASK in __sock_cmsg_send()). With that
> said, we could reserve the highest four bits for cmsg use for the
> moment. Four bits represents four points where we can record the
> timestamp in the tx case.
> 
> Do you agree on this point?

I don't follow.

I probably miss the entire point.

The goal for sockcm fields is to start with the sk field and
optionally override based on cmsg. This is what sockcm_init does for
tsflags.

This information is for the skb, so these are recording flags.

Why does the new datapath need to know whether features are enabled
through setsockopt or on a per-call basis with a cmsg?

The goal was always to keep the reporting flags per socket, but make
the recording flag per packet, mainly for sampling.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-30  2:47           ` Willem de Bruijn
@ 2024-10-30  3:04             ` Jason Xing
  2024-10-30  5:37               ` Martin KaFai Lau
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-30  3:04 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin KaFai Lau, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Wed, Oct 30, 2024 at 10:47 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Wed, Oct 30, 2024 at 9:45 AM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > Jason Xing wrote:
> > > > On Wed, Oct 30, 2024 at 7:00 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > > > >
> > > > > On 10/28/24 4:05 AM, Jason Xing wrote:
> > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > >
> > > > > > This patch has introduced a separate sk_tsflags_bpf for bpf
> > > > > > extension, which helps us let two feature work nearly at the
> > > > > > same time.
> > > > > >
> > > > > > Each feature will finally take effect on skb_shinfo(skb)->tx_flags,
> > > > > > say, tcp_tx_timestamp() for TCP or skb_setup_tx_timestamp() for
> > > > > > other types, so in __skb_tstamp_tx() we are unable to know which
> > > > > > feature is turned on, unless we check each feature's own socket
> > > > > > flag field.
> > > > > >
> > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > > ---
> > > > > >   include/net/sock.h |  1 +
> > > > > >   net/core/skbuff.c  | 39 +++++++++++++++++++++++++++++++++++++++
> > > > > >   2 files changed, 40 insertions(+)
> > > > > >
> > > > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > > > index 7464e9f9f47c..5384f1e49f5c 100644
> > > > > > --- a/include/net/sock.h
> > > > > > +++ b/include/net/sock.h
> > > > > > @@ -445,6 +445,7 @@ struct sock {
> > > > > >       u32                     sk_reserved_mem;
> > > > > >       int                     sk_forward_alloc;
> > > > > >       u32                     sk_tsflags;
> > > > > > +     u32                     sk_tsflags_bpf;
> > > > > >       __cacheline_group_end(sock_write_rxtx);
> > > > > >
> > > > > >       __cacheline_group_begin(sock_write_tx);
> > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > index 1cf8416f4123..39309f75e105 100644
> > > > > > --- a/net/core/skbuff.c
> > > > > > +++ b/net/core/skbuff.c
> > > > > > @@ -5539,6 +5539,32 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
> > > > > >   }
> > > > > >   EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
> > > > > >
> > > > > > +/* This function is used to test if application SO_TIMESTAMPING feature
> > > > > > + * or bpf SO_TIMESTAMPING feature is loaded by checking its own socket flags.
> > > > > > + */
> > > > > > +static bool sk_tstamp_tx_flags(struct sock *sk, u32 tsflags, int tstype)
> > > > > > +{
> > > > > > +     u32 testflag;
> > > > > > +
> > > > > > +     switch (tstype) {
> > > > > > +     case SCM_TSTAMP_SCHED:
> > > > > > +             testflag = SOF_TIMESTAMPING_TX_SCHED;
> > > > > > +             break;
> > > > > > +     case SCM_TSTAMP_SND:
> > > > > > +             testflag = SOF_TIMESTAMPING_TX_SOFTWARE;
> > > > > > +             break;
> > > > > > +     case SCM_TSTAMP_ACK:
> > > > > > +             testflag = SOF_TIMESTAMPING_TX_ACK;
> > > > > > +             break;
> > > > > > +     default:
> > > > > > +             return false;
> > > > > > +     }
> > > > > > +     if (tsflags & testflag)
> > > > > > +             return true;
> > > > > > +
> > > > > > +     return false;
> > > > > > +}
> > > > > > +
> > > > > >   static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > > > > >                                const struct sk_buff *ack_skb,
> > > > > >                                struct skb_shared_hwtstamps *hwtstamps,
> > > > > > @@ -5549,6 +5575,9 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > > > > >       u32 tsflags;
> > > > > >
> > > > > >       tsflags = READ_ONCE(sk->sk_tsflags);
> > > > > > +     if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
> > > > >
> > > > > I still don't get this part since v2. How does it work with cmsg only
> > > > > SOF_TIMESTAMPING_TX_*?
> > > > >
> > > > > I tried with "./txtimestamp -6 -c 1 -C -N -L ::1" and it does not return any tx
> > > > > time stamp after this patch.
> > > > >
> > > > > I am likely missing something
> > > > > or v2 concluded that this behavior change is acceptable?
> > > >
> > > > Sorry, I submitted this series accidentally removing one important
> > > > thing which is similar to what Vadim Fedorenko mentioned in the v1
> > > > [1]:
> > > > adding another member like sk_flags_bpf to handle the cmsg case.
> > > >
> > > > Willem, would it be acceptable to add another field in struct sock to
> > > > help us recognise the case where BPF and cmsg works parallelly?
> > > >
> > > > [1]: https://lore.kernel.org/all/662873cb-a897-464e-bdb3-edf01363c3b2@linux.dev/
> > >
> > > The current timestamp flags don't need a u32. Maybe just reserve a bit
> > > for this purpose?
> >
> > Sure. Good suggestion.
> >
> > But I think only using one bit to reflect whether the sk->sk_tsflags
> > is used by normal or cmsg features is not enough. The existing
> > implementation in tcp_sendmsg_locked() doesn't override the
> > sk->sk_tsflags even the normal and cmsg features enabled parallelly.
> > It only overrides sockc.tsflags in tcp_sendmsg_locked(). Based on
> > that, even if at some point users suddenly remove the cmsg use and
> > then the prior normal SO_TIMESTAMPING continues to work.
> >
> > How about this, please see below:
> > For now, sk->sk_tsflags only uses 17 bits (see the last one
> > SOF_TIMESTAMPING_OPT_RX_FILTER). The cmsg feature only uses 4 flags
> > (see SOF_TIMESTAMPING_TX_RECORD_MASK in __sock_cmsg_send()). With that
> > said, we could reserve the highest four bits for cmsg use for the
> > moment. Four bits represents four points where we can record the
> > timestamp in the tx case.
> >
> > Do you agree on this point?
>
> I don't follow.
>
> I probably miss the entire point.
>
> The goal for sockcm fields is to start with the sk field and
> optionally override based on cmsg. This is what sockcm_init does for
> tsflags.
>
> This information is for the skb, so these are recording flags.
>
> Why does the new datapath need to know whether features are enabled
> through setsockopt or on a per-call basis with a cmsg?
>
> The goal was always to keep the reporting flags per socket, but make
> the recording flag per packet, mainly for sampling.

If a user uses 1) cmsg feature, 2) bpf feature at the same time, we
allow each feature to work independently.

How could it work? It relies on sk_tstamp_tx_flags() function in the
current patch: when we are in __skb_tstamp_tx(), we cannot know which
flags in each feature are set without fetching sk->sk_tsflags and
sk->sk_tsflags_bpf. Then we are able to know what timestamp we want to
record. To put it in a simple way, we're not sure if the user wants to
see a SCHED timestamp by using the cmsg feature in __skb_tstamp_tx()
if we hit this test statement "skb_shinfo(skb)->tx_flags &
SKBTX_SCHED_TSTAMP)". So we need those two socket tsflag fields to
help us.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-29 19:45           ` Willem de Bruijn
@ 2024-10-30  3:27             ` Jason Xing
  0 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-30  3:27 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Wed, Oct 30, 2024 at 3:45 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> > > > > > +static long int sock_calculate_tskey_offset(struct sock *sk, int val, int bpf_type)
> > > > > > +{
> > > > > > +     u32 tskey;
> > > > > > +
> > > > > > +     if (sk_is_tcp(sk)) {
> > > > > > +             if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
> > > > > > +                     return -EINVAL;
> > > > > > +
> > > > > > +             if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > > > > > +                     tskey = tcp_sk(sk)->write_seq;
> > > > > > +             else
> > > > > > +                     tskey = tcp_sk(sk)->snd_una;
> > > > > > +     } else {
> > > > > > +             tskey = 0;
> > > > > > +     }
> > > > > > +
> > > > > > +     if (bpf_type && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > > > > > +             sk->sk_tskey_bpf_offset = tskey - atomic_read(&sk->sk_tskey);
> > > > > > +             return 0;
> > > > > > +     } else if (!bpf_type && (sk->sk_tsflags_bpf & SOF_TIMESTAMPING_OPT_ID)) {
> > > > > > +             sk->sk_tskey_bpf_offset = atomic_read(&sk->sk_tskey) - tskey;
> > > > > > +     } else {
> > > > > > +             sk->sk_tskey_bpf_offset = 0;
> > > > > > +     }
> > > > > > +
> > > > > > +     return tskey;
> > > > > > +}
> > > > > > +
> > > > > >  int sock_set_tskey(struct sock *sk, int val, int bpf_type)
> > > > > >  {
> > > > > >       u32 tsflags = bpf_type ? sk->sk_tsflags_bpf : sk->sk_tsflags;
> > > > > > @@ -901,17 +944,13 @@ int sock_set_tskey(struct sock *sk, int val, int bpf_type)
> > > > > >
> > > > > >       if (val & SOF_TIMESTAMPING_OPT_ID &&
> > > > > >           !(tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > > > > > -             if (sk_is_tcp(sk)) {
> > > > > > -                     if ((1 << sk->sk_state) &
> > > > > > -                         (TCPF_CLOSE | TCPF_LISTEN))
> > > > > > -                             return -EINVAL;
> > > > > > -                     if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > > > > > -                             atomic_set(&sk->sk_tskey, tcp_sk(sk)->write_seq);
> > > > > > -                     else
> > > > > > -                             atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
> > > > > > -             } else {
> > > > > > -                     atomic_set(&sk->sk_tskey, 0);
> > > > > > -             }
> > > > > > +             long int ret;
> > > > > > +
> > > > > > +             ret = sock_calculate_tskey_offset(sk, val, bpf_type);
> > > > > > +             if (ret <= 0)
> > > > > > +                     return ret;
> > > > > > +
> > > > > > +             atomic_set(&sk->sk_tskey, ret);
> > > > > >       }
> > > > > >
> > > > > >       return 0;
> > > > > > @@ -956,10 +995,15 @@ static int sock_set_timestamping_bpf(struct sock *sk,
> > > > > >                                    struct so_timestamping timestamping)
> > > > > >  {
> > > > > >       u32 flags = timestamping.flags;
> > > > > > +     int ret;
> > > > > >
> > > > > >       if (flags & ~SOF_TIMESTAMPING_BPF_SUPPPORTED_MASK)
> > > > > >               return -EINVAL;
> > > > > >
> > > > > > +     ret = sock_set_tskey(sk, flags, 1);
> > > > > > +     if (ret)
> > > > > > +             return ret;
> > > > > > +
> > > > > >       WRITE_ONCE(sk->sk_tsflags_bpf, flags);
> > > > > >
> > > > > >       return 0;
> > > > >
> > > > > I'm a bit hazy on when this can be called. We can assume that this new
> > > > > BPF operation cannot race with the existing setsockopt nor with the
> > > > > datapath that might touch the atomic fields, right?
> > > >
> > > > It surely can race with the existing setsockopt.
> > > >
> > > > 1)
> > > > if (only existing setsockopt works) {
> > > >         then sk->sk_tskey is set through setsockopt, sk_tskey_bpf_offset is 0.
> > > > }
> > > >
> > > > 2)
> > > > if (only bpf setsockopt works) {
> > > >         then sk->sk_tskey is set through bpf_setsockopt,
> > > > sk_tskey_bpf_offset is 0.
> > > > }
> > > >
> > > > 3)
> > > > if (existing setsockopt already started, here we enable the bpf feature) {
> > > >         then sk->sk_tskey will not change, but the sk_tskey_bpf_offset
> > > > will be calculated.
> > > > }
> > > >
> > > > 4)
> > > > if (bpf setsockopt already started, here we enable the application feature) {
> > > >         then sk->sk_tskey will re-initialized/overridden by
> > > > setsockopt, and the sk_tskey_bpf_offset will be calculated.
> > > > }
> >
> > I will copy the above to the commit message next time in order to
> > provide a clear design to future readers.
> >
> > > >
> > > > Then the skb tskey will use the sk->sk_tskey like before.
> > >
> > > I mean race as in the setsockopt and bpf setsockopt and datapath
> > > running concurrently.
> > >
> > > As long as both variants of setsockopt hold the socket lock, that
> > > won't happen.
> > >
> > > The datapath is lockless for UDP, so atomic_inc sk_tskey can race
> > > with calculating the difference. But this is a known issue. A process
> > > that cares should not run setsockopt and send concurrently. So this is
> > > fine too.
> >
> > Oh, now I see. Thanks for the detailed explanation! So Do you feel if
> > we need to take care of this in the future, I mean, after this series
> > gets merged...?
>
> If there is a race condition, then that cannot be fixed up later.
>
> But from my admittedly brief analysis, it seems that there is nothing
> here that needs to be fixed: control plane operations (setsockopt)
> hold the socket lock. A setsockopt that conflicts with a lockless
> datapath update will have a slightly ambiguous offset. It is under
> controlof and up to the user to avoid that if they care.

I got it. Thanks.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-30  3:04             ` Jason Xing
@ 2024-10-30  5:37               ` Martin KaFai Lau
  2024-10-30  6:42                 ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Martin KaFai Lau @ 2024-10-30  5:37 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: willemb, davem, edumazet, kuba, pabeni, dsahern, ast, daniel,
	andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev, Jason Xing

On 10/29/24 8:04 PM, Jason Xing wrote:
>>>>>>>    static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
>>>>>>>                                 const struct sk_buff *ack_skb,
>>>>>>>                                 struct skb_shared_hwtstamps *hwtstamps,
>>>>>>> @@ -5549,6 +5575,9 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
>>>>>>>        u32 tsflags;
>>>>>>>
>>>>>>>        tsflags = READ_ONCE(sk->sk_tsflags);
>>>>>>> +     if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
>>>>>>
>>>>>> I still don't get this part since v2. How does it work with cmsg only
>>>>>> SOF_TIMESTAMPING_TX_*?
>>>>>>
>>>>>> I tried with "./txtimestamp -6 -c 1 -C -N -L ::1" and it does not return any tx
>>>>>> time stamp after this patch.
>>>>>>
>>>>>> I am likely missing something
>>>>>> or v2 concluded that this behavior change is acceptable?
>>>>>
>>>>> Sorry, I submitted this series accidentally removing one important
>>>>> thing which is similar to what Vadim Fedorenko mentioned in the v1
>>>>> [1]:
>>>>> adding another member like sk_flags_bpf to handle the cmsg case.
>>>>>
>>>>> Willem, would it be acceptable to add another field in struct sock to
>>>>> help us recognise the case where BPF and cmsg works parallelly?
>>>>>
>>>>> [1]: https://lore.kernel.org/all/662873cb-a897-464e-bdb3-edf01363c3b2@linux.dev/
>>>>
>>>> The current timestamp flags don't need a u32. Maybe just reserve a bit
>>>> for this purpose?
>>>
>>> Sure. Good suggestion.
>>>
>>> But I think only using one bit to reflect whether the sk->sk_tsflags
>>> is used by normal or cmsg features is not enough. The existing
>>> implementation in tcp_sendmsg_locked() doesn't override the
>>> sk->sk_tsflags even the normal and cmsg features enabled parallelly.
>>> It only overrides sockc.tsflags in tcp_sendmsg_locked(). Based on
>>> that, even if at some point users suddenly remove the cmsg use and
>>> then the prior normal SO_TIMESTAMPING continues to work.
>>>
>>> How about this, please see below:
>>> For now, sk->sk_tsflags only uses 17 bits (see the last one
>>> SOF_TIMESTAMPING_OPT_RX_FILTER). The cmsg feature only uses 4 flags
>>> (see SOF_TIMESTAMPING_TX_RECORD_MASK in __sock_cmsg_send()). With that
>>> said, we could reserve the highest four bits for cmsg use for the
>>> moment. Four bits represents four points where we can record the
>>> timestamp in the tx case.
>>>
>>> Do you agree on this point?
>>
>> I don't follow.
>>
>> I probably miss the entire point.
>>
>> The goal for sockcm fields is to start with the sk field and
>> optionally override based on cmsg. This is what sockcm_init does for
>> tsflags.
>>
>> This information is for the skb, so these are recording flags.
>>
>> Why does the new datapath need to know whether features are enabled
>> through setsockopt or on a per-call basis with a cmsg?
>>
>> The goal was always to keep the reporting flags per socket, but make
>> the recording flag per packet, mainly for sampling.
> 
> If a user uses 1) cmsg feature, 2) bpf feature at the same time, we
> allow each feature to work independently.
> 
> How could it work? It relies on sk_tstamp_tx_flags() function in the
> current patch: when we are in __skb_tstamp_tx(), we cannot know which
> flags in each feature are set without fetching sk->sk_tsflags and
> sk->sk_tsflags_bpf. Then we are able to know what timestamp we want to
> record. To put it in a simple way, we're not sure if the user wants to
> see a SCHED timestamp by using the cmsg feature in __skb_tstamp_tx()
> if we hit this test statement "skb_shinfo(skb)->tx_flags &
> SKBTX_SCHED_TSTAMP)". So we need those two socket tsflag fields to
> help us.

I also don't see how a new bit/integer in a sk can help to tell the per cmsg 
on/off. This cmsg may have tx timestamp on while the next cmsg can have it off.

There is still one bit in skb_shinfo(skb)->tx_flags. How about define a 
SKBTX_BPF for everything. imo, the fine control on 
SOF_TIMESTAMPING_TX_{SCHED,SOFTWARE} is not useful for bpf. Almost all of the 
time the bpf program wants all available time stamps (sched, software, and 
hwtstamp if the NIC has it). Since bpf is in the kernel, it is much cheaper 
because it does not need to do skb_alloc/clone and queue to the error queue.

I think the bpf prog needs to capture a timestamp at the sendmsg() time, so a 
bpf prog needs to be called at sendmsg(). Then it may as well allow the bpf 
prog@sendmsg() to decide if it needs to set the SKBTX_BPF bit in 
skb_shinfo(skb)->tx_flags or not.

TCP_SKB_CB(skb)->txstamp_ack can also work similarly. There is still unused bit 
in "struct tcp_skb_cb", so may be adding TCP_SKB_CB(skb)->bpf_txstamp_ack

Then there is no need to control SOF_TIMESTAMPING_TX_* through bpf_setsockopt(). 
It only needs one bpf specific socket option like bpf_setsockopt(SOL_SOCKET, 
BPF_TX_TIMESTAMPING) to guard if the bpf-prog@sendmsg() needs to be called or 
not. There are already other TCP_BPF_IW,TCP_BPF_SNDCWND_CLAMP,... specific 
socket options.

imo, this is a simpler interface and also gives the bpf prog per packet control 
at the same time.

[ This user space cmsg-only testing has to be in the selftests/bpf to show how 
it can work. ]

> 
> Thanks,
> Jason


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-28 11:05 ` [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset Jason Xing
  2024-10-29  1:24   ` Willem de Bruijn
@ 2024-10-30  5:42   ` Martin KaFai Lau
  2024-10-30  6:50     ` Jason Xing
  1 sibling, 1 reply; 88+ messages in thread
From: Martin KaFai Lau @ 2024-10-30  5:42 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf,
	netdev, Jason Xing

On 10/28/24 4:05 AM, Jason Xing wrote:
> +/* Used to track the tskey for bpf extension
> + *
> + * @sk_tskey: bpf extension can use it only when no application uses.
> + *            Application can use it directly regardless of bpf extension.
> + *
> + * There are three strategies:
> + * 1) If we've already set through setsockopt() and here we're going to set
> + *    OPT_ID for bpf use, we will not re-initialize the @sk_tskey and will
> + *    keep the record of delta between the current "key" and previous key.
> + * 2) If we've already set through bpf_setsockopt() and here we're going to
> + *    set for application use, we will record the delta first and then
> + *    override/initialize the @sk_tskey.
> + * 3) other cases, which means only either of them takes effect, so initialize
> + *    everything simplely.
> + */
> +static long int sock_calculate_tskey_offset(struct sock *sk, int val, int bpf_type)
> +{
> +	u32 tskey;
> +
> +	if (sk_is_tcp(sk)) {
> +		if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
> +			return -EINVAL;
> +
> +		if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> +			tskey = tcp_sk(sk)->write_seq;
> +		else
> +			tskey = tcp_sk(sk)->snd_una;
> +	} else {
> +		tskey = 0;
> +	}
> +
> +	if (bpf_type && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> +		sk->sk_tskey_bpf_offset = tskey - atomic_read(&sk->sk_tskey);
> +		return 0;
> +	} else if (!bpf_type && (sk->sk_tsflags_bpf & SOF_TIMESTAMPING_OPT_ID)) {
> +		sk->sk_tskey_bpf_offset = atomic_read(&sk->sk_tskey) - tskey;
> +	} else {
> +		sk->sk_tskey_bpf_offset = 0;
> +	}
> +
> +	return tskey;
> +}

Before diving into this route, the bpf prog can peek into the tcp seq no in the 
skb. It can also look at the sk->sk_tskey for UDP socket. Can you explain why 
those are not enough information for the bpf prog?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 14/14] bpf: add simple bpf tests in the tx path for so_timstamping feature
  2024-10-28 11:05 ` [PATCH net-next v3 14/14] bpf: add simple bpf tests in the tx path for so_timstamping feature Jason Xing
  2024-10-29  1:26   ` Willem de Bruijn
@ 2024-10-30  5:57   ` Martin KaFai Lau
  2024-10-30  6:54     ` Jason Xing
  1 sibling, 1 reply; 88+ messages in thread
From: Martin KaFai Lau @ 2024-10-30  5:57 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf,
	netdev, Jason Xing

On 10/28/24 4:05 AM, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> Only check if we pass those three key points after we enable the
> bpf extension for so_timestamping. During each point, we can choose
> whether to print the current timestamp.

The bpf prog usually does more than just print. The bpf prog aggregates data 
first before sending all raw data to the user space.

The selftests will be more useful for the reviewer and the future user if it can 
at least show how it can calculate the tx delay between [sendmsg, SCHED], 
[SCHED, SND], [SND, ACK].

[ ... ]

> +SEC("sockops")
> +int skops_sockopt(struct bpf_sock_ops *skops)
> +{
> +	struct bpf_sock *bpf_sk = skops->sk;
> +	struct sock *sk;
> +
> +	if (!bpf_sk)
> +		return 1;
> +
> +	sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
> +	if (!sk)
> +		return 1;
> +
> +	switch (skops->op) {
> +	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
> +		nr_active += !bpf_test_sockopt(skops, sk);
> +		break;
> +	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
> +		nr_passive += !bpf_test_sockopt(skops, sk);
> +		break;
> +	case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
> +		nr_sched += 1;
> +		break;
> +	case BPF_SOCK_OPS_TS_SW_OPT_CB:
> +		nr_txsw += 1;
> +		break;
> +	case BPF_SOCK_OPS_TS_ACK_OPT_CB:
> +		nr_ack += 1;

> +		break;
> +	}
> +
> +	return 1;
> +}
> +
> +char _license[] SEC("license") = "GPL";


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-30  5:37               ` Martin KaFai Lau
@ 2024-10-30  6:42                 ` Jason Xing
  2024-10-30 17:15                   ` Willem de Bruijn
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-30  6:42 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Wed, Oct 30, 2024 at 1:37 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/29/24 8:04 PM, Jason Xing wrote:
> >>>>>>>    static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> >>>>>>>                                 const struct sk_buff *ack_skb,
> >>>>>>>                                 struct skb_shared_hwtstamps *hwtstamps,
> >>>>>>> @@ -5549,6 +5575,9 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> >>>>>>>        u32 tsflags;
> >>>>>>>
> >>>>>>>        tsflags = READ_ONCE(sk->sk_tsflags);
> >>>>>>> +     if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
> >>>>>>
> >>>>>> I still don't get this part since v2. How does it work with cmsg only
> >>>>>> SOF_TIMESTAMPING_TX_*?
> >>>>>>
> >>>>>> I tried with "./txtimestamp -6 -c 1 -C -N -L ::1" and it does not return any tx
> >>>>>> time stamp after this patch.
> >>>>>>
> >>>>>> I am likely missing something
> >>>>>> or v2 concluded that this behavior change is acceptable?
> >>>>>
> >>>>> Sorry, I submitted this series accidentally removing one important
> >>>>> thing which is similar to what Vadim Fedorenko mentioned in the v1
> >>>>> [1]:
> >>>>> adding another member like sk_flags_bpf to handle the cmsg case.
> >>>>>
> >>>>> Willem, would it be acceptable to add another field in struct sock to
> >>>>> help us recognise the case where BPF and cmsg works parallelly?
> >>>>>
> >>>>> [1]: https://lore.kernel.org/all/662873cb-a897-464e-bdb3-edf01363c3b2@linux.dev/
> >>>>
> >>>> The current timestamp flags don't need a u32. Maybe just reserve a bit
> >>>> for this purpose?
> >>>
> >>> Sure. Good suggestion.
> >>>
> >>> But I think only using one bit to reflect whether the sk->sk_tsflags
> >>> is used by normal or cmsg features is not enough. The existing
> >>> implementation in tcp_sendmsg_locked() doesn't override the
> >>> sk->sk_tsflags even the normal and cmsg features enabled parallelly.
> >>> It only overrides sockc.tsflags in tcp_sendmsg_locked(). Based on
> >>> that, even if at some point users suddenly remove the cmsg use and
> >>> then the prior normal SO_TIMESTAMPING continues to work.
> >>>
> >>> How about this, please see below:
> >>> For now, sk->sk_tsflags only uses 17 bits (see the last one
> >>> SOF_TIMESTAMPING_OPT_RX_FILTER). The cmsg feature only uses 4 flags
> >>> (see SOF_TIMESTAMPING_TX_RECORD_MASK in __sock_cmsg_send()). With that
> >>> said, we could reserve the highest four bits for cmsg use for the
> >>> moment. Four bits represents four points where we can record the
> >>> timestamp in the tx case.
> >>>
> >>> Do you agree on this point?
> >>
> >> I don't follow.
> >>
> >> I probably miss the entire point.
> >>
> >> The goal for sockcm fields is to start with the sk field and
> >> optionally override based on cmsg. This is what sockcm_init does for
> >> tsflags.
> >>
> >> This information is for the skb, so these are recording flags.
> >>
> >> Why does the new datapath need to know whether features are enabled
> >> through setsockopt or on a per-call basis with a cmsg?
> >>
> >> The goal was always to keep the reporting flags per socket, but make
> >> the recording flag per packet, mainly for sampling.
> >
> > If a user uses 1) cmsg feature, 2) bpf feature at the same time, we
> > allow each feature to work independently.
> >
> > How could it work? It relies on sk_tstamp_tx_flags() function in the
> > current patch: when we are in __skb_tstamp_tx(), we cannot know which
> > flags in each feature are set without fetching sk->sk_tsflags and
> > sk->sk_tsflags_bpf. Then we are able to know what timestamp we want to
> > record. To put it in a simple way, we're not sure if the user wants to
> > see a SCHED timestamp by using the cmsg feature in __skb_tstamp_tx()
> > if we hit this test statement "skb_shinfo(skb)->tx_flags &
> > SKBTX_SCHED_TSTAMP)". So we need those two socket tsflag fields to
> > help us.
>
> I also don't see how a new bit/integer in a sk can help to tell the per cmsg
> on/off. This cmsg may have tx timestamp on while the next cmsg can have it off.

It's not hard to use it because we can clear every socket cmsg tsflags
when we're done the check in tcp_sendmsg_locked() if the cmsg feature
is not enabled. Then we can accurately know which timestamp should we
print in the tx path.

>
> There is still one bit in skb_shinfo(skb)->tx_flags. How about define a
> SKBTX_BPF for everything. imo, the fine control on
> SOF_TIMESTAMPING_TX_{SCHED,SOFTWARE} is not useful for bpf. Almost all of the
> time the bpf program wants all available time stamps (sched, software, and
> hwtstamp if the NIC has it).

Sorry, I really doubt that we can lose the fine control. I still
reckon that providing more options to users is a good way to go,
especially for some latency sensitive applications, enabling one or
two or three tx flags could lead to different performances. For the
users of SO_TIMESTAMPING, they use the feature very differently. Not
all users prefer to record everything.

> Since bpf is in the kernel, it is much cheaper
> because it does not need to do skb_alloc/clone and queue to the error queue.
>
> I think the bpf prog needs to capture a timestamp at the sendmsg() time, so a
> bpf prog needs to be called at sendmsg().

Agreed, I planned to implement this after this series.

> Then it may as well allow the bpf
> prog@sendmsg() to decide if it needs to set the SKBTX_BPF bit in
> skb_shinfo(skb)->tx_flags or not.
>
> TCP_SKB_CB(skb)->txstamp_ack can also work similarly. There is still unused bit
> in "struct tcp_skb_cb", so may be adding TCP_SKB_CB(skb)->bpf_txstamp_ack
>
> Then there is no need to control SOF_TIMESTAMPING_TX_* through bpf_setsockopt().
> It only needs one bpf specific socket option like bpf_setsockopt(SOL_SOCKET,
> BPF_TX_TIMESTAMPING) to guard if the bpf-prog@sendmsg() needs to be called or
> not. There are already other TCP_BPF_IW,TCP_BPF_SNDCWND_CLAMP,... specific
> socket options.
>
> imo, this is a simpler interface and also gives the bpf prog per packet control
> at the same time.

Very interesting idea, but the precondition is that we give up the
fine control...

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-30  5:42   ` Martin KaFai Lau
@ 2024-10-30  6:50     ` Jason Xing
  2024-10-31  1:17       ` Martin KaFai Lau
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-30  6:50 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf,
	netdev, Jason Xing

On Wed, Oct 30, 2024 at 1:42 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/28/24 4:05 AM, Jason Xing wrote:
> > +/* Used to track the tskey for bpf extension
> > + *
> > + * @sk_tskey: bpf extension can use it only when no application uses.
> > + *            Application can use it directly regardless of bpf extension.
> > + *
> > + * There are three strategies:
> > + * 1) If we've already set through setsockopt() and here we're going to set
> > + *    OPT_ID for bpf use, we will not re-initialize the @sk_tskey and will
> > + *    keep the record of delta between the current "key" and previous key.
> > + * 2) If we've already set through bpf_setsockopt() and here we're going to
> > + *    set for application use, we will record the delta first and then
> > + *    override/initialize the @sk_tskey.
> > + * 3) other cases, which means only either of them takes effect, so initialize
> > + *    everything simplely.
> > + */
> > +static long int sock_calculate_tskey_offset(struct sock *sk, int val, int bpf_type)
> > +{
> > +     u32 tskey;
> > +
> > +     if (sk_is_tcp(sk)) {
> > +             if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
> > +                     return -EINVAL;
> > +
> > +             if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > +                     tskey = tcp_sk(sk)->write_seq;
> > +             else
> > +                     tskey = tcp_sk(sk)->snd_una;
> > +     } else {
> > +             tskey = 0;
> > +     }
> > +
> > +     if (bpf_type && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > +             sk->sk_tskey_bpf_offset = tskey - atomic_read(&sk->sk_tskey);
> > +             return 0;
> > +     } else if (!bpf_type && (sk->sk_tsflags_bpf & SOF_TIMESTAMPING_OPT_ID)) {
> > +             sk->sk_tskey_bpf_offset = atomic_read(&sk->sk_tskey) - tskey;
> > +     } else {
> > +             sk->sk_tskey_bpf_offset = 0;
> > +     }
> > +
> > +     return tskey;
> > +}
>
> Before diving into this route, the bpf prog can peek into the tcp seq no in the
> skb. It can also look at the sk->sk_tskey for UDP socket. Can you explain why
> those are not enough information for the bpf prog?

Well, it does make sense. It seems we don't need to implement tskey
for this bpf feature...

Due to lack of enough knowledge of bpf, could you provide more hints
that I can follow to write a bpf program to print more information
from the skb? Like in the last patch of this series, in
tools/testing/selftests/bpf/prog_tests/so_timestamping.c, do we have a
feasible way to do that?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 14/14] bpf: add simple bpf tests in the tx path for so_timstamping feature
  2024-10-30  5:57   ` Martin KaFai Lau
@ 2024-10-30  6:54     ` Jason Xing
  0 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-30  6:54 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf,
	netdev, Jason Xing

On Wed, Oct 30, 2024 at 1:58 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/28/24 4:05 AM, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > Only check if we pass those three key points after we enable the
> > bpf extension for so_timestamping. During each point, we can choose
> > whether to print the current timestamp.
>
> The bpf prog usually does more than just print. The bpf prog aggregates data
> first before sending all raw data to the user space.
>
> The selftests will be more useful for the reviewer and the future user if it can
> at least show how it can calculate the tx delay between [sendmsg, SCHED],
> [SCHED, SND], [SND, ACK].

Got it, I will dig into how to implement it and then post a new
version. Before this, I only used the bpf program to print timestamps
to one file without using those advanced functions (like aggregating
data) in bpf. Let me try :) If you know some good examples of this,
please show me :) Thanks in advance.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-30  6:42                 ` Jason Xing
@ 2024-10-30 17:15                   ` Willem de Bruijn
  2024-10-30 23:54                     ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-30 17:15 UTC (permalink / raw)
  To: Jason Xing, Martin KaFai Lau
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

Jason Xing wrote:
> On Wed, Oct 30, 2024 at 1:37 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 10/29/24 8:04 PM, Jason Xing wrote:
> > >>>>>>>    static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > >>>>>>>                                 const struct sk_buff *ack_skb,
> > >>>>>>>                                 struct skb_shared_hwtstamps *hwtstamps,
> > >>>>>>> @@ -5549,6 +5575,9 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > >>>>>>>        u32 tsflags;
> > >>>>>>>
> > >>>>>>>        tsflags = READ_ONCE(sk->sk_tsflags);
> > >>>>>>> +     if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
> > >>>>>>
> > >>>>>> I still don't get this part since v2. How does it work with cmsg only
> > >>>>>> SOF_TIMESTAMPING_TX_*?
> > >>>>>>
> > >>>>>> I tried with "./txtimestamp -6 -c 1 -C -N -L ::1" and it does not return any tx
> > >>>>>> time stamp after this patch.
> > >>>>>>
> > >>>>>> I am likely missing something
> > >>>>>> or v2 concluded that this behavior change is acceptable?
> > >>>>>
> > >>>>> Sorry, I submitted this series accidentally removing one important
> > >>>>> thing which is similar to what Vadim Fedorenko mentioned in the v1
> > >>>>> [1]:
> > >>>>> adding another member like sk_flags_bpf to handle the cmsg case.
> > >>>>>
> > >>>>> Willem, would it be acceptable to add another field in struct sock to
> > >>>>> help us recognise the case where BPF and cmsg works parallelly?
> > >>>>>
> > >>>>> [1]: https://lore.kernel.org/all/662873cb-a897-464e-bdb3-edf01363c3b2@linux.dev/
> > >>>>
> > >>>> The current timestamp flags don't need a u32. Maybe just reserve a bit
> > >>>> for this purpose?
> > >>>
> > >>> Sure. Good suggestion.
> > >>>
> > >>> But I think only using one bit to reflect whether the sk->sk_tsflags
> > >>> is used by normal or cmsg features is not enough. The existing
> > >>> implementation in tcp_sendmsg_locked() doesn't override the
> > >>> sk->sk_tsflags even the normal and cmsg features enabled parallelly.
> > >>> It only overrides sockc.tsflags in tcp_sendmsg_locked(). Based on
> > >>> that, even if at some point users suddenly remove the cmsg use and
> > >>> then the prior normal SO_TIMESTAMPING continues to work.
> > >>>
> > >>> How about this, please see below:
> > >>> For now, sk->sk_tsflags only uses 17 bits (see the last one
> > >>> SOF_TIMESTAMPING_OPT_RX_FILTER). The cmsg feature only uses 4 flags
> > >>> (see SOF_TIMESTAMPING_TX_RECORD_MASK in __sock_cmsg_send()). With that
> > >>> said, we could reserve the highest four bits for cmsg use for the
> > >>> moment. Four bits represents four points where we can record the
> > >>> timestamp in the tx case.
> > >>>
> > >>> Do you agree on this point?
> > >>
> > >> I don't follow.
> > >>
> > >> I probably miss the entire point.
> > >>
> > >> The goal for sockcm fields is to start with the sk field and
> > >> optionally override based on cmsg. This is what sockcm_init does for
> > >> tsflags.
> > >>
> > >> This information is for the skb, so these are recording flags.
> > >>
> > >> Why does the new datapath need to know whether features are enabled
> > >> through setsockopt or on a per-call basis with a cmsg?
> > >>
> > >> The goal was always to keep the reporting flags per socket, but make
> > >> the recording flag per packet, mainly for sampling.
> > >
> > > If a user uses 1) cmsg feature, 2) bpf feature at the same time, we
> > > allow each feature to work independently.
> > >
> > > How could it work? It relies on sk_tstamp_tx_flags() function in the
> > > current patch: when we are in __skb_tstamp_tx(), we cannot know which
> > > flags in each feature are set without fetching sk->sk_tsflags and
> > > sk->sk_tsflags_bpf. Then we are able to know what timestamp we want to
> > > record. To put it in a simple way, we're not sure if the user wants to
> > > see a SCHED timestamp by using the cmsg feature in __skb_tstamp_tx()
> > > if we hit this test statement "skb_shinfo(skb)->tx_flags &
> > > SKBTX_SCHED_TSTAMP)". So we need those two socket tsflag fields to
> > > help us.
> >
> > I also don't see how a new bit/integer in a sk can help to tell the per cmsg
> > on/off. This cmsg may have tx timestamp on while the next cmsg can have it off.
> 
> It's not hard to use it because we can clear every socket cmsg tsflags
> when we're done the check in tcp_sendmsg_locked() if the cmsg feature
> is not enabled. Then we can accurately know which timestamp should we
> print in the tx path.
> 
> >
> > There is still one bit in skb_shinfo(skb)->tx_flags. How about define a
> > SKBTX_BPF for everything. imo, the fine control on
> > SOF_TIMESTAMPING_TX_{SCHED,SOFTWARE} is not useful for bpf. Almost all of the
> > time the bpf program wants all available time stamps (sched, software, and
> > hwtstamp if the NIC has it).

I like the approach of just calling BPF on every hook. Assuming that
the call is very cheap, which AFAIK is true.

In that case we don't need complex branching in C to optionally skip
this step, as we do for reporting to userspace.

All the logic and complexity is in the BPF program itself.

We obviously then let go of the goal to model the BPF API close to the
existing SO_TIMESTAMPING API. Though I advocated for keeping them
aligned, I also think we should just tailor it to what makes most
sense in the BPF space.
 
> Sorry, I really doubt that we can lose the fine control. 

Since BPF is called at each reporting point, no control is lost,
actually.

> I still
> reckon that providing more options to users is a good way to go,
> especially for some latency sensitive applications, enabling one or
> two or three tx flags could lead to different performances. For the
> users of SO_TIMESTAMPING, they use the feature very differently. Not
> all users prefer to record everything.
> 
> > Since bpf is in the kernel, it is much cheaper
> > because it does not need to do skb_alloc/clone and queue to the error queue.
> >
> > I think the bpf prog needs to capture a timestamp at the sendmsg() time, so a
> > bpf prog needs to be called at sendmsg().
> 
> Agreed, I planned to implement this after this series.
> 
> > Then it may as well allow the bpf
> > prog@sendmsg() to decide if it needs to set the SKBTX_BPF bit in
> > skb_shinfo(skb)->tx_flags or not.
> >
> > TCP_SKB_CB(skb)->txstamp_ack can also work similarly. There is still unused bit
> > in "struct tcp_skb_cb", so may be adding TCP_SKB_CB(skb)->bpf_txstamp_ack
> >
> > Then there is no need to control SOF_TIMESTAMPING_TX_* through bpf_setsockopt().
> > It only needs one bpf specific socket option like bpf_setsockopt(SOL_SOCKET,
> > BPF_TX_TIMESTAMPING) to guard if the bpf-prog@sendmsg() needs to be called or
> > not. There are already other TCP_BPF_IW,TCP_BPF_SNDCWND_CLAMP,... specific
> > socket options.
> >
> > imo, this is a simpler interface and also gives the bpf prog per packet control
> > at the same time.
> 
> Very interesting idea, but the precondition is that we give up the
> fine control...
> 
> Thanks,
> Jason



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-30 17:15                   ` Willem de Bruijn
@ 2024-10-30 23:54                     ` Jason Xing
  2024-10-31  0:13                       ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-30 23:54 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin KaFai Lau, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Thu, Oct 31, 2024 at 1:15 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Wed, Oct 30, 2024 at 1:37 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >
> > > On 10/29/24 8:04 PM, Jason Xing wrote:
> > > >>>>>>>    static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > > >>>>>>>                                 const struct sk_buff *ack_skb,
> > > >>>>>>>                                 struct skb_shared_hwtstamps *hwtstamps,
> > > >>>>>>> @@ -5549,6 +5575,9 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > > >>>>>>>        u32 tsflags;
> > > >>>>>>>
> > > >>>>>>>        tsflags = READ_ONCE(sk->sk_tsflags);
> > > >>>>>>> +     if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
> > > >>>>>>
> > > >>>>>> I still don't get this part since v2. How does it work with cmsg only
> > > >>>>>> SOF_TIMESTAMPING_TX_*?
> > > >>>>>>
> > > >>>>>> I tried with "./txtimestamp -6 -c 1 -C -N -L ::1" and it does not return any tx
> > > >>>>>> time stamp after this patch.
> > > >>>>>>
> > > >>>>>> I am likely missing something
> > > >>>>>> or v2 concluded that this behavior change is acceptable?
> > > >>>>>
> > > >>>>> Sorry, I submitted this series accidentally removing one important
> > > >>>>> thing which is similar to what Vadim Fedorenko mentioned in the v1
> > > >>>>> [1]:
> > > >>>>> adding another member like sk_flags_bpf to handle the cmsg case.
> > > >>>>>
> > > >>>>> Willem, would it be acceptable to add another field in struct sock to
> > > >>>>> help us recognise the case where BPF and cmsg works parallelly?
> > > >>>>>
> > > >>>>> [1]: https://lore.kernel.org/all/662873cb-a897-464e-bdb3-edf01363c3b2@linux.dev/
> > > >>>>
> > > >>>> The current timestamp flags don't need a u32. Maybe just reserve a bit
> > > >>>> for this purpose?
> > > >>>
> > > >>> Sure. Good suggestion.
> > > >>>
> > > >>> But I think only using one bit to reflect whether the sk->sk_tsflags
> > > >>> is used by normal or cmsg features is not enough. The existing
> > > >>> implementation in tcp_sendmsg_locked() doesn't override the
> > > >>> sk->sk_tsflags even the normal and cmsg features enabled parallelly.
> > > >>> It only overrides sockc.tsflags in tcp_sendmsg_locked(). Based on
> > > >>> that, even if at some point users suddenly remove the cmsg use and
> > > >>> then the prior normal SO_TIMESTAMPING continues to work.
> > > >>>
> > > >>> How about this, please see below:
> > > >>> For now, sk->sk_tsflags only uses 17 bits (see the last one
> > > >>> SOF_TIMESTAMPING_OPT_RX_FILTER). The cmsg feature only uses 4 flags
> > > >>> (see SOF_TIMESTAMPING_TX_RECORD_MASK in __sock_cmsg_send()). With that
> > > >>> said, we could reserve the highest four bits for cmsg use for the
> > > >>> moment. Four bits represents four points where we can record the
> > > >>> timestamp in the tx case.
> > > >>>
> > > >>> Do you agree on this point?
> > > >>
> > > >> I don't follow.
> > > >>
> > > >> I probably miss the entire point.
> > > >>
> > > >> The goal for sockcm fields is to start with the sk field and
> > > >> optionally override based on cmsg. This is what sockcm_init does for
> > > >> tsflags.
> > > >>
> > > >> This information is for the skb, so these are recording flags.
> > > >>
> > > >> Why does the new datapath need to know whether features are enabled
> > > >> through setsockopt or on a per-call basis with a cmsg?
> > > >>
> > > >> The goal was always to keep the reporting flags per socket, but make
> > > >> the recording flag per packet, mainly for sampling.
> > > >
> > > > If a user uses 1) cmsg feature, 2) bpf feature at the same time, we
> > > > allow each feature to work independently.
> > > >
> > > > How could it work? It relies on sk_tstamp_tx_flags() function in the
> > > > current patch: when we are in __skb_tstamp_tx(), we cannot know which
> > > > flags in each feature are set without fetching sk->sk_tsflags and
> > > > sk->sk_tsflags_bpf. Then we are able to know what timestamp we want to
> > > > record. To put it in a simple way, we're not sure if the user wants to
> > > > see a SCHED timestamp by using the cmsg feature in __skb_tstamp_tx()
> > > > if we hit this test statement "skb_shinfo(skb)->tx_flags &
> > > > SKBTX_SCHED_TSTAMP)". So we need those two socket tsflag fields to
> > > > help us.
> > >
> > > I also don't see how a new bit/integer in a sk can help to tell the per cmsg
> > > on/off. This cmsg may have tx timestamp on while the next cmsg can have it off.
> >
> > It's not hard to use it because we can clear every socket cmsg tsflags
> > when we're done the check in tcp_sendmsg_locked() if the cmsg feature
> > is not enabled. Then we can accurately know which timestamp should we
> > print in the tx path.
> >
> > >
> > > There is still one bit in skb_shinfo(skb)->tx_flags. How about define a
> > > SKBTX_BPF for everything. imo, the fine control on
> > > SOF_TIMESTAMPING_TX_{SCHED,SOFTWARE} is not useful for bpf. Almost all of the
> > > time the bpf program wants all available time stamps (sched, software, and
> > > hwtstamp if the NIC has it).
>
> I like the approach of just calling BPF on every hook. Assuming that
> the call is very cheap, which AFAIK is true.
>
> In that case we don't need complex branching in C to optionally skip
> this step, as we do for reporting to userspace.
>
> All the logic and complexity is in the BPF program itself.
>
> We obviously then let go of the goal to model the BPF API close to the
> existing SO_TIMESTAMPING API. Though I advocated for keeping them
> aligned, I also think we should just tailor it to what makes most
> sense in the BPF space.
>
> > Sorry, I really doubt that we can lose the fine control.
>
> Since BPF is called at each reporting point, no control is lost,
> actually.

Sorry, I still don't get it :( If there is something wrong with my
understanding, please correct me.

BPF is only called on every sock_opt point in this case, like
BPF_SOCK_OPS_TCP_CONNECT_CB, not every report point of
SO_TIMESTAMPING. If we add check to test if skb is set SKBTX_BPF in
__skb_tstamp_tx(), then at every point bpf will be called. But it's
different from SO_TIMESTAMPING drived by each bit (SCHED/TX_SOFTWARE)
to control each point. My question is if we would use SKBTX_BPF for
everything, how could we control and know when we hit
SCHED/TX_SOFTWARE/ACK time from the bpf programs' perspective? Only
one bit... It will print everything without the ability to control.

Then if we try the SKBTX_BPF approach, it seems we don't actually
insist on adding a test statement in __skb_tstamp_tx(). Instead, we
could add into more places (by only checking the SKBTX_BPF flag), say,
tcp_write_xmit(), right?

I'm not saying I'm opposed to this idea. Instead I think it's very
useful, just a few questions haunting me...

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-30 23:54                     ` Jason Xing
@ 2024-10-31  0:13                       ` Jason Xing
  2024-10-31  6:27                         ` Martin KaFai Lau
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-31  0:13 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin KaFai Lau, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Thu, Oct 31, 2024 at 7:54 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Oct 31, 2024 at 1:15 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > On Wed, Oct 30, 2024 at 1:37 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > > >
> > > > On 10/29/24 8:04 PM, Jason Xing wrote:
> > > > >>>>>>>    static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > > > >>>>>>>                                 const struct sk_buff *ack_skb,
> > > > >>>>>>>                                 struct skb_shared_hwtstamps *hwtstamps,
> > > > >>>>>>> @@ -5549,6 +5575,9 @@ static void skb_tstamp_tx_output(struct sk_buff *orig_skb,
> > > > >>>>>>>        u32 tsflags;
> > > > >>>>>>>
> > > > >>>>>>>        tsflags = READ_ONCE(sk->sk_tsflags);
> > > > >>>>>>> +     if (!sk_tstamp_tx_flags(sk, tsflags, tstype))
> > > > >>>>>>
> > > > >>>>>> I still don't get this part since v2. How does it work with cmsg only
> > > > >>>>>> SOF_TIMESTAMPING_TX_*?
> > > > >>>>>>
> > > > >>>>>> I tried with "./txtimestamp -6 -c 1 -C -N -L ::1" and it does not return any tx
> > > > >>>>>> time stamp after this patch.
> > > > >>>>>>
> > > > >>>>>> I am likely missing something
> > > > >>>>>> or v2 concluded that this behavior change is acceptable?
> > > > >>>>>
> > > > >>>>> Sorry, I submitted this series accidentally removing one important
> > > > >>>>> thing which is similar to what Vadim Fedorenko mentioned in the v1
> > > > >>>>> [1]:
> > > > >>>>> adding another member like sk_flags_bpf to handle the cmsg case.
> > > > >>>>>
> > > > >>>>> Willem, would it be acceptable to add another field in struct sock to
> > > > >>>>> help us recognise the case where BPF and cmsg works parallelly?
> > > > >>>>>
> > > > >>>>> [1]: https://lore.kernel.org/all/662873cb-a897-464e-bdb3-edf01363c3b2@linux.dev/
> > > > >>>>
> > > > >>>> The current timestamp flags don't need a u32. Maybe just reserve a bit
> > > > >>>> for this purpose?
> > > > >>>
> > > > >>> Sure. Good suggestion.
> > > > >>>
> > > > >>> But I think only using one bit to reflect whether the sk->sk_tsflags
> > > > >>> is used by normal or cmsg features is not enough. The existing
> > > > >>> implementation in tcp_sendmsg_locked() doesn't override the
> > > > >>> sk->sk_tsflags even the normal and cmsg features enabled parallelly.
> > > > >>> It only overrides sockc.tsflags in tcp_sendmsg_locked(). Based on
> > > > >>> that, even if at some point users suddenly remove the cmsg use and
> > > > >>> then the prior normal SO_TIMESTAMPING continues to work.
> > > > >>>
> > > > >>> How about this, please see below:
> > > > >>> For now, sk->sk_tsflags only uses 17 bits (see the last one
> > > > >>> SOF_TIMESTAMPING_OPT_RX_FILTER). The cmsg feature only uses 4 flags
> > > > >>> (see SOF_TIMESTAMPING_TX_RECORD_MASK in __sock_cmsg_send()). With that
> > > > >>> said, we could reserve the highest four bits for cmsg use for the
> > > > >>> moment. Four bits represents four points where we can record the
> > > > >>> timestamp in the tx case.
> > > > >>>
> > > > >>> Do you agree on this point?
> > > > >>
> > > > >> I don't follow.
> > > > >>
> > > > >> I probably miss the entire point.
> > > > >>
> > > > >> The goal for sockcm fields is to start with the sk field and
> > > > >> optionally override based on cmsg. This is what sockcm_init does for
> > > > >> tsflags.
> > > > >>
> > > > >> This information is for the skb, so these are recording flags.
> > > > >>
> > > > >> Why does the new datapath need to know whether features are enabled
> > > > >> through setsockopt or on a per-call basis with a cmsg?
> > > > >>
> > > > >> The goal was always to keep the reporting flags per socket, but make
> > > > >> the recording flag per packet, mainly for sampling.
> > > > >
> > > > > If a user uses 1) cmsg feature, 2) bpf feature at the same time, we
> > > > > allow each feature to work independently.
> > > > >
> > > > > How could it work? It relies on sk_tstamp_tx_flags() function in the
> > > > > current patch: when we are in __skb_tstamp_tx(), we cannot know which
> > > > > flags in each feature are set without fetching sk->sk_tsflags and
> > > > > sk->sk_tsflags_bpf. Then we are able to know what timestamp we want to
> > > > > record. To put it in a simple way, we're not sure if the user wants to
> > > > > see a SCHED timestamp by using the cmsg feature in __skb_tstamp_tx()
> > > > > if we hit this test statement "skb_shinfo(skb)->tx_flags &
> > > > > SKBTX_SCHED_TSTAMP)". So we need those two socket tsflag fields to
> > > > > help us.
> > > >
> > > > I also don't see how a new bit/integer in a sk can help to tell the per cmsg
> > > > on/off. This cmsg may have tx timestamp on while the next cmsg can have it off.
> > >
> > > It's not hard to use it because we can clear every socket cmsg tsflags
> > > when we're done the check in tcp_sendmsg_locked() if the cmsg feature
> > > is not enabled. Then we can accurately know which timestamp should we
> > > print in the tx path.
> > >
> > > >
> > > > There is still one bit in skb_shinfo(skb)->tx_flags. How about define a
> > > > SKBTX_BPF for everything. imo, the fine control on
> > > > SOF_TIMESTAMPING_TX_{SCHED,SOFTWARE} is not useful for bpf. Almost all of the
> > > > time the bpf program wants all available time stamps (sched, software, and
> > > > hwtstamp if the NIC has it).
> >
> > I like the approach of just calling BPF on every hook. Assuming that
> > the call is very cheap, which AFAIK is true.
> >
> > In that case we don't need complex branching in C to optionally skip
> > this step, as we do for reporting to userspace.
> >
> > All the logic and complexity is in the BPF program itself.
> >
> > We obviously then let go of the goal to model the BPF API close to the
> > existing SO_TIMESTAMPING API. Though I advocated for keeping them
> > aligned, I also think we should just tailor it to what makes most
> > sense in the BPF space.
> >
> > > Sorry, I really doubt that we can lose the fine control.
> >
> > Since BPF is called at each reporting point, no control is lost,
> > actually.
>
> Sorry, I still don't get it :( If there is something wrong with my
> understanding, please correct me.
>
> BPF is only called on every sock_opt point in this case, like
> BPF_SOCK_OPS_TCP_CONNECT_CB, not every report point of
> SO_TIMESTAMPING. If we add check to test if skb is set SKBTX_BPF in
> __skb_tstamp_tx(), then at every point bpf will be called. But it's
> different from SO_TIMESTAMPING drived by each bit (SCHED/TX_SOFTWARE)
> to control each point. My question is if we would use SKBTX_BPF for
> everything, how could we control and know when we hit
> SCHED/TX_SOFTWARE/ACK time from the bpf programs' perspective? Only
> one bit... It will print everything without the ability to control.
>
> Then if we try the SKBTX_BPF approach, it seems we don't actually
> insist on adding a test statement in __skb_tstamp_tx(). Instead, we
> could add into more places (by only checking the SKBTX_BPF flag), say,
> tcp_write_xmit(), right?

I realized that we will have some new sock_opt flags like
TS_SCHED_OPT_CB in patch 4, so we can control whether to print or
not... For each sock_opt point, they will be called without caring if
related flags in skb are set. Well, it's meaningless to add more
control of skb tsflags at each TS_xx_OPT_CB point.

Am I understanding in a correct way? Now, I'm totally fine with this great idea!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-30  6:50     ` Jason Xing
@ 2024-10-31  1:17       ` Martin KaFai Lau
  2024-10-31  2:41         ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Martin KaFai Lau @ 2024-10-31  1:17 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf,
	netdev, Jason Xing

On 10/29/24 11:50 PM, Jason Xing wrote:
> On Wed, Oct 30, 2024 at 1:42 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/28/24 4:05 AM, Jason Xing wrote:
>>> +/* Used to track the tskey for bpf extension
>>> + *
>>> + * @sk_tskey: bpf extension can use it only when no application uses.
>>> + *            Application can use it directly regardless of bpf extension.
>>> + *
>>> + * There are three strategies:
>>> + * 1) If we've already set through setsockopt() and here we're going to set
>>> + *    OPT_ID for bpf use, we will not re-initialize the @sk_tskey and will
>>> + *    keep the record of delta between the current "key" and previous key.
>>> + * 2) If we've already set through bpf_setsockopt() and here we're going to
>>> + *    set for application use, we will record the delta first and then
>>> + *    override/initialize the @sk_tskey.
>>> + * 3) other cases, which means only either of them takes effect, so initialize
>>> + *    everything simplely.
>>> + */
>>> +static long int sock_calculate_tskey_offset(struct sock *sk, int val, int bpf_type)
>>> +{
>>> +     u32 tskey;
>>> +
>>> +     if (sk_is_tcp(sk)) {
>>> +             if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
>>> +                     return -EINVAL;
>>> +
>>> +             if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
>>> +                     tskey = tcp_sk(sk)->write_seq;
>>> +             else
>>> +                     tskey = tcp_sk(sk)->snd_una;
>>> +     } else {
>>> +             tskey = 0;
>>> +     }
>>> +
>>> +     if (bpf_type && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
>>> +             sk->sk_tskey_bpf_offset = tskey - atomic_read(&sk->sk_tskey);
>>> +             return 0;
>>> +     } else if (!bpf_type && (sk->sk_tsflags_bpf & SOF_TIMESTAMPING_OPT_ID)) {
>>> +             sk->sk_tskey_bpf_offset = atomic_read(&sk->sk_tskey) - tskey;
>>> +     } else {
>>> +             sk->sk_tskey_bpf_offset = 0;
>>> +     }
>>> +
>>> +     return tskey;
>>> +}
>>
>> Before diving into this route, the bpf prog can peek into the tcp seq no in the
>> skb. It can also look at the sk->sk_tskey for UDP socket. Can you explain why
>> those are not enough information for the bpf prog?
> 
> Well, it does make sense. It seems we don't need to implement tskey
> for this bpf feature...
> 
> Due to lack of enough knowledge of bpf, could you provide more hints
> that I can follow to write a bpf program to print more information
> from the skb? Like in the last patch of this series, in
> tools/testing/selftests/bpf/prog_tests/so_timestamping.c, do we have a
> feasible way to do that?

The bpf-prog@sendmsg() will be run to capture a timestamp for sendmsg().
When running the bpf-prog@sendmsg(), the skb can be set to the "struct 
bpf_sock_ops_kern sock_ops;" which is passed to the sockops prog. Take a look at 
bpf_skops_write_hdr_opt().

bpf prog cannot directly access the skops->skb now. It is because the sockops 
prog sees the uapi "struct bpf_sock_ops" instead of "struct 
bpf_sock_ops(_kern)". The conversion is done in sock_ops_convert_ctx_access. It 
is an old way before BTF. I don't want to extend the uapi "struct bpf_sock_ops".

Instead, use bpf_cast_to_kern_ctx((struct bpf_sock_ops *)skops_ctx) to get a 
trusted "struct bpf_sock_ops(_kern) *skops" pointer. Then it can access the 
skops->skb. afaik, the tcb->seq should be available already during sendmsg. it 
should be able to get it from TCP_SKB_CB(skb)->seq with the bpf_core_cast. Take 
a look at the existing examples of bpf_core_cast.

The same goes for the skb->data. It can use the bpf_dynptr_from_skb(). It is not 
available to skops program now but should be easy to expose.

The bpf prog wants to calculate the delay between [sendmsg, SCHED], [SCHED, 
SND], [SND, ACK]. It is why (at least in my mental model) a key is needed to 
co-relate the sendmsg, SCHED, SND, and ACK timestamp. The tcp seqno could be 
served as that key.

All that said, while looking at tcp_tx_timestamp() again, there is always 
"shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;". shinfo->tskey can be 
used directly as-is by the bpf prog. I think now I am missing why the bpf prog 
needs the sk_tskey in the sk?

In the bpf prog, when the SCHED/SND/ACK timestamp comes back, it has to find the 
earlier sendmsg timestamp. One option is to store the earlier sendmsg timestamp 
at the bpf map key-ed by seqno or the shinfo's tskey. Storing in a bpf map 
key-ed by seqno/tskey is probably what the selftest should do. In the future, we 
can consider allowing the rbtree in the bpf sk local storage for searching 
seqno. There is shinfo's hwtstamp that can be used also if there is a need.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-31  1:17       ` Martin KaFai Lau
@ 2024-10-31  2:41         ` Jason Xing
  2024-10-31  3:27           ` Jason Xing
                             ` (2 more replies)
  0 siblings, 3 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-31  2:41 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf,
	netdev, Jason Xing

On Thu, Oct 31, 2024 at 9:17 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/29/24 11:50 PM, Jason Xing wrote:
> > On Wed, Oct 30, 2024 at 1:42 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 10/28/24 4:05 AM, Jason Xing wrote:
> >>> +/* Used to track the tskey for bpf extension
> >>> + *
> >>> + * @sk_tskey: bpf extension can use it only when no application uses.
> >>> + *            Application can use it directly regardless of bpf extension.
> >>> + *
> >>> + * There are three strategies:
> >>> + * 1) If we've already set through setsockopt() and here we're going to set
> >>> + *    OPT_ID for bpf use, we will not re-initialize the @sk_tskey and will
> >>> + *    keep the record of delta between the current "key" and previous key.
> >>> + * 2) If we've already set through bpf_setsockopt() and here we're going to
> >>> + *    set for application use, we will record the delta first and then
> >>> + *    override/initialize the @sk_tskey.
> >>> + * 3) other cases, which means only either of them takes effect, so initialize
> >>> + *    everything simplely.
> >>> + */
> >>> +static long int sock_calculate_tskey_offset(struct sock *sk, int val, int bpf_type)
> >>> +{
> >>> +     u32 tskey;
> >>> +
> >>> +     if (sk_is_tcp(sk)) {
> >>> +             if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
> >>> +                     return -EINVAL;
> >>> +
> >>> +             if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> >>> +                     tskey = tcp_sk(sk)->write_seq;
> >>> +             else
> >>> +                     tskey = tcp_sk(sk)->snd_una;
> >>> +     } else {
> >>> +             tskey = 0;
> >>> +     }
> >>> +
> >>> +     if (bpf_type && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> >>> +             sk->sk_tskey_bpf_offset = tskey - atomic_read(&sk->sk_tskey);
> >>> +             return 0;
> >>> +     } else if (!bpf_type && (sk->sk_tsflags_bpf & SOF_TIMESTAMPING_OPT_ID)) {
> >>> +             sk->sk_tskey_bpf_offset = atomic_read(&sk->sk_tskey) - tskey;
> >>> +     } else {
> >>> +             sk->sk_tskey_bpf_offset = 0;
> >>> +     }
> >>> +
> >>> +     return tskey;
> >>> +}
> >>
> >> Before diving into this route, the bpf prog can peek into the tcp seq no in the
> >> skb. It can also look at the sk->sk_tskey for UDP socket. Can you explain why
> >> those are not enough information for the bpf prog?
> >
> > Well, it does make sense. It seems we don't need to implement tskey
> > for this bpf feature...
> >
> > Due to lack of enough knowledge of bpf, could you provide more hints
> > that I can follow to write a bpf program to print more information
> > from the skb? Like in the last patch of this series, in
> > tools/testing/selftests/bpf/prog_tests/so_timestamping.c, do we have a
> > feasible way to do that?
>
> The bpf-prog@sendmsg() will be run to capture a timestamp for sendmsg().
> When running the bpf-prog@sendmsg(), the skb can be set to the "struct
> bpf_sock_ops_kern sock_ops;" which is passed to the sockops prog. Take a look at
> bpf_skops_write_hdr_opt().

Thanks. I see the skb field in struct bpf_sock_ops_kern.

>
> bpf prog cannot directly access the skops->skb now. It is because the sockops
> prog sees the uapi "struct bpf_sock_ops" instead of "struct
> bpf_sock_ops(_kern)". The conversion is done in sock_ops_convert_ctx_access. It
> is an old way before BTF. I don't want to extend the uapi "struct bpf_sock_ops".

Oh, so it seems we cannot use this way, right?

I also noticed a use case that allow users to get the information from one skb:
"int BPF_PROG(trace_netif_receive_skb, struct sk_buff *skb)" in
tools/testing/selftests/bpf/progs/netif_receive_skb.c
But it requires us to add the tracepoint in __skb_tstamp_tx() first.
Two months ago, I was planning to use a tracepoint for some people who
find it difficult to deploy bpf.

>
> Instead, use bpf_cast_to_kern_ctx((struct bpf_sock_ops *)skops_ctx) to get a
> trusted "struct bpf_sock_ops(_kern) *skops" pointer. Then it can access the
> skops->skb.

Let me spend some time on it. Thanks.

> afaik, the tcb->seq should be available already during sendmsg. it
> should be able to get it from TCP_SKB_CB(skb)->seq with the bpf_core_cast. Take
> a look at the existing examples of bpf_core_cast.
>
> The same goes for the skb->data. It can use the bpf_dynptr_from_skb(). It is not
> available to skops program now but should be easy to expose.

I wonder what the use of skb->data is here.

>
> The bpf prog wants to calculate the delay between [sendmsg, SCHED], [SCHED,
> SND], [SND, ACK]. It is why (at least in my mental model) a key is needed to
> co-relate the sendmsg, SCHED, SND, and ACK timestamp. The tcp seqno could be
> served as that key.
>
> All that said, while looking at tcp_tx_timestamp() again, there is always
> "shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;". shinfo->tskey can be
> used directly as-is by the bpf prog. I think now I am missing why the bpf prog
> needs the sk_tskey in the sk?

As you said, tcp seqno could be treated as the key, but it leaks the
information in TCP layer to users. Please see the commit:
commit 4ed2d765dfaccff5ebdac68e2064b59125033a3b
Author: Willem de Bruijn <willemb@google.com>
Date:   Mon Aug 4 22:11:49 2014 -0400

    net-timestamp: TCP timestamping
...
    - To avoid leaking the absolute seqno to userspace, the offset
    returned in ee_data must always be relative. It is an offset between
    an skb and sk field.

It has to be computed in the kernel before reporting to the user space, I think.

>
> In the bpf prog, when the SCHED/SND/ACK timestamp comes back, it has to find the
> earlier sendmsg timestamp. One option is to store the earlier sendmsg timestamp
> at the bpf map key-ed by seqno or the shinfo's tskey. Storing in a bpf map
> key-ed by seqno/tskey is probably what the selftest should do. In the future, we
> can consider allowing the rbtree in the bpf sk local storage for searching
> seqno. There is shinfo's hwtstamp that can be used also if there is a need.

Thanks for the information! Let me investigate how the bpf map works...

I wonder that for the selftests could it be much simpler if we just
record each timestamp stored in three variables and calculate them at
last since we only send the small packet once instead of using bpf
map. I mean, bpf map is really good as far as I know, but I'm a bit
worried that implementing such a function could cause more extra work
(implementation and review).

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-31  2:41         ` Jason Xing
@ 2024-10-31  3:27           ` Jason Xing
  2024-10-31  5:52           ` Martin KaFai Lau
  2024-10-31 23:50           ` Martin KaFai Lau
  2 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-31  3:27 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf,
	netdev, Jason Xing

On Thu, Oct 31, 2024 at 10:41 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Oct 31, 2024 at 9:17 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 10/29/24 11:50 PM, Jason Xing wrote:
> > > On Wed, Oct 30, 2024 at 1:42 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >>
> > >> On 10/28/24 4:05 AM, Jason Xing wrote:
> > >>> +/* Used to track the tskey for bpf extension
> > >>> + *
> > >>> + * @sk_tskey: bpf extension can use it only when no application uses.
> > >>> + *            Application can use it directly regardless of bpf extension.
> > >>> + *
> > >>> + * There are three strategies:
> > >>> + * 1) If we've already set through setsockopt() and here we're going to set
> > >>> + *    OPT_ID for bpf use, we will not re-initialize the @sk_tskey and will
> > >>> + *    keep the record of delta between the current "key" and previous key.
> > >>> + * 2) If we've already set through bpf_setsockopt() and here we're going to
> > >>> + *    set for application use, we will record the delta first and then
> > >>> + *    override/initialize the @sk_tskey.
> > >>> + * 3) other cases, which means only either of them takes effect, so initialize
> > >>> + *    everything simplely.
> > >>> + */
> > >>> +static long int sock_calculate_tskey_offset(struct sock *sk, int val, int bpf_type)
> > >>> +{
> > >>> +     u32 tskey;
> > >>> +
> > >>> +     if (sk_is_tcp(sk)) {
> > >>> +             if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
> > >>> +                     return -EINVAL;
> > >>> +
> > >>> +             if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> > >>> +                     tskey = tcp_sk(sk)->write_seq;
> > >>> +             else
> > >>> +                     tskey = tcp_sk(sk)->snd_una;
> > >>> +     } else {
> > >>> +             tskey = 0;
> > >>> +     }
> > >>> +
> > >>> +     if (bpf_type && (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> > >>> +             sk->sk_tskey_bpf_offset = tskey - atomic_read(&sk->sk_tskey);
> > >>> +             return 0;
> > >>> +     } else if (!bpf_type && (sk->sk_tsflags_bpf & SOF_TIMESTAMPING_OPT_ID)) {
> > >>> +             sk->sk_tskey_bpf_offset = atomic_read(&sk->sk_tskey) - tskey;
> > >>> +     } else {
> > >>> +             sk->sk_tskey_bpf_offset = 0;
> > >>> +     }
> > >>> +
> > >>> +     return tskey;
> > >>> +}
> > >>
> > >> Before diving into this route, the bpf prog can peek into the tcp seq no in the
> > >> skb. It can also look at the sk->sk_tskey for UDP socket. Can you explain why
> > >> those are not enough information for the bpf prog?
> > >
> > > Well, it does make sense. It seems we don't need to implement tskey
> > > for this bpf feature...
> > >
> > > Due to lack of enough knowledge of bpf, could you provide more hints
> > > that I can follow to write a bpf program to print more information
> > > from the skb? Like in the last patch of this series, in
> > > tools/testing/selftests/bpf/prog_tests/so_timestamping.c, do we have a
> > > feasible way to do that?
> >
> > The bpf-prog@sendmsg() will be run to capture a timestamp for sendmsg().
> > When running the bpf-prog@sendmsg(), the skb can be set to the "struct
> > bpf_sock_ops_kern sock_ops;" which is passed to the sockops prog. Take a look at
> > bpf_skops_write_hdr_opt().
>
> Thanks. I see the skb field in struct bpf_sock_ops_kern.
>
> >
> > bpf prog cannot directly access the skops->skb now. It is because the sockops
> > prog sees the uapi "struct bpf_sock_ops" instead of "struct
> > bpf_sock_ops(_kern)". The conversion is done in sock_ops_convert_ctx_access. It
> > is an old way before BTF. I don't want to extend the uapi "struct bpf_sock_ops".
>
> Oh, so it seems we cannot use this way, right?
>
> I also noticed a use case that allow users to get the information from one skb:
> "int BPF_PROG(trace_netif_receive_skb, struct sk_buff *skb)" in
> tools/testing/selftests/bpf/progs/netif_receive_skb.c
> But it requires us to add the tracepoint in __skb_tstamp_tx() first.
> Two months ago, I was planning to use a tracepoint for some people who
> find it difficult to deploy bpf.
>
> >
> > Instead, use bpf_cast_to_kern_ctx((struct bpf_sock_ops *)skops_ctx) to get a
> > trusted "struct bpf_sock_ops(_kern) *skops" pointer. Then it can access the
> > skops->skb.
>
> Let me spend some time on it. Thanks.
>
> > afaik, the tcb->seq should be available already during sendmsg. it
> > should be able to get it from TCP_SKB_CB(skb)->seq with the bpf_core_cast. Take
> > a look at the existing examples of bpf_core_cast.
> >
> > The same goes for the skb->data. It can use the bpf_dynptr_from_skb(). It is not
> > available to skops program now but should be easy to expose.
>
> I wonder what the use of skb->data is here.
>
> >
> > The bpf prog wants to calculate the delay between [sendmsg, SCHED], [SCHED,
> > SND], [SND, ACK]. It is why (at least in my mental model) a key is needed to
> > co-relate the sendmsg, SCHED, SND, and ACK timestamp. The tcp seqno could be
> > served as that key.
> >
> > All that said, while looking at tcp_tx_timestamp() again, there is always
> > "shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;". shinfo->tskey can be
> > used directly as-is by the bpf prog. I think now I am missing why the bpf prog
> > needs the sk_tskey in the sk?
>
> As you said, tcp seqno could be treated as the key, but it leaks the
> information in TCP layer to users. Please see the commit:
> commit 4ed2d765dfaccff5ebdac68e2064b59125033a3b
> Author: Willem de Bruijn <willemb@google.com>
> Date:   Mon Aug 4 22:11:49 2014 -0400
>
>     net-timestamp: TCP timestamping
> ...
>     - To avoid leaking the absolute seqno to userspace, the offset
>     returned in ee_data must always be relative. It is an offset between
>     an skb and sk field.
>
> It has to be computed in the kernel before reporting to the user space, I think.

Well, I'm thinking since the BPF program can only be used by _admin_,
we will not take any risk even if the raw seq is exported to the BPF
program.

Willem, I would like to know your opinions about this point (about
whether we can export the raw seqno or not). Thanks.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-31  2:41         ` Jason Xing
  2024-10-31  3:27           ` Jason Xing
@ 2024-10-31  5:52           ` Martin KaFai Lau
  2024-10-31  6:16             ` Jason Xing
  2024-10-31 23:50           ` Martin KaFai Lau
  2 siblings, 1 reply; 88+ messages in thread
From: Martin KaFai Lau @ 2024-10-31  5:52 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf,
	netdev, Jason Xing

On 10/30/24 7:41 PM, Jason Xing wrote:

>> All that said, while looking at tcp_tx_timestamp() again, there is always
>> "shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;". shinfo->tskey can be
>> used directly as-is by the bpf prog. I think now I am missing why the bpf prog
>> needs the sk_tskey in the sk?
> 
> As you said, tcp seqno could be treated as the key, but it leaks the
> information in TCP layer to users. Please see the commit:

I don't think it is a concern for bpf prog running in the kernel. The sockops 
bpf prog can already read the sk, the skb (which has seqno), and many others.

The bpf prog is not a print-only logic. Only using bpf prog to do raw data 
dumping is not fully utilizing its capability, e.g. data aggregation. The bpf 
prog should aggregate the data first which is to calculate the delay here.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-31  5:52           ` Martin KaFai Lau
@ 2024-10-31  6:16             ` Jason Xing
  0 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-10-31  6:16 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf,
	netdev, Jason Xing

On Thu, Oct 31, 2024 at 1:52 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/30/24 7:41 PM, Jason Xing wrote:
>
> >> All that said, while looking at tcp_tx_timestamp() again, there is always
> >> "shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;". shinfo->tskey can be
> >> used directly as-is by the bpf prog. I think now I am missing why the bpf prog
> >> needs the sk_tskey in the sk?
> >
> > As you said, tcp seqno could be treated as the key, but it leaks the
> > information in TCP layer to users. Please see the commit:
>
> I don't think it is a concern for bpf prog running in the kernel. The sockops
> bpf prog can already read the sk, the skb (which has seqno), and many others.
>
> The bpf prog is not a print-only logic. Only using bpf prog to do raw data
> dumping is not fully utilizing its capability, e.g. data aggregation. The bpf
> prog should aggregate the data first which is to calculate the delay here.

Agree, I forgot BPF is only for admin, so it's a feasible solution. It
saves a lot of energy :)

It looks like the thing is getting simpler and simpler, which could be
mostly taken over by bpf itself at last. Good news!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-31  0:13                       ` Jason Xing
@ 2024-10-31  6:27                         ` Martin KaFai Lau
  2024-10-31  7:04                           ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Martin KaFai Lau @ 2024-10-31  6:27 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: willemb, davem, edumazet, kuba, pabeni, dsahern, ast, daniel,
	andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev, Jason Xing

On 10/30/24 5:13 PM, Jason Xing wrote:
> I realized that we will have some new sock_opt flags like
> TS_SCHED_OPT_CB in patch 4, so we can control whether to print or
> not... For each sock_opt point, they will be called without caring if
> related flags in skb are set. Well, it's meaningless to add more
> control of skb tsflags at each TS_xx_OPT_CB point.
> 
> Am I understanding in a correct way? Now, I'm totally fine with this great idea!
Yes, I think so.

The sockops prog can choose to ignore any BPF_SOCK_OPS_TS_*_CB. The are only 3: 
SCHED, SND, and ACK. If the hwtstamp is available from a NIC, I think it would 
be quite wasteful to throw it away. ACK can be controlled by the 
TCP_SKB_CB(skb)->bpf_txstamp_ack.

Going back to my earlier bpf_setsockopt(SOL_SOCKET, BPF_TX_TIMESTAMPING) 
comment. I think it may as well go back to use the "u8 
bpf_sock_ops_cb_flags;" and use the bpf_sock_ops_cb_flags_set() helper to 
enable/disable the timestamp related callback hook. May be add one 
BPF_SOCK_OPS_TX_TIMESTAMPING_CB_FLAG.

For tx, one new hook should be at the sendmsg and should be around 
tcp_tx_timestamp (?) for tcp. Another hook is __skb_tstamp_tx() which should be 
similar to your patch. Add a new kfunc to set shinfo->tx_flags |= SKBTX_BPF 
and/or TCP_SKB_CB(skb)->bpf_txstamp_ack during sendmsg.

For rx, add one BPF_SOCK_OPS_RX_TIMESTAMPING_CB_FLAG. bpf_sock_ops_cb_flags 
needs to move from the tcp_sock to the sock because it will be used by UDP also. 
When enabling or disabling this flag, it needs to take care of the 
net_{enable,disable}_timestamp. The same for the __sk_destruct() also.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-31  6:27                         ` Martin KaFai Lau
@ 2024-10-31  7:04                           ` Jason Xing
  2024-10-31 12:30                             ` Willem de Bruijn
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-31  7:04 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Thu, Oct 31, 2024 at 2:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/30/24 5:13 PM, Jason Xing wrote:
> > I realized that we will have some new sock_opt flags like
> > TS_SCHED_OPT_CB in patch 4, so we can control whether to print or
> > not... For each sock_opt point, they will be called without caring if
> > related flags in skb are set. Well, it's meaningless to add more
> > control of skb tsflags at each TS_xx_OPT_CB point.
> >
> > Am I understanding in a correct way? Now, I'm totally fine with this great idea!
> Yes, I think so.
>
> The sockops prog can choose to ignore any BPF_SOCK_OPS_TS_*_CB. The are only 3:
> SCHED, SND, and ACK. If the hwtstamp is available from a NIC, I think it would
> be quite wasteful to throw it away. ACK can be controlled by the
> TCP_SKB_CB(skb)->bpf_txstamp_ack.

Right, let me try this:)

> Going back to my earlier bpf_setsockopt(SOL_SOCKET, BPF_TX_TIMESTAMPING)
> comment. I think it may as well go back to use the "u8
> bpf_sock_ops_cb_flags;" and use the bpf_sock_ops_cb_flags_set() helper to
> enable/disable the timestamp related callback hook. May be add one
> BPF_SOCK_OPS_TX_TIMESTAMPING_CB_FLAG.

bpf_sock_ops_cb_flags this flag is only used in TCP condition, right?
If that is so, it cannot be suitable for UDP.

I'm thinking of this solution:
1) adding a new flag in SOF_TIMESTAMPING_OPT_BPF flag (in
include/uapi/linux/net_tstamp.h) which can be used by sk->sk_tsflags
2) flags =   SOF_TIMESTAMPING_OPT_BPF;    bpf_setsockopt(skops,
SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
3) test if sk->sk_tsflags has this new flag in tcp_tx_timestamp() or
in udp_sendmsg()
...

>
> For tx, one new hook should be at the sendmsg and should be around
> tcp_tx_timestamp (?) for tcp. Another hook is __skb_tstamp_tx() which should be

I think there are two points we're supposed to record:
1) the moment tcp/udp_sendmsg() is triggered. It represents the syscall time.
2) another point in tcp_tx_timestamp(). It represents the timestamp of
the last skb in this sendmsg() call.
Users may happen to send a big packet.

> similar to your patch. Add a new kfunc to set shinfo->tx_flags |= SKBTX_BPF
> and/or TCP_SKB_CB(skb)->bpf_txstamp_ack during sendmsg.

Got it.

>
>
> For rx, add one BPF_SOCK_OPS_RX_TIMESTAMPING_CB_FLAG. bpf_sock_ops_cb_flags
> needs to move from the tcp_sock to the sock because it will be used by UDP also.
> When enabling or disabling this flag, it needs to take care of the
> net_{enable,disable}_timestamp. The same for the __sk_destruct() also.
>

I think if the solution I proposed as above is feasible, then we don't
have to move the tcp_sock which brings more extra work :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-31  7:04                           ` Jason Xing
@ 2024-10-31 12:30                             ` Willem de Bruijn
  2024-10-31 13:50                               ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-10-31 12:30 UTC (permalink / raw)
  To: Jason Xing, Martin KaFai Lau
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

Jason Xing wrote:
> On Thu, Oct 31, 2024 at 2:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 10/30/24 5:13 PM, Jason Xing wrote:
> > > I realized that we will have some new sock_opt flags like
> > > TS_SCHED_OPT_CB in patch 4, so we can control whether to print or
> > > not... For each sock_opt point, they will be called without caring if
> > > related flags in skb are set. Well, it's meaningless to add more
> > > control of skb tsflags at each TS_xx_OPT_CB point.
> > >
> > > Am I understanding in a correct way? Now, I'm totally fine with this great idea!
> > Yes, I think so.
> >
> > The sockops prog can choose to ignore any BPF_SOCK_OPS_TS_*_CB. The are only 3:
> > SCHED, SND, and ACK. If the hwtstamp is available from a NIC, I think it would
> > be quite wasteful to throw it away. ACK can be controlled by the
> > TCP_SKB_CB(skb)->bpf_txstamp_ack.
> 
> Right, let me try this:)
> 
> > Going back to my earlier bpf_setsockopt(SOL_SOCKET, BPF_TX_TIMESTAMPING)
> > comment. I think it may as well go back to use the "u8
> > bpf_sock_ops_cb_flags;" and use the bpf_sock_ops_cb_flags_set() helper to
> > enable/disable the timestamp related callback hook. May be add one
> > BPF_SOCK_OPS_TX_TIMESTAMPING_CB_FLAG.
> 
> bpf_sock_ops_cb_flags this flag is only used in TCP condition, right?
> If that is so, it cannot be suitable for UDP.
> 
> I'm thinking of this solution:
> 1) adding a new flag in SOF_TIMESTAMPING_OPT_BPF flag (in
> include/uapi/linux/net_tstamp.h) which can be used by sk->sk_tsflags
> 2) flags =   SOF_TIMESTAMPING_OPT_BPF;    bpf_setsockopt(skops,
> SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
> 3) test if sk->sk_tsflags has this new flag in tcp_tx_timestamp() or
> in udp_sendmsg()
> ...
> 
> >
> > For tx, one new hook should be at the sendmsg and should be around
> > tcp_tx_timestamp (?) for tcp. Another hook is __skb_tstamp_tx() which should be
> 
> I think there are two points we're supposed to record:
> 1) the moment tcp/udp_sendmsg() is triggered. It represents the syscall time.
> 2) another point in tcp_tx_timestamp(). It represents the timestamp of
> the last skb in this sendmsg() call.
> Users may happen to send a big packet.

Err on the side of fewer measurement points. It's always possible to
add more later, but not possible to remove them (depending on whether
BPF infra is ABI).

Overall great suggestion. Thanks a lot for sharing your BPF expertise
on this, Martin.

On using the raw seqno: this data is accessible to anyone root in
namespace (ns_capable) using packet sockets, so as long as it does not
open to more than that, it is logically equivalent to the current
setting.

With seqno the BPF program has to be careful that the same seqno can
be retransmitted, so for instance seeing an ACK before a (second) SND
must be anticipated. That is true for SO_TIMESTAMPING today too.

For datagrams (UDP as well as RAW and many non IP protocols), an
alternative still needs to be found.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-31 12:30                             ` Willem de Bruijn
@ 2024-10-31 13:50                               ` Jason Xing
  2024-10-31 23:26                                 ` Martin KaFai Lau
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-10-31 13:50 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin KaFai Lau, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Thu, Oct 31, 2024 at 8:30 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Thu, Oct 31, 2024 at 2:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >
> > > On 10/30/24 5:13 PM, Jason Xing wrote:
> > > > I realized that we will have some new sock_opt flags like
> > > > TS_SCHED_OPT_CB in patch 4, so we can control whether to print or
> > > > not... For each sock_opt point, they will be called without caring if
> > > > related flags in skb are set. Well, it's meaningless to add more
> > > > control of skb tsflags at each TS_xx_OPT_CB point.
> > > >
> > > > Am I understanding in a correct way? Now, I'm totally fine with this great idea!
> > > Yes, I think so.
> > >
> > > The sockops prog can choose to ignore any BPF_SOCK_OPS_TS_*_CB. The are only 3:
> > > SCHED, SND, and ACK. If the hwtstamp is available from a NIC, I think it would
> > > be quite wasteful to throw it away. ACK can be controlled by the
> > > TCP_SKB_CB(skb)->bpf_txstamp_ack.
> >
> > Right, let me try this:)
> >
> > > Going back to my earlier bpf_setsockopt(SOL_SOCKET, BPF_TX_TIMESTAMPING)
> > > comment. I think it may as well go back to use the "u8
> > > bpf_sock_ops_cb_flags;" and use the bpf_sock_ops_cb_flags_set() helper to
> > > enable/disable the timestamp related callback hook. May be add one
> > > BPF_SOCK_OPS_TX_TIMESTAMPING_CB_FLAG.
> >
> > bpf_sock_ops_cb_flags this flag is only used in TCP condition, right?
> > If that is so, it cannot be suitable for UDP.
> >
> > I'm thinking of this solution:
> > 1) adding a new flag in SOF_TIMESTAMPING_OPT_BPF flag (in
> > include/uapi/linux/net_tstamp.h) which can be used by sk->sk_tsflags
> > 2) flags =   SOF_TIMESTAMPING_OPT_BPF;    bpf_setsockopt(skops,
> > SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
> > 3) test if sk->sk_tsflags has this new flag in tcp_tx_timestamp() or
> > in udp_sendmsg()
> > ...
> >
> > >
> > > For tx, one new hook should be at the sendmsg and should be around
> > > tcp_tx_timestamp (?) for tcp. Another hook is __skb_tstamp_tx() which should be
> >
> > I think there are two points we're supposed to record:
> > 1) the moment tcp/udp_sendmsg() is triggered. It represents the syscall time.
> > 2) another point in tcp_tx_timestamp(). It represents the timestamp of
> > the last skb in this sendmsg() call.
> > Users may happen to send a big packet.
>
> Err on the side of fewer measurement points. It's always possible to
> add more later, but not possible to remove them (depending on whether
> BPF infra is ABI).
>
> Overall great suggestion. Thanks a lot for sharing your BPF expertise
> on this, Martin.
>
> On using the raw seqno: this data is accessible to anyone root in
> namespace (ns_capable) using packet sockets, so as long as it does not
> open to more than that, it is logically equivalent to the current
> setting.
>
> With seqno the BPF program has to be careful that the same seqno can
> be retransmitted, so for instance seeing an ACK before a (second) SND
> must be anticipated. That is true for SO_TIMESTAMPING today too.
>
> For datagrams (UDP as well as RAW and many non IP protocols), an
> alternative still needs to be found.

It seems that using the tskey for bpf extension is always correct and
easy to use.

Could we provide the tskey to users and then let users decide the
better way to identify the call of sendmsg. We could keep the
traditional use of tskey. If without it, people need to figure out a
good way and may find it difficult to use the bpf extension.

I will keep thinking of alternatives for UDP in the meantime.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-31 13:50                               ` Jason Xing
@ 2024-10-31 23:26                                 ` Martin KaFai Lau
  2024-11-01  7:47                                   ` Jason Xing
  2024-11-01 13:32                                   ` Willem de Bruijn
  0 siblings, 2 replies; 88+ messages in thread
From: Martin KaFai Lau @ 2024-10-31 23:26 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: willemb, davem, edumazet, kuba, pabeni, dsahern, ast, daniel,
	andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev, Jason Xing

On 10/31/24 6:50 AM, Jason Xing wrote:
> On Thu, Oct 31, 2024 at 8:30 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>>
>> Jason Xing wrote:
>>> On Thu, Oct 31, 2024 at 2:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> On 10/30/24 5:13 PM, Jason Xing wrote:
>>>>> I realized that we will have some new sock_opt flags like
>>>>> TS_SCHED_OPT_CB in patch 4, so we can control whether to print or
>>>>> not... For each sock_opt point, they will be called without caring if
>>>>> related flags in skb are set. Well, it's meaningless to add more
>>>>> control of skb tsflags at each TS_xx_OPT_CB point.
>>>>>
>>>>> Am I understanding in a correct way? Now, I'm totally fine with this great idea!
>>>> Yes, I think so.
>>>>
>>>> The sockops prog can choose to ignore any BPF_SOCK_OPS_TS_*_CB. The are only 3:
>>>> SCHED, SND, and ACK. If the hwtstamp is available from a NIC, I think it would
>>>> be quite wasteful to throw it away. ACK can be controlled by the
>>>> TCP_SKB_CB(skb)->bpf_txstamp_ack.
>>>
>>> Right, let me try this:)
>>>
>>>> Going back to my earlier bpf_setsockopt(SOL_SOCKET, BPF_TX_TIMESTAMPING)
>>>> comment. I think it may as well go back to use the "u8
>>>> bpf_sock_ops_cb_flags;" and use the bpf_sock_ops_cb_flags_set() helper to
>>>> enable/disable the timestamp related callback hook. May be add one
>>>> BPF_SOCK_OPS_TX_TIMESTAMPING_CB_FLAG.
>>>
>>> bpf_sock_ops_cb_flags this flag is only used in TCP condition, right?
>>> If that is so, it cannot be suitable for UDP.
>>>
>>> I'm thinking of this solution:
>>> 1) adding a new flag in SOF_TIMESTAMPING_OPT_BPF flag (in
>>> include/uapi/linux/net_tstamp.h) which can be used by sk->sk_tsflags

probably not in include/uapi/linux/net_tstamp.h. This flag can only be used by a 
bpf prog (meaning will not be used by user space syscall). More below.

>>> 2) flags =   SOF_TIMESTAMPING_OPT_BPF;    bpf_setsockopt(skops,
>>> SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
>>> 3) test if sk->sk_tsflags has this new flag in tcp_tx_timestamp() or
>>> in udp_sendmsg()
>>> ...

Not sure how many churns/audits is needed to ensure the user space cannot 
set/clear the SOF_TIMESTAMPING_OPT_BPF bit in sk->sk_tsflags. Could be not much.

May be it is cleaner to leave the sk->sk_tsflags for user space only and having 
a separate field in "struct sock" to track bpf specific needs. More like your 
current sk_tsflags_bpf approach but I was thinking to reuse the 
bpf_sock_ops_cb_flags instead. e.g. "BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), 
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)" is used to check if it needs to call a bpf 
prog to decide if it needs to add tcp header option. Here we want to test if it 
should call a bpf prog to make a decision on tx timestamp on a skb.

The bpf_sock_ops_cb_flags can be moved from struct tcp_sock to struct sock. It 
is doable from the bpf side.

All that said, but, yes, it will add some TCP specific enum flag (e.g. 
BPF_SOCK_OPS_RTO_CB_FLAG) to the struct sock which will not be used by 
UDP/raw/...etc, so may be keep your current sk_tsflags_bpf approach but rename 
it to sk_bpf_cb_flags in struct "sock" so that it can be reused for other non 
tstamp ops in the future? probably a u8 is enough.

This optname is used by the bpf prog only and not usable by user space syscall. 
If it prefers to stay with bpf_setsockopt (which is fine), it needs a bpf 
specific optname like the current TCP_BPF_SOCK_OPS_CB_FLAGS which currently sets 
the tp->bpf_sock_ops_cb_flags. May be a new SK_BPF_CB_FLAGS optname for setting 
the sk->sk_bpf_cb_flags, like bpf_setsockopt(skops_ctx, SOL_SOCKET, 
SK_BPF_CB_FLAGS, &val, sizeof(val)) and handle it in the sol_socket_sockopt() 
alone without calling into sk_{set,get}sockopt. Add a new enum for the optval 
for the sk_bpf_cb_flags:

enum {
	SK_BPF_CB_TX_TIMESTAMPING = (1 << 0),
	SK_BPF_CB_RX_TIEMSTAMPING = (1 << 1),
};

>>>
>>>>
>>>> For tx, one new hook should be at the sendmsg and should be around
>>>> tcp_tx_timestamp (?) for tcp. Another hook is __skb_tstamp_tx() which should be
>>>
>>> I think there are two points we're supposed to record:
>>> 1) the moment tcp/udp_sendmsg() is triggered. It represents the syscall time.
>>> 2) another point in tcp_tx_timestamp(). It represents the timestamp of
>>> the last skb in this sendmsg() call.
>>> Users may happen to send a big packet.

hmm... a big packet and sendmsg is blocked waiting for memory?

>>
>> Err on the side of fewer measurement points. It's always possible to
>> add more later, but not possible to remove them (depending on whether
>> BPF infra is ABI).

I also think it is better to start with tcp_tx_timestamp() alone first to keep 
the patch set simple now. The selftest prog can use a bpf fentry prog to trace 
the tcp_sendmsg_locked(). This can be revisited later if the bpf fentry prog is 
not enough.

>>
>> Overall great suggestion. Thanks a lot for sharing your BPF expertise
>> on this, Martin.

Thanks!

>>
>> On using the raw seqno: this data is accessible to anyone root in
>> namespace (ns_capable) using packet sockets, so as long as it does not
>> open to more than that, it is logically equivalent to the current
>> setting.
>>
>> With seqno the BPF program has to be careful that the same seqno can
>> be retransmitted, so for instance seeing an ACK before a (second) SND
>> must be anticipated. That is true for SO_TIMESTAMPING today too.

Ah. It will be a very useful comment to add to the selftests bpf prog.

>>
>> For datagrams (UDP as well as RAW and many non IP protocols), an
>> alternative still needs to be found.

In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags 
& SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) & 
SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set. If it is 
unlikely, may be we can just disallow bpf prog from directly setting 
skb_shinfo(skb)->tskey for this particular skb.

For all other cases, in __ip[6]_append_data, directly call a bpf prog and also 
pass the kernel decided tskey to the bpf prog.

The kernel passed tskey could be 0 (meaning the user space has not used it). The 
bpf prog can give one for the kernel to use. The bpf prog can store the 
sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct 
sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX 
instead) if it helps.

If the kernel passed tskey is not 0, the bpf prog can just use that one 
(assuming the user space is doing something sane, like the value in 
SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this 
is very unlikely also (?) but the bpf prog can probably detect this and choose 
to ignore this sk.

To solve the above unsupported corner cases, I think we can allow the bpf prog 
to store something in the shinfo->hwtstamps at the tx path. The bpf-only key 
could be one of the things to store there. Change __ip[6]_append_data to handle 
the shinfo->hwtstamps. I think allowing the bpf prog to write to the 
shinfo->hwtsatmps could be considered later when needed.

[ I may be off tomorrow, so reply could be slower. ]

> 
> It seems that using the tskey for bpf extension is always correct and
> easy to use.
> 
> Could we provide the tskey to users and then let users decide the
> better way to identify the call of sendmsg. We could keep the
> traditional use of tskey. If without it, people need to figure out a
> good way and may find it difficult to use the bpf extension.
> 
> I will keep thinking of alternatives for UDP in the meantime.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-31  2:41         ` Jason Xing
  2024-10-31  3:27           ` Jason Xing
  2024-10-31  5:52           ` Martin KaFai Lau
@ 2024-10-31 23:50           ` Martin KaFai Lau
  2024-11-01  6:33             ` Jason Xing
  2 siblings, 1 reply; 88+ messages in thread
From: Martin KaFai Lau @ 2024-10-31 23:50 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf,
	netdev, Jason Xing

On 10/30/24 7:41 PM, Jason Xing wrote:
>> bpf prog cannot directly access the skops->skb now. It is because the sockops
>> prog sees the uapi "struct bpf_sock_ops" instead of "struct
>> bpf_sock_ops(_kern)". The conversion is done in sock_ops_convert_ctx_access. It
>> is an old way before BTF. I don't want to extend the uapi "struct bpf_sock_ops".
> 
> Oh, so it seems we cannot use this way, right?

No. don't extend the uapi "struct bpf_sock_ops". Use bpf_cast_to_kern_ctx() instead.

> 
> I also noticed a use case that allow users to get the information from one skb:
> "int BPF_PROG(trace_netif_receive_skb, struct sk_buff *skb)" in
> tools/testing/selftests/bpf/progs/netif_receive_skb.c
> But it requires us to add the tracepoint in __skb_tstamp_tx() first.
> Two months ago, I was planning to use a tracepoint for some people who
> find it difficult to deploy bpf.


It is a tracing prog instead of sockops prog. The verifier allows accessing 
different things based on the program type. This patch set is using the sockops 
bpf prog type which is not a tracing prog. Tracing can do a lot of read-only 
things but here we need write (e.g. bpf_setsockopt), so tracing prog is not 
suitable here.

> 
>>
>> Instead, use bpf_cast_to_kern_ctx((struct bpf_sock_ops *)skops_ctx) to get a
>> trusted "struct bpf_sock_ops(_kern) *skops" pointer. Then it can access the
>> skops->skb.
> 
> Let me spend some time on it. Thanks.

Take a look at the bpf_cast_to_kern_ctx() examples in selftests/bpf. I think 
this can be directly used to get to (struct bpf_sock_ops_kern *)skops->skb. Ping 
back if your selftest bpf prog cannot load.

> 
>> afaik, the tcb->seq should be available already during sendmsg. it
>> should be able to get it from TCP_SKB_CB(skb)->seq with the bpf_core_cast. Take
>> a look at the existing examples of bpf_core_cast.
>>
>> The same goes for the skb->data. It can use the bpf_dynptr_from_skb(). It is not
>> available to skops program now but should be easy to expose.
 > I wonder what the use of skb->data is here.

You are right, not needed. I was thinking it may need to parse the tcp header 
from the skb at the rx timestamping. It is not needed. The tcp stack should have 
already parsed it and TCP_SKB_CB can be directly used as long as the sockops 
prog can get to the skops->skb.

>>
>> In the bpf prog, when the SCHED/SND/ACK timestamp comes back, it has to find the
>> earlier sendmsg timestamp. One option is to store the earlier sendmsg timestamp
>> at the bpf map key-ed by seqno or the shinfo's tskey. Storing in a bpf map
>> key-ed by seqno/tskey is probably what the selftest should do. In the future, we
>> can consider allowing the rbtree in the bpf sk local storage for searching
>> seqno. There is shinfo's hwtstamp that can be used also if there is a need.
> 
> Thanks for the information! Let me investigate how the bpf map works...
> 
> I wonder that for the selftests could it be much simpler if we just
> record each timestamp stored in three variables and calculate them at
> last since we only send the small packet once instead of using bpf
> map. I mean, bpf map is really good as far as I know, but I'm a bit
> worried that implementing such a function could cause more extra work
> (implementation and review).

Don't worry on the review side. imo, a closer to the real world selftest prog is 
actually helping the review process. It needs to test the tskey anyway and it 
needs to store somewhere. bpf map is pretty simple to use. I don't think it will 
have much different in term of complexity also.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset
  2024-10-31 23:50           ` Martin KaFai Lau
@ 2024-11-01  6:33             ` Jason Xing
  0 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-11-01  6:33 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf,
	netdev, Jason Xing

On Fri, Nov 1, 2024 at 7:50 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/30/24 7:41 PM, Jason Xing wrote:
> >> bpf prog cannot directly access the skops->skb now. It is because the sockops
> >> prog sees the uapi "struct bpf_sock_ops" instead of "struct
> >> bpf_sock_ops(_kern)". The conversion is done in sock_ops_convert_ctx_access. It
> >> is an old way before BTF. I don't want to extend the uapi "struct bpf_sock_ops".
> >
> > Oh, so it seems we cannot use this way, right?
>
> No. don't extend the uapi "struct bpf_sock_ops". Use bpf_cast_to_kern_ctx() instead.

Got it!

>
> >
> > I also noticed a use case that allow users to get the information from one skb:
> > "int BPF_PROG(trace_netif_receive_skb, struct sk_buff *skb)" in
> > tools/testing/selftests/bpf/progs/netif_receive_skb.c
> > But it requires us to add the tracepoint in __skb_tstamp_tx() first.
> > Two months ago, I was planning to use a tracepoint for some people who
> > find it difficult to deploy bpf.
>
>
> It is a tracing prog instead of sockops prog. The verifier allows accessing
> different things based on the program type. This patch set is using the sockops
> bpf prog type which is not a tracing prog. Tracing can do a lot of read-only
> things but here we need write (e.g. bpf_setsockopt), so tracing prog is not
> suitable here.

Thanks for the explaination.

>
> >
> >>
> >> Instead, use bpf_cast_to_kern_ctx((struct bpf_sock_ops *)skops_ctx) to get a
> >> trusted "struct bpf_sock_ops(_kern) *skops" pointer. Then it can access the
> >> skops->skb.
> >
> > Let me spend some time on it. Thanks.
>
> Take a look at the bpf_cast_to_kern_ctx() examples in selftests/bpf. I think
> this can be directly used to get to (struct bpf_sock_ops_kern *)skops->skb. Ping
> back if your selftest bpf prog cannot load.

No problem :)

>
> >
> >> afaik, the tcb->seq should be available already during sendmsg. it
> >> should be able to get it from TCP_SKB_CB(skb)->seq with the bpf_core_cast. Take
> >> a look at the existing examples of bpf_core_cast.
> >>
> >> The same goes for the skb->data. It can use the bpf_dynptr_from_skb(). It is not
> >> available to skops program now but should be easy to expose.
>  > I wonder what the use of skb->data is here.
>
> You are right, not needed. I was thinking it may need to parse the tcp header
> from the skb at the rx timestamping. It is not needed. The tcp stack should have
> already parsed it and TCP_SKB_CB can be directly used as long as the sockops
> prog can get to the skops->skb.

Agreed.

>
> >>
> >> In the bpf prog, when the SCHED/SND/ACK timestamp comes back, it has to find the
> >> earlier sendmsg timestamp. One option is to store the earlier sendmsg timestamp
> >> at the bpf map key-ed by seqno or the shinfo's tskey. Storing in a bpf map
> >> key-ed by seqno/tskey is probably what the selftest should do. In the future, we
> >> can consider allowing the rbtree in the bpf sk local storage for searching
> >> seqno. There is shinfo's hwtstamp that can be used also if there is a need.
> >
> > Thanks for the information! Let me investigate how the bpf map works...
> >
> > I wonder that for the selftests could it be much simpler if we just
> > record each timestamp stored in three variables and calculate them at
> > last since we only send the small packet once instead of using bpf
> > map. I mean, bpf map is really good as far as I know, but I'm a bit
> > worried that implementing such a function could cause more extra work
> > (implementation and review).
>
> Don't worry on the review side. imo, a closer to the real world selftest prog is
> actually helping the review process. It needs to test the tskey anyway and it
> needs to store somewhere. bpf map is pretty simple to use. I don't think it will
> have much different in term of complexity also.

Got it, will do it soon :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-31 23:26                                 ` Martin KaFai Lau
@ 2024-11-01  7:47                                   ` Jason Xing
  2024-11-05  1:50                                     ` Martin KaFai Lau
  2024-11-01 13:32                                   ` Willem de Bruijn
  1 sibling, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-11-01  7:47 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Fri, Nov 1, 2024 at 7:26 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/31/24 6:50 AM, Jason Xing wrote:
> > On Thu, Oct 31, 2024 at 8:30 PM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> >>
> >> Jason Xing wrote:
> >>> On Thu, Oct 31, 2024 at 2:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>
> >>>> On 10/30/24 5:13 PM, Jason Xing wrote:
> >>>>> I realized that we will have some new sock_opt flags like
> >>>>> TS_SCHED_OPT_CB in patch 4, so we can control whether to print or
> >>>>> not... For each sock_opt point, they will be called without caring if
> >>>>> related flags in skb are set. Well, it's meaningless to add more
> >>>>> control of skb tsflags at each TS_xx_OPT_CB point.
> >>>>>
> >>>>> Am I understanding in a correct way? Now, I'm totally fine with this great idea!
> >>>> Yes, I think so.
> >>>>
> >>>> The sockops prog can choose to ignore any BPF_SOCK_OPS_TS_*_CB. The are only 3:
> >>>> SCHED, SND, and ACK. If the hwtstamp is available from a NIC, I think it would
> >>>> be quite wasteful to throw it away. ACK can be controlled by the
> >>>> TCP_SKB_CB(skb)->bpf_txstamp_ack.
> >>>
> >>> Right, let me try this:)
> >>>
> >>>> Going back to my earlier bpf_setsockopt(SOL_SOCKET, BPF_TX_TIMESTAMPING)
> >>>> comment. I think it may as well go back to use the "u8
> >>>> bpf_sock_ops_cb_flags;" and use the bpf_sock_ops_cb_flags_set() helper to
> >>>> enable/disable the timestamp related callback hook. May be add one
> >>>> BPF_SOCK_OPS_TX_TIMESTAMPING_CB_FLAG.
> >>>
> >>> bpf_sock_ops_cb_flags this flag is only used in TCP condition, right?
> >>> If that is so, it cannot be suitable for UDP.
> >>>
> >>> I'm thinking of this solution:
> >>> 1) adding a new flag in SOF_TIMESTAMPING_OPT_BPF flag (in
> >>> include/uapi/linux/net_tstamp.h) which can be used by sk->sk_tsflags
>
> probably not in include/uapi/linux/net_tstamp.h. This flag can only be used by a
> bpf prog (meaning will not be used by user space syscall). More below.
>
> >>> 2) flags =   SOF_TIMESTAMPING_OPT_BPF;    bpf_setsockopt(skops,
> >>> SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
> >>> 3) test if sk->sk_tsflags has this new flag in tcp_tx_timestamp() or
> >>> in udp_sendmsg()
> >>> ...
>
> Not sure how many churns/audits is needed to ensure the user space cannot
> set/clear the SOF_TIMESTAMPING_OPT_BPF bit in sk->sk_tsflags. Could be not much.
>
> May be it is cleaner to leave the sk->sk_tsflags for user space only and having
> a separate field in "struct sock" to track bpf specific needs. More like your
> current sk_tsflags_bpf approach but I was thinking to reuse the
> bpf_sock_ops_cb_flags instead. e.g. "BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
> BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)" is used to check if it needs to call a bpf
> prog to decide if it needs to add tcp header option. Here we want to test if it
> should call a bpf prog to make a decision on tx timestamp on a skb.
>
> The bpf_sock_ops_cb_flags can be moved from struct tcp_sock to struct sock. It
> is doable from the bpf side.
>
> All that said, but, yes, it will add some TCP specific enum flag (e.g.
> BPF_SOCK_OPS_RTO_CB_FLAG) to the struct sock which will not be used by
> UDP/raw/...etc, so may be keep your current sk_tsflags_bpf approach but rename
> it to sk_bpf_cb_flags in struct "sock" so that it can be reused for other non
> tstamp ops in the future? probably a u8 is enough.

Thanks so much for the details.

>
> This optname is used by the bpf prog only and not usable by user space syscall.
> If it prefers to stay with bpf_setsockopt (which is fine), it needs a bpf
> specific optname like the current TCP_BPF_SOCK_OPS_CB_FLAGS which currently sets
> the tp->bpf_sock_ops_cb_flags. May be a new SK_BPF_CB_FLAGS optname for setting
> the sk->sk_bpf_cb_flags, like bpf_setsockopt(skops_ctx, SOL_SOCKET,

> SK_BPF_CB_FLAGS, &val, sizeof(val)) and handle it in the sol_socket_sockopt()
> alone without calling into sk_{set,get}sockopt. Add a new enum for the optval
> for the sk_bpf_cb_flags:
>
> enum {
>         SK_BPF_CB_TX_TIMESTAMPING = (1 << 0),
>         SK_BPF_CB_RX_TIEMSTAMPING = (1 << 1),
> };

Then it will involve more strange modification in sol_socket_sockopt()
to retrieve the opt value like what I did in V2 (see
https://lore.kernel.org/all/20241012040651.95616-3-kerneljasonxing@gmail.com/).
It's the reason why I did set and get operation in
sk_{set,get}sockopt() in this series to keep align with other flags.
Handling it in sk_{set,get}sockopt() is not a bad idea and easy to
implement, I feel.

Overall the suggestion looks good to me. I can give it a try :)

I'm thinking of another approach to using bpf_sock_ops_cb_flags_set()
instead of bpf_setsockopt() when sockops like
BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB is triggered. I can modify the
bpf_sock_ops_cb_flags_set like this:
diff --git a/net/core/filter.c b/net/core/filter.c
index 58761263176c..001140067c1a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5770,14 +5770,25 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct
bpf_sock_ops_kern *, bpf_sock,
           int, argval)
 {
        struct sock *sk = bpf_sock->sk;
-       int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+       int val = argval;

-       if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
+       if (!IS_ENABLED(CONFIG_INET))
                return -EINVAL;

-       tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+       if (sk_is_tcp(sk)) {
+               val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+               if (!sk_fullsock(sk))
+                       return -EINVAL;
+
+               tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+
+               val = argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
+       } else {
+               sk->bpf_sock_ops_cb_flags = val;
+               val = argval &
(~(SK_BPF_CB_TX_TIEMSTAMPING|SK_BPF_CB_RX_TIEMSTAMPING));
+       }

-       return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
+       return val;
 }

The BPF program uses bpf_sock_ops_cb_flags_set(skops,
SK_BPF_CB_FLAGS); to set the flags. Then we can implement a similar
function like BPF_SOCK_OPS_TEST_FLAG() in tcp_tx_timestamp() to check
if we are allowed to set shinfo->tx_flags |= SKBTX_BPF.

One advantage of this approach is that the bpf_sock_ops_cb_flags_set()
could be extended for more than only TCP in the future. Admittedly,
this will involve more work.

Which way would you prefer?

>
>
> >>>
> >>>>
> >>>> For tx, one new hook should be at the sendmsg and should be around
> >>>> tcp_tx_timestamp (?) for tcp. Another hook is __skb_tstamp_tx() which should be
> >>>
> >>> I think there are two points we're supposed to record:
> >>> 1) the moment tcp/udp_sendmsg() is triggered. It represents the syscall time.
> >>> 2) another point in tcp_tx_timestamp(). It represents the timestamp of
> >>> the last skb in this sendmsg() call.
> >>> Users may happen to send a big packet.
>
> hmm... a big packet and sendmsg is blocked waiting for memory?
>
> >>
> >> Err on the side of fewer measurement points. It's always possible to
> >> add more later, but not possible to remove them (depending on whether
> >> BPF infra is ABI).
>
> I also think it is better to start with tcp_tx_timestamp() alone first to keep
> the patch set simple now. The selftest prog can use a bpf fentry prog to trace
> the tcp_sendmsg_locked(). This can be revisited later if the bpf fentry prog is
> not enough.
>
> >>
> >> Overall great suggestion. Thanks a lot for sharing your BPF expertise
> >> on this, Martin.
>
> Thanks!
>
> >>
> >> On using the raw seqno: this data is accessible to anyone root in
> >> namespace (ns_capable) using packet sockets, so as long as it does not
> >> open to more than that, it is logically equivalent to the current
> >> setting.
> >>
> >> With seqno the BPF program has to be careful that the same seqno can
> >> be retransmitted, so for instance seeing an ACK before a (second) SND
> >> must be anticipated. That is true for SO_TIMESTAMPING today too.
>
> Ah. It will be a very useful comment to add to the selftests bpf prog.
>
> >>
> >> For datagrams (UDP as well as RAW and many non IP protocols), an
> >> alternative still needs to be found.
>
> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set. If it is
> unlikely, may be we can just disallow bpf prog from directly setting
> skb_shinfo(skb)->tskey for this particular skb.
>
> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
> pass the kernel decided tskey to the bpf prog.

I'm a bit confused here. IIUC, we need to support the tskey like what
we did in this series to handle non TCP cases?

I think I can keep those three patches related to tskey to support
both TCP and non-TCP cases. Then let the bpf program decide to use
tskey.

>
> The kernel passed tskey could be 0 (meaning the user space has not used it). The
> bpf prog can give one for the kernel to use. The bpf prog can store the
> sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
> sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
> instead) if it helps.
>
> If the kernel passed tskey is not 0, the bpf prog can just use that one
> (assuming the user space is doing something sane, like the value in
> SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
> is very unlikely also (?) but the bpf prog can probably detect this and choose
> to ignore this sk.
>
> To solve the above unsupported corner cases, I think we can allow the bpf prog
> to store something in the shinfo->hwtstamps at the tx path. The bpf-only key
> could be one of the things to store there. Change __ip[6]_append_data to handle
> the shinfo->hwtstamps. I think allowing the bpf prog to write to the
> shinfo->hwtsatmps could be considered later when needed.
>
> [ I may be off tomorrow, so reply could be slower. ]

Thanks for your help!

Thanks,
Jason

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-31 23:26                                 ` Martin KaFai Lau
  2024-11-01  7:47                                   ` Jason Xing
@ 2024-11-01 13:32                                   ` Willem de Bruijn
  2024-11-01 16:08                                     ` Jason Xing
  2024-11-05  2:09                                     ` Martin KaFai Lau
  1 sibling, 2 replies; 88+ messages in thread
From: Willem de Bruijn @ 2024-11-01 13:32 UTC (permalink / raw)
  To: Martin KaFai Lau, Jason Xing, Willem de Bruijn
  Cc: willemb, davem, edumazet, kuba, pabeni, dsahern, ast, daniel,
	andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev, Jason Xing

Martin KaFai Lau wrote:
> On 10/31/24 6:50 AM, Jason Xing wrote:
> > On Thu, Oct 31, 2024 at 8:30 PM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> >>
> >> Jason Xing wrote:
> >>> On Thu, Oct 31, 2024 at 2:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>
> >>>> On 10/30/24 5:13 PM, Jason Xing wrote:
> >>>>> I realized that we will have some new sock_opt flags like
> >>>>> TS_SCHED_OPT_CB in patch 4, so we can control whether to print or
> >>>>> not... For each sock_opt point, they will be called without caring if
> >>>>> related flags in skb are set. Well, it's meaningless to add more
> >>>>> control of skb tsflags at each TS_xx_OPT_CB point.
> >>>>>
> >>>>> Am I understanding in a correct way? Now, I'm totally fine with this great idea!
> >>>> Yes, I think so.
> >>>>
> >>>> The sockops prog can choose to ignore any BPF_SOCK_OPS_TS_*_CB. The are only 3:
> >>>> SCHED, SND, and ACK. If the hwtstamp is available from a NIC, I think it would
> >>>> be quite wasteful to throw it away. ACK can be controlled by the
> >>>> TCP_SKB_CB(skb)->bpf_txstamp_ack.
> >>>
> >>> Right, let me try this:)
> >>>
> >>>> Going back to my earlier bpf_setsockopt(SOL_SOCKET, BPF_TX_TIMESTAMPING)
> >>>> comment. I think it may as well go back to use the "u8
> >>>> bpf_sock_ops_cb_flags;" and use the bpf_sock_ops_cb_flags_set() helper to
> >>>> enable/disable the timestamp related callback hook. May be add one
> >>>> BPF_SOCK_OPS_TX_TIMESTAMPING_CB_FLAG.
> >>>
> >>> bpf_sock_ops_cb_flags this flag is only used in TCP condition, right?
> >>> If that is so, it cannot be suitable for UDP.
> >>>
> >>> I'm thinking of this solution:
> >>> 1) adding a new flag in SOF_TIMESTAMPING_OPT_BPF flag (in
> >>> include/uapi/linux/net_tstamp.h) which can be used by sk->sk_tsflags
> 
> probably not in include/uapi/linux/net_tstamp.h. This flag can only be used by a 
> bpf prog (meaning will not be used by user space syscall). More below.
> 
> >>> 2) flags =   SOF_TIMESTAMPING_OPT_BPF;    bpf_setsockopt(skops,
> >>> SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
> >>> 3) test if sk->sk_tsflags has this new flag in tcp_tx_timestamp() or
> >>> in udp_sendmsg()
> >>> ...
> 
> Not sure how many churns/audits is needed to ensure the user space cannot 
> set/clear the SOF_TIMESTAMPING_OPT_BPF bit in sk->sk_tsflags. Could be not much.

Stores are limited to defined bits with the following in
sock_set_timestamping

        if (val & ~SOF_TIMESTAMPING_MASK)
                return -EINVAL;
 
> May be it is cleaner to leave the sk->sk_tsflags for user space only and having 
> a separate field in "struct sock" to track bpf specific needs. More like your 
> current sk_tsflags_bpf approach but I was thinking to reuse the 
> bpf_sock_ops_cb_flags instead. e.g. "BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), 
> BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)" is used to check if it needs to call a bpf 
> prog to decide if it needs to add tcp header option. Here we want to test if it 
> should call a bpf prog to make a decision on tx timestamp on a skb.
> 
> The bpf_sock_ops_cb_flags can be moved from struct tcp_sock to struct sock. It 
> is doable from the bpf side.
> 
> All that said, but, yes, it will add some TCP specific enum flag (e.g. 
> BPF_SOCK_OPS_RTO_CB_FLAG) to the struct sock which will not be used by 
> UDP/raw/...etc, so may be keep your current sk_tsflags_bpf approach but rename 
> it to sk_bpf_cb_flags in struct "sock" so that it can be reused for other non 
> tstamp ops in the future? probably a u8 is enough.
> 
> This optname is used by the bpf prog only and not usable by user space syscall. 
> If it prefers to stay with bpf_setsockopt (which is fine), it needs a bpf 
> specific optname like the current TCP_BPF_SOCK_OPS_CB_FLAGS which currently sets 
> the tp->bpf_sock_ops_cb_flags. May be a new SK_BPF_CB_FLAGS optname for setting 
> the sk->sk_bpf_cb_flags, like bpf_setsockopt(skops_ctx, SOL_SOCKET, 
> SK_BPF_CB_FLAGS, &val, sizeof(val)) and handle it in the sol_socket_sockopt() 
> alone without calling into sk_{set,get}sockopt. Add a new enum for the optval 
> for the sk_bpf_cb_flags:
> 
> enum {
> 	SK_BPF_CB_TX_TIMESTAMPING = (1 << 0),
> 	SK_BPF_CB_RX_TIEMSTAMPING = (1 << 1),
> };
> 
> >>
> >> On using the raw seqno: this data is accessible to anyone root in
> >> namespace (ns_capable) using packet sockets, so as long as it does not
> >> open to more than that, it is logically equivalent to the current
> >> setting.
> >>
> >> With seqno the BPF program has to be careful that the same seqno can
> >> be retransmitted, so for instance seeing an ACK before a (second) SND
> >> must be anticipated. That is true for SO_TIMESTAMPING today too.
> 
> Ah. It will be a very useful comment to add to the selftests bpf prog.
> 
> >>
> >> For datagrams (UDP as well as RAW and many non IP protocols), an
> >> alternative still needs to be found.
> 
> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags 
> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) & 
> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.

This is not something to rely on. OPT_ID was added relatively recently.
Older applications, or any that just use the most straightforward API,
will not set this.

> If it is 
> unlikely, may be we can just disallow bpf prog from directly setting 
> skb_shinfo(skb)->tskey for this particular skb.
> 
> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also 
> pass the kernel decided tskey to the bpf prog.
> 
> The kernel passed tskey could be 0 (meaning the user space has not used it). The 
> bpf prog can give one for the kernel to use. The bpf prog can store the 
> sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct 
> sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX 
> instead) if it helps.
> 
> If the kernel passed tskey is not 0, the bpf prog can just use that one 
> (assuming the user space is doing something sane, like the value in 
> SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this 
> is very unlikely also (?) but the bpf prog can probably detect this and choose 
> to ignore this sk.

If an applications uses OPT_ID, it is unlikely that they will toggle
the feature on and off on a per-packet basis. So in the common case
the program could use the user-set counter or use its own if userspace
does not enable the feature. In the rare case that an application does
intermittently set an OPT_ID, the numbering would be erratic. This
does mean that an actively malicious application could mess with admin
measurements.

> To solve the above unsupported corner cases, I think we can allow the bpf prog 
> to store something in the shinfo->hwtstamps at the tx path. The bpf-only key 
> could be one of the things to store there. Change __ip[6]_append_data to handle 
> the shinfo->hwtstamps. I think allowing the bpf prog to write to the 
> shinfo->hwtsatmps could be considered later when needed.
> 
> [ I may be off tomorrow, so reply could be slower. ]
> 
> > 
> > It seems that using the tskey for bpf extension is always correct and
> > easy to use.
> > 
> > Could we provide the tskey to users and then let users decide the
> > better way to identify the call of sendmsg. We could keep the
> > traditional use of tskey. If without it, people need to figure out a
> > good way and may find it difficult to use the bpf extension.
> > 
> > I will keep thinking of alternatives for UDP in the meantime.



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-01 13:32                                   ` Willem de Bruijn
@ 2024-11-01 16:08                                     ` Jason Xing
  2024-11-01 16:39                                       ` Willem de Bruijn
  2024-11-05  2:09                                     ` Martin KaFai Lau
  1 sibling, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-11-01 16:08 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin KaFai Lau, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Fri, Nov 1, 2024 at 9:32 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Martin KaFai Lau wrote:
> > On 10/31/24 6:50 AM, Jason Xing wrote:
> > > On Thu, Oct 31, 2024 at 8:30 PM Willem de Bruijn
> > > <willemdebruijn.kernel@gmail.com> wrote:
> > >>
> > >> Jason Xing wrote:
> > >>> On Thu, Oct 31, 2024 at 2:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >>>>
> > >>>> On 10/30/24 5:13 PM, Jason Xing wrote:
> > >>>>> I realized that we will have some new sock_opt flags like
> > >>>>> TS_SCHED_OPT_CB in patch 4, so we can control whether to print or
> > >>>>> not... For each sock_opt point, they will be called without caring if
> > >>>>> related flags in skb are set. Well, it's meaningless to add more
> > >>>>> control of skb tsflags at each TS_xx_OPT_CB point.
> > >>>>>
> > >>>>> Am I understanding in a correct way? Now, I'm totally fine with this great idea!
> > >>>> Yes, I think so.
> > >>>>
> > >>>> The sockops prog can choose to ignore any BPF_SOCK_OPS_TS_*_CB. The are only 3:
> > >>>> SCHED, SND, and ACK. If the hwtstamp is available from a NIC, I think it would
> > >>>> be quite wasteful to throw it away. ACK can be controlled by the
> > >>>> TCP_SKB_CB(skb)->bpf_txstamp_ack.
> > >>>
> > >>> Right, let me try this:)
> > >>>
> > >>>> Going back to my earlier bpf_setsockopt(SOL_SOCKET, BPF_TX_TIMESTAMPING)
> > >>>> comment. I think it may as well go back to use the "u8
> > >>>> bpf_sock_ops_cb_flags;" and use the bpf_sock_ops_cb_flags_set() helper to
> > >>>> enable/disable the timestamp related callback hook. May be add one
> > >>>> BPF_SOCK_OPS_TX_TIMESTAMPING_CB_FLAG.
> > >>>
> > >>> bpf_sock_ops_cb_flags this flag is only used in TCP condition, right?
> > >>> If that is so, it cannot be suitable for UDP.
> > >>>
> > >>> I'm thinking of this solution:
> > >>> 1) adding a new flag in SOF_TIMESTAMPING_OPT_BPF flag (in
> > >>> include/uapi/linux/net_tstamp.h) which can be used by sk->sk_tsflags
> >
> > probably not in include/uapi/linux/net_tstamp.h. This flag can only be used by a
> > bpf prog (meaning will not be used by user space syscall). More below.
> >
> > >>> 2) flags =   SOF_TIMESTAMPING_OPT_BPF;    bpf_setsockopt(skops,
> > >>> SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
> > >>> 3) test if sk->sk_tsflags has this new flag in tcp_tx_timestamp() or
> > >>> in udp_sendmsg()
> > >>> ...
> >
> > Not sure how many churns/audits is needed to ensure the user space cannot
> > set/clear the SOF_TIMESTAMPING_OPT_BPF bit in sk->sk_tsflags. Could be not much.
>
> Stores are limited to defined bits with the following in
> sock_set_timestamping
>
>         if (val & ~SOF_TIMESTAMPING_MASK)
>                 return -EINVAL;
>
> > May be it is cleaner to leave the sk->sk_tsflags for user space only and having
> > a separate field in "struct sock" to track bpf specific needs. More like your
> > current sk_tsflags_bpf approach but I was thinking to reuse the
> > bpf_sock_ops_cb_flags instead. e.g. "BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
> > BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)" is used to check if it needs to call a bpf
> > prog to decide if it needs to add tcp header option. Here we want to test if it
> > should call a bpf prog to make a decision on tx timestamp on a skb.
> >
> > The bpf_sock_ops_cb_flags can be moved from struct tcp_sock to struct sock. It
> > is doable from the bpf side.
> >
> > All that said, but, yes, it will add some TCP specific enum flag (e.g.
> > BPF_SOCK_OPS_RTO_CB_FLAG) to the struct sock which will not be used by
> > UDP/raw/...etc, so may be keep your current sk_tsflags_bpf approach but rename
> > it to sk_bpf_cb_flags in struct "sock" so that it can be reused for other non
> > tstamp ops in the future? probably a u8 is enough.
> >
> > This optname is used by the bpf prog only and not usable by user space syscall.
> > If it prefers to stay with bpf_setsockopt (which is fine), it needs a bpf
> > specific optname like the current TCP_BPF_SOCK_OPS_CB_FLAGS which currently sets
> > the tp->bpf_sock_ops_cb_flags. May be a new SK_BPF_CB_FLAGS optname for setting
> > the sk->sk_bpf_cb_flags, like bpf_setsockopt(skops_ctx, SOL_SOCKET,
> > SK_BPF_CB_FLAGS, &val, sizeof(val)) and handle it in the sol_socket_sockopt()
> > alone without calling into sk_{set,get}sockopt. Add a new enum for the optval
> > for the sk_bpf_cb_flags:
> >
> > enum {
> >       SK_BPF_CB_TX_TIMESTAMPING = (1 << 0),
> >       SK_BPF_CB_RX_TIEMSTAMPING = (1 << 1),
> > };
> >
> > >>
> > >> On using the raw seqno: this data is accessible to anyone root in
> > >> namespace (ns_capable) using packet sockets, so as long as it does not
> > >> open to more than that, it is logically equivalent to the current
> > >> setting.
> > >>
> > >> With seqno the BPF program has to be careful that the same seqno can
> > >> be retransmitted, so for instance seeing an ACK before a (second) SND
> > >> must be anticipated. That is true for SO_TIMESTAMPING today too.
> >
> > Ah. It will be a very useful comment to add to the selftests bpf prog.
> >
> > >>
> > >> For datagrams (UDP as well as RAW and many non IP protocols), an
> > >> alternative still needs to be found.
> >
> > In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
> > & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
> > SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.
>
> This is not something to rely on. OPT_ID was added relatively recently.
> Older applications, or any that just use the most straightforward API,
> will not set this.
>
> > If it is
> > unlikely, may be we can just disallow bpf prog from directly setting
> > skb_shinfo(skb)->tskey for this particular skb.
> >
> > For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
> > pass the kernel decided tskey to the bpf prog.
> >
> > The kernel passed tskey could be 0 (meaning the user space has not used it). The
> > bpf prog can give one for the kernel to use. The bpf prog can store the
> > sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
> > sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
> > instead) if it helps.
> >
> > If the kernel passed tskey is not 0, the bpf prog can just use that one
> > (assuming the user space is doing something sane, like the value in
> > SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
> > is very unlikely also (?) but the bpf prog can probably detect this and choose
> > to ignore this sk.
>
> If an applications uses OPT_ID, it is unlikely that they will toggle
> the feature on and off on a per-packet basis. So in the common case
> the program could use the user-set counter or use its own if userspace
> does not enable the feature. In the rare case that an application does
> intermittently set an OPT_ID, the numbering would be erratic. This
> does mean that an actively malicious application could mess with admin
> measurements.
>

Sorry, I got lost in this part. What would you recommend I should do
about OPT_ID in the next move? Should I keep those three OPT_ID
patches?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-01 16:08                                     ` Jason Xing
@ 2024-11-01 16:39                                       ` Willem de Bruijn
  0 siblings, 0 replies; 88+ messages in thread
From: Willem de Bruijn @ 2024-11-01 16:39 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: Martin KaFai Lau, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

> > > >>
> > > >> For datagrams (UDP as well as RAW and many non IP protocols), an
> > > >> alternative still needs to be found.
> > >
> > > In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
> > > & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
> > > SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.
> >
> > This is not something to rely on. OPT_ID was added relatively recently.
> > Older applications, or any that just use the most straightforward API,
> > will not set this.
> >
> > > If it is
> > > unlikely, may be we can just disallow bpf prog from directly setting
> > > skb_shinfo(skb)->tskey for this particular skb.
> > >
> > > For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
> > > pass the kernel decided tskey to the bpf prog.
> > >
> > > The kernel passed tskey could be 0 (meaning the user space has not used it). The
> > > bpf prog can give one for the kernel to use. The bpf prog can store the
> > > sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
> > > sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
> > > instead) if it helps.
> > >
> > > If the kernel passed tskey is not 0, the bpf prog can just use that one
> > > (assuming the user space is doing something sane, like the value in
> > > SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
> > > is very unlikely also (?) but the bpf prog can probably detect this and choose
> > > to ignore this sk.
> >
> > If an applications uses OPT_ID, it is unlikely that they will toggle
> > the feature on and off on a per-packet basis. So in the common case
> > the program could use the user-set counter or use its own if userspace
> > does not enable the feature. In the rare case that an application does
> > intermittently set an OPT_ID, the numbering would be erratic. This
> > does mean that an actively malicious application could mess with admin
> > measurements.
> >
> 
> Sorry, I got lost in this part. What would you recommend I should do
> about OPT_ID in the next move? Should I keep those three OPT_ID
> patches?

I did not offer a suggestion. Just pointed out a constraint.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-10-28 11:05 ` [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly Jason Xing
  2024-10-29 23:00   ` Martin KaFai Lau
@ 2024-11-02 13:43   ` Simon Horman
  2024-11-03  0:42     ` Jason Xing
  1 sibling, 1 reply; 88+ messages in thread
From: Simon Horman @ 2024-11-02 13:43 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal, bpf, netdev, Jason Xing

On Mon, Oct 28, 2024 at 07:05:23PM +0800, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> This patch has introduced a separate sk_tsflags_bpf for bpf
> extension, which helps us let two feature work nearly at the
> same time.
> 
> Each feature will finally take effect on skb_shinfo(skb)->tx_flags,
> say, tcp_tx_timestamp() for TCP or skb_setup_tx_timestamp() for
> other types, so in __skb_tstamp_tx() we are unable to know which
> feature is turned on, unless we check each feature's own socket
> flag field.
> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  include/net/sock.h |  1 +
>  net/core/skbuff.c  | 39 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 40 insertions(+)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 7464e9f9f47c..5384f1e49f5c 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -445,6 +445,7 @@ struct sock {
>  	u32			sk_reserved_mem;
>  	int			sk_forward_alloc;
>  	u32			sk_tsflags;
> +	u32			sk_tsflags_bpf;

Please add sk_tsflags_bpf to the Kernel doc for this structure.
Likewise for sk_tskey_bpf_offset which is added by a subsequent patch.

>  	__cacheline_group_end(sock_write_rxtx);
>  
>  	__cacheline_group_begin(sock_write_tx);

...

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-02 13:43   ` Simon Horman
@ 2024-11-03  0:42     ` Jason Xing
  0 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-11-03  0:42 UTC (permalink / raw)
  To: Simon Horman
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	ykolal, bpf, netdev, Jason Xing

On Sat, Nov 2, 2024 at 9:44 PM Simon Horman <horms@kernel.org> wrote:
>
> On Mon, Oct 28, 2024 at 07:05:23PM +0800, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > This patch has introduced a separate sk_tsflags_bpf for bpf
> > extension, which helps us let two feature work nearly at the
> > same time.
> >
> > Each feature will finally take effect on skb_shinfo(skb)->tx_flags,
> > say, tcp_tx_timestamp() for TCP or skb_setup_tx_timestamp() for
> > other types, so in __skb_tstamp_tx() we are unable to know which
> > feature is turned on, unless we check each feature's own socket
> > flag field.
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  include/net/sock.h |  1 +
> >  net/core/skbuff.c  | 39 +++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 40 insertions(+)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 7464e9f9f47c..5384f1e49f5c 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -445,6 +445,7 @@ struct sock {
> >       u32                     sk_reserved_mem;
> >       int                     sk_forward_alloc;
> >       u32                     sk_tsflags;
> > +     u32                     sk_tsflags_bpf;
>
> Please add sk_tsflags_bpf to the Kernel doc for this structure.
> Likewise for sk_tskey_bpf_offset which is added by a subsequent patch.

Oh, thanks for reminding me!

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-01  7:47                                   ` Jason Xing
@ 2024-11-05  1:50                                     ` Martin KaFai Lau
  2024-11-05  3:13                                       ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Martin KaFai Lau @ 2024-11-05  1:50 UTC (permalink / raw)
  To: Jason Xing
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On 11/1/24 12:47 AM, Jason Xing wrote:

>> If it prefers to stay with bpf_setsockopt (which is fine), it needs a bpf
>> specific optname like the current TCP_BPF_SOCK_OPS_CB_FLAGS which currently sets
>> the tp->bpf_sock_ops_cb_flags. May be a new SK_BPF_CB_FLAGS optname for setting
>> the sk->sk_bpf_cb_flags, like bpf_setsockopt(skops_ctx, SOL_SOCKET,
> 
>> SK_BPF_CB_FLAGS, &val, sizeof(val)) and handle it in the sol_socket_sockopt()
>> alone without calling into sk_{set,get}sockopt. Add a new enum for the optval
>> for the sk_bpf_cb_flags:
>>
>> enum {
>>          SK_BPF_CB_TX_TIMESTAMPING = (1 << 0),
>>          SK_BPF_CB_RX_TIEMSTAMPING = (1 << 1),
>> };
> 
> Then it will involve more strange modification in sol_socket_sockopt()
> to retrieve the opt value like what I did in V2 (see
> https://lore.kernel.org/all/20241012040651.95616-3-kerneljasonxing@gmail.com/).
> It's the reason why I did set and get operation in
> sk_{set,get}sockopt() in this series to keep align with other flags.
> Handling it in sk_{set,get}sockopt() is not a bad idea and easy to
> implement, I feel.

This will look very different now. It is handling bpf specific
optname and accessing the bpf specific field in sk->sk_bpf_cb_flags.

I really don't see why it needs to spill over to sk_{set,get}sockopt()
to handle sk->sk_bpf_cb_flags.

I have quickly typed out a small part of discussion so far.
It is likely buggy and not compiler tested. Pieces are still missing.
The bpf_tstamp_ack will need a few more changes in the
tcp_{input,output}.c. May be merging with the tstamp_ack to become
2 bits will be cleaner, not sure.

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 39f1d16f3628..0b4913315854 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -488,6 +488,7 @@ enum {
  
  	/* generate software time stamp when entering packet scheduling */
  	SKBTX_SCHED_TSTAMP = 1 << 6,
+	SKBTX_BPF = 1 << 7,
  };
  
  #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
diff --git a/include/net/sock.h b/include/net/sock.h
index f29c14448938..4ec27c524f49 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -234,6 +234,20 @@ struct sock_common {
  struct bpf_local_storage;
  struct sk_filter;
  
+enum {
+	SK_BPF_CB_TX_TIMESTAMPING = BIT(0),
+	SK_BPF_CB_RX_TIEMSTAMPING = BIT(1),
+	SK_BPF_CB_MASK		= BIT(2) - 1,
+};
+
+#ifdef CONFIG_BPF_SYSCALL
+#define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
+void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb);
+#else
+#define SK_BPF_CB_FLAG_TEST(SK, FLAG)
+static inline void bpf_skops_timestamping(struct sock *sk, struct sk_buff *skb) {}
+#endif
+
  /**
    *	struct sock - network layer representation of sockets
    *	@__sk_common: shared layout with inet_timewait_sock
@@ -444,7 +458,10 @@ struct sock {
  	socket_lock_t		sk_lock;
  	u32			sk_reserved_mem;
  	int			sk_forward_alloc;
-	u32			sk_tsflags;
+	u16			sk_tsflags;
+#ifdef CONFIG_BPF_SYSCALL
+	u16			sk_bpf_cb_flags;
+#endif
  	__cacheline_group_end(sock_write_rxtx);
  
  	__cacheline_group_begin(sock_write_tx);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d1948d357dad..224b697bae9d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -961,7 +961,8 @@ struct tcp_skb_cb {
  	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
  			eor:1,		/* Is skb MSG_EOR marked? */
  			has_rxtstamp:1,	/* SKB has a RX timestamp	*/
-			unused:5;
+			bpf_txstamp_ack:1,
+			unused:4;
  	__u32		ack_seq;	/* Sequence number ACK'd	*/
  	union {
  		struct {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f28b6527e815..2ff7ff0ebdab 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7014,6 +7014,7 @@ enum {
  					 * by the kernel or the
  					 * earlier bpf-progs.
  					 */
+	BPF_SOCK_OPS_TX_TIMESTAMPING_CB,
  };
  
  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
@@ -7080,6 +7081,7 @@ enum {
  	TCP_BPF_SYN_IP		= 1006, /* Copy the IP[46] and TCP header */
  	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
  	TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
+	SK_BPF_CB_FLAGS		= 1009,
  };
  
  enum {
diff --git a/net/core/filter.c b/net/core/filter.c
index e31ee8be2de0..81a36e50047b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5206,6 +5206,19 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
  	.arg1_type      = ARG_PTR_TO_CTX,
  };
  
+static int sk_bpf_cb_flags(struct sock *sk, int sk_bpf_cb_flags, bool getopt)
+{
+	if (getopt)
+		return -EINVAL;
+
+	if (sk_bpf_cb_flags & ~SK_BPF_CB_MASK)
+		return -EINVAL;
+
+	sk->sk_bpf_cb_flags = sk->sk_bpf_cb_flags;
+
+	return 0;
+}
+
  static int sol_socket_sockopt(struct sock *sk, int optname,
  			      char *optval, int *optlen,
  			      bool getopt)
@@ -5222,6 +5235,7 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
  	case SO_MAX_PACING_RATE:
  	case SO_BINDTOIFINDEX:
  	case SO_TXREHASH:
+	case SK_BPF_CB_FLAGS:
  		if (*optlen != sizeof(int))
  			return -EINVAL;
  		break;
@@ -5231,6 +5245,9 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
  		return -EINVAL;
  	}
  
+	if (optname == SK_BPF_CB_FLAGS)
+		return sk_bpf_cb_flags(sk, *(int *)optval, getopt);
+
  	if (getopt) {
  		if (optname == SO_BINDTODEVICE)
  			return -EINVAL;
diff --git a/net/core/sock.c b/net/core/sock.c
index 039be95c40cf..d0406639cee9 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -137,6 +137,7 @@
  #include <linux/sock_diag.h>
  
  #include <linux/filter.h>
+#include <linux/bpf-cgroup.h>
  #include <net/sock_reuseport.h>
  #include <net/bpf_sk_storage.h>
  
@@ -946,6 +947,20 @@ int sock_set_timestamping(struct sock *sk, int optname,
  	return 0;
  }
  
+#ifdef CONFIG_BPF_SYSCALL
+void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb)
+{
+	struct bpf_sock_ops_kern sock_ops;
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+	sock_ops.op = BPF_SOCK_OPS_TX_TIMESTAMPING_CB;
+	sock_ops.is_fullsock = 1;
+	sock_ops.sk = sk;
+	sock_ops.skb = skb;
+	__cgroup_bpf_run_filter_sock_ops(sk, &sock_ops, CGROUP_SOCK_OPS);
+}
+#endif
+
  void sock_set_keepalive(struct sock *sk)
  {
  	lock_sock(sk);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4f77bd862e95..1e7f2d5fd792 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -491,6 +491,15 @@ static void tcp_tx_timestamp(struct sock *sk, u16 tsflags)
  		if (tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK)
  			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
  	}
+
+	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
+	    SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING))
+		/* The bpf prog can do:
+		 * shinfo->tx_flags |= SKBTX_BPF,
+		 * tcb->bpf_txstamp_ack = 1,
+		 * shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1 (if tskey not set)
+		 */
+		bpf_skops_tx_timestamping(sk, skb);
  }


> 
> Overall the suggestion looks good to me. I can give it a try :)
> 
> I'm thinking of another approach to using bpf_sock_ops_cb_flags_set()
> instead of bpf_setsockopt() when sockops like
> BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB is triggered. I can modify the
> bpf_sock_ops_cb_flags_set like this:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 58761263176c..001140067c1a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5770,14 +5770,25 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct
> bpf_sock_ops_kern *, bpf_sock,
>             int, argval)
>   {
>          struct sock *sk = bpf_sock->sk;
> -       int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> +       int val = argval;
> 
> -       if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
> +       if (!IS_ENABLED(CONFIG_INET))
>                  return -EINVAL;
> 
> -       tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> +       if (sk_is_tcp(sk)) {
> +               val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> +               if (!sk_fullsock(sk))
> +                       return -EINVAL;
> +
> +               tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> +
> +               val = argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
> +       } else {
> +               sk->bpf_sock_ops_cb_flags = val;

Why separate tcp vs non-tcp case? The tcp_sk(sk)->bpf_sock_ops_cb_flags
is running out of bits anyway for tcp specific callback.

just keep the SK_BPF_CB_{TX,RX}_TIEMSTAMPING in sk->sk_bpf_cb_flags
for all tcp/udp/raw/...

> +               val = argval &
> (~(SK_BPF_CB_TX_TIEMSTAMPING|SK_BPF_CB_RX_TIEMSTAMPING));

imo, we also don't need to return val to tell the caller what
is not supported in the running kernel. The BPF CO-RE can
handle this also, so less reason to keep extending the
bpf_sock_ops_cb_flags_set API for non tcp.

>>>> For datagrams (UDP as well as RAW and many non IP protocols), an
>>>> alternative still needs to be found.
>>
>> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
>> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
>> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set. If it is
>> unlikely, may be we can just disallow bpf prog from directly setting
>> skb_shinfo(skb)->tskey for this particular skb.
>>
>> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
>> pass the kernel decided tskey to the bpf prog.
> 
> I'm a bit confused here. IIUC, we need to support the tskey like what
> we did in this series to handle non TCP cases?

Like tcp, I don't think it really needs to use the sk->sk_tskey to mark the
ID of a skb for the non tcp cases also. will comment on another thread.


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-01 13:32                                   ` Willem de Bruijn
  2024-11-01 16:08                                     ` Jason Xing
@ 2024-11-05  2:09                                     ` Martin KaFai Lau
  2024-11-05  6:22                                       ` Jason Xing
  2024-11-05 14:29                                       ` Willem de Bruijn
  1 sibling, 2 replies; 88+ messages in thread
From: Martin KaFai Lau @ 2024-11-05  2:09 UTC (permalink / raw)
  To: Willem de Bruijn, Jason Xing
  Cc: willemb, davem, edumazet, kuba, pabeni, dsahern, ast, daniel,
	andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev, Jason Xing

On 11/1/24 6:32 AM, Willem de Bruijn wrote:
>> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
>> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
>> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.
> This is not something to rely on. OPT_ID was added relatively recently.
> Older applications, or any that just use the most straightforward API,
> will not set this.

Good point that the OPT_ID per cmsg is very new.

The datagram support on SOF_TIMESTAMPING_OPT_ID in sk->sk_tsflags had
been there for quite some time now. Is it a safe assumption that
most applications doing udp tx timestamping should have
the SOF_TIMESTAMPING_OPT_ID set to be useful?

> 
>> If it is
>> unlikely, may be we can just disallow bpf prog from directly setting
>> skb_shinfo(skb)->tskey for this particular skb.
>>
>> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
>> pass the kernel decided tskey to the bpf prog.
>>
>> The kernel passed tskey could be 0 (meaning the user space has not used it). The
>> bpf prog can give one for the kernel to use. The bpf prog can store the
>> sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
>> sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
>> instead) if it helps.
>>
>> If the kernel passed tskey is not 0, the bpf prog can just use that one
>> (assuming the user space is doing something sane, like the value in
>> SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
>> is very unlikely also (?) but the bpf prog can probably detect this and choose
>> to ignore this sk.
> If an applications uses OPT_ID, it is unlikely that they will toggle
> the feature on and off on a per-packet basis. So in the common case
> the program could use the user-set counter or use its own if userspace
> does not enable the feature. In the rare case that an application does
> intermittently set an OPT_ID, the numbering would be erratic. This
> does mean that an actively malicious application could mess with admin
> measurements.

All make sense. Given it is reasonable to assume the user space should either 
has SOF_TIMESTAMPING_OPT_ID always on or always off. When it is off, the bpf 
prog can directly provide its own tskey to be used in shinfo->tskey. The bpf 
prog can generate the id itself without using the sk->sk_tskey, e.g. store an 
atomic int in the bpf_sk_storage.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-05  1:50                                     ` Martin KaFai Lau
@ 2024-11-05  3:13                                       ` Jason Xing
  0 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-11-05  3:13 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Tue, Nov 5, 2024 at 9:51 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 11/1/24 12:47 AM, Jason Xing wrote:
>
> >> If it prefers to stay with bpf_setsockopt (which is fine), it needs a bpf
> >> specific optname like the current TCP_BPF_SOCK_OPS_CB_FLAGS which currently sets
> >> the tp->bpf_sock_ops_cb_flags. May be a new SK_BPF_CB_FLAGS optname for setting
> >> the sk->sk_bpf_cb_flags, like bpf_setsockopt(skops_ctx, SOL_SOCKET,
> >
> >> SK_BPF_CB_FLAGS, &val, sizeof(val)) and handle it in the sol_socket_sockopt()
> >> alone without calling into sk_{set,get}sockopt. Add a new enum for the optval
> >> for the sk_bpf_cb_flags:
> >>
> >> enum {
> >>          SK_BPF_CB_TX_TIMESTAMPING = (1 << 0),
> >>          SK_BPF_CB_RX_TIEMSTAMPING = (1 << 1),
> >> };
> >
> > Then it will involve more strange modification in sol_socket_sockopt()
> > to retrieve the opt value like what I did in V2 (see
> > https://lore.kernel.org/all/20241012040651.95616-3-kerneljasonxing@gmail.com/).
> > It's the reason why I did set and get operation in
> > sk_{set,get}sockopt() in this series to keep align with other flags.
> > Handling it in sk_{set,get}sockopt() is not a bad idea and easy to
> > implement, I feel.
>
> This will look very different now. It is handling bpf specific
> optname and accessing the bpf specific field in sk->sk_bpf_cb_flags.
>
> I really don't see why it needs to spill over to sk_{set,get}sockopt()
> to handle sk->sk_bpf_cb_flags.
>
> I have quickly typed out a small part of discussion so far.
> It is likely buggy and not compiler tested. Pieces are still missing.
> The bpf_tstamp_ack will need a few more changes in the
> tcp_{input,output}.c. May be merging with the tstamp_ack to become
> 2 bits will be cleaner, not sure.
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 39f1d16f3628..0b4913315854 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -488,6 +488,7 @@ enum {
>
>         /* generate software time stamp when entering packet scheduling */
>         SKBTX_SCHED_TSTAMP = 1 << 6,
> +       SKBTX_BPF = 1 << 7,
>   };
>
>   #define SKBTX_ANY_SW_TSTAMP   (SKBTX_SW_TSTAMP    | \
> diff --git a/include/net/sock.h b/include/net/sock.h
> index f29c14448938..4ec27c524f49 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -234,6 +234,20 @@ struct sock_common {
>   struct bpf_local_storage;
>   struct sk_filter;
>
> +enum {
> +       SK_BPF_CB_TX_TIMESTAMPING = BIT(0),
> +       SK_BPF_CB_RX_TIEMSTAMPING = BIT(1),
> +       SK_BPF_CB_MASK          = BIT(2) - 1,
> +};
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +#define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
> +void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb);
> +#else
> +#define SK_BPF_CB_FLAG_TEST(SK, FLAG)
> +static inline void bpf_skops_timestamping(struct sock *sk, struct sk_buff *skb) {}
> +#endif

Until now, I know that I misunderstood what you meant in the previous
thread. I thought you were suggesting we need to use bpf_setsockopt.
Sorry.

I completely agree with this approach! Thanks!

> +
>   /**
>     *   struct sock - network layer representation of sockets
>     *   @__sk_common: shared layout with inet_timewait_sock
> @@ -444,7 +458,10 @@ struct sock {
>         socket_lock_t           sk_lock;
>         u32                     sk_reserved_mem;
>         int                     sk_forward_alloc;
> -       u32                     sk_tsflags;
> +       u16                     sk_tsflags;
> +#ifdef CONFIG_BPF_SYSCALL
> +       u16                     sk_bpf_cb_flags;

We cannot use u16 for sk_tsflags because the
SOF_TIMESTAMPING_OPT_RX_FILTER uses the 17th bit already. I will
handle it.

> +#endif
>         __cacheline_group_end(sock_write_rxtx);
>
>         __cacheline_group_begin(sock_write_tx);
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index d1948d357dad..224b697bae9d 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -961,7 +961,8 @@ struct tcp_skb_cb {
>         __u8            txstamp_ack:1,  /* Record TX timestamp for ack? */
>                         eor:1,          /* Is skb MSG_EOR marked? */
>                         has_rxtstamp:1, /* SKB has a RX timestamp       */
> -                       unused:5;
> +                       bpf_txstamp_ack:1,
> +                       unused:4;
>         __u32           ack_seq;        /* Sequence number ACK'd        */
>         union {
>                 struct {
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f28b6527e815..2ff7ff0ebdab 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7014,6 +7014,7 @@ enum {
>                                          * by the kernel or the
>                                          * earlier bpf-progs.
>                                          */
> +       BPF_SOCK_OPS_TX_TIMESTAMPING_CB,
>   };
>
>   /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> @@ -7080,6 +7081,7 @@ enum {
>         TCP_BPF_SYN_IP          = 1006, /* Copy the IP[46] and TCP header */
>         TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
>         TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
> +       SK_BPF_CB_FLAGS         = 1009,
>   };
>
>   enum {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index e31ee8be2de0..81a36e50047b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5206,6 +5206,19 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
>         .arg1_type      = ARG_PTR_TO_CTX,
>   };
>
> +static int sk_bpf_cb_flags(struct sock *sk, int sk_bpf_cb_flags, bool getopt)
> +{
> +       if (getopt)
> +               return -EINVAL;
> +
> +       if (sk_bpf_cb_flags & ~SK_BPF_CB_MASK)
> +               return -EINVAL;
> +
> +       sk->sk_bpf_cb_flags = sk->sk_bpf_cb_flags;
> +
> +       return 0;
> +}
> +
>   static int sol_socket_sockopt(struct sock *sk, int optname,
>                               char *optval, int *optlen,
>                               bool getopt)
> @@ -5222,6 +5235,7 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
>         case SO_MAX_PACING_RATE:
>         case SO_BINDTOIFINDEX:
>         case SO_TXREHASH:
> +       case SK_BPF_CB_FLAGS:
>                 if (*optlen != sizeof(int))
>                         return -EINVAL;
>                 break;
> @@ -5231,6 +5245,9 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
>                 return -EINVAL;
>         }
>
> +       if (optname == SK_BPF_CB_FLAGS)
> +               return sk_bpf_cb_flags(sk, *(int *)optval, getopt);
> +
>         if (getopt) {
>                 if (optname == SO_BINDTODEVICE)
>                         return -EINVAL;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 039be95c40cf..d0406639cee9 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -137,6 +137,7 @@
>   #include <linux/sock_diag.h>
>
>   #include <linux/filter.h>
> +#include <linux/bpf-cgroup.h>
>   #include <net/sock_reuseport.h>
>   #include <net/bpf_sk_storage.h>
>
> @@ -946,6 +947,20 @@ int sock_set_timestamping(struct sock *sk, int optname,
>         return 0;
>   }
>
> +#ifdef CONFIG_BPF_SYSCALL
> +void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb)
> +{
> +       struct bpf_sock_ops_kern sock_ops;
> +
> +       memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
> +       sock_ops.op = BPF_SOCK_OPS_TX_TIMESTAMPING_CB;
> +       sock_ops.is_fullsock = 1;
> +       sock_ops.sk = sk;
> +       sock_ops.skb = skb;
> +       __cgroup_bpf_run_filter_sock_ops(sk, &sock_ops, CGROUP_SOCK_OPS);
> +}
> +#endif
> +
>   void sock_set_keepalive(struct sock *sk)
>   {
>         lock_sock(sk);
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4f77bd862e95..1e7f2d5fd792 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -491,6 +491,15 @@ static void tcp_tx_timestamp(struct sock *sk, u16 tsflags)
>                 if (tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK)
>                         shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
>         }
> +
> +       if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> +           SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING))
> +               /* The bpf prog can do:
> +                * shinfo->tx_flags |= SKBTX_BPF,
> +                * tcb->bpf_txstamp_ack = 1,
> +                * shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1 (if tskey not set)
> +                */
> +               bpf_skops_tx_timestamping(sk, skb);
>   }
>
>
> >
> > Overall the suggestion looks good to me. I can give it a try :)
> >
> > I'm thinking of another approach to using bpf_sock_ops_cb_flags_set()
> > instead of bpf_setsockopt() when sockops like
> > BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB is triggered. I can modify the
> > bpf_sock_ops_cb_flags_set like this:
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 58761263176c..001140067c1a 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5770,14 +5770,25 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct
> > bpf_sock_ops_kern *, bpf_sock,
> >             int, argval)
> >   {
> >          struct sock *sk = bpf_sock->sk;
> > -       int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> > +       int val = argval;
> >
> > -       if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
> > +       if (!IS_ENABLED(CONFIG_INET))
> >                  return -EINVAL;
> >
> > -       tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> > +       if (sk_is_tcp(sk)) {
> > +               val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> > +               if (!sk_fullsock(sk))
> > +                       return -EINVAL;
> > +
> > +               tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> > +
> > +               val = argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
> > +       } else {
> > +               sk->bpf_sock_ops_cb_flags = val;
>
> Why separate tcp vs non-tcp case? The tcp_sk(sk)->bpf_sock_ops_cb_flags
> is running out of bits anyway for tcp specific callback.
>
> just keep the SK_BPF_CB_{TX,RX}_TIEMSTAMPING in sk->sk_bpf_cb_flags
> for all tcp/udp/raw/...

Agreed!

>
> > +               val = argval &
> > (~(SK_BPF_CB_TX_TIEMSTAMPING|SK_BPF_CB_RX_TIEMSTAMPING));
>
> imo, we also don't need to return val to tell the caller what
> is not supported in the running kernel. The BPF CO-RE can
> handle this also, so less reason to keep extending the
> bpf_sock_ops_cb_flags_set API for non tcp.
>
> >>>> For datagrams (UDP as well as RAW and many non IP protocols), an
> >>>> alternative still needs to be found.
> >>
> >> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
> >> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
> >> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set. If it is
> >> unlikely, may be we can just disallow bpf prog from directly setting
> >> skb_shinfo(skb)->tskey for this particular skb.
> >>
> >> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
> >> pass the kernel decided tskey to the bpf prog.
> >
> > I'm a bit confused here. IIUC, we need to support the tskey like what
> > we did in this series to handle non TCP cases?
>
> Like tcp, I don't think it really needs to use the sk->sk_tskey to mark the
> ID of a skb for the non tcp cases also. will comment on another thread.

Fine, I will let go of the tskey logic in v4.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-05  2:09                                     ` Martin KaFai Lau
@ 2024-11-05  6:22                                       ` Jason Xing
  2024-11-05 19:22                                         ` Martin KaFai Lau
  2024-11-05 14:29                                       ` Willem de Bruijn
  1 sibling, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-11-05  6:22 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Tue, Nov 5, 2024 at 10:09 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 11/1/24 6:32 AM, Willem de Bruijn wrote:
> >> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
> >> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
> >> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.
> > This is not something to rely on. OPT_ID was added relatively recently.
> > Older applications, or any that just use the most straightforward API,
> > will not set this.
>
> Good point that the OPT_ID per cmsg is very new.
>
> The datagram support on SOF_TIMESTAMPING_OPT_ID in sk->sk_tsflags had
> been there for quite some time now. Is it a safe assumption that
> most applications doing udp tx timestamping should have
> the SOF_TIMESTAMPING_OPT_ID set to be useful?
>
> >
> >> If it is
> >> unlikely, may be we can just disallow bpf prog from directly setting
> >> skb_shinfo(skb)->tskey for this particular skb.
> >>
> >> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
> >> pass the kernel decided tskey to the bpf prog.
> >>
> >> The kernel passed tskey could be 0 (meaning the user space has not used it). The
> >> bpf prog can give one for the kernel to use. The bpf prog can store the
> >> sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
> >> sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
> >> instead) if it helps.
> >>
> >> If the kernel passed tskey is not 0, the bpf prog can just use that one
> >> (assuming the user space is doing something sane, like the value in
> >> SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
> >> is very unlikely also (?) but the bpf prog can probably detect this and choose
> >> to ignore this sk.
> > If an applications uses OPT_ID, it is unlikely that they will toggle
> > the feature on and off on a per-packet basis. So in the common case
> > the program could use the user-set counter or use its own if userspace
> > does not enable the feature. In the rare case that an application does
> > intermittently set an OPT_ID, the numbering would be erratic. This
> > does mean that an actively malicious application could mess with admin
> > measurements.
>
> All make sense. Given it is reasonable to assume the user space should either
> has SOF_TIMESTAMPING_OPT_ID always on or always off. When it is off, the bpf
> prog can directly provide its own tskey to be used in shinfo->tskey. The bpf
> prog can generate the id itself without using the sk->sk_tskey, e.g. store an
> atomic int in the bpf_sk_storage.

I wonder, how can we correlate the key with each skb in the bpf
program for non-TCP type without implementing a bpf extension for
SCM_TS_OPT_ID? Every time the timestamp is reported, we cannot know
which sendmsg() the skb belongs to for non-TCP cases.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-05  2:09                                     ` Martin KaFai Lau
  2024-11-05  6:22                                       ` Jason Xing
@ 2024-11-05 14:29                                       ` Willem de Bruijn
  1 sibling, 0 replies; 88+ messages in thread
From: Willem de Bruijn @ 2024-11-05 14:29 UTC (permalink / raw)
  To: Martin KaFai Lau, Willem de Bruijn, Jason Xing
  Cc: willemb, davem, edumazet, kuba, pabeni, dsahern, ast, daniel,
	andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev, Jason Xing

Martin KaFai Lau wrote:
> On 11/1/24 6:32 AM, Willem de Bruijn wrote:
> >> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
> >> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
> >> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.
> > This is not something to rely on. OPT_ID was added relatively recently.
> > Older applications, or any that just use the most straightforward API,
> > will not set this.
> 
> Good point that the OPT_ID per cmsg is very new.
> 
> The datagram support on SOF_TIMESTAMPING_OPT_ID in sk->sk_tsflags had
> been there for quite some time now. Is it a safe assumption that
> most applications doing udp tx timestamping should have
> the SOF_TIMESTAMPING_OPT_ID set to be useful?

I don't think so.

The very first open source code I happen to look at, github.com/ptpd,
already sets SO_TIMESTAMPING without OPT_ID.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-05  6:22                                       ` Jason Xing
@ 2024-11-05 19:22                                         ` Martin KaFai Lau
  2024-11-06  0:17                                           ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Martin KaFai Lau @ 2024-11-05 19:22 UTC (permalink / raw)
  To: Jason Xing
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On 11/4/24 10:22 PM, Jason Xing wrote:
> On Tue, Nov 5, 2024 at 10:09 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 11/1/24 6:32 AM, Willem de Bruijn wrote:
>>>> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
>>>> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
>>>> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.
>>> This is not something to rely on. OPT_ID was added relatively recently.
>>> Older applications, or any that just use the most straightforward API,
>>> will not set this.
>>
>> Good point that the OPT_ID per cmsg is very new.
>>
>> The datagram support on SOF_TIMESTAMPING_OPT_ID in sk->sk_tsflags had
>> been there for quite some time now. Is it a safe assumption that
>> most applications doing udp tx timestamping should have
>> the SOF_TIMESTAMPING_OPT_ID set to be useful?
>>
>>>
>>>> If it is
>>>> unlikely, may be we can just disallow bpf prog from directly setting
>>>> skb_shinfo(skb)->tskey for this particular skb.
>>>>
>>>> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
>>>> pass the kernel decided tskey to the bpf prog.
>>>>
>>>> The kernel passed tskey could be 0 (meaning the user space has not used it). The
>>>> bpf prog can give one for the kernel to use. The bpf prog can store the
>>>> sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
>>>> sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
>>>> instead) if it helps.
>>>>
>>>> If the kernel passed tskey is not 0, the bpf prog can just use that one
>>>> (assuming the user space is doing something sane, like the value in
>>>> SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
>>>> is very unlikely also (?) but the bpf prog can probably detect this and choose
>>>> to ignore this sk.
>>> If an applications uses OPT_ID, it is unlikely that they will toggle
>>> the feature on and off on a per-packet basis. So in the common case
>>> the program could use the user-set counter or use its own if userspace
>>> does not enable the feature. In the rare case that an application does
>>> intermittently set an OPT_ID, the numbering would be erratic. This
>>> does mean that an actively malicious application could mess with admin
>>> measurements.
>>
>> All make sense. Given it is reasonable to assume the user space should either
>> has SOF_TIMESTAMPING_OPT_ID always on or always off. When it is off, the bpf
>> prog can directly provide its own tskey to be used in shinfo->tskey. The bpf
>> prog can generate the id itself without using the sk->sk_tskey, e.g. store an
>> atomic int in the bpf_sk_storage.
> 
> I wonder, how can we correlate the key with each skb in the bpf
> program for non-TCP type without implementing a bpf extension for
> SCM_TS_OPT_ID? Every time the timestamp is reported, we cannot know
> which sendmsg() the skb belongs to for non-TCP cases.

SCM_TS_OPT_ID is eventually setting the shinfo->tskey.
If the shinfo->tskey is not set by the user space, the bpf prog can directly set 
the shinfo->tskey. There is no need to use the sk->sk_tskey as the ID generator 
also. The bpf prog can have its own id generator.

If the user space has already set the shinfo->tskey (either by sk->sk_tskey or 
SCM_TS_OPT_ID), the bpf prog can just use the user space one.

If there is a weird application that flips flops between OPT_ID on/off, the bpf 
prog will get confused which is fine. The bpf prog can detect this and choose to 
ignore measuring this sk/skb. The bpf prog can also choose to be on the very 
safe side and ignore all skb with SKBTX_ANY_TSTAMP set in txflags but with no 
OPT_ID. The bpf prog can look into the details of the sk and skb to decide what 
makes the most sense for its deployment.

I don't know whether it makes more sense to call the bpf prog to decide the 
shinfo->{tx_flags,tskey} just before the "while (length > 0)" in 
__ip[6]_append_data or it is better to call the bpf prog in ip[6]_setup_cork.
I admittedly less familiar with this code path than the tcp one.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-05 19:22                                         ` Martin KaFai Lau
@ 2024-11-06  0:17                                           ` Jason Xing
  2024-11-06  1:09                                             ` Martin KaFai Lau
  2024-11-06  1:11                                             ` Willem de Bruijn
  0 siblings, 2 replies; 88+ messages in thread
From: Jason Xing @ 2024-11-06  0:17 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Wed, Nov 6, 2024 at 3:22 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 11/4/24 10:22 PM, Jason Xing wrote:
> > On Tue, Nov 5, 2024 at 10:09 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 11/1/24 6:32 AM, Willem de Bruijn wrote:
> >>>> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
> >>>> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
> >>>> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.
> >>> This is not something to rely on. OPT_ID was added relatively recently.
> >>> Older applications, or any that just use the most straightforward API,
> >>> will not set this.
> >>
> >> Good point that the OPT_ID per cmsg is very new.
> >>
> >> The datagram support on SOF_TIMESTAMPING_OPT_ID in sk->sk_tsflags had
> >> been there for quite some time now. Is it a safe assumption that
> >> most applications doing udp tx timestamping should have
> >> the SOF_TIMESTAMPING_OPT_ID set to be useful?
> >>
> >>>
> >>>> If it is
> >>>> unlikely, may be we can just disallow bpf prog from directly setting
> >>>> skb_shinfo(skb)->tskey for this particular skb.
> >>>>
> >>>> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
> >>>> pass the kernel decided tskey to the bpf prog.
> >>>>
> >>>> The kernel passed tskey could be 0 (meaning the user space has not used it). The
> >>>> bpf prog can give one for the kernel to use. The bpf prog can store the
> >>>> sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
> >>>> sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
> >>>> instead) if it helps.
> >>>>
> >>>> If the kernel passed tskey is not 0, the bpf prog can just use that one
> >>>> (assuming the user space is doing something sane, like the value in
> >>>> SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
> >>>> is very unlikely also (?) but the bpf prog can probably detect this and choose
> >>>> to ignore this sk.
> >>> If an applications uses OPT_ID, it is unlikely that they will toggle
> >>> the feature on and off on a per-packet basis. So in the common case
> >>> the program could use the user-set counter or use its own if userspace
> >>> does not enable the feature. In the rare case that an application does
> >>> intermittently set an OPT_ID, the numbering would be erratic. This
> >>> does mean that an actively malicious application could mess with admin
> >>> measurements.
> >>
> >> All make sense. Given it is reasonable to assume the user space should either
> >> has SOF_TIMESTAMPING_OPT_ID always on or always off. When it is off, the bpf
> >> prog can directly provide its own tskey to be used in shinfo->tskey. The bpf
> >> prog can generate the id itself without using the sk->sk_tskey, e.g. store an
> >> atomic int in the bpf_sk_storage.
> >
> > I wonder, how can we correlate the key with each skb in the bpf
> > program for non-TCP type without implementing a bpf extension for
> > SCM_TS_OPT_ID? Every time the timestamp is reported, we cannot know
> > which sendmsg() the skb belongs to for non-TCP cases.
>
> SCM_TS_OPT_ID is eventually setting the shinfo->tskey.
> If the shinfo->tskey is not set by the user space, the bpf prog can directly set
> the shinfo->tskey. There is no need to use the sk->sk_tskey as the ID generator
> also. The bpf prog can have its own id generator.
>
> If the user space has already set the shinfo->tskey (either by sk->sk_tskey or
> SCM_TS_OPT_ID), the bpf prog can just use the user space one.
>
> If there is a weird application that flips flops between OPT_ID on/off, the bpf
> prog will get confused which is fine. The bpf prog can detect this and choose to
> ignore measuring this sk/skb. The bpf prog can also choose to be on the very
> safe side and ignore all skb with SKBTX_ANY_TSTAMP set in txflags but with no
> OPT_ID. The bpf prog can look into the details of the sk and skb to decide what
> makes the most sense for its deployment.
>
> I don't know whether it makes more sense to call the bpf prog to decide the
> shinfo->{tx_flags,tskey} just before the "while (length > 0)" in
> __ip[6]_append_data or it is better to call the bpf prog in ip[6]_setup_cork.
> I admittedly less familiar with this code path than the tcp one.

Now I feel it could be complicated for a software engineer to consider
how they will handle the key if they don't read the kernel code very
carefully. They are facing different situations. Being user-friendly
lets this feature have more chances to get widely used. As I insisted
before, I still would like to know if it is possible that we can try
to introduce sk_tskey_bpf_offset (like patch 10-12) to calculate a bpf
exclusive tskey for bpf use? Only exporting one key. It will be really
simple and easy-to-use :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-06  0:17                                           ` Jason Xing
@ 2024-11-06  1:09                                             ` Martin KaFai Lau
  2024-11-06  2:51                                               ` Jason Xing
  2024-11-06  1:11                                             ` Willem de Bruijn
  1 sibling, 1 reply; 88+ messages in thread
From: Martin KaFai Lau @ 2024-11-06  1:09 UTC (permalink / raw)
  To: Jason Xing
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On 11/5/24 4:17 PM, Jason Xing wrote:
> On Wed, Nov 6, 2024 at 3:22 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 11/4/24 10:22 PM, Jason Xing wrote:
>>> On Tue, Nov 5, 2024 at 10:09 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> On 11/1/24 6:32 AM, Willem de Bruijn wrote:
>>>>>> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
>>>>>> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
>>>>>> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.
>>>>> This is not something to rely on. OPT_ID was added relatively recently.
>>>>> Older applications, or any that just use the most straightforward API,
>>>>> will not set this.
>>>>
>>>> Good point that the OPT_ID per cmsg is very new.
>>>>
>>>> The datagram support on SOF_TIMESTAMPING_OPT_ID in sk->sk_tsflags had
>>>> been there for quite some time now. Is it a safe assumption that
>>>> most applications doing udp tx timestamping should have
>>>> the SOF_TIMESTAMPING_OPT_ID set to be useful?
>>>>
>>>>>
>>>>>> If it is
>>>>>> unlikely, may be we can just disallow bpf prog from directly setting
>>>>>> skb_shinfo(skb)->tskey for this particular skb.
>>>>>>
>>>>>> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
>>>>>> pass the kernel decided tskey to the bpf prog.
>>>>>>
>>>>>> The kernel passed tskey could be 0 (meaning the user space has not used it). The
>>>>>> bpf prog can give one for the kernel to use. The bpf prog can store the
>>>>>> sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
>>>>>> sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
>>>>>> instead) if it helps.
>>>>>>
>>>>>> If the kernel passed tskey is not 0, the bpf prog can just use that one
>>>>>> (assuming the user space is doing something sane, like the value in
>>>>>> SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
>>>>>> is very unlikely also (?) but the bpf prog can probably detect this and choose
>>>>>> to ignore this sk.
>>>>> If an applications uses OPT_ID, it is unlikely that they will toggle
>>>>> the feature on and off on a per-packet basis. So in the common case
>>>>> the program could use the user-set counter or use its own if userspace
>>>>> does not enable the feature. In the rare case that an application does
>>>>> intermittently set an OPT_ID, the numbering would be erratic. This
>>>>> does mean that an actively malicious application could mess with admin
>>>>> measurements.
>>>>
>>>> All make sense. Given it is reasonable to assume the user space should either
>>>> has SOF_TIMESTAMPING_OPT_ID always on or always off. When it is off, the bpf
>>>> prog can directly provide its own tskey to be used in shinfo->tskey. The bpf
>>>> prog can generate the id itself without using the sk->sk_tskey, e.g. store an
>>>> atomic int in the bpf_sk_storage.
>>>
>>> I wonder, how can we correlate the key with each skb in the bpf
>>> program for non-TCP type without implementing a bpf extension for
>>> SCM_TS_OPT_ID? Every time the timestamp is reported, we cannot know
>>> which sendmsg() the skb belongs to for non-TCP cases.
>>
>> SCM_TS_OPT_ID is eventually setting the shinfo->tskey.
>> If the shinfo->tskey is not set by the user space, the bpf prog can directly set
>> the shinfo->tskey. There is no need to use the sk->sk_tskey as the ID generator
>> also. The bpf prog can have its own id generator.
>>
>> If the user space has already set the shinfo->tskey (either by sk->sk_tskey or
>> SCM_TS_OPT_ID), the bpf prog can just use the user space one.
>>
>> If there is a weird application that flips flops between OPT_ID on/off, the bpf
>> prog will get confused which is fine. The bpf prog can detect this and choose to
>> ignore measuring this sk/skb. The bpf prog can also choose to be on the very
>> safe side and ignore all skb with SKBTX_ANY_TSTAMP set in txflags but with no
>> OPT_ID. The bpf prog can look into the details of the sk and skb to decide what
>> makes the most sense for its deployment.
>>
>> I don't know whether it makes more sense to call the bpf prog to decide the
>> shinfo->{tx_flags,tskey} just before the "while (length > 0)" in
>> __ip[6]_append_data or it is better to call the bpf prog in ip[6]_setup_cork.
>> I admittedly less familiar with this code path than the tcp one.
> 
> Now I feel it could be complicated for a software engineer to consider
> how they will handle the key if they don't read the kernel code very
> carefully. They are facing different situations. Being user-friendly
> lets this feature have more chances to get widely used. As I insisted
> before, I still would like to know if it is possible that we can try
> to introduce sk_tskey_bpf_offset (like patch 10-12) to calculate a bpf
> exclusive tskey for bpf use? Only exporting one key. It will be really
> simple and easy-to-use :)

imo, there is no need for adding sk_tskey_bpf_offset to sk. just allow the bpf 
prog to decide what is the tskey.

There is no usability issue in bpf prog. It is pretty normal for a bpf prog 
author to look at the sk details to make decision.

Abstracting the sk/skb is not helping the bpf prog and not the right direction 
to go. Over time, there has been case over case that the bpf prog wants to know 
more instead of being abstracted away like running in the user space. e.g. The 
"struct bpf_sock" abstraction in the uapi/linux/bpf.h does not scale and we have 
stopped adding more abstraction this way. The btf (and PTR_TO_BTF_ID, 
CO-RE...etc) has been added to allow the bpf prog to learn other details in sk 
and skb.

Instead, design a better bpf kfunc to help the bpf prog to set the bits/tskey in 
the skb. I think this is more important. tcp tskey is easy. just need some care 
on the udp tskey and need to check if the user space has already set one.
A good designed bpf kfunc is all it needs.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-06  0:17                                           ` Jason Xing
  2024-11-06  1:09                                             ` Martin KaFai Lau
@ 2024-11-06  1:11                                             ` Willem de Bruijn
  2024-11-06  2:37                                               ` Jason Xing
  1 sibling, 1 reply; 88+ messages in thread
From: Willem de Bruijn @ 2024-11-06  1:11 UTC (permalink / raw)
  To: Jason Xing, Martin KaFai Lau
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

Jason Xing wrote:
> On Wed, Nov 6, 2024 at 3:22 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 11/4/24 10:22 PM, Jason Xing wrote:
> > > On Tue, Nov 5, 2024 at 10:09 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >>
> > >> On 11/1/24 6:32 AM, Willem de Bruijn wrote:
> > >>>> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
> > >>>> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
> > >>>> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.
> > >>> This is not something to rely on. OPT_ID was added relatively recently.
> > >>> Older applications, or any that just use the most straightforward API,
> > >>> will not set this.
> > >>
> > >> Good point that the OPT_ID per cmsg is very new.
> > >>
> > >> The datagram support on SOF_TIMESTAMPING_OPT_ID in sk->sk_tsflags had
> > >> been there for quite some time now. Is it a safe assumption that
> > >> most applications doing udp tx timestamping should have
> > >> the SOF_TIMESTAMPING_OPT_ID set to be useful?
> > >>
> > >>>
> > >>>> If it is
> > >>>> unlikely, may be we can just disallow bpf prog from directly setting
> > >>>> skb_shinfo(skb)->tskey for this particular skb.
> > >>>>
> > >>>> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
> > >>>> pass the kernel decided tskey to the bpf prog.
> > >>>>
> > >>>> The kernel passed tskey could be 0 (meaning the user space has not used it). The
> > >>>> bpf prog can give one for the kernel to use. The bpf prog can store the
> > >>>> sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
> > >>>> sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
> > >>>> instead) if it helps.
> > >>>>
> > >>>> If the kernel passed tskey is not 0, the bpf prog can just use that one
> > >>>> (assuming the user space is doing something sane, like the value in
> > >>>> SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
> > >>>> is very unlikely also (?) but the bpf prog can probably detect this and choose
> > >>>> to ignore this sk.
> > >>> If an applications uses OPT_ID, it is unlikely that they will toggle
> > >>> the feature on and off on a per-packet basis. So in the common case
> > >>> the program could use the user-set counter or use its own if userspace
> > >>> does not enable the feature. In the rare case that an application does
> > >>> intermittently set an OPT_ID, the numbering would be erratic. This
> > >>> does mean that an actively malicious application could mess with admin
> > >>> measurements.
> > >>
> > >> All make sense. Given it is reasonable to assume the user space should either
> > >> has SOF_TIMESTAMPING_OPT_ID always on or always off. When it is off, the bpf
> > >> prog can directly provide its own tskey to be used in shinfo->tskey. The bpf
> > >> prog can generate the id itself without using the sk->sk_tskey, e.g. store an
> > >> atomic int in the bpf_sk_storage.
> > >
> > > I wonder, how can we correlate the key with each skb in the bpf
> > > program for non-TCP type without implementing a bpf extension for
> > > SCM_TS_OPT_ID? Every time the timestamp is reported, we cannot know
> > > which sendmsg() the skb belongs to for non-TCP cases.
> >
> > SCM_TS_OPT_ID is eventually setting the shinfo->tskey.
> > If the shinfo->tskey is not set by the user space, the bpf prog can directly set
> > the shinfo->tskey. There is no need to use the sk->sk_tskey as the ID generator
> > also. The bpf prog can have its own id generator.
> >
> > If the user space has already set the shinfo->tskey (either by sk->sk_tskey or
> > SCM_TS_OPT_ID), the bpf prog can just use the user space one.
> >
> > If there is a weird application that flips flops between OPT_ID on/off, the bpf
> > prog will get confused which is fine. The bpf prog can detect this and choose to
> > ignore measuring this sk/skb.

That will skew measurement and is under control of the process.

I don't immediately foresee this being used to measure untrusted
processes that would have an incentive to game this.

But the caveat should be stated explicitly.

> > The bpf prog can also choose to be on the very
> > safe side and ignore all skb with SKBTX_ANY_TSTAMP set in txflags but with no
> > OPT_ID. The bpf prog can look into the details of the sk and skb to decide what
> > makes the most sense for its deployment.
> >
> > I don't know whether it makes more sense to call the bpf prog to decide the
> > shinfo->{tx_flags,tskey} just before the "while (length > 0)" in
> > __ip[6]_append_data or it is better to call the bpf prog in ip[6]_setup_cork.
> > I admittedly less familiar with this code path than the tcp one.

Probably the current spot, mainly because no skb exists yet in
ip_setup_cork.
 
> Now I feel it could be complicated for a software engineer to consider
> how they will handle the key if they don't read the kernel code very
> carefully. They are facing different situations. Being user-friendly
> lets this feature have more chances to get widely used. As I insisted
> before, I still would like to know if it is possible that we can try
> to introduce sk_tskey_bpf_offset (like patch 10-12) to calculate a bpf
> exclusive tskey for bpf use? Only exporting one key. It will be really
> simple and easy-to-use :)

That has complications of its own. It also has to deal with the user
enabling/disabling/resetting its key, and with OPT_ID passed by cmsg.
Multiple skbs may be in flight, derived from each of these sources.
A single sk flag can only offset against one of them.

I think Martin's approach is more workable. Use the tskey that is set,
if any. Else, set one. 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-06  1:11                                             ` Willem de Bruijn
@ 2024-11-06  2:37                                               ` Jason Xing
  0 siblings, 0 replies; 88+ messages in thread
From: Jason Xing @ 2024-11-06  2:37 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin KaFai Lau, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Wed, Nov 6, 2024 at 9:11 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Wed, Nov 6, 2024 at 3:22 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >
> > > On 11/4/24 10:22 PM, Jason Xing wrote:
> > > > On Tue, Nov 5, 2024 at 10:09 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > > >>
> > > >> On 11/1/24 6:32 AM, Willem de Bruijn wrote:
> > > >>>> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
> > > >>>> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
> > > >>>> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.
> > > >>> This is not something to rely on. OPT_ID was added relatively recently.
> > > >>> Older applications, or any that just use the most straightforward API,
> > > >>> will not set this.
> > > >>
> > > >> Good point that the OPT_ID per cmsg is very new.
> > > >>
> > > >> The datagram support on SOF_TIMESTAMPING_OPT_ID in sk->sk_tsflags had
> > > >> been there for quite some time now. Is it a safe assumption that
> > > >> most applications doing udp tx timestamping should have
> > > >> the SOF_TIMESTAMPING_OPT_ID set to be useful?
> > > >>
> > > >>>
> > > >>>> If it is
> > > >>>> unlikely, may be we can just disallow bpf prog from directly setting
> > > >>>> skb_shinfo(skb)->tskey for this particular skb.
> > > >>>>
> > > >>>> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
> > > >>>> pass the kernel decided tskey to the bpf prog.
> > > >>>>
> > > >>>> The kernel passed tskey could be 0 (meaning the user space has not used it). The
> > > >>>> bpf prog can give one for the kernel to use. The bpf prog can store the
> > > >>>> sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
> > > >>>> sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
> > > >>>> instead) if it helps.
> > > >>>>
> > > >>>> If the kernel passed tskey is not 0, the bpf prog can just use that one
> > > >>>> (assuming the user space is doing something sane, like the value in
> > > >>>> SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
> > > >>>> is very unlikely also (?) but the bpf prog can probably detect this and choose
> > > >>>> to ignore this sk.
> > > >>> If an applications uses OPT_ID, it is unlikely that they will toggle
> > > >>> the feature on and off on a per-packet basis. So in the common case
> > > >>> the program could use the user-set counter or use its own if userspace
> > > >>> does not enable the feature. In the rare case that an application does
> > > >>> intermittently set an OPT_ID, the numbering would be erratic. This
> > > >>> does mean that an actively malicious application could mess with admin
> > > >>> measurements.
> > > >>
> > > >> All make sense. Given it is reasonable to assume the user space should either
> > > >> has SOF_TIMESTAMPING_OPT_ID always on or always off. When it is off, the bpf
> > > >> prog can directly provide its own tskey to be used in shinfo->tskey. The bpf
> > > >> prog can generate the id itself without using the sk->sk_tskey, e.g. store an
> > > >> atomic int in the bpf_sk_storage.
> > > >
> > > > I wonder, how can we correlate the key with each skb in the bpf
> > > > program for non-TCP type without implementing a bpf extension for
> > > > SCM_TS_OPT_ID? Every time the timestamp is reported, we cannot know
> > > > which sendmsg() the skb belongs to for non-TCP cases.
> > >
> > > SCM_TS_OPT_ID is eventually setting the shinfo->tskey.
> > > If the shinfo->tskey is not set by the user space, the bpf prog can directly set
> > > the shinfo->tskey. There is no need to use the sk->sk_tskey as the ID generator
> > > also. The bpf prog can have its own id generator.
> > >
> > > If the user space has already set the shinfo->tskey (either by sk->sk_tskey or
> > > SCM_TS_OPT_ID), the bpf prog can just use the user space one.
> > >
> > > If there is a weird application that flips flops between OPT_ID on/off, the bpf
> > > prog will get confused which is fine. The bpf prog can detect this and choose to
> > > ignore measuring this sk/skb.
>
> That will skew measurement and is under control of the process.
>
> I don't immediately foresee this being used to measure untrusted
> processes that would have an incentive to game this.
>
> But the caveat should be stated explicitly.
>
> > > The bpf prog can also choose to be on the very
> > > safe side and ignore all skb with SKBTX_ANY_TSTAMP set in txflags but with no
> > > OPT_ID. The bpf prog can look into the details of the sk and skb to decide what
> > > makes the most sense for its deployment.
> > >
> > > I don't know whether it makes more sense to call the bpf prog to decide the
> > > shinfo->{tx_flags,tskey} just before the "while (length > 0)" in
> > > __ip[6]_append_data or it is better to call the bpf prog in ip[6]_setup_cork.
> > > I admittedly less familiar with this code path than the tcp one.
>
> Probably the current spot, mainly because no skb exists yet in
> ip_setup_cork.
>
> > Now I feel it could be complicated for a software engineer to consider
> > how they will handle the key if they don't read the kernel code very
> > carefully. They are facing different situations. Being user-friendly
> > lets this feature have more chances to get widely used. As I insisted
> > before, I still would like to know if it is possible that we can try
> > to introduce sk_tskey_bpf_offset (like patch 10-12) to calculate a bpf
> > exclusive tskey for bpf use? Only exporting one key. It will be really
> > simple and easy-to-use :)
>
> That has complications of its own. It also has to deal with the user
> enabling/disabling/resetting its key, and with OPT_ID passed by cmsg.
> Multiple skbs may be in flight, derived from each of these sources.
> A single sk flag can only offset against one of them.
>
> I think Martin's approach is more workable. Use the tskey that is set,
> if any. Else, set one.

Got it. Thanks!

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-06  1:09                                             ` Martin KaFai Lau
@ 2024-11-06  2:51                                               ` Jason Xing
  2024-11-07  1:19                                                 ` Martin KaFai Lau
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-11-06  2:51 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Wed, Nov 6, 2024 at 9:09 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 11/5/24 4:17 PM, Jason Xing wrote:
> > On Wed, Nov 6, 2024 at 3:22 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 11/4/24 10:22 PM, Jason Xing wrote:
> >>> On Tue, Nov 5, 2024 at 10:09 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>
> >>>> On 11/1/24 6:32 AM, Willem de Bruijn wrote:
> >>>>>> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
> >>>>>> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
> >>>>>> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.
> >>>>> This is not something to rely on. OPT_ID was added relatively recently.
> >>>>> Older applications, or any that just use the most straightforward API,
> >>>>> will not set this.
> >>>>
> >>>> Good point that the OPT_ID per cmsg is very new.
> >>>>
> >>>> The datagram support on SOF_TIMESTAMPING_OPT_ID in sk->sk_tsflags had
> >>>> been there for quite some time now. Is it a safe assumption that
> >>>> most applications doing udp tx timestamping should have
> >>>> the SOF_TIMESTAMPING_OPT_ID set to be useful?
> >>>>
> >>>>>
> >>>>>> If it is
> >>>>>> unlikely, may be we can just disallow bpf prog from directly setting
> >>>>>> skb_shinfo(skb)->tskey for this particular skb.
> >>>>>>
> >>>>>> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
> >>>>>> pass the kernel decided tskey to the bpf prog.
> >>>>>>
> >>>>>> The kernel passed tskey could be 0 (meaning the user space has not used it). The
> >>>>>> bpf prog can give one for the kernel to use. The bpf prog can store the
> >>>>>> sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
> >>>>>> sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
> >>>>>> instead) if it helps.
> >>>>>>
> >>>>>> If the kernel passed tskey is not 0, the bpf prog can just use that one
> >>>>>> (assuming the user space is doing something sane, like the value in
> >>>>>> SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
> >>>>>> is very unlikely also (?) but the bpf prog can probably detect this and choose
> >>>>>> to ignore this sk.
> >>>>> If an applications uses OPT_ID, it is unlikely that they will toggle
> >>>>> the feature on and off on a per-packet basis. So in the common case
> >>>>> the program could use the user-set counter or use its own if userspace
> >>>>> does not enable the feature. In the rare case that an application does
> >>>>> intermittently set an OPT_ID, the numbering would be erratic. This
> >>>>> does mean that an actively malicious application could mess with admin
> >>>>> measurements.
> >>>>
> >>>> All make sense. Given it is reasonable to assume the user space should either
> >>>> has SOF_TIMESTAMPING_OPT_ID always on or always off. When it is off, the bpf
> >>>> prog can directly provide its own tskey to be used in shinfo->tskey. The bpf
> >>>> prog can generate the id itself without using the sk->sk_tskey, e.g. store an
> >>>> atomic int in the bpf_sk_storage.
> >>>
> >>> I wonder, how can we correlate the key with each skb in the bpf
> >>> program for non-TCP type without implementing a bpf extension for
> >>> SCM_TS_OPT_ID? Every time the timestamp is reported, we cannot know
> >>> which sendmsg() the skb belongs to for non-TCP cases.
> >>
> >> SCM_TS_OPT_ID is eventually setting the shinfo->tskey.
> >> If the shinfo->tskey is not set by the user space, the bpf prog can directly set
> >> the shinfo->tskey. There is no need to use the sk->sk_tskey as the ID generator
> >> also. The bpf prog can have its own id generator.
> >>
> >> If the user space has already set the shinfo->tskey (either by sk->sk_tskey or
> >> SCM_TS_OPT_ID), the bpf prog can just use the user space one.
> >>
> >> If there is a weird application that flips flops between OPT_ID on/off, the bpf
> >> prog will get confused which is fine. The bpf prog can detect this and choose to
> >> ignore measuring this sk/skb. The bpf prog can also choose to be on the very
> >> safe side and ignore all skb with SKBTX_ANY_TSTAMP set in txflags but with no
> >> OPT_ID. The bpf prog can look into the details of the sk and skb to decide what
> >> makes the most sense for its deployment.
> >>
> >> I don't know whether it makes more sense to call the bpf prog to decide the
> >> shinfo->{tx_flags,tskey} just before the "while (length > 0)" in
> >> __ip[6]_append_data or it is better to call the bpf prog in ip[6]_setup_cork.
> >> I admittedly less familiar with this code path than the tcp one.
> >
> > Now I feel it could be complicated for a software engineer to consider
> > how they will handle the key if they don't read the kernel code very
> > carefully. They are facing different situations. Being user-friendly
> > lets this feature have more chances to get widely used. As I insisted
> > before, I still would like to know if it is possible that we can try
> > to introduce sk_tskey_bpf_offset (like patch 10-12) to calculate a bpf
> > exclusive tskey for bpf use? Only exporting one key. It will be really
> > simple and easy-to-use :)
>
> imo, there is no need for adding sk_tskey_bpf_offset to sk. just allow the bpf
> prog to decide what is the tskey.
>
> There is no usability issue in bpf prog. It is pretty normal for a bpf prog
> author to look at the sk details to make decision.
>
> Abstracting the sk/skb is not helping the bpf prog and not the right direction
> to go. Over time, there has been case over case that the bpf prog wants to know
> more instead of being abstracted away like running in the user space. e.g. The
> "struct bpf_sock" abstraction in the uapi/linux/bpf.h does not scale and we have
> stopped adding more abstraction this way. The btf (and PTR_TO_BTF_ID,
> CO-RE...etc) has been added to allow the bpf prog to learn other details in sk
> and skb.
>
> Instead, design a better bpf kfunc to help the bpf prog to set the bits/tskey in
> the skb. I think this is more important. tcp tskey is easy. just need some care
> on the udp tskey and need to check if the user space has already set one.
> A good designed bpf kfunc is all it needs.

Thanks!

Let me confirm again in case I'm missing something important.
1) For tcp, as you said before, bpf prog can extract the seq from the
exported skb, so I don't need to export any key in this case.
2) For udp, if the skb has skb_shinfo(skb)->tskey set, then export the
key, else, export zero to the bpf program.
3) extend SCM_TS_OPT_ID for the udp/bpf case.

I'm not sure if I should postpone implementing this part after the
basic framework of this series gets merged. Anyway, I will try this :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-06  2:51                                               ` Jason Xing
@ 2024-11-07  1:19                                                 ` Martin KaFai Lau
  2024-11-07  3:31                                                   ` Jason Xing
  0 siblings, 1 reply; 88+ messages in thread
From: Martin KaFai Lau @ 2024-11-07  1:19 UTC (permalink / raw)
  To: Jason Xing
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On 11/5/24 6:51 PM, Jason Xing wrote:
> On Wed, Nov 6, 2024 at 9:09 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 11/5/24 4:17 PM, Jason Xing wrote:
>>> On Wed, Nov 6, 2024 at 3:22 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> On 11/4/24 10:22 PM, Jason Xing wrote:
>>>>> On Tue, Nov 5, 2024 at 10:09 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>>>
>>>>>> On 11/1/24 6:32 AM, Willem de Bruijn wrote:
>>>>>>>> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
>>>>>>>> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
>>>>>>>> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.
>>>>>>> This is not something to rely on. OPT_ID was added relatively recently.
>>>>>>> Older applications, or any that just use the most straightforward API,
>>>>>>> will not set this.
>>>>>>
>>>>>> Good point that the OPT_ID per cmsg is very new.
>>>>>>
>>>>>> The datagram support on SOF_TIMESTAMPING_OPT_ID in sk->sk_tsflags had
>>>>>> been there for quite some time now. Is it a safe assumption that
>>>>>> most applications doing udp tx timestamping should have
>>>>>> the SOF_TIMESTAMPING_OPT_ID set to be useful?
>>>>>>
>>>>>>>
>>>>>>>> If it is
>>>>>>>> unlikely, may be we can just disallow bpf prog from directly setting
>>>>>>>> skb_shinfo(skb)->tskey for this particular skb.
>>>>>>>>
>>>>>>>> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
>>>>>>>> pass the kernel decided tskey to the bpf prog.
>>>>>>>>
>>>>>>>> The kernel passed tskey could be 0 (meaning the user space has not used it). The
>>>>>>>> bpf prog can give one for the kernel to use. The bpf prog can store the
>>>>>>>> sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
>>>>>>>> sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
>>>>>>>> instead) if it helps.
>>>>>>>>
>>>>>>>> If the kernel passed tskey is not 0, the bpf prog can just use that one
>>>>>>>> (assuming the user space is doing something sane, like the value in
>>>>>>>> SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
>>>>>>>> is very unlikely also (?) but the bpf prog can probably detect this and choose
>>>>>>>> to ignore this sk.
>>>>>>> If an applications uses OPT_ID, it is unlikely that they will toggle
>>>>>>> the feature on and off on a per-packet basis. So in the common case
>>>>>>> the program could use the user-set counter or use its own if userspace
>>>>>>> does not enable the feature. In the rare case that an application does
>>>>>>> intermittently set an OPT_ID, the numbering would be erratic. This
>>>>>>> does mean that an actively malicious application could mess with admin
>>>>>>> measurements.
>>>>>>
>>>>>> All make sense. Given it is reasonable to assume the user space should either
>>>>>> has SOF_TIMESTAMPING_OPT_ID always on or always off. When it is off, the bpf
>>>>>> prog can directly provide its own tskey to be used in shinfo->tskey. The bpf
>>>>>> prog can generate the id itself without using the sk->sk_tskey, e.g. store an
>>>>>> atomic int in the bpf_sk_storage.
>>>>>
>>>>> I wonder, how can we correlate the key with each skb in the bpf
>>>>> program for non-TCP type without implementing a bpf extension for
>>>>> SCM_TS_OPT_ID? Every time the timestamp is reported, we cannot know
>>>>> which sendmsg() the skb belongs to for non-TCP cases.
>>>>
>>>> SCM_TS_OPT_ID is eventually setting the shinfo->tskey.
>>>> If the shinfo->tskey is not set by the user space, the bpf prog can directly set
>>>> the shinfo->tskey. There is no need to use the sk->sk_tskey as the ID generator
>>>> also. The bpf prog can have its own id generator.
>>>>
>>>> If the user space has already set the shinfo->tskey (either by sk->sk_tskey or
>>>> SCM_TS_OPT_ID), the bpf prog can just use the user space one.
>>>>
>>>> If there is a weird application that flips flops between OPT_ID on/off, the bpf
>>>> prog will get confused which is fine. The bpf prog can detect this and choose to
>>>> ignore measuring this sk/skb. The bpf prog can also choose to be on the very
>>>> safe side and ignore all skb with SKBTX_ANY_TSTAMP set in txflags but with no
>>>> OPT_ID. The bpf prog can look into the details of the sk and skb to decide what
>>>> makes the most sense for its deployment.
>>>>
>>>> I don't know whether it makes more sense to call the bpf prog to decide the
>>>> shinfo->{tx_flags,tskey} just before the "while (length > 0)" in
>>>> __ip[6]_append_data or it is better to call the bpf prog in ip[6]_setup_cork.
>>>> I admittedly less familiar with this code path than the tcp one.
>>>
>>> Now I feel it could be complicated for a software engineer to consider
>>> how they will handle the key if they don't read the kernel code very
>>> carefully. They are facing different situations. Being user-friendly
>>> lets this feature have more chances to get widely used. As I insisted
>>> before, I still would like to know if it is possible that we can try
>>> to introduce sk_tskey_bpf_offset (like patch 10-12) to calculate a bpf
>>> exclusive tskey for bpf use? Only exporting one key. It will be really
>>> simple and easy-to-use :)
>>
>> imo, there is no need for adding sk_tskey_bpf_offset to sk. just allow the bpf
>> prog to decide what is the tskey.
>>
>> There is no usability issue in bpf prog. It is pretty normal for a bpf prog
>> author to look at the sk details to make decision.
>>
>> Abstracting the sk/skb is not helping the bpf prog and not the right direction
>> to go. Over time, there has been case over case that the bpf prog wants to know
>> more instead of being abstracted away like running in the user space. e.g. The
>> "struct bpf_sock" abstraction in the uapi/linux/bpf.h does not scale and we have
>> stopped adding more abstraction this way. The btf (and PTR_TO_BTF_ID,
>> CO-RE...etc) has been added to allow the bpf prog to learn other details in sk
>> and skb.
>>
>> Instead, design a better bpf kfunc to help the bpf prog to set the bits/tskey in
>> the skb. I think this is more important. tcp tskey is easy. just need some care
>> on the udp tskey and need to check if the user space has already set one.
>> A good designed bpf kfunc is all it needs.
> 
> Thanks!
> 
> Let me confirm again in case I'm missing something important.
> 1) For tcp, as you said before, bpf prog can extract the seq from the
> exported skb, so I don't need to export any key in this case.
> 2) For udp, if the skb has skb_shinfo(skb)->tskey set, then export the
> key, else, export zero to the bpf program.

A follow up to myself on the earlier bpf kfunc comment. Something like this:

/* ack: request ACK timestamp (tcp only)
  * req_tskey: bpf prog can request to use a particular tskey.
  *            req_tskey should always be 0 for tcp.
  * return: -ve for error. u32 for the tskey that the bpf prog should use.
  *	   may be different from the req_tskey (e.g. the user space has
  *         already set one).
  */
__bpf_kfunc s64 bpf_skops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
					   bool ack, u32 req_tskey);

/* "not sure" if this kfunc is needed. probably no. I think it is easier to pass
  * true/false in the args[0]. It seems tskey can be 0 in udp, so
  * passing tskey can't tell if the skb/cork/sockcm_cookie has the tskey.
  */
__bpf_kfunc bool bpf_skops_has_tskey(struct bpf_sock_ops_kern *skops);

For udp, I don't know whether it will be easier to set the tskey in the 'cork' 
or 'sockcm_cookie' or 'skb'. I guess it depends where the bpf prog is called. If 
skb, it seems the bpf prog may be called repetitively for doing the same thing 
in the while loop in __ip[6]_append_data. If it is better to set the 'cork' or 
'sockcm_cookie', the cork/sockcm_cookie pointer can be added to 'struct 
bpf_sock_ops_kern'. The sizeof(struct bpf_sock_ops_kern) is at 64bytes. Adding 
one pointer is not ideal.... probably it can be union with syn_skb but will need 
some code audit (so please check).


> 3) extend SCM_TS_OPT_ID for the udp/bpf case.

I don't understand. What does it mean to extend SCM_TS_OPT_ID?

> I'm not sure if I should postpone implementing this part after the
> basic framework of this series gets merged. Anyway, I will try this :)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-07  1:19                                                 ` Martin KaFai Lau
@ 2024-11-07  3:31                                                   ` Jason Xing
  2024-11-07 19:05                                                     ` Martin KaFai Lau
  0 siblings, 1 reply; 88+ messages in thread
From: Jason Xing @ 2024-11-07  3:31 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On Thu, Nov 7, 2024 at 9:19 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 11/5/24 6:51 PM, Jason Xing wrote:
> > On Wed, Nov 6, 2024 at 9:09 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 11/5/24 4:17 PM, Jason Xing wrote:
> >>> On Wed, Nov 6, 2024 at 3:22 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>
> >>>> On 11/4/24 10:22 PM, Jason Xing wrote:
> >>>>> On Tue, Nov 5, 2024 at 10:09 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>>>
> >>>>>> On 11/1/24 6:32 AM, Willem de Bruijn wrote:
> >>>>>>>> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
> >>>>>>>> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
> >>>>>>>> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set.
> >>>>>>> This is not something to rely on. OPT_ID was added relatively recently.
> >>>>>>> Older applications, or any that just use the most straightforward API,
> >>>>>>> will not set this.
> >>>>>>
> >>>>>> Good point that the OPT_ID per cmsg is very new.
> >>>>>>
> >>>>>> The datagram support on SOF_TIMESTAMPING_OPT_ID in sk->sk_tsflags had
> >>>>>> been there for quite some time now. Is it a safe assumption that
> >>>>>> most applications doing udp tx timestamping should have
> >>>>>> the SOF_TIMESTAMPING_OPT_ID set to be useful?
> >>>>>>
> >>>>>>>
> >>>>>>>> If it is
> >>>>>>>> unlikely, may be we can just disallow bpf prog from directly setting
> >>>>>>>> skb_shinfo(skb)->tskey for this particular skb.
> >>>>>>>>
> >>>>>>>> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
> >>>>>>>> pass the kernel decided tskey to the bpf prog.
> >>>>>>>>
> >>>>>>>> The kernel passed tskey could be 0 (meaning the user space has not used it). The
> >>>>>>>> bpf prog can give one for the kernel to use. The bpf prog can store the
> >>>>>>>> sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
> >>>>>>>> sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
> >>>>>>>> instead) if it helps.
> >>>>>>>>
> >>>>>>>> If the kernel passed tskey is not 0, the bpf prog can just use that one
> >>>>>>>> (assuming the user space is doing something sane, like the value in
> >>>>>>>> SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
> >>>>>>>> is very unlikely also (?) but the bpf prog can probably detect this and choose
> >>>>>>>> to ignore this sk.
> >>>>>>> If an applications uses OPT_ID, it is unlikely that they will toggle
> >>>>>>> the feature on and off on a per-packet basis. So in the common case
> >>>>>>> the program could use the user-set counter or use its own if userspace
> >>>>>>> does not enable the feature. In the rare case that an application does
> >>>>>>> intermittently set an OPT_ID, the numbering would be erratic. This
> >>>>>>> does mean that an actively malicious application could mess with admin
> >>>>>>> measurements.
> >>>>>>
> >>>>>> All make sense. Given it is reasonable to assume the user space should either
> >>>>>> has SOF_TIMESTAMPING_OPT_ID always on or always off. When it is off, the bpf
> >>>>>> prog can directly provide its own tskey to be used in shinfo->tskey. The bpf
> >>>>>> prog can generate the id itself without using the sk->sk_tskey, e.g. store an
> >>>>>> atomic int in the bpf_sk_storage.
> >>>>>
> >>>>> I wonder, how can we correlate the key with each skb in the bpf
> >>>>> program for non-TCP type without implementing a bpf extension for
> >>>>> SCM_TS_OPT_ID? Every time the timestamp is reported, we cannot know
> >>>>> which sendmsg() the skb belongs to for non-TCP cases.
> >>>>
> >>>> SCM_TS_OPT_ID is eventually setting the shinfo->tskey.
> >>>> If the shinfo->tskey is not set by the user space, the bpf prog can directly set
> >>>> the shinfo->tskey. There is no need to use the sk->sk_tskey as the ID generator
> >>>> also. The bpf prog can have its own id generator.
> >>>>
> >>>> If the user space has already set the shinfo->tskey (either by sk->sk_tskey or
> >>>> SCM_TS_OPT_ID), the bpf prog can just use the user space one.
> >>>>
> >>>> If there is a weird application that flips flops between OPT_ID on/off, the bpf
> >>>> prog will get confused which is fine. The bpf prog can detect this and choose to
> >>>> ignore measuring this sk/skb. The bpf prog can also choose to be on the very
> >>>> safe side and ignore all skb with SKBTX_ANY_TSTAMP set in txflags but with no
> >>>> OPT_ID. The bpf prog can look into the details of the sk and skb to decide what
> >>>> makes the most sense for its deployment.
> >>>>
> >>>> I don't know whether it makes more sense to call the bpf prog to decide the
> >>>> shinfo->{tx_flags,tskey} just before the "while (length > 0)" in
> >>>> __ip[6]_append_data or it is better to call the bpf prog in ip[6]_setup_cork.
> >>>> I admittedly less familiar with this code path than the tcp one.
> >>>
> >>> Now I feel it could be complicated for a software engineer to consider
> >>> how they will handle the key if they don't read the kernel code very
> >>> carefully. They are facing different situations. Being user-friendly
> >>> lets this feature have more chances to get widely used. As I insisted
> >>> before, I still would like to know if it is possible that we can try
> >>> to introduce sk_tskey_bpf_offset (like patch 10-12) to calculate a bpf
> >>> exclusive tskey for bpf use? Only exporting one key. It will be really
> >>> simple and easy-to-use :)
> >>
> >> imo, there is no need for adding sk_tskey_bpf_offset to sk. just allow the bpf
> >> prog to decide what is the tskey.
> >>
> >> There is no usability issue in bpf prog. It is pretty normal for a bpf prog
> >> author to look at the sk details to make decision.
> >>
> >> Abstracting the sk/skb is not helping the bpf prog and not the right direction
> >> to go. Over time, there has been case over case that the bpf prog wants to know
> >> more instead of being abstracted away like running in the user space. e.g. The
> >> "struct bpf_sock" abstraction in the uapi/linux/bpf.h does not scale and we have
> >> stopped adding more abstraction this way. The btf (and PTR_TO_BTF_ID,
> >> CO-RE...etc) has been added to allow the bpf prog to learn other details in sk
> >> and skb.
> >>
> >> Instead, design a better bpf kfunc to help the bpf prog to set the bits/tskey in
> >> the skb. I think this is more important. tcp tskey is easy. just need some care
> >> on the udp tskey and need to check if the user space has already set one.
> >> A good designed bpf kfunc is all it needs.
> >
> > Thanks!
> >
> > Let me confirm again in case I'm missing something important.
> > 1) For tcp, as you said before, bpf prog can extract the seq from the
> > exported skb, so I don't need to export any key in this case.
> > 2) For udp, if the skb has skb_shinfo(skb)->tskey set, then export the
> > key, else, export zero to the bpf program.
>
> A follow up to myself on the earlier bpf kfunc comment. Something like this:

Thank you so much!

>
> /* ack: request ACK timestamp (tcp only)
>   * req_tskey: bpf prog can request to use a particular tskey.
>   *            req_tskey should always be 0 for tcp.
>   * return: -ve for error. u32 for the tskey that the bpf prog should use.
>   *        may be different from the req_tskey (e.g. the user space has
>   *         already set one).
>   */
> __bpf_kfunc s64 bpf_skops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
>                                            bool ack, u32 req_tskey);
>
> /* "not sure" if this kfunc is needed. probably no. I think it is easier to pass
>   * true/false in the args[0]. It seems tskey can be 0 in udp, so

Good idea.

>   * passing tskey can't tell if the skb/cork/sockcm_cookie has the tskey.
>   */
> __bpf_kfunc bool bpf_skops_has_tskey(struct bpf_sock_ops_kern *skops);
>
> For udp, I don't know whether it will be easier to set the tskey in the 'cork'
> or 'sockcm_cookie' or 'skb'. I guess it depends where the bpf prog is called. If
> skb, it seems the bpf prog may be called repetitively for doing the same thing
> in the while loop in __ip[6]_append_data. If it is better to set the 'cork' or
> 'sockcm_cookie', the cork/sockcm_cookie pointer can be added to 'struct
> bpf_sock_ops_kern'. The sizeof(struct bpf_sock_ops_kern) is at 64bytes. Adding
> one pointer is not ideal.... probably it can be union with syn_skb but will need
> some code audit (so please check).

Let me dig into it :)

>
>
> > 3) extend SCM_TS_OPT_ID for the udp/bpf case.
>
> I don't understand. What does it mean to extend SCM_TS_OPT_ID?

Oh, I thought you expect to pass the key from the bpf program through
using the interface of SCM_TS_OPT_ID feature which isn't supported by
bpf. Let me think more about it first.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly
  2024-11-07  3:31                                                   ` Jason Xing
@ 2024-11-07 19:05                                                     ` Martin KaFai Lau
  0 siblings, 0 replies; 88+ messages in thread
From: Martin KaFai Lau @ 2024-11-07 19:05 UTC (permalink / raw)
  To: Jason Xing
  Cc: Willem de Bruijn, willemb, davem, edumazet, kuba, pabeni, dsahern,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, shuah, ykolal, bpf, netdev,
	Jason Xing

On 11/6/24 7:31 PM, Jason Xing wrote:
>> /* ack: request ACK timestamp (tcp only)
>>    * req_tskey: bpf prog can request to use a particular tskey.
>>    *            req_tskey should always be 0 for tcp.
>>    * return: -ve for error. u32 for the tskey that the bpf prog should use.
>>    *        may be different from the req_tskey (e.g. the user space has
>>    *         already set one).
>>    */
>> __bpf_kfunc s64 bpf_skops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
>>                                             bool ack, u32 req_tskey);
>>

>>
>> For udp, I don't know whether it will be easier to set the tskey in the 'cork'
>> or 'sockcm_cookie' or 'skb'. I guess it depends where the bpf prog is called. If
>> skb, it seems the bpf prog may be called repetitively for doing the same thing
>> in the while loop in __ip[6]_append_data. If it is better to set the 'cork' or
>> 'sockcm_cookie', the cork/sockcm_cookie pointer can be added to 'struct
>> bpf_sock_ops_kern'. The sizeof(struct bpf_sock_ops_kern) is at 64bytes. Adding
>> one pointer is not ideal.... probably it can be union with syn_skb but will need
>> some code audit (so please check).

>>
>>
>>> 3) extend SCM_TS_OPT_ID for the udp/bpf case.
>>
>> I don't understand. What does it mean to extend SCM_TS_OPT_ID?
> 
> Oh, I thought you expect to pass the key from the bpf program through
> using the interface of SCM_TS_OPT_ID feature which isn't supported by
> bpf. Let me think more about it first.

I still don't understand the SCM_TS_OPT_ID part but no I don't mean that.

The bpf prog uses the kfunc to directly set the tskey (and tx_flags) in 
skb/cork/sockcm_cookie. The name here for the tskey and tx_flags may be 
different based on if it is skb/cork/sockcm_cookie..

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2024-11-07 19:06 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-28 11:05 [PATCH net-next v3 00/14] net-timestamp: bpf extension to equip applications transparently Jason Xing
2024-10-28 11:05 ` [PATCH net-next v3 01/14] net-timestamp: reorganize in skb_tstamp_tx_output() Jason Xing
2024-10-28 11:05 ` [PATCH net-next v3 02/14] net-timestamp: allow two features to work parallelly Jason Xing
2024-10-29 23:00   ` Martin KaFai Lau
2024-10-30  1:23     ` Jason Xing
2024-10-30  1:45       ` Willem de Bruijn
2024-10-30  2:32         ` Jason Xing
2024-10-30  2:47           ` Willem de Bruijn
2024-10-30  3:04             ` Jason Xing
2024-10-30  5:37               ` Martin KaFai Lau
2024-10-30  6:42                 ` Jason Xing
2024-10-30 17:15                   ` Willem de Bruijn
2024-10-30 23:54                     ` Jason Xing
2024-10-31  0:13                       ` Jason Xing
2024-10-31  6:27                         ` Martin KaFai Lau
2024-10-31  7:04                           ` Jason Xing
2024-10-31 12:30                             ` Willem de Bruijn
2024-10-31 13:50                               ` Jason Xing
2024-10-31 23:26                                 ` Martin KaFai Lau
2024-11-01  7:47                                   ` Jason Xing
2024-11-05  1:50                                     ` Martin KaFai Lau
2024-11-05  3:13                                       ` Jason Xing
2024-11-01 13:32                                   ` Willem de Bruijn
2024-11-01 16:08                                     ` Jason Xing
2024-11-01 16:39                                       ` Willem de Bruijn
2024-11-05  2:09                                     ` Martin KaFai Lau
2024-11-05  6:22                                       ` Jason Xing
2024-11-05 19:22                                         ` Martin KaFai Lau
2024-11-06  0:17                                           ` Jason Xing
2024-11-06  1:09                                             ` Martin KaFai Lau
2024-11-06  2:51                                               ` Jason Xing
2024-11-07  1:19                                                 ` Martin KaFai Lau
2024-11-07  3:31                                                   ` Jason Xing
2024-11-07 19:05                                                     ` Martin KaFai Lau
2024-11-06  1:11                                             ` Willem de Bruijn
2024-11-06  2:37                                               ` Jason Xing
2024-11-05 14:29                                       ` Willem de Bruijn
2024-11-02 13:43   ` Simon Horman
2024-11-03  0:42     ` Jason Xing
2024-10-28 11:05 ` [PATCH net-next v3 03/14] net-timestamp: open gate for bpf_setsockopt/_getsockopt Jason Xing
2024-10-29  0:59   ` Willem de Bruijn
2024-10-29  1:18     ` Jason Xing
2024-10-30  0:32   ` Martin KaFai Lau
2024-10-30  1:15     ` Jason Xing
2024-10-28 11:05 ` [PATCH net-next v3 04/14] net-timestamp: introduce TS_SCHED_OPT_CB to generate dev xmit timestamp Jason Xing
2024-10-29  0:23   ` kernel test robot
2024-10-29  1:02   ` Willem de Bruijn
2024-10-29  1:30     ` Jason Xing
2024-10-29  1:04   ` kernel test robot
2024-10-28 11:05 ` [PATCH net-next v3 05/14] net-timestamp: introduce TS_SW_OPT_CB to generate driver timestamp Jason Xing
2024-10-28 11:05 ` [PATCH net-next v3 06/14] net-timestamp: introduce TS_ACK_OPT_CB to generate tcp acked timestamp Jason Xing
2024-10-29  1:03   ` Willem de Bruijn
2024-10-29  1:19     ` Jason Xing
2024-10-28 11:05 ` [PATCH net-next v3 07/14] net-timestamp: add a new triggered point to set sk_tsflags_bpf in UDP layer Jason Xing
2024-10-29  1:07   ` Willem de Bruijn
2024-10-29  1:23     ` Jason Xing
2024-10-29  1:33       ` Willem de Bruijn
2024-10-29  3:12         ` Jason Xing
2024-10-29 15:04           ` Willem de Bruijn
2024-10-29 15:44             ` Jason Xing
2024-10-28 11:05 ` [PATCH net-next v3 08/14] net-timestamp: make bpf for tx timestamp work Jason Xing
2024-10-28 11:05 ` [PATCH net-next v3 09/14] net-timestamp: add a common helper to set tskey Jason Xing
2024-10-28 11:05 ` [PATCH net-next v3 10/14] net-timestamp: add basic support with tskey offset Jason Xing
2024-10-29  1:24   ` Willem de Bruijn
2024-10-29  2:41     ` Jason Xing
2024-10-29 15:03       ` Willem de Bruijn
2024-10-29 15:50         ` Jason Xing
2024-10-29 19:45           ` Willem de Bruijn
2024-10-30  3:27             ` Jason Xing
2024-10-30  5:42   ` Martin KaFai Lau
2024-10-30  6:50     ` Jason Xing
2024-10-31  1:17       ` Martin KaFai Lau
2024-10-31  2:41         ` Jason Xing
2024-10-31  3:27           ` Jason Xing
2024-10-31  5:52           ` Martin KaFai Lau
2024-10-31  6:16             ` Jason Xing
2024-10-31 23:50           ` Martin KaFai Lau
2024-11-01  6:33             ` Jason Xing
2024-10-28 11:05 ` [PATCH net-next v3 11/14] net-timestamp: support OPT_ID for TCP proto Jason Xing
2024-10-28 11:05 ` [PATCH net-next v3 12/14] net-timestamp: add OPT_ID for UDP proto Jason Xing
2024-10-28 11:05 ` [PATCH net-next v3 13/14] net-timestamp: use static key to control bpf extension Jason Xing
2024-10-28 11:05 ` [PATCH net-next v3 14/14] bpf: add simple bpf tests in the tx path for so_timstamping feature Jason Xing
2024-10-29  1:26   ` Willem de Bruijn
2024-10-29  1:33     ` Jason Xing
2024-10-29  1:40       ` Willem de Bruijn
2024-10-29  3:13         ` Jason Xing
2024-10-30  5:57   ` Martin KaFai Lau
2024-10-30  6:54     ` Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).