[RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently
@ 2025-01-21  1:28 Jason Xing
  2025-01-21  1:28 ` [RFC PATCH net-next v6 01/13] net-timestamp: add support for bpf_setsockopt() Jason Xing
                   ` (12 more replies)
  0 siblings, 13 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

"Timestamping is key to debugging network stack latency. With
SO_TIMESTAMPING, bugs that are otherwise incorrectly assumed to be
network issues can be attributed to the kernel." This is extracted
from the talk "SO_TIMESTAMPING: Powering Fleetwide RPC Monitoring"
addressed by Willem de Bruijn at netdevconf 0x17).

There are a few areas that need optimization with the consideration of
easier use and less performance impact, which I highlighted and mainly
discussed at netconf 2024 with Willem de Bruijn and John Fastabend:
uAPI compatibility, extra system call overhead, and the need for
application modification. I initially managed to solve these issues
by writing a kernel module that hooks various key functions. However,
this approach is not suitable for the next kernel release. Therefore,
a BPF extension was proposed. During recent period, Martin KaFai Lau
provides invaluable suggestions about BPF along the way. Many thanks
here!

In this series, I only support foundamental codes and tx for TCP.
This approach mostly relies on existing SO_TIMESTAMPING feature, users
only needs to pass certain flags through bpf_setsocktopt() to a separate
tsflags. Please see the last selftest patch in this series.

After this series, we could step by step implement more advanced
functions/flags already in SO_TIMESTAMPING feature for bpf extension.

---
RFC v6
In the meantime, any suggestions and reviews are welcome!
Link: https://lore.kernel.org/all/20250112113748.73504-1-kerneljasonxing@gmail.com/
1. handle those safety problem by using the correct method.
2. support bpf_getsockopt.
3. adjust the position of BPF_SOCK_OPS_TS_TCP_SND_CB
4. fix mishandling the hardware timestamp error
5. add more corresponding tests

v5
Link: https://lore.kernel.org/all/20241207173803.90744-1-kerneljasonxing@gmail.com/
1. handle the safety issus when someone tries to call unrelated bpf
helpers.
2. avoid adding direct function call in the hot path like
__dev_queue_xmit()
3. remove reporting the hardware timestamp and tskey since they can be
fetched through the existing helper with the help of
bpf_skops_init_skb(), please see the selftest.
4. add new sendmsg callback in tcp_sendmsg, and introduce tskey_bpf used
by bpf program to correlate tcp_sendmsg with other hook points in patch [13/15].

v4
Link: https://lore.kernel.org/all/20241028110535.82999-1-kerneljasonxing@gmail.com/
1. introduce sk->sk_bpf_cb_flags to let user use bpf_setsockopt() (Martin)
2. introduce SKBTX_BPF to enable the bpf SO_TIMESTAMPING feature (Martin)
3. introduce bpf map in tests (Martin)
4. I choose to make this series as simple as possible, so I only support
most cases in the tx path for TCP protocol.

v3
Link: https://lore.kernel.org/all/20241012040651.95616-1-kerneljasonxing@gmail.com/
1. support UDP proto by introducing a new generation point.
2. for OPT_ID, introducing sk_tskey_bpf_offset to compute the delta
between the current socket key and bpf socket key. It is desiged for
UDP, which also applies to TCP.
3. support bpf_getsockopt()
4. use cgroup static key instead.
5. add one simple bpf selftest to show how it can be used.
6. remove the rx support from v2 because the number of patches could
exceed the limit of one series.

V2
Link: https://lore.kernel.org/all/20241008095109.99918-1-kerneljasonxing@gmail.com/
1. Introduce tsflag requestors so that we are able to extend more in the
future. Besides, it enables TX flags for bpf extension feature separately
without breaking users. It is suggested by Vadim Fedorenko.
2. introduce a static key to control the whole feature. (Willem)
3. Open the gate of bpf_setsockopt for the SO_TIMESTAMPING feature in
some TX/RX cases, not all the cases.

Jason Xing (13):
  net-timestamp: add support for bpf_setsockopt()
  net-timestamp: prepare for timestamping callbacks use
  bpf: stop UDP sock accessing TCP fields in bpf callbacks
  bpf: stop UDP sock accessing TCP fields in sock_op BPF CALLs
  net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  net-timestamp: support SCM_TSTAMP_SCHED for bpf extension
  net-timestamp: support sw SCM_TSTAMP_SND for bpf extension
  net-timestamp: support hw SCM_TSTAMP_SND for bpf extension
  net-timestamp: support SCM_TSTAMP_ACK for bpf extension
  net-timestamp: make TCP tx timestamp bpf extension work
  net-timestamp: add a new callback in tcp_tx_timestamp()
  net-timestamp: introduce cgroup lock to avoid affecting non-bpf cases
  bpf: add simple bpf tests in the tx path for so_timestamping feature

 include/linux/filter.h                        |   1 +
 include/linux/skbuff.h                        |  25 +-
 include/net/sock.h                            |  10 +
 include/net/tcp.h                             |   4 +-
 include/uapi/linux/bpf.h                      |  31 +++
 net/core/dev.c                                |   5 +-
 net/core/filter.c                             |  50 +++-
 net/core/skbuff.c                             |  67 +++++-
 net/core/sock.c                               |  13 +
 net/dsa/user.c                                |   2 +-
 net/ipv4/tcp.c                                |  11 +
 net/ipv4/tcp_input.c                          |   9 +-
 net/ipv4/tcp_output.c                         |   8 +
 net/socket.c                                  |   2 +-
 tools/include/uapi/linux/bpf.h                |  24 ++
 .../bpf/prog_tests/so_timestamping.c          |  98 ++++++++
 .../selftests/bpf/progs/so_timestamping.c     | 227 ++++++++++++++++++
 17 files changed, 563 insertions(+), 24 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/so_timestamping.c
 create mode 100644 tools/testing/selftests/bpf/progs/so_timestamping.c

-- 
2.43.5

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH net-next v6 01/13] net-timestamp: add support for bpf_setsockopt()
  2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
@ 2025-01-21  1:28 ` Jason Xing
  2025-01-21  1:28 ` [RFC PATCH net-next v6 02/13] net-timestamp: prepare for timestamping callbacks use Jason Xing
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Users can write the following code to enable the bpf extension:
bpf_setsockopt(skops, SOL_SOCKET, SK_BPF_CB_FLAGS, &flags, sizeof(flags));

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/net/sock.h             |  3 +++
 include/uapi/linux/bpf.h       |  8 ++++++++
 net/core/filter.c              | 23 +++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  1 +
 4 files changed, 35 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 8036b3b79cd8..7916982343c6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -303,6 +303,7 @@ struct sk_filter;
   *	@sk_stamp: time stamp of last packet received
   *	@sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
   *	@sk_tsflags: SO_TIMESTAMPING flags
+  *	@sk_bpf_cb_flags: used in bpf_setsockopt()
   *	@sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
   *			   Sockets that can be used under memory reclaim should
   *			   set this to false.
@@ -445,6 +446,8 @@ struct sock {
 	u32			sk_reserved_mem;
 	int			sk_forward_alloc;
 	u32			sk_tsflags;
+#define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
+	u32			sk_bpf_cb_flags;
 	__cacheline_group_end(sock_write_rxtx);
 
 	__cacheline_group_begin(sock_write_tx);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4162afc6b5d0..e629e09b0b31 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6903,6 +6903,13 @@ enum {
 	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
 };
 
+/* Definitions for bpf_sk_cb_flags */
+enum {
+	SK_BPF_CB_TX_TIMESTAMPING	= 1<<0,
+	SK_BPF_CB_MASK			= (SK_BPF_CB_TX_TIMESTAMPING - 1) |
+					   SK_BPF_CB_TX_TIMESTAMPING
+};
+
 /* List of known BPF sock_ops operators.
  * New entries can only be added at the end
  */
@@ -7081,6 +7088,7 @@ enum {
 	TCP_BPF_SYN_IP		= 1006, /* Copy the IP[46] and TCP header */
 	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
 	TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
+	SK_BPF_CB_FLAGS		= 1009, /* Used to set socket bpf flags */
 };
 
 enum {
diff --git a/net/core/filter.c b/net/core/filter.c
index 5b5996901ccc..8e2715b7ac8a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5222,6 +5222,25 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
 	.arg1_type      = ARG_PTR_TO_CTX,
 };
 
+static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
+{
+	u32 sk_bpf_cb_flags;
+
+	if (getopt) {
+		*(u32 *)optval = sk->sk_bpf_cb_flags;
+		return 0;
+	}
+
+	sk_bpf_cb_flags = *(u32 *)optval;
+
+	if (sk_bpf_cb_flags & ~SK_BPF_CB_MASK)
+		return -EINVAL;
+
+	sk->sk_bpf_cb_flags = sk_bpf_cb_flags;
+
+	return 0;
+}
+
 static int sol_socket_sockopt(struct sock *sk, int optname,
 			      char *optval, int *optlen,
 			      bool getopt)
@@ -5238,6 +5257,7 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
 	case SO_MAX_PACING_RATE:
 	case SO_BINDTOIFINDEX:
 	case SO_TXREHASH:
+	case SK_BPF_CB_FLAGS:
 		if (*optlen != sizeof(int))
 			return -EINVAL;
 		break;
@@ -5247,6 +5267,9 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
 		return -EINVAL;
 	}
 
+	if (optname == SK_BPF_CB_FLAGS)
+		return sk_bpf_set_get_cb_flags(sk, optval, getopt);
+
 	if (getopt) {
 		if (optname == SO_BINDTODEVICE)
 			return -EINVAL;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4162afc6b5d0..6b0a5b787b12 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7081,6 +7081,7 @@ enum {
 	TCP_BPF_SYN_IP		= 1006, /* Copy the IP[46] and TCP header */
 	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
 	TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
+	SK_BPF_CB_FLAGS		= 1009, /* Used to set socket bpf flags */
 };
 
 enum {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH net-next v6 02/13] net-timestamp: prepare for timestamping callbacks use
  2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
  2025-01-21  1:28 ` [RFC PATCH net-next v6 01/13] net-timestamp: add support for bpf_setsockopt() Jason Xing
@ 2025-01-21  1:28 ` Jason Xing
  2025-01-21  5:08   ` Jason Xing
  2025-01-21  1:28 ` [RFC PATCH net-next v6 03/13] bpf: stop UDP sock accessing TCP fields in bpf callbacks Jason Xing
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Later, I would introduce four callback points to report information
to user space based on this patch.

As to skb initialization here, people can follow these three steps
as below to fetch the shared info from the exported skb in the bpf
prog:
1. skops_kern = bpf_cast_to_kern_ctx(skops);
2. skb = skops_kern->skb;
3. shinfo = bpf_core_cast(skb->head + skb->end, struct skb_shared_info);

More details can be seen in the last selftest patch of the series.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/net/sock.h |  7 +++++++
 net/core/sock.c    | 13 +++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 7916982343c6..6f4d54faba92 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2923,6 +2923,13 @@ int sock_set_timestamping(struct sock *sk, int optname,
 			  struct so_timestamping timestamping);
 
 void sock_enable_timestamps(struct sock *sk);
+#if defined(CONFIG_CGROUP_BPF)
+void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op);
+#else
+static inline void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op)
+{
+}
+#endif
 void sock_no_linger(struct sock *sk);
 void sock_set_keepalive(struct sock *sk);
 void sock_set_priority(struct sock *sk, u32 priority);
diff --git a/net/core/sock.c b/net/core/sock.c
index eae2ae70a2e0..e165163521dc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -948,6 +948,19 @@ int sock_set_timestamping(struct sock *sk, int optname,
 	return 0;
 }
 
+void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op)
+{
+	struct bpf_sock_ops_kern sock_ops;
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+	sock_ops.op = op;
+	sock_ops.is_fullsock = 1;
+	sock_ops.sk = sk;
+	bpf_skops_init_skb(&sock_ops, skb, 0);
+	/* Timestamping bpf extension supports only TCP and UDP full socket */
+	__cgroup_bpf_run_filter_sock_ops(sk, &sock_ops, CGROUP_SOCK_OPS);
+}
+
 void sock_set_keepalive(struct sock *sk)
 {
 	lock_sock(sk);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH net-next v6 03/13] bpf: stop UDP sock accessing TCP fields in bpf callbacks
  2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
  2025-01-21  1:28 ` [RFC PATCH net-next v6 01/13] net-timestamp: add support for bpf_setsockopt() Jason Xing
  2025-01-21  1:28 ` [RFC PATCH net-next v6 02/13] net-timestamp: prepare for timestamping callbacks use Jason Xing
@ 2025-01-21  1:28 ` Jason Xing
  2025-01-24 23:40   ` Martin KaFai Lau
  2025-01-21  1:28 ` [RFC PATCH net-next v6 04/13] bpf: stop UDP sock accessing TCP fields in sock_op BPF CALLs Jason Xing
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Applying the new member allow_tcp_access in the existing callbacks
where is_fullsock is set to 1 can help us stop UDP socket accessing
struct tcp_sock, or else it could be catastrophe leading to panic.

For now, those existing callbacks are used only for TCP. I believe
in the short run, we will have timestamping UDP callbacks support.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/linux/filter.h | 1 +
 include/net/tcp.h      | 1 +
 net/core/filter.c      | 8 ++++----
 net/ipv4/tcp_input.c   | 2 ++
 net/ipv4/tcp_output.c  | 2 ++
 5 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index a3ea46281595..1b1333a90b4a 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1508,6 +1508,7 @@ struct bpf_sock_ops_kern {
 	void	*skb_data_end;
 	u8	op;
 	u8	is_fullsock;
+	u8	allow_tcp_access;
 	u8	remaining_opt_len;
 	u64	temp;			/* temp and everything after is not
 					 * initialized to 0 before calling
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5b2b04835688..293047694710 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2649,6 +2649,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
 	if (sk_fullsock(sk)) {
 		sock_ops.is_fullsock = 1;
+		sock_ops.allow_tcp_access = 1;
 		sock_owned_by_me(sk);
 	}
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 8e2715b7ac8a..fdd305b4cfbb 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10381,10 +10381,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 		}							      \
 		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
 						struct bpf_sock_ops_kern,     \
-						is_fullsock),		      \
+						allow_tcp_access),	      \
 				      fullsock_reg, si->src_reg,	      \
 				      offsetof(struct bpf_sock_ops_kern,      \
-					       is_fullsock));		      \
+					       allow_tcp_access));	      \
 		*insn++ = BPF_JMP_IMM(BPF_JEQ, fullsock_reg, 0, jmp);	      \
 		if (si->dst_reg == si->src_reg)				      \
 			*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg,	      \
@@ -10469,10 +10469,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 					       temp));			      \
 		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
 						struct bpf_sock_ops_kern,     \
-						is_fullsock),		      \
+						allow_tcp_access),	      \
 				      reg, si->dst_reg,			      \
 				      offsetof(struct bpf_sock_ops_kern,      \
-					       is_fullsock));		      \
+					       allow_tcp_access));	      \
 		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
 		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
 						struct bpf_sock_ops_kern, sk),\
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index eb82e01da911..77185479ed5e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -169,6 +169,7 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
 	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
 	sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB;
 	sock_ops.is_fullsock = 1;
+	sock_ops.allow_tcp_access = 1;
 	sock_ops.sk = sk;
 	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
 
@@ -185,6 +186,7 @@ static void bpf_skops_established(struct sock *sk, int bpf_op,
 	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
 	sock_ops.op = bpf_op;
 	sock_ops.is_fullsock = 1;
+	sock_ops.allow_tcp_access = 1;
 	sock_ops.sk = sk;
 	/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
 	if (skb)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..695749807c09 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -522,6 +522,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
 		sock_owned_by_me(sk);
 
 		sock_ops.is_fullsock = 1;
+		sock_ops.allow_tcp_access = 1;
 		sock_ops.sk = sk;
 	}
 
@@ -567,6 +568,7 @@ static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
 		sock_owned_by_me(sk);
 
 		sock_ops.is_fullsock = 1;
+		sock_ops.allow_tcp_access = 1;
 		sock_ops.sk = sk;
 	}
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH net-next v6 04/13] bpf: stop UDP sock accessing TCP fields in sock_op BPF CALLs
  2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (2 preceding siblings ...)
  2025-01-21  1:28 ` [RFC PATCH net-next v6 03/13] bpf: stop UDP sock accessing TCP fields in bpf callbacks Jason Xing
@ 2025-01-21  1:28 ` Jason Xing
  2025-01-25  0:28   ` Martin KaFai Lau
  2025-01-21  1:28 ` [RFC PATCH net-next v6 05/13] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING Jason Xing
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

In the next round, we will support the UDP proto for SO_TIMESTAMPING
bpf extension, so we need to ensure there is no safety problem, which
is ususally caused by UDP socket trying to access TCP fields.

These approaches can be categorized into two groups:
1. add TCP protocol check
2. add sock op check

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 net/core/filter.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index fdd305b4cfbb..934431886876 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5523,6 +5523,11 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 	return -EINVAL;
 }
 
+static bool is_locked_tcp_sock_ops(struct bpf_sock_ops_kern *bpf_sock)
+{
+	return bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
+}
+
 static int _bpf_setsockopt(struct sock *sk, int level, int optname,
 			   char *optval, int optlen)
 {
@@ -5673,7 +5678,12 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
 BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 	   int, level, int, optname, char *, optval, int, optlen)
 {
-	return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
+	struct sock *sk = bpf_sock->sk;
+
+	if (is_locked_tcp_sock_ops(bpf_sock) && sk_fullsock(sk))
+		sock_owned_by_me(sk);
+
+	return __bpf_setsockopt(sk, level, optname, optval, optlen);
 }
 
 static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
@@ -5759,6 +5769,7 @@ BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 	   int, level, int, optname, char *, optval, int, optlen)
 {
 	if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
+	    bpf_sock->sk->sk_protocol == IPPROTO_TCP &&
 	    optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) {
 		int ret, copy_len = 0;
 		const u8 *start;
@@ -5800,7 +5811,8 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
 	struct sock *sk = bpf_sock->sk;
 	int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
 
-	if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
+	if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk) ||
+	    sk->sk_protocol != IPPROTO_TCP)
 		return -EINVAL;
 
 	tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
@@ -7609,6 +7621,9 @@ BPF_CALL_4(bpf_sock_ops_load_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
 	u8 search_kind, search_len, copy_len, magic_len;
 	int ret;
 
+	if (!is_locked_tcp_sock_ops(bpf_sock))
+		return -EOPNOTSUPP;
+
 	/* 2 byte is the minimal option len except TCPOPT_NOP and
 	 * TCPOPT_EOL which are useless for the bpf prog to learn
 	 * and this helper disallow loading them also.
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH net-next v6 05/13] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (3 preceding siblings ...)
  2025-01-21  1:28 ` [RFC PATCH net-next v6 04/13] bpf: stop UDP sock accessing TCP fields in sock_op BPF CALLs Jason Xing
@ 2025-01-21  1:28 ` Jason Xing
  2025-01-21  1:28 ` [RFC PATCH net-next v6 06/13] net-timestamp: support SCM_TSTAMP_SCHED for bpf extension Jason Xing
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

No functional changes here. I add skb_enable_app_tstamp() to test
if the orig_skb matches the usage of application SO_TIMESTAMPING
and skb_sw_tstamp_tx() to distinguish the software and hardware
timestamp when tsflag is SCM_TSTAMP_SND.

Also, I deliberately distinguish the the software and hardware
SCM_TSTAMP_SND timestamp by passing 'sw' parameter in order to
avoid such a case where hardware may go wrong and pass a NULL
hwstamps, which is even though unlikely to happen. If it really
happens, bpf prog will finally consider it as a software timestamp.
It will be hardly recognized. Let's make the timestamping part
more robust.

After this patch, I will soon add checks about bpf SO_TIMESTAMPING.
In this way, we can support two modes parallelly.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/linux/skbuff.h | 13 +++++++------
 net/core/dev.c         |  2 +-
 net/core/skbuff.c      | 32 ++++++++++++++++++++++++++++++--
 net/ipv4/tcp_input.c   |  3 ++-
 4 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bb2b751d274a..dfc419281cc9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -39,6 +39,7 @@
 #include <net/net_debug.h>
 #include <net/dropreason-core.h>
 #include <net/netmem.h>
+#include <uapi/linux/errqueue.h>
 
 /**
  * DOC: skb checksums
@@ -4533,18 +4534,18 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
 
 void __skb_tstamp_tx(struct sk_buff *orig_skb, const struct sk_buff *ack_skb,
 		     struct skb_shared_hwtstamps *hwtstamps,
-		     struct sock *sk, int tstype);
+		     struct sock *sk, bool sw, int tstype);
 
 /**
- * skb_tstamp_tx - queue clone of skb with send time stamps
+ * skb_tstamp_tx - queue clone of skb with send HARDWARE timestamps
  * @orig_skb:	the original outgoing packet
  * @hwtstamps:	hardware time stamps, may be NULL if not available
  *
  * If the skb has a socket associated, then this function clones the
  * skb (thus sharing the actual data and optional structures), stores
- * the optional hardware time stamping information (if non NULL) or
- * generates a software time stamp (otherwise), then queues the clone
- * to the error queue of the socket.  Errors are silently ignored.
+ * the optional hardware time stamping information (if non NULL) then
+ * queues the clone to the error queue of the socket.  Errors are
+ * silently ignored.
  */
 void skb_tstamp_tx(struct sk_buff *orig_skb,
 		   struct skb_shared_hwtstamps *hwtstamps);
@@ -4565,7 +4566,7 @@ static inline void skb_tx_timestamp(struct sk_buff *skb)
 {
 	skb_clone_tx_timestamp(skb);
 	if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP)
-		skb_tstamp_tx(skb, NULL);
+		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SND);
 }
 
 /**
diff --git a/net/core/dev.c b/net/core/dev.c
index afa2282f2604..d77b8389753e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4501,7 +4501,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	skb_assert_len(skb);
 
 	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
-		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED);
+		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SCHED);
 
 	/* Disable soft irqs for various locks below. Also
 	 * stops preemption for RCU.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a441613a1e6c..6042961dfc02 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5539,10 +5539,35 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
 }
 EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
 
+static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
+{
+	int flag;
+
+	switch (tstype) {
+	case SCM_TSTAMP_SCHED:
+		flag = SKBTX_SCHED_TSTAMP;
+		break;
+	case SCM_TSTAMP_SND:
+		flag = sw ? SKBTX_SW_TSTAMP : SKBTX_HW_TSTAMP;
+		break;
+	case SCM_TSTAMP_ACK:
+		if (TCP_SKB_CB(skb)->txstamp_ack)
+			return true;
+		fallthrough;
+	default:
+		return false;
+	}
+
+	if (skb_shinfo(skb)->tx_flags & flag)
+		return true;
+
+	return false;
+}
+
 void __skb_tstamp_tx(struct sk_buff *orig_skb,
 		     const struct sk_buff *ack_skb,
 		     struct skb_shared_hwtstamps *hwtstamps,
-		     struct sock *sk, int tstype)
+		     struct sock *sk, bool sw, int tstype)
 {
 	struct sk_buff *skb;
 	bool tsonly, opt_stats = false;
@@ -5551,6 +5576,9 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (!sk)
 		return;
 
+	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
+		return;
+
 	tsflags = READ_ONCE(sk->sk_tsflags);
 	if (!hwtstamps && !(tsflags & SOF_TIMESTAMPING_OPT_TX_SWHW) &&
 	    skb_shinfo(orig_skb)->tx_flags & SKBTX_IN_PROGRESS)
@@ -5599,7 +5627,7 @@ EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
 void skb_tstamp_tx(struct sk_buff *orig_skb,
 		   struct skb_shared_hwtstamps *hwtstamps)
 {
-	return __skb_tstamp_tx(orig_skb, NULL, hwtstamps, orig_skb->sk,
+	return __skb_tstamp_tx(orig_skb, NULL, hwtstamps, orig_skb->sk, false,
 			       SCM_TSTAMP_SND);
 }
 EXPORT_SYMBOL_GPL(skb_tstamp_tx);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 77185479ed5e..62252702929d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3330,7 +3330,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
 	if (!before(shinfo->tskey, prior_snd_una) &&
 	    before(shinfo->tskey, tcp_sk(sk)->snd_una)) {
 		tcp_skb_tsorted_save(skb) {
-			__skb_tstamp_tx(skb, ack_skb, NULL, sk, SCM_TSTAMP_ACK);
+			__skb_tstamp_tx(skb, ack_skb, NULL, sk, true,
+					SCM_TSTAMP_ACK);
 		} tcp_skb_tsorted_restore(skb);
 	}
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH net-next v6 06/13] net-timestamp: support SCM_TSTAMP_SCHED for bpf extension
  2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (4 preceding siblings ...)
  2025-01-21  1:28 ` [RFC PATCH net-next v6 05/13] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING Jason Xing
@ 2025-01-21  1:28 ` Jason Xing
  2025-01-25  0:38   ` Martin KaFai Lau
  2025-01-21  1:28 ` [RFC PATCH net-next v6 07/13] net-timestamp: support sw SCM_TSTAMP_SND " Jason Xing
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Introducing SKBTX_BPF is used as an indicator telling us whether
the skb should be traced by the bpf prog.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/linux/skbuff.h         |  6 +++++-
 include/uapi/linux/bpf.h       |  5 +++++
 net/core/dev.c                 |  3 ++-
 net/core/skbuff.c              | 23 +++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  5 +++++
 5 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index dfc419281cc9..35c2e864dd4b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -490,10 +490,14 @@ enum {
 
 	/* generate software time stamp when entering packet scheduling */
 	SKBTX_SCHED_TSTAMP = 1 << 6,
+
+	/* used for bpf extension when a bpf program is loaded */
+	SKBTX_BPF = 1 << 7,
 };
 
 #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
-				 SKBTX_SCHED_TSTAMP)
+				 SKBTX_SCHED_TSTAMP | \
+				 SKBTX_BPF)
 #define SKBTX_ANY_TSTAMP	(SKBTX_HW_TSTAMP | \
 				 SKBTX_HW_TSTAMP_USE_CYCLES | \
 				 SKBTX_ANY_SW_TSTAMP)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e629e09b0b31..72f93c6e45c1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7022,6 +7022,11 @@ enum {
 					 * by the kernel or the
 					 * earlier bpf-progs.
 					 */
+	BPF_SOCK_OPS_TS_SCHED_OPT_CB,	/* Called when skb is passing through
+					 * dev layer when SO_TIMESTAMPING
+					 * feature is on. It indicates the
+					 * recorded timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/dev.c b/net/core/dev.c
index d77b8389753e..4f291459d6b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4500,7 +4500,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	skb_reset_mac_header(skb);
 	skb_assert_len(skb);
 
-	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
+	if (unlikely(skb_shinfo(skb)->tx_flags &
+		     (SKBTX_SCHED_TSTAMP | SKBTX_BPF)))
 		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SCHED);
 
 	/* Disable soft irqs for various locks below. Also
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6042961dfc02..d19d577b996f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5564,6 +5564,24 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
 	return false;
 }
 
+static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk, int tstype)
+{
+	int op;
+
+	if (!sk)
+		return;
+
+	switch (tstype) {
+	case SCM_TSTAMP_SCHED:
+		op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
+		break;
+	default:
+		return;
+	}
+
+	bpf_skops_tx_timestamping(sk, skb, op);
+}
+
 void __skb_tstamp_tx(struct sk_buff *orig_skb,
 		     const struct sk_buff *ack_skb,
 		     struct skb_shared_hwtstamps *hwtstamps,
@@ -5576,6 +5594,11 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (!sk)
 		return;
 
+	/* bpf extension feature entry */
+	if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
+		skb_tstamp_tx_bpf(orig_skb, sk, tstype);
+
+	/* application feature entry */
 	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
 		return;
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6b0a5b787b12..891217ab6d2d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7015,6 +7015,11 @@ enum {
 					 * by the kernel or the
 					 * earlier bpf-progs.
 					 */
+	BPF_SOCK_OPS_TS_SCHED_OPT_CB,	/* Called when skb is passing through
+					 * dev layer when SO_TIMESTAMPING
+					 * feature is on. It indicates the
+					 * recorded timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH net-next v6 07/13] net-timestamp: support sw SCM_TSTAMP_SND for bpf extension
  2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (5 preceding siblings ...)
  2025-01-21  1:28 ` [RFC PATCH net-next v6 06/13] net-timestamp: support SCM_TSTAMP_SCHED for bpf extension Jason Xing
@ 2025-01-21  1:28 ` Jason Xing
  2025-01-25  0:40   ` Martin KaFai Lau
  2025-01-21  1:28 ` [RFC PATCH net-next v6 08/13] net-timestamp: support hw " Jason Xing
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Support SCM_TSTAMP_SND case. Then we will get the software
timestamp when the driver is about to send the skb. Later, I
will support the hardware timestamp.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/linux/skbuff.h         |  2 +-
 include/uapi/linux/bpf.h       |  5 +++++
 net/core/skbuff.c              | 10 ++++++++--
 tools/include/uapi/linux/bpf.h |  5 +++++
 4 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 35c2e864dd4b..de8d3bd311f5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4569,7 +4569,7 @@ void skb_tstamp_tx(struct sk_buff *orig_skb,
 static inline void skb_tx_timestamp(struct sk_buff *skb)
 {
 	skb_clone_tx_timestamp(skb);
-	if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP)
+	if (skb_shinfo(skb)->tx_flags & (SKBTX_SW_TSTAMP | SKBTX_BPF))
 		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SND);
 }
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 72f93c6e45c1..a6d761f07f67 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7027,6 +7027,11 @@ enum {
 					 * feature is on. It indicates the
 					 * recorded timestamp.
 					 */
+	BPF_SOCK_OPS_TS_SW_OPT_CB,	/* Called when skb is about to send
+					 * to the nic when SO_TIMESTAMPING
+					 * feature is on. It indicates the
+					 * recorded timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d19d577b996f..288eb9869827 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5564,7 +5564,8 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
 	return false;
 }
 
-static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk, int tstype)
+static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
+			      int tstype, bool sw)
 {
 	int op;
 
@@ -5575,6 +5576,11 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk, int tstype)
 	case SCM_TSTAMP_SCHED:
 		op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
 		break;
+	case SCM_TSTAMP_SND:
+		if (!sw)
+			return;
+		op = BPF_SOCK_OPS_TS_SW_OPT_CB;
+		break;
 	default:
 		return;
 	}
@@ -5596,7 +5602,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 
 	/* bpf extension feature entry */
 	if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
-		skb_tstamp_tx_bpf(orig_skb, sk, tstype);
+		skb_tstamp_tx_bpf(orig_skb, sk, tstype, sw);
 
 	/* application feature entry */
 	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 891217ab6d2d..73fc0a95c9ca 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7020,6 +7020,11 @@ enum {
 					 * feature is on. It indicates the
 					 * recorded timestamp.
 					 */
+	BPF_SOCK_OPS_TS_SW_OPT_CB,	/* Called when skb is about to send
+					 * to the nic when SO_TIMESTAMPING
+					 * feature is on. It indicates the
+					 * recorded timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH net-next v6 08/13] net-timestamp: support hw SCM_TSTAMP_SND for bpf extension
  2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (6 preceding siblings ...)
  2025-01-21  1:28 ` [RFC PATCH net-next v6 07/13] net-timestamp: support sw SCM_TSTAMP_SND " Jason Xing
@ 2025-01-21  1:28 ` Jason Xing
  2025-01-25  0:46   ` Martin KaFai Lau
  2025-01-21  1:28 ` [RFC PATCH net-next v6 09/13] net-timestamp: support SCM_TSTAMP_ACK " Jason Xing
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

In this patch, we finish the hardware part. Then bpf program can
fetch the hwstamp from skb directly.

To avoid changing so many callers using SKBTX_HW_TSTAMP from drivers,
use this simple modification like this patch does to support printing
hardware timestamp.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/linux/skbuff.h         |  4 +++-
 include/uapi/linux/bpf.h       |  5 +++++
 net/core/skbuff.c              | 11 ++++++-----
 net/dsa/user.c                 |  2 +-
 net/socket.c                   |  2 +-
 tools/include/uapi/linux/bpf.h |  5 +++++
 6 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index de8d3bd311f5..df2d790ae36b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -471,7 +471,7 @@ struct skb_shared_hwtstamps {
 /* Definitions for tx_flags in struct skb_shared_info */
 enum {
 	/* generate hardware time stamp */
-	SKBTX_HW_TSTAMP = 1 << 0,
+	__SKBTX_HW_TSTAMP = 1 << 0,
 
 	/* generate software time stamp when queueing packet to NIC */
 	SKBTX_SW_TSTAMP = 1 << 1,
@@ -495,6 +495,8 @@ enum {
 	SKBTX_BPF = 1 << 7,
 };
 
+#define SKBTX_HW_TSTAMP		(__SKBTX_HW_TSTAMP | SKBTX_BPF)
+
 #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
 				 SKBTX_SCHED_TSTAMP | \
 				 SKBTX_BPF)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a6d761f07f67..8936e1061e71 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7032,6 +7032,11 @@ enum {
 					 * feature is on. It indicates the
 					 * recorded timestamp.
 					 */
+	BPF_SOCK_OPS_TS_HW_OPT_CB,	/* Called in hardware phase when
+					 * SO_TIMESTAMPING feature is on.
+					 * It indicates the recorded
+					 * timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 288eb9869827..c769feae5162 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5548,7 +5548,7 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
 		flag = SKBTX_SCHED_TSTAMP;
 		break;
 	case SCM_TSTAMP_SND:
-		flag = sw ? SKBTX_SW_TSTAMP : SKBTX_HW_TSTAMP;
+		flag = sw ? SKBTX_SW_TSTAMP : __SKBTX_HW_TSTAMP;
 		break;
 	case SCM_TSTAMP_ACK:
 		if (TCP_SKB_CB(skb)->txstamp_ack)
@@ -5565,7 +5565,8 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
 }
 
 static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
-			      int tstype, bool sw)
+			      int tstype, bool sw,
+			      struct skb_shared_hwtstamps *hwtstamps)
 {
 	int op;
 
@@ -5577,9 +5578,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
 		op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
 		break;
 	case SCM_TSTAMP_SND:
+		op = sw ? BPF_SOCK_OPS_TS_SW_OPT_CB : BPF_SOCK_OPS_TS_HW_OPT_CB;
 		if (!sw)
-			return;
-		op = BPF_SOCK_OPS_TS_SW_OPT_CB;
+			*skb_hwtstamps(skb) = *hwtstamps;
 		break;
 	default:
 		return;
@@ -5602,7 +5603,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 
 	/* bpf extension feature entry */
 	if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
-		skb_tstamp_tx_bpf(orig_skb, sk, tstype, sw);
+		skb_tstamp_tx_bpf(orig_skb, sk, tstype, sw, hwtstamps);
 
 	/* application feature entry */
 	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
diff --git a/net/dsa/user.c b/net/dsa/user.c
index 291ab1b4acc4..ae715bf0ae75 100644
--- a/net/dsa/user.c
+++ b/net/dsa/user.c
@@ -897,7 +897,7 @@ static void dsa_skb_tx_timestamp(struct dsa_user_priv *p,
 {
 	struct dsa_switch *ds = p->dp->ds;
 
-	if (!(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
+	if (!(skb_shinfo(skb)->tx_flags & __SKBTX_HW_TSTAMP))
 		return;
 
 	if (!ds->ops->port_txtstamp)
diff --git a/net/socket.c b/net/socket.c
index 262a28b59c7f..70eabb510ce6 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -676,7 +676,7 @@ void __sock_tx_timestamp(__u32 tsflags, __u8 *tx_flags)
 	u8 flags = *tx_flags;
 
 	if (tsflags & SOF_TIMESTAMPING_TX_HARDWARE) {
-		flags |= SKBTX_HW_TSTAMP;
+		flags |= __SKBTX_HW_TSTAMP;
 
 		/* PTP hardware clocks can provide a free running cycle counter
 		 * as a time base for virtual clocks. Tell driver to use the
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 73fc0a95c9ca..f1583b5814ea 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7025,6 +7025,11 @@ enum {
 					 * feature is on. It indicates the
 					 * recorded timestamp.
 					 */
+	BPF_SOCK_OPS_TS_HW_OPT_CB,	/* Called in hardware phase when
+					 * SO_TIMESTAMPING feature is on.
+					 * It indicates the recorded
+					 * timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH net-next v6 09/13] net-timestamp: support SCM_TSTAMP_ACK for bpf extension
  2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (7 preceding siblings ...)
  2025-01-21  1:28 ` [RFC PATCH net-next v6 08/13] net-timestamp: support hw " Jason Xing
@ 2025-01-21  1:28 ` Jason Xing
  2025-01-21  1:28 ` [RFC PATCH net-next v6 10/13] net-timestamp: make TCP tx timestamp bpf extension work Jason Xing
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Handle the ACK timestamp case. Actually testing SKBTX_BPF flag
can work, but we need to Introduce a new txstamp_ack_bpf to avoid
cache line misses in tcp_ack_tstamp(). To be more specific, in most
cases, normal flows would not access skb_shinfo as txstamp_ack
is zero, so that this function won't appear in the hot spot lists.
Introducing a new member txstamp_ack_bpf works similarly.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/net/tcp.h              | 3 ++-
 include/uapi/linux/bpf.h       | 5 +++++
 net/core/skbuff.c              | 3 +++
 net/ipv4/tcp_input.c           | 3 ++-
 net/ipv4/tcp_output.c          | 5 +++++
 tools/include/uapi/linux/bpf.h | 5 +++++
 6 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 293047694710..88429e422301 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -959,9 +959,10 @@ struct tcp_skb_cb {
 	__u8		sacked;		/* State flags for SACK.	*/
 	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
 	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
+			txstamp_ack_bpf:1,	/* ack timestamp for bpf use */
 			eor:1,		/* Is skb MSG_EOR marked? */
 			has_rxtstamp:1,	/* SKB has a RX timestamp	*/
-			unused:5;
+			unused:4;
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
 	union {
 		struct {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8936e1061e71..3b9bfc88345c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7037,6 +7037,11 @@ enum {
 					 * It indicates the recorded
 					 * timestamp.
 					 */
+	BPF_SOCK_OPS_TS_ACK_OPT_CB,	/* Called when all the skbs are
+					 * acknowledged when SO_TIMESTAMPING
+					 * feature is on. It indicates the
+					 * recorded timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c769feae5162..33340e0b094f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5582,6 +5582,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
 		if (!sw)
 			*skb_hwtstamps(skb) = *hwtstamps;
 		break;
+	case SCM_TSTAMP_ACK:
+		op = BPF_SOCK_OPS_TS_ACK_OPT_CB;
+		break;
 	default:
 		return;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 62252702929d..c8945f5be31b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3323,7 +3323,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
 	const struct skb_shared_info *shinfo;
 
 	/* Avoid cache line misses to get skb_shinfo() and shinfo->tx_flags */
-	if (likely(!TCP_SKB_CB(skb)->txstamp_ack))
+	if (likely(!TCP_SKB_CB(skb)->txstamp_ack &&
+		   !TCP_SKB_CB(skb)->txstamp_ack_bpf))
 		return;
 
 	shinfo = skb_shinfo(skb);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 695749807c09..fc84ca669b76 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1556,6 +1556,7 @@ static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int de
 static bool tcp_has_tx_tstamp(const struct sk_buff *skb)
 {
 	return TCP_SKB_CB(skb)->txstamp_ack ||
+	       TCP_SKB_CB(skb)->txstamp_ack_bpf ||
 		(skb_shinfo(skb)->tx_flags & SKBTX_ANY_TSTAMP);
 }
 
@@ -1572,7 +1573,9 @@ static void tcp_fragment_tstamp(struct sk_buff *skb, struct sk_buff *skb2)
 		shinfo2->tx_flags |= tsflags;
 		swap(shinfo->tskey, shinfo2->tskey);
 		TCP_SKB_CB(skb2)->txstamp_ack = TCP_SKB_CB(skb)->txstamp_ack;
+		TCP_SKB_CB(skb2)->txstamp_ack_bpf = TCP_SKB_CB(skb)->txstamp_ack_bpf;
 		TCP_SKB_CB(skb)->txstamp_ack = 0;
+		TCP_SKB_CB(skb)->txstamp_ack_bpf = 0;
 	}
 }
 
@@ -3213,6 +3216,8 @@ void tcp_skb_collapse_tstamp(struct sk_buff *skb,
 		shinfo->tskey = next_shinfo->tskey;
 		TCP_SKB_CB(skb)->txstamp_ack |=
 			TCP_SKB_CB(next_skb)->txstamp_ack;
+		TCP_SKB_CB(skb)->txstamp_ack_bpf |=
+			TCP_SKB_CB(next_skb)->txstamp_ack_bpf;
 	}
 }
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f1583b5814ea..b463aa9c27da 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7030,6 +7030,11 @@ enum {
 					 * It indicates the recorded
 					 * timestamp.
 					 */
+	BPF_SOCK_OPS_TS_ACK_OPT_CB,	/* Called when all the skbs are
+					 * acknowledged when SO_TIMESTAMPING
+					 * feature is on. It indicates the
+					 * recorded timestamp.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH net-next v6 10/13] net-timestamp: make TCP tx timestamp bpf extension work
  2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (8 preceding siblings ...)
  2025-01-21  1:28 ` [RFC PATCH net-next v6 09/13] net-timestamp: support SCM_TSTAMP_ACK " Jason Xing
@ 2025-01-21  1:28 ` Jason Xing
  2025-01-21  1:28 ` [RFC PATCH net-next v6 11/13] net-timestamp: add a new callback in tcp_tx_timestamp() Jason Xing
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Make partial of the feature work finally. After this, user can
fully use the bpf prog to trace the tx path for TCP type.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 net/ipv4/tcp.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0d704bda6c41..0a41006b10d1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -492,6 +492,15 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
 		if (tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK)
 			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
 	}
+
+	if (SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
+		struct skb_shared_info *shinfo = skb_shinfo(skb);
+		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+
+		tcb->txstamp_ack_bpf = 1;
+		shinfo->tx_flags |= SKBTX_BPF;
+		shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+	}
 }
 
 static bool tcp_stream_is_readable(struct sock *sk, int target)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH net-next v6 11/13] net-timestamp: add a new callback in tcp_tx_timestamp()
  2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (9 preceding siblings ...)
  2025-01-21  1:28 ` [RFC PATCH net-next v6 10/13] net-timestamp: make TCP tx timestamp bpf extension work Jason Xing
@ 2025-01-21  1:28 ` Jason Xing
  2025-01-25  0:50   ` Martin KaFai Lau
  2025-01-21  1:29 ` [RFC PATCH net-next v6 12/13] net-timestamp: introduce cgroup lock to avoid affecting non-bpf cases Jason Xing
  2025-01-21  1:29 ` [RFC PATCH net-next v6 13/13] bpf: add simple bpf tests in the tx path for so_timestamping feature Jason Xing
  12 siblings, 1 reply; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Introduce the callback to correlate tcp_sendmsg timestamp with other
three points (SND/SW/ACK). We can let bpf trace the beginning of
tcp_sendmsg_locked() and fetch the socket addr, so that in
tcp_tx_timestamp() we can correlate the tskey with the socket addr.
It is accurate since they are under the protect of socket lock.
More details can be found in the selftest.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/uapi/linux/bpf.h       | 3 +++
 net/ipv4/tcp.c                 | 1 +
 tools/include/uapi/linux/bpf.h | 3 +++
 3 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3b9bfc88345c..55c74fa18163 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7042,6 +7042,9 @@ enum {
 					 * feature is on. It indicates the
 					 * recorded timestamp.
 					 */
+	BPF_SOCK_OPS_TS_TCP_SND_CB,	/* Called when every tcp_sendmsg
+					 * syscall is triggered
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0a41006b10d1..49e489c346ea 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -500,6 +500,7 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
 		tcb->txstamp_ack_bpf = 1;
 		shinfo->tx_flags |= SKBTX_BPF;
 		shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+		bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TS_TCP_SND_CB);
 	}
 }
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b463aa9c27da..38fc04a7ac20 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7035,6 +7035,9 @@ enum {
 					 * feature is on. It indicates the
 					 * recorded timestamp.
 					 */
+	BPF_SOCK_OPS_TS_TCP_SND_CB,	/* Called when every tcp_sendmsg
+					 * syscall is triggered
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH net-next v6 12/13] net-timestamp: introduce cgroup lock to avoid affecting non-bpf cases
  2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (10 preceding siblings ...)
  2025-01-21  1:28 ` [RFC PATCH net-next v6 11/13] net-timestamp: add a new callback in tcp_tx_timestamp() Jason Xing
@ 2025-01-21  1:29 ` Jason Xing
  2025-01-25  1:09   ` Martin KaFai Lau
  2025-01-21  1:29 ` [RFC PATCH net-next v6 13/13] bpf: add simple bpf tests in the tx path for so_timestamping feature Jason Xing
  12 siblings, 1 reply; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:29 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Introducing the lock to avoid affecting the applications which
are not using timestamping bpf feature.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 net/core/skbuff.c     | 6 ++++--
 net/ipv4/tcp.c        | 3 ++-
 net/ipv4/tcp_input.c  | 3 ++-
 net/ipv4/tcp_output.c | 3 ++-
 4 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 33340e0b094f..db5b4b653351 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5605,11 +5605,13 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 		return;
 
 	/* bpf extension feature entry */
-	if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
+	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
+	    skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
 		skb_tstamp_tx_bpf(orig_skb, sk, tstype, sw, hwtstamps);
 
 	/* application feature entry */
-	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
+	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
+	    !skb_enable_app_tstamp(orig_skb, tstype, sw))
 		return;
 
 	tsflags = READ_ONCE(sk->sk_tsflags);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 49e489c346ea..d88160af00c4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -493,7 +493,8 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
 			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
 	}
 
-	if (SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
+	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
+	    SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
 		struct skb_shared_info *shinfo = skb_shinfo(skb);
 		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c8945f5be31b..e30607ba41e5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3324,7 +3324,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
 
 	/* Avoid cache line misses to get skb_shinfo() and shinfo->tx_flags */
 	if (likely(!TCP_SKB_CB(skb)->txstamp_ack &&
-		   !TCP_SKB_CB(skb)->txstamp_ack_bpf))
+		   !(cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
+		     TCP_SKB_CB(skb)->txstamp_ack_bpf)))
 		return;
 
 	shinfo = skb_shinfo(skb);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fc84ca669b76..483f19c2083e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1556,7 +1556,8 @@ static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int de
 static bool tcp_has_tx_tstamp(const struct sk_buff *skb)
 {
 	return TCP_SKB_CB(skb)->txstamp_ack ||
-	       TCP_SKB_CB(skb)->txstamp_ack_bpf ||
+	       (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
+		TCP_SKB_CB(skb)->txstamp_ack_bpf) ||
 		(skb_shinfo(skb)->tx_flags & SKBTX_ANY_TSTAMP);
 }
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH net-next v6 13/13] bpf: add simple bpf tests in the tx path for so_timestamping feature
  2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (11 preceding siblings ...)
  2025-01-21  1:29 ` [RFC PATCH net-next v6 12/13] net-timestamp: introduce cgroup lock to avoid affecting non-bpf cases Jason Xing
@ 2025-01-21  1:29 ` Jason Xing
  2025-01-25  3:07   ` Martin KaFai Lau
  12 siblings, 1 reply; 43+ messages in thread
From: Jason Xing @ 2025-01-21  1:29 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Only check if we pass those three key points after we enable the
bpf extension for so_timestamping. During each point, we can choose
whether to print the current timestamp.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 .../bpf/prog_tests/so_timestamping.c          |  98 ++++++++
 .../selftests/bpf/progs/so_timestamping.c     | 227 ++++++++++++++++++
 2 files changed, 325 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/so_timestamping.c
 create mode 100644 tools/testing/selftests/bpf/progs/so_timestamping.c

diff --git a/tools/testing/selftests/bpf/prog_tests/so_timestamping.c b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
new file mode 100644
index 000000000000..bbfa7eb38cfb
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
@@ -0,0 +1,98 @@
+#define _GNU_SOURCE
+#include <sched.h>
+#include <linux/socket.h>
+#include <linux/tls.h>
+#include <net/if.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+
+#include "so_timestamping.skel.h"
+
+#define CG_NAME "/so-timestamping-test"
+
+static const char addr4_str[] = "127.0.0.1";
+static const char addr6_str[] = "::1";
+static struct so_timestamping *skel;
+static int cg_fd;
+
+static int create_netns(void)
+{
+	if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
+		return -1;
+
+	if (!ASSERT_OK(system("ip link set dev lo up"), "set lo up"))
+		return -1;
+
+	return 0;
+}
+
+static void test_tcp(int family)
+{
+	struct so_timestamping__bss *bss = skel->bss;
+	char buf[] = "testing testing";
+	int sfd = -1, cfd = -1;
+	int n;
+
+	memset(bss, 0, sizeof(*bss));
+
+	sfd = start_server(family, SOCK_STREAM,
+			   family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
+	if (!ASSERT_GE(sfd, 0, "start_server"))
+		goto out;
+
+	cfd = connect_to_fd(sfd, 0);
+	if (!ASSERT_GE(cfd, 0, "connect_to_fd_server")) {
+		close(sfd);
+		goto out;
+	}
+
+	n = write(cfd, buf, sizeof(buf));
+	if (!ASSERT_EQ(n, sizeof(buf), "send to server"))
+		goto out;
+
+	ASSERT_EQ(bss->nr_active, 1, "nr_active");
+	ASSERT_EQ(bss->nr_snd, 2, "nr_snd");
+	ASSERT_EQ(bss->nr_sched, 1, "nr_sched");
+	ASSERT_EQ(bss->nr_txsw, 1, "nr_txsw");
+	ASSERT_EQ(bss->nr_ack, 1, "nr_ack");
+
+out:
+	if (sfd >= 0)
+		close(sfd);
+	if (cfd >= 0)
+		close(cfd);
+}
+
+void test_so_timestamping(void)
+{
+	cg_fd = test__join_cgroup(CG_NAME);
+	if (cg_fd < 0)
+		return;
+
+	if (create_netns())
+		goto done;
+
+	skel = so_timestamping__open();
+	if (!ASSERT_OK_PTR(skel, "open skel"))
+		goto done;
+
+	if (!ASSERT_OK(so_timestamping__load(skel), "load skel"))
+		goto done;
+
+	if (!ASSERT_OK(so_timestamping__attach(skel), "attach skel"))
+		goto done;
+
+	skel->links.skops_sockopt =
+		bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd);
+	if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup"))
+		goto done;
+
+	test_tcp(AF_INET6);
+	test_tcp(AF_INET);
+
+done:
+	so_timestamping__destroy(skel);
+	close(cg_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/so_timestamping.c b/tools/testing/selftests/bpf/progs/so_timestamping.c
new file mode 100644
index 000000000000..f4708e84c243
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/so_timestamping.c
@@ -0,0 +1,227 @@
+#include "vmlinux.h"
+#include "bpf_tracing_net.h"
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+//#include <bpf/bpf_core_read.h>
+#include "bpf_misc.h"
+#include "bpf_kfuncs.h"
+
+#define SK_BPF_CB_FLAGS 1009
+#define SK_BPF_CB_TX_TIMESTAMPING 1
+
+int nr_active;
+int nr_snd;
+int nr_passive;
+int nr_sched;
+int nr_txsw;
+int nr_ack;
+
+struct sockopt_test {
+	int opt;
+	int new;
+};
+
+static const struct sockopt_test sol_socket_tests[] = {
+	{ .opt = SK_BPF_CB_FLAGS, .new = SK_BPF_CB_TX_TIMESTAMPING, },
+	{ .opt = 0, },
+};
+
+struct loop_ctx {
+	void *ctx;
+	const struct sock *sk;
+};
+
+struct sk_stg {
+	__u64 sendmsg_ns;	/* record ts when sendmsg is called */
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct sk_stg);
+} sk_stg_map SEC(".maps");
+
+
+struct delay_info {
+	u64 sendmsg_ns;		/* record ts when sendmsg is called */
+	u32 sched_delay;	/* SCHED_OPT_CB - sendmsg_ns */
+	u32 sw_snd_delay;	/* SW_OPT_CB - SCHED_OPT_CB */
+	u32 ack_delay;		/* ACK_OPT_CB - SW_OPT_CB */
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, u32);
+	__type(value, struct delay_info);
+	__uint(max_entries, 1024);
+} time_map SEC(".maps");
+
+static u64 delay_tolerance_nsec = 1000000000; /* 1 second as an example */
+
+static int bpf_test_sockopt_int(void *ctx, const struct sock *sk,
+				const struct sockopt_test *t,
+				int level)
+{
+	int new, opt, tmp;
+
+	opt = t->opt;
+	new = t->new;
+
+	if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
+		return 1;
+
+	if (bpf_getsockopt(ctx, level, opt, &tmp, sizeof(tmp)) ||
+	    tmp != new) {
+		return 1;
+	}
+
+	return 0;
+}
+
+static int bpf_test_socket_sockopt(__u32 i, struct loop_ctx *lc)
+{
+	const struct sockopt_test *t;
+
+	if (i >= ARRAY_SIZE(sol_socket_tests))
+		return 1;
+
+	t = &sol_socket_tests[i];
+	if (!t->opt)
+		return 1;
+
+	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, SOL_SOCKET);
+}
+
+static int bpf_test_sockopt(void *ctx, const struct sock *sk)
+{
+	struct loop_ctx lc = { .ctx = ctx, .sk = sk, };
+	int n;
+
+	n = bpf_loop(ARRAY_SIZE(sol_socket_tests), bpf_test_socket_sockopt, &lc, 0);
+	if (n != ARRAY_SIZE(sol_socket_tests))
+		return -1;
+
+	return 0;
+}
+
+static bool bpf_test_delay(struct bpf_sock_ops *skops, const struct sock *sk)
+{
+	struct bpf_sock_ops_kern *skops_kern;
+	u64 timestamp = bpf_ktime_get_ns();
+	struct skb_shared_info *shinfo;
+	struct delay_info dinfo = {0};
+	struct delay_info *val;
+	struct sk_buff *skb;
+	struct sk_stg *stg;
+	u32 delay, tskey;
+	u64 prior_ts;
+
+	skops_kern = bpf_cast_to_kern_ctx(skops);
+	skb = skops_kern->skb;
+	shinfo = bpf_core_cast(skb->head + skb->end, struct skb_shared_info);
+	tskey = shinfo->tskey;
+	if (!tskey)
+		return false;
+
+	if (skops->op == BPF_SOCK_OPS_TS_TCP_SND_CB) {
+		stg = bpf_sk_storage_get(&sk_stg_map, (void *)sk, 0, 0);
+		if (!stg)
+			return false;
+		dinfo.sendmsg_ns = stg->sendmsg_ns;
+		val = &dinfo;
+		goto out;
+	}
+
+	val = bpf_map_lookup_elem(&time_map, &tskey);
+	if (!val)
+		return false;
+
+	switch (skops->op) {
+	case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
+		delay = val->sched_delay = timestamp - val->sendmsg_ns;
+		break;
+	case BPF_SOCK_OPS_TS_SW_OPT_CB:
+		prior_ts = val->sched_delay + val->sendmsg_ns;
+		delay = val->sw_snd_delay = timestamp - prior_ts;
+		break;
+	case BPF_SOCK_OPS_TS_ACK_OPT_CB:
+		prior_ts = val->sw_snd_delay + val->sched_delay + val->sendmsg_ns;
+		delay = val->ack_delay = timestamp - prior_ts;
+		break;
+	}
+
+	if (delay <= 0 || delay >= delay_tolerance_nsec)
+		return false;
+
+	/* Since it's the last one, remove from the map after latency check */
+	if (skops->op == BPF_SOCK_OPS_TS_ACK_OPT_CB) {
+		bpf_map_delete_elem(&time_map, &tskey);
+		return true;
+	}
+
+out:
+	bpf_map_update_elem(&time_map, &tskey, val, BPF_ANY);
+	return true;
+}
+
+SEC("fentry/tcp_sendmsg_locked")
+int BPF_PROG(trace_tcp_sendmsg_locked, struct sock *sk, struct msghdr *msg, size_t size)
+{
+	u64 timestamp = bpf_ktime_get_ns();
+	u32 flag = sk->sk_bpf_cb_flags;
+	struct sk_stg *stg;
+
+	if (!flag)
+		return 0;
+
+	stg = bpf_sk_storage_get(&sk_stg_map, sk, 0,
+				 BPF_SK_STORAGE_GET_F_CREATE);
+	if (!stg)
+		return 0;
+
+	stg->sendmsg_ns = timestamp;
+	nr_snd += 1;
+	return 0;
+}
+
+SEC("sockops")
+int skops_sockopt(struct bpf_sock_ops *skops)
+{
+	struct bpf_sock *bpf_sk = skops->sk;
+	const struct sock *sk;
+
+	if (!bpf_sk)
+		return 1;
+
+	sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
+	if (!sk)
+		return 1;
+
+	switch (skops->op) {
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+		nr_active += !bpf_test_sockopt(skops, sk);
+		break;
+	case BPF_SOCK_OPS_TS_TCP_SND_CB:
+		if (bpf_test_delay(skops, sk))
+			nr_snd += 1;
+		break;
+	case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
+		if (bpf_test_delay(skops, sk))
+			nr_sched += 1;
+		break;
+	case BPF_SOCK_OPS_TS_SW_OPT_CB:
+		if (bpf_test_delay(skops, sk))
+			nr_txsw += 1;
+		break;
+	case BPF_SOCK_OPS_TS_ACK_OPT_CB:
+		if (bpf_test_delay(skops, sk))
+			nr_ack += 1;
+		break;
+	}
+
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 02/13] net-timestamp: prepare for timestamping callbacks use
  2025-01-21  1:28 ` [RFC PATCH net-next v6 02/13] net-timestamp: prepare for timestamping callbacks use Jason Xing
@ 2025-01-21  5:08   ` Jason Xing
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-21  5:08 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev

On Tue, Jan 21, 2025 at 9:29 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> Later, I would introduce four callback points to report information
> to user space based on this patch.
>
> As to skb initialization here, people can follow these three steps
> as below to fetch the shared info from the exported skb in the bpf
> prog:
> 1. skops_kern = bpf_cast_to_kern_ctx(skops);
> 2. skb = skops_kern->skb;
> 3. shinfo = bpf_core_cast(skb->head + skb->end, struct skb_shared_info);
>
> More details can be seen in the last selftest patch of the series.
>
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>  include/net/sock.h |  7 +++++++
>  net/core/sock.c    | 13 +++++++++++++
>  2 files changed, 20 insertions(+)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 7916982343c6..6f4d54faba92 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2923,6 +2923,13 @@ int sock_set_timestamping(struct sock *sk, int optname,
>                           struct so_timestamping timestamping);
>
>  void sock_enable_timestamps(struct sock *sk);
> +#if defined(CONFIG_CGROUP_BPF)
> +void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op);
> +#else
> +static inline void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op)
> +{
> +}
> +#endif
>  void sock_no_linger(struct sock *sk);
>  void sock_set_keepalive(struct sock *sk);
>  void sock_set_priority(struct sock *sk, u32 priority);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index eae2ae70a2e0..e165163521dc 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -948,6 +948,19 @@ int sock_set_timestamping(struct sock *sk, int optname,
>         return 0;
>  }
>

Oops, I accidentally remove the following protection:
#if defined(CONFIG_CGROUP_BPF)
> +void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op)
> +{
> +       struct bpf_sock_ops_kern sock_ops;
> +
> +       memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
> +       sock_ops.op = op;
> +       sock_ops.is_fullsock = 1;
> +       sock_ops.sk = sk;
> +       bpf_skops_init_skb(&sock_ops, skb, 0);
> +       /* Timestamping bpf extension supports only TCP and UDP full socket */
> +       __cgroup_bpf_run_filter_sock_ops(sk, &sock_ops, CGROUP_SOCK_OPS);
> +}

And here:
#endif

I'll add them in the next version.

Thanks,
Jason

> +
>  void sock_set_keepalive(struct sock *sk)
>  {
>         lock_sock(sk);
> --
> 2.43.5
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 03/13] bpf: stop UDP sock accessing TCP fields in bpf callbacks
  2025-01-21  1:28 ` [RFC PATCH net-next v6 03/13] bpf: stop UDP sock accessing TCP fields in bpf callbacks Jason Xing
@ 2025-01-24 23:40   ` Martin KaFai Lau
  2025-01-25  0:28     ` Jason Xing
  0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-01-24 23:40 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 1/20/25 5:28 PM, Jason Xing wrote:
> Applying the new member allow_tcp_access in the existing callbacks
> where is_fullsock is set to 1 can help us stop UDP socket accessing
> struct tcp_sock, or else it could be catastrophe leading to panic.
> 
> For now, those existing callbacks are used only for TCP. I believe
> in the short run, we will have timestamping UDP callbacks support.

The commit message needs adjustment. UDP is not supported yet, so this change 
feels like it's unnecessary based on the commit message. However, even without 
UDP support, the new timestamping callbacks cannot directly write some fields 
because the sk lock is not held, so this change is needed for TCP timestamping 
support.

To keep it simple, instead of distinguishing between read and write access, we 
disallow all read/write access to the tcp_sock through the older bpf_sock_ops 
ctx. The new timestamping callbacks can use newer helpers to read everything 
from a sk (e.g. bpf_core_cast), so nothing is lost.

The "allow_tcp_access" flag is added to indicate that the callback site has a 
tcp_sock locked. Yes, it will make future UDP support easier because a udp_sock 
is not a tcp_sock to begin with.

> 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>   include/linux/filter.h | 1 +
>   include/net/tcp.h      | 1 +
>   net/core/filter.c      | 8 ++++----
>   net/ipv4/tcp_input.c   | 2 ++
>   net/ipv4/tcp_output.c  | 2 ++
>   5 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index a3ea46281595..1b1333a90b4a 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1508,6 +1508,7 @@ struct bpf_sock_ops_kern {
>   	void	*skb_data_end;
>   	u8	op;
>   	u8	is_fullsock;
> +	u8	allow_tcp_access;

It is useful to add a comment here to explain the sockops callback has the 
tcp_sock locked when it is set.

>   	u8	remaining_opt_len;
>   	u64	temp;			/* temp and everything after is not
>   					 * initialized to 0 before calling
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 5b2b04835688..293047694710 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2649,6 +2649,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
>   	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
>   	if (sk_fullsock(sk)) {
>   		sock_ops.is_fullsock = 1;
> +		sock_ops.allow_tcp_access = 1;
>   		sock_owned_by_me(sk);
>   	}
>   
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 8e2715b7ac8a..fdd305b4cfbb 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -10381,10 +10381,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
>   		}							      \
>   		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
>   						struct bpf_sock_ops_kern,     \
> -						is_fullsock),		      \
> +						allow_tcp_access),	      \
>   				      fullsock_reg, si->src_reg,	      \
>   				      offsetof(struct bpf_sock_ops_kern,      \
> -					       is_fullsock));		      \
> +					       allow_tcp_access));	      \
>   		*insn++ = BPF_JMP_IMM(BPF_JEQ, fullsock_reg, 0, jmp);	      \
>   		if (si->dst_reg == si->src_reg)				      \
>   			*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg,	      \
> @@ -10469,10 +10469,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
>   					       temp));			      \
>   		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
>   						struct bpf_sock_ops_kern,     \
> -						is_fullsock),		      \
> +						allow_tcp_access),	      \
>   				      reg, si->dst_reg,			      \
>   				      offsetof(struct bpf_sock_ops_kern,      \
> -					       is_fullsock));		      \
> +					       allow_tcp_access));	      \
>   		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
>   		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
>   						struct bpf_sock_ops_kern, sk),\
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index eb82e01da911..77185479ed5e 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -169,6 +169,7 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
>   	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
>   	sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB;
>   	sock_ops.is_fullsock = 1;
> +	sock_ops.allow_tcp_access = 1;
>   	sock_ops.sk = sk;
>   	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
>   
> @@ -185,6 +186,7 @@ static void bpf_skops_established(struct sock *sk, int bpf_op,
>   	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
>   	sock_ops.op = bpf_op;
>   	sock_ops.is_fullsock = 1;
> +	sock_ops.allow_tcp_access = 1;
>   	sock_ops.sk = sk;
>   	/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
>   	if (skb)
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 0e5b9a654254..695749807c09 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -522,6 +522,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>   		sock_owned_by_me(sk);
>   
>   		sock_ops.is_fullsock = 1;
> +		sock_ops.allow_tcp_access = 1;
>   		sock_ops.sk = sk;
>   	}
>   
> @@ -567,6 +568,7 @@ static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
>   		sock_owned_by_me(sk);
>   
>   		sock_ops.is_fullsock = 1;
> +		sock_ops.allow_tcp_access = 1;
>   		sock_ops.sk = sk;
>   	}
>   


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 04/13] bpf: stop UDP sock accessing TCP fields in sock_op BPF CALLs
  2025-01-21  1:28 ` [RFC PATCH net-next v6 04/13] bpf: stop UDP sock accessing TCP fields in sock_op BPF CALLs Jason Xing
@ 2025-01-25  0:28   ` Martin KaFai Lau
  2025-01-25  1:15     ` Jason Xing
  0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-01-25  0:28 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 1/20/25 5:28 PM, Jason Xing wrote:
> In the next round, we will support the UDP proto for SO_TIMESTAMPING
> bpf extension, so we need to ensure there is no safety problem, which
> is ususally caused by UDP socket trying to access TCP fields.
> 
> These approaches can be categorized into two groups:
> 1. add TCP protocol check
> 2. add sock op check

Same as patch 3. The commit message needs adjustment. I would combine patch 3 
and patch 4 because ...

> 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>   net/core/filter.c | 19 +++++++++++++++++--
>   1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index fdd305b4cfbb..934431886876 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5523,6 +5523,11 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
>   	return -EINVAL;
>   }
>   
> +static bool is_locked_tcp_sock_ops(struct bpf_sock_ops_kern *bpf_sock)
> +{
> +	return bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB;

More bike shedding...

After sleeping on it again, I think it can just test the 
bpf_sock->allow_tcp_access instead.


> +}
> +
>   static int _bpf_setsockopt(struct sock *sk, int level, int optname,
>   			   char *optval, int optlen)
>   {
> @@ -5673,7 +5678,12 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
>   BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
>   	   int, level, int, optname, char *, optval, int, optlen)
>   {
> -	return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
> +	struct sock *sk = bpf_sock->sk;
> +
> +	if (is_locked_tcp_sock_ops(bpf_sock) && sk_fullsock(sk))

afaict, the new timestamping callbacks still can do setsockopt and it is 
incorrect. It should be:

	if (!bpf_sock->allow_tcp_access)
		return -EOPNOTSUPP;

I recalled I have asked in v5 but it may be buried in the long thread, so asking 
here again. Please add test(s) to check that the new timestamping callbacks 
cannot call setsockopt and read/write to some of the tcp_sock fields through the 
bpf_sock_ops.

> +		sock_owned_by_me(sk);

Not needed and instead...

> +
> +	return __bpf_setsockopt(sk, level, optname, optval, optlen);

keep the original _bpf_setsockopt().

>   }
>   
>   static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
> @@ -5759,6 +5769,7 @@ BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
>   	   int, level, int, optname, char *, optval, int, optlen)
>   {
>   	if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
> +	    bpf_sock->sk->sk_protocol == IPPROTO_TCP &&
>   	    optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) {

No need to allow getsockopt regardless what SOL_* it is asking. To keep it 
simple, I would just disable both getsockopt and setsockopt for all SOL_* for 
the new timestamping callbacks. Nothing is lost, the bpf prog can directly read 
the sk.

>   		int ret, copy_len = 0;
>   		const u8 *start;
> @@ -5800,7 +5811,8 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
>   	struct sock *sk = bpf_sock->sk;
>   	int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
>   
> -	if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
> +	if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk) ||
> +	    sk->sk_protocol != IPPROTO_TCP)

Same here. It should disallow this "set" helper for the timestamping callbacks 
which do not hold the lock.

>   		return -EINVAL;
>   
>   	tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> @@ -7609,6 +7621,9 @@ BPF_CALL_4(bpf_sock_ops_load_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
>   	u8 search_kind, search_len, copy_len, magic_len;
>   	int ret;
>   
> +	if (!is_locked_tcp_sock_ops(bpf_sock))
> +		return -EOPNOTSUPP;

This is correct, just change it to "!bpf_sock->allow_tcp_access".

All the above changed helpers should use the same test and the same return handling.

> +
>   	/* 2 byte is the minimal option len except TCPOPT_NOP and
>   	 * TCPOPT_EOL which are useless for the bpf prog to learn
>   	 * and this helper disallow loading them also.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 03/13] bpf: stop UDP sock accessing TCP fields in bpf callbacks
  2025-01-24 23:40   ` Martin KaFai Lau
@ 2025-01-25  0:28     ` Jason Xing
  2025-01-28  1:34       ` Jason Xing
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Xing @ 2025-01-25  0:28 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 7:41 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/20/25 5:28 PM, Jason Xing wrote:
> > Applying the new member allow_tcp_access in the existing callbacks
> > where is_fullsock is set to 1 can help us stop UDP socket accessing
> > struct tcp_sock, or else it could be catastrophe leading to panic.
> >
> > For now, those existing callbacks are used only for TCP. I believe
> > in the short run, we will have timestamping UDP callbacks support.
>
> The commit message needs adjustment. UDP is not supported yet, so this change
> feels like it's unnecessary based on the commit message. However, even without
> UDP support, the new timestamping callbacks cannot directly write some fields
> because the sk lock is not held, so this change is needed for TCP timestamping

Thanks and I will revise them. But I still want to say that the
timestamping callbacks after this series are all under the protection
of socket lock.

> support.
>
> To keep it simple, instead of distinguishing between read and write access, we
> disallow all read/write access to the tcp_sock through the older bpf_sock_ops
> ctx. The new timestamping callbacks can use newer helpers to read everything
> from a sk (e.g. bpf_core_cast), so nothing is lost.
>
> The "allow_tcp_access" flag is added to indicate that the callback site has a
> tcp_sock locked. Yes, it will make future UDP support easier because a udp_sock
> is not a tcp_sock to begin with.

I will add them :)

>
> >
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >   include/linux/filter.h | 1 +
> >   include/net/tcp.h      | 1 +
> >   net/core/filter.c      | 8 ++++----
> >   net/ipv4/tcp_input.c   | 2 ++
> >   net/ipv4/tcp_output.c  | 2 ++
> >   5 files changed, 10 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index a3ea46281595..1b1333a90b4a 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -1508,6 +1508,7 @@ struct bpf_sock_ops_kern {
> >       void    *skb_data_end;
> >       u8      op;
> >       u8      is_fullsock;
> > +     u8      allow_tcp_access;
>
> It is useful to add a comment here to explain the sockops callback has the
> tcp_sock locked when it is set.

Got it!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 06/13] net-timestamp: support SCM_TSTAMP_SCHED for bpf extension
  2025-01-21  1:28 ` [RFC PATCH net-next v6 06/13] net-timestamp: support SCM_TSTAMP_SCHED for bpf extension Jason Xing
@ 2025-01-25  0:38   ` Martin KaFai Lau
  2025-01-25  1:16     ` Jason Xing
  0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-01-25  0:38 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 1/20/25 5:28 PM, Jason Xing wrote:
> Introducing SKBTX_BPF is used as an indicator telling us whether
> the skb should be traced by the bpf prog.
> 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>   include/linux/skbuff.h         |  6 +++++-
>   include/uapi/linux/bpf.h       |  5 +++++
>   net/core/dev.c                 |  3 ++-
>   net/core/skbuff.c              | 23 +++++++++++++++++++++++
>   tools/include/uapi/linux/bpf.h |  5 +++++
>   5 files changed, 40 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index dfc419281cc9..35c2e864dd4b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -490,10 +490,14 @@ enum {
>   
>   	/* generate software time stamp when entering packet scheduling */
>   	SKBTX_SCHED_TSTAMP = 1 << 6,
> +
> +	/* used for bpf extension when a bpf program is loaded */
> +	SKBTX_BPF = 1 << 7,
>   };
>   
>   #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
> -				 SKBTX_SCHED_TSTAMP)
> +				 SKBTX_SCHED_TSTAMP | \
> +				 SKBTX_BPF)
>   #define SKBTX_ANY_TSTAMP	(SKBTX_HW_TSTAMP | \
>   				 SKBTX_HW_TSTAMP_USE_CYCLES | \
>   				 SKBTX_ANY_SW_TSTAMP)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e629e09b0b31..72f93c6e45c1 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7022,6 +7022,11 @@ enum {
>   					 * by the kernel or the
>   					 * earlier bpf-progs.
>   					 */
> +	BPF_SOCK_OPS_TS_SCHED_OPT_CB,	/* Called when skb is passing through
> +					 * dev layer when SO_TIMESTAMPING

The "SO_TIMESTAMPING" term is not accurate. I guess you meant 
"SK_BPF_CB_TX_TIMESTAMPING"?

Also, may be "Called before skb entering qdisc"?

> +					 * feature is on. It indicates the
> +					 * recorded timestamp.

There is no timestamp recorded also.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 07/13] net-timestamp: support sw SCM_TSTAMP_SND for bpf extension
  2025-01-21  1:28 ` [RFC PATCH net-next v6 07/13] net-timestamp: support sw SCM_TSTAMP_SND " Jason Xing
@ 2025-01-25  0:40   ` Martin KaFai Lau
  2025-01-25  1:17     ` Jason Xing
  0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-01-25  0:40 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 1/20/25 5:28 PM, Jason Xing wrote:
> Support SCM_TSTAMP_SND case. Then we will get the software
> timestamp when the driver is about to send the skb. Later, I
> will support the hardware timestamp.
> 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>   include/linux/skbuff.h         |  2 +-
>   include/uapi/linux/bpf.h       |  5 +++++
>   net/core/skbuff.c              | 10 ++++++++--
>   tools/include/uapi/linux/bpf.h |  5 +++++
>   4 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 35c2e864dd4b..de8d3bd311f5 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -4569,7 +4569,7 @@ void skb_tstamp_tx(struct sk_buff *orig_skb,
>   static inline void skb_tx_timestamp(struct sk_buff *skb)
>   {
>   	skb_clone_tx_timestamp(skb);
> -	if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP)
> +	if (skb_shinfo(skb)->tx_flags & (SKBTX_SW_TSTAMP | SKBTX_BPF))
>   		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SND);
>   }
>   
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 72f93c6e45c1..a6d761f07f67 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7027,6 +7027,11 @@ enum {
>   					 * feature is on. It indicates the
>   					 * recorded timestamp.
>   					 */
> +	BPF_SOCK_OPS_TS_SW_OPT_CB,	/* Called when skb is about to send
> +					 * to the nic when SO_TIMESTAMPING

Same comment on the "SO_TIMESTAMPING".


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 08/13] net-timestamp: support hw SCM_TSTAMP_SND for bpf extension
  2025-01-21  1:28 ` [RFC PATCH net-next v6 08/13] net-timestamp: support hw " Jason Xing
@ 2025-01-25  0:46   ` Martin KaFai Lau
  2025-01-25  1:18     ` Jason Xing
  0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-01-25  0:46 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 1/20/25 5:28 PM, Jason Xing wrote:
> In this patch, we finish the hardware part. Then bpf program can
> fetch the hwstamp from skb directly.
> 
> To avoid changing so many callers using SKBTX_HW_TSTAMP from drivers,
> use this simple modification like this patch does to support printing
> hardware timestamp.
> 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>   include/linux/skbuff.h         |  4 +++-
>   include/uapi/linux/bpf.h       |  5 +++++
>   net/core/skbuff.c              | 11 ++++++-----
>   net/dsa/user.c                 |  2 +-
>   net/socket.c                   |  2 +-
>   tools/include/uapi/linux/bpf.h |  5 +++++
>   6 files changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index de8d3bd311f5..df2d790ae36b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -471,7 +471,7 @@ struct skb_shared_hwtstamps {
>   /* Definitions for tx_flags in struct skb_shared_info */
>   enum {
>   	/* generate hardware time stamp */
> -	SKBTX_HW_TSTAMP = 1 << 0,
> +	__SKBTX_HW_TSTAMP = 1 << 0,
>   
>   	/* generate software time stamp when queueing packet to NIC */
>   	SKBTX_SW_TSTAMP = 1 << 1,
> @@ -495,6 +495,8 @@ enum {
>   	SKBTX_BPF = 1 << 7,
>   };
>   
> +#define SKBTX_HW_TSTAMP		(__SKBTX_HW_TSTAMP | SKBTX_BPF)
> +
>   #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
>   				 SKBTX_SCHED_TSTAMP | \
>   				 SKBTX_BPF)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index a6d761f07f67..8936e1061e71 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7032,6 +7032,11 @@ enum {
>   					 * feature is on. It indicates the
>   					 * recorded timestamp.
>   					 */
> +	BPF_SOCK_OPS_TS_HW_OPT_CB,	/* Called in hardware phase when
> +					 * SO_TIMESTAMPING feature is on.

Same comment on the "SO_TIMESTAMPING".

It will be useful to elaborate more on "hardware phase", like exactly when it 
will be called,

> +					 * It indicates the recorded
> +					 * timestamp.

and the hwtstamps will be in the skb.

> +					 */
>   };
>   
>   /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 288eb9869827..c769feae5162 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5548,7 +5548,7 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
>   		flag = SKBTX_SCHED_TSTAMP;
>   		break;
>   	case SCM_TSTAMP_SND:
> -		flag = sw ? SKBTX_SW_TSTAMP : SKBTX_HW_TSTAMP;
> +		flag = sw ? SKBTX_SW_TSTAMP : __SKBTX_HW_TSTAMP;
>   		break;
>   	case SCM_TSTAMP_ACK:
>   		if (TCP_SKB_CB(skb)->txstamp_ack)
> @@ -5565,7 +5565,8 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
>   }
>   
>   static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
> -			      int tstype, bool sw)
> +			      int tstype, bool sw,
> +			      struct skb_shared_hwtstamps *hwtstamps)
>   {
>   	int op;
>   
> @@ -5577,9 +5578,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
>   		op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
>   		break;
>   	case SCM_TSTAMP_SND:
> +		op = sw ? BPF_SOCK_OPS_TS_SW_OPT_CB : BPF_SOCK_OPS_TS_HW_OPT_CB;
>   		if (!sw)
> -			return;
> -		op = BPF_SOCK_OPS_TS_SW_OPT_CB;
> +			*skb_hwtstamps(skb) = *hwtstamps;

hwtstamps may still be NULL, no?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 11/13] net-timestamp: add a new callback in tcp_tx_timestamp()
  2025-01-21  1:28 ` [RFC PATCH net-next v6 11/13] net-timestamp: add a new callback in tcp_tx_timestamp() Jason Xing
@ 2025-01-25  0:50   ` Martin KaFai Lau
  2025-01-25  1:21     ` Jason Xing
  0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-01-25  0:50 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 1/20/25 5:28 PM, Jason Xing wrote:
> Introduce the callback to correlate tcp_sendmsg timestamp with other
> three points (SND/SW/ACK). We can let bpf trace the beginning of
> tcp_sendmsg_locked() and fetch the socket addr, so that in
> tcp_tx_timestamp() we can correlate the tskey with the socket addr.
> It is accurate since they are under the protect of socket lock.
> More details can be found in the selftest.
> 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>   include/uapi/linux/bpf.h       | 3 +++
>   net/ipv4/tcp.c                 | 1 +
>   tools/include/uapi/linux/bpf.h | 3 +++
>   3 files changed, 7 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 3b9bfc88345c..55c74fa18163 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7042,6 +7042,9 @@ enum {
>   					 * feature is on. It indicates the
>   					 * recorded timestamp.
>   					 */
> +	BPF_SOCK_OPS_TS_TCP_SND_CB,	/* Called when every tcp_sendmsg
> +					 * syscall is triggered
> +					 */

I recall we agreed in v5 to adjust the "TCP_" naming part because it will be 
used in UDP also. Like completely remove the "TCP_" from the name?

>   };
>   
>   /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 0a41006b10d1..49e489c346ea 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -500,6 +500,7 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
>   		tcb->txstamp_ack_bpf = 1;
>   		shinfo->tx_flags |= SKBTX_BPF;
>   		shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> +		bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TS_TCP_SND_CB);
>   	}
>   }
>   
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index b463aa9c27da..38fc04a7ac20 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -7035,6 +7035,9 @@ enum {
>   					 * feature is on. It indicates the
>   					 * recorded timestamp.
>   					 */
> +	BPF_SOCK_OPS_TS_TCP_SND_CB,	/* Called when every tcp_sendmsg
> +					 * syscall is triggered
> +					 */
>   };
>   
>   /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 12/13] net-timestamp: introduce cgroup lock to avoid affecting non-bpf cases
  2025-01-21  1:29 ` [RFC PATCH net-next v6 12/13] net-timestamp: introduce cgroup lock to avoid affecting non-bpf cases Jason Xing
@ 2025-01-25  1:09   ` Martin KaFai Lau
  2025-01-25  1:25     ` Jason Xing
  0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-01-25  1:09 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 1/20/25 5:29 PM, Jason Xing wrote:
> Introducing the lock to avoid affecting the applications which
> are not using timestamping bpf feature.
> 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>   net/core/skbuff.c     | 6 ++++--
>   net/ipv4/tcp.c        | 3 ++-
>   net/ipv4/tcp_input.c  | 3 ++-
>   net/ipv4/tcp_output.c | 3 ++-
>   4 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 33340e0b094f..db5b4b653351 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5605,11 +5605,13 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
>   		return;
>   
>   	/* bpf extension feature entry */
> -	if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
> +	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&

I wonder if it is really needed. The caller has just tested the tx_flags.

> +	    skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
>   		skb_tstamp_tx_bpf(orig_skb, sk, tstype, sw, hwtstamps);
>   
>   	/* application feature entry */
> -	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
> +	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&

Same here and this one looks wrong also. The userspace may get something 
unexpected in the err queue. The bpf prog may have already detached here after 
setting the SKBTX_BPF earlier.

> +	    !skb_enable_app_tstamp(orig_skb, tstype, sw))
>   		return;
>   
>   	tsflags = READ_ONCE(sk->sk_tsflags);
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 49e489c346ea..d88160af00c4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -493,7 +493,8 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
>   			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
>   	}
>   
> -	if (SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> +	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&

This looks ok considering SK_BPF_CB_FLAG_TEST may get to another cacheline.

> +	    SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
>   		struct skb_shared_info *shinfo = skb_shinfo(skb);
>   		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
>   
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index c8945f5be31b..e30607ba41e5 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3324,7 +3324,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
>   
>   	/* Avoid cache line misses to get skb_shinfo() and shinfo->tx_flags */
>   	if (likely(!TCP_SKB_CB(skb)->txstamp_ack &&
> -		   !TCP_SKB_CB(skb)->txstamp_ack_bpf))
> +		   !(cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&

Same here. txtstamp_ack has just been tested.... txstamp_ack_bpf is the next bit.

> +		     TCP_SKB_CB(skb)->txstamp_ack_bpf)))
>   		return;
>   
>   	shinfo = skb_shinfo(skb);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index fc84ca669b76..483f19c2083e 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1556,7 +1556,8 @@ static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int de
>   static bool tcp_has_tx_tstamp(const struct sk_buff *skb)
>   {
>   	return TCP_SKB_CB(skb)->txstamp_ack ||
> -	       TCP_SKB_CB(skb)->txstamp_ack_bpf ||
> +	       (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&

Same here.

> +		TCP_SKB_CB(skb)->txstamp_ack_bpf) ||
>   		(skb_shinfo(skb)->tx_flags & SKBTX_ANY_TSTAMP);
>   }
>   


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 04/13] bpf: stop UDP sock accessing TCP fields in sock_op BPF CALLs
  2025-01-25  0:28   ` Martin KaFai Lau
@ 2025-01-25  1:15     ` Jason Xing
  2025-01-25  1:32       ` Jason Xing
  2025-01-25  2:25       ` Martin KaFai Lau
  0 siblings, 2 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-25  1:15 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 8:28 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/20/25 5:28 PM, Jason Xing wrote:
> > In the next round, we will support the UDP proto for SO_TIMESTAMPING
> > bpf extension, so we need to ensure there is no safety problem, which
> > is ususally caused by UDP socket trying to access TCP fields.
> >
> > These approaches can be categorized into two groups:
> > 1. add TCP protocol check
> > 2. add sock op check
>
> Same as patch 3. The commit message needs adjustment. I would combine patch 3
> and patch 4 because ...

I wonder if you refer to "squashing" patch 4 into patch 3?

>
> >
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >   net/core/filter.c | 19 +++++++++++++++++--
> >   1 file changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index fdd305b4cfbb..934431886876 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5523,6 +5523,11 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
> >       return -EINVAL;
> >   }
> >
> > +static bool is_locked_tcp_sock_ops(struct bpf_sock_ops_kern *bpf_sock)
> > +{
> > +     return bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
>
> More bike shedding...
>
> After sleeping on it again, I think it can just test the
> bpf_sock->allow_tcp_access instead.

Sorry, I don't think it can work for all the cases because:
1) please see BPF_SOCK_OPS_WRITE_HDR_OPT_CB/BPF_SOCK_OPS_HDR_OPT_LEN_CB,
if req exists, there is no allow_tcp_access initialization. Then
calling some function like bpf_sock_ops_setsockopt will be rejected
because allow_tcp_access is zero.
2) tcp_call_bpf() only set allow_tcp_access only when the socket is
fullsock. As far as I know, all the callers have the full stock for
now, but in the future it might not.

If we should use allow_tcp_access to test, then the following patch
should be folded into patch 3, right?
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..9cd7d4446617 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -525,6 +525,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk,
struct sk_buff *skb,
                sock_ops.sk = sk;
        }

+       sock_ops.allow_tcp_access = 1;
        sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type);
        sock_ops.remaining_opt_len = *remaining;
        /* tcp_current_mss() does not pass a skb */


>
>
> > +}
> > +
> >   static int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >                          char *optval, int optlen)
> >   {
> > @@ -5673,7 +5678,12 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
> >   BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> >          int, level, int, optname, char *, optval, int, optlen)
> >   {
> > -     return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
> > +     struct sock *sk = bpf_sock->sk;
> > +
> > +     if (is_locked_tcp_sock_ops(bpf_sock) && sk_fullsock(sk))
>
> afaict, the new timestamping callbacks still can do setsockopt and it is
> incorrect. It should be:
>
>         if (!bpf_sock->allow_tcp_access)
>                 return -EOPNOTSUPP;
>
> I recalled I have asked in v5 but it may be buried in the long thread, so asking
> here again. Please add test(s) to check that the new timestamping callbacks
> cannot call setsockopt and read/write to some of the tcp_sock fields through the
> bpf_sock_ops.
>
> > +             sock_owned_by_me(sk);
>
> Not needed and instead...

Sorry I don't get you here. What I was doing was letting non
timestamping callbacks be checked by the sock_owned_by_me() function.

If the callback belongs to timestamping, we will skip the check.

>
> > +
> > +     return __bpf_setsockopt(sk, level, optname, optval, optlen);
>
> keep the original _bpf_setsockopt().

Oh, I remembered we've already assumed/agreed the timestamping socket
must be full sock. I will use it.

>
> >   }
> >
> >   static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
> > @@ -5759,6 +5769,7 @@ BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> >          int, level, int, optname, char *, optval, int, optlen)
> >   {
> >       if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
> > +         bpf_sock->sk->sk_protocol == IPPROTO_TCP &&
> >           optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) {
>
> No need to allow getsockopt regardless what SOL_* it is asking. To keep it
> simple, I would just disable both getsockopt and setsockopt for all SOL_* for

Really? I'm shocked because the selftests in this series call
bpf_sock_ops_getsockopt() and bpf_sock_ops_setsockopt() in patch
[13/13]:
...
if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
...

> the new timestamping callbacks. Nothing is lost, the bpf prog can directly read
> the sk.
>
> >               int ret, copy_len = 0;
> >               const u8 *start;
> > @@ -5800,7 +5811,8 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
> >       struct sock *sk = bpf_sock->sk;
> >       int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> >
> > -     if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
> > +     if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk) ||
> > +         sk->sk_protocol != IPPROTO_TCP)
>
> Same here. It should disallow this "set" helper for the timestamping callbacks
> which do not hold the lock.
>
> >               return -EINVAL;
> >
> >       tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> > @@ -7609,6 +7621,9 @@ BPF_CALL_4(bpf_sock_ops_load_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
> >       u8 search_kind, search_len, copy_len, magic_len;
> >       int ret;
> >
> > +     if (!is_locked_tcp_sock_ops(bpf_sock))
> > +             return -EOPNOTSUPP;
>
> This is correct, just change it to "!bpf_sock->allow_tcp_access".
>
> All the above changed helpers should use the same test and the same return handling.
>
> > +
> >       /* 2 byte is the minimal option len except TCPOPT_NOP and
> >        * TCPOPT_EOL which are useless for the bpf prog to learn
> >        * and this helper disallow loading them also.
>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 06/13] net-timestamp: support SCM_TSTAMP_SCHED for bpf extension
  2025-01-25  0:38   ` Martin KaFai Lau
@ 2025-01-25  1:16     ` Jason Xing
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-25  1:16 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 8:38 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/20/25 5:28 PM, Jason Xing wrote:
> > Introducing SKBTX_BPF is used as an indicator telling us whether
> > the skb should be traced by the bpf prog.
> >
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >   include/linux/skbuff.h         |  6 +++++-
> >   include/uapi/linux/bpf.h       |  5 +++++
> >   net/core/dev.c                 |  3 ++-
> >   net/core/skbuff.c              | 23 +++++++++++++++++++++++
> >   tools/include/uapi/linux/bpf.h |  5 +++++
> >   5 files changed, 40 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index dfc419281cc9..35c2e864dd4b 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -490,10 +490,14 @@ enum {
> >
> >       /* generate software time stamp when entering packet scheduling */
> >       SKBTX_SCHED_TSTAMP = 1 << 6,
> > +
> > +     /* used for bpf extension when a bpf program is loaded */
> > +     SKBTX_BPF = 1 << 7,
> >   };
> >
> >   #define SKBTX_ANY_SW_TSTAMP (SKBTX_SW_TSTAMP    | \
> > -                              SKBTX_SCHED_TSTAMP)
> > +                              SKBTX_SCHED_TSTAMP | \
> > +                              SKBTX_BPF)
> >   #define SKBTX_ANY_TSTAMP    (SKBTX_HW_TSTAMP | \
> >                                SKBTX_HW_TSTAMP_USE_CYCLES | \
> >                                SKBTX_ANY_SW_TSTAMP)
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index e629e09b0b31..72f93c6e45c1 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -7022,6 +7022,11 @@ enum {
> >                                        * by the kernel or the
> >                                        * earlier bpf-progs.
> >                                        */
> > +     BPF_SOCK_OPS_TS_SCHED_OPT_CB,   /* Called when skb is passing through
> > +                                      * dev layer when SO_TIMESTAMPING
>
> The "SO_TIMESTAMPING" term is not accurate. I guess you meant
> "SK_BPF_CB_TX_TIMESTAMPING"?
>
> Also, may be "Called before skb entering qdisc"?
>
> > +                                      * feature is on. It indicates the
> > +                                      * recorded timestamp.
>
> There is no timestamp recorded also.

Thanks for correcting me. I'll adjust them.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 07/13] net-timestamp: support sw SCM_TSTAMP_SND for bpf extension
  2025-01-25  0:40   ` Martin KaFai Lau
@ 2025-01-25  1:17     ` Jason Xing
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-25  1:17 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 8:41 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/20/25 5:28 PM, Jason Xing wrote:
> > Support SCM_TSTAMP_SND case. Then we will get the software
> > timestamp when the driver is about to send the skb. Later, I
> > will support the hardware timestamp.
> >
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >   include/linux/skbuff.h         |  2 +-
> >   include/uapi/linux/bpf.h       |  5 +++++
> >   net/core/skbuff.c              | 10 ++++++++--
> >   tools/include/uapi/linux/bpf.h |  5 +++++
> >   4 files changed, 19 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 35c2e864dd4b..de8d3bd311f5 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -4569,7 +4569,7 @@ void skb_tstamp_tx(struct sk_buff *orig_skb,
> >   static inline void skb_tx_timestamp(struct sk_buff *skb)
> >   {
> >       skb_clone_tx_timestamp(skb);
> > -     if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP)
> > +     if (skb_shinfo(skb)->tx_flags & (SKBTX_SW_TSTAMP | SKBTX_BPF))
> >               __skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SND);
> >   }
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 72f93c6e45c1..a6d761f07f67 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -7027,6 +7027,11 @@ enum {
> >                                        * feature is on. It indicates the
> >                                        * recorded timestamp.
> >                                        */
> > +     BPF_SOCK_OPS_TS_SW_OPT_CB,      /* Called when skb is about to send
> > +                                      * to the nic when SO_TIMESTAMPING
>
> Same comment on the "SO_TIMESTAMPING".

Got it. Thanks.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 08/13] net-timestamp: support hw SCM_TSTAMP_SND for bpf extension
  2025-01-25  0:46   ` Martin KaFai Lau
@ 2025-01-25  1:18     ` Jason Xing
  2025-01-25  1:29       ` Martin KaFai Lau
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Xing @ 2025-01-25  1:18 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 8:47 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/20/25 5:28 PM, Jason Xing wrote:
> > In this patch, we finish the hardware part. Then bpf program can
> > fetch the hwstamp from skb directly.
> >
> > To avoid changing so many callers using SKBTX_HW_TSTAMP from drivers,
> > use this simple modification like this patch does to support printing
> > hardware timestamp.
> >
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >   include/linux/skbuff.h         |  4 +++-
> >   include/uapi/linux/bpf.h       |  5 +++++
> >   net/core/skbuff.c              | 11 ++++++-----
> >   net/dsa/user.c                 |  2 +-
> >   net/socket.c                   |  2 +-
> >   tools/include/uapi/linux/bpf.h |  5 +++++
> >   6 files changed, 21 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index de8d3bd311f5..df2d790ae36b 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -471,7 +471,7 @@ struct skb_shared_hwtstamps {
> >   /* Definitions for tx_flags in struct skb_shared_info */
> >   enum {
> >       /* generate hardware time stamp */
> > -     SKBTX_HW_TSTAMP = 1 << 0,
> > +     __SKBTX_HW_TSTAMP = 1 << 0,
> >
> >       /* generate software time stamp when queueing packet to NIC */
> >       SKBTX_SW_TSTAMP = 1 << 1,
> > @@ -495,6 +495,8 @@ enum {
> >       SKBTX_BPF = 1 << 7,
> >   };
> >
> > +#define SKBTX_HW_TSTAMP              (__SKBTX_HW_TSTAMP | SKBTX_BPF)
> > +
> >   #define SKBTX_ANY_SW_TSTAMP (SKBTX_SW_TSTAMP    | \
> >                                SKBTX_SCHED_TSTAMP | \
> >                                SKBTX_BPF)
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index a6d761f07f67..8936e1061e71 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -7032,6 +7032,11 @@ enum {
> >                                        * feature is on. It indicates the
> >                                        * recorded timestamp.
> >                                        */
> > +     BPF_SOCK_OPS_TS_HW_OPT_CB,      /* Called in hardware phase when
> > +                                      * SO_TIMESTAMPING feature is on.
>
> Same comment on the "SO_TIMESTAMPING".
>
> It will be useful to elaborate more on "hardware phase", like exactly when it
> will be called,

Got it.

>
> > +                                      * It indicates the recorded
> > +                                      * timestamp.
>
> and the hwtstamps will be in the skb.

Right.

>
> > +                                      */
> >   };
> >
> >   /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 288eb9869827..c769feae5162 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5548,7 +5548,7 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
> >               flag = SKBTX_SCHED_TSTAMP;
> >               break;
> >       case SCM_TSTAMP_SND:
> > -             flag = sw ? SKBTX_SW_TSTAMP : SKBTX_HW_TSTAMP;
> > +             flag = sw ? SKBTX_SW_TSTAMP : __SKBTX_HW_TSTAMP;
> >               break;
> >       case SCM_TSTAMP_ACK:
> >               if (TCP_SKB_CB(skb)->txstamp_ack)
> > @@ -5565,7 +5565,8 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
> >   }
> >
> >   static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
> > -                           int tstype, bool sw)
> > +                           int tstype, bool sw,
> > +                           struct skb_shared_hwtstamps *hwtstamps)
> >   {
> >       int op;
> >
> > @@ -5577,9 +5578,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
> >               op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
> >               break;
> >       case SCM_TSTAMP_SND:
> > +             op = sw ? BPF_SOCK_OPS_TS_SW_OPT_CB : BPF_SOCK_OPS_TS_HW_OPT_CB;
> >               if (!sw)
> > -                     return;
> > -             op = BPF_SOCK_OPS_TS_SW_OPT_CB;
> > +                     *skb_hwtstamps(skb) = *hwtstamps;
>
> hwtstamps may still be NULL, no?

Right, it can be zero if something wrong happens.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 11/13] net-timestamp: add a new callback in tcp_tx_timestamp()
  2025-01-25  0:50   ` Martin KaFai Lau
@ 2025-01-25  1:21     ` Jason Xing
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-25  1:21 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 8:50 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/20/25 5:28 PM, Jason Xing wrote:
> > Introduce the callback to correlate tcp_sendmsg timestamp with other
> > three points (SND/SW/ACK). We can let bpf trace the beginning of
> > tcp_sendmsg_locked() and fetch the socket addr, so that in
> > tcp_tx_timestamp() we can correlate the tskey with the socket addr.
> > It is accurate since they are under the protect of socket lock.
> > More details can be found in the selftest.
> >
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >   include/uapi/linux/bpf.h       | 3 +++
> >   net/ipv4/tcp.c                 | 1 +
> >   tools/include/uapi/linux/bpf.h | 3 +++
> >   3 files changed, 7 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 3b9bfc88345c..55c74fa18163 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -7042,6 +7042,9 @@ enum {
> >                                        * feature is on. It indicates the
> >                                        * recorded timestamp.
> >                                        */
> > +     BPF_SOCK_OPS_TS_TCP_SND_CB,     /* Called when every tcp_sendmsg
> > +                                      * syscall is triggered
> > +                                      */
>
> I recall we agreed in v5 to adjust the "TCP_" naming part because it will be
> used in UDP also. Like completely remove the "TCP_" from the name?

Right. The thing is that, after that discussion, I altered my thoughts
because I'm not so sure if I need this for UDP (I need more time to
think about the UDP case) which can be trace by fentry, sorry.

I will follow your instructions to remove "TCP_".

Thanks,
Jaosn

>
> >   };
> >
> >   /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 0a41006b10d1..49e489c346ea 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -500,6 +500,7 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
> >               tcb->txstamp_ack_bpf = 1;
> >               shinfo->tx_flags |= SKBTX_BPF;
> >               shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> > +             bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TS_TCP_SND_CB);
> >       }
> >   }
> >
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index b463aa9c27da..38fc04a7ac20 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -7035,6 +7035,9 @@ enum {
> >                                        * feature is on. It indicates the
> >                                        * recorded timestamp.
> >                                        */
> > +     BPF_SOCK_OPS_TS_TCP_SND_CB,     /* Called when every tcp_sendmsg
> > +                                      * syscall is triggered
> > +                                      */
> >   };
> >
> >   /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 12/13] net-timestamp: introduce cgroup lock to avoid affecting non-bpf cases
  2025-01-25  1:09   ` Martin KaFai Lau
@ 2025-01-25  1:25     ` Jason Xing
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-25  1:25 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 9:09 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/20/25 5:29 PM, Jason Xing wrote:
> > Introducing the lock to avoid affecting the applications which
> > are not using timestamping bpf feature.
> >
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >   net/core/skbuff.c     | 6 ++++--
> >   net/ipv4/tcp.c        | 3 ++-
> >   net/ipv4/tcp_input.c  | 3 ++-
> >   net/ipv4/tcp_output.c | 3 ++-
> >   4 files changed, 10 insertions(+), 5 deletions(-)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 33340e0b094f..db5b4b653351 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5605,11 +5605,13 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
> >               return;
> >
> >       /* bpf extension feature entry */
> > -     if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
> > +     if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
>
> I wonder if it is really needed. The caller has just tested the tx_flags.
>
> > +         skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
> >               skb_tstamp_tx_bpf(orig_skb, sk, tstype, sw, hwtstamps);
> >
> >       /* application feature entry */
> > -     if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
> > +     if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
>
> Same here and this one looks wrong also. The userspace may get something
> unexpected in the err queue. The bpf prog may have already detached here after
> setting the SKBTX_BPF earlier.

Oh, thanks for spotting this case.

>
> > +         !skb_enable_app_tstamp(orig_skb, tstype, sw))
> >               return;
> >
> >       tsflags = READ_ONCE(sk->sk_tsflags);
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 49e489c346ea..d88160af00c4 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -493,7 +493,8 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
> >                       shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> >       }
> >
> > -     if (SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> > +     if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
>
> This looks ok considering SK_BPF_CB_FLAG_TEST may get to another cacheline.
>
> > +         SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> >               struct skb_shared_info *shinfo = skb_shinfo(skb);
> >               struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index c8945f5be31b..e30607ba41e5 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3324,7 +3324,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
> >
> >       /* Avoid cache line misses to get skb_shinfo() and shinfo->tx_flags */
> >       if (likely(!TCP_SKB_CB(skb)->txstamp_ack &&
> > -                !TCP_SKB_CB(skb)->txstamp_ack_bpf))
> > +                !(cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
>
> Same here. txtstamp_ack has just been tested.... txstamp_ack_bpf is the next bit.
>
> > +                  TCP_SKB_CB(skb)->txstamp_ack_bpf)))
> >               return;
> >
> >       shinfo = skb_shinfo(skb);
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index fc84ca669b76..483f19c2083e 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1556,7 +1556,8 @@ static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int de
> >   static bool tcp_has_tx_tstamp(const struct sk_buff *skb)
> >   {
> >       return TCP_SKB_CB(skb)->txstamp_ack ||
> > -            TCP_SKB_CB(skb)->txstamp_ack_bpf ||
> > +            (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
>
> Same here.

I thought the cgroup_bpf_enabled which nearly doesn't consume any
resources can protect the timestamping use everywhere. I have no
strong preference. I will remove them as you suggested.

Thanks,
Jason

>
> > +             TCP_SKB_CB(skb)->txstamp_ack_bpf) ||
> >               (skb_shinfo(skb)->tx_flags & SKBTX_ANY_TSTAMP);
> >   }
> >
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 08/13] net-timestamp: support hw SCM_TSTAMP_SND for bpf extension
  2025-01-25  1:18     ` Jason Xing
@ 2025-01-25  1:29       ` Martin KaFai Lau
  2025-01-25  1:35         ` Jason Xing
  0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-01-25  1:29 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 1/24/25 5:18 PM, Jason Xing wrote:
>>> @@ -5577,9 +5578,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
>>>                op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
>>>                break;
>>>        case SCM_TSTAMP_SND:
>>> +             op = sw ? BPF_SOCK_OPS_TS_SW_OPT_CB : BPF_SOCK_OPS_TS_HW_OPT_CB;
>>>                if (!sw)
>>> -                     return;
>>> -             op = BPF_SOCK_OPS_TS_SW_OPT_CB;
>>> +                     *skb_hwtstamps(skb) = *hwtstamps;
>> hwtstamps may still be NULL, no?
> Right, it can be zero if something wrong happens.

Then it needs a NULL check, no?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 04/13] bpf: stop UDP sock accessing TCP fields in sock_op BPF CALLs
  2025-01-25  1:15     ` Jason Xing
@ 2025-01-25  1:32       ` Jason Xing
  2025-01-25  2:25       ` Martin KaFai Lau
  1 sibling, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-25  1:32 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 9:15 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Sat, Jan 25, 2025 at 8:28 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 1/20/25 5:28 PM, Jason Xing wrote:
> > > In the next round, we will support the UDP proto for SO_TIMESTAMPING
> > > bpf extension, so we need to ensure there is no safety problem, which
> > > is ususally caused by UDP socket trying to access TCP fields.
> > >
> > > These approaches can be categorized into two groups:
> > > 1. add TCP protocol check
> > > 2. add sock op check
> >
> > Same as patch 3. The commit message needs adjustment. I would combine patch 3
> > and patch 4 because ...
>
> I wonder if you refer to "squashing" patch 4 into patch 3?
>
> >
> > >
> > > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > > ---
> > >   net/core/filter.c | 19 +++++++++++++++++--
> > >   1 file changed, 17 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index fdd305b4cfbb..934431886876 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -5523,6 +5523,11 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
> > >       return -EINVAL;
> > >   }
> > >
> > > +static bool is_locked_tcp_sock_ops(struct bpf_sock_ops_kern *bpf_sock)
> > > +{
> > > +     return bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
> >
> > More bike shedding...
> >
> > After sleeping on it again, I think it can just test the
> > bpf_sock->allow_tcp_access instead.
>
> Sorry, I don't think it can work for all the cases because:
> 1) please see BPF_SOCK_OPS_WRITE_HDR_OPT_CB/BPF_SOCK_OPS_HDR_OPT_LEN_CB,
> if req exists, there is no allow_tcp_access initialization. Then
> calling some function like bpf_sock_ops_setsockopt will be rejected
> because allow_tcp_access is zero.
> 2) tcp_call_bpf() only set allow_tcp_access only when the socket is
> fullsock. As far as I know, all the callers have the full stock for
> now, but in the future it might not.
>
> If we should use allow_tcp_access to test, then the following patch
> should be folded into patch 3, right?
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 0e5b9a654254..9cd7d4446617 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -525,6 +525,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk,
> struct sk_buff *skb,
>                 sock_ops.sk = sk;
>         }
>
> +       sock_ops.allow_tcp_access = 1;
>         sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type);
>         sock_ops.remaining_opt_len = *remaining;
>         /* tcp_current_mss() does not pass a skb */
>
>
> >
> >
> > > +}
> > > +
> > >   static int _bpf_setsockopt(struct sock *sk, int level, int optname,
> > >                          char *optval, int optlen)
> > >   {
> > > @@ -5673,7 +5678,12 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
> > >   BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> > >          int, level, int, optname, char *, optval, int, optlen)
> > >   {
> > > -     return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
> > > +     struct sock *sk = bpf_sock->sk;
> > > +
> > > +     if (is_locked_tcp_sock_ops(bpf_sock) && sk_fullsock(sk))
> >
> > afaict, the new timestamping callbacks still can do setsockopt and it is
> > incorrect. It should be:
> >
> >         if (!bpf_sock->allow_tcp_access)
> >                 return -EOPNOTSUPP;
> >
> > I recalled I have asked in v5 but it may be buried in the long thread, so asking
> > here again. Please add test(s) to check that the new timestamping callbacks
> > cannot call setsockopt and read/write to some of the tcp_sock fields through the
> > bpf_sock_ops.
> >
> > > +             sock_owned_by_me(sk);
> >
> > Not needed and instead...
>
> Sorry I don't get you here. What I was doing was letting non
> timestamping callbacks be checked by the sock_owned_by_me() function.
>
> If the callback belongs to timestamping, we will skip the check.
>
> >
> > > +
> > > +     return __bpf_setsockopt(sk, level, optname, optval, optlen);
> >
> > keep the original _bpf_setsockopt().
>
> Oh, I remembered we've already assumed/agreed the timestamping socket
> must be full sock. I will use it.

Oh, no. We cannot use it because it will WARN us if the socket is not held:
static int _bpf_setsockopt(struct sock *sk, int level, int optname,
                           char *optval, int optlen)
{
        if (sk_fullsock(sk))
                sock_owned_by_me(sk);
        return __bpf_setsockopt(sk, level, optname, optval, optlen);
}

Let me rephrase what I know about the TCP and UDP cases:
1) the sockets are full socket.
2) the sockets are under the protection of socket lock, but in the
future they might not.

So we need to check if it's a fullsock but we don't expect to get any
warnings because the socket is not locked.

Am I right about those two?

Thanks,
Jason

>
> >
> > >   }
> > >
> > >   static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
> > > @@ -5759,6 +5769,7 @@ BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> > >          int, level, int, optname, char *, optval, int, optlen)
> > >   {
> > >       if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
> > > +         bpf_sock->sk->sk_protocol == IPPROTO_TCP &&
> > >           optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) {
> >
> > No need to allow getsockopt regardless what SOL_* it is asking. To keep it
> > simple, I would just disable both getsockopt and setsockopt for all SOL_* for
>
> Really? I'm shocked because the selftests in this series call
> bpf_sock_ops_getsockopt() and bpf_sock_ops_setsockopt() in patch
> [13/13]:
> ...
> if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
> ...
>
> > the new timestamping callbacks. Nothing is lost, the bpf prog can directly read
> > the sk.
> >
> > >               int ret, copy_len = 0;
> > >               const u8 *start;
> > > @@ -5800,7 +5811,8 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
> > >       struct sock *sk = bpf_sock->sk;
> > >       int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> > >
> > > -     if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
> > > +     if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk) ||
> > > +         sk->sk_protocol != IPPROTO_TCP)
> >
> > Same here. It should disallow this "set" helper for the timestamping callbacks
> > which do not hold the lock.
> >
> > >               return -EINVAL;
> > >
> > >       tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> > > @@ -7609,6 +7621,9 @@ BPF_CALL_4(bpf_sock_ops_load_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
> > >       u8 search_kind, search_len, copy_len, magic_len;
> > >       int ret;
> > >
> > > +     if (!is_locked_tcp_sock_ops(bpf_sock))
> > > +             return -EOPNOTSUPP;
> >
> > This is correct, just change it to "!bpf_sock->allow_tcp_access".
> >
> > All the above changed helpers should use the same test and the same return handling.
> >
> > > +
> > >       /* 2 byte is the minimal option len except TCPOPT_NOP and
> > >        * TCPOPT_EOL which are useless for the bpf prog to learn
> > >        * and this helper disallow loading them also.
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 08/13] net-timestamp: support hw SCM_TSTAMP_SND for bpf extension
  2025-01-25  1:29       ` Martin KaFai Lau
@ 2025-01-25  1:35         ` Jason Xing
  2025-01-25  2:36           ` Martin KaFai Lau
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Xing @ 2025-01-25  1:35 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 9:30 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/24/25 5:18 PM, Jason Xing wrote:
> >>> @@ -5577,9 +5578,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
> >>>                op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
> >>>                break;
> >>>        case SCM_TSTAMP_SND:
> >>> +             op = sw ? BPF_SOCK_OPS_TS_SW_OPT_CB : BPF_SOCK_OPS_TS_HW_OPT_CB;
> >>>                if (!sw)
> >>> -                     return;
> >>> -             op = BPF_SOCK_OPS_TS_SW_OPT_CB;
> >>> +                     *skb_hwtstamps(skb) = *hwtstamps;
> >> hwtstamps may still be NULL, no?
> > Right, it can be zero if something wrong happens.
>
> Then it needs a NULL check, no?

My original intention is passing whatever to the userspace, so the bpf
program will be aware of what is happening in the kernel. Passing NULL
to hwstamps is right which will not cause any problem, I think.

Do you mean the default value of hwstamps itself is NULL so in this
case we don't need to re-init it to NULL again?

Like this:
If (*hwtstamps)
     *skb_hwtstamps(skb) = *hwtstamps;

But it looks no different actually.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 04/13] bpf: stop UDP sock accessing TCP fields in sock_op BPF CALLs
  2025-01-25  1:15     ` Jason Xing
  2025-01-25  1:32       ` Jason Xing
@ 2025-01-25  2:25       ` Martin KaFai Lau
  2025-01-25  2:58         ` Jason Xing
  2025-01-25  3:12         ` Martin KaFai Lau
  1 sibling, 2 replies; 43+ messages in thread
From: Martin KaFai Lau @ 2025-01-25  2:25 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 1/24/25 5:15 PM, Jason Xing wrote:
>>> +static bool is_locked_tcp_sock_ops(struct bpf_sock_ops_kern *bpf_sock)
>>> +{
>>> +     return bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
>>
>> More bike shedding...
>>
>> After sleeping on it again, I think it can just test the
>> bpf_sock->allow_tcp_access instead.
> 
> Sorry, I don't think it can work for all the cases because:
> 1) please see BPF_SOCK_OPS_WRITE_HDR_OPT_CB/BPF_SOCK_OPS_HDR_OPT_LEN_CB,
> if req exists, there is no allow_tcp_access initialization. Then
> calling some function like bpf_sock_ops_setsockopt will be rejected
> because allow_tcp_access is zero.
> 2) tcp_call_bpf() only set allow_tcp_access only when the socket is
> fullsock. As far as I know, all the callers have the full stock for
> now, but in the future it might not.

Note that the existing helper bpf_sock_ops_cb_flags_set and 
bpf_sock_ops_{set,get}sockopt itself have done the sk_fullsock() test and then 
return -EINVAL. bpf_sock->sk is fullsock or not does not matter to these helpers.

You are right on the BPF_SOCK_OPS_WRITE_HDR_OPT_CB/BPF_SOCK_OPS_HDR_OPT_LEN_CB 
but the only helper left that testing allow_tcp_access is not enough is 
bpf_sock_ops_load_hdr_opt(). Potentially, it can test "if 
(!bpf_sock->allow_tcp_access && !bpf_sock->syn_skb) { return -EOPNOTSUPP; }".

Agree to stay with the current "bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB" 
as in this patch. It is cleaner.

>>> @@ -5673,7 +5678,12 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
>>>    BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
>>>           int, level, int, optname, char *, optval, int, optlen)
>>>    {
>>> -     return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
>>> +     struct sock *sk = bpf_sock->sk;
>>> +
>>> +     if (is_locked_tcp_sock_ops(bpf_sock) && sk_fullsock(sk))
>>
>> afaict, the new timestamping callbacks still can do setsockopt and it is
>> incorrect. It should be:
>>
>>          if (!bpf_sock->allow_tcp_access)
>>                  return -EOPNOTSUPP;
>>
>> I recalled I have asked in v5 but it may be buried in the long thread, so asking
>> here again. Please add test(s) to check that the new timestamping callbacks
>> cannot call setsockopt and read/write to some of the tcp_sock fields through the
>> bpf_sock_ops.
>>
>>> +             sock_owned_by_me(sk);
>>
>> Not needed and instead...
> 
> Sorry I don't get you here. What I was doing was letting non
> timestamping callbacks be checked by the sock_owned_by_me() function.
> 
> If the callback belongs to timestamping, we will skip the check.

It will skip the sock_owned_by_me() test and
continue to do the following __bpf_setsockopt() which the new timetamping 
callback should not do, no?

It should be just this at the very beginning of bpf_sock_ops_setsockopt:

	if (!is_locked_tcp_sock_ops(bpf_sock))
		return -EOPNOTSUPP;
> 
>>
>>> +
>>> +     return __bpf_setsockopt(sk, level, optname, optval, optlen);
>>
>> keep the original _bpf_setsockopt().
> 
> Oh, I remembered we've already assumed/agreed the timestamping socket
> must be full sock. I will use it.
> 
>>
>>>    }
>>>
>>>    static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
>>> @@ -5759,6 +5769,7 @@ BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
>>>           int, level, int, optname, char *, optval, int, optlen)
>>>    {
>>>        if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
>>> +         bpf_sock->sk->sk_protocol == IPPROTO_TCP &&
>>>            optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) {
>>
>> No need to allow getsockopt regardless what SOL_* it is asking. To keep it
>> simple, I would just disable both getsockopt and setsockopt for all SOL_* for
> 
> Really? I'm shocked because the selftests in this series call
> bpf_sock_ops_getsockopt() and bpf_sock_ops_setsockopt() in patch
> [13/13]:

Yes, really. It may be late Friday for me here. Please double check your test if 
the bpf_set/getsockopt is called from the new timestamp callback or it is only 
called from the existing BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB callback.

Note that I am only asking to disable the set/getsockopt, 
bpf_sock_ops_cb_flags_set, and the bpf_sock_ops_load_hdr_opt for the new 
timestamping callbacks.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 08/13] net-timestamp: support hw SCM_TSTAMP_SND for bpf extension
  2025-01-25  1:35         ` Jason Xing
@ 2025-01-25  2:36           ` Martin KaFai Lau
  2025-01-25  2:59             ` Jason Xing
  0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-01-25  2:36 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 1/24/25 5:35 PM, Jason Xing wrote:
> On Sat, Jan 25, 2025 at 9:30 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 1/24/25 5:18 PM, Jason Xing wrote:
>>>>> @@ -5577,9 +5578,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
>>>>>                 op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
>>>>>                 break;
>>>>>         case SCM_TSTAMP_SND:
>>>>> +             op = sw ? BPF_SOCK_OPS_TS_SW_OPT_CB : BPF_SOCK_OPS_TS_HW_OPT_CB;
>>>>>                 if (!sw)
>>>>> -                     return;
>>>>> -             op = BPF_SOCK_OPS_TS_SW_OPT_CB;
>>>>> +                     *skb_hwtstamps(skb) = *hwtstamps;
>>>> hwtstamps may still be NULL, no?
>>> Right, it can be zero if something wrong happens.
>>
>> Then it needs a NULL check, no?
> 
> My original intention is passing whatever to the userspace, so the bpf
> program will be aware of what is happening in the kernel.

This is fine.

> Passing NULL to hwstamps is right which will not cause any problem, I think.
> 
> Do you mean the default value of hwstamps itself is NULL so in this
> case we don't need to re-init it to NULL again?
> 
> Like this:
> If (*hwtstamps)
   if (hwtstamps) instead ?

I don't know. If hwtstamps is NULL, doing *hwtstamps will be bad and oops....
May be my brain doesn't work well at the end of Friday. Please check.

>       *skb_hwtstamps(skb) = *hwtstamps;
> 
> But it looks no different actually.
> 
> Thanks,
> Jason


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 04/13] bpf: stop UDP sock accessing TCP fields in sock_op BPF CALLs
  2025-01-25  2:25       ` Martin KaFai Lau
@ 2025-01-25  2:58         ` Jason Xing
  2025-01-25  3:12         ` Martin KaFai Lau
  1 sibling, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-25  2:58 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 10:26 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/24/25 5:15 PM, Jason Xing wrote:
> >>> +static bool is_locked_tcp_sock_ops(struct bpf_sock_ops_kern *bpf_sock)
> >>> +{
> >>> +     return bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
> >>
> >> More bike shedding...
> >>
> >> After sleeping on it again, I think it can just test the
> >> bpf_sock->allow_tcp_access instead.
> >
> > Sorry, I don't think it can work for all the cases because:
> > 1) please see BPF_SOCK_OPS_WRITE_HDR_OPT_CB/BPF_SOCK_OPS_HDR_OPT_LEN_CB,
> > if req exists, there is no allow_tcp_access initialization. Then
> > calling some function like bpf_sock_ops_setsockopt will be rejected
> > because allow_tcp_access is zero.
> > 2) tcp_call_bpf() only set allow_tcp_access only when the socket is
> > fullsock. As far as I know, all the callers have the full stock for
> > now, but in the future it might not.
>
> Note that the existing helper bpf_sock_ops_cb_flags_set and
> bpf_sock_ops_{set,get}sockopt itself have done the sk_fullsock() test and then
> return -EINVAL. bpf_sock->sk is fullsock or not does not matter to these helpers.
>
> You are right on the BPF_SOCK_OPS_WRITE_HDR_OPT_CB/BPF_SOCK_OPS_HDR_OPT_LEN_CB
> but the only helper left that testing allow_tcp_access is not enough is
> bpf_sock_ops_load_hdr_opt(). Potentially, it can test "if
> (!bpf_sock->allow_tcp_access && !bpf_sock->syn_skb) { return -EOPNOTSUPP; }".
>
> Agree to stay with the current "bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB"
> as in this patch. It is cleaner.
>
> >>> @@ -5673,7 +5678,12 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
> >>>    BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> >>>           int, level, int, optname, char *, optval, int, optlen)
> >>>    {
> >>> -     return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
> >>> +     struct sock *sk = bpf_sock->sk;
> >>> +
> >>> +     if (is_locked_tcp_sock_ops(bpf_sock) && sk_fullsock(sk))
> >>
> >> afaict, the new timestamping callbacks still can do setsockopt and it is
> >> incorrect. It should be:
> >>
> >>          if (!bpf_sock->allow_tcp_access)
> >>                  return -EOPNOTSUPP;
> >>
> >> I recalled I have asked in v5 but it may be buried in the long thread, so asking
> >> here again. Please add test(s) to check that the new timestamping callbacks
> >> cannot call setsockopt and read/write to some of the tcp_sock fields through the
> >> bpf_sock_ops.
> >>
> >>> +             sock_owned_by_me(sk);
> >>
> >> Not needed and instead...
> >
> > Sorry I don't get you here. What I was doing was letting non
> > timestamping callbacks be checked by the sock_owned_by_me() function.
> >
> > If the callback belongs to timestamping, we will skip the check.
>
> It will skip the sock_owned_by_me() test and
> continue to do the following __bpf_setsockopt() which the new timetamping
> callback should not do, no?
>
> It should be just this at the very beginning of bpf_sock_ops_setsockopt:
>
>         if (!is_locked_tcp_sock_ops(bpf_sock))
>                 return -EOPNOTSUPP;
> >
> >>
> >>> +
> >>> +     return __bpf_setsockopt(sk, level, optname, optval, optlen);
> >>
> >> keep the original _bpf_setsockopt().
> >
> > Oh, I remembered we've already assumed/agreed the timestamping socket
> > must be full sock. I will use it.
> >
> >>
> >>>    }
> >>>
> >>>    static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
> >>> @@ -5759,6 +5769,7 @@ BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> >>>           int, level, int, optname, char *, optval, int, optlen)
> >>>    {
> >>>        if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
> >>> +         bpf_sock->sk->sk_protocol == IPPROTO_TCP &&
> >>>            optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) {
> >>
> >> No need to allow getsockopt regardless what SOL_* it is asking. To keep it
> >> simple, I would just disable both getsockopt and setsockopt for all SOL_* for
> >
> > Really? I'm shocked because the selftests in this series call
> > bpf_sock_ops_getsockopt() and bpf_sock_ops_setsockopt() in patch
> > [13/13]:
>
> Yes, really. It may be late Friday for me here. Please double check your test if
> the bpf_set/getsockopt is called from the new timestamp callback or it is only
> called from the existing BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB callback.

Oh, after reading the above a few times, I finally realized you're
right. I'll do it. Thanks!!

>
> Note that I am only asking to disable the set/getsockopt,
> bpf_sock_ops_cb_flags_set, and the bpf_sock_ops_load_hdr_opt for the new
> timestamping callbacks.

Right, right!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 08/13] net-timestamp: support hw SCM_TSTAMP_SND for bpf extension
  2025-01-25  2:36           ` Martin KaFai Lau
@ 2025-01-25  2:59             ` Jason Xing
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-25  2:59 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 10:37 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/24/25 5:35 PM, Jason Xing wrote:
> > On Sat, Jan 25, 2025 at 9:30 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 1/24/25 5:18 PM, Jason Xing wrote:
> >>>>> @@ -5577,9 +5578,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
> >>>>>                 op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
> >>>>>                 break;
> >>>>>         case SCM_TSTAMP_SND:
> >>>>> +             op = sw ? BPF_SOCK_OPS_TS_SW_OPT_CB : BPF_SOCK_OPS_TS_HW_OPT_CB;
> >>>>>                 if (!sw)
> >>>>> -                     return;
> >>>>> -             op = BPF_SOCK_OPS_TS_SW_OPT_CB;
> >>>>> +                     *skb_hwtstamps(skb) = *hwtstamps;
> >>>> hwtstamps may still be NULL, no?
> >>> Right, it can be zero if something wrong happens.
> >>
> >> Then it needs a NULL check, no?
> >
> > My original intention is passing whatever to the userspace, so the bpf
> > program will be aware of what is happening in the kernel.
>
> This is fine.
>
> > Passing NULL to hwstamps is right which will not cause any problem, I think.
> >
> > Do you mean the default value of hwstamps itself is NULL so in this
> > case we don't need to re-init it to NULL again?
> >
> > Like this:
> > If (*hwtstamps)
>    if (hwtstamps) instead ?
>
> I don't know. If hwtstamps is NULL, doing *hwtstamps will be bad and oops....
> May be my brain doesn't work well at the end of Friday. Please check.

Thanks for your effort, Martin!

I will deliberately inject this error case and then see what will happen.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 13/13] bpf: add simple bpf tests in the tx path for so_timestamping feature
  2025-01-21  1:29 ` [RFC PATCH net-next v6 13/13] bpf: add simple bpf tests in the tx path for so_timestamping feature Jason Xing
@ 2025-01-25  3:07   ` Martin KaFai Lau
  2025-01-25  3:42     ` Jason Xing
  0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-01-25  3:07 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 1/20/25 5:29 PM, Jason Xing wrote:
> Only check if we pass those three key points after we enable the
> bpf extension for so_timestamping. During each point, we can choose
> whether to print the current timestamp.
> 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>   .../bpf/prog_tests/so_timestamping.c          |  98 ++++++++
>   .../selftests/bpf/progs/so_timestamping.c     | 227 ++++++++++++++++++
>   2 files changed, 325 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/so_timestamping.c
>   create mode 100644 tools/testing/selftests/bpf/progs/so_timestamping.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/so_timestamping.c b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> new file mode 100644
> index 000000000000..bbfa7eb38cfb
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> @@ -0,0 +1,98 @@
> +#define _GNU_SOURCE
> +#include <sched.h>
> +#include <linux/socket.h>
> +#include <linux/tls.h>
> +#include <net/if.h>
> +
> +#include "test_progs.h"
> +#include "cgroup_helpers.h"
> +#include "network_helpers.h"
> +
> +#include "so_timestamping.skel.h"
> +
> +#define CG_NAME "/so-timestamping-test"
> +
> +static const char addr4_str[] = "127.0.0.1";
> +static const char addr6_str[] = "::1";
> +static struct so_timestamping *skel;
> +static int cg_fd;
> +
> +static int create_netns(void)

Reuse the netns_new("so_timestamping_ns", true) from test_progs.c.

> +{
> +	if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
> +		return -1;
> +
> +	if (!ASSERT_OK(system("ip link set dev lo up"), "set lo up"))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static void test_tcp(int family)
> +{
> +	struct so_timestamping__bss *bss = skel->bss;
> +	char buf[] = "testing testing";
> +	int sfd = -1, cfd = -1;
> +	int n;
> +
> +	memset(bss, 0, sizeof(*bss));
> +
> +	sfd = start_server(family, SOCK_STREAM,
> +			   family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
> +	if (!ASSERT_GE(sfd, 0, "start_server"))

nit. ASSERT_OK_FD.

> +		goto out;
> +
> +	cfd = connect_to_fd(sfd, 0);
> +	if (!ASSERT_GE(cfd, 0, "connect_to_fd_server")) {

Same here. ASSERT_OK_FD.

> +		close(sfd);

This close is unnecessary. It will cause a double close at "out:" also.

> +		goto out;
> +	}
> +
> +	n = write(cfd, buf, sizeof(buf));
> +	if (!ASSERT_EQ(n, sizeof(buf), "send to server"))
> +		goto out;
> +
> +	ASSERT_EQ(bss->nr_active, 1, "nr_active");
> +	ASSERT_EQ(bss->nr_snd, 2, "nr_snd");
> +	ASSERT_EQ(bss->nr_sched, 1, "nr_sched");
> +	ASSERT_EQ(bss->nr_txsw, 1, "nr_txsw");
> +	ASSERT_EQ(bss->nr_ack, 1, "nr_ack");
> +
> +out:
> +	if (sfd >= 0)
> +		close(sfd);
> +	if (cfd >= 0)
> +		close(cfd);
> +}
> +
> +void test_so_timestamping(void)
> +{
> +	cg_fd = test__join_cgroup(CG_NAME);
> +	if (cg_fd < 0)
> +		return;
> +
> +	if (create_netns())
> +		goto done;
> +
> +	skel = so_timestamping__open();

nit. so_timestamping__open_and_load()

> +	if (!ASSERT_OK_PTR(skel, "open skel"))
> +		goto done;
> +
> +	if (!ASSERT_OK(so_timestamping__load(skel), "load skel"))

Then this __load() is not need.

> +		goto done;
> +
> +	if (!ASSERT_OK(so_timestamping__attach(skel), "attach skel"))
> +		goto done;
> +
> +	skel->links.skops_sockopt =
> +		bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd);
> +	if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup"))
> +		goto done;
> +
> +	test_tcp(AF_INET6);
> +	test_tcp(AF_INET);
> +
> +done:
> +	so_timestamping__destroy(skel);
> +	close(cg_fd);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/so_timestamping.c b/tools/testing/selftests/bpf/progs/so_timestamping.c
> new file mode 100644
> index 000000000000..f4708e84c243
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/so_timestamping.c
> @@ -0,0 +1,227 @@
> +#include "vmlinux.h"
> +#include "bpf_tracing_net.h"
> +#include <bpf/bpf_core_read.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +//#include <bpf/bpf_core_read.h>
> +#include "bpf_misc.h"
> +#include "bpf_kfuncs.h"
> +
> +#define SK_BPF_CB_FLAGS 1009
> +#define SK_BPF_CB_TX_TIMESTAMPING 1
> +
> +int nr_active;
> +int nr_snd;
> +int nr_passive;
> +int nr_sched;
> +int nr_txsw;
> +int nr_ack;
> +
> +struct sockopt_test {
> +	int opt;
> +	int new;
> +};
> +
> +static const struct sockopt_test sol_socket_tests[] = {
> +	{ .opt = SK_BPF_CB_FLAGS, .new = SK_BPF_CB_TX_TIMESTAMPING, },
> +	{ .opt = 0, },
> +};
> +
> +struct loop_ctx {
> +	void *ctx;
> +	const struct sock *sk;
> +};
> +
> +struct sk_stg {
> +	__u64 sendmsg_ns;	/* record ts when sendmsg is called */
> +};
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_SK_STORAGE);
> +	__uint(map_flags, BPF_F_NO_PREALLOC);
> +	__type(key, int);
> +	__type(value, struct sk_stg);
> +} sk_stg_map SEC(".maps");
> +
> +
> +struct delay_info {
> +	u64 sendmsg_ns;		/* record ts when sendmsg is called */
> +	u32 sched_delay;	/* SCHED_OPT_CB - sendmsg_ns */
> +	u32 sw_snd_delay;	/* SW_OPT_CB - SCHED_OPT_CB */
> +	u32 ack_delay;		/* ACK_OPT_CB - SW_OPT_CB */
> +};
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__type(key, u32);
> +	__type(value, struct delay_info);
> +	__uint(max_entries, 1024);
> +} time_map SEC(".maps");
> +
> +static u64 delay_tolerance_nsec = 1000000000; /* 1 second as an example */
> +
> +static int bpf_test_sockopt_int(void *ctx, const struct sock *sk,
> +				const struct sockopt_test *t,
> +				int level)
> +{
> +	int new, opt, tmp;
> +
> +	opt = t->opt;
> +	new = t->new;
> +
> +	if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
> +		return 1;
> +
> +	if (bpf_getsockopt(ctx, level, opt, &tmp, sizeof(tmp)) ||
> +	    tmp != new) {
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int bpf_test_socket_sockopt(__u32 i, struct loop_ctx *lc)
> +{
> +	const struct sockopt_test *t;
> +
> +	if (i >= ARRAY_SIZE(sol_socket_tests))
> +		return 1;
> +
> +	t = &sol_socket_tests[i];
> +	if (!t->opt)
> +		return 1;
> +
> +	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, SOL_SOCKET);
> +}
> +
> +static int bpf_test_sockopt(void *ctx, const struct sock *sk)
> +{
> +	struct loop_ctx lc = { .ctx = ctx, .sk = sk, };
> +	int n;
> +
> +	n = bpf_loop(ARRAY_SIZE(sol_socket_tests), bpf_test_socket_sockopt, &lc, 0);
> +	if (n != ARRAY_SIZE(sol_socket_tests))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static bool bpf_test_delay(struct bpf_sock_ops *skops, const struct sock *sk)
> +{
> +	struct bpf_sock_ops_kern *skops_kern;
> +	u64 timestamp = bpf_ktime_get_ns();
> +	struct skb_shared_info *shinfo;
> +	struct delay_info dinfo = {0};
> +	struct delay_info *val;
> +	struct sk_buff *skb;
> +	struct sk_stg *stg;
> +	u32 delay, tskey;
> +	u64 prior_ts;
> +
> +	skops_kern = bpf_cast_to_kern_ctx(skops);
> +	skb = skops_kern->skb;
> +	shinfo = bpf_core_cast(skb->head + skb->end, struct skb_shared_info);
> +	tskey = shinfo->tskey;
> +	if (!tskey)
> +		return false;
> +
> +	if (skops->op == BPF_SOCK_OPS_TS_TCP_SND_CB) {
> +		stg = bpf_sk_storage_get(&sk_stg_map, (void *)sk, 0, 0);
> +		if (!stg)
> +			return false;
> +		dinfo.sendmsg_ns = stg->sendmsg_ns;
> +		val = &dinfo;

Move the map_update here instead.

		bpf_map_update_elem(&time_map, &tskey, val, BPF_ANY);

> +		goto out;
> +	}
> +
> +	val = bpf_map_lookup_elem(&time_map, &tskey);
> +	if (!val)
> +		return false;
> +
> +	switch (skops->op) {
> +	case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
> +		delay = val->sched_delay = timestamp - val->sendmsg_ns;
> +		break;
> +	case BPF_SOCK_OPS_TS_SW_OPT_CB:
> +		prior_ts = val->sched_delay + val->sendmsg_ns;
> +		delay = val->sw_snd_delay = timestamp - prior_ts;
> +		break;
> +	case BPF_SOCK_OPS_TS_ACK_OPT_CB:
> +		prior_ts = val->sw_snd_delay + val->sched_delay + val->sendmsg_ns;
> +		delay = val->ack_delay = timestamp - prior_ts;
> +		break;
> +	}
> +
> +	if (delay <= 0 || delay >= delay_tolerance_nsec)

Regarding delay <= 0 check, note that delay was defined as u32.

delay_tolerance_nsec is 1 sec which could be too short for the bpf CI. May be 
raise it to like 10s and only check "if (delay >= delay_tolerance_nsec)". It 
will be useful to bump a nr_long_delay++ also and ASSERT in the userspace.

btw, it is in nsec, is u32 enough?


> +		return false;
> +
> +	/* Since it's the last one, remove from the map after latency check */
> +	if (skops->op == BPF_SOCK_OPS_TS_ACK_OPT_CB) {
> +		bpf_map_delete_elem(&time_map, &tskey);
> +		return true;
> +	}
> +
> +out:
> +	bpf_map_update_elem(&time_map, &tskey, val, BPF_ANY);

then no need to do update_elem here for other op.

Overall, I think the set looks good. Only a few things left. Thanks for 
revamping the test also. The test should be pretty close to how it will be used.

Please add tests to ensure the new timestamping callbacks cannot use the helpers 
that we discussed in the earlier patch and also cannot directly read/write the 
sock fields through the bpf_sock_ops.

Please also add some details on how the UDP BPF_SOCK_OPS_TS_TCP_SND_CB (or to be 
renamed to BPF_SOCK_OPS_TS_SND_CB ?) will look like. It is the only callback 
that I don't have a clear idea for UDP.

Please tag the set to bpf-next. Then the bpf CI can pick up automatically and 
continue testing it whenever some other bpf patches landed.

[ I will reply other followup later ]

> +	return true;
> +}
> +
> +SEC("fentry/tcp_sendmsg_locked")
> +int BPF_PROG(trace_tcp_sendmsg_locked, struct sock *sk, struct msghdr *msg, size_t size)
> +{
> +	u64 timestamp = bpf_ktime_get_ns();
> +	u32 flag = sk->sk_bpf_cb_flags;
> +	struct sk_stg *stg;
> +
> +	if (!flag)
> +		return 0;
> +
> +	stg = bpf_sk_storage_get(&sk_stg_map, sk, 0,
> +				 BPF_SK_STORAGE_GET_F_CREATE);
> +	if (!stg)
> +		return 0;
> +
> +	stg->sendmsg_ns = timestamp;
> +	nr_snd += 1;
> +	return 0;
> +}
> +
> +SEC("sockops")
> +int skops_sockopt(struct bpf_sock_ops *skops)
> +{
> +	struct bpf_sock *bpf_sk = skops->sk;
> +	const struct sock *sk;
> +
> +	if (!bpf_sk)
> +		return 1;
> +
> +	sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
> +	if (!sk)
> +		return 1;
> +
> +	switch (skops->op) {
> +	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
> +		nr_active += !bpf_test_sockopt(skops, sk);
> +		break;
> +	case BPF_SOCK_OPS_TS_TCP_SND_CB:
> +		if (bpf_test_delay(skops, sk))
> +			nr_snd += 1;
> +		break;
> +	case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
> +		if (bpf_test_delay(skops, sk))
> +			nr_sched += 1;
> +		break;
> +	case BPF_SOCK_OPS_TS_SW_OPT_CB:
> +		if (bpf_test_delay(skops, sk))
> +			nr_txsw += 1;
> +		break;
> +	case BPF_SOCK_OPS_TS_ACK_OPT_CB:
> +		if (bpf_test_delay(skops, sk))
> +			nr_ack += 1;
> +		break;
> +	}
> +
> +	return 1;
> +}
> +
> +char _license[] SEC("license") = "GPL";


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 04/13] bpf: stop UDP sock accessing TCP fields in sock_op BPF CALLs
  2025-01-25  2:25       ` Martin KaFai Lau
  2025-01-25  2:58         ` Jason Xing
@ 2025-01-25  3:12         ` Martin KaFai Lau
  2025-01-25  3:43           ` Jason Xing
  1 sibling, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-01-25  3:12 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 1/24/25 6:25 PM, Martin KaFai Lau wrote:
>>
>> Sorry, I don't think it can work for all the cases because:
>> 1) please see BPF_SOCK_OPS_WRITE_HDR_OPT_CB/BPF_SOCK_OPS_HDR_OPT_LEN_CB,
>> if req exists, there is no allow_tcp_access initialization. Then
>> calling some function like bpf_sock_ops_setsockopt will be rejected
>> because allow_tcp_access is zero.
>> 2) tcp_call_bpf() only set allow_tcp_access only when the socket is
>> fullsock. As far as I know, all the callers have the full stock for
>> now, but in the future it might not.
> 
> Note that the existing helper bpf_sock_ops_cb_flags_set and 
> bpf_sock_ops_{set,get}sockopt itself have done the sk_fullsock() test and then 
> return -EINVAL. bpf_sock->sk is fullsock or not does not matter to these helpers.
> 
> You are right on the BPF_SOCK_OPS_WRITE_HDR_OPT_CB/BPF_SOCK_OPS_HDR_OPT_LEN_CB 
> but the only helper left that testing allow_tcp_access is not enough is 
> bpf_sock_ops_load_hdr_opt(). Potentially, it can test "if (!bpf_sock- 
>  >allow_tcp_access && !bpf_sock->syn_skb) { return -EOPNOTSUPP; }".
> 
> Agree to stay with the current "bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB" 
> as in this patch. It is cleaner.

Also ignore my earlier comment on merging patch 3 and 4. Better keep patch 4 on 
its own since it is not reusing the allow_tcp_access test. Instead, stay with 
the "bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB" test.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 13/13] bpf: add simple bpf tests in the tx path for so_timestamping feature
  2025-01-25  3:07   ` Martin KaFai Lau
@ 2025-01-25  3:42     ` Jason Xing
  2025-01-27 23:49       ` Martin KaFai Lau
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Xing @ 2025-01-25  3:42 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 11:08 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/20/25 5:29 PM, Jason Xing wrote:
> > Only check if we pass those three key points after we enable the
> > bpf extension for so_timestamping. During each point, we can choose
> > whether to print the current timestamp.
> >
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >   .../bpf/prog_tests/so_timestamping.c          |  98 ++++++++
> >   .../selftests/bpf/progs/so_timestamping.c     | 227 ++++++++++++++++++
> >   2 files changed, 325 insertions(+)
> >   create mode 100644 tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/so_timestamping.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/so_timestamping.c b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> > new file mode 100644
> > index 000000000000..bbfa7eb38cfb
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
> > @@ -0,0 +1,98 @@
> > +#define _GNU_SOURCE
> > +#include <sched.h>
> > +#include <linux/socket.h>
> > +#include <linux/tls.h>
> > +#include <net/if.h>
> > +
> > +#include "test_progs.h"
> > +#include "cgroup_helpers.h"
> > +#include "network_helpers.h"
> > +
> > +#include "so_timestamping.skel.h"
> > +
> > +#define CG_NAME "/so-timestamping-test"
> > +
> > +static const char addr4_str[] = "127.0.0.1";
> > +static const char addr6_str[] = "::1";
> > +static struct so_timestamping *skel;
> > +static int cg_fd;
> > +
> > +static int create_netns(void)
>
> Reuse the netns_new("so_timestamping_ns", true) from test_progs.c.
>
> > +{
> > +     if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
> > +             return -1;
> > +
> > +     if (!ASSERT_OK(system("ip link set dev lo up"), "set lo up"))
> > +             return -1;
> > +
> > +     return 0;
> > +}
> > +
> > +static void test_tcp(int family)
> > +{
> > +     struct so_timestamping__bss *bss = skel->bss;
> > +     char buf[] = "testing testing";
> > +     int sfd = -1, cfd = -1;
> > +     int n;
> > +
> > +     memset(bss, 0, sizeof(*bss));
> > +
> > +     sfd = start_server(family, SOCK_STREAM,
> > +                        family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
> > +     if (!ASSERT_GE(sfd, 0, "start_server"))
>
> nit. ASSERT_OK_FD.
>
> > +             goto out;
> > +
> > +     cfd = connect_to_fd(sfd, 0);
> > +     if (!ASSERT_GE(cfd, 0, "connect_to_fd_server")) {
>
> Same here. ASSERT_OK_FD.
>
> > +             close(sfd);
>
> This close is unnecessary. It will cause a double close at "out:" also.
>
> > +             goto out;
> > +     }
> > +
> > +     n = write(cfd, buf, sizeof(buf));
> > +     if (!ASSERT_EQ(n, sizeof(buf), "send to server"))
> > +             goto out;
> > +
> > +     ASSERT_EQ(bss->nr_active, 1, "nr_active");
> > +     ASSERT_EQ(bss->nr_snd, 2, "nr_snd");
> > +     ASSERT_EQ(bss->nr_sched, 1, "nr_sched");
> > +     ASSERT_EQ(bss->nr_txsw, 1, "nr_txsw");
> > +     ASSERT_EQ(bss->nr_ack, 1, "nr_ack");
> > +
> > +out:
> > +     if (sfd >= 0)
> > +             close(sfd);
> > +     if (cfd >= 0)
> > +             close(cfd);
> > +}
> > +
> > +void test_so_timestamping(void)
> > +{
> > +     cg_fd = test__join_cgroup(CG_NAME);
> > +     if (cg_fd < 0)
> > +             return;
> > +
> > +     if (create_netns())
> > +             goto done;
> > +
> > +     skel = so_timestamping__open();
>
> nit. so_timestamping__open_and_load()
>
> > +     if (!ASSERT_OK_PTR(skel, "open skel"))
> > +             goto done;
> > +
> > +     if (!ASSERT_OK(so_timestamping__load(skel), "load skel"))
>
> Then this __load() is not need.
>
> > +             goto done;
> > +
> > +     if (!ASSERT_OK(so_timestamping__attach(skel), "attach skel"))
> > +             goto done;
> > +
> > +     skel->links.skops_sockopt =
> > +             bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd);
> > +     if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup"))
> > +             goto done;
> > +
> > +     test_tcp(AF_INET6);
> > +     test_tcp(AF_INET);
> > +
> > +done:
> > +     so_timestamping__destroy(skel);
> > +     close(cg_fd);
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/so_timestamping.c b/tools/testing/selftests/bpf/progs/so_timestamping.c
> > new file mode 100644
> > index 000000000000..f4708e84c243
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/so_timestamping.c
> > @@ -0,0 +1,227 @@
> > +#include "vmlinux.h"
> > +#include "bpf_tracing_net.h"
> > +#include <bpf/bpf_core_read.h>
> > +#include <bpf/bpf_helpers.h>
> > +#include <bpf/bpf_tracing.h>
> > +//#include <bpf/bpf_core_read.h>
> > +#include "bpf_misc.h"
> > +#include "bpf_kfuncs.h"
> > +
> > +#define SK_BPF_CB_FLAGS 1009
> > +#define SK_BPF_CB_TX_TIMESTAMPING 1
> > +
> > +int nr_active;
> > +int nr_snd;
> > +int nr_passive;
> > +int nr_sched;
> > +int nr_txsw;
> > +int nr_ack;
> > +
> > +struct sockopt_test {
> > +     int opt;
> > +     int new;
> > +};
> > +
> > +static const struct sockopt_test sol_socket_tests[] = {
> > +     { .opt = SK_BPF_CB_FLAGS, .new = SK_BPF_CB_TX_TIMESTAMPING, },
> > +     { .opt = 0, },
> > +};
> > +
> > +struct loop_ctx {
> > +     void *ctx;
> > +     const struct sock *sk;
> > +};
> > +
> > +struct sk_stg {
> > +     __u64 sendmsg_ns;       /* record ts when sendmsg is called */
> > +};
> > +
> > +struct {
> > +     __uint(type, BPF_MAP_TYPE_SK_STORAGE);
> > +     __uint(map_flags, BPF_F_NO_PREALLOC);
> > +     __type(key, int);
> > +     __type(value, struct sk_stg);
> > +} sk_stg_map SEC(".maps");
> > +
> > +
> > +struct delay_info {
> > +     u64 sendmsg_ns;         /* record ts when sendmsg is called */
> > +     u32 sched_delay;        /* SCHED_OPT_CB - sendmsg_ns */
> > +     u32 sw_snd_delay;       /* SW_OPT_CB - SCHED_OPT_CB */
> > +     u32 ack_delay;          /* ACK_OPT_CB - SW_OPT_CB */
> > +};
> > +
> > +struct {
> > +     __uint(type, BPF_MAP_TYPE_HASH);
> > +     __type(key, u32);
> > +     __type(value, struct delay_info);
> > +     __uint(max_entries, 1024);
> > +} time_map SEC(".maps");
> > +
> > +static u64 delay_tolerance_nsec = 1000000000; /* 1 second as an example */
> > +
> > +static int bpf_test_sockopt_int(void *ctx, const struct sock *sk,
> > +                             const struct sockopt_test *t,
> > +                             int level)
> > +{
> > +     int new, opt, tmp;
> > +
> > +     opt = t->opt;
> > +     new = t->new;
> > +
> > +     if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
> > +             return 1;
> > +
> > +     if (bpf_getsockopt(ctx, level, opt, &tmp, sizeof(tmp)) ||
> > +         tmp != new) {
> > +             return 1;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static int bpf_test_socket_sockopt(__u32 i, struct loop_ctx *lc)
> > +{
> > +     const struct sockopt_test *t;
> > +
> > +     if (i >= ARRAY_SIZE(sol_socket_tests))
> > +             return 1;
> > +
> > +     t = &sol_socket_tests[i];
> > +     if (!t->opt)
> > +             return 1;
> > +
> > +     return bpf_test_sockopt_int(lc->ctx, lc->sk, t, SOL_SOCKET);
> > +}
> > +
> > +static int bpf_test_sockopt(void *ctx, const struct sock *sk)
> > +{
> > +     struct loop_ctx lc = { .ctx = ctx, .sk = sk, };
> > +     int n;
> > +
> > +     n = bpf_loop(ARRAY_SIZE(sol_socket_tests), bpf_test_socket_sockopt, &lc, 0);
> > +     if (n != ARRAY_SIZE(sol_socket_tests))
> > +             return -1;
> > +
> > +     return 0;
> > +}
> > +
> > +static bool bpf_test_delay(struct bpf_sock_ops *skops, const struct sock *sk)
> > +{
> > +     struct bpf_sock_ops_kern *skops_kern;
> > +     u64 timestamp = bpf_ktime_get_ns();
> > +     struct skb_shared_info *shinfo;
> > +     struct delay_info dinfo = {0};
> > +     struct delay_info *val;
> > +     struct sk_buff *skb;
> > +     struct sk_stg *stg;
> > +     u32 delay, tskey;
> > +     u64 prior_ts;
> > +
> > +     skops_kern = bpf_cast_to_kern_ctx(skops);
> > +     skb = skops_kern->skb;
> > +     shinfo = bpf_core_cast(skb->head + skb->end, struct skb_shared_info);
> > +     tskey = shinfo->tskey;
> > +     if (!tskey)
> > +             return false;
> > +
> > +     if (skops->op == BPF_SOCK_OPS_TS_TCP_SND_CB) {
> > +             stg = bpf_sk_storage_get(&sk_stg_map, (void *)sk, 0, 0);
> > +             if (!stg)
> > +                     return false;
> > +             dinfo.sendmsg_ns = stg->sendmsg_ns;
> > +             val = &dinfo;
>
> Move the map_update here instead.
>
>                 bpf_map_update_elem(&time_map, &tskey, val, BPF_ANY);
>
> > +             goto out;
> > +     }
> > +
> > +     val = bpf_map_lookup_elem(&time_map, &tskey);
> > +     if (!val)
> > +             return false;
> > +
> > +     switch (skops->op) {
> > +     case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
> > +             delay = val->sched_delay = timestamp - val->sendmsg_ns;
> > +             break;
> > +     case BPF_SOCK_OPS_TS_SW_OPT_CB:
> > +             prior_ts = val->sched_delay + val->sendmsg_ns;
> > +             delay = val->sw_snd_delay = timestamp - prior_ts;
> > +             break;
> > +     case BPF_SOCK_OPS_TS_ACK_OPT_CB:
> > +             prior_ts = val->sw_snd_delay + val->sched_delay + val->sendmsg_ns;
> > +             delay = val->ack_delay = timestamp - prior_ts;
> > +             break;
> > +     }
> > +
> > +     if (delay <= 0 || delay >= delay_tolerance_nsec)
>
> Regarding delay <= 0 check, note that delay was defined as u32.
>
> delay_tolerance_nsec is 1 sec which could be too short for the bpf CI. May be
> raise it to like 10s and only check "if (delay >= delay_tolerance_nsec)". It
> will be useful to bump a nr_long_delay++ also and ASSERT in the userspace.
>
> btw, it is in nsec, is u32 enough?
>
>
> > +             return false;
> > +
> > +     /* Since it's the last one, remove from the map after latency check */
> > +     if (skops->op == BPF_SOCK_OPS_TS_ACK_OPT_CB) {
> > +             bpf_map_delete_elem(&time_map, &tskey);
> > +             return true;
> > +     }
> > +
> > +out:
> > +     bpf_map_update_elem(&time_map, &tskey, val, BPF_ANY);
>
> then no need to do update_elem here for other op.

I'm going to adjust all the points that you mentioned as above. Thanks!

>
> Overall, I think the set looks good. Only a few things left. Thanks for
> revamping the test also. The test should be pretty close to how it will be used.
>
> Please add tests to ensure the new timestamping callbacks cannot use the helpers
> that we discussed in the earlier patch and also cannot directly read/write the
> sock fields through the bpf_sock_ops.

No problem. Will do it in the next respin.

>
> Please also add some details on how the UDP BPF_SOCK_OPS_TS_TCP_SND_CB (or to be
> renamed to BPF_SOCK_OPS_TS_SND_CB ?) will look like. It is the only callback
> that I don't have a clear idea for UDP.

I think I will rename it as you said. But I wonder if I can add more
details about UDP after this series gets merged which should not be
too late. After this series, I will carefully consider and test how we
use for UDP type.

>
> Please tag the set to bpf-next. Then the bpf CI can pick up automatically and
> continue testing it whenever some other bpf patches landed.

Got it!

>
> [ I will reply other followup later ]

Thanks for your work.

>
> > +     return true;
> > +}
> > +
> > +SEC("fentry/tcp_sendmsg_locked")
> > +int BPF_PROG(trace_tcp_sendmsg_locked, struct sock *sk, struct msghdr *msg, size_t size)
> > +{
> > +     u64 timestamp = bpf_ktime_get_ns();
> > +     u32 flag = sk->sk_bpf_cb_flags;
> > +     struct sk_stg *stg;
> > +
> > +     if (!flag)
> > +             return 0;
> > +
> > +     stg = bpf_sk_storage_get(&sk_stg_map, sk, 0,
> > +                              BPF_SK_STORAGE_GET_F_CREATE);
> > +     if (!stg)
> > +             return 0;
> > +
> > +     stg->sendmsg_ns = timestamp;
> > +     nr_snd += 1;
> > +     return 0;
> > +}
> > +
> > +SEC("sockops")
> > +int skops_sockopt(struct bpf_sock_ops *skops)
> > +{
> > +     struct bpf_sock *bpf_sk = skops->sk;
> > +     const struct sock *sk;
> > +
> > +     if (!bpf_sk)
> > +             return 1;
> > +
> > +     sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
> > +     if (!sk)
> > +             return 1;
> > +
> > +     switch (skops->op) {
> > +     case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
> > +             nr_active += !bpf_test_sockopt(skops, sk);
> > +             break;
> > +     case BPF_SOCK_OPS_TS_TCP_SND_CB:
> > +             if (bpf_test_delay(skops, sk))
> > +                     nr_snd += 1;
> > +             break;
> > +     case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
> > +             if (bpf_test_delay(skops, sk))
> > +                     nr_sched += 1;
> > +             break;
> > +     case BPF_SOCK_OPS_TS_SW_OPT_CB:
> > +             if (bpf_test_delay(skops, sk))
> > +                     nr_txsw += 1;
> > +             break;
> > +     case BPF_SOCK_OPS_TS_ACK_OPT_CB:
> > +             if (bpf_test_delay(skops, sk))
> > +                     nr_ack += 1;
> > +             break;
> > +     }
> > +
> > +     return 1;
> > +}
> > +
> > +char _license[] SEC("license") = "GPL";
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 04/13] bpf: stop UDP sock accessing TCP fields in sock_op BPF CALLs
  2025-01-25  3:12         ` Martin KaFai Lau
@ 2025-01-25  3:43           ` Jason Xing
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-25  3:43 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 11:12 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/24/25 6:25 PM, Martin KaFai Lau wrote:
> >>
> >> Sorry, I don't think it can work for all the cases because:
> >> 1) please see BPF_SOCK_OPS_WRITE_HDR_OPT_CB/BPF_SOCK_OPS_HDR_OPT_LEN_CB,
> >> if req exists, there is no allow_tcp_access initialization. Then
> >> calling some function like bpf_sock_ops_setsockopt will be rejected
> >> because allow_tcp_access is zero.
> >> 2) tcp_call_bpf() only set allow_tcp_access only when the socket is
> >> fullsock. As far as I know, all the callers have the full stock for
> >> now, but in the future it might not.
> >
> > Note that the existing helper bpf_sock_ops_cb_flags_set and
> > bpf_sock_ops_{set,get}sockopt itself have done the sk_fullsock() test and then
> > return -EINVAL. bpf_sock->sk is fullsock or not does not matter to these helpers.
> >
> > You are right on the BPF_SOCK_OPS_WRITE_HDR_OPT_CB/BPF_SOCK_OPS_HDR_OPT_LEN_CB
> > but the only helper left that testing allow_tcp_access is not enough is
> > bpf_sock_ops_load_hdr_opt(). Potentially, it can test "if (!bpf_sock-
> >  >allow_tcp_access && !bpf_sock->syn_skb) { return -EOPNOTSUPP; }".
> >
> > Agree to stay with the current "bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB"
> > as in this patch. It is cleaner.
>
> Also ignore my earlier comment on merging patch 3 and 4. Better keep patch 4 on
> its own since it is not reusing the allow_tcp_access test. Instead, stay with
> the "bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB" test.

Got it!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 13/13] bpf: add simple bpf tests in the tx path for so_timestamping feature
  2025-01-25  3:42     ` Jason Xing
@ 2025-01-27 23:49       ` Martin KaFai Lau
  2025-01-28  0:19         ` Jason Xing
  0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2025-01-27 23:49 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 1/24/25 7:42 PM, Jason Xing wrote:
>> Please also add some details on how the UDP BPF_SOCK_OPS_TS_TCP_SND_CB (or to be
>> renamed to BPF_SOCK_OPS_TS_SND_CB ?) will look like. It is the only callback
>> that I don't have a clear idea for UDP.
> I think I will rename it as you said. But I wonder if I can add more
> details about UDP after this series gets merged which should not be
> too late. After this series, I will carefully consider and test how we
> use for UDP type.

Not asking for a full UDP implementation, having this set staying with TCP is 
ok. We have pretty clear idea on all the new TS_*_CB will work in UDP except the 
TS_SND_CB.

I am asking at least a description on where this SND hook will be in UDP and how 
the delay will be measured from the udp_sendmsg(). I haven't looked, so the 
question. It is better to get some visibility first instead of scrambling to 
change it after landing to -next.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 13/13] bpf: add simple bpf tests in the tx path for so_timestamping feature
  2025-01-27 23:49       ` Martin KaFai Lau
@ 2025-01-28  0:19         ` Jason Xing
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-28  0:19 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Tue, Jan 28, 2025 at 7:49 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/24/25 7:42 PM, Jason Xing wrote:
> >> Please also add some details on how the UDP BPF_SOCK_OPS_TS_TCP_SND_CB (or to be
> >> renamed to BPF_SOCK_OPS_TS_SND_CB ?) will look like. It is the only callback
> >> that I don't have a clear idea for UDP.
> > I think I will rename it as you said. But I wonder if I can add more
> > details about UDP after this series gets merged which should not be
> > too late. After this series, I will carefully consider and test how we
> > use for UDP type.
>
> Not asking for a full UDP implementation, having this set staying with TCP is
> ok. We have pretty clear idea on all the new TS_*_CB will work in UDP except the
> TS_SND_CB.
>
> I am asking at least a description on where this SND hook will be in UDP and how
> the delay will be measured from the udp_sendmsg(). I haven't looked, so the
> question. It is better to get some visibility first instead of scrambling to
> change it after landing to -next.

No problem. Let me give it more thoughts :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH net-next v6 03/13] bpf: stop UDP sock accessing TCP fields in bpf callbacks
  2025-01-25  0:28     ` Jason Xing
@ 2025-01-28  1:34       ` Jason Xing
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Xing @ 2025-01-28  1:34 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Sat, Jan 25, 2025 at 8:28 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Sat, Jan 25, 2025 at 7:41 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 1/20/25 5:28 PM, Jason Xing wrote:
> > > Applying the new member allow_tcp_access in the existing callbacks
> > > where is_fullsock is set to 1 can help us stop UDP socket accessing
> > > struct tcp_sock, or else it could be catastrophe leading to panic.
> > >
> > > For now, those existing callbacks are used only for TCP. I believe
> > > in the short run, we will have timestamping UDP callbacks support.
> >
> > The commit message needs adjustment. UDP is not supported yet, so this change
> > feels like it's unnecessary based on the commit message. However, even without
> > UDP support, the new timestamping callbacks cannot directly write some fields
> > because the sk lock is not held, so this change is needed for TCP timestamping
>
> Thanks and I will revise them. But I still want to say that the
> timestamping callbacks after this series are all under the protection
> of socket lock.

For the record only, I was wrong about the understanding of socket
lock like above because there remains cases where this kind of path,
say, i40e_intr()->i40e_ptp_tx_hwtstamp()->skb_tstamp_tx()->__skb_tstamp_tx(),
will not be protected under the socket lock. With that said, directly
accessing tcp_sock is not safe even if the socket type is TCP.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2025-01-28  1:34 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-21  1:28 [RFC PATCH net-next v6 00/13] net-timestamp: bpf extension to equip applications transparently Jason Xing
2025-01-21  1:28 ` [RFC PATCH net-next v6 01/13] net-timestamp: add support for bpf_setsockopt() Jason Xing
2025-01-21  1:28 ` [RFC PATCH net-next v6 02/13] net-timestamp: prepare for timestamping callbacks use Jason Xing
2025-01-21  5:08   ` Jason Xing
2025-01-21  1:28 ` [RFC PATCH net-next v6 03/13] bpf: stop UDP sock accessing TCP fields in bpf callbacks Jason Xing
2025-01-24 23:40   ` Martin KaFai Lau
2025-01-25  0:28     ` Jason Xing
2025-01-28  1:34       ` Jason Xing
2025-01-21  1:28 ` [RFC PATCH net-next v6 04/13] bpf: stop UDP sock accessing TCP fields in sock_op BPF CALLs Jason Xing
2025-01-25  0:28   ` Martin KaFai Lau
2025-01-25  1:15     ` Jason Xing
2025-01-25  1:32       ` Jason Xing
2025-01-25  2:25       ` Martin KaFai Lau
2025-01-25  2:58         ` Jason Xing
2025-01-25  3:12         ` Martin KaFai Lau
2025-01-25  3:43           ` Jason Xing
2025-01-21  1:28 ` [RFC PATCH net-next v6 05/13] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING Jason Xing
2025-01-21  1:28 ` [RFC PATCH net-next v6 06/13] net-timestamp: support SCM_TSTAMP_SCHED for bpf extension Jason Xing
2025-01-25  0:38   ` Martin KaFai Lau
2025-01-25  1:16     ` Jason Xing
2025-01-21  1:28 ` [RFC PATCH net-next v6 07/13] net-timestamp: support sw SCM_TSTAMP_SND " Jason Xing
2025-01-25  0:40   ` Martin KaFai Lau
2025-01-25  1:17     ` Jason Xing
2025-01-21  1:28 ` [RFC PATCH net-next v6 08/13] net-timestamp: support hw " Jason Xing
2025-01-25  0:46   ` Martin KaFai Lau
2025-01-25  1:18     ` Jason Xing
2025-01-25  1:29       ` Martin KaFai Lau
2025-01-25  1:35         ` Jason Xing
2025-01-25  2:36           ` Martin KaFai Lau
2025-01-25  2:59             ` Jason Xing
2025-01-21  1:28 ` [RFC PATCH net-next v6 09/13] net-timestamp: support SCM_TSTAMP_ACK " Jason Xing
2025-01-21  1:28 ` [RFC PATCH net-next v6 10/13] net-timestamp: make TCP tx timestamp bpf extension work Jason Xing
2025-01-21  1:28 ` [RFC PATCH net-next v6 11/13] net-timestamp: add a new callback in tcp_tx_timestamp() Jason Xing
2025-01-25  0:50   ` Martin KaFai Lau
2025-01-25  1:21     ` Jason Xing
2025-01-21  1:29 ` [RFC PATCH net-next v6 12/13] net-timestamp: introduce cgroup lock to avoid affecting non-bpf cases Jason Xing
2025-01-25  1:09   ` Martin KaFai Lau
2025-01-25  1:25     ` Jason Xing
2025-01-21  1:29 ` [RFC PATCH net-next v6 13/13] bpf: add simple bpf tests in the tx path for so_timestamping feature Jason Xing
2025-01-25  3:07   ` Martin KaFai Lau
2025-01-25  3:42     ` Jason Xing
2025-01-27 23:49       ` Martin KaFai Lau
2025-01-28  0:19         ` Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).