netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently
@ 2025-02-04 18:30 Jason Xing
  2025-02-04 18:30 ` [PATCH bpf-next v8 01/12] bpf: add support for bpf_setsockopt() Jason Xing
                   ` (11 more replies)
  0 siblings, 12 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-04 18:30 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

"Timestamping is key to debugging network stack latency. With
SO_TIMESTAMPING, bugs that are otherwise incorrectly assumed to be
network issues can be attributed to the kernel." This is extracted
from the talk "SO_TIMESTAMPING: Powering Fleetwide RPC Monitoring"
addressed by Willem de Bruijn at netdevconf 0x17).

There are a few areas that need optimization with the consideration of
easier use and less performance impact, which I highlighted and mainly
discussed at netconf 2024 with Willem de Bruijn and John Fastabend:
uAPI compatibility, extra system call overhead, and the need for
application modification. I initially managed to solve these issues
by writing a kernel module that hooks various key functions. However,
this approach is not suitable for the next kernel release. Therefore,
a BPF extension was proposed. During recent period, Martin KaFai Lau
provides invaluable suggestions about BPF along the way. Many thanks
here!

In this series, only support foundamental codes and tx for TCP.
This approach mostly relies on existing SO_TIMESTAMPING feature, users
only needs to pass certain flags through bpf_setsocktopt() to a separate
tsflags. Please see the last selftest patch in this series.

---
v8
Link: https://lore.kernel.org/all/20250128084620.57547-1-kerneljasonxing@gmail.com/
1. adjust some commit messages and titles
2. add sk cookie in selftests
3. handle the NULL pointer in hwstamp

v7
Link: https://lore.kernel.org/all/20250121012901.87763-1-kerneljasonxing@gmail.com/
1. target bpf-next tree
2. simplely and directly stop timestamping callbacks calling a few BPF
CALLS due to safety concern.
3. add more new testcases and adjust the existing testcases
4. revise some comments of new timestamping callbacks
5. remove a few BPF CGROUP locks

RFC v6
In the meantime, any suggestions and reviews are welcome!
Link: https://lore.kernel.org/all/20250112113748.73504-1-kerneljasonxing@gmail.com/
1. handle those safety problem by using the correct method.
2. support bpf_getsockopt.
3. adjust the position of BPF_SOCK_OPS_TS_TCP_SND_CB
4. fix mishandling the hardware timestamp error
5. add more corresponding tests

v5
Link: https://lore.kernel.org/all/20241207173803.90744-1-kerneljasonxing@gmail.com/
1. handle the safety issus when someone tries to call unrelated bpf
helpers.
2. avoid adding direct function call in the hot path like
__dev_queue_xmit()
3. remove reporting the hardware timestamp and tskey since they can be
fetched through the existing helper with the help of
bpf_skops_init_skb(), please see the selftest.
4. add new sendmsg callback in tcp_sendmsg, and introduce tskey_bpf used
by bpf program to correlate tcp_sendmsg with other hook points in patch [13/15].

v4
Link: https://lore.kernel.org/all/20241028110535.82999-1-kerneljasonxing@gmail.com/
1. introduce sk->sk_bpf_cb_flags to let user use bpf_setsockopt() (Martin)
2. introduce SKBTX_BPF to enable the bpf SO_TIMESTAMPING feature (Martin)
3. introduce bpf map in tests (Martin)
4. I choose to make this series as simple as possible, so I only support
most cases in the tx path for TCP protocol.

v3
Link: https://lore.kernel.org/all/20241012040651.95616-1-kerneljasonxing@gmail.com/
1. support UDP proto by introducing a new generation point.
2. for OPT_ID, introducing sk_tskey_bpf_offset to compute the delta
between the current socket key and bpf socket key. It is desiged for
UDP, which also applies to TCP.
3. support bpf_getsockopt()
4. use cgroup static key instead.
5. add one simple bpf selftest to show how it can be used.
6. remove the rx support from v2 because the number of patches could
exceed the limit of one series.

V2
Link: https://lore.kernel.org/all/20241008095109.99918-1-kerneljasonxing@gmail.com/
1. Introduce tsflag requestors so that we are able to extend more in the
future. Besides, it enables TX flags for bpf extension feature separately
without breaking users. It is suggested by Vadim Fedorenko.
2. introduce a static key to control the whole feature. (Willem)
3. Open the gate of bpf_setsockopt for the SO_TIMESTAMPING feature in
some TX/RX cases, not all the cases.


Jason Xing (12):
  bpf: add support for bpf_setsockopt()
  bpf: prepare for timestamping callbacks use
  bpf: stop unsafely accessing TCP fields in bpf callbacks
  bpf: stop calling some sock_op BPF CALLs in new timestamping callbacks
  net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  bpf: support SCM_TSTAMP_SCHED of SO_TIMESTAMPING
  bpf: support sw SCM_TSTAMP_SND of SO_TIMESTAMPING
  bpf: support hw SCM_TSTAMP_SND of SO_TIMESTAMPING
  bpf: support SCM_TSTAMP_ACK of SO_TIMESTAMPING
  bpf: make TCP tx timestamp bpf extension work
  bpf: add a new callback in tcp_tx_timestamp()
  selftests/bpf: add simple bpf tests in the tx path for timestamping
    feature

 include/linux/filter.h                        |   5 +
 include/linux/skbuff.h                        |  25 +-
 include/net/sock.h                            |  10 +
 include/net/tcp.h                             |   4 +-
 include/uapi/linux/bpf.h                      |  35 ++
 net/core/dev.c                                |   5 +-
 net/core/filter.c                             |  48 ++-
 net/core/skbuff.c                             |  62 +++-
 net/core/sock.c                               |  15 +
 net/dsa/user.c                                |   2 +-
 net/ipv4/tcp.c                                |  11 +
 net/ipv4/tcp_input.c                          |   8 +-
 net/ipv4/tcp_output.c                         |   7 +
 net/socket.c                                  |   2 +-
 tools/include/uapi/linux/bpf.h                |  28 ++
 .../bpf/prog_tests/so_timestamping.c          |  79 +++++
 .../selftests/bpf/progs/so_timestamping.c     | 306 ++++++++++++++++++
 17 files changed, 630 insertions(+), 22 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/so_timestamping.c
 create mode 100644 tools/testing/selftests/bpf/progs/so_timestamping.c

-- 
2.43.5


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v8 01/12] bpf: add support for bpf_setsockopt()
  2025-02-04 18:30 [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently Jason Xing
@ 2025-02-04 18:30 ` Jason Xing
  2025-02-05 15:22   ` Willem de Bruijn
  2025-02-04 18:30 ` [PATCH bpf-next v8 02/12] bpf: prepare for timestamping callbacks use Jason Xing
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-04 18:30 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Users can write the following code to enable the bpf extension:
int flags = SK_BPF_CB_TX_TIMESTAMPING;
int opts = SK_BPF_CB_FLAGS;
bpf_setsockopt(skops, SOL_SOCKET, opts, &flags, sizeof(flags));

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/net/sock.h             |  3 +++
 include/uapi/linux/bpf.h       |  8 ++++++++
 net/core/filter.c              | 23 +++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  1 +
 4 files changed, 35 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 8036b3b79cd8..7916982343c6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -303,6 +303,7 @@ struct sk_filter;
   *	@sk_stamp: time stamp of last packet received
   *	@sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
   *	@sk_tsflags: SO_TIMESTAMPING flags
+  *	@sk_bpf_cb_flags: used in bpf_setsockopt()
   *	@sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
   *			   Sockets that can be used under memory reclaim should
   *			   set this to false.
@@ -445,6 +446,8 @@ struct sock {
 	u32			sk_reserved_mem;
 	int			sk_forward_alloc;
 	u32			sk_tsflags;
+#define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
+	u32			sk_bpf_cb_flags;
 	__cacheline_group_end(sock_write_rxtx);
 
 	__cacheline_group_begin(sock_write_tx);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2acf9b336371..6116eb3d1515 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6913,6 +6913,13 @@ enum {
 	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
 };
 
+/* Definitions for bpf_sk_cb_flags */
+enum {
+	SK_BPF_CB_TX_TIMESTAMPING	= 1<<0,
+	SK_BPF_CB_MASK			= (SK_BPF_CB_TX_TIMESTAMPING - 1) |
+					   SK_BPF_CB_TX_TIMESTAMPING
+};
+
 /* List of known BPF sock_ops operators.
  * New entries can only be added at the end
  */
@@ -7091,6 +7098,7 @@ enum {
 	TCP_BPF_SYN_IP		= 1006, /* Copy the IP[46] and TCP header */
 	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
 	TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
+	SK_BPF_CB_FLAGS		= 1009, /* Used to set socket bpf flags */
 };
 
 enum {
diff --git a/net/core/filter.c b/net/core/filter.c
index 2ec162dd83c4..1c6c07507a78 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5222,6 +5222,25 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
 	.arg1_type      = ARG_PTR_TO_CTX,
 };
 
+static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
+{
+	u32 sk_bpf_cb_flags;
+
+	if (getopt) {
+		*(u32 *)optval = sk->sk_bpf_cb_flags;
+		return 0;
+	}
+
+	sk_bpf_cb_flags = *(u32 *)optval;
+
+	if (sk_bpf_cb_flags & ~SK_BPF_CB_MASK)
+		return -EINVAL;
+
+	sk->sk_bpf_cb_flags = sk_bpf_cb_flags;
+
+	return 0;
+}
+
 static int sol_socket_sockopt(struct sock *sk, int optname,
 			      char *optval, int *optlen,
 			      bool getopt)
@@ -5238,6 +5257,7 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
 	case SO_MAX_PACING_RATE:
 	case SO_BINDTOIFINDEX:
 	case SO_TXREHASH:
+	case SK_BPF_CB_FLAGS:
 		if (*optlen != sizeof(int))
 			return -EINVAL;
 		break;
@@ -5247,6 +5267,9 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
 		return -EINVAL;
 	}
 
+	if (optname == SK_BPF_CB_FLAGS)
+		return sk_bpf_set_get_cb_flags(sk, optval, getopt);
+
 	if (getopt) {
 		if (optname == SO_BINDTODEVICE)
 			return -EINVAL;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 2acf9b336371..70366f74ef4e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7091,6 +7091,7 @@ enum {
 	TCP_BPF_SYN_IP		= 1006, /* Copy the IP[46] and TCP header */
 	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
 	TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
+	SK_BPF_CB_FLAGS		= 1009, /* Used to set socket bpf flags */
 };
 
 enum {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v8 02/12] bpf: prepare for timestamping callbacks use
  2025-02-04 18:30 [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently Jason Xing
  2025-02-04 18:30 ` [PATCH bpf-next v8 01/12] bpf: add support for bpf_setsockopt() Jason Xing
@ 2025-02-04 18:30 ` Jason Xing
  2025-02-04 18:30 ` [PATCH bpf-next v8 03/12] bpf: stop unsafely accessing TCP fields in bpf callbacks Jason Xing
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-04 18:30 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Later, four callback points to report information to user space
based on this patch will be introduced.

As to skb initialization here, users can follow these three steps
as below to fetch the shared info from the exported skb in the bpf
prog:
1. skops_kern = bpf_cast_to_kern_ctx(skops);
2. skb = skops_kern->skb;
3. shinfo = bpf_core_cast(skb->head + skb->end, struct skb_shared_info);

More details can be seen in the last selftest patch of the series.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/net/sock.h |  7 +++++++
 net/core/sock.c    | 15 +++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 7916982343c6..6f4d54faba92 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2923,6 +2923,13 @@ int sock_set_timestamping(struct sock *sk, int optname,
 			  struct so_timestamping timestamping);
 
 void sock_enable_timestamps(struct sock *sk);
+#if defined(CONFIG_CGROUP_BPF)
+void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op);
+#else
+static inline void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op)
+{
+}
+#endif
 void sock_no_linger(struct sock *sk);
 void sock_set_keepalive(struct sock *sk);
 void sock_set_priority(struct sock *sk, u32 priority);
diff --git a/net/core/sock.c b/net/core/sock.c
index eae2ae70a2e0..41db6407e360 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -948,6 +948,21 @@ int sock_set_timestamping(struct sock *sk, int optname,
 	return 0;
 }
 
+#if defined(CONFIG_CGROUP_BPF)
+void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op)
+{
+	struct bpf_sock_ops_kern sock_ops;
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+	sock_ops.op = op;
+	sock_ops.is_fullsock = 1;
+	sock_ops.sk = sk;
+	bpf_skops_init_skb(&sock_ops, skb, 0);
+	/* Timestamping bpf extension supports only TCP and UDP full socket */
+	__cgroup_bpf_run_filter_sock_ops(sk, &sock_ops, CGROUP_SOCK_OPS);
+}
+#endif
+
 void sock_set_keepalive(struct sock *sk)
 {
 	lock_sock(sk);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v8 03/12] bpf: stop unsafely accessing TCP fields in bpf callbacks
  2025-02-04 18:30 [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently Jason Xing
  2025-02-04 18:30 ` [PATCH bpf-next v8 01/12] bpf: add support for bpf_setsockopt() Jason Xing
  2025-02-04 18:30 ` [PATCH bpf-next v8 02/12] bpf: prepare for timestamping callbacks use Jason Xing
@ 2025-02-04 18:30 ` Jason Xing
  2025-02-05 15:24   ` Willem de Bruijn
  2025-02-04 18:30 ` [PATCH bpf-next v8 04/12] bpf: stop calling some sock_op BPF CALLs in new timestamping callbacks Jason Xing
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-04 18:30 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

The "allow_tcp_access" flag is added to indicate that the callback
site has a tcp_sock locked.

Applying the new member allow_tcp_access in the existing callbacks
where is_fullsock is set to 1 can stop UDP socket accessing struct
tcp_sock and stop TCP socket without sk lock protecting does the
similar thing, or else it could be catastrophe leading to panic.

To keep it simple, instead of distinguishing between read and write
access, users aren't allowed all read/write access to the tcp_sock
through the older bpf_sock_ops ctx. The new timestamping callbacks
can use newer helpers to read everything from a sk (e.g. bpf_core_cast),
so nothing is lost.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/linux/filter.h | 5 +++++
 include/net/tcp.h      | 1 +
 net/core/filter.c      | 8 ++++----
 net/ipv4/tcp_input.c   | 2 ++
 net/ipv4/tcp_output.c  | 2 ++
 5 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index a3ea46281595..1569e9f31a8c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1508,6 +1508,11 @@ struct bpf_sock_ops_kern {
 	void	*skb_data_end;
 	u8	op;
 	u8	is_fullsock;
+	u8	allow_tcp_access;	/* Indicate that the callback site
+					 * has a tcp_sock locked. Then it
+					 * would be safe to access struct
+					 * tcp_sock.
+					 */
 	u8	remaining_opt_len;
 	u64	temp;			/* temp and everything after is not
 					 * initialized to 0 before calling
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5b2b04835688..293047694710 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2649,6 +2649,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
 	if (sk_fullsock(sk)) {
 		sock_ops.is_fullsock = 1;
+		sock_ops.allow_tcp_access = 1;
 		sock_owned_by_me(sk);
 	}
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 1c6c07507a78..dc0e67c5776a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10381,10 +10381,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 		}							      \
 		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
 						struct bpf_sock_ops_kern,     \
-						is_fullsock),		      \
+						allow_tcp_access),	      \
 				      fullsock_reg, si->src_reg,	      \
 				      offsetof(struct bpf_sock_ops_kern,      \
-					       is_fullsock));		      \
+					       allow_tcp_access));	      \
 		*insn++ = BPF_JMP_IMM(BPF_JEQ, fullsock_reg, 0, jmp);	      \
 		if (si->dst_reg == si->src_reg)				      \
 			*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg,	      \
@@ -10469,10 +10469,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 					       temp));			      \
 		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
 						struct bpf_sock_ops_kern,     \
-						is_fullsock),		      \
+						allow_tcp_access),	      \
 				      reg, si->dst_reg,			      \
 				      offsetof(struct bpf_sock_ops_kern,      \
-					       is_fullsock));		      \
+					       allow_tcp_access));	      \
 		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
 		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
 						struct bpf_sock_ops_kern, sk),\
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index eb82e01da911..77185479ed5e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -169,6 +169,7 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
 	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
 	sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB;
 	sock_ops.is_fullsock = 1;
+	sock_ops.allow_tcp_access = 1;
 	sock_ops.sk = sk;
 	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
 
@@ -185,6 +186,7 @@ static void bpf_skops_established(struct sock *sk, int bpf_op,
 	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
 	sock_ops.op = bpf_op;
 	sock_ops.is_fullsock = 1;
+	sock_ops.allow_tcp_access = 1;
 	sock_ops.sk = sk;
 	/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
 	if (skb)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..695749807c09 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -522,6 +522,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
 		sock_owned_by_me(sk);
 
 		sock_ops.is_fullsock = 1;
+		sock_ops.allow_tcp_access = 1;
 		sock_ops.sk = sk;
 	}
 
@@ -567,6 +568,7 @@ static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
 		sock_owned_by_me(sk);
 
 		sock_ops.is_fullsock = 1;
+		sock_ops.allow_tcp_access = 1;
 		sock_ops.sk = sk;
 	}
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v8 04/12] bpf: stop calling some sock_op BPF CALLs in new timestamping callbacks
  2025-02-04 18:30 [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (2 preceding siblings ...)
  2025-02-04 18:30 ` [PATCH bpf-next v8 03/12] bpf: stop unsafely accessing TCP fields in bpf callbacks Jason Xing
@ 2025-02-04 18:30 ` Jason Xing
  2025-02-05 15:26   ` Willem de Bruijn
  2025-02-04 18:30 ` [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING Jason Xing
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-04 18:30 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Simply disallow calling bpf_sock_ops_setsockopt/getsockopt,
bpf_sock_ops_cb_flags_set, and the bpf_sock_ops_load_hdr_opt for
the new timestamping callbacks for the safety consideration.

Besides, In the next round, the UDP proto for SO_TIMESTAMPING bpf
extension will be supported, so there should be no safety problem,
which is usually caused by UDP socket trying to access TCP fields.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 net/core/filter.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index dc0e67c5776a..d3395ffe058e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5523,6 +5523,11 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 	return -EINVAL;
 }
 
+static bool is_locked_tcp_sock_ops(struct bpf_sock_ops_kern *bpf_sock)
+{
+	return bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
+}
+
 static int _bpf_setsockopt(struct sock *sk, int level, int optname,
 			   char *optval, int optlen)
 {
@@ -5673,6 +5678,9 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
 BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 	   int, level, int, optname, char *, optval, int, optlen)
 {
+	if (!is_locked_tcp_sock_ops(bpf_sock))
+		return -EOPNOTSUPP;
+
 	return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
 }
 
@@ -5758,6 +5766,9 @@ static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock,
 BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 	   int, level, int, optname, char *, optval, int, optlen)
 {
+	if (!is_locked_tcp_sock_ops(bpf_sock))
+		return -EOPNOTSUPP;
+
 	if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
 	    optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) {
 		int ret, copy_len = 0;
@@ -5800,6 +5811,9 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
 	struct sock *sk = bpf_sock->sk;
 	int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
 
+	if (!is_locked_tcp_sock_ops(bpf_sock))
+		return -EOPNOTSUPP;
+
 	if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
 		return -EINVAL;
 
@@ -7609,6 +7623,9 @@ BPF_CALL_4(bpf_sock_ops_load_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
 	u8 search_kind, search_len, copy_len, magic_len;
 	int ret;
 
+	if (!is_locked_tcp_sock_ops(bpf_sock))
+		return -EOPNOTSUPP;
+
 	/* 2 byte is the minimal option len except TCPOPT_NOP and
 	 * TCPOPT_EOL which are useless for the bpf prog to learn
 	 * and this helper disallow loading them also.
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  2025-02-04 18:30 [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (3 preceding siblings ...)
  2025-02-04 18:30 ` [PATCH bpf-next v8 04/12] bpf: stop calling some sock_op BPF CALLs in new timestamping callbacks Jason Xing
@ 2025-02-04 18:30 ` Jason Xing
  2025-02-05  1:47   ` Jakub Kicinski
                     ` (2 more replies)
  2025-02-04 18:30 ` [PATCH bpf-next v8 06/12] bpf: support SCM_TSTAMP_SCHED " Jason Xing
                   ` (6 subsequent siblings)
  11 siblings, 3 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-04 18:30 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

No functional changes here, only add skb_enable_app_tstamp() to test
if the orig_skb matches the usage of application SO_TIMESTAMPING
or its bpf extension. And it's good to support two modes in
parallel later in this series.

Also, this patch deliberately distinguish the software and
hardware SCM_TSTAMP_SND timestamp by passing 'sw' parameter in order
to avoid such a case where hardware may go wrong and pass a NULL
hwstamps, which is even though unlikely to happen. If it really
happens, bpf prog will finally consider it as a software timestamp.
It will be hardly recognized. Let's make the timestamping part
more robust.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/linux/skbuff.h | 13 +++++++------
 net/core/dev.c         |  2 +-
 net/core/skbuff.c      | 32 ++++++++++++++++++++++++++++++--
 net/ipv4/tcp_input.c   |  3 ++-
 4 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bb2b751d274a..dfc419281cc9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -39,6 +39,7 @@
 #include <net/net_debug.h>
 #include <net/dropreason-core.h>
 #include <net/netmem.h>
+#include <uapi/linux/errqueue.h>
 
 /**
  * DOC: skb checksums
@@ -4533,18 +4534,18 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
 
 void __skb_tstamp_tx(struct sk_buff *orig_skb, const struct sk_buff *ack_skb,
 		     struct skb_shared_hwtstamps *hwtstamps,
-		     struct sock *sk, int tstype);
+		     struct sock *sk, bool sw, int tstype);
 
 /**
- * skb_tstamp_tx - queue clone of skb with send time stamps
+ * skb_tstamp_tx - queue clone of skb with send HARDWARE timestamps
  * @orig_skb:	the original outgoing packet
  * @hwtstamps:	hardware time stamps, may be NULL if not available
  *
  * If the skb has a socket associated, then this function clones the
  * skb (thus sharing the actual data and optional structures), stores
- * the optional hardware time stamping information (if non NULL) or
- * generates a software time stamp (otherwise), then queues the clone
- * to the error queue of the socket.  Errors are silently ignored.
+ * the optional hardware time stamping information (if non NULL) then
+ * queues the clone to the error queue of the socket.  Errors are
+ * silently ignored.
  */
 void skb_tstamp_tx(struct sk_buff *orig_skb,
 		   struct skb_shared_hwtstamps *hwtstamps);
@@ -4565,7 +4566,7 @@ static inline void skb_tx_timestamp(struct sk_buff *skb)
 {
 	skb_clone_tx_timestamp(skb);
 	if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP)
-		skb_tstamp_tx(skb, NULL);
+		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SND);
 }
 
 /**
diff --git a/net/core/dev.c b/net/core/dev.c
index afa2282f2604..d77b8389753e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4501,7 +4501,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	skb_assert_len(skb);
 
 	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
-		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED);
+		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SCHED);
 
 	/* Disable soft irqs for various locks below. Also
 	 * stops preemption for RCU.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a441613a1e6c..6042961dfc02 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5539,10 +5539,35 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
 }
 EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
 
+static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
+{
+	int flag;
+
+	switch (tstype) {
+	case SCM_TSTAMP_SCHED:
+		flag = SKBTX_SCHED_TSTAMP;
+		break;
+	case SCM_TSTAMP_SND:
+		flag = sw ? SKBTX_SW_TSTAMP : SKBTX_HW_TSTAMP;
+		break;
+	case SCM_TSTAMP_ACK:
+		if (TCP_SKB_CB(skb)->txstamp_ack)
+			return true;
+		fallthrough;
+	default:
+		return false;
+	}
+
+	if (skb_shinfo(skb)->tx_flags & flag)
+		return true;
+
+	return false;
+}
+
 void __skb_tstamp_tx(struct sk_buff *orig_skb,
 		     const struct sk_buff *ack_skb,
 		     struct skb_shared_hwtstamps *hwtstamps,
-		     struct sock *sk, int tstype)
+		     struct sock *sk, bool sw, int tstype)
 {
 	struct sk_buff *skb;
 	bool tsonly, opt_stats = false;
@@ -5551,6 +5576,9 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (!sk)
 		return;
 
+	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
+		return;
+
 	tsflags = READ_ONCE(sk->sk_tsflags);
 	if (!hwtstamps && !(tsflags & SOF_TIMESTAMPING_OPT_TX_SWHW) &&
 	    skb_shinfo(orig_skb)->tx_flags & SKBTX_IN_PROGRESS)
@@ -5599,7 +5627,7 @@ EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
 void skb_tstamp_tx(struct sk_buff *orig_skb,
 		   struct skb_shared_hwtstamps *hwtstamps)
 {
-	return __skb_tstamp_tx(orig_skb, NULL, hwtstamps, orig_skb->sk,
+	return __skb_tstamp_tx(orig_skb, NULL, hwtstamps, orig_skb->sk, false,
 			       SCM_TSTAMP_SND);
 }
 EXPORT_SYMBOL_GPL(skb_tstamp_tx);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 77185479ed5e..62252702929d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3330,7 +3330,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
 	if (!before(shinfo->tskey, prior_snd_una) &&
 	    before(shinfo->tskey, tcp_sk(sk)->snd_una)) {
 		tcp_skb_tsorted_save(skb) {
-			__skb_tstamp_tx(skb, ack_skb, NULL, sk, SCM_TSTAMP_ACK);
+			__skb_tstamp_tx(skb, ack_skb, NULL, sk, true,
+					SCM_TSTAMP_ACK);
 		} tcp_skb_tsorted_restore(skb);
 	}
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v8 06/12] bpf: support SCM_TSTAMP_SCHED of SO_TIMESTAMPING
  2025-02-04 18:30 [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (4 preceding siblings ...)
  2025-02-04 18:30 ` [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING Jason Xing
@ 2025-02-04 18:30 ` Jason Xing
  2025-02-05 15:36   ` Willem de Bruijn
  2025-02-04 18:30 ` [PATCH bpf-next v8 07/12] bpf: support sw SCM_TSTAMP_SND " Jason Xing
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-04 18:30 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Introducing SKBTX_BPF is used as an indicator telling us whether
the skb should be traced by the bpf prog.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/linux/skbuff.h         |  6 +++++-
 include/uapi/linux/bpf.h       |  4 ++++
 net/core/dev.c                 |  3 ++-
 net/core/skbuff.c              | 20 ++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  4 ++++
 5 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index dfc419281cc9..35c2e864dd4b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -490,10 +490,14 @@ enum {
 
 	/* generate software time stamp when entering packet scheduling */
 	SKBTX_SCHED_TSTAMP = 1 << 6,
+
+	/* used for bpf extension when a bpf program is loaded */
+	SKBTX_BPF = 1 << 7,
 };
 
 #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
-				 SKBTX_SCHED_TSTAMP)
+				 SKBTX_SCHED_TSTAMP | \
+				 SKBTX_BPF)
 #define SKBTX_ANY_TSTAMP	(SKBTX_HW_TSTAMP | \
 				 SKBTX_HW_TSTAMP_USE_CYCLES | \
 				 SKBTX_ANY_SW_TSTAMP)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6116eb3d1515..30d2c078966b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7032,6 +7032,10 @@ enum {
 					 * by the kernel or the
 					 * earlier bpf-progs.
 					 */
+	BPF_SOCK_OPS_TS_SCHED_OPT_CB,	/* Called when skb is passing through
+					 * dev layer when SK_BPF_CB_TX_TIMESTAMPING
+					 * feature is on.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/dev.c b/net/core/dev.c
index d77b8389753e..4f291459d6b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4500,7 +4500,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	skb_reset_mac_header(skb);
 	skb_assert_len(skb);
 
-	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
+	if (unlikely(skb_shinfo(skb)->tx_flags &
+		     (SKBTX_SCHED_TSTAMP | SKBTX_BPF)))
 		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SCHED);
 
 	/* Disable soft irqs for various locks below. Also
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6042961dfc02..b7261e886529 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5564,6 +5564,21 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
 	return false;
 }
 
+static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk, int tstype)
+{
+	int op;
+
+	switch (tstype) {
+	case SCM_TSTAMP_SCHED:
+		op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
+		break;
+	default:
+		return;
+	}
+
+	bpf_skops_tx_timestamping(sk, skb, op);
+}
+
 void __skb_tstamp_tx(struct sk_buff *orig_skb,
 		     const struct sk_buff *ack_skb,
 		     struct skb_shared_hwtstamps *hwtstamps,
@@ -5576,6 +5591,11 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (!sk)
 		return;
 
+	/* bpf extension feature entry */
+	if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
+		skb_tstamp_tx_bpf(orig_skb, sk, tstype);
+
+	/* application feature entry */
 	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
 		return;
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 70366f74ef4e..eed91b7296b7 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7025,6 +7025,10 @@ enum {
 					 * by the kernel or the
 					 * earlier bpf-progs.
 					 */
+	BPF_SOCK_OPS_TS_SCHED_OPT_CB,	/* Called when skb is passing through
+					 * dev layer when SK_BPF_CB_TX_TIMESTAMPING
+					 * feature is on.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v8 07/12] bpf: support sw SCM_TSTAMP_SND of SO_TIMESTAMPING
  2025-02-04 18:30 [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (5 preceding siblings ...)
  2025-02-04 18:30 ` [PATCH bpf-next v8 06/12] bpf: support SCM_TSTAMP_SCHED " Jason Xing
@ 2025-02-04 18:30 ` Jason Xing
  2025-02-04 18:30 ` [PATCH bpf-next v8 08/12] bpf: support hw " Jason Xing
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-04 18:30 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Support sw SCM_TSTAMP_SND case. Then users will get the software
timestamp when the driver is about to send the skb. Later,
the hardware timestamp will be supported.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/linux/skbuff.h         |  2 +-
 include/uapi/linux/bpf.h       |  4 ++++
 net/core/skbuff.c              | 10 ++++++++--
 tools/include/uapi/linux/bpf.h |  4 ++++
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 35c2e864dd4b..de8d3bd311f5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4569,7 +4569,7 @@ void skb_tstamp_tx(struct sk_buff *orig_skb,
 static inline void skb_tx_timestamp(struct sk_buff *skb)
 {
 	skb_clone_tx_timestamp(skb);
-	if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP)
+	if (skb_shinfo(skb)->tx_flags & (SKBTX_SW_TSTAMP | SKBTX_BPF))
 		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SND);
 }
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 30d2c078966b..6a1083bcf779 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7036,6 +7036,10 @@ enum {
 					 * dev layer when SK_BPF_CB_TX_TIMESTAMPING
 					 * feature is on.
 					 */
+	BPF_SOCK_OPS_TS_SW_OPT_CB,	/* Called when skb is about to send
+					 * to the nic when SK_BPF_CB_TX_TIMESTAMPING
+					 * feature is on.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b7261e886529..b22d079e7143 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5564,7 +5564,8 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
 	return false;
 }
 
-static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk, int tstype)
+static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
+			      int tstype, bool sw)
 {
 	int op;
 
@@ -5572,6 +5573,11 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk, int tstype)
 	case SCM_TSTAMP_SCHED:
 		op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
 		break;
+	case SCM_TSTAMP_SND:
+		if (!sw)
+			return;
+		op = BPF_SOCK_OPS_TS_SW_OPT_CB;
+		break;
 	default:
 		return;
 	}
@@ -5593,7 +5599,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 
 	/* bpf extension feature entry */
 	if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
-		skb_tstamp_tx_bpf(orig_skb, sk, tstype);
+		skb_tstamp_tx_bpf(orig_skb, sk, tstype, sw);
 
 	/* application feature entry */
 	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index eed91b7296b7..9bd1c7c77b17 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7029,6 +7029,10 @@ enum {
 					 * dev layer when SK_BPF_CB_TX_TIMESTAMPING
 					 * feature is on.
 					 */
+	BPF_SOCK_OPS_TS_SW_OPT_CB,	/* Called when skb is about to send
+					 * to the nic when SK_BPF_CB_TX_TIMESTAMPING
+					 * feature is on.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v8 08/12] bpf: support hw SCM_TSTAMP_SND of SO_TIMESTAMPING
  2025-02-04 18:30 [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (6 preceding siblings ...)
  2025-02-04 18:30 ` [PATCH bpf-next v8 07/12] bpf: support sw SCM_TSTAMP_SND " Jason Xing
@ 2025-02-04 18:30 ` Jason Xing
  2025-02-05 15:45   ` Willem de Bruijn
  2025-02-04 18:30 ` [PATCH bpf-next v8 09/12] bpf: support SCM_TSTAMP_ACK " Jason Xing
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-04 18:30 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Patch finishes the hardware part. Then bpf program can fetch the
hwstamp from skb directly.

To avoid changing so many callers using SKBTX_HW_TSTAMP from drivers,
use this simple modification like this patch does to support printing
hardware timestamp.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/linux/skbuff.h         |  4 +++-
 include/uapi/linux/bpf.h       |  7 +++++++
 net/core/skbuff.c              | 13 +++++++------
 net/dsa/user.c                 |  2 +-
 net/socket.c                   |  2 +-
 tools/include/uapi/linux/bpf.h |  7 +++++++
 6 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index de8d3bd311f5..df2d790ae36b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -471,7 +471,7 @@ struct skb_shared_hwtstamps {
 /* Definitions for tx_flags in struct skb_shared_info */
 enum {
 	/* generate hardware time stamp */
-	SKBTX_HW_TSTAMP = 1 << 0,
+	__SKBTX_HW_TSTAMP = 1 << 0,
 
 	/* generate software time stamp when queueing packet to NIC */
 	SKBTX_SW_TSTAMP = 1 << 1,
@@ -495,6 +495,8 @@ enum {
 	SKBTX_BPF = 1 << 7,
 };
 
+#define SKBTX_HW_TSTAMP		(__SKBTX_HW_TSTAMP | SKBTX_BPF)
+
 #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
 				 SKBTX_SCHED_TSTAMP | \
 				 SKBTX_BPF)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6a1083bcf779..4c3566f623c2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7040,6 +7040,13 @@ enum {
 					 * to the nic when SK_BPF_CB_TX_TIMESTAMPING
 					 * feature is on.
 					 */
+	BPF_SOCK_OPS_TS_HW_OPT_CB,	/* Called in hardware phase when
+					 * SK_BPF_CB_TX_TIMESTAMPING feature
+					 * is on. At the same time, hwtstamps
+					 * of skb is initialized as the
+					 * timestamp that hardware just
+					 * generates.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b22d079e7143..264435f989ad 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5548,7 +5548,7 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
 		flag = SKBTX_SCHED_TSTAMP;
 		break;
 	case SCM_TSTAMP_SND:
-		flag = sw ? SKBTX_SW_TSTAMP : SKBTX_HW_TSTAMP;
+		flag = sw ? SKBTX_SW_TSTAMP : __SKBTX_HW_TSTAMP;
 		break;
 	case SCM_TSTAMP_ACK:
 		if (TCP_SKB_CB(skb)->txstamp_ack)
@@ -5565,7 +5565,8 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
 }
 
 static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
-			      int tstype, bool sw)
+			      int tstype, bool sw,
+			      struct skb_shared_hwtstamps *hwtstamps)
 {
 	int op;
 
@@ -5574,9 +5575,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
 		op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
 		break;
 	case SCM_TSTAMP_SND:
-		if (!sw)
-			return;
-		op = BPF_SOCK_OPS_TS_SW_OPT_CB;
+		op = sw ? BPF_SOCK_OPS_TS_SW_OPT_CB : BPF_SOCK_OPS_TS_HW_OPT_CB;
+		if (!sw && hwtstamps)
+			*skb_hwtstamps(skb) = *hwtstamps;
 		break;
 	default:
 		return;
@@ -5599,7 +5600,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
 
 	/* bpf extension feature entry */
 	if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
-		skb_tstamp_tx_bpf(orig_skb, sk, tstype, sw);
+		skb_tstamp_tx_bpf(orig_skb, sk, tstype, sw, hwtstamps);
 
 	/* application feature entry */
 	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
diff --git a/net/dsa/user.c b/net/dsa/user.c
index 291ab1b4acc4..ae715bf0ae75 100644
--- a/net/dsa/user.c
+++ b/net/dsa/user.c
@@ -897,7 +897,7 @@ static void dsa_skb_tx_timestamp(struct dsa_user_priv *p,
 {
 	struct dsa_switch *ds = p->dp->ds;
 
-	if (!(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
+	if (!(skb_shinfo(skb)->tx_flags & __SKBTX_HW_TSTAMP))
 		return;
 
 	if (!ds->ops->port_txtstamp)
diff --git a/net/socket.c b/net/socket.c
index 262a28b59c7f..70eabb510ce6 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -676,7 +676,7 @@ void __sock_tx_timestamp(__u32 tsflags, __u8 *tx_flags)
 	u8 flags = *tx_flags;
 
 	if (tsflags & SOF_TIMESTAMPING_TX_HARDWARE) {
-		flags |= SKBTX_HW_TSTAMP;
+		flags |= __SKBTX_HW_TSTAMP;
 
 		/* PTP hardware clocks can provide a free running cycle counter
 		 * as a time base for virtual clocks. Tell driver to use the
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9bd1c7c77b17..974b7f61d11f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7033,6 +7033,13 @@ enum {
 					 * to the nic when SK_BPF_CB_TX_TIMESTAMPING
 					 * feature is on.
 					 */
+	BPF_SOCK_OPS_TS_HW_OPT_CB,	/* Called in hardware phase when
+					 * SK_BPF_CB_TX_TIMESTAMPING feature
+					 * is on. At the same time, hwtstamps
+					 * of skb is initialized as the
+					 * timestamp that hardware just
+					 * generates.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v8 09/12] bpf: support SCM_TSTAMP_ACK of SO_TIMESTAMPING
  2025-02-04 18:30 [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (7 preceding siblings ...)
  2025-02-04 18:30 ` [PATCH bpf-next v8 08/12] bpf: support hw " Jason Xing
@ 2025-02-04 18:30 ` Jason Xing
  2025-02-05 15:47   ` Willem de Bruijn
  2025-02-04 18:30 ` [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work Jason Xing
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-04 18:30 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Handle the ACK timestamp case. Actually testing SKBTX_BPF flag
can work, but Introducing a new txstamp_ack_bpf to avoid cache
line misses in tcp_ack_tstamp() is needed. To be more specific,
in most cases, normal flows would not access skb_shinfo as
txstamp_ack is zero, so that this function won't appear in the
hot spot lists. Introducing a new member txstamp_ack_bpf works
similarly.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/net/tcp.h              | 3 ++-
 include/uapi/linux/bpf.h       | 5 +++++
 net/core/skbuff.c              | 3 +++
 net/ipv4/tcp_input.c           | 3 ++-
 net/ipv4/tcp_output.c          | 5 +++++
 tools/include/uapi/linux/bpf.h | 5 +++++
 6 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 293047694710..88429e422301 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -959,9 +959,10 @@ struct tcp_skb_cb {
 	__u8		sacked;		/* State flags for SACK.	*/
 	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
 	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
+			txstamp_ack_bpf:1,	/* ack timestamp for bpf use */
 			eor:1,		/* Is skb MSG_EOR marked? */
 			has_rxtstamp:1,	/* SKB has a RX timestamp	*/
-			unused:5;
+			unused:4;
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
 	union {
 		struct {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4c3566f623c2..800122a8abe5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7047,6 +7047,11 @@ enum {
 					 * timestamp that hardware just
 					 * generates.
 					 */
+	BPF_SOCK_OPS_TS_ACK_OPT_CB,	/* Called when all the skbs in the
+					 * same sendmsg call are acked
+					 * when SK_BPF_CB_TX_TIMESTAMPING
+					 * feature is on.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 264435f989ad..a8463fef574a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5579,6 +5579,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
 		if (!sw && hwtstamps)
 			*skb_hwtstamps(skb) = *hwtstamps;
 		break;
+	case SCM_TSTAMP_ACK:
+		op = BPF_SOCK_OPS_TS_ACK_OPT_CB;
+		break;
 	default:
 		return;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 62252702929d..c8945f5be31b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3323,7 +3323,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
 	const struct skb_shared_info *shinfo;
 
 	/* Avoid cache line misses to get skb_shinfo() and shinfo->tx_flags */
-	if (likely(!TCP_SKB_CB(skb)->txstamp_ack))
+	if (likely(!TCP_SKB_CB(skb)->txstamp_ack &&
+		   !TCP_SKB_CB(skb)->txstamp_ack_bpf))
 		return;
 
 	shinfo = skb_shinfo(skb);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 695749807c09..fc84ca669b76 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1556,6 +1556,7 @@ static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int de
 static bool tcp_has_tx_tstamp(const struct sk_buff *skb)
 {
 	return TCP_SKB_CB(skb)->txstamp_ack ||
+	       TCP_SKB_CB(skb)->txstamp_ack_bpf ||
 		(skb_shinfo(skb)->tx_flags & SKBTX_ANY_TSTAMP);
 }
 
@@ -1572,7 +1573,9 @@ static void tcp_fragment_tstamp(struct sk_buff *skb, struct sk_buff *skb2)
 		shinfo2->tx_flags |= tsflags;
 		swap(shinfo->tskey, shinfo2->tskey);
 		TCP_SKB_CB(skb2)->txstamp_ack = TCP_SKB_CB(skb)->txstamp_ack;
+		TCP_SKB_CB(skb2)->txstamp_ack_bpf = TCP_SKB_CB(skb)->txstamp_ack_bpf;
 		TCP_SKB_CB(skb)->txstamp_ack = 0;
+		TCP_SKB_CB(skb)->txstamp_ack_bpf = 0;
 	}
 }
 
@@ -3213,6 +3216,8 @@ void tcp_skb_collapse_tstamp(struct sk_buff *skb,
 		shinfo->tskey = next_shinfo->tskey;
 		TCP_SKB_CB(skb)->txstamp_ack |=
 			TCP_SKB_CB(next_skb)->txstamp_ack;
+		TCP_SKB_CB(skb)->txstamp_ack_bpf |=
+			TCP_SKB_CB(next_skb)->txstamp_ack_bpf;
 	}
 }
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 974b7f61d11f..06e68d772989 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7040,6 +7040,11 @@ enum {
 					 * timestamp that hardware just
 					 * generates.
 					 */
+	BPF_SOCK_OPS_TS_ACK_OPT_CB,	/* Called when all the skbs in the
+					 * same sendmsg call are acked
+					 * when SK_BPF_CB_TX_TIMESTAMPING
+					 * feature is on.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-04 18:30 [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (8 preceding siblings ...)
  2025-02-04 18:30 ` [PATCH bpf-next v8 09/12] bpf: support SCM_TSTAMP_ACK " Jason Xing
@ 2025-02-04 18:30 ` Jason Xing
  2025-02-05  1:57   ` Jakub Kicinski
  2025-02-04 18:30 ` [PATCH bpf-next v8 11/12] bpf: add a new callback in tcp_tx_timestamp() Jason Xing
  2025-02-04 18:30 ` [PATCH bpf-next v8 12/12] selftests/bpf: add simple bpf tests in the tx path for timestamping feature Jason Xing
  11 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-04 18:30 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

After this patch, people can fully use the bpf prog to trace the
tx path for TCP type.

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 net/ipv4/tcp.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0d704bda6c41..3df802410ebf 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -492,6 +492,16 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
 		if (tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK)
 			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
 	}
+
+	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
+	    SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
+		struct skb_shared_info *shinfo = skb_shinfo(skb);
+		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+
+		tcb->txstamp_ack_bpf = 1;
+		shinfo->tx_flags |= SKBTX_BPF;
+		shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+	}
 }
 
 static bool tcp_stream_is_readable(struct sock *sk, int target)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v8 11/12] bpf: add a new callback in tcp_tx_timestamp()
  2025-02-04 18:30 [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (9 preceding siblings ...)
  2025-02-04 18:30 ` [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work Jason Xing
@ 2025-02-04 18:30 ` Jason Xing
  2025-02-05  5:28   ` Jason Xing
  2025-02-04 18:30 ` [PATCH bpf-next v8 12/12] selftests/bpf: add simple bpf tests in the tx path for timestamping feature Jason Xing
  11 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-04 18:30 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Introduce the callback to correlate tcp_sendmsg timestamp with other
points, like SND/SW/ACK. let bpf prog trace the beginning of
tcp_sendmsg_locked() and then store the sendmsg timestamp at
the bpf_sk_storage, so that in tcp_tx_timestamp() we can correlate
the timestamp with tskey which can be found in other sending points.

More details can be found in the selftest:
The selftest uses the bpf_sk_storage to store the sendmsg timestamp at
fentry/tcp_sendmsg_locked and retrieves it back at tcp_tx_timestamp
(i.e. BPF_SOCK_OPS_TS_SND_CB added in this patch).

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 include/uapi/linux/bpf.h       | 7 +++++++
 net/ipv4/tcp.c                 | 1 +
 tools/include/uapi/linux/bpf.h | 7 +++++++
 3 files changed, 15 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 800122a8abe5..accb3b314fff 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7052,6 +7052,13 @@ enum {
 					 * when SK_BPF_CB_TX_TIMESTAMPING
 					 * feature is on.
 					 */
+	BPF_SOCK_OPS_TS_SND_CB,		/* Called when every sendmsg syscall
+					 * is triggered. For TCP, it stays
+					 * in the last send process to
+					 * correlate with tcp_sendmsg timestamp
+					 * with other timestamping callbacks,
+					 * like SND/SW/ACK.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3df802410ebf..a2ac57543b6d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -501,6 +501,7 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
 		tcb->txstamp_ack_bpf = 1;
 		shinfo->tx_flags |= SKBTX_BPF;
 		shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+		bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TS_SND_CB);
 	}
 }
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 06e68d772989..384502996cdd 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7045,6 +7045,13 @@ enum {
 					 * when SK_BPF_CB_TX_TIMESTAMPING
 					 * feature is on.
 					 */
+	BPF_SOCK_OPS_TS_SND_CB,		/* Called when every sendmsg syscall
+					 * is triggered. For TCP, it stays
+					 * in the last send process to
+					 * correlate with tcp_sendmsg timestamp
+					 * with other timestamping callbacks,
+					 * like SND/SW/ACK.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH bpf-next v8 12/12] selftests/bpf: add simple bpf tests in the tx path for timestamping feature
  2025-02-04 18:30 [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently Jason Xing
                   ` (10 preceding siblings ...)
  2025-02-04 18:30 ` [PATCH bpf-next v8 11/12] bpf: add a new callback in tcp_tx_timestamp() Jason Xing
@ 2025-02-04 18:30 ` Jason Xing
  2025-02-05 15:54   ` Willem de Bruijn
  11 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-04 18:30 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Bpf prog calculates a couple of latency delta between each tx points
which SO_TIMESTAMPING feature has already implemented. It can be used
in the real world to diagnose the behaviour in the tx path.

Also, check the safety issues by accessing a few bpf calls in
bpf_test_access_bpf_calls().

Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
---
 .../bpf/prog_tests/so_timestamping.c          |  79 +++++
 .../selftests/bpf/progs/so_timestamping.c     | 306 ++++++++++++++++++
 2 files changed, 385 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/so_timestamping.c
 create mode 100644 tools/testing/selftests/bpf/progs/so_timestamping.c

diff --git a/tools/testing/selftests/bpf/prog_tests/so_timestamping.c b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
new file mode 100644
index 000000000000..1829f93bc52e
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/so_timestamping.c
@@ -0,0 +1,79 @@
+#include "test_progs.h"
+#include "network_helpers.h"
+
+#include "so_timestamping.skel.h"
+
+#define CG_NAME "/so-timestamping-test"
+
+static const char addr4_str[] = "127.0.0.1";
+static const char addr6_str[] = "::1";
+static struct so_timestamping *skel;
+
+static void test_tcp(int family)
+{
+	struct so_timestamping__bss *bss = skel->bss;
+	char buf[] = "testing testing";
+	int sfd = -1, cfd = -1;
+	int n;
+
+	memset(bss, 0, sizeof(*bss));
+
+	sfd = start_server(family, SOCK_STREAM,
+			   family == AF_INET6 ? addr6_str : addr4_str, 0, 0);
+	if (!ASSERT_OK_FD(sfd, "start_server"))
+		goto out;
+
+	cfd = connect_to_fd(sfd, 0);
+	if (!ASSERT_OK_FD(cfd, "connect_to_fd_server"))
+		goto out;
+
+	n = write(cfd, buf, sizeof(buf));
+	if (!ASSERT_EQ(n, sizeof(buf), "send to server"))
+		goto out;
+
+	ASSERT_EQ(bss->nr_active, 1, "nr_active");
+	ASSERT_EQ(bss->nr_snd, 2, "nr_snd");
+	ASSERT_EQ(bss->nr_sched, 1, "nr_sched");
+	ASSERT_EQ(bss->nr_txsw, 1, "nr_txsw");
+	ASSERT_EQ(bss->nr_ack, 1, "nr_ack");
+
+out:
+	if (sfd >= 0)
+		close(sfd);
+	if (cfd >= 0)
+		close(cfd);
+}
+
+void test_so_timestamping(void)
+{
+	struct netns_obj *ns;
+	int cg_fd;
+
+	cg_fd = test__join_cgroup(CG_NAME);
+	if (!ASSERT_OK_FD(cg_fd, "join cgroup"))
+		return;
+
+	ns = netns_new("so_timestamping_ns", true);
+	if (!ASSERT_OK_PTR(ns, "create ns"))
+		goto done;
+
+	skel = so_timestamping__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open and load skel"))
+		goto done;
+
+	if (!ASSERT_OK(so_timestamping__attach(skel), "attach skel"))
+		goto done;
+
+	skel->links.skops_sockopt =
+		bpf_program__attach_cgroup(skel->progs.skops_sockopt, cg_fd);
+	if (!ASSERT_OK_PTR(skel->links.skops_sockopt, "attach cgroup"))
+		goto done;
+
+	test_tcp(AF_INET6);
+	test_tcp(AF_INET);
+
+done:
+	so_timestamping__destroy(skel);
+	netns_free(ns);
+	close(cg_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/so_timestamping.c b/tools/testing/selftests/bpf/progs/so_timestamping.c
new file mode 100644
index 000000000000..dc8bbcfd9eb5
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/so_timestamping.c
@@ -0,0 +1,306 @@
+#include "vmlinux.h"
+#include "bpf_tracing_net.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+#include "bpf_kfuncs.h"
+#define BPF_PROG_TEST_TCP_HDR_OPTIONS
+#include "test_tcp_hdr_options.h"
+#include <errno.h>
+
+#define SK_BPF_CB_FLAGS 1009
+#define SK_BPF_CB_TX_TIMESTAMPING 1
+
+int nr_active;
+int nr_snd;
+int nr_passive;
+int nr_sched;
+int nr_txsw;
+int nr_ack;
+
+struct sockopt_test {
+	int opt;
+	int new;
+};
+
+static const struct sockopt_test sol_socket_tests[] = {
+	{ .opt = SK_BPF_CB_FLAGS, .new = SK_BPF_CB_TX_TIMESTAMPING, },
+	{ .opt = 0, },
+};
+
+struct loop_ctx {
+	void *ctx;
+	const struct sock *sk;
+};
+
+struct sk_stg {
+	__u64 sendmsg_ns;	/* record ts when sendmsg is called */
+};
+
+struct sk_tskey {
+	u64 cookie;
+	u32 tskey;
+};
+
+struct delay_info {
+	u64 sendmsg_ns;		/* record ts when sendmsg is called */
+	u32 sched_delay;	/* SCHED_OPT_CB - sendmsg_ns */
+	u32 sw_snd_delay;	/* SW_OPT_CB - SCHED_OPT_CB */
+	u32 ack_delay;		/* ACK_OPT_CB - SW_OPT_CB */
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct sk_stg);
+} sk_stg_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, struct sk_tskey);
+	__type(value, struct delay_info);
+	__uint(max_entries, 1024);
+} time_map SEC(".maps");
+
+static u64 delay_tolerance_nsec = 10000000000; /* 10 second as an example */
+
+static int bpf_test_sockopt_int(void *ctx, const struct sock *sk,
+				const struct sockopt_test *t,
+				int level)
+{
+	int new, opt, tmp;
+
+	opt = t->opt;
+	new = t->new;
+
+	if (bpf_setsockopt(ctx, level, opt, &new, sizeof(new)))
+		return 1;
+
+	if (bpf_getsockopt(ctx, level, opt, &tmp, sizeof(tmp)) ||
+	    tmp != new)
+		return 1;
+
+	return 0;
+}
+
+static int bpf_test_socket_sockopt(__u32 i, struct loop_ctx *lc)
+{
+	const struct sockopt_test *t;
+
+	if (i >= ARRAY_SIZE(sol_socket_tests))
+		return 1;
+
+	t = &sol_socket_tests[i];
+	if (!t->opt)
+		return 1;
+
+	return bpf_test_sockopt_int(lc->ctx, lc->sk, t, SOL_SOCKET);
+}
+
+static int bpf_test_sockopt(void *ctx, const struct sock *sk)
+{
+	struct loop_ctx lc = { .ctx = ctx, .sk = sk, };
+	int n;
+
+	n = bpf_loop(ARRAY_SIZE(sol_socket_tests), bpf_test_socket_sockopt, &lc, 0);
+	if (n != ARRAY_SIZE(sol_socket_tests))
+		return -1;
+
+	return 0;
+}
+
+static bool bpf_test_access_sockopt(void *ctx)
+{
+	const struct sockopt_test *t;
+	int tmp, ret, i = 0;
+	int level = SOL_SOCKET;
+
+	t = &sol_socket_tests[i];
+
+	for (; t->opt;) {
+		ret = bpf_setsockopt(ctx, level, t->opt, (void *)&t->new, sizeof(t->new));
+		if (ret != -EOPNOTSUPP)
+			return true;
+
+		ret = bpf_getsockopt(ctx, level, t->opt, &tmp, sizeof(tmp));
+		if (ret != -EOPNOTSUPP)
+			return true;
+
+		if (++i >= ARRAY_SIZE(sol_socket_tests))
+			break;
+	}
+
+	return false;
+}
+
+/* Adding a simple test to see if we can get an expected value */
+static bool bpf_test_access_load_hdr_opt(struct bpf_sock_ops *skops)
+{
+	struct tcp_opt reg_opt;
+	int load_flags = 0;
+	int ret;
+
+	reg_opt.kind = TCPOPT_EXP;
+	reg_opt.len = 0;
+	reg_opt.data32 = 0;
+	ret = bpf_load_hdr_opt(skops, &reg_opt, sizeof(reg_opt), load_flags);
+	if (ret != -EOPNOTSUPP)
+		return true;
+
+	return false;
+}
+
+/* Adding a simple test to see if we can get an expected value */
+static bool bpf_test_access_cb_flags_set(struct bpf_sock_ops *skops)
+{
+	int ret;
+
+	ret = bpf_sock_ops_cb_flags_set(skops, 0);
+	if (ret != -EOPNOTSUPP)
+		return true;
+
+	return false;
+}
+
+/* In the timestamping callbacks, we're not allowed to call the following
+ * BPF CALLs for the safety concern. Return false if expected.
+ */
+static bool bpf_test_access_bpf_calls(struct bpf_sock_ops *skops,
+				     const struct sock *sk)
+{
+	if (bpf_test_access_sockopt(skops))
+		return true;
+
+	if (bpf_test_access_load_hdr_opt(skops))
+		return true;
+
+	if (bpf_test_access_cb_flags_set(skops))
+		return true;
+
+	return false;
+}
+
+static bool bpf_test_delay(struct bpf_sock_ops *skops, const struct sock *sk)
+{
+	struct bpf_sock_ops_kern *skops_kern;
+	u64 timestamp = bpf_ktime_get_ns();
+	struct skb_shared_info *shinfo;
+	struct delay_info dinfo = {0};
+	struct sk_tskey key = {0};
+	struct delay_info *val;
+	struct sk_buff *skb;
+	struct sk_stg *stg;
+	u64 prior_ts, delay;
+
+	if (bpf_test_access_bpf_calls(skops, sk))
+		return false;
+
+	skops_kern = bpf_cast_to_kern_ctx(skops);
+	skb = skops_kern->skb;
+	shinfo = bpf_core_cast(skb->head + skb->end, struct skb_shared_info);
+	key.tskey = shinfo->tskey;
+	if (!key.tskey)
+		return false;
+
+	key.cookie = bpf_get_socket_cookie(skops);
+	if (!key.cookie)
+		return false;
+
+	if (skops->op == BPF_SOCK_OPS_TS_SND_CB) {
+		stg = bpf_sk_storage_get(&sk_stg_map, (void *)sk, 0, 0);
+		if (!stg)
+			return false;
+		dinfo.sendmsg_ns = stg->sendmsg_ns;
+		bpf_map_update_elem(&time_map, &key, &dinfo, BPF_ANY);
+		goto out;
+	}
+
+	val = bpf_map_lookup_elem(&time_map, &key);
+	if (!val)
+		return false;
+
+	switch (skops->op) {
+	case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
+		delay = val->sched_delay = timestamp - val->sendmsg_ns;
+		break;
+	case BPF_SOCK_OPS_TS_SW_OPT_CB:
+		prior_ts = val->sched_delay + val->sendmsg_ns;
+		delay = val->sw_snd_delay = timestamp - prior_ts;
+		break;
+	case BPF_SOCK_OPS_TS_ACK_OPT_CB:
+		prior_ts = val->sw_snd_delay + val->sched_delay + val->sendmsg_ns;
+		delay = val->ack_delay = timestamp - prior_ts;
+		break;
+	}
+
+	if (delay >= delay_tolerance_nsec)
+		return false;
+
+	/* Since it's the last one, remove from the map after latency check */
+	if (skops->op == BPF_SOCK_OPS_TS_ACK_OPT_CB)
+		bpf_map_delete_elem(&time_map, &key);
+
+out:
+	return true;
+}
+
+SEC("fentry/tcp_sendmsg_locked")
+int BPF_PROG(trace_tcp_sendmsg_locked, struct sock *sk, struct msghdr *msg, size_t size)
+{
+	u64 timestamp = bpf_ktime_get_ns();
+	u32 flag = sk->sk_bpf_cb_flags;
+	struct sk_stg *stg;
+
+	if (!flag)
+		return 0;
+
+	stg = bpf_sk_storage_get(&sk_stg_map, sk, 0,
+				 BPF_SK_STORAGE_GET_F_CREATE);
+	if (!stg)
+		return 0;
+
+	stg->sendmsg_ns = timestamp;
+	nr_snd += 1;
+	return 0;
+}
+
+SEC("sockops")
+int skops_sockopt(struct bpf_sock_ops *skops)
+{
+	struct bpf_sock *bpf_sk = skops->sk;
+	const struct sock *sk;
+
+	if (!bpf_sk)
+		return 1;
+
+	sk = (struct sock *)bpf_skc_to_tcp_sock(bpf_sk);
+	if (!sk)
+		return 1;
+
+	switch (skops->op) {
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+		nr_active += !bpf_test_sockopt(skops, sk);
+		break;
+	case BPF_SOCK_OPS_TS_SND_CB:
+		if (bpf_test_delay(skops, sk))
+			nr_snd += 1;
+		break;
+	case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
+		if (bpf_test_delay(skops, sk))
+			nr_sched += 1;
+		break;
+	case BPF_SOCK_OPS_TS_SW_OPT_CB:
+		if (bpf_test_delay(skops, sk))
+			nr_txsw += 1;
+		break;
+	case BPF_SOCK_OPS_TS_ACK_OPT_CB:
+		if (bpf_test_delay(skops, sk))
+			nr_ack += 1;
+		break;
+	}
+
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  2025-02-04 18:30 ` [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING Jason Xing
@ 2025-02-05  1:47   ` Jakub Kicinski
  2025-02-05  2:40     ` Jason Xing
  2025-02-05  1:50   ` Jakub Kicinski
  2025-02-05 15:34   ` Willem de Bruijn
  2 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-02-05  1:47 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, pabeni, dsahern, willemdebruijn.kernel, willemb,
	ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed,  5 Feb 2025 02:30:17 +0800 Jason Xing wrote:
> @@ -4565,7 +4566,7 @@ static inline void skb_tx_timestamp(struct sk_buff *skb)
>  {
>  	skb_clone_tx_timestamp(skb);
>  	if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP)
> -		skb_tstamp_tx(skb, NULL);
> +		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SND);
>  }

Please move skb_tx_timestamp() to net/core/timestamping.c
You can make skb_clone_tx_timestamp() static, this is its only caller.
This way on balance we won't be adding any non-inlined calls,
and we don't have to drag the linux/errqueue.h include into skbuff.h

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  2025-02-04 18:30 ` [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING Jason Xing
  2025-02-05  1:47   ` Jakub Kicinski
@ 2025-02-05  1:50   ` Jakub Kicinski
  2025-02-05 15:34   ` Willem de Bruijn
  2 siblings, 0 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-02-05  1:50 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, pabeni, dsahern, willemdebruijn.kernel, willemb,
	ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed,  5 Feb 2025 02:30:17 +0800 Jason Xing wrote:
>  void __skb_tstamp_tx(struct sk_buff *orig_skb,
>  		     const struct sk_buff *ack_skb,
>  		     struct skb_shared_hwtstamps *hwtstamps,
> -		     struct sock *sk, int tstype)
> +		     struct sock *sk, bool sw, int tstype)
>  {
>  	struct sk_buff *skb;
>  	bool tsonly, opt_stats = false;
> @@ -5551,6 +5576,9 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
>  	if (!sk)
>  		return;
>  
> +	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))

maybe keep the order of @tstype vs @sw consistent?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-04 18:30 ` [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work Jason Xing
@ 2025-02-05  1:57   ` Jakub Kicinski
  2025-02-05  2:15     ` Jason Xing
  2025-02-05 21:57     ` Martin KaFai Lau
  0 siblings, 2 replies; 66+ messages in thread
From: Jakub Kicinski @ 2025-02-05  1:57 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, pabeni, dsahern, willemdebruijn.kernel, willemb,
	ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
> +	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> +	    SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> +		struct skb_shared_info *shinfo = skb_shinfo(skb);
> +		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> +
> +		tcb->txstamp_ack_bpf = 1;
> +		shinfo->tx_flags |= SKBTX_BPF;
> +		shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> +	}

If BPF program is attached we'll timestamp all skbs? Am I reading this
right?

Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
interested in tracing current packet all the way thru the stack?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-05  1:57   ` Jakub Kicinski
@ 2025-02-05  2:15     ` Jason Xing
  2025-02-05 21:57     ` Martin KaFai Lau
  1 sibling, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-05  2:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, edumazet, pabeni, dsahern, willemdebruijn.kernel, willemb,
	ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed, Feb 5, 2025 at 9:57 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
> > +     if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> > +         SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> > +             struct skb_shared_info *shinfo = skb_shinfo(skb);
> > +             struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> > +
> > +             tcb->txstamp_ack_bpf = 1;
> > +             shinfo->tx_flags |= SKBTX_BPF;
> > +             shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> > +     }
>
> If BPF program is attached we'll timestamp all skbs? Am I reading this
> right?

For now, not really because tcp_tx_timestamp() gets called only when
dealing with the last part of this sendmsg(). So not all the skbs will
be traced.

>
> Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
> interested in tracing current packet all the way thru the stack?

This flag is mainly used to correlate the sendmsg timestamp with
corresponding tskey, or else the skb travers the qdisc layer without
knowing how to match its sendmsg.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  2025-02-05  1:47   ` Jakub Kicinski
@ 2025-02-05  2:40     ` Jason Xing
  2025-02-05  3:14       ` Jakub Kicinski
  0 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-05  2:40 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, edumazet, pabeni, dsahern, willemdebruijn.kernel, willemb,
	ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed, Feb 5, 2025 at 9:47 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed,  5 Feb 2025 02:30:17 +0800 Jason Xing wrote:
> > @@ -4565,7 +4566,7 @@ static inline void skb_tx_timestamp(struct sk_buff *skb)
> >  {
> >       skb_clone_tx_timestamp(skb);
> >       if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP)
> > -             skb_tstamp_tx(skb, NULL);
> > +             __skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SND);
> >  }
>
> Please move skb_tx_timestamp() to net/core/timestamping.c
> You can make skb_clone_tx_timestamp() static, this is its only caller.

I just tested it and it works after reading your message.

I wonder if we need a separate cleanup after this series about moving
this kind of functions into net/core/timestamping.c, say,
__skb_tstamp_tx()?

Thanks,
Jason

> This way on balance we won't be adding any non-inlined calls,
> and we don't have to drag the linux/errqueue.h include into skbuff.h

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  2025-02-05  2:40     ` Jason Xing
@ 2025-02-05  3:14       ` Jakub Kicinski
  2025-02-05  3:23         ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Jakub Kicinski @ 2025-02-05  3:14 UTC (permalink / raw)
  To: Jason Xing
  Cc: davem, edumazet, pabeni, dsahern, willemdebruijn.kernel, willemb,
	ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed, 5 Feb 2025 10:40:42 +0800 Jason Xing wrote:
> I wonder if we need a separate cleanup after this series about moving
> this kind of functions into net/core/timestamping.c, say,
> __skb_tstamp_tx()?

IMHO no need to go too far, just move the one function as part of this
series. The only motivation is to avoid adding includes to
linux/skbuff.h since skbuff.h is included in something like 8k objects.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  2025-02-05  3:14       ` Jakub Kicinski
@ 2025-02-05  3:23         ` Jason Xing
  0 siblings, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-05  3:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, edumazet, pabeni, dsahern, willemdebruijn.kernel, willemb,
	ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed, Feb 5, 2025 at 11:14 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 5 Feb 2025 10:40:42 +0800 Jason Xing wrote:
> > I wonder if we need a separate cleanup after this series about moving
> > this kind of functions into net/core/timestamping.c, say,
> > __skb_tstamp_tx()?
>
> IMHO no need to go too far, just move the one function as part of this
> series. The only motivation is to avoid adding includes to
> linux/skbuff.h since skbuff.h is included in something like 8k objects.

Thanks for clarifying. Will do it in the re-spin.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 11/12] bpf: add a new callback in tcp_tx_timestamp()
  2025-02-04 18:30 ` [PATCH bpf-next v8 11/12] bpf: add a new callback in tcp_tx_timestamp() Jason Xing
@ 2025-02-05  5:28   ` Jason Xing
  0 siblings, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-05  5:28 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, willemdebruijn.kernel,
	willemb, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms
  Cc: bpf, netdev

On Wed, Feb 5, 2025 at 2:31 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> Introduce the callback to correlate tcp_sendmsg timestamp with other
> points, like SND/SW/ACK. let bpf prog trace the beginning of
> tcp_sendmsg_locked() and then store the sendmsg timestamp at
> the bpf_sk_storage, so that in tcp_tx_timestamp() we can correlate
> the timestamp with tskey which can be found in other sending points.
>
> More details can be found in the selftest:
> The selftest uses the bpf_sk_storage to store the sendmsg timestamp at
> fentry/tcp_sendmsg_locked and retrieves it back at tcp_tx_timestamp
> (i.e. BPF_SOCK_OPS_TS_SND_CB added in this patch).
>
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>  include/uapi/linux/bpf.h       | 7 +++++++
>  net/ipv4/tcp.c                 | 1 +
>  tools/include/uapi/linux/bpf.h | 7 +++++++
>  3 files changed, 15 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 800122a8abe5..accb3b314fff 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7052,6 +7052,13 @@ enum {
>                                          * when SK_BPF_CB_TX_TIMESTAMPING
>                                          * feature is on.
>                                          */
> +       BPF_SOCK_OPS_TS_SND_CB,         /* Called when every sendmsg syscall
> +                                        * is triggered. For TCP, it stays
> +                                        * in the last send process to
> +                                        * correlate with tcp_sendmsg timestamp
> +                                        * with other timestamping callbacks,
> +                                        * like SND/SW/ACK.
> +                                        */
>  };

In case the use of the new flag is buried in many threads, I decide to
rephrase here to manifest how UDP would use it:
1. introduce a field ts_opt_id_bpf which works like ts_opt_id[1] to allow
the bpf program to fully take control of the management of tskey.
2. use fentry hook udp_sendmsg(), and introduce a callback function
like BPF_SOCK_OPS_TIMEOUT_INIT in kernel to initialize the
ts_opt_id_bpf with tskey that bpf prog generates. We can directly use
BPF_SOCK_OPS_TS_SND_CB.
3. modify the SCM_TS_OPT_ID logic to support bpf extension so that the
newly added field ts_opt_id_bpf can be passed to the
skb_shinfo(skb)->tskey in __ip_append_data().

In this way, this approach can also be extended for other protocols.

[1]
commit 4aecca4c76808f3736056d18ff510df80424bc9f
Author: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Date:   Tue Oct 1 05:57:14 2024 -0700

    net_tstamp: add SCM_TS_OPT_ID to provide OPT_ID in control message

    SOF_TIMESTAMPING_OPT_ID socket option flag gives a way to correlate TX
    timestamps and packets sent via socket. Unfortunately, there is no way
    to reliably predict socket timestamp ID value in case of error returned
    by sendmsg. For UDP sockets it's impossible because of lockless
    nature of UDP transmit, several threads may send packets in parallel. In
    case of RAW sockets MSG_MORE option makes things complicated. More
    details are in the conversation [1].
    This patch adds new control message type to give user-space
    software an opportunity to control the mapping between packets and
    values by providing ID with each sendmsg for UDP sockets.
    The documentation is also added in this patch.

    [1] https://lore.kernel.org/netdev/CALCETrU0jB+kg0mhV6A8mrHfTE1D1pr1SD_B9Eaa9aDPfgHdtA@mail.gmail.com/

Thanks,
Jason

>
>  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 3df802410ebf..a2ac57543b6d 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -501,6 +501,7 @@ static void tcp_tx_timestamp(struct sock *sk, struct sockcm_cookie *sockc)
>                 tcb->txstamp_ack_bpf = 1;
>                 shinfo->tx_flags |= SKBTX_BPF;
>                 shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> +               bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TS_SND_CB);
>         }
>  }
>
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 06e68d772989..384502996cdd 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -7045,6 +7045,13 @@ enum {
>                                          * when SK_BPF_CB_TX_TIMESTAMPING
>                                          * feature is on.
>                                          */
> +       BPF_SOCK_OPS_TS_SND_CB,         /* Called when every sendmsg syscall
> +                                        * is triggered. For TCP, it stays
> +                                        * in the last send process to
> +                                        * correlate with tcp_sendmsg timestamp
> +                                        * with other timestamping callbacks,
> +                                        * like SND/SW/ACK.
> +                                        */
>  };
>
>  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> --
> 2.43.5
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 01/12] bpf: add support for bpf_setsockopt()
  2025-02-04 18:30 ` [PATCH bpf-next v8 01/12] bpf: add support for bpf_setsockopt() Jason Xing
@ 2025-02-05 15:22   ` Willem de Bruijn
  2025-02-05 15:34     ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Willem de Bruijn @ 2025-02-05 15:22 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> Users can write the following code to enable the bpf extension:
> int flags = SK_BPF_CB_TX_TIMESTAMPING;
> int opts = SK_BPF_CB_FLAGS;
> bpf_setsockopt(skops, SOL_SOCKET, opts, &flags, sizeof(flags));
> 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>  include/net/sock.h             |  3 +++
>  include/uapi/linux/bpf.h       |  8 ++++++++
>  net/core/filter.c              | 23 +++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |  1 +
>  4 files changed, 35 insertions(+)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 8036b3b79cd8..7916982343c6 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -303,6 +303,7 @@ struct sk_filter;
>    *	@sk_stamp: time stamp of last packet received
>    *	@sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
>    *	@sk_tsflags: SO_TIMESTAMPING flags
> +  *	@sk_bpf_cb_flags: used in bpf_setsockopt()
>    *	@sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
>    *			   Sockets that can be used under memory reclaim should
>    *			   set this to false.
> @@ -445,6 +446,8 @@ struct sock {
>  	u32			sk_reserved_mem;
>  	int			sk_forward_alloc;
>  	u32			sk_tsflags;
> +#define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
> +	u32			sk_bpf_cb_flags;
>  	__cacheline_group_end(sock_write_rxtx);
>  
>  	__cacheline_group_begin(sock_write_tx);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2acf9b336371..6116eb3d1515 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -6913,6 +6913,13 @@ enum {
>  	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
>  };
>  
> +/* Definitions for bpf_sk_cb_flags */
> +enum {
> +	SK_BPF_CB_TX_TIMESTAMPING	= 1<<0,
> +	SK_BPF_CB_MASK			= (SK_BPF_CB_TX_TIMESTAMPING - 1) |
> +					   SK_BPF_CB_TX_TIMESTAMPING
> +};
> +
>  /* List of known BPF sock_ops operators.
>   * New entries can only be added at the end
>   */
> @@ -7091,6 +7098,7 @@ enum {
>  	TCP_BPF_SYN_IP		= 1006, /* Copy the IP[46] and TCP header */
>  	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
>  	TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
> +	SK_BPF_CB_FLAGS		= 1009, /* Used to set socket bpf flags */
>  };
>  
>  enum {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 2ec162dd83c4..1c6c07507a78 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5222,6 +5222,25 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
>  	.arg1_type      = ARG_PTR_TO_CTX,
>  };
>  
> +static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
> +{
> +	u32 sk_bpf_cb_flags;
> +
> +	if (getopt) {
> +		*(u32 *)optval = sk->sk_bpf_cb_flags;
> +		return 0;
> +	}
> +
> +	sk_bpf_cb_flags = *(u32 *)optval;
> +
> +	if (sk_bpf_cb_flags & ~SK_BPF_CB_MASK)
> +		return -EINVAL;
> +
> +	sk->sk_bpf_cb_flags = sk_bpf_cb_flags;

I don't know BPF internals that well:

Is there mutual exclusion between these sol_socket_sockopt calls?
Or do these sk field accesses need WRITE_ONCE/READ_ONCE.

> +
> +	return 0;
> +}
> +
>  static int sol_socket_sockopt(struct sock *sk, int optname,
>  			      char *optval, int *optlen,
>  			      bool getopt)
> @@ -5238,6 +5257,7 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
>  	case SO_MAX_PACING_RATE:
>  	case SO_BINDTOIFINDEX:
>  	case SO_TXREHASH:
> +	case SK_BPF_CB_FLAGS:
>  		if (*optlen != sizeof(int))
>  			return -EINVAL;
>  		break;
> @@ -5247,6 +5267,9 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
>  		return -EINVAL;
>  	}
>  
> +	if (optname == SK_BPF_CB_FLAGS)
> +		return sk_bpf_set_get_cb_flags(sk, optval, getopt);
> +
>  	if (getopt) {
>  		if (optname == SO_BINDTODEVICE)
>  			return -EINVAL;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 2acf9b336371..70366f74ef4e 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -7091,6 +7091,7 @@ enum {
>  	TCP_BPF_SYN_IP		= 1006, /* Copy the IP[46] and TCP header */
>  	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
>  	TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
> +	SK_BPF_CB_FLAGS		= 1009, /* Used to set socket bpf flags */
>  };
>  
>  enum {
> -- 
> 2.43.5
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 03/12] bpf: stop unsafely accessing TCP fields in bpf callbacks
  2025-02-04 18:30 ` [PATCH bpf-next v8 03/12] bpf: stop unsafely accessing TCP fields in bpf callbacks Jason Xing
@ 2025-02-05 15:24   ` Willem de Bruijn
  2025-02-05 15:35     ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Willem de Bruijn @ 2025-02-05 15:24 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> The "allow_tcp_access" flag is added to indicate that the callback
> site has a tcp_sock locked.
> 
> Applying the new member allow_tcp_access in the existing callbacks
> where is_fullsock is set to 1 can stop UDP socket accessing struct
> tcp_sock and stop TCP socket without sk lock protecting does the
> similar thing, or else it could be catastrophe leading to panic.
> 
> To keep it simple, instead of distinguishing between read and write
> access, users aren't allowed all read/write access to the tcp_sock
> through the older bpf_sock_ops ctx. The new timestamping callbacks
> can use newer helpers to read everything from a sk (e.g. bpf_core_cast),
> so nothing is lost.
> 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>  include/linux/filter.h | 5 +++++
>  include/net/tcp.h      | 1 +
>  net/core/filter.c      | 8 ++++----
>  net/ipv4/tcp_input.c   | 2 ++
>  net/ipv4/tcp_output.c  | 2 ++
>  5 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index a3ea46281595..1569e9f31a8c 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1508,6 +1508,11 @@ struct bpf_sock_ops_kern {
>  	void	*skb_data_end;
>  	u8	op;
>  	u8	is_fullsock;
> +	u8	allow_tcp_access;	/* Indicate that the callback site
> +					 * has a tcp_sock locked. Then it
> +					 * would be safe to access struct
> +					 * tcp_sock.
> +					 */

perhaps no need for explicit documentation if the variable name is
self documenting: is_locked_tcp_sock

>  	u8	remaining_opt_len;
>  	u64	temp;			/* temp and everything after is not
>  					 * initialized to 0 before calling
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 5b2b04835688..293047694710 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2649,6 +2649,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
>  	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
>  	if (sk_fullsock(sk)) {
>  		sock_ops.is_fullsock = 1;
> +		sock_ops.allow_tcp_access = 1;
>  		sock_owned_by_me(sk);
>  	}
>  
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 1c6c07507a78..dc0e67c5776a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -10381,10 +10381,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
>  		}							      \
>  		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
>  						struct bpf_sock_ops_kern,     \
> -						is_fullsock),		      \
> +						allow_tcp_access),	      \
>  				      fullsock_reg, si->src_reg,	      \
>  				      offsetof(struct bpf_sock_ops_kern,      \
> -					       is_fullsock));		      \
> +					       allow_tcp_access));	      \
>  		*insn++ = BPF_JMP_IMM(BPF_JEQ, fullsock_reg, 0, jmp);	      \
>  		if (si->dst_reg == si->src_reg)				      \
>  			*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg,	      \
> @@ -10469,10 +10469,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
>  					       temp));			      \
>  		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
>  						struct bpf_sock_ops_kern,     \
> -						is_fullsock),		      \
> +						allow_tcp_access),	      \
>  				      reg, si->dst_reg,			      \
>  				      offsetof(struct bpf_sock_ops_kern,      \
> -					       is_fullsock));		      \
> +					       allow_tcp_access));	      \
>  		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
>  		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
>  						struct bpf_sock_ops_kern, sk),\
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index eb82e01da911..77185479ed5e 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -169,6 +169,7 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
>  	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
>  	sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB;
>  	sock_ops.is_fullsock = 1;
> +	sock_ops.allow_tcp_access = 1;
>  	sock_ops.sk = sk;
>  	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
>  
> @@ -185,6 +186,7 @@ static void bpf_skops_established(struct sock *sk, int bpf_op,
>  	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
>  	sock_ops.op = bpf_op;
>  	sock_ops.is_fullsock = 1;
> +	sock_ops.allow_tcp_access = 1;
>  	sock_ops.sk = sk;
>  	/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
>  	if (skb)
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 0e5b9a654254..695749807c09 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -522,6 +522,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>  		sock_owned_by_me(sk);
>  
>  		sock_ops.is_fullsock = 1;
> +		sock_ops.allow_tcp_access = 1;
>  		sock_ops.sk = sk;
>  	}
>  
> @@ -567,6 +568,7 @@ static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
>  		sock_owned_by_me(sk);
>  
>  		sock_ops.is_fullsock = 1;
> +		sock_ops.allow_tcp_access = 1;
>  		sock_ops.sk = sk;
>  	}
>  
> -- 
> 2.43.5
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 04/12] bpf: stop calling some sock_op BPF CALLs in new timestamping callbacks
  2025-02-04 18:30 ` [PATCH bpf-next v8 04/12] bpf: stop calling some sock_op BPF CALLs in new timestamping callbacks Jason Xing
@ 2025-02-05 15:26   ` Willem de Bruijn
  2025-02-05 15:50     ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Willem de Bruijn @ 2025-02-05 15:26 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> Simply disallow calling bpf_sock_ops_setsockopt/getsockopt,
> bpf_sock_ops_cb_flags_set, and the bpf_sock_ops_load_hdr_opt for
> the new timestamping callbacks for the safety consideration.

Please reword this: Disallow .. unless this is operating on a locked
TCP socket. Or something along those lines.
 
> Besides, In the next round, the UDP proto for SO_TIMESTAMPING bpf
> extension will be supported, so there should be no safety problem,
> which is usually caused by UDP socket trying to access TCP fields.

Besides is probably the wrong word here: this is not an aside, but
the actual reason for this test, if I follow correctly.

> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>  net/core/filter.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index dc0e67c5776a..d3395ffe058e 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5523,6 +5523,11 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
>  	return -EINVAL;
>  }
>  
> +static bool is_locked_tcp_sock_ops(struct bpf_sock_ops_kern *bpf_sock)
> +{
> +	return bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
> +}
> +
>  static int _bpf_setsockopt(struct sock *sk, int level, int optname,
>  			   char *optval, int optlen)
>  {
> @@ -5673,6 +5678,9 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
>  BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
>  	   int, level, int, optname, char *, optval, int, optlen)
>  {
> +	if (!is_locked_tcp_sock_ops(bpf_sock))
> +		return -EOPNOTSUPP;
> +
>  	return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
>  }
>  
> @@ -5758,6 +5766,9 @@ static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock,
>  BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
>  	   int, level, int, optname, char *, optval, int, optlen)
>  {
> +	if (!is_locked_tcp_sock_ops(bpf_sock))
> +		return -EOPNOTSUPP;
> +
>  	if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
>  	    optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) {
>  		int ret, copy_len = 0;
> @@ -5800,6 +5811,9 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
>  	struct sock *sk = bpf_sock->sk;
>  	int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
>  
> +	if (!is_locked_tcp_sock_ops(bpf_sock))
> +		return -EOPNOTSUPP;
> +
>  	if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
>  		return -EINVAL;
>  
> @@ -7609,6 +7623,9 @@ BPF_CALL_4(bpf_sock_ops_load_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
>  	u8 search_kind, search_len, copy_len, magic_len;
>  	int ret;
>  
> +	if (!is_locked_tcp_sock_ops(bpf_sock))
> +		return -EOPNOTSUPP;
> +
>  	/* 2 byte is the minimal option len except TCPOPT_NOP and
>  	 * TCPOPT_EOL which are useless for the bpf prog to learn
>  	 * and this helper disallow loading them also.
> -- 
> 2.43.5
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  2025-02-04 18:30 ` [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING Jason Xing
  2025-02-05  1:47   ` Jakub Kicinski
  2025-02-05  1:50   ` Jakub Kicinski
@ 2025-02-05 15:34   ` Willem de Bruijn
  2025-02-05 15:52     ` Jason Xing
  2025-02-06  8:43     ` Jason Xing
  2 siblings, 2 replies; 66+ messages in thread
From: Willem de Bruijn @ 2025-02-05 15:34 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> No functional changes here, only add skb_enable_app_tstamp() to test
> if the orig_skb matches the usage of application SO_TIMESTAMPING
> or its bpf extension. And it's good to support two modes in
> parallel later in this series.
> 
> Also, this patch deliberately distinguish the software and
> hardware SCM_TSTAMP_SND timestamp by passing 'sw' parameter in order
> to avoid such a case where hardware may go wrong and pass a NULL
> hwstamps, which is even though unlikely to happen. If it really
> happens, bpf prog will finally consider it as a software timestamp.
> It will be hardly recognized. Let's make the timestamping part
> more robust.

Disagree. Don't add a crutch that has not shown to be necessary for
all this time.

Just infer hw from hwtstamps != NULL.
 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>  include/linux/skbuff.h | 13 +++++++------
>  net/core/dev.c         |  2 +-
>  net/core/skbuff.c      | 32 ++++++++++++++++++++++++++++++--
>  net/ipv4/tcp_input.c   |  3 ++-
>  4 files changed, 40 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index bb2b751d274a..dfc419281cc9 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -39,6 +39,7 @@
>  #include <net/net_debug.h>
>  #include <net/dropreason-core.h>
>  #include <net/netmem.h>
> +#include <uapi/linux/errqueue.h>
>  
>  /**
>   * DOC: skb checksums
> @@ -4533,18 +4534,18 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
>  
>  void __skb_tstamp_tx(struct sk_buff *orig_skb, const struct sk_buff *ack_skb,
>  		     struct skb_shared_hwtstamps *hwtstamps,
> -		     struct sock *sk, int tstype);
> +		     struct sock *sk, bool sw, int tstype);
>  
>  /**
> - * skb_tstamp_tx - queue clone of skb with send time stamps
> + * skb_tstamp_tx - queue clone of skb with send HARDWARE timestamps

Unfortunately this cannot be modified to skb_tstamp_tx_hw, as that
would require updating way too many callers.

>   * @orig_skb:	the original outgoing packet
>   * @hwtstamps:	hardware time stamps, may be NULL if not available
>   *
>   * If the skb has a socket associated, then this function clones the
>   * skb (thus sharing the actual data and optional structures), stores
> - * the optional hardware time stamping information (if non NULL) or
> - * generates a software time stamp (otherwise), then queues the clone
> - * to the error queue of the socket.  Errors are silently ignored.
> + * the optional hardware time stamping information (if non NULL) then
> + * queues the clone to the error queue of the socket.  Errors are
> + * silently ignored.
>   */
>  void skb_tstamp_tx(struct sk_buff *orig_skb,
>  		   struct skb_shared_hwtstamps *hwtstamps);
> @@ -4565,7 +4566,7 @@ static inline void skb_tx_timestamp(struct sk_buff *skb)
>  {
>  	skb_clone_tx_timestamp(skb);
>  	if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP)
> -		skb_tstamp_tx(skb, NULL);
> +		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SND);

If a separate version for software timestamps were needed, I'd suggest
adding a skb_tstamp_tx_sw() wrapper. But see first comment.

>  }
>  
>  /**
> diff --git a/net/core/dev.c b/net/core/dev.c
> index afa2282f2604..d77b8389753e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4501,7 +4501,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>  	skb_assert_len(skb);
>  
>  	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
> -		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED);
> +		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SCHED);
>  
>  	/* Disable soft irqs for various locks below. Also
>  	 * stops preemption for RCU.
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index a441613a1e6c..6042961dfc02 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5539,10 +5539,35 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
>  }
>  EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
>  
> +static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)

app is a bit vague. I suggest

skb_tstamp_tx_report_so_timestamping

and

skb_tstamp_tx_report_bpf_timestamping

> +{
> +	int flag;
> +
> +	switch (tstype) {
> +	case SCM_TSTAMP_SCHED:
> +		flag = SKBTX_SCHED_TSTAMP;
> +		break;

Please just have a one line statements in the case directly:

    case SCM_TSTAMP_SCHED:
        return skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP;
    case SCM_TSTAMP_SND:
        return skb_shinfo(skb)->tx_flags & (sw ? SKBTX_SW_TSTAMP :
                                                 SKBTX_HW_TSTAMP);
    case SCM_TSTAMP_ACK:
        return TCP_SKB_CB(skb)->txstamp_ack;

> +	case SCM_TSTAMP_SND:
> +		flag = sw ? SKBTX_SW_TSTAMP : SKBTX_HW_TSTAMP;
> +		break;
> +	case SCM_TSTAMP_ACK:
> +		if (TCP_SKB_CB(skb)->txstamp_ack)
> +			return true;
> +		fallthrough;
> +	default:
> +		return false;
> +	}
> +
> +	if (skb_shinfo(skb)->tx_flags & flag)
> +		return true;
> +
> +	return false;
> +}
> +
>  void __skb_tstamp_tx(struct sk_buff *orig_skb,
>  		     const struct sk_buff *ack_skb,
>  		     struct skb_shared_hwtstamps *hwtstamps,
> -		     struct sock *sk, int tstype)
> +		     struct sock *sk, bool sw, int tstype)
>  {
>  	struct sk_buff *skb;
>  	bool tsonly, opt_stats = false;
> @@ -5551,6 +5576,9 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
>  	if (!sk)
>  		return;
>  
> +	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
> +		return;
> +
>  	tsflags = READ_ONCE(sk->sk_tsflags);
>  	if (!hwtstamps && !(tsflags & SOF_TIMESTAMPING_OPT_TX_SWHW) &&
>  	    skb_shinfo(orig_skb)->tx_flags & SKBTX_IN_PROGRESS)
> @@ -5599,7 +5627,7 @@ EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
>  void skb_tstamp_tx(struct sk_buff *orig_skb,
>  		   struct skb_shared_hwtstamps *hwtstamps)
>  {
> -	return __skb_tstamp_tx(orig_skb, NULL, hwtstamps, orig_skb->sk,
> +	return __skb_tstamp_tx(orig_skb, NULL, hwtstamps, orig_skb->sk, false,
>  			       SCM_TSTAMP_SND);
>  }
>  EXPORT_SYMBOL_GPL(skb_tstamp_tx);
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 77185479ed5e..62252702929d 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3330,7 +3330,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
>  	if (!before(shinfo->tskey, prior_snd_una) &&
>  	    before(shinfo->tskey, tcp_sk(sk)->snd_una)) {
>  		tcp_skb_tsorted_save(skb) {
> -			__skb_tstamp_tx(skb, ack_skb, NULL, sk, SCM_TSTAMP_ACK);
> +			__skb_tstamp_tx(skb, ack_skb, NULL, sk, true,
> +					SCM_TSTAMP_ACK);
>  		} tcp_skb_tsorted_restore(skb);
>  	}
>  }
> -- 
> 2.43.5
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 01/12] bpf: add support for bpf_setsockopt()
  2025-02-05 15:22   ` Willem de Bruijn
@ 2025-02-05 15:34     ` Jason Xing
  2025-02-05 20:57       ` Martin KaFai Lau
  2025-02-05 21:25       ` Willem de Bruijn
  0 siblings, 2 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-05 15:34 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed, Feb 5, 2025 at 11:22 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > Users can write the following code to enable the bpf extension:
> > int flags = SK_BPF_CB_TX_TIMESTAMPING;
> > int opts = SK_BPF_CB_FLAGS;
> > bpf_setsockopt(skops, SOL_SOCKET, opts, &flags, sizeof(flags));
> >
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >  include/net/sock.h             |  3 +++
> >  include/uapi/linux/bpf.h       |  8 ++++++++
> >  net/core/filter.c              | 23 +++++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h |  1 +
> >  4 files changed, 35 insertions(+)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 8036b3b79cd8..7916982343c6 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -303,6 +303,7 @@ struct sk_filter;
> >    *  @sk_stamp: time stamp of last packet received
> >    *  @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
> >    *  @sk_tsflags: SO_TIMESTAMPING flags
> > +  *  @sk_bpf_cb_flags: used in bpf_setsockopt()
> >    *  @sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
> >    *                     Sockets that can be used under memory reclaim should
> >    *                     set this to false.
> > @@ -445,6 +446,8 @@ struct sock {
> >       u32                     sk_reserved_mem;
> >       int                     sk_forward_alloc;
> >       u32                     sk_tsflags;
> > +#define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
> > +     u32                     sk_bpf_cb_flags;
> >       __cacheline_group_end(sock_write_rxtx);
> >
> >       __cacheline_group_begin(sock_write_tx);
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 2acf9b336371..6116eb3d1515 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -6913,6 +6913,13 @@ enum {
> >       BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
> >  };
> >
> > +/* Definitions for bpf_sk_cb_flags */
> > +enum {
> > +     SK_BPF_CB_TX_TIMESTAMPING       = 1<<0,
> > +     SK_BPF_CB_MASK                  = (SK_BPF_CB_TX_TIMESTAMPING - 1) |
> > +                                        SK_BPF_CB_TX_TIMESTAMPING
> > +};
> > +
> >  /* List of known BPF sock_ops operators.
> >   * New entries can only be added at the end
> >   */
> > @@ -7091,6 +7098,7 @@ enum {
> >       TCP_BPF_SYN_IP          = 1006, /* Copy the IP[46] and TCP header */
> >       TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
> >       TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
> > +     SK_BPF_CB_FLAGS         = 1009, /* Used to set socket bpf flags */
> >  };
> >
> >  enum {
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 2ec162dd83c4..1c6c07507a78 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5222,6 +5222,25 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
> >       .arg1_type      = ARG_PTR_TO_CTX,
> >  };
> >
> > +static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
> > +{
> > +     u32 sk_bpf_cb_flags;
> > +
> > +     if (getopt) {
> > +             *(u32 *)optval = sk->sk_bpf_cb_flags;
> > +             return 0;
> > +     }
> > +
> > +     sk_bpf_cb_flags = *(u32 *)optval;
> > +
> > +     if (sk_bpf_cb_flags & ~SK_BPF_CB_MASK)
> > +             return -EINVAL;
> > +
> > +     sk->sk_bpf_cb_flags = sk_bpf_cb_flags;
>
> I don't know BPF internals that well:
>
> Is there mutual exclusion between these sol_socket_sockopt calls?
> Or do these sk field accesses need WRITE_ONCE/READ_ONCE.

According to the existing callbacks (like
BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB which I used in the selftests) in
include/uapi/linux/bpf.h, they are under the socket lock protection.
And the correct use of this feature is to set during the 3-way
handshake that also is protected by lock. But after you remind me of
this potential data race issue, just in case bpf program doesn't use
it as we expect, I think I will add the this annotation in v9.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 03/12] bpf: stop unsafely accessing TCP fields in bpf callbacks
  2025-02-05 15:24   ` Willem de Bruijn
@ 2025-02-05 15:35     ` Jason Xing
  0 siblings, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-05 15:35 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed, Feb 5, 2025 at 11:24 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > The "allow_tcp_access" flag is added to indicate that the callback
> > site has a tcp_sock locked.
> >
> > Applying the new member allow_tcp_access in the existing callbacks
> > where is_fullsock is set to 1 can stop UDP socket accessing struct
> > tcp_sock and stop TCP socket without sk lock protecting does the
> > similar thing, or else it could be catastrophe leading to panic.
> >
> > To keep it simple, instead of distinguishing between read and write
> > access, users aren't allowed all read/write access to the tcp_sock
> > through the older bpf_sock_ops ctx. The new timestamping callbacks
> > can use newer helpers to read everything from a sk (e.g. bpf_core_cast),
> > so nothing is lost.
> >
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >  include/linux/filter.h | 5 +++++
> >  include/net/tcp.h      | 1 +
> >  net/core/filter.c      | 8 ++++----
> >  net/ipv4/tcp_input.c   | 2 ++
> >  net/ipv4/tcp_output.c  | 2 ++
> >  5 files changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index a3ea46281595..1569e9f31a8c 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -1508,6 +1508,11 @@ struct bpf_sock_ops_kern {
> >       void    *skb_data_end;
> >       u8      op;
> >       u8      is_fullsock;
> > +     u8      allow_tcp_access;       /* Indicate that the callback site
> > +                                      * has a tcp_sock locked. Then it
> > +                                      * would be safe to access struct
> > +                                      * tcp_sock.
> > +                                      */
>
> perhaps no need for explicit documentation if the variable name is
> self documenting: is_locked_tcp_sock

Good suggestion. I will take it :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 06/12] bpf: support SCM_TSTAMP_SCHED of SO_TIMESTAMPING
  2025-02-04 18:30 ` [PATCH bpf-next v8 06/12] bpf: support SCM_TSTAMP_SCHED " Jason Xing
@ 2025-02-05 15:36   ` Willem de Bruijn
  2025-02-05 15:55     ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Willem de Bruijn @ 2025-02-05 15:36 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> Introducing SKBTX_BPF is used as an indicator telling us whether
> the skb should be traced by the bpf prog.

Should this say support the SCM_TSTAMP_SCHED case?

Also: imperative mood: Introduce instead of Introducing.

 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>  include/linux/skbuff.h         |  6 +++++-
>  include/uapi/linux/bpf.h       |  4 ++++
>  net/core/dev.c                 |  3 ++-
>  net/core/skbuff.c              | 20 ++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |  4 ++++
>  5 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index dfc419281cc9..35c2e864dd4b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -490,10 +490,14 @@ enum {
>  
>  	/* generate software time stamp when entering packet scheduling */
>  	SKBTX_SCHED_TSTAMP = 1 << 6,
> +
> +	/* used for bpf extension when a bpf program is loaded */
> +	SKBTX_BPF = 1 << 7,
>  };
>  
>  #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
> -				 SKBTX_SCHED_TSTAMP)
> +				 SKBTX_SCHED_TSTAMP | \
> +				 SKBTX_BPF)
>  #define SKBTX_ANY_TSTAMP	(SKBTX_HW_TSTAMP | \
>  				 SKBTX_HW_TSTAMP_USE_CYCLES | \
>  				 SKBTX_ANY_SW_TSTAMP)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 6116eb3d1515..30d2c078966b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7032,6 +7032,10 @@ enum {
>  					 * by the kernel or the
>  					 * earlier bpf-progs.
>  					 */
> +	BPF_SOCK_OPS_TS_SCHED_OPT_CB,	/* Called when skb is passing through
> +					 * dev layer when SK_BPF_CB_TX_TIMESTAMPING
> +					 * feature is on.
> +					 */
>  };
>  
>  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> diff --git a/net/core/dev.c b/net/core/dev.c
> index d77b8389753e..4f291459d6b1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4500,7 +4500,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>  	skb_reset_mac_header(skb);
>  	skb_assert_len(skb);
>  
> -	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
> +	if (unlikely(skb_shinfo(skb)->tx_flags &
> +		     (SKBTX_SCHED_TSTAMP | SKBTX_BPF)))
>  		__skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SCHED);
>  
>  	/* Disable soft irqs for various locks below. Also
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 6042961dfc02..b7261e886529 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5564,6 +5564,21 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
>  	return false;
>  }
>  
> +static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk, int tstype)
> +{
> +	int op;
> +
> +	switch (tstype) {
> +	case SCM_TSTAMP_SCHED:
> +		op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
> +		break;
> +	default:
> +		return;
> +	}
> +
> +	bpf_skops_tx_timestamping(sk, skb, op);
> +}
> +
>  void __skb_tstamp_tx(struct sk_buff *orig_skb,
>  		     const struct sk_buff *ack_skb,
>  		     struct skb_shared_hwtstamps *hwtstamps,
> @@ -5576,6 +5591,11 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
>  	if (!sk)
>  		return;
>  
> +	/* bpf extension feature entry */
> +	if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
> +		skb_tstamp_tx_bpf(orig_skb, sk, tstype);
> +
> +	/* application feature entry */
>  	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
>  		return;
>  
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 70366f74ef4e..eed91b7296b7 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -7025,6 +7025,10 @@ enum {
>  					 * by the kernel or the
>  					 * earlier bpf-progs.
>  					 */
> +	BPF_SOCK_OPS_TS_SCHED_OPT_CB,	/* Called when skb is passing through
> +					 * dev layer when SK_BPF_CB_TX_TIMESTAMPING
> +					 * feature is on.
> +					 */
>  };
>  
>  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> -- 
> 2.43.5
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 08/12] bpf: support hw SCM_TSTAMP_SND of SO_TIMESTAMPING
  2025-02-04 18:30 ` [PATCH bpf-next v8 08/12] bpf: support hw " Jason Xing
@ 2025-02-05 15:45   ` Willem de Bruijn
  2025-02-05 16:03     ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Willem de Bruijn @ 2025-02-05 15:45 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> Patch finishes the hardware part.

For consistency with previous patches, and to make sense of this
commit message on its own, when stumbling on it, e.g., through
git blame, replace the above with

Support hw SCM_TSTAMP_SND case. 

> Then bpf program can fetch the
> hwstamp from skb directly.
> 
> To avoid changing so many callers using SKBTX_HW_TSTAMP from drivers,
> use this simple modification like this patch does to support printing
> hardware timestamp.

Which simple modification? Please state explicitly.
 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>  include/linux/skbuff.h         |  4 +++-
>  include/uapi/linux/bpf.h       |  7 +++++++
>  net/core/skbuff.c              | 13 +++++++------
>  net/dsa/user.c                 |  2 +-
>  net/socket.c                   |  2 +-
>  tools/include/uapi/linux/bpf.h |  7 +++++++
>  6 files changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index de8d3bd311f5..df2d790ae36b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -471,7 +471,7 @@ struct skb_shared_hwtstamps {
>  /* Definitions for tx_flags in struct skb_shared_info */
>  enum {
>  	/* generate hardware time stamp */
> -	SKBTX_HW_TSTAMP = 1 << 0,
> +	__SKBTX_HW_TSTAMP = 1 << 0,

Perhaps we can have a more descriptive name. SKBTX_HW_TSTAMP_NOBPF?

>  
>  	/* generate software time stamp when queueing packet to NIC */
>  	SKBTX_SW_TSTAMP = 1 << 1,
> @@ -495,6 +495,8 @@ enum {
>  	SKBTX_BPF = 1 << 7,
>  };
>  
> +#define SKBTX_HW_TSTAMP		(__SKBTX_HW_TSTAMP | SKBTX_BPF)
> +
>  #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
>  				 SKBTX_SCHED_TSTAMP | \
>  				 SKBTX_BPF)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 6a1083bcf779..4c3566f623c2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7040,6 +7040,13 @@ enum {
>  					 * to the nic when SK_BPF_CB_TX_TIMESTAMPING
>  					 * feature is on.
>  					 */
> +	BPF_SOCK_OPS_TS_HW_OPT_CB,	/* Called in hardware phase when
> +					 * SK_BPF_CB_TX_TIMESTAMPING feature
> +					 * is on. At the same time, hwtstamps
> +					 * of skb is initialized as the
> +					 * timestamp that hardware just
> +					 * generates.

"hwtstamp of skb is initialized" is this something new? Or are you
just conveying that when this callback is called, skb_hwtstamps(skb)
is non-zero? If the latter, drop, because the wording is confusing.

> +					 */
>  };
>  
>  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b22d079e7143..264435f989ad 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5548,7 +5548,7 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
>  		flag = SKBTX_SCHED_TSTAMP;
>  		break;
>  	case SCM_TSTAMP_SND:
> -		flag = sw ? SKBTX_SW_TSTAMP : SKBTX_HW_TSTAMP;
> +		flag = sw ? SKBTX_SW_TSTAMP : __SKBTX_HW_TSTAMP;
>  		break;
>  	case SCM_TSTAMP_ACK:
>  		if (TCP_SKB_CB(skb)->txstamp_ack)
> @@ -5565,7 +5565,8 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
>  }
>  
>  static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
> -			      int tstype, bool sw)
> +			      int tstype, bool sw,
> +			      struct skb_shared_hwtstamps *hwtstamps)
>  {
>  	int op;
>  
> @@ -5574,9 +5575,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
>  		op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
>  		break;
>  	case SCM_TSTAMP_SND:
> -		if (!sw)
> -			return;
> -		op = BPF_SOCK_OPS_TS_SW_OPT_CB;
> +		op = sw ? BPF_SOCK_OPS_TS_SW_OPT_CB : BPF_SOCK_OPS_TS_HW_OPT_CB;
> +		if (!sw && hwtstamps)
> +			*skb_hwtstamps(skb) = *hwtstamps;

Isn't this called by drivers that have actually set skb_hwtstamps?
>  		break;
>  	default:
>  		return;
> @@ -5599,7 +5600,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
>  
>  	/* bpf extension feature entry */
>  	if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
> -		skb_tstamp_tx_bpf(orig_skb, sk, tstype, sw);
> +		skb_tstamp_tx_bpf(orig_skb, sk, tstype, sw, hwtstamps);
>  
>  	/* application feature entry */
>  	if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
> diff --git a/net/dsa/user.c b/net/dsa/user.c
> index 291ab1b4acc4..ae715bf0ae75 100644
> --- a/net/dsa/user.c
> +++ b/net/dsa/user.c
> @@ -897,7 +897,7 @@ static void dsa_skb_tx_timestamp(struct dsa_user_priv *p,
>  {
>  	struct dsa_switch *ds = p->dp->ds;
>  
> -	if (!(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
> +	if (!(skb_shinfo(skb)->tx_flags & __SKBTX_HW_TSTAMP))
>  		return;
>  
>  	if (!ds->ops->port_txtstamp)
> diff --git a/net/socket.c b/net/socket.c
> index 262a28b59c7f..70eabb510ce6 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -676,7 +676,7 @@ void __sock_tx_timestamp(__u32 tsflags, __u8 *tx_flags)
>  	u8 flags = *tx_flags;
>  
>  	if (tsflags & SOF_TIMESTAMPING_TX_HARDWARE) {
> -		flags |= SKBTX_HW_TSTAMP;
> +		flags |= __SKBTX_HW_TSTAMP;
>  
>  		/* PTP hardware clocks can provide a free running cycle counter
>  		 * as a time base for virtual clocks. Tell driver to use the
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 9bd1c7c77b17..974b7f61d11f 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -7033,6 +7033,13 @@ enum {
>  					 * to the nic when SK_BPF_CB_TX_TIMESTAMPING
>  					 * feature is on.
>  					 */
> +	BPF_SOCK_OPS_TS_HW_OPT_CB,	/* Called in hardware phase when
> +					 * SK_BPF_CB_TX_TIMESTAMPING feature
> +					 * is on. At the same time, hwtstamps
> +					 * of skb is initialized as the
> +					 * timestamp that hardware just
> +					 * generates.
> +					 */
>  };
>  
>  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> -- 
> 2.43.5
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 09/12] bpf: support SCM_TSTAMP_ACK of SO_TIMESTAMPING
  2025-02-04 18:30 ` [PATCH bpf-next v8 09/12] bpf: support SCM_TSTAMP_ACK " Jason Xing
@ 2025-02-05 15:47   ` Willem de Bruijn
  2025-02-05 16:06     ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Willem de Bruijn @ 2025-02-05 15:47 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> Handle the ACK timestamp case. Actually testing SKBTX_BPF flag
> can work, but Introducing a new txstamp_ack_bpf to avoid cache

repeat comment: s/Introducing/introduce

> line misses in tcp_ack_tstamp() is needed. To be more specific,
> in most cases, normal flows would not access skb_shinfo as
> txstamp_ack is zero, so that this function won't appear in the
> hot spot lists. Introducing a new member txstamp_ack_bpf works
> similarly.
> 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
>  include/net/tcp.h              | 3 ++-
>  include/uapi/linux/bpf.h       | 5 +++++
>  net/core/skbuff.c              | 3 +++
>  net/ipv4/tcp_input.c           | 3 ++-
>  net/ipv4/tcp_output.c          | 5 +++++
>  tools/include/uapi/linux/bpf.h | 5 +++++
>  6 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 293047694710..88429e422301 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -959,9 +959,10 @@ struct tcp_skb_cb {
>  	__u8		sacked;		/* State flags for SACK.	*/
>  	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
>  	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
> +			txstamp_ack_bpf:1,	/* ack timestamp for bpf use */
>  			eor:1,		/* Is skb MSG_EOR marked? */
>  			has_rxtstamp:1,	/* SKB has a RX timestamp	*/
> -			unused:5;
> +			unused:4;
>  	__u32		ack_seq;	/* Sequence number ACK'd	*/
>  	union {
>  		struct {
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4c3566f623c2..800122a8abe5 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7047,6 +7047,11 @@ enum {
>  					 * timestamp that hardware just
>  					 * generates.
>  					 */
> +	BPF_SOCK_OPS_TS_ACK_OPT_CB,	/* Called when all the skbs in the
> +					 * same sendmsg call are acked
> +					 * when SK_BPF_CB_TX_TIMESTAMPING
> +					 * feature is on.
> +					 */
>  };
>  
>  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 264435f989ad..a8463fef574a 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5579,6 +5579,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
>  		if (!sw && hwtstamps)
>  			*skb_hwtstamps(skb) = *hwtstamps;
>  		break;
> +	case SCM_TSTAMP_ACK:
> +		op = BPF_SOCK_OPS_TS_ACK_OPT_CB;
> +		break;
>  	default:
>  		return;
>  	}
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 62252702929d..c8945f5be31b 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3323,7 +3323,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
>  	const struct skb_shared_info *shinfo;
>  
>  	/* Avoid cache line misses to get skb_shinfo() and shinfo->tx_flags */
> -	if (likely(!TCP_SKB_CB(skb)->txstamp_ack))
> +	if (likely(!TCP_SKB_CB(skb)->txstamp_ack &&
> +		   !TCP_SKB_CB(skb)->txstamp_ack_bpf))

Here and elsewhere: instead of requiring multiple tests, how about
extending txstamp_ack to a two-bit field, so that a single branch
suffices.

>  		return;
>  
>  	shinfo = skb_shinfo(skb);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 695749807c09..fc84ca669b76 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1556,6 +1556,7 @@ static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int de
>  static bool tcp_has_tx_tstamp(const struct sk_buff *skb)
>  {
>  	return TCP_SKB_CB(skb)->txstamp_ack ||
> +	       TCP_SKB_CB(skb)->txstamp_ack_bpf ||
>  		(skb_shinfo(skb)->tx_flags & SKBTX_ANY_TSTAMP);
>  }
>  
> @@ -1572,7 +1573,9 @@ static void tcp_fragment_tstamp(struct sk_buff *skb, struct sk_buff *skb2)
>  		shinfo2->tx_flags |= tsflags;
>  		swap(shinfo->tskey, shinfo2->tskey);
>  		TCP_SKB_CB(skb2)->txstamp_ack = TCP_SKB_CB(skb)->txstamp_ack;
> +		TCP_SKB_CB(skb2)->txstamp_ack_bpf = TCP_SKB_CB(skb)->txstamp_ack_bpf;
>  		TCP_SKB_CB(skb)->txstamp_ack = 0;
> +		TCP_SKB_CB(skb)->txstamp_ack_bpf = 0;
>  	}
>  }
>  
> @@ -3213,6 +3216,8 @@ void tcp_skb_collapse_tstamp(struct sk_buff *skb,
>  		shinfo->tskey = next_shinfo->tskey;
>  		TCP_SKB_CB(skb)->txstamp_ack |=
>  			TCP_SKB_CB(next_skb)->txstamp_ack;
> +		TCP_SKB_CB(skb)->txstamp_ack_bpf |=
> +			TCP_SKB_CB(next_skb)->txstamp_ack_bpf;
>  	}
>  }
>  
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 974b7f61d11f..06e68d772989 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -7040,6 +7040,11 @@ enum {
>  					 * timestamp that hardware just
>  					 * generates.
>  					 */
> +	BPF_SOCK_OPS_TS_ACK_OPT_CB,	/* Called when all the skbs in the
> +					 * same sendmsg call are acked
> +					 * when SK_BPF_CB_TX_TIMESTAMPING
> +					 * feature is on.
> +					 */
>  };
>  
>  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> -- 
> 2.43.5
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 04/12] bpf: stop calling some sock_op BPF CALLs in new timestamping callbacks
  2025-02-05 15:26   ` Willem de Bruijn
@ 2025-02-05 15:50     ` Jason Xing
  0 siblings, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-05 15:50 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed, Feb 5, 2025 at 11:26 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > Simply disallow calling bpf_sock_ops_setsockopt/getsockopt,
> > bpf_sock_ops_cb_flags_set, and the bpf_sock_ops_load_hdr_opt for
> > the new timestamping callbacks for the safety consideration.
>
> Please reword this: Disallow .. unless this is operating on a locked
> TCP socket. Or something along those lines.

Will adjust it.

>
> > Besides, In the next round, the UDP proto for SO_TIMESTAMPING bpf
> > extension will be supported, so there should be no safety problem,
> > which is usually caused by UDP socket trying to access TCP fields.
>
> Besides is probably the wrong word here: this is not an aside, but
> the actual reason for this test, if I follow correctly.

Right, will fix it. Thanks.

>
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >  net/core/filter.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index dc0e67c5776a..d3395ffe058e 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5523,6 +5523,11 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
> >       return -EINVAL;
> >  }
> >
> > +static bool is_locked_tcp_sock_ops(struct bpf_sock_ops_kern *bpf_sock)
> > +{
> > +     return bpf_sock->op <= BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
> > +}
> > +
> >  static int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >                          char *optval, int optlen)
> >  {
> > @@ -5673,6 +5678,9 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = {
> >  BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> >          int, level, int, optname, char *, optval, int, optlen)
> >  {
> > +     if (!is_locked_tcp_sock_ops(bpf_sock))
> > +             return -EOPNOTSUPP;
> > +
> >       return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
> >  }
> >
> > @@ -5758,6 +5766,9 @@ static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock,
> >  BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> >          int, level, int, optname, char *, optval, int, optlen)
> >  {
> > +     if (!is_locked_tcp_sock_ops(bpf_sock))
> > +             return -EOPNOTSUPP;
> > +
> >       if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
> >           optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) {
> >               int ret, copy_len = 0;
> > @@ -5800,6 +5811,9 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
> >       struct sock *sk = bpf_sock->sk;
> >       int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> >
> > +     if (!is_locked_tcp_sock_ops(bpf_sock))
> > +             return -EOPNOTSUPP;
> > +
> >       if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
> >               return -EINVAL;
> >
> > @@ -7609,6 +7623,9 @@ BPF_CALL_4(bpf_sock_ops_load_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
> >       u8 search_kind, search_len, copy_len, magic_len;
> >       int ret;
> >
> > +     if (!is_locked_tcp_sock_ops(bpf_sock))
> > +             return -EOPNOTSUPP;
> > +
> >       /* 2 byte is the minimal option len except TCPOPT_NOP and
> >        * TCPOPT_EOL which are useless for the bpf prog to learn
> >        * and this helper disallow loading them also.
> > --
> > 2.43.5
> >
>
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  2025-02-05 15:34   ` Willem de Bruijn
@ 2025-02-05 15:52     ` Jason Xing
  2025-02-06  8:43     ` Jason Xing
  1 sibling, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-05 15:52 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed, Feb 5, 2025 at 11:34 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > No functional changes here, only add skb_enable_app_tstamp() to test
> > if the orig_skb matches the usage of application SO_TIMESTAMPING
> > or its bpf extension. And it's good to support two modes in
> > parallel later in this series.
> >
> > Also, this patch deliberately distinguish the software and
> > hardware SCM_TSTAMP_SND timestamp by passing 'sw' parameter in order
> > to avoid such a case where hardware may go wrong and pass a NULL
> > hwstamps, which is even though unlikely to happen. If it really
> > happens, bpf prog will finally consider it as a software timestamp.
> > It will be hardly recognized. Let's make the timestamping part
> > more robust.
>
> Disagree. Don't add a crutch that has not shown to be necessary for
> all this time.
>
> Just infer hw from hwtstamps != NULL.
>
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >  include/linux/skbuff.h | 13 +++++++------
> >  net/core/dev.c         |  2 +-
> >  net/core/skbuff.c      | 32 ++++++++++++++++++++++++++++++--
> >  net/ipv4/tcp_input.c   |  3 ++-
> >  4 files changed, 40 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index bb2b751d274a..dfc419281cc9 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -39,6 +39,7 @@
> >  #include <net/net_debug.h>
> >  #include <net/dropreason-core.h>
> >  #include <net/netmem.h>
> > +#include <uapi/linux/errqueue.h>
> >
> >  /**
> >   * DOC: skb checksums
> > @@ -4533,18 +4534,18 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
> >
> >  void __skb_tstamp_tx(struct sk_buff *orig_skb, const struct sk_buff *ack_skb,
> >                    struct skb_shared_hwtstamps *hwtstamps,
> > -                  struct sock *sk, int tstype);
> > +                  struct sock *sk, bool sw, int tstype);
> >
> >  /**
> > - * skb_tstamp_tx - queue clone of skb with send time stamps
> > + * skb_tstamp_tx - queue clone of skb with send HARDWARE timestamps
>
> Unfortunately this cannot be modified to skb_tstamp_tx_hw, as that
> would require updating way too many callers.
>
> >   * @orig_skb:        the original outgoing packet
> >   * @hwtstamps:       hardware time stamps, may be NULL if not available
> >   *
> >   * If the skb has a socket associated, then this function clones the
> >   * skb (thus sharing the actual data and optional structures), stores
> > - * the optional hardware time stamping information (if non NULL) or
> > - * generates a software time stamp (otherwise), then queues the clone
> > - * to the error queue of the socket.  Errors are silently ignored.
> > + * the optional hardware time stamping information (if non NULL) then
> > + * queues the clone to the error queue of the socket.  Errors are
> > + * silently ignored.
> >   */
> >  void skb_tstamp_tx(struct sk_buff *orig_skb,
> >                  struct skb_shared_hwtstamps *hwtstamps);
> > @@ -4565,7 +4566,7 @@ static inline void skb_tx_timestamp(struct sk_buff *skb)
> >  {
> >       skb_clone_tx_timestamp(skb);
> >       if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP)
> > -             skb_tstamp_tx(skb, NULL);
> > +             __skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SND);
>
> If a separate version for software timestamps were needed, I'd suggest
> adding a skb_tstamp_tx_sw() wrapper. But see first comment.
>
> >  }
> >
> >  /**
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index afa2282f2604..d77b8389753e 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4501,7 +4501,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >       skb_assert_len(skb);
> >
> >       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
> > -             __skb_tstamp_tx(skb, NULL, NULL, skb->sk, SCM_TSTAMP_SCHED);
> > +             __skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SCHED);
> >
> >       /* Disable soft irqs for various locks below. Also
> >        * stops preemption for RCU.
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index a441613a1e6c..6042961dfc02 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5539,10 +5539,35 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
> >  }
> >  EXPORT_SYMBOL_GPL(skb_complete_tx_timestamp);
> >
> > +static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
>
> app is a bit vague. I suggest
>
> skb_tstamp_tx_report_so_timestamping
>
> and
>
> skb_tstamp_tx_report_bpf_timestamping

Good name. I like them.

>
> > +{
> > +     int flag;
> > +
> > +     switch (tstype) {
> > +     case SCM_TSTAMP_SCHED:
> > +             flag = SKBTX_SCHED_TSTAMP;
> > +             break;
>
> Please just have a one line statements in the case directly:
>
>     case SCM_TSTAMP_SCHED:
>         return skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP;
>     case SCM_TSTAMP_SND:
>         return skb_shinfo(skb)->tx_flags & (sw ? SKBTX_SW_TSTAMP :
>                                                  SKBTX_HW_TSTAMP);
>     case SCM_TSTAMP_ACK:
>         return TCP_SKB_CB(skb)->txstamp_ack;
>

Thanks for the re-arrangement!

> > +     case SCM_TSTAMP_SND:
> > +             flag = sw ? SKBTX_SW_TSTAMP : SKBTX_HW_TSTAMP;
> > +             break;
> > +     case SCM_TSTAMP_ACK:
> > +             if (TCP_SKB_CB(skb)->txstamp_ack)
> > +                     return true;
> > +             fallthrough;
> > +     default:
> > +             return false;
> > +     }
> > +
> > +     if (skb_shinfo(skb)->tx_flags & flag)
> > +             return true;
> > +
> > +     return false;
> > +}
> > +
> >  void __skb_tstamp_tx(struct sk_buff *orig_skb,
> >                    const struct sk_buff *ack_skb,
> >                    struct skb_shared_hwtstamps *hwtstamps,
> > -                  struct sock *sk, int tstype)
> > +                  struct sock *sk, bool sw, int tstype)
> >  {
> >       struct sk_buff *skb;
> >       bool tsonly, opt_stats = false;
> > @@ -5551,6 +5576,9 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
> >       if (!sk)
> >               return;
> >
> > +     if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
> > +             return;
> > +
> >       tsflags = READ_ONCE(sk->sk_tsflags);
> >       if (!hwtstamps && !(tsflags & SOF_TIMESTAMPING_OPT_TX_SWHW) &&
> >           skb_shinfo(orig_skb)->tx_flags & SKBTX_IN_PROGRESS)
> > @@ -5599,7 +5627,7 @@ EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
> >  void skb_tstamp_tx(struct sk_buff *orig_skb,
> >                  struct skb_shared_hwtstamps *hwtstamps)
> >  {
> > -     return __skb_tstamp_tx(orig_skb, NULL, hwtstamps, orig_skb->sk,
> > +     return __skb_tstamp_tx(orig_skb, NULL, hwtstamps, orig_skb->sk, false,
> >                              SCM_TSTAMP_SND);
> >  }
> >  EXPORT_SYMBOL_GPL(skb_tstamp_tx);
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 77185479ed5e..62252702929d 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3330,7 +3330,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
> >       if (!before(shinfo->tskey, prior_snd_una) &&
> >           before(shinfo->tskey, tcp_sk(sk)->snd_una)) {
> >               tcp_skb_tsorted_save(skb) {
> > -                     __skb_tstamp_tx(skb, ack_skb, NULL, sk, SCM_TSTAMP_ACK);
> > +                     __skb_tstamp_tx(skb, ack_skb, NULL, sk, true,
> > +                                     SCM_TSTAMP_ACK);
> >               } tcp_skb_tsorted_restore(skb);
> >       }
> >  }
> > --
> > 2.43.5
> >
>
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 12/12] selftests/bpf: add simple bpf tests in the tx path for timestamping feature
  2025-02-04 18:30 ` [PATCH bpf-next v8 12/12] selftests/bpf: add simple bpf tests in the tx path for timestamping feature Jason Xing
@ 2025-02-05 15:54   ` Willem de Bruijn
  2025-02-05 16:08     ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Willem de Bruijn @ 2025-02-05 15:54 UTC (permalink / raw)
  To: Jason Xing, davem, edumazet, kuba, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, martin.lau,
	eddyz87, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, horms
  Cc: bpf, netdev, Jason Xing

Jason Xing wrote:
> Bpf prog calculates a couple of latency delta between each tx points
> which SO_TIMESTAMPING feature has already implemented. It can be used
> in the real world to diagnose the behaviour in the tx path.
> 
> Also, check the safety issues by accessing a few bpf calls in
> bpf_test_access_bpf_calls().
> 
> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>

> +static bool bpf_test_delay(struct bpf_sock_ops *skops, const struct sock *sk)
> +{
> +	struct bpf_sock_ops_kern *skops_kern;
> +	u64 timestamp = bpf_ktime_get_ns();
> +	struct skb_shared_info *shinfo;
> +	struct delay_info dinfo = {0};
> +	struct sk_tskey key = {0};
> +	struct delay_info *val;
> +	struct sk_buff *skb;
> +	struct sk_stg *stg;
> +	u64 prior_ts, delay;
> +
> +	if (bpf_test_access_bpf_calls(skops, sk))
> +		return false;
> +
> +	skops_kern = bpf_cast_to_kern_ctx(skops);
> +	skb = skops_kern->skb;
> +	shinfo = bpf_core_cast(skb->head + skb->end, struct skb_shared_info);
> +	key.tskey = shinfo->tskey;
> +	if (!key.tskey)
> +		return false;
> +
> +	key.cookie = bpf_get_socket_cookie(skops);
> +	if (!key.cookie)
> +		return false;
> +
> +	if (skops->op == BPF_SOCK_OPS_TS_SND_CB) {
> +		stg = bpf_sk_storage_get(&sk_stg_map, (void *)sk, 0, 0);
> +		if (!stg)
> +			return false;
> +		dinfo.sendmsg_ns = stg->sendmsg_ns;
> +		bpf_map_update_elem(&time_map, &key, &dinfo, BPF_ANY);
> +		goto out;
> +	}
> +
> +	val = bpf_map_lookup_elem(&time_map, &key);
> +	if (!val)
> +		return false;
> +
> +	switch (skops->op) {
> +	case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
> +		delay = val->sched_delay = timestamp - val->sendmsg_ns;
> +		break;

For a test this is fine. But just a reminder that in general a packet
may pass through multiple qdiscs. For instance with bonding or tunnel
virtual devices in the egress path.

> +	case BPF_SOCK_OPS_TS_SW_OPT_CB:
> +		prior_ts = val->sched_delay + val->sendmsg_ns;
> +		delay = val->sw_snd_delay = timestamp - prior_ts;
> +		break;
> +	case BPF_SOCK_OPS_TS_ACK_OPT_CB:
> +		prior_ts = val->sw_snd_delay + val->sched_delay + val->sendmsg_ns;
> +		delay = val->ack_delay = timestamp - prior_ts;
> +		break;

Similar to the above: fine for a test, but in practice be aware that
packets may be resent, in which case an ACK might precede a repeat
SCHED and SND. And erroneous or malicious peers may also just never
send an ACK. So this can never be relied on in production settings,
e.g., as the only signal to clear an entry from a map (as done in the
branch below).

> +	}
> +
> +	if (delay >= delay_tolerance_nsec)
> +		return false;
> +
> +	/* Since it's the last one, remove from the map after latency check */
> +	if (skops->op == BPF_SOCK_OPS_TS_ACK_OPT_CB)
> +		bpf_map_delete_elem(&time_map, &key);
> +
> +out:
> +	return true;
> +}
> +

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 06/12] bpf: support SCM_TSTAMP_SCHED of SO_TIMESTAMPING
  2025-02-05 15:36   ` Willem de Bruijn
@ 2025-02-05 15:55     ` Jason Xing
  0 siblings, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-05 15:55 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed, Feb 5, 2025 at 11:36 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > Introducing SKBTX_BPF is used as an indicator telling us whether
> > the skb should be traced by the bpf prog.
>
> Should this say support the SCM_TSTAMP_SCHED case?

Will do it.

>
> Also: imperative mood: Introduce instead of Introducing.

Oh, sorry, I have to take an English lesson :S Apparently I didn't
know the difference :( Will adjust accordingly.

>
>
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >  include/linux/skbuff.h         |  6 +++++-
> >  include/uapi/linux/bpf.h       |  4 ++++
> >  net/core/dev.c                 |  3 ++-
> >  net/core/skbuff.c              | 20 ++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h |  4 ++++
> >  5 files changed, 35 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index dfc419281cc9..35c2e864dd4b 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -490,10 +490,14 @@ enum {
> >
> >       /* generate software time stamp when entering packet scheduling */
> >       SKBTX_SCHED_TSTAMP = 1 << 6,
> > +
> > +     /* used for bpf extension when a bpf program is loaded */
> > +     SKBTX_BPF = 1 << 7,
> >  };
> >
> >  #define SKBTX_ANY_SW_TSTAMP  (SKBTX_SW_TSTAMP    | \
> > -                              SKBTX_SCHED_TSTAMP)
> > +                              SKBTX_SCHED_TSTAMP | \
> > +                              SKBTX_BPF)
> >  #define SKBTX_ANY_TSTAMP     (SKBTX_HW_TSTAMP | \
> >                                SKBTX_HW_TSTAMP_USE_CYCLES | \
> >                                SKBTX_ANY_SW_TSTAMP)
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 6116eb3d1515..30d2c078966b 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -7032,6 +7032,10 @@ enum {
> >                                        * by the kernel or the
> >                                        * earlier bpf-progs.
> >                                        */
> > +     BPF_SOCK_OPS_TS_SCHED_OPT_CB,   /* Called when skb is passing through
> > +                                      * dev layer when SK_BPF_CB_TX_TIMESTAMPING
> > +                                      * feature is on.
> > +                                      */
> >  };
> >
> >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index d77b8389753e..4f291459d6b1 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4500,7 +4500,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >       skb_reset_mac_header(skb);
> >       skb_assert_len(skb);
> >
> > -     if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
> > +     if (unlikely(skb_shinfo(skb)->tx_flags &
> > +                  (SKBTX_SCHED_TSTAMP | SKBTX_BPF)))
> >               __skb_tstamp_tx(skb, NULL, NULL, skb->sk, true, SCM_TSTAMP_SCHED);
> >
> >       /* Disable soft irqs for various locks below. Also
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 6042961dfc02..b7261e886529 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5564,6 +5564,21 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
> >       return false;
> >  }
> >
> > +static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk, int tstype)
> > +{
> > +     int op;
> > +
> > +     switch (tstype) {
> > +     case SCM_TSTAMP_SCHED:
> > +             op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
> > +             break;
> > +     default:
> > +             return;
> > +     }
> > +
> > +     bpf_skops_tx_timestamping(sk, skb, op);
> > +}
> > +
> >  void __skb_tstamp_tx(struct sk_buff *orig_skb,
> >                    const struct sk_buff *ack_skb,
> >                    struct skb_shared_hwtstamps *hwtstamps,
> > @@ -5576,6 +5591,11 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
> >       if (!sk)
> >               return;
> >
> > +     /* bpf extension feature entry */
> > +     if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
> > +             skb_tstamp_tx_bpf(orig_skb, sk, tstype);
> > +
> > +     /* application feature entry */
> >       if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
> >               return;
> >
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 70366f74ef4e..eed91b7296b7 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -7025,6 +7025,10 @@ enum {
> >                                        * by the kernel or the
> >                                        * earlier bpf-progs.
> >                                        */
> > +     BPF_SOCK_OPS_TS_SCHED_OPT_CB,   /* Called when skb is passing through
> > +                                      * dev layer when SK_BPF_CB_TX_TIMESTAMPING
> > +                                      * feature is on.
> > +                                      */
> >  };
> >
> >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > --
> > 2.43.5
> >
>
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 08/12] bpf: support hw SCM_TSTAMP_SND of SO_TIMESTAMPING
  2025-02-05 15:45   ` Willem de Bruijn
@ 2025-02-05 16:03     ` Jason Xing
  2025-02-10 22:39       ` Martin KaFai Lau
  0 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-05 16:03 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed, Feb 5, 2025 at 11:45 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > Patch finishes the hardware part.
>
> For consistency with previous patches, and to make sense of this
> commit message on its own, when stumbling on it, e.g., through
> git blame, replace the above with
>
> Support hw SCM_TSTAMP_SND case.
>
> > Then bpf program can fetch the
> > hwstamp from skb directly.
> >
> > To avoid changing so many callers using SKBTX_HW_TSTAMP from drivers,
> > use this simple modification like this patch does to support printing
> > hardware timestamp.
>
> Which simple modification? Please state explicitly.
>
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >  include/linux/skbuff.h         |  4 +++-
> >  include/uapi/linux/bpf.h       |  7 +++++++
> >  net/core/skbuff.c              | 13 +++++++------
> >  net/dsa/user.c                 |  2 +-
> >  net/socket.c                   |  2 +-
> >  tools/include/uapi/linux/bpf.h |  7 +++++++
> >  6 files changed, 26 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index de8d3bd311f5..df2d790ae36b 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -471,7 +471,7 @@ struct skb_shared_hwtstamps {
> >  /* Definitions for tx_flags in struct skb_shared_info */
> >  enum {
> >       /* generate hardware time stamp */
> > -     SKBTX_HW_TSTAMP = 1 << 0,
> > +     __SKBTX_HW_TSTAMP = 1 << 0,
>
> Perhaps we can have a more descriptive name. SKBTX_HW_TSTAMP_NOBPF?

Good suggestion.

>
> >
> >       /* generate software time stamp when queueing packet to NIC */
> >       SKBTX_SW_TSTAMP = 1 << 1,
> > @@ -495,6 +495,8 @@ enum {
> >       SKBTX_BPF = 1 << 7,
> >  };
> >
> > +#define SKBTX_HW_TSTAMP              (__SKBTX_HW_TSTAMP | SKBTX_BPF)
> > +
> >  #define SKBTX_ANY_SW_TSTAMP  (SKBTX_SW_TSTAMP    | \
> >                                SKBTX_SCHED_TSTAMP | \
> >                                SKBTX_BPF)
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 6a1083bcf779..4c3566f623c2 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -7040,6 +7040,13 @@ enum {
> >                                        * to the nic when SK_BPF_CB_TX_TIMESTAMPING
> >                                        * feature is on.
> >                                        */
> > +     BPF_SOCK_OPS_TS_HW_OPT_CB,      /* Called in hardware phase when
> > +                                      * SK_BPF_CB_TX_TIMESTAMPING feature
> > +                                      * is on. At the same time, hwtstamps
> > +                                      * of skb is initialized as the
> > +                                      * timestamp that hardware just
> > +                                      * generates.
>
> "hwtstamp of skb is initialized" is this something new? Or are you
> just conveying that when this callback is called, skb_hwtstamps(skb)
> is non-zero? If the latter, drop, because the wording is confusing.

I will update it.

>
> > +                                      */
> >  };
> >
> >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index b22d079e7143..264435f989ad 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5548,7 +5548,7 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
> >               flag = SKBTX_SCHED_TSTAMP;
> >               break;
> >       case SCM_TSTAMP_SND:
> > -             flag = sw ? SKBTX_SW_TSTAMP : SKBTX_HW_TSTAMP;
> > +             flag = sw ? SKBTX_SW_TSTAMP : __SKBTX_HW_TSTAMP;
> >               break;
> >       case SCM_TSTAMP_ACK:
> >               if (TCP_SKB_CB(skb)->txstamp_ack)
> > @@ -5565,7 +5565,8 @@ static bool skb_enable_app_tstamp(struct sk_buff *skb, int tstype, bool sw)
> >  }
> >
> >  static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
> > -                           int tstype, bool sw)
> > +                           int tstype, bool sw,
> > +                           struct skb_shared_hwtstamps *hwtstamps)
> >  {
> >       int op;
> >
> > @@ -5574,9 +5575,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
> >               op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
> >               break;
> >       case SCM_TSTAMP_SND:
> > -             if (!sw)
> > -                     return;
> > -             op = BPF_SOCK_OPS_TS_SW_OPT_CB;
> > +             op = sw ? BPF_SOCK_OPS_TS_SW_OPT_CB : BPF_SOCK_OPS_TS_HW_OPT_CB;
> > +             if (!sw && hwtstamps)
> > +                     *skb_hwtstamps(skb) = *hwtstamps;
>
> Isn't this called by drivers that have actually set skb_hwtstamps?

Oops, somehow my mind has gone blank :( Will remove it. Thanks for
correcting me!

> >               break;
> >       default:
> >               return;
> > @@ -5599,7 +5600,7 @@ void __skb_tstamp_tx(struct sk_buff *orig_skb,
> >
> >       /* bpf extension feature entry */
> >       if (skb_shinfo(orig_skb)->tx_flags & SKBTX_BPF)
> > -             skb_tstamp_tx_bpf(orig_skb, sk, tstype, sw);
> > +             skb_tstamp_tx_bpf(orig_skb, sk, tstype, sw, hwtstamps);
> >
> >       /* application feature entry */
> >       if (!skb_enable_app_tstamp(orig_skb, tstype, sw))
> > diff --git a/net/dsa/user.c b/net/dsa/user.c
> > index 291ab1b4acc4..ae715bf0ae75 100644
> > --- a/net/dsa/user.c
> > +++ b/net/dsa/user.c
> > @@ -897,7 +897,7 @@ static void dsa_skb_tx_timestamp(struct dsa_user_priv *p,
> >  {
> >       struct dsa_switch *ds = p->dp->ds;
> >
> > -     if (!(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
> > +     if (!(skb_shinfo(skb)->tx_flags & __SKBTX_HW_TSTAMP))
> >               return;
> >
> >       if (!ds->ops->port_txtstamp)
> > diff --git a/net/socket.c b/net/socket.c
> > index 262a28b59c7f..70eabb510ce6 100644
> > --- a/net/socket.c
> > +++ b/net/socket.c
> > @@ -676,7 +676,7 @@ void __sock_tx_timestamp(__u32 tsflags, __u8 *tx_flags)
> >       u8 flags = *tx_flags;
> >
> >       if (tsflags & SOF_TIMESTAMPING_TX_HARDWARE) {
> > -             flags |= SKBTX_HW_TSTAMP;
> > +             flags |= __SKBTX_HW_TSTAMP;
> >
> >               /* PTP hardware clocks can provide a free running cycle counter
> >                * as a time base for virtual clocks. Tell driver to use the
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 9bd1c7c77b17..974b7f61d11f 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -7033,6 +7033,13 @@ enum {
> >                                        * to the nic when SK_BPF_CB_TX_TIMESTAMPING
> >                                        * feature is on.
> >                                        */
> > +     BPF_SOCK_OPS_TS_HW_OPT_CB,      /* Called in hardware phase when
> > +                                      * SK_BPF_CB_TX_TIMESTAMPING feature
> > +                                      * is on. At the same time, hwtstamps
> > +                                      * of skb is initialized as the
> > +                                      * timestamp that hardware just
> > +                                      * generates.
> > +                                      */
> >  };
> >
> >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > --
> > 2.43.5
> >
>
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 09/12] bpf: support SCM_TSTAMP_ACK of SO_TIMESTAMPING
  2025-02-05 15:47   ` Willem de Bruijn
@ 2025-02-05 16:06     ` Jason Xing
  2025-02-05 21:25       ` Willem de Bruijn
  0 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-05 16:06 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed, Feb 5, 2025 at 11:47 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > Handle the ACK timestamp case. Actually testing SKBTX_BPF flag
> > can work, but Introducing a new txstamp_ack_bpf to avoid cache
>
> repeat comment: s/Introducing/introduce
>
> > line misses in tcp_ack_tstamp() is needed. To be more specific,
> > in most cases, normal flows would not access skb_shinfo as
> > txstamp_ack is zero, so that this function won't appear in the
> > hot spot lists. Introducing a new member txstamp_ack_bpf works
> > similarly.
> >
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >  include/net/tcp.h              | 3 ++-
> >  include/uapi/linux/bpf.h       | 5 +++++
> >  net/core/skbuff.c              | 3 +++
> >  net/ipv4/tcp_input.c           | 3 ++-
> >  net/ipv4/tcp_output.c          | 5 +++++
> >  tools/include/uapi/linux/bpf.h | 5 +++++
> >  6 files changed, 22 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index 293047694710..88429e422301 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -959,9 +959,10 @@ struct tcp_skb_cb {
> >       __u8            sacked;         /* State flags for SACK.        */
> >       __u8            ip_dsfield;     /* IPv4 tos or IPv6 dsfield     */
> >       __u8            txstamp_ack:1,  /* Record TX timestamp for ack? */
> > +                     txstamp_ack_bpf:1,      /* ack timestamp for bpf use */
> >                       eor:1,          /* Is skb MSG_EOR marked? */
> >                       has_rxtstamp:1, /* SKB has a RX timestamp       */
> > -                     unused:5;
> > +                     unused:4;
> >       __u32           ack_seq;        /* Sequence number ACK'd        */
> >       union {
> >               struct {
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 4c3566f623c2..800122a8abe5 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -7047,6 +7047,11 @@ enum {
> >                                        * timestamp that hardware just
> >                                        * generates.
> >                                        */
> > +     BPF_SOCK_OPS_TS_ACK_OPT_CB,     /* Called when all the skbs in the
> > +                                      * same sendmsg call are acked
> > +                                      * when SK_BPF_CB_TX_TIMESTAMPING
> > +                                      * feature is on.
> > +                                      */
> >  };
> >
> >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 264435f989ad..a8463fef574a 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5579,6 +5579,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
> >               if (!sw && hwtstamps)
> >                       *skb_hwtstamps(skb) = *hwtstamps;
> >               break;
> > +     case SCM_TSTAMP_ACK:
> > +             op = BPF_SOCK_OPS_TS_ACK_OPT_CB;
> > +             break;
> >       default:
> >               return;
> >       }
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 62252702929d..c8945f5be31b 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3323,7 +3323,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
> >       const struct skb_shared_info *shinfo;
> >
> >       /* Avoid cache line misses to get skb_shinfo() and shinfo->tx_flags */
> > -     if (likely(!TCP_SKB_CB(skb)->txstamp_ack))
> > +     if (likely(!TCP_SKB_CB(skb)->txstamp_ack &&
> > +                !TCP_SKB_CB(skb)->txstamp_ack_bpf))
>
> Here and elsewhere: instead of requiring multiple tests, how about
> extending txstamp_ack to a two-bit field, so that a single branch
> suffices.

It should work. Let me assume 1 stands for so_timestamping, 2 bpf extension?

>
> >               return;
> >
> >       shinfo = skb_shinfo(skb);
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 695749807c09..fc84ca669b76 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1556,6 +1556,7 @@ static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int de
> >  static bool tcp_has_tx_tstamp(const struct sk_buff *skb)
> >  {
> >       return TCP_SKB_CB(skb)->txstamp_ack ||
> > +            TCP_SKB_CB(skb)->txstamp_ack_bpf ||
> >               (skb_shinfo(skb)->tx_flags & SKBTX_ANY_TSTAMP);
> >  }
> >
> > @@ -1572,7 +1573,9 @@ static void tcp_fragment_tstamp(struct sk_buff *skb, struct sk_buff *skb2)
> >               shinfo2->tx_flags |= tsflags;
> >               swap(shinfo->tskey, shinfo2->tskey);
> >               TCP_SKB_CB(skb2)->txstamp_ack = TCP_SKB_CB(skb)->txstamp_ack;
> > +             TCP_SKB_CB(skb2)->txstamp_ack_bpf = TCP_SKB_CB(skb)->txstamp_ack_bpf;
> >               TCP_SKB_CB(skb)->txstamp_ack = 0;
> > +             TCP_SKB_CB(skb)->txstamp_ack_bpf = 0;
> >       }
> >  }
> >
> > @@ -3213,6 +3216,8 @@ void tcp_skb_collapse_tstamp(struct sk_buff *skb,
> >               shinfo->tskey = next_shinfo->tskey;
> >               TCP_SKB_CB(skb)->txstamp_ack |=
> >                       TCP_SKB_CB(next_skb)->txstamp_ack;
> > +             TCP_SKB_CB(skb)->txstamp_ack_bpf |=
> > +                     TCP_SKB_CB(next_skb)->txstamp_ack_bpf;
> >       }
> >  }
> >
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 974b7f61d11f..06e68d772989 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -7040,6 +7040,11 @@ enum {
> >                                        * timestamp that hardware just
> >                                        * generates.
> >                                        */
> > +     BPF_SOCK_OPS_TS_ACK_OPT_CB,     /* Called when all the skbs in the
> > +                                      * same sendmsg call are acked
> > +                                      * when SK_BPF_CB_TX_TIMESTAMPING
> > +                                      * feature is on.
> > +                                      */
> >  };
> >
> >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > --
> > 2.43.5
> >
>
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 12/12] selftests/bpf: add simple bpf tests in the tx path for timestamping feature
  2025-02-05 15:54   ` Willem de Bruijn
@ 2025-02-05 16:08     ` Jason Xing
  2025-02-06  1:28       ` Martin KaFai Lau
  0 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-05 16:08 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed, Feb 5, 2025 at 11:54 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > Bpf prog calculates a couple of latency delta between each tx points
> > which SO_TIMESTAMPING feature has already implemented. It can be used
> > in the real world to diagnose the behaviour in the tx path.
> >
> > Also, check the safety issues by accessing a few bpf calls in
> > bpf_test_access_bpf_calls().
> >
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
>
> > +static bool bpf_test_delay(struct bpf_sock_ops *skops, const struct sock *sk)
> > +{
> > +     struct bpf_sock_ops_kern *skops_kern;
> > +     u64 timestamp = bpf_ktime_get_ns();
> > +     struct skb_shared_info *shinfo;
> > +     struct delay_info dinfo = {0};
> > +     struct sk_tskey key = {0};
> > +     struct delay_info *val;
> > +     struct sk_buff *skb;
> > +     struct sk_stg *stg;
> > +     u64 prior_ts, delay;
> > +
> > +     if (bpf_test_access_bpf_calls(skops, sk))
> > +             return false;
> > +
> > +     skops_kern = bpf_cast_to_kern_ctx(skops);
> > +     skb = skops_kern->skb;
> > +     shinfo = bpf_core_cast(skb->head + skb->end, struct skb_shared_info);
> > +     key.tskey = shinfo->tskey;
> > +     if (!key.tskey)
> > +             return false;
> > +
> > +     key.cookie = bpf_get_socket_cookie(skops);
> > +     if (!key.cookie)
> > +             return false;
> > +
> > +     if (skops->op == BPF_SOCK_OPS_TS_SND_CB) {
> > +             stg = bpf_sk_storage_get(&sk_stg_map, (void *)sk, 0, 0);
> > +             if (!stg)
> > +                     return false;
> > +             dinfo.sendmsg_ns = stg->sendmsg_ns;
> > +             bpf_map_update_elem(&time_map, &key, &dinfo, BPF_ANY);
> > +             goto out;
> > +     }
> > +
> > +     val = bpf_map_lookup_elem(&time_map, &key);
> > +     if (!val)
> > +             return false;
> > +
> > +     switch (skops->op) {
> > +     case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
> > +             delay = val->sched_delay = timestamp - val->sendmsg_ns;
> > +             break;
>
> For a test this is fine. But just a reminder that in general a packet
> may pass through multiple qdiscs. For instance with bonding or tunnel
> virtual devices in the egress path.

Right, I've seen this in production (two times qdisc timestamps
because of bonding).

>
> > +     case BPF_SOCK_OPS_TS_SW_OPT_CB:
> > +             prior_ts = val->sched_delay + val->sendmsg_ns;
> > +             delay = val->sw_snd_delay = timestamp - prior_ts;
> > +             break;
> > +     case BPF_SOCK_OPS_TS_ACK_OPT_CB:
> > +             prior_ts = val->sw_snd_delay + val->sched_delay + val->sendmsg_ns;
> > +             delay = val->ack_delay = timestamp - prior_ts;
> > +             break;
>
> Similar to the above: fine for a test, but in practice be aware that
> packets may be resent, in which case an ACK might precede a repeat
> SCHED and SND. And erroneous or malicious peers may also just never
> send an ACK. So this can never be relied on in production settings,
> e.g., as the only signal to clear an entry from a map (as done in the
> branch below).

Agreed. In production, actually what we do is print all the timestamps
and let an agent parse them.

>
> > +     }
> > +
> > +     if (delay >= delay_tolerance_nsec)
> > +             return false;
> > +
> > +     /* Since it's the last one, remove from the map after latency check */
> > +     if (skops->op == BPF_SOCK_OPS_TS_ACK_OPT_CB)
> > +             bpf_map_delete_elem(&time_map, &key);
> > +
> > +out:
> > +     return true;
> > +}
> > +

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 01/12] bpf: add support for bpf_setsockopt()
  2025-02-05 15:34     ` Jason Xing
@ 2025-02-05 20:57       ` Martin KaFai Lau
  2025-02-05 21:25       ` Willem de Bruijn
  1 sibling, 0 replies; 66+ messages in thread
From: Martin KaFai Lau @ 2025-02-05 20:57 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, horms, bpf, netdev

On 2/5/25 7:34 AM, Jason Xing wrote:
> On Wed, Feb 5, 2025 at 11:22 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>>
>> Jason Xing wrote:
>>> Users can write the following code to enable the bpf extension:
>>> int flags = SK_BPF_CB_TX_TIMESTAMPING;
>>> int opts = SK_BPF_CB_FLAGS;
>>> bpf_setsockopt(skops, SOL_SOCKET, opts, &flags, sizeof(flags));
>>>
>>> Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
>>> ---
>>>   include/net/sock.h             |  3 +++
>>>   include/uapi/linux/bpf.h       |  8 ++++++++
>>>   net/core/filter.c              | 23 +++++++++++++++++++++++
>>>   tools/include/uapi/linux/bpf.h |  1 +
>>>   4 files changed, 35 insertions(+)
>>>
>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>> index 8036b3b79cd8..7916982343c6 100644
>>> --- a/include/net/sock.h
>>> +++ b/include/net/sock.h
>>> @@ -303,6 +303,7 @@ struct sk_filter;
>>>     *  @sk_stamp: time stamp of last packet received
>>>     *  @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
>>>     *  @sk_tsflags: SO_TIMESTAMPING flags
>>> +  *  @sk_bpf_cb_flags: used in bpf_setsockopt()
>>>     *  @sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
>>>     *                     Sockets that can be used under memory reclaim should
>>>     *                     set this to false.
>>> @@ -445,6 +446,8 @@ struct sock {
>>>        u32                     sk_reserved_mem;
>>>        int                     sk_forward_alloc;
>>>        u32                     sk_tsflags;
>>> +#define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
>>> +     u32                     sk_bpf_cb_flags;
>>>        __cacheline_group_end(sock_write_rxtx);
>>>
>>>        __cacheline_group_begin(sock_write_tx);
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 2acf9b336371..6116eb3d1515 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -6913,6 +6913,13 @@ enum {
>>>        BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
>>>   };
>>>
>>> +/* Definitions for bpf_sk_cb_flags */
>>> +enum {
>>> +     SK_BPF_CB_TX_TIMESTAMPING       = 1<<0,
>>> +     SK_BPF_CB_MASK                  = (SK_BPF_CB_TX_TIMESTAMPING - 1) |
>>> +                                        SK_BPF_CB_TX_TIMESTAMPING
>>> +};
>>> +
>>>   /* List of known BPF sock_ops operators.
>>>    * New entries can only be added at the end
>>>    */
>>> @@ -7091,6 +7098,7 @@ enum {
>>>        TCP_BPF_SYN_IP          = 1006, /* Copy the IP[46] and TCP header */
>>>        TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
>>>        TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
>>> +     SK_BPF_CB_FLAGS         = 1009, /* Used to set socket bpf flags */
>>>   };
>>>
>>>   enum {
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 2ec162dd83c4..1c6c07507a78 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -5222,6 +5222,25 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
>>>        .arg1_type      = ARG_PTR_TO_CTX,
>>>   };
>>>
>>> +static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
>>> +{
>>> +     u32 sk_bpf_cb_flags;
>>> +
>>> +     if (getopt) {
>>> +             *(u32 *)optval = sk->sk_bpf_cb_flags;
>>> +             return 0;
>>> +     }
>>> +
>>> +     sk_bpf_cb_flags = *(u32 *)optval;
>>> +
>>> +     if (sk_bpf_cb_flags & ~SK_BPF_CB_MASK)
>>> +             return -EINVAL;
>>> +
>>> +     sk->sk_bpf_cb_flags = sk_bpf_cb_flags;
>>
>> I don't know BPF internals that well:
>>
>> Is there mutual exclusion between these sol_socket_sockopt calls?

Yep. There is a sock_owned_by_me() in the earlier code path of sol_socket_sockopt.

Another reader is at patch 11 tcp_tx_timestamp() which should have owned the 
sock also.

>> Or do these sk field accesses need WRITE_ONCE/READ_ONCE.

> this potential data race issue, just in case bpf program doesn't use
> it as we expect, I think I will add the this annotation in v9.

Jason, it should not have data race issue in sol_socket_sockopt(). bpf program 
cannot use the sol_socket_sockopt without a lock. Patch 4 was added exactly to 
ensure that.

The situation is similar to the existing tcp_sk(sk)->bpf_sock_ops_cb_flags. It 
is also a plain access such that it is clear all read/write is under the 
sock_owned_by_me.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 01/12] bpf: add support for bpf_setsockopt()
  2025-02-05 15:34     ` Jason Xing
  2025-02-05 20:57       ` Martin KaFai Lau
@ 2025-02-05 21:25       ` Willem de Bruijn
  1 sibling, 0 replies; 66+ messages in thread
From: Willem de Bruijn @ 2025-02-05 21:25 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

Jason Xing wrote:
> On Wed, Feb 5, 2025 at 11:22 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > Users can write the following code to enable the bpf extension:
> > > int flags = SK_BPF_CB_TX_TIMESTAMPING;
> > > int opts = SK_BPF_CB_FLAGS;
> > > bpf_setsockopt(skops, SOL_SOCKET, opts, &flags, sizeof(flags));
> > >
> > > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > > ---
> > >  include/net/sock.h             |  3 +++
> > >  include/uapi/linux/bpf.h       |  8 ++++++++
> > >  net/core/filter.c              | 23 +++++++++++++++++++++++
> > >  tools/include/uapi/linux/bpf.h |  1 +
> > >  4 files changed, 35 insertions(+)
> > >
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index 8036b3b79cd8..7916982343c6 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -303,6 +303,7 @@ struct sk_filter;
> > >    *  @sk_stamp: time stamp of last packet received
> > >    *  @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
> > >    *  @sk_tsflags: SO_TIMESTAMPING flags
> > > +  *  @sk_bpf_cb_flags: used in bpf_setsockopt()
> > >    *  @sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
> > >    *                     Sockets that can be used under memory reclaim should
> > >    *                     set this to false.
> > > @@ -445,6 +446,8 @@ struct sock {
> > >       u32                     sk_reserved_mem;
> > >       int                     sk_forward_alloc;
> > >       u32                     sk_tsflags;
> > > +#define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
> > > +     u32                     sk_bpf_cb_flags;
> > >       __cacheline_group_end(sock_write_rxtx);
> > >
> > >       __cacheline_group_begin(sock_write_tx);
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 2acf9b336371..6116eb3d1515 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -6913,6 +6913,13 @@ enum {
> > >       BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
> > >  };
> > >
> > > +/* Definitions for bpf_sk_cb_flags */
> > > +enum {
> > > +     SK_BPF_CB_TX_TIMESTAMPING       = 1<<0,
> > > +     SK_BPF_CB_MASK                  = (SK_BPF_CB_TX_TIMESTAMPING - 1) |
> > > +                                        SK_BPF_CB_TX_TIMESTAMPING
> > > +};
> > > +
> > >  /* List of known BPF sock_ops operators.
> > >   * New entries can only be added at the end
> > >   */
> > > @@ -7091,6 +7098,7 @@ enum {
> > >       TCP_BPF_SYN_IP          = 1006, /* Copy the IP[46] and TCP header */
> > >       TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
> > >       TCP_BPF_SOCK_OPS_CB_FLAGS = 1008, /* Get or Set TCP sock ops flags */
> > > +     SK_BPF_CB_FLAGS         = 1009, /* Used to set socket bpf flags */
> > >  };
> > >
> > >  enum {
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 2ec162dd83c4..1c6c07507a78 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -5222,6 +5222,25 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
> > >       .arg1_type      = ARG_PTR_TO_CTX,
> > >  };
> > >
> > > +static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
> > > +{
> > > +     u32 sk_bpf_cb_flags;
> > > +
> > > +     if (getopt) {
> > > +             *(u32 *)optval = sk->sk_bpf_cb_flags;
> > > +             return 0;
> > > +     }
> > > +
> > > +     sk_bpf_cb_flags = *(u32 *)optval;
> > > +
> > > +     if (sk_bpf_cb_flags & ~SK_BPF_CB_MASK)
> > > +             return -EINVAL;
> > > +
> > > +     sk->sk_bpf_cb_flags = sk_bpf_cb_flags;
> >
> > I don't know BPF internals that well:
> >
> > Is there mutual exclusion between these sol_socket_sockopt calls?
> > Or do these sk field accesses need WRITE_ONCE/READ_ONCE.
> 
> According to the existing callbacks (like
> BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB which I used in the selftests) in
> include/uapi/linux/bpf.h, they are under the socket lock protection.
> And the correct use of this feature is to set during the 3-way
> handshake that also is protected by lock. But after you remind me of
> this potential data race issue, just in case bpf program doesn't use
> it as we expect, I think I will add the this annotation in v9.

Let's not add instrumentation defensively where not needed. Doing so
confuses future readers who will assume that it was needed and cannot
see why.

Either leave it for now. Or if it is needed for lockless UDP, add it
now, but then add an explicit comment that it is for that use case.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 09/12] bpf: support SCM_TSTAMP_ACK of SO_TIMESTAMPING
  2025-02-05 16:06     ` Jason Xing
@ 2025-02-05 21:25       ` Willem de Bruijn
  0 siblings, 0 replies; 66+ messages in thread
From: Willem de Bruijn @ 2025-02-05 21:25 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

Jason Xing wrote:
> On Wed, Feb 5, 2025 at 11:47 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > Handle the ACK timestamp case. Actually testing SKBTX_BPF flag
> > > can work, but Introducing a new txstamp_ack_bpf to avoid cache
> >
> > repeat comment: s/Introducing/introduce
> >
> > > line misses in tcp_ack_tstamp() is needed. To be more specific,
> > > in most cases, normal flows would not access skb_shinfo as
> > > txstamp_ack is zero, so that this function won't appear in the
> > > hot spot lists. Introducing a new member txstamp_ack_bpf works
> > > similarly.
> > >
> > > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > > ---
> > >  include/net/tcp.h              | 3 ++-
> > >  include/uapi/linux/bpf.h       | 5 +++++
> > >  net/core/skbuff.c              | 3 +++
> > >  net/ipv4/tcp_input.c           | 3 ++-
> > >  net/ipv4/tcp_output.c          | 5 +++++
> > >  tools/include/uapi/linux/bpf.h | 5 +++++
> > >  6 files changed, 22 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > > index 293047694710..88429e422301 100644
> > > --- a/include/net/tcp.h
> > > +++ b/include/net/tcp.h
> > > @@ -959,9 +959,10 @@ struct tcp_skb_cb {
> > >       __u8            sacked;         /* State flags for SACK.        */
> > >       __u8            ip_dsfield;     /* IPv4 tos or IPv6 dsfield     */
> > >       __u8            txstamp_ack:1,  /* Record TX timestamp for ack? */
> > > +                     txstamp_ack_bpf:1,      /* ack timestamp for bpf use */
> > >                       eor:1,          /* Is skb MSG_EOR marked? */
> > >                       has_rxtstamp:1, /* SKB has a RX timestamp       */
> > > -                     unused:5;
> > > +                     unused:4;
> > >       __u32           ack_seq;        /* Sequence number ACK'd        */
> > >       union {
> > >               struct {
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 4c3566f623c2..800122a8abe5 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -7047,6 +7047,11 @@ enum {
> > >                                        * timestamp that hardware just
> > >                                        * generates.
> > >                                        */
> > > +     BPF_SOCK_OPS_TS_ACK_OPT_CB,     /* Called when all the skbs in the
> > > +                                      * same sendmsg call are acked
> > > +                                      * when SK_BPF_CB_TX_TIMESTAMPING
> > > +                                      * feature is on.
> > > +                                      */
> > >  };
> > >
> > >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 264435f989ad..a8463fef574a 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -5579,6 +5579,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
> > >               if (!sw && hwtstamps)
> > >                       *skb_hwtstamps(skb) = *hwtstamps;
> > >               break;
> > > +     case SCM_TSTAMP_ACK:
> > > +             op = BPF_SOCK_OPS_TS_ACK_OPT_CB;
> > > +             break;
> > >       default:
> > >               return;
> > >       }
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index 62252702929d..c8945f5be31b 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -3323,7 +3323,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
> > >       const struct skb_shared_info *shinfo;
> > >
> > >       /* Avoid cache line misses to get skb_shinfo() and shinfo->tx_flags */
> > > -     if (likely(!TCP_SKB_CB(skb)->txstamp_ack))
> > > +     if (likely(!TCP_SKB_CB(skb)->txstamp_ack &&
> > > +                !TCP_SKB_CB(skb)->txstamp_ack_bpf))
> >
> > Here and elsewhere: instead of requiring multiple tests, how about
> > extending txstamp_ack to a two-bit field, so that a single branch
> > suffices.
> 
> It should work. Let me assume 1 stands for so_timestamping, 2 bpf extension?

Sounds good

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-05  1:57   ` Jakub Kicinski
  2025-02-05  2:15     ` Jason Xing
@ 2025-02-05 21:57     ` Martin KaFai Lau
  2025-02-06  0:12       ` Jason Xing
  1 sibling, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2025-02-05 21:57 UTC (permalink / raw)
  To: Jakub Kicinski, Jason Xing
  Cc: davem, edumazet, pabeni, dsahern, willemdebruijn.kernel, willemb,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On 2/4/25 5:57 PM, Jakub Kicinski wrote:
> On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
>> +	if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
>> +	    SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
>> +		struct skb_shared_info *shinfo = skb_shinfo(skb);
>> +		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
>> +
>> +		tcb->txstamp_ack_bpf = 1;
>> +		shinfo->tx_flags |= SKBTX_BPF;
>> +		shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
>> +	}
> 
> If BPF program is attached we'll timestamp all skbs? Am I reading this
> right?

If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING 
bit of a sock, then all skbs of this sock will be tx timestamp-ed.

> 
> Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
> interested in tracing current packet all the way thru the stack?

I like this idea. It can give the BPF prog a chance to do skb sampling on a 
particular socket.

The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value) 
already has another usage, which its return value is currently enforced by the 
verifier. It is better not to convolute it further.

I don't prefer to add more use cases to skops->reply either, which is an union 
of args[4], such that later progs (in the cgrp prog array) may lose the args value.

Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a 
new BPF kfunc can be added so that the BPF prog can call it to selectively set 
SKBTX_BPF and txstamp_ack_bpf in some skb.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-05 21:57     ` Martin KaFai Lau
@ 2025-02-06  0:12       ` Jason Xing
  2025-02-06  0:42         ` Jason Xing
  2025-02-06  0:47         ` Martin KaFai Lau
  0 siblings, 2 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-06  0:12 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Jakub Kicinski, davem, edumazet, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, eddyz87,
	song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	horms, bpf, netdev

On Thu, Feb 6, 2025 at 5:57 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/4/25 5:57 PM, Jakub Kicinski wrote:
> > On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
> >> +    if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> >> +        SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> >> +            struct skb_shared_info *shinfo = skb_shinfo(skb);
> >> +            struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> >> +
> >> +            tcb->txstamp_ack_bpf = 1;
> >> +            shinfo->tx_flags |= SKBTX_BPF;
> >> +            shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> >> +    }
> >
> > If BPF program is attached we'll timestamp all skbs? Am I reading this
> > right?
>
> If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING
> bit of a sock, then all skbs of this sock will be tx timestamp-ed.

Martin, I'm afraid it's not like what you expect. Only the last
portion of the sendmsg will enter the above function which means if
the size of sendmsg is large, only the last skb will be set SKBTX_BPF
and be timestamped.

>
> >
> > Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
> > interested in tracing current packet all the way thru the stack?
>
> I like this idea. It can give the BPF prog a chance to do skb sampling on a
> particular socket.
>
> The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value)
> already has another usage, which its return value is currently enforced by the
> verifier. It is better not to convolute it further.
>
> I don't prefer to add more use cases to skops->reply either, which is an union
> of args[4], such that later progs (in the cgrp prog array) may lose the args value.
>
> Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a
> new BPF kfunc can be added so that the BPF prog can call it to selectively set
> SKBTX_BPF and txstamp_ack_bpf in some skb.

Agreed because at netdev 0x19 I have an explicit plan to share the
experience from our company about how to trace all the skbs which were
completed through a kernel module. It's how we use in production
especially for debug or diagnose use.

I'm not knowledgeable enough about BPF, so I'd like to know if there
are some functions that I can take as good examples?

I think it's a standalone and good feature, can I handle it after this series?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-06  0:12       ` Jason Xing
@ 2025-02-06  0:42         ` Jason Xing
  2025-02-06  0:47         ` Martin KaFai Lau
  1 sibling, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-06  0:42 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Jakub Kicinski, davem, edumazet, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, eddyz87,
	song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	horms, bpf, netdev

On Thu, Feb 6, 2025 at 8:12 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Feb 6, 2025 at 5:57 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 2/4/25 5:57 PM, Jakub Kicinski wrote:
> > > On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
> > >> +    if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> > >> +        SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> > >> +            struct skb_shared_info *shinfo = skb_shinfo(skb);
> > >> +            struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> > >> +
> > >> +            tcb->txstamp_ack_bpf = 1;
> > >> +            shinfo->tx_flags |= SKBTX_BPF;
> > >> +            shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> > >> +    }
> > >
> > > If BPF program is attached we'll timestamp all skbs? Am I reading this
> > > right?
> >
> > If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING
> > bit of a sock, then all skbs of this sock will be tx timestamp-ed.
>
> Martin, I'm afraid it's not like what you expect. Only the last
> portion of the sendmsg will enter the above function which means if
> the size of sendmsg is large, only the last skb will be set SKBTX_BPF
> and be timestamped.

Long time ago, SO_TIMESTAMPING was mostly used to distinguish which
layer the latency issue happens, especially to exclude many cases
caused by the application itself[1].

Thanks to bpf, we can pay more attention to the kernel behaviour, even
like the tiny delay brought by flow control, say, BQL or fair queue in
Qdisc which can be noticed by this bpf extension (for sure, it will
need more work, not now).

[1]
https://netdevconf.info/0x17/sessions/talk/so_timestamping-powering-fleetwide-rpc-monitoring.html
quoting Willem: "With SO_TIMESTAMPING, bugs that are otherwise
incorrectly assumed to be network issues can be attributed to the
kernel. It can isolate transmission, reception and even scheduling
sources."

>
> >
> > >
> > > Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
> > > interested in tracing current packet all the way thru the stack?
> >
> > I like this idea. It can give the BPF prog a chance to do skb sampling on a
> > particular socket.
> >
> > The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value)
> > already has another usage, which its return value is currently enforced by the
> > verifier. It is better not to convolute it further.
> >
> > I don't prefer to add more use cases to skops->reply either, which is an union
> > of args[4], such that later progs (in the cgrp prog array) may lose the args value.
> >
> > Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a
> > new BPF kfunc can be added so that the BPF prog can call it to selectively set
> > SKBTX_BPF and txstamp_ack_bpf in some skb.
>
> Agreed because at netdev 0x19 I have an explicit plan to share the
> experience from our company about how to trace all the skbs which were
> completed through a kernel module. It's how we use in production
> especially for debug or diagnose use.

I'm not sure if you can see this link[2] because Jamal is still
working on publishing officially. We can wait if it's not accessible
to you temporarily.

[2]: https://0x19.netdevconf.info/paper/5?cap=05arRrN3AEg11M

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-06  0:12       ` Jason Xing
  2025-02-06  0:42         ` Jason Xing
@ 2025-02-06  0:47         ` Martin KaFai Lau
  2025-02-06  1:05           ` Jason Xing
  1 sibling, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2025-02-06  0:47 UTC (permalink / raw)
  To: Jason Xing
  Cc: Jakub Kicinski, davem, edumazet, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, eddyz87,
	song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	horms, bpf, netdev

On 2/5/25 4:12 PM, Jason Xing wrote:
> On Thu, Feb 6, 2025 at 5:57 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 2/4/25 5:57 PM, Jakub Kicinski wrote:
>>> On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
>>>> +    if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
>>>> +        SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
>>>> +            struct skb_shared_info *shinfo = skb_shinfo(skb);
>>>> +            struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
>>>> +
>>>> +            tcb->txstamp_ack_bpf = 1;
>>>> +            shinfo->tx_flags |= SKBTX_BPF;
>>>> +            shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
>>>> +    }
>>>
>>> If BPF program is attached we'll timestamp all skbs? Am I reading this
>>> right?
>>
>> If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING
>> bit of a sock, then all skbs of this sock will be tx timestamp-ed.
> 
> Martin, I'm afraid it's not like what you expect. Only the last
> portion of the sendmsg will enter the above function which means if
> the size of sendmsg is large, only the last skb will be set SKBTX_BPF
> and be timestamped.

Sure. The last skb of a large msg and more skb of small msg (or MSG_EOR).

My point is, only attaching a bpf alone is not enough. The 
SK_BPF_CB_TX_TIMESTAMPING still needs to be turned on.

> 
>>
>>>
>>> Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
>>> interested in tracing current packet all the way thru the stack?
>>
>> I like this idea. It can give the BPF prog a chance to do skb sampling on a
>> particular socket.
>>
>> The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value)
>> already has another usage, which its return value is currently enforced by the
>> verifier. It is better not to convolute it further.
>>
>> I don't prefer to add more use cases to skops->reply either, which is an union
>> of args[4], such that later progs (in the cgrp prog array) may lose the args value.
>>
>> Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a
>> new BPF kfunc can be added so that the BPF prog can call it to selectively set
>> SKBTX_BPF and txstamp_ack_bpf in some skb.
> 
> Agreed because at netdev 0x19 I have an explicit plan to share the
> experience from our company about how to trace all the skbs which were
> completed through a kernel module. It's how we use in production
> especially for debug or diagnose use.

This is fine. The bpf prog can still do that by calling the kfunc. I don't see 
why move the bit setting into kfunc makes the whole set won't work.

> I'm not knowledgeable enough about BPF, so I'd like to know if there
> are some functions that I can take as good examples?
> 
> I think it's a standalone and good feature, can I handle it after this series?

Unfortunately, no. Once the default is on, this cannot be changed.

I think Jakub's suggestion to allow bpf prog selectively choose skb to timestamp 
is useful, so I suggested a way to do it.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-06  0:47         ` Martin KaFai Lau
@ 2025-02-06  1:05           ` Jason Xing
  2025-02-06  2:39             ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-06  1:05 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Jakub Kicinski, davem, edumazet, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, eddyz87,
	song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	horms, bpf, netdev

On Thu, Feb 6, 2025 at 8:47 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/5/25 4:12 PM, Jason Xing wrote:
> > On Thu, Feb 6, 2025 at 5:57 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 2/4/25 5:57 PM, Jakub Kicinski wrote:
> >>> On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
> >>>> +    if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> >>>> +        SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> >>>> +            struct skb_shared_info *shinfo = skb_shinfo(skb);
> >>>> +            struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> >>>> +
> >>>> +            tcb->txstamp_ack_bpf = 1;
> >>>> +            shinfo->tx_flags |= SKBTX_BPF;
> >>>> +            shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> >>>> +    }
> >>>
> >>> If BPF program is attached we'll timestamp all skbs? Am I reading this
> >>> right?
> >>
> >> If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING
> >> bit of a sock, then all skbs of this sock will be tx timestamp-ed.
> >
> > Martin, I'm afraid it's not like what you expect. Only the last
> > portion of the sendmsg will enter the above function which means if
> > the size of sendmsg is large, only the last skb will be set SKBTX_BPF
> > and be timestamped.
>
> Sure. The last skb of a large msg and more skb of small msg (or MSG_EOR).
>
> My point is, only attaching a bpf alone is not enough. The
> SK_BPF_CB_TX_TIMESTAMPING still needs to be turned on.

Right.

>
> >
> >>
> >>>
> >>> Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
> >>> interested in tracing current packet all the way thru the stack?
> >>
> >> I like this idea. It can give the BPF prog a chance to do skb sampling on a
> >> particular socket.
> >>
> >> The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value)
> >> already has another usage, which its return value is currently enforced by the
> >> verifier. It is better not to convolute it further.
> >>
> >> I don't prefer to add more use cases to skops->reply either, which is an union
> >> of args[4], such that later progs (in the cgrp prog array) may lose the args value.
> >>
> >> Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a
> >> new BPF kfunc can be added so that the BPF prog can call it to selectively set
> >> SKBTX_BPF and txstamp_ack_bpf in some skb.
> >
> > Agreed because at netdev 0x19 I have an explicit plan to share the
> > experience from our company about how to trace all the skbs which were
> > completed through a kernel module. It's how we use in production
> > especially for debug or diagnose use.
>
> This is fine. The bpf prog can still do that by calling the kfunc. I don't see
> why move the bit setting into kfunc makes the whole set won't work.
>
> > I'm not knowledgeable enough about BPF, so I'd like to know if there
> > are some functions that I can take as good examples?
> >
> > I think it's a standalone and good feature, can I handle it after this series?
>
> Unfortunately, no. Once the default is on, this cannot be changed.
>
> I think Jakub's suggestion to allow bpf prog selectively choose skb to timestamp
> is useful, so I suggested a way to do it.

Because, sorry, I don't want to postpone this series any longer (blame
on me for delaying almost 4 months), only wanting to focus on the
extension for SO_TIMESTAMPING so that we can quickly move on with
small changes per series.

Selectively sampling the skbs or sampling all the skbs could be an
optional good choice/feature for users instead of mandatory?

There are two kinds of monitoring in production: 1) daily monitoring,
2) diagnostic monitoring which I'm not sure if I translate in the
right way. For the former that is obviously a light-weight feature, I
think we don't need to trace that many skbs, only the last skb is
enough which was done in Google because even the selective feature[1]
is a little bit heavy. I received some complaints from a few
latency-sensitive customers to ask us if we can reduce the monitoring
in the kernel because as I mentioned before many issues are caused by
the application itself instead of kernel.

[1] selective feature consists of two parts, only selectively
collecting all the skbs in a certain period or selectively collecting
exactly like what SO_TIMESTAMPING does in a certain period. It might
need a full discussion, I reckon.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 12/12] selftests/bpf: add simple bpf tests in the tx path for timestamping feature
  2025-02-05 16:08     ` Jason Xing
@ 2025-02-06  1:28       ` Martin KaFai Lau
  2025-02-06  2:14         ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2025-02-06  1:28 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, horms, bpf, netdev

On 2/5/25 8:08 AM, Jason Xing wrote:
>>> +     switch (skops->op) {
>>> +     case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
>>> +             delay = val->sched_delay = timestamp - val->sendmsg_ns;
>>> +             break;
>>
>> For a test this is fine. But just a reminder that in general a packet
>> may pass through multiple qdiscs. For instance with bonding or tunnel
>> virtual devices in the egress path.
> 
> Right, I've seen this in production (two times qdisc timestamps
> because of bonding).
> 
>>
>>> +     case BPF_SOCK_OPS_TS_SW_OPT_CB:
>>> +             prior_ts = val->sched_delay + val->sendmsg_ns;
>>> +             delay = val->sw_snd_delay = timestamp - prior_ts;
>>> +             break;
>>> +     case BPF_SOCK_OPS_TS_ACK_OPT_CB:
>>> +             prior_ts = val->sw_snd_delay + val->sched_delay + val->sendmsg_ns;
>>> +             delay = val->ack_delay = timestamp - prior_ts;
>>> +             break;
>>
>> Similar to the above: fine for a test, but in practice be aware that
>> packets may be resent, in which case an ACK might precede a repeat
>> SCHED and SND. And erroneous or malicious peers may also just never
>> send an ACK. So this can never be relied on in production settings,
>> e.g., as the only signal to clear an entry from a map (as done in the
>> branch below).

All good points. I think all these notes should be added as comment to the test.

I think as a test, this will be a good start and can use some followup to 
address the cases.

> 
> Agreed. In production, actually what we do is print all the timestamps
> and let an agent parse them.

The BPF program that runs in the kernel can provide its own user interface that 
best fits its environment. If a raw printing interface is sufficient, that works 
well and is simple on the BPF program side. If the production system cannot 
afford the raw printing cost, the bpf prog can perform some aggregation first.

The BPF program should be able to detect when an outgoing skb is re-transmitted 
and act accordingly. There is BPF timer to retire entries for which no ACK has 
been received.

Potentially, this data can be aggregated into the individual bpf_sk_storage or 
using a BPF map keyed by a particular IP address prefix.

I just want to highlight here for people in the future referencing this thread 
to look for implementation ideas.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 12/12] selftests/bpf: add simple bpf tests in the tx path for timestamping feature
  2025-02-06  1:28       ` Martin KaFai Lau
@ 2025-02-06  2:14         ` Jason Xing
  0 siblings, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-06  2:14 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, davem, edumazet, kuba, pabeni, dsahern, willemb,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Thu, Feb 6, 2025 at 9:28 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/5/25 8:08 AM, Jason Xing wrote:
> >>> +     switch (skops->op) {
> >>> +     case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
> >>> +             delay = val->sched_delay = timestamp - val->sendmsg_ns;
> >>> +             break;
> >>
> >> For a test this is fine. But just a reminder that in general a packet
> >> may pass through multiple qdiscs. For instance with bonding or tunnel
> >> virtual devices in the egress path.
> >
> > Right, I've seen this in production (two times qdisc timestamps
> > because of bonding).
> >
> >>
> >>> +     case BPF_SOCK_OPS_TS_SW_OPT_CB:
> >>> +             prior_ts = val->sched_delay + val->sendmsg_ns;
> >>> +             delay = val->sw_snd_delay = timestamp - prior_ts;
> >>> +             break;
> >>> +     case BPF_SOCK_OPS_TS_ACK_OPT_CB:
> >>> +             prior_ts = val->sw_snd_delay + val->sched_delay + val->sendmsg_ns;
> >>> +             delay = val->ack_delay = timestamp - prior_ts;
> >>> +             break;
> >>
> >> Similar to the above: fine for a test, but in practice be aware that
> >> packets may be resent, in which case an ACK might precede a repeat
> >> SCHED and SND. And erroneous or malicious peers may also just never
> >> send an ACK. So this can never be relied on in production settings,
> >> e.g., as the only signal to clear an entry from a map (as done in the
> >> branch below).
>
> All good points. I think all these notes should be added as comment to the test.

Got it, I will add them in the commit message.

> I think as a test, this will be a good start and can use some followup to
> address the cases.

Good idea.

>
> >
> > Agreed. In production, actually what we do is print all the timestamps
> > and let an agent parse them.
>
> The BPF program that runs in the kernel can provide its own user interface that
> best fits its environment. If a raw printing interface is sufficient, that works
> well and is simple on the BPF program side. If the production system cannot
> afford the raw printing cost, the bpf prog can perform some aggregation first.
>
> The BPF program should be able to detect when an outgoing skb is re-transmitted
> and act accordingly. There is BPF timer to retire entries for which no ACK has
> been received.

Oh, first time to know the BPF timer.

>
> Potentially, this data can be aggregated into the individual bpf_sk_storage or
> using a BPF map keyed by a particular IP address prefix.
>
> I just want to highlight here for people in the future referencing this thread
> to look for implementation ideas.

Thanks, I think they are useful! I will copy more description in the
commit message.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-06  1:05           ` Jason Xing
@ 2025-02-06  2:39             ` Jason Xing
  2025-02-06  2:56               ` Willem de Bruijn
  0 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-06  2:39 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Jakub Kicinski, davem, edumazet, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, eddyz87,
	song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	horms, bpf, netdev

On Thu, Feb 6, 2025 at 9:05 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Feb 6, 2025 at 8:47 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 2/5/25 4:12 PM, Jason Xing wrote:
> > > On Thu, Feb 6, 2025 at 5:57 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >>
> > >> On 2/4/25 5:57 PM, Jakub Kicinski wrote:
> > >>> On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
> > >>>> +    if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> > >>>> +        SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> > >>>> +            struct skb_shared_info *shinfo = skb_shinfo(skb);
> > >>>> +            struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> > >>>> +
> > >>>> +            tcb->txstamp_ack_bpf = 1;
> > >>>> +            shinfo->tx_flags |= SKBTX_BPF;
> > >>>> +            shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> > >>>> +    }
> > >>>
> > >>> If BPF program is attached we'll timestamp all skbs? Am I reading this
> > >>> right?
> > >>
> > >> If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING
> > >> bit of a sock, then all skbs of this sock will be tx timestamp-ed.
> > >
> > > Martin, I'm afraid it's not like what you expect. Only the last
> > > portion of the sendmsg will enter the above function which means if
> > > the size of sendmsg is large, only the last skb will be set SKBTX_BPF
> > > and be timestamped.
> >
> > Sure. The last skb of a large msg and more skb of small msg (or MSG_EOR).
> >
> > My point is, only attaching a bpf alone is not enough. The
> > SK_BPF_CB_TX_TIMESTAMPING still needs to be turned on.
>
> Right.
>
> >
> > >
> > >>
> > >>>
> > >>> Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
> > >>> interested in tracing current packet all the way thru the stack?
> > >>
> > >> I like this idea. It can give the BPF prog a chance to do skb sampling on a
> > >> particular socket.
> > >>
> > >> The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value)
> > >> already has another usage, which its return value is currently enforced by the
> > >> verifier. It is better not to convolute it further.
> > >>
> > >> I don't prefer to add more use cases to skops->reply either, which is an union
> > >> of args[4], such that later progs (in the cgrp prog array) may lose the args value.
> > >>
> > >> Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a
> > >> new BPF kfunc can be added so that the BPF prog can call it to selectively set
> > >> SKBTX_BPF and txstamp_ack_bpf in some skb.
> > >
> > > Agreed because at netdev 0x19 I have an explicit plan to share the
> > > experience from our company about how to trace all the skbs which were
> > > completed through a kernel module. It's how we use in production
> > > especially for debug or diagnose use.
> >
> > This is fine. The bpf prog can still do that by calling the kfunc. I don't see
> > why move the bit setting into kfunc makes the whole set won't work.
> >
> > > I'm not knowledgeable enough about BPF, so I'd like to know if there
> > > are some functions that I can take as good examples?
> > >
> > > I think it's a standalone and good feature, can I handle it after this series?
> >
> > Unfortunately, no. Once the default is on, this cannot be changed.
> >
> > I think Jakub's suggestion to allow bpf prog selectively choose skb to timestamp
> > is useful, so I suggested a way to do it.
>
> Because, sorry, I don't want to postpone this series any longer (blame
> on me for delaying almost 4 months), only wanting to focus on the
> extension for SO_TIMESTAMPING so that we can quickly move on with
> small changes per series.
>
> Selectively sampling the skbs or sampling all the skbs could be an
> optional good choice/feature for users instead of mandatory?
>
> There are two kinds of monitoring in production: 1) daily monitoring,
> 2) diagnostic monitoring which I'm not sure if I translate in the
> right way. For the former that is obviously a light-weight feature, I
> think we don't need to trace that many skbs, only the last skb is
> enough which was done in Google because even the selective feature[1]
> is a little bit heavy. I received some complaints from a few
> latency-sensitive customers to ask us if we can reduce the monitoring
> in the kernel because as I mentioned before many issues are caused by
> the application itself instead of kernel.
>
> [1] selective feature consists of two parts, only selectively
> collecting all the skbs in a certain period or selectively collecting
> exactly like what SO_TIMESTAMPING does in a certain period. It might
> need a full discussion, I reckon.

I presume you might refer to the former. It works like the cmsg
feature which can be a good selectively sampling example. It would be
better to check the value of reply in the BPF_SOCK_OPS_TS_SND_CB
callback which is nearly the very beginning of each sendmsg syscall
because I have a hunch we will add more hook points before skb enters
the qdisc.

I think we can split the whole idea into two parts: for now, because
of the current series implementing the same function as SO_TIMETAMPING
does, I will implement the selective sample feature in the series.
After someday we finish tracing all the skb, then we will add the
corresponding selective sample feature.

But the default mode is the exact same as SO_TIMESTAMPING instead of
asking bpf prog to enable the sample feature. Does it make sense to
you?

With that said, the patch looks like this:
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1f528e63bc71..73909dad7ed4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -497,11 +497,14 @@ static void tcp_tx_timestamp(struct sock *sk,
struct sockcm_cookie *sockc)
            SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
                struct skb_shared_info *shinfo = skb_shinfo(skb);
                struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+               bool enable_sample = true;

-               tcb->txstamp_ack_bpf = 1;
-               shinfo->tx_flags |= SKBTX_BPF;
-               shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
-               bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TS_SND_CB);
+               enable_sample = bpf_skops_tx_timestamping(sk, skb,
BPF_SOCK_OPS_TS_SND_CB);
+               if (enable_sample) {
+                       tcb->txstamp_ack_bpf = 1;
+                       shinfo->tx_flags |= SKBTX_BPF;
+                       shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+               }
        }
 }

Thanks,
Jason

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-06  2:39             ` Jason Xing
@ 2025-02-06  2:56               ` Willem de Bruijn
  2025-02-06  3:09                 ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Willem de Bruijn @ 2025-02-06  2:56 UTC (permalink / raw)
  To: Jason Xing, Martin KaFai Lau
  Cc: Jakub Kicinski, davem, edumazet, pabeni, dsahern,
	willemdebruijn.kernel, willemb, ast, daniel, andrii, eddyz87,
	song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	horms, bpf, netdev

Jason Xing wrote:
> On Thu, Feb 6, 2025 at 9:05 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, Feb 6, 2025 at 8:47 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >
> > > On 2/5/25 4:12 PM, Jason Xing wrote:
> > > > On Thu, Feb 6, 2025 at 5:57 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > > >>
> > > >> On 2/4/25 5:57 PM, Jakub Kicinski wrote:
> > > >>> On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
> > > >>>> +    if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> > > >>>> +        SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> > > >>>> +            struct skb_shared_info *shinfo = skb_shinfo(skb);
> > > >>>> +            struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> > > >>>> +
> > > >>>> +            tcb->txstamp_ack_bpf = 1;
> > > >>>> +            shinfo->tx_flags |= SKBTX_BPF;
> > > >>>> +            shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> > > >>>> +    }
> > > >>>
> > > >>> If BPF program is attached we'll timestamp all skbs? Am I reading this
> > > >>> right?
> > > >>
> > > >> If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING
> > > >> bit of a sock, then all skbs of this sock will be tx timestamp-ed.
> > > >
> > > > Martin, I'm afraid it's not like what you expect. Only the last
> > > > portion of the sendmsg will enter the above function which means if
> > > > the size of sendmsg is large, only the last skb will be set SKBTX_BPF
> > > > and be timestamped.
> > >
> > > Sure. The last skb of a large msg and more skb of small msg (or MSG_EOR).
> > >
> > > My point is, only attaching a bpf alone is not enough. The
> > > SK_BPF_CB_TX_TIMESTAMPING still needs to be turned on.
> >
> > Right.
> >
> > >
> > > >
> > > >>
> > > >>>
> > > >>> Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
> > > >>> interested in tracing current packet all the way thru the stack?
> > > >>
> > > >> I like this idea. It can give the BPF prog a chance to do skb sampling on a
> > > >> particular socket.
> > > >>
> > > >> The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value)
> > > >> already has another usage, which its return value is currently enforced by the
> > > >> verifier. It is better not to convolute it further.
> > > >>
> > > >> I don't prefer to add more use cases to skops->reply either, which is an union
> > > >> of args[4], such that later progs (in the cgrp prog array) may lose the args value.
> > > >>
> > > >> Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a
> > > >> new BPF kfunc can be added so that the BPF prog can call it to selectively set
> > > >> SKBTX_BPF and txstamp_ack_bpf in some skb.
> > > >
> > > > Agreed because at netdev 0x19 I have an explicit plan to share the
> > > > experience from our company about how to trace all the skbs which were
> > > > completed through a kernel module. It's how we use in production
> > > > especially for debug or diagnose use.
> > >
> > > This is fine. The bpf prog can still do that by calling the kfunc. I don't see
> > > why move the bit setting into kfunc makes the whole set won't work.
> > >
> > > > I'm not knowledgeable enough about BPF, so I'd like to know if there
> > > > are some functions that I can take as good examples?
> > > >
> > > > I think it's a standalone and good feature, can I handle it after this series?
> > >
> > > Unfortunately, no. Once the default is on, this cannot be changed.
> > >
> > > I think Jakub's suggestion to allow bpf prog selectively choose skb to timestamp
> > > is useful, so I suggested a way to do it.
> >
> > Because, sorry, I don't want to postpone this series any longer (blame
> > on me for delaying almost 4 months), only wanting to focus on the
> > extension for SO_TIMESTAMPING so that we can quickly move on with
> > small changes per series.
> >
> > Selectively sampling the skbs or sampling all the skbs could be an
> > optional good choice/feature for users instead of mandatory?
> >
> > There are two kinds of monitoring in production: 1) daily monitoring,
> > 2) diagnostic monitoring which I'm not sure if I translate in the
> > right way. For the former that is obviously a light-weight feature, I
> > think we don't need to trace that many skbs, only the last skb is
> > enough which was done in Google because even the selective feature[1]
> > is a little bit heavy. I received some complaints from a few
> > latency-sensitive customers to ask us if we can reduce the monitoring
> > in the kernel because as I mentioned before many issues are caused by
> > the application itself instead of kernel.
> >
> > [1] selective feature consists of two parts, only selectively
> > collecting all the skbs in a certain period or selectively collecting
> > exactly like what SO_TIMESTAMPING does in a certain period. It might
> > need a full discussion, I reckon.
> 
> I presume you might refer to the former. It works like the cmsg
> feature which can be a good selectively sampling example. It would be
> better to check the value of reply in the BPF_SOCK_OPS_TS_SND_CB
> callback which is nearly the very beginning of each sendmsg syscall
> because I have a hunch we will add more hook points before skb enters
> the qdisc.
> 
> I think we can split the whole idea into two parts: for now, because
> of the current series implementing the same function as SO_TIMETAMPING
> does, I will implement the selective sample feature in the series.
> After someday we finish tracing all the skb, then we will add the
> corresponding selective sample feature.

Are you saying that you will include selective sampling now or want to
postpone it?

Jakub brought up a great point. Our continuous deployment would not be
feasible without sampling. Indeed implemented using cmsg.

I think it should be included from the initial patch series.
 
> But the default mode is the exact same as SO_TIMESTAMPING instead of
> asking bpf prog to enable the sample feature. Does it make sense to
> you?
> 
> With that said, the patch looks like this:
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1f528e63bc71..73909dad7ed4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -497,11 +497,14 @@ static void tcp_tx_timestamp(struct sock *sk,
> struct sockcm_cookie *sockc)
>             SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
>                 struct skb_shared_info *shinfo = skb_shinfo(skb);
>                 struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> +               bool enable_sample = true;
> 
> -               tcb->txstamp_ack_bpf = 1;
> -               shinfo->tx_flags |= SKBTX_BPF;
> -               shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> -               bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TS_SND_CB);
> +               enable_sample = bpf_skops_tx_timestamping(sk, skb,
> BPF_SOCK_OPS_TS_SND_CB);
> +               if (enable_sample) {
> +                       tcb->txstamp_ack_bpf = 1;
> +                       shinfo->tx_flags |= SKBTX_BPF;
> +                       shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> +               }
>         }
>  }
> 
> Thanks,
> Jason



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-06  2:56               ` Willem de Bruijn
@ 2025-02-06  3:09                 ` Jason Xing
  2025-02-06  3:25                   ` Willem de Bruijn
  0 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-06  3:09 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin KaFai Lau, Jakub Kicinski, davem, edumazet, pabeni,
	dsahern, willemb, ast, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms,
	bpf, netdev

On Thu, Feb 6, 2025 at 10:56 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Thu, Feb 6, 2025 at 9:05 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Thu, Feb 6, 2025 at 8:47 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > > >
> > > > On 2/5/25 4:12 PM, Jason Xing wrote:
> > > > > On Thu, Feb 6, 2025 at 5:57 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > > > >>
> > > > >> On 2/4/25 5:57 PM, Jakub Kicinski wrote:
> > > > >>> On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
> > > > >>>> +    if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> > > > >>>> +        SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> > > > >>>> +            struct skb_shared_info *shinfo = skb_shinfo(skb);
> > > > >>>> +            struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> > > > >>>> +
> > > > >>>> +            tcb->txstamp_ack_bpf = 1;
> > > > >>>> +            shinfo->tx_flags |= SKBTX_BPF;
> > > > >>>> +            shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> > > > >>>> +    }
> > > > >>>
> > > > >>> If BPF program is attached we'll timestamp all skbs? Am I reading this
> > > > >>> right?
> > > > >>
> > > > >> If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING
> > > > >> bit of a sock, then all skbs of this sock will be tx timestamp-ed.
> > > > >
> > > > > Martin, I'm afraid it's not like what you expect. Only the last
> > > > > portion of the sendmsg will enter the above function which means if
> > > > > the size of sendmsg is large, only the last skb will be set SKBTX_BPF
> > > > > and be timestamped.
> > > >
> > > > Sure. The last skb of a large msg and more skb of small msg (or MSG_EOR).
> > > >
> > > > My point is, only attaching a bpf alone is not enough. The
> > > > SK_BPF_CB_TX_TIMESTAMPING still needs to be turned on.
> > >
> > > Right.
> > >
> > > >
> > > > >
> > > > >>
> > > > >>>
> > > > >>> Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
> > > > >>> interested in tracing current packet all the way thru the stack?
> > > > >>
> > > > >> I like this idea. It can give the BPF prog a chance to do skb sampling on a
> > > > >> particular socket.
> > > > >>
> > > > >> The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value)
> > > > >> already has another usage, which its return value is currently enforced by the
> > > > >> verifier. It is better not to convolute it further.
> > > > >>
> > > > >> I don't prefer to add more use cases to skops->reply either, which is an union
> > > > >> of args[4], such that later progs (in the cgrp prog array) may lose the args value.
> > > > >>
> > > > >> Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a
> > > > >> new BPF kfunc can be added so that the BPF prog can call it to selectively set
> > > > >> SKBTX_BPF and txstamp_ack_bpf in some skb.
> > > > >
> > > > > Agreed because at netdev 0x19 I have an explicit plan to share the
> > > > > experience from our company about how to trace all the skbs which were
> > > > > completed through a kernel module. It's how we use in production
> > > > > especially for debug or diagnose use.
> > > >
> > > > This is fine. The bpf prog can still do that by calling the kfunc. I don't see
> > > > why move the bit setting into kfunc makes the whole set won't work.
> > > >
> > > > > I'm not knowledgeable enough about BPF, so I'd like to know if there
> > > > > are some functions that I can take as good examples?
> > > > >
> > > > > I think it's a standalone and good feature, can I handle it after this series?
> > > >
> > > > Unfortunately, no. Once the default is on, this cannot be changed.
> > > >
> > > > I think Jakub's suggestion to allow bpf prog selectively choose skb to timestamp
> > > > is useful, so I suggested a way to do it.
> > >
> > > Because, sorry, I don't want to postpone this series any longer (blame
> > > on me for delaying almost 4 months), only wanting to focus on the
> > > extension for SO_TIMESTAMPING so that we can quickly move on with
> > > small changes per series.
> > >
> > > Selectively sampling the skbs or sampling all the skbs could be an
> > > optional good choice/feature for users instead of mandatory?
> > >
> > > There are two kinds of monitoring in production: 1) daily monitoring,
> > > 2) diagnostic monitoring which I'm not sure if I translate in the
> > > right way. For the former that is obviously a light-weight feature, I
> > > think we don't need to trace that many skbs, only the last skb is
> > > enough which was done in Google because even the selective feature[1]
> > > is a little bit heavy. I received some complaints from a few
> > > latency-sensitive customers to ask us if we can reduce the monitoring
> > > in the kernel because as I mentioned before many issues are caused by
> > > the application itself instead of kernel.
> > >
> > > [1] selective feature consists of two parts, only selectively
> > > collecting all the skbs in a certain period or selectively collecting
> > > exactly like what SO_TIMESTAMPING does in a certain period. It might
> > > need a full discussion, I reckon.
> >
> > I presume you might refer to the former. It works like the cmsg
> > feature which can be a good selectively sampling example. It would be
> > better to check the value of reply in the BPF_SOCK_OPS_TS_SND_CB
> > callback which is nearly the very beginning of each sendmsg syscall
> > because I have a hunch we will add more hook points before skb enters
> > the qdisc.
> >
> > I think we can split the whole idea into two parts: for now, because
> > of the current series implementing the same function as SO_TIMETAMPING
> > does, I will implement the selective sample feature in the series.
> > After someday we finish tracing all the skb, then we will add the
> > corresponding selective sample feature.
>
> Are you saying that you will include selective sampling now or want to
> postpone it?

A few months ago, I planned to do it after this series. Since you all
ask, it's not complex to have it included in this series :)

Selective sampling has two kinds of meaning like I mentioned above, so
in the next re-spin I will implement the cmsg feature for bpf
extension in this series. I'm doing the test right now. And leave
another selective sampling small feature until the feature of tracing
all the skbs is implemented if possible.

>
> Jakub brought up a great point. Our continuous deployment would not be
> feasible without sampling. Indeed implemented using cmsg.

Right, right. I just realized that I misunderstood what Jakub offered.

>
> I think it should be included from the initial patch series.

I agree to include this in this series. Like what I wrote in the
previous thread, it should be simple :) And it will be manifested in
the selftests as well.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-06  3:09                 ` Jason Xing
@ 2025-02-06  3:25                   ` Willem de Bruijn
  2025-02-06  3:41                     ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Willem de Bruijn @ 2025-02-06  3:25 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: Martin KaFai Lau, Jakub Kicinski, davem, edumazet, pabeni,
	dsahern, willemb, ast, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms,
	bpf, netdev

> > > I think we can split the whole idea into two parts: for now, because
> > > of the current series implementing the same function as SO_TIMETAMPING
> > > does, I will implement the selective sample feature in the series.
> > > After someday we finish tracing all the skb, then we will add the
> > > corresponding selective sample feature.
> >
> > Are you saying that you will include selective sampling now or want to
> > postpone it?
> 
> A few months ago, I planned to do it after this series. Since you all
> ask, it's not complex to have it included in this series :)
> 
> Selective sampling has two kinds of meaning like I mentioned above, so
> in the next re-spin I will implement the cmsg feature for bpf
> extension in this series. 

Great thanks.

> I'm doing the test right now. And leave
> another selective sampling small feature until the feature of tracing
> all the skbs is implemented if possible.

Can you elaborate on this other feature?
 
> >
> > Jakub brought up a great point. Our continuous deployment would not be
> > feasible without sampling. Indeed implemented using cmsg.
> 
> Right, right. I just realized that I misunderstood what Jakub offered.
> 
> >
> > I think it should be included from the initial patch series.
> 
> I agree to include this in this series. Like what I wrote in the
> previous thread, it should be simple :) And it will be manifested in
> the selftests as well.
> 
> Thanks,
> Jason



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-06  3:25                   ` Willem de Bruijn
@ 2025-02-06  3:41                     ` Jason Xing
  2025-02-06  6:12                       ` Martin KaFai Lau
  0 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-06  3:41 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Martin KaFai Lau, Jakub Kicinski, davem, edumazet, pabeni,
	dsahern, willemb, ast, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms,
	bpf, netdev

On Thu, Feb 6, 2025 at 11:25 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> > > > I think we can split the whole idea into two parts: for now, because
> > > > of the current series implementing the same function as SO_TIMETAMPING
> > > > does, I will implement the selective sample feature in the series.
> > > > After someday we finish tracing all the skb, then we will add the
> > > > corresponding selective sample feature.
> > >
> > > Are you saying that you will include selective sampling now or want to
> > > postpone it?
> >
> > A few months ago, I planned to do it after this series. Since you all
> > ask, it's not complex to have it included in this series :)
> >
> > Selective sampling has two kinds of meaning like I mentioned above, so
> > in the next re-spin I will implement the cmsg feature for bpf
> > extension in this series.
>
> Great thanks.

I have to rephrase a bit in case Martin visits here soon: I will
compare two approaches 1) reply value, 2) bpf kfunc and then see which
way is better.

>
> > I'm doing the test right now. And leave
> > another selective sampling small feature until the feature of tracing
> > all the skbs is implemented if possible.
>
> Can you elaborate on this other feature?

Do you recall oneday I asked your opinion privately about whether we
can trace _all the skbs_ (not the last skb from each sendmsg) to have
a better insight of kernel behaviour? I can also see a couple of
latency issues in the kernel. If it is approved, then corresponding
selective sampling should be supported. It's what I was trying to
describe.

The advantage of relying on the timestamping feature is that we can
isolate normal flows and monitored flow so that normal flows wouldn't
be affected because of enabling the monitoring feature, compared to so
many open source monitoring applications I've dug into. They usually
directly hook the hot path like __tcp_transmit_skb() or
dev_queue_xmit, which will surely influence the normal flows and cause
performance degradation to some extent. I noticed that after
conducting some tests a few months ago. The principle behind the bpf
fentry is to replace some instructions at the very beginning of the
hooked function, so every time even normal flows entering the
monitored function will get affected.

Thanks,
Jason

>
> > >
> > > Jakub brought up a great point. Our continuous deployment would not be
> > > feasible without sampling. Indeed implemented using cmsg.
> >
> > Right, right. I just realized that I misunderstood what Jakub offered.
> >
> > >
> > > I think it should be included from the initial patch series.
> >
> > I agree to include this in this series. Like what I wrote in the
> > previous thread, it should be simple :) And it will be manifested in
> > the selftests as well.
> >
> > Thanks,
> > Jason
>
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-06  3:41                     ` Jason Xing
@ 2025-02-06  6:12                       ` Martin KaFai Lau
  2025-02-06  6:56                         ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2025-02-06  6:12 UTC (permalink / raw)
  To: Jason Xing
  Cc: Willem de Bruijn, Jakub Kicinski, davem, edumazet, pabeni,
	dsahern, willemb, ast, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms,
	bpf, netdev

On 2/5/25 7:41 PM, Jason Xing wrote:
> On Thu, Feb 6, 2025 at 11:25 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>>
>>>>> I think we can split the whole idea into two parts: for now, because
>>>>> of the current series implementing the same function as SO_TIMETAMPING
>>>>> does, I will implement the selective sample feature in the series.
>>>>> After someday we finish tracing all the skb, then we will add the
>>>>> corresponding selective sample feature.
>>>>
>>>> Are you saying that you will include selective sampling now or want to
>>>> postpone it?
>>>
>>> A few months ago, I planned to do it after this series. Since you all
>>> ask, it's not complex to have it included in this series :)
>>>
>>> Selective sampling has two kinds of meaning like I mentioned above, so
>>> in the next re-spin I will implement the cmsg feature for bpf
>>> extension in this series.
>>
>> Great thanks.
> 
> I have to rephrase a bit in case Martin visits here soon: I will
> compare two approaches 1) reply value, 2) bpf kfunc and then see which
> way is better.

I have already explained in details why the 1) reply value from the bpf prog 
won't work. Please go back to that reply which has the context.

> 
>>
>>> I'm doing the test right now. And leave
>>> another selective sampling small feature until the feature of tracing
>>> all the skbs is implemented if possible.
>>
>> Can you elaborate on this other feature?
> 
> Do you recall oneday I asked your opinion privately about whether we
> can trace _all the skbs_ (not the last skb from each sendmsg) to have
> a better insight of kernel behaviour? I can also see a couple of
> latency issues in the kernel. If it is approved, then corresponding
> selective sampling should be supported. It's what I was trying to
> describe.
> 
> The advantage of relying on the timestamping feature is that we can
> isolate normal flows and monitored flow so that normal flows wouldn't
> be affected because of enabling the monitoring feature, compared to so
> many open source monitoring applications I've dug into. They usually
> directly hook the hot path like __tcp_transmit_skb() or
> dev_queue_xmit, which will surely influence the normal flows and cause
> performance degradation to some extent. I noticed that after
> conducting some tests a few months ago. The principle behind the bpf
> fentry is to replace some instructions at the very beginning of the
> hooked function, so every time even normal flows entering the
> monitored function will get affected.

I sort of guess this while stalled in the traffic... :/

I was not asking to be able to "selective on all skb of a large msg". This will 
be a separate topic. If we really wanted to support this case (tbh, I am not 
convinced) in the future, there is more reason the default behavior should be 
"off" now for consistency reason.

The comment was on the existing tcp_tx_timestamp(). First focus on allowing 
selective tracking of the skb that the current tcp_tx_timestamp() also tracks 
because it is the most understood use case. This will allow the bpf prog to 
select which tcp_sendmsg call it should track/sample. Perhaps the bpf prog will 
limit tracking X numbers of packets and then will stop there. Perhaps the bpf 
prog will only allocate X numbers of sample spaces in the bpf_sk_storage to 
track packet. There are many reasons that bpf prog may want to sample and stop 
tracking at some point even in the current tcp_tx_timestamp().


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-06  6:12                       ` Martin KaFai Lau
@ 2025-02-06  6:56                         ` Jason Xing
  2025-02-07  2:07                           ` Martin KaFai Lau
  0 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-06  6:56 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, Jakub Kicinski, davem, edumazet, pabeni,
	dsahern, willemb, ast, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms,
	bpf, netdev

On Thu, Feb 6, 2025 at 2:12 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/5/25 7:41 PM, Jason Xing wrote:
> > On Thu, Feb 6, 2025 at 11:25 AM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> >>
> >>>>> I think we can split the whole idea into two parts: for now, because
> >>>>> of the current series implementing the same function as SO_TIMETAMPING
> >>>>> does, I will implement the selective sample feature in the series.
> >>>>> After someday we finish tracing all the skb, then we will add the
> >>>>> corresponding selective sample feature.
> >>>>
> >>>> Are you saying that you will include selective sampling now or want to
> >>>> postpone it?
> >>>
> >>> A few months ago, I planned to do it after this series. Since you all
> >>> ask, it's not complex to have it included in this series :)
> >>>
> >>> Selective sampling has two kinds of meaning like I mentioned above, so
> >>> in the next re-spin I will implement the cmsg feature for bpf
> >>> extension in this series.
> >>
> >> Great thanks.
> >
> > I have to rephrase a bit in case Martin visits here soon: I will
> > compare two approaches 1) reply value, 2) bpf kfunc and then see which
> > way is better.
>
> I have already explained in details why the 1) reply value from the bpf prog
> won't work. Please go back to that reply which has the context.

Yes, of course I saw this, but I said I need to implement and dig more
into this on my own. One of my replies includes a little code snippet
regarding reply value approach. I didn't expect you to misunderstand
that I would choose reply value, so I rephrase it like above :)

>
> >
> >>
> >>> I'm doing the test right now. And leave
> >>> another selective sampling small feature until the feature of tracing
> >>> all the skbs is implemented if possible.
> >>
> >> Can you elaborate on this other feature?
> >
> > Do you recall oneday I asked your opinion privately about whether we
> > can trace _all the skbs_ (not the last skb from each sendmsg) to have
> > a better insight of kernel behaviour? I can also see a couple of
> > latency issues in the kernel. If it is approved, then corresponding
> > selective sampling should be supported. It's what I was trying to
> > describe.
> >
> > The advantage of relying on the timestamping feature is that we can
> > isolate normal flows and monitored flow so that normal flows wouldn't
> > be affected because of enabling the monitoring feature, compared to so
> > many open source monitoring applications I've dug into. They usually
> > directly hook the hot path like __tcp_transmit_skb() or
> > dev_queue_xmit, which will surely influence the normal flows and cause
> > performance degradation to some extent. I noticed that after
> > conducting some tests a few months ago. The principle behind the bpf
> > fentry is to replace some instructions at the very beginning of the
> > hooked function, so every time even normal flows entering the
> > monitored function will get affected.
>
> I sort of guess this while stalled in the traffic... :/
>
> I was not asking to be able to "selective on all skb of a large msg". This will
> be a separate topic. If we really wanted to support this case (tbh, I am not
> convinced) in the future, there is more reason the default behavior should be
> "off" now for consistency reason.

Yep, another topic. At that time, I particularly noticed that Jakub
said "all skbs" which you agreed with, so I felt reluctant...

> The comment was on the existing tcp_tx_timestamp(). First focus on allowing
> selective tracking of the skb that the current tcp_tx_timestamp() also tracks
> because it is the most understood use case. This will allow the bpf prog to
> select which tcp_sendmsg call it should track/sample. Perhaps the bpf prog will
> limit tracking X numbers of packets and then will stop there. Perhaps the bpf
> prog will only allocate X numbers of sample spaces in the bpf_sk_storage to
> track packet. There are many reasons that bpf prog may want to sample and stop
> tracking at some point even in the current tcp_tx_timestamp().

Completely agreed, this is also what I did in my kernel module. Willem
once mentioned that Google also uses the sample feature, IIRC. So for
sure I will complete it soon in this series. Thanks for your
information, BTW. I will quote it :)

---
More discussions/suggestions are welcome since I've already proposed
_tracing all the skbs_. The idea behind this is to let this bpf
extension help us get enough information about kernel (especially
stack) behaviour.

The goals are:
1) much less side effects on the normal flows due to so_timestamping
feature compared to normal bpf* related trace method.
2) have a deep insight/understanding of where those latencies exactly
come from. I've encountered TSQ limits (please see tcp_write_xmit(),
there are more controls) with virtio_net driver. So maybe hacking the
tcp_write_xmit() is needed.
3) Only trace the last skb from each sendmsg might not be enough if
the latency arises in the kernel. If one of skbs is missed, we will
never know what happens to that skb.

Based on the above, I'd like to make it into an optional choice
provided to users if they want to take a deep look inside the kernel.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  2025-02-05 15:34   ` Willem de Bruijn
  2025-02-05 15:52     ` Jason Xing
@ 2025-02-06  8:43     ` Jason Xing
  2025-02-06 10:22       ` Jason Xing
  2025-02-06 16:13       ` Willem de Bruijn
  1 sibling, 2 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-06  8:43 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Wed, Feb 5, 2025 at 11:34 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > No functional changes here, only add skb_enable_app_tstamp() to test
> > if the orig_skb matches the usage of application SO_TIMESTAMPING
> > or its bpf extension. And it's good to support two modes in
> > parallel later in this series.
> >
> > Also, this patch deliberately distinguish the software and
> > hardware SCM_TSTAMP_SND timestamp by passing 'sw' parameter in order
> > to avoid such a case where hardware may go wrong and pass a NULL
> > hwstamps, which is even though unlikely to happen. If it really
> > happens, bpf prog will finally consider it as a software timestamp.
> > It will be hardly recognized. Let's make the timestamping part
> > more robust.
>
> Disagree. Don't add a crutch that has not shown to be necessary for
> all this time.
>
> Just infer hw from hwtstamps != NULL.

I can surely modify this part as you said, but may I ask why? I cannot
find a good reason to absolutely trust the hardware behaviour. If that
corner case happens, it would be very hard to trace the root cause...

>
> > Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
> > ---
> >  include/linux/skbuff.h | 13 +++++++------
> >  net/core/dev.c         |  2 +-
> >  net/core/skbuff.c      | 32 ++++++++++++++++++++++++++++++--
> >  net/ipv4/tcp_input.c   |  3 ++-
> >  4 files changed, 40 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index bb2b751d274a..dfc419281cc9 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -39,6 +39,7 @@
> >  #include <net/net_debug.h>
> >  #include <net/dropreason-core.h>
> >  #include <net/netmem.h>
> > +#include <uapi/linux/errqueue.h>
> >
> >  /**
> >   * DOC: skb checksums
> > @@ -4533,18 +4534,18 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
> >
> >  void __skb_tstamp_tx(struct sk_buff *orig_skb, const struct sk_buff *ack_skb,
> >                    struct skb_shared_hwtstamps *hwtstamps,
> > -                  struct sock *sk, int tstype);
> > +                  struct sock *sk, bool sw, int tstype);
> >
> >  /**
> > - * skb_tstamp_tx - queue clone of skb with send time stamps
> > + * skb_tstamp_tx - queue clone of skb with send HARDWARE timestamps
>
> Unfortunately this cannot be modified to skb_tstamp_tx_hw, as that
> would require updating way too many callers.

I didn't change the name, only the description and usage of
skb_tstamp_tx(). It always gets called in the hardware timestamp
situation except skb_tx_timestamp() that is modified.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  2025-02-06  8:43     ` Jason Xing
@ 2025-02-06 10:22       ` Jason Xing
  2025-02-06 16:13       ` Willem de Bruijn
  1 sibling, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-06 10:22 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Thu, Feb 6, 2025 at 4:43 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Wed, Feb 5, 2025 at 11:34 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > No functional changes here, only add skb_enable_app_tstamp() to test
> > > if the orig_skb matches the usage of application SO_TIMESTAMPING
> > > or its bpf extension. And it's good to support two modes in
> > > parallel later in this series.
> > >
> > > Also, this patch deliberately distinguish the software and
> > > hardware SCM_TSTAMP_SND timestamp by passing 'sw' parameter in order
> > > to avoid such a case where hardware may go wrong and pass a NULL
> > > hwstamps, which is even though unlikely to happen. If it really
> > > happens, bpf prog will finally consider it as a software timestamp.
> > > It will be hardly recognized. Let's make the timestamping part
> > > more robust.
> >
> > Disagree. Don't add a crutch that has not shown to be necessary for
> > all this time.
> >
> > Just infer hw from hwtstamps != NULL.
>
> I can surely modify this part as you said, but may I ask why? I cannot
> find a good reason to absolutely trust the hardware behaviour. If that
> corner case happens, it would be very hard to trace the root cause...

No offense, just curious. I can keep the same approach as
SO_TIMESTAMPING since you disagree. I have no strong preference
because I found It's simpler after rewriting this part.

I will simplify this patch in v9 :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  2025-02-06  8:43     ` Jason Xing
  2025-02-06 10:22       ` Jason Xing
@ 2025-02-06 16:13       ` Willem de Bruijn
  2025-02-07  0:22         ` Jason Xing
  1 sibling, 1 reply; 66+ messages in thread
From: Willem de Bruijn @ 2025-02-06 16:13 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

Jason Xing wrote:
> On Wed, Feb 5, 2025 at 11:34 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Jason Xing wrote:
> > > No functional changes here, only add skb_enable_app_tstamp() to test
> > > if the orig_skb matches the usage of application SO_TIMESTAMPING
> > > or its bpf extension. And it's good to support two modes in
> > > parallel later in this series.
> > >
> > > Also, this patch deliberately distinguish the software and
> > > hardware SCM_TSTAMP_SND timestamp by passing 'sw' parameter in order
> > > to avoid such a case where hardware may go wrong and pass a NULL
> > > hwstamps, which is even though unlikely to happen. If it really
> > > happens, bpf prog will finally consider it as a software timestamp.
> > > It will be hardly recognized. Let's make the timestamping part
> > > more robust.
> >
> > Disagree. Don't add a crutch that has not shown to be necessary for
> > all this time.
> >
> > Just infer hw from hwtstamps != NULL.
> 
> I can surely modify this part as you said, but may I ask why? I cannot
> find a good reason to absolutely trust the hardware behaviour. If that
> corner case happens, it would be very hard to trace the root cause...

A NULL pointer exception is easy to find.

It's not a hardware bug, but a driver bug. Given the small number of
drivers implementing this API, it could even be found through code
inspection.

As a general rule of thumb we don't add protection mechanisms to paper
over bugs elsewhere in the kernel. But detect and fix the bugs. An
exception to the general rule is when buggy code is hard to find. That
is not the case here.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING
  2025-02-06 16:13       ` Willem de Bruijn
@ 2025-02-07  0:22         ` Jason Xing
  0 siblings, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-07  0:22 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Fri, Feb 7, 2025 at 12:13 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Jason Xing wrote:
> > On Wed, Feb 5, 2025 at 11:34 PM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > Jason Xing wrote:
> > > > No functional changes here, only add skb_enable_app_tstamp() to test
> > > > if the orig_skb matches the usage of application SO_TIMESTAMPING
> > > > or its bpf extension. And it's good to support two modes in
> > > > parallel later in this series.
> > > >
> > > > Also, this patch deliberately distinguish the software and
> > > > hardware SCM_TSTAMP_SND timestamp by passing 'sw' parameter in order
> > > > to avoid such a case where hardware may go wrong and pass a NULL
> > > > hwstamps, which is even though unlikely to happen. If it really
> > > > happens, bpf prog will finally consider it as a software timestamp.
> > > > It will be hardly recognized. Let's make the timestamping part
> > > > more robust.
> > >
> > > Disagree. Don't add a crutch that has not shown to be necessary for
> > > all this time.
> > >
> > > Just infer hw from hwtstamps != NULL.
> >
> > I can surely modify this part as you said, but may I ask why? I cannot
> > find a good reason to absolutely trust the hardware behaviour. If that
> > corner case happens, it would be very hard to trace the root cause...
>
> A NULL pointer exception is easy to find.
>
> It's not a hardware bug, but a driver bug. Given the small number of
> drivers implementing this API, it could even be found through code
> inspection.
>
> As a general rule of thumb we don't add protection mechanisms to paper
> over bugs elsewhere in the kernel. But detect and fix the bugs. An
> exception to the general rule is when buggy code is hard to find. That
> is not the case here.

Thanks for the explanation.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-06  6:56                         ` Jason Xing
@ 2025-02-07  2:07                           ` Martin KaFai Lau
  2025-02-07  2:18                             ` Jason Xing
  2025-02-07 13:34                             ` Jason Xing
  0 siblings, 2 replies; 66+ messages in thread
From: Martin KaFai Lau @ 2025-02-07  2:07 UTC (permalink / raw)
  To: Jason Xing
  Cc: Willem de Bruijn, Jakub Kicinski, davem, edumazet, pabeni,
	dsahern, willemb, ast, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms,
	bpf, netdev

On 2/5/25 10:56 PM, Jason Xing wrote:
>>> I have to rephrase a bit in case Martin visits here soon: I will
>>> compare two approaches 1) reply value, 2) bpf kfunc and then see which
>>> way is better.
>>
>> I have already explained in details why the 1) reply value from the bpf prog
>> won't work. Please go back to that reply which has the context.
> 
> Yes, of course I saw this, but I said I need to implement and dig more
> into this on my own. One of my replies includes a little code snippet
> regarding reply value approach. I didn't expect you to misunderstand
> that I would choose reply value, so I rephrase it like above :)

I did see the code snippet which is incomplete, so I have to guess. afaik, it is 
not going to work. I was hoping to save some time without detouring to the 
reply-value path in case my earlier message was missed. I will stay quiet and 
wait for v9 first then to avoid extending this long thread further.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-07  2:07                           ` Martin KaFai Lau
@ 2025-02-07  2:18                             ` Jason Xing
  2025-02-07 12:07                               ` Jason Xing
  2025-02-07 13:34                             ` Jason Xing
  1 sibling, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-07  2:18 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, Jakub Kicinski, davem, edumazet, pabeni,
	dsahern, willemb, ast, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms,
	bpf, netdev

On Fri, Feb 7, 2025 at 10:07 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/5/25 10:56 PM, Jason Xing wrote:
> >>> I have to rephrase a bit in case Martin visits here soon: I will
> >>> compare two approaches 1) reply value, 2) bpf kfunc and then see which
> >>> way is better.
> >>
> >> I have already explained in details why the 1) reply value from the bpf prog
> >> won't work. Please go back to that reply which has the context.
> >
> > Yes, of course I saw this, but I said I need to implement and dig more
> > into this on my own. One of my replies includes a little code snippet
> > regarding reply value approach. I didn't expect you to misunderstand
> > that I would choose reply value, so I rephrase it like above :)
>
> I did see the code snippet which is incomplete, so I have to guess. afaik, it is
> not going to work. I was hoping to save some time without detouring to the
> reply-value path in case my earlier message was missed. I will stay quiet and
> wait for v9 first then to avoid extending this long thread further.

I see. I'm grateful that you point out the right path. I'm still
investigating to find a good existing example in selftests and how to
support kfunc.

Thanks,
Jaosn

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-07  2:18                             ` Jason Xing
@ 2025-02-07 12:07                               ` Jason Xing
  2025-02-08  2:11                                 ` Martin KaFai Lau
  0 siblings, 1 reply; 66+ messages in thread
From: Jason Xing @ 2025-02-07 12:07 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, Jakub Kicinski, davem, edumazet, pabeni,
	dsahern, willemb, ast, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms,
	bpf, netdev

On Fri, Feb 7, 2025 at 10:18 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Fri, Feb 7, 2025 at 10:07 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 2/5/25 10:56 PM, Jason Xing wrote:
> > >>> I have to rephrase a bit in case Martin visits here soon: I will
> > >>> compare two approaches 1) reply value, 2) bpf kfunc and then see which
> > >>> way is better.
> > >>
> > >> I have already explained in details why the 1) reply value from the bpf prog
> > >> won't work. Please go back to that reply which has the context.
> > >
> > > Yes, of course I saw this, but I said I need to implement and dig more
> > > into this on my own. One of my replies includes a little code snippet
> > > regarding reply value approach. I didn't expect you to misunderstand
> > > that I would choose reply value, so I rephrase it like above :)
> >
> > I did see the code snippet which is incomplete, so I have to guess. afaik, it is
> > not going to work. I was hoping to save some time without detouring to the
> > reply-value path in case my earlier message was missed. I will stay quiet and
> > wait for v9 first then to avoid extending this long thread further.
>
> I see. I'm grateful that you point out the right path. I'm still
> investigating to find a good existing example in selftests and how to
> support kfunc.

Martin, sorry to revive this thread.

It's a little bit hard for me to find a proper example to follow. I
tried to call __bpf_kfunc in the BPF_SOCK_OPS_TS_SND_CB callback and
then failed because kfunc is not supported in the sock_ops case.
Later, I tried to kprobe to hook a function, say,
tcp_tx_timestamp_bpf(), passed the skb parameter to the kfunc and then
got an error.

Here is code snippet:
1) net/ipv4/tcp.c
+__bpf_kfunc static void tcp_init_tx_timestamp(struct sk_buff *skb)
+{
+       struct skb_shared_info *shinfo = skb_shinfo(skb);
+       struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+
+       printk(KERN_ERR "jason: %d, %d\n\n", tcb->txstamp_ack,
shinfo->tx_flags);
+       /*
+       tcb->txstamp_ack = 2;
+       shinfo->tx_flags |= SKBTX_BPF;
+       shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+       */
+}
Note: I skipped copying some codes like BTF_ID_FLAGS...

2) bpf prog
SEC("kprobe/tcp_tx_timestamp_bpf") // I wrote a new function/wrapper to hook
int BPF_KPROBE(kprobe__tcp_tx_timestamp_bpf, struct sock *sk, struct
sk_buff *skb)
{
        tcp_init_tx_timestamp(skb);
        return 0;
}

Then running the bpf prog, I got the following message:
; tcp_init_tx_timestamp(skb); @ so_timestamping.c:281
1: (85) call tcp_init_tx_timestamp#120682
arg#0 pointer type STRUCT sk_buff must point to scalar, or struct with scalar
processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0
peak_states 0 mark_read 0
-- END PROG LOAD LOG --
libbpf: prog 'kprobe__tcp_tx_timestamp_bpf': failed to load: -22
libbpf: failed to load object 'so_timestamping'
libbpf: failed to load BPF skeleton 'so_timestamping': -22
test_so_timestamping:FAIL:open and load skel unexpected error: -22

If I don't pass any parameter in the kfunc, it can work.

Should we support the sock_ops for __bpf_kfunc?

Please enlighten me more about this. Thanks in advance!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-07  2:07                           ` Martin KaFai Lau
  2025-02-07  2:18                             ` Jason Xing
@ 2025-02-07 13:34                             ` Jason Xing
  1 sibling, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-07 13:34 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, Jakub Kicinski, davem, edumazet, pabeni,
	dsahern, willemb, ast, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms,
	bpf, netdev

On Fri, Feb 7, 2025 at 10:07 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/5/25 10:56 PM, Jason Xing wrote:
> >>> I have to rephrase a bit in case Martin visits here soon: I will
> >>> compare two approaches 1) reply value, 2) bpf kfunc and then see which
> >>> way is better.
> >>
> >> I have already explained in details why the 1) reply value from the bpf prog
> >> won't work. Please go back to that reply which has the context.
> >
> > Yes, of course I saw this, but I said I need to implement and dig more
> > into this on my own. One of my replies includes a little code snippet
> > regarding reply value approach. I didn't expect you to misunderstand
> > that I would choose reply value, so I rephrase it like above :)
>
> I did see the code snippet which is incomplete, so I have to guess. afaik, it is
> not going to work. I was hoping to save some time without detouring to the
> reply-value path in case my earlier message was missed. I will stay quiet and
> wait for v9 first then to avoid extending this long thread further.

FYI, the code I adjusted works, a little bit ugly though.

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ad4f056aff22..44b4f8655668 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -498,10 +498,13 @@ static void tcp_tx_timestamp(struct sock *sk,
struct sockcm_cookie *sockc)
                struct skb_shared_info *shinfo = skb_shinfo(skb);
                struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);

-               tcb->txstamp_ack = 2;
-               shinfo->tx_flags |= SKBTX_BPF;
                shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
-               bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TS_SND_CB);
+               if (bpf_skops_tx_timestamping(sk, skb,
BPF_SOCK_OPS_TS_SND_CB)) {
+                       tcb->txstamp_ack = 2;
+                       shinfo->tx_flags |= SKBTX_BPF;
+               } else {
+                       shinfo->tskey = 0;
+               }
        }
 }

I'm not sure if it meets your requirement? The reason why I resorted
to this method is because I failed to attempt to use kfunc and
struggled to read many btf codes :(

So please provide more hints so that I can start again. Thanks.

Thanks,
Jason

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-07 12:07                               ` Jason Xing
@ 2025-02-08  2:11                                 ` Martin KaFai Lau
  2025-02-08  6:53                                   ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2025-02-08  2:11 UTC (permalink / raw)
  To: Jason Xing
  Cc: Willem de Bruijn, Jakub Kicinski, davem, edumazet, pabeni,
	dsahern, willemb, ast, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms,
	bpf, netdev

On 2/7/25 4:07 AM, Jason Xing wrote:
> On Fri, Feb 7, 2025 at 10:18 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>>
>> On Fri, Feb 7, 2025 at 10:07 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>
>>> On 2/5/25 10:56 PM, Jason Xing wrote:
>>>>>> I have to rephrase a bit in case Martin visits here soon: I will
>>>>>> compare two approaches 1) reply value, 2) bpf kfunc and then see which
>>>>>> way is better.
>>>>>
>>>>> I have already explained in details why the 1) reply value from the bpf prog
>>>>> won't work. Please go back to that reply which has the context.
>>>>
>>>> Yes, of course I saw this, but I said I need to implement and dig more
>>>> into this on my own. One of my replies includes a little code snippet
>>>> regarding reply value approach. I didn't expect you to misunderstand
>>>> that I would choose reply value, so I rephrase it like above :)
>>>
>>> I did see the code snippet which is incomplete, so I have to guess. afaik, it is
>>> not going to work. I was hoping to save some time without detouring to the
>>> reply-value path in case my earlier message was missed. I will stay quiet and
>>> wait for v9 first then to avoid extending this long thread further.
>>
>> I see. I'm grateful that you point out the right path. I'm still
>> investigating to find a good existing example in selftests and how to
>> support kfunc.
> 
> Martin, sorry to revive this thread.
> 
> It's a little bit hard for me to find a proper example to follow. I
> tried to call __bpf_kfunc in the BPF_SOCK_OPS_TS_SND_CB callback and
> then failed because kfunc is not supported in the sock_ops case.
> Later, I tried to kprobe to hook a function, say,
> tcp_tx_timestamp_bpf(), passed the skb parameter to the kfunc and then
> got an error.
> 
> Here is code snippet:
> 1) net/ipv4/tcp.c
> +__bpf_kfunc static void tcp_init_tx_timestamp(struct sk_buff *skb)
> +{
> +       struct skb_shared_info *shinfo = skb_shinfo(skb);
> +       struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> +
> +       printk(KERN_ERR "jason: %d, %d\n\n", tcb->txstamp_ack,
> shinfo->tx_flags);
> +       /*
> +       tcb->txstamp_ack = 2;
> +       shinfo->tx_flags |= SKBTX_BPF;
> +       shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> +       */
> +}
> Note: I skipped copying some codes like BTF_ID_FLAGS...

This part is missing, so I can only guess again. This BTF_ID_FLAGS
and the kfunc registration part went wrong when trying to add the
new kfunc for the sock_ops program. There are kfunc examples for
netdev related bpf prog in filter.c. e.g. bpf_sock_addr_set_sun_path.

[ The same goes for another later message where the changes in
   bpf_skops_tx_timestamping is missing, so I won't comment there. ]

> 
> 2) bpf prog
> SEC("kprobe/tcp_tx_timestamp_bpf") // I wrote a new function/wrapper to hook
> int BPF_KPROBE(kprobe__tcp_tx_timestamp_bpf, struct sock *sk, struct
> sk_buff *skb)
> {
>          tcp_init_tx_timestamp(skb);
>          return 0;
> }
> 
> Then running the bpf prog, I got the following message:
> ; tcp_init_tx_timestamp(skb); @ so_timestamping.c:281
> 1: (85) call tcp_init_tx_timestamp#120682
> arg#0 pointer type STRUCT sk_buff must point to scalar, or struct with scalar
> processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0
> peak_states 0 mark_read 0
> -- END PROG LOAD LOG --
> libbpf: prog 'kprobe__tcp_tx_timestamp_bpf': failed to load: -22
> libbpf: failed to load object 'so_timestamping'
> libbpf: failed to load BPF skeleton 'so_timestamping': -22
> test_so_timestamping:FAIL:open and load skel unexpected error: -22
> 
> If I don't pass any parameter in the kfunc, it can work.
> 
> Should we support the sock_ops for __bpf_kfunc?

sock_ops does support kfunc. The patch 12 selftest is using the
bpf_cast_to_kern_ctx() and it is a kfunc:

--------8<--------
BTF_KFUNCS_START(common_btf_ids)
BTF_ID_FLAGS(func, bpf_cast_to_kern_ctx, KF_FASTCALL)
-------->8--------

It just the new kfunc is not registered at the right place, so the verifier
cannot find it.

Untested code on top of your v8, so I don't have your latest
changes on the txstamp_ack_bpf bits...etc.

diff --git i/kernel/bpf/btf.c w/kernel/bpf/btf.c
index 9433b6467bbe..740210f883dc 100644
--- i/kernel/bpf/btf.c
+++ w/kernel/bpf/btf.c
@@ -8522,6 +8522,7 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
  	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
  	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
  	case BPF_PROG_TYPE_CGROUP_SYSCTL:
+	case BPF_PROG_TYPE_SOCK_OPS:
  		return BTF_KFUNC_HOOK_CGROUP;
  	case BPF_PROG_TYPE_SCHED_ACT:
  		return BTF_KFUNC_HOOK_SCHED_ACT;
diff --git i/net/core/filter.c w/net/core/filter.c
index d3395ffe058e..3bad67eb5c9e 100644
--- i/net/core/filter.c
+++ w/net/core/filter.c
@@ -12102,6 +12102,30 @@ __bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct __sk_buff *s, struct sock *sk,
  #endif
  }
  
+enum {
+	BPF_SOCK_OPS_TX_TSTAMP_TCP_ACK = 1 << 0,
+};
+
+__bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops, int flags)
+{
+	struct sk_buff *skb;
+
+	if (skops->op != BPF_SOCK_OPS_TS_SND_CB)
+		return -EOPNOTSUPP;
+
+	if (flags & ~BPF_SOCK_OPS_TX_TSTAMP_TCP_ACK)
+		return -EINVAL;
+
+	skb = skops->skb;
+	/* [REMOVE THIS COMMENT]: sk_is_tcp check will be needed in the future */
+	if (flags & BPF_SOCK_OPS_TX_TSTAMP_TCP_ACK)
+		TCP_SKB_CB(skb)->txstamp_ack_bpf = 1;
+	skb_shinfo(skb)->tx_flags |= SKBTX_BPF;
+	skb_shinfo(skb)->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+
+	return 0;
+}
+
  __bpf_kfunc_end_defs();
  
  int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12135,6 +12159,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_tcp_reqsk)
  BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk, KF_TRUSTED_ARGS)
  BTF_KFUNCS_END(bpf_kfunc_check_set_tcp_reqsk)
  
+BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
+BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
+
  static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
  	.owner = THIS_MODULE,
  	.set = &bpf_kfunc_check_set_skb,
@@ -12155,6 +12183,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_tcp_reqsk = {
  	.set = &bpf_kfunc_check_set_tcp_reqsk,
  };
  
+static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
+	.owner = THIS_MODULE,
+	.set = &bpf_kfunc_check_set_sock_ops,
+};
+
  static int __init bpf_kfunc_init(void)
  {
  	int ret;
@@ -12173,6 +12206,7 @@ static int __init bpf_kfunc_init(void)
  	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
  	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
  					       &bpf_kfunc_set_sock_addr);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
  	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
  }

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work
  2025-02-08  2:11                                 ` Martin KaFai Lau
@ 2025-02-08  6:53                                   ` Jason Xing
  0 siblings, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-08  6:53 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, Jakub Kicinski, davem, edumazet, pabeni,
	dsahern, willemb, ast, daniel, andrii, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, horms,
	bpf, netdev

On Sat, Feb 8, 2025 at 10:11 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/7/25 4:07 AM, Jason Xing wrote:
> > On Fri, Feb 7, 2025 at 10:18 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >>
> >> On Fri, Feb 7, 2025 at 10:07 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>
> >>> On 2/5/25 10:56 PM, Jason Xing wrote:
> >>>>>> I have to rephrase a bit in case Martin visits here soon: I will
> >>>>>> compare two approaches 1) reply value, 2) bpf kfunc and then see which
> >>>>>> way is better.
> >>>>>
> >>>>> I have already explained in details why the 1) reply value from the bpf prog
> >>>>> won't work. Please go back to that reply which has the context.
> >>>>
> >>>> Yes, of course I saw this, but I said I need to implement and dig more
> >>>> into this on my own. One of my replies includes a little code snippet
> >>>> regarding reply value approach. I didn't expect you to misunderstand
> >>>> that I would choose reply value, so I rephrase it like above :)
> >>>
> >>> I did see the code snippet which is incomplete, so I have to guess. afaik, it is
> >>> not going to work. I was hoping to save some time without detouring to the
> >>> reply-value path in case my earlier message was missed. I will stay quiet and
> >>> wait for v9 first then to avoid extending this long thread further.
> >>
> >> I see. I'm grateful that you point out the right path. I'm still
> >> investigating to find a good existing example in selftests and how to
> >> support kfunc.
> >
> > Martin, sorry to revive this thread.
> >
> > It's a little bit hard for me to find a proper example to follow. I
> > tried to call __bpf_kfunc in the BPF_SOCK_OPS_TS_SND_CB callback and
> > then failed because kfunc is not supported in the sock_ops case.
> > Later, I tried to kprobe to hook a function, say,
> > tcp_tx_timestamp_bpf(), passed the skb parameter to the kfunc and then
> > got an error.
> >
> > Here is code snippet:
> > 1) net/ipv4/tcp.c
> > +__bpf_kfunc static void tcp_init_tx_timestamp(struct sk_buff *skb)
> > +{
> > +       struct skb_shared_info *shinfo = skb_shinfo(skb);
> > +       struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> > +
> > +       printk(KERN_ERR "jason: %d, %d\n\n", tcb->txstamp_ack,
> > shinfo->tx_flags);
> > +       /*
> > +       tcb->txstamp_ack = 2;
> > +       shinfo->tx_flags |= SKBTX_BPF;
> > +       shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> > +       */
> > +}
> > Note: I skipped copying some codes like BTF_ID_FLAGS...
>
> This part is missing, so I can only guess again. This BTF_ID_FLAGS
> and the kfunc registration part went wrong when trying to add the
> new kfunc for the sock_ops program. There are kfunc examples for
> netdev related bpf prog in filter.c. e.g. bpf_sock_addr_set_sun_path.
>
> [ The same goes for another later message where the changes in
>    bpf_skops_tx_timestamping is missing, so I won't comment there. ]
>
> >
> > 2) bpf prog
> > SEC("kprobe/tcp_tx_timestamp_bpf") // I wrote a new function/wrapper to hook
> > int BPF_KPROBE(kprobe__tcp_tx_timestamp_bpf, struct sock *sk, struct
> > sk_buff *skb)
> > {
> >          tcp_init_tx_timestamp(skb);
> >          return 0;
> > }
> >
> > Then running the bpf prog, I got the following message:
> > ; tcp_init_tx_timestamp(skb); @ so_timestamping.c:281
> > 1: (85) call tcp_init_tx_timestamp#120682
> > arg#0 pointer type STRUCT sk_buff must point to scalar, or struct with scalar
> > processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0
> > peak_states 0 mark_read 0
> > -- END PROG LOAD LOG --
> > libbpf: prog 'kprobe__tcp_tx_timestamp_bpf': failed to load: -22
> > libbpf: failed to load object 'so_timestamping'
> > libbpf: failed to load BPF skeleton 'so_timestamping': -22
> > test_so_timestamping:FAIL:open and load skel unexpected error: -22
> >
> > If I don't pass any parameter in the kfunc, it can work.
> >
> > Should we support the sock_ops for __bpf_kfunc?
>
> sock_ops does support kfunc. The patch 12 selftest is using the
> bpf_cast_to_kern_ctx() and it is a kfunc:
>
> --------8<--------
> BTF_KFUNCS_START(common_btf_ids)
> BTF_ID_FLAGS(func, bpf_cast_to_kern_ctx, KF_FASTCALL)
> -------->8--------
>
> It just the new kfunc is not registered at the right place, so the verifier
> cannot find it.
>
> Untested code on top of your v8, so I don't have your latest
> changes on the txstamp_ack_bpf bits...etc.

Thanks for sharing your great understanding of BPF. And it's working!
Many thanks here.

>
> diff --git i/kernel/bpf/btf.c w/kernel/bpf/btf.c
> index 9433b6467bbe..740210f883dc 100644
> --- i/kernel/bpf/btf.c
> +++ w/kernel/bpf/btf.c
> @@ -8522,6 +8522,7 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
>         case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
>         case BPF_PROG_TYPE_CGROUP_SOCKOPT:
>         case BPF_PROG_TYPE_CGROUP_SYSCTL:
> +       case BPF_PROG_TYPE_SOCK_OPS:

The above line is exactly what I want (before this, I had no clue
about how to write this part), causing my whole kfunc feature not to
work.

>                 return BTF_KFUNC_HOOK_CGROUP;
>         case BPF_PROG_TYPE_SCHED_ACT:
>                 return BTF_KFUNC_HOOK_SCHED_ACT;
> diff --git i/net/core/filter.c w/net/core/filter.c
> index d3395ffe058e..3bad67eb5c9e 100644
> --- i/net/core/filter.c
> +++ w/net/core/filter.c
> @@ -12102,6 +12102,30 @@ __bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct __sk_buff *s, struct sock *sk,
>   #endif
>   }
>
> +enum {
> +       BPF_SOCK_OPS_TX_TSTAMP_TCP_ACK = 1 << 0,
> +};

Could I remove this flag since we have BPF_SOCK_OPS_TS_ACK_OPT_CB to
control whether to report or not?


> +
> +__bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops, int flags)
> +{
> +       struct sk_buff *skb;
> +
> +       if (skops->op != BPF_SOCK_OPS_TS_SND_CB)
> +               return -EOPNOTSUPP;
> +
> +       if (flags & ~BPF_SOCK_OPS_TX_TSTAMP_TCP_ACK)
> +               return -EINVAL;
> +
> +       skb = skops->skb;
> +       /* [REMOVE THIS COMMENT]: sk_is_tcp check will be needed in the future */
> +       if (flags & BPF_SOCK_OPS_TX_TSTAMP_TCP_ACK)
> +               TCP_SKB_CB(skb)->txstamp_ack_bpf = 1;
> +       skb_shinfo(skb)->tx_flags |= SKBTX_BPF;
> +       skb_shinfo(skb)->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> +
> +       return 0;
> +}
> +
>   __bpf_kfunc_end_defs();
>
>   int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
> @@ -12135,6 +12159,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_tcp_reqsk)
>   BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk, KF_TRUSTED_ARGS)
>   BTF_KFUNCS_END(bpf_kfunc_check_set_tcp_reqsk)
>
> +BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
> +BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp, KF_TRUSTED_ARGS)
> +BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
> +
>   static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
>         .owner = THIS_MODULE,
>         .set = &bpf_kfunc_check_set_skb,
> @@ -12155,6 +12183,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_tcp_reqsk = {
>         .set = &bpf_kfunc_check_set_tcp_reqsk,
>   };
>
> +static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
> +       .owner = THIS_MODULE,
> +       .set = &bpf_kfunc_check_set_sock_ops,
> +};
> +
>   static int __init bpf_kfunc_init(void)
>   {
>         int ret;
> @@ -12173,6 +12206,7 @@ static int __init bpf_kfunc_init(void)
>         ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
>         ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
>                                                &bpf_kfunc_set_sock_addr);
> +       ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
>         return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
>   }

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 08/12] bpf: support hw SCM_TSTAMP_SND of SO_TIMESTAMPING
  2025-02-05 16:03     ` Jason Xing
@ 2025-02-10 22:39       ` Martin KaFai Lau
  2025-02-11  0:00         ` Jason Xing
  0 siblings, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2025-02-10 22:39 UTC (permalink / raw)
  To: Jason Xing, Willem de Bruijn
  Cc: davem, edumazet, kuba, pabeni, dsahern, willemb, ast, daniel,
	andrii, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, horms, bpf, netdev

On 2/5/25 8:03 AM, Jason Xing wrote:
>>> @@ -5574,9 +5575,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
>>>                op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
>>>                break;
>>>        case SCM_TSTAMP_SND:
>>> -             if (!sw)
>>> -                     return;
>>> -             op = BPF_SOCK_OPS_TS_SW_OPT_CB;
>>> +             op = sw ? BPF_SOCK_OPS_TS_SW_OPT_CB : BPF_SOCK_OPS_TS_HW_OPT_CB;
>>> +             if (!sw && hwtstamps)
>>> +                     *skb_hwtstamps(skb) = *hwtstamps;
>> Isn't this called by drivers that have actually set skb_hwtstamps?
> Oops, somehow my mind has gone blank 🙁 Will remove it. Thanks for
> correcting me!

I just noticed I missed this thread when reviewing v9.

I looked at a few drivers, e.g. the mlx5e_consume_skb(). It does not necessarily 
set the skb_hwtstamps(skb) before calling skb_tstamp_tx(). The __skb_tstamp_tx() 
is also setting skb_hwtstamps(skb) after testing "if (hwtstamps)", so I think 
this assignment is still needed here?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH bpf-next v8 08/12] bpf: support hw SCM_TSTAMP_SND of SO_TIMESTAMPING
  2025-02-10 22:39       ` Martin KaFai Lau
@ 2025-02-11  0:00         ` Jason Xing
  0 siblings, 0 replies; 66+ messages in thread
From: Jason Xing @ 2025-02-11  0:00 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, davem, edumazet, kuba, pabeni, dsahern, willemb,
	ast, daniel, andrii, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, horms, bpf, netdev

On Tue, Feb 11, 2025 at 6:40 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/5/25 8:03 AM, Jason Xing wrote:
> >>> @@ -5574,9 +5575,9 @@ static void skb_tstamp_tx_bpf(struct sk_buff *skb, struct sock *sk,
> >>>                op = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
> >>>                break;
> >>>        case SCM_TSTAMP_SND:
> >>> -             if (!sw)
> >>> -                     return;
> >>> -             op = BPF_SOCK_OPS_TS_SW_OPT_CB;
> >>> +             op = sw ? BPF_SOCK_OPS_TS_SW_OPT_CB : BPF_SOCK_OPS_TS_HW_OPT_CB;
> >>> +             if (!sw && hwtstamps)
> >>> +                     *skb_hwtstamps(skb) = *hwtstamps;
> >> Isn't this called by drivers that have actually set skb_hwtstamps?
> > Oops, somehow my mind has gone blank 🙁 Will remove it. Thanks for
> > correcting me!
>
> I just noticed I missed this thread when reviewing v9.
>
> I looked at a few drivers, e.g. the mlx5e_consume_skb(). It does not necessarily

There are indeed many drivers behaving like you said:
1. xgbe_tx_tstamp()
2. aq_ptp_tx_hwtstamp()
3. bnx2x_ptp_task
4. i40e_ptp_tx_hwtstamp
...

I really doubt that I've checked this a long time ago and then left
this memory behind in V9, after all we've discussed this a lot of
times...

> set the skb_hwtstamps(skb) before calling skb_tstamp_tx(). The __skb_tstamp_tx()
> is also setting skb_hwtstamps(skb) after testing "if (hwtstamps)", so I think

This assignment is used to assign a cloned or newly allocated skb
instead of the orig_skb passing from the driver side.

> this assignment is still needed here?

Right.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2025-02-11  0:01 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-04 18:30 [PATCH bpf-next v8 00/12] net-timestamp: bpf extension to equip applications transparently Jason Xing
2025-02-04 18:30 ` [PATCH bpf-next v8 01/12] bpf: add support for bpf_setsockopt() Jason Xing
2025-02-05 15:22   ` Willem de Bruijn
2025-02-05 15:34     ` Jason Xing
2025-02-05 20:57       ` Martin KaFai Lau
2025-02-05 21:25       ` Willem de Bruijn
2025-02-04 18:30 ` [PATCH bpf-next v8 02/12] bpf: prepare for timestamping callbacks use Jason Xing
2025-02-04 18:30 ` [PATCH bpf-next v8 03/12] bpf: stop unsafely accessing TCP fields in bpf callbacks Jason Xing
2025-02-05 15:24   ` Willem de Bruijn
2025-02-05 15:35     ` Jason Xing
2025-02-04 18:30 ` [PATCH bpf-next v8 04/12] bpf: stop calling some sock_op BPF CALLs in new timestamping callbacks Jason Xing
2025-02-05 15:26   ` Willem de Bruijn
2025-02-05 15:50     ` Jason Xing
2025-02-04 18:30 ` [PATCH bpf-next v8 05/12] net-timestamp: prepare for isolating two modes of SO_TIMESTAMPING Jason Xing
2025-02-05  1:47   ` Jakub Kicinski
2025-02-05  2:40     ` Jason Xing
2025-02-05  3:14       ` Jakub Kicinski
2025-02-05  3:23         ` Jason Xing
2025-02-05  1:50   ` Jakub Kicinski
2025-02-05 15:34   ` Willem de Bruijn
2025-02-05 15:52     ` Jason Xing
2025-02-06  8:43     ` Jason Xing
2025-02-06 10:22       ` Jason Xing
2025-02-06 16:13       ` Willem de Bruijn
2025-02-07  0:22         ` Jason Xing
2025-02-04 18:30 ` [PATCH bpf-next v8 06/12] bpf: support SCM_TSTAMP_SCHED " Jason Xing
2025-02-05 15:36   ` Willem de Bruijn
2025-02-05 15:55     ` Jason Xing
2025-02-04 18:30 ` [PATCH bpf-next v8 07/12] bpf: support sw SCM_TSTAMP_SND " Jason Xing
2025-02-04 18:30 ` [PATCH bpf-next v8 08/12] bpf: support hw " Jason Xing
2025-02-05 15:45   ` Willem de Bruijn
2025-02-05 16:03     ` Jason Xing
2025-02-10 22:39       ` Martin KaFai Lau
2025-02-11  0:00         ` Jason Xing
2025-02-04 18:30 ` [PATCH bpf-next v8 09/12] bpf: support SCM_TSTAMP_ACK " Jason Xing
2025-02-05 15:47   ` Willem de Bruijn
2025-02-05 16:06     ` Jason Xing
2025-02-05 21:25       ` Willem de Bruijn
2025-02-04 18:30 ` [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf extension work Jason Xing
2025-02-05  1:57   ` Jakub Kicinski
2025-02-05  2:15     ` Jason Xing
2025-02-05 21:57     ` Martin KaFai Lau
2025-02-06  0:12       ` Jason Xing
2025-02-06  0:42         ` Jason Xing
2025-02-06  0:47         ` Martin KaFai Lau
2025-02-06  1:05           ` Jason Xing
2025-02-06  2:39             ` Jason Xing
2025-02-06  2:56               ` Willem de Bruijn
2025-02-06  3:09                 ` Jason Xing
2025-02-06  3:25                   ` Willem de Bruijn
2025-02-06  3:41                     ` Jason Xing
2025-02-06  6:12                       ` Martin KaFai Lau
2025-02-06  6:56                         ` Jason Xing
2025-02-07  2:07                           ` Martin KaFai Lau
2025-02-07  2:18                             ` Jason Xing
2025-02-07 12:07                               ` Jason Xing
2025-02-08  2:11                                 ` Martin KaFai Lau
2025-02-08  6:53                                   ` Jason Xing
2025-02-07 13:34                             ` Jason Xing
2025-02-04 18:30 ` [PATCH bpf-next v8 11/12] bpf: add a new callback in tcp_tx_timestamp() Jason Xing
2025-02-05  5:28   ` Jason Xing
2025-02-04 18:30 ` [PATCH bpf-next v8 12/12] selftests/bpf: add simple bpf tests in the tx path for timestamping feature Jason Xing
2025-02-05 15:54   ` Willem de Bruijn
2025-02-05 16:08     ` Jason Xing
2025-02-06  1:28       ` Martin KaFai Lau
2025-02-06  2:14         ` Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).