[PATCH net-next] net_tstamp: add SOF_TIMESTAMPING_OPT_ID

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next] net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP
@ 2022-12-05 23:09 Willem de Bruijn
  2022-12-06  0:34 ` Soheil Hassas Yeganeh
  2022-12-06 20:22 ` Jakub Kicinski
  0 siblings, 2 replies; 6+ messages in thread
From: Willem de Bruijn @ 2022-12-05 23:09 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, edumazet, pabeni, soheil, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
write_seq sockets instead of snd_una.

Intuitively the contract is that the counter is zero after the
setsockopt, so that the next write N results in a notification for
last byte N - 1.

On idle sockets snd_una == write_seq so this holds for both. But on
sockets with data in transmission, snd_una depends on the ACK response
from the peer. A process cannot learn this in a race free manner
(ioctl SIOCOUTQ is one racy approach).

write_seq is a better starting point because based on the seqno of
data written by the process only.

But the existing behavior may already be relied upon. So make the new
behavior optional behind a flag.

The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
Move the field in struct sock to avoid growing the socket (for some
common CONFIG variants). The UAPI interface so_timestamping.flags is
already int, so 32 bits wide.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Willem de Bruijn <willemb@google.com>

---

Alternative solutions are

* make the change unconditionally: a one line change.
* make the condition a (per netns) sysctl instead of flag
* make SOF_TIMESTAMPING_OPT_ID_TCP not a modifier of, but alternative
  to SOF_TIMESTAMPING_OPT_ID. That requires also updating all existing
  code that now tests OPT_ID to test a new OPT_ID_MASK.

Weighing the variants, this seemed the best option to me.
---
 Documentation/networking/timestamping.rst | 19 +++++++++++++++++++
 include/net/sock.h                        |  6 +++---
 include/uapi/linux/net_tstamp.h           |  3 ++-
 net/core/sock.c                           |  9 ++++++++-
 net/ethtool/common.c                      |  1 +
 5 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
index be4eb1242057..578f24731be5 100644
--- a/Documentation/networking/timestamping.rst
+++ b/Documentation/networking/timestamping.rst
@@ -192,6 +192,25 @@ SOF_TIMESTAMPING_OPT_ID:
   among all possibly concurrently outstanding timestamp requests for
   that socket.
 
+SOF_TIMESTAMPING_OPT_ID_TCP:
+  Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
+  timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
+  counter increments for stream sockets, but its starting point is
+  not entirely trivial. This modifier option changes that point.
+
+  A reasonable expectation is that the counter is reset to zero with
+  the system call, so that a subsequent write() of N bytes generates
+  a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
+  implements this behavior under all conditions.
+
+  SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
+  especially when the socket option is set when no data is in
+  transmission. If data is being transmitted, it may be off by the
+  length of the output queue (SIOCOUTQ) due to being based on snd_una
+  rather than write_seq. That offset depends on factors outside of
+  process control, including network RTT and peer response time. The
+  difference is subtle and unlikely to be noticed when confiugred at
+  initial socket creation. But .._OPT_ID behavior is more predictable.
 
 SOF_TIMESTAMPING_OPT_CMSG:
   Support recv() cmsg for all timestamped packets. Control messages
diff --git a/include/net/sock.h b/include/net/sock.h
index 6d207e7c4ad0..ecea3dcc2217 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -503,10 +503,10 @@ struct sock {
 #if BITS_PER_LONG==32
 	seqlock_t		sk_stamp_seq;
 #endif
-	u16			sk_tsflags;
-	u8			sk_shutdown;
 	atomic_t		sk_tskey;
 	atomic_t		sk_zckey;
+	u32			sk_tsflags;
+	u8			sk_shutdown;
 
 	u8			sk_clockid;
 	u8			sk_txtime_deadline_mode : 1,
@@ -1899,7 +1899,7 @@ static inline void sock_replace_proto(struct sock *sk, struct proto *proto)
 struct sockcm_cookie {
 	u64 transmit_time;
 	u32 mark;
-	u16 tsflags;
+	u32 tsflags;
 };
 
 static inline void sockcm_init(struct sockcm_cookie *sockc,
diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 55501e5e7ac8..a2c66b3d7f0f 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -31,8 +31,9 @@ enum {
 	SOF_TIMESTAMPING_OPT_PKTINFO = (1<<13),
 	SOF_TIMESTAMPING_OPT_TX_SWHW = (1<<14),
 	SOF_TIMESTAMPING_BIND_PHC = (1 << 15),
+	SOF_TIMESTAMPING_OPT_ID_TCP = (1 << 16),
 
-	SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_BIND_PHC,
+	SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_ID_TCP,
 	SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) |
 				 SOF_TIMESTAMPING_LAST
 };
diff --git a/net/core/sock.c b/net/core/sock.c
index 4571914a4aa8..b0ab841e0aed 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -901,13 +901,20 @@ int sock_set_timestamping(struct sock *sk, int optname,
 	if (val & ~SOF_TIMESTAMPING_MASK)
 		return -EINVAL;
 
+	if (val & SOF_TIMESTAMPING_OPT_ID_TCP &&
+	    !(val & SOF_TIMESTAMPING_OPT_ID))
+		return -EINVAL;
+
 	if (val & SOF_TIMESTAMPING_OPT_ID &&
 	    !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
 		if (sk_is_tcp(sk)) {
 			if ((1 << sk->sk_state) &
 			    (TCPF_CLOSE | TCPF_LISTEN))
 				return -EINVAL;
-			atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
+			if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
+				atomic_set(&sk->sk_tskey, tcp_sk(sk)->write_seq);
+			else
+				atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
 		} else {
 			atomic_set(&sk->sk_tskey, 0);
 		}
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 21cfe8557205..6f399afc2ff2 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -417,6 +417,7 @@ const char sof_timestamping_names[][ETH_GSTRING_LEN] = {
 	[const_ilog2(SOF_TIMESTAMPING_OPT_PKTINFO)]  = "option-pktinfo",
 	[const_ilog2(SOF_TIMESTAMPING_OPT_TX_SWHW)]  = "option-tx-swhw",
 	[const_ilog2(SOF_TIMESTAMPING_BIND_PHC)]     = "bind-phc",
+	[const_ilog2(SOF_TIMESTAMPING_OPT_ID_TCP)]   = "option-id-tcp",
 };
 static_assert(ARRAY_SIZE(sof_timestamping_names) == __SOF_TIMESTAMPING_CNT);
 
-- 
2.39.0.rc0.267.gcb52ba06e7-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP
  2022-12-05 23:09 [PATCH net-next] net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP Willem de Bruijn
@ 2022-12-06  0:34 ` Soheil Hassas Yeganeh
  2022-12-06 20:22 ` Jakub Kicinski
  1 sibling, 0 replies; 6+ messages in thread
From: Soheil Hassas Yeganeh @ 2022-12-06  0:34 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, davem, kuba, edumazet, pabeni, Willem de Bruijn

On Mon, Dec 5, 2022 at 6:09 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> From: Willem de Bruijn <willemb@google.com>
>
> Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
> write_seq sockets instead of snd_una.
>
> Intuitively the contract is that the counter is zero after the
> setsockopt, so that the next write N results in a notification for
> last byte N - 1.
>
> On idle sockets snd_una == write_seq so this holds for both. But on
> sockets with data in transmission, snd_una depends on the ACK response
> from the peer. A process cannot learn this in a race free manner
> (ioctl SIOCOUTQ is one racy approach).
>
> write_seq is a better starting point because based on the seqno of
> data written by the process only.
>
> But the existing behavior may already be relied upon. So make the new
> behavior optional behind a flag.
>
> The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
> Move the field in struct sock to avoid growing the socket (for some
> common CONFIG variants). The UAPI interface so_timestamping.flags is
> already int, so 32 bits wide.
>
> Reported-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Willem de Bruijn <willemb@google.com>

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Thanks for the fix!

> ---
>
> Alternative solutions are
>
> * make the change unconditionally: a one line change.
> * make the condition a (per netns) sysctl instead of flag
> * make SOF_TIMESTAMPING_OPT_ID_TCP not a modifier of, but alternative
>   to SOF_TIMESTAMPING_OPT_ID. That requires also updating all existing
>   code that now tests OPT_ID to test a new OPT_ID_MASK.
>
> Weighing the variants, this seemed the best option to me.
> ---
>  Documentation/networking/timestamping.rst | 19 +++++++++++++++++++
>  include/net/sock.h                        |  6 +++---
>  include/uapi/linux/net_tstamp.h           |  3 ++-
>  net/core/sock.c                           |  9 ++++++++-
>  net/ethtool/common.c                      |  1 +
>  5 files changed, 33 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
> index be4eb1242057..578f24731be5 100644
> --- a/Documentation/networking/timestamping.rst
> +++ b/Documentation/networking/timestamping.rst
> @@ -192,6 +192,25 @@ SOF_TIMESTAMPING_OPT_ID:
>    among all possibly concurrently outstanding timestamp requests for
>    that socket.
>
> +SOF_TIMESTAMPING_OPT_ID_TCP:
> +  Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
> +  timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
> +  counter increments for stream sockets, but its starting point is
> +  not entirely trivial. This modifier option changes that point.
> +
> +  A reasonable expectation is that the counter is reset to zero with
> +  the system call, so that a subsequent write() of N bytes generates
> +  a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
> +  implements this behavior under all conditions.
> +
> +  SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
> +  especially when the socket option is set when no data is in
> +  transmission. If data is being transmitted, it may be off by the
> +  length of the output queue (SIOCOUTQ) due to being based on snd_una
> +  rather than write_seq. That offset depends on factors outside of
> +  process control, including network RTT and peer response time. The
> +  difference is subtle and unlikely to be noticed when confiugred at
> +  initial socket creation. But .._OPT_ID behavior is more predictable.
>
>  SOF_TIMESTAMPING_OPT_CMSG:
>    Support recv() cmsg for all timestamped packets. Control messages
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 6d207e7c4ad0..ecea3dcc2217 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -503,10 +503,10 @@ struct sock {
>  #if BITS_PER_LONG==32
>         seqlock_t               sk_stamp_seq;
>  #endif
> -       u16                     sk_tsflags;
> -       u8                      sk_shutdown;
>         atomic_t                sk_tskey;
>         atomic_t                sk_zckey;
> +       u32                     sk_tsflags;
> +       u8                      sk_shutdown;
>
>         u8                      sk_clockid;
>         u8                      sk_txtime_deadline_mode : 1,
> @@ -1899,7 +1899,7 @@ static inline void sock_replace_proto(struct sock *sk, struct proto *proto)
>  struct sockcm_cookie {
>         u64 transmit_time;
>         u32 mark;
> -       u16 tsflags;
> +       u32 tsflags;
>  };
>
>  static inline void sockcm_init(struct sockcm_cookie *sockc,
> diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
> index 55501e5e7ac8..a2c66b3d7f0f 100644
> --- a/include/uapi/linux/net_tstamp.h
> +++ b/include/uapi/linux/net_tstamp.h
> @@ -31,8 +31,9 @@ enum {
>         SOF_TIMESTAMPING_OPT_PKTINFO = (1<<13),
>         SOF_TIMESTAMPING_OPT_TX_SWHW = (1<<14),
>         SOF_TIMESTAMPING_BIND_PHC = (1 << 15),
> +       SOF_TIMESTAMPING_OPT_ID_TCP = (1 << 16),
>
> -       SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_BIND_PHC,
> +       SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_ID_TCP,
>         SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) |
>                                  SOF_TIMESTAMPING_LAST
>  };
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 4571914a4aa8..b0ab841e0aed 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -901,13 +901,20 @@ int sock_set_timestamping(struct sock *sk, int optname,
>         if (val & ~SOF_TIMESTAMPING_MASK)
>                 return -EINVAL;
>
> +       if (val & SOF_TIMESTAMPING_OPT_ID_TCP &&
> +           !(val & SOF_TIMESTAMPING_OPT_ID))
> +               return -EINVAL;
> +
>         if (val & SOF_TIMESTAMPING_OPT_ID &&
>             !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
>                 if (sk_is_tcp(sk)) {
>                         if ((1 << sk->sk_state) &
>                             (TCPF_CLOSE | TCPF_LISTEN))
>                                 return -EINVAL;
> -                       atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
> +                       if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> +                               atomic_set(&sk->sk_tskey, tcp_sk(sk)->write_seq);
> +                       else
> +                               atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
>                 } else {
>                         atomic_set(&sk->sk_tskey, 0);
>                 }
> diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> index 21cfe8557205..6f399afc2ff2 100644
> --- a/net/ethtool/common.c
> +++ b/net/ethtool/common.c
> @@ -417,6 +417,7 @@ const char sof_timestamping_names[][ETH_GSTRING_LEN] = {
>         [const_ilog2(SOF_TIMESTAMPING_OPT_PKTINFO)]  = "option-pktinfo",
>         [const_ilog2(SOF_TIMESTAMPING_OPT_TX_SWHW)]  = "option-tx-swhw",
>         [const_ilog2(SOF_TIMESTAMPING_BIND_PHC)]     = "bind-phc",
> +       [const_ilog2(SOF_TIMESTAMPING_OPT_ID_TCP)]   = "option-id-tcp",
>  };
>  static_assert(ARRAY_SIZE(sof_timestamping_names) == __SOF_TIMESTAMPING_CNT);
>
> --
> 2.39.0.rc0.267.gcb52ba06e7-goog
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP
  2022-12-05 23:09 [PATCH net-next] net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP Willem de Bruijn
  2022-12-06  0:34 ` Soheil Hassas Yeganeh
@ 2022-12-06 20:22 ` Jakub Kicinski
  2022-12-06 20:46   ` Willem de Bruijn
  1 sibling, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2022-12-06 20:22 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netdev, davem, edumazet, pabeni, soheil, Willem de Bruijn

On Mon,  5 Dec 2022 18:09:25 -0500 Willem de Bruijn wrote:
> Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
> write_seq sockets instead of snd_una.
> 
> Intuitively the contract is that the counter is zero after the
> setsockopt, so that the next write N results in a notification for
> last byte N - 1.
> 
> On idle sockets snd_una == write_seq so this holds for both. But on
> sockets with data in transmission, snd_una depends on the ACK response
> from the peer. A process cannot learn this in a race free manner
> (ioctl SIOCOUTQ is one racy approach).

We can't just copy back the value of 

	tcp_sk(sk)->snd_una - tcp_sk(sk)->write_seq

to the user if the input of setsockopt is large enough (ie. extend the
struct, if len >= sizeof(new struct) -> user is asking to get this?
Or even add a bit somewhere that requests a copy back?

Highly unlikely to break anything, I reckon? But whether setsockopt()
can copy back is not 100% clear to me...

> write_seq is a better starting point because based on the seqno of
> data written by the process only.
> 
> But the existing behavior may already be relied upon. So make the new
> behavior optional behind a flag.
> 
> The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
> Move the field in struct sock to avoid growing the socket (for some
> common CONFIG variants). The UAPI interface so_timestamping.flags is
> already int, so 32 bits wide.
> 
> Reported-by: Jakub Kicinski <kuba@kernel.org>

Reported-by: Sotirios Delimanolis <sotodel@meta.com>

I'm just a bad human information router.

> Signed-off-by: Willem de Bruijn <willemb@google.com>
> 
> ---
> 
> Alternative solutions are
> 
> * make the change unconditionally: a one line change.
> * make the condition a (per netns) sysctl instead of flag
> * make SOF_TIMESTAMPING_OPT_ID_TCP not a modifier of, but alternative
>   to SOF_TIMESTAMPING_OPT_ID. That requires also updating all existing
>   code that now tests OPT_ID to test a new OPT_ID_MASK.

 * copy back the SIOCOUTQ

;)

> Weighing the variants, this seemed the best option to me.
> ---
>  Documentation/networking/timestamping.rst | 19 +++++++++++++++++++
>  include/net/sock.h                        |  6 +++---
>  include/uapi/linux/net_tstamp.h           |  3 ++-
>  net/core/sock.c                           |  9 ++++++++-
>  net/ethtool/common.c                      |  1 +
>  5 files changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
> index be4eb1242057..578f24731be5 100644
> --- a/Documentation/networking/timestamping.rst
> +++ b/Documentation/networking/timestamping.rst
> @@ -192,6 +192,25 @@ SOF_TIMESTAMPING_OPT_ID:
>    among all possibly concurrently outstanding timestamp requests for
>    that socket.
>  
> +SOF_TIMESTAMPING_OPT_ID_TCP:
> +  Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
> +  timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
> +  counter increments for stream sockets, but its starting point is
> +  not entirely trivial. This modifier option changes that point.
> +
> +  A reasonable expectation is that the counter is reset to zero with
> +  the system call, so that a subsequent write() of N bytes generates
> +  a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
> +  implements this behavior under all conditions.
> +
> +  SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
> +  especially when the socket option is set when no data is in
> +  transmission. If data is being transmitted, it may be off by the
> +  length of the output queue (SIOCOUTQ) due to being based on snd_una
> +  rather than write_seq. That offset depends on factors outside of
> +  process control, including network RTT and peer response time. The
> +  difference is subtle and unlikely to be noticed when confiugred at
> +  initial socket creation. But .._OPT_ID behavior is more predictable.

I reckon this needs to be more informative. Say how exactly they differ
(written vs queued for transmission). And I'd add to
SOF_TIMESTAMPING_OPT_ID docs a note to "see also .._OPT_ID_TCP version".

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP
  2022-12-06 20:22 ` Jakub Kicinski
@ 2022-12-06 20:46   ` Willem de Bruijn
  2022-12-06 20:58     ` Jakub Kicinski
  0 siblings, 1 reply; 6+ messages in thread
From: Willem de Bruijn @ 2022-12-06 20:46 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Willem de Bruijn, netdev, davem, edumazet, pabeni, soheil

On Tue, Dec 6, 2022 at 3:22 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon,  5 Dec 2022 18:09:25 -0500 Willem de Bruijn wrote:
> > Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
> > write_seq sockets instead of snd_una.
> >
> > Intuitively the contract is that the counter is zero after the
> > setsockopt, so that the next write N results in a notification for
> > last byte N - 1.
> >
> > On idle sockets snd_una == write_seq so this holds for both. But on
> > sockets with data in transmission, snd_una depends on the ACK response
> > from the peer. A process cannot learn this in a race free manner
> > (ioctl SIOCOUTQ is one racy approach).
>
> We can't just copy back the value of
>
>         tcp_sk(sk)->snd_una - tcp_sk(sk)->write_seq
>
> to the user if the input of setsockopt is large enough (ie. extend the
> struct, if len >= sizeof(new struct) -> user is asking to get this?
> Or even add a bit somewhere that requests a copy back?

We could, but indeed then we first need a way to signal that the
kernel is new enough to actually write something meaningful back that
is safe to read.

And if we change the kernel API and applications, I find this a
somewhat hacky approach: why program the slightly wrong thing and
return the offset to patch it up in userspace, if we can just program
the right thing to begin with?

> Highly unlikely to break anything, I reckon? But whether setsockopt()
> can copy back is not 100% clear to me...
>
> > write_seq is a better starting point because based on the seqno of
> > data written by the process only.
> >
> > But the existing behavior may already be relied upon. So make the new
> > behavior optional behind a flag.
> >
> > The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
> > Move the field in struct sock to avoid growing the socket (for some
> > common CONFIG variants). The UAPI interface so_timestamping.flags is
> > already int, so 32 bits wide.
> >
> > Reported-by: Jakub Kicinski <kuba@kernel.org>
>
> Reported-by: Sotirios Delimanolis <sotodel@meta.com>
>
> I'm just a bad human information router.
>
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> >
> > ---
> >
> > Alternative solutions are
> >
> > * make the change unconditionally: a one line change.
> > * make the condition a (per netns) sysctl instead of flag
> > * make SOF_TIMESTAMPING_OPT_ID_TCP not a modifier of, but alternative
> >   to SOF_TIMESTAMPING_OPT_ID. That requires also updating all existing
> >   code that now tests OPT_ID to test a new OPT_ID_MASK.
>
>  * copy back the SIOCOUTQ
>
> ;)
>
> > Weighing the variants, this seemed the best option to me.
> > ---
> >  Documentation/networking/timestamping.rst | 19 +++++++++++++++++++
> >  include/net/sock.h                        |  6 +++---
> >  include/uapi/linux/net_tstamp.h           |  3 ++-
> >  net/core/sock.c                           |  9 ++++++++-
> >  net/ethtool/common.c                      |  1 +
> >  5 files changed, 33 insertions(+), 5 deletions(-)
> >
> > diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
> > index be4eb1242057..578f24731be5 100644
> > --- a/Documentation/networking/timestamping.rst
> > +++ b/Documentation/networking/timestamping.rst
> > @@ -192,6 +192,25 @@ SOF_TIMESTAMPING_OPT_ID:
> >    among all possibly concurrently outstanding timestamp requests for
> >    that socket.
> >
> > +SOF_TIMESTAMPING_OPT_ID_TCP:
> > +  Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
> > +  timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
> > +  counter increments for stream sockets, but its starting point is
> > +  not entirely trivial. This modifier option changes that point.
> > +
> > +  A reasonable expectation is that the counter is reset to zero with
> > +  the system call, so that a subsequent write() of N bytes generates
> > +  a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
> > +  implements this behavior under all conditions.
> > +
> > +  SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
> > +  especially when the socket option is set when no data is in
> > +  transmission. If data is being transmitted, it may be off by the
> > +  length of the output queue (SIOCOUTQ) due to being based on snd_una
> > +  rather than write_seq. That offset depends on factors outside of
> > +  process control, including network RTT and peer response time. The
> > +  difference is subtle and unlikely to be noticed when confiugred at

note to self: confiugred -> configured

> > +  initial socket creation. But .._OPT_ID behavior is more predictable.
>
> I reckon this needs to be more informative. Say how exactly they differ
> (written vs queued for transmission). And I'd add to
> SOF_TIMESTAMPING_OPT_ID docs a note to "see also .._OPT_ID_TCP version".

Will do. Assuming we're good with this approach.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP
  2022-12-06 20:46   ` Willem de Bruijn
@ 2022-12-06 20:58     ` Jakub Kicinski
  2022-12-06 21:21       ` Willem de Bruijn
  0 siblings, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2022-12-06 20:58 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, davem, edumazet, pabeni, soheil

On Tue, 6 Dec 2022 15:46:25 -0500 Willem de Bruijn wrote:
> > We can't just copy back the value of
> >
> >         tcp_sk(sk)->snd_una - tcp_sk(sk)->write_seq
> >
> > to the user if the input of setsockopt is large enough (ie. extend the
> > struct, if len >= sizeof(new struct) -> user is asking to get this?
> > Or even add a bit somewhere that requests a copy back?  
> 
> We could, but indeed then we first need a way to signal that the
> kernel is new enough to actually write something meaningful back that
> is safe to read.

It should be sufficient to init the memory to -1. 
(I guess I'm not helping my own "this is less hacky" argument? :)

> And if we change the kernel API and applications, I find this a
> somewhat hacky approach: why program the slightly wrong thing and
> return the offset to patch it up in userspace, if we can just program
> the right thing to begin with?

Ah, so you'd also switch all your apps to use this new bit?

What wasn't clear to me whether this is a
 1 - we clearly did the wrong thing
or
 2 - there is a legit use case for un-packetized(?) data not being
     counted

In case of (1) we should make it clearer that the new bit is in fact
a "fixed" version of the functionality.
For (2) we can view this as an extension of the existing functionality
so combining in the same bit with write back seems natural (and TBH 
I like the single syscall probing path more than try-one-then-the-other,
but that's 100% subjective).

Anyway, don't wanna waste too much of your time. If you prefer to keep
as is the doc change is good enough for me.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next] net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP
  2022-12-06 20:58     ` Jakub Kicinski
@ 2022-12-06 21:21       ` Willem de Bruijn
  0 siblings, 0 replies; 6+ messages in thread
From: Willem de Bruijn @ 2022-12-06 21:21 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Willem de Bruijn, netdev, davem, edumazet, pabeni, soheil

On Tue, Dec 6, 2022 at 3:58 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 6 Dec 2022 15:46:25 -0500 Willem de Bruijn wrote:
> > > We can't just copy back the value of
> > >
> > >         tcp_sk(sk)->snd_una - tcp_sk(sk)->write_seq
> > >
> > > to the user if the input of setsockopt is large enough (ie. extend the
> > > struct, if len >= sizeof(new struct) -> user is asking to get this?
> > > Or even add a bit somewhere that requests a copy back?
> >
> > We could, but indeed then we first need a way to signal that the
> > kernel is new enough to actually write something meaningful back that
> > is safe to read.
>
> It should be sufficient to init the memory to -1.
> (I guess I'm not helping my own "this is less hacky" argument? :)
>
> > And if we change the kernel API and applications, I find this a
> > somewhat hacky approach: why program the slightly wrong thing and
> > return the offset to patch it up in userspace, if we can just program
> > the right thing to begin with?
>
> Ah, so you'd also switch all your apps to use this new bit?
>
> What wasn't clear to me whether this is a
>  1 - we clearly did the wrong thing
> or
>  2 - there is a legit use case for un-packetized(?) data not being
>      counted
>
> In case of (1) we should make it clearer that the new bit is in fact
> a "fixed" version of the functionality.
> For (2) we can view this as an extension of the existing functionality
> so combining in the same bit with write back seems natural (and TBH
> I like the single syscall probing path more than try-one-then-the-other,
> but that's 100% subjective).
>
> Anyway, don't wanna waste too much of your time. If you prefer to keep
> as is the doc change is good enough for me.

It's definitely 1. I'll be more explicit in the documentation and
commit message about that.

I would have just made the one line 's/snd_una/write_seq/' change if I
could be certain that no existing code relies on the current behavior.
But I already came across it in one test and production. It's too
risky.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-12-06 21:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-05 23:09 [PATCH net-next] net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP Willem de Bruijn
2022-12-06  0:34 ` Soheil Hassas Yeganeh
2022-12-06 20:22 ` Jakub Kicinski
2022-12-06 20:46   ` Willem de Bruijn
2022-12-06 20:58     ` Jakub Kicinski
2022-12-06 21:21       ` Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).