* [PATCH net-next 3/4] Documentation: net: net.core.txrehash is not specific to listening sockets
2023-04-27 13:45 Antoine Tenart
@ 2023-04-27 13:45 ` Antoine Tenart
0 siblings, 0 replies; 21+ messages in thread
From: Antoine Tenart @ 2023-04-27 13:45 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet; +Cc: Antoine Tenart, netdev
The net.core.txrehash documentation mentions this knob is for listening
sockets only, while sk_rethink_txhash can be called on SYN and RTO
retransmits on all TCP sockets.
Remove the listening socket part.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
Documentation/admin-guide/sysctl/net.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 466c560b0c30..4877563241f3 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -386,8 +386,8 @@ Default : 0 (for compatibility reasons)
txrehash
--------
-Controls default hash rethink behaviour on listening socket when SO_TXREHASH
-option is set to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt).
+Controls default hash rethink behaviour on socket when SO_TXREHASH option is set
+to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt).
If set to 1 (default), hash rethink is performed on listening socket.
If set to 0, hash rethink is not performed.
--
2.40.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4
@ 2023-05-11 9:34 Antoine Tenart
2023-05-11 9:34 ` [PATCH net-next 1/4] net: tcp: make the txhash available in TIME_WAIT sockets for IPv4 too Antoine Tenart
` (5 more replies)
0 siblings, 6 replies; 21+ messages in thread
From: Antoine Tenart @ 2023-05-11 9:34 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet; +Cc: Antoine Tenart, netdev
Hello,
Series is divided in two parts. First two commits make the txhash (used
for the skb hash in TCP) to be consistent for all IPv4/TCP packets (IPv6
doesn't have the same issue). Last two commits improve doc/comment
hash-related parts.
One example is when using OvS with dp_hash, which uses skb->hash, to
select a path. We'd like packets from the same flow to be consistent, as
well as the hash being stable over time when using net.core.txrehash=0.
Same applies for kernel ECMP which also can use skb->hash.
IMHO the series makes sense in net-next, but we could argue (some)
commits be seen as fixes and I can resend if necessary.
Thanks!
Antoine
Antoine Tenart (4):
net: tcp: make the txhash available in TIME_WAIT sockets for IPv4 too
net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV
Documentation: net: net.core.txrehash is not specific to listening
sockets
net: skbuff: fix l4_hash comment
Documentation/admin-guide/sysctl/net.rst | 4 ++--
include/linux/skbuff.h | 4 ++--
include/net/ip.h | 2 +-
net/ipv4/ip_output.c | 4 +++-
net/ipv4/tcp_ipv4.c | 14 +++++++++-----
net/ipv4/tcp_minisocks.c | 2 +-
6 files changed, 18 insertions(+), 12 deletions(-)
--
2.40.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH net-next 1/4] net: tcp: make the txhash available in TIME_WAIT sockets for IPv4 too
2023-05-11 9:34 [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4 Antoine Tenart
@ 2023-05-11 9:34 ` Antoine Tenart
2023-05-11 9:34 ` [PATCH net-next 2/4] net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV Antoine Tenart
` (4 subsequent siblings)
5 siblings, 0 replies; 21+ messages in thread
From: Antoine Tenart @ 2023-05-11 9:34 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet; +Cc: Antoine Tenart, netdev
Commit c67b85558ff2 ("ipv6: tcp: send consistent autoflowlabel in
TIME_WAIT state") made the socket txhash also available in TIME_WAIT
sockets but for IPv6 only. Make it available for IPv4 too as we'll use
it in later commits.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
net/ipv4/tcp_minisocks.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index dac0d62120e6..04fc328727e6 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -303,6 +303,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
tcptw->tw_ts_offset = tp->tsoffset;
tcptw->tw_last_oow_ack_time = 0;
tcptw->tw_tx_delay = tp->tcp_tx_delay;
+ tw->tw_txhash = sk->sk_txhash;
#if IS_ENABLED(CONFIG_IPV6)
if (tw->tw_family == PF_INET6) {
struct ipv6_pinfo *np = inet6_sk(sk);
@@ -311,7 +312,6 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
tw->tw_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
tw->tw_tclass = np->tclass;
tw->tw_flowlabel = be32_to_cpu(np->flow_label & IPV6_FLOWLABEL_MASK);
- tw->tw_txhash = sk->sk_txhash;
tw->tw_ipv6only = sk->sk_ipv6only;
}
#endif
--
2.40.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH net-next 2/4] net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV
2023-05-11 9:34 [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4 Antoine Tenart
2023-05-11 9:34 ` [PATCH net-next 1/4] net: tcp: make the txhash available in TIME_WAIT sockets for IPv4 too Antoine Tenart
@ 2023-05-11 9:34 ` Antoine Tenart
2023-05-11 9:34 ` [PATCH net-next 3/4] Documentation: net: net.core.txrehash is not specific to listening sockets Antoine Tenart
` (3 subsequent siblings)
5 siblings, 0 replies; 21+ messages in thread
From: Antoine Tenart @ 2023-05-11 9:34 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet; +Cc: Antoine Tenart, netdev
When using IPv4/TCP, skb->hash comes from sk->sk_txhash except in
TIME_WAIT and SYN_RECV where it's not set in the reply skb from
ip_send_unicast_reply. Those packets will have a mismatched hash with
others from the same flow as their hashes will be 0. IPv6 does not have
the same issue as the hash is set from the socket txhash in those cases.
This commits sets the hash in the reply skb from ip_send_unicast_reply,
which makes the IPv4 code behaving like IPv6.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
include/net/ip.h | 2 +-
net/ipv4/ip_output.c | 4 +++-
net/ipv4/tcp_ipv4.c | 14 +++++++++-----
3 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/include/net/ip.h b/include/net/ip.h
index c3fffaa92d6e..749735171e2c 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -280,7 +280,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
const struct ip_options *sopt,
__be32 daddr, __be32 saddr,
const struct ip_reply_arg *arg,
- unsigned int len, u64 transmit_time);
+ unsigned int len, u64 transmit_time, u32 txhash);
#define IP_INC_STATS(net, field) SNMP_INC_STATS64((net)->mib.ip_statistics, field)
#define __IP_INC_STATS(net, field) __SNMP_INC_STATS64((net)->mib.ip_statistics, field)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 61892268e8a6..a1bead441026 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1692,7 +1692,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
const struct ip_options *sopt,
__be32 daddr, __be32 saddr,
const struct ip_reply_arg *arg,
- unsigned int len, u64 transmit_time)
+ unsigned int len, u64 transmit_time, u32 txhash)
{
struct ip_options_data replyopts;
struct ipcm_cookie ipc;
@@ -1755,6 +1755,8 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
arg->csum));
nskb->ip_summed = CHECKSUM_NONE;
nskb->mono_delivery_time = !!transmit_time;
+ if (txhash)
+ skb_set_hash(nskb, txhash, PKT_HASH_TYPE_L4);
ip_push_pending_frames(sk, &fl4);
}
out:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 39bda2b1066e..8fd4b548d448 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -692,6 +692,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
u64 transmit_time = 0;
struct sock *ctl_sk;
struct net *net;
+ u32 txhash = 0;
/* Never send a reset in response to a reset. */
if (th->rst)
@@ -829,12 +830,14 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
inet_twsk(sk)->tw_priority : sk->sk_priority;
transmit_time = tcp_transmit_time(sk);
xfrm_sk_clone_policy(ctl_sk, sk);
+ txhash = (sk->sk_state == TCP_TIME_WAIT) ?
+ inet_twsk(sk)->tw_txhash : sk->sk_txhash;
}
ip_send_unicast_reply(ctl_sk,
skb, &TCP_SKB_CB(skb)->header.h4.opt,
ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
&arg, arg.iov[0].iov_len,
- transmit_time);
+ transmit_time, txhash);
ctl_sk->sk_mark = 0;
xfrm_sk_free_policy(ctl_sk);
@@ -857,7 +860,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
struct sk_buff *skb, u32 seq, u32 ack,
u32 win, u32 tsval, u32 tsecr, int oif,
struct tcp_md5sig_key *key,
- int reply_flags, u8 tos)
+ int reply_flags, u8 tos, u32 txhash)
{
const struct tcphdr *th = tcp_hdr(skb);
struct {
@@ -933,7 +936,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
skb, &TCP_SKB_CB(skb)->header.h4.opt,
ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
&arg, arg.iov[0].iov_len,
- transmit_time);
+ transmit_time, txhash);
ctl_sk->sk_mark = 0;
sock_net_set(ctl_sk, &init_net);
@@ -954,7 +957,8 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
tw->tw_bound_dev_if,
tcp_twsk_md5_key(tcptw),
tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
- tw->tw_tos
+ tw->tw_tos,
+ tw->tw_txhash
);
inet_twsk_put(tw);
@@ -987,7 +991,7 @@ static void tcp_v4_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
0,
tcp_md5_do_lookup(sk, l3index, addr, AF_INET),
inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
- ip_hdr(skb)->tos);
+ ip_hdr(skb)->tos, tcp_rsk(req)->txhash);
}
/*
--
2.40.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH net-next 3/4] Documentation: net: net.core.txrehash is not specific to listening sockets
2023-05-11 9:34 [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4 Antoine Tenart
2023-05-11 9:34 ` [PATCH net-next 1/4] net: tcp: make the txhash available in TIME_WAIT sockets for IPv4 too Antoine Tenart
2023-05-11 9:34 ` [PATCH net-next 2/4] net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV Antoine Tenart
@ 2023-05-11 9:34 ` Antoine Tenart
2023-05-11 9:34 ` [PATCH net-next 4/4] net: skbuff: fix l4_hash comment Antoine Tenart
` (2 subsequent siblings)
5 siblings, 0 replies; 21+ messages in thread
From: Antoine Tenart @ 2023-05-11 9:34 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet; +Cc: Antoine Tenart, netdev
The net.core.txrehash documentation mentions this knob is for listening
sockets only, while sk_rethink_txhash can be called on SYN and RTO
retransmits on all TCP sockets.
Remove the listening socket part.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
Documentation/admin-guide/sysctl/net.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 466c560b0c30..4877563241f3 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -386,8 +386,8 @@ Default : 0 (for compatibility reasons)
txrehash
--------
-Controls default hash rethink behaviour on listening socket when SO_TXREHASH
-option is set to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt).
+Controls default hash rethink behaviour on socket when SO_TXREHASH option is set
+to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt).
If set to 1 (default), hash rethink is performed on listening socket.
If set to 0, hash rethink is not performed.
--
2.40.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH net-next 4/4] net: skbuff: fix l4_hash comment
2023-05-11 9:34 [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4 Antoine Tenart
` (2 preceding siblings ...)
2023-05-11 9:34 ` [PATCH net-next 3/4] Documentation: net: net.core.txrehash is not specific to listening sockets Antoine Tenart
@ 2023-05-11 9:34 ` Antoine Tenart
2023-05-11 12:10 ` Dumitru Ceara
2023-05-11 10:24 ` [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4 Eric Dumazet
2023-05-11 11:59 ` Ilya Maximets
5 siblings, 1 reply; 21+ messages in thread
From: Antoine Tenart @ 2023-05-11 9:34 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet; +Cc: Antoine Tenart, netdev
Since commit 877d1f6291f8 ("net: Set sk_txhash from a random number")
sk->sk_txhash is not a canonical 4-tuple hash. sk->sk_txhash is
used in the TCP Tx path to populate skb->hash, with skb->l4_hash=1.
With this, skb->l4_hash does not always indicate the hash is a
"canonical 4-tuple hash over transport ports" but rather a hash from L4
layer to provide a uniform distribution over flows. Reword the comment
accordingly, to avoid misunderstandings.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
include/linux/skbuff.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 738776ab8838..f54c84193b23 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -791,8 +791,8 @@ typedef unsigned char *sk_buff_data_t;
* @active_extensions: active extensions (skb_ext_id types)
* @ndisc_nodetype: router type (from link layer)
* @ooo_okay: allow the mapping of a socket to a queue to be changed
- * @l4_hash: indicate hash is a canonical 4-tuple hash over transport
- * ports.
+ * @l4_hash: indicate hash is from layer 4 and provides a uniform
+ * distribution over flows.
* @sw_hash: indicates hash was computed in software stack
* @wifi_acked_valid: wifi_acked was set
* @wifi_acked: whether frame was acked on wifi or not
--
2.40.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4
2023-05-11 9:34 [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4 Antoine Tenart
` (3 preceding siblings ...)
2023-05-11 9:34 ` [PATCH net-next 4/4] net: skbuff: fix l4_hash comment Antoine Tenart
@ 2023-05-11 10:24 ` Eric Dumazet
2023-05-11 11:55 ` Antoine Tenart
2023-05-11 11:59 ` Ilya Maximets
5 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2023-05-11 10:24 UTC (permalink / raw)
To: Antoine Tenart; +Cc: davem, kuba, pabeni, netdev
On Thu, May 11, 2023 at 11:35 AM Antoine Tenart <atenart@kernel.org> wrote:
>
> Hello,
>
> Series is divided in two parts. First two commits make the txhash (used
> for the skb hash in TCP) to be consistent for all IPv4/TCP packets (IPv6
> doesn't have the same issue). Last two commits improve doc/comment
> hash-related parts.
>
> One example is when using OvS with dp_hash, which uses skb->hash, to
> select a path. We'd like packets from the same flow to be consistent, as
> well as the hash being stable over time when using net.core.txrehash=0.
> Same applies for kernel ECMP which also can use skb->hash.
>
SGTM, thanks.
Reviewed-by: Eric Dumazet <edumazet@google.com>
FYI while reviewing your patches, I found that I have to send this fix:
I suggest we hold your patch series a bit before this reaches net-next tree,
to avoid merge conflicts.
Bug was added in commit f6c0f5d209fa ("tcp: honor SO_PRIORITY in
TIME_WAIT state")
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 39bda2b1066e1d607a59fb79c6305d0ca30cb28d..06d2573685ca993a3a0a89807f09d7b5c153cc72
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -829,6 +829,9 @@ static void tcp_v4_send_reset(const struct sock
*sk, struct sk_buff *skb)
inet_twsk(sk)->tw_priority : sk->sk_priority;
transmit_time = tcp_transmit_time(sk);
xfrm_sk_clone_policy(ctl_sk, sk);
+ } else {
+ ctl_sk->sk_mark = 0;
+ ctl_sk->sk_priority = 0;
}
ip_send_unicast_reply(ctl_sk,
skb, &TCP_SKB_CB(skb)->header.h4.opt,
@@ -836,7 +839,6 @@ static void tcp_v4_send_reset(const struct sock
*sk, struct sk_buff *skb)
&arg, arg.iov[0].iov_len,
transmit_time);
- ctl_sk->sk_mark = 0;
xfrm_sk_free_policy(ctl_sk);
sock_net_set(ctl_sk, &init_net);
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
@@ -935,7 +937,6 @@ static void tcp_v4_send_ack(const struct sock *sk,
&arg, arg.iov[0].iov_len,
transmit_time);
- ctl_sk->sk_mark = 0;
sock_net_set(ctl_sk, &init_net);
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
local_bh_enable();
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4
2023-05-11 10:24 ` [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4 Eric Dumazet
@ 2023-05-11 11:55 ` Antoine Tenart
0 siblings, 0 replies; 21+ messages in thread
From: Antoine Tenart @ 2023-05-11 11:55 UTC (permalink / raw)
To: Eric Dumazet; +Cc: davem, kuba, pabeni, netdev
Quoting Eric Dumazet (2023-05-11 12:24:15)
> On Thu, May 11, 2023 at 11:35 AM Antoine Tenart <atenart@kernel.org> wrote:
> >
> > Series is divided in two parts. First two commits make the txhash (used
> > for the skb hash in TCP) to be consistent for all IPv4/TCP packets (IPv6
> > doesn't have the same issue). Last two commits improve doc/comment
> > hash-related parts.
> >
> > One example is when using OvS with dp_hash, which uses skb->hash, to
> > select a path. We'd like packets from the same flow to be consistent, as
> > well as the hash being stable over time when using net.core.txrehash=0.
> > Same applies for kernel ECMP which also can use skb->hash.
> >
>
> SGTM, thanks.
>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
>
> FYI while reviewing your patches, I found that I have to send this fix:
>
> I suggest we hold your patch series a bit before this reaches net-next tree,
> to avoid merge conflicts.
Sure, no problem. Thanks for the review!
> Bug was added in commit f6c0f5d209fa ("tcp: honor SO_PRIORITY in
> TIME_WAIT state")
>
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 39bda2b1066e1d607a59fb79c6305d0ca30cb28d..06d2573685ca993a3a0a89807f09d7b5c153cc72
> 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -829,6 +829,9 @@ static void tcp_v4_send_reset(const struct sock
> *sk, struct sk_buff *skb)
> inet_twsk(sk)->tw_priority : sk->sk_priority;
> transmit_time = tcp_transmit_time(sk);
> xfrm_sk_clone_policy(ctl_sk, sk);
> + } else {
> + ctl_sk->sk_mark = 0;
> + ctl_sk->sk_priority = 0;
> }
> ip_send_unicast_reply(ctl_sk,
> skb, &TCP_SKB_CB(skb)->header.h4.opt,
> @@ -836,7 +839,6 @@ static void tcp_v4_send_reset(const struct sock
> *sk, struct sk_buff *skb)
> &arg, arg.iov[0].iov_len,
> transmit_time);
>
> - ctl_sk->sk_mark = 0;
> xfrm_sk_free_policy(ctl_sk);
> sock_net_set(ctl_sk, &init_net);
> __TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
> @@ -935,7 +937,6 @@ static void tcp_v4_send_ack(const struct sock *sk,
> &arg, arg.iov[0].iov_len,
> transmit_time);
>
> - ctl_sk->sk_mark = 0;
> sock_net_set(ctl_sk, &init_net);
> __TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
> local_bh_enable();
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4
2023-05-11 9:34 [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4 Antoine Tenart
` (4 preceding siblings ...)
2023-05-11 10:24 ` [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4 Eric Dumazet
@ 2023-05-11 11:59 ` Ilya Maximets
5 siblings, 0 replies; 21+ messages in thread
From: Ilya Maximets @ 2023-05-11 11:59 UTC (permalink / raw)
To: Antoine Tenart, davem, kuba, pabeni, edumazet; +Cc: netdev, i.maximets
On 5/11/23 11:34, Antoine Tenart wrote:
> Hello,
>
> Series is divided in two parts. First two commits make the txhash (used
> for the skb hash in TCP) to be consistent for all IPv4/TCP packets (IPv6
> doesn't have the same issue). Last two commits improve doc/comment
> hash-related parts.
>
> One example is when using OvS with dp_hash, which uses skb->hash, to
> select a path. We'd like packets from the same flow to be consistent, as
> well as the hash being stable over time when using net.core.txrehash=0.
> Same applies for kernel ECMP which also can use skb->hash.
FWIW, same also applies to seg6_flowlabel that is used for flowlable
based load balancing, because seg6_make_flowlabel() is using skb hash.
Best regards, Ilya Maximets.
>
> IMHO the series makes sense in net-next, but we could argue (some)
> commits be seen as fixes and I can resend if necessary.
>
> Thanks!
> Antoine
>
> Antoine Tenart (4):
> net: tcp: make the txhash available in TIME_WAIT sockets for IPv4 too
> net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV
> Documentation: net: net.core.txrehash is not specific to listening
> sockets
> net: skbuff: fix l4_hash comment
>
> Documentation/admin-guide/sysctl/net.rst | 4 ++--
> include/linux/skbuff.h | 4 ++--
> include/net/ip.h | 2 +-
> net/ipv4/ip_output.c | 4 +++-
> net/ipv4/tcp_ipv4.c | 14 +++++++++-----
> net/ipv4/tcp_minisocks.c | 2 +-
> 6 files changed, 18 insertions(+), 12 deletions(-)
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 4/4] net: skbuff: fix l4_hash comment
2023-05-11 9:34 ` [PATCH net-next 4/4] net: skbuff: fix l4_hash comment Antoine Tenart
@ 2023-05-11 12:10 ` Dumitru Ceara
2023-05-11 12:33 ` Eric Dumazet
0 siblings, 1 reply; 21+ messages in thread
From: Dumitru Ceara @ 2023-05-11 12:10 UTC (permalink / raw)
To: Antoine Tenart, davem, kuba, pabeni, edumazet; +Cc: netdev, Ilya Maximets
Hi Antoine,
On 5/11/23 11:34, Antoine Tenart wrote:
> Since commit 877d1f6291f8 ("net: Set sk_txhash from a random number")
> sk->sk_txhash is not a canonical 4-tuple hash. sk->sk_txhash is
> used in the TCP Tx path to populate skb->hash, with skb->l4_hash=1.
> With this, skb->l4_hash does not always indicate the hash is a
> "canonical 4-tuple hash over transport ports" but rather a hash from L4
> layer to provide a uniform distribution over flows. Reword the comment
> accordingly, to avoid misunderstandings.
But AFAIU the hash used to be a canonical 4-tuple hash and was used as
such by other components, e.g., OvS:
https://elixir.bootlin.com/linux/latest/source/net/openvswitch/actions.c#L1069
It seems to me at least unfortunate that semantics change without
considering other users. The fact that we now fix the documentation
makes it seem like OvS was wrong to use the skb hash. However, before
877d1f6291f8 ("net: Set sk_txhash from a random number") it was OK for
OvS to use the skb hash as a canonical 4-tuple hash.
Best regards,
Dumitru
>
> Signed-off-by: Antoine Tenart <atenart@kernel.org>
> ---
> include/linux/skbuff.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 738776ab8838..f54c84193b23 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -791,8 +791,8 @@ typedef unsigned char *sk_buff_data_t;
> * @active_extensions: active extensions (skb_ext_id types)
> * @ndisc_nodetype: router type (from link layer)
> * @ooo_okay: allow the mapping of a socket to a queue to be changed
> - * @l4_hash: indicate hash is a canonical 4-tuple hash over transport
> - * ports.
> + * @l4_hash: indicate hash is from layer 4 and provides a uniform
> + * distribution over flows.
> * @sw_hash: indicates hash was computed in software stack
> * @wifi_acked_valid: wifi_acked was set
> * @wifi_acked: whether frame was acked on wifi or not
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 4/4] net: skbuff: fix l4_hash comment
2023-05-11 12:10 ` Dumitru Ceara
@ 2023-05-11 12:33 ` Eric Dumazet
2023-05-11 13:00 ` Dumitru Ceara
0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2023-05-11 12:33 UTC (permalink / raw)
To: Dumitru Ceara; +Cc: Antoine Tenart, davem, kuba, pabeni, netdev, Ilya Maximets
On Thu, May 11, 2023 at 2:10 PM Dumitru Ceara <dceara@redhat.com> wrote:
>
> Hi Antoine,
>
> On 5/11/23 11:34, Antoine Tenart wrote:
> > Since commit 877d1f6291f8 ("net: Set sk_txhash from a random number")
> > sk->sk_txhash is not a canonical 4-tuple hash. sk->sk_txhash is
> > used in the TCP Tx path to populate skb->hash, with skb->l4_hash=1.
> > With this, skb->l4_hash does not always indicate the hash is a
> > "canonical 4-tuple hash over transport ports" but rather a hash from L4
> > layer to provide a uniform distribution over flows. Reword the comment
> > accordingly, to avoid misunderstandings.
>
> But AFAIU the hash used to be a canonical 4-tuple hash and was used as
> such by other components, e.g., OvS:
>
> https://elixir.bootlin.com/linux/latest/source/net/openvswitch/actions.c#L1069
>
> It seems to me at least unfortunate that semantics change without
> considering other users. The fact that we now fix the documentation
> makes it seem like OvS was wrong to use the skb hash. However, before
> 877d1f6291f8 ("net: Set sk_txhash from a random number") it was OK for
> OvS to use the skb hash as a canonical 4-tuple hash.
>
I do not think we can undo stuff that was done back in 2015
Has anyone complained ?
Note that skb->hash has never been considered as canonical, for obvious reasons.
> Best regards,
> Dumitru
>
> >
> > Signed-off-by: Antoine Tenart <atenart@kernel.org>
> > ---
> > include/linux/skbuff.h | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 738776ab8838..f54c84193b23 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -791,8 +791,8 @@ typedef unsigned char *sk_buff_data_t;
> > * @active_extensions: active extensions (skb_ext_id types)
> > * @ndisc_nodetype: router type (from link layer)
> > * @ooo_okay: allow the mapping of a socket to a queue to be changed
> > - * @l4_hash: indicate hash is a canonical 4-tuple hash over transport
> > - * ports.
> > + * @l4_hash: indicate hash is from layer 4 and provides a uniform
> > + * distribution over flows.
> > * @sw_hash: indicates hash was computed in software stack
> > * @wifi_acked_valid: wifi_acked was set
> > * @wifi_acked: whether frame was acked on wifi or not
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 4/4] net: skbuff: fix l4_hash comment
2023-05-11 12:33 ` Eric Dumazet
@ 2023-05-11 13:00 ` Dumitru Ceara
2023-05-11 17:54 ` Ilya Maximets
0 siblings, 1 reply; 21+ messages in thread
From: Dumitru Ceara @ 2023-05-11 13:00 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Antoine Tenart, davem, kuba, pabeni, netdev, Ilya Maximets
On 5/11/23 14:33, Eric Dumazet wrote:
> On Thu, May 11, 2023 at 2:10 PM Dumitru Ceara <dceara@redhat.com> wrote:
>>
>> Hi Antoine,
>>
>> On 5/11/23 11:34, Antoine Tenart wrote:
>>> Since commit 877d1f6291f8 ("net: Set sk_txhash from a random number")
>>> sk->sk_txhash is not a canonical 4-tuple hash. sk->sk_txhash is
>>> used in the TCP Tx path to populate skb->hash, with skb->l4_hash=1.
>>> With this, skb->l4_hash does not always indicate the hash is a
>>> "canonical 4-tuple hash over transport ports" but rather a hash from L4
>>> layer to provide a uniform distribution over flows. Reword the comment
>>> accordingly, to avoid misunderstandings.
>>
>> But AFAIU the hash used to be a canonical 4-tuple hash and was used as
>> such by other components, e.g., OvS:
>>
>> https://elixir.bootlin.com/linux/latest/source/net/openvswitch/actions.c#L1069
>>
>> It seems to me at least unfortunate that semantics change without
>> considering other users. The fact that we now fix the documentation
>> makes it seem like OvS was wrong to use the skb hash. However, before
>> 877d1f6291f8 ("net: Set sk_txhash from a random number") it was OK for
>> OvS to use the skb hash as a canonical 4-tuple hash.
>>
>
> I do not think we can undo stuff that was done back in 2015
>
I understand. I guess I was kind of grasping at straws in the hope of
getting a canonical 4-tuple hash.
> Has anyone complained ?
>
It did go unnoticed for a while but recently we started getting
(indirect) reports due to the hash changing.
This one is from an upstream OVN (OvS) user:
https://github.com/ovn-org/ovn/issues/112
This is from an OpenShift (also running OVN/OvS) user:
https://issues.redhat.com/browse/OCPBUGS-7406
> Note that skb->hash has never been considered as canonical, for obvious reasons.
>
>
>> Best regards,
>> Dumitru
>>
>>>
>>> Signed-off-by: Antoine Tenart <atenart@kernel.org>
>>> ---
>>> include/linux/skbuff.h | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>> index 738776ab8838..f54c84193b23 100644
>>> --- a/include/linux/skbuff.h
>>> +++ b/include/linux/skbuff.h
>>> @@ -791,8 +791,8 @@ typedef unsigned char *sk_buff_data_t;
>>> * @active_extensions: active extensions (skb_ext_id types)
>>> * @ndisc_nodetype: router type (from link layer)
>>> * @ooo_okay: allow the mapping of a socket to a queue to be changed
>>> - * @l4_hash: indicate hash is a canonical 4-tuple hash over transport
>>> - * ports.
>>> + * @l4_hash: indicate hash is from layer 4 and provides a uniform
>>> + * distribution over flows.
>>> * @sw_hash: indicates hash was computed in software stack
>>> * @wifi_acked_valid: wifi_acked was set
>>> * @wifi_acked: whether frame was acked on wifi or not
>>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 4/4] net: skbuff: fix l4_hash comment
2023-05-11 13:00 ` Dumitru Ceara
@ 2023-05-11 17:54 ` Ilya Maximets
2023-05-11 20:50 ` Dumitru Ceara
0 siblings, 1 reply; 21+ messages in thread
From: Ilya Maximets @ 2023-05-11 17:54 UTC (permalink / raw)
To: Dumitru Ceara, Eric Dumazet
Cc: i.maximets, Antoine Tenart, davem, kuba, pabeni, netdev
On 5/11/23 15:00, Dumitru Ceara wrote:
> On 5/11/23 14:33, Eric Dumazet wrote:
>> On Thu, May 11, 2023 at 2:10 PM Dumitru Ceara <dceara@redhat.com> wrote:
>>>
>>> Hi Antoine,
>>>
>>> On 5/11/23 11:34, Antoine Tenart wrote:
>>>> Since commit 877d1f6291f8 ("net: Set sk_txhash from a random number")
>>>> sk->sk_txhash is not a canonical 4-tuple hash. sk->sk_txhash is
>>>> used in the TCP Tx path to populate skb->hash, with skb->l4_hash=1.
>>>> With this, skb->l4_hash does not always indicate the hash is a
>>>> "canonical 4-tuple hash over transport ports" but rather a hash from L4
>>>> layer to provide a uniform distribution over flows. Reword the comment
>>>> accordingly, to avoid misunderstandings.
>>>
>>> But AFAIU the hash used to be a canonical 4-tuple hash and was used as
>>> such by other components, e.g., OvS:
>>>
>>> https://elixir.bootlin.com/linux/latest/source/net/openvswitch/actions.c#L1069
>>>
>>> It seems to me at least unfortunate that semantics change without
>>> considering other users. The fact that we now fix the documentation
>>> makes it seem like OvS was wrong to use the skb hash. However, before
>>> 877d1f6291f8 ("net: Set sk_txhash from a random number") it was OK for
>>> OvS to use the skb hash as a canonical 4-tuple hash.
>>>
>>
>> I do not think we can undo stuff that was done back in 2015
>>
>
> I understand. I guess I was kind of grasping at straws in the hope of
> getting a canonical 4-tuple hash.
>
>> Has anyone complained ?
>>
>
> It did go unnoticed for a while but recently we started getting
> (indirect) reports due to the hash changing.
>
> This one is from an upstream OVN (OvS) user:
> https://github.com/ovn-org/ovn/issues/112
>
> This is from an OpenShift (also running OVN/OvS) user:
> https://issues.redhat.com/browse/OCPBUGS-7406
>
>> Note that skb->hash has never been considered as canonical, for obvious reasons.
I guess, the other point here is that it's not an L4 hash either.
It's a random number. So, the documentation will still not be
correct even after the change proposed in this patch.
One other solution to the problem might be to stop setting l4_hash
flag while it's a random number.
One way to not break everything doing that will be to introduce a
new flag, e.g. 'rnd_hash' that will be a hash that is "not related
to packet fields, but provides a uniform distribution over flows".
skb_get_hash() then may return the current hash if it's any of
l4, rnd or sw. That should preserve the current logic across
the kernel code.
But having a new flag, we could introduce a new helper, for example
skb_get_stable_hash() or skb_get_hash_nonrandom() or something like
that, that will be equal to the current version of skb_get_hash(),
i.e. not take the random hash into account.
Affected subsystems (OVS, ECMP, SRv6) can be changed to use that
new function. This way these subsystems will get a software hash
based on the real packet fields, if it was originally random.
This will also preserve ability to use hash provided by the HW,
since it is not normally random.
With that, we'll also not need to have in the API something that has
'L4' in the name and in the docs, but has no relation to packet fields.
It can be argued that the description in the doc doesn't mean that
this hash is computed using L4 packet fields, but it's confusing
regardless and getting overlooked while creating new code, as it
shown by the issues in multiple substystems.
Hope this makes some sense.
Dumitru also had some alternative ideas on how to provide a stable
hash to subsystems that need it, but I'll leave it to him.
Best regards, Ilya Maximets.
>>
>>
>>> Best regards,
>>> Dumitru
>>>
>>>>
>>>> Signed-off-by: Antoine Tenart <atenart@kernel.org>
>>>> ---
>>>> include/linux/skbuff.h | 4 ++--
>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>>> index 738776ab8838..f54c84193b23 100644
>>>> --- a/include/linux/skbuff.h
>>>> +++ b/include/linux/skbuff.h
>>>> @@ -791,8 +791,8 @@ typedef unsigned char *sk_buff_data_t;
>>>> * @active_extensions: active extensions (skb_ext_id types)
>>>> * @ndisc_nodetype: router type (from link layer)
>>>> * @ooo_okay: allow the mapping of a socket to a queue to be changed
>>>> - * @l4_hash: indicate hash is a canonical 4-tuple hash over transport
>>>> - * ports.
>>>> + * @l4_hash: indicate hash is from layer 4 and provides a uniform
>>>> + * distribution over flows.
>>>> * @sw_hash: indicates hash was computed in software stack
>>>> * @wifi_acked_valid: wifi_acked was set
>>>> * @wifi_acked: whether frame was acked on wifi or not
>>>
>>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 4/4] net: skbuff: fix l4_hash comment
2023-05-11 17:54 ` Ilya Maximets
@ 2023-05-11 20:50 ` Dumitru Ceara
2023-05-15 8:12 ` Antoine Tenart
0 siblings, 1 reply; 21+ messages in thread
From: Dumitru Ceara @ 2023-05-11 20:50 UTC (permalink / raw)
To: Ilya Maximets, Eric Dumazet; +Cc: Antoine Tenart, davem, kuba, pabeni, netdev
On 5/11/23 19:54, Ilya Maximets wrote:
> On 5/11/23 15:00, Dumitru Ceara wrote:
>> On 5/11/23 14:33, Eric Dumazet wrote:
>>> On Thu, May 11, 2023 at 2:10 PM Dumitru Ceara <dceara@redhat.com> wrote:
>>>>
>>>> Hi Antoine,
>>>>
>>>> On 5/11/23 11:34, Antoine Tenart wrote:
>>>>> Since commit 877d1f6291f8 ("net: Set sk_txhash from a random number")
>>>>> sk->sk_txhash is not a canonical 4-tuple hash. sk->sk_txhash is
>>>>> used in the TCP Tx path to populate skb->hash, with skb->l4_hash=1.
>>>>> With this, skb->l4_hash does not always indicate the hash is a
>>>>> "canonical 4-tuple hash over transport ports" but rather a hash from L4
>>>>> layer to provide a uniform distribution over flows. Reword the comment
>>>>> accordingly, to avoid misunderstandings.
>>>>
>>>> But AFAIU the hash used to be a canonical 4-tuple hash and was used as
>>>> such by other components, e.g., OvS:
>>>>
>>>> https://elixir.bootlin.com/linux/latest/source/net/openvswitch/actions.c#L1069
>>>>
>>>> It seems to me at least unfortunate that semantics change without
>>>> considering other users. The fact that we now fix the documentation
>>>> makes it seem like OvS was wrong to use the skb hash. However, before
>>>> 877d1f6291f8 ("net: Set sk_txhash from a random number") it was OK for
>>>> OvS to use the skb hash as a canonical 4-tuple hash.
>>>>
>>>
>>> I do not think we can undo stuff that was done back in 2015
>>>
>>
>> I understand. I guess I was kind of grasping at straws in the hope of
>> getting a canonical 4-tuple hash.
>>
>>> Has anyone complained ?
>>>
>>
>> It did go unnoticed for a while but recently we started getting
>> (indirect) reports due to the hash changing.
>>
>> This one is from an upstream OVN (OvS) user:
>> https://github.com/ovn-org/ovn/issues/112
>>
>> This is from an OpenShift (also running OVN/OvS) user:
>> https://issues.redhat.com/browse/OCPBUGS-7406
>>
I just realized we need a bit more context here. It started being a
visible problem after 265f94ff54d6 ("net: Recompute sk_txhash on
negative routing advice") and also after 3acf3ec3f4b0 ("tcp: Change
txhash on every SYN and RTO retransmit") when retransmits started
changing the txhash and implicitly the hash used by OvS.
>>> Note that skb->hash has never been considered as canonical, for obvious reasons.
>
> I guess, the other point here is that it's not an L4 hash either.
>
> It's a random number. So, the documentation will still not be
> correct even after the change proposed in this patch.
>
>
> One other solution to the problem might be to stop setting l4_hash
> flag while it's a random number.
>
> One way to not break everything doing that will be to introduce a
> new flag, e.g. 'rnd_hash' that will be a hash that is "not related
> to packet fields, but provides a uniform distribution over flows".
>
> skb_get_hash() then may return the current hash if it's any of
> l4, rnd or sw. That should preserve the current logic across
> the kernel code.
> But having a new flag, we could introduce a new helper, for example
> skb_get_stable_hash() or skb_get_hash_nonrandom() or something like
> that, that will be equal to the current version of skb_get_hash(),
> i.e. not take the random hash into account.
>
> Affected subsystems (OVS, ECMP, SRv6) can be changed to use that
> new function. This way these subsystems will get a software hash
> based on the real packet fields, if it was originally random.
> This will also preserve ability to use hash provided by the HW,
> since it is not normally random.
>
> With that, we'll also not need to have in the API something that has
> 'L4' in the name and in the docs, but has no relation to packet fields.
> It can be argued that the description in the doc doesn't mean that
> this hash is computed using L4 packet fields, but it's confusing
> regardless and getting overlooked while creating new code, as it
> shown by the issues in multiple substystems.
>
> Hope this makes some sense.
>
>
> Dumitru also had some alternative ideas on how to provide a stable
> hash to subsystems that need it, but I'll leave it to him.
>
What I had in mind is not really a stable hash but a "good enough
alternative". It's probably "good enough" (at least for OvS/OVN) if the
hash used by OvS doesn't change throughout the lifetime of a TCP session.
Would it be possible to save the original (random) hash that was
generated for a locally terminated TCP session? E.g., a new field in
'struct sock'. It would be in essence a random tag associated to the
session that doesn't change throughout the lifetime of the session.
Unlike sk->sk_txhash which changes on retransmit/negative routing advice.
That means OvS doesn't have to compute a stable hash every time it
processes a packet, It would just access this value through
skb->sk->good_name_for_this_new_tag. The advantage is that it gives the
appearance of a canonical 4-tuple hash throughout the lifetime of a
session and it doesn't affect any of the use cases that required
877d1f6291f8 ("net: Set sk_txhash from a random number").
I probably missed relevant things but I thought it might be worth
sharing in case the idea has some value.
Regards,
Dumitru
> Best regards, Ilya Maximets.
>
>>>
>>>
>>>> Best regards,
>>>> Dumitru
>>>>
>>>>>
>>>>> Signed-off-by: Antoine Tenart <atenart@kernel.org>
>>>>> ---
>>>>> include/linux/skbuff.h | 4 ++--
>>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>>>> index 738776ab8838..f54c84193b23 100644
>>>>> --- a/include/linux/skbuff.h
>>>>> +++ b/include/linux/skbuff.h
>>>>> @@ -791,8 +791,8 @@ typedef unsigned char *sk_buff_data_t;
>>>>> * @active_extensions: active extensions (skb_ext_id types)
>>>>> * @ndisc_nodetype: router type (from link layer)
>>>>> * @ooo_okay: allow the mapping of a socket to a queue to be changed
>>>>> - * @l4_hash: indicate hash is a canonical 4-tuple hash over transport
>>>>> - * ports.
>>>>> + * @l4_hash: indicate hash is from layer 4 and provides a uniform
>>>>> + * distribution over flows.
>>>>> * @sw_hash: indicates hash was computed in software stack
>>>>> * @wifi_acked_valid: wifi_acked was set
>>>>> * @wifi_acked: whether frame was acked on wifi or not
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 4/4] net: skbuff: fix l4_hash comment
2023-05-11 20:50 ` Dumitru Ceara
@ 2023-05-15 8:12 ` Antoine Tenart
2023-05-15 18:23 ` Ilya Maximets
0 siblings, 1 reply; 21+ messages in thread
From: Antoine Tenart @ 2023-05-15 8:12 UTC (permalink / raw)
To: Dumitru Ceara, Eric Dumazet, Ilya Maximets; +Cc: davem, kuba, pabeni, netdev
Quoting Dumitru Ceara (2023-05-11 22:50:32)
> On 5/11/23 19:54, Ilya Maximets wrote:
> >>> Note that skb->hash has never been considered as canonical, for obvious reasons.
> >
> > I guess, the other point here is that it's not an L4 hash either.
> >
> > It's a random number. So, the documentation will still not be
> > correct even after the change proposed in this patch.
The proposed changed is "indicate hash is from layer 4 and provides a
uniform distribution over flows", which does not describe *how* the hash
is computed but *where* it comes from. This matches "random number set
by TCP" and changes in how hashes are computed won't affect the comment,
so we'll not end up in the same situation.
> > One way to not break everything doing that will be to introduce a
> > new flag, e.g. 'rnd_hash' that will be a hash that is "not related
> > to packet fields, but provides a uniform distribution over flows".
> >
> > skb_get_hash() then may return the current hash if it's any of
> > l4, rnd or sw. That should preserve the current logic across
> > the kernel code.
> > But having a new flag, we could introduce a new helper, for example
> > skb_get_stable_hash() or skb_get_hash_nonrandom() or something like
> > that, that will be equal to the current version of skb_get_hash(),
> > i.e. not take the random hash into account.
> >
> > Affected subsystems (OVS, ECMP, SRv6) can be changed to use that
> > new function. This way these subsystems will get a software hash
> > based on the real packet fields, if it was originally random.
> > This will also preserve ability to use hash provided by the HW,
> > since it is not normally random.
But then the whole point of txrehash would be dismissed, if ECMP and
others stop using the hash provided by TCP. This needs to be a
conditional setting, to make the skb hash to be stable over time only
when needed. That way both scenario are supported.
> What I had in mind is not really a stable hash but a "good enough
> alternative". It's probably "good enough" (at least for OvS/OVN) if the
> hash used by OvS doesn't change throughout the lifetime of a TCP session.
So what's important is not how the hash is computed but the fact that
it should be stable over time when requested. Isn't exactly what
net.core.txrehash=0 does? If there are some bugs they should be fixed.
On top of this, if OvS needs to additionally provide a canonical
4/5-tuple hash because not only the stability over time is needed but
also the method is important, it needs to compute its own hash. As part
of such potential series ways to cache the result can be explored.
Numbers would help too. (This can be discussed here, that's fine, but I
thought it's important to distinguish the two topics).
Antoine
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 4/4] net: skbuff: fix l4_hash comment
2023-05-15 8:12 ` Antoine Tenart
@ 2023-05-15 18:23 ` Ilya Maximets
2023-05-16 7:36 ` Antoine Tenart
0 siblings, 1 reply; 21+ messages in thread
From: Ilya Maximets @ 2023-05-15 18:23 UTC (permalink / raw)
To: Antoine Tenart, Dumitru Ceara, Eric Dumazet
Cc: i.maximets, davem, kuba, pabeni, netdev
On 5/15/23 10:12, Antoine Tenart wrote:
> Quoting Dumitru Ceara (2023-05-11 22:50:32)
>> On 5/11/23 19:54, Ilya Maximets wrote:
>>>>> Note that skb->hash has never been considered as canonical, for obvious reasons.
>>>
>>> I guess, the other point here is that it's not an L4 hash either.
>>>
>>> It's a random number. So, the documentation will still not be
>>> correct even after the change proposed in this patch.
>
> The proposed changed is "indicate hash is from layer 4 and provides a
> uniform distribution over flows", which does not describe *how* the hash
> is computed but *where* it comes from. This matches "random number set
> by TCP" and changes in how hashes are computed won't affect the comment,
> so we'll not end up in the same situation.
I respectfully disagree, "is from layer 4" and "random number" do not
match for me. So, "where it comes from" argument is not applicable.
Random numbers come from random number generator, and not "from layer 4".
Unless by "from layer 4" you mean "from the code that handles layer 4
packet processing". But that seems very confusing to me. And it is
definitely not the first thing that comes to mind while reading the
documentation.
>
>>> One way to not break everything doing that will be to introduce a
>>> new flag, e.g. 'rnd_hash' that will be a hash that is "not related
>>> to packet fields, but provides a uniform distribution over flows".
>>>
>>> skb_get_hash() then may return the current hash if it's any of
>>> l4, rnd or sw. That should preserve the current logic across
>>> the kernel code.
>>> But having a new flag, we could introduce a new helper, for example
>>> skb_get_stable_hash() or skb_get_hash_nonrandom() or something like
>>> that, that will be equal to the current version of skb_get_hash(),
>>> i.e. not take the random hash into account.
>>>
>>> Affected subsystems (OVS, ECMP, SRv6) can be changed to use that
>>> new function. This way these subsystems will get a software hash
>>> based on the real packet fields, if it was originally random.
>>> This will also preserve ability to use hash provided by the HW,
>>> since it is not normally random.
>
> But then the whole point of txrehash would be dismissed, if ECMP and
> others stop using the hash provided by TCP.
I guess ECMP is not a good example as it doesn't use skb hash for locally
generated traffic. However, the argument about defeating the purpose of
rehash doesn't stand either for the same reason. The same next hop will
be chosen for the same flow always, because the hash will be re-generated
from a 5-tuple. But anyway...
In OVS and SRv6 cases we're also talking about load balancing, so unlike
ECMP, final destinations may be different based on the hash. In this case
there is an issue.
> This needs to be a
> conditional setting, to make the skb hash to be stable over time only
> when needed. That way both scenario are supported.
>
>> What I had in mind is not really a stable hash but a "good enough
>> alternative". It's probably "good enough" (at least for OvS/OVN) if the
>> hash used by OvS doesn't change throughout the lifetime of a TCP session.
>
> So what's important is not how the hash is computed but the fact that
> it should be stable over time when requested. Isn't exactly what
> net.core.txrehash=0 does? If there are some bugs they should be fixed.
Not asking to disable re-hash. Asking to provide a way to distinguish
changing random numbers from a more stable hash, so we can avoid
recalculating it on every packet.
Of course, we can change the code to re-calculate every time, but that
sounds wasteful if it's already done by the HW, for example.
> On top of this, if OvS needs to additionally provide a canonical
> 4/5-tuple hash because not only the stability over time is needed but
> also the method is important, it needs to compute its own hash
For the OVS_HASH_ALG_L4, the exact method is not really important.
For OVS_HASH_ALG_SYM_L4, the hash has to symmetric, but this hashing
algorithm is not implemented in the OVS kernel module today.
> As part
> of such potential series ways to cache the result can be explored.
> Numbers would help too. (This can be discussed here, that's fine, but I
> thought it's important to distinguish the two topics).
I agree that topics are fairly different. The rest of the set is probably
fine (I didn't review). I'm just not sure why we need to change the comment
from one incorrect sentence to another similarly incorrect and confusing.
Also making it look like subsystems that use it in a previously documented
meaning are at fault. It's just not fair. I think, the issue should be
fixed first.
On the other note,
I don't think that the rest of the patch set should be held back by that
though. So, maybe dropping this one patch from the set might be an option
for now?
Best regards, Ilya Maximets.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 4/4] net: skbuff: fix l4_hash comment
2023-05-15 18:23 ` Ilya Maximets
@ 2023-05-16 7:36 ` Antoine Tenart
2023-05-16 21:25 ` Ilya Maximets
0 siblings, 1 reply; 21+ messages in thread
From: Antoine Tenart @ 2023-05-16 7:36 UTC (permalink / raw)
To: Dumitru Ceara, Eric Dumazet, Ilya Maximets
Cc: i.maximets, davem, kuba, pabeni, netdev
Quoting Ilya Maximets (2023-05-15 20:23:28)
> On 5/15/23 10:12, Antoine Tenart wrote:
> > Quoting Dumitru Ceara (2023-05-11 22:50:32)
> >> On 5/11/23 19:54, Ilya Maximets wrote:
> >>>>> Note that skb->hash has never been considered as canonical, for obvious reasons.
> >>>
> >>> I guess, the other point here is that it's not an L4 hash either.
> >>>
> >>> It's a random number. So, the documentation will still not be
> >>> correct even after the change proposed in this patch.
> >
> > The proposed changed is "indicate hash is from layer 4 and provides a
> > uniform distribution over flows", which does not describe *how* the hash
> > is computed but *where* it comes from. This matches "random number set
> > by TCP" and changes in how hashes are computed won't affect the comment,
> > so we'll not end up in the same situation.
>
> I respectfully disagree, "is from layer 4" and "random number" do not
> match for me. So, "where it comes from" argument is not applicable.
> Random numbers come from random number generator, and not "from layer 4".
>
> Unless by "from layer 4" you mean "from the code that handles layer 4
> packet processing". But that seems very confusing to me. And it is
> definitely not the first thing that comes to mind while reading the
> documentation.
Yes that is what I meant, but if that is still confusing then this is
not improving things so let's try something better. I intentionally did
not mention how the hash is computed because it's easy to forget to
update the documentation when the exact logic is changed. What's
important here IMHO is to mention what the hash provides.
What about "indicates hash was set by layer 4 stack and provides a
uniform distribution over flows"? Or/and we should we also add a
disclaimer like "no guarantee on how the hash was computed"?
> Also making it look like subsystems that use it in a previously
> documented meaning are at fault. It's just not fair.
I never said that nor it is what I think, I'm sorry if my message was
misleading.
Antoine
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 4/4] net: skbuff: fix l4_hash comment
2023-05-16 7:36 ` Antoine Tenart
@ 2023-05-16 21:25 ` Ilya Maximets
2023-05-17 12:05 ` Antoine Tenart
0 siblings, 1 reply; 21+ messages in thread
From: Ilya Maximets @ 2023-05-16 21:25 UTC (permalink / raw)
To: Antoine Tenart, Dumitru Ceara, Eric Dumazet
Cc: i.maximets, davem, kuba, pabeni, netdev
On 5/16/23 09:36, Antoine Tenart wrote:
> Quoting Ilya Maximets (2023-05-15 20:23:28)
>> On 5/15/23 10:12, Antoine Tenart wrote:
>>> Quoting Dumitru Ceara (2023-05-11 22:50:32)
>>>> On 5/11/23 19:54, Ilya Maximets wrote:
>>>>>>> Note that skb->hash has never been considered as canonical, for obvious reasons.
>>>>>
>>>>> I guess, the other point here is that it's not an L4 hash either.
>>>>>
>>>>> It's a random number. So, the documentation will still not be
>>>>> correct even after the change proposed in this patch.
>>>
>>> The proposed changed is "indicate hash is from layer 4 and provides a
>>> uniform distribution over flows", which does not describe *how* the hash
>>> is computed but *where* it comes from. This matches "random number set
>>> by TCP" and changes in how hashes are computed won't affect the comment,
>>> so we'll not end up in the same situation.
>>
>> I respectfully disagree, "is from layer 4" and "random number" do not
>> match for me. So, "where it comes from" argument is not applicable.
>> Random numbers come from random number generator, and not "from layer 4".
>>
>> Unless by "from layer 4" you mean "from the code that handles layer 4
>> packet processing". But that seems very confusing to me. And it is
>> definitely not the first thing that comes to mind while reading the
>> documentation.
>
> Yes that is what I meant, but if that is still confusing then this is
> not improving things so let's try something better. I intentionally did
> not mention how the hash is computed because it's easy to forget to
> update the documentation when the exact logic is changed. What's
> important here IMHO is to mention what the hash provides.
>
> What about "indicates hash was set by layer 4 stack and provides a
> uniform distribution over flows"? Or/and we should we also add a
> disclaimer like "no guarantee on how the hash was computed"?
I'm still not sure this is correct. Is a NIC driver part of layer 4 stack?
Also it still doesn't make a lot of sense to change to comment without
changing the code.
And there are lots of other inconsistencies around skb hash. The following
is probably the most colorful that I found:
TCP code in tcp_make_synack() calls:
skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
Where 'txhash' is a random number. But it calls it PKT_HASH_TYPE_L4.
Going to the definition we can find this [1]:
/*
* Packet hash types specify the type of hash in skb_set_hash.
*
* Hash types refer to the protocol layer addresses which are used to
* construct a packet's hash. The hashes are used to differentiate or identify
* flows of the protocol layer for the hash type. Hash types are either
* layer-2 (L2), layer-3 (L3), or layer-4 (L4).
*
* Properties of hashes:
*
* 1) Two packets in different flows have different hash values
* 2) Two packets in the same flow should have the same hash value
Now this directly contradicts to the hash being not stable and not being
computed form actual packet fields.
Later in the same file:
enum pkt_hash_types {
PKT_HASH_TYPE_NONE, /* Undefined type */
PKT_HASH_TYPE_L2, /* Input: src_MAC, dest_MAC */
PKT_HASH_TYPE_L3, /* Input: src_IP, dst_IP */
PKT_HASH_TYPE_L4, /* Input: src_IP, dst_IP, src_port, dst_port */
};
Here we see that PKT_HASH_TYPE_L4 supposed to use particular fields
as an input.
It's kind of pointless having all that documented if the l4_hash flag
is a random number and none of the kernel subsystems are able to use
it in a way it is documented.
[1] https://elixir.bootlin.com/linux/v6.4-rc1/source/include/linux/skbuff.h#L1419
Best regards, Ilya Maximets.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 4/4] net: skbuff: fix l4_hash comment
2023-05-16 21:25 ` Ilya Maximets
@ 2023-05-17 12:05 ` Antoine Tenart
2023-05-17 23:00 ` Ilya Maximets
0 siblings, 1 reply; 21+ messages in thread
From: Antoine Tenart @ 2023-05-17 12:05 UTC (permalink / raw)
To: Dumitru Ceara, Eric Dumazet, Ilya Maximets
Cc: i.maximets, davem, kuba, pabeni, netdev
Quoting Ilya Maximets (2023-05-16 23:25:19)
> On 5/16/23 09:36, Antoine Tenart wrote:
> >
> > What about "indicates hash was set by layer 4 stack and provides a
> > uniform distribution over flows"? Or/and we should we also add a
> > disclaimer like "no guarantee on how the hash was computed"?
>
> I'm still not sure this is correct. Is a NIC driver part of layer 4
> stack?
Offloading logic with L4 fields for csum, RSS, etc; we can argue it does
something at L4. What about this: "Provides a uniform distribution over
L4 flows"? I does look better than the previous proposal IMHO.
> And there are lots of other inconsistencies around skb hash. The following
> is probably the most colorful that I found:
>
> skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
> * Hash types refer to the protocol layer addresses which are used to
> * construct a packet's hash. The hashes are used to differentiate or identify
> * flows of the protocol layer for the hash type. Hash types are either
> * layer-2 (L2), layer-3 (L3), or layer-4 (L4).
> *
> * Properties of hashes:
> *
> * 1) Two packets in different flows have different hash values
> * 2) Two packets in the same flow should have the same hash value
> enum pkt_hash_types {
> PKT_HASH_TYPE_L4, /* Input: src_IP, dst_IP, src_port, dst_port */
> };
>
> Here we see that PKT_HASH_TYPE_L4 supposed to use particular fields
> as an input.
If we strictly follow the above, do all NIC provide a L4 hash using only
the above fields (src_IP, dst_IP, src_port, dst_port)? Having a quick
look I'm pretty sure no, both 4 and 5-tuple can be used. What is
important is at what level the distribution is.
So yes strictly speaking the above PKT_HASH_TYPE_L4 use can be a little
surprising, but to me it's a shortcut or a missing update. For perfect
correctness we could use
__skb_set_hash(skb, tcp_rsk(req)->txhash, false, true) FWIW.
Even l4_hash w/o taking the rnd case into account does not guarantee a
stable hash for the lifetime of a flow; what happens if packets from the
same flow are received on two NICs using different keys and/or algs?
Being computed from L4 fields does not mean it is stable. If the stable
property is needed, the hash has to be computed locally. And then comes
the other topic of caching it for reuse and potential sharing across
different consumers, sure.
Now, I'll let some time to give a chance for others to chime in.
Thanks,
Antoine
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 4/4] net: skbuff: fix l4_hash comment
2023-05-17 12:05 ` Antoine Tenart
@ 2023-05-17 23:00 ` Ilya Maximets
2023-05-23 15:25 ` Antoine Tenart
0 siblings, 1 reply; 21+ messages in thread
From: Ilya Maximets @ 2023-05-17 23:00 UTC (permalink / raw)
To: Antoine Tenart, Dumitru Ceara, Eric Dumazet
Cc: i.maximets, davem, kuba, pabeni, netdev
On 5/17/23 14:05, Antoine Tenart wrote:
> Quoting Ilya Maximets (2023-05-16 23:25:19)
>> On 5/16/23 09:36, Antoine Tenart wrote:
>>>
>>> What about "indicates hash was set by layer 4 stack and provides a
>>> uniform distribution over flows"? Or/and we should we also add a
>>> disclaimer like "no guarantee on how the hash was computed"?
>>
>> I'm still not sure this is correct. Is a NIC driver part of layer 4
>> stack?
>
> Offloading logic with L4 fields for csum, RSS, etc; we can argue it does
> something at L4. What about this: "Provides a uniform distribution over
> L4 flows"? I does look better than the previous proposal IMHO.
>
>> And there are lots of other inconsistencies around skb hash. The following
>> is probably the most colorful that I found:
>>
>> skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
>
>> * Hash types refer to the protocol layer addresses which are used to
>> * construct a packet's hash. The hashes are used to differentiate or identify
>> * flows of the protocol layer for the hash type. Hash types are either
>> * layer-2 (L2), layer-3 (L3), or layer-4 (L4).
>> *
>> * Properties of hashes:
>> *
>> * 1) Two packets in different flows have different hash values
>> * 2) Two packets in the same flow should have the same hash value
>
>> enum pkt_hash_types {
>> PKT_HASH_TYPE_L4, /* Input: src_IP, dst_IP, src_port, dst_port */
>> };
>>
>> Here we see that PKT_HASH_TYPE_L4 supposed to use particular fields
>> as an input.
>
> If we strictly follow the above, do all NIC provide a L4 hash using only
> the above fields (src_IP, dst_IP, src_port, dst_port)? Having a quick
> look I'm pretty sure no, both 4 and 5-tuple can be used. What is
> important is at what level the distribution is.
I would read the above as 'at least these fields'. In the continuation
of the comment above it's, for example, allowed to set L3 when it is, in
fact, L4.
>
> So yes strictly speaking the above PKT_HASH_TYPE_L4 use can be a little
> surprising, but to me it's a shortcut or a missing update. For perfect
> correctness we could use
> __skb_set_hash(skb, tcp_rsk(req)->txhash, false, true) FWIW.
Only after the documentation change.
>
> Even l4_hash w/o taking the rnd case into account does not guarantee a
> stable hash for the lifetime of a flow; what happens if packets from the
> same flow are received on two NICs using different keys and/or algs?
Following the same logic we can't really say that it "provides a uniform
distribution over L4 flows" either. The fact that L4 fields were used
to calculate the hash, doesn't mean the hash function is any good.
So, the same way as stability can't be part of the definition, the
uniform distribution can't be as well. And the 'l4' part of the field
name looses the meaning completely. So, we will end up with:
* @l4_hash: Some number is stored in the 'hash' field.
At this point the is no reason to keep the flag at all.
In practice though, such setups are unlikley to be common. Surely, some
adequate distribution and some stability for hashes is implied.
Not the best possible distribution and not the absolute stability, but
good enough to rely on in many cases. And that's why the quoted comment
defines "Properties of hashes" this way.
> Being computed from L4 fields does not mean it is stable. If the stable
> property is needed, the hash has to be computed locally. And then comes
> the other topic of caching it for reuse and potential sharing across
> different consumers, sure.
>
> Now, I'll let some time to give a chance for others to chime in.
Sure.
>
> Thanks,
> Antoine
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next 4/4] net: skbuff: fix l4_hash comment
2023-05-17 23:00 ` Ilya Maximets
@ 2023-05-23 15:25 ` Antoine Tenart
0 siblings, 0 replies; 21+ messages in thread
From: Antoine Tenart @ 2023-05-23 15:25 UTC (permalink / raw)
To: Dumitru Ceara, Eric Dumazet, Ilya Maximets
Cc: i.maximets, davem, kuba, pabeni, netdev
Quoting Ilya Maximets (2023-05-18 01:00:40)
> On 5/17/23 14:05, Antoine Tenart wrote:
> >
> > Even l4_hash w/o taking the rnd case into account does not guarantee a
> > stable hash for the lifetime of a flow; what happens if packets from the
> > same flow are received on two NICs using different keys and/or algs?
>
> Following the same logic we can't really say that it "provides a uniform
> distribution over L4 flows" either. The fact that L4 fields were used
> to calculate the hash, doesn't mean the hash function is any good.
Well drivers need to either trust the h/w in some ways or not use what
is provided if it's broken; or we can't be sure of anything. It's not
the same as an example where a valid setup can't guarantee a property by
design.
> > Now, I'll let some time to give a chance for others to chime in.
>
> Sure.
I don't think we'll get more guidance and we failed to come to an
agreement so let's keep this as-is for now; I'll send a v2 w/o this
documentation change.
As for a way forward and the stability need, IMHO the hash needs to be
computed where it is used (with potential cache, not reusing skb->hash)
but if you feel this should be addressed at this level a patch doing so
might at least get others to comment (both ways).
Thanks,
Antoine
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2023-05-23 15:25 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-11 9:34 [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4 Antoine Tenart
2023-05-11 9:34 ` [PATCH net-next 1/4] net: tcp: make the txhash available in TIME_WAIT sockets for IPv4 too Antoine Tenart
2023-05-11 9:34 ` [PATCH net-next 2/4] net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECV Antoine Tenart
2023-05-11 9:34 ` [PATCH net-next 3/4] Documentation: net: net.core.txrehash is not specific to listening sockets Antoine Tenart
2023-05-11 9:34 ` [PATCH net-next 4/4] net: skbuff: fix l4_hash comment Antoine Tenart
2023-05-11 12:10 ` Dumitru Ceara
2023-05-11 12:33 ` Eric Dumazet
2023-05-11 13:00 ` Dumitru Ceara
2023-05-11 17:54 ` Ilya Maximets
2023-05-11 20:50 ` Dumitru Ceara
2023-05-15 8:12 ` Antoine Tenart
2023-05-15 18:23 ` Ilya Maximets
2023-05-16 7:36 ` Antoine Tenart
2023-05-16 21:25 ` Ilya Maximets
2023-05-17 12:05 ` Antoine Tenart
2023-05-17 23:00 ` Ilya Maximets
2023-05-23 15:25 ` Antoine Tenart
2023-05-11 10:24 ` [PATCH net-next 0/4] net: tcp: make txhash use consistent for IPv4 Eric Dumazet
2023-05-11 11:55 ` Antoine Tenart
2023-05-11 11:59 ` Ilya Maximets
-- strict thread matches above, loose matches on Subject: below --
2023-04-27 13:45 Antoine Tenart
2023-04-27 13:45 ` [PATCH net-next 3/4] Documentation: net: net.core.txrehash is not specific to listening sockets Antoine Tenart
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).