[RFC PATCH 0/7] tun: Introduce virtio-net hashing feature

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature
@ 2023-10-08  5:20 Akihiko Odaki
  2023-10-08  5:20 ` [RFC PATCH 1/7] net: skbuff: Add tun_vnet_hash flag Akihiko Odaki
                   ` (7 more replies)
  0 siblings, 8 replies; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-08  5:20 UTC (permalink / raw)
  Cc: Willem de Bruijn, Jason Wang, Michael S. Tsirkin,
	Xuan Zhuo Shuah Khan

virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it makes little sense to
allow to implement different hashing algorithms with eBPF since the hash
value reported by virtio-net is strictly defined by the specification.
An implementation of this alternative solution was submitted as RFC
patches in the past:
https://lore.kernel.org/lkml/20210112194143.1494-1-yuri.benditovich@daynix.com/

QEMU patched to use this new feature is available at:
https://github.com/daynix/qemu/tree/akihikodaki/rss

The QEMU patches will soon be submitted to the upstream as RFC too.

Akihiko Odaki (7):
  net: skbuff: Add tun_vnet_hash flag
  net/core: Ensure qdisc_skb_cb will not be overwritten
  net: sched: Add members to qdisc_skb_cb
  virtio_net: Add functions for hashing
  tun: Introduce virtio-net hashing feature
  selftest: tun: Add tests for virtio-net hashing
  vhost_net: Support VIRTIO_NET_F_HASH_REPORT

 drivers/net/tun.c                    | 187 ++++++++-
 drivers/vhost/net.c                  |  16 +-
 include/linux/skbuff.h               |   2 +
 include/linux/virtio_net.h           | 157 ++++++++
 include/net/sch_generic.h            |  12 +-
 include/uapi/linux/if_tun.h          |  16 +
 net/core/dev.c                       |   1 +
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/tun.c    | 578 ++++++++++++++++++++++++++-
 9 files changed, 933 insertions(+), 38 deletions(-)

-- 
2.42.0

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 1/7] net: skbuff: Add tun_vnet_hash flag
  2023-10-08  5:20 [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
@ 2023-10-08  5:20 ` Akihiko Odaki
  2023-10-08 18:39   ` Willem de Bruijn
  2023-10-08  5:20 ` [RFC PATCH 2/7] net/core: Ensure qdisc_skb_cb will not be overwritten Akihiko Odaki
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-08  5:20 UTC (permalink / raw)
  Cc: Willem de Bruijn, Jason Wang, Michael S. Tsirkin,
	Xuan Zhuo Shuah Khan

tun_vnet_hash can use this flag to indicate it stored virtio-net hash
cache to cb.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 include/linux/skbuff.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4174c4b82d13..e638f157c13c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -837,6 +837,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@truesize: Buffer size
  *	@users: User count - see {datagram,tcp}.c
  *	@extensions: allocated extensions, valid if active_extensions is nonzero
+ *	@tun_vnet_hash: tun stored virtio-net hash cache to cb
  */
 
 struct sk_buff {
@@ -989,6 +990,7 @@ struct sk_buff {
 #if IS_ENABLED(CONFIG_IP_SCTP)
 	__u8			csum_not_inet:1;
 #endif
+	__u8			tun_vnet_hash:1;
 
 #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
 	__u16			tc_index;	/* traffic control index */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/7] net: skbuff: Add tun_vnet_hash flag
  2023-10-08  5:20 ` [RFC PATCH 1/7] net: skbuff: Add tun_vnet_hash flag Akihiko Odaki
@ 2023-10-08 18:39   ` Willem de Bruijn
  2023-10-08 19:52     ` Akihiko Odaki
  0 siblings, 1 reply; 41+ messages in thread
From: Willem de Bruijn @ 2023-10-08 18:39 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, cai, jakub, elver, pabeni, Yuri Benditovich

On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> tun_vnet_hash can use this flag to indicate it stored virtio-net hash
> cache to cb.
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> ---
>  include/linux/skbuff.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 4174c4b82d13..e638f157c13c 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -837,6 +837,7 @@ typedef unsigned char *sk_buff_data_t;
>   *     @truesize: Buffer size
>   *     @users: User count - see {datagram,tcp}.c
>   *     @extensions: allocated extensions, valid if active_extensions is nonzero
> + *     @tun_vnet_hash: tun stored virtio-net hash cache to cb
>   */
>
>  struct sk_buff {
> @@ -989,6 +990,7 @@ struct sk_buff {
>  #if IS_ENABLED(CONFIG_IP_SCTP)
>         __u8                    csum_not_inet:1;
>  #endif
> +       __u8                    tun_vnet_hash:1;

sk_buff space is very limited.

No need to extend it, especially for code that stays within a single
subsystem (tun).

To a lesser extent the same point applies to the qdisc_skb_cb.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/7] net: skbuff: Add tun_vnet_hash flag
  2023-10-08 18:39   ` Willem de Bruijn
@ 2023-10-08 19:52     ` Akihiko Odaki
  0 siblings, 0 replies; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-08 19:52 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, cai, jakub, elver, pabeni, Yuri Benditovich

On 2023/10/09 3:39, Willem de Bruijn wrote:
> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> tun_vnet_hash can use this flag to indicate it stored virtio-net hash
>> cache to cb.
>>
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>> ---
>>   include/linux/skbuff.h | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 4174c4b82d13..e638f157c13c 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -837,6 +837,7 @@ typedef unsigned char *sk_buff_data_t;
>>    *     @truesize: Buffer size
>>    *     @users: User count - see {datagram,tcp}.c
>>    *     @extensions: allocated extensions, valid if active_extensions is nonzero
>> + *     @tun_vnet_hash: tun stored virtio-net hash cache to cb
>>    */
>>
>>   struct sk_buff {
>> @@ -989,6 +990,7 @@ struct sk_buff {
>>   #if IS_ENABLED(CONFIG_IP_SCTP)
>>          __u8                    csum_not_inet:1;
>>   #endif
>> +       __u8                    tun_vnet_hash:1;
> 
> sk_buff space is very limited.
> 
> No need to extend it, especially for code that stays within a single
> subsystem (tun).
> 
> To a lesser extent the same point applies to the qdisc_skb_cb.

I had to extend sk_buff because it does not stay in tun but moves back 
and forth between qdisc and tun.

The new members of sk_buff and qdisc_skb_cb are stored by tun's 
ndo_select_queue(). The control will go back to qdisc after 
ndo_select_queue() function finishes. Eventually tun's ndo_start_xmit() 
will be called by qdisc and consumes the stored members. qdisc is 
required to keep the stored members intact.

tun_vnet_hash is a bit special. It is put into sk_buff because 
ndo_select_queue() is not always called and it may be left 
uninitialized. ndo_start_xmit() may read some garbage from cb's old user 
if it is put into cb.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 2/7] net/core: Ensure qdisc_skb_cb will not be overwritten
  2023-10-08  5:20 [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
  2023-10-08  5:20 ` [RFC PATCH 1/7] net: skbuff: Add tun_vnet_hash flag Akihiko Odaki
@ 2023-10-08  5:20 ` Akihiko Odaki
  2023-10-08  5:20 ` [RFC PATCH 3/7] net: sched: Add members to qdisc_skb_cb Akihiko Odaki
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-08  5:20 UTC (permalink / raw)
  Cc: Willem de Bruijn, Jason Wang, Michael S. Tsirkin,
	Xuan Zhuo Shuah Khan

Later qdisc_skb_cb will be used to deliver virtio-net hash information
for tun.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 net/core/dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 85df22f05c38..305fb465cc1e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3622,6 +3622,7 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device
 	if (netif_needs_gso(skb, features)) {
 		struct sk_buff *segs;
 
+		BUILD_BUG_ON(sizeof(struct qdisc_skb_cb) > SKB_GSO_CB_OFFSET);
 		segs = skb_gso_segment(skb, features);
 		if (IS_ERR(segs)) {
 			goto out_kfree_skb;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC PATCH 3/7] net: sched: Add members to qdisc_skb_cb
  2023-10-08  5:20 [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
  2023-10-08  5:20 ` [RFC PATCH 1/7] net: skbuff: Add tun_vnet_hash flag Akihiko Odaki
  2023-10-08  5:20 ` [RFC PATCH 2/7] net/core: Ensure qdisc_skb_cb will not be overwritten Akihiko Odaki
@ 2023-10-08  5:20 ` Akihiko Odaki
  2023-10-08  5:20 ` [RFC PATCH 4/7] virtio_net: Add functions for hashing Akihiko Odaki
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-08  5:20 UTC (permalink / raw)
  Cc: Willem de Bruijn, Jason Wang, Michael S. Tsirkin,
	Xuan Zhuo Shuah Khan

The new members will be used to deliver virtio-net hash information by
tun. The other members are also reordered so that the overall size will
not change.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 include/net/sch_generic.h | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f232512505f8..9dfdc63859c7 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -430,13 +430,15 @@ struct tcf_proto {
 };
 
 struct qdisc_skb_cb {
-	struct {
-		unsigned int		pkt_len;
-		u16			slave_dev_queue_mapping;
-		u16			tc_classid;
-	};
 #define QDISC_CB_PRIV_LEN 20
 	unsigned char		data[QDISC_CB_PRIV_LEN];
+	unsigned int		pkt_len;
+	u16			slave_dev_queue_mapping;
+	union {
+		u16		tc_classid;
+		u16		tun_vnet_hash_report;
+	};
+	u32			tun_vnet_hash_value;
 };
 
 typedef void tcf_chain_head_change_t(struct tcf_proto *tp_head, void *priv);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC PATCH 4/7] virtio_net: Add functions for hashing
  2023-10-08  5:20 [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (2 preceding siblings ...)
  2023-10-08  5:20 ` [RFC PATCH 3/7] net: sched: Add members to qdisc_skb_cb Akihiko Odaki
@ 2023-10-08  5:20 ` Akihiko Odaki
  2023-10-08  5:20 ` [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-08  5:20 UTC (permalink / raw)
  Cc: Willem de Bruijn, Jason Wang, Michael S. Tsirkin,
	Xuan Zhuo Shuah Khan

They are useful to implement VIRTIO_NET_F_RSS and
VIRTIO_NET_F_HASH_REPORT.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 include/linux/virtio_net.h | 157 +++++++++++++++++++++++++++++++++++++
 1 file changed, 157 insertions(+)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 7b4dd69555e4..f05781ddc261 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -7,6 +7,143 @@
 #include <uapi/linux/udp.h>
 #include <uapi/linux/virtio_net.h>
 
+struct virtio_net_hash {
+	u32 value;
+	u16 report;
+};
+
+struct virtio_net_toeplitz_state {
+	u32 hash;
+	u32 key_buffer;
+	const u32 *key;
+};
+
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
+					 VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
+					 VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
+					 VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
+					 VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
+					 VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+
+static inline void virtio_net_toeplitz(struct virtio_net_toeplitz_state *state,
+				       const u32 *input, size_t len)
+{
+	u32 key;
+
+	while (len) {
+		state->key++;
+		key = ntohl(*state->key);
+
+		for (u32 bit = BIT(31); bit; bit >>= 1) {
+			if (*input & bit)
+				state->hash ^= state->key_buffer;
+
+			state->key_buffer =
+				(state->key_buffer << 1) | !!(key & bit);
+		}
+
+		input++;
+		len--;
+	}
+}
+
+static inline u8 virtio_net_hash_key_length(u32 types)
+{
+	size_t len = 0;
+
+	if (types & VIRTIO_NET_HASH_REPORT_IPv4)
+		len = max(len,
+			  sizeof(struct flow_dissector_key_ipv4_addrs));
+
+	if (types &
+	    (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
+		len = max(len,
+			  sizeof(struct flow_dissector_key_ipv4_addrs) +
+			  sizeof(struct flow_dissector_key_ports));
+
+	if (types & VIRTIO_NET_HASH_REPORT_IPv6)
+		len = max(len,
+			  sizeof(struct flow_dissector_key_ipv6_addrs));
+
+	if (types &
+	    (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
+		len = max(len,
+			  sizeof(struct flow_dissector_key_ipv6_addrs) +
+			  sizeof(struct flow_dissector_key_ports));
+
+	return 4 + len;
+}
+
+static inline void virtio_net_hash(const struct sk_buff *skb,
+				   u32 types, const u32 *key,
+				   struct virtio_net_hash *hash)
+{
+	u16 report = VIRTIO_NET_HASH_REPORT_NONE;
+	struct virtio_net_toeplitz_state toeplitz_state = {
+		.key_buffer = ntohl(*key),
+		.key = key
+	};
+	struct flow_keys flow;
+
+	if (!skb_flow_dissect_flow_keys(skb, &flow, 0))
+		return;
+
+	switch (flow.basic.n_proto) {
+	case htons(ETH_P_IP):
+		if (flow.basic.ip_proto == IPPROTO_TCP &&
+		    (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4)) {
+			report = VIRTIO_NET_HASH_REPORT_TCPv4;
+			virtio_net_toeplitz(&toeplitz_state,
+					    (u32 *)&flow.addrs.v4addrs,
+					    sizeof(flow.addrs.v4addrs) / 4);
+			virtio_net_toeplitz(&toeplitz_state, &flow.ports.ports,
+					    1);
+		} else if (flow.basic.ip_proto == IPPROTO_UDP &&
+			   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4)) {
+			report = VIRTIO_NET_HASH_REPORT_UDPv4;
+			virtio_net_toeplitz(&toeplitz_state,
+					    (u32 *)&flow.addrs.v4addrs,
+					    sizeof(flow.addrs.v4addrs) / 4);
+			virtio_net_toeplitz(&toeplitz_state, &flow.ports.ports,
+					    1);
+		} else if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4) {
+			report = VIRTIO_NET_HASH_REPORT_IPv4;
+			virtio_net_toeplitz(&toeplitz_state,
+					    (u32 *)&flow.addrs.v4addrs,
+					    sizeof(flow.addrs.v4addrs) / 4);
+		}
+		break;
+
+	case htons(ETH_P_IPV6):
+		if (flow.basic.ip_proto == IPPROTO_TCP &&
+		    (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6)) {
+			report = VIRTIO_NET_HASH_REPORT_TCPv6;
+			virtio_net_toeplitz(&toeplitz_state,
+					    (u32 *)&flow.addrs.v6addrs,
+					    sizeof(flow.addrs.v6addrs) / 4);
+			virtio_net_toeplitz(&toeplitz_state, &flow.ports.ports,
+					    1);
+		} else if (flow.basic.ip_proto == IPPROTO_UDP &&
+			   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6)) {
+			report = VIRTIO_NET_HASH_REPORT_UDPv6;
+			virtio_net_toeplitz(&toeplitz_state,
+					    (u32 *)&flow.addrs.v6addrs,
+					    sizeof(flow.addrs.v6addrs) / 4);
+			virtio_net_toeplitz(&toeplitz_state, &flow.ports.ports,
+					    1);
+		} else if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6) {
+			report = VIRTIO_NET_HASH_REPORT_IPv6;
+			virtio_net_toeplitz(&toeplitz_state,
+					    (u32 *)&flow.addrs.v6addrs,
+					    sizeof(flow.addrs.v6addrs) / 4);
+		}
+		break;
+	}
+
+	hash->value = toeplitz_state.hash;
+	hash->report = report;
+}
+
 static inline bool virtio_net_hdr_match_proto(__be16 protocol, __u8 gso_type)
 {
 	switch (gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
@@ -216,4 +353,24 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 	return 0;
 }
 
+static inline int virtio_net_hdr_v1_hash_from_skb(const struct sk_buff *skb,
+						  struct virtio_net_hdr_v1_hash *hdr,
+						  bool has_data_valid,
+						  int vlan_hlen,
+						  const struct virtio_net_hash *hash)
+{
+	int ret;
+
+	memset(hdr, 0, sizeof(*hdr));
+
+	ret = virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr,
+				      true, has_data_valid, vlan_hlen);
+	if (!ret) {
+		hdr->hash_value = cpu_to_le32(hash->value);
+		hdr->hash_report = cpu_to_le16(hash->report);
+	}
+
+	return ret;
+}
+
 #endif /* _LINUX_VIRTIO_NET_H */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-08  5:20 [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (3 preceding siblings ...)
  2023-10-08  5:20 ` [RFC PATCH 4/7] virtio_net: Add functions for hashing Akihiko Odaki
@ 2023-10-08  5:20 ` Akihiko Odaki
  2023-10-08 19:07   ` Willem de Bruijn
                     ` (2 more replies)
  2023-10-08  5:20 ` [RFC PATCH 6/7] selftest: tun: Add tests for virtio-net hashing Akihiko Odaki
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-08  5:20 UTC (permalink / raw)
  Cc: Willem de Bruijn, Jason Wang, Michael S. Tsirkin,
	Xuan Zhuo Shuah Khan

virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but it makes
little sense to allow to implement different hashing algorithms with
eBPF since the hash value reported by virtio-net is strictly defined by
the specification.

The hash value already stored in sk_buff is not used and computed
independently since it may have been computed in a way not conformant
with the specification.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 drivers/net/tun.c           | 187 ++++++++++++++++++++++++++++++++----
 include/uapi/linux/if_tun.h |  16 +++
 2 files changed, 182 insertions(+), 21 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 89ab9efe522c..561a573cd008 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -171,6 +171,9 @@ struct tun_prog {
 	struct bpf_prog *prog;
 };
 
+#define TUN_VNET_HASH_MAX_KEY_SIZE 40
+#define TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH 128
+
 /* Since the socket were moved to tun_file, to preserve the behavior of persist
  * device, socket filter, sndbuf and vnet header size were restore when the
  * file were attached to a persist device.
@@ -209,6 +212,9 @@ struct tun_struct {
 	struct tun_prog __rcu *steering_prog;
 	struct tun_prog __rcu *filter_prog;
 	struct ethtool_link_ksettings link_ksettings;
+	struct tun_vnet_hash vnet_hash;
+	u16 vnet_hash_indirection_table[TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH];
+	u32 vnet_hash_key[TUN_VNET_HASH_MAX_KEY_SIZE / 4];
 	/* init args */
 	struct file *file;
 	struct ifreq *ifr;
@@ -219,6 +225,13 @@ struct veth {
 	__be16 h_vlan_TCI;
 };
 
+static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
+	.max_indirection_table_length =
+		TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
+
+	.types = VIRTIO_NET_SUPPORTED_HASH_TYPES
+};
+
 static void tun_flow_init(struct tun_struct *tun);
 static void tun_flow_uninit(struct tun_struct *tun);
 
@@ -320,10 +333,16 @@ static long tun_set_vnet_be(struct tun_struct *tun, int __user *argp)
 	if (get_user(be, argp))
 		return -EFAULT;
 
-	if (be)
+	if (be) {
+		if (!(tun->flags & TUN_VNET_LE) &&
+		    (tun->vnet_hash.flags & TUN_VNET_HASH_REPORT)) {
+			return -EINVAL;
+		}
+
 		tun->flags |= TUN_VNET_BE;
-	else
+	} else {
 		tun->flags &= ~TUN_VNET_BE;
+	}
 
 	return 0;
 }
@@ -558,15 +577,47 @@ static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
 	return ret % numqueues;
 }
 
+static u16 tun_vnet_select_queue(struct tun_struct *tun, struct sk_buff *skb)
+{
+	u32 value = qdisc_skb_cb(skb)->tun_vnet_hash_value;
+	u32 numqueues;
+	u32 index;
+	u16 queue;
+
+	numqueues = READ_ONCE(tun->numqueues);
+	if (!numqueues)
+		return 0;
+
+	index = value & READ_ONCE(tun->vnet_hash.indirection_table_mask);
+	queue = READ_ONCE(tun->vnet_hash_indirection_table[index]);
+	if (!queue)
+		queue = READ_ONCE(tun->vnet_hash.unclassified_queue);
+
+	return queue % numqueues;
+}
+
 static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
 			    struct net_device *sb_dev)
 {
 	struct tun_struct *tun = netdev_priv(dev);
+	u8 vnet_hash_flags = READ_ONCE(tun->vnet_hash.flags);
+	struct virtio_net_hash hash;
 	u16 ret;
 
+	if (vnet_hash_flags & (TUN_VNET_HASH_RSS | TUN_VNET_HASH_REPORT)) {
+		virtio_net_hash(skb, READ_ONCE(tun->vnet_hash.types),
+				tun->vnet_hash_key, &hash);
+
+		skb->tun_vnet_hash = true;
+		qdisc_skb_cb(skb)->tun_vnet_hash_value = hash.value;
+		qdisc_skb_cb(skb)->tun_vnet_hash_report = hash.report;
+	}
+
 	rcu_read_lock();
 	if (rcu_dereference(tun->steering_prog))
 		ret = tun_ebpf_select_queue(tun, skb);
+	else if (vnet_hash_flags & TUN_VNET_HASH_RSS)
+		ret = tun_vnet_select_queue(tun, skb);
 	else
 		ret = tun_automq_select_queue(tun, skb);
 	rcu_read_unlock();
@@ -2088,10 +2139,15 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 			    struct iov_iter *iter)
 {
 	struct tun_pi pi = { 0, skb->protocol };
+	struct virtio_net_hash vnet_hash = {
+		.value = qdisc_skb_cb(skb)->tun_vnet_hash_value,
+		.report = qdisc_skb_cb(skb)->tun_vnet_hash_report
+	};
 	ssize_t total;
 	int vlan_offset = 0;
 	int vlan_hlen = 0;
 	int vnet_hdr_sz = 0;
+	size_t vnet_hdr_content_sz;
 
 	if (skb_vlan_tag_present(skb))
 		vlan_hlen = VLAN_HLEN;
@@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	}
 
 	if (vnet_hdr_sz) {
-		struct virtio_net_hdr gso;
+		union {
+			struct virtio_net_hdr hdr;
+			struct virtio_net_hdr_v1_hash v1_hash_hdr;
+		} hdr;
+		int ret;
 
 		if (iov_iter_count(iter) < vnet_hdr_sz)
 			return -EINVAL;
 
-		if (virtio_net_hdr_from_skb(skb, &gso,
-					    tun_is_little_endian(tun), true,
-					    vlan_hlen)) {
+		if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
+		    vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
+		    skb->tun_vnet_hash) {
+			vnet_hdr_content_sz = sizeof(hdr.v1_hash_hdr);
+			ret = virtio_net_hdr_v1_hash_from_skb(skb,
+							      &hdr.v1_hash_hdr,
+							      true,
+							      vlan_hlen,
+							      &vnet_hash);
+		} else {
+			vnet_hdr_content_sz = sizeof(hdr.hdr);
+			ret = virtio_net_hdr_from_skb(skb, &hdr.hdr,
+						      tun_is_little_endian(tun),
+						      true, vlan_hlen);
+		}
+
+		if (ret) {
 			struct skb_shared_info *sinfo = skb_shinfo(skb);
 			pr_err("unexpected GSO type: "
 			       "0x%x, gso_size %d, hdr_len %d\n",
-			       sinfo->gso_type, tun16_to_cpu(tun, gso.gso_size),
-			       tun16_to_cpu(tun, gso.hdr_len));
+			       sinfo->gso_type, tun16_to_cpu(tun, hdr.hdr.gso_size),
+			       tun16_to_cpu(tun, hdr.hdr.hdr_len));
 			print_hex_dump(KERN_ERR, "tun: ",
 				       DUMP_PREFIX_NONE,
 				       16, 1, skb->head,
-				       min((int)tun16_to_cpu(tun, gso.hdr_len), 64), true);
+				       min((int)tun16_to_cpu(tun, hdr.hdr.hdr_len), 64), true);
 			WARN_ON_ONCE(1);
 			return -EINVAL;
 		}
 
-		if (copy_to_iter(&gso, sizeof(gso), iter) != sizeof(gso))
+		if (copy_to_iter(&hdr, vnet_hdr_content_sz, iter) != vnet_hdr_content_sz)
 			return -EFAULT;
 
-		iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso));
+		iov_iter_advance(iter, vnet_hdr_sz - vnet_hdr_content_sz);
 	}
 
 	if (vlan_hlen) {
@@ -3007,24 +3081,27 @@ static int tun_set_queue(struct file *file, struct ifreq *ifr)
 	return ret;
 }
 
-static int tun_set_ebpf(struct tun_struct *tun, struct tun_prog __rcu **prog_p,
-			void __user *data)
+static struct bpf_prog *tun_set_ebpf(struct tun_struct *tun,
+				     struct tun_prog __rcu **prog_p,
+				     void __user *data)
 {
 	struct bpf_prog *prog;
 	int fd;
+	int ret;
 
 	if (copy_from_user(&fd, data, sizeof(fd)))
-		return -EFAULT;
+		return ERR_PTR(-EFAULT);
 
 	if (fd == -1) {
 		prog = NULL;
 	} else {
 		prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
 		if (IS_ERR(prog))
-			return PTR_ERR(prog);
+			return prog;
 	}
 
-	return __tun_set_ebpf(tun, prog_p, prog);
+	ret = __tun_set_ebpf(tun, prog_p, prog);
+	return ret ? ERR_PTR(ret) : prog;
 }
 
 /* Return correct value for tun->dev->addr_len based on tun->dev->type. */
@@ -3082,6 +3159,11 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 	int le;
 	int ret;
 	bool do_notify = false;
+	struct bpf_prog *bpf_ret;
+	struct tun_vnet_hash vnet_hash;
+	u16 vnet_hash_indirection_table[TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH];
+	u8 vnet_hash_key[TUN_VNET_HASH_MAX_KEY_SIZE];
+	size_t len;
 
 	if (cmd == TUNSETIFF || cmd == TUNSETQUEUE ||
 	    (_IOC_TYPE(cmd) == SOCK_IOC_TYPE && cmd != SIOCGSKNS)) {
@@ -3295,7 +3377,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 			ret = -EFAULT;
 			break;
 		}
-		if (vnet_hdr_sz < (int)sizeof(struct virtio_net_hdr)) {
+		if (vnet_hdr_sz <
+		    (int)((tun->vnet_hash.flags & TUN_VNET_HASH_REPORT) ?
+			  sizeof(struct virtio_net_hdr_v1_hash) :
+			  sizeof(struct virtio_net_hdr))) {
 			ret = -EINVAL;
 			break;
 		}
@@ -3314,10 +3399,16 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 			ret = -EFAULT;
 			break;
 		}
-		if (le)
+		if (le) {
 			tun->flags |= TUN_VNET_LE;
-		else
+		} else {
+			if (!tun_legacy_is_little_endian(tun)) {
+				ret = -EINVAL;
+				break;
+			}
+
 			tun->flags &= ~TUN_VNET_LE;
+		}
 		break;
 
 	case TUNGETVNETBE:
@@ -3360,11 +3451,17 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		break;
 
 	case TUNSETSTEERINGEBPF:
-		ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
+		bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
+		if (IS_ERR(bpf_ret))
+			ret = PTR_ERR(bpf_ret);
+		else if (bpf_ret)
+			tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
 		break;
 
 	case TUNSETFILTEREBPF:
-		ret = tun_set_ebpf(tun, &tun->filter_prog, argp);
+		bpf_ret = tun_set_ebpf(tun, &tun->filter_prog, argp);
+		if (IS_ERR(bpf_ret))
+			ret = PTR_ERR(bpf_ret);
 		break;
 
 	case TUNSETCARRIER:
@@ -3382,6 +3479,54 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		ret = open_related_ns(&net->ns, get_net_ns);
 		break;
 
+	case TUNGETVNETHASHCAP:
+		if (copy_to_user(argp, &tun_vnet_hash_cap,
+				 sizeof(tun_vnet_hash_cap)))
+			ret = -EFAULT;
+		break;
+
+	case TUNSETVNETHASH:
+		len = sizeof(vnet_hash);
+		if (copy_from_user(&vnet_hash, argp, len)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
+		     (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
+		      !tun_is_little_endian(tun))) ||
+		     vnet_hash.indirection_table_mask >=
+		     TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
+			ret = -EINVAL;
+			break;
+		}
+
+		argp = (u8 __user *)argp + len;
+		len = (vnet_hash.indirection_table_mask + 1) * 2;
+		if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		argp = (u8 __user *)argp + len;
+		len = virtio_net_hash_key_length(vnet_hash.types);
+
+		if (copy_from_user(vnet_hash_key, argp, len)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		tun->vnet_hash = vnet_hash;
+		memcpy(tun->vnet_hash_indirection_table,
+		       vnet_hash_indirection_table,
+		       (vnet_hash.indirection_table_mask + 1) * 2);
+		memcpy(tun->vnet_hash_key, vnet_hash_key, len);
+
+		if (vnet_hash.flags & TUN_VNET_HASH_RSS)
+			__tun_set_ebpf(tun, &tun->steering_prog, NULL);
+
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 287cdc81c939..dc591cd897c8 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -61,6 +61,8 @@
 #define TUNSETFILTEREBPF _IOR('T', 225, int)
 #define TUNSETCARRIER _IOW('T', 226, int)
 #define TUNGETDEVNETNS _IO('T', 227)
+#define TUNGETVNETHASHCAP _IO('T', 228)
+#define TUNSETVNETHASH _IOW('T', 229, unsigned int)
 
 /* TUNSETIFF ifr flags */
 #define IFF_TUN		0x0001
@@ -115,4 +117,18 @@ struct tun_filter {
 	__u8   addr[][ETH_ALEN];
 };
 
+struct tun_vnet_hash_cap {
+	__u16 max_indirection_table_length;
+	__u32 types;
+};
+
+#define TUN_VNET_HASH_RSS	0x01
+#define TUN_VNET_HASH_REPORT	0x02
+struct tun_vnet_hash {
+	__u8 flags;
+	__u32 types;
+	__u16 indirection_table_mask;
+	__u16 unclassified_queue;
+};
+
 #endif /* _UAPI__IF_TUN_H */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-08  5:20 ` [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
@ 2023-10-08 19:07   ` Willem de Bruijn
  2023-10-08 20:04     ` Akihiko Odaki
  2023-10-09  8:13   ` Willem de Bruijn
  2023-10-09 11:38   ` Michael S. Tsirkin
  2 siblings, 1 reply; 41+ messages in thread
From: Willem de Bruijn @ 2023-10-08 19:07 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
>
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges. An alternative solution is to extend the eBPF steering
> program so that it will be able to report to the userspace, but it makes
> little sense to allow to implement different hashing algorithms with
> eBPF since the hash value reported by virtio-net is strictly defined by
> the specification.
>
> The hash value already stored in sk_buff is not used and computed
> independently since it may have been computed in a way not conformant
> with the specification.
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> ---

> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> +       .max_indirection_table_length =
> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> +
> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> +};

No need to have explicit capabilities exchange like this? Tun either
supports all or none.

>         case TUNSETSTEERINGEBPF:
> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> +               if (IS_ERR(bpf_ret))
> +                       ret = PTR_ERR(bpf_ret);
> +               else if (bpf_ret)
> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;

Don't make one feature disable another.

TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
functions. If one is enabled the other call should fail, with EBUSY
for instance.

> +       case TUNSETVNETHASH:
> +               len = sizeof(vnet_hash);
> +               if (copy_from_user(&vnet_hash, argp, len)) {
> +                       ret = -EFAULT;
> +                       break;
> +               }
> +
> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> +                     !tun_is_little_endian(tun))) ||
> +                    vnet_hash.indirection_table_mask >=
> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> +                       ret = -EINVAL;
> +                       break;
> +               }
> +
> +               argp = (u8 __user *)argp + len;
> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> +                       ret = -EFAULT;
> +                       break;
> +               }
> +
> +               argp = (u8 __user *)argp + len;
> +               len = virtio_net_hash_key_length(vnet_hash.types);
> +
> +               if (copy_from_user(vnet_hash_key, argp, len)) {
> +                       ret = -EFAULT;
> +                       break;
> +               }

Probably easier and less error-prone to define a fixed size control
struct with the max indirection table size.

Btw: please trim the CC: list considerably on future patches.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-08 19:07   ` Willem de Bruijn
@ 2023-10-08 20:04     ` Akihiko Odaki
  2023-10-08 20:08       ` Willem de Bruijn
  0 siblings, 1 reply; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-08 20:04 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On 2023/10/09 4:07, Willem de Bruijn wrote:
> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> virtio-net have two usage of hashes: one is RSS and another is hash
>> reporting. Conventionally the hash calculation was done by the VMM.
>> However, computing the hash after the queue was chosen defeats the
>> purpose of RSS.
>>
>> Another approach is to use eBPF steering program. This approach has
>> another downside: it cannot report the calculated hash due to the
>> restrictive nature of eBPF.
>>
>> Introduce the code to compute hashes to the kernel in order to overcome
>> thse challenges. An alternative solution is to extend the eBPF steering
>> program so that it will be able to report to the userspace, but it makes
>> little sense to allow to implement different hashing algorithms with
>> eBPF since the hash value reported by virtio-net is strictly defined by
>> the specification.
>>
>> The hash value already stored in sk_buff is not used and computed
>> independently since it may have been computed in a way not conformant
>> with the specification.
>>
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>> ---
> 
>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>> +       .max_indirection_table_length =
>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>> +
>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>> +};
> 
> No need to have explicit capabilities exchange like this? Tun either
> supports all or none.

tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX, 
VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.

It is because the flow dissector does not support IPv6 extensions. The 
specification is also vague, and does not tell how many TLVs should be 
consumed at most when interpreting destination option header so I chose 
to avoid adding code for these hash types to the flow dissector. I doubt 
anyone will complain about it since nobody complains for Linux.

I'm also adding this so that we can extend it later. 
max_indirection_table_length may grow for systems with 128+ CPUs, or 
types may have other bits for new protocols in the future.

> 
>>          case TUNSETSTEERINGEBPF:
>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>> +               if (IS_ERR(bpf_ret))
>> +                       ret = PTR_ERR(bpf_ret);
>> +               else if (bpf_ret)
>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> 
> Don't make one feature disable another.
> 
> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> functions. If one is enabled the other call should fail, with EBUSY
> for instance.
> 
>> +       case TUNSETVNETHASH:
>> +               len = sizeof(vnet_hash);
>> +               if (copy_from_user(&vnet_hash, argp, len)) {
>> +                       ret = -EFAULT;
>> +                       break;
>> +               }
>> +
>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>> +                     !tun_is_little_endian(tun))) ||
>> +                    vnet_hash.indirection_table_mask >=
>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>> +                       ret = -EINVAL;
>> +                       break;
>> +               }
>> +
>> +               argp = (u8 __user *)argp + len;
>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>> +                       ret = -EFAULT;
>> +                       break;
>> +               }
>> +
>> +               argp = (u8 __user *)argp + len;
>> +               len = virtio_net_hash_key_length(vnet_hash.types);
>> +
>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
>> +                       ret = -EFAULT;
>> +                       break;
>> +               }
> 
> Probably easier and less error-prone to define a fixed size control
> struct with the max indirection table size.

I made its size variable because the indirection table and key may grow 
in the future as I wrote above.

> 
> Btw: please trim the CC: list considerably on future patches.

I'll do so in the next version with the TUNSETSTEERINGEBPF change you 
proposed.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-08 20:04     ` Akihiko Odaki
@ 2023-10-08 20:08       ` Willem de Bruijn
  2023-10-08 20:46         ` Akihiko Odaki
  0 siblings, 1 reply; 41+ messages in thread
From: Willem de Bruijn @ 2023-10-08 20:08 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/10/09 4:07, Willem de Bruijn wrote:
> > On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> virtio-net have two usage of hashes: one is RSS and another is hash
> >> reporting. Conventionally the hash calculation was done by the VMM.
> >> However, computing the hash after the queue was chosen defeats the
> >> purpose of RSS.
> >>
> >> Another approach is to use eBPF steering program. This approach has
> >> another downside: it cannot report the calculated hash due to the
> >> restrictive nature of eBPF.
> >>
> >> Introduce the code to compute hashes to the kernel in order to overcome
> >> thse challenges. An alternative solution is to extend the eBPF steering
> >> program so that it will be able to report to the userspace, but it makes
> >> little sense to allow to implement different hashing algorithms with
> >> eBPF since the hash value reported by virtio-net is strictly defined by
> >> the specification.
> >>
> >> The hash value already stored in sk_buff is not used and computed
> >> independently since it may have been computed in a way not conformant
> >> with the specification.
> >>
> >> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >> ---
> >
> >> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >> +       .max_indirection_table_length =
> >> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >> +
> >> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >> +};
> >
> > No need to have explicit capabilities exchange like this? Tun either
> > supports all or none.
>
> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>
> It is because the flow dissector does not support IPv6 extensions. The
> specification is also vague, and does not tell how many TLVs should be
> consumed at most when interpreting destination option header so I chose
> to avoid adding code for these hash types to the flow dissector. I doubt
> anyone will complain about it since nobody complains for Linux.
>
> I'm also adding this so that we can extend it later.
> max_indirection_table_length may grow for systems with 128+ CPUs, or
> types may have other bits for new protocols in the future.
>
> >
> >>          case TUNSETSTEERINGEBPF:
> >> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >> +               if (IS_ERR(bpf_ret))
> >> +                       ret = PTR_ERR(bpf_ret);
> >> +               else if (bpf_ret)
> >> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >
> > Don't make one feature disable another.
> >
> > TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> > functions. If one is enabled the other call should fail, with EBUSY
> > for instance.
> >
> >> +       case TUNSETVNETHASH:
> >> +               len = sizeof(vnet_hash);
> >> +               if (copy_from_user(&vnet_hash, argp, len)) {
> >> +                       ret = -EFAULT;
> >> +                       break;
> >> +               }
> >> +
> >> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >> +                     !tun_is_little_endian(tun))) ||
> >> +                    vnet_hash.indirection_table_mask >=
> >> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >> +                       ret = -EINVAL;
> >> +                       break;
> >> +               }
> >> +
> >> +               argp = (u8 __user *)argp + len;
> >> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
> >> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >> +                       ret = -EFAULT;
> >> +                       break;
> >> +               }
> >> +
> >> +               argp = (u8 __user *)argp + len;
> >> +               len = virtio_net_hash_key_length(vnet_hash.types);
> >> +
> >> +               if (copy_from_user(vnet_hash_key, argp, len)) {
> >> +                       ret = -EFAULT;
> >> +                       break;
> >> +               }
> >
> > Probably easier and less error-prone to define a fixed size control
> > struct with the max indirection table size.
>
> I made its size variable because the indirection table and key may grow
> in the future as I wrote above.
>
> >
> > Btw: please trim the CC: list considerably on future patches.
>
> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> proposed.

To be clear: please don't just resubmit with that one change.

The skb and cb issues are quite fundamental issues that need to be resolved.

I'd like to understand why adjusting the existing BPF feature for this
exact purpose cannot be amended to return the key it produced.

As you point out, the C flow dissector is insufficient. The BPF flow
dissector does not have this problem. The same argument would go for
the pre-existing BPF steering program.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-08 20:08       ` Willem de Bruijn
@ 2023-10-08 20:46         ` Akihiko Odaki
  2023-10-09  8:04           ` Willem de Bruijn
  0 siblings, 1 reply; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-08 20:46 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On 2023/10/09 5:08, Willem de Bruijn wrote:
> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>> However, computing the hash after the queue was chosen defeats the
>>>> purpose of RSS.
>>>>
>>>> Another approach is to use eBPF steering program. This approach has
>>>> another downside: it cannot report the calculated hash due to the
>>>> restrictive nature of eBPF.
>>>>
>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>> program so that it will be able to report to the userspace, but it makes
>>>> little sense to allow to implement different hashing algorithms with
>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>> the specification.
>>>>
>>>> The hash value already stored in sk_buff is not used and computed
>>>> independently since it may have been computed in a way not conformant
>>>> with the specification.
>>>>
>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>> ---
>>>
>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>> +       .max_indirection_table_length =
>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>> +
>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>> +};
>>>
>>> No need to have explicit capabilities exchange like this? Tun either
>>> supports all or none.
>>
>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>
>> It is because the flow dissector does not support IPv6 extensions. The
>> specification is also vague, and does not tell how many TLVs should be
>> consumed at most when interpreting destination option header so I chose
>> to avoid adding code for these hash types to the flow dissector. I doubt
>> anyone will complain about it since nobody complains for Linux.
>>
>> I'm also adding this so that we can extend it later.
>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>> types may have other bits for new protocols in the future.
>>
>>>
>>>>           case TUNSETSTEERINGEBPF:
>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>> +               if (IS_ERR(bpf_ret))
>>>> +                       ret = PTR_ERR(bpf_ret);
>>>> +               else if (bpf_ret)
>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>
>>> Don't make one feature disable another.
>>>
>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>> functions. If one is enabled the other call should fail, with EBUSY
>>> for instance.
>>>
>>>> +       case TUNSETVNETHASH:
>>>> +               len = sizeof(vnet_hash);
>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
>>>> +                       ret = -EFAULT;
>>>> +                       break;
>>>> +               }
>>>> +
>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>> +                     !tun_is_little_endian(tun))) ||
>>>> +                    vnet_hash.indirection_table_mask >=
>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>> +                       ret = -EINVAL;
>>>> +                       break;
>>>> +               }
>>>> +
>>>> +               argp = (u8 __user *)argp + len;
>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>> +                       ret = -EFAULT;
>>>> +                       break;
>>>> +               }
>>>> +
>>>> +               argp = (u8 __user *)argp + len;
>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
>>>> +
>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
>>>> +                       ret = -EFAULT;
>>>> +                       break;
>>>> +               }
>>>
>>> Probably easier and less error-prone to define a fixed size control
>>> struct with the max indirection table size.
>>
>> I made its size variable because the indirection table and key may grow
>> in the future as I wrote above.
>>
>>>
>>> Btw: please trim the CC: list considerably on future patches.
>>
>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>> proposed.
> 
> To be clear: please don't just resubmit with that one change.
> 
> The skb and cb issues are quite fundamental issues that need to be resolved.
> 
> I'd like to understand why adjusting the existing BPF feature for this
> exact purpose cannot be amended to return the key it produced.

eBPF steering program is not designed for this particular problem in my 
understanding. It was introduced to derive hash values with an 
understanding of application-specific semantics of packets instead of 
generic IP/TCP/UDP semantics.

This problem is rather different in terms that the hash derivation is 
strictly defined by virtio-net. I don't think it makes sense to 
introduce the complexity of BPF when you always run the same code.

It can utilize the existing flow dissector and also make it easier to 
use for the userspace by implementing this in the kernel.

> 
> As you point out, the C flow dissector is insufficient. The BPF flow
> dissector does not have this problem. The same argument would go for
> the pre-existing BPF steering program.
It is possible to extend the C flow dissector just as it is possible to 
implement a BPF flow dissector. The more serious problem is that 
virtio-net specification (and Microsoft RSS it follows) does not tell 
how to implement IPv6 extension support.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-08 20:46         ` Akihiko Odaki
@ 2023-10-09  8:04           ` Willem de Bruijn
  2023-10-09  8:57             ` Akihiko Odaki
  0 siblings, 1 reply; 41+ messages in thread
From: Willem de Bruijn @ 2023-10-09  8:04 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/10/09 5:08, Willem de Bruijn wrote:
> > On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>> However, computing the hash after the queue was chosen defeats the
> >>>> purpose of RSS.
> >>>>
> >>>> Another approach is to use eBPF steering program. This approach has
> >>>> another downside: it cannot report the calculated hash due to the
> >>>> restrictive nature of eBPF.
> >>>>
> >>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>> program so that it will be able to report to the userspace, but it makes
> >>>> little sense to allow to implement different hashing algorithms with
> >>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>> the specification.
> >>>>
> >>>> The hash value already stored in sk_buff is not used and computed
> >>>> independently since it may have been computed in a way not conformant
> >>>> with the specification.
> >>>>
> >>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>> ---
> >>>
> >>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>> +       .max_indirection_table_length =
> >>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>> +
> >>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>> +};
> >>>
> >>> No need to have explicit capabilities exchange like this? Tun either
> >>> supports all or none.
> >>
> >> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>
> >> It is because the flow dissector does not support IPv6 extensions. The
> >> specification is also vague, and does not tell how many TLVs should be
> >> consumed at most when interpreting destination option header so I chose
> >> to avoid adding code for these hash types to the flow dissector. I doubt
> >> anyone will complain about it since nobody complains for Linux.
> >>
> >> I'm also adding this so that we can extend it later.
> >> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >> types may have other bits for new protocols in the future.
> >>
> >>>
> >>>>           case TUNSETSTEERINGEBPF:
> >>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>> +               if (IS_ERR(bpf_ret))
> >>>> +                       ret = PTR_ERR(bpf_ret);
> >>>> +               else if (bpf_ret)
> >>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>
> >>> Don't make one feature disable another.
> >>>
> >>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>> functions. If one is enabled the other call should fail, with EBUSY
> >>> for instance.
> >>>
> >>>> +       case TUNSETVNETHASH:
> >>>> +               len = sizeof(vnet_hash);
> >>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
> >>>> +                       ret = -EFAULT;
> >>>> +                       break;
> >>>> +               }
> >>>> +
> >>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>> +                     !tun_is_little_endian(tun))) ||
> >>>> +                    vnet_hash.indirection_table_mask >=
> >>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>> +                       ret = -EINVAL;
> >>>> +                       break;
> >>>> +               }
> >>>> +
> >>>> +               argp = (u8 __user *)argp + len;
> >>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>> +                       ret = -EFAULT;
> >>>> +                       break;
> >>>> +               }
> >>>> +
> >>>> +               argp = (u8 __user *)argp + len;
> >>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
> >>>> +
> >>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>> +                       ret = -EFAULT;
> >>>> +                       break;
> >>>> +               }
> >>>
> >>> Probably easier and less error-prone to define a fixed size control
> >>> struct with the max indirection table size.
> >>
> >> I made its size variable because the indirection table and key may grow
> >> in the future as I wrote above.
> >>
> >>>
> >>> Btw: please trim the CC: list considerably on future patches.
> >>
> >> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >> proposed.
> >
> > To be clear: please don't just resubmit with that one change.
> >
> > The skb and cb issues are quite fundamental issues that need to be resolved.
> >
> > I'd like to understand why adjusting the existing BPF feature for this
> > exact purpose cannot be amended to return the key it produced.
>
> eBPF steering program is not designed for this particular problem in my
> understanding. It was introduced to derive hash values with an
> understanding of application-specific semantics of packets instead of
> generic IP/TCP/UDP semantics.
>
> This problem is rather different in terms that the hash derivation is
> strictly defined by virtio-net. I don't think it makes sense to
> introduce the complexity of BPF when you always run the same code.
>
> It can utilize the existing flow dissector and also make it easier to
> use for the userspace by implementing this in the kernel.

Ok. There does appear to be overlap in functionality. But it might be
easier to deploy to just have standard Toeplitz available without
having to compile and load an eBPF program.

As for the sk_buff and cb[] changes. The first is really not needed.
sk_buff simply would not scale if every edge case needs a few bits.

For the control block, generally it is not safe to use that across
layers. In this case, between qdisc enqueue of a given device and
ndo_start_xmit of that device, I suppose it is. Though uncommon. I
wonder if there is any precedent.

The data will have to be stored in the skb somewhere. A simpler option
is just skb->hash? This code would use skb_get_hash, if it would
always produce a Toeplitz hash, anyway.

> >
> > As you point out, the C flow dissector is insufficient. The BPF flow
> > dissector does not have this problem. The same argument would go for
> > the pre-existing BPF steering program.
> It is possible to extend the C flow dissector just as it is possible to
> implement a BPF flow dissector. The more serious problem is that
> virtio-net specification (and Microsoft RSS it follows) does not tell
> how to implement IPv6 extension support.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09  8:04           ` Willem de Bruijn
@ 2023-10-09  8:57             ` Akihiko Odaki
  2023-10-09  9:57               ` Willem de Bruijn
  0 siblings, 1 reply; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-09  8:57 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On 2023/10/09 17:04, Willem de Bruijn wrote:
> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>> purpose of RSS.
>>>>>>
>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>> restrictive nature of eBPF.
>>>>>>
>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>> the specification.
>>>>>>
>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>> independently since it may have been computed in a way not conformant
>>>>>> with the specification.
>>>>>>
>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>> ---
>>>>>
>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>> +       .max_indirection_table_length =
>>>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>> +
>>>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>> +};
>>>>>
>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>> supports all or none.
>>>>
>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>
>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>> specification is also vague, and does not tell how many TLVs should be
>>>> consumed at most when interpreting destination option header so I chose
>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>> anyone will complain about it since nobody complains for Linux.
>>>>
>>>> I'm also adding this so that we can extend it later.
>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>> types may have other bits for new protocols in the future.
>>>>
>>>>>
>>>>>>            case TUNSETSTEERINGEBPF:
>>>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>> +               if (IS_ERR(bpf_ret))
>>>>>> +                       ret = PTR_ERR(bpf_ret);
>>>>>> +               else if (bpf_ret)
>>>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>
>>>>> Don't make one feature disable another.
>>>>>
>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>> for instance.
>>>>>
>>>>>> +       case TUNSETVNETHASH:
>>>>>> +               len = sizeof(vnet_hash);
>>>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>> +                       ret = -EFAULT;
>>>>>> +                       break;
>>>>>> +               }
>>>>>> +
>>>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>> +                     !tun_is_little_endian(tun))) ||
>>>>>> +                    vnet_hash.indirection_table_mask >=
>>>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>> +                       ret = -EINVAL;
>>>>>> +                       break;
>>>>>> +               }
>>>>>> +
>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>> +                       ret = -EFAULT;
>>>>>> +                       break;
>>>>>> +               }
>>>>>> +
>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>> +
>>>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>> +                       ret = -EFAULT;
>>>>>> +                       break;
>>>>>> +               }
>>>>>
>>>>> Probably easier and less error-prone to define a fixed size control
>>>>> struct with the max indirection table size.
>>>>
>>>> I made its size variable because the indirection table and key may grow
>>>> in the future as I wrote above.
>>>>
>>>>>
>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>
>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>> proposed.
>>>
>>> To be clear: please don't just resubmit with that one change.
>>>
>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>
>>> I'd like to understand why adjusting the existing BPF feature for this
>>> exact purpose cannot be amended to return the key it produced.
>>
>> eBPF steering program is not designed for this particular problem in my
>> understanding. It was introduced to derive hash values with an
>> understanding of application-specific semantics of packets instead of
>> generic IP/TCP/UDP semantics.
>>
>> This problem is rather different in terms that the hash derivation is
>> strictly defined by virtio-net. I don't think it makes sense to
>> introduce the complexity of BPF when you always run the same code.
>>
>> It can utilize the existing flow dissector and also make it easier to
>> use for the userspace by implementing this in the kernel.
> 
> Ok. There does appear to be overlap in functionality. But it might be
> easier to deploy to just have standard Toeplitz available without
> having to compile and load an eBPF program.
> 
> As for the sk_buff and cb[] changes. The first is really not needed.
> sk_buff simply would not scale if every edge case needs a few bits.

An alternative is to move the bit to cb[] and clear it for every code 
paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.

I think we can put the bit in sk_buff for now. We can implement the 
alternative when we are short of bits.

> 
> For the control block, generally it is not safe to use that across
> layers. In this case, between qdisc enqueue of a given device and
> ndo_start_xmit of that device, I suppose it is. Though uncommon. I
> wonder if there is any precedent.

That's one of the reasons modifying qdisc_skb_cb instead of creating 
another struct embedding it as bpf_skb_data_end and tc_skb_cb do. This 
clarifies that it is qdisc's responsibility to keep these data intact.

> 
> The data will have to be stored in the skb somewhere. A simpler option
> is just skb->hash? This code would use skb_get_hash, if it would
> always produce a Toeplitz hash, anyway.

We still need tun_vnet_hash_report to identify hash type.

skb_get_hash() uses Siphash instead of Toeplitz, and the configuration 
of Toeplitz relies on tun's internal state. It is still possible to 
store a hash calculated with tun's own logic to skb->hash, but someone 
may later overwrite it with __skb_get_hash() though it's unlikely.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09  8:57             ` Akihiko Odaki
@ 2023-10-09  9:57               ` Willem de Bruijn
  2023-10-09 10:01                 ` Akihiko Odaki
  0 siblings, 1 reply; 41+ messages in thread
From: Willem de Bruijn @ 2023-10-09  9:57 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/10/09 17:04, Willem de Bruijn wrote:
> > On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2023/10/09 5:08, Willem de Bruijn wrote:
> >>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>
> >>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>> purpose of RSS.
> >>>>>>
> >>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>> restrictive nature of eBPF.
> >>>>>>
> >>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>> program so that it will be able to report to the userspace, but it makes
> >>>>>> little sense to allow to implement different hashing algorithms with
> >>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>>>> the specification.
> >>>>>>
> >>>>>> The hash value already stored in sk_buff is not used and computed
> >>>>>> independently since it may have been computed in a way not conformant
> >>>>>> with the specification.
> >>>>>>
> >>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>>>> ---
> >>>>>
> >>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>>>> +       .max_indirection_table_length =
> >>>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>>>> +
> >>>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>>>> +};
> >>>>>
> >>>>> No need to have explicit capabilities exchange like this? Tun either
> >>>>> supports all or none.
> >>>>
> >>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>>>
> >>>> It is because the flow dissector does not support IPv6 extensions. The
> >>>> specification is also vague, and does not tell how many TLVs should be
> >>>> consumed at most when interpreting destination option header so I chose
> >>>> to avoid adding code for these hash types to the flow dissector. I doubt
> >>>> anyone will complain about it since nobody complains for Linux.
> >>>>
> >>>> I'm also adding this so that we can extend it later.
> >>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >>>> types may have other bits for new protocols in the future.
> >>>>
> >>>>>
> >>>>>>            case TUNSETSTEERINGEBPF:
> >>>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>> +               if (IS_ERR(bpf_ret))
> >>>>>> +                       ret = PTR_ERR(bpf_ret);
> >>>>>> +               else if (bpf_ret)
> >>>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>>>
> >>>>> Don't make one feature disable another.
> >>>>>
> >>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>>>> functions. If one is enabled the other call should fail, with EBUSY
> >>>>> for instance.
> >>>>>
> >>>>>> +       case TUNSETVNETHASH:
> >>>>>> +               len = sizeof(vnet_hash);
> >>>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
> >>>>>> +                       ret = -EFAULT;
> >>>>>> +                       break;
> >>>>>> +               }
> >>>>>> +
> >>>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>>>> +                     !tun_is_little_endian(tun))) ||
> >>>>>> +                    vnet_hash.indirection_table_mask >=
> >>>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>>>> +                       ret = -EINVAL;
> >>>>>> +                       break;
> >>>>>> +               }
> >>>>>> +
> >>>>>> +               argp = (u8 __user *)argp + len;
> >>>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>>>> +                       ret = -EFAULT;
> >>>>>> +                       break;
> >>>>>> +               }
> >>>>>> +
> >>>>>> +               argp = (u8 __user *)argp + len;
> >>>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
> >>>>>> +
> >>>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>>>> +                       ret = -EFAULT;
> >>>>>> +                       break;
> >>>>>> +               }
> >>>>>
> >>>>> Probably easier and less error-prone to define a fixed size control
> >>>>> struct with the max indirection table size.
> >>>>
> >>>> I made its size variable because the indirection table and key may grow
> >>>> in the future as I wrote above.
> >>>>
> >>>>>
> >>>>> Btw: please trim the CC: list considerably on future patches.
> >>>>
> >>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >>>> proposed.
> >>>
> >>> To be clear: please don't just resubmit with that one change.
> >>>
> >>> The skb and cb issues are quite fundamental issues that need to be resolved.
> >>>
> >>> I'd like to understand why adjusting the existing BPF feature for this
> >>> exact purpose cannot be amended to return the key it produced.
> >>
> >> eBPF steering program is not designed for this particular problem in my
> >> understanding. It was introduced to derive hash values with an
> >> understanding of application-specific semantics of packets instead of
> >> generic IP/TCP/UDP semantics.
> >>
> >> This problem is rather different in terms that the hash derivation is
> >> strictly defined by virtio-net. I don't think it makes sense to
> >> introduce the complexity of BPF when you always run the same code.
> >>
> >> It can utilize the existing flow dissector and also make it easier to
> >> use for the userspace by implementing this in the kernel.
> >
> > Ok. There does appear to be overlap in functionality. But it might be
> > easier to deploy to just have standard Toeplitz available without
> > having to compile and load an eBPF program.
> >
> > As for the sk_buff and cb[] changes. The first is really not needed.
> > sk_buff simply would not scale if every edge case needs a few bits.
>
> An alternative is to move the bit to cb[] and clear it for every code
> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>
> I think we can put the bit in sk_buff for now. We can implement the
> alternative when we are short of bits.

I disagree. sk_buff fields add a cost to every code path. They cannot
be added for every edge case.
>
> >
> > For the control block, generally it is not safe to use that across
> > layers. In this case, between qdisc enqueue of a given device and
> > ndo_start_xmit of that device, I suppose it is. Though uncommon. I
> > wonder if there is any precedent.
>
> That's one of the reasons modifying qdisc_skb_cb instead of creating
> another struct embedding it as bpf_skb_data_end and tc_skb_cb do. This
> clarifies that it is qdisc's responsibility to keep these data intact.
>
> >
> > The data will have to be stored in the skb somewhere. A simpler option
> > is just skb->hash? This code would use skb_get_hash, if it would
> > always produce a Toeplitz hash, anyway.
>
> We still need tun_vnet_hash_report to identify hash type.
>
> skb_get_hash() uses Siphash instead of Toeplitz, and the configuration
> of Toeplitz relies on tun's internal state. It is still possible to
> store a hash calculated with tun's own logic to skb->hash, but someone
> may later overwrite it with __skb_get_hash() though it's unlikely.

That won't happen in this data path.

Fair point on the hash type.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09  9:57               ` Willem de Bruijn
@ 2023-10-09 10:01                 ` Akihiko Odaki
  2023-10-09 10:06                   ` Willem de Bruijn
  0 siblings, 1 reply; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-09 10:01 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On 2023/10/09 18:57, Willem de Bruijn wrote:
> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2023/10/09 17:04, Willem de Bruijn wrote:
>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>
>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>> purpose of RSS.
>>>>>>>>
>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>> restrictive nature of eBPF.
>>>>>>>>
>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>>>> the specification.
>>>>>>>>
>>>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>>>> independently since it may have been computed in a way not conformant
>>>>>>>> with the specification.
>>>>>>>>
>>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>>>> ---
>>>>>>>
>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>>>> +       .max_indirection_table_length =
>>>>>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>>>> +
>>>>>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>>>> +};
>>>>>>>
>>>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>>>> supports all or none.
>>>>>>
>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>>>
>>>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>>>> specification is also vague, and does not tell how many TLVs should be
>>>>>> consumed at most when interpreting destination option header so I chose
>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>>>> anyone will complain about it since nobody complains for Linux.
>>>>>>
>>>>>> I'm also adding this so that we can extend it later.
>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>>>> types may have other bits for new protocols in the future.
>>>>>>
>>>>>>>
>>>>>>>>             case TUNSETSTEERINGEBPF:
>>>>>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>> +               if (IS_ERR(bpf_ret))
>>>>>>>> +                       ret = PTR_ERR(bpf_ret);
>>>>>>>> +               else if (bpf_ret)
>>>>>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>>>
>>>>>>> Don't make one feature disable another.
>>>>>>>
>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>>>> for instance.
>>>>>>>
>>>>>>>> +       case TUNSETVNETHASH:
>>>>>>>> +               len = sizeof(vnet_hash);
>>>>>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>>>> +                       ret = -EFAULT;
>>>>>>>> +                       break;
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>>>> +                     !tun_is_little_endian(tun))) ||
>>>>>>>> +                    vnet_hash.indirection_table_mask >=
>>>>>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>>>> +                       ret = -EINVAL;
>>>>>>>> +                       break;
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>>>> +                       ret = -EFAULT;
>>>>>>>> +                       break;
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>>>> +
>>>>>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>>>> +                       ret = -EFAULT;
>>>>>>>> +                       break;
>>>>>>>> +               }
>>>>>>>
>>>>>>> Probably easier and less error-prone to define a fixed size control
>>>>>>> struct with the max indirection table size.
>>>>>>
>>>>>> I made its size variable because the indirection table and key may grow
>>>>>> in the future as I wrote above.
>>>>>>
>>>>>>>
>>>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>>>
>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>>>> proposed.
>>>>>
>>>>> To be clear: please don't just resubmit with that one change.
>>>>>
>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>>>
>>>>> I'd like to understand why adjusting the existing BPF feature for this
>>>>> exact purpose cannot be amended to return the key it produced.
>>>>
>>>> eBPF steering program is not designed for this particular problem in my
>>>> understanding. It was introduced to derive hash values with an
>>>> understanding of application-specific semantics of packets instead of
>>>> generic IP/TCP/UDP semantics.
>>>>
>>>> This problem is rather different in terms that the hash derivation is
>>>> strictly defined by virtio-net. I don't think it makes sense to
>>>> introduce the complexity of BPF when you always run the same code.
>>>>
>>>> It can utilize the existing flow dissector and also make it easier to
>>>> use for the userspace by implementing this in the kernel.
>>>
>>> Ok. There does appear to be overlap in functionality. But it might be
>>> easier to deploy to just have standard Toeplitz available without
>>> having to compile and load an eBPF program.
>>>
>>> As for the sk_buff and cb[] changes. The first is really not needed.
>>> sk_buff simply would not scale if every edge case needs a few bits.
>>
>> An alternative is to move the bit to cb[] and clear it for every code
>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>>
>> I think we can put the bit in sk_buff for now. We can implement the
>> alternative when we are short of bits.
> 
> I disagree. sk_buff fields add a cost to every code path. They cannot
> be added for every edge case.

It only takes an unused bit and does not grow the sk_buff size so I 
think it has practically no cost for now.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09 10:01                 ` Akihiko Odaki
@ 2023-10-09 10:06                   ` Willem de Bruijn
  2023-10-09 10:12                     ` Akihiko Odaki
  0 siblings, 1 reply; 41+ messages in thread
From: Willem de Bruijn @ 2023-10-09 10:06 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/10/09 18:57, Willem de Bruijn wrote:
> > On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2023/10/09 17:04, Willem de Bruijn wrote:
> >>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
> >>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>
> >>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>
> >>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>> purpose of RSS.
> >>>>>>>>
> >>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>> restrictive nature of eBPF.
> >>>>>>>>
> >>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>>>> program so that it will be able to report to the userspace, but it makes
> >>>>>>>> little sense to allow to implement different hashing algorithms with
> >>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>>>>>> the specification.
> >>>>>>>>
> >>>>>>>> The hash value already stored in sk_buff is not used and computed
> >>>>>>>> independently since it may have been computed in a way not conformant
> >>>>>>>> with the specification.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>>>>>> ---
> >>>>>>>
> >>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>>>>>> +       .max_indirection_table_length =
> >>>>>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>>>>>> +
> >>>>>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>>>>>> +};
> >>>>>>>
> >>>>>>> No need to have explicit capabilities exchange like this? Tun either
> >>>>>>> supports all or none.
> >>>>>>
> >>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>>>>>
> >>>>>> It is because the flow dissector does not support IPv6 extensions. The
> >>>>>> specification is also vague, and does not tell how many TLVs should be
> >>>>>> consumed at most when interpreting destination option header so I chose
> >>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
> >>>>>> anyone will complain about it since nobody complains for Linux.
> >>>>>>
> >>>>>> I'm also adding this so that we can extend it later.
> >>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >>>>>> types may have other bits for new protocols in the future.
> >>>>>>
> >>>>>>>
> >>>>>>>>             case TUNSETSTEERINGEBPF:
> >>>>>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>> +               if (IS_ERR(bpf_ret))
> >>>>>>>> +                       ret = PTR_ERR(bpf_ret);
> >>>>>>>> +               else if (bpf_ret)
> >>>>>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>>>>>
> >>>>>>> Don't make one feature disable another.
> >>>>>>>
> >>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>>>>>> functions. If one is enabled the other call should fail, with EBUSY
> >>>>>>> for instance.
> >>>>>>>
> >>>>>>>> +       case TUNSETVNETHASH:
> >>>>>>>> +               len = sizeof(vnet_hash);
> >>>>>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
> >>>>>>>> +                       ret = -EFAULT;
> >>>>>>>> +                       break;
> >>>>>>>> +               }
> >>>>>>>> +
> >>>>>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>>>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>>>>>> +                     !tun_is_little_endian(tun))) ||
> >>>>>>>> +                    vnet_hash.indirection_table_mask >=
> >>>>>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>>>>>> +                       ret = -EINVAL;
> >>>>>>>> +                       break;
> >>>>>>>> +               }
> >>>>>>>> +
> >>>>>>>> +               argp = (u8 __user *)argp + len;
> >>>>>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>>>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>>>>>> +                       ret = -EFAULT;
> >>>>>>>> +                       break;
> >>>>>>>> +               }
> >>>>>>>> +
> >>>>>>>> +               argp = (u8 __user *)argp + len;
> >>>>>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
> >>>>>>>> +
> >>>>>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>>>>>> +                       ret = -EFAULT;
> >>>>>>>> +                       break;
> >>>>>>>> +               }
> >>>>>>>
> >>>>>>> Probably easier and less error-prone to define a fixed size control
> >>>>>>> struct with the max indirection table size.
> >>>>>>
> >>>>>> I made its size variable because the indirection table and key may grow
> >>>>>> in the future as I wrote above.
> >>>>>>
> >>>>>>>
> >>>>>>> Btw: please trim the CC: list considerably on future patches.
> >>>>>>
> >>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >>>>>> proposed.
> >>>>>
> >>>>> To be clear: please don't just resubmit with that one change.
> >>>>>
> >>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
> >>>>>
> >>>>> I'd like to understand why adjusting the existing BPF feature for this
> >>>>> exact purpose cannot be amended to return the key it produced.
> >>>>
> >>>> eBPF steering program is not designed for this particular problem in my
> >>>> understanding. It was introduced to derive hash values with an
> >>>> understanding of application-specific semantics of packets instead of
> >>>> generic IP/TCP/UDP semantics.
> >>>>
> >>>> This problem is rather different in terms that the hash derivation is
> >>>> strictly defined by virtio-net. I don't think it makes sense to
> >>>> introduce the complexity of BPF when you always run the same code.
> >>>>
> >>>> It can utilize the existing flow dissector and also make it easier to
> >>>> use for the userspace by implementing this in the kernel.
> >>>
> >>> Ok. There does appear to be overlap in functionality. But it might be
> >>> easier to deploy to just have standard Toeplitz available without
> >>> having to compile and load an eBPF program.
> >>>
> >>> As for the sk_buff and cb[] changes. The first is really not needed.
> >>> sk_buff simply would not scale if every edge case needs a few bits.
> >>
> >> An alternative is to move the bit to cb[] and clear it for every code
> >> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
> >>
> >> I think we can put the bit in sk_buff for now. We can implement the
> >> alternative when we are short of bits.
> >
> > I disagree. sk_buff fields add a cost to every code path. They cannot
> > be added for every edge case.
>
> It only takes an unused bit and does not grow the sk_buff size so I
> think it has practically no cost for now.

The problem is that that thinking leads to death by a thousand cuts.

"for now" forces the cost of having to think hard how to avoid growing
sk_buff onto the next person. Let's do it right from the start.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09 10:06                   ` Willem de Bruijn
@ 2023-10-09 10:12                     ` Akihiko Odaki
  2023-10-09 10:44                       ` Willem de Bruijn
  0 siblings, 1 reply; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-09 10:12 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On 2023/10/09 19:06, Willem de Bruijn wrote:
> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2023/10/09 18:57, Willem de Bruijn wrote:
>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>
>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>
>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>>>> purpose of RSS.
>>>>>>>>>>
>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>>>> restrictive nature of eBPF.
>>>>>>>>>>
>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>>>>>> the specification.
>>>>>>>>>>
>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>>>>>> independently since it may have been computed in a way not conformant
>>>>>>>>>> with the specification.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>>>>>> +       .max_indirection_table_length =
>>>>>>>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>>>>>> +
>>>>>>>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>>>>>> +};
>>>>>>>>>
>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>>>>>> supports all or none.
>>>>>>>>
>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>>>>>
>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>>>>>> specification is also vague, and does not tell how many TLVs should be
>>>>>>>> consumed at most when interpreting destination option header so I chose
>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>>>>>> anyone will complain about it since nobody complains for Linux.
>>>>>>>>
>>>>>>>> I'm also adding this so that we can extend it later.
>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>>>>>> types may have other bits for new protocols in the future.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>              case TUNSETSTEERINGEBPF:
>>>>>>>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>> +               if (IS_ERR(bpf_ret))
>>>>>>>>>> +                       ret = PTR_ERR(bpf_ret);
>>>>>>>>>> +               else if (bpf_ret)
>>>>>>>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>>>>>
>>>>>>>>> Don't make one feature disable another.
>>>>>>>>>
>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>>>>>> for instance.
>>>>>>>>>
>>>>>>>>>> +       case TUNSETVNETHASH:
>>>>>>>>>> +               len = sizeof(vnet_hash);
>>>>>>>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>> +                       break;
>>>>>>>>>> +               }
>>>>>>>>>> +
>>>>>>>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>>>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>>>>>> +                     !tun_is_little_endian(tun))) ||
>>>>>>>>>> +                    vnet_hash.indirection_table_mask >=
>>>>>>>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>>>>>> +                       ret = -EINVAL;
>>>>>>>>>> +                       break;
>>>>>>>>>> +               }
>>>>>>>>>> +
>>>>>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>>>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>>>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>> +                       break;
>>>>>>>>>> +               }
>>>>>>>>>> +
>>>>>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>>>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>>>>>> +
>>>>>>>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>> +                       break;
>>>>>>>>>> +               }
>>>>>>>>>
>>>>>>>>> Probably easier and less error-prone to define a fixed size control
>>>>>>>>> struct with the max indirection table size.
>>>>>>>>
>>>>>>>> I made its size variable because the indirection table and key may grow
>>>>>>>> in the future as I wrote above.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>>>>>
>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>>>>>> proposed.
>>>>>>>
>>>>>>> To be clear: please don't just resubmit with that one change.
>>>>>>>
>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>>>>>
>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
>>>>>>> exact purpose cannot be amended to return the key it produced.
>>>>>>
>>>>>> eBPF steering program is not designed for this particular problem in my
>>>>>> understanding. It was introduced to derive hash values with an
>>>>>> understanding of application-specific semantics of packets instead of
>>>>>> generic IP/TCP/UDP semantics.
>>>>>>
>>>>>> This problem is rather different in terms that the hash derivation is
>>>>>> strictly defined by virtio-net. I don't think it makes sense to
>>>>>> introduce the complexity of BPF when you always run the same code.
>>>>>>
>>>>>> It can utilize the existing flow dissector and also make it easier to
>>>>>> use for the userspace by implementing this in the kernel.
>>>>>
>>>>> Ok. There does appear to be overlap in functionality. But it might be
>>>>> easier to deploy to just have standard Toeplitz available without
>>>>> having to compile and load an eBPF program.
>>>>>
>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
>>>>> sk_buff simply would not scale if every edge case needs a few bits.
>>>>
>>>> An alternative is to move the bit to cb[] and clear it for every code
>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>>>>
>>>> I think we can put the bit in sk_buff for now. We can implement the
>>>> alternative when we are short of bits.
>>>
>>> I disagree. sk_buff fields add a cost to every code path. They cannot
>>> be added for every edge case.
>>
>> It only takes an unused bit and does not grow the sk_buff size so I
>> think it has practically no cost for now.
> 
> The problem is that that thinking leads to death by a thousand cuts.
> 
> "for now" forces the cost of having to think hard how to avoid growing
> sk_buff onto the next person. Let's do it right from the start.

I see. I described an alternative to move the bit to cb[] and clear it 
in all code paths that leads to ndo_start_xmit() earlier. Does that 
sound good to you?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09 10:12                     ` Akihiko Odaki
@ 2023-10-09 10:44                       ` Willem de Bruijn
  2023-10-10  1:52                         ` Akihiko Odaki
  0 siblings, 1 reply; 41+ messages in thread
From: Willem de Bruijn @ 2023-10-09 10:44 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/10/09 19:06, Willem de Bruijn wrote:
> > On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2023/10/09 18:57, Willem de Bruijn wrote:
> >>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
> >>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>
> >>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
> >>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>
> >>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>>>> purpose of RSS.
> >>>>>>>>>>
> >>>>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>>>> restrictive nature of eBPF.
> >>>>>>>>>>
> >>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>>>>>> program so that it will be able to report to the userspace, but it makes
> >>>>>>>>>> little sense to allow to implement different hashing algorithms with
> >>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>>>>>>>> the specification.
> >>>>>>>>>>
> >>>>>>>>>> The hash value already stored in sk_buff is not used and computed
> >>>>>>>>>> independently since it may have been computed in a way not conformant
> >>>>>>>>>> with the specification.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>>>>>>>> ---
> >>>>>>>>>
> >>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>>>>>>>> +       .max_indirection_table_length =
> >>>>>>>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>>>>>>>> +
> >>>>>>>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>>>>>>>> +};
> >>>>>>>>>
> >>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
> >>>>>>>>> supports all or none.
> >>>>>>>>
> >>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>>>>>>>
> >>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
> >>>>>>>> specification is also vague, and does not tell how many TLVs should be
> >>>>>>>> consumed at most when interpreting destination option header so I chose
> >>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
> >>>>>>>> anyone will complain about it since nobody complains for Linux.
> >>>>>>>>
> >>>>>>>> I'm also adding this so that we can extend it later.
> >>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >>>>>>>> types may have other bits for new protocols in the future.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>              case TUNSETSTEERINGEBPF:
> >>>>>>>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>> +               if (IS_ERR(bpf_ret))
> >>>>>>>>>> +                       ret = PTR_ERR(bpf_ret);
> >>>>>>>>>> +               else if (bpf_ret)
> >>>>>>>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>>>>>>>
> >>>>>>>>> Don't make one feature disable another.
> >>>>>>>>>
> >>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
> >>>>>>>>> for instance.
> >>>>>>>>>
> >>>>>>>>>> +       case TUNSETVNETHASH:
> >>>>>>>>>> +               len = sizeof(vnet_hash);
> >>>>>>>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
> >>>>>>>>>> +                       ret = -EFAULT;
> >>>>>>>>>> +                       break;
> >>>>>>>>>> +               }
> >>>>>>>>>> +
> >>>>>>>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>>>>>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>>>>>>>> +                     !tun_is_little_endian(tun))) ||
> >>>>>>>>>> +                    vnet_hash.indirection_table_mask >=
> >>>>>>>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>>>>>>>> +                       ret = -EINVAL;
> >>>>>>>>>> +                       break;
> >>>>>>>>>> +               }
> >>>>>>>>>> +
> >>>>>>>>>> +               argp = (u8 __user *)argp + len;
> >>>>>>>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>>>>>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>>>>>>>> +                       ret = -EFAULT;
> >>>>>>>>>> +                       break;
> >>>>>>>>>> +               }
> >>>>>>>>>> +
> >>>>>>>>>> +               argp = (u8 __user *)argp + len;
> >>>>>>>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
> >>>>>>>>>> +
> >>>>>>>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>>>>>>>> +                       ret = -EFAULT;
> >>>>>>>>>> +                       break;
> >>>>>>>>>> +               }
> >>>>>>>>>
> >>>>>>>>> Probably easier and less error-prone to define a fixed size control
> >>>>>>>>> struct with the max indirection table size.
> >>>>>>>>
> >>>>>>>> I made its size variable because the indirection table and key may grow
> >>>>>>>> in the future as I wrote above.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Btw: please trim the CC: list considerably on future patches.
> >>>>>>>>
> >>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >>>>>>>> proposed.
> >>>>>>>
> >>>>>>> To be clear: please don't just resubmit with that one change.
> >>>>>>>
> >>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
> >>>>>>>
> >>>>>>> I'd like to understand why adjusting the existing BPF feature for this
> >>>>>>> exact purpose cannot be amended to return the key it produced.
> >>>>>>
> >>>>>> eBPF steering program is not designed for this particular problem in my
> >>>>>> understanding. It was introduced to derive hash values with an
> >>>>>> understanding of application-specific semantics of packets instead of
> >>>>>> generic IP/TCP/UDP semantics.
> >>>>>>
> >>>>>> This problem is rather different in terms that the hash derivation is
> >>>>>> strictly defined by virtio-net. I don't think it makes sense to
> >>>>>> introduce the complexity of BPF when you always run the same code.
> >>>>>>
> >>>>>> It can utilize the existing flow dissector and also make it easier to
> >>>>>> use for the userspace by implementing this in the kernel.
> >>>>>
> >>>>> Ok. There does appear to be overlap in functionality. But it might be
> >>>>> easier to deploy to just have standard Toeplitz available without
> >>>>> having to compile and load an eBPF program.
> >>>>>
> >>>>> As for the sk_buff and cb[] changes. The first is really not needed.
> >>>>> sk_buff simply would not scale if every edge case needs a few bits.
> >>>>
> >>>> An alternative is to move the bit to cb[] and clear it for every code
> >>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
> >>>>
> >>>> I think we can put the bit in sk_buff for now. We can implement the
> >>>> alternative when we are short of bits.
> >>>
> >>> I disagree. sk_buff fields add a cost to every code path. They cannot
> >>> be added for every edge case.
> >>
> >> It only takes an unused bit and does not grow the sk_buff size so I
> >> think it has practically no cost for now.
> >
> > The problem is that that thinking leads to death by a thousand cuts.
> >
> > "for now" forces the cost of having to think hard how to avoid growing
> > sk_buff onto the next person. Let's do it right from the start.
>
> I see. I described an alternative to move the bit to cb[] and clear it
> in all code paths that leads to ndo_start_xmit() earlier. Does that
> sound good to you?

If you use the control block to pass information between
__dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
the field can be left undefined in all non-tun paths. tun_select_queue
can initialize.

I would still use skb->hash to encode the hash. That hash type of that
field is not strictly defined. It can be siphash from ___skb_get_hash
or a device hash, which most likely also uses Toeplitz. Then you also
don't run into the problem of growing the struct size.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09 10:44                       ` Willem de Bruijn
@ 2023-10-10  1:52                         ` Akihiko Odaki
  2023-10-10  5:45                           ` Jason Wang
  0 siblings, 1 reply; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-10  1:52 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On 2023/10/09 19:44, Willem de Bruijn wrote:
> On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2023/10/09 19:06, Willem de Bruijn wrote:
>>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
>>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
>>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>
>>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>>>>>> purpose of RSS.
>>>>>>>>>>>>
>>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>>>>>> restrictive nature of eBPF.
>>>>>>>>>>>>
>>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>>>>>>>> the specification.
>>>>>>>>>>>>
>>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>>>>>>>> independently since it may have been computed in a way not conformant
>>>>>>>>>>>> with the specification.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>>>>>>>> +       .max_indirection_table_length =
>>>>>>>>>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>>>>>>>> +
>>>>>>>>>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>>>>>>>> +};
>>>>>>>>>>>
>>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>>>>>>>> supports all or none.
>>>>>>>>>>
>>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>>>>>>>
>>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
>>>>>>>>>> consumed at most when interpreting destination option header so I chose
>>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>>>>>>>> anyone will complain about it since nobody complains for Linux.
>>>>>>>>>>
>>>>>>>>>> I'm also adding this so that we can extend it later.
>>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>>>>>>>> types may have other bits for new protocols in the future.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>               case TUNSETSTEERINGEBPF:
>>>>>>>>>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>> +               if (IS_ERR(bpf_ret))
>>>>>>>>>>>> +                       ret = PTR_ERR(bpf_ret);
>>>>>>>>>>>> +               else if (bpf_ret)
>>>>>>>>>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>>>>>>>
>>>>>>>>>>> Don't make one feature disable another.
>>>>>>>>>>>
>>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>>>>>>>> for instance.
>>>>>>>>>>>
>>>>>>>>>>>> +       case TUNSETVNETHASH:
>>>>>>>>>>>> +               len = sizeof(vnet_hash);
>>>>>>>>>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>>>> +                       break;
>>>>>>>>>>>> +               }
>>>>>>>>>>>> +
>>>>>>>>>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>>>>>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>>>>>>>> +                     !tun_is_little_endian(tun))) ||
>>>>>>>>>>>> +                    vnet_hash.indirection_table_mask >=
>>>>>>>>>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>>>>>>>> +                       ret = -EINVAL;
>>>>>>>>>>>> +                       break;
>>>>>>>>>>>> +               }
>>>>>>>>>>>> +
>>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>>>>>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>>>>>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>>>> +                       break;
>>>>>>>>>>>> +               }
>>>>>>>>>>>> +
>>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>>>>>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>>>>>>>> +
>>>>>>>>>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>>>> +                       break;
>>>>>>>>>>>> +               }
>>>>>>>>>>>
>>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
>>>>>>>>>>> struct with the max indirection table size.
>>>>>>>>>>
>>>>>>>>>> I made its size variable because the indirection table and key may grow
>>>>>>>>>> in the future as I wrote above.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>>>>>>>
>>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>>>>>>>> proposed.
>>>>>>>>>
>>>>>>>>> To be clear: please don't just resubmit with that one change.
>>>>>>>>>
>>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>>>>>>>
>>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
>>>>>>>>> exact purpose cannot be amended to return the key it produced.
>>>>>>>>
>>>>>>>> eBPF steering program is not designed for this particular problem in my
>>>>>>>> understanding. It was introduced to derive hash values with an
>>>>>>>> understanding of application-specific semantics of packets instead of
>>>>>>>> generic IP/TCP/UDP semantics.
>>>>>>>>
>>>>>>>> This problem is rather different in terms that the hash derivation is
>>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
>>>>>>>> introduce the complexity of BPF when you always run the same code.
>>>>>>>>
>>>>>>>> It can utilize the existing flow dissector and also make it easier to
>>>>>>>> use for the userspace by implementing this in the kernel.
>>>>>>>
>>>>>>> Ok. There does appear to be overlap in functionality. But it might be
>>>>>>> easier to deploy to just have standard Toeplitz available without
>>>>>>> having to compile and load an eBPF program.
>>>>>>>
>>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
>>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
>>>>>>
>>>>>> An alternative is to move the bit to cb[] and clear it for every code
>>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>>>>>>
>>>>>> I think we can put the bit in sk_buff for now. We can implement the
>>>>>> alternative when we are short of bits.
>>>>>
>>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
>>>>> be added for every edge case.
>>>>
>>>> It only takes an unused bit and does not grow the sk_buff size so I
>>>> think it has practically no cost for now.
>>>
>>> The problem is that that thinking leads to death by a thousand cuts.
>>>
>>> "for now" forces the cost of having to think hard how to avoid growing
>>> sk_buff onto the next person. Let's do it right from the start.
>>
>> I see. I described an alternative to move the bit to cb[] and clear it
>> in all code paths that leads to ndo_start_xmit() earlier. Does that
>> sound good to you?
> 
> If you use the control block to pass information between
> __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
> the field can be left undefined in all non-tun paths. tun_select_queue
> can initialize.

The problem is that tun_select_queue() is not always called. 
netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before 
calling it, but this variable may change later and result in a race 
condition. Another case is that XDP with predefined queue.

> 
> I would still use skb->hash to encode the hash. That hash type of that
> field is not strictly defined. It can be siphash from ___skb_get_hash
> or a device hash, which most likely also uses Toeplitz. Then you also
> don't run into the problem of growing the struct size.

I'm concerned exactly because it's not strictly defined. Someone may 
decide to overwrite it later if we are not cautious enough. qdisc_skb_cb 
also has sufficient space to contain both of the hash value and type.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-10  1:52                         ` Akihiko Odaki
@ 2023-10-10  5:45                           ` Jason Wang
  2023-10-10  5:51                             ` Akihiko Odaki
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2023-10-10  5:45 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Willem de Bruijn, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On Tue, Oct 10, 2023 at 9:52 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/10/09 19:44, Willem de Bruijn wrote:
> > On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2023/10/09 19:06, Willem de Bruijn wrote:
> >>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
> >>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>
> >>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
> >>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>
> >>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
> >>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>>>>>> purpose of RSS.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>>>>>> restrictive nature of eBPF.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
> >>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
> >>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>>>>>>>>>> the specification.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
> >>>>>>>>>>>> independently since it may have been computed in a way not conformant
> >>>>>>>>>>>> with the specification.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>
> >>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>>>>>>>>>> +       .max_indirection_table_length =
> >>>>>>>>>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>>>>>>>>>> +};
> >>>>>>>>>>>
> >>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
> >>>>>>>>>>> supports all or none.
> >>>>>>>>>>
> >>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>>>>>>>>>
> >>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
> >>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
> >>>>>>>>>> consumed at most when interpreting destination option header so I chose
> >>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
> >>>>>>>>>> anyone will complain about it since nobody complains for Linux.
> >>>>>>>>>>
> >>>>>>>>>> I'm also adding this so that we can extend it later.
> >>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >>>>>>>>>> types may have other bits for new protocols in the future.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>               case TUNSETSTEERINGEBPF:
> >>>>>>>>>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>> +               if (IS_ERR(bpf_ret))
> >>>>>>>>>>>> +                       ret = PTR_ERR(bpf_ret);
> >>>>>>>>>>>> +               else if (bpf_ret)
> >>>>>>>>>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>>>>>>>>>
> >>>>>>>>>>> Don't make one feature disable another.
> >>>>>>>>>>>
> >>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
> >>>>>>>>>>> for instance.
> >>>>>>>>>>>
> >>>>>>>>>>>> +       case TUNSETVNETHASH:
> >>>>>>>>>>>> +               len = sizeof(vnet_hash);
> >>>>>>>>>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
> >>>>>>>>>>>> +                       ret = -EFAULT;
> >>>>>>>>>>>> +                       break;
> >>>>>>>>>>>> +               }
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>>>>>>>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>>>>>>>>>> +                     !tun_is_little_endian(tun))) ||
> >>>>>>>>>>>> +                    vnet_hash.indirection_table_mask >=
> >>>>>>>>>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>>>>>>>>>> +                       ret = -EINVAL;
> >>>>>>>>>>>> +                       break;
> >>>>>>>>>>>> +               }
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
> >>>>>>>>>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>>>>>>>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>>>>>>>>>> +                       ret = -EFAULT;
> >>>>>>>>>>>> +                       break;
> >>>>>>>>>>>> +               }
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
> >>>>>>>>>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>>>>>>>>>> +                       ret = -EFAULT;
> >>>>>>>>>>>> +                       break;
> >>>>>>>>>>>> +               }
> >>>>>>>>>>>
> >>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
> >>>>>>>>>>> struct with the max indirection table size.
> >>>>>>>>>>
> >>>>>>>>>> I made its size variable because the indirection table and key may grow
> >>>>>>>>>> in the future as I wrote above.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
> >>>>>>>>>>
> >>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >>>>>>>>>> proposed.
> >>>>>>>>>
> >>>>>>>>> To be clear: please don't just resubmit with that one change.
> >>>>>>>>>
> >>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
> >>>>>>>>>
> >>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
> >>>>>>>>> exact purpose cannot be amended to return the key it produced.
> >>>>>>>>
> >>>>>>>> eBPF steering program is not designed for this particular problem in my
> >>>>>>>> understanding. It was introduced to derive hash values with an
> >>>>>>>> understanding of application-specific semantics of packets instead of
> >>>>>>>> generic IP/TCP/UDP semantics.
> >>>>>>>>
> >>>>>>>> This problem is rather different in terms that the hash derivation is
> >>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
> >>>>>>>> introduce the complexity of BPF when you always run the same code.
> >>>>>>>>
> >>>>>>>> It can utilize the existing flow dissector and also make it easier to
> >>>>>>>> use for the userspace by implementing this in the kernel.
> >>>>>>>
> >>>>>>> Ok. There does appear to be overlap in functionality. But it might be
> >>>>>>> easier to deploy to just have standard Toeplitz available without
> >>>>>>> having to compile and load an eBPF program.
> >>>>>>>
> >>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
> >>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
> >>>>>>
> >>>>>> An alternative is to move the bit to cb[] and clear it for every code
> >>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
> >>>>>>
> >>>>>> I think we can put the bit in sk_buff for now. We can implement the
> >>>>>> alternative when we are short of bits.
> >>>>>
> >>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
> >>>>> be added for every edge case.
> >>>>
> >>>> It only takes an unused bit and does not grow the sk_buff size so I
> >>>> think it has practically no cost for now.
> >>>
> >>> The problem is that that thinking leads to death by a thousand cuts.
> >>>
> >>> "for now" forces the cost of having to think hard how to avoid growing
> >>> sk_buff onto the next person. Let's do it right from the start.
> >>
> >> I see. I described an alternative to move the bit to cb[] and clear it
> >> in all code paths that leads to ndo_start_xmit() earlier. Does that
> >> sound good to you?
> >
> > If you use the control block to pass information between
> > __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
> > the field can be left undefined in all non-tun paths. tun_select_queue
> > can initialize.
>
> The problem is that tun_select_queue() is not always called.
> netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before
> calling it, but this variable may change later and result in a race
> condition. Another case is that XDP with predefined queue.
>
> >
> > I would still use skb->hash to encode the hash. That hash type of that
> > field is not strictly defined. It can be siphash from ___skb_get_hash
> > or a device hash, which most likely also uses Toeplitz. Then you also
> > don't run into the problem of growing the struct size.
>
> I'm concerned exactly because it's not strictly defined. Someone may
> decide to overwrite it later if we are not cautious enough. qdisc_skb_cb
> also has sufficient space to contain both of the hash value and type.

How about using skb extensions?

Thanks

>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-10  5:45                           ` Jason Wang
@ 2023-10-10  5:51                             ` Akihiko Odaki
  2023-10-10  6:00                               ` Jason Wang
  0 siblings, 1 reply; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-10  5:51 UTC (permalink / raw)
  To: Jason Wang
  Cc: Willem de Bruijn, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On 2023/10/10 14:45, Jason Wang wrote:
> On Tue, Oct 10, 2023 at 9:52 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2023/10/09 19:44, Willem de Bruijn wrote:
>>> On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2023/10/09 19:06, Willem de Bruijn wrote:
>>>>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
>>>>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>
>>>>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
>>>>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>>>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>>>>>>>> purpose of RSS.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>>>>>>>> restrictive nature of eBPF.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>>>>>>>>>> the specification.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>>>>>>>>>> independently since it may have been computed in a way not conformant
>>>>>>>>>>>>>> with the specification.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>>>>>>>>>> +       .max_indirection_table_length =
>>>>>>>>>>>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>>>>>>>>>> +};
>>>>>>>>>>>>>
>>>>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>>>>>>>>>> supports all or none.
>>>>>>>>>>>>
>>>>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>>>>>>>>>
>>>>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
>>>>>>>>>>>> consumed at most when interpreting destination option header so I chose
>>>>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>>>>>>>>>> anyone will complain about it since nobody complains for Linux.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm also adding this so that we can extend it later.
>>>>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>>>>>>>>>> types may have other bits for new protocols in the future.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>                case TUNSETSTEERINGEBPF:
>>>>>>>>>>>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>>>> +               if (IS_ERR(bpf_ret))
>>>>>>>>>>>>>> +                       ret = PTR_ERR(bpf_ret);
>>>>>>>>>>>>>> +               else if (bpf_ret)
>>>>>>>>>>>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>>>>>>>>>
>>>>>>>>>>>>> Don't make one feature disable another.
>>>>>>>>>>>>>
>>>>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>>>>>>>>>> for instance.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +       case TUNSETVNETHASH:
>>>>>>>>>>>>>> +               len = sizeof(vnet_hash);
>>>>>>>>>>>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>>>>>> +                       break;
>>>>>>>>>>>>>> +               }
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>>>>>>>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>>>>>>>>>> +                     !tun_is_little_endian(tun))) ||
>>>>>>>>>>>>>> +                    vnet_hash.indirection_table_mask >=
>>>>>>>>>>>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>>>>>>>>>> +                       ret = -EINVAL;
>>>>>>>>>>>>>> +                       break;
>>>>>>>>>>>>>> +               }
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>>>>>>>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>>>>>>>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>>>>>> +                       break;
>>>>>>>>>>>>>> +               }
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>>>>>>>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>>>>>> +                       break;
>>>>>>>>>>>>>> +               }
>>>>>>>>>>>>>
>>>>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
>>>>>>>>>>>>> struct with the max indirection table size.
>>>>>>>>>>>>
>>>>>>>>>>>> I made its size variable because the indirection table and key may grow
>>>>>>>>>>>> in the future as I wrote above.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>>>>>>>>>
>>>>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>>>>>>>>>> proposed.
>>>>>>>>>>>
>>>>>>>>>>> To be clear: please don't just resubmit with that one change.
>>>>>>>>>>>
>>>>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>>>>>>>>>
>>>>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
>>>>>>>>>>> exact purpose cannot be amended to return the key it produced.
>>>>>>>>>>
>>>>>>>>>> eBPF steering program is not designed for this particular problem in my
>>>>>>>>>> understanding. It was introduced to derive hash values with an
>>>>>>>>>> understanding of application-specific semantics of packets instead of
>>>>>>>>>> generic IP/TCP/UDP semantics.
>>>>>>>>>>
>>>>>>>>>> This problem is rather different in terms that the hash derivation is
>>>>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
>>>>>>>>>> introduce the complexity of BPF when you always run the same code.
>>>>>>>>>>
>>>>>>>>>> It can utilize the existing flow dissector and also make it easier to
>>>>>>>>>> use for the userspace by implementing this in the kernel.
>>>>>>>>>
>>>>>>>>> Ok. There does appear to be overlap in functionality. But it might be
>>>>>>>>> easier to deploy to just have standard Toeplitz available without
>>>>>>>>> having to compile and load an eBPF program.
>>>>>>>>>
>>>>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
>>>>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
>>>>>>>>
>>>>>>>> An alternative is to move the bit to cb[] and clear it for every code
>>>>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>>>>>>>>
>>>>>>>> I think we can put the bit in sk_buff for now. We can implement the
>>>>>>>> alternative when we are short of bits.
>>>>>>>
>>>>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
>>>>>>> be added for every edge case.
>>>>>>
>>>>>> It only takes an unused bit and does not grow the sk_buff size so I
>>>>>> think it has practically no cost for now.
>>>>>
>>>>> The problem is that that thinking leads to death by a thousand cuts.
>>>>>
>>>>> "for now" forces the cost of having to think hard how to avoid growing
>>>>> sk_buff onto the next person. Let's do it right from the start.
>>>>
>>>> I see. I described an alternative to move the bit to cb[] and clear it
>>>> in all code paths that leads to ndo_start_xmit() earlier. Does that
>>>> sound good to you?
>>>
>>> If you use the control block to pass information between
>>> __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
>>> the field can be left undefined in all non-tun paths. tun_select_queue
>>> can initialize.
>>
>> The problem is that tun_select_queue() is not always called.
>> netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before
>> calling it, but this variable may change later and result in a race
>> condition. Another case is that XDP with predefined queue.
>>
>>>
>>> I would still use skb->hash to encode the hash. That hash type of that
>>> field is not strictly defined. It can be siphash from ___skb_get_hash
>>> or a device hash, which most likely also uses Toeplitz. Then you also
>>> don't run into the problem of growing the struct size.
>>
>> I'm concerned exactly because it's not strictly defined. Someone may
>> decide to overwrite it later if we are not cautious enough. qdisc_skb_cb
>> also has sufficient space to contain both of the hash value and type.
> 
> How about using skb extensions?

I think it will work. I'll try it in the next version.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-10  5:51                             ` Akihiko Odaki
@ 2023-10-10  6:00                               ` Jason Wang
  2023-10-10  6:19                                 ` Akihiko Odaki
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2023-10-10  6:00 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Willem de Bruijn, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On Tue, Oct 10, 2023 at 1:51 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/10/10 14:45, Jason Wang wrote:
> > On Tue, Oct 10, 2023 at 9:52 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2023/10/09 19:44, Willem de Bruijn wrote:
> >>> On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> On 2023/10/09 19:06, Willem de Bruijn wrote:
> >>>>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>
> >>>>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
> >>>>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>
> >>>>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
> >>>>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
> >>>>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>>>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>>>>>>>> purpose of RSS.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>>>>>>>> restrictive nature of eBPF.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
> >>>>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
> >>>>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>>>>>>>>>>>> the specification.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
> >>>>>>>>>>>>>> independently since it may have been computed in a way not conformant
> >>>>>>>>>>>>>> with the specification.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>>>>>>>>>>>> +       .max_indirection_table_length =
> >>>>>>>>>>>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>>>>>>>>>>>> +};
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
> >>>>>>>>>>>>> supports all or none.
> >>>>>>>>>>>>
> >>>>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >>>>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
> >>>>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
> >>>>>>>>>>>> consumed at most when interpreting destination option header so I chose
> >>>>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
> >>>>>>>>>>>> anyone will complain about it since nobody complains for Linux.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm also adding this so that we can extend it later.
> >>>>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >>>>>>>>>>>> types may have other bits for new protocols in the future.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>                case TUNSETSTEERINGEBPF:
> >>>>>>>>>>>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>>>> +               if (IS_ERR(bpf_ret))
> >>>>>>>>>>>>>> +                       ret = PTR_ERR(bpf_ret);
> >>>>>>>>>>>>>> +               else if (bpf_ret)
> >>>>>>>>>>>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Don't make one feature disable another.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>>>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
> >>>>>>>>>>>>> for instance.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> +       case TUNSETVNETHASH:
> >>>>>>>>>>>>>> +               len = sizeof(vnet_hash);
> >>>>>>>>>>>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
> >>>>>>>>>>>>>> +                       ret = -EFAULT;
> >>>>>>>>>>>>>> +                       break;
> >>>>>>>>>>>>>> +               }
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>>>>>>>>>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>>>>>>>>>>>> +                     !tun_is_little_endian(tun))) ||
> >>>>>>>>>>>>>> +                    vnet_hash.indirection_table_mask >=
> >>>>>>>>>>>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>>>>>>>>>>>> +                       ret = -EINVAL;
> >>>>>>>>>>>>>> +                       break;
> >>>>>>>>>>>>>> +               }
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
> >>>>>>>>>>>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>>>>>>>>>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>>>>>>>>>>>> +                       ret = -EFAULT;
> >>>>>>>>>>>>>> +                       break;
> >>>>>>>>>>>>>> +               }
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
> >>>>>>>>>>>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>>>>>>>>>>>> +                       ret = -EFAULT;
> >>>>>>>>>>>>>> +                       break;
> >>>>>>>>>>>>>> +               }
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
> >>>>>>>>>>>>> struct with the max indirection table size.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I made its size variable because the indirection table and key may grow
> >>>>>>>>>>>> in the future as I wrote above.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >>>>>>>>>>>> proposed.
> >>>>>>>>>>>
> >>>>>>>>>>> To be clear: please don't just resubmit with that one change.
> >>>>>>>>>>>
> >>>>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
> >>>>>>>>>>>
> >>>>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
> >>>>>>>>>>> exact purpose cannot be amended to return the key it produced.
> >>>>>>>>>>
> >>>>>>>>>> eBPF steering program is not designed for this particular problem in my
> >>>>>>>>>> understanding. It was introduced to derive hash values with an
> >>>>>>>>>> understanding of application-specific semantics of packets instead of
> >>>>>>>>>> generic IP/TCP/UDP semantics.
> >>>>>>>>>>
> >>>>>>>>>> This problem is rather different in terms that the hash derivation is
> >>>>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
> >>>>>>>>>> introduce the complexity of BPF when you always run the same code.
> >>>>>>>>>>
> >>>>>>>>>> It can utilize the existing flow dissector and also make it easier to
> >>>>>>>>>> use for the userspace by implementing this in the kernel.
> >>>>>>>>>
> >>>>>>>>> Ok. There does appear to be overlap in functionality. But it might be
> >>>>>>>>> easier to deploy to just have standard Toeplitz available without
> >>>>>>>>> having to compile and load an eBPF program.
> >>>>>>>>>
> >>>>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
> >>>>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
> >>>>>>>>
> >>>>>>>> An alternative is to move the bit to cb[] and clear it for every code
> >>>>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
> >>>>>>>>
> >>>>>>>> I think we can put the bit in sk_buff for now. We can implement the
> >>>>>>>> alternative when we are short of bits.
> >>>>>>>
> >>>>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
> >>>>>>> be added for every edge case.
> >>>>>>
> >>>>>> It only takes an unused bit and does not grow the sk_buff size so I
> >>>>>> think it has practically no cost for now.
> >>>>>
> >>>>> The problem is that that thinking leads to death by a thousand cuts.
> >>>>>
> >>>>> "for now" forces the cost of having to think hard how to avoid growing
> >>>>> sk_buff onto the next person. Let's do it right from the start.
> >>>>
> >>>> I see. I described an alternative to move the bit to cb[] and clear it
> >>>> in all code paths that leads to ndo_start_xmit() earlier. Does that
> >>>> sound good to you?
> >>>
> >>> If you use the control block to pass information between
> >>> __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
> >>> the field can be left undefined in all non-tun paths. tun_select_queue
> >>> can initialize.
> >>
> >> The problem is that tun_select_queue() is not always called.
> >> netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before
> >> calling it, but this variable may change later and result in a race
> >> condition. Another case is that XDP with predefined queue.
> >>
> >>>
> >>> I would still use skb->hash to encode the hash. That hash type of that
> >>> field is not strictly defined. It can be siphash from ___skb_get_hash
> >>> or a device hash, which most likely also uses Toeplitz. Then you also
> >>> don't run into the problem of growing the struct size.
> >>
> >> I'm concerned exactly because it's not strictly defined. Someone may
> >> decide to overwrite it later if we are not cautious enough. qdisc_skb_cb
> >> also has sufficient space to contain both of the hash value and type.
> >
> > How about using skb extensions?
>
> I think it will work. I'll try it in the next version.

Btw, I still think using eBPF for hash might be better.

Though the hashing rule is defined in the spec, it may be extended in
the future. For example, several extensions has been proposed:

1) RSS context
2) encapsulated packet hashing

Thanks

>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-10  6:00                               ` Jason Wang
@ 2023-10-10  6:19                                 ` Akihiko Odaki
  2023-10-11  3:18                                   ` Jason Wang
  0 siblings, 1 reply; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-10  6:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: Willem de Bruijn, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On 2023/10/10 15:00, Jason Wang wrote:
> On Tue, Oct 10, 2023 at 1:51 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2023/10/10 14:45, Jason Wang wrote:
>>> On Tue, Oct 10, 2023 at 9:52 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2023/10/09 19:44, Willem de Bruijn wrote:
>>>>> On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> On 2023/10/09 19:06, Willem de Bruijn wrote:
>>>>>>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>
>>>>>>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
>>>>>>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
>>>>>>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>>>>>>>>>> purpose of RSS.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>>>>>>>>>> restrictive nature of eBPF.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>>>>>>>>>>>> the specification.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>>>>>>>>>>>> independently since it may have been computed in a way not conformant
>>>>>>>>>>>>>>>> with the specification.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>>>>>>>>>>>> +       .max_indirection_table_length =
>>>>>>>>>>>>>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>>>>>>>>>>>> +};
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>>>>>>>>>>>> supports all or none.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>>>>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>>>>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
>>>>>>>>>>>>>> consumed at most when interpreting destination option header so I chose
>>>>>>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>>>>>>>>>>>> anyone will complain about it since nobody complains for Linux.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm also adding this so that we can extend it later.
>>>>>>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>>>>>>>>>>>> types may have other bits for new protocols in the future.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                 case TUNSETSTEERINGEBPF:
>>>>>>>>>>>>>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>>>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>>>>>> +               if (IS_ERR(bpf_ret))
>>>>>>>>>>>>>>>> +                       ret = PTR_ERR(bpf_ret);
>>>>>>>>>>>>>>>> +               else if (bpf_ret)
>>>>>>>>>>>>>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Don't make one feature disable another.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>>>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>>>>>>>>>>>> for instance.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +       case TUNSETVNETHASH:
>>>>>>>>>>>>>>>> +               len = sizeof(vnet_hash);
>>>>>>>>>>>>>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>>>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>>>>>>>> +                       break;
>>>>>>>>>>>>>>>> +               }
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>>>>>>>>>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>>>>>>>>>>>> +                     !tun_is_little_endian(tun))) ||
>>>>>>>>>>>>>>>> +                    vnet_hash.indirection_table_mask >=
>>>>>>>>>>>>>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>>>>>>>>>>>> +                       ret = -EINVAL;
>>>>>>>>>>>>>>>> +                       break;
>>>>>>>>>>>>>>>> +               }
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>>>>>>>>>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>>>>>>>>>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>>>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>>>>>>>> +                       break;
>>>>>>>>>>>>>>>> +               }
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>>>>>>>>>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>>>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>>>>>>>> +                       break;
>>>>>>>>>>>>>>>> +               }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
>>>>>>>>>>>>>>> struct with the max indirection table size.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I made its size variable because the indirection table and key may grow
>>>>>>>>>>>>>> in the future as I wrote above.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>>>>>>>>>>>> proposed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To be clear: please don't just resubmit with that one change.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
>>>>>>>>>>>>> exact purpose cannot be amended to return the key it produced.
>>>>>>>>>>>>
>>>>>>>>>>>> eBPF steering program is not designed for this particular problem in my
>>>>>>>>>>>> understanding. It was introduced to derive hash values with an
>>>>>>>>>>>> understanding of application-specific semantics of packets instead of
>>>>>>>>>>>> generic IP/TCP/UDP semantics.
>>>>>>>>>>>>
>>>>>>>>>>>> This problem is rather different in terms that the hash derivation is
>>>>>>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
>>>>>>>>>>>> introduce the complexity of BPF when you always run the same code.
>>>>>>>>>>>>
>>>>>>>>>>>> It can utilize the existing flow dissector and also make it easier to
>>>>>>>>>>>> use for the userspace by implementing this in the kernel.
>>>>>>>>>>>
>>>>>>>>>>> Ok. There does appear to be overlap in functionality. But it might be
>>>>>>>>>>> easier to deploy to just have standard Toeplitz available without
>>>>>>>>>>> having to compile and load an eBPF program.
>>>>>>>>>>>
>>>>>>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
>>>>>>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
>>>>>>>>>>
>>>>>>>>>> An alternative is to move the bit to cb[] and clear it for every code
>>>>>>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>>>>>>>>>>
>>>>>>>>>> I think we can put the bit in sk_buff for now. We can implement the
>>>>>>>>>> alternative when we are short of bits.
>>>>>>>>>
>>>>>>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
>>>>>>>>> be added for every edge case.
>>>>>>>>
>>>>>>>> It only takes an unused bit and does not grow the sk_buff size so I
>>>>>>>> think it has practically no cost for now.
>>>>>>>
>>>>>>> The problem is that that thinking leads to death by a thousand cuts.
>>>>>>>
>>>>>>> "for now" forces the cost of having to think hard how to avoid growing
>>>>>>> sk_buff onto the next person. Let's do it right from the start.
>>>>>>
>>>>>> I see. I described an alternative to move the bit to cb[] and clear it
>>>>>> in all code paths that leads to ndo_start_xmit() earlier. Does that
>>>>>> sound good to you?
>>>>>
>>>>> If you use the control block to pass information between
>>>>> __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
>>>>> the field can be left undefined in all non-tun paths. tun_select_queue
>>>>> can initialize.
>>>>
>>>> The problem is that tun_select_queue() is not always called.
>>>> netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before
>>>> calling it, but this variable may change later and result in a race
>>>> condition. Another case is that XDP with predefined queue.
>>>>
>>>>>
>>>>> I would still use skb->hash to encode the hash. That hash type of that
>>>>> field is not strictly defined. It can be siphash from ___skb_get_hash
>>>>> or a device hash, which most likely also uses Toeplitz. Then you also
>>>>> don't run into the problem of growing the struct size.
>>>>
>>>> I'm concerned exactly because it's not strictly defined. Someone may
>>>> decide to overwrite it later if we are not cautious enough. qdisc_skb_cb
>>>> also has sufficient space to contain both of the hash value and type.
>>>
>>> How about using skb extensions?
>>
>> I think it will work. I'll try it in the next version.
> 
> Btw, I still think using eBPF for hash might be better.
> 
> Though the hashing rule is defined in the spec, it may be extended in
> the future. For example, several extensions has been proposed:
> 
> 1) RSS context
> 2) encapsulated packet hashing

Looking at the proposals, I'm now more inclined to extend the BPF 
steering program.

Yuri, who wrote the RFC patches to extend the BPF steering program, also 
raised an concern that it may become hard to implement virtio-net 
extensions in the future. It is much easier to deploy a new BPF program 
to support extensions since it will be included in QEMU and can be 
deployed at once without concerning other kernel stuff.

I was still not sure how likely such an extension will emerge especially 
when the hardware RSS capability is not evolving for a decade or so. But 
those proposals show that there are more demands of new features for 
virtio-net.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-10  6:19                                 ` Akihiko Odaki
@ 2023-10-11  3:18                                   ` Jason Wang
  2023-10-11  3:57                                     ` Akihiko Odaki
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2023-10-11  3:18 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Willem de Bruijn, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On Tue, Oct 10, 2023 at 2:19 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/10/10 15:00, Jason Wang wrote:
> > On Tue, Oct 10, 2023 at 1:51 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2023/10/10 14:45, Jason Wang wrote:
> >>> On Tue, Oct 10, 2023 at 9:52 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> On 2023/10/09 19:44, Willem de Bruijn wrote:
> >>>>> On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>
> >>>>>> On 2023/10/09 19:06, Willem de Bruijn wrote:
> >>>>>>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>
> >>>>>>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
> >>>>>>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
> >>>>>>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
> >>>>>>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>>>>>>>>>> purpose of RSS.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>>>>>>>>>> restrictive nature of eBPF.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
> >>>>>>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
> >>>>>>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>>>>>>>>>>>>>> the specification.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
> >>>>>>>>>>>>>>>> independently since it may have been computed in a way not conformant
> >>>>>>>>>>>>>>>> with the specification.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>>>>>>>>>>>>>> +       .max_indirection_table_length =
> >>>>>>>>>>>>>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>>>>>>>>>>>>>> +};
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
> >>>>>>>>>>>>>>> supports all or none.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >>>>>>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
> >>>>>>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
> >>>>>>>>>>>>>> consumed at most when interpreting destination option header so I chose
> >>>>>>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
> >>>>>>>>>>>>>> anyone will complain about it since nobody complains for Linux.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm also adding this so that we can extend it later.
> >>>>>>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >>>>>>>>>>>>>> types may have other bits for new protocols in the future.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>                 case TUNSETSTEERINGEBPF:
> >>>>>>>>>>>>>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>>>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>>>>>> +               if (IS_ERR(bpf_ret))
> >>>>>>>>>>>>>>>> +                       ret = PTR_ERR(bpf_ret);
> >>>>>>>>>>>>>>>> +               else if (bpf_ret)
> >>>>>>>>>>>>>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Don't make one feature disable another.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>>>>>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
> >>>>>>>>>>>>>>> for instance.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +       case TUNSETVNETHASH:
> >>>>>>>>>>>>>>>> +               len = sizeof(vnet_hash);
> >>>>>>>>>>>>>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
> >>>>>>>>>>>>>>>> +                       ret = -EFAULT;
> >>>>>>>>>>>>>>>> +                       break;
> >>>>>>>>>>>>>>>> +               }
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>>>>>>>>>>>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>>>>>>>>>>>>>> +                     !tun_is_little_endian(tun))) ||
> >>>>>>>>>>>>>>>> +                    vnet_hash.indirection_table_mask >=
> >>>>>>>>>>>>>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>>>>>>>>>>>>>> +                       ret = -EINVAL;
> >>>>>>>>>>>>>>>> +                       break;
> >>>>>>>>>>>>>>>> +               }
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
> >>>>>>>>>>>>>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>>>>>>>>>>>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>>>>>>>>>>>>>> +                       ret = -EFAULT;
> >>>>>>>>>>>>>>>> +                       break;
> >>>>>>>>>>>>>>>> +               }
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
> >>>>>>>>>>>>>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>>>>>>>>>>>>>> +                       ret = -EFAULT;
> >>>>>>>>>>>>>>>> +                       break;
> >>>>>>>>>>>>>>>> +               }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
> >>>>>>>>>>>>>>> struct with the max indirection table size.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I made its size variable because the indirection table and key may grow
> >>>>>>>>>>>>>> in the future as I wrote above.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >>>>>>>>>>>>>> proposed.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> To be clear: please don't just resubmit with that one change.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
> >>>>>>>>>>>>> exact purpose cannot be amended to return the key it produced.
> >>>>>>>>>>>>
> >>>>>>>>>>>> eBPF steering program is not designed for this particular problem in my
> >>>>>>>>>>>> understanding. It was introduced to derive hash values with an
> >>>>>>>>>>>> understanding of application-specific semantics of packets instead of
> >>>>>>>>>>>> generic IP/TCP/UDP semantics.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This problem is rather different in terms that the hash derivation is
> >>>>>>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
> >>>>>>>>>>>> introduce the complexity of BPF when you always run the same code.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It can utilize the existing flow dissector and also make it easier to
> >>>>>>>>>>>> use for the userspace by implementing this in the kernel.
> >>>>>>>>>>>
> >>>>>>>>>>> Ok. There does appear to be overlap in functionality. But it might be
> >>>>>>>>>>> easier to deploy to just have standard Toeplitz available without
> >>>>>>>>>>> having to compile and load an eBPF program.
> >>>>>>>>>>>
> >>>>>>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
> >>>>>>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
> >>>>>>>>>>
> >>>>>>>>>> An alternative is to move the bit to cb[] and clear it for every code
> >>>>>>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
> >>>>>>>>>>
> >>>>>>>>>> I think we can put the bit in sk_buff for now. We can implement the
> >>>>>>>>>> alternative when we are short of bits.
> >>>>>>>>>
> >>>>>>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
> >>>>>>>>> be added for every edge case.
> >>>>>>>>
> >>>>>>>> It only takes an unused bit and does not grow the sk_buff size so I
> >>>>>>>> think it has practically no cost for now.
> >>>>>>>
> >>>>>>> The problem is that that thinking leads to death by a thousand cuts.
> >>>>>>>
> >>>>>>> "for now" forces the cost of having to think hard how to avoid growing
> >>>>>>> sk_buff onto the next person. Let's do it right from the start.
> >>>>>>
> >>>>>> I see. I described an alternative to move the bit to cb[] and clear it
> >>>>>> in all code paths that leads to ndo_start_xmit() earlier. Does that
> >>>>>> sound good to you?
> >>>>>
> >>>>> If you use the control block to pass information between
> >>>>> __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
> >>>>> the field can be left undefined in all non-tun paths. tun_select_queue
> >>>>> can initialize.
> >>>>
> >>>> The problem is that tun_select_queue() is not always called.
> >>>> netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before
> >>>> calling it, but this variable may change later and result in a race
> >>>> condition. Another case is that XDP with predefined queue.
> >>>>
> >>>>>
> >>>>> I would still use skb->hash to encode the hash. That hash type of that
> >>>>> field is not strictly defined. It can be siphash from ___skb_get_hash
> >>>>> or a device hash, which most likely also uses Toeplitz. Then you also
> >>>>> don't run into the problem of growing the struct size.
> >>>>
> >>>> I'm concerned exactly because it's not strictly defined. Someone may
> >>>> decide to overwrite it later if we are not cautious enough. qdisc_skb_cb
> >>>> also has sufficient space to contain both of the hash value and type.
> >>>
> >>> How about using skb extensions?
> >>
> >> I think it will work. I'll try it in the next version.
> >
> > Btw, I still think using eBPF for hash might be better.
> >
> > Though the hashing rule is defined in the spec, it may be extended in
> > the future. For example, several extensions has been proposed:
> >
> > 1) RSS context
> > 2) encapsulated packet hashing
>
> Looking at the proposals, I'm now more inclined to extend the BPF
> steering program.

Just to make sure we are at the same page.

If the eBPF program needs to access skb extensions, it would not be a
steering program anymore (not a filter).

Or do you mean it is a dedicated eBPF program that calculates the hash?

>
> Yuri, who wrote the RFC patches to extend the BPF steering program, also
> raised an concern that it may become hard to implement virtio-net
> extensions in the future. It is much easier to deploy a new BPF program
> to support extensions since it will be included in QEMU and can be
> deployed at once without concerning other kernel stuff.
>
> I was still not sure how likely such an extension will emerge especially
> when the hardware RSS capability is not evolving for a decade or so. But
> those proposals show that there are more demands of new features for
> virtio-net.

It's not only the RSS, if you track virtio development, flow directors
are also being proposed.

Thanks

>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-11  3:18                                   ` Jason Wang
@ 2023-10-11  3:57                                     ` Akihiko Odaki
  0 siblings, 0 replies; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-11  3:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: Willem de Bruijn, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, jakub, elver, pabeni, Yuri Benditovich

On 2023/10/11 12:18, Jason Wang wrote:
> On Tue, Oct 10, 2023 at 2:19 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2023/10/10 15:00, Jason Wang wrote:
>>> On Tue, Oct 10, 2023 at 1:51 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2023/10/10 14:45, Jason Wang wrote:
>>>>> On Tue, Oct 10, 2023 at 9:52 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> On 2023/10/09 19:44, Willem de Bruijn wrote:
>>>>>>> On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>
>>>>>>>> On 2023/10/09 19:06, Willem de Bruijn wrote:
>>>>>>>>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
>>>>>>>>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>>>>>>>>>>>> purpose of RSS.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>>>>>>>>>>>> restrictive nature of eBPF.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>>>>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>>>>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>>>>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>>>>>>>>>>>>>> the specification.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>>>>>>>>>>>>>> independently since it may have been computed in a way not conformant
>>>>>>>>>>>>>>>>>> with the specification.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>>>>>>>>>>>>>> +       .max_indirection_table_length =
>>>>>>>>>>>>>>>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>>>>>>>>>>>>>> +};
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>>>>>>>>>>>>>> supports all or none.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>>>>>>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>>>>>>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
>>>>>>>>>>>>>>>> consumed at most when interpreting destination option header so I chose
>>>>>>>>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>>>>>>>>>>>>>> anyone will complain about it since nobody complains for Linux.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm also adding this so that we can extend it later.
>>>>>>>>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>>>>>>>>>>>>>> types may have other bits for new protocols in the future.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>                  case TUNSETSTEERINGEBPF:
>>>>>>>>>>>>>>>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>>>>>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>>>>>>>> +               if (IS_ERR(bpf_ret))
>>>>>>>>>>>>>>>>>> +                       ret = PTR_ERR(bpf_ret);
>>>>>>>>>>>>>>>>>> +               else if (bpf_ret)
>>>>>>>>>>>>>>>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Don't make one feature disable another.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>>>>>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>>>>>>>>>>>>>> for instance.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +       case TUNSETVNETHASH:
>>>>>>>>>>>>>>>>>> +               len = sizeof(vnet_hash);
>>>>>>>>>>>>>>>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>>>>>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>>>>>>>>>> +                       break;
>>>>>>>>>>>>>>>>>> +               }
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>>>>>>>>>>>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>>>>>>>>>>>>>> +                     !tun_is_little_endian(tun))) ||
>>>>>>>>>>>>>>>>>> +                    vnet_hash.indirection_table_mask >=
>>>>>>>>>>>>>>>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>>>>>>>>>>>>>> +                       ret = -EINVAL;
>>>>>>>>>>>>>>>>>> +                       break;
>>>>>>>>>>>>>>>>>> +               }
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>>>>>>>>>>>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>>>>>>>>>>>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>>>>>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>>>>>>>>>> +                       break;
>>>>>>>>>>>>>>>>>> +               }
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +               argp = (u8 __user *)argp + len;
>>>>>>>>>>>>>>>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>>>>>>>>>>>>>> +                       ret = -EFAULT;
>>>>>>>>>>>>>>>>>> +                       break;
>>>>>>>>>>>>>>>>>> +               }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
>>>>>>>>>>>>>>>>> struct with the max indirection table size.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I made its size variable because the indirection table and key may grow
>>>>>>>>>>>>>>>> in the future as I wrote above.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>>>>>>>>>>>>>> proposed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To be clear: please don't just resubmit with that one change.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
>>>>>>>>>>>>>>> exact purpose cannot be amended to return the key it produced.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> eBPF steering program is not designed for this particular problem in my
>>>>>>>>>>>>>> understanding. It was introduced to derive hash values with an
>>>>>>>>>>>>>> understanding of application-specific semantics of packets instead of
>>>>>>>>>>>>>> generic IP/TCP/UDP semantics.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This problem is rather different in terms that the hash derivation is
>>>>>>>>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
>>>>>>>>>>>>>> introduce the complexity of BPF when you always run the same code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It can utilize the existing flow dissector and also make it easier to
>>>>>>>>>>>>>> use for the userspace by implementing this in the kernel.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok. There does appear to be overlap in functionality. But it might be
>>>>>>>>>>>>> easier to deploy to just have standard Toeplitz available without
>>>>>>>>>>>>> having to compile and load an eBPF program.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
>>>>>>>>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
>>>>>>>>>>>>
>>>>>>>>>>>> An alternative is to move the bit to cb[] and clear it for every code
>>>>>>>>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>>>>>>>>>>>>
>>>>>>>>>>>> I think we can put the bit in sk_buff for now. We can implement the
>>>>>>>>>>>> alternative when we are short of bits.
>>>>>>>>>>>
>>>>>>>>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
>>>>>>>>>>> be added for every edge case.
>>>>>>>>>>
>>>>>>>>>> It only takes an unused bit and does not grow the sk_buff size so I
>>>>>>>>>> think it has practically no cost for now.
>>>>>>>>>
>>>>>>>>> The problem is that that thinking leads to death by a thousand cuts.
>>>>>>>>>
>>>>>>>>> "for now" forces the cost of having to think hard how to avoid growing
>>>>>>>>> sk_buff onto the next person. Let's do it right from the start.
>>>>>>>>
>>>>>>>> I see. I described an alternative to move the bit to cb[] and clear it
>>>>>>>> in all code paths that leads to ndo_start_xmit() earlier. Does that
>>>>>>>> sound good to you?
>>>>>>>
>>>>>>> If you use the control block to pass information between
>>>>>>> __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
>>>>>>> the field can be left undefined in all non-tun paths. tun_select_queue
>>>>>>> can initialize.
>>>>>>
>>>>>> The problem is that tun_select_queue() is not always called.
>>>>>> netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before
>>>>>> calling it, but this variable may change later and result in a race
>>>>>> condition. Another case is that XDP with predefined queue.
>>>>>>
>>>>>>>
>>>>>>> I would still use skb->hash to encode the hash. That hash type of that
>>>>>>> field is not strictly defined. It can be siphash from ___skb_get_hash
>>>>>>> or a device hash, which most likely also uses Toeplitz. Then you also
>>>>>>> don't run into the problem of growing the struct size.
>>>>>>
>>>>>> I'm concerned exactly because it's not strictly defined. Someone may
>>>>>> decide to overwrite it later if we are not cautious enough. qdisc_skb_cb
>>>>>> also has sufficient space to contain both of the hash value and type.
>>>>>
>>>>> How about using skb extensions?
>>>>
>>>> I think it will work. I'll try it in the next version.
>>>
>>> Btw, I still think using eBPF for hash might be better.
>>>
>>> Though the hashing rule is defined in the spec, it may be extended in
>>> the future. For example, several extensions has been proposed:
>>>
>>> 1) RSS context
>>> 2) encapsulated packet hashing
>>
>> Looking at the proposals, I'm now more inclined to extend the BPF
>> steering program.
> 
> Just to make sure we are at the same page.
> 
> If the eBPF program needs to access skb extensions, it would not be a
> steering program anymore (not a filter).
> 
> Or do you mean it is a dedicated eBPF program that calculates the hash?

I think the BPF program should be a steering program but extended for 
hash reporting.

Since we need a hash reporting feature that is not present in a socket 
filter, the BPF program should have a dedicated bpf_prog_type (not 
BPF_PROG_TYPE_SOCKET_FILTER). However, as its functionality is the 
superset of the conventional steering program, I'm planning to use the 
existing TUNSETSTEERINGEBPF ioctl to set it.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-08  5:20 ` [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
  2023-10-08 19:07   ` Willem de Bruijn
@ 2023-10-09  8:13   ` Willem de Bruijn
  2023-10-09  8:44     ` Akihiko Odaki
  2023-10-09 11:38   ` Michael S. Tsirkin
  2 siblings, 1 reply; 41+ messages in thread
From: Willem de Bruijn @ 2023-10-09  8:13 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, cai, jakub, elver, pabeni, Yuri Benditovich

On Sun, Oct 8, 2023 at 12:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
>
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges. An alternative solution is to extend the eBPF steering
> program so that it will be able to report to the userspace, but it makes
> little sense to allow to implement different hashing algorithms with
> eBPF since the hash value reported by virtio-net is strictly defined by
> the specification.
>
> The hash value already stored in sk_buff is not used and computed
> independently since it may have been computed in a way not conformant
> with the specification.
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>

> @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>         }
>
>         if (vnet_hdr_sz) {
> -               struct virtio_net_hdr gso;
> +               union {
> +                       struct virtio_net_hdr hdr;
> +                       struct virtio_net_hdr_v1_hash v1_hash_hdr;
> +               } hdr;
> +               int ret;
>
>                 if (iov_iter_count(iter) < vnet_hdr_sz)
>                         return -EINVAL;
>
> -               if (virtio_net_hdr_from_skb(skb, &gso,
> -                                           tun_is_little_endian(tun), true,
> -                                           vlan_hlen)) {
> +               if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
> +                   vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
> +                   skb->tun_vnet_hash) {

Isn't vnet_hdr_sz guaranteed to be >= hdr.v1_hash_hdr, by virtue of
the set hash ioctl failing otherwise?

Such checks should be limited to control path where possible

> +                       vnet_hdr_content_sz = sizeof(hdr.v1_hash_hdr);
> +                       ret = virtio_net_hdr_v1_hash_from_skb(skb,
> +                                                             &hdr.v1_hash_hdr,
> +                                                             true,
> +                                                             vlan_hlen,
> +                                                             &vnet_hash);
> +               } else {
> +                       vnet_hdr_content_sz = sizeof(hdr.hdr);
> +                       ret = virtio_net_hdr_from_skb(skb, &hdr.hdr,
> +                                                     tun_is_little_endian(tun),
> +                                                     true, vlan_hlen);
> +               }
> +

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09  8:13   ` Willem de Bruijn
@ 2023-10-09  8:44     ` Akihiko Odaki
  2023-10-09  9:54       ` Willem de Bruijn
  2023-10-09 11:50       ` Michael S. Tsirkin
  0 siblings, 2 replies; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-09  8:44 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, cai, jakub, elver, pabeni, Yuri Benditovich

On 2023/10/09 17:13, Willem de Bruijn wrote:
> On Sun, Oct 8, 2023 at 12:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> virtio-net have two usage of hashes: one is RSS and another is hash
>> reporting. Conventionally the hash calculation was done by the VMM.
>> However, computing the hash after the queue was chosen defeats the
>> purpose of RSS.
>>
>> Another approach is to use eBPF steering program. This approach has
>> another downside: it cannot report the calculated hash due to the
>> restrictive nature of eBPF.
>>
>> Introduce the code to compute hashes to the kernel in order to overcome
>> thse challenges. An alternative solution is to extend the eBPF steering
>> program so that it will be able to report to the userspace, but it makes
>> little sense to allow to implement different hashing algorithms with
>> eBPF since the hash value reported by virtio-net is strictly defined by
>> the specification.
>>
>> The hash value already stored in sk_buff is not used and computed
>> independently since it may have been computed in a way not conformant
>> with the specification.
>>
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> 
>> @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>          }
>>
>>          if (vnet_hdr_sz) {
>> -               struct virtio_net_hdr gso;
>> +               union {
>> +                       struct virtio_net_hdr hdr;
>> +                       struct virtio_net_hdr_v1_hash v1_hash_hdr;
>> +               } hdr;
>> +               int ret;
>>
>>                  if (iov_iter_count(iter) < vnet_hdr_sz)
>>                          return -EINVAL;
>>
>> -               if (virtio_net_hdr_from_skb(skb, &gso,
>> -                                           tun_is_little_endian(tun), true,
>> -                                           vlan_hlen)) {
>> +               if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
>> +                   vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
>> +                   skb->tun_vnet_hash) {
> 
> Isn't vnet_hdr_sz guaranteed to be >= hdr.v1_hash_hdr, by virtue of
> the set hash ioctl failing otherwise?
> 
> Such checks should be limited to control path where possible

There is a potential race since tun->vnet_hash.flags and vnet_hdr_sz are 
not read at once.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09  8:44     ` Akihiko Odaki
@ 2023-10-09  9:54       ` Willem de Bruijn
  2023-10-09 10:05         ` Akihiko Odaki
  2023-10-09 11:50       ` Michael S. Tsirkin
  1 sibling, 1 reply; 41+ messages in thread
From: Willem de Bruijn @ 2023-10-09  9:54 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, cai, jakub, elver, pabeni, Yuri Benditovich

On Mon, Oct 9, 2023 at 3:44 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/10/09 17:13, Willem de Bruijn wrote:
> > On Sun, Oct 8, 2023 at 12:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> virtio-net have two usage of hashes: one is RSS and another is hash
> >> reporting. Conventionally the hash calculation was done by the VMM.
> >> However, computing the hash after the queue was chosen defeats the
> >> purpose of RSS.
> >>
> >> Another approach is to use eBPF steering program. This approach has
> >> another downside: it cannot report the calculated hash due to the
> >> restrictive nature of eBPF.
> >>
> >> Introduce the code to compute hashes to the kernel in order to overcome
> >> thse challenges. An alternative solution is to extend the eBPF steering
> >> program so that it will be able to report to the userspace, but it makes
> >> little sense to allow to implement different hashing algorithms with
> >> eBPF since the hash value reported by virtio-net is strictly defined by
> >> the specification.
> >>
> >> The hash value already stored in sk_buff is not used and computed
> >> independently since it may have been computed in a way not conformant
> >> with the specification.
> >>
> >> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >
> >> @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>          }
> >>
> >>          if (vnet_hdr_sz) {
> >> -               struct virtio_net_hdr gso;
> >> +               union {
> >> +                       struct virtio_net_hdr hdr;
> >> +                       struct virtio_net_hdr_v1_hash v1_hash_hdr;
> >> +               } hdr;
> >> +               int ret;
> >>
> >>                  if (iov_iter_count(iter) < vnet_hdr_sz)
> >>                          return -EINVAL;
> >>
> >> -               if (virtio_net_hdr_from_skb(skb, &gso,
> >> -                                           tun_is_little_endian(tun), true,
> >> -                                           vlan_hlen)) {
> >> +               if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
> >> +                   vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
> >> +                   skb->tun_vnet_hash) {
> >
> > Isn't vnet_hdr_sz guaranteed to be >= hdr.v1_hash_hdr, by virtue of
> > the set hash ioctl failing otherwise?
> >
> > Such checks should be limited to control path where possible
>
> There is a potential race since tun->vnet_hash.flags and vnet_hdr_sz are
> not read at once.

It should not be possible to downgrade the hdr_sz once v1 is selected.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09  9:54       ` Willem de Bruijn
@ 2023-10-09 10:05         ` Akihiko Odaki
  2023-10-09 10:07           ` Willem de Bruijn
  0 siblings, 1 reply; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-09 10:05 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, cai, jakub, elver, pabeni, Yuri Benditovich



On 2023/10/09 18:54, Willem de Bruijn wrote:
> On Mon, Oct 9, 2023 at 3:44 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2023/10/09 17:13, Willem de Bruijn wrote:
>>> On Sun, Oct 8, 2023 at 12:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>> However, computing the hash after the queue was chosen defeats the
>>>> purpose of RSS.
>>>>
>>>> Another approach is to use eBPF steering program. This approach has
>>>> another downside: it cannot report the calculated hash due to the
>>>> restrictive nature of eBPF.
>>>>
>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>> program so that it will be able to report to the userspace, but it makes
>>>> little sense to allow to implement different hashing algorithms with
>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>> the specification.
>>>>
>>>> The hash value already stored in sk_buff is not used and computed
>>>> independently since it may have been computed in a way not conformant
>>>> with the specification.
>>>>
>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>
>>>> @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>           }
>>>>
>>>>           if (vnet_hdr_sz) {
>>>> -               struct virtio_net_hdr gso;
>>>> +               union {
>>>> +                       struct virtio_net_hdr hdr;
>>>> +                       struct virtio_net_hdr_v1_hash v1_hash_hdr;
>>>> +               } hdr;
>>>> +               int ret;
>>>>
>>>>                   if (iov_iter_count(iter) < vnet_hdr_sz)
>>>>                           return -EINVAL;
>>>>
>>>> -               if (virtio_net_hdr_from_skb(skb, &gso,
>>>> -                                           tun_is_little_endian(tun), true,
>>>> -                                           vlan_hlen)) {
>>>> +               if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
>>>> +                   vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
>>>> +                   skb->tun_vnet_hash) {
>>>
>>> Isn't vnet_hdr_sz guaranteed to be >= hdr.v1_hash_hdr, by virtue of
>>> the set hash ioctl failing otherwise?
>>>
>>> Such checks should be limited to control path where possible
>>
>> There is a potential race since tun->vnet_hash.flags and vnet_hdr_sz are
>> not read at once.
> 
> It should not be possible to downgrade the hdr_sz once v1 is selected.

I see nothing that prevents shrinking the header size.

tun->vnet_hash.flags is read after vnet_hdr_sz so the race can happen 
even for the case the header size grows though this can be fixed by 
reordering the two reads.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09 10:05         ` Akihiko Odaki
@ 2023-10-09 10:07           ` Willem de Bruijn
  2023-10-09 10:11             ` Akihiko Odaki
  0 siblings, 1 reply; 41+ messages in thread
From: Willem de Bruijn @ 2023-10-09 10:07 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, cai, jakub, elver, pabeni, Yuri Benditovich

On Mon, Oct 9, 2023 at 3:05 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
>
>
> On 2023/10/09 18:54, Willem de Bruijn wrote:
> > On Mon, Oct 9, 2023 at 3:44 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2023/10/09 17:13, Willem de Bruijn wrote:
> >>> On Sun, Oct 8, 2023 at 12:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>> However, computing the hash after the queue was chosen defeats the
> >>>> purpose of RSS.
> >>>>
> >>>> Another approach is to use eBPF steering program. This approach has
> >>>> another downside: it cannot report the calculated hash due to the
> >>>> restrictive nature of eBPF.
> >>>>
> >>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>> program so that it will be able to report to the userspace, but it makes
> >>>> little sense to allow to implement different hashing algorithms with
> >>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>> the specification.
> >>>>
> >>>> The hash value already stored in sk_buff is not used and computed
> >>>> independently since it may have been computed in a way not conformant
> >>>> with the specification.
> >>>>
> >>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>
> >>>> @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>>>           }
> >>>>
> >>>>           if (vnet_hdr_sz) {
> >>>> -               struct virtio_net_hdr gso;
> >>>> +               union {
> >>>> +                       struct virtio_net_hdr hdr;
> >>>> +                       struct virtio_net_hdr_v1_hash v1_hash_hdr;
> >>>> +               } hdr;
> >>>> +               int ret;
> >>>>
> >>>>                   if (iov_iter_count(iter) < vnet_hdr_sz)
> >>>>                           return -EINVAL;
> >>>>
> >>>> -               if (virtio_net_hdr_from_skb(skb, &gso,
> >>>> -                                           tun_is_little_endian(tun), true,
> >>>> -                                           vlan_hlen)) {
> >>>> +               if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
> >>>> +                   vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
> >>>> +                   skb->tun_vnet_hash) {
> >>>
> >>> Isn't vnet_hdr_sz guaranteed to be >= hdr.v1_hash_hdr, by virtue of
> >>> the set hash ioctl failing otherwise?
> >>>
> >>> Such checks should be limited to control path where possible
> >>
> >> There is a potential race since tun->vnet_hash.flags and vnet_hdr_sz are
> >> not read at once.
> >
> > It should not be possible to downgrade the hdr_sz once v1 is selected.
>
> I see nothing that prevents shrinking the header size.
>
> tun->vnet_hash.flags is read after vnet_hdr_sz so the race can happen
> even for the case the header size grows though this can be fixed by
> reordering the two reads.

One option is to fail any control path that tries to re-negotiate
header size once this hash option is enabled?

There is no practical reason to allow feature re-negotiation at any
arbitrary time.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09 10:07           ` Willem de Bruijn
@ 2023-10-09 10:11             ` Akihiko Odaki
  2023-10-09 10:32               ` Willem de Bruijn
  0 siblings, 1 reply; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-09 10:11 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, cai, jakub, elver, pabeni, Yuri Benditovich

On 2023/10/09 19:07, Willem de Bruijn wrote:
> On Mon, Oct 9, 2023 at 3:05 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>>
>>
>> On 2023/10/09 18:54, Willem de Bruijn wrote:
>>> On Mon, Oct 9, 2023 at 3:44 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2023/10/09 17:13, Willem de Bruijn wrote:
>>>>> On Sun, Oct 8, 2023 at 12:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>> purpose of RSS.
>>>>>>
>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>> restrictive nature of eBPF.
>>>>>>
>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>> the specification.
>>>>>>
>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>> independently since it may have been computed in a way not conformant
>>>>>> with the specification.
>>>>>>
>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>
>>>>>> @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>>>            }
>>>>>>
>>>>>>            if (vnet_hdr_sz) {
>>>>>> -               struct virtio_net_hdr gso;
>>>>>> +               union {
>>>>>> +                       struct virtio_net_hdr hdr;
>>>>>> +                       struct virtio_net_hdr_v1_hash v1_hash_hdr;
>>>>>> +               } hdr;
>>>>>> +               int ret;
>>>>>>
>>>>>>                    if (iov_iter_count(iter) < vnet_hdr_sz)
>>>>>>                            return -EINVAL;
>>>>>>
>>>>>> -               if (virtio_net_hdr_from_skb(skb, &gso,
>>>>>> -                                           tun_is_little_endian(tun), true,
>>>>>> -                                           vlan_hlen)) {
>>>>>> +               if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
>>>>>> +                   vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
>>>>>> +                   skb->tun_vnet_hash) {
>>>>>
>>>>> Isn't vnet_hdr_sz guaranteed to be >= hdr.v1_hash_hdr, by virtue of
>>>>> the set hash ioctl failing otherwise?
>>>>>
>>>>> Such checks should be limited to control path where possible
>>>>
>>>> There is a potential race since tun->vnet_hash.flags and vnet_hdr_sz are
>>>> not read at once.
>>>
>>> It should not be possible to downgrade the hdr_sz once v1 is selected.
>>
>> I see nothing that prevents shrinking the header size.
>>
>> tun->vnet_hash.flags is read after vnet_hdr_sz so the race can happen
>> even for the case the header size grows though this can be fixed by
>> reordering the two reads.
> 
> One option is to fail any control path that tries to re-negotiate
> header size once this hash option is enabled?
> 
> There is no practical reason to allow feature re-negotiation at any
> arbitrary time.

I think it's a bit awkward interface design since tun allows to 
reconfigure any of its parameters, but it's certainly possible.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09 10:11             ` Akihiko Odaki
@ 2023-10-09 10:32               ` Willem de Bruijn
  0 siblings, 0 replies; 41+ messages in thread
From: Willem de Bruijn @ 2023-10-09 10:32 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, cai, jakub, elver, pabeni, Yuri Benditovich

On Mon, Oct 9, 2023 at 3:11 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/10/09 19:07, Willem de Bruijn wrote:
> > On Mon, Oct 9, 2023 at 3:05 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >>
> >>
> >> On 2023/10/09 18:54, Willem de Bruijn wrote:
> >>> On Mon, Oct 9, 2023 at 3:44 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> On 2023/10/09 17:13, Willem de Bruijn wrote:
> >>>>> On Sun, Oct 8, 2023 at 12:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>
> >>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>> purpose of RSS.
> >>>>>>
> >>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>> restrictive nature of eBPF.
> >>>>>>
> >>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>> program so that it will be able to report to the userspace, but it makes
> >>>>>> little sense to allow to implement different hashing algorithms with
> >>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>>>> the specification.
> >>>>>>
> >>>>>> The hash value already stored in sk_buff is not used and computed
> >>>>>> independently since it may have been computed in a way not conformant
> >>>>>> with the specification.
> >>>>>>
> >>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>>>
> >>>>>> @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>>>>>            }
> >>>>>>
> >>>>>>            if (vnet_hdr_sz) {
> >>>>>> -               struct virtio_net_hdr gso;
> >>>>>> +               union {
> >>>>>> +                       struct virtio_net_hdr hdr;
> >>>>>> +                       struct virtio_net_hdr_v1_hash v1_hash_hdr;
> >>>>>> +               } hdr;
> >>>>>> +               int ret;
> >>>>>>
> >>>>>>                    if (iov_iter_count(iter) < vnet_hdr_sz)
> >>>>>>                            return -EINVAL;
> >>>>>>
> >>>>>> -               if (virtio_net_hdr_from_skb(skb, &gso,
> >>>>>> -                                           tun_is_little_endian(tun), true,
> >>>>>> -                                           vlan_hlen)) {
> >>>>>> +               if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
> >>>>>> +                   vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
> >>>>>> +                   skb->tun_vnet_hash) {
> >>>>>
> >>>>> Isn't vnet_hdr_sz guaranteed to be >= hdr.v1_hash_hdr, by virtue of
> >>>>> the set hash ioctl failing otherwise?
> >>>>>
> >>>>> Such checks should be limited to control path where possible
> >>>>
> >>>> There is a potential race since tun->vnet_hash.flags and vnet_hdr_sz are
> >>>> not read at once.
> >>>
> >>> It should not be possible to downgrade the hdr_sz once v1 is selected.
> >>
> >> I see nothing that prevents shrinking the header size.
> >>
> >> tun->vnet_hash.flags is read after vnet_hdr_sz so the race can happen
> >> even for the case the header size grows though this can be fixed by
> >> reordering the two reads.
> >
> > One option is to fail any control path that tries to re-negotiate
> > header size once this hash option is enabled?
> >
> > There is no practical reason to allow feature re-negotiation at any
> > arbitrary time.
>
> I think it's a bit awkward interface design since tun allows to
> reconfigure any of its parameters, but it's certainly possible.

If this would be the only exception to that rule, and this is the only
place that needs a datapath check, then it's fine to leave as is.

In general, this runtime configurability serves little purpose but to
help syzbot exercise code paths no real application would attempt. But
I won't ask to diverge from whatever tun already does. We just have to
be more careful about the possible races it brings.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09  8:44     ` Akihiko Odaki
  2023-10-09  9:54       ` Willem de Bruijn
@ 2023-10-09 11:50       ` Michael S. Tsirkin
  2023-10-10  2:34         ` Akihiko Odaki
  1 sibling, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2023-10-09 11:50 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Willem de Bruijn, Jason Wang, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, cai, jakub, elver, pabeni, Yuri Benditovich

On Mon, Oct 09, 2023 at 05:44:20PM +0900, Akihiko Odaki wrote:
> On 2023/10/09 17:13, Willem de Bruijn wrote:
> > On Sun, Oct 8, 2023 at 12:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> > > 
> > > virtio-net have two usage of hashes: one is RSS and another is hash
> > > reporting. Conventionally the hash calculation was done by the VMM.
> > > However, computing the hash after the queue was chosen defeats the
> > > purpose of RSS.
> > > 
> > > Another approach is to use eBPF steering program. This approach has
> > > another downside: it cannot report the calculated hash due to the
> > > restrictive nature of eBPF.
> > > 
> > > Introduce the code to compute hashes to the kernel in order to overcome
> > > thse challenges. An alternative solution is to extend the eBPF steering
> > > program so that it will be able to report to the userspace, but it makes
> > > little sense to allow to implement different hashing algorithms with
> > > eBPF since the hash value reported by virtio-net is strictly defined by
> > > the specification.
> > > 
> > > The hash value already stored in sk_buff is not used and computed
> > > independently since it may have been computed in a way not conformant
> > > with the specification.
> > > 
> > > Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> > 
> > > @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> > >          }
> > > 
> > >          if (vnet_hdr_sz) {
> > > -               struct virtio_net_hdr gso;
> > > +               union {
> > > +                       struct virtio_net_hdr hdr;
> > > +                       struct virtio_net_hdr_v1_hash v1_hash_hdr;
> > > +               } hdr;
> > > +               int ret;
> > > 
> > >                  if (iov_iter_count(iter) < vnet_hdr_sz)
> > >                          return -EINVAL;
> > > 
> > > -               if (virtio_net_hdr_from_skb(skb, &gso,
> > > -                                           tun_is_little_endian(tun), true,
> > > -                                           vlan_hlen)) {
> > > +               if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
> > > +                   vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
> > > +                   skb->tun_vnet_hash) {
> > 
> > Isn't vnet_hdr_sz guaranteed to be >= hdr.v1_hash_hdr, by virtue of
> > the set hash ioctl failing otherwise?
> > 
> > Such checks should be limited to control path where possible
> 
> There is a potential race since tun->vnet_hash.flags and vnet_hdr_sz are not
> read at once.

And then it's a complete mess and you get inconsistent
behaviour with packets getting sent all over the place, right?
So maybe keep a pointer to this struct so it can be
changed atomically then. Maybe even something with rcu I donnu.

-- 
MST


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09 11:50       ` Michael S. Tsirkin
@ 2023-10-10  2:34         ` Akihiko Odaki
  0 siblings, 0 replies; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-10  2:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Willem de Bruijn, Jason Wang, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, cai, jakub, elver, pabeni, Yuri Benditovich

On 2023/10/09 20:50, Michael S. Tsirkin wrote:
> On Mon, Oct 09, 2023 at 05:44:20PM +0900, Akihiko Odaki wrote:
>> On 2023/10/09 17:13, Willem de Bruijn wrote:
>>> On Sun, Oct 8, 2023 at 12:22 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>> However, computing the hash after the queue was chosen defeats the
>>>> purpose of RSS.
>>>>
>>>> Another approach is to use eBPF steering program. This approach has
>>>> another downside: it cannot report the calculated hash due to the
>>>> restrictive nature of eBPF.
>>>>
>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>> program so that it will be able to report to the userspace, but it makes
>>>> little sense to allow to implement different hashing algorithms with
>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>> the specification.
>>>>
>>>> The hash value already stored in sk_buff is not used and computed
>>>> independently since it may have been computed in a way not conformant
>>>> with the specification.
>>>>
>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>
>>>> @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>           }
>>>>
>>>>           if (vnet_hdr_sz) {
>>>> -               struct virtio_net_hdr gso;
>>>> +               union {
>>>> +                       struct virtio_net_hdr hdr;
>>>> +                       struct virtio_net_hdr_v1_hash v1_hash_hdr;
>>>> +               } hdr;
>>>> +               int ret;
>>>>
>>>>                   if (iov_iter_count(iter) < vnet_hdr_sz)
>>>>                           return -EINVAL;
>>>>
>>>> -               if (virtio_net_hdr_from_skb(skb, &gso,
>>>> -                                           tun_is_little_endian(tun), true,
>>>> -                                           vlan_hlen)) {
>>>> +               if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
>>>> +                   vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
>>>> +                   skb->tun_vnet_hash) {
>>>
>>> Isn't vnet_hdr_sz guaranteed to be >= hdr.v1_hash_hdr, by virtue of
>>> the set hash ioctl failing otherwise?
>>>
>>> Such checks should be limited to control path where possible
>>
>> There is a potential race since tun->vnet_hash.flags and vnet_hdr_sz are not
>> read at once.
> 
> And then it's a complete mess and you get inconsistent
> behaviour with packets getting sent all over the place, right?
> So maybe keep a pointer to this struct so it can be
> changed atomically then. Maybe even something with rcu I donnu.

I think it's a good idea to use RCU for the vnet_hash members, but 
vnet_hdr_sz is something not specific to vnet_hash so this check will be 
still necessary.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-08  5:20 ` [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
  2023-10-08 19:07   ` Willem de Bruijn
  2023-10-09  8:13   ` Willem de Bruijn
@ 2023-10-09 11:38   ` Michael S. Tsirkin
  2023-10-10  2:32     ` Akihiko Odaki
  2 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2023-10-09 11:38 UTC (permalink / raw)
  To: Willem de Bruijn, Jason Wang, Michael S. Tsirkin, netdev,
	linux-kernel, kvm, virtualization, linux-kselftest, bpf, davem,
	kuba, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, rdunlap, willemb, gustavoars, herbert,
	steffen.klassert, nogikh, pablo, decui, cai, jakub, elver, pabeni,
	Yuri Benditovich, Akihiko Odaki
  Cc: xuanzhuo, shuah

Akihiko Odaki sorry about reposts.
Having an email with two "@" in the CC list:
	 xuanzhuo@linux.alibaba.comshuah@kernel.org
tripped up mutt's reply-all for me and made it send to you only.

I am guessing you meant two addresses: xuanzhuo@linux.alibaba.com
and shuah@kernel.org.


On Sun, Oct 08, 2023 at 02:20:49PM +0900, Akihiko Odaki wrote:
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
> 
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
> 
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges. An alternative solution is to extend the eBPF steering
> program so that it will be able to report to the userspace, but it makes
> little sense to allow to implement different hashing algorithms with
> eBPF since the hash value reported by virtio-net is strictly defined by
> the specification.
> 
> The hash value already stored in sk_buff is not used and computed
> independently since it may have been computed in a way not conformant
> with the specification.
> 
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> ---
>  drivers/net/tun.c           | 187 ++++++++++++++++++++++++++++++++----
>  include/uapi/linux/if_tun.h |  16 +++
>  2 files changed, 182 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 89ab9efe522c..561a573cd008 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -171,6 +171,9 @@ struct tun_prog {
>  	struct bpf_prog *prog;
>  };
>  
> +#define TUN_VNET_HASH_MAX_KEY_SIZE 40
> +#define TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH 128
> +

where do these come from?

>  /* Since the socket were moved to tun_file, to preserve the behavior of persist
>   * device, socket filter, sndbuf and vnet header size were restore when the
>   * file were attached to a persist device.
> @@ -209,6 +212,9 @@ struct tun_struct {
>  	struct tun_prog __rcu *steering_prog;
>  	struct tun_prog __rcu *filter_prog;
>  	struct ethtool_link_ksettings link_ksettings;
> +	struct tun_vnet_hash vnet_hash;
> +	u16 vnet_hash_indirection_table[TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH];
> +	u32 vnet_hash_key[TUN_VNET_HASH_MAX_KEY_SIZE / 4];

That's quite a lot of data to add in this struct, and will be used
by a small minority of users. Are you pushing any hot data out of cache
with this? Why not allocate these as needed?

>  	/* init args */
>  	struct file *file;
>  	struct ifreq *ifr;
> @@ -219,6 +225,13 @@ struct veth {
>  	__be16 h_vlan_TCI;
>  };
>  
> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> +	.max_indirection_table_length =
> +		TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> +
> +	.types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> +};
> +
>  static void tun_flow_init(struct tun_struct *tun);
>  static void tun_flow_uninit(struct tun_struct *tun);
>  
> @@ -320,10 +333,16 @@ static long tun_set_vnet_be(struct tun_struct *tun, int __user *argp)
>  	if (get_user(be, argp))
>  		return -EFAULT;
>  
> -	if (be)
> +	if (be) {
> +		if (!(tun->flags & TUN_VNET_LE) &&
> +		    (tun->vnet_hash.flags & TUN_VNET_HASH_REPORT)) {
> +			return -EINVAL;
> +		}
> +
>  		tun->flags |= TUN_VNET_BE;
> -	else
> +	} else {
>  		tun->flags &= ~TUN_VNET_BE;
> +	}
>  
>  	return 0;
>  }
> @@ -558,15 +577,47 @@ static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
>  	return ret % numqueues;
>  }
>  
> +static u16 tun_vnet_select_queue(struct tun_struct *tun, struct sk_buff *skb)
> +{
> +	u32 value = qdisc_skb_cb(skb)->tun_vnet_hash_value;
> +	u32 numqueues;
> +	u32 index;
> +	u16 queue;
> +
> +	numqueues = READ_ONCE(tun->numqueues);
> +	if (!numqueues)
> +		return 0;
> +
> +	index = value & READ_ONCE(tun->vnet_hash.indirection_table_mask);
> +	queue = READ_ONCE(tun->vnet_hash_indirection_table[index]);
> +	if (!queue)
> +		queue = READ_ONCE(tun->vnet_hash.unclassified_queue);

Apparently 0 is an illegal queue value? You are making this part
of UAPI better document things like this.

> +
> +	return queue % numqueues;
> +}
> +
>  static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
>  			    struct net_device *sb_dev)
>  {
>  	struct tun_struct *tun = netdev_priv(dev);
> +	u8 vnet_hash_flags = READ_ONCE(tun->vnet_hash.flags);
> +	struct virtio_net_hash hash;
>  	u16 ret;
>  
> +	if (vnet_hash_flags & (TUN_VNET_HASH_RSS | TUN_VNET_HASH_REPORT)) {
> +		virtio_net_hash(skb, READ_ONCE(tun->vnet_hash.types),
> +				tun->vnet_hash_key, &hash);
> +

What are all these READ_ONCE things doing?
E.g. you seem to be very content to have tun->vnet_hash.types read twice apparently?
This is volatile which basically breaks all compiler's attempts
to optimize code - needs to be used judiciously.



> +		skb->tun_vnet_hash = true;
> +		qdisc_skb_cb(skb)->tun_vnet_hash_value = hash.value;
> +		qdisc_skb_cb(skb)->tun_vnet_hash_report = hash.report;
> +	}
> +
>  	rcu_read_lock();
>  	if (rcu_dereference(tun->steering_prog))
>  		ret = tun_ebpf_select_queue(tun, skb);
> +	else if (vnet_hash_flags & TUN_VNET_HASH_RSS)
> +		ret = tun_vnet_select_queue(tun, skb);
>  	else
>  		ret = tun_automq_select_queue(tun, skb);
>  	rcu_read_unlock();
> @@ -2088,10 +2139,15 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  			    struct iov_iter *iter)
>  {
>  	struct tun_pi pi = { 0, skb->protocol };
> +	struct virtio_net_hash vnet_hash = {
> +		.value = qdisc_skb_cb(skb)->tun_vnet_hash_value,
> +		.report = qdisc_skb_cb(skb)->tun_vnet_hash_report
> +	};
>  	ssize_t total;
>  	int vlan_offset = 0;
>  	int vlan_hlen = 0;
>  	int vnet_hdr_sz = 0;
> +	size_t vnet_hdr_content_sz;
>  
>  	if (skb_vlan_tag_present(skb))
>  		vlan_hlen = VLAN_HLEN;
> @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  	}
>  
>  	if (vnet_hdr_sz) {
> -		struct virtio_net_hdr gso;
> +		union {
> +			struct virtio_net_hdr hdr;
> +			struct virtio_net_hdr_v1_hash v1_hash_hdr;
> +		} hdr;
> +		int ret;
>  
>  		if (iov_iter_count(iter) < vnet_hdr_sz)
>  			return -EINVAL;
>  
> -		if (virtio_net_hdr_from_skb(skb, &gso,
> -					    tun_is_little_endian(tun), true,
> -					    vlan_hlen)) {
> +		if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
> +		    vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
> +		    skb->tun_vnet_hash) {
> +			vnet_hdr_content_sz = sizeof(hdr.v1_hash_hdr);
> +			ret = virtio_net_hdr_v1_hash_from_skb(skb,
> +							      &hdr.v1_hash_hdr,
> +							      true,
> +							      vlan_hlen,
> +							      &vnet_hash);
> +		} else {
> +			vnet_hdr_content_sz = sizeof(hdr.hdr);
> +			ret = virtio_net_hdr_from_skb(skb, &hdr.hdr,
> +						      tun_is_little_endian(tun),
> +						      true, vlan_hlen);
> +		}
> +
> +		if (ret) {
>  			struct skb_shared_info *sinfo = skb_shinfo(skb);
>  			pr_err("unexpected GSO type: "
>  			       "0x%x, gso_size %d, hdr_len %d\n",
> -			       sinfo->gso_type, tun16_to_cpu(tun, gso.gso_size),
> -			       tun16_to_cpu(tun, gso.hdr_len));
> +			       sinfo->gso_type, tun16_to_cpu(tun, hdr.hdr.gso_size),
> +			       tun16_to_cpu(tun, hdr.hdr.hdr_len));
>  			print_hex_dump(KERN_ERR, "tun: ",
>  				       DUMP_PREFIX_NONE,
>  				       16, 1, skb->head,
> -				       min((int)tun16_to_cpu(tun, gso.hdr_len), 64), true);
> +				       min((int)tun16_to_cpu(tun, hdr.hdr.hdr_len), 64), true);
>  			WARN_ON_ONCE(1);
>  			return -EINVAL;
>  		}
>  
> -		if (copy_to_iter(&gso, sizeof(gso), iter) != sizeof(gso))
> +		if (copy_to_iter(&hdr, vnet_hdr_content_sz, iter) != vnet_hdr_content_sz)
>  			return -EFAULT;
>  
> -		iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso));
> +		iov_iter_advance(iter, vnet_hdr_sz - vnet_hdr_content_sz);
>  	}
>  
>  	if (vlan_hlen) {
> @@ -3007,24 +3081,27 @@ static int tun_set_queue(struct file *file, struct ifreq *ifr)
>  	return ret;
>  }
>  
> -static int tun_set_ebpf(struct tun_struct *tun, struct tun_prog __rcu **prog_p,
> -			void __user *data)
> +static struct bpf_prog *tun_set_ebpf(struct tun_struct *tun,
> +				     struct tun_prog __rcu **prog_p,
> +				     void __user *data)
>  {
>  	struct bpf_prog *prog;
>  	int fd;
> +	int ret;
>  
>  	if (copy_from_user(&fd, data, sizeof(fd)))
> -		return -EFAULT;
> +		return ERR_PTR(-EFAULT);
>  
>  	if (fd == -1) {
>  		prog = NULL;
>  	} else {
>  		prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
>  		if (IS_ERR(prog))
> -			return PTR_ERR(prog);
> +			return prog;
>  	}
>  
> -	return __tun_set_ebpf(tun, prog_p, prog);
> +	ret = __tun_set_ebpf(tun, prog_p, prog);
> +	return ret ? ERR_PTR(ret) : prog;
>  }
>  
>  /* Return correct value for tun->dev->addr_len based on tun->dev->type. */
> @@ -3082,6 +3159,11 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>  	int le;
>  	int ret;
>  	bool do_notify = false;
> +	struct bpf_prog *bpf_ret;
> +	struct tun_vnet_hash vnet_hash;
> +	u16 vnet_hash_indirection_table[TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH];
> +	u8 vnet_hash_key[TUN_VNET_HASH_MAX_KEY_SIZE];
> +	size_t len;
>  
>  	if (cmd == TUNSETIFF || cmd == TUNSETQUEUE ||
>  	    (_IOC_TYPE(cmd) == SOCK_IOC_TYPE && cmd != SIOCGSKNS)) {
> @@ -3295,7 +3377,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>  			ret = -EFAULT;
>  			break;
>  		}
> -		if (vnet_hdr_sz < (int)sizeof(struct virtio_net_hdr)) {
> +		if (vnet_hdr_sz <
> +		    (int)((tun->vnet_hash.flags & TUN_VNET_HASH_REPORT) ?
> +			  sizeof(struct virtio_net_hdr_v1_hash) :
> +			  sizeof(struct virtio_net_hdr))) {
>  			ret = -EINVAL;
>  			break;
>  		}
> @@ -3314,10 +3399,16 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>  			ret = -EFAULT;
>  			break;
>  		}
> -		if (le)
> +		if (le) {
>  			tun->flags |= TUN_VNET_LE;
> -		else
> +		} else {
> +			if (!tun_legacy_is_little_endian(tun)) {
> +				ret = -EINVAL;
> +				break;
> +			}
> +
>  			tun->flags &= ~TUN_VNET_LE;
> +		}
>  		break;
>  
>  	case TUNGETVNETBE:
> @@ -3360,11 +3451,17 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>  		break;
>  
>  	case TUNSETSTEERINGEBPF:
> -		ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> +		bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> +		if (IS_ERR(bpf_ret))
> +			ret = PTR_ERR(bpf_ret);
> +		else if (bpf_ret)
> +			tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;

what is this doing?

>  		break;
>  
>  	case TUNSETFILTEREBPF:
> -		ret = tun_set_ebpf(tun, &tun->filter_prog, argp);
> +		bpf_ret = tun_set_ebpf(tun, &tun->filter_prog, argp);
> +		if (IS_ERR(bpf_ret))
> +			ret = PTR_ERR(bpf_ret);
>  		break;
>  
>  	case TUNSETCARRIER:
> @@ -3382,6 +3479,54 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>  		ret = open_related_ns(&net->ns, get_net_ns);
>  		break;
>  
> +	case TUNGETVNETHASHCAP:
> +		if (copy_to_user(argp, &tun_vnet_hash_cap,
> +				 sizeof(tun_vnet_hash_cap)))
> +			ret = -EFAULT;
> +		break;
> +
> +	case TUNSETVNETHASH:
> +		len = sizeof(vnet_hash);
> +		if (copy_from_user(&vnet_hash, argp, len)) {
> +			ret = -EFAULT;
> +			break;
> +		}


what if flags has some bits set you don't know how to handle?
should these be ignored as now or cause a failure?
these decisions all affect uapi.

> +
> +		if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> +		     (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> +		      !tun_is_little_endian(tun))) ||
> +		     vnet_hash.indirection_table_mask >=
> +		     TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> +			ret = -EINVAL;
> +			break;
> +		}

Given this is later used to index within an array one has to
be very careful about spectre things here, which this code isn't.


> +
> +		argp = (u8 __user *)argp + len;
> +		len = (vnet_hash.indirection_table_mask + 1) * 2;

comment pointer math tricks like this extensively please.

> +		if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		argp = (u8 __user *)argp + len;
> +		len = virtio_net_hash_key_length(vnet_hash.types);
> +
> +		if (copy_from_user(vnet_hash_key, argp, len)) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		tun->vnet_hash = vnet_hash;
> +		memcpy(tun->vnet_hash_indirection_table,
> +		       vnet_hash_indirection_table,
> +		       (vnet_hash.indirection_table_mask + 1) * 2);
> +		memcpy(tun->vnet_hash_key, vnet_hash_key, len);
> +
> +		if (vnet_hash.flags & TUN_VNET_HASH_RSS)
> +			__tun_set_ebpf(tun, &tun->steering_prog, NULL);
> +
> +		break;
> +
>  	default:
>  		ret = -EINVAL;
>  		break;
> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> index 287cdc81c939..dc591cd897c8 100644
> --- a/include/uapi/linux/if_tun.h
> +++ b/include/uapi/linux/if_tun.h
> @@ -61,6 +61,8 @@
>  #define TUNSETFILTEREBPF _IOR('T', 225, int)
>  #define TUNSETCARRIER _IOW('T', 226, int)
>  #define TUNGETDEVNETNS _IO('T', 227)
> +#define TUNGETVNETHASHCAP _IO('T', 228)
> +#define TUNSETVNETHASH _IOW('T', 229, unsigned int)
>  
>  /* TUNSETIFF ifr flags */
>  #define IFF_TUN		0x0001
> @@ -115,4 +117,18 @@ struct tun_filter {
>  	__u8   addr[][ETH_ALEN];
>  };
>  
> +struct tun_vnet_hash_cap {
> +	__u16 max_indirection_table_length;
> +	__u32 types;
> +};
> +

There's hidden padding in this struct - not good, copy
will leak kernel info out.



> +#define TUN_VNET_HASH_RSS	0x01
> +#define TUN_VNET_HASH_REPORT	0x02


Do you intend to add more flags down the road?
How will userspace know what is supported?

> +struct tun_vnet_hash {
> +	__u8 flags;
> +	__u32 types;
> +	__u16 indirection_table_mask;
> +	__u16 unclassified_queue;
> +};
> +

Padding here too. Best avoided.

In any case, document UAPI please.


>  #endif /* _UAPI__IF_TUN_H */
> -- 
> 2.42.0


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature
  2023-10-09 11:38   ` Michael S. Tsirkin
@ 2023-10-10  2:32     ` Akihiko Odaki
  0 siblings, 0 replies; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-10  2:32 UTC (permalink / raw)
  To: Michael S. Tsirkin, Willem de Bruijn, Jason Wang, netdev,
	linux-kernel, kvm, virtualization, linux-kselftest, bpf, davem,
	kuba, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, rdunlap, willemb, gustavoars, herbert,
	steffen.klassert, nogikh, pablo, decui, cai, jakub, elver, pabeni,
	Yuri Benditovich
  Cc: xuanzhuo, shuah

On 2023/10/09 20:38, Michael S. Tsirkin wrote:
> Akihiko Odaki sorry about reposts.
> Having an email with two "@" in the CC list:
> 	 xuanzhuo@linux.alibaba.comshuah@kernel.org
> tripped up mutt's reply-all for me and made it send to you only.
> 
> I am guessing you meant two addresses: xuanzhuo@linux.alibaba.com
> and shuah@kernel.org.

Oops, I'll send the future versions with corrected addresses (and fewer 
CCs).

> 
> 
> On Sun, Oct 08, 2023 at 02:20:49PM +0900, Akihiko Odaki wrote:
>> virtio-net have two usage of hashes: one is RSS and another is hash
>> reporting. Conventionally the hash calculation was done by the VMM.
>> However, computing the hash after the queue was chosen defeats the
>> purpose of RSS.
>>
>> Another approach is to use eBPF steering program. This approach has
>> another downside: it cannot report the calculated hash due to the
>> restrictive nature of eBPF.
>>
>> Introduce the code to compute hashes to the kernel in order to overcome
>> thse challenges. An alternative solution is to extend the eBPF steering
>> program so that it will be able to report to the userspace, but it makes
>> little sense to allow to implement different hashing algorithms with
>> eBPF since the hash value reported by virtio-net is strictly defined by
>> the specification.
>>
>> The hash value already stored in sk_buff is not used and computed
>> independently since it may have been computed in a way not conformant
>> with the specification.
>>
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>> ---
>>   drivers/net/tun.c           | 187 ++++++++++++++++++++++++++++++++----
>>   include/uapi/linux/if_tun.h |  16 +++
>>   2 files changed, 182 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 89ab9efe522c..561a573cd008 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -171,6 +171,9 @@ struct tun_prog {
>>   	struct bpf_prog *prog;
>>   };
>>   
>> +#define TUN_VNET_HASH_MAX_KEY_SIZE 40
>> +#define TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH 128
>> +
> 
> where do these come from?

TUN_VNET_HASH_MAX_KEY_SIZE is the total size of IPv6 addresses and 
TCP/UDP ports.

TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH is from the following 
statement in VIRTIO Version 1.2 clause 5.1.4.1:
 > The device MUST set rss_max_indirection_table_length to at least 128,
 > if it offers VIRTIO_NET_F_RSS.

> 
>>   /* Since the socket were moved to tun_file, to preserve the behavior of persist
>>    * device, socket filter, sndbuf and vnet header size were restore when the
>>    * file were attached to a persist device.
>> @@ -209,6 +212,9 @@ struct tun_struct {
>>   	struct tun_prog __rcu *steering_prog;
>>   	struct tun_prog __rcu *filter_prog;
>>   	struct ethtool_link_ksettings link_ksettings;
>> +	struct tun_vnet_hash vnet_hash;
>> +	u16 vnet_hash_indirection_table[TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH];
>> +	u32 vnet_hash_key[TUN_VNET_HASH_MAX_KEY_SIZE / 4];
> 
> That's quite a lot of data to add in this struct, and will be used
> by a small minority of users. Are you pushing any hot data out of cache
> with this? Why not allocate these as needed?

I put this here just because it's convenient. It would be better to have 
a separate structure.

> 
>>   	/* init args */
>>   	struct file *file;
>>   	struct ifreq *ifr;
>> @@ -219,6 +225,13 @@ struct veth {
>>   	__be16 h_vlan_TCI;
>>   };
>>   
>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>> +	.max_indirection_table_length =
>> +		TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>> +
>> +	.types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>> +};
>> +
>>   static void tun_flow_init(struct tun_struct *tun);
>>   static void tun_flow_uninit(struct tun_struct *tun);
>>   
>> @@ -320,10 +333,16 @@ static long tun_set_vnet_be(struct tun_struct *tun, int __user *argp)
>>   	if (get_user(be, argp))
>>   		return -EFAULT;
>>   
>> -	if (be)
>> +	if (be) {
>> +		if (!(tun->flags & TUN_VNET_LE) &&
>> +		    (tun->vnet_hash.flags & TUN_VNET_HASH_REPORT)) {
>> +			return -EINVAL;
>> +		}
>> +
>>   		tun->flags |= TUN_VNET_BE;
>> -	else
>> +	} else {
>>   		tun->flags &= ~TUN_VNET_BE;
>> +	}
>>   
>>   	return 0;
>>   }
>> @@ -558,15 +577,47 @@ static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
>>   	return ret % numqueues;
>>   }
>>   
>> +static u16 tun_vnet_select_queue(struct tun_struct *tun, struct sk_buff *skb)
>> +{
>> +	u32 value = qdisc_skb_cb(skb)->tun_vnet_hash_value;
>> +	u32 numqueues;
>> +	u32 index;
>> +	u16 queue;
>> +
>> +	numqueues = READ_ONCE(tun->numqueues);
>> +	if (!numqueues)
>> +		return 0;
>> +
>> +	index = value & READ_ONCE(tun->vnet_hash.indirection_table_mask);
>> +	queue = READ_ONCE(tun->vnet_hash_indirection_table[index]);
>> +	if (!queue)
>> +		queue = READ_ONCE(tun->vnet_hash.unclassified_queue);
> 
> Apparently 0 is an illegal queue value? You are making this part
> of UAPI better document things like this.

Well, I don't think so. I naively copied this from QEMU's eBPF steering 
program but I think this should check tun_vnet_hash_report == 
VIRTIO_NET_HASH_REPORT_NONE. I'll fix it.

> 
>> +
>> +	return queue % numqueues;
>> +}
>> +
>>   static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
>>   			    struct net_device *sb_dev)
>>   {
>>   	struct tun_struct *tun = netdev_priv(dev);
>> +	u8 vnet_hash_flags = READ_ONCE(tun->vnet_hash.flags);
>> +	struct virtio_net_hash hash;
>>   	u16 ret;
>>   
>> +	if (vnet_hash_flags & (TUN_VNET_HASH_RSS | TUN_VNET_HASH_REPORT)) {
>> +		virtio_net_hash(skb, READ_ONCE(tun->vnet_hash.types),
>> +				tun->vnet_hash_key, &hash);
>> +
> 
> What are all these READ_ONCE things doing?
> E.g. you seem to be very content to have tun->vnet_hash.types read twice apparently?
> This is volatile which basically breaks all compiler's attempts
> to optimize code - needs to be used judiciously.

I'll replace them with one RCU dereference once I extract them into a 
structure.

> 
> 
> 
>> +		skb->tun_vnet_hash = true;
>> +		qdisc_skb_cb(skb)->tun_vnet_hash_value = hash.value;
>> +		qdisc_skb_cb(skb)->tun_vnet_hash_report = hash.report;
>> +	}
>> +
>>   	rcu_read_lock();
>>   	if (rcu_dereference(tun->steering_prog))
>>   		ret = tun_ebpf_select_queue(tun, skb);
>> +	else if (vnet_hash_flags & TUN_VNET_HASH_RSS)
>> +		ret = tun_vnet_select_queue(tun, skb);
>>   	else
>>   		ret = tun_automq_select_queue(tun, skb);
>>   	rcu_read_unlock();
>> @@ -2088,10 +2139,15 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>   			    struct iov_iter *iter)
>>   {
>>   	struct tun_pi pi = { 0, skb->protocol };
>> +	struct virtio_net_hash vnet_hash = {
>> +		.value = qdisc_skb_cb(skb)->tun_vnet_hash_value,
>> +		.report = qdisc_skb_cb(skb)->tun_vnet_hash_report
>> +	};
>>   	ssize_t total;
>>   	int vlan_offset = 0;
>>   	int vlan_hlen = 0;
>>   	int vnet_hdr_sz = 0;
>> +	size_t vnet_hdr_content_sz;
>>   
>>   	if (skb_vlan_tag_present(skb))
>>   		vlan_hlen = VLAN_HLEN;
>> @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>   	}
>>   
>>   	if (vnet_hdr_sz) {
>> -		struct virtio_net_hdr gso;
>> +		union {
>> +			struct virtio_net_hdr hdr;
>> +			struct virtio_net_hdr_v1_hash v1_hash_hdr;
>> +		} hdr;
>> +		int ret;
>>   
>>   		if (iov_iter_count(iter) < vnet_hdr_sz)
>>   			return -EINVAL;
>>   
>> -		if (virtio_net_hdr_from_skb(skb, &gso,
>> -					    tun_is_little_endian(tun), true,
>> -					    vlan_hlen)) {
>> +		if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
>> +		    vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
>> +		    skb->tun_vnet_hash) {
>> +			vnet_hdr_content_sz = sizeof(hdr.v1_hash_hdr);
>> +			ret = virtio_net_hdr_v1_hash_from_skb(skb,
>> +							      &hdr.v1_hash_hdr,
>> +							      true,
>> +							      vlan_hlen,
>> +							      &vnet_hash);
>> +		} else {
>> +			vnet_hdr_content_sz = sizeof(hdr.hdr);
>> +			ret = virtio_net_hdr_from_skb(skb, &hdr.hdr,
>> +						      tun_is_little_endian(tun),
>> +						      true, vlan_hlen);
>> +		}
>> +
>> +		if (ret) {
>>   			struct skb_shared_info *sinfo = skb_shinfo(skb);
>>   			pr_err("unexpected GSO type: "
>>   			       "0x%x, gso_size %d, hdr_len %d\n",
>> -			       sinfo->gso_type, tun16_to_cpu(tun, gso.gso_size),
>> -			       tun16_to_cpu(tun, gso.hdr_len));
>> +			       sinfo->gso_type, tun16_to_cpu(tun, hdr.hdr.gso_size),
>> +			       tun16_to_cpu(tun, hdr.hdr.hdr_len));
>>   			print_hex_dump(KERN_ERR, "tun: ",
>>   				       DUMP_PREFIX_NONE,
>>   				       16, 1, skb->head,
>> -				       min((int)tun16_to_cpu(tun, gso.hdr_len), 64), true);
>> +				       min((int)tun16_to_cpu(tun, hdr.hdr.hdr_len), 64), true);
>>   			WARN_ON_ONCE(1);
>>   			return -EINVAL;
>>   		}
>>   
>> -		if (copy_to_iter(&gso, sizeof(gso), iter) != sizeof(gso))
>> +		if (copy_to_iter(&hdr, vnet_hdr_content_sz, iter) != vnet_hdr_content_sz)
>>   			return -EFAULT;
>>   
>> -		iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso));
>> +		iov_iter_advance(iter, vnet_hdr_sz - vnet_hdr_content_sz);
>>   	}
>>   
>>   	if (vlan_hlen) {
>> @@ -3007,24 +3081,27 @@ static int tun_set_queue(struct file *file, struct ifreq *ifr)
>>   	return ret;
>>   }
>>   
>> -static int tun_set_ebpf(struct tun_struct *tun, struct tun_prog __rcu **prog_p,
>> -			void __user *data)
>> +static struct bpf_prog *tun_set_ebpf(struct tun_struct *tun,
>> +				     struct tun_prog __rcu **prog_p,
>> +				     void __user *data)
>>   {
>>   	struct bpf_prog *prog;
>>   	int fd;
>> +	int ret;
>>   
>>   	if (copy_from_user(&fd, data, sizeof(fd)))
>> -		return -EFAULT;
>> +		return ERR_PTR(-EFAULT);
>>   
>>   	if (fd == -1) {
>>   		prog = NULL;
>>   	} else {
>>   		prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
>>   		if (IS_ERR(prog))
>> -			return PTR_ERR(prog);
>> +			return prog;
>>   	}
>>   
>> -	return __tun_set_ebpf(tun, prog_p, prog);
>> +	ret = __tun_set_ebpf(tun, prog_p, prog);
>> +	return ret ? ERR_PTR(ret) : prog;
>>   }
>>   
>>   /* Return correct value for tun->dev->addr_len based on tun->dev->type. */
>> @@ -3082,6 +3159,11 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   	int le;
>>   	int ret;
>>   	bool do_notify = false;
>> +	struct bpf_prog *bpf_ret;
>> +	struct tun_vnet_hash vnet_hash;
>> +	u16 vnet_hash_indirection_table[TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH];
>> +	u8 vnet_hash_key[TUN_VNET_HASH_MAX_KEY_SIZE];
>> +	size_t len;
>>   
>>   	if (cmd == TUNSETIFF || cmd == TUNSETQUEUE ||
>>   	    (_IOC_TYPE(cmd) == SOCK_IOC_TYPE && cmd != SIOCGSKNS)) {
>> @@ -3295,7 +3377,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   			ret = -EFAULT;
>>   			break;
>>   		}
>> -		if (vnet_hdr_sz < (int)sizeof(struct virtio_net_hdr)) {
>> +		if (vnet_hdr_sz <
>> +		    (int)((tun->vnet_hash.flags & TUN_VNET_HASH_REPORT) ?
>> +			  sizeof(struct virtio_net_hdr_v1_hash) :
>> +			  sizeof(struct virtio_net_hdr))) {
>>   			ret = -EINVAL;
>>   			break;
>>   		}
>> @@ -3314,10 +3399,16 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   			ret = -EFAULT;
>>   			break;
>>   		}
>> -		if (le)
>> +		if (le) {
>>   			tun->flags |= TUN_VNET_LE;
>> -		else
>> +		} else {
>> +			if (!tun_legacy_is_little_endian(tun)) {
>> +				ret = -EINVAL;
>> +				break;
>> +			}
>> +
>>   			tun->flags &= ~TUN_VNET_LE;
>> +		}
>>   		break;
>>   
>>   	case TUNGETVNETBE:
>> @@ -3360,11 +3451,17 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   		break;
>>   
>>   	case TUNSETSTEERINGEBPF:
>> -		ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>> +		bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>> +		if (IS_ERR(bpf_ret))
>> +			ret = PTR_ERR(bpf_ret);
>> +		else if (bpf_ret)
>> +			tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> 
> what is this doing?

It's unsetting RSS because steering eBPF is incompatible with RSS. 
Willem proposed to return EBUSY instead so I'll do so in the future.

> 
>>   		break;
>>   
>>   	case TUNSETFILTEREBPF:
>> -		ret = tun_set_ebpf(tun, &tun->filter_prog, argp);
>> +		bpf_ret = tun_set_ebpf(tun, &tun->filter_prog, argp);
>> +		if (IS_ERR(bpf_ret))
>> +			ret = PTR_ERR(bpf_ret);
>>   		break;
>>   
>>   	case TUNSETCARRIER:
>> @@ -3382,6 +3479,54 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   		ret = open_related_ns(&net->ns, get_net_ns);
>>   		break;
>>   
>> +	case TUNGETVNETHASHCAP:
>> +		if (copy_to_user(argp, &tun_vnet_hash_cap,
>> +				 sizeof(tun_vnet_hash_cap)))
>> +			ret = -EFAULT;
>> +		break;
>> +
>> +	case TUNSETVNETHASH:
>> +		len = sizeof(vnet_hash);
>> +		if (copy_from_user(&vnet_hash, argp, len)) {
>> +			ret = -EFAULT;
>> +			break;
>> +		}
> 
> 
> what if flags has some bits set you don't know how to handle?
> should these be ignored as now or cause a failure?
> these decisions all affect uapi.

Currently they are ignored.

> 
>> +
>> +		if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>> +		     (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>> +		      !tun_is_little_endian(tun))) ||
>> +		     vnet_hash.indirection_table_mask >=
>> +		     TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>> +			ret = -EINVAL;
>> +			break;
>> +		}
> 
> Given this is later used to index within an array one has to
> be very careful about spectre things here, which this code isn't.

I have totally forgotten that. Thanks for pointing this out.

> 
> 
>> +
>> +		argp = (u8 __user *)argp + len;
>> +		len = (vnet_hash.indirection_table_mask + 1) * 2;
> 
> comment pointer math tricks like this extensively please.

Ok, I will.

> 
>> +		if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>> +			ret = -EFAULT;
>> +			break;
>> +		}
>> +
>> +		argp = (u8 __user *)argp + len;
>> +		len = virtio_net_hash_key_length(vnet_hash.types);
>> +
>> +		if (copy_from_user(vnet_hash_key, argp, len)) {
>> +			ret = -EFAULT;
>> +			break;
>> +		}
>> +
>> +		tun->vnet_hash = vnet_hash;
>> +		memcpy(tun->vnet_hash_indirection_table,
>> +		       vnet_hash_indirection_table,
>> +		       (vnet_hash.indirection_table_mask + 1) * 2);
>> +		memcpy(tun->vnet_hash_key, vnet_hash_key, len);
>> +
>> +		if (vnet_hash.flags & TUN_VNET_HASH_RSS)
>> +			__tun_set_ebpf(tun, &tun->steering_prog, NULL);
>> +
>> +		break;
>> +
>>   	default:
>>   		ret = -EINVAL;
>>   		break;
>> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
>> index 287cdc81c939..dc591cd897c8 100644
>> --- a/include/uapi/linux/if_tun.h
>> +++ b/include/uapi/linux/if_tun.h
>> @@ -61,6 +61,8 @@
>>   #define TUNSETFILTEREBPF _IOR('T', 225, int)
>>   #define TUNSETCARRIER _IOW('T', 226, int)
>>   #define TUNGETDEVNETNS _IO('T', 227)
>> +#define TUNGETVNETHASHCAP _IO('T', 228)
>> +#define TUNSETVNETHASH _IOW('T', 229, unsigned int)
>>   
>>   /* TUNSETIFF ifr flags */
>>   #define IFF_TUN		0x0001
>> @@ -115,4 +117,18 @@ struct tun_filter {
>>   	__u8   addr[][ETH_ALEN];
>>   };
>>   
>> +struct tun_vnet_hash_cap {
>> +	__u16 max_indirection_table_length;
>> +	__u32 types;
>> +};
>> +
> 
> There's hidden padding in this struct - not good, copy
> will leak kernel info out.

I'll add an explicit padding.

> 
> 
> 
>> +#define TUN_VNET_HASH_RSS	0x01
>> +#define TUN_VNET_HASH_REPORT	0x02
> 
> 
> Do you intend to add more flags down the road?
> How will userspace know what is supported?

I don't intend to do so, but perhaps it may be better to return EINVAL 
when an unknown flag is set.

> 
>> +struct tun_vnet_hash {
>> +	__u8 flags;
>> +	__u32 types;
>> +	__u16 indirection_table_mask;
>> +	__u16 unclassified_queue;
>> +};
>> +
> 
> Padding here too. Best avoided.

Perhaps it may be easier just to grow flags into u32.

> 
> In any case, document UAPI please.

Sure, I'll.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 6/7] selftest: tun: Add tests for virtio-net hashing
  2023-10-08  5:20 [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (4 preceding siblings ...)
  2023-10-08  5:20 ` [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
@ 2023-10-08  5:20 ` Akihiko Odaki
  2023-10-20  1:52   ` kernel test robot
  2023-10-08  5:20 ` [RFC PATCH 7/7] vhost_net: Support VIRTIO_NET_F_HASH_REPORT Akihiko Odaki
  2023-10-08 18:36 ` [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature Willem de Bruijn
  7 siblings, 1 reply; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-08  5:20 UTC (permalink / raw)
  Cc: Willem de Bruijn, Jason Wang, Michael S. Tsirkin,
	Xuan Zhuo Shuah Khan

The added tests confirm tun can perform RSS and hash reporting, and
reject invalid configurations for them.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/tun.c    | 578 ++++++++++++++++++++++++++-
 2 files changed, 572 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 8b017070960d..253a683073d9 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -2,7 +2,7 @@
 # Makefile for net selftests
 
 CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
-CFLAGS += -I../../../../usr/include/ $(KHDR_INCLUDES)
+CFLAGS += -I../../../../usr/include/ -I../../../include/ $(KHDR_INCLUDES)
 # Additional include paths needed by kselftest.h
 CFLAGS += -I../
 
diff --git a/tools/testing/selftests/net/tun.c b/tools/testing/selftests/net/tun.c
index fa83918b62d1..862652fb4ed4 100644
--- a/tools/testing/selftests/net/tun.c
+++ b/tools/testing/selftests/net/tun.c
@@ -2,21 +2,39 @@
 
 #define _GNU_SOURCE
 
+#include <endian.h>
 #include <errno.h>
 #include <fcntl.h>
+#include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
-#include <linux/if.h>
+#include <net/if.h>
+#include <netinet/ip.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <linux/compiler.h>
+#include <linux/icmp.h>
+#include <linux/if_arp.h>
 #include <linux/if_tun.h>
+#include <linux/ipv6.h>
 #include <linux/netlink.h>
 #include <linux/rtnetlink.h>
-#include <sys/ioctl.h>
-#include <sys/socket.h>
+#include <linux/sockios.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/virtio_net.h>
 
 #include "../kselftest_harness.h"
 
+#define TUN_HWADDR_SOURCE { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00 }
+
+#define TUN_HWADDR_DEST { 0x02, 0x00, 0x00, 0x00, 0x00, 0x01 }
+
+#define TUN_IPADDR_SOURCE htonl((172 << 24) | (17 << 16) | 0)
+#define TUN_IPADDR_DEST htonl((172 << 24) | (17 << 16) | 1)
+
 static int tun_attach(int fd, char *dev)
 {
 	struct ifreq ifr;
@@ -39,7 +57,7 @@ static int tun_detach(int fd, char *dev)
 	return ioctl(fd, TUNSETQUEUE, (void *) &ifr);
 }
 
-static int tun_alloc(char *dev)
+static int tun_alloc(char *dev, short flags)
 {
 	struct ifreq ifr;
 	int fd, err;
@@ -52,7 +70,8 @@ static int tun_alloc(char *dev)
 
 	memset(&ifr, 0, sizeof(ifr));
 	strcpy(ifr.ifr_name, dev);
-	ifr.ifr_flags = IFF_TAP | IFF_NAPI | IFF_MULTI_QUEUE;
+	ifr.ifr_flags = flags | IFF_TAP | IFF_NAPI | IFF_NO_PI |
+			IFF_MULTI_QUEUE;
 
 	err = ioctl(fd, TUNSETIFF, (void *) &ifr);
 	if (err < 0) {
@@ -64,6 +83,40 @@ static int tun_alloc(char *dev)
 	return fd;
 }
 
+static bool tun_add_to_bridge(int local_fd, const char *name)
+{
+	struct ifreq ifreq = {
+		.ifr_name = "xbridge",
+		.ifr_ifindex = if_nametoindex(name)
+	};
+
+	if (!ifreq.ifr_ifindex) {
+		perror("if_nametoindex");
+		return false;
+	}
+
+	if (ioctl(local_fd, SIOCBRADDIF, &ifreq)) {
+		perror("SIOCBRADDIF");
+		return false;
+	}
+
+	return true;
+}
+
+static bool tun_set_flags(int local_fd, const char *name, short flags)
+{
+	struct ifreq ifreq = { .ifr_flags = flags };
+
+	strcpy(ifreq.ifr_name, name);
+
+	if (ioctl(local_fd, SIOCSIFFLAGS, &ifreq)) {
+		perror("SIOCSIFFLAGS");
+		return false;
+	}
+
+	return true;
+}
+
 static int tun_delete(char *dev)
 {
 	struct {
@@ -102,6 +155,153 @@ static int tun_delete(char *dev)
 	return ret;
 }
 
+static uint32_t tun_sum(const void *buf, size_t len)
+{
+	const uint16_t *sbuf = buf;
+	uint32_t sum = 0;
+
+	while (len > 1) {
+		sum += *sbuf++;
+		len -= 2;
+	}
+
+	if (len)
+		sum += *(uint8_t *)sbuf;
+
+	return sum;
+}
+
+static uint16_t tun_build_ip_check(uint32_t sum)
+{
+	return ~((sum & 0xffff) + (sum >> 16));
+}
+
+static uint32_t tun_build_ip_pseudo_sum(const void *iphdr)
+{
+	uint16_t tot_len = ntohs(((struct iphdr *)iphdr)->tot_len);
+
+	return tun_sum((char *)iphdr + offsetof(struct iphdr, saddr), 8) +
+	       htons(((struct iphdr *)iphdr)->protocol) +
+	       htons(tot_len - sizeof(struct iphdr));
+}
+
+static uint32_t tun_build_ipv6_pseudo_sum(const void *ipv6hdr)
+{
+	return tun_sum((char *)ipv6hdr + offsetof(struct ipv6hdr, saddr), 32) +
+	       ((struct ipv6hdr *)ipv6hdr)->payload_len +
+	       htons(((struct ipv6hdr *)ipv6hdr)->nexthdr);
+}
+
+static void tun_build_ethhdr(struct ethhdr *ethhdr, uint16_t proto)
+{
+	*ethhdr = (struct ethhdr) {
+		.h_dest = TUN_HWADDR_DEST,
+		.h_source = TUN_HWADDR_SOURCE,
+		.h_proto = htons(proto)
+	};
+}
+
+static void tun_build_iphdr(void *dest, uint16_t len, uint8_t protocol)
+{
+	struct iphdr iphdr = {
+		.ihl = sizeof(iphdr) / 4,
+		.version = 4,
+		.tot_len = htons(sizeof(iphdr) + len),
+		.ttl = 255,
+		.protocol = protocol,
+		.saddr = TUN_IPADDR_SOURCE,
+		.daddr = TUN_IPADDR_DEST
+	};
+
+	iphdr.check = tun_build_ip_check(tun_sum(&iphdr, sizeof(iphdr)));
+	memcpy(dest, &iphdr, sizeof(iphdr));
+}
+
+static void tun_build_ipv6hdr(void *dest, uint16_t len, uint8_t protocol)
+{
+	struct ipv6hdr ipv6hdr = {
+		.version = 6,
+		.payload_len = htons(len),
+		.nexthdr = protocol,
+		.saddr = {
+			.s6_addr32 = {
+				htonl(0xffff0000), 0, 0, TUN_IPADDR_SOURCE
+			}
+		},
+		.daddr = {
+			.s6_addr32 = {
+				htonl(0xffff0000), 0, 0, TUN_IPADDR_DEST
+			}
+		},
+	};
+
+	memcpy(dest, &ipv6hdr, sizeof(ipv6hdr));
+}
+
+static void tun_build_tcphdr(void *dest, uint32_t sum)
+{
+	struct tcphdr tcphdr = {
+		.source = htons(9),
+		.dest = htons(9),
+		.fin = 1,
+		.doff = sizeof(tcphdr) / 4,
+	};
+	uint32_t tcp_sum = tun_sum(&tcphdr, sizeof(tcphdr));
+
+	tcphdr.check = tun_build_ip_check(sum + tcp_sum);
+	memcpy(dest, &tcphdr, sizeof(tcphdr));
+}
+
+static void tun_build_udphdr(void *dest, uint32_t sum)
+{
+	struct udphdr udphdr = {
+		.source = htons(9),
+		.dest = htons(9),
+		.len = htons(sizeof(udphdr)),
+	};
+	uint32_t udp_sum = tun_sum(&udphdr, sizeof(udphdr));
+
+	udphdr.check = tun_build_ip_check(sum + udp_sum);
+	memcpy(dest, &udphdr, sizeof(udphdr));
+}
+
+static bool tun_vnet_hash_check(int source_fd, const int *dest_fds,
+				const void *buffer, size_t len,
+				uint16_t report, uint32_t value)
+{
+	size_t read_len = sizeof(struct virtio_net_hdr_v1_hash) + len;
+	struct virtio_net_hdr_v1_hash *read_buffer;
+	int ret;
+
+	if (write(source_fd, buffer, len) != len) {
+		perror("write");
+		return false;
+	}
+
+	read_buffer = malloc(read_len);
+	if (!read_buffer) {
+		perror("malloc");
+		return false;
+	}
+
+	ret = read(dest_fds[value & 1], read_buffer, read_len);
+	if (ret != read_len) {
+		perror("read");
+		free(read_buffer);
+		return false;
+	}
+
+	if (read_buffer->hash_value != htole32(value) ||
+	    read_buffer->hash_report != htole16(report) ||
+	    memcmp(read_buffer + 1, buffer, len)) {
+		free(read_buffer);
+		return false;
+	}
+
+	free(read_buffer);
+	return true;
+}
+
 FIXTURE(tun)
 {
 	char ifname[IFNAMSIZ];
@@ -112,10 +312,10 @@ FIXTURE_SETUP(tun)
 {
 	memset(self->ifname, 0, sizeof(self->ifname));
 
-	self->fd = tun_alloc(self->ifname);
+	self->fd = tun_alloc(self->ifname, 0);
 	ASSERT_GE(self->fd, 0);
 
-	self->fd2 = tun_alloc(self->ifname);
+	self->fd2 = tun_alloc(self->ifname, 0);
 	ASSERT_GE(self->fd2, 0);
 }
 
@@ -159,4 +359,368 @@ TEST_F(tun, reattach_close_delete) {
 	EXPECT_EQ(tun_delete(self->ifname), 0);
 }
 
+FIXTURE(tun_vnet_hash)
+{
+	int local_fd;
+	int source_fd;
+	int dest_fds[2];
+};
+
+FIXTURE_SETUP(tun_vnet_hash)
+{
+	static const struct {
+		struct tun_vnet_hash hdr;
+		uint16_t indirection_table[ARRAY_SIZE(self->dest_fds)];
+		uint8_t key[40];
+	} vnet_hash = {
+		.hdr = {
+			.flags = TUN_VNET_HASH_REPORT | TUN_VNET_HASH_RSS,
+			.types = VIRTIO_NET_RSS_HASH_TYPE_IPv4 |
+				VIRTIO_NET_RSS_HASH_TYPE_TCPv4 |
+				VIRTIO_NET_RSS_HASH_TYPE_UDPv4 |
+				VIRTIO_NET_RSS_HASH_TYPE_IPv6 |
+				VIRTIO_NET_RSS_HASH_TYPE_TCPv6 |
+				VIRTIO_NET_RSS_HASH_TYPE_UDPv6,
+			.indirection_table_mask = 1
+		},
+		.indirection_table = { 0, 1 },
+		.key = {
+			0x6d, 0x5a, 0x56, 0xda, 0x25, 0x5b, 0x0e, 0xc2,
+			0x41, 0x67, 0x25, 0x3d, 0x43, 0xa3, 0x8f, 0xb0,
+			0xd0, 0xca, 0x2b, 0xcb, 0xae, 0x7b, 0x30, 0xb4,
+			0x77, 0xcb, 0x2d, 0xa3, 0x80, 0x30, 0xf2, 0x0c,
+			0x6a, 0x42, 0xb7, 0x3b, 0xbe, 0xac, 0x01, 0xfa
+		}
+	};
+
+	struct {
+		struct virtio_net_hdr_v1_hash vnet_hdr;
+		struct ethhdr ethhdr;
+		struct arphdr arphdr;
+		unsigned char sender_hwaddr[6];
+		uint32_t sender_ipaddr;
+		unsigned char target_hwaddr[6];
+		uint32_t target_ipaddr;
+	} __packed packet = {
+		.ethhdr = {
+			.h_source = TUN_HWADDR_SOURCE,
+			.h_dest = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+			.h_proto = htons(ETH_P_ARP)
+		},
+		.arphdr = {
+			.ar_hrd = htons(ARPHRD_ETHER),
+			.ar_pro = htons(ETH_P_IP),
+			.ar_hln = ETH_ALEN,
+			.ar_pln = 4,
+			.ar_op = htons(ARPOP_REQUEST)
+		},
+		.sender_hwaddr = TUN_HWADDR_DEST,
+		.sender_ipaddr = TUN_IPADDR_DEST,
+		.target_ipaddr = TUN_IPADDR_DEST
+	};
+
+	char ifname[IFNAMSIZ];
+	int i;
+
+	self->local_fd = socket(AF_LOCAL, SOCK_STREAM, 0);
+	ASSERT_GE(self->local_fd, 0);
+
+	ASSERT_EQ(ioctl(self->local_fd, SIOCBRADDBR, "xbridge"), 0);
+	ASSERT_TRUE(tun_set_flags(self->local_fd, "xbridge", IFF_UP));
+
+	ifname[0] = 0;
+	self->source_fd = tun_alloc(ifname, 0);
+	ASSERT_GE(self->source_fd, 0);
+	ASSERT_TRUE(tun_add_to_bridge(self->local_fd, ifname));
+	ASSERT_TRUE(tun_set_flags(self->local_fd, ifname, IFF_UP));
+
+	ifname[0] = 0;
+	for (size_t i = 0; i < ARRAY_SIZE(self->dest_fds); i++) {
+		self->dest_fds[i] = tun_alloc(ifname, IFF_VNET_HDR);
+		ASSERT_GE(self->dest_fds[i], 0);
+	}
+
+	ASSERT_TRUE(tun_add_to_bridge(self->local_fd, ifname));
+	i = sizeof(struct virtio_net_hdr_v1_hash);
+	ASSERT_EQ(ioctl(self->dest_fds[0], TUNSETVNETHDRSZ, &i), 0);
+	i = 1;
+	ASSERT_EQ(ioctl(self->dest_fds[0], TUNSETVNETLE, &i), 0);
+	ASSERT_TRUE(tun_set_flags(self->local_fd, ifname, IFF_UP));
+	ASSERT_EQ(write(self->dest_fds[0], &packet, sizeof(packet)),
+		  sizeof(packet));
+	ASSERT_EQ(ioctl(self->dest_fds[0], TUNSETVNETHASH, &vnet_hash), 0);
+}
+
+FIXTURE_TEARDOWN(tun_vnet_hash)
+{
+	ASSERT_TRUE(tun_set_flags(self->local_fd, "xbridge", 0));
+	EXPECT_EQ(ioctl(self->local_fd, SIOCBRDELBR, "xbridge"), 0);
+	EXPECT_EQ(close(self->source_fd), 0);
+
+	for (size_t i = 0; i < ARRAY_SIZE(self->dest_fds); i++)
+		EXPECT_EQ(close(self->dest_fds[i]), 0);
+
+	EXPECT_EQ(close(self->local_fd), 0);
+}
+
+TEST_F(tun_vnet_hash, ipv4)
+{
+	struct {
+		struct ethhdr ethhdr;
+		struct iphdr iphdr;
+	} __packed packet;
+
+	tun_build_ethhdr(&packet.ethhdr, ETH_P_IP);
+	tun_build_iphdr(&packet.iphdr, 0, 253);
+
+	EXPECT_TRUE(tun_vnet_hash_check(self->source_fd, self->dest_fds,
+					&packet, sizeof(packet),
+					VIRTIO_NET_HASH_REPORT_IPv4,
+					0x9246d590));
+}
+
+TEST_F(tun_vnet_hash, tcpv4)
+{
+	struct {
+		struct ethhdr ethhdr;
+		struct iphdr iphdr;
+		struct tcphdr tcphdr;
+	} __packed packet;
+
+	tun_build_ethhdr(&packet.ethhdr, ETH_P_IP);
+	tun_build_iphdr(&packet.iphdr, sizeof(struct tcphdr), IPPROTO_TCP);
+
+	tun_build_tcphdr(&packet.tcphdr,
+			 tun_build_ip_pseudo_sum(&packet.iphdr));
+
+	EXPECT_TRUE(tun_vnet_hash_check(self->source_fd, self->dest_fds,
+					&packet, sizeof(packet),
+					VIRTIO_NET_HASH_REPORT_TCPv4,
+					0xfad3f31a));
+}
+
+TEST_F(tun_vnet_hash, udpv4)
+{
+	struct {
+		struct ethhdr ethhdr;
+		struct iphdr iphdr;
+		struct udphdr udphdr;
+	} __packed packet;
+
+	tun_build_ethhdr(&packet.ethhdr, ETH_P_IP);
+	tun_build_iphdr(&packet.iphdr, sizeof(struct udphdr), IPPROTO_UDP);
+
+	tun_build_udphdr(&packet.udphdr,
+			 tun_build_ip_pseudo_sum(&packet.iphdr));
+
+	EXPECT_TRUE(tun_vnet_hash_check(self->source_fd, self->dest_fds,
+					&packet, sizeof(packet),
+					VIRTIO_NET_HASH_REPORT_UDPv4,
+					0xfad3f31a));
+}
+
+TEST_F(tun_vnet_hash, ipv6)
+{
+	struct {
+		struct ethhdr ethhdr;
+		struct ipv6hdr ipv6hdr;
+	} __packed packet;
+
+	tun_build_ethhdr(&packet.ethhdr, ETH_P_IPV6);
+	tun_build_ipv6hdr(&packet.ipv6hdr, 0, 253);
+
+	EXPECT_TRUE(tun_vnet_hash_check(self->source_fd, self->dest_fds,
+					&packet, sizeof(packet),
+					VIRTIO_NET_HASH_REPORT_IPv6,
+					0x6b7835b3));
+}
+
+TEST_F(tun_vnet_hash, tcpv6)
+{
+	struct {
+		struct ethhdr ethhdr;
+		struct ipv6hdr ipv6hdr;
+		struct tcphdr tcphdr;
+	} __packed packet;
+
+	tun_build_ethhdr(&packet.ethhdr, ETH_P_IPV6);
+	tun_build_ipv6hdr(&packet.ipv6hdr, sizeof(struct tcphdr), IPPROTO_TCP);
+
+	tun_build_tcphdr(&packet.tcphdr,
+			 tun_build_ipv6_pseudo_sum(&packet.ipv6hdr));
+
+	EXPECT_TRUE(tun_vnet_hash_check(self->source_fd, self->dest_fds,
+					&packet, sizeof(packet),
+					VIRTIO_NET_HASH_REPORT_TCPv6,
+					0x6c6717));
+}
+
+TEST_F(tun_vnet_hash, udpv6)
+{
+	struct {
+		struct ethhdr ethhdr;
+		struct ipv6hdr ipv6hdr;
+		struct udphdr udphdr;
+	} __packed packet;
+
+	tun_build_ethhdr(&packet.ethhdr, ETH_P_IPV6);
+	tun_build_ipv6hdr(&packet.ipv6hdr, sizeof(struct udphdr), IPPROTO_UDP);
+
+	tun_build_udphdr(&packet.udphdr,
+			 tun_build_ipv6_pseudo_sum(&packet.ipv6hdr));
+
+	EXPECT_TRUE(tun_vnet_hash_check(self->source_fd, self->dest_fds,
+					&packet, sizeof(packet),
+					VIRTIO_NET_HASH_REPORT_UDPv6,
+					0x6c6717));
+}
+
+FIXTURE(tun_vnet_hash_config)
+{
+	int fd;
+};
+
+FIXTURE_SETUP(tun_vnet_hash_config)
+{
+	char ifname[IFNAMSIZ];
+
+	ifname[0] = 0;
+	self->fd = tun_alloc(ifname, 0);
+	ASSERT_GE(self->fd, 0);
+}
+
+FIXTURE_TEARDOWN(tun_vnet_hash_config)
+{
+	EXPECT_EQ(close(self->fd), 0);
+}
+
+TEST_F(tun_vnet_hash_config, cap)
+{
+	struct tun_vnet_hash_cap cap;
+
+	ASSERT_EQ(ioctl(self->fd, TUNGETVNETHASHCAP, &cap), 0);
+	EXPECT_EQ(cap.max_indirection_table_length, 128);
+	EXPECT_EQ(cap.types,
+		  VIRTIO_NET_RSS_HASH_TYPE_IPv4 |
+		  VIRTIO_NET_RSS_HASH_TYPE_TCPv4 |
+		  VIRTIO_NET_RSS_HASH_TYPE_UDPv4 |
+		  VIRTIO_NET_RSS_HASH_TYPE_IPv6 |
+		  VIRTIO_NET_RSS_HASH_TYPE_TCPv6 |
+		  VIRTIO_NET_RSS_HASH_TYPE_UDPv6);
+}
+
+TEST_F(tun_vnet_hash_config, insufficient_hdr_sz)
+{
+	static const struct tun_vnet_hash vnet_hash = {
+		.flags = TUN_VNET_HASH_REPORT
+	};
+	int i;
+
+	i = 1;
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETLE, &i), 0);
+
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETHASH, &vnet_hash), -1);
+	EXPECT_EQ(errno, EINVAL);
+}
+
+TEST_F(tun_vnet_hash_config, shrink_hdr_sz)
+{
+	static const struct tun_vnet_hash vnet_hash = {
+		.flags = TUN_VNET_HASH_REPORT
+	};
+	int i;
+
+	i = sizeof(struct virtio_net_hdr_v1_hash);
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETHDRSZ, &i), 0);
+
+	i = 1;
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETLE, &i), 0);
+
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETHASH, &vnet_hash), 0);
+
+	i = sizeof(struct virtio_net_hdr);
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETHDRSZ, &i), -1);
+	EXPECT_EQ(errno, EINVAL);
+}
+
+TEST_F(tun_vnet_hash_config, too_big_indirection_table)
+{
+	static const struct tun_vnet_hash vnet_hash = {
+		.flags = TUN_VNET_HASH_REPORT,
+		.indirection_table_mask = 255
+	};
+	int i;
+
+	i = sizeof(struct virtio_net_hdr_v1_hash);
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETHDRSZ, &i), 0);
+
+	i = 1;
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETLE, &i), 0);
+
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETHASH, &vnet_hash), -1);
+	EXPECT_EQ(errno, EINVAL);
+}
+
+TEST_F(tun_vnet_hash_config, set_be_early)
+{
+	static const struct tun_vnet_hash vnet_hash = {
+		.flags = TUN_VNET_HASH_REPORT
+	};
+	int i;
+
+	i = 1;
+	if (ioctl(self->fd, TUNSETVNETBE, &i))
+		return;
+
+	i = sizeof(struct virtio_net_hdr_v1_hash);
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETHDRSZ, &i), 0);
+
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETHASH, &vnet_hash), -1);
+	EXPECT_EQ(errno, EINVAL);
+}
+
+TEST_F(tun_vnet_hash_config, set_be_later)
+{
+	static const struct tun_vnet_hash vnet_hash = {
+		.flags = TUN_VNET_HASH_REPORT
+	};
+	int i;
+
+	i = sizeof(struct virtio_net_hdr_v1_hash);
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETHDRSZ, &i), 0);
+
+	if (ioctl(self->fd, TUNSETVNETHASH, &vnet_hash))
+		return;
+
+	i = 1;
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETBE, &i), -1);
+	EXPECT_EQ(errno, EINVAL);
+}
+
+TEST_F(tun_vnet_hash_config, unset_le_later)
+{
+	static const struct tun_vnet_hash vnet_hash = {
+		.flags = TUN_VNET_HASH_REPORT
+	};
+	int i;
+
+	i = sizeof(struct virtio_net_hdr_v1_hash);
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETHDRSZ, &i), 0);
+
+	i = 1;
+	ioctl(self->fd, TUNSETVNETBE, &i);
+
+	if (!ioctl(self->fd, TUNSETVNETHASH, &vnet_hash))
+		return;
+
+	i = 1;
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETLE, &i), 0);
+
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETHASH, &vnet_hash), 0);
+
+	i = 0;
+	ASSERT_EQ(ioctl(self->fd, TUNSETVNETLE, &i), -1);
+	EXPECT_EQ(errno, EINVAL);
+}
+
 TEST_HARNESS_MAIN
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 6/7] selftest: tun: Add tests for virtio-net hashing
  2023-10-08  5:20 ` [RFC PATCH 6/7] selftest: tun: Add tests for virtio-net hashing Akihiko Odaki
@ 2023-10-20  1:52   ` kernel test robot
  0 siblings, 0 replies; 41+ messages in thread
From: kernel test robot @ 2023-10-20  1:52 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: oe-lkp, lkp, netdev, Willem de Bruijn, Jason Wang,
	Michael S. Tsirkin, linux-kernel, kvm, virtualization,
	linux-kselftest, bpf, davem, kuba, ast, daniel, andrii, kafai,
	songliubraving, yhs, john.fastabend, kpsingh, rdunlap, willemb,
	gustavoars, herbert, steffen.klassert, nogikh, pablo, decui, cai,
	jakub, elver, pabeni, xuanzhuo@linux.alibaba.comshuah,
	Akihiko Odaki, oliver.sang


hi, Akihiko Odaki,

sorry for sending again, the previous one has some problem that lost most
of CC part.


Hello,


kernel test robot noticed "kernel-selftests.net.make.fail" on:

commit: c04079dfb34c2f534f013408b12218c14b286b7d ("[RFC PATCH 6/7] selftest: tun: Add tests for virtio-net hashing")
url: https://github.com/intel-lab-lkp/linux/commits/Akihiko-Odaki/net-skbuff-Add-tun_vnet_hash-flag/20231008-133245
base: https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git next
patch link: https://lore.kernel.org/all/20231008052101.144422-7-akihiko.odaki@daynix.com/
patch subject: [RFC PATCH 6/7] selftest: tun: Add tests for virtio-net hashing

in testcase: kernel-selftests
version: kernel-selftests-x86_64-60acb023-1_20230329
with following parameters:

	group: net
	test: fcnal-test.sh
	atomic_test: ipv6_runtime



compiler: gcc-12
test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 32G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202310192236.fde97031-oliver.sang@intel.com

KERNEL SELFTESTS: linux_headers_dir is /usr/src/linux-headers-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d
2023-10-14 17:54:16 mount --bind /lib/modules/6.6.0-rc2-00023-gc04079dfb34c/kernel/lib /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/lib
make: Entering directory '/usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/bpf/resolve_btfids'

...

gcc -Wall -Wl,--no-as-needed -O2 -g -I../../../../usr/include/ -I../../../include/ -isystem /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/usr/include -I../     txtimestamp.c  -o /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/txtimestamp
reuseport_bpf.c: In function ‘attach_cbpf’:
reuseport_bpf.c:133:28: error: array type has incomplete element type ‘struct sock_filter’
  133 |         struct sock_filter code[] = {
      |                            ^~~~
reuseport_bpf.c:139:29: error: ‘BPF_A’ undeclared (first use in this function); did you mean ‘BPF_H’?
  139 |                 { BPF_RET | BPF_A, 0, 0, 0 },
      |                             ^~~~~
      |                             BPF_H
reuseport_bpf.c:139:29: note: each undeclared identifier is reported only once for each function it appears in
reuseport_bpf.c:141:16: error: variable ‘p’ has initializer but incomplete type
  141 |         struct sock_fprog p = {
      |                ^~~~~~~~~~
reuseport_bpf.c:142:18: error: ‘struct sock_fprog’ has no member named ‘len’
  142 |                 .len = ARRAY_SIZE(code),
      |                  ^~~
In file included from reuseport_bpf.c:27:
../kselftest.h:56:25: warning: excess elements in struct initializer
   56 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
      |                         ^
reuseport_bpf.c:142:24: note: in expansion of macro ‘ARRAY_SIZE’
  142 |                 .len = ARRAY_SIZE(code),
      |                        ^~~~~~~~~~
../kselftest.h:56:25: note: (near initialization for ‘p’)
   56 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
      |                         ^
reuseport_bpf.c:142:24: note: in expansion of macro ‘ARRAY_SIZE’
  142 |                 .len = ARRAY_SIZE(code),
      |                        ^~~~~~~~~~
reuseport_bpf.c:143:18: error: ‘struct sock_fprog’ has no member named ‘filter’
  143 |                 .filter = code,
      |                  ^~~~~~
reuseport_bpf.c:143:27: warning: excess elements in struct initializer
  143 |                 .filter = code,
      |                           ^~~~
reuseport_bpf.c:143:27: note: (near initialization for ‘p’)
reuseport_bpf.c:141:27: error: storage size of ‘p’ isn’t known
  141 |         struct sock_fprog p = {
      |                           ^
reuseport_bpf.c:141:27: warning: unused variable ‘p’ [-Wunused-variable]
reuseport_bpf.c:133:28: warning: unused variable ‘code’ [-Wunused-variable]
  133 |         struct sock_filter code[] = {
      |                            ^~~~
reuseport_bpf.c: In function ‘test_filter_no_reuseport’:
reuseport_bpf.c:346:28: error: array type has incomplete element type ‘struct sock_filter’
  346 |         struct sock_filter ccode[] = {{ BPF_RET | BPF_A, 0, 0, 0 }};
      |                            ^~~~~
reuseport_bpf.c:346:51: error: ‘BPF_A’ undeclared (first use in this function); did you mean ‘BPF_H’?
  346 |         struct sock_filter ccode[] = {{ BPF_RET | BPF_A, 0, 0, 0 }};
      |                                                   ^~~~~
      |                                                   BPF_H
reuseport_bpf.c:348:27: error: storage size of ‘cprog’ isn’t known
  348 |         struct sock_fprog cprog;
      |                           ^~~~~
reuseport_bpf.c:348:27: warning: unused variable ‘cprog’ [-Wunused-variable]
reuseport_bpf.c:346:28: warning: unused variable ‘ccode’ [-Wunused-variable]
  346 |         struct sock_filter ccode[] = {{ BPF_RET | BPF_A, 0, 0, 0 }};
      |                            ^~~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/reuseport_bpf] Error 1
make: *** Waiting for unfinished jobs....
reuseport_bpf_cpu.c: In function ‘attach_bpf’:
reuseport_bpf_cpu.c:79:28: error: array type has incomplete element type ‘struct sock_filter’
   79 |         struct sock_filter code[] = {
      |                            ^~~~
reuseport_bpf_cpu.c:81:52: error: ‘SKF_AD_OFF’ undeclared (first use in this function)
   81 |                 { BPF_LD  | BPF_W | BPF_ABS, 0, 0, SKF_AD_OFF + SKF_AD_CPU },
      |                                                    ^~~~~~~~~~
reuseport_bpf_cpu.c:81:52: note: each undeclared identifier is reported only once for each function it appears in
reuseport_bpf_cpu.c:81:65: error: ‘SKF_AD_CPU’ undeclared (first use in this function)
   81 |                 { BPF_LD  | BPF_W | BPF_ABS, 0, 0, SKF_AD_OFF + SKF_AD_CPU },
      |                                                                 ^~~~~~~~~~
reuseport_bpf_cpu.c:83:29: error: ‘BPF_A’ undeclared (first use in this function); did you mean ‘BPF_H’?
   83 |                 { BPF_RET | BPF_A, 0, 0, 0 },
      |                             ^~~~~
      |                             BPF_H
reuseport_bpf_cpu.c:85:16: error: variable ‘p’ has initializer but incomplete type
   85 |         struct sock_fprog p = {
      |                ^~~~~~~~~~
reuseport_bpf_cpu.c:86:18: error: ‘struct sock_fprog’ has no member named ‘len’
   86 |                 .len = 2,
      |                  ^~~
reuseport_bpf_cpu.c:86:24: warning: excess elements in struct initializer
   86 |                 .len = 2,
      |                        ^
reuseport_bpf_cpu.c:86:24: note: (near initialization for ‘p’)
reuseport_bpf_cpu.c:87:18: error: ‘struct sock_fprog’ has no member named ‘filter’
   87 |                 .filter = code,
      |                  ^~~~~~
reuseport_bpf_cpu.c:87:27: warning: excess elements in struct initializer
   87 |                 .filter = code,
      |                           ^~~~
reuseport_bpf_cpu.c:87:27: note: (near initialization for ‘p’)
reuseport_bpf_cpu.c:85:27: error: storage size of ‘p’ isn’t known
   85 |         struct sock_fprog p = {
      |                           ^
reuseport_bpf_cpu.c:85:27: warning: unused variable ‘p’ [-Wunused-variable]
reuseport_bpf_cpu.c:79:28: warning: unused variable ‘code’ [-Wunused-variable]
   79 |         struct sock_filter code[] = {
      |                            ^~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/reuseport_bpf_cpu] Error 1
In file included from psock_fanout.c:55:
psock_lib.h: In function ‘pair_udp_setfilter’:
psock_lib.h:52:28: error: array type has incomplete element type ‘struct sock_filter’
   52 |         struct sock_filter bpf_filter[] = {
      |                            ^~~~~~~~~~
psock_lib.h:65:27: error: storage size of ‘bpf_prog’ isn’t known
   65 |         struct sock_fprog bpf_prog;
      |                           ^~~~~~~~
psock_lib.h:65:27: warning: unused variable ‘bpf_prog’ [-Wunused-variable]
psock_lib.h:52:28: warning: unused variable ‘bpf_filter’ [-Wunused-variable]
   52 |         struct sock_filter bpf_filter[] = {
      |                            ^~~~~~~~~~
psock_fanout.c: In function ‘sock_fanout_set_cbpf’:
psock_fanout.c:114:28: error: array type has incomplete element type ‘struct sock_filter’
  114 |         struct sock_filter bpf_filter[] = {
      |                            ^~~~~~~~~~
psock_fanout.c:115:17: warning: implicit declaration of function ‘BPF_STMT’; did you mean ‘BPF_STX’? [-Wimplicit-function-declaration]
  115 |                 BPF_STMT(BPF_LD | BPF_B | BPF_ABS, 80),       /* ldb [80] */
      |                 ^~~~~~~~
      |                 BPF_STX
psock_fanout.c:116:36: error: ‘BPF_A’ undeclared (first use in this function); did you mean ‘BPF_X’?
  116 |                 BPF_STMT(BPF_RET | BPF_A, 0),                 /* ret A */
      |                                    ^~~~~
      |                                    BPF_X
psock_fanout.c:116:36: note: each undeclared identifier is reported only once for each function it appears in
psock_fanout.c:118:27: error: storage size of ‘bpf_prog’ isn’t known
  118 |         struct sock_fprog bpf_prog;
      |                           ^~~~~~~~
psock_fanout.c:118:27: warning: unused variable ‘bpf_prog’ [-Wunused-variable]
psock_fanout.c:114:28: warning: unused variable ‘bpf_filter’ [-Wunused-variable]
  114 |         struct sock_filter bpf_filter[] = {
      |                            ^~~~~~~~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/psock_fanout] Error 1
In file included from psock_tpacket.c:47:
psock_lib.h: In function ‘pair_udp_setfilter’:
psock_lib.h:52:28: error: array type has incomplete element type ‘struct sock_filter’
   52 |         struct sock_filter bpf_filter[] = {
      |                            ^~~~~~~~~~
psock_lib.h:65:27: error: storage size of ‘bpf_prog’ isn’t known
   65 |         struct sock_fprog bpf_prog;
      |                           ^~~~~~~~
psock_lib.h:65:27: warning: unused variable ‘bpf_prog’ [-Wunused-variable]
psock_lib.h:52:28: warning: unused variable ‘bpf_filter’ [-Wunused-variable]
   52 |         struct sock_filter bpf_filter[] = {
      |                            ^~~~~~~~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/psock_tpacket] Error 1
In file included from psock_snd.c:32:
psock_lib.h: In function ‘pair_udp_setfilter’:
psock_lib.h:52:28: error: array type has incomplete element type ‘struct sock_filter’
   52 |         struct sock_filter bpf_filter[] = {
      |                            ^~~~~~~~~~
psock_lib.h:65:27: error: storage size of ‘bpf_prog’ isn’t known
   65 |         struct sock_fprog bpf_prog;
      |                           ^~~~~~~~
psock_lib.h:65:27: warning: unused variable ‘bpf_prog’ [-Wunused-variable]
psock_lib.h:52:28: warning: unused variable ‘bpf_filter’ [-Wunused-variable]
   52 |         struct sock_filter bpf_filter[] = {
      |                            ^~~~~~~~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/psock_snd] Error 1
make: Leaving directory '/usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net'



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231019/202310192236.fde97031-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 7/7] vhost_net: Support VIRTIO_NET_F_HASH_REPORT
  2023-10-08  5:20 [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (5 preceding siblings ...)
  2023-10-08  5:20 ` [RFC PATCH 6/7] selftest: tun: Add tests for virtio-net hashing Akihiko Odaki
@ 2023-10-08  5:20 ` Akihiko Odaki
  2023-10-08 18:36 ` [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature Willem de Bruijn
  7 siblings, 0 replies; 41+ messages in thread
From: Akihiko Odaki @ 2023-10-08  5:20 UTC (permalink / raw)
  Cc: Willem de Bruijn, Jason Wang, Michael S. Tsirkin,
	Xuan Zhuo Shuah Khan

VIRTIO_NET_F_HASH_REPORT allows to report hash values calculated on the
host. When VHOST_NET_F_VIRTIO_NET_HDR is employed, it will report no
hash values (i.e., the hash_report member is always set to
VIRTIO_NET_HASH_REPORT_NONE). Otherwise, the values reported by the
underlying socket will be reported.

VIRTIO_NET_F_HASH_REPORT requires VIRTIO_F_VERSION_1.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 drivers/vhost/net.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f2ed7167c848..6a31d450fae2 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -73,6 +73,7 @@ enum {
 	VHOST_NET_FEATURES = VHOST_FEATURES |
 			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
 			 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+			 (1ULL << VIRTIO_NET_F_HASH_REPORT) |
 			 (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
 			 (1ULL << VIRTIO_F_RING_RESET)
 };
@@ -1634,10 +1635,13 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
 	size_t vhost_hlen, sock_hlen, hdr_len;
 	int i;
 
-	hdr_len = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
-			       (1ULL << VIRTIO_F_VERSION_1))) ?
-			sizeof(struct virtio_net_hdr_mrg_rxbuf) :
-			sizeof(struct virtio_net_hdr);
+	if (features & (1ULL << VIRTIO_NET_F_HASH_REPORT))
+		hdr_len = sizeof(struct virtio_net_hdr_v1_hash);
+	else if (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+			     (1ULL << VIRTIO_F_VERSION_1)))
+		hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	else
+		hdr_len = sizeof(struct virtio_net_hdr);
 	if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
 		/* vhost provides vnet_hdr */
 		vhost_hlen = hdr_len;
@@ -1718,6 +1722,10 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 			return -EFAULT;
 		if (features & ~VHOST_NET_FEATURES)
 			return -EOPNOTSUPP;
+		if ((features & ((1ULL << VIRTIO_F_VERSION_1) |
+				 (1ULL << VIRTIO_NET_F_HASH_REPORT))) ==
+		    (1ULL << VIRTIO_NET_F_HASH_REPORT))
+			return -EINVAL;
 		return vhost_net_set_features(n, features);
 	case VHOST_GET_BACKEND_FEATURES:
 		features = VHOST_NET_BACKEND_FEATURES;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature
  2023-10-08  5:20 [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (6 preceding siblings ...)
  2023-10-08  5:20 ` [RFC PATCH 7/7] vhost_net: Support VIRTIO_NET_F_HASH_REPORT Akihiko Odaki
@ 2023-10-08 18:36 ` Willem de Bruijn
  7 siblings, 0 replies; 41+ messages in thread
From: Willem de Bruijn @ 2023-10-08 18:36 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jason Wang, Michael S. Tsirkin, netdev, linux-kernel, kvm,
	virtualization, linux-kselftest, bpf, davem, kuba, ast, daniel,
	andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
	rdunlap, willemb, gustavoars, herbert, steffen.klassert, nogikh,
	pablo, decui, cai, jakub, elver, pabeni, Yuri Benditovich

On Sun, Oct 8, 2023 at 7:21 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
>
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges.
>
> An alternative solution is to extend the eBPF steering program so that it
> will be able to report to the userspace, but it makes little sense to
> allow to implement different hashing algorithms with eBPF since the hash
> value reported by virtio-net is strictly defined by the specification.

But using the existing BPF steering may have the benefit of requiring
a lot less new code.

There is ample precedence for BPF programs that work this way. The
flow dissector comes to mind.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2023-10-20  1:52 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-08  5:20 [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
2023-10-08  5:20 ` [RFC PATCH 1/7] net: skbuff: Add tun_vnet_hash flag Akihiko Odaki
2023-10-08 18:39   ` Willem de Bruijn
2023-10-08 19:52     ` Akihiko Odaki
2023-10-08  5:20 ` [RFC PATCH 2/7] net/core: Ensure qdisc_skb_cb will not be overwritten Akihiko Odaki
2023-10-08  5:20 ` [RFC PATCH 3/7] net: sched: Add members to qdisc_skb_cb Akihiko Odaki
2023-10-08  5:20 ` [RFC PATCH 4/7] virtio_net: Add functions for hashing Akihiko Odaki
2023-10-08  5:20 ` [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature Akihiko Odaki
2023-10-08 19:07   ` Willem de Bruijn
2023-10-08 20:04     ` Akihiko Odaki
2023-10-08 20:08       ` Willem de Bruijn
2023-10-08 20:46         ` Akihiko Odaki
2023-10-09  8:04           ` Willem de Bruijn
2023-10-09  8:57             ` Akihiko Odaki
2023-10-09  9:57               ` Willem de Bruijn
2023-10-09 10:01                 ` Akihiko Odaki
2023-10-09 10:06                   ` Willem de Bruijn
2023-10-09 10:12                     ` Akihiko Odaki
2023-10-09 10:44                       ` Willem de Bruijn
2023-10-10  1:52                         ` Akihiko Odaki
2023-10-10  5:45                           ` Jason Wang
2023-10-10  5:51                             ` Akihiko Odaki
2023-10-10  6:00                               ` Jason Wang
2023-10-10  6:19                                 ` Akihiko Odaki
2023-10-11  3:18                                   ` Jason Wang
2023-10-11  3:57                                     ` Akihiko Odaki
2023-10-09  8:13   ` Willem de Bruijn
2023-10-09  8:44     ` Akihiko Odaki
2023-10-09  9:54       ` Willem de Bruijn
2023-10-09 10:05         ` Akihiko Odaki
2023-10-09 10:07           ` Willem de Bruijn
2023-10-09 10:11             ` Akihiko Odaki
2023-10-09 10:32               ` Willem de Bruijn
2023-10-09 11:50       ` Michael S. Tsirkin
2023-10-10  2:34         ` Akihiko Odaki
2023-10-09 11:38   ` Michael S. Tsirkin
2023-10-10  2:32     ` Akihiko Odaki
2023-10-08  5:20 ` [RFC PATCH 6/7] selftest: tun: Add tests for virtio-net hashing Akihiko Odaki
2023-10-20  1:52   ` kernel test robot
2023-10-08  5:20 ` [RFC PATCH 7/7] vhost_net: Support VIRTIO_NET_F_HASH_REPORT Akihiko Odaki
2023-10-08 18:36 ` [RFC PATCH 0/7] tun: Introduce virtio-net hashing feature Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).