netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature
@ 2025-05-30  4:50 Akihiko Odaki
  2025-05-30  4:50 ` [PATCH net-next v12 01/10] virtio_net: Add functions for hashing Akihiko Odaki
                   ` (10 more replies)
  0 siblings, 11 replies; 39+ messages in thread
From: Akihiko Odaki @ 2025-05-30  4:50 UTC (permalink / raw)
  To: Jonathan Corbet, Willem de Bruijn, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Yuri Benditovich,
	Andrew Melnychenko, Stephen Hemminger, gur.stavi, Lei Yang,
	Simon Horman, Akihiko Odaki

NOTE: I'm leaving Daynix Computing Ltd., for which I worked on this
patch series, by the end of this month.

While net-next is closed, this is the last chance for me to send another
version so let me send the local changes now.

Please contact Yuri Benditovich, who is CCed on this email, for anything
about this series.

virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).

The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20250530-hash-v5-0-343d7d7a8200@daynix.com/

This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/

V1 -> V2:
  Changed to introduce a new BPF program type.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
Changes in v12:
- Updated tools/testing/selftests/net/config.
- Split TUNSETVNETHASH.
- Link to v11: https://lore.kernel.org/r/20250317-rss-v11-0-4cacca92f31f@daynix.com

Changes in v11:
- Added the missing code to free vnet_hash in patch
  "tap: Introduce virtio-net hash feature".
- Link to v10: https://lore.kernel.org/r/20250313-rss-v10-0-3185d73a9af0@daynix.com

Changes in v10:
- Split common code and TUN/TAP-specific code into separate patches.
- Reverted a spurious style change in patch "tun: Introduce virtio-net
  hash feature".
- Added a comment explaining disable_ipv6 in tests.
- Used AF_PACKET for patch "selftest: tun: Add tests for
  virtio-net hashing". I also added the usage of FIXTURE_VARIANT() as
  the testing function now needs access to more variant-specific
  variables.
- Corrected the message of patch "selftest: tun: Add tests for
  virtio-net hashing"; it mentioned validation of configuration but
  it is not scope of this patch.
- Expanded the description of patch "selftest: tun: Add tests for
  virtio-net hashing".
- Added patch "tun: Allow steering eBPF program to fall back".
- Changed to handle TUNGETVNETHASHCAP before taking the rtnl lock.
- Removed redundant tests for tun_vnet_ioctl().
- Added patch "selftest: tap: Add tests for virtio-net ioctls".
- Added a design explanation of ioctls for extensibility and migration.
- Removed a few branches in patch
  "vhost/net: Support VIRTIO_NET_F_HASH_REPORT".
- Link to v9: https://lore.kernel.org/r/20250307-rss-v9-0-df76624025eb@daynix.com

Changes in v9:
- Added a missing return statement in patch
  "tun: Introduce virtio-net hash feature".
- Link to v8: https://lore.kernel.org/r/20250306-rss-v8-0-7ab4f56ff423@daynix.com

Changes in v8:
- Disabled IPv6 to eliminate noises in tests.
- Added a branch in tap to avoid unnecessary dissection when hash
  reporting is disabled.
- Removed unnecessary rtnl_lock().
- Extracted code to handle new ioctls into separate functions to avoid
  adding extra NULL checks to the code handling other ioctls.
- Introduced variable named "fd" to __tun_chr_ioctl().
- s/-/=/g in a patch message to avoid confusing Git.
- Link to v7: https://lore.kernel.org/r/20250228-rss-v7-0-844205cbbdd6@daynix.com

Changes in v7:
- Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for
  VHOST_NET_F_VIRTIO_NET_HDR.
- s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing".
- Added tap_skb_cb type.
- Rebased.
- Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com

Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
  Introduce virtio-net hash reporting feature", and "tun: Introduce
  virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com

Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
  https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324cab4
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
  have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com

Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
  "tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com

Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
  "tun: Introduce virtio-net hash reporting feature" and
  "tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
  RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com

---
Akihiko Odaki (10):
      virtio_net: Add functions for hashing
      net: flow_dissector: Export flow_keys_dissector_symmetric
      tun: Allow steering eBPF program to fall back
      tun: Add common virtio-net hash feature code
      tun: Introduce virtio-net hash feature
      tap: Introduce virtio-net hash feature
      selftest: tun: Test vnet ioctls without device
      selftest: tun: Add tests for virtio-net hashing
      selftest: tap: Add tests for virtio-net ioctls
      vhost/net: Support VIRTIO_NET_F_HASH_REPORT

 Documentation/networking/tuntap.rst  |   7 +
 drivers/net/Kconfig                  |   1 +
 drivers/net/ipvlan/ipvtap.c          |   2 +-
 drivers/net/macvtap.c                |   2 +-
 drivers/net/tap.c                    |  80 +++++-
 drivers/net/tun.c                    |  92 +++++--
 drivers/net/tun_vnet.h               | 165 +++++++++++-
 drivers/vhost/net.c                  |  68 ++---
 include/linux/if_tap.h               |   4 +-
 include/linux/skbuff.h               |   3 +
 include/linux/virtio_net.h           | 188 ++++++++++++++
 include/net/flow_dissector.h         |   1 +
 include/uapi/linux/if_tun.h          |  80 ++++++
 net/core/flow_dissector.c            |   3 +-
 net/core/skbuff.c                    |   4 +
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/config   |   1 +
 tools/testing/selftests/net/tap.c    | 131 +++++++++-
 tools/testing/selftests/net/tun.c    | 485 ++++++++++++++++++++++++++++++++++-
 19 files changed, 1234 insertions(+), 85 deletions(-)
---
base-commit: 5cb8274d66c611b7889565c418a8158517810f9b
change-id: 20240403-rss-e737d89efa77

Best regards,
-- 
Akihiko Odaki <akihiko.odaki@daynix.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-05-30  4:50 [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Akihiko Odaki
@ 2025-05-30  4:50 ` Akihiko Odaki
  2025-06-03  3:19   ` Jason Wang
  2025-05-30  4:50 ` [PATCH net-next v12 02/10] net: flow_dissector: Export flow_keys_dissector_symmetric Akihiko Odaki
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Akihiko Odaki @ 2025-05-30  4:50 UTC (permalink / raw)
  To: Jonathan Corbet, Willem de Bruijn, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Yuri Benditovich,
	Andrew Melnychenko, Stephen Hemminger, gur.stavi, Lei Yang,
	Simon Horman, Akihiko Odaki

They are useful to implement VIRTIO_NET_F_RSS and
VIRTIO_NET_F_HASH_REPORT.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Tested-by: Lei Yang <leiyang@redhat.com>
---
 include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 188 insertions(+)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 02a9f4dc594d..426f33b4b824 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -9,6 +9,194 @@
 #include <uapi/linux/tcp.h>
 #include <uapi/linux/virtio_net.h>
 
+struct virtio_net_hash {
+	u32 value;
+	u16 report;
+};
+
+struct virtio_net_toeplitz_state {
+	u32 hash;
+	const u32 *key;
+};
+
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
+					 VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
+					 VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
+					 VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
+					 VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
+					 VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+
+static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
+{
+	while (len >= sizeof(*input)) {
+		*input = be32_to_cpu((__force __be32)*input);
+		input++;
+		len -= sizeof(*input);
+	}
+}
+
+static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
+					    const __be32 *input, size_t len)
+{
+	while (len >= sizeof(*input)) {
+		for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
+			u32 i = ffs(map);
+
+			state->hash ^= state->key[0] << (32 - i) |
+				       (u32)((u64)state->key[1] >> i);
+		}
+
+		state->key++;
+		input++;
+		len -= sizeof(*input);
+	}
+}
+
+static inline u8 virtio_net_hash_key_length(u32 types)
+{
+	size_t len = 0;
+
+	if (types & VIRTIO_NET_HASH_REPORT_IPv4)
+		len = max(len,
+			  sizeof(struct flow_dissector_key_ipv4_addrs));
+
+	if (types &
+	    (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
+		len = max(len,
+			  sizeof(struct flow_dissector_key_ipv4_addrs) +
+			  sizeof(struct flow_dissector_key_ports));
+
+	if (types & VIRTIO_NET_HASH_REPORT_IPv6)
+		len = max(len,
+			  sizeof(struct flow_dissector_key_ipv6_addrs));
+
+	if (types &
+	    (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
+		len = max(len,
+			  sizeof(struct flow_dissector_key_ipv6_addrs) +
+			  sizeof(struct flow_dissector_key_ports));
+
+	return len + sizeof(u32);
+}
+
+static inline u32 virtio_net_hash_report(u32 types,
+					 const struct flow_keys_basic *keys)
+{
+	switch (keys->basic.n_proto) {
+	case cpu_to_be16(ETH_P_IP):
+		if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
+			if (keys->basic.ip_proto == IPPROTO_TCP &&
+			    (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
+				return VIRTIO_NET_HASH_REPORT_TCPv4;
+
+			if (keys->basic.ip_proto == IPPROTO_UDP &&
+			    (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
+				return VIRTIO_NET_HASH_REPORT_UDPv4;
+		}
+
+		if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
+			return VIRTIO_NET_HASH_REPORT_IPv4;
+
+		return VIRTIO_NET_HASH_REPORT_NONE;
+
+	case cpu_to_be16(ETH_P_IPV6):
+		if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
+			if (keys->basic.ip_proto == IPPROTO_TCP &&
+			    (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
+				return VIRTIO_NET_HASH_REPORT_TCPv6;
+
+			if (keys->basic.ip_proto == IPPROTO_UDP &&
+			    (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
+				return VIRTIO_NET_HASH_REPORT_UDPv6;
+		}
+
+		if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
+			return VIRTIO_NET_HASH_REPORT_IPv6;
+
+		return VIRTIO_NET_HASH_REPORT_NONE;
+
+	default:
+		return VIRTIO_NET_HASH_REPORT_NONE;
+	}
+}
+
+static inline void virtio_net_hash_rss(const struct sk_buff *skb,
+				       u32 types, const u32 *key,
+				       struct virtio_net_hash *hash)
+{
+	struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
+	struct flow_keys flow;
+	struct flow_keys_basic flow_basic;
+	u16 report;
+
+	if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
+		hash->report = VIRTIO_NET_HASH_REPORT_NONE;
+		return;
+	}
+
+	flow_basic = (struct flow_keys_basic) {
+		.control = flow.control,
+		.basic = flow.basic
+	};
+
+	report = virtio_net_hash_report(types, &flow_basic);
+
+	switch (report) {
+	case VIRTIO_NET_HASH_REPORT_IPv4:
+		virtio_net_toeplitz_calc(&toeplitz_state,
+					 (__be32 *)&flow.addrs.v4addrs,
+					 sizeof(flow.addrs.v4addrs));
+		break;
+
+	case VIRTIO_NET_HASH_REPORT_TCPv4:
+		virtio_net_toeplitz_calc(&toeplitz_state,
+					 (__be32 *)&flow.addrs.v4addrs,
+					 sizeof(flow.addrs.v4addrs));
+		virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
+					 sizeof(flow.ports.ports));
+		break;
+
+	case VIRTIO_NET_HASH_REPORT_UDPv4:
+		virtio_net_toeplitz_calc(&toeplitz_state,
+					 (__be32 *)&flow.addrs.v4addrs,
+					 sizeof(flow.addrs.v4addrs));
+		virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
+					 sizeof(flow.ports.ports));
+		break;
+
+	case VIRTIO_NET_HASH_REPORT_IPv6:
+		virtio_net_toeplitz_calc(&toeplitz_state,
+					 (__be32 *)&flow.addrs.v6addrs,
+					 sizeof(flow.addrs.v6addrs));
+		break;
+
+	case VIRTIO_NET_HASH_REPORT_TCPv6:
+		virtio_net_toeplitz_calc(&toeplitz_state,
+					 (__be32 *)&flow.addrs.v6addrs,
+					 sizeof(flow.addrs.v6addrs));
+		virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
+					 sizeof(flow.ports.ports));
+		break;
+
+	case VIRTIO_NET_HASH_REPORT_UDPv6:
+		virtio_net_toeplitz_calc(&toeplitz_state,
+					 (__be32 *)&flow.addrs.v6addrs,
+					 sizeof(flow.addrs.v6addrs));
+		virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
+					 sizeof(flow.ports.ports));
+		break;
+
+	default:
+		hash->report = VIRTIO_NET_HASH_REPORT_NONE;
+		return;
+	}
+
+	hash->value = toeplitz_state.hash;
+	hash->report = report;
+}
+
 static inline bool virtio_net_hdr_match_proto(__be16 protocol, __u8 gso_type)
 {
 	switch (gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v12 02/10] net: flow_dissector: Export flow_keys_dissector_symmetric
  2025-05-30  4:50 [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Akihiko Odaki
  2025-05-30  4:50 ` [PATCH net-next v12 01/10] virtio_net: Add functions for hashing Akihiko Odaki
@ 2025-05-30  4:50 ` Akihiko Odaki
  2025-06-03  3:20   ` Jason Wang
  2025-05-30  4:50 ` [PATCH net-next v12 03/10] tun: Allow steering eBPF program to fall back Akihiko Odaki
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Akihiko Odaki @ 2025-05-30  4:50 UTC (permalink / raw)
  To: Jonathan Corbet, Willem de Bruijn, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Yuri Benditovich,
	Andrew Melnychenko, Stephen Hemminger, gur.stavi, Lei Yang,
	Simon Horman, Akihiko Odaki

flow_keys_dissector_symmetric is useful to derive a symmetric hash
and to know its source such as IPv4, IPv6, TCP, and UDP.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Tested-by: Lei Yang <leiyang@redhat.com>
---
 include/net/flow_dissector.h | 1 +
 net/core/flow_dissector.c    | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index ced79dc8e856..d01c1ec77b7d 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -423,6 +423,7 @@ __be32 flow_get_u32_src(const struct flow_keys *flow);
 __be32 flow_get_u32_dst(const struct flow_keys *flow);
 
 extern struct flow_dissector flow_keys_dissector;
+extern struct flow_dissector flow_keys_dissector_symmetric;
 extern struct flow_dissector flow_keys_basic_dissector;
 
 /* struct flow_keys_digest:
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 9cd8de6bebb5..32c7ee31330c 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1862,7 +1862,8 @@ void make_flow_keys_digest(struct flow_keys_digest *digest,
 }
 EXPORT_SYMBOL(make_flow_keys_digest);
 
-static struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
+struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
+EXPORT_SYMBOL(flow_keys_dissector_symmetric);
 
 u32 __skb_get_hash_symmetric_net(const struct net *net, const struct sk_buff *skb)
 {

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v12 03/10] tun: Allow steering eBPF program to fall back
  2025-05-30  4:50 [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Akihiko Odaki
  2025-05-30  4:50 ` [PATCH net-next v12 01/10] virtio_net: Add functions for hashing Akihiko Odaki
  2025-05-30  4:50 ` [PATCH net-next v12 02/10] net: flow_dissector: Export flow_keys_dissector_symmetric Akihiko Odaki
@ 2025-05-30  4:50 ` Akihiko Odaki
  2025-06-04  1:27   ` Jason Wang
  2025-05-30  4:50 ` [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code Akihiko Odaki
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Akihiko Odaki @ 2025-05-30  4:50 UTC (permalink / raw)
  To: Jonathan Corbet, Willem de Bruijn, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Yuri Benditovich,
	Andrew Melnychenko, Stephen Hemminger, gur.stavi, Lei Yang,
	Simon Horman, Akihiko Odaki

This clarifies a steering eBPF program takes precedence over the other
steering algorithms.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 Documentation/networking/tuntap.rst |  7 +++++++
 drivers/net/tun.c                   | 28 +++++++++++++++++-----------
 include/uapi/linux/if_tun.h         |  9 +++++++++
 3 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/Documentation/networking/tuntap.rst b/Documentation/networking/tuntap.rst
index 4d7087f727be..86b4ae8caa8a 100644
--- a/Documentation/networking/tuntap.rst
+++ b/Documentation/networking/tuntap.rst
@@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
       return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
   }
 
+3.4 Reference
+-------------
+
+``linux/if_tun.h`` defines the interface described below:
+
+.. kernel-doc:: include/uapi/linux/if_tun.h
+
 Universal TUN/TAP device driver Frequently Asked Question
 =========================================================
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d8f4d3e996a7..9133ab9ed3f5 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -476,21 +476,29 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
 	return txq;
 }
 
-static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
+static bool tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb,
+				  u16 *ret)
 {
 	struct tun_prog *prog;
 	u32 numqueues;
-	u16 ret = 0;
+	u32 prog_ret;
+
+	prog = rcu_dereference(tun->steering_prog);
+	if (!prog)
+		return false;
 
 	numqueues = READ_ONCE(tun->numqueues);
-	if (!numqueues)
-		return 0;
+	if (!numqueues) {
+		*ret = 0;
+		return true;
+	}
 
-	prog = rcu_dereference(tun->steering_prog);
-	if (prog)
-		ret = bpf_prog_run_clear_cb(prog->prog, skb);
+	prog_ret = bpf_prog_run_clear_cb(prog->prog, skb);
+	if (prog_ret == TUN_STEERINGEBPF_FALLBACK)
+		return false;
 
-	return ret % numqueues;
+	*ret = (u16)prog_ret % numqueues;
+	return true;
 }
 
 static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
@@ -500,9 +508,7 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
 	u16 ret;
 
 	rcu_read_lock();
-	if (rcu_dereference(tun->steering_prog))
-		ret = tun_ebpf_select_queue(tun, skb);
-	else
+	if (!tun_ebpf_select_queue(tun, skb, &ret))
 		ret = tun_automq_select_queue(tun, skb);
 	rcu_read_unlock();
 
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 287cdc81c939..980de74724fc 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -115,4 +115,13 @@ struct tun_filter {
 	__u8   addr[][ETH_ALEN];
 };
 
+/**
+ * define TUN_STEERINGEBPF_FALLBACK - A steering eBPF return value to fall back
+ *
+ * A steering eBPF program may return this value to fall back to the steering
+ * algorithm that should have been used if the program was not set. This allows
+ * selectively overriding the steering decision.
+ */
+#define TUN_STEERINGEBPF_FALLBACK -1
+
 #endif /* _UAPI__IF_TUN_H */

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code
  2025-05-30  4:50 [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (2 preceding siblings ...)
  2025-05-30  4:50 ` [PATCH net-next v12 03/10] tun: Allow steering eBPF program to fall back Akihiko Odaki
@ 2025-05-30  4:50 ` Akihiko Odaki
  2025-06-04  1:53   ` Jason Wang
  2025-05-30  4:50 ` [PATCH net-next v12 05/10] tun: Introduce virtio-net hash feature Akihiko Odaki
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Akihiko Odaki @ 2025-05-30  4:50 UTC (permalink / raw)
  To: Jonathan Corbet, Willem de Bruijn, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Yuri Benditovich,
	Andrew Melnychenko, Stephen Hemminger, gur.stavi, Lei Yang,
	Simon Horman, Akihiko Odaki

Add common code required for the features being added to TUN and TAP.
They will be enabled for each of them in following patches.

Added Features
==============

Hash reporting
--------------

Allow the guest to reuse the hash value to make receive steering
consistent between the host and guest, and to save hash computation.

Receive Side Scaling (RSS)
--------------------------

RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Added ioctls
============

They are designed to make extensibility and VM migration compatible.
This change only adds the implementation and does not expose them to
the userspace.

TUNGETVNETHASHTYPES
-------------------

This ioctl tells supported hash types. It is useful to check if a VM can
be migrated to the current host.

TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
--------------------------------------------------------------------

These ioctls configures a steering algorithm and, if needed, hash
reporting.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Tested-by: Lei Yang <leiyang@redhat.com>
---
 drivers/net/tap.c           |  10 ++-
 drivers/net/tun.c           |  12 +++-
 drivers/net/tun_vnet.h      | 165 +++++++++++++++++++++++++++++++++++++++++---
 include/uapi/linux/if_tun.h |  71 +++++++++++++++++++
 4 files changed, 244 insertions(+), 14 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index d4ece538f1b2..25c60ff2d3f2 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
 	sock_put(&q->sk);
 }
 
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
+{
+	return NULL;
+}
+
 /*
  * Select a queue based on the rxq of the device on which this packet
  * arrived. If the incoming device is not mq, calculate a flow hash
@@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
 	int total;
 
 	if (q->flags & IFF_VNET_HDR) {
-		struct virtio_net_hdr vnet_hdr;
+		struct virtio_net_hdr_v1_hash vnet_hdr;
 
 		vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
 
-		ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
+		ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
+					    tap_find_hash, &vnet_hdr);
 		if (ret)
 			return ret;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9133ab9ed3f5..03d47799e9bd 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash)
 		e->rps_rxhash = hash;
 }
 
+static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
+{
+	return NULL;
+}
+
 /* We try to identify a flow through its rxhash. The reason that
  * we do not check rxq no. is because some cards(e.g 82599), chooses
  * the rxq based on the txq where the last packet of the flow comes. As
@@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
 	ssize_t ret;
 
 	if (tun->flags & IFF_VNET_HDR) {
-		struct virtio_net_hdr gso = { 0 };
+		struct virtio_net_hdr_v1_hash gso = { 0 };
 
 		vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
 		ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
@@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	}
 
 	if (vnet_hdr_sz) {
-		struct virtio_net_hdr gso;
+		struct virtio_net_hdr_v1_hash gso;
 
-		ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
+		ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
+					    skb, tun_find_hash, &gso);
 		if (ret)
 			return ret;
 
diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
index 58b9ac7a5fc4..45d0533efc8d 100644
--- a/drivers/net/tun_vnet.h
+++ b/drivers/net/tun_vnet.h
@@ -6,6 +6,17 @@
 #define TUN_VNET_LE     0x80000000
 #define TUN_VNET_BE     0x40000000
 
+typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *);
+typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
+
+struct tun_vnet_hash {
+	bool report;
+	bool rss;
+	struct tun_vnet_rss common;
+	u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
+	u16 rss_indirection_table[];
+};
+
 static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags)
 {
 	bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
@@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
 	}
 }
 
+static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp)
+{
+	return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
+}
+
+static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
+					  unsigned int cmd,
+					  void __user *argp)
+{
+	struct tun_vnet_rss common;
+	struct tun_vnet_hash *hash;
+	size_t indirection_table_size;
+	size_t key_size;
+	size_t size;
+
+	switch (cmd) {
+	case TUNSETVNETREPORTINGAUTOMQ:
+		if (get_user(common.hash_types, (u32 __user *)argp))
+			return -EFAULT;
+
+		if (common.hash_types) {
+			hash = kzalloc(sizeof(*hash), GFP_KERNEL);
+			if (!hash)
+				return -ENOMEM;
+
+			hash->report = true;
+			hash->common.hash_types = common.hash_types;
+		} else {
+			hash = NULL;
+		}
+		break;
+
+	case TUNSETVNETREPORTINGRSS:
+	case TUNSETVNETRSS:
+		if (copy_from_user(&common, argp, sizeof(common)))
+			return -EFAULT;
+		argp = (struct tun_vnet_rss __user *)argp + 1;
+
+		indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
+		key_size = virtio_net_hash_key_length(common.hash_types);
+		size = struct_size(hash, rss_indirection_table,
+				   (size_t)common.indirection_table_mask + 1);
+
+		hash = kmalloc(size, GFP_KERNEL);
+		if (!hash)
+			return -ENOMEM;
+
+		if (copy_from_user(hash->rss_indirection_table,
+				   argp, indirection_table_size)) {
+			kfree(hash);
+			return -EFAULT;
+		}
+		argp = (u16 __user *)argp + common.indirection_table_mask + 1;
+
+		if (copy_from_user(hash->rss_key, argp, key_size)) {
+			kfree(hash);
+			return -EFAULT;
+		}
+
+		virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
+		hash->report = cmd == TUNSETVNETREPORTINGRSS;
+		hash->rss = true;
+		hash->common = common;
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
+	return 0;
+}
+
+static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
+					struct sk_buff *skb,
+					const struct flow_keys_basic *keys,
+					u32 value,
+					tun_vnet_hash_add vnet_hash_add)
+{
+	struct virtio_net_hash *report;
+
+	if (!hash || !hash->report)
+		return;
+
+	report = vnet_hash_add(skb);
+	if (!report)
+		return;
+
+	*report = (struct virtio_net_hash) {
+		.report = virtio_net_hash_report(hash->common.hash_types, keys),
+		.value = value
+	};
+}
+
+static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
+					    const struct tun_vnet_hash *hash,
+					    struct sk_buff *skb,
+					    tun_vnet_hash_add vnet_hash_add)
+{
+	struct virtio_net_hash *report;
+	struct virtio_net_hash ret;
+	u16 index;
+
+	if (!numqueues)
+		return 0;
+
+	virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
+
+	if (!ret.report)
+		return hash->common.unclassified_queue % numqueues;
+
+	if (hash->report) {
+		report = vnet_hash_add(skb);
+		if (report)
+			*report = ret;
+	}
+
+	index = ret.value & hash->common.indirection_table_mask;
+
+	return hash->rss_indirection_table[index] % numqueues;
+}
+
 static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
 				   struct iov_iter *from,
 				   struct virtio_net_hdr *hdr)
@@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
 }
 
 static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
-				   const struct virtio_net_hdr *hdr)
+				   const struct virtio_net_hdr_v1_hash *hdr)
 {
+	int content_sz = MIN(sizeof(*hdr), sz);
+
 	if (unlikely(iov_iter_count(iter) < sz))
 		return -EINVAL;
 
-	if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
+	if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz))
 		return -EFAULT;
 
-	if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
+	if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz)
 		return -EFAULT;
 
 	return 0;
@@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
 	return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags));
 }
 
-static inline int tun_vnet_hdr_from_skb(unsigned int flags,
+static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags,
 					const struct net_device *dev,
 					const struct sk_buff *skb,
-					struct virtio_net_hdr *hdr)
+					tun_vnet_hash_find vnet_hash_find,
+					struct virtio_net_hdr_v1_hash *hdr)
 {
 	int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
+	const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
+					       NULL : vnet_hash_find(skb);
+
+	*hdr = (struct virtio_net_hdr_v1_hash) {
+		.hash_report = VIRTIO_NET_HASH_REPORT_NONE
+	};
+
+	if (report) {
+		hdr->hash_value = cpu_to_le32(report->value);
+		hdr->hash_report = cpu_to_le16(report->report);
+	}
 
-	if (virtio_net_hdr_from_skb(skb, hdr,
+	if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr,
 				    tun_vnet_is_little_endian(flags), true,
 				    vlan_hlen)) {
 		struct skb_shared_info *sinfo = skb_shinfo(skb);
 
 		if (net_ratelimit()) {
 			netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
-				   sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
-				   tun_vnet16_to_cpu(flags, hdr->hdr_len));
+				   sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
+				   tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len));
 			print_hex_dump(KERN_ERR, "tun: ",
 				       DUMP_PREFIX_NONE,
 				       16, 1, skb->head,
-				       min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
+				       min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true);
 		}
 		WARN_ON_ONCE(1);
 		return -EINVAL;
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 980de74724fc..fe4b984d3bbb 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -62,6 +62,62 @@
 #define TUNSETCARRIER _IOW('T', 226, int)
 #define TUNGETDEVNETNS _IO('T', 227)
 
+/**
+ * define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
+ *
+ * The argument is a pointer to __u32 which will store the supported virtio_net
+ * hashing types.
+ */
+#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
+
+/**
+ * define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
+ *
+ * Disable RSS and enable automatic receive steering with hash reporting.
+ *
+ * The argument is a pointer to __u32 that contains a bitmask of hash types
+ * allowed to be reported.
+ *
+ * This ioctl results in %EBADFD if the underlying device is deleted. It affects
+ * all queues attached to the same device.
+ *
+ * This ioctl currently has no effect on XDP packets and packets with
+ * queue_mapping set by TC.
+ */
+#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
+
+/**
+ * define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
+ *
+ * Disable automatic receive steering and enable RSS with hash reporting.
+ *
+ * This ioctl results in %EBADFD if the underlying device is deleted. It affects
+ * all queues attached to the same device.
+ *
+ * This ioctl currently has no effect on XDP packets and packets with
+ * queue_mapping set by TC.
+ */
+#define TUNSETVNETREPORTINGRSS _IOR('T', 230, struct tun_vnet_rss)
+
+/**
+ * define TUNSETVNETRSS - ioctl to enable RSS without hash reporting
+ *
+ * Disable automatic receive steering and enable RSS without hash reporting.
+ *
+ * The argument is a pointer to the compound of the following in order:
+ *
+ * 1. &struct tun_vnet_rss
+ * 3. Indirection table
+ * 4. Key
+ *
+ * This ioctl results in %EBADFD if the underlying device is deleted. It affects
+ * all queues attached to the same device.
+ *
+ * This ioctl currently has no effect on XDP packets and packets with
+ * queue_mapping set by TC.
+ */
+#define TUNSETVNETRSS _IOR('T', 231, struct tun_vnet_rss)
+
 /* TUNSETIFF ifr flags */
 #define IFF_TUN		0x0001
 #define IFF_TAP		0x0002
@@ -124,4 +180,19 @@ struct tun_filter {
  */
 #define TUN_STEERINGEBPF_FALLBACK -1
 
+/**
+ * struct tun_vnet_rss - virtio_net RSS configuration
+ * @hash_types:
+ *		Bitmask of allowed hash types
+ * @indirection_table_mask:
+ *		Bitmask to be applied to the indirection table index
+ * @unclassified_queue:
+ *		The index of the queue to place unclassified packets in
+ */
+struct tun_vnet_rss {
+	__u32 hash_types;
+	__u16 indirection_table_mask;
+	__u16 unclassified_queue;
+};
+
 #endif /* _UAPI__IF_TUN_H */

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v12 05/10] tun: Introduce virtio-net hash feature
  2025-05-30  4:50 [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (3 preceding siblings ...)
  2025-05-30  4:50 ` [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code Akihiko Odaki
@ 2025-05-30  4:50 ` Akihiko Odaki
  2025-05-30  4:50 ` [PATCH net-next v12 06/10] tap: " Akihiko Odaki
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Akihiko Odaki @ 2025-05-30  4:50 UTC (permalink / raw)
  To: Jonathan Corbet, Willem de Bruijn, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Yuri Benditovich,
	Andrew Melnychenko, Stephen Hemminger, gur.stavi, Lei Yang,
	Simon Horman, Akihiko Odaki

Add ioctls and storage required for the virtio-net hash feature to TUN.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 drivers/net/Kconfig    |  1 +
 drivers/net/tun.c      | 56 ++++++++++++++++++++++++++++++++++++++++++--------
 include/linux/skbuff.h |  3 +++
 net/core/skbuff.c      |  4 ++++
 4 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 1fd5acdc73c6..aecfd244dd83 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
 	tristate "Universal TUN/TAP device driver support"
 	depends on INET
 	select CRC32
+	select SKB_EXTENSIONS
 	help
 	  TUN/TAP provides packet reception and transmission for user space
 	  programs.  It can be viewed as a simple Point-to-Point or Ethernet
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 03d47799e9bd..0a34db248e03 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -209,6 +209,7 @@ struct tun_struct {
 	struct bpf_prog __rcu *xdp_prog;
 	struct tun_prog __rcu *steering_prog;
 	struct tun_prog __rcu *filter_prog;
+	struct tun_vnet_hash __rcu *vnet_hash;
 	struct ethtool_link_ksettings link_ksettings;
 	/* init args */
 	struct file *file;
@@ -451,9 +452,14 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash)
 		e->rps_rxhash = hash;
 }
 
+static struct virtio_net_hash *tun_add_hash(struct sk_buff *skb)
+{
+	return skb_ext_add(skb, SKB_EXT_TUN_VNET_HASH);
+}
+
 static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
 {
-	return NULL;
+	return skb_ext_find(skb, SKB_EXT_TUN_VNET_HASH);
 }
 
 /* We try to identify a flow through its rxhash. The reason that
@@ -462,14 +468,21 @@ static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
  * the userspace application move between processors, we may get a
  * different rxq no. here.
  */
-static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
+static u16 tun_automq_select_queue(struct tun_struct *tun,
+				   const struct tun_vnet_hash *vnet_hash,
+				   struct sk_buff *skb)
 {
+	struct flow_keys keys;
+	struct flow_keys_basic keys_basic;
 	struct tun_flow_entry *e;
 	u32 txq, numqueues;
 
 	numqueues = READ_ONCE(tun->numqueues);
 
-	txq = __skb_get_hash_symmetric(skb);
+	memset(&keys, 0, sizeof(keys));
+	skb_flow_dissect(skb, &flow_keys_dissector_symmetric, &keys, 0);
+
+	txq = flow_hash_from_keys(&keys);
 	e = tun_flow_find(&tun->flows[tun_hashfn(txq)], txq);
 	if (e) {
 		tun_flow_save_rps_rxhash(e, txq);
@@ -478,6 +491,13 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
 		txq = reciprocal_scale(txq, numqueues);
 	}
 
+	keys_basic = (struct flow_keys_basic) {
+		.control = keys.control,
+		.basic = keys.basic
+	};
+	tun_vnet_hash_report(vnet_hash, skb, &keys_basic, skb->l4_hash ? skb->hash : txq,
+			     tun_add_hash);
+
 	return txq;
 }
 
@@ -513,8 +533,15 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
 	u16 ret;
 
 	rcu_read_lock();
-	if (!tun_ebpf_select_queue(tun, skb, &ret))
-		ret = tun_automq_select_queue(tun, skb);
+	if (!tun_ebpf_select_queue(tun, skb, &ret)) {
+		struct tun_vnet_hash *vnet_hash = rcu_dereference(tun->vnet_hash);
+
+		if (vnet_hash && vnet_hash->rss)
+			ret = tun_vnet_rss_select_queue(READ_ONCE(tun->numqueues), vnet_hash,
+							skb, tun_add_hash);
+		else
+			ret = tun_automq_select_queue(tun, vnet_hash, skb);
+	}
 	rcu_read_unlock();
 
 	return ret;
@@ -2235,6 +2262,7 @@ static void tun_free_netdev(struct net_device *dev)
 	security_tun_dev_free_security(tun->security);
 	__tun_set_ebpf(tun, &tun->steering_prog, NULL);
 	__tun_set_ebpf(tun, &tun->filter_prog, NULL);
+	kfree_rcu_mightsleep(rcu_access_pointer(tun->vnet_hash));
 }
 
 static void tun_setup(struct net_device *dev)
@@ -3014,16 +3042,22 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 	} else {
 		memset(&ifr, 0, sizeof(ifr));
 	}
-	if (cmd == TUNGETFEATURES) {
+	switch (cmd) {
+	case TUNGETFEATURES:
 		/* Currently this just means: "what IFF flags are valid?".
 		 * This is needed because we never checked for invalid flags on
 		 * TUNSETIFF.
 		 */
 		return put_user(IFF_TUN | IFF_TAP | IFF_NO_CARRIER |
 				TUN_FEATURES, (unsigned int __user*)argp);
-	} else if (cmd == TUNSETQUEUE) {
+
+	case TUNSETQUEUE:
 		return tun_set_queue(file, &ifr);
-	} else if (cmd == SIOCGSKNS) {
+
+	case TUNGETVNETHASHTYPES:
+		return tun_vnet_ioctl_gethashtypes(argp);
+
+	case SIOCGSKNS:
 		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 		return open_related_ns(&net->ns, get_net_ns);
@@ -3264,6 +3298,12 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		ret = open_related_ns(&net->ns, get_net_ns);
 		break;
 
+	case TUNSETVNETREPORTINGAUTOMQ:
+	case TUNSETVNETREPORTINGRSS:
+	case TUNSETVNETRSS:
+		ret = tun_vnet_ioctl_sethash(&tun->vnet_hash, cmd, argp);
+		break;
+
 	default:
 		ret = tun_vnet_ioctl(&tun->vnet_hdr_sz, &tun->flags, cmd, argp);
 		break;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bb2b751d274a..cdd793f1c360 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4842,6 +4842,9 @@ enum skb_ext_id {
 #endif
 #if IS_ENABLED(CONFIG_MCTP_FLOWS)
 	SKB_EXT_MCTP,
+#endif
+#if IS_ENABLED(CONFIG_TUN)
+	SKB_EXT_TUN_VNET_HASH,
 #endif
 	SKB_EXT_NUM, /* must be last */
 };
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b1c81687e9d8..75d48217a20f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -64,6 +64,7 @@
 #include <linux/mpls.h>
 #include <linux/kcov.h>
 #include <linux/iov_iter.h>
+#include <linux/virtio_net.h>
 
 #include <net/protocol.h>
 #include <net/dst.h>
@@ -4969,6 +4970,9 @@ static const u8 skb_ext_type_len[] = {
 #if IS_ENABLED(CONFIG_MCTP_FLOWS)
 	[SKB_EXT_MCTP] = SKB_EXT_CHUNKSIZEOF(struct mctp_flow),
 #endif
+#if IS_ENABLED(CONFIG_TUN)
+	[SKB_EXT_TUN_VNET_HASH] = SKB_EXT_CHUNKSIZEOF(struct virtio_net_hash),
+#endif
 };
 
 static __always_inline unsigned int skb_ext_total_length(void)

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v12 06/10] tap: Introduce virtio-net hash feature
  2025-05-30  4:50 [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (4 preceding siblings ...)
  2025-05-30  4:50 ` [PATCH net-next v12 05/10] tun: Introduce virtio-net hash feature Akihiko Odaki
@ 2025-05-30  4:50 ` Akihiko Odaki
  2025-05-30  4:50 ` [PATCH net-next v12 07/10] selftest: tun: Test vnet ioctls without device Akihiko Odaki
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Akihiko Odaki @ 2025-05-30  4:50 UTC (permalink / raw)
  To: Jonathan Corbet, Willem de Bruijn, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Yuri Benditovich,
	Andrew Melnychenko, Stephen Hemminger, gur.stavi, Lei Yang,
	Simon Horman, Akihiko Odaki

Add ioctls and storage required for the virtio-net hash feature to TAP.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 drivers/net/ipvlan/ipvtap.c |  2 +-
 drivers/net/macvtap.c       |  2 +-
 drivers/net/tap.c           | 72 +++++++++++++++++++++++++++++++++++++++++----
 include/linux/if_tap.h      |  4 ++-
 4 files changed, 71 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ipvlan/ipvtap.c b/drivers/net/ipvlan/ipvtap.c
index 1afc4c47be73..305438abf7ae 100644
--- a/drivers/net/ipvlan/ipvtap.c
+++ b/drivers/net/ipvlan/ipvtap.c
@@ -114,7 +114,7 @@ static void ipvtap_dellink(struct net_device *dev,
 	struct ipvtap_dev *vlan = netdev_priv(dev);
 
 	netdev_rx_handler_unregister(dev);
-	tap_del_queues(&vlan->tap);
+	tap_del(&vlan->tap);
 	ipvlan_link_delete(dev, head);
 }
 
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 29a5929d48e5..e72144d05ef4 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -122,7 +122,7 @@ static void macvtap_dellink(struct net_device *dev,
 	struct macvtap_dev *vlantap = netdev_priv(dev);
 
 	netdev_rx_handler_unregister(dev);
-	tap_del_queues(&vlantap->tap);
+	tap_del(&vlantap->tap);
 	macvlan_dellink(dev, head);
 }
 
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 25c60ff2d3f2..15f056d7f632 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -49,6 +49,10 @@ struct major_info {
 	struct list_head next;
 };
 
+struct tap_skb_cb {
+	struct virtio_net_hash hash;
+};
+
 #define GOODCOPY_LEN 128
 
 static const struct proto_ops tap_socket_ops;
@@ -179,9 +183,20 @@ static void tap_put_queue(struct tap_queue *q)
 	sock_put(&q->sk);
 }
 
+static struct tap_skb_cb *tap_skb_cb(const struct sk_buff *skb)
+{
+	BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct tap_skb_cb));
+	return (struct tap_skb_cb *)skb->cb;
+}
+
+static struct virtio_net_hash *tap_add_hash(struct sk_buff *skb)
+{
+	return &tap_skb_cb(skb)->hash;
+}
+
 static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
 {
-	return NULL;
+	return &tap_skb_cb(skb)->hash;
 }
 
 /*
@@ -194,6 +209,7 @@ static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
 static struct tap_queue *tap_get_queue(struct tap_dev *tap,
 				       struct sk_buff *skb)
 {
+	struct flow_keys_basic keys_basic;
 	struct tap_queue *queue = NULL;
 	/* Access to taps array is protected by rcu, but access to numvtaps
 	 * isn't. Below we use it to lookup a queue, but treat it as a hint
@@ -201,17 +217,47 @@ static struct tap_queue *tap_get_queue(struct tap_dev *tap,
 	 * racing against queue removal.
 	 */
 	int numvtaps = READ_ONCE(tap->numvtaps);
+	struct tun_vnet_hash *vnet_hash = rcu_dereference(tap->vnet_hash);
 	__u32 rxq;
 
+	*tap_skb_cb(skb) = (struct tap_skb_cb) {
+		.hash = { .report = VIRTIO_NET_HASH_REPORT_NONE }
+	};
+
 	if (!numvtaps)
 		goto out;
 
 	if (numvtaps == 1)
 		goto single;
 
+	if (vnet_hash) {
+		if (vnet_hash->rss) {
+			rxq = tun_vnet_rss_select_queue(numvtaps, vnet_hash, skb, tap_add_hash);
+			queue = rcu_dereference(tap->taps[rxq]);
+			goto out;
+		}
+
+		if (!skb->l4_hash && !skb->sw_hash) {
+			struct flow_keys keys;
+
+			skb_flow_dissect_flow_keys(skb, &keys, FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+			rxq = flow_hash_from_keys(&keys);
+			keys_basic = (struct flow_keys_basic) {
+				.control = keys.control,
+				.basic = keys.basic
+			};
+		} else {
+			skb_flow_dissect_flow_keys_basic(NULL, skb, &keys_basic, NULL, 0, 0, 0,
+							 FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+			rxq = skb->hash;
+		}
+	} else {
+		rxq = skb_get_hash(skb);
+	}
+
 	/* Check if we can use flow to select a queue */
-	rxq = skb_get_hash(skb);
 	if (rxq) {
+		tun_vnet_hash_report(vnet_hash, skb, &keys_basic, rxq, tap_add_hash);
 		queue = rcu_dereference(tap->taps[rxq % numvtaps]);
 		goto out;
 	}
@@ -234,10 +280,10 @@ static struct tap_queue *tap_get_queue(struct tap_dev *tap,
 
 /*
  * The net_device is going away, give up the reference
- * that it holds on all queues and safely set the pointer
- * from the queues to NULL.
+ * that it holds on all queues, safely set the pointer
+ * from the queues to NULL, and free vnet_hash.
  */
-void tap_del_queues(struct tap_dev *tap)
+void tap_del(struct tap_dev *tap)
 {
 	struct tap_queue *q, *tmp;
 
@@ -254,8 +300,10 @@ void tap_del_queues(struct tap_dev *tap)
 	BUG_ON(tap->numqueues);
 	/* guarantee that any future tap_set_queue will fail */
 	tap->numvtaps = MAX_TAP_QUEUES;
+
+	kfree_rcu_mightsleep(rtnl_dereference(tap->vnet_hash));
 }
-EXPORT_SYMBOL_GPL(tap_del_queues);
+EXPORT_SYMBOL_GPL(tap_del);
 
 rx_handler_result_t tap_handle_frame(struct sk_buff **pskb)
 {
@@ -998,6 +1046,18 @@ static long tap_ioctl(struct file *file, unsigned int cmd,
 		rtnl_unlock();
 		return ret;
 
+	case TUNGETVNETHASHTYPES:
+		return tun_vnet_ioctl_gethashtypes(argp);
+
+	case TUNSETVNETREPORTINGAUTOMQ:
+	case TUNSETVNETREPORTINGRSS:
+	case TUNSETVNETRSS:
+		rtnl_lock();
+		tap = rtnl_dereference(q->tap);
+		ret = tap ? tun_vnet_ioctl_sethash(&tap->vnet_hash, cmd, argp) : -EBADFD;
+		rtnl_unlock();
+		return ret;
+
 	case SIOCGIFHWADDR:
 		rtnl_lock();
 		tap = tap_get_tap_dev(q);
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 553552fa635c..6647a7a9e956 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -31,6 +31,7 @@ static inline struct ptr_ring *tap_get_ptr_ring(struct file *f)
 #define MAX_TAP_QUEUES 256
 
 struct tap_queue;
+struct tun_vnet_hash;
 
 struct tap_dev {
 	struct net_device	*dev;
@@ -43,6 +44,7 @@ struct tap_dev {
 	int			numqueues;
 	netdev_features_t	tap_features;
 	int			minor;
+	struct tun_vnet_hash __rcu *vnet_hash;
 
 	void (*update_features)(struct tap_dev *tap, netdev_features_t features);
 	void (*count_tx_dropped)(struct tap_dev *tap);
@@ -74,7 +76,7 @@ struct tap_queue {
 };
 
 rx_handler_result_t tap_handle_frame(struct sk_buff **pskb);
-void tap_del_queues(struct tap_dev *tap);
+void tap_del(struct tap_dev *tap);
 int tap_get_minor(dev_t major, struct tap_dev *tap);
 void tap_free_minor(dev_t major, struct tap_dev *tap);
 int tap_queue_resize(struct tap_dev *tap);

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v12 07/10] selftest: tun: Test vnet ioctls without device
  2025-05-30  4:50 [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (5 preceding siblings ...)
  2025-05-30  4:50 ` [PATCH net-next v12 06/10] tap: " Akihiko Odaki
@ 2025-05-30  4:50 ` Akihiko Odaki
  2025-05-30  4:50 ` [PATCH net-next v12 08/10] selftest: tun: Add tests for virtio-net hashing Akihiko Odaki
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Akihiko Odaki @ 2025-05-30  4:50 UTC (permalink / raw)
  To: Jonathan Corbet, Willem de Bruijn, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Yuri Benditovich,
	Andrew Melnychenko, Stephen Hemminger, gur.stavi, Lei Yang,
	Simon Horman, Akihiko Odaki

Ensure that vnet ioctls result in EBADFD when the underlying device is
deleted.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Tested-by: Lei Yang <leiyang@redhat.com>
---
 tools/testing/selftests/net/tun.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/tools/testing/selftests/net/tun.c b/tools/testing/selftests/net/tun.c
index fa83918b62d1..41747e1728a6 100644
--- a/tools/testing/selftests/net/tun.c
+++ b/tools/testing/selftests/net/tun.c
@@ -12,6 +12,7 @@
 #include <linux/if_tun.h>
 #include <linux/netlink.h>
 #include <linux/rtnetlink.h>
+#include <linux/virtio_net.h>
 #include <sys/ioctl.h>
 #include <sys/socket.h>
 
@@ -159,4 +160,42 @@ TEST_F(tun, reattach_close_delete) {
 	EXPECT_EQ(tun_delete(self->ifname), 0);
 }
 
+FIXTURE(tun_deleted)
+{
+	char ifname[IFNAMSIZ];
+	int fd;
+};
+
+FIXTURE_SETUP(tun_deleted)
+{
+	self->ifname[0] = 0;
+	self->fd = tun_alloc(self->ifname);
+	ASSERT_LE(0, self->fd);
+
+	ASSERT_EQ(0, tun_delete(self->ifname))
+		EXPECT_EQ(0, close(self->fd));
+}
+
+FIXTURE_TEARDOWN(tun_deleted)
+{
+	EXPECT_EQ(0, close(self->fd));
+}
+
+TEST_F(tun_deleted, getvnethdrsz)
+{
+	ASSERT_EQ(-1, ioctl(self->fd, TUNGETVNETHDRSZ));
+	EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, getvnethashtypes)
+{
+	uint32_t hash_types;
+	int ret = ioctl(self->fd, TUNGETVNETHASHTYPES, &hash_types);
+
+	if (ret == -1 && errno == EBADFD)
+		SKIP(return, "TUNGETVNETHASHTYPES not supported");
+
+	EXPECT_FALSE(ret);
+}
+
 TEST_HARNESS_MAIN

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v12 08/10] selftest: tun: Add tests for virtio-net hashing
  2025-05-30  4:50 [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (6 preceding siblings ...)
  2025-05-30  4:50 ` [PATCH net-next v12 07/10] selftest: tun: Test vnet ioctls without device Akihiko Odaki
@ 2025-05-30  4:50 ` Akihiko Odaki
  2025-05-30  4:50 ` [PATCH net-next v12 09/10] selftest: tap: Add tests for virtio-net ioctls Akihiko Odaki
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Akihiko Odaki @ 2025-05-30  4:50 UTC (permalink / raw)
  To: Jonathan Corbet, Willem de Bruijn, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Yuri Benditovich,
	Andrew Melnychenko, Stephen Hemminger, gur.stavi, Lei Yang,
	Simon Horman, Akihiko Odaki

The added tests confirm tun can perform RSS for all supported hash types
to select the receive queue and report hash values.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Tested-by: Lei Yang <leiyang@redhat.com>
---
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/tun.c    | 448 ++++++++++++++++++++++++++++++++++-
 2 files changed, 440 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 8f32b4f01aee..0e0c751a4691 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -124,6 +124,6 @@ $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma
 $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -lcrypto
 $(OUTPUT)/tcp_inq: LDLIBS += -lpthread
 $(OUTPUT)/bind_bhash: LDLIBS += -lpthread
-$(OUTPUT)/io_uring_zerocopy_tx: CFLAGS += -I../../../include/
+$(OUTPUT)/io_uring_zerocopy_tx $(OUTPUT)/tun: CFLAGS += -I../../../include/
 
 include bpf.mk
diff --git a/tools/testing/selftests/net/tun.c b/tools/testing/selftests/net/tun.c
index 41747e1728a6..79ae65ae934a 100644
--- a/tools/testing/selftests/net/tun.c
+++ b/tools/testing/selftests/net/tun.c
@@ -2,22 +2,38 @@
 
 #define _GNU_SOURCE
 
+#include <endian.h>
 #include <errno.h>
 #include <fcntl.h>
+#include <sched.h>
+#include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
-#include <linux/if.h>
+#include <net/if.h>
+#include <netinet/ip.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <linux/compiler.h>
+#include <linux/icmp.h>
+#include <linux/if_arp.h>
 #include <linux/if_tun.h>
+#include <linux/ipv6.h>
 #include <linux/netlink.h>
 #include <linux/rtnetlink.h>
+#include <linux/sockios.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
 #include <linux/virtio_net.h>
-#include <sys/ioctl.h>
-#include <sys/socket.h>
 
 #include "../kselftest_harness.h"
 
+#define TUN_HWADDR_SOURCE { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00 }
+#define TUN_HWADDR_DEST { 0x02, 0x00, 0x00, 0x00, 0x00, 0x01 }
+#define TUN_IPADDR_SOURCE htonl((172 << 24) | (17 << 16) | 0)
+#define TUN_IPADDR_DEST htonl((172 << 24) | (17 << 16) | 1)
+
 static int tun_attach(int fd, char *dev)
 {
 	struct ifreq ifr;
@@ -40,7 +56,7 @@ static int tun_detach(int fd, char *dev)
 	return ioctl(fd, TUNSETQUEUE, (void *) &ifr);
 }
 
-static int tun_alloc(char *dev)
+static int tun_alloc(char *dev, short flags)
 {
 	struct ifreq ifr;
 	int fd, err;
@@ -53,7 +69,8 @@ static int tun_alloc(char *dev)
 
 	memset(&ifr, 0, sizeof(ifr));
 	strcpy(ifr.ifr_name, dev);
-	ifr.ifr_flags = IFF_TAP | IFF_NAPI | IFF_MULTI_QUEUE;
+	ifr.ifr_flags = flags | IFF_TAP | IFF_NAPI | IFF_NO_PI |
+			IFF_MULTI_QUEUE;
 
 	err = ioctl(fd, TUNSETIFF, (void *) &ifr);
 	if (err < 0) {
@@ -65,6 +82,20 @@ static int tun_alloc(char *dev)
 	return fd;
 }
 
+static bool tun_set_flags(int local_fd, const char *name, short flags)
+{
+	struct ifreq ifreq = { .ifr_flags = flags };
+
+	strcpy(ifreq.ifr_name, name);
+
+	if (ioctl(local_fd, SIOCSIFFLAGS, &ifreq)) {
+		perror("SIOCSIFFLAGS");
+		return false;
+	}
+
+	return true;
+}
+
 static int tun_delete(char *dev)
 {
 	struct {
@@ -103,6 +134,107 @@ static int tun_delete(char *dev)
 	return ret;
 }
 
+static uint32_t tun_sum(const void *buf, size_t len)
+{
+	const uint16_t *sbuf = buf;
+	uint32_t sum = 0;
+
+	while (len > 1) {
+		sum += *sbuf++;
+		len -= 2;
+	}
+
+	if (len)
+		sum += *(uint8_t *)sbuf;
+
+	return sum;
+}
+
+static uint16_t tun_build_ip_check(uint32_t sum)
+{
+	return ~((sum & 0xffff) + (sum >> 16));
+}
+
+static uint32_t tun_build_ip_pseudo_sum(const void *iphdr)
+{
+	uint16_t tot_len = ntohs(((struct iphdr *)iphdr)->tot_len);
+
+	return tun_sum((char *)iphdr + offsetof(struct iphdr, saddr), 8) +
+	       htons(((struct iphdr *)iphdr)->protocol) +
+	       htons(tot_len - sizeof(struct iphdr));
+}
+
+static uint32_t tun_build_ipv6_pseudo_sum(const void *ipv6hdr)
+{
+	return tun_sum((char *)ipv6hdr + offsetof(struct ipv6hdr, saddr), 32) +
+	       ((struct ipv6hdr *)ipv6hdr)->payload_len +
+	       htons(((struct ipv6hdr *)ipv6hdr)->nexthdr);
+}
+
+static void tun_build_iphdr(void *dest, uint16_t len, uint8_t protocol)
+{
+	struct iphdr iphdr = {
+		.ihl = sizeof(iphdr) / 4,
+		.version = 4,
+		.tot_len = htons(sizeof(iphdr) + len),
+		.ttl = 255,
+		.protocol = protocol,
+		.saddr = TUN_IPADDR_SOURCE,
+		.daddr = TUN_IPADDR_DEST
+	};
+
+	iphdr.check = tun_build_ip_check(tun_sum(&iphdr, sizeof(iphdr)));
+	memcpy(dest, &iphdr, sizeof(iphdr));
+}
+
+static void tun_build_ipv6hdr(void *dest, uint16_t len, uint8_t protocol)
+{
+	struct ipv6hdr ipv6hdr = {
+		.version = 6,
+		.payload_len = htons(len),
+		.nexthdr = protocol,
+		.saddr = {
+			.s6_addr32 = {
+				htonl(0xffff0000), 0, 0, TUN_IPADDR_SOURCE
+			}
+		},
+		.daddr = {
+			.s6_addr32 = {
+				htonl(0xffff0000), 0, 0, TUN_IPADDR_DEST
+			}
+		},
+	};
+
+	memcpy(dest, &ipv6hdr, sizeof(ipv6hdr));
+}
+
+static void tun_build_tcphdr(void *dest, uint32_t sum)
+{
+	struct tcphdr tcphdr = {
+		.source = htons(9),
+		.dest = htons(9),
+		.fin = 1,
+		.doff = sizeof(tcphdr) / 4,
+	};
+	uint32_t tcp_sum = tun_sum(&tcphdr, sizeof(tcphdr));
+
+	tcphdr.check = tun_build_ip_check(sum + tcp_sum);
+	memcpy(dest, &tcphdr, sizeof(tcphdr));
+}
+
+static void tun_build_udphdr(void *dest, uint32_t sum)
+{
+	struct udphdr udphdr = {
+		.source = htons(9),
+		.dest = htons(9),
+		.len = htons(sizeof(udphdr)),
+	};
+	uint32_t udp_sum = tun_sum(&udphdr, sizeof(udphdr));
+
+	udphdr.check = tun_build_ip_check(sum + udp_sum);
+	memcpy(dest, &udphdr, sizeof(udphdr));
+}
+
 FIXTURE(tun)
 {
 	char ifname[IFNAMSIZ];
@@ -113,10 +245,10 @@ FIXTURE_SETUP(tun)
 {
 	memset(self->ifname, 0, sizeof(self->ifname));
 
-	self->fd = tun_alloc(self->ifname);
+	self->fd = tun_alloc(self->ifname, 0);
 	ASSERT_GE(self->fd, 0);
 
-	self->fd2 = tun_alloc(self->ifname);
+	self->fd2 = tun_alloc(self->ifname, 0);
 	ASSERT_GE(self->fd2, 0);
 }
 
@@ -169,7 +301,7 @@ FIXTURE(tun_deleted)
 FIXTURE_SETUP(tun_deleted)
 {
 	self->ifname[0] = 0;
-	self->fd = tun_alloc(self->ifname);
+	self->fd = tun_alloc(self->ifname, 0);
 	ASSERT_LE(0, self->fd);
 
 	ASSERT_EQ(0, tun_delete(self->ifname))
@@ -198,4 +330,302 @@ TEST_F(tun_deleted, getvnethashtypes)
 	EXPECT_FALSE(ret);
 }
 
-TEST_HARNESS_MAIN
+FIXTURE(tun_vnet_rss)
+{
+	int dest_fds[3];
+	unsigned int dest_ifindex;
+	int source_fd;
+	char buffer[sizeof(struct ipv6hdr) + sizeof(struct tcphdr)];
+	uint16_t len;
+};
+
+FIXTURE_VARIANT(tun_vnet_rss)
+{
+	uint16_t eth_p;
+	uint8_t ipproto;
+	uint8_t flags;
+	uint16_t hash_report;
+	uint32_t hash_value;
+};
+
+FIXTURE_VARIANT_ADD(tun_vnet_rss, unclassified)
+{
+	.eth_p = ETH_P_LOOPBACK
+};
+
+FIXTURE_VARIANT_ADD(tun_vnet_rss, ipv4)
+{
+	.eth_p = ETH_P_IP,
+	.ipproto = 253,
+	.hash_report = VIRTIO_NET_HASH_REPORT_IPv4,
+	.hash_value = 0x6e45d952
+};
+
+FIXTURE_VARIANT_ADD(tun_vnet_rss, tcpv4)
+{
+	.eth_p = ETH_P_IP,
+	.ipproto = IPPROTO_TCP,
+	.hash_report = VIRTIO_NET_HASH_REPORT_TCPv4,
+	.hash_value = 0xfb63539a
+};
+
+FIXTURE_VARIANT_ADD(tun_vnet_rss, udpv4)
+{
+	.eth_p = ETH_P_IP,
+	.ipproto = IPPROTO_UDP,
+	.hash_report = VIRTIO_NET_HASH_REPORT_UDPv4,
+	.hash_value = 0xfb63539a
+};
+
+FIXTURE_VARIANT_ADD(tun_vnet_rss, ipv6)
+{
+	.eth_p = ETH_P_IPV6,
+	.ipproto = 253,
+	.hash_report = VIRTIO_NET_HASH_REPORT_IPv6,
+	.hash_value = 0xd6eb560f
+};
+
+FIXTURE_VARIANT_ADD(tun_vnet_rss, tcpv6)
+{
+	.eth_p = ETH_P_IPV6,
+	.ipproto = IPPROTO_TCP,
+	.hash_report = VIRTIO_NET_HASH_REPORT_TCPv6,
+	.hash_value = 0xc2b9f251
+};
+
+FIXTURE_VARIANT_ADD(tun_vnet_rss, udpv6)
+{
+	.eth_p = ETH_P_IPV6,
+	.ipproto = IPPROTO_UDP,
+	.hash_report = VIRTIO_NET_HASH_REPORT_UDPv6,
+	.hash_value = 0xc2b9f251
+};
+
+FIXTURE_SETUP(tun_vnet_rss)
+{
+	static const struct {
+		struct tun_vnet_rss hdr;
+		uint16_t indirection_table[2];
+		uint8_t key[40];
+	} vnet_rss = {
+		.hdr = {
+			.hash_types = VIRTIO_NET_RSS_HASH_TYPE_IPv4 |
+				      VIRTIO_NET_RSS_HASH_TYPE_TCPv4 |
+				      VIRTIO_NET_RSS_HASH_TYPE_UDPv4 |
+				      VIRTIO_NET_RSS_HASH_TYPE_IPv6 |
+				      VIRTIO_NET_RSS_HASH_TYPE_TCPv6 |
+				      VIRTIO_NET_RSS_HASH_TYPE_UDPv6,
+			.indirection_table_mask = 1,
+			.unclassified_queue = 5
+		},
+		.indirection_table = { 3, 4 },
+		.key = {
+			0x6d, 0x5a, 0x56, 0xda, 0x25, 0x5b, 0x0e, 0xc2,
+			0x41, 0x67, 0x25, 0x3d, 0x43, 0xa3, 0x8f, 0xb0,
+			0xd0, 0xca, 0x2b, 0xcb, 0xae, 0x7b, 0x30, 0xb4,
+			0x77, 0xcb, 0x2d, 0xa3, 0x80, 0x30, 0xf2, 0x0c,
+			0x6a, 0x42, 0xb7, 0x3b, 0xbe, 0xac, 0x01, 0xfa
+		}
+	};
+
+	struct {
+		struct virtio_net_hdr_v1_hash vnet_hdr;
+		struct ethhdr ethhdr;
+		struct arphdr arphdr;
+		unsigned char sender_hwaddr[6];
+		uint32_t sender_ipaddr;
+		unsigned char target_hwaddr[6];
+		uint32_t target_ipaddr;
+	} __packed packet = {
+		.ethhdr = {
+			.h_source = TUN_HWADDR_SOURCE,
+			.h_dest = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff },
+			.h_proto = htons(ETH_P_ARP)
+		},
+		.arphdr = {
+			.ar_hrd = htons(ARPHRD_ETHER),
+			.ar_pro = htons(ETH_P_IP),
+			.ar_hln = ETH_ALEN,
+			.ar_pln = 4,
+			.ar_op = htons(ARPOP_REQUEST)
+		},
+		.sender_hwaddr = TUN_HWADDR_DEST,
+		.sender_ipaddr = TUN_IPADDR_DEST,
+		.target_ipaddr = TUN_IPADDR_DEST
+	};
+
+	char dest_ifname[IFNAMSIZ] = "";
+	int i, local_fd;
+	uint16_t ip_payload_len;
+	uint32_t hash_types;
+	uint32_t ip_sum;
+
+	self->dest_fds[0] = tun_alloc(dest_ifname, IFF_VNET_HDR);
+	ASSERT_LE(0, self->dest_fds[0]) {
+		EXPECT_EQ(0, close(self->dest_fds[0]));
+	}
+
+	i = ioctl(self->dest_fds[0], TUNGETVNETHASHTYPES, &hash_types);
+	if (i == -1 && errno == EINVAL) {
+		EXPECT_EQ(0, close(self->dest_fds[0]));
+		SKIP(return, "TUNGETVNETHASHTYPES not supported");
+	}
+
+	ASSERT_EQ(0, i)
+		EXPECT_EQ(0, close(self->dest_fds[0]));
+
+	if ((hash_types & vnet_rss.hdr.hash_types) != vnet_rss.hdr.hash_types) {
+		EXPECT_EQ(0, close(self->dest_fds[0]));
+		SKIP(return, "Lacks some hash type support");
+	}
+
+	self->dest_ifindex = if_nametoindex(dest_ifname);
+	ASSERT_TRUE(self->dest_ifindex)
+		EXPECT_EQ(0, close(self->dest_fds[0]));
+
+	i = sizeof(struct virtio_net_hdr_v1_hash);
+	ASSERT_EQ(0, ioctl(self->dest_fds[0], TUNSETVNETHDRSZ, &i))
+		EXPECT_EQ(0, close(self->dest_fds[0]));
+
+	i = 1;
+	ASSERT_EQ(0, ioctl(self->dest_fds[0], TUNSETVNETLE, &i))
+		EXPECT_EQ(0, close(self->dest_fds[0]));
+
+	local_fd = socket(AF_LOCAL, SOCK_STREAM, 0);
+	ASSERT_LE(0, local_fd)
+		EXPECT_EQ(0, close(self->dest_fds[0]));
+
+	i = tun_set_flags(local_fd, dest_ifname, IFF_UP);
+	EXPECT_EQ(0, close(local_fd));
+	ASSERT_TRUE(i)
+		EXPECT_EQ(0, close(self->dest_fds[0]));
+
+	ASSERT_EQ(sizeof(packet),
+		  write(self->dest_fds[0], &packet, sizeof(packet)))
+		EXPECT_EQ(0, close(self->dest_fds[0]));
+
+	ASSERT_EQ(0, ioctl(self->dest_fds[0], TUNSETVNETREPORTINGRSS, &vnet_rss))
+		EXPECT_EQ(0, close(self->dest_fds[0]));
+
+	self->source_fd = socket(AF_PACKET, SOCK_DGRAM, 0);
+	ASSERT_LE(0, self->source_fd)
+		EXPECT_EQ(0, close(self->dest_fds[0]));
+
+	for (i = 1; i < ARRAY_SIZE(self->dest_fds); i++) {
+		self->dest_fds[i] = tun_alloc(dest_ifname, IFF_VNET_HDR);
+		ASSERT_LE(0, self->dest_fds[i]) {
+			while (i) {
+				i--;
+				EXPECT_EQ(0, close(self->dest_fds[i]));
+			}
+		}
+	}
+
+	if (variant->eth_p == ETH_P_IP || variant->eth_p == ETH_P_IPV6) {
+		switch (variant->ipproto) {
+		case IPPROTO_TCP:
+			ip_payload_len = sizeof(struct tcphdr);
+			break;
+
+		case IPPROTO_UDP:
+			ip_payload_len = sizeof(struct udphdr);
+			break;
+
+		default:
+			ip_payload_len = 0;
+		}
+
+		switch (variant->eth_p) {
+		case ETH_P_IP:
+			tun_build_iphdr(self->buffer, ip_payload_len, variant->ipproto);
+			self->len = sizeof(struct iphdr);
+			ip_sum = tun_build_ip_pseudo_sum(self->buffer);
+			break;
+
+		case ETH_P_IPV6:
+			tun_build_ipv6hdr(self->buffer, ip_payload_len, variant->ipproto);
+			self->len = sizeof(struct ipv6hdr);
+			ip_sum = tun_build_ipv6_pseudo_sum(self->buffer);
+			break;
+		}
+
+		switch (variant->ipproto) {
+		case IPPROTO_TCP:
+			tun_build_tcphdr(self->buffer + self->len, ip_sum);
+			break;
+
+		case IPPROTO_UDP:
+			tun_build_udphdr(self->buffer + self->len, ip_sum);
+			break;
+		}
+
+		self->len += ip_payload_len;
+	}
+}
+
+FIXTURE_TEARDOWN(tun_vnet_rss)
+{
+	EXPECT_EQ(0, close(self->source_fd));
+
+	for (size_t i = 0; i < ARRAY_SIZE(self->dest_fds); i++)
+		EXPECT_EQ(0, close(self->dest_fds[i]));
+}
+
+TEST_F(tun_vnet_rss, rx)
+{
+	size_t len = sizeof(struct virtio_net_hdr_v1_hash) + ETH_HLEN + self->len;
+	struct {
+		struct virtio_net_hdr_v1_hash vnet_hdr;
+		struct ethhdr ethhdr;
+		char payload[sizeof(self->buffer)];
+	} buffer;
+	struct virtio_net_hdr_v1_hash hdr = {
+		.hdr = { .flags = variant->flags },
+		.hash_value = htole32(variant->hash_value),
+		.hash_report = htole16(variant->hash_report)
+	};
+	int txq = variant->hash_report ? variant->hash_value & 1 : 2;
+	struct sockaddr_ll ll = {
+		.sll_family = AF_PACKET,
+		.sll_addr = TUN_HWADDR_DEST,
+		.sll_halen = ETH_ALEN,
+		.sll_ifindex = self->dest_ifindex,
+		.sll_protocol = htons(variant->eth_p),
+	};
+
+	EXPECT_EQ(self->len,
+		  sendto(self->source_fd, self->buffer, self->len, 0,
+			 (struct sockaddr *)&ll, sizeof(ll)));
+	EXPECT_EQ(len, read(self->dest_fds[txq], &buffer, len));
+	ASSERT_FALSE(memcmp(&buffer, &hdr, sizeof(hdr)));
+	ASSERT_FALSE(memcmp(buffer.payload, self->buffer, self->len));
+}
+
+int main(int argc, char **argv)
+{
+	FILE *file;
+
+	if (unshare(CLONE_NEWNET)) {
+		perror("unshare");
+		return KSFT_FAIL;
+	}
+
+	/* Disable IPv6 to eliminate IPv6 Neighbor Discovery messages. */
+	file = fopen("/proc/sys/net/ipv6/conf/default/disable_ipv6", "w");
+	if (file) {
+		if (fputc('1', file) != '1') {
+			perror("fputc");
+			return KSFT_FAIL;
+		}
+
+		if (fclose(file)) {
+			perror("fclose");
+			return KSFT_FAIL;
+		}
+	} else if (errno != ENOENT) {
+		perror("fopen");
+		return KSFT_FAIL;
+	}
+
+	return test_harness_run(argc, argv);
+}

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v12 09/10] selftest: tap: Add tests for virtio-net ioctls
  2025-05-30  4:50 [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (7 preceding siblings ...)
  2025-05-30  4:50 ` [PATCH net-next v12 08/10] selftest: tun: Add tests for virtio-net hashing Akihiko Odaki
@ 2025-05-30  4:50 ` Akihiko Odaki
  2025-05-30  4:50 ` [PATCH net-next v12 10/10] vhost/net: Support VIRTIO_NET_F_HASH_REPORT Akihiko Odaki
  2025-07-16  8:55 ` [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Michael S. Tsirkin
  10 siblings, 0 replies; 39+ messages in thread
From: Akihiko Odaki @ 2025-05-30  4:50 UTC (permalink / raw)
  To: Jonathan Corbet, Willem de Bruijn, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Yuri Benditovich,
	Andrew Melnychenko, Stephen Hemminger, gur.stavi, Lei Yang,
	Simon Horman, Akihiko Odaki

They only test the ioctls are wired up to the implementation common with
tun as it is already tested for tun.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
---
 tools/testing/selftests/net/config |   1 +
 tools/testing/selftests/net/tap.c  | 131 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 129 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 61e5116987f3..00cb1e65b392 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -99,6 +99,7 @@ CONFIG_IPV6_IOAM6_LWTUNNEL=y
 CONFIG_CRYPTO_SM4_GENERIC=y
 CONFIG_AMT=m
 CONFIG_TUN=y
+CONFIG_TUN_VNET_CROSS_LE=y
 CONFIG_VXLAN=m
 CONFIG_IP_SCTP=m
 CONFIG_NETFILTER_XT_MATCH_POLICY=m
diff --git a/tools/testing/selftests/net/tap.c b/tools/testing/selftests/net/tap.c
index 247c3b3ac1c9..0decbc338965 100644
--- a/tools/testing/selftests/net/tap.c
+++ b/tools/testing/selftests/net/tap.c
@@ -387,9 +387,6 @@ FIXTURE_TEARDOWN(tap)
 	if (self->fd != -1)
 		close(self->fd);
 
-	ret = dev_delete(param_dev_tap_name);
-	EXPECT_EQ(ret, 0);
-
 	ret = dev_delete(param_dev_dummy_name);
 	EXPECT_EQ(ret, 0);
 }
@@ -431,4 +428,132 @@ TEST_F(tap, test_packet_crash_tap_invalid_eth_proto)
 	ASSERT_EQ(errno, EINVAL);
 }
 
+TEST_F(tap, test_vnethdrsz)
+{
+	int sz = sizeof(struct virtio_net_hdr_v1_hash);
+
+	ASSERT_FALSE(dev_delete(param_dev_tap_name));
+
+	ASSERT_FALSE(ioctl(self->fd, TUNSETVNETHDRSZ, &sz));
+	sz = 0;
+	ASSERT_FALSE(ioctl(self->fd, TUNGETVNETHDRSZ, &sz));
+	EXPECT_EQ(sizeof(struct virtio_net_hdr_v1_hash), sz);
+}
+
+TEST_F(tap, test_vnetle)
+{
+	int le = 1;
+
+	ASSERT_FALSE(dev_delete(param_dev_tap_name));
+
+	ASSERT_FALSE(ioctl(self->fd, TUNSETVNETLE, &le));
+	le = 0;
+	ASSERT_FALSE(ioctl(self->fd, TUNGETVNETLE, &le));
+	EXPECT_EQ(1, le);
+}
+
+TEST_F(tap, test_vnetbe)
+{
+	int be = 1;
+	int ret;
+
+	ASSERT_FALSE(dev_delete(param_dev_tap_name));
+
+	ret = ioctl(self->fd, TUNSETVNETBE, &be);
+	if (ret == -1 && errno == EINVAL)
+		SKIP(return, "TUNSETVNETBE not supported");
+
+	ASSERT_FALSE(ret);
+	be = 0;
+	ASSERT_FALSE(ioctl(self->fd, TUNGETVNETBE, &be));
+	EXPECT_EQ(1, be);
+}
+
+TEST_F(tap, test_getvnethashtypes)
+{
+	uint32_t hash_types;
+	int ret;
+
+	ASSERT_FALSE(dev_delete(param_dev_tap_name));
+
+	ret = ioctl(self->fd, TUNGETVNETHASHTYPES, &hash_types);
+	if (ret == -1 && errno == EINVAL)
+		SKIP(return, "TUNGETVNETHASHTYPES not supported");
+
+	EXPECT_FALSE(ret);
+}
+
+FIXTURE(tap_setvnethash)
+{
+	int fd;
+};
+
+FIXTURE_VARIANT(tap_setvnethash)
+{
+	unsigned int cmd;
+};
+
+FIXTURE_VARIANT_ADD(tap_setvnethash, reportingautomq)
+{
+	.cmd = TUNSETVNETREPORTINGAUTOMQ
+};
+
+FIXTURE_VARIANT_ADD(tap_setvnethash, reportingrss)
+{
+	.cmd = TUNSETVNETREPORTINGRSS
+};
+
+FIXTURE_VARIANT_ADD(tap_setvnethash, rss)
+{
+	.cmd = TUNSETVNETRSS
+};
+
+FIXTURE_SETUP(tap_setvnethash)
+{
+	int ret;
+
+	ret = dev_create(param_dev_dummy_name, "dummy", NULL, NULL);
+	ASSERT_FALSE(ret);
+
+	ret = dev_create(param_dev_tap_name, "macvtap", macvtap_fill_rtattr,
+			 NULL);
+	ASSERT_FALSE(ret)
+		EXPECT_FALSE(dev_delete(param_dev_dummy_name));
+
+	self->fd = opentap(param_dev_tap_name);
+	ASSERT_LT(0, self->fd)
+		EXPECT_FALSE(dev_delete(param_dev_dummy_name));
+}
+
+FIXTURE_TEARDOWN(tap_setvnethash)
+{
+	EXPECT_FALSE(close(self->fd));
+	EXPECT_FALSE(dev_delete(param_dev_dummy_name));
+}
+
+TEST_F(tap_setvnethash, test_alive)
+{
+	struct tun_vnet_rss rss = { .hash_types = 0 };
+	int ret;
+
+	ret = ioctl(self->fd, variant->cmd, &rss);
+
+	if (ret == -1 && errno == EINVAL)
+		SKIP(return, "not supported");
+
+	EXPECT_FALSE(ret);
+}
+
+TEST_F(tap_setvnethash, test_deleted)
+{
+	ASSERT_FALSE(dev_delete(param_dev_tap_name));
+
+	ASSERT_EQ(-1, ioctl(self->fd, variant->cmd));
+
+	if (errno == EINVAL)
+		SKIP(return, "not supported");
+
+	EXPECT_EQ(EBADFD, errno);
+}
+
 TEST_HARNESS_MAIN

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v12 10/10] vhost/net: Support VIRTIO_NET_F_HASH_REPORT
  2025-05-30  4:50 [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (8 preceding siblings ...)
  2025-05-30  4:50 ` [PATCH net-next v12 09/10] selftest: tap: Add tests for virtio-net ioctls Akihiko Odaki
@ 2025-05-30  4:50 ` Akihiko Odaki
  2025-07-16  8:55 ` [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Michael S. Tsirkin
  10 siblings, 0 replies; 39+ messages in thread
From: Akihiko Odaki @ 2025-05-30  4:50 UTC (permalink / raw)
  To: Jonathan Corbet, Willem de Bruijn, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Yuri Benditovich,
	Andrew Melnychenko, Stephen Hemminger, gur.stavi, Lei Yang,
	Simon Horman, Akihiko Odaki

VIRTIO_NET_F_HASH_REPORT allows to report hash values calculated on the
host. When VHOST_NET_F_VIRTIO_NET_HDR is employed, it will report no
hash values (i.e., the hash_report member is always set to
VIRTIO_NET_HASH_REPORT_NONE). Otherwise, the values reported by the
underlying socket will be reported.

VIRTIO_NET_F_HASH_REPORT requires VIRTIO_F_VERSION_1.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Tested-by: Lei Yang <leiyang@redhat.com>
---
 drivers/vhost/net.c | 68 +++++++++++++++++++++++++++--------------------------
 1 file changed, 35 insertions(+), 33 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b9b9e9d40951..fc5b43e43a06 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -73,6 +73,7 @@ enum {
 	VHOST_NET_FEATURES = VHOST_FEATURES |
 			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
 			 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+			 (1ULL << VIRTIO_NET_F_HASH_REPORT) |
 			 (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
 			 (1ULL << VIRTIO_F_RING_RESET)
 };
@@ -1097,10 +1098,6 @@ static void handle_rx(struct vhost_net *net)
 		.msg_controllen = 0,
 		.msg_flags = MSG_DONTWAIT,
 	};
-	struct virtio_net_hdr hdr = {
-		.flags = 0,
-		.gso_type = VIRTIO_NET_HDR_GSO_NONE
-	};
 	size_t total_len = 0;
 	int err, mergeable;
 	s16 headcount;
@@ -1174,11 +1171,15 @@ static void handle_rx(struct vhost_net *net)
 		/* We don't need to be notified again. */
 		iov_iter_init(&msg.msg_iter, ITER_DEST, vq->iov, in, vhost_len);
 		fixup = msg.msg_iter;
-		if (unlikely((vhost_hlen))) {
-			/* We will supply the header ourselves
-			 * TODO: support TSO.
-			 */
-			iov_iter_advance(&msg.msg_iter, vhost_hlen);
+		/*
+		 * Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR
+		 * TODO: support TSO.
+		 */
+		if (unlikely(vhost_hlen) &&
+		    iov_iter_zero(vhost_hlen, &msg.msg_iter) != vhost_hlen) {
+			vq_err(vq, "Unable to write vnet_hdr at addr %p\n",
+			       vq->iov->iov_base);
+			goto out;
 		}
 		err = sock->ops->recvmsg(sock, &msg,
 					 sock_len, MSG_DONTWAIT | MSG_TRUNC);
@@ -1191,30 +1192,24 @@ static void handle_rx(struct vhost_net *net)
 			vhost_discard_vq_desc(vq, headcount);
 			continue;
 		}
-		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
-		if (unlikely(vhost_hlen)) {
-			if (copy_to_iter(&hdr, sizeof(hdr),
-					 &fixup) != sizeof(hdr)) {
-				vq_err(vq, "Unable to write vnet_hdr "
-				       "at addr %p\n", vq->iov->iov_base);
-				goto out;
-			}
-		} else {
-			/* Header came from socket; we'll need to patch
-			 * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF
-			 */
-			iov_iter_advance(&fixup, sizeof(hdr));
-		}
 		/* TODO: Should check and handle checksum. */
 
+		/*
+		 * We'll need to patch ->num_buffers over if
+		 * VIRTIO_NET_F_MRG_RXBUF or VIRTIO_F_VERSION_1
+		 */
 		num_buffers = cpu_to_vhost16(vq, headcount);
-		if (likely(set_num_buffers) &&
-		    copy_to_iter(&num_buffers, sizeof num_buffers,
-				 &fixup) != sizeof num_buffers) {
-			vq_err(vq, "Failed num_buffers write");
-			vhost_discard_vq_desc(vq, headcount);
-			goto out;
+		if (likely(set_num_buffers)) {
+			iov_iter_advance(&fixup, offsetof(struct virtio_net_hdr_v1, num_buffers));
+
+			if (copy_to_iter(&num_buffers, sizeof(num_buffers),
+					 &fixup) != sizeof(num_buffers)) {
+				vq_err(vq, "Failed num_buffers write");
+				vhost_discard_vq_desc(vq, headcount);
+				goto out;
+			}
 		}
+
 		nvq->done_idx += headcount;
 		if (nvq->done_idx > VHOST_NET_BATCH)
 			vhost_net_signal_used(nvq);
@@ -1607,10 +1602,13 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
 	size_t vhost_hlen, sock_hlen, hdr_len;
 	int i;
 
-	hdr_len = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
-			       (1ULL << VIRTIO_F_VERSION_1))) ?
-			sizeof(struct virtio_net_hdr_mrg_rxbuf) :
-			sizeof(struct virtio_net_hdr);
+	if (features & (1ULL << VIRTIO_NET_F_HASH_REPORT))
+		hdr_len = sizeof(struct virtio_net_hdr_v1_hash);
+	else if (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+			     (1ULL << VIRTIO_F_VERSION_1)))
+		hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	else
+		hdr_len = sizeof(struct virtio_net_hdr);
 	if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
 		/* vhost provides vnet_hdr */
 		vhost_hlen = hdr_len;
@@ -1691,6 +1689,10 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 			return -EFAULT;
 		if (features & ~VHOST_NET_FEATURES)
 			return -EOPNOTSUPP;
+		if ((features & ((1ULL << VIRTIO_F_VERSION_1) |
+				 (1ULL << VIRTIO_NET_F_HASH_REPORT))) ==
+		    (1ULL << VIRTIO_NET_F_HASH_REPORT))
+			return -EINVAL;
 		return vhost_net_set_features(n, features);
 	case VHOST_GET_BACKEND_FEATURES:
 		features = VHOST_NET_BACKEND_FEATURES;

-- 
2.49.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-05-30  4:50 ` [PATCH net-next v12 01/10] virtio_net: Add functions for hashing Akihiko Odaki
@ 2025-06-03  3:19   ` Jason Wang
  2025-06-03  5:31     ` Akihiko Odaki
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2025-06-03  3:19 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> They are useful to implement VIRTIO_NET_F_RSS and
> VIRTIO_NET_F_HASH_REPORT.
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> Tested-by: Lei Yang <leiyang@redhat.com>
> ---
>  include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 188 insertions(+)
>
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index 02a9f4dc594d..426f33b4b824 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -9,6 +9,194 @@
>  #include <uapi/linux/tcp.h>
>  #include <uapi/linux/virtio_net.h>
>
> +struct virtio_net_hash {
> +       u32 value;
> +       u16 report;
> +};
> +
> +struct virtio_net_toeplitz_state {
> +       u32 hash;
> +       const u32 *key;
> +};
> +
> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
> +                                        VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
> +
> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
> +
> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
> +{
> +       while (len >= sizeof(*input)) {
> +               *input = be32_to_cpu((__force __be32)*input);
> +               input++;
> +               len -= sizeof(*input);
> +       }
> +}
> +
> +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
> +                                           const __be32 *input, size_t len)
> +{
> +       while (len >= sizeof(*input)) {
> +               for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
> +                       u32 i = ffs(map);
> +
> +                       state->hash ^= state->key[0] << (32 - i) |
> +                                      (u32)((u64)state->key[1] >> i);
> +               }
> +
> +               state->key++;
> +               input++;
> +               len -= sizeof(*input);
> +       }
> +}
> +
> +static inline u8 virtio_net_hash_key_length(u32 types)
> +{
> +       size_t len = 0;
> +
> +       if (types & VIRTIO_NET_HASH_REPORT_IPv4)
> +               len = max(len,
> +                         sizeof(struct flow_dissector_key_ipv4_addrs));
> +
> +       if (types &
> +           (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
> +               len = max(len,
> +                         sizeof(struct flow_dissector_key_ipv4_addrs) +
> +                         sizeof(struct flow_dissector_key_ports));
> +
> +       if (types & VIRTIO_NET_HASH_REPORT_IPv6)
> +               len = max(len,
> +                         sizeof(struct flow_dissector_key_ipv6_addrs));
> +
> +       if (types &
> +           (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
> +               len = max(len,
> +                         sizeof(struct flow_dissector_key_ipv6_addrs) +
> +                         sizeof(struct flow_dissector_key_ports));
> +
> +       return len + sizeof(u32);
> +}
> +
> +static inline u32 virtio_net_hash_report(u32 types,
> +                                        const struct flow_keys_basic *keys)
> +{
> +       switch (keys->basic.n_proto) {
> +       case cpu_to_be16(ETH_P_IP):
> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
> +                               return VIRTIO_NET_HASH_REPORT_TCPv4;
> +
> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
> +                               return VIRTIO_NET_HASH_REPORT_UDPv4;
> +               }
> +
> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
> +                       return VIRTIO_NET_HASH_REPORT_IPv4;
> +
> +               return VIRTIO_NET_HASH_REPORT_NONE;
> +
> +       case cpu_to_be16(ETH_P_IPV6):
> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
> +                               return VIRTIO_NET_HASH_REPORT_TCPv6;
> +
> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
> +                               return VIRTIO_NET_HASH_REPORT_UDPv6;
> +               }
> +
> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
> +                       return VIRTIO_NET_HASH_REPORT_IPv6;
> +
> +               return VIRTIO_NET_HASH_REPORT_NONE;
> +
> +       default:
> +               return VIRTIO_NET_HASH_REPORT_NONE;
> +       }
> +}
> +
> +static inline void virtio_net_hash_rss(const struct sk_buff *skb,
> +                                      u32 types, const u32 *key,
> +                                      struct virtio_net_hash *hash)
> +{
> +       struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
> +       struct flow_keys flow;
> +       struct flow_keys_basic flow_basic;
> +       u16 report;
> +
> +       if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
> +               return;
> +       }
> +
> +       flow_basic = (struct flow_keys_basic) {
> +               .control = flow.control,
> +               .basic = flow.basic
> +       };
> +
> +       report = virtio_net_hash_report(types, &flow_basic);
> +
> +       switch (report) {
> +       case VIRTIO_NET_HASH_REPORT_IPv4:
> +               virtio_net_toeplitz_calc(&toeplitz_state,
> +                                        (__be32 *)&flow.addrs.v4addrs,
> +                                        sizeof(flow.addrs.v4addrs));
> +               break;
> +
> +       case VIRTIO_NET_HASH_REPORT_TCPv4:
> +               virtio_net_toeplitz_calc(&toeplitz_state,
> +                                        (__be32 *)&flow.addrs.v4addrs,
> +                                        sizeof(flow.addrs.v4addrs));
> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> +                                        sizeof(flow.ports.ports));
> +               break;
> +
> +       case VIRTIO_NET_HASH_REPORT_UDPv4:
> +               virtio_net_toeplitz_calc(&toeplitz_state,
> +                                        (__be32 *)&flow.addrs.v4addrs,
> +                                        sizeof(flow.addrs.v4addrs));
> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> +                                        sizeof(flow.ports.ports));
> +               break;
> +
> +       case VIRTIO_NET_HASH_REPORT_IPv6:
> +               virtio_net_toeplitz_calc(&toeplitz_state,
> +                                        (__be32 *)&flow.addrs.v6addrs,
> +                                        sizeof(flow.addrs.v6addrs));
> +               break;
> +
> +       case VIRTIO_NET_HASH_REPORT_TCPv6:
> +               virtio_net_toeplitz_calc(&toeplitz_state,
> +                                        (__be32 *)&flow.addrs.v6addrs,
> +                                        sizeof(flow.addrs.v6addrs));
> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> +                                        sizeof(flow.ports.ports));
> +               break;
> +
> +       case VIRTIO_NET_HASH_REPORT_UDPv6:
> +               virtio_net_toeplitz_calc(&toeplitz_state,
> +                                        (__be32 *)&flow.addrs.v6addrs,
> +                                        sizeof(flow.addrs.v6addrs));
> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> +                                        sizeof(flow.ports.ports));
> +               break;
> +
> +       default:
> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
> +               return;

So I still think we need a comment here to explain why this is not an
issue if the device can report HASH_XXX_EX. Or we need to add the
support, since this is the code from the driver side, I don't think we
need to worry about the device implementation issues.

For the issue of the number of options, does the spec forbid fallback
to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.

Thanks

> +       }
> +
> +       hash->value = toeplitz_state.hash;
> +       hash->report = report;
> +}
> +
>  static inline bool virtio_net_hdr_match_proto(__be16 protocol, __u8 gso_type)
>  {
>         switch (gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
>
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 02/10] net: flow_dissector: Export flow_keys_dissector_symmetric
  2025-05-30  4:50 ` [PATCH net-next v12 02/10] net: flow_dissector: Export flow_keys_dissector_symmetric Akihiko Odaki
@ 2025-06-03  3:20   ` Jason Wang
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Wang @ 2025-06-03  3:20 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> flow_keys_dissector_symmetric is useful to derive a symmetric hash
> and to know its source such as IPv4, IPv6, TCP, and UDP.
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> Tested-by: Lei Yang <leiyang@redhat.com>
> ---

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-06-03  3:19   ` Jason Wang
@ 2025-06-03  5:31     ` Akihiko Odaki
  2025-06-04  1:18       ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Akihiko Odaki @ 2025-06-03  5:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On 2025/06/03 12:19, Jason Wang wrote:
> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> They are useful to implement VIRTIO_NET_F_RSS and
>> VIRTIO_NET_F_HASH_REPORT.
>>
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>> Tested-by: Lei Yang <leiyang@redhat.com>
>> ---
>>   include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 188 insertions(+)
>>
>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
>> index 02a9f4dc594d..426f33b4b824 100644
>> --- a/include/linux/virtio_net.h
>> +++ b/include/linux/virtio_net.h
>> @@ -9,6 +9,194 @@
>>   #include <uapi/linux/tcp.h>
>>   #include <uapi/linux/virtio_net.h>
>>
>> +struct virtio_net_hash {
>> +       u32 value;
>> +       u16 report;
>> +};
>> +
>> +struct virtio_net_toeplitz_state {
>> +       u32 hash;
>> +       const u32 *key;
>> +};
>> +
>> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
>> +                                        VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
>> +
>> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
>> +
>> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
>> +{
>> +       while (len >= sizeof(*input)) {
>> +               *input = be32_to_cpu((__force __be32)*input);
>> +               input++;
>> +               len -= sizeof(*input);
>> +       }
>> +}
>> +
>> +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
>> +                                           const __be32 *input, size_t len)
>> +{
>> +       while (len >= sizeof(*input)) {
>> +               for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
>> +                       u32 i = ffs(map);
>> +
>> +                       state->hash ^= state->key[0] << (32 - i) |
>> +                                      (u32)((u64)state->key[1] >> i);
>> +               }
>> +
>> +               state->key++;
>> +               input++;
>> +               len -= sizeof(*input);
>> +       }
>> +}
>> +
>> +static inline u8 virtio_net_hash_key_length(u32 types)
>> +{
>> +       size_t len = 0;
>> +
>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv4)
>> +               len = max(len,
>> +                         sizeof(struct flow_dissector_key_ipv4_addrs));
>> +
>> +       if (types &
>> +           (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
>> +               len = max(len,
>> +                         sizeof(struct flow_dissector_key_ipv4_addrs) +
>> +                         sizeof(struct flow_dissector_key_ports));
>> +
>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv6)
>> +               len = max(len,
>> +                         sizeof(struct flow_dissector_key_ipv6_addrs));
>> +
>> +       if (types &
>> +           (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
>> +               len = max(len,
>> +                         sizeof(struct flow_dissector_key_ipv6_addrs) +
>> +                         sizeof(struct flow_dissector_key_ports));
>> +
>> +       return len + sizeof(u32);
>> +}
>> +
>> +static inline u32 virtio_net_hash_report(u32 types,
>> +                                        const struct flow_keys_basic *keys)
>> +{
>> +       switch (keys->basic.n_proto) {
>> +       case cpu_to_be16(ETH_P_IP):
>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
>> +                               return VIRTIO_NET_HASH_REPORT_TCPv4;
>> +
>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
>> +                               return VIRTIO_NET_HASH_REPORT_UDPv4;
>> +               }
>> +
>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
>> +                       return VIRTIO_NET_HASH_REPORT_IPv4;
>> +
>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>> +
>> +       case cpu_to_be16(ETH_P_IPV6):
>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
>> +                               return VIRTIO_NET_HASH_REPORT_TCPv6;
>> +
>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
>> +                               return VIRTIO_NET_HASH_REPORT_UDPv6;
>> +               }
>> +
>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
>> +                       return VIRTIO_NET_HASH_REPORT_IPv6;
>> +
>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>> +
>> +       default:
>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>> +       }
>> +}
>> +
>> +static inline void virtio_net_hash_rss(const struct sk_buff *skb,
>> +                                      u32 types, const u32 *key,
>> +                                      struct virtio_net_hash *hash)
>> +{
>> +       struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
>> +       struct flow_keys flow;
>> +       struct flow_keys_basic flow_basic;
>> +       u16 report;
>> +
>> +       if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
>> +               return;
>> +       }
>> +
>> +       flow_basic = (struct flow_keys_basic) {
>> +               .control = flow.control,
>> +               .basic = flow.basic
>> +       };
>> +
>> +       report = virtio_net_hash_report(types, &flow_basic);
>> +
>> +       switch (report) {
>> +       case VIRTIO_NET_HASH_REPORT_IPv4:
>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>> +                                        (__be32 *)&flow.addrs.v4addrs,
>> +                                        sizeof(flow.addrs.v4addrs));
>> +               break;
>> +
>> +       case VIRTIO_NET_HASH_REPORT_TCPv4:
>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>> +                                        (__be32 *)&flow.addrs.v4addrs,
>> +                                        sizeof(flow.addrs.v4addrs));
>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>> +                                        sizeof(flow.ports.ports));
>> +               break;
>> +
>> +       case VIRTIO_NET_HASH_REPORT_UDPv4:
>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>> +                                        (__be32 *)&flow.addrs.v4addrs,
>> +                                        sizeof(flow.addrs.v4addrs));
>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>> +                                        sizeof(flow.ports.ports));
>> +               break;
>> +
>> +       case VIRTIO_NET_HASH_REPORT_IPv6:
>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>> +                                        (__be32 *)&flow.addrs.v6addrs,
>> +                                        sizeof(flow.addrs.v6addrs));
>> +               break;
>> +
>> +       case VIRTIO_NET_HASH_REPORT_TCPv6:
>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>> +                                        (__be32 *)&flow.addrs.v6addrs,
>> +                                        sizeof(flow.addrs.v6addrs));
>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>> +                                        sizeof(flow.ports.ports));
>> +               break;
>> +
>> +       case VIRTIO_NET_HASH_REPORT_UDPv6:
>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>> +                                        (__be32 *)&flow.addrs.v6addrs,
>> +                                        sizeof(flow.addrs.v6addrs));
>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>> +                                        sizeof(flow.ports.ports));
>> +               break;
>> +
>> +       default:
>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
>> +               return;
> 
> So I still think we need a comment here to explain why this is not an
> issue if the device can report HASH_XXX_EX. Or we need to add the
> support, since this is the code from the driver side, I don't think we
> need to worry about the device implementation issues.

This is on the device side, and don't report HASH_TYPE_XXX_EX.

> 
> For the issue of the number of options, does the spec forbid fallback
> to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.

5.1.6.4.3.4 "IPv6 packets with extension header" says:
 > If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
 > header, the hash is calculated over the following fields:
 > - Home address from the home address option in the IPv6 destination
 >   options header. If the extension header is not present, use the
 >   Source IPv6 address.
 > - IPv6 address that is contained in the Routing-Header-Type-2 from the
 >   associated extension header. If the extension header is not present,
 >   use the Destination IPv6 address.
 > - Source TCP port
 > - Destination TCP port

Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6 
and an home address option in the IPv6 destination options header is 
present, the hash is calculated over the home address. If the hash is 
not calculated over the home address in such a case, the device is 
contradicting with this section and violating the spec. The same goes 
for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.

Regards,
Akihiko Odaki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-06-03  5:31     ` Akihiko Odaki
@ 2025-06-04  1:18       ` Jason Wang
  2025-06-04  7:20         ` Akihiko Odaki
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2025-06-04  1:18 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2025/06/03 12:19, Jason Wang wrote:
> > On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> They are useful to implement VIRTIO_NET_F_RSS and
> >> VIRTIO_NET_F_HASH_REPORT.
> >>
> >> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >> Tested-by: Lei Yang <leiyang@redhat.com>
> >> ---
> >>   include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 188 insertions(+)
> >>
> >> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> >> index 02a9f4dc594d..426f33b4b824 100644
> >> --- a/include/linux/virtio_net.h
> >> +++ b/include/linux/virtio_net.h
> >> @@ -9,6 +9,194 @@
> >>   #include <uapi/linux/tcp.h>
> >>   #include <uapi/linux/virtio_net.h>
> >>
> >> +struct virtio_net_hash {
> >> +       u32 value;
> >> +       u16 report;
> >> +};
> >> +
> >> +struct virtio_net_toeplitz_state {
> >> +       u32 hash;
> >> +       const u32 *key;
> >> +};
> >> +
> >> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
> >> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
> >> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
> >> +                                        VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
> >> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
> >> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
> >> +
> >> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
> >> +
> >> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
> >> +{
> >> +       while (len >= sizeof(*input)) {
> >> +               *input = be32_to_cpu((__force __be32)*input);
> >> +               input++;
> >> +               len -= sizeof(*input);
> >> +       }
> >> +}
> >> +
> >> +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
> >> +                                           const __be32 *input, size_t len)
> >> +{
> >> +       while (len >= sizeof(*input)) {
> >> +               for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
> >> +                       u32 i = ffs(map);
> >> +
> >> +                       state->hash ^= state->key[0] << (32 - i) |
> >> +                                      (u32)((u64)state->key[1] >> i);
> >> +               }
> >> +
> >> +               state->key++;
> >> +               input++;
> >> +               len -= sizeof(*input);
> >> +       }
> >> +}
> >> +
> >> +static inline u8 virtio_net_hash_key_length(u32 types)
> >> +{
> >> +       size_t len = 0;
> >> +
> >> +       if (types & VIRTIO_NET_HASH_REPORT_IPv4)
> >> +               len = max(len,
> >> +                         sizeof(struct flow_dissector_key_ipv4_addrs));
> >> +
> >> +       if (types &
> >> +           (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
> >> +               len = max(len,
> >> +                         sizeof(struct flow_dissector_key_ipv4_addrs) +
> >> +                         sizeof(struct flow_dissector_key_ports));
> >> +
> >> +       if (types & VIRTIO_NET_HASH_REPORT_IPv6)
> >> +               len = max(len,
> >> +                         sizeof(struct flow_dissector_key_ipv6_addrs));
> >> +
> >> +       if (types &
> >> +           (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
> >> +               len = max(len,
> >> +                         sizeof(struct flow_dissector_key_ipv6_addrs) +
> >> +                         sizeof(struct flow_dissector_key_ports));
> >> +
> >> +       return len + sizeof(u32);
> >> +}
> >> +
> >> +static inline u32 virtio_net_hash_report(u32 types,
> >> +                                        const struct flow_keys_basic *keys)
> >> +{
> >> +       switch (keys->basic.n_proto) {
> >> +       case cpu_to_be16(ETH_P_IP):
> >> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> >> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
> >> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
> >> +                               return VIRTIO_NET_HASH_REPORT_TCPv4;
> >> +
> >> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
> >> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
> >> +                               return VIRTIO_NET_HASH_REPORT_UDPv4;
> >> +               }
> >> +
> >> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
> >> +                       return VIRTIO_NET_HASH_REPORT_IPv4;
> >> +
> >> +               return VIRTIO_NET_HASH_REPORT_NONE;
> >> +
> >> +       case cpu_to_be16(ETH_P_IPV6):
> >> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> >> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
> >> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
> >> +                               return VIRTIO_NET_HASH_REPORT_TCPv6;
> >> +
> >> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
> >> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
> >> +                               return VIRTIO_NET_HASH_REPORT_UDPv6;
> >> +               }
> >> +
> >> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
> >> +                       return VIRTIO_NET_HASH_REPORT_IPv6;
> >> +
> >> +               return VIRTIO_NET_HASH_REPORT_NONE;
> >> +
> >> +       default:
> >> +               return VIRTIO_NET_HASH_REPORT_NONE;
> >> +       }
> >> +}
> >> +
> >> +static inline void virtio_net_hash_rss(const struct sk_buff *skb,
> >> +                                      u32 types, const u32 *key,
> >> +                                      struct virtio_net_hash *hash)
> >> +{
> >> +       struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
> >> +       struct flow_keys flow;
> >> +       struct flow_keys_basic flow_basic;
> >> +       u16 report;
> >> +
> >> +       if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
> >> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
> >> +               return;
> >> +       }
> >> +
> >> +       flow_basic = (struct flow_keys_basic) {
> >> +               .control = flow.control,
> >> +               .basic = flow.basic
> >> +       };
> >> +
> >> +       report = virtio_net_hash_report(types, &flow_basic);
> >> +
> >> +       switch (report) {
> >> +       case VIRTIO_NET_HASH_REPORT_IPv4:
> >> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >> +                                        (__be32 *)&flow.addrs.v4addrs,
> >> +                                        sizeof(flow.addrs.v4addrs));
> >> +               break;
> >> +
> >> +       case VIRTIO_NET_HASH_REPORT_TCPv4:
> >> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >> +                                        (__be32 *)&flow.addrs.v4addrs,
> >> +                                        sizeof(flow.addrs.v4addrs));
> >> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >> +                                        sizeof(flow.ports.ports));
> >> +               break;
> >> +
> >> +       case VIRTIO_NET_HASH_REPORT_UDPv4:
> >> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >> +                                        (__be32 *)&flow.addrs.v4addrs,
> >> +                                        sizeof(flow.addrs.v4addrs));
> >> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >> +                                        sizeof(flow.ports.ports));
> >> +               break;
> >> +
> >> +       case VIRTIO_NET_HASH_REPORT_IPv6:
> >> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >> +                                        (__be32 *)&flow.addrs.v6addrs,
> >> +                                        sizeof(flow.addrs.v6addrs));
> >> +               break;
> >> +
> >> +       case VIRTIO_NET_HASH_REPORT_TCPv6:
> >> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >> +                                        (__be32 *)&flow.addrs.v6addrs,
> >> +                                        sizeof(flow.addrs.v6addrs));
> >> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >> +                                        sizeof(flow.ports.ports));
> >> +               break;
> >> +
> >> +       case VIRTIO_NET_HASH_REPORT_UDPv6:
> >> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >> +                                        (__be32 *)&flow.addrs.v6addrs,
> >> +                                        sizeof(flow.addrs.v6addrs));
> >> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >> +                                        sizeof(flow.ports.ports));
> >> +               break;
> >> +
> >> +       default:
> >> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
> >> +               return;
> >
> > So I still think we need a comment here to explain why this is not an
> > issue if the device can report HASH_XXX_EX. Or we need to add the
> > support, since this is the code from the driver side, I don't think we
> > need to worry about the device implementation issues.
>
> This is on the device side, and don't report HASH_TYPE_XXX_EX.
>
> >
> > For the issue of the number of options, does the spec forbid fallback
> > to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
>
> 5.1.6.4.3.4 "IPv6 packets with extension header" says:
>  > If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
>  > header, the hash is calculated over the following fields:
>  > - Home address from the home address option in the IPv6 destination
>  >   options header. If the extension header is not present, use the
>  >   Source IPv6 address.
>  > - IPv6 address that is contained in the Routing-Header-Type-2 from the
>  >   associated extension header. If the extension header is not present,
>  >   use the Destination IPv6 address.
>  > - Source TCP port
>  > - Destination TCP port
>
> Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6
> and an home address option in the IPv6 destination options header is
> present, the hash is calculated over the home address. If the hash is
> not calculated over the home address in such a case, the device is
> contradicting with this section and violating the spec. The same goes
> for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.

Just to make sure we are one the same page. I meant:

1) If the hash is not calculated over the home address (in the case of
IPv6 destination destination), it can still report
VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your
series. So the device can simply fallback to e.g TCPv6 if it can't
understand all or part of the IPv6 options.
2) the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the
tun_vnet_ioctl_sethash(), so userspace may set
VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by
tun_vnet_ioctl_gethashtypes(). In this case they won't get
VIRTIO_NET_HASH_TYPE_TCP_EX.
3) implementing part of the hash types might complicate the migration
or at least we need to describe the expectations of libvirt or other
management in this case. For example, do we plan to have a dedicated
Qemu command line like:

-device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?

Thanks

>
> Regards,
> Akihiko Odaki
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 03/10] tun: Allow steering eBPF program to fall back
  2025-05-30  4:50 ` [PATCH net-next v12 03/10] tun: Allow steering eBPF program to fall back Akihiko Odaki
@ 2025-06-04  1:27   ` Jason Wang
  2025-06-04  7:24     ` Akihiko Odaki
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2025-06-04  1:27 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> This clarifies a steering eBPF program takes precedence over the other
> steering algorithms.

Let's give an example on the use case for this.

>
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> ---
>  Documentation/networking/tuntap.rst |  7 +++++++
>  drivers/net/tun.c                   | 28 +++++++++++++++++-----------
>  include/uapi/linux/if_tun.h         |  9 +++++++++
>  3 files changed, 33 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/networking/tuntap.rst b/Documentation/networking/tuntap.rst
> index 4d7087f727be..86b4ae8caa8a 100644
> --- a/Documentation/networking/tuntap.rst
> +++ b/Documentation/networking/tuntap.rst
> @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
>        return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
>    }
>
> +3.4 Reference
> +-------------
> +
> +``linux/if_tun.h`` defines the interface described below:
> +
> +.. kernel-doc:: include/uapi/linux/if_tun.h
> +
>  Universal TUN/TAP device driver Frequently Asked Question
>  =========================================================
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index d8f4d3e996a7..9133ab9ed3f5 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -476,21 +476,29 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
>         return txq;
>  }
>
> -static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
> +static bool tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb,
> +                                 u16 *ret)
>  {
>         struct tun_prog *prog;
>         u32 numqueues;
> -       u16 ret = 0;
> +       u32 prog_ret;
> +
> +       prog = rcu_dereference(tun->steering_prog);
> +       if (!prog)
> +               return false;
>
>         numqueues = READ_ONCE(tun->numqueues);
> -       if (!numqueues)
> -               return 0;
> +       if (!numqueues) {
> +               *ret = 0;
> +               return true;
> +       }
>
> -       prog = rcu_dereference(tun->steering_prog);
> -       if (prog)
> -               ret = bpf_prog_run_clear_cb(prog->prog, skb);
> +       prog_ret = bpf_prog_run_clear_cb(prog->prog, skb);
> +       if (prog_ret == TUN_STEERINGEBPF_FALLBACK)
> +               return false;

This seems to break the uAPI. So I think we need a new ioctl to enable
the behaviour

>
> -       return ret % numqueues;
> +       *ret = (u16)prog_ret % numqueues;
> +       return true;
>  }
>
>  static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
> @@ -500,9 +508,7 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
>         u16 ret;
>
>         rcu_read_lock();
> -       if (rcu_dereference(tun->steering_prog))
> -               ret = tun_ebpf_select_queue(tun, skb);
> -       else
> +       if (!tun_ebpf_select_queue(tun, skb, &ret))
>                 ret = tun_automq_select_queue(tun, skb);
>         rcu_read_unlock();
>
> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> index 287cdc81c939..980de74724fc 100644
> --- a/include/uapi/linux/if_tun.h
> +++ b/include/uapi/linux/if_tun.h
> @@ -115,4 +115,13 @@ struct tun_filter {
>         __u8   addr[][ETH_ALEN];
>  };
>
> +/**
> + * define TUN_STEERINGEBPF_FALLBACK - A steering eBPF return value to fall back
> + *
> + * A steering eBPF program may return this value to fall back to the steering
> + * algorithm that should have been used if the program was not set. This allows
> + * selectively overriding the steering decision.
> + */
> +#define TUN_STEERINGEBPF_FALLBACK -1

Not a native speaker, consider it works more like XDP_PASS, would it
be better to use "TUN_STERRING_PASS"?

Thanks

> +
>  #endif /* _UAPI__IF_TUN_H */
>
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code
  2025-05-30  4:50 ` [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code Akihiko Odaki
@ 2025-06-04  1:53   ` Jason Wang
  2025-06-04  8:42     ` Akihiko Odaki
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2025-06-04  1:53 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> Add common code required for the features being added to TUN and TAP.
> They will be enabled for each of them in following patches.
>
> Added Features
> ==============
>
> Hash reporting
> --------------
>
> Allow the guest to reuse the hash value to make receive steering
> consistent between the host and guest, and to save hash computation.
>
> Receive Side Scaling (RSS)
> --------------------------
>
> RSS is a receive steering algorithm that can be negotiated to use with
> virtio_net. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF steering program.
>
> Introduce the code to perform RSS to the kernel in order to overcome
> thse challenges. An alternative solution is to extend the eBPF steering
> program so that it will be able to report to the userspace, but I didn't
> opt for it because extending the current mechanism of eBPF steering
> program as is because it relies on legacy context rewriting, and
> introducing kfunc-based eBPF will result in non-UAPI dependency while
> the other relevant virtualization APIs such as KVM and vhost_net are
> UAPIs.
>
> Added ioctls
> ============
>
> They are designed to make extensibility and VM migration compatible.
> This change only adds the implementation and does not expose them to
> the userspace.
>
> TUNGETVNETHASHTYPES
> -------------------
>
> This ioctl tells supported hash types. It is useful to check if a VM can
> be migrated to the current host.
>
> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
> --------------------------------------------------------------------
>
> These ioctls configures a steering algorithm and, if needed, hash
> reporting.
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> Tested-by: Lei Yang <leiyang@redhat.com>
> ---
>  drivers/net/tap.c           |  10 ++-
>  drivers/net/tun.c           |  12 +++-
>  drivers/net/tun_vnet.h      | 165 +++++++++++++++++++++++++++++++++++++++++---
>  include/uapi/linux/if_tun.h |  71 +++++++++++++++++++
>  4 files changed, 244 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index d4ece538f1b2..25c60ff2d3f2 100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
>         sock_put(&q->sk);
>  }
>
> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
> +{
> +       return NULL;
> +}
> +
>  /*
>   * Select a queue based on the rxq of the device on which this packet
>   * arrived. If the incoming device is not mq, calculate a flow hash
> @@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
>         int total;
>
>         if (q->flags & IFF_VNET_HDR) {
> -               struct virtio_net_hdr vnet_hdr;
> +               struct virtio_net_hdr_v1_hash vnet_hdr;
>
>                 vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
>
> -               ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
> +                                           tap_find_hash, &vnet_hdr);
>                 if (ret)
>                         return ret;
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 9133ab9ed3f5..03d47799e9bd 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash)
>                 e->rps_rxhash = hash;
>  }
>
> +static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
> +{
> +       return NULL;
> +}
> +
>  /* We try to identify a flow through its rxhash. The reason that
>   * we do not check rxq no. is because some cards(e.g 82599), chooses
>   * the rxq based on the txq where the last packet of the flow comes. As
> @@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
>         ssize_t ret;
>
>         if (tun->flags & IFF_VNET_HDR) {
> -               struct virtio_net_hdr gso = { 0 };
> +               struct virtio_net_hdr_v1_hash gso = { 0 };
>
>                 vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
>                 ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
> @@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>         }
>
>         if (vnet_hdr_sz) {
> -               struct virtio_net_hdr gso;
> +               struct virtio_net_hdr_v1_hash gso;
>
> -               ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
> +                                           skb, tun_find_hash, &gso);
>                 if (ret)
>                         return ret;
>
> diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
> index 58b9ac7a5fc4..45d0533efc8d 100644
> --- a/drivers/net/tun_vnet.h
> +++ b/drivers/net/tun_vnet.h
> @@ -6,6 +6,17 @@
>  #define TUN_VNET_LE     0x80000000
>  #define TUN_VNET_BE     0x40000000
>
> +typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *);
> +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
> +
> +struct tun_vnet_hash {
> +       bool report;
> +       bool rss;
> +       struct tun_vnet_rss common;
> +       u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
> +       u16 rss_indirection_table[];
> +};
> +
>  static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags)
>  {
>         bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
> @@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
>         }
>  }
>
> +static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp)
> +{
> +       return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
> +}
> +
> +static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
> +                                         unsigned int cmd,
> +                                         void __user *argp)
> +{
> +       struct tun_vnet_rss common;
> +       struct tun_vnet_hash *hash;
> +       size_t indirection_table_size;
> +       size_t key_size;
> +       size_t size;
> +
> +       switch (cmd) {
> +       case TUNSETVNETREPORTINGAUTOMQ:
> +               if (get_user(common.hash_types, (u32 __user *)argp))
> +                       return -EFAULT;
> +
> +               if (common.hash_types) {
> +                       hash = kzalloc(sizeof(*hash), GFP_KERNEL);
> +                       if (!hash)
> +                               return -ENOMEM;
> +
> +                       hash->report = true;
> +                       hash->common.hash_types = common.hash_types;
> +               } else {
> +                       hash = NULL;
> +               }
> +               break;
> +
> +       case TUNSETVNETREPORTINGRSS:
> +       case TUNSETVNETRSS:

So the above three shows unnecessary design redundancy as well as a
burden for the future extension.  Why not simply have

1) TUNSETVNET_RSS
2) TUNSETVNET_HASH_REPORT
?

Which maps to

 #define VIRTIO_NET_CTRL_MQ_RSS_CONFIG          1 (for configurable
receive steering)
 #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG         2 (for configurable
hash calculation)

It would be always easier to start with what spec had or at least we
need to explain why we choose a different design here or in the
changelog to ease our life.

One day we would have tun_select_queue_algorithm_x() we don't have to
duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ

> +               if (copy_from_user(&common, argp, sizeof(common)))
> +                       return -EFAULT;
> +               argp = (struct tun_vnet_rss __user *)argp + 1;
> +
> +               indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
> +               key_size = virtio_net_hash_key_length(common.hash_types);
> +               size = struct_size(hash, rss_indirection_table,
> +                                  (size_t)common.indirection_table_mask + 1);
> +
> +               hash = kmalloc(size, GFP_KERNEL);
> +               if (!hash)
> +                       return -ENOMEM;
> +
> +               if (copy_from_user(hash->rss_indirection_table,
> +                                  argp, indirection_table_size)) {
> +                       kfree(hash);
> +                       return -EFAULT;
> +               }
> +               argp = (u16 __user *)argp + common.indirection_table_mask + 1;
> +
> +               if (copy_from_user(hash->rss_key, argp, key_size)) {
> +                       kfree(hash);
> +                       return -EFAULT;
> +               }
> +
> +               virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
> +               hash->report = cmd == TUNSETVNETREPORTINGRSS;

At least, if this is the only difference why not simply code this into
the ioctl itself other than having a very similar command?

> +               hash->rss = true;
> +               hash->common = common;
> +               break;
> +
> +       default:
> +               return -EINVAL;
> +       }
> +
> +       kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
> +       return 0;
> +}
> +
> +static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
> +                                       struct sk_buff *skb,
> +                                       const struct flow_keys_basic *keys,
> +                                       u32 value,
> +                                       tun_vnet_hash_add vnet_hash_add)
> +{
> +       struct virtio_net_hash *report;
> +
> +       if (!hash || !hash->report)
> +               return;
> +
> +       report = vnet_hash_add(skb);
> +       if (!report)
> +               return;
> +
> +       *report = (struct virtio_net_hash) {
> +               .report = virtio_net_hash_report(hash->common.hash_types, keys),
> +               .value = value
> +       };
> +}
> +
> +static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
> +                                           const struct tun_vnet_hash *hash,
> +                                           struct sk_buff *skb,
> +                                           tun_vnet_hash_add vnet_hash_add)
> +{
> +       struct virtio_net_hash *report;
> +       struct virtio_net_hash ret;
> +       u16 index;
> +
> +       if (!numqueues)
> +               return 0;
> +
> +       virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
> +
> +       if (!ret.report)
> +               return hash->common.unclassified_queue % numqueues;
> +
> +       if (hash->report) {
> +               report = vnet_hash_add(skb);
> +               if (report)
> +                       *report = ret;
> +       }
> +
> +       index = ret.value & hash->common.indirection_table_mask;
> +
> +       return hash->rss_indirection_table[index] % numqueues;
> +}
> +
>  static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
>                                    struct iov_iter *from,
>                                    struct virtio_net_hdr *hdr)
> @@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
>  }
>
>  static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
> -                                  const struct virtio_net_hdr *hdr)
> +                                  const struct virtio_net_hdr_v1_hash *hdr)
>  {
> +       int content_sz = MIN(sizeof(*hdr), sz);
> +
>         if (unlikely(iov_iter_count(iter) < sz))
>                 return -EINVAL;
>
> -       if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
> +       if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz))
>                 return -EFAULT;
>
> -       if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
> +       if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz)
>                 return -EFAULT;
>
>         return 0;
> @@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
>         return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags));
>  }
>
> -static inline int tun_vnet_hdr_from_skb(unsigned int flags,
> +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags,
>                                         const struct net_device *dev,
>                                         const struct sk_buff *skb,
> -                                       struct virtio_net_hdr *hdr)
> +                                       tun_vnet_hash_find vnet_hash_find,
> +                                       struct virtio_net_hdr_v1_hash *hdr)
>  {
>         int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
> +       const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
> +                                              NULL : vnet_hash_find(skb);
> +
> +       *hdr = (struct virtio_net_hdr_v1_hash) {
> +               .hash_report = VIRTIO_NET_HASH_REPORT_NONE
> +       };
> +
> +       if (report) {
> +               hdr->hash_value = cpu_to_le32(report->value);
> +               hdr->hash_report = cpu_to_le16(report->report);
> +       }
>
> -       if (virtio_net_hdr_from_skb(skb, hdr,
> +       if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr,
>                                     tun_vnet_is_little_endian(flags), true,
>                                     vlan_hlen)) {
>                 struct skb_shared_info *sinfo = skb_shinfo(skb);
>
>                 if (net_ratelimit()) {
>                         netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
> -                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
> -                                  tun_vnet16_to_cpu(flags, hdr->hdr_len));
> +                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
> +                                  tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len));
>                         print_hex_dump(KERN_ERR, "tun: ",
>                                        DUMP_PREFIX_NONE,
>                                        16, 1, skb->head,
> -                                      min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
> +                                      min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true);
>                 }
>                 WARN_ON_ONCE(1);
>                 return -EINVAL;
> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> index 980de74724fc..fe4b984d3bbb 100644
> --- a/include/uapi/linux/if_tun.h
> +++ b/include/uapi/linux/if_tun.h
> @@ -62,6 +62,62 @@
>  #define TUNSETCARRIER _IOW('T', 226, int)
>  #define TUNGETDEVNETNS _IO('T', 227)
>
> +/**
> + * define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
> + *
> + * The argument is a pointer to __u32 which will store the supported virtio_net
> + * hashing types.
> + */
> +#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
> +
> +/**
> + * define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
> + *
> + * Disable RSS and enable automatic receive steering with hash reporting.
> + *
> + * The argument is a pointer to __u32 that contains a bitmask of hash types
> + * allowed to be reported.
> + *
> + * This ioctl results in %EBADFD if the underlying device is deleted. It affects
> + * all queues attached to the same device.
> + *
> + * This ioctl currently has no effect on XDP packets and packets with
> + * queue_mapping set by TC.
> + */
> +#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
> +
> +/**
> + * define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
> + *
> + * Disable automatic receive steering and enable RSS with hash reporting.

This is unnecessary, e.g one day will have select_queue_xyz(), we
don't want to say "Disable automatic receive steering and xyz ..."

> + *
> + * This ioctl results in %EBADFD if the underlying device is deleted. It affects
> + * all queues attached to the same device.
> + *
> + * This ioctl currently has no effect on XDP packets and packets with
> + * queue_mapping set by TC.
> + */
> +#define TUNSETVNETREPORTINGRSS _IOR('T', 230, struct tun_vnet_rss)
> +
> +/**
> + * define TUNSETVNETRSS - ioctl to enable RSS without hash reporting
> + *
> + * Disable automatic receive steering and enable RSS without hash reporting.

Same here.

> + *
> + * The argument is a pointer to the compound of the following in order:
> + *
> + * 1. &struct tun_vnet_rss
> + * 3. Indirection table
> + * 4. Key
> + *
> + * This ioctl results in %EBADFD if the underlying device is deleted. It affects
> + * all queues attached to the same device.
> + *
> + * This ioctl currently has no effect on XDP packets and packets with
> + * queue_mapping set by TC.
> + */
> +#define TUNSETVNETRSS _IOR('T', 231, struct tun_vnet_rss)
> +
>  /* TUNSETIFF ifr flags */
>  #define IFF_TUN                0x0001
>  #define IFF_TAP                0x0002
> @@ -124,4 +180,19 @@ struct tun_filter {
>   */
>  #define TUN_STEERINGEBPF_FALLBACK -1
>
> +/**
> + * struct tun_vnet_rss - virtio_net RSS configuration
> + * @hash_types:
> + *             Bitmask of allowed hash types
> + * @indirection_table_mask:
> + *             Bitmask to be applied to the indirection table index
> + * @unclassified_queue:
> + *             The index of the queue to place unclassified packets in
> + */
> +struct tun_vnet_rss {
> +       __u32 hash_types;
> +       __u16 indirection_table_mask;
> +       __u16 unclassified_queue;
> +};
> +
>  #endif /* _UAPI__IF_TUN_H */
>
> --
> 2.49.0
>

Thanks


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-06-04  1:18       ` Jason Wang
@ 2025-06-04  7:20         ` Akihiko Odaki
  2025-06-05  1:53           ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Akihiko Odaki @ 2025-06-04  7:20 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On 2025/06/04 10:18, Jason Wang wrote:
> On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2025/06/03 12:19, Jason Wang wrote:
>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> They are useful to implement VIRTIO_NET_F_RSS and
>>>> VIRTIO_NET_F_HASH_REPORT.
>>>>
>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>> Tested-by: Lei Yang <leiyang@redhat.com>
>>>> ---
>>>>    include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 188 insertions(+)
>>>>
>>>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
>>>> index 02a9f4dc594d..426f33b4b824 100644
>>>> --- a/include/linux/virtio_net.h
>>>> +++ b/include/linux/virtio_net.h
>>>> @@ -9,6 +9,194 @@
>>>>    #include <uapi/linux/tcp.h>
>>>>    #include <uapi/linux/virtio_net.h>
>>>>
>>>> +struct virtio_net_hash {
>>>> +       u32 value;
>>>> +       u16 report;
>>>> +};
>>>> +
>>>> +struct virtio_net_toeplitz_state {
>>>> +       u32 hash;
>>>> +       const u32 *key;
>>>> +};
>>>> +
>>>> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
>>>> +
>>>> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
>>>> +
>>>> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
>>>> +{
>>>> +       while (len >= sizeof(*input)) {
>>>> +               *input = be32_to_cpu((__force __be32)*input);
>>>> +               input++;
>>>> +               len -= sizeof(*input);
>>>> +       }
>>>> +}
>>>> +
>>>> +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
>>>> +                                           const __be32 *input, size_t len)
>>>> +{
>>>> +       while (len >= sizeof(*input)) {
>>>> +               for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
>>>> +                       u32 i = ffs(map);
>>>> +
>>>> +                       state->hash ^= state->key[0] << (32 - i) |
>>>> +                                      (u32)((u64)state->key[1] >> i);
>>>> +               }
>>>> +
>>>> +               state->key++;
>>>> +               input++;
>>>> +               len -= sizeof(*input);
>>>> +       }
>>>> +}
>>>> +
>>>> +static inline u8 virtio_net_hash_key_length(u32 types)
>>>> +{
>>>> +       size_t len = 0;
>>>> +
>>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv4)
>>>> +               len = max(len,
>>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs));
>>>> +
>>>> +       if (types &
>>>> +           (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
>>>> +               len = max(len,
>>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs) +
>>>> +                         sizeof(struct flow_dissector_key_ports));
>>>> +
>>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv6)
>>>> +               len = max(len,
>>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs));
>>>> +
>>>> +       if (types &
>>>> +           (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
>>>> +               len = max(len,
>>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs) +
>>>> +                         sizeof(struct flow_dissector_key_ports));
>>>> +
>>>> +       return len + sizeof(u32);
>>>> +}
>>>> +
>>>> +static inline u32 virtio_net_hash_report(u32 types,
>>>> +                                        const struct flow_keys_basic *keys)
>>>> +{
>>>> +       switch (keys->basic.n_proto) {
>>>> +       case cpu_to_be16(ETH_P_IP):
>>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
>>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
>>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv4;
>>>> +
>>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
>>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv4;
>>>> +               }
>>>> +
>>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
>>>> +                       return VIRTIO_NET_HASH_REPORT_IPv4;
>>>> +
>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>>>> +
>>>> +       case cpu_to_be16(ETH_P_IPV6):
>>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
>>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
>>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv6;
>>>> +
>>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
>>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv6;
>>>> +               }
>>>> +
>>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
>>>> +                       return VIRTIO_NET_HASH_REPORT_IPv6;
>>>> +
>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>>>> +
>>>> +       default:
>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>>>> +       }
>>>> +}
>>>> +
>>>> +static inline void virtio_net_hash_rss(const struct sk_buff *skb,
>>>> +                                      u32 types, const u32 *key,
>>>> +                                      struct virtio_net_hash *hash)
>>>> +{
>>>> +       struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
>>>> +       struct flow_keys flow;
>>>> +       struct flow_keys_basic flow_basic;
>>>> +       u16 report;
>>>> +
>>>> +       if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
>>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
>>>> +               return;
>>>> +       }
>>>> +
>>>> +       flow_basic = (struct flow_keys_basic) {
>>>> +               .control = flow.control,
>>>> +               .basic = flow.basic
>>>> +       };
>>>> +
>>>> +       report = virtio_net_hash_report(types, &flow_basic);
>>>> +
>>>> +       switch (report) {
>>>> +       case VIRTIO_NET_HASH_REPORT_IPv4:
>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
>>>> +                                        sizeof(flow.addrs.v4addrs));
>>>> +               break;
>>>> +
>>>> +       case VIRTIO_NET_HASH_REPORT_TCPv4:
>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
>>>> +                                        sizeof(flow.addrs.v4addrs));
>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>> +                                        sizeof(flow.ports.ports));
>>>> +               break;
>>>> +
>>>> +       case VIRTIO_NET_HASH_REPORT_UDPv4:
>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
>>>> +                                        sizeof(flow.addrs.v4addrs));
>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>> +                                        sizeof(flow.ports.ports));
>>>> +               break;
>>>> +
>>>> +       case VIRTIO_NET_HASH_REPORT_IPv6:
>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
>>>> +                                        sizeof(flow.addrs.v6addrs));
>>>> +               break;
>>>> +
>>>> +       case VIRTIO_NET_HASH_REPORT_TCPv6:
>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
>>>> +                                        sizeof(flow.addrs.v6addrs));
>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>> +                                        sizeof(flow.ports.ports));
>>>> +               break;
>>>> +
>>>> +       case VIRTIO_NET_HASH_REPORT_UDPv6:
>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
>>>> +                                        sizeof(flow.addrs.v6addrs));
>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>> +                                        sizeof(flow.ports.ports));
>>>> +               break;
>>>> +
>>>> +       default:
>>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
>>>> +               return;
>>>
>>> So I still think we need a comment here to explain why this is not an
>>> issue if the device can report HASH_XXX_EX. Or we need to add the
>>> support, since this is the code from the driver side, I don't think we
>>> need to worry about the device implementation issues.
>>
>> This is on the device side, and don't report HASH_TYPE_XXX_EX.
>>
>>>
>>> For the issue of the number of options, does the spec forbid fallback
>>> to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
>>
>> 5.1.6.4.3.4 "IPv6 packets with extension header" says:
>>   > If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
>>   > header, the hash is calculated over the following fields:
>>   > - Home address from the home address option in the IPv6 destination
>>   >   options header. If the extension header is not present, use the
>>   >   Source IPv6 address.
>>   > - IPv6 address that is contained in the Routing-Header-Type-2 from the
>>   >   associated extension header. If the extension header is not present,
>>   >   use the Destination IPv6 address.
>>   > - Source TCP port
>>   > - Destination TCP port
>>
>> Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6
>> and an home address option in the IPv6 destination options header is
>> present, the hash is calculated over the home address. If the hash is
>> not calculated over the home address in such a case, the device is
>> contradicting with this section and violating the spec. The same goes
>> for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
> 
> Just to make sure we are one the same page. I meant:
> 
> 1) If the hash is not calculated over the home address (in the case of
> IPv6 destination destination), it can still report
> VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your
> series. So the device can simply fallback to e.g TCPv6 if it can't
> understand all or part of the IPv6 options.

The spec says it can fallback if "the extension header is not present", 
not if the device can't understand the extension header.

> 2) the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the
> tun_vnet_ioctl_sethash(), so userspace may set
> VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by
> tun_vnet_ioctl_gethashtypes(). In this case they won't get
> VIRTIO_NET_HASH_TYPE_TCP_EX.

That's right. It's the responsibility of the userspace to set only the 
supported hash types.

> 3) implementing part of the hash types might complicate the migration
> or at least we need to describe the expectations of libvirt or other
> management in this case. For example, do we plan to have a dedicated
> Qemu command line like:
> 
> -device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?

I posted a patch series to implement such a command line for vDPA[1]. 
The patch series that wires this tuntap feature up[2] reuses the 
infrastructure so it doesn't bring additional complexity.

[1] 
https://lore.kernel.org/qemu-devel/20250530-vdpa-v1-0-5af4109b1c19@daynix.com/
[2] 
https://lore.kernel.org/qemu-devel/20250530-hash-v5-0-343d7d7a8200@daynix.com/

Regards,
Akihiko Odaki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 03/10] tun: Allow steering eBPF program to fall back
  2025-06-04  1:27   ` Jason Wang
@ 2025-06-04  7:24     ` Akihiko Odaki
  2025-06-05  1:55       ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Akihiko Odaki @ 2025-06-04  7:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On 2025/06/04 10:27, Jason Wang wrote:
> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> This clarifies a steering eBPF program takes precedence over the other
>> steering algorithms.
> 
> Let's give an example on the use case for this.
> 
>>
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>> ---
>>   Documentation/networking/tuntap.rst |  7 +++++++
>>   drivers/net/tun.c                   | 28 +++++++++++++++++-----------
>>   include/uapi/linux/if_tun.h         |  9 +++++++++
>>   3 files changed, 33 insertions(+), 11 deletions(-)
>>
>> diff --git a/Documentation/networking/tuntap.rst b/Documentation/networking/tuntap.rst
>> index 4d7087f727be..86b4ae8caa8a 100644
>> --- a/Documentation/networking/tuntap.rst
>> +++ b/Documentation/networking/tuntap.rst
>> @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
>>         return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
>>     }
>>
>> +3.4 Reference
>> +-------------
>> +
>> +``linux/if_tun.h`` defines the interface described below:
>> +
>> +.. kernel-doc:: include/uapi/linux/if_tun.h
>> +
>>   Universal TUN/TAP device driver Frequently Asked Question
>>   =========================================================
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index d8f4d3e996a7..9133ab9ed3f5 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -476,21 +476,29 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
>>          return txq;
>>   }
>>
>> -static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
>> +static bool tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb,
>> +                                 u16 *ret)
>>   {
>>          struct tun_prog *prog;
>>          u32 numqueues;
>> -       u16 ret = 0;
>> +       u32 prog_ret;
>> +
>> +       prog = rcu_dereference(tun->steering_prog);
>> +       if (!prog)
>> +               return false;
>>
>>          numqueues = READ_ONCE(tun->numqueues);
>> -       if (!numqueues)
>> -               return 0;
>> +       if (!numqueues) {
>> +               *ret = 0;
>> +               return true;
>> +       }
>>
>> -       prog = rcu_dereference(tun->steering_prog);
>> -       if (prog)
>> -               ret = bpf_prog_run_clear_cb(prog->prog, skb);
>> +       prog_ret = bpf_prog_run_clear_cb(prog->prog, skb);
>> +       if (prog_ret == TUN_STEERINGEBPF_FALLBACK)
>> +               return false;
> 
> This seems to break the uAPI. So I think we need a new ioctl to enable
> the behaviour

I assumed it is fine to repurpose one of the 32-bit integer values since 
32-bit integer is too big to specify the queue number, but it may not be 
fine. I don't have a concrete use case either.

Perhaps it is safer to note that TUNSETSTEERINGEBPF takes precedence 
over TUNSETVNETRSS to allow such an extension in the future (but without 
implementing one now).

> 
>>
>> -       return ret % numqueues;
>> +       *ret = (u16)prog_ret % numqueues;
>> +       return true;
>>   }
>>
>>   static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
>> @@ -500,9 +508,7 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
>>          u16 ret;
>>
>>          rcu_read_lock();
>> -       if (rcu_dereference(tun->steering_prog))
>> -               ret = tun_ebpf_select_queue(tun, skb);
>> -       else
>> +       if (!tun_ebpf_select_queue(tun, skb, &ret))
>>                  ret = tun_automq_select_queue(tun, skb);
>>          rcu_read_unlock();
>>
>> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
>> index 287cdc81c939..980de74724fc 100644
>> --- a/include/uapi/linux/if_tun.h
>> +++ b/include/uapi/linux/if_tun.h
>> @@ -115,4 +115,13 @@ struct tun_filter {
>>          __u8   addr[][ETH_ALEN];
>>   };
>>
>> +/**
>> + * define TUN_STEERINGEBPF_FALLBACK - A steering eBPF return value to fall back
>> + *
>> + * A steering eBPF program may return this value to fall back to the steering
>> + * algorithm that should have been used if the program was not set. This allows
>> + * selectively overriding the steering decision.
>> + */
>> +#define TUN_STEERINGEBPF_FALLBACK -1
> 
> Not a native speaker, consider it works more like XDP_PASS, would it
> be better to use "TUN_STERRING_PASS"?

That sounds indeed better to me.

Regards,
Akihiko Odaki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code
  2025-06-04  1:53   ` Jason Wang
@ 2025-06-04  8:42     ` Akihiko Odaki
  2025-06-05  2:46       ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Akihiko Odaki @ 2025-06-04  8:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On 2025/06/04 10:53, Jason Wang wrote:
> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> Add common code required for the features being added to TUN and TAP.
>> They will be enabled for each of them in following patches.
>>
>> Added Features
>> ==============
>>
>> Hash reporting
>> --------------
>>
>> Allow the guest to reuse the hash value to make receive steering
>> consistent between the host and guest, and to save hash computation.
>>
>> Receive Side Scaling (RSS)
>> --------------------------
>>
>> RSS is a receive steering algorithm that can be negotiated to use with
>> virtio_net. Conventionally the hash calculation was done by the VMM.
>> However, computing the hash after the queue was chosen defeats the
>> purpose of RSS.
>>
>> Another approach is to use eBPF steering program. This approach has
>> another downside: it cannot report the calculated hash due to the
>> restrictive nature of eBPF steering program.
>>
>> Introduce the code to perform RSS to the kernel in order to overcome
>> thse challenges. An alternative solution is to extend the eBPF steering
>> program so that it will be able to report to the userspace, but I didn't
>> opt for it because extending the current mechanism of eBPF steering
>> program as is because it relies on legacy context rewriting, and
>> introducing kfunc-based eBPF will result in non-UAPI dependency while
>> the other relevant virtualization APIs such as KVM and vhost_net are
>> UAPIs.
>>
>> Added ioctls
>> ============
>>
>> They are designed to make extensibility and VM migration compatible.
>> This change only adds the implementation and does not expose them to
>> the userspace.
>>
>> TUNGETVNETHASHTYPES
>> -------------------
>>
>> This ioctl tells supported hash types. It is useful to check if a VM can
>> be migrated to the current host.
>>
>> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
>> --------------------------------------------------------------------
>>
>> These ioctls configures a steering algorithm and, if needed, hash
>> reporting.
>>
>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>> Tested-by: Lei Yang <leiyang@redhat.com>
>> ---
>>   drivers/net/tap.c           |  10 ++-
>>   drivers/net/tun.c           |  12 +++-
>>   drivers/net/tun_vnet.h      | 165 +++++++++++++++++++++++++++++++++++++++++---
>>   include/uapi/linux/if_tun.h |  71 +++++++++++++++++++
>>   4 files changed, 244 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>> index d4ece538f1b2..25c60ff2d3f2 100644
>> --- a/drivers/net/tap.c
>> +++ b/drivers/net/tap.c
>> @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
>>          sock_put(&q->sk);
>>   }
>>
>> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
>> +{
>> +       return NULL;
>> +}
>> +
>>   /*
>>    * Select a queue based on the rxq of the device on which this packet
>>    * arrived. If the incoming device is not mq, calculate a flow hash
>> @@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>          int total;
>>
>>          if (q->flags & IFF_VNET_HDR) {
>> -               struct virtio_net_hdr vnet_hdr;
>> +               struct virtio_net_hdr_v1_hash vnet_hdr;
>>
>>                  vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
>>
>> -               ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
>> +                                           tap_find_hash, &vnet_hdr);
>>                  if (ret)
>>                          return ret;
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 9133ab9ed3f5..03d47799e9bd 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash)
>>                  e->rps_rxhash = hash;
>>   }
>>
>> +static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
>> +{
>> +       return NULL;
>> +}
>> +
>>   /* We try to identify a flow through its rxhash. The reason that
>>    * we do not check rxq no. is because some cards(e.g 82599), chooses
>>    * the rxq based on the txq where the last packet of the flow comes. As
>> @@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
>>          ssize_t ret;
>>
>>          if (tun->flags & IFF_VNET_HDR) {
>> -               struct virtio_net_hdr gso = { 0 };
>> +               struct virtio_net_hdr_v1_hash gso = { 0 };
>>
>>                  vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
>>                  ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
>> @@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>          }
>>
>>          if (vnet_hdr_sz) {
>> -               struct virtio_net_hdr gso;
>> +               struct virtio_net_hdr_v1_hash gso;
>>
>> -               ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
>> +                                           skb, tun_find_hash, &gso);
>>                  if (ret)
>>                          return ret;
>>
>> diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
>> index 58b9ac7a5fc4..45d0533efc8d 100644
>> --- a/drivers/net/tun_vnet.h
>> +++ b/drivers/net/tun_vnet.h
>> @@ -6,6 +6,17 @@
>>   #define TUN_VNET_LE     0x80000000
>>   #define TUN_VNET_BE     0x40000000
>>
>> +typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *);
>> +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
>> +
>> +struct tun_vnet_hash {
>> +       bool report;
>> +       bool rss;
>> +       struct tun_vnet_rss common;
>> +       u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
>> +       u16 rss_indirection_table[];
>> +};
>> +
>>   static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags)
>>   {
>>          bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
>> @@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
>>          }
>>   }
>>
>> +static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp)
>> +{
>> +       return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
>> +}
>> +
>> +static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
>> +                                         unsigned int cmd,
>> +                                         void __user *argp)
>> +{
>> +       struct tun_vnet_rss common;
>> +       struct tun_vnet_hash *hash;
>> +       size_t indirection_table_size;
>> +       size_t key_size;
>> +       size_t size;
>> +
>> +       switch (cmd) {
>> +       case TUNSETVNETREPORTINGAUTOMQ:
>> +               if (get_user(common.hash_types, (u32 __user *)argp))
>> +                       return -EFAULT;
>> +
>> +               if (common.hash_types) {
>> +                       hash = kzalloc(sizeof(*hash), GFP_KERNEL);
>> +                       if (!hash)
>> +                               return -ENOMEM;
>> +
>> +                       hash->report = true;
>> +                       hash->common.hash_types = common.hash_types;
>> +               } else {
>> +                       hash = NULL;
>> +               }
>> +               break;
>> +
>> +       case TUNSETVNETREPORTINGRSS:
>> +       case TUNSETVNETRSS:
> 
> So the above three shows unnecessary design redundancy as well as a
> burden for the future extension.  Why not simply have
> 
> 1) TUNSETVNET_RSS
> 2) TUNSETVNET_HASH_REPORT
> ?
> 
> Which maps to
> 
>   #define VIRTIO_NET_CTRL_MQ_RSS_CONFIG          1 (for configurable
> receive steering)
>   #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG         2 (for configurable
> hash calculation)
> 
> It would be always easier to start with what spec had or at least we
> need to explain why we choose a different design here or in the
> changelog to ease our life.

TUNSETVNETREPORTINGAUTOMQ maps to VIRTIO_NET_CTRL_MQ_HASH_CONFIG.

TUNSETVNETREPORTINGRSS and TUNSETVNETRSS map to 
VIRTIO_NET_CTRL_MQ_RSS_CONFIG. We have two ioctls here because 
VIRTIO_NET_CTRL_MQ_RSS_CONFIG behaves differently depending on whether 
VIRTIO_NET_F_HASH_REPORT is negotiated or not; it also enables hash 
reporting if the feature is negotiated.

> 
> One day we would have tun_select_queue_algorithm_x() we don't have to
> duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ

5.1.6.5.6.4 Hash calculation says:
>  If VIRTIO_NET_F_HASH_REPORT was negotiated and the device uses
> automatic receive steering, the device MUST support a command to
> configure hash calculation parameters.
>
> The driver provides parameters for hash calculation as follows:
>
> class VIRTIO_NET_CTRL_MQ, command VIRTIO_NET_CTRL_MQ_HASH_CONFIG.

VIRTIO_NET_CTRL_MQ_HASH_CONFIG is for automatic receive steering and not 
for RSS (or other steering algorithms if the spec gets any in the future).

> 
>> +               if (copy_from_user(&common, argp, sizeof(common)))
>> +                       return -EFAULT;
>> +               argp = (struct tun_vnet_rss __user *)argp + 1;
>> +
>> +               indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
>> +               key_size = virtio_net_hash_key_length(common.hash_types);
>> +               size = struct_size(hash, rss_indirection_table,
>> +                                  (size_t)common.indirection_table_mask + 1);
>> +
>> +               hash = kmalloc(size, GFP_KERNEL);
>> +               if (!hash)
>> +                       return -ENOMEM;
>> +
>> +               if (copy_from_user(hash->rss_indirection_table,
>> +                                  argp, indirection_table_size)) {
>> +                       kfree(hash);
>> +                       return -EFAULT;
>> +               }
>> +               argp = (u16 __user *)argp + common.indirection_table_mask + 1;
>> +
>> +               if (copy_from_user(hash->rss_key, argp, key_size)) {
>> +                       kfree(hash);
>> +                       return -EFAULT;
>> +               }
>> +
>> +               virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
>> +               hash->report = cmd == TUNSETVNETREPORTINGRSS;
> 
> At least, if this is the only difference why not simply code this into
> the ioctl itself other than having a very similar command?

It is what the previous version did. Either is fine I guess; the only 
practical difference would be the size of the configuration struct is 
smaller with this approach.

> 
>> +               hash->rss = true;
>> +               hash->common = common;
>> +               break;
>> +
>> +       default:
>> +               return -EINVAL;
>> +       }
>> +
>> +       kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
>> +       return 0;
>> +}
>> +
>> +static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
>> +                                       struct sk_buff *skb,
>> +                                       const struct flow_keys_basic *keys,
>> +                                       u32 value,
>> +                                       tun_vnet_hash_add vnet_hash_add)
>> +{
>> +       struct virtio_net_hash *report;
>> +
>> +       if (!hash || !hash->report)
>> +               return;
>> +
>> +       report = vnet_hash_add(skb);
>> +       if (!report)
>> +               return;
>> +
>> +       *report = (struct virtio_net_hash) {
>> +               .report = virtio_net_hash_report(hash->common.hash_types, keys),
>> +               .value = value
>> +       };
>> +}
>> +
>> +static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
>> +                                           const struct tun_vnet_hash *hash,
>> +                                           struct sk_buff *skb,
>> +                                           tun_vnet_hash_add vnet_hash_add)
>> +{
>> +       struct virtio_net_hash *report;
>> +       struct virtio_net_hash ret;
>> +       u16 index;
>> +
>> +       if (!numqueues)
>> +               return 0;
>> +
>> +       virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
>> +
>> +       if (!ret.report)
>> +               return hash->common.unclassified_queue % numqueues;
>> +
>> +       if (hash->report) {
>> +               report = vnet_hash_add(skb);
>> +               if (report)
>> +                       *report = ret;
>> +       }
>> +
>> +       index = ret.value & hash->common.indirection_table_mask;
>> +
>> +       return hash->rss_indirection_table[index] % numqueues;
>> +}
>> +
>>   static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
>>                                     struct iov_iter *from,
>>                                     struct virtio_net_hdr *hdr)
>> @@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
>>   }
>>
>>   static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
>> -                                  const struct virtio_net_hdr *hdr)
>> +                                  const struct virtio_net_hdr_v1_hash *hdr)
>>   {
>> +       int content_sz = MIN(sizeof(*hdr), sz);
>> +
>>          if (unlikely(iov_iter_count(iter) < sz))
>>                  return -EINVAL;
>>
>> -       if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
>> +       if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz))
>>                  return -EFAULT;
>>
>> -       if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
>> +       if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz)
>>                  return -EFAULT;
>>
>>          return 0;
>> @@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
>>          return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags));
>>   }
>>
>> -static inline int tun_vnet_hdr_from_skb(unsigned int flags,
>> +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags,
>>                                          const struct net_device *dev,
>>                                          const struct sk_buff *skb,
>> -                                       struct virtio_net_hdr *hdr)
>> +                                       tun_vnet_hash_find vnet_hash_find,
>> +                                       struct virtio_net_hdr_v1_hash *hdr)
>>   {
>>          int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
>> +       const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
>> +                                              NULL : vnet_hash_find(skb);
>> +
>> +       *hdr = (struct virtio_net_hdr_v1_hash) {
>> +               .hash_report = VIRTIO_NET_HASH_REPORT_NONE
>> +       };
>> +
>> +       if (report) {
>> +               hdr->hash_value = cpu_to_le32(report->value);
>> +               hdr->hash_report = cpu_to_le16(report->report);
>> +       }
>>
>> -       if (virtio_net_hdr_from_skb(skb, hdr,
>> +       if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr,
>>                                      tun_vnet_is_little_endian(flags), true,
>>                                      vlan_hlen)) {
>>                  struct skb_shared_info *sinfo = skb_shinfo(skb);
>>
>>                  if (net_ratelimit()) {
>>                          netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
>> -                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
>> -                                  tun_vnet16_to_cpu(flags, hdr->hdr_len));
>> +                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
>> +                                  tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len));
>>                          print_hex_dump(KERN_ERR, "tun: ",
>>                                         DUMP_PREFIX_NONE,
>>                                         16, 1, skb->head,
>> -                                      min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
>> +                                      min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true);
>>                  }
>>                  WARN_ON_ONCE(1);
>>                  return -EINVAL;
>> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
>> index 980de74724fc..fe4b984d3bbb 100644
>> --- a/include/uapi/linux/if_tun.h
>> +++ b/include/uapi/linux/if_tun.h
>> @@ -62,6 +62,62 @@
>>   #define TUNSETCARRIER _IOW('T', 226, int)
>>   #define TUNGETDEVNETNS _IO('T', 227)
>>
>> +/**
>> + * define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
>> + *
>> + * The argument is a pointer to __u32 which will store the supported virtio_net
>> + * hashing types.
>> + */
>> +#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
>> +
>> +/**
>> + * define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
>> + *
>> + * Disable RSS and enable automatic receive steering with hash reporting.
>> + *
>> + * The argument is a pointer to __u32 that contains a bitmask of hash types
>> + * allowed to be reported.
>> + *
>> + * This ioctl results in %EBADFD if the underlying device is deleted. It affects
>> + * all queues attached to the same device.
>> + *
>> + * This ioctl currently has no effect on XDP packets and packets with
>> + * queue_mapping set by TC.
>> + */
>> +#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
>> +
>> +/**
>> + * define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
>> + *
>> + * Disable automatic receive steering and enable RSS with hash reporting.
> 
> This is unnecessary, e.g one day will have select_queue_xyz(), we
> don't want to say "Disable automatic receive steering and xyz ..."

It is still something better to be documented as its behavior is 
somewhat complicated.

Concretely, this ioctl disables automatic receive steering but doesn't 
disable steering by eBPF, which is implied by TUN_STEERINGEBPF_FALLBACK.

Perhaps it may be rephrased to tell it disables "the other receive 
steering mechanism except eBPF".

Regards,
Akihiko Odaki

> 
>> + *
>> + * This ioctl results in %EBADFD if the underlying device is deleted. It affects
>> + * all queues attached to the same device.
>> + *
>> + * This ioctl currently has no effect on XDP packets and packets with
>> + * queue_mapping set by TC.
>> + */
>> +#define TUNSETVNETREPORTINGRSS _IOR('T', 230, struct tun_vnet_rss)
>> +
>> +/**
>> + * define TUNSETVNETRSS - ioctl to enable RSS without hash reporting
>> + *
>> + * Disable automatic receive steering and enable RSS without hash reporting.
> 
> Same here.
> 
>> + *
>> + * The argument is a pointer to the compound of the following in order:
>> + *
>> + * 1. &struct tun_vnet_rss
>> + * 3. Indirection table
>> + * 4. Key
>> + *
>> + * This ioctl results in %EBADFD if the underlying device is deleted. It affects
>> + * all queues attached to the same device.
>> + *
>> + * This ioctl currently has no effect on XDP packets and packets with
>> + * queue_mapping set by TC.
>> + */
>> +#define TUNSETVNETRSS _IOR('T', 231, struct tun_vnet_rss)
>> +
>>   /* TUNSETIFF ifr flags */
>>   #define IFF_TUN                0x0001
>>   #define IFF_TAP                0x0002
>> @@ -124,4 +180,19 @@ struct tun_filter {
>>    */
>>   #define TUN_STEERINGEBPF_FALLBACK -1
>>
>> +/**
>> + * struct tun_vnet_rss - virtio_net RSS configuration
>> + * @hash_types:
>> + *             Bitmask of allowed hash types
>> + * @indirection_table_mask:
>> + *             Bitmask to be applied to the indirection table index
>> + * @unclassified_queue:
>> + *             The index of the queue to place unclassified packets in
>> + */
>> +struct tun_vnet_rss {
>> +       __u32 hash_types;
>> +       __u16 indirection_table_mask;
>> +       __u16 unclassified_queue;
>> +};
>> +
>>   #endif /* _UAPI__IF_TUN_H */
>>
>> --
>> 2.49.0
>>
> 
> Thanks
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-06-04  7:20         ` Akihiko Odaki
@ 2025-06-05  1:53           ` Jason Wang
  2025-06-05  7:57             ` Akihiko Odaki
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2025-06-05  1:53 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Wed, Jun 4, 2025 at 3:20 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2025/06/04 10:18, Jason Wang wrote:
> > On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2025/06/03 12:19, Jason Wang wrote:
> >>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> They are useful to implement VIRTIO_NET_F_RSS and
> >>>> VIRTIO_NET_F_HASH_REPORT.
> >>>>
> >>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>> Tested-by: Lei Yang <leiyang@redhat.com>
> >>>> ---
> >>>>    include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++
> >>>>    1 file changed, 188 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> >>>> index 02a9f4dc594d..426f33b4b824 100644
> >>>> --- a/include/linux/virtio_net.h
> >>>> +++ b/include/linux/virtio_net.h
> >>>> @@ -9,6 +9,194 @@
> >>>>    #include <uapi/linux/tcp.h>
> >>>>    #include <uapi/linux/virtio_net.h>
> >>>>
> >>>> +struct virtio_net_hash {
> >>>> +       u32 value;
> >>>> +       u16 report;
> >>>> +};
> >>>> +
> >>>> +struct virtio_net_toeplitz_state {
> >>>> +       u32 hash;
> >>>> +       const u32 *key;
> >>>> +};
> >>>> +
> >>>> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
> >>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
> >>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
> >>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
> >>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
> >>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
> >>>> +
> >>>> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
> >>>> +
> >>>> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
> >>>> +{
> >>>> +       while (len >= sizeof(*input)) {
> >>>> +               *input = be32_to_cpu((__force __be32)*input);
> >>>> +               input++;
> >>>> +               len -= sizeof(*input);
> >>>> +       }
> >>>> +}
> >>>> +
> >>>> +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
> >>>> +                                           const __be32 *input, size_t len)
> >>>> +{
> >>>> +       while (len >= sizeof(*input)) {
> >>>> +               for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
> >>>> +                       u32 i = ffs(map);
> >>>> +
> >>>> +                       state->hash ^= state->key[0] << (32 - i) |
> >>>> +                                      (u32)((u64)state->key[1] >> i);
> >>>> +               }
> >>>> +
> >>>> +               state->key++;
> >>>> +               input++;
> >>>> +               len -= sizeof(*input);
> >>>> +       }
> >>>> +}
> >>>> +
> >>>> +static inline u8 virtio_net_hash_key_length(u32 types)
> >>>> +{
> >>>> +       size_t len = 0;
> >>>> +
> >>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv4)
> >>>> +               len = max(len,
> >>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs));
> >>>> +
> >>>> +       if (types &
> >>>> +           (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
> >>>> +               len = max(len,
> >>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs) +
> >>>> +                         sizeof(struct flow_dissector_key_ports));
> >>>> +
> >>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv6)
> >>>> +               len = max(len,
> >>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs));
> >>>> +
> >>>> +       if (types &
> >>>> +           (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
> >>>> +               len = max(len,
> >>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs) +
> >>>> +                         sizeof(struct flow_dissector_key_ports));
> >>>> +
> >>>> +       return len + sizeof(u32);
> >>>> +}
> >>>> +
> >>>> +static inline u32 virtio_net_hash_report(u32 types,
> >>>> +                                        const struct flow_keys_basic *keys)
> >>>> +{
> >>>> +       switch (keys->basic.n_proto) {
> >>>> +       case cpu_to_be16(ETH_P_IP):
> >>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> >>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
> >>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
> >>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv4;
> >>>> +
> >>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
> >>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
> >>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv4;
> >>>> +               }
> >>>> +
> >>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
> >>>> +                       return VIRTIO_NET_HASH_REPORT_IPv4;
> >>>> +
> >>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
> >>>> +
> >>>> +       case cpu_to_be16(ETH_P_IPV6):
> >>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> >>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
> >>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
> >>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv6;
> >>>> +
> >>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
> >>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
> >>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv6;
> >>>> +               }
> >>>> +
> >>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
> >>>> +                       return VIRTIO_NET_HASH_REPORT_IPv6;
> >>>> +
> >>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
> >>>> +
> >>>> +       default:
> >>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
> >>>> +       }
> >>>> +}
> >>>> +
> >>>> +static inline void virtio_net_hash_rss(const struct sk_buff *skb,
> >>>> +                                      u32 types, const u32 *key,
> >>>> +                                      struct virtio_net_hash *hash)
> >>>> +{
> >>>> +       struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
> >>>> +       struct flow_keys flow;
> >>>> +       struct flow_keys_basic flow_basic;
> >>>> +       u16 report;
> >>>> +
> >>>> +       if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
> >>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
> >>>> +               return;
> >>>> +       }
> >>>> +
> >>>> +       flow_basic = (struct flow_keys_basic) {
> >>>> +               .control = flow.control,
> >>>> +               .basic = flow.basic
> >>>> +       };
> >>>> +
> >>>> +       report = virtio_net_hash_report(types, &flow_basic);
> >>>> +
> >>>> +       switch (report) {
> >>>> +       case VIRTIO_NET_HASH_REPORT_IPv4:
> >>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>> +                                        (__be32 *)&flow.addrs.v4addrs,
> >>>> +                                        sizeof(flow.addrs.v4addrs));
> >>>> +               break;
> >>>> +
> >>>> +       case VIRTIO_NET_HASH_REPORT_TCPv4:
> >>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>> +                                        (__be32 *)&flow.addrs.v4addrs,
> >>>> +                                        sizeof(flow.addrs.v4addrs));
> >>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >>>> +                                        sizeof(flow.ports.ports));
> >>>> +               break;
> >>>> +
> >>>> +       case VIRTIO_NET_HASH_REPORT_UDPv4:
> >>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>> +                                        (__be32 *)&flow.addrs.v4addrs,
> >>>> +                                        sizeof(flow.addrs.v4addrs));
> >>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >>>> +                                        sizeof(flow.ports.ports));
> >>>> +               break;
> >>>> +
> >>>> +       case VIRTIO_NET_HASH_REPORT_IPv6:
> >>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>> +                                        (__be32 *)&flow.addrs.v6addrs,
> >>>> +                                        sizeof(flow.addrs.v6addrs));
> >>>> +               break;
> >>>> +
> >>>> +       case VIRTIO_NET_HASH_REPORT_TCPv6:
> >>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>> +                                        (__be32 *)&flow.addrs.v6addrs,
> >>>> +                                        sizeof(flow.addrs.v6addrs));
> >>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >>>> +                                        sizeof(flow.ports.ports));
> >>>> +               break;
> >>>> +
> >>>> +       case VIRTIO_NET_HASH_REPORT_UDPv6:
> >>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>> +                                        (__be32 *)&flow.addrs.v6addrs,
> >>>> +                                        sizeof(flow.addrs.v6addrs));
> >>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >>>> +                                        sizeof(flow.ports.ports));
> >>>> +               break;
> >>>> +
> >>>> +       default:
> >>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
> >>>> +               return;
> >>>
> >>> So I still think we need a comment here to explain why this is not an
> >>> issue if the device can report HASH_XXX_EX. Or we need to add the
> >>> support, since this is the code from the driver side, I don't think we
> >>> need to worry about the device implementation issues.
> >>
> >> This is on the device side, and don't report HASH_TYPE_XXX_EX.
> >>
> >>>
> >>> For the issue of the number of options, does the spec forbid fallback
> >>> to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
> >>
> >> 5.1.6.4.3.4 "IPv6 packets with extension header" says:
> >>   > If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
> >>   > header, the hash is calculated over the following fields:
> >>   > - Home address from the home address option in the IPv6 destination
> >>   >   options header. If the extension header is not present, use the
> >>   >   Source IPv6 address.
> >>   > - IPv6 address that is contained in the Routing-Header-Type-2 from the
> >>   >   associated extension header. If the extension header is not present,
> >>   >   use the Destination IPv6 address.
> >>   > - Source TCP port
> >>   > - Destination TCP port
> >>
> >> Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6
> >> and an home address option in the IPv6 destination options header is
> >> present, the hash is calculated over the home address. If the hash is
> >> not calculated over the home address in such a case, the device is
> >> contradicting with this section and violating the spec. The same goes
> >> for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
> >
> > Just to make sure we are one the same page. I meant:
> >
> > 1) If the hash is not calculated over the home address (in the case of
> > IPv6 destination destination), it can still report
> > VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your
> > series. So the device can simply fallback to e.g TCPv6 if it can't
> > understand all or part of the IPv6 options.
>
> The spec says it can fallback if "the extension header is not present",
> not if the device can't understand the extension header.

I don't think so,

1) spec had a condition beforehand:

"""
If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
header, the hash is calculated over the following fields:
...
If the extension header is not present ...
"""

So the device can choose not to set VIRTIO_NET_HASH_TYPE_TCP_EX as
spec doesn't say device MUST set VIRTIO_NET_HASH_TYPE_TCP_EX if ...

2) implementation wise, since device has limited resources, we can't
expect the device can parse arbitrary number of ipv6 options

3) if 1) and 2) not the case, we need fix the spec otherwise implement
a spec compliant device is impractical

>
> > 2) the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the
> > tun_vnet_ioctl_sethash(), so userspace may set
> > VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by
> > tun_vnet_ioctl_gethashtypes(). In this case they won't get
> > VIRTIO_NET_HASH_TYPE_TCP_EX.
>
> That's right. It's the responsibility of the userspace to set only the
> supported hash types.

Well, the kernel should filter out the unsupported one to have a
robust uAPI. Otherwise, we give green light to the buggy userspace
which will have unexpected results.

>
> > 3) implementing part of the hash types might complicate the migration
> > or at least we need to describe the expectations of libvirt or other
> > management in this case. For example, do we plan to have a dedicated
> > Qemu command line like:
> >
> > -device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?
>
> I posted a patch series to implement such a command line for vDPA[1].
> The patch series that wires this tuntap feature up[2] reuses the
> infrastructure so it doesn't bring additional complexity.
>
> [1]
> https://lore.kernel.org/qemu-devel/20250530-vdpa-v1-0-5af4109b1c19@daynix.com/
> [2]
> https://lore.kernel.org/qemu-devel/20250530-hash-v5-0-343d7d7a8200@daynix.com/

I meant, if we implement a full hash report feature, it means a single
hash cmdline option is more than sufficient and so compatibility code
can just turn it off when dealing with machine types. This is much
more simpler than

1) having both hash as well as supported_hash_features
2) dealing both hash as well as supported_hash_features in compatibility codes
3) libvirt will be happy

For [1], it seems it introduces a per has type option, this seems to
be a burden to the management layer as it need to learn new option
everytime a new hash type is supported

Thanks

>
> Regards,
> Akihiko Odaki
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 03/10] tun: Allow steering eBPF program to fall back
  2025-06-04  7:24     ` Akihiko Odaki
@ 2025-06-05  1:55       ` Jason Wang
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Wang @ 2025-06-05  1:55 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Wed, Jun 4, 2025 at 3:24 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2025/06/04 10:27, Jason Wang wrote:
> > On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> This clarifies a steering eBPF program takes precedence over the other
> >> steering algorithms.
> >
> > Let's give an example on the use case for this.
> >
> >>
> >> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >> ---
> >>   Documentation/networking/tuntap.rst |  7 +++++++
> >>   drivers/net/tun.c                   | 28 +++++++++++++++++-----------
> >>   include/uapi/linux/if_tun.h         |  9 +++++++++
> >>   3 files changed, 33 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/Documentation/networking/tuntap.rst b/Documentation/networking/tuntap.rst
> >> index 4d7087f727be..86b4ae8caa8a 100644
> >> --- a/Documentation/networking/tuntap.rst
> >> +++ b/Documentation/networking/tuntap.rst
> >> @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
> >>         return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
> >>     }
> >>
> >> +3.4 Reference
> >> +-------------
> >> +
> >> +``linux/if_tun.h`` defines the interface described below:
> >> +
> >> +.. kernel-doc:: include/uapi/linux/if_tun.h
> >> +
> >>   Universal TUN/TAP device driver Frequently Asked Question
> >>   =========================================================
> >>
> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >> index d8f4d3e996a7..9133ab9ed3f5 100644
> >> --- a/drivers/net/tun.c
> >> +++ b/drivers/net/tun.c
> >> @@ -476,21 +476,29 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
> >>          return txq;
> >>   }
> >>
> >> -static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
> >> +static bool tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb,
> >> +                                 u16 *ret)
> >>   {
> >>          struct tun_prog *prog;
> >>          u32 numqueues;
> >> -       u16 ret = 0;
> >> +       u32 prog_ret;
> >> +
> >> +       prog = rcu_dereference(tun->steering_prog);
> >> +       if (!prog)
> >> +               return false;
> >>
> >>          numqueues = READ_ONCE(tun->numqueues);
> >> -       if (!numqueues)
> >> -               return 0;
> >> +       if (!numqueues) {
> >> +               *ret = 0;
> >> +               return true;
> >> +       }
> >>
> >> -       prog = rcu_dereference(tun->steering_prog);
> >> -       if (prog)
> >> -               ret = bpf_prog_run_clear_cb(prog->prog, skb);
> >> +       prog_ret = bpf_prog_run_clear_cb(prog->prog, skb);
> >> +       if (prog_ret == TUN_STEERINGEBPF_FALLBACK)
> >> +               return false;
> >
> > This seems to break the uAPI. So I think we need a new ioctl to enable
> > the behaviour
>
> I assumed it is fine to repurpose one of the 32-bit integer values since
> 32-bit integer is too big to specify the queue number, but it may not be
> fine. I don't have a concrete use case either.
>
> Perhaps it is safer to note that TUNSETSTEERINGEBPF takes precedence
> over TUNSETVNETRSS to allow such an extension in the future (but without
> implementing one now).

Yes, that's my original point. Let's start from something that is
simpler and safer.

Thanks


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code
  2025-06-04  8:42     ` Akihiko Odaki
@ 2025-06-05  2:46       ` Jason Wang
  2025-06-05  8:18         ` Akihiko Odaki
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2025-06-05  2:46 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2025/06/04 10:53, Jason Wang wrote:
> > On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> Add common code required for the features being added to TUN and TAP.
> >> They will be enabled for each of them in following patches.
> >>
> >> Added Features
> >> ==============
> >>
> >> Hash reporting
> >> --------------
> >>
> >> Allow the guest to reuse the hash value to make receive steering
> >> consistent between the host and guest, and to save hash computation.
> >>
> >> Receive Side Scaling (RSS)
> >> --------------------------
> >>
> >> RSS is a receive steering algorithm that can be negotiated to use with
> >> virtio_net. Conventionally the hash calculation was done by the VMM.
> >> However, computing the hash after the queue was chosen defeats the
> >> purpose of RSS.
> >>
> >> Another approach is to use eBPF steering program. This approach has
> >> another downside: it cannot report the calculated hash due to the
> >> restrictive nature of eBPF steering program.
> >>
> >> Introduce the code to perform RSS to the kernel in order to overcome
> >> thse challenges. An alternative solution is to extend the eBPF steering
> >> program so that it will be able to report to the userspace, but I didn't
> >> opt for it because extending the current mechanism of eBPF steering
> >> program as is because it relies on legacy context rewriting, and
> >> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >> the other relevant virtualization APIs such as KVM and vhost_net are
> >> UAPIs.
> >>
> >> Added ioctls
> >> ============
> >>
> >> They are designed to make extensibility and VM migration compatible.
> >> This change only adds the implementation and does not expose them to
> >> the userspace.
> >>
> >> TUNGETVNETHASHTYPES
> >> -------------------
> >>
> >> This ioctl tells supported hash types. It is useful to check if a VM can
> >> be migrated to the current host.
> >>
> >> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
> >> --------------------------------------------------------------------
> >>
> >> These ioctls configures a steering algorithm and, if needed, hash
> >> reporting.
> >>
> >> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >> Tested-by: Lei Yang <leiyang@redhat.com>
> >> ---
> >>   drivers/net/tap.c           |  10 ++-
> >>   drivers/net/tun.c           |  12 +++-
> >>   drivers/net/tun_vnet.h      | 165 +++++++++++++++++++++++++++++++++++++++++---
> >>   include/uapi/linux/if_tun.h |  71 +++++++++++++++++++
> >>   4 files changed, 244 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >> index d4ece538f1b2..25c60ff2d3f2 100644
> >> --- a/drivers/net/tap.c
> >> +++ b/drivers/net/tap.c
> >> @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
> >>          sock_put(&q->sk);
> >>   }
> >>
> >> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
> >> +{
> >> +       return NULL;
> >> +}
> >> +
> >>   /*
> >>    * Select a queue based on the rxq of the device on which this packet
> >>    * arrived. If the incoming device is not mq, calculate a flow hash
> >> @@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>          int total;
> >>
> >>          if (q->flags & IFF_VNET_HDR) {
> >> -               struct virtio_net_hdr vnet_hdr;
> >> +               struct virtio_net_hdr_v1_hash vnet_hdr;
> >>
> >>                  vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
> >>
> >> -               ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
> >> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
> >> +                                           tap_find_hash, &vnet_hdr);
> >>                  if (ret)
> >>                          return ret;
> >>
> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >> index 9133ab9ed3f5..03d47799e9bd 100644
> >> --- a/drivers/net/tun.c
> >> +++ b/drivers/net/tun.c
> >> @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash)
> >>                  e->rps_rxhash = hash;
> >>   }
> >>
> >> +static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
> >> +{
> >> +       return NULL;
> >> +}
> >> +
> >>   /* We try to identify a flow through its rxhash. The reason that
> >>    * we do not check rxq no. is because some cards(e.g 82599), chooses
> >>    * the rxq based on the txq where the last packet of the flow comes. As
> >> @@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
> >>          ssize_t ret;
> >>
> >>          if (tun->flags & IFF_VNET_HDR) {
> >> -               struct virtio_net_hdr gso = { 0 };
> >> +               struct virtio_net_hdr_v1_hash gso = { 0 };
> >>
> >>                  vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
> >>                  ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
> >> @@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>          }
> >>
> >>          if (vnet_hdr_sz) {
> >> -               struct virtio_net_hdr gso;
> >> +               struct virtio_net_hdr_v1_hash gso;
> >>
> >> -               ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
> >> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
> >> +                                           skb, tun_find_hash, &gso);
> >>                  if (ret)
> >>                          return ret;
> >>
> >> diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
> >> index 58b9ac7a5fc4..45d0533efc8d 100644
> >> --- a/drivers/net/tun_vnet.h
> >> +++ b/drivers/net/tun_vnet.h
> >> @@ -6,6 +6,17 @@
> >>   #define TUN_VNET_LE     0x80000000
> >>   #define TUN_VNET_BE     0x40000000
> >>
> >> +typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *);
> >> +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
> >> +
> >> +struct tun_vnet_hash {
> >> +       bool report;
> >> +       bool rss;
> >> +       struct tun_vnet_rss common;
> >> +       u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
> >> +       u16 rss_indirection_table[];
> >> +};
> >> +
> >>   static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags)
> >>   {
> >>          bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
> >> @@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
> >>          }
> >>   }
> >>
> >> +static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp)
> >> +{
> >> +       return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
> >> +}
> >> +
> >> +static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
> >> +                                         unsigned int cmd,
> >> +                                         void __user *argp)
> >> +{
> >> +       struct tun_vnet_rss common;
> >> +       struct tun_vnet_hash *hash;
> >> +       size_t indirection_table_size;
> >> +       size_t key_size;
> >> +       size_t size;
> >> +
> >> +       switch (cmd) {
> >> +       case TUNSETVNETREPORTINGAUTOMQ:
> >> +               if (get_user(common.hash_types, (u32 __user *)argp))
> >> +                       return -EFAULT;
> >> +
> >> +               if (common.hash_types) {
> >> +                       hash = kzalloc(sizeof(*hash), GFP_KERNEL);
> >> +                       if (!hash)
> >> +                               return -ENOMEM;
> >> +
> >> +                       hash->report = true;
> >> +                       hash->common.hash_types = common.hash_types;
> >> +               } else {
> >> +                       hash = NULL;
> >> +               }
> >> +               break;
> >> +
> >> +       case TUNSETVNETREPORTINGRSS:
> >> +       case TUNSETVNETRSS:
> >
> > So the above three shows unnecessary design redundancy as well as a
> > burden for the future extension.  Why not simply have
> >
> > 1) TUNSETVNET_RSS
> > 2) TUNSETVNET_HASH_REPORT
> > ?
> >
> > Which maps to
> >
> >   #define VIRTIO_NET_CTRL_MQ_RSS_CONFIG          1 (for configurable
> > receive steering)
> >   #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG         2 (for configurable
> > hash calculation)
> >
> > It would be always easier to start with what spec had or at least we
> > need to explain why we choose a different design here or in the
> > changelog to ease our life.
>
> TUNSETVNETREPORTINGAUTOMQ maps to VIRTIO_NET_CTRL_MQ_HASH_CONFIG.

It's not:

VIRTIO_NET_CTRL_MQ_HASH_CONFIG uses:

struct virtio_net_hash_config {
    le32 hash_types;
    le16 reserved[4];
    u8 hash_key_length;
    u8 hash_key_data[hash_key_length];
};

but TUNSETVNETREPORTINGAUTOMQ only accepts hash_types without others:

>
> TUNSETVNETREPORTINGRSS and TUNSETVNETRSS map to
> VIRTIO_NET_CTRL_MQ_RSS_CONFIG.

I think we've already had a discussion about this.

Reusing virtio-net uAPI is much better instead of having a tun
specific one considering tun may need to support more virtio commands
in the future. Or maybe it's the time to introduce a transport for the
virtio control virtqueue uAPI in tuntap to avoid inventing new uAPI
endlessly.

What's more I see:

struct tun_vnet_rss {
        __u32 hash_types;
        __u16 indirection_table_mask;
        __u16 unclassified_queue;
};

struct tun_vnet_hash {
        bool report;
        bool rss;
        struct tun_vnet_rss common;
        u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
        u16 rss_indirection_table[];
};

As I pointed out in the past, let's just decouple the rss from hash,
everything would be much simpler, or you need to explain why you
couple this somewhere.

For example:

1) why is the tun_vnet_hash not part of the uAPI but tun_vnet_rss, or
how could userspace know what kind of format it would use for
TUNSETVNETREPORTINGRSS?
2) what's the advantages of embedding rss specific stuff into hash
report structure
3) what's the advantages of not using virtio-net uAPI

More issues:

1) max_tx_vq is ignored, so do you expect the userspace to intercept
this and switch to using TUNSETQUEUE? This seems like a burden as TUN
can simply accept it and do the attaching/detaching by itself
2) the rx depends on the indirection table, so userspace need to
intercept the indirection and change the rx queue numbers accordingly
3) RSS allows a async TX/RX queue, this is not supported by TUN now,
no matter if we decide to let userspace to intercept max_tx_vq or not,
we need to implement it first

Things would be much more simpler, if kernel can do 1) and 2).

> We have two ioctls here because
> VIRTIO_NET_CTRL_MQ_RSS_CONFIG behaves differently depending on whether
> VIRTIO_NET_F_HASH_REPORT is negotiated or not;

It wouldn't be a problem if you do 1:1 mapping between virtio commands
and TUN uAPI, otherwise it should have a bug somewhere.

> it also enables hash
> reporting if the feature is negotiated.

Again, starting from virtio-net command is easier, a strong
justification is needed to explain why we choose another for tun/tap.

>
> >
> > One day we would have tun_select_queue_algorithm_x() we don't have to
> > duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ
>
> 5.1.6.5.6.4 Hash calculation says:
> >  If VIRTIO_NET_F_HASH_REPORT was negotiated and the device uses
> > automatic receive steering, the device MUST support a command to
> > configure hash calculation parameters.
> >
> > The driver provides parameters for hash calculation as follows:
> >
> > class VIRTIO_NET_CTRL_MQ, command VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
>
> VIRTIO_NET_CTRL_MQ_HASH_CONFIG is for automatic receive steering and not
> for RSS (or other steering algorithms if the spec gets any in the future).

I'm not sure but the spec needs some tweaking. For example, I don't
expect there would be a dedicated hash config command for flow filters
in the future. I think the reason why spec says like this is that
virtio-net only supports automatic receive steering.

Note that macvtap doesn't implement automatic receive steering.

>
> >
> >> +               if (copy_from_user(&common, argp, sizeof(common)))
> >> +                       return -EFAULT;
> >> +               argp = (struct tun_vnet_rss __user *)argp + 1;
> >> +
> >> +               indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
> >> +               key_size = virtio_net_hash_key_length(common.hash_types);
> >> +               size = struct_size(hash, rss_indirection_table,
> >> +                                  (size_t)common.indirection_table_mask + 1);
> >> +
> >> +               hash = kmalloc(size, GFP_KERNEL);
> >> +               if (!hash)
> >> +                       return -ENOMEM;
> >> +
> >> +               if (copy_from_user(hash->rss_indirection_table,
> >> +                                  argp, indirection_table_size)) {
> >> +                       kfree(hash);
> >> +                       return -EFAULT;
> >> +               }
> >> +               argp = (u16 __user *)argp + common.indirection_table_mask + 1;
> >> +
> >> +               if (copy_from_user(hash->rss_key, argp, key_size)) {
> >> +                       kfree(hash);
> >> +                       return -EFAULT;
> >> +               }
> >> +
> >> +               virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
> >> +               hash->report = cmd == TUNSETVNETREPORTINGRSS;
> >
> > At least, if this is the only difference why not simply code this into
> > the ioctl itself other than having a very similar command?
>
> It is what the previous version did. Either is fine I guess; the only
> practical difference would be the size of the configuration struct is
> smaller with this approach.
>
> >
> >> +               hash->rss = true;
> >> +               hash->common = common;
> >> +               break;
> >> +
> >> +       default:
> >> +               return -EINVAL;
> >> +       }
> >> +
> >> +       kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
> >> +       return 0;
> >> +}
> >> +
> >> +static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
> >> +                                       struct sk_buff *skb,
> >> +                                       const struct flow_keys_basic *keys,
> >> +                                       u32 value,
> >> +                                       tun_vnet_hash_add vnet_hash_add)
> >> +{
> >> +       struct virtio_net_hash *report;
> >> +
> >> +       if (!hash || !hash->report)
> >> +               return;
> >> +
> >> +       report = vnet_hash_add(skb);
> >> +       if (!report)
> >> +               return;
> >> +
> >> +       *report = (struct virtio_net_hash) {
> >> +               .report = virtio_net_hash_report(hash->common.hash_types, keys),
> >> +               .value = value
> >> +       };
> >> +}
> >> +
> >> +static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
> >> +                                           const struct tun_vnet_hash *hash,
> >> +                                           struct sk_buff *skb,
> >> +                                           tun_vnet_hash_add vnet_hash_add)
> >> +{
> >> +       struct virtio_net_hash *report;
> >> +       struct virtio_net_hash ret;
> >> +       u16 index;
> >> +
> >> +       if (!numqueues)
> >> +               return 0;
> >> +
> >> +       virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
> >> +
> >> +       if (!ret.report)
> >> +               return hash->common.unclassified_queue % numqueues;
> >> +
> >> +       if (hash->report) {
> >> +               report = vnet_hash_add(skb);
> >> +               if (report)
> >> +                       *report = ret;
> >> +       }
> >> +
> >> +       index = ret.value & hash->common.indirection_table_mask;
> >> +
> >> +       return hash->rss_indirection_table[index] % numqueues;
> >> +}
> >> +
> >>   static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
> >>                                     struct iov_iter *from,
> >>                                     struct virtio_net_hdr *hdr)
> >> @@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
> >>   }
> >>
> >>   static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
> >> -                                  const struct virtio_net_hdr *hdr)
> >> +                                  const struct virtio_net_hdr_v1_hash *hdr)
> >>   {
> >> +       int content_sz = MIN(sizeof(*hdr), sz);
> >> +
> >>          if (unlikely(iov_iter_count(iter) < sz))
> >>                  return -EINVAL;
> >>
> >> -       if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
> >> +       if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz))
> >>                  return -EFAULT;
> >>
> >> -       if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
> >> +       if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz)
> >>                  return -EFAULT;
> >>
> >>          return 0;
> >> @@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
> >>          return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags));
> >>   }
> >>
> >> -static inline int tun_vnet_hdr_from_skb(unsigned int flags,
> >> +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags,
> >>                                          const struct net_device *dev,
> >>                                          const struct sk_buff *skb,
> >> -                                       struct virtio_net_hdr *hdr)
> >> +                                       tun_vnet_hash_find vnet_hash_find,
> >> +                                       struct virtio_net_hdr_v1_hash *hdr)
> >>   {
> >>          int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
> >> +       const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
> >> +                                              NULL : vnet_hash_find(skb);
> >> +
> >> +       *hdr = (struct virtio_net_hdr_v1_hash) {
> >> +               .hash_report = VIRTIO_NET_HASH_REPORT_NONE
> >> +       };
> >> +
> >> +       if (report) {
> >> +               hdr->hash_value = cpu_to_le32(report->value);
> >> +               hdr->hash_report = cpu_to_le16(report->report);
> >> +       }
> >>
> >> -       if (virtio_net_hdr_from_skb(skb, hdr,
> >> +       if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr,
> >>                                      tun_vnet_is_little_endian(flags), true,
> >>                                      vlan_hlen)) {
> >>                  struct skb_shared_info *sinfo = skb_shinfo(skb);
> >>
> >>                  if (net_ratelimit()) {
> >>                          netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
> >> -                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
> >> -                                  tun_vnet16_to_cpu(flags, hdr->hdr_len));
> >> +                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
> >> +                                  tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len));
> >>                          print_hex_dump(KERN_ERR, "tun: ",
> >>                                         DUMP_PREFIX_NONE,
> >>                                         16, 1, skb->head,
> >> -                                      min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
> >> +                                      min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true);
> >>                  }
> >>                  WARN_ON_ONCE(1);
> >>                  return -EINVAL;
> >> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> >> index 980de74724fc..fe4b984d3bbb 100644
> >> --- a/include/uapi/linux/if_tun.h
> >> +++ b/include/uapi/linux/if_tun.h
> >> @@ -62,6 +62,62 @@
> >>   #define TUNSETCARRIER _IOW('T', 226, int)
> >>   #define TUNGETDEVNETNS _IO('T', 227)
> >>
> >> +/**
> >> + * define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
> >> + *
> >> + * The argument is a pointer to __u32 which will store the supported virtio_net
> >> + * hashing types.
> >> + */
> >> +#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
> >> +
> >> +/**
> >> + * define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
> >> + *
> >> + * Disable RSS and enable automatic receive steering with hash reporting.
> >> + *
> >> + * The argument is a pointer to __u32 that contains a bitmask of hash types
> >> + * allowed to be reported.
> >> + *
> >> + * This ioctl results in %EBADFD if the underlying device is deleted. It affects
> >> + * all queues attached to the same device.
> >> + *
> >> + * This ioctl currently has no effect on XDP packets and packets with
> >> + * queue_mapping set by TC.
> >> + */
> >> +#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
> >> +
> >> +/**
> >> + * define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
> >> + *
> >> + * Disable automatic receive steering and enable RSS with hash reporting.
> >
> > This is unnecessary, e.g one day will have select_queue_xyz(), we
> > don't want to say "Disable automatic receive steering and xyz ..."
>
> It is still something better to be documented as its behavior is
> somewhat complicated.

This is a hint of uAPI design issue.

>
> Concretely, this ioctl disables automatic receive steering but doesn't
> disable steering by eBPF, which is implied by TUN_STEERINGEBPF_FALLBACK.

It would be simpler:

1) not having TUN_STEERINGEBPF_FALLBACK
2) the steering algorithm depends on the last uAPI call

>
> Perhaps it may be rephrased to tell it disables "the other receive
> steering mechanism except eBPF".
>
> Regards,
> Akihiko Odaki
>
> >

Thanks


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-06-05  1:53           ` Jason Wang
@ 2025-06-05  7:57             ` Akihiko Odaki
  2025-06-06  0:48               ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Akihiko Odaki @ 2025-06-05  7:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On 2025/06/05 10:53, Jason Wang wrote:
> On Wed, Jun 4, 2025 at 3:20 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2025/06/04 10:18, Jason Wang wrote:
>>> On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2025/06/03 12:19, Jason Wang wrote:
>>>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> They are useful to implement VIRTIO_NET_F_RSS and
>>>>>> VIRTIO_NET_F_HASH_REPORT.
>>>>>>
>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>> Tested-by: Lei Yang <leiyang@redhat.com>
>>>>>> ---
>>>>>>     include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 188 insertions(+)
>>>>>>
>>>>>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
>>>>>> index 02a9f4dc594d..426f33b4b824 100644
>>>>>> --- a/include/linux/virtio_net.h
>>>>>> +++ b/include/linux/virtio_net.h
>>>>>> @@ -9,6 +9,194 @@
>>>>>>     #include <uapi/linux/tcp.h>
>>>>>>     #include <uapi/linux/virtio_net.h>
>>>>>>
>>>>>> +struct virtio_net_hash {
>>>>>> +       u32 value;
>>>>>> +       u16 report;
>>>>>> +};
>>>>>> +
>>>>>> +struct virtio_net_toeplitz_state {
>>>>>> +       u32 hash;
>>>>>> +       const u32 *key;
>>>>>> +};
>>>>>> +
>>>>>> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
>>>>>> +
>>>>>> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
>>>>>> +
>>>>>> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
>>>>>> +{
>>>>>> +       while (len >= sizeof(*input)) {
>>>>>> +               *input = be32_to_cpu((__force __be32)*input);
>>>>>> +               input++;
>>>>>> +               len -= sizeof(*input);
>>>>>> +       }
>>>>>> +}
>>>>>> +
>>>>>> +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
>>>>>> +                                           const __be32 *input, size_t len)
>>>>>> +{
>>>>>> +       while (len >= sizeof(*input)) {
>>>>>> +               for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
>>>>>> +                       u32 i = ffs(map);
>>>>>> +
>>>>>> +                       state->hash ^= state->key[0] << (32 - i) |
>>>>>> +                                      (u32)((u64)state->key[1] >> i);
>>>>>> +               }
>>>>>> +
>>>>>> +               state->key++;
>>>>>> +               input++;
>>>>>> +               len -= sizeof(*input);
>>>>>> +       }
>>>>>> +}
>>>>>> +
>>>>>> +static inline u8 virtio_net_hash_key_length(u32 types)
>>>>>> +{
>>>>>> +       size_t len = 0;
>>>>>> +
>>>>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv4)
>>>>>> +               len = max(len,
>>>>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs));
>>>>>> +
>>>>>> +       if (types &
>>>>>> +           (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
>>>>>> +               len = max(len,
>>>>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs) +
>>>>>> +                         sizeof(struct flow_dissector_key_ports));
>>>>>> +
>>>>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv6)
>>>>>> +               len = max(len,
>>>>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs));
>>>>>> +
>>>>>> +       if (types &
>>>>>> +           (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
>>>>>> +               len = max(len,
>>>>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs) +
>>>>>> +                         sizeof(struct flow_dissector_key_ports));
>>>>>> +
>>>>>> +       return len + sizeof(u32);
>>>>>> +}
>>>>>> +
>>>>>> +static inline u32 virtio_net_hash_report(u32 types,
>>>>>> +                                        const struct flow_keys_basic *keys)
>>>>>> +{
>>>>>> +       switch (keys->basic.n_proto) {
>>>>>> +       case cpu_to_be16(ETH_P_IP):
>>>>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
>>>>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv4;
>>>>>> +
>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
>>>>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv4;
>>>>>> +               }
>>>>>> +
>>>>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
>>>>>> +                       return VIRTIO_NET_HASH_REPORT_IPv4;
>>>>>> +
>>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>>>>>> +
>>>>>> +       case cpu_to_be16(ETH_P_IPV6):
>>>>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
>>>>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv6;
>>>>>> +
>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
>>>>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv6;
>>>>>> +               }
>>>>>> +
>>>>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
>>>>>> +                       return VIRTIO_NET_HASH_REPORT_IPv6;
>>>>>> +
>>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>>>>>> +
>>>>>> +       default:
>>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>>>>>> +       }
>>>>>> +}
>>>>>> +
>>>>>> +static inline void virtio_net_hash_rss(const struct sk_buff *skb,
>>>>>> +                                      u32 types, const u32 *key,
>>>>>> +                                      struct virtio_net_hash *hash)
>>>>>> +{
>>>>>> +       struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
>>>>>> +       struct flow_keys flow;
>>>>>> +       struct flow_keys_basic flow_basic;
>>>>>> +       u16 report;
>>>>>> +
>>>>>> +       if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
>>>>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
>>>>>> +               return;
>>>>>> +       }
>>>>>> +
>>>>>> +       flow_basic = (struct flow_keys_basic) {
>>>>>> +               .control = flow.control,
>>>>>> +               .basic = flow.basic
>>>>>> +       };
>>>>>> +
>>>>>> +       report = virtio_net_hash_report(types, &flow_basic);
>>>>>> +
>>>>>> +       switch (report) {
>>>>>> +       case VIRTIO_NET_HASH_REPORT_IPv4:
>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
>>>>>> +                                        sizeof(flow.addrs.v4addrs));
>>>>>> +               break;
>>>>>> +
>>>>>> +       case VIRTIO_NET_HASH_REPORT_TCPv4:
>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
>>>>>> +                                        sizeof(flow.addrs.v4addrs));
>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>>>> +                                        sizeof(flow.ports.ports));
>>>>>> +               break;
>>>>>> +
>>>>>> +       case VIRTIO_NET_HASH_REPORT_UDPv4:
>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
>>>>>> +                                        sizeof(flow.addrs.v4addrs));
>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>>>> +                                        sizeof(flow.ports.ports));
>>>>>> +               break;
>>>>>> +
>>>>>> +       case VIRTIO_NET_HASH_REPORT_IPv6:
>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
>>>>>> +                                        sizeof(flow.addrs.v6addrs));
>>>>>> +               break;
>>>>>> +
>>>>>> +       case VIRTIO_NET_HASH_REPORT_TCPv6:
>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
>>>>>> +                                        sizeof(flow.addrs.v6addrs));
>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>>>> +                                        sizeof(flow.ports.ports));
>>>>>> +               break;
>>>>>> +
>>>>>> +       case VIRTIO_NET_HASH_REPORT_UDPv6:
>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
>>>>>> +                                        sizeof(flow.addrs.v6addrs));
>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>>>> +                                        sizeof(flow.ports.ports));
>>>>>> +               break;
>>>>>> +
>>>>>> +       default:
>>>>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
>>>>>> +               return;
>>>>>
>>>>> So I still think we need a comment here to explain why this is not an
>>>>> issue if the device can report HASH_XXX_EX. Or we need to add the
>>>>> support, since this is the code from the driver side, I don't think we
>>>>> need to worry about the device implementation issues.
>>>>
>>>> This is on the device side, and don't report HASH_TYPE_XXX_EX.
>>>>
>>>>>
>>>>> For the issue of the number of options, does the spec forbid fallback
>>>>> to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
>>>>
>>>> 5.1.6.4.3.4 "IPv6 packets with extension header" says:
>>>>    > If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
>>>>    > header, the hash is calculated over the following fields:
>>>>    > - Home address from the home address option in the IPv6 destination
>>>>    >   options header. If the extension header is not present, use the
>>>>    >   Source IPv6 address.
>>>>    > - IPv6 address that is contained in the Routing-Header-Type-2 from the
>>>>    >   associated extension header. If the extension header is not present,
>>>>    >   use the Destination IPv6 address.
>>>>    > - Source TCP port
>>>>    > - Destination TCP port
>>>>
>>>> Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6
>>>> and an home address option in the IPv6 destination options header is
>>>> present, the hash is calculated over the home address. If the hash is
>>>> not calculated over the home address in such a case, the device is
>>>> contradicting with this section and violating the spec. The same goes
>>>> for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
>>>
>>> Just to make sure we are one the same page. I meant:
>>>
>>> 1) If the hash is not calculated over the home address (in the case of
>>> IPv6 destination destination), it can still report
>>> VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your
>>> series. So the device can simply fallback to e.g TCPv6 if it can't
>>> understand all or part of the IPv6 options.
>>
>> The spec says it can fallback if "the extension header is not present",
>> not if the device can't understand the extension header.
> 
> I don't think so,
> 
> 1) spec had a condition beforehand:
> 
> """
> If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
> header, the hash is calculated over the following fields:
> ...
> If the extension header is not present ...
> """
> 
> So the device can choose not to set VIRTIO_NET_HASH_TYPE_TCP_EX as
> spec doesn't say device MUST set VIRTIO_NET_HASH_TYPE_TCP_EX if ...
> 
> 2) implementation wise, since device has limited resources, we can't
> expect the device can parse arbitrary number of ipv6 options
> 
> 3) if 1) and 2) not the case, we need fix the spec otherwise implement
> a spec compliant device is impractical

The statement is preceded by the following:
 >  The device calculates the hash on IPv4 packets according to
 > ’Enabled hash types’ bitmask as follows:

The 'Enabled hash types' bitmask is specified by the device.

I think the spec needs amendment.

I wonder if there are any people interested in the feature though. 
Looking at virtnet_set_hashflow() in drivers/net/virtio_net.c, the 
driver of Linux does not let users configure HASH_TYPE_XXX_EX. I suppose 
Windows supports HASH_TYPE_XXX_EX, but those who care network 
performance most would use Linux so HASH_TYPE_XXX_EX support without 
Linux driver's support may not be useful.

> 
>>
>>> 2) the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the
>>> tun_vnet_ioctl_sethash(), so userspace may set
>>> VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by
>>> tun_vnet_ioctl_gethashtypes(). In this case they won't get
>>> VIRTIO_NET_HASH_TYPE_TCP_EX.
>>
>> That's right. It's the responsibility of the userspace to set only the
>> supported hash types.
> 
> Well, the kernel should filter out the unsupported one to have a
> robust uAPI. Otherwise, we give green light to the buggy userspace
> which will have unexpected results.

My reasoning was that it may be fine for some use cases other than VM 
(e.g., DPDK); in such a use case, it is fine as long as the UAPI works 
in the best-effort basis.

For example, suppose a userspace program that processes TCP packets; the 
program can enable: HASH_TYPE_IPv4, HASH_TYPE_TCPv4, HASH_TYPE_IPv6, and 
HASH_TYPE_TCPv6. Ideally, the kernel should support all the hash types, 
but, even if e.g., HASH_TYPE_TCPv6 is not available, it will fall back 
to HASH_TYPE_IPv6, which still does something good and may be acceptable.

That said, for a use case that involves VM and implements virtio-net 
(e.g., QEMU), setting an unsupported hash type here is definitely a bug. 
Catching the bug may outweigh the extra trouble for other use cases.

> 
>>
>>> 3) implementing part of the hash types might complicate the migration
>>> or at least we need to describe the expectations of libvirt or other
>>> management in this case. For example, do we plan to have a dedicated
>>> Qemu command line like:
>>>
>>> -device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?
>>
>> I posted a patch series to implement such a command line for vDPA[1].
>> The patch series that wires this tuntap feature up[2] reuses the
>> infrastructure so it doesn't bring additional complexity.
>>
>> [1]
>> https://lore.kernel.org/qemu-devel/20250530-vdpa-v1-0-5af4109b1c19@daynix.com/
>> [2]
>> https://lore.kernel.org/qemu-devel/20250530-hash-v5-0-343d7d7a8200@daynix.com/
> 
> I meant, if we implement a full hash report feature, it means a single
> hash cmdline option is more than sufficient and so compatibility code
> can just turn it off when dealing with machine types. This is much
> more simpler than
> 
> 1) having both hash as well as supported_hash_features
> 2) dealing both hash as well as supported_hash_features in compatibility codes
> 3) libvirt will be happy
> 
> For [1], it seems it introduces a per has type option, this seems to
> be a burden to the management layer as it need to learn new option
> everytime a new hash type is supported

Even with the command line you proposed (supported_hash_types=X,Y,Z), it 
is still necessary to know the values the supported_hash_types property 
accepts (X.Y,Z), so I don't think it makes difference.

The burden to the management layer is already present for features, so 
it is an existing problem (or its mere extension).

This problem was discussed in the following thread in the past, but no 
solution is implemented yet, and probably solving it will be difficult.
https://lore.kernel.org/qemu-devel/20230731223148.1002258-5-yuri.benditovich@daynix.com/

Regards,
Akihiko Odaki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code
  2025-06-05  2:46       ` Jason Wang
@ 2025-06-05  8:18         ` Akihiko Odaki
  2025-06-06  1:01           ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Akihiko Odaki @ 2025-06-05  8:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On 2025/06/05 11:46, Jason Wang wrote:
> On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2025/06/04 10:53, Jason Wang wrote:
>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> Add common code required for the features being added to TUN and TAP.
>>>> They will be enabled for each of them in following patches.
>>>>
>>>> Added Features
>>>> ==============
>>>>
>>>> Hash reporting
>>>> --------------
>>>>
>>>> Allow the guest to reuse the hash value to make receive steering
>>>> consistent between the host and guest, and to save hash computation.
>>>>
>>>> Receive Side Scaling (RSS)
>>>> --------------------------
>>>>
>>>> RSS is a receive steering algorithm that can be negotiated to use with
>>>> virtio_net. Conventionally the hash calculation was done by the VMM.
>>>> However, computing the hash after the queue was chosen defeats the
>>>> purpose of RSS.
>>>>
>>>> Another approach is to use eBPF steering program. This approach has
>>>> another downside: it cannot report the calculated hash due to the
>>>> restrictive nature of eBPF steering program.
>>>>
>>>> Introduce the code to perform RSS to the kernel in order to overcome
>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>> program so that it will be able to report to the userspace, but I didn't
>>>> opt for it because extending the current mechanism of eBPF steering
>>>> program as is because it relies on legacy context rewriting, and
>>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
>>>> the other relevant virtualization APIs such as KVM and vhost_net are
>>>> UAPIs.
>>>>
>>>> Added ioctls
>>>> ============
>>>>
>>>> They are designed to make extensibility and VM migration compatible.
>>>> This change only adds the implementation and does not expose them to
>>>> the userspace.
>>>>
>>>> TUNGETVNETHASHTYPES
>>>> -------------------
>>>>
>>>> This ioctl tells supported hash types. It is useful to check if a VM can
>>>> be migrated to the current host.
>>>>
>>>> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
>>>> --------------------------------------------------------------------
>>>>
>>>> These ioctls configures a steering algorithm and, if needed, hash
>>>> reporting.
>>>>
>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>> Tested-by: Lei Yang <leiyang@redhat.com>
>>>> ---
>>>>    drivers/net/tap.c           |  10 ++-
>>>>    drivers/net/tun.c           |  12 +++-
>>>>    drivers/net/tun_vnet.h      | 165 +++++++++++++++++++++++++++++++++++++++++---
>>>>    include/uapi/linux/if_tun.h |  71 +++++++++++++++++++
>>>>    4 files changed, 244 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>> index d4ece538f1b2..25c60ff2d3f2 100644
>>>> --- a/drivers/net/tap.c
>>>> +++ b/drivers/net/tap.c
>>>> @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
>>>>           sock_put(&q->sk);
>>>>    }
>>>>
>>>> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
>>>> +{
>>>> +       return NULL;
>>>> +}
>>>> +
>>>>    /*
>>>>     * Select a queue based on the rxq of the device on which this packet
>>>>     * arrived. If the incoming device is not mq, calculate a flow hash
>>>> @@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>           int total;
>>>>
>>>>           if (q->flags & IFF_VNET_HDR) {
>>>> -               struct virtio_net_hdr vnet_hdr;
>>>> +               struct virtio_net_hdr_v1_hash vnet_hdr;
>>>>
>>>>                   vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
>>>>
>>>> -               ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
>>>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
>>>> +                                           tap_find_hash, &vnet_hdr);
>>>>                   if (ret)
>>>>                           return ret;
>>>>
>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>> index 9133ab9ed3f5..03d47799e9bd 100644
>>>> --- a/drivers/net/tun.c
>>>> +++ b/drivers/net/tun.c
>>>> @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash)
>>>>                   e->rps_rxhash = hash;
>>>>    }
>>>>
>>>> +static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
>>>> +{
>>>> +       return NULL;
>>>> +}
>>>> +
>>>>    /* We try to identify a flow through its rxhash. The reason that
>>>>     * we do not check rxq no. is because some cards(e.g 82599), chooses
>>>>     * the rxq based on the txq where the last packet of the flow comes. As
>>>> @@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
>>>>           ssize_t ret;
>>>>
>>>>           if (tun->flags & IFF_VNET_HDR) {
>>>> -               struct virtio_net_hdr gso = { 0 };
>>>> +               struct virtio_net_hdr_v1_hash gso = { 0 };
>>>>
>>>>                   vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
>>>>                   ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
>>>> @@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>           }
>>>>
>>>>           if (vnet_hdr_sz) {
>>>> -               struct virtio_net_hdr gso;
>>>> +               struct virtio_net_hdr_v1_hash gso;
>>>>
>>>> -               ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
>>>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
>>>> +                                           skb, tun_find_hash, &gso);
>>>>                   if (ret)
>>>>                           return ret;
>>>>
>>>> diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
>>>> index 58b9ac7a5fc4..45d0533efc8d 100644
>>>> --- a/drivers/net/tun_vnet.h
>>>> +++ b/drivers/net/tun_vnet.h
>>>> @@ -6,6 +6,17 @@
>>>>    #define TUN_VNET_LE     0x80000000
>>>>    #define TUN_VNET_BE     0x40000000
>>>>
>>>> +typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *);
>>>> +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
>>>> +
>>>> +struct tun_vnet_hash {
>>>> +       bool report;
>>>> +       bool rss;
>>>> +       struct tun_vnet_rss common;
>>>> +       u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
>>>> +       u16 rss_indirection_table[];
>>>> +};
>>>> +
>>>>    static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags)
>>>>    {
>>>>           bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
>>>> @@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
>>>>           }
>>>>    }
>>>>
>>>> +static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp)
>>>> +{
>>>> +       return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
>>>> +}
>>>> +
>>>> +static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
>>>> +                                         unsigned int cmd,
>>>> +                                         void __user *argp)
>>>> +{
>>>> +       struct tun_vnet_rss common;
>>>> +       struct tun_vnet_hash *hash;
>>>> +       size_t indirection_table_size;
>>>> +       size_t key_size;
>>>> +       size_t size;
>>>> +
>>>> +       switch (cmd) {
>>>> +       case TUNSETVNETREPORTINGAUTOMQ:
>>>> +               if (get_user(common.hash_types, (u32 __user *)argp))
>>>> +                       return -EFAULT;
>>>> +
>>>> +               if (common.hash_types) {
>>>> +                       hash = kzalloc(sizeof(*hash), GFP_KERNEL);
>>>> +                       if (!hash)
>>>> +                               return -ENOMEM;
>>>> +
>>>> +                       hash->report = true;
>>>> +                       hash->common.hash_types = common.hash_types;
>>>> +               } else {
>>>> +                       hash = NULL;
>>>> +               }
>>>> +               break;
>>>> +
>>>> +       case TUNSETVNETREPORTINGRSS:
>>>> +       case TUNSETVNETRSS:
>>>
>>> So the above three shows unnecessary design redundancy as well as a
>>> burden for the future extension.  Why not simply have
>>>
>>> 1) TUNSETVNET_RSS
>>> 2) TUNSETVNET_HASH_REPORT
>>> ?
>>>
>>> Which maps to
>>>
>>>    #define VIRTIO_NET_CTRL_MQ_RSS_CONFIG          1 (for configurable
>>> receive steering)
>>>    #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG         2 (for configurable
>>> hash calculation)
>>>
>>> It would be always easier to start with what spec had or at least we
>>> need to explain why we choose a different design here or in the
>>> changelog to ease our life.
>>
>> TUNSETVNETREPORTINGAUTOMQ maps to VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
> 
> It's not:
> 
> VIRTIO_NET_CTRL_MQ_HASH_CONFIG uses:
> 
> struct virtio_net_hash_config {
>      le32 hash_types;
>      le16 reserved[4];
>      u8 hash_key_length;
>      u8 hash_key_data[hash_key_length];
> };
> 
> but TUNSETVNETREPORTINGAUTOMQ only accepts hash_types without others:

The others are not present because the spec doesn't specify what to do 
with them and the kernel doesn't use them either.

> 
>>
>> TUNSETVNETREPORTINGRSS and TUNSETVNETRSS map to
>> VIRTIO_NET_CTRL_MQ_RSS_CONFIG.
> 
> I think we've already had a discussion about this.
> 
> Reusing virtio-net uAPI is much better instead of having a tun
> specific one considering tun may need to support more virtio commands
> in the future. Or maybe it's the time to introduce a transport for the
> virtio control virtqueue uAPI in tuntap to avoid inventing new uAPI
> endlessly.
> 
> What's more I see:
> 
> struct tun_vnet_rss {
>          __u32 hash_types;
>          __u16 indirection_table_mask;
>          __u16 unclassified_queue;
> };
> 
> struct tun_vnet_hash {
>          bool report;
>          bool rss;
>          struct tun_vnet_rss common;
>          u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
>          u16 rss_indirection_table[];
> };
> 
> As I pointed out in the past, let's just decouple the rss from hash,
> everything would be much simpler, or you need to explain why you
> couple this somewhere.
> 
> For example:
> 
> 1) why is the tun_vnet_hash not part of the uAPI but tun_vnet_rss, or
> how could userspace know what kind of format it would use for
> TUNSETVNETREPORTINGRSS?

That was the previous version.

> 2) what's the advantages of embedding rss specific stuff into hash
> report structure

Because the hash types field in struct tun_vnet_rss is used by hash 
reporting too.

> 3) what's the advantages of not using virtio-net uAPI

1. The use cases that don't involve VM will be simplified; programs for 
such a use case will not need to convert endian or to fill fileds the 
kernel doesn't care.
2. It aligns with existing UAPIs that operate in native endian and don't 
use virtio-net data structures like TUNSETOFFLOAD and TUNSETVNETHDRSZ.

> 
> More issues:
> 
> 1) max_tx_vq is ignored, so do you expect the userspace to intercept
> this and switch to using TUNSETQUEUE? This seems like a burden as TUN
> can simply accept it and do the attaching/detaching by itself
> 2) the rx depends on the indirection table, so userspace need to
> intercept the indirection and change the rx queue numbers accordingly
> 3) RSS allows a async TX/RX queue, this is not supported by TUN now,
> no matter if we decide to let userspace to intercept max_tx_vq or not,
> we need to implement it first
 > > Things would be much more simpler, if kernel can do 1) and 2).

Attaching and detaching queues is not possible for the kernel because it 
doesn't know what file descriptors that map to queues will be used by 
the userspace.

The following patch does 2) for QEMU:
https://lore.kernel.org/qemu-devel/20250322-vq-v2-1-cee0aafe6404@daynix.com/

For 3), the patch for QEMU takes the maximum of TX and RX queue numbers 
to derive the number of queue pairs.

> 
>> We have two ioctls here because
>> VIRTIO_NET_CTRL_MQ_RSS_CONFIG behaves differently depending on whether
>> VIRTIO_NET_F_HASH_REPORT is negotiated or not;
> 
> It wouldn't be a problem if you do 1:1 mapping between virtio commands
> and TUN uAPI, otherwise it should have a bug somewhere.

Speaking of 1:1 mapping, it is possible to map VIRTIO_NET_F_HASH_REPORT 
into another ioctl. It may help add another receive steering algorithm 
in the future by not requiring two ioctls (TUNSETVNETREPORTING_X and 
TUNSETVNET_X).

> 
>> it also enables hash
>> reporting if the feature is negotiated.
> 
> Again, starting from virtio-net command is easier, a strong
> justification is needed to explain why we choose another for tun/tap.
> 
>>
>>>
>>> One day we would have tun_select_queue_algorithm_x() we don't have to
>>> duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ
>>
>> 5.1.6.5.6.4 Hash calculation says:
>>>   If VIRTIO_NET_F_HASH_REPORT was negotiated and the device uses
>>> automatic receive steering, the device MUST support a command to
>>> configure hash calculation parameters.
>>>
>>> The driver provides parameters for hash calculation as follows:
>>>
>>> class VIRTIO_NET_CTRL_MQ, command VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
>>
>> VIRTIO_NET_CTRL_MQ_HASH_CONFIG is for automatic receive steering and not
>> for RSS (or other steering algorithms if the spec gets any in the future).
> 
> I'm not sure but the spec needs some tweaking. For example, I don't
> expect there would be a dedicated hash config command for flow filters
> in the future. I think the reason why spec says like this is that
> virtio-net only supports automatic receive steering.
> 
> Note that macvtap doesn't implement automatic receive steering.

QEMU advertises VIRTIO_NET_F_CTRL_VQ for macvtap too, so it should have 
implemented it. I think QEMU with macvtap still compliant to the spec.

"5.1.6.5.6 Automatic receive steering in multiqueue mode" says:

 > After the driver transmitted a packet of a flow on transmitqX, the
 > device SHOULD cause incoming packets for that flow to be steered to
 > receiveqX.

It is "SHOULD", so it is still compliant if the device doesn't properly 
respect the flow.

> 
>>
>>>
>>>> +               if (copy_from_user(&common, argp, sizeof(common)))
>>>> +                       return -EFAULT;
>>>> +               argp = (struct tun_vnet_rss __user *)argp + 1;
>>>> +
>>>> +               indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
>>>> +               key_size = virtio_net_hash_key_length(common.hash_types);
>>>> +               size = struct_size(hash, rss_indirection_table,
>>>> +                                  (size_t)common.indirection_table_mask + 1);
>>>> +
>>>> +               hash = kmalloc(size, GFP_KERNEL);
>>>> +               if (!hash)
>>>> +                       return -ENOMEM;
>>>> +
>>>> +               if (copy_from_user(hash->rss_indirection_table,
>>>> +                                  argp, indirection_table_size)) {
>>>> +                       kfree(hash);
>>>> +                       return -EFAULT;
>>>> +               }
>>>> +               argp = (u16 __user *)argp + common.indirection_table_mask + 1;
>>>> +
>>>> +               if (copy_from_user(hash->rss_key, argp, key_size)) {
>>>> +                       kfree(hash);
>>>> +                       return -EFAULT;
>>>> +               }
>>>> +
>>>> +               virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
>>>> +               hash->report = cmd == TUNSETVNETREPORTINGRSS;
>>>
>>> At least, if this is the only difference why not simply code this into
>>> the ioctl itself other than having a very similar command?
>>
>> It is what the previous version did. Either is fine I guess; the only
>> practical difference would be the size of the configuration struct is
>> smaller with this approach.
>>
>>>
>>>> +               hash->rss = true;
>>>> +               hash->common = common;
>>>> +               break;
>>>> +
>>>> +       default:
>>>> +               return -EINVAL;
>>>> +       }
>>>> +
>>>> +       kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
>>>> +                                       struct sk_buff *skb,
>>>> +                                       const struct flow_keys_basic *keys,
>>>> +                                       u32 value,
>>>> +                                       tun_vnet_hash_add vnet_hash_add)
>>>> +{
>>>> +       struct virtio_net_hash *report;
>>>> +
>>>> +       if (!hash || !hash->report)
>>>> +               return;
>>>> +
>>>> +       report = vnet_hash_add(skb);
>>>> +       if (!report)
>>>> +               return;
>>>> +
>>>> +       *report = (struct virtio_net_hash) {
>>>> +               .report = virtio_net_hash_report(hash->common.hash_types, keys),
>>>> +               .value = value
>>>> +       };
>>>> +}
>>>> +
>>>> +static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
>>>> +                                           const struct tun_vnet_hash *hash,
>>>> +                                           struct sk_buff *skb,
>>>> +                                           tun_vnet_hash_add vnet_hash_add)
>>>> +{
>>>> +       struct virtio_net_hash *report;
>>>> +       struct virtio_net_hash ret;
>>>> +       u16 index;
>>>> +
>>>> +       if (!numqueues)
>>>> +               return 0;
>>>> +
>>>> +       virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
>>>> +
>>>> +       if (!ret.report)
>>>> +               return hash->common.unclassified_queue % numqueues;
>>>> +
>>>> +       if (hash->report) {
>>>> +               report = vnet_hash_add(skb);
>>>> +               if (report)
>>>> +                       *report = ret;
>>>> +       }
>>>> +
>>>> +       index = ret.value & hash->common.indirection_table_mask;
>>>> +
>>>> +       return hash->rss_indirection_table[index] % numqueues;
>>>> +}
>>>> +
>>>>    static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
>>>>                                      struct iov_iter *from,
>>>>                                      struct virtio_net_hdr *hdr)
>>>> @@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
>>>>    }
>>>>
>>>>    static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
>>>> -                                  const struct virtio_net_hdr *hdr)
>>>> +                                  const struct virtio_net_hdr_v1_hash *hdr)
>>>>    {
>>>> +       int content_sz = MIN(sizeof(*hdr), sz);
>>>> +
>>>>           if (unlikely(iov_iter_count(iter) < sz))
>>>>                   return -EINVAL;
>>>>
>>>> -       if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
>>>> +       if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz))
>>>>                   return -EFAULT;
>>>>
>>>> -       if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
>>>> +       if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz)
>>>>                   return -EFAULT;
>>>>
>>>>           return 0;
>>>> @@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
>>>>           return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags));
>>>>    }
>>>>
>>>> -static inline int tun_vnet_hdr_from_skb(unsigned int flags,
>>>> +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags,
>>>>                                           const struct net_device *dev,
>>>>                                           const struct sk_buff *skb,
>>>> -                                       struct virtio_net_hdr *hdr)
>>>> +                                       tun_vnet_hash_find vnet_hash_find,
>>>> +                                       struct virtio_net_hdr_v1_hash *hdr)
>>>>    {
>>>>           int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
>>>> +       const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
>>>> +                                              NULL : vnet_hash_find(skb);
>>>> +
>>>> +       *hdr = (struct virtio_net_hdr_v1_hash) {
>>>> +               .hash_report = VIRTIO_NET_HASH_REPORT_NONE
>>>> +       };
>>>> +
>>>> +       if (report) {
>>>> +               hdr->hash_value = cpu_to_le32(report->value);
>>>> +               hdr->hash_report = cpu_to_le16(report->report);
>>>> +       }
>>>>
>>>> -       if (virtio_net_hdr_from_skb(skb, hdr,
>>>> +       if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr,
>>>>                                       tun_vnet_is_little_endian(flags), true,
>>>>                                       vlan_hlen)) {
>>>>                   struct skb_shared_info *sinfo = skb_shinfo(skb);
>>>>
>>>>                   if (net_ratelimit()) {
>>>>                           netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
>>>> -                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
>>>> -                                  tun_vnet16_to_cpu(flags, hdr->hdr_len));
>>>> +                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
>>>> +                                  tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len));
>>>>                           print_hex_dump(KERN_ERR, "tun: ",
>>>>                                          DUMP_PREFIX_NONE,
>>>>                                          16, 1, skb->head,
>>>> -                                      min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
>>>> +                                      min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true);
>>>>                   }
>>>>                   WARN_ON_ONCE(1);
>>>>                   return -EINVAL;
>>>> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
>>>> index 980de74724fc..fe4b984d3bbb 100644
>>>> --- a/include/uapi/linux/if_tun.h
>>>> +++ b/include/uapi/linux/if_tun.h
>>>> @@ -62,6 +62,62 @@
>>>>    #define TUNSETCARRIER _IOW('T', 226, int)
>>>>    #define TUNGETDEVNETNS _IO('T', 227)
>>>>
>>>> +/**
>>>> + * define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
>>>> + *
>>>> + * The argument is a pointer to __u32 which will store the supported virtio_net
>>>> + * hashing types.
>>>> + */
>>>> +#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
>>>> +
>>>> +/**
>>>> + * define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
>>>> + *
>>>> + * Disable RSS and enable automatic receive steering with hash reporting.
>>>> + *
>>>> + * The argument is a pointer to __u32 that contains a bitmask of hash types
>>>> + * allowed to be reported.
>>>> + *
>>>> + * This ioctl results in %EBADFD if the underlying device is deleted. It affects
>>>> + * all queues attached to the same device.
>>>> + *
>>>> + * This ioctl currently has no effect on XDP packets and packets with
>>>> + * queue_mapping set by TC.
>>>> + */
>>>> +#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
>>>> +
>>>> +/**
>>>> + * define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
>>>> + *
>>>> + * Disable automatic receive steering and enable RSS with hash reporting.
>>>
>>> This is unnecessary, e.g one day will have select_queue_xyz(), we
>>> don't want to say "Disable automatic receive steering and xyz ..."
>>
>> It is still something better to be documented as its behavior is
>> somewhat complicated.
> 
> This is a hint of uAPI design issue.
> 
>>
>> Concretely, this ioctl disables automatic receive steering but doesn't
>> disable steering by eBPF, which is implied by TUN_STEERINGEBPF_FALLBACK.
> 
> It would be simpler:
> 
> 1) not having TUN_STEERINGEBPF_FALLBACK
> 2) the steering algorithm depends on the last uAPI call

What will TUNSETSTEERINGEBPF with NULL mean in that case? It currently 
switches the steering algorithm to automq.

Regards,
Akihiko Odaki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-06-05  7:57             ` Akihiko Odaki
@ 2025-06-06  0:48               ` Jason Wang
  2025-06-06  9:10                 ` Akihiko Odaki
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2025-06-06  0:48 UTC (permalink / raw)
  To: Akihiko Odaki, Michael S. Tsirkin
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Xuan Zhuo, Shuah Khan, linux-doc,
	linux-kernel, netdev, kvm, virtualization, linux-kselftest,
	Yuri Benditovich, Andrew Melnychenko, Stephen Hemminger,
	gur.stavi, Lei Yang, Simon Horman

On Thu, Jun 5, 2025 at 3:58 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2025/06/05 10:53, Jason Wang wrote:
> > On Wed, Jun 4, 2025 at 3:20 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2025/06/04 10:18, Jason Wang wrote:
> >>> On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> On 2025/06/03 12:19, Jason Wang wrote:
> >>>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>
> >>>>>> They are useful to implement VIRTIO_NET_F_RSS and
> >>>>>> VIRTIO_NET_F_HASH_REPORT.
> >>>>>>
> >>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>>>> Tested-by: Lei Yang <leiyang@redhat.com>
> >>>>>> ---
> >>>>>>     include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>     1 file changed, 188 insertions(+)
> >>>>>>
> >>>>>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> >>>>>> index 02a9f4dc594d..426f33b4b824 100644
> >>>>>> --- a/include/linux/virtio_net.h
> >>>>>> +++ b/include/linux/virtio_net.h
> >>>>>> @@ -9,6 +9,194 @@
> >>>>>>     #include <uapi/linux/tcp.h>
> >>>>>>     #include <uapi/linux/virtio_net.h>
> >>>>>>
> >>>>>> +struct virtio_net_hash {
> >>>>>> +       u32 value;
> >>>>>> +       u16 report;
> >>>>>> +};
> >>>>>> +
> >>>>>> +struct virtio_net_toeplitz_state {
> >>>>>> +       u32 hash;
> >>>>>> +       const u32 *key;
> >>>>>> +};
> >>>>>> +
> >>>>>> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
> >>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
> >>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
> >>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
> >>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
> >>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
> >>>>>> +
> >>>>>> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
> >>>>>> +
> >>>>>> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
> >>>>>> +{
> >>>>>> +       while (len >= sizeof(*input)) {
> >>>>>> +               *input = be32_to_cpu((__force __be32)*input);
> >>>>>> +               input++;
> >>>>>> +               len -= sizeof(*input);
> >>>>>> +       }
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
> >>>>>> +                                           const __be32 *input, size_t len)
> >>>>>> +{
> >>>>>> +       while (len >= sizeof(*input)) {
> >>>>>> +               for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
> >>>>>> +                       u32 i = ffs(map);
> >>>>>> +
> >>>>>> +                       state->hash ^= state->key[0] << (32 - i) |
> >>>>>> +                                      (u32)((u64)state->key[1] >> i);
> >>>>>> +               }
> >>>>>> +
> >>>>>> +               state->key++;
> >>>>>> +               input++;
> >>>>>> +               len -= sizeof(*input);
> >>>>>> +       }
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline u8 virtio_net_hash_key_length(u32 types)
> >>>>>> +{
> >>>>>> +       size_t len = 0;
> >>>>>> +
> >>>>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv4)
> >>>>>> +               len = max(len,
> >>>>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs));
> >>>>>> +
> >>>>>> +       if (types &
> >>>>>> +           (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
> >>>>>> +               len = max(len,
> >>>>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs) +
> >>>>>> +                         sizeof(struct flow_dissector_key_ports));
> >>>>>> +
> >>>>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv6)
> >>>>>> +               len = max(len,
> >>>>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs));
> >>>>>> +
> >>>>>> +       if (types &
> >>>>>> +           (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
> >>>>>> +               len = max(len,
> >>>>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs) +
> >>>>>> +                         sizeof(struct flow_dissector_key_ports));
> >>>>>> +
> >>>>>> +       return len + sizeof(u32);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline u32 virtio_net_hash_report(u32 types,
> >>>>>> +                                        const struct flow_keys_basic *keys)
> >>>>>> +{
> >>>>>> +       switch (keys->basic.n_proto) {
> >>>>>> +       case cpu_to_be16(ETH_P_IP):
> >>>>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> >>>>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
> >>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
> >>>>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv4;
> >>>>>> +
> >>>>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
> >>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
> >>>>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv4;
> >>>>>> +               }
> >>>>>> +
> >>>>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
> >>>>>> +                       return VIRTIO_NET_HASH_REPORT_IPv4;
> >>>>>> +
> >>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
> >>>>>> +
> >>>>>> +       case cpu_to_be16(ETH_P_IPV6):
> >>>>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> >>>>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
> >>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
> >>>>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv6;
> >>>>>> +
> >>>>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
> >>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
> >>>>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv6;
> >>>>>> +               }
> >>>>>> +
> >>>>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
> >>>>>> +                       return VIRTIO_NET_HASH_REPORT_IPv6;
> >>>>>> +
> >>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
> >>>>>> +
> >>>>>> +       default:
> >>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
> >>>>>> +       }
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline void virtio_net_hash_rss(const struct sk_buff *skb,
> >>>>>> +                                      u32 types, const u32 *key,
> >>>>>> +                                      struct virtio_net_hash *hash)
> >>>>>> +{
> >>>>>> +       struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
> >>>>>> +       struct flow_keys flow;
> >>>>>> +       struct flow_keys_basic flow_basic;
> >>>>>> +       u16 report;
> >>>>>> +
> >>>>>> +       if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
> >>>>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
> >>>>>> +               return;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       flow_basic = (struct flow_keys_basic) {
> >>>>>> +               .control = flow.control,
> >>>>>> +               .basic = flow.basic
> >>>>>> +       };
> >>>>>> +
> >>>>>> +       report = virtio_net_hash_report(types, &flow_basic);
> >>>>>> +
> >>>>>> +       switch (report) {
> >>>>>> +       case VIRTIO_NET_HASH_REPORT_IPv4:
> >>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
> >>>>>> +                                        sizeof(flow.addrs.v4addrs));
> >>>>>> +               break;
> >>>>>> +
> >>>>>> +       case VIRTIO_NET_HASH_REPORT_TCPv4:
> >>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
> >>>>>> +                                        sizeof(flow.addrs.v4addrs));
> >>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >>>>>> +                                        sizeof(flow.ports.ports));
> >>>>>> +               break;
> >>>>>> +
> >>>>>> +       case VIRTIO_NET_HASH_REPORT_UDPv4:
> >>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
> >>>>>> +                                        sizeof(flow.addrs.v4addrs));
> >>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >>>>>> +                                        sizeof(flow.ports.ports));
> >>>>>> +               break;
> >>>>>> +
> >>>>>> +       case VIRTIO_NET_HASH_REPORT_IPv6:
> >>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
> >>>>>> +                                        sizeof(flow.addrs.v6addrs));
> >>>>>> +               break;
> >>>>>> +
> >>>>>> +       case VIRTIO_NET_HASH_REPORT_TCPv6:
> >>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
> >>>>>> +                                        sizeof(flow.addrs.v6addrs));
> >>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >>>>>> +                                        sizeof(flow.ports.ports));
> >>>>>> +               break;
> >>>>>> +
> >>>>>> +       case VIRTIO_NET_HASH_REPORT_UDPv6:
> >>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
> >>>>>> +                                        sizeof(flow.addrs.v6addrs));
> >>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >>>>>> +                                        sizeof(flow.ports.ports));
> >>>>>> +               break;
> >>>>>> +
> >>>>>> +       default:
> >>>>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
> >>>>>> +               return;
> >>>>>
> >>>>> So I still think we need a comment here to explain why this is not an
> >>>>> issue if the device can report HASH_XXX_EX. Or we need to add the
> >>>>> support, since this is the code from the driver side, I don't think we
> >>>>> need to worry about the device implementation issues.
> >>>>
> >>>> This is on the device side, and don't report HASH_TYPE_XXX_EX.
> >>>>
> >>>>>
> >>>>> For the issue of the number of options, does the spec forbid fallback
> >>>>> to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
> >>>>
> >>>> 5.1.6.4.3.4 "IPv6 packets with extension header" says:
> >>>>    > If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
> >>>>    > header, the hash is calculated over the following fields:
> >>>>    > - Home address from the home address option in the IPv6 destination
> >>>>    >   options header. If the extension header is not present, use the
> >>>>    >   Source IPv6 address.
> >>>>    > - IPv6 address that is contained in the Routing-Header-Type-2 from the
> >>>>    >   associated extension header. If the extension header is not present,
> >>>>    >   use the Destination IPv6 address.
> >>>>    > - Source TCP port
> >>>>    > - Destination TCP port
> >>>>
> >>>> Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6
> >>>> and an home address option in the IPv6 destination options header is
> >>>> present, the hash is calculated over the home address. If the hash is
> >>>> not calculated over the home address in such a case, the device is
> >>>> contradicting with this section and violating the spec. The same goes
> >>>> for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
> >>>
> >>> Just to make sure we are one the same page. I meant:
> >>>
> >>> 1) If the hash is not calculated over the home address (in the case of
> >>> IPv6 destination destination), it can still report
> >>> VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your
> >>> series. So the device can simply fallback to e.g TCPv6 if it can't
> >>> understand all or part of the IPv6 options.
> >>
> >> The spec says it can fallback if "the extension header is not present",
> >> not if the device can't understand the extension header.
> >
> > I don't think so,
> >
> > 1) spec had a condition beforehand:
> >
> > """
> > If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
> > header, the hash is calculated over the following fields:
> > ...
> > If the extension header is not present ...
> > """
> >
> > So the device can choose not to set VIRTIO_NET_HASH_TYPE_TCP_EX as
> > spec doesn't say device MUST set VIRTIO_NET_HASH_TYPE_TCP_EX if ...
> >
> > 2) implementation wise, since device has limited resources, we can't
> > expect the device can parse arbitrary number of ipv6 options
> >
> > 3) if 1) and 2) not the case, we need fix the spec otherwise implement
> > a spec compliant device is impractical
>
> The statement is preceded by the following:
>  >  The device calculates the hash on IPv4 packets according to
>  > ’Enabled hash types’ bitmask as follows:
>
> The 'Enabled hash types' bitmask is specified by the device.
>
> I think the spec needs amendment.

Michael, can you help to clarify here?

>
> I wonder if there are any people interested in the feature though.
> Looking at virtnet_set_hashflow() in drivers/net/virtio_net.c, the
> driver of Linux does not let users configure HASH_TYPE_XXX_EX. I suppose
> Windows supports HASH_TYPE_XXX_EX, but those who care network
> performance most would use Linux so HASH_TYPE_XXX_EX support without
> Linux driver's support may not be useful.

It might be still interesting for example for the hardware virtio
vendors to support windows etc.

>
> >
> >>
> >>> 2) the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the
> >>> tun_vnet_ioctl_sethash(), so userspace may set
> >>> VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by
> >>> tun_vnet_ioctl_gethashtypes(). In this case they won't get
> >>> VIRTIO_NET_HASH_TYPE_TCP_EX.
> >>
> >> That's right. It's the responsibility of the userspace to set only the
> >> supported hash types.
> >
> > Well, the kernel should filter out the unsupported one to have a
> > robust uAPI. Otherwise, we give green light to the buggy userspace
> > which will have unexpected results.
>
> My reasoning was that it may be fine for some use cases other than VM
> (e.g., DPDK); in such a use case, it is fine as long as the UAPI works
> in the best-effort basis.

Best-effort might increase the chance for user visisable changes after
migration.

>
> For example, suppose a userspace program that processes TCP packets; the
> program can enable: HASH_TYPE_IPv4, HASH_TYPE_TCPv4, HASH_TYPE_IPv6, and
> HASH_TYPE_TCPv6. Ideally, the kernel should support all the hash types,
> but, even if e.g., HASH_TYPE_TCPv6 is not available,

For "available" did you mean it is not supported by the device?

> it will fall back
> to HASH_TYPE_IPv6, which still does something good and may be acceptable.

This fallback is exactly the same as I said above, let
VIRTIO_NET_HASH_TYPE_TCP_EX to fallback.

My point is that, the implementation should either:

1) allow fallback so it can claim to support all hash types

or

2) don't allow fallback so it can only support a part of the hash types

If we're doing something in the middle, for example, allow part of the
type to fallback.

>
> That said, for a use case that involves VM and implements virtio-net
> (e.g., QEMU), setting an unsupported hash type here is definitely a bug.
> Catching the bug may outweigh the extra trouble for other use cases.
>
> >
> >>
> >>> 3) implementing part of the hash types might complicate the migration
> >>> or at least we need to describe the expectations of libvirt or other
> >>> management in this case. For example, do we plan to have a dedicated
> >>> Qemu command line like:
> >>>
> >>> -device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?
> >>
> >> I posted a patch series to implement such a command line for vDPA[1].
> >> The patch series that wires this tuntap feature up[2] reuses the
> >> infrastructure so it doesn't bring additional complexity.
> >>
> >> [1]
> >> https://lore.kernel.org/qemu-devel/20250530-vdpa-v1-0-5af4109b1c19@daynix.com/
> >> [2]
> >> https://lore.kernel.org/qemu-devel/20250530-hash-v5-0-343d7d7a8200@daynix.com/
> >
> > I meant, if we implement a full hash report feature, it means a single
> > hash cmdline option is more than sufficient and so compatibility code
> > can just turn it off when dealing with machine types. This is much
> > more simpler than
> >
> > 1) having both hash as well as supported_hash_features
> > 2) dealing both hash as well as supported_hash_features in compatibility codes
> > 3) libvirt will be happy
> >
> > For [1], it seems it introduces a per has type option, this seems to
> > be a burden to the management layer as it need to learn new option
> > everytime a new hash type is supported
>
> Even with the command line you proposed (supported_hash_types=X,Y,Z), it
> is still necessary to know the values the supported_hash_types property
> accepts (X.Y,Z), so I don't think it makes difference.

It could be a uint32_t.

>
> The burden to the management layer is already present for features, so
> it is an existing problem (or its mere extension).

Yes, but since this feature is new it's better to try our best to avoid that.

>
> This problem was discussed in the following thread in the past, but no
> solution is implemented yet, and probably solving it will be difficult.
> https://lore.kernel.org/qemu-devel/20230731223148.1002258-5-yuri.benditovich@daynix.com/

It's a similar issue but not the same, it looks more like a discussion
on whether the fallback from vhost-net to qemu works for missing
features etc.

Thanks

>
> Regards,
> Akihiko Odaki
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code
  2025-06-05  8:18         ` Akihiko Odaki
@ 2025-06-06  1:01           ` Jason Wang
  2025-06-06  9:26             ` Akihiko Odaki
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2025-06-06  1:01 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Thu, Jun 5, 2025 at 4:18 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2025/06/05 11:46, Jason Wang wrote:
> > On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2025/06/04 10:53, Jason Wang wrote:
> >>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> Add common code required for the features being added to TUN and TAP.
> >>>> They will be enabled for each of them in following patches.
> >>>>
> >>>> Added Features
> >>>> ==============
> >>>>
> >>>> Hash reporting
> >>>> --------------
> >>>>
> >>>> Allow the guest to reuse the hash value to make receive steering
> >>>> consistent between the host and guest, and to save hash computation.
> >>>>
> >>>> Receive Side Scaling (RSS)
> >>>> --------------------------
> >>>>
> >>>> RSS is a receive steering algorithm that can be negotiated to use with
> >>>> virtio_net. Conventionally the hash calculation was done by the VMM.
> >>>> However, computing the hash after the queue was chosen defeats the
> >>>> purpose of RSS.
> >>>>
> >>>> Another approach is to use eBPF steering program. This approach has
> >>>> another downside: it cannot report the calculated hash due to the
> >>>> restrictive nature of eBPF steering program.
> >>>>
> >>>> Introduce the code to perform RSS to the kernel in order to overcome
> >>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>> program so that it will be able to report to the userspace, but I didn't
> >>>> opt for it because extending the current mechanism of eBPF steering
> >>>> program as is because it relies on legacy context rewriting, and
> >>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >>>> the other relevant virtualization APIs such as KVM and vhost_net are
> >>>> UAPIs.
> >>>>
> >>>> Added ioctls
> >>>> ============
> >>>>
> >>>> They are designed to make extensibility and VM migration compatible.
> >>>> This change only adds the implementation and does not expose them to
> >>>> the userspace.
> >>>>
> >>>> TUNGETVNETHASHTYPES
> >>>> -------------------
> >>>>
> >>>> This ioctl tells supported hash types. It is useful to check if a VM can
> >>>> be migrated to the current host.
> >>>>
> >>>> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
> >>>> --------------------------------------------------------------------
> >>>>
> >>>> These ioctls configures a steering algorithm and, if needed, hash
> >>>> reporting.
> >>>>
> >>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>> Tested-by: Lei Yang <leiyang@redhat.com>
> >>>> ---
> >>>>    drivers/net/tap.c           |  10 ++-
> >>>>    drivers/net/tun.c           |  12 +++-
> >>>>    drivers/net/tun_vnet.h      | 165 +++++++++++++++++++++++++++++++++++++++++---
> >>>>    include/uapi/linux/if_tun.h |  71 +++++++++++++++++++
> >>>>    4 files changed, 244 insertions(+), 14 deletions(-)
> >>>>
> >>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >>>> index d4ece538f1b2..25c60ff2d3f2 100644
> >>>> --- a/drivers/net/tap.c
> >>>> +++ b/drivers/net/tap.c
> >>>> @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
> >>>>           sock_put(&q->sk);
> >>>>    }
> >>>>
> >>>> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
> >>>> +{
> >>>> +       return NULL;
> >>>> +}
> >>>> +
> >>>>    /*
> >>>>     * Select a queue based on the rxq of the device on which this packet
> >>>>     * arrived. If the incoming device is not mq, calculate a flow hash
> >>>> @@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>>>           int total;
> >>>>
> >>>>           if (q->flags & IFF_VNET_HDR) {
> >>>> -               struct virtio_net_hdr vnet_hdr;
> >>>> +               struct virtio_net_hdr_v1_hash vnet_hdr;
> >>>>
> >>>>                   vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
> >>>>
> >>>> -               ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
> >>>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
> >>>> +                                           tap_find_hash, &vnet_hdr);
> >>>>                   if (ret)
> >>>>                           return ret;
> >>>>
> >>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>> index 9133ab9ed3f5..03d47799e9bd 100644
> >>>> --- a/drivers/net/tun.c
> >>>> +++ b/drivers/net/tun.c
> >>>> @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash)
> >>>>                   e->rps_rxhash = hash;
> >>>>    }
> >>>>
> >>>> +static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
> >>>> +{
> >>>> +       return NULL;
> >>>> +}
> >>>> +
> >>>>    /* We try to identify a flow through its rxhash. The reason that
> >>>>     * we do not check rxq no. is because some cards(e.g 82599), chooses
> >>>>     * the rxq based on the txq where the last packet of the flow comes. As
> >>>> @@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
> >>>>           ssize_t ret;
> >>>>
> >>>>           if (tun->flags & IFF_VNET_HDR) {
> >>>> -               struct virtio_net_hdr gso = { 0 };
> >>>> +               struct virtio_net_hdr_v1_hash gso = { 0 };
> >>>>
> >>>>                   vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
> >>>>                   ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
> >>>> @@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>>>           }
> >>>>
> >>>>           if (vnet_hdr_sz) {
> >>>> -               struct virtio_net_hdr gso;
> >>>> +               struct virtio_net_hdr_v1_hash gso;
> >>>>
> >>>> -               ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
> >>>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
> >>>> +                                           skb, tun_find_hash, &gso);
> >>>>                   if (ret)
> >>>>                           return ret;
> >>>>
> >>>> diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
> >>>> index 58b9ac7a5fc4..45d0533efc8d 100644
> >>>> --- a/drivers/net/tun_vnet.h
> >>>> +++ b/drivers/net/tun_vnet.h
> >>>> @@ -6,6 +6,17 @@
> >>>>    #define TUN_VNET_LE     0x80000000
> >>>>    #define TUN_VNET_BE     0x40000000
> >>>>
> >>>> +typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *);
> >>>> +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
> >>>> +
> >>>> +struct tun_vnet_hash {
> >>>> +       bool report;
> >>>> +       bool rss;
> >>>> +       struct tun_vnet_rss common;
> >>>> +       u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
> >>>> +       u16 rss_indirection_table[];
> >>>> +};
> >>>> +
> >>>>    static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags)
> >>>>    {
> >>>>           bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
> >>>> @@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
> >>>>           }
> >>>>    }
> >>>>
> >>>> +static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp)
> >>>> +{
> >>>> +       return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
> >>>> +}
> >>>> +
> >>>> +static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
> >>>> +                                         unsigned int cmd,
> >>>> +                                         void __user *argp)
> >>>> +{
> >>>> +       struct tun_vnet_rss common;
> >>>> +       struct tun_vnet_hash *hash;
> >>>> +       size_t indirection_table_size;
> >>>> +       size_t key_size;
> >>>> +       size_t size;
> >>>> +
> >>>> +       switch (cmd) {
> >>>> +       case TUNSETVNETREPORTINGAUTOMQ:
> >>>> +               if (get_user(common.hash_types, (u32 __user *)argp))
> >>>> +                       return -EFAULT;
> >>>> +
> >>>> +               if (common.hash_types) {
> >>>> +                       hash = kzalloc(sizeof(*hash), GFP_KERNEL);
> >>>> +                       if (!hash)
> >>>> +                               return -ENOMEM;
> >>>> +
> >>>> +                       hash->report = true;
> >>>> +                       hash->common.hash_types = common.hash_types;
> >>>> +               } else {
> >>>> +                       hash = NULL;
> >>>> +               }
> >>>> +               break;
> >>>> +
> >>>> +       case TUNSETVNETREPORTINGRSS:
> >>>> +       case TUNSETVNETRSS:
> >>>
> >>> So the above three shows unnecessary design redundancy as well as a
> >>> burden for the future extension.  Why not simply have
> >>>
> >>> 1) TUNSETVNET_RSS
> >>> 2) TUNSETVNET_HASH_REPORT
> >>> ?
> >>>
> >>> Which maps to
> >>>
> >>>    #define VIRTIO_NET_CTRL_MQ_RSS_CONFIG          1 (for configurable
> >>> receive steering)
> >>>    #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG         2 (for configurable
> >>> hash calculation)
> >>>
> >>> It would be always easier to start with what spec had or at least we
> >>> need to explain why we choose a different design here or in the
> >>> changelog to ease our life.
> >>
> >> TUNSETVNETREPORTINGAUTOMQ maps to VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
> >
> > It's not:
> >
> > VIRTIO_NET_CTRL_MQ_HASH_CONFIG uses:
> >
> > struct virtio_net_hash_config {
> >      le32 hash_types;
> >      le16 reserved[4];
> >      u8 hash_key_length;
> >      u8 hash_key_data[hash_key_length];
> > };
> >
> > but TUNSETVNETREPORTINGAUTOMQ only accepts hash_types without others:
>
> The others are not present because the spec doesn't specify what to do
> with them and the kernel doesn't use them either.

Did you mean the hash_key_length and hash_key_data? Note that we have
drivers other than the Linux ones as well.

>
> >
> >>
> >> TUNSETVNETREPORTINGRSS and TUNSETVNETRSS map to
> >> VIRTIO_NET_CTRL_MQ_RSS_CONFIG.
> >
> > I think we've already had a discussion about this.
> >
> > Reusing virtio-net uAPI is much better instead of having a tun
> > specific one considering tun may need to support more virtio commands
> > in the future. Or maybe it's the time to introduce a transport for the
> > virtio control virtqueue uAPI in tuntap to avoid inventing new uAPI
> > endlessly.
> >
> > What's more I see:
> >
> > struct tun_vnet_rss {
> >          __u32 hash_types;
> >          __u16 indirection_table_mask;
> >          __u16 unclassified_queue;
> > };
> >
> > struct tun_vnet_hash {
> >          bool report;
> >          bool rss;
> >          struct tun_vnet_rss common;
> >          u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
> >          u16 rss_indirection_table[];
> > };
> >
> > As I pointed out in the past, let's just decouple the rss from hash,
> > everything would be much simpler, or you need to explain why you
> > couple this somewhere.
> >
> > For example:
> >
> > 1) why is the tun_vnet_hash not part of the uAPI but tun_vnet_rss, or
> > how could userspace know what kind of format it would use for
> > TUNSETVNETREPORTINGRSS?
>
> That was the previous version.
>
> > 2) what's the advantages of embedding rss specific stuff into hash
> > report structure
>
> Because the hash types field in struct tun_vnet_rss is used by hash
> reporting too.
>
> > 3) what's the advantages of not using virtio-net uAPI
>
> 1. The use cases that don't involve VM will be simplified; programs for
> such a use case will not need to convert endian or to fill fileds the
> kernel doesn't care.

Well, virtio_net_hdr is used by packet socket as well. Considering the
complexity of designing a new uAPI, it's still better.

Or maybe you can clarify which field that kernel doesn't care about?
In this case TUN/TAP is basically the device datapath, if some of the
fields don't make sense, it's a bug of the spec.

> 2. It aligns with existing UAPIs that operate in native endian and don't
> use virtio-net data structures like TUNSETOFFLOAD and TUNSETVNETHDRSZ.

For those two examples, it would be used by guests directly. This is
different from RSS stuff.

With native endian, you need an endian conversation that converts le to native.

>
> >
> > More issues:
> >
> > 1) max_tx_vq is ignored, so do you expect the userspace to intercept
> > this and switch to using TUNSETQUEUE? This seems like a burden as TUN
> > can simply accept it and do the attaching/detaching by itself
> > 2) the rx depends on the indirection table, so userspace need to
> > intercept the indirection and change the rx queue numbers accordingly
> > 3) RSS allows a async TX/RX queue, this is not supported by TUN now,
> > no matter if we decide to let userspace to intercept max_tx_vq or not,
> > we need to implement it first
>  > > Things would be much more simpler, if kernel can do 1) and 2).
>
> Attaching and detaching queues is not possible for the kernel because it
> doesn't know what file descriptors that map to queues will be used by
> the userspace.

The kernel knows, tfile has a queue_index part.

>
> The following patch does 2) for QEMU:
> https://lore.kernel.org/qemu-devel/20250322-vq-v2-1-cee0aafe6404@daynix.com/

See below point, form the view of the kernel, it's still a queue pair
not async TX/RX queue.

>
> For 3), the patch for QEMU takes the maximum of TX and RX queue numbers
> to derive the number of queue pairs.
>
> >
> >> We have two ioctls here because
> >> VIRTIO_NET_CTRL_MQ_RSS_CONFIG behaves differently depending on whether
> >> VIRTIO_NET_F_HASH_REPORT is negotiated or not;
> >
> > It wouldn't be a problem if you do 1:1 mapping between virtio commands
> > and TUN uAPI, otherwise it should have a bug somewhere.
>
> Speaking of 1:1 mapping, it is possible to map VIRTIO_NET_F_HASH_REPORT
> into another ioctl. It may help add another receive steering algorithm
> in the future by not requiring two ioctls (TUNSETVNETREPORTING_X and
> TUNSETVNET_X).

Yes and as I pointed out, virtio_net_hash_config should not be
specific to automq, it can work for other steering algorithm as well.

>
> >
> >> it also enables hash
> >> reporting if the feature is negotiated.
> >
> > Again, starting from virtio-net command is easier, a strong
> > justification is needed to explain why we choose another for tun/tap.
> >
> >>
> >>>
> >>> One day we would have tun_select_queue_algorithm_x() we don't have to
> >>> duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ
> >>
> >> 5.1.6.5.6.4 Hash calculation says:
> >>>   If VIRTIO_NET_F_HASH_REPORT was negotiated and the device uses
> >>> automatic receive steering, the device MUST support a command to
> >>> configure hash calculation parameters.
> >>>
> >>> The driver provides parameters for hash calculation as follows:
> >>>
> >>> class VIRTIO_NET_CTRL_MQ, command VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
> >>
> >> VIRTIO_NET_CTRL_MQ_HASH_CONFIG is for automatic receive steering and not
> >> for RSS (or other steering algorithms if the spec gets any in the future).
> >
> > I'm not sure but the spec needs some tweaking. For example, I don't
> > expect there would be a dedicated hash config command for flow filters
> > in the future. I think the reason why spec says like this is that
> > virtio-net only supports automatic receive steering.
> >
> > Note that macvtap doesn't implement automatic receive steering.
>
> QEMU advertises VIRTIO_NET_F_CTRL_VQ for macvtap too, so it should have
> implemented it. I think QEMU with macvtap still compliant to the spec.

Compliant, but automatic traffic steering is the best effort as well.

Nope, TUN/TAP implements a flow cache that can steer tx based on rx.
Macvtap simply uses hash here.

>
> "5.1.6.5.6 Automatic receive steering in multiqueue mode" says:
>
>  > After the driver transmitted a packet of a flow on transmitqX, the
>  > device SHOULD cause incoming packets for that flow to be steered to
>  > receiveqX.
>
> It is "SHOULD", so it is still compliant if the device doesn't properly
> respect the flow.

Yes, a quality of implementation, or it's impractical to support a
correct steering for this device as limited resources and mailious
users can do syn flood etc.

>
> >
> >>
> >>>
> >>>> +               if (copy_from_user(&common, argp, sizeof(common)))
> >>>> +                       return -EFAULT;
> >>>> +               argp = (struct tun_vnet_rss __user *)argp + 1;
> >>>> +
> >>>> +               indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
> >>>> +               key_size = virtio_net_hash_key_length(common.hash_types);
> >>>> +               size = struct_size(hash, rss_indirection_table,
> >>>> +                                  (size_t)common.indirection_table_mask + 1);
> >>>> +
> >>>> +               hash = kmalloc(size, GFP_KERNEL);
> >>>> +               if (!hash)
> >>>> +                       return -ENOMEM;
> >>>> +
> >>>> +               if (copy_from_user(hash->rss_indirection_table,
> >>>> +                                  argp, indirection_table_size)) {
> >>>> +                       kfree(hash);
> >>>> +                       return -EFAULT;
> >>>> +               }
> >>>> +               argp = (u16 __user *)argp + common.indirection_table_mask + 1;
> >>>> +
> >>>> +               if (copy_from_user(hash->rss_key, argp, key_size)) {
> >>>> +                       kfree(hash);
> >>>> +                       return -EFAULT;
> >>>> +               }
> >>>> +
> >>>> +               virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
> >>>> +               hash->report = cmd == TUNSETVNETREPORTINGRSS;
> >>>
> >>> At least, if this is the only difference why not simply code this into
> >>> the ioctl itself other than having a very similar command?
> >>
> >> It is what the previous version did. Either is fine I guess; the only
> >> practical difference would be the size of the configuration struct is
> >> smaller with this approach.
> >>
> >>>
> >>>> +               hash->rss = true;
> >>>> +               hash->common = common;
> >>>> +               break;
> >>>> +
> >>>> +       default:
> >>>> +               return -EINVAL;
> >>>> +       }
> >>>> +
> >>>> +       kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
> >>>> +       return 0;
> >>>> +}
> >>>> +
> >>>> +static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
> >>>> +                                       struct sk_buff *skb,
> >>>> +                                       const struct flow_keys_basic *keys,
> >>>> +                                       u32 value,
> >>>> +                                       tun_vnet_hash_add vnet_hash_add)
> >>>> +{
> >>>> +       struct virtio_net_hash *report;
> >>>> +
> >>>> +       if (!hash || !hash->report)
> >>>> +               return;
> >>>> +
> >>>> +       report = vnet_hash_add(skb);
> >>>> +       if (!report)
> >>>> +               return;
> >>>> +
> >>>> +       *report = (struct virtio_net_hash) {
> >>>> +               .report = virtio_net_hash_report(hash->common.hash_types, keys),
> >>>> +               .value = value
> >>>> +       };
> >>>> +}
> >>>> +
> >>>> +static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
> >>>> +                                           const struct tun_vnet_hash *hash,
> >>>> +                                           struct sk_buff *skb,
> >>>> +                                           tun_vnet_hash_add vnet_hash_add)
> >>>> +{
> >>>> +       struct virtio_net_hash *report;
> >>>> +       struct virtio_net_hash ret;
> >>>> +       u16 index;
> >>>> +
> >>>> +       if (!numqueues)
> >>>> +               return 0;
> >>>> +
> >>>> +       virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
> >>>> +
> >>>> +       if (!ret.report)
> >>>> +               return hash->common.unclassified_queue % numqueues;
> >>>> +
> >>>> +       if (hash->report) {
> >>>> +               report = vnet_hash_add(skb);
> >>>> +               if (report)
> >>>> +                       *report = ret;
> >>>> +       }
> >>>> +
> >>>> +       index = ret.value & hash->common.indirection_table_mask;
> >>>> +
> >>>> +       return hash->rss_indirection_table[index] % numqueues;
> >>>> +}
> >>>> +
> >>>>    static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
> >>>>                                      struct iov_iter *from,
> >>>>                                      struct virtio_net_hdr *hdr)
> >>>> @@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
> >>>>    }
> >>>>
> >>>>    static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
> >>>> -                                  const struct virtio_net_hdr *hdr)
> >>>> +                                  const struct virtio_net_hdr_v1_hash *hdr)
> >>>>    {
> >>>> +       int content_sz = MIN(sizeof(*hdr), sz);
> >>>> +
> >>>>           if (unlikely(iov_iter_count(iter) < sz))
> >>>>                   return -EINVAL;
> >>>>
> >>>> -       if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
> >>>> +       if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz))
> >>>>                   return -EFAULT;
> >>>>
> >>>> -       if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
> >>>> +       if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz)
> >>>>                   return -EFAULT;
> >>>>
> >>>>           return 0;
> >>>> @@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
> >>>>           return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags));
> >>>>    }
> >>>>
> >>>> -static inline int tun_vnet_hdr_from_skb(unsigned int flags,
> >>>> +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags,
> >>>>                                           const struct net_device *dev,
> >>>>                                           const struct sk_buff *skb,
> >>>> -                                       struct virtio_net_hdr *hdr)
> >>>> +                                       tun_vnet_hash_find vnet_hash_find,
> >>>> +                                       struct virtio_net_hdr_v1_hash *hdr)
> >>>>    {
> >>>>           int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
> >>>> +       const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
> >>>> +                                              NULL : vnet_hash_find(skb);
> >>>> +
> >>>> +       *hdr = (struct virtio_net_hdr_v1_hash) {
> >>>> +               .hash_report = VIRTIO_NET_HASH_REPORT_NONE
> >>>> +       };
> >>>> +
> >>>> +       if (report) {
> >>>> +               hdr->hash_value = cpu_to_le32(report->value);
> >>>> +               hdr->hash_report = cpu_to_le16(report->report);
> >>>> +       }
> >>>>
> >>>> -       if (virtio_net_hdr_from_skb(skb, hdr,
> >>>> +       if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr,
> >>>>                                       tun_vnet_is_little_endian(flags), true,
> >>>>                                       vlan_hlen)) {
> >>>>                   struct skb_shared_info *sinfo = skb_shinfo(skb);
> >>>>
> >>>>                   if (net_ratelimit()) {
> >>>>                           netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
> >>>> -                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
> >>>> -                                  tun_vnet16_to_cpu(flags, hdr->hdr_len));
> >>>> +                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
> >>>> +                                  tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len));
> >>>>                           print_hex_dump(KERN_ERR, "tun: ",
> >>>>                                          DUMP_PREFIX_NONE,
> >>>>                                          16, 1, skb->head,
> >>>> -                                      min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
> >>>> +                                      min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true);
> >>>>                   }
> >>>>                   WARN_ON_ONCE(1);
> >>>>                   return -EINVAL;
> >>>> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> >>>> index 980de74724fc..fe4b984d3bbb 100644
> >>>> --- a/include/uapi/linux/if_tun.h
> >>>> +++ b/include/uapi/linux/if_tun.h
> >>>> @@ -62,6 +62,62 @@
> >>>>    #define TUNSETCARRIER _IOW('T', 226, int)
> >>>>    #define TUNGETDEVNETNS _IO('T', 227)
> >>>>
> >>>> +/**
> >>>> + * define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
> >>>> + *
> >>>> + * The argument is a pointer to __u32 which will store the supported virtio_net
> >>>> + * hashing types.
> >>>> + */
> >>>> +#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
> >>>> +
> >>>> +/**
> >>>> + * define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
> >>>> + *
> >>>> + * Disable RSS and enable automatic receive steering with hash reporting.
> >>>> + *
> >>>> + * The argument is a pointer to __u32 that contains a bitmask of hash types
> >>>> + * allowed to be reported.
> >>>> + *
> >>>> + * This ioctl results in %EBADFD if the underlying device is deleted. It affects
> >>>> + * all queues attached to the same device.
> >>>> + *
> >>>> + * This ioctl currently has no effect on XDP packets and packets with
> >>>> + * queue_mapping set by TC.
> >>>> + */
> >>>> +#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
> >>>> +
> >>>> +/**
> >>>> + * define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
> >>>> + *
> >>>> + * Disable automatic receive steering and enable RSS with hash reporting.
> >>>
> >>> This is unnecessary, e.g one day will have select_queue_xyz(), we
> >>> don't want to say "Disable automatic receive steering and xyz ..."
> >>
> >> It is still something better to be documented as its behavior is
> >> somewhat complicated.
> >
> > This is a hint of uAPI design issue.
> >
> >>
> >> Concretely, this ioctl disables automatic receive steering but doesn't
> >> disable steering by eBPF, which is implied by TUN_STEERINGEBPF_FALLBACK.
> >
> > It would be simpler:
> >
> > 1) not having TUN_STEERINGEBPF_FALLBACK
> > 2) the steering algorithm depends on the last uAPI call
>
> What will TUNSETSTEERINGEBPF with NULL mean in that case? It currently
> switches the steering algorithm to automq.

A stackwise semantic then?

Thanks

>
> Regards,
> Akihiko Odaki
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-06-06  0:48               ` Jason Wang
@ 2025-06-06  9:10                 ` Akihiko Odaki
  2025-06-17  3:28                   ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Akihiko Odaki @ 2025-06-06  9:10 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Xuan Zhuo, Shuah Khan, linux-doc,
	linux-kernel, netdev, kvm, virtualization, linux-kselftest,
	Yuri Benditovich, Andrew Melnychenko, Stephen Hemminger,
	gur.stavi, Lei Yang, Simon Horman

On 2025/06/06 9:48, Jason Wang wrote:
> On Thu, Jun 5, 2025 at 3:58 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2025/06/05 10:53, Jason Wang wrote:
>>> On Wed, Jun 4, 2025 at 3:20 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2025/06/04 10:18, Jason Wang wrote:
>>>>> On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> On 2025/06/03 12:19, Jason Wang wrote:
>>>>>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>
>>>>>>>> They are useful to implement VIRTIO_NET_F_RSS and
>>>>>>>> VIRTIO_NET_F_HASH_REPORT.
>>>>>>>>
>>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>>>> Tested-by: Lei Yang <leiyang@redhat.com>
>>>>>>>> ---
>>>>>>>>      include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>      1 file changed, 188 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
>>>>>>>> index 02a9f4dc594d..426f33b4b824 100644
>>>>>>>> --- a/include/linux/virtio_net.h
>>>>>>>> +++ b/include/linux/virtio_net.h
>>>>>>>> @@ -9,6 +9,194 @@
>>>>>>>>      #include <uapi/linux/tcp.h>
>>>>>>>>      #include <uapi/linux/virtio_net.h>
>>>>>>>>
>>>>>>>> +struct virtio_net_hash {
>>>>>>>> +       u32 value;
>>>>>>>> +       u16 report;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +struct virtio_net_toeplitz_state {
>>>>>>>> +       u32 hash;
>>>>>>>> +       const u32 *key;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
>>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
>>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
>>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
>>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
>>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
>>>>>>>> +
>>>>>>>> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
>>>>>>>> +
>>>>>>>> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
>>>>>>>> +{
>>>>>>>> +       while (len >= sizeof(*input)) {
>>>>>>>> +               *input = be32_to_cpu((__force __be32)*input);
>>>>>>>> +               input++;
>>>>>>>> +               len -= sizeof(*input);
>>>>>>>> +       }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
>>>>>>>> +                                           const __be32 *input, size_t len)
>>>>>>>> +{
>>>>>>>> +       while (len >= sizeof(*input)) {
>>>>>>>> +               for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
>>>>>>>> +                       u32 i = ffs(map);
>>>>>>>> +
>>>>>>>> +                       state->hash ^= state->key[0] << (32 - i) |
>>>>>>>> +                                      (u32)((u64)state->key[1] >> i);
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>> +               state->key++;
>>>>>>>> +               input++;
>>>>>>>> +               len -= sizeof(*input);
>>>>>>>> +       }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static inline u8 virtio_net_hash_key_length(u32 types)
>>>>>>>> +{
>>>>>>>> +       size_t len = 0;
>>>>>>>> +
>>>>>>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv4)
>>>>>>>> +               len = max(len,
>>>>>>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs));
>>>>>>>> +
>>>>>>>> +       if (types &
>>>>>>>> +           (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
>>>>>>>> +               len = max(len,
>>>>>>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs) +
>>>>>>>> +                         sizeof(struct flow_dissector_key_ports));
>>>>>>>> +
>>>>>>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv6)
>>>>>>>> +               len = max(len,
>>>>>>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs));
>>>>>>>> +
>>>>>>>> +       if (types &
>>>>>>>> +           (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
>>>>>>>> +               len = max(len,
>>>>>>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs) +
>>>>>>>> +                         sizeof(struct flow_dissector_key_ports));
>>>>>>>> +
>>>>>>>> +       return len + sizeof(u32);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static inline u32 virtio_net_hash_report(u32 types,
>>>>>>>> +                                        const struct flow_keys_basic *keys)
>>>>>>>> +{
>>>>>>>> +       switch (keys->basic.n_proto) {
>>>>>>>> +       case cpu_to_be16(ETH_P_IP):
>>>>>>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
>>>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
>>>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
>>>>>>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv4;
>>>>>>>> +
>>>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
>>>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
>>>>>>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv4;
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
>>>>>>>> +                       return VIRTIO_NET_HASH_REPORT_IPv4;
>>>>>>>> +
>>>>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>>>>>>>> +
>>>>>>>> +       case cpu_to_be16(ETH_P_IPV6):
>>>>>>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
>>>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
>>>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
>>>>>>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv6;
>>>>>>>> +
>>>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
>>>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
>>>>>>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv6;
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
>>>>>>>> +                       return VIRTIO_NET_HASH_REPORT_IPv6;
>>>>>>>> +
>>>>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>>>>>>>> +
>>>>>>>> +       default:
>>>>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>>>>>>>> +       }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static inline void virtio_net_hash_rss(const struct sk_buff *skb,
>>>>>>>> +                                      u32 types, const u32 *key,
>>>>>>>> +                                      struct virtio_net_hash *hash)
>>>>>>>> +{
>>>>>>>> +       struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
>>>>>>>> +       struct flow_keys flow;
>>>>>>>> +       struct flow_keys_basic flow_basic;
>>>>>>>> +       u16 report;
>>>>>>>> +
>>>>>>>> +       if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
>>>>>>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
>>>>>>>> +               return;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       flow_basic = (struct flow_keys_basic) {
>>>>>>>> +               .control = flow.control,
>>>>>>>> +               .basic = flow.basic
>>>>>>>> +       };
>>>>>>>> +
>>>>>>>> +       report = virtio_net_hash_report(types, &flow_basic);
>>>>>>>> +
>>>>>>>> +       switch (report) {
>>>>>>>> +       case VIRTIO_NET_HASH_REPORT_IPv4:
>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
>>>>>>>> +                                        sizeof(flow.addrs.v4addrs));
>>>>>>>> +               break;
>>>>>>>> +
>>>>>>>> +       case VIRTIO_NET_HASH_REPORT_TCPv4:
>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
>>>>>>>> +                                        sizeof(flow.addrs.v4addrs));
>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>>>>>> +                                        sizeof(flow.ports.ports));
>>>>>>>> +               break;
>>>>>>>> +
>>>>>>>> +       case VIRTIO_NET_HASH_REPORT_UDPv4:
>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
>>>>>>>> +                                        sizeof(flow.addrs.v4addrs));
>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>>>>>> +                                        sizeof(flow.ports.ports));
>>>>>>>> +               break;
>>>>>>>> +
>>>>>>>> +       case VIRTIO_NET_HASH_REPORT_IPv6:
>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
>>>>>>>> +                                        sizeof(flow.addrs.v6addrs));
>>>>>>>> +               break;
>>>>>>>> +
>>>>>>>> +       case VIRTIO_NET_HASH_REPORT_TCPv6:
>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
>>>>>>>> +                                        sizeof(flow.addrs.v6addrs));
>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>>>>>> +                                        sizeof(flow.ports.ports));
>>>>>>>> +               break;
>>>>>>>> +
>>>>>>>> +       case VIRTIO_NET_HASH_REPORT_UDPv6:
>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
>>>>>>>> +                                        sizeof(flow.addrs.v6addrs));
>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>>>>>> +                                        sizeof(flow.ports.ports));
>>>>>>>> +               break;
>>>>>>>> +
>>>>>>>> +       default:
>>>>>>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
>>>>>>>> +               return;
>>>>>>>
>>>>>>> So I still think we need a comment here to explain why this is not an
>>>>>>> issue if the device can report HASH_XXX_EX. Or we need to add the
>>>>>>> support, since this is the code from the driver side, I don't think we
>>>>>>> need to worry about the device implementation issues.
>>>>>>
>>>>>> This is on the device side, and don't report HASH_TYPE_XXX_EX.
>>>>>>
>>>>>>>
>>>>>>> For the issue of the number of options, does the spec forbid fallback
>>>>>>> to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
>>>>>>
>>>>>> 5.1.6.4.3.4 "IPv6 packets with extension header" says:
>>>>>>     > If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
>>>>>>     > header, the hash is calculated over the following fields:
>>>>>>     > - Home address from the home address option in the IPv6 destination
>>>>>>     >   options header. If the extension header is not present, use the
>>>>>>     >   Source IPv6 address.
>>>>>>     > - IPv6 address that is contained in the Routing-Header-Type-2 from the
>>>>>>     >   associated extension header. If the extension header is not present,
>>>>>>     >   use the Destination IPv6 address.
>>>>>>     > - Source TCP port
>>>>>>     > - Destination TCP port
>>>>>>
>>>>>> Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6
>>>>>> and an home address option in the IPv6 destination options header is
>>>>>> present, the hash is calculated over the home address. If the hash is
>>>>>> not calculated over the home address in such a case, the device is
>>>>>> contradicting with this section and violating the spec. The same goes
>>>>>> for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
>>>>>
>>>>> Just to make sure we are one the same page. I meant:
>>>>>
>>>>> 1) If the hash is not calculated over the home address (in the case of
>>>>> IPv6 destination destination), it can still report
>>>>> VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your
>>>>> series. So the device can simply fallback to e.g TCPv6 if it can't
>>>>> understand all or part of the IPv6 options.
>>>>
>>>> The spec says it can fallback if "the extension header is not present",
>>>> not if the device can't understand the extension header.
>>>
>>> I don't think so,
>>>
>>> 1) spec had a condition beforehand:
>>>
>>> """
>>> If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
>>> header, the hash is calculated over the following fields:
>>> ...
>>> If the extension header is not present ...
>>> """
>>>
>>> So the device can choose not to set VIRTIO_NET_HASH_TYPE_TCP_EX as
>>> spec doesn't say device MUST set VIRTIO_NET_HASH_TYPE_TCP_EX if ...
>>>
>>> 2) implementation wise, since device has limited resources, we can't
>>> expect the device can parse arbitrary number of ipv6 options
>>>
>>> 3) if 1) and 2) not the case, we need fix the spec otherwise implement
>>> a spec compliant device is impractical
>>
>> The statement is preceded by the following:
>>   >  The device calculates the hash on IPv4 packets according to
>>   > ’Enabled hash types’ bitmask as follows:
>>
>> The 'Enabled hash types' bitmask is specified by the device.
>>
>> I think the spec needs amendment.
> 
> Michael, can you help to clarify here?
> 
>>
>> I wonder if there are any people interested in the feature though.
>> Looking at virtnet_set_hashflow() in drivers/net/virtio_net.c, the
>> driver of Linux does not let users configure HASH_TYPE_XXX_EX. I suppose
>> Windows supports HASH_TYPE_XXX_EX, but those who care network
>> performance most would use Linux so HASH_TYPE_XXX_EX support without
>> Linux driver's support may not be useful.
> 
> It might be still interesting for example for the hardware virtio
> vendors to support windows etc.

I don't know if Windows needs them for e.g., device/driver certification 
so surveying Windows makes sense.

> 
>>
>>>
>>>>
>>>>> 2) the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the
>>>>> tun_vnet_ioctl_sethash(), so userspace may set
>>>>> VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by
>>>>> tun_vnet_ioctl_gethashtypes(). In this case they won't get
>>>>> VIRTIO_NET_HASH_TYPE_TCP_EX.
>>>>
>>>> That's right. It's the responsibility of the userspace to set only the
>>>> supported hash types.
>>>
>>> Well, the kernel should filter out the unsupported one to have a
>>> robust uAPI. Otherwise, we give green light to the buggy userspace
>>> which will have unexpected results.
>>
>> My reasoning was that it may be fine for some use cases other than VM
>> (e.g., DPDK); in such a use case, it is fine as long as the UAPI works
>> in the best-effort basis.
> 
> Best-effort might increase the chance for user visisable changes after
> migration.

It is a trade-off between catching a migration bug for VMM and making 
life a bit easier for userspace programs other than VMM.

> 
>>
>> For example, suppose a userspace program that processes TCP packets; the
>> program can enable: HASH_TYPE_IPv4, HASH_TYPE_TCPv4, HASH_TYPE_IPv6, and
>> HASH_TYPE_TCPv6. Ideally, the kernel should support all the hash types,
>> but, even if e.g., HASH_TYPE_TCPv6 is not available,
> 
> For "available" did you mean it is not supported by the device?
> 
>> it will fall back
>> to HASH_TYPE_IPv6, which still does something good and may be acceptable.
> 
> This fallback is exactly the same as I said above, let
> VIRTIO_NET_HASH_TYPE_TCP_EX to fallback.
> 
> My point is that, the implementation should either:
> 
> 1) allow fallback so it can claim to support all hash types
> 
> or
> 
> 2) don't allow fallback so it can only support a part of the hash types
> 
> If we're doing something in the middle, for example, allow part of the
> type to fallback.

1) or the middle will make it unsuitable for VM because it violates the 
virtio spec. 2) makes sense though the trade-off I mentioned should be 
taken into consideration.

> 
>>
>> That said, for a use case that involves VM and implements virtio-net
>> (e.g., QEMU), setting an unsupported hash type here is definitely a bug.
>> Catching the bug may outweigh the extra trouble for other use cases.
>>
>>>
>>>>
>>>>> 3) implementing part of the hash types might complicate the migration
>>>>> or at least we need to describe the expectations of libvirt or other
>>>>> management in this case. For example, do we plan to have a dedicated
>>>>> Qemu command line like:
>>>>>
>>>>> -device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?
>>>>
>>>> I posted a patch series to implement such a command line for vDPA[1].
>>>> The patch series that wires this tuntap feature up[2] reuses the
>>>> infrastructure so it doesn't bring additional complexity.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/qemu-devel/20250530-vdpa-v1-0-5af4109b1c19@daynix.com/
>>>> [2]
>>>> https://lore.kernel.org/qemu-devel/20250530-hash-v5-0-343d7d7a8200@daynix.com/
>>>
>>> I meant, if we implement a full hash report feature, it means a single
>>> hash cmdline option is more than sufficient and so compatibility code
>>> can just turn it off when dealing with machine types. This is much
>>> more simpler than
>>>
>>> 1) having both hash as well as supported_hash_features
>>> 2) dealing both hash as well as supported_hash_features in compatibility codes
>>> 3) libvirt will be happy
>>>
>>> For [1], it seems it introduces a per has type option, this seems to
>>> be a burden to the management layer as it need to learn new option
>>> everytime a new hash type is supported
>>
>> Even with the command line you proposed (supported_hash_types=X,Y,Z), it
>> is still necessary to know the values the supported_hash_types property
>> accepts (X.Y,Z), so I don't think it makes difference.
> 
> It could be a uint32_t.

The management layer will need to know what bits are accepted even with 
uint32_t.

> 
>>
>> The burden to the management layer is already present for features, so
>> it is an existing problem (or its mere extension).
> 
> Yes, but since this feature is new it's better to try our best to avoid that.
> 
>>
>> This problem was discussed in the following thread in the past, but no
>> solution is implemented yet, and probably solving it will be difficult.
>> https://lore.kernel.org/qemu-devel/20230731223148.1002258-5-yuri.benditovich@daynix.com/
> 
> It's a similar issue but not the same, it looks more like a discussion
> on whether the fallback from vhost-net to qemu works for missing
> features etc.

Perhaps we may be able to do better since this feature is new as you say 
and we don't have to worry much about breaking change. I don't have an 
idea for that yet.

Regards,
Akihiko Odaki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code
  2025-06-06  1:01           ` Jason Wang
@ 2025-06-06  9:26             ` Akihiko Odaki
  2025-06-17  3:39               ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Akihiko Odaki @ 2025-06-06  9:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On 2025/06/06 10:01, Jason Wang wrote:
> On Thu, Jun 5, 2025 at 4:18 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2025/06/05 11:46, Jason Wang wrote:
>>> On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2025/06/04 10:53, Jason Wang wrote:
>>>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> Add common code required for the features being added to TUN and TAP.
>>>>>> They will be enabled for each of them in following patches.
>>>>>>
>>>>>> Added Features
>>>>>> ==============
>>>>>>
>>>>>> Hash reporting
>>>>>> --------------
>>>>>>
>>>>>> Allow the guest to reuse the hash value to make receive steering
>>>>>> consistent between the host and guest, and to save hash computation.
>>>>>>
>>>>>> Receive Side Scaling (RSS)
>>>>>> --------------------------
>>>>>>
>>>>>> RSS is a receive steering algorithm that can be negotiated to use with
>>>>>> virtio_net. Conventionally the hash calculation was done by the VMM.
>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>> purpose of RSS.
>>>>>>
>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>> restrictive nature of eBPF steering program.
>>>>>>
>>>>>> Introduce the code to perform RSS to the kernel in order to overcome
>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>> program so that it will be able to report to the userspace, but I didn't
>>>>>> opt for it because extending the current mechanism of eBPF steering
>>>>>> program as is because it relies on legacy context rewriting, and
>>>>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
>>>>>> the other relevant virtualization APIs such as KVM and vhost_net are
>>>>>> UAPIs.
>>>>>>
>>>>>> Added ioctls
>>>>>> ============
>>>>>>
>>>>>> They are designed to make extensibility and VM migration compatible.
>>>>>> This change only adds the implementation and does not expose them to
>>>>>> the userspace.
>>>>>>
>>>>>> TUNGETVNETHASHTYPES
>>>>>> -------------------
>>>>>>
>>>>>> This ioctl tells supported hash types. It is useful to check if a VM can
>>>>>> be migrated to the current host.
>>>>>>
>>>>>> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
>>>>>> --------------------------------------------------------------------
>>>>>>
>>>>>> These ioctls configures a steering algorithm and, if needed, hash
>>>>>> reporting.
>>>>>>
>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>> Tested-by: Lei Yang <leiyang@redhat.com>
>>>>>> ---
>>>>>>     drivers/net/tap.c           |  10 ++-
>>>>>>     drivers/net/tun.c           |  12 +++-
>>>>>>     drivers/net/tun_vnet.h      | 165 +++++++++++++++++++++++++++++++++++++++++---
>>>>>>     include/uapi/linux/if_tun.h |  71 +++++++++++++++++++
>>>>>>     4 files changed, 244 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>>>> index d4ece538f1b2..25c60ff2d3f2 100644
>>>>>> --- a/drivers/net/tap.c
>>>>>> +++ b/drivers/net/tap.c
>>>>>> @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
>>>>>>            sock_put(&q->sk);
>>>>>>     }
>>>>>>
>>>>>> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
>>>>>> +{
>>>>>> +       return NULL;
>>>>>> +}
>>>>>> +
>>>>>>     /*
>>>>>>      * Select a queue based on the rxq of the device on which this packet
>>>>>>      * arrived. If the incoming device is not mq, calculate a flow hash
>>>>>> @@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>>>            int total;
>>>>>>
>>>>>>            if (q->flags & IFF_VNET_HDR) {
>>>>>> -               struct virtio_net_hdr vnet_hdr;
>>>>>> +               struct virtio_net_hdr_v1_hash vnet_hdr;
>>>>>>
>>>>>>                    vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
>>>>>>
>>>>>> -               ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
>>>>>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
>>>>>> +                                           tap_find_hash, &vnet_hdr);
>>>>>>                    if (ret)
>>>>>>                            return ret;
>>>>>>
>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>> index 9133ab9ed3f5..03d47799e9bd 100644
>>>>>> --- a/drivers/net/tun.c
>>>>>> +++ b/drivers/net/tun.c
>>>>>> @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash)
>>>>>>                    e->rps_rxhash = hash;
>>>>>>     }
>>>>>>
>>>>>> +static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
>>>>>> +{
>>>>>> +       return NULL;
>>>>>> +}
>>>>>> +
>>>>>>     /* We try to identify a flow through its rxhash. The reason that
>>>>>>      * we do not check rxq no. is because some cards(e.g 82599), chooses
>>>>>>      * the rxq based on the txq where the last packet of the flow comes. As
>>>>>> @@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
>>>>>>            ssize_t ret;
>>>>>>
>>>>>>            if (tun->flags & IFF_VNET_HDR) {
>>>>>> -               struct virtio_net_hdr gso = { 0 };
>>>>>> +               struct virtio_net_hdr_v1_hash gso = { 0 };
>>>>>>
>>>>>>                    vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
>>>>>>                    ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
>>>>>> @@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>>>            }
>>>>>>
>>>>>>            if (vnet_hdr_sz) {
>>>>>> -               struct virtio_net_hdr gso;
>>>>>> +               struct virtio_net_hdr_v1_hash gso;
>>>>>>
>>>>>> -               ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
>>>>>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
>>>>>> +                                           skb, tun_find_hash, &gso);
>>>>>>                    if (ret)
>>>>>>                            return ret;
>>>>>>
>>>>>> diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
>>>>>> index 58b9ac7a5fc4..45d0533efc8d 100644
>>>>>> --- a/drivers/net/tun_vnet.h
>>>>>> +++ b/drivers/net/tun_vnet.h
>>>>>> @@ -6,6 +6,17 @@
>>>>>>     #define TUN_VNET_LE     0x80000000
>>>>>>     #define TUN_VNET_BE     0x40000000
>>>>>>
>>>>>> +typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *);
>>>>>> +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
>>>>>> +
>>>>>> +struct tun_vnet_hash {
>>>>>> +       bool report;
>>>>>> +       bool rss;
>>>>>> +       struct tun_vnet_rss common;
>>>>>> +       u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
>>>>>> +       u16 rss_indirection_table[];
>>>>>> +};
>>>>>> +
>>>>>>     static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags)
>>>>>>     {
>>>>>>            bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
>>>>>> @@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
>>>>>>            }
>>>>>>     }
>>>>>>
>>>>>> +static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp)
>>>>>> +{
>>>>>> +       return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
>>>>>> +}
>>>>>> +
>>>>>> +static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
>>>>>> +                                         unsigned int cmd,
>>>>>> +                                         void __user *argp)
>>>>>> +{
>>>>>> +       struct tun_vnet_rss common;
>>>>>> +       struct tun_vnet_hash *hash;
>>>>>> +       size_t indirection_table_size;
>>>>>> +       size_t key_size;
>>>>>> +       size_t size;
>>>>>> +
>>>>>> +       switch (cmd) {
>>>>>> +       case TUNSETVNETREPORTINGAUTOMQ:
>>>>>> +               if (get_user(common.hash_types, (u32 __user *)argp))
>>>>>> +                       return -EFAULT;
>>>>>> +
>>>>>> +               if (common.hash_types) {
>>>>>> +                       hash = kzalloc(sizeof(*hash), GFP_KERNEL);
>>>>>> +                       if (!hash)
>>>>>> +                               return -ENOMEM;
>>>>>> +
>>>>>> +                       hash->report = true;
>>>>>> +                       hash->common.hash_types = common.hash_types;
>>>>>> +               } else {
>>>>>> +                       hash = NULL;
>>>>>> +               }
>>>>>> +               break;
>>>>>> +
>>>>>> +       case TUNSETVNETREPORTINGRSS:
>>>>>> +       case TUNSETVNETRSS:
>>>>>
>>>>> So the above three shows unnecessary design redundancy as well as a
>>>>> burden for the future extension.  Why not simply have
>>>>>
>>>>> 1) TUNSETVNET_RSS
>>>>> 2) TUNSETVNET_HASH_REPORT
>>>>> ?
>>>>>
>>>>> Which maps to
>>>>>
>>>>>     #define VIRTIO_NET_CTRL_MQ_RSS_CONFIG          1 (for configurable
>>>>> receive steering)
>>>>>     #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG         2 (for configurable
>>>>> hash calculation)
>>>>>
>>>>> It would be always easier to start with what spec had or at least we
>>>>> need to explain why we choose a different design here or in the
>>>>> changelog to ease our life.
>>>>
>>>> TUNSETVNETREPORTINGAUTOMQ maps to VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
>>>
>>> It's not:
>>>
>>> VIRTIO_NET_CTRL_MQ_HASH_CONFIG uses:
>>>
>>> struct virtio_net_hash_config {
>>>       le32 hash_types;
>>>       le16 reserved[4];
>>>       u8 hash_key_length;
>>>       u8 hash_key_data[hash_key_length];
>>> };
>>>
>>> but TUNSETVNETREPORTINGAUTOMQ only accepts hash_types without others:
>>
>> The others are not present because the spec doesn't specify what to do
>> with them and the kernel doesn't use them either.
> 
> Did you mean the hash_key_length and hash_key_data? Note that we have
> drivers other than the Linux ones as well.

And reserved. Drivers can set whatever to these fields. It is not 
specified how these fields should be used.

> 
>>
>>>
>>>>
>>>> TUNSETVNETREPORTINGRSS and TUNSETVNETRSS map to
>>>> VIRTIO_NET_CTRL_MQ_RSS_CONFIG.
>>>
>>> I think we've already had a discussion about this.
>>>
>>> Reusing virtio-net uAPI is much better instead of having a tun
>>> specific one considering tun may need to support more virtio commands
>>> in the future. Or maybe it's the time to introduce a transport for the
>>> virtio control virtqueue uAPI in tuntap to avoid inventing new uAPI
>>> endlessly.
>>>
>>> What's more I see:
>>>
>>> struct tun_vnet_rss {
>>>           __u32 hash_types;
>>>           __u16 indirection_table_mask;
>>>           __u16 unclassified_queue;
>>> };
>>>
>>> struct tun_vnet_hash {
>>>           bool report;
>>>           bool rss;
>>>           struct tun_vnet_rss common;
>>>           u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
>>>           u16 rss_indirection_table[];
>>> };
>>>
>>> As I pointed out in the past, let's just decouple the rss from hash,
>>> everything would be much simpler, or you need to explain why you
>>> couple this somewhere.
>>>
>>> For example:
>>>
>>> 1) why is the tun_vnet_hash not part of the uAPI but tun_vnet_rss, or
>>> how could userspace know what kind of format it would use for
>>> TUNSETVNETREPORTINGRSS?
>>
>> That was the previous version.
>>
>>> 2) what's the advantages of embedding rss specific stuff into hash
>>> report structure
>>
>> Because the hash types field in struct tun_vnet_rss is used by hash
>> reporting too.
>>
>>> 3) what's the advantages of not using virtio-net uAPI
>>
>> 1. The use cases that don't involve VM will be simplified; programs for
>> such a use case will not need to convert endian or to fill fileds the
>> kernel doesn't care.
> 
> Well, virtio_net_hdr is used by packet socket as well. Considering the
> complexity of designing a new uAPI, it's still better.

This patch series also reuses the datapath, following the prior examples.

> 
> Or maybe you can clarify which field that kernel doesn't care about?
> In this case TUN/TAP is basically the device datapath, if some of the
> fields don't make sense, it's a bug of the spec.

reserved, hash_key_length, and hash_key_data.

> 
>> 2. It aligns with existing UAPIs that operate in native endian and don't
>> use virtio-net data structures like TUNSETOFFLOAD and TUNSETVNETHDRSZ.
> 
> For those two examples, it would be used by guests directly. This is
> different from RSS stuff.

They are mediated by the VMM, which is no different from RSS.

> 
> With native endian, you need an endian conversation that converts le to native.

That's true, but QEMU does so anyway to validate the configuration, to 
attach/detach queues, and to share the data structures with userspace 
RSS implementations. I expect other VMMs will do so too.

> 
>>
>>>
>>> More issues:
>>>
>>> 1) max_tx_vq is ignored, so do you expect the userspace to intercept
>>> this and switch to using TUNSETQUEUE? This seems like a burden as TUN
>>> can simply accept it and do the attaching/detaching by itself
>>> 2) the rx depends on the indirection table, so userspace need to
>>> intercept the indirection and change the rx queue numbers accordingly
>>> 3) RSS allows a async TX/RX queue, this is not supported by TUN now,
>>> no matter if we decide to let userspace to intercept max_tx_vq or not,
>>> we need to implement it first
>>   > > Things would be much more simpler, if kernel can do 1) and 2).
>>
>> Attaching and detaching queues is not possible for the kernel because it
>> doesn't know what file descriptors that map to queues will be used by
>> the userspace.
> 
> The kernel knows, tfile has a queue_index part.

queue_index is set with TUNSETQUEUE so we need the ioctl.

> 
>>
>> The following patch does 2) for QEMU:
>> https://lore.kernel.org/qemu-devel/20250322-vq-v2-1-cee0aafe6404@daynix.com/
> 
> See below point, form the view of the kernel, it's still a queue pair
> not async TX/RX queue.
> 
>>
>> For 3), the patch for QEMU takes the maximum of TX and RX queue numbers
>> to derive the number of queue pairs.
>>
>>>
>>>> We have two ioctls here because
>>>> VIRTIO_NET_CTRL_MQ_RSS_CONFIG behaves differently depending on whether
>>>> VIRTIO_NET_F_HASH_REPORT is negotiated or not;
>>>
>>> It wouldn't be a problem if you do 1:1 mapping between virtio commands
>>> and TUN uAPI, otherwise it should have a bug somewhere.
>>
>> Speaking of 1:1 mapping, it is possible to map VIRTIO_NET_F_HASH_REPORT
>> into another ioctl. It may help add another receive steering algorithm
>> in the future by not requiring two ioctls (TUNSETVNETREPORTING_X and
>> TUNSETVNET_X).
> 
> Yes and as I pointed out, virtio_net_hash_config should not be
> specific to automq, it can work for other steering algorithm as well.

That's not what the virtio spec says, so it will not be 1:1 mapping 
between virtio commands and TUN uAPI.

> 
>>
>>>
>>>> it also enables hash
>>>> reporting if the feature is negotiated.
>>>
>>> Again, starting from virtio-net command is easier, a strong
>>> justification is needed to explain why we choose another for tun/tap.
>>>
>>>>
>>>>>
>>>>> One day we would have tun_select_queue_algorithm_x() we don't have to
>>>>> duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ
>>>>
>>>> 5.1.6.5.6.4 Hash calculation says:
>>>>>    If VIRTIO_NET_F_HASH_REPORT was negotiated and the device uses
>>>>> automatic receive steering, the device MUST support a command to
>>>>> configure hash calculation parameters.
>>>>>
>>>>> The driver provides parameters for hash calculation as follows:
>>>>>
>>>>> class VIRTIO_NET_CTRL_MQ, command VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
>>>>
>>>> VIRTIO_NET_CTRL_MQ_HASH_CONFIG is for automatic receive steering and not
>>>> for RSS (or other steering algorithms if the spec gets any in the future).
>>>
>>> I'm not sure but the spec needs some tweaking. For example, I don't
>>> expect there would be a dedicated hash config command for flow filters
>>> in the future. I think the reason why spec says like this is that
>>> virtio-net only supports automatic receive steering.
>>>
>>> Note that macvtap doesn't implement automatic receive steering.
>>
>> QEMU advertises VIRTIO_NET_F_CTRL_VQ for macvtap too, so it should have
>> implemented it. I think QEMU with macvtap still compliant to the spec.
> 
> Compliant, but automatic traffic steering is the best effort as well.
> 
> Nope, TUN/TAP implements a flow cache that can steer tx based on rx.
> Macvtap simply uses hash here.
> 
>>
>> "5.1.6.5.6 Automatic receive steering in multiqueue mode" says:
>>
>>   > After the driver transmitted a packet of a flow on transmitqX, the
>>   > device SHOULD cause incoming packets for that flow to be steered to
>>   > receiveqX.
>>
>> It is "SHOULD", so it is still compliant if the device doesn't properly
>> respect the flow.
> 
> Yes, a quality of implementation, or it's impractical to support a
> correct steering for this device as limited resources and mailious
> users can do syn flood etc.
> 
>>
>>>
>>>>
>>>>>
>>>>>> +               if (copy_from_user(&common, argp, sizeof(common)))
>>>>>> +                       return -EFAULT;
>>>>>> +               argp = (struct tun_vnet_rss __user *)argp + 1;
>>>>>> +
>>>>>> +               indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
>>>>>> +               key_size = virtio_net_hash_key_length(common.hash_types);
>>>>>> +               size = struct_size(hash, rss_indirection_table,
>>>>>> +                                  (size_t)common.indirection_table_mask + 1);
>>>>>> +
>>>>>> +               hash = kmalloc(size, GFP_KERNEL);
>>>>>> +               if (!hash)
>>>>>> +                       return -ENOMEM;
>>>>>> +
>>>>>> +               if (copy_from_user(hash->rss_indirection_table,
>>>>>> +                                  argp, indirection_table_size)) {
>>>>>> +                       kfree(hash);
>>>>>> +                       return -EFAULT;
>>>>>> +               }
>>>>>> +               argp = (u16 __user *)argp + common.indirection_table_mask + 1;
>>>>>> +
>>>>>> +               if (copy_from_user(hash->rss_key, argp, key_size)) {
>>>>>> +                       kfree(hash);
>>>>>> +                       return -EFAULT;
>>>>>> +               }
>>>>>> +
>>>>>> +               virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
>>>>>> +               hash->report = cmd == TUNSETVNETREPORTINGRSS;
>>>>>
>>>>> At least, if this is the only difference why not simply code this into
>>>>> the ioctl itself other than having a very similar command?
>>>>
>>>> It is what the previous version did. Either is fine I guess; the only
>>>> practical difference would be the size of the configuration struct is
>>>> smaller with this approach.
>>>>
>>>>>
>>>>>> +               hash->rss = true;
>>>>>> +               hash->common = common;
>>>>>> +               break;
>>>>>> +
>>>>>> +       default:
>>>>>> +               return -EINVAL;
>>>>>> +       }
>>>>>> +
>>>>>> +       kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
>>>>>> +       return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
>>>>>> +                                       struct sk_buff *skb,
>>>>>> +                                       const struct flow_keys_basic *keys,
>>>>>> +                                       u32 value,
>>>>>> +                                       tun_vnet_hash_add vnet_hash_add)
>>>>>> +{
>>>>>> +       struct virtio_net_hash *report;
>>>>>> +
>>>>>> +       if (!hash || !hash->report)
>>>>>> +               return;
>>>>>> +
>>>>>> +       report = vnet_hash_add(skb);
>>>>>> +       if (!report)
>>>>>> +               return;
>>>>>> +
>>>>>> +       *report = (struct virtio_net_hash) {
>>>>>> +               .report = virtio_net_hash_report(hash->common.hash_types, keys),
>>>>>> +               .value = value
>>>>>> +       };
>>>>>> +}
>>>>>> +
>>>>>> +static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
>>>>>> +                                           const struct tun_vnet_hash *hash,
>>>>>> +                                           struct sk_buff *skb,
>>>>>> +                                           tun_vnet_hash_add vnet_hash_add)
>>>>>> +{
>>>>>> +       struct virtio_net_hash *report;
>>>>>> +       struct virtio_net_hash ret;
>>>>>> +       u16 index;
>>>>>> +
>>>>>> +       if (!numqueues)
>>>>>> +               return 0;
>>>>>> +
>>>>>> +       virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
>>>>>> +
>>>>>> +       if (!ret.report)
>>>>>> +               return hash->common.unclassified_queue % numqueues;
>>>>>> +
>>>>>> +       if (hash->report) {
>>>>>> +               report = vnet_hash_add(skb);
>>>>>> +               if (report)
>>>>>> +                       *report = ret;
>>>>>> +       }
>>>>>> +
>>>>>> +       index = ret.value & hash->common.indirection_table_mask;
>>>>>> +
>>>>>> +       return hash->rss_indirection_table[index] % numqueues;
>>>>>> +}
>>>>>> +
>>>>>>     static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
>>>>>>                                       struct iov_iter *from,
>>>>>>                                       struct virtio_net_hdr *hdr)
>>>>>> @@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
>>>>>>     }
>>>>>>
>>>>>>     static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
>>>>>> -                                  const struct virtio_net_hdr *hdr)
>>>>>> +                                  const struct virtio_net_hdr_v1_hash *hdr)
>>>>>>     {
>>>>>> +       int content_sz = MIN(sizeof(*hdr), sz);
>>>>>> +
>>>>>>            if (unlikely(iov_iter_count(iter) < sz))
>>>>>>                    return -EINVAL;
>>>>>>
>>>>>> -       if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
>>>>>> +       if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz))
>>>>>>                    return -EFAULT;
>>>>>>
>>>>>> -       if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
>>>>>> +       if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz)
>>>>>>                    return -EFAULT;
>>>>>>
>>>>>>            return 0;
>>>>>> @@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
>>>>>>            return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags));
>>>>>>     }
>>>>>>
>>>>>> -static inline int tun_vnet_hdr_from_skb(unsigned int flags,
>>>>>> +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags,
>>>>>>                                            const struct net_device *dev,
>>>>>>                                            const struct sk_buff *skb,
>>>>>> -                                       struct virtio_net_hdr *hdr)
>>>>>> +                                       tun_vnet_hash_find vnet_hash_find,
>>>>>> +                                       struct virtio_net_hdr_v1_hash *hdr)
>>>>>>     {
>>>>>>            int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
>>>>>> +       const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
>>>>>> +                                              NULL : vnet_hash_find(skb);
>>>>>> +
>>>>>> +       *hdr = (struct virtio_net_hdr_v1_hash) {
>>>>>> +               .hash_report = VIRTIO_NET_HASH_REPORT_NONE
>>>>>> +       };
>>>>>> +
>>>>>> +       if (report) {
>>>>>> +               hdr->hash_value = cpu_to_le32(report->value);
>>>>>> +               hdr->hash_report = cpu_to_le16(report->report);
>>>>>> +       }
>>>>>>
>>>>>> -       if (virtio_net_hdr_from_skb(skb, hdr,
>>>>>> +       if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr,
>>>>>>                                        tun_vnet_is_little_endian(flags), true,
>>>>>>                                        vlan_hlen)) {
>>>>>>                    struct skb_shared_info *sinfo = skb_shinfo(skb);
>>>>>>
>>>>>>                    if (net_ratelimit()) {
>>>>>>                            netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
>>>>>> -                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
>>>>>> -                                  tun_vnet16_to_cpu(flags, hdr->hdr_len));
>>>>>> +                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
>>>>>> +                                  tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len));
>>>>>>                            print_hex_dump(KERN_ERR, "tun: ",
>>>>>>                                           DUMP_PREFIX_NONE,
>>>>>>                                           16, 1, skb->head,
>>>>>> -                                      min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
>>>>>> +                                      min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true);
>>>>>>                    }
>>>>>>                    WARN_ON_ONCE(1);
>>>>>>                    return -EINVAL;
>>>>>> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
>>>>>> index 980de74724fc..fe4b984d3bbb 100644
>>>>>> --- a/include/uapi/linux/if_tun.h
>>>>>> +++ b/include/uapi/linux/if_tun.h
>>>>>> @@ -62,6 +62,62 @@
>>>>>>     #define TUNSETCARRIER _IOW('T', 226, int)
>>>>>>     #define TUNGETDEVNETNS _IO('T', 227)
>>>>>>
>>>>>> +/**
>>>>>> + * define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
>>>>>> + *
>>>>>> + * The argument is a pointer to __u32 which will store the supported virtio_net
>>>>>> + * hashing types.
>>>>>> + */
>>>>>> +#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
>>>>>> +
>>>>>> +/**
>>>>>> + * define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
>>>>>> + *
>>>>>> + * Disable RSS and enable automatic receive steering with hash reporting.
>>>>>> + *
>>>>>> + * The argument is a pointer to __u32 that contains a bitmask of hash types
>>>>>> + * allowed to be reported.
>>>>>> + *
>>>>>> + * This ioctl results in %EBADFD if the underlying device is deleted. It affects
>>>>>> + * all queues attached to the same device.
>>>>>> + *
>>>>>> + * This ioctl currently has no effect on XDP packets and packets with
>>>>>> + * queue_mapping set by TC.
>>>>>> + */
>>>>>> +#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
>>>>>> +
>>>>>> +/**
>>>>>> + * define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
>>>>>> + *
>>>>>> + * Disable automatic receive steering and enable RSS with hash reporting.
>>>>>
>>>>> This is unnecessary, e.g one day will have select_queue_xyz(), we
>>>>> don't want to say "Disable automatic receive steering and xyz ..."
>>>>
>>>> It is still something better to be documented as its behavior is
>>>> somewhat complicated.
>>>
>>> This is a hint of uAPI design issue.
>>>
>>>>
>>>> Concretely, this ioctl disables automatic receive steering but doesn't
>>>> disable steering by eBPF, which is implied by TUN_STEERINGEBPF_FALLBACK.
>>>
>>> It would be simpler:
>>>
>>> 1) not having TUN_STEERINGEBPF_FALLBACK
>>> 2) the steering algorithm depends on the last uAPI call
>>
>> What will TUNSETSTEERINGEBPF with NULL mean in that case? It currently
>> switches the steering algorithm to automq.
> 
> A stackwise semantic then?

Can you clarify the semantics with an example of a set of ioctls? 
Perhaps it is an easy way to demonstrate an alternative design idea.

Regards,
Akihiko Odaki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-06-06  9:10                 ` Akihiko Odaki
@ 2025-06-17  3:28                   ` Jason Wang
  2025-06-21  5:50                     ` Akihiko Odaki
  2025-06-22 17:40                     ` Yuri Benditovich
  0 siblings, 2 replies; 39+ messages in thread
From: Jason Wang @ 2025-06-17  3:28 UTC (permalink / raw)
  To: Akihiko Odaki, Yuri Benditovich
  Cc: Michael S. Tsirkin, Jonathan Corbet, Willem de Bruijn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Fri, Jun 6, 2025 at 5:10 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2025/06/06 9:48, Jason Wang wrote:
> > On Thu, Jun 5, 2025 at 3:58 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2025/06/05 10:53, Jason Wang wrote:
> >>> On Wed, Jun 4, 2025 at 3:20 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> On 2025/06/04 10:18, Jason Wang wrote:
> >>>>> On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>
> >>>>>> On 2025/06/03 12:19, Jason Wang wrote:
> >>>>>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>
> >>>>>>>> They are useful to implement VIRTIO_NET_F_RSS and
> >>>>>>>> VIRTIO_NET_F_HASH_REPORT.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>>>>>> Tested-by: Lei Yang <leiyang@redhat.com>
> >>>>>>>> ---
> >>>>>>>>      include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>>>      1 file changed, 188 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> >>>>>>>> index 02a9f4dc594d..426f33b4b824 100644
> >>>>>>>> --- a/include/linux/virtio_net.h
> >>>>>>>> +++ b/include/linux/virtio_net.h
> >>>>>>>> @@ -9,6 +9,194 @@
> >>>>>>>>      #include <uapi/linux/tcp.h>
> >>>>>>>>      #include <uapi/linux/virtio_net.h>
> >>>>>>>>
> >>>>>>>> +struct virtio_net_hash {
> >>>>>>>> +       u32 value;
> >>>>>>>> +       u16 report;
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>> +struct virtio_net_toeplitz_state {
> >>>>>>>> +       u32 hash;
> >>>>>>>> +       const u32 *key;
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
> >>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
> >>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
> >>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
> >>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
> >>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
> >>>>>>>> +
> >>>>>>>> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
> >>>>>>>> +
> >>>>>>>> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
> >>>>>>>> +{
> >>>>>>>> +       while (len >= sizeof(*input)) {
> >>>>>>>> +               *input = be32_to_cpu((__force __be32)*input);
> >>>>>>>> +               input++;
> >>>>>>>> +               len -= sizeof(*input);
> >>>>>>>> +       }
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
> >>>>>>>> +                                           const __be32 *input, size_t len)
> >>>>>>>> +{
> >>>>>>>> +       while (len >= sizeof(*input)) {
> >>>>>>>> +               for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
> >>>>>>>> +                       u32 i = ffs(map);
> >>>>>>>> +
> >>>>>>>> +                       state->hash ^= state->key[0] << (32 - i) |
> >>>>>>>> +                                      (u32)((u64)state->key[1] >> i);
> >>>>>>>> +               }
> >>>>>>>> +
> >>>>>>>> +               state->key++;
> >>>>>>>> +               input++;
> >>>>>>>> +               len -= sizeof(*input);
> >>>>>>>> +       }
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static inline u8 virtio_net_hash_key_length(u32 types)
> >>>>>>>> +{
> >>>>>>>> +       size_t len = 0;
> >>>>>>>> +
> >>>>>>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv4)
> >>>>>>>> +               len = max(len,
> >>>>>>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs));
> >>>>>>>> +
> >>>>>>>> +       if (types &
> >>>>>>>> +           (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
> >>>>>>>> +               len = max(len,
> >>>>>>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs) +
> >>>>>>>> +                         sizeof(struct flow_dissector_key_ports));
> >>>>>>>> +
> >>>>>>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv6)
> >>>>>>>> +               len = max(len,
> >>>>>>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs));
> >>>>>>>> +
> >>>>>>>> +       if (types &
> >>>>>>>> +           (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
> >>>>>>>> +               len = max(len,
> >>>>>>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs) +
> >>>>>>>> +                         sizeof(struct flow_dissector_key_ports));
> >>>>>>>> +
> >>>>>>>> +       return len + sizeof(u32);
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static inline u32 virtio_net_hash_report(u32 types,
> >>>>>>>> +                                        const struct flow_keys_basic *keys)
> >>>>>>>> +{
> >>>>>>>> +       switch (keys->basic.n_proto) {
> >>>>>>>> +       case cpu_to_be16(ETH_P_IP):
> >>>>>>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> >>>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
> >>>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
> >>>>>>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv4;
> >>>>>>>> +
> >>>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
> >>>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
> >>>>>>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv4;
> >>>>>>>> +               }
> >>>>>>>> +
> >>>>>>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
> >>>>>>>> +                       return VIRTIO_NET_HASH_REPORT_IPv4;
> >>>>>>>> +
> >>>>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
> >>>>>>>> +
> >>>>>>>> +       case cpu_to_be16(ETH_P_IPV6):
> >>>>>>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> >>>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
> >>>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
> >>>>>>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv6;
> >>>>>>>> +
> >>>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
> >>>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
> >>>>>>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv6;
> >>>>>>>> +               }
> >>>>>>>> +
> >>>>>>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
> >>>>>>>> +                       return VIRTIO_NET_HASH_REPORT_IPv6;
> >>>>>>>> +
> >>>>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
> >>>>>>>> +
> >>>>>>>> +       default:
> >>>>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
> >>>>>>>> +       }
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static inline void virtio_net_hash_rss(const struct sk_buff *skb,
> >>>>>>>> +                                      u32 types, const u32 *key,
> >>>>>>>> +                                      struct virtio_net_hash *hash)
> >>>>>>>> +{
> >>>>>>>> +       struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
> >>>>>>>> +       struct flow_keys flow;
> >>>>>>>> +       struct flow_keys_basic flow_basic;
> >>>>>>>> +       u16 report;
> >>>>>>>> +
> >>>>>>>> +       if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
> >>>>>>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
> >>>>>>>> +               return;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       flow_basic = (struct flow_keys_basic) {
> >>>>>>>> +               .control = flow.control,
> >>>>>>>> +               .basic = flow.basic
> >>>>>>>> +       };
> >>>>>>>> +
> >>>>>>>> +       report = virtio_net_hash_report(types, &flow_basic);
> >>>>>>>> +
> >>>>>>>> +       switch (report) {
> >>>>>>>> +       case VIRTIO_NET_HASH_REPORT_IPv4:
> >>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
> >>>>>>>> +                                        sizeof(flow.addrs.v4addrs));
> >>>>>>>> +               break;
> >>>>>>>> +
> >>>>>>>> +       case VIRTIO_NET_HASH_REPORT_TCPv4:
> >>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
> >>>>>>>> +                                        sizeof(flow.addrs.v4addrs));
> >>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >>>>>>>> +                                        sizeof(flow.ports.ports));
> >>>>>>>> +               break;
> >>>>>>>> +
> >>>>>>>> +       case VIRTIO_NET_HASH_REPORT_UDPv4:
> >>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
> >>>>>>>> +                                        sizeof(flow.addrs.v4addrs));
> >>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >>>>>>>> +                                        sizeof(flow.ports.ports));
> >>>>>>>> +               break;
> >>>>>>>> +
> >>>>>>>> +       case VIRTIO_NET_HASH_REPORT_IPv6:
> >>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
> >>>>>>>> +                                        sizeof(flow.addrs.v6addrs));
> >>>>>>>> +               break;
> >>>>>>>> +
> >>>>>>>> +       case VIRTIO_NET_HASH_REPORT_TCPv6:
> >>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
> >>>>>>>> +                                        sizeof(flow.addrs.v6addrs));
> >>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >>>>>>>> +                                        sizeof(flow.ports.ports));
> >>>>>>>> +               break;
> >>>>>>>> +
> >>>>>>>> +       case VIRTIO_NET_HASH_REPORT_UDPv6:
> >>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
> >>>>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
> >>>>>>>> +                                        sizeof(flow.addrs.v6addrs));
> >>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
> >>>>>>>> +                                        sizeof(flow.ports.ports));
> >>>>>>>> +               break;
> >>>>>>>> +
> >>>>>>>> +       default:
> >>>>>>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
> >>>>>>>> +               return;
> >>>>>>>
> >>>>>>> So I still think we need a comment here to explain why this is not an
> >>>>>>> issue if the device can report HASH_XXX_EX. Or we need to add the
> >>>>>>> support, since this is the code from the driver side, I don't think we
> >>>>>>> need to worry about the device implementation issues.
> >>>>>>
> >>>>>> This is on the device side, and don't report HASH_TYPE_XXX_EX.
> >>>>>>
> >>>>>>>
> >>>>>>> For the issue of the number of options, does the spec forbid fallback
> >>>>>>> to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
> >>>>>>
> >>>>>> 5.1.6.4.3.4 "IPv6 packets with extension header" says:
> >>>>>>     > If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
> >>>>>>     > header, the hash is calculated over the following fields:
> >>>>>>     > - Home address from the home address option in the IPv6 destination
> >>>>>>     >   options header. If the extension header is not present, use the
> >>>>>>     >   Source IPv6 address.
> >>>>>>     > - IPv6 address that is contained in the Routing-Header-Type-2 from the
> >>>>>>     >   associated extension header. If the extension header is not present,
> >>>>>>     >   use the Destination IPv6 address.
> >>>>>>     > - Source TCP port
> >>>>>>     > - Destination TCP port
> >>>>>>
> >>>>>> Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6
> >>>>>> and an home address option in the IPv6 destination options header is
> >>>>>> present, the hash is calculated over the home address. If the hash is
> >>>>>> not calculated over the home address in such a case, the device is
> >>>>>> contradicting with this section and violating the spec. The same goes
> >>>>>> for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
> >>>>>
> >>>>> Just to make sure we are one the same page. I meant:
> >>>>>
> >>>>> 1) If the hash is not calculated over the home address (in the case of
> >>>>> IPv6 destination destination), it can still report
> >>>>> VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your
> >>>>> series. So the device can simply fallback to e.g TCPv6 if it can't
> >>>>> understand all or part of the IPv6 options.
> >>>>
> >>>> The spec says it can fallback if "the extension header is not present",
> >>>> not if the device can't understand the extension header.
> >>>
> >>> I don't think so,
> >>>
> >>> 1) spec had a condition beforehand:
> >>>
> >>> """
> >>> If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
> >>> header, the hash is calculated over the following fields:
> >>> ...
> >>> If the extension header is not present ...
> >>> """
> >>>
> >>> So the device can choose not to set VIRTIO_NET_HASH_TYPE_TCP_EX as
> >>> spec doesn't say device MUST set VIRTIO_NET_HASH_TYPE_TCP_EX if ...
> >>>
> >>> 2) implementation wise, since device has limited resources, we can't
> >>> expect the device can parse arbitrary number of ipv6 options
> >>>
> >>> 3) if 1) and 2) not the case, we need fix the spec otherwise implement
> >>> a spec compliant device is impractical
> >>
> >> The statement is preceded by the following:
> >>   >  The device calculates the hash on IPv4 packets according to
> >>   > ’Enabled hash types’ bitmask as follows:
> >>
> >> The 'Enabled hash types' bitmask is specified by the device.
> >>
> >> I think the spec needs amendment.
> >
> > Michael, can you help to clarify here?
> >
> >>
> >> I wonder if there are any people interested in the feature though.
> >> Looking at virtnet_set_hashflow() in drivers/net/virtio_net.c, the
> >> driver of Linux does not let users configure HASH_TYPE_XXX_EX. I suppose
> >> Windows supports HASH_TYPE_XXX_EX, but those who care network
> >> performance most would use Linux so HASH_TYPE_XXX_EX support without
> >> Linux driver's support may not be useful.
> >
> > It might be still interesting for example for the hardware virtio
> > vendors to support windows etc.
>
> I don't know if Windows needs them for e.g., device/driver certification
> so surveying Windows makes sense.

Yuri, can you help to clarify this?

>
> >
> >>
> >>>
> >>>>
> >>>>> 2) the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the
> >>>>> tun_vnet_ioctl_sethash(), so userspace may set
> >>>>> VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by
> >>>>> tun_vnet_ioctl_gethashtypes(). In this case they won't get
> >>>>> VIRTIO_NET_HASH_TYPE_TCP_EX.
> >>>>
> >>>> That's right. It's the responsibility of the userspace to set only the
> >>>> supported hash types.
> >>>
> >>> Well, the kernel should filter out the unsupported one to have a
> >>> robust uAPI. Otherwise, we give green light to the buggy userspace
> >>> which will have unexpected results.
> >>
> >> My reasoning was that it may be fine for some use cases other than VM
> >> (e.g., DPDK); in such a use case, it is fine as long as the UAPI works
> >> in the best-effort basis.
> >
> > Best-effort might increase the chance for user visisable changes after
> > migration.
>
> It is a trade-off between catching a migration bug for VMM and making
> life a bit easier for userspace programs other than VMM.

My understanding is to avoid breaking the migration compatibility as
much as possible as fixing that would be complicated or even
impossible.

>
> >
> >>
> >> For example, suppose a userspace program that processes TCP packets; the
> >> program can enable: HASH_TYPE_IPv4, HASH_TYPE_TCPv4, HASH_TYPE_IPv6, and
> >> HASH_TYPE_TCPv6. Ideally, the kernel should support all the hash types,
> >> but, even if e.g., HASH_TYPE_TCPv6 is not available,
> >
> > For "available" did you mean it is not supported by the device?
> >
> >> it will fall back
> >> to HASH_TYPE_IPv6, which still does something good and may be acceptable.
> >
> > This fallback is exactly the same as I said above, let
> > VIRTIO_NET_HASH_TYPE_TCP_EX to fallback.
> >
> > My point is that, the implementation should either:
> >
> > 1) allow fallback so it can claim to support all hash types
> >
> > or
> >
> > 2) don't allow fallback so it can only support a part of the hash types
> >
> > If we're doing something in the middle, for example, allow part of the
> > type to fallback.
>
> 1) or the middle will make it unsuitable for VM because it violates the
> virtio spec. 2) makes sense though the trade-off I mentioned should be
> taken into consideration.
>
> >
> >>
> >> That said, for a use case that involves VM and implements virtio-net
> >> (e.g., QEMU), setting an unsupported hash type here is definitely a bug.
> >> Catching the bug may outweigh the extra trouble for other use cases.
> >>
> >>>
> >>>>
> >>>>> 3) implementing part of the hash types might complicate the migration
> >>>>> or at least we need to describe the expectations of libvirt or other
> >>>>> management in this case. For example, do we plan to have a dedicated
> >>>>> Qemu command line like:
> >>>>>
> >>>>> -device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?
> >>>>
> >>>> I posted a patch series to implement such a command line for vDPA[1].
> >>>> The patch series that wires this tuntap feature up[2] reuses the
> >>>> infrastructure so it doesn't bring additional complexity.
> >>>>
> >>>> [1]
> >>>> https://lore.kernel.org/qemu-devel/20250530-vdpa-v1-0-5af4109b1c19@daynix.com/
> >>>> [2]
> >>>> https://lore.kernel.org/qemu-devel/20250530-hash-v5-0-343d7d7a8200@daynix.com/
> >>>
> >>> I meant, if we implement a full hash report feature, it means a single
> >>> hash cmdline option is more than sufficient and so compatibility code
> >>> can just turn it off when dealing with machine types. This is much
> >>> more simpler than
> >>>
> >>> 1) having both hash as well as supported_hash_features
> >>> 2) dealing both hash as well as supported_hash_features in compatibility codes
> >>> 3) libvirt will be happy
> >>>
> >>> For [1], it seems it introduces a per has type option, this seems to
> >>> be a burden to the management layer as it need to learn new option
> >>> everytime a new hash type is supported
> >>
> >> Even with the command line you proposed (supported_hash_types=X,Y,Z), it
> >> is still necessary to know the values the supported_hash_types property
> >> accepts (X.Y,Z), so I don't think it makes difference.
> >
> > It could be a uint32_t.
>
> The management layer will need to know what bits are accepted even with
> uint32_t.

Ease the management, basically it would be used by debugging or
machine type only.

>
> >
> >>
> >> The burden to the management layer is already present for features, so
> >> it is an existing problem (or its mere extension).
> >
> > Yes, but since this feature is new it's better to try our best to avoid that.
> >
> >>
> >> This problem was discussed in the following thread in the past, but no
> >> solution is implemented yet, and probably solving it will be difficult.
> >> https://lore.kernel.org/qemu-devel/20230731223148.1002258-5-yuri.benditovich@daynix.com/
> >
> > It's a similar issue but not the same, it looks more like a discussion
> > on whether the fallback from vhost-net to qemu works for missing
> > features etc.
>
> Perhaps we may be able to do better since this feature is new as you say
> and we don't have to worry much about breaking change. I don't have an
> idea for that yet.

Right.

>
> Regards,
> Akihiko Odaki
>

Thanks


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code
  2025-06-06  9:26             ` Akihiko Odaki
@ 2025-06-17  3:39               ` Jason Wang
  2025-06-19 17:01                 ` Akihiko Odaki
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2025-06-17  3:39 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Fri, Jun 6, 2025 at 5:27 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2025/06/06 10:01, Jason Wang wrote:
> > On Thu, Jun 5, 2025 at 4:18 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2025/06/05 11:46, Jason Wang wrote:
> >>> On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> On 2025/06/04 10:53, Jason Wang wrote:
> >>>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>
> >>>>>> Add common code required for the features being added to TUN and TAP.
> >>>>>> They will be enabled for each of them in following patches.
> >>>>>>
> >>>>>> Added Features
> >>>>>> ==============
> >>>>>>
> >>>>>> Hash reporting
> >>>>>> --------------
> >>>>>>
> >>>>>> Allow the guest to reuse the hash value to make receive steering
> >>>>>> consistent between the host and guest, and to save hash computation.
> >>>>>>
> >>>>>> Receive Side Scaling (RSS)
> >>>>>> --------------------------
> >>>>>>
> >>>>>> RSS is a receive steering algorithm that can be negotiated to use with
> >>>>>> virtio_net. Conventionally the hash calculation was done by the VMM.
> >>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>> purpose of RSS.
> >>>>>>
> >>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>> restrictive nature of eBPF steering program.
> >>>>>>
> >>>>>> Introduce the code to perform RSS to the kernel in order to overcome
> >>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>> program so that it will be able to report to the userspace, but I didn't
> >>>>>> opt for it because extending the current mechanism of eBPF steering
> >>>>>> program as is because it relies on legacy context rewriting, and
> >>>>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >>>>>> the other relevant virtualization APIs such as KVM and vhost_net are
> >>>>>> UAPIs.
> >>>>>>
> >>>>>> Added ioctls
> >>>>>> ============
> >>>>>>
> >>>>>> They are designed to make extensibility and VM migration compatible.
> >>>>>> This change only adds the implementation and does not expose them to
> >>>>>> the userspace.
> >>>>>>
> >>>>>> TUNGETVNETHASHTYPES
> >>>>>> -------------------
> >>>>>>
> >>>>>> This ioctl tells supported hash types. It is useful to check if a VM can
> >>>>>> be migrated to the current host.
> >>>>>>
> >>>>>> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
> >>>>>> --------------------------------------------------------------------
> >>>>>>
> >>>>>> These ioctls configures a steering algorithm and, if needed, hash
> >>>>>> reporting.
> >>>>>>
> >>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>>>> Tested-by: Lei Yang <leiyang@redhat.com>
> >>>>>> ---
> >>>>>>     drivers/net/tap.c           |  10 ++-
> >>>>>>     drivers/net/tun.c           |  12 +++-
> >>>>>>     drivers/net/tun_vnet.h      | 165 +++++++++++++++++++++++++++++++++++++++++---
> >>>>>>     include/uapi/linux/if_tun.h |  71 +++++++++++++++++++
> >>>>>>     4 files changed, 244 insertions(+), 14 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >>>>>> index d4ece538f1b2..25c60ff2d3f2 100644
> >>>>>> --- a/drivers/net/tap.c
> >>>>>> +++ b/drivers/net/tap.c
> >>>>>> @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
> >>>>>>            sock_put(&q->sk);
> >>>>>>     }
> >>>>>>
> >>>>>> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
> >>>>>> +{
> >>>>>> +       return NULL;
> >>>>>> +}
> >>>>>> +
> >>>>>>     /*
> >>>>>>      * Select a queue based on the rxq of the device on which this packet
> >>>>>>      * arrived. If the incoming device is not mq, calculate a flow hash
> >>>>>> @@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>>>>>            int total;
> >>>>>>
> >>>>>>            if (q->flags & IFF_VNET_HDR) {
> >>>>>> -               struct virtio_net_hdr vnet_hdr;
> >>>>>> +               struct virtio_net_hdr_v1_hash vnet_hdr;
> >>>>>>
> >>>>>>                    vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
> >>>>>>
> >>>>>> -               ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
> >>>>>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
> >>>>>> +                                           tap_find_hash, &vnet_hdr);
> >>>>>>                    if (ret)
> >>>>>>                            return ret;
> >>>>>>
> >>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>> index 9133ab9ed3f5..03d47799e9bd 100644
> >>>>>> --- a/drivers/net/tun.c
> >>>>>> +++ b/drivers/net/tun.c
> >>>>>> @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash)
> >>>>>>                    e->rps_rxhash = hash;
> >>>>>>     }
> >>>>>>
> >>>>>> +static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
> >>>>>> +{
> >>>>>> +       return NULL;
> >>>>>> +}
> >>>>>> +
> >>>>>>     /* We try to identify a flow through its rxhash. The reason that
> >>>>>>      * we do not check rxq no. is because some cards(e.g 82599), chooses
> >>>>>>      * the rxq based on the txq where the last packet of the flow comes. As
> >>>>>> @@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
> >>>>>>            ssize_t ret;
> >>>>>>
> >>>>>>            if (tun->flags & IFF_VNET_HDR) {
> >>>>>> -               struct virtio_net_hdr gso = { 0 };
> >>>>>> +               struct virtio_net_hdr_v1_hash gso = { 0 };
> >>>>>>
> >>>>>>                    vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
> >>>>>>                    ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
> >>>>>> @@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>>>>>            }
> >>>>>>
> >>>>>>            if (vnet_hdr_sz) {
> >>>>>> -               struct virtio_net_hdr gso;
> >>>>>> +               struct virtio_net_hdr_v1_hash gso;
> >>>>>>
> >>>>>> -               ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
> >>>>>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
> >>>>>> +                                           skb, tun_find_hash, &gso);
> >>>>>>                    if (ret)
> >>>>>>                            return ret;
> >>>>>>
> >>>>>> diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
> >>>>>> index 58b9ac7a5fc4..45d0533efc8d 100644
> >>>>>> --- a/drivers/net/tun_vnet.h
> >>>>>> +++ b/drivers/net/tun_vnet.h
> >>>>>> @@ -6,6 +6,17 @@
> >>>>>>     #define TUN_VNET_LE     0x80000000
> >>>>>>     #define TUN_VNET_BE     0x40000000
> >>>>>>
> >>>>>> +typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *);
> >>>>>> +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
> >>>>>> +
> >>>>>> +struct tun_vnet_hash {
> >>>>>> +       bool report;
> >>>>>> +       bool rss;
> >>>>>> +       struct tun_vnet_rss common;
> >>>>>> +       u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
> >>>>>> +       u16 rss_indirection_table[];
> >>>>>> +};
> >>>>>> +
> >>>>>>     static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags)
> >>>>>>     {
> >>>>>>            bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
> >>>>>> @@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
> >>>>>>            }
> >>>>>>     }
> >>>>>>
> >>>>>> +static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp)
> >>>>>> +{
> >>>>>> +       return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
> >>>>>> +                                         unsigned int cmd,
> >>>>>> +                                         void __user *argp)
> >>>>>> +{
> >>>>>> +       struct tun_vnet_rss common;
> >>>>>> +       struct tun_vnet_hash *hash;
> >>>>>> +       size_t indirection_table_size;
> >>>>>> +       size_t key_size;
> >>>>>> +       size_t size;
> >>>>>> +
> >>>>>> +       switch (cmd) {
> >>>>>> +       case TUNSETVNETREPORTINGAUTOMQ:
> >>>>>> +               if (get_user(common.hash_types, (u32 __user *)argp))
> >>>>>> +                       return -EFAULT;
> >>>>>> +
> >>>>>> +               if (common.hash_types) {
> >>>>>> +                       hash = kzalloc(sizeof(*hash), GFP_KERNEL);
> >>>>>> +                       if (!hash)
> >>>>>> +                               return -ENOMEM;
> >>>>>> +
> >>>>>> +                       hash->report = true;
> >>>>>> +                       hash->common.hash_types = common.hash_types;
> >>>>>> +               } else {
> >>>>>> +                       hash = NULL;
> >>>>>> +               }
> >>>>>> +               break;
> >>>>>> +
> >>>>>> +       case TUNSETVNETREPORTINGRSS:
> >>>>>> +       case TUNSETVNETRSS:
> >>>>>
> >>>>> So the above three shows unnecessary design redundancy as well as a
> >>>>> burden for the future extension.  Why not simply have
> >>>>>
> >>>>> 1) TUNSETVNET_RSS
> >>>>> 2) TUNSETVNET_HASH_REPORT
> >>>>> ?
> >>>>>
> >>>>> Which maps to
> >>>>>
> >>>>>     #define VIRTIO_NET_CTRL_MQ_RSS_CONFIG          1 (for configurable
> >>>>> receive steering)
> >>>>>     #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG         2 (for configurable
> >>>>> hash calculation)
> >>>>>
> >>>>> It would be always easier to start with what spec had or at least we
> >>>>> need to explain why we choose a different design here or in the
> >>>>> changelog to ease our life.
> >>>>
> >>>> TUNSETVNETREPORTINGAUTOMQ maps to VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
> >>>
> >>> It's not:
> >>>
> >>> VIRTIO_NET_CTRL_MQ_HASH_CONFIG uses:
> >>>
> >>> struct virtio_net_hash_config {
> >>>       le32 hash_types;
> >>>       le16 reserved[4];
> >>>       u8 hash_key_length;
> >>>       u8 hash_key_data[hash_key_length];
> >>> };
> >>>
> >>> but TUNSETVNETREPORTINGAUTOMQ only accepts hash_types without others:
> >>
> >> The others are not present because the spec doesn't specify what to do
> >> with them and the kernel doesn't use them either.
> >
> > Did you mean the hash_key_length and hash_key_data? Note that we have
> > drivers other than the Linux ones as well.
>
> And reserved. Drivers can set whatever to these fields. It is not
> specified how these fields should be used.
>
> >
> >>
> >>>
> >>>>
> >>>> TUNSETVNETREPORTINGRSS and TUNSETVNETRSS map to
> >>>> VIRTIO_NET_CTRL_MQ_RSS_CONFIG.
> >>>
> >>> I think we've already had a discussion about this.
> >>>
> >>> Reusing virtio-net uAPI is much better instead of having a tun
> >>> specific one considering tun may need to support more virtio commands
> >>> in the future. Or maybe it's the time to introduce a transport for the
> >>> virtio control virtqueue uAPI in tuntap to avoid inventing new uAPI
> >>> endlessly.
> >>>
> >>> What's more I see:
> >>>
> >>> struct tun_vnet_rss {
> >>>           __u32 hash_types;
> >>>           __u16 indirection_table_mask;
> >>>           __u16 unclassified_queue;
> >>> };
> >>>
> >>> struct tun_vnet_hash {
> >>>           bool report;
> >>>           bool rss;
> >>>           struct tun_vnet_rss common;
> >>>           u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
> >>>           u16 rss_indirection_table[];
> >>> };
> >>>
> >>> As I pointed out in the past, let's just decouple the rss from hash,
> >>> everything would be much simpler, or you need to explain why you
> >>> couple this somewhere.
> >>>
> >>> For example:
> >>>
> >>> 1) why is the tun_vnet_hash not part of the uAPI but tun_vnet_rss, or
> >>> how could userspace know what kind of format it would use for
> >>> TUNSETVNETREPORTINGRSS?
> >>
> >> That was the previous version.
> >>
> >>> 2) what's the advantages of embedding rss specific stuff into hash
> >>> report structure
> >>
> >> Because the hash types field in struct tun_vnet_rss is used by hash
> >> reporting too.
> >>
> >>> 3) what's the advantages of not using virtio-net uAPI
> >>
> >> 1. The use cases that don't involve VM will be simplified; programs for
> >> such a use case will not need to convert endian or to fill fileds the
> >> kernel doesn't care.
> >
> > Well, virtio_net_hdr is used by packet socket as well. Considering the
> > complexity of designing a new uAPI, it's still better.
>
> This patch series also reuses the datapath, following the prior examples.
>
> >
> > Or maybe you can clarify which field that kernel doesn't care about?
> > In this case TUN/TAP is basically the device datapath, if some of the
> > fields don't make sense, it's a bug of the spec.
>
> reserved, hash_key_length, and hash_key_data.

I may miss something when RSS is not negotiated, hash_key_length, and
hash_key_data is necessary, otherwise how could we calculate the hash?

>
> >
> >> 2. It aligns with existing UAPIs that operate in native endian and don't
> >> use virtio-net data structures like TUNSETOFFLOAD and TUNSETVNETHDRSZ.
> >
> > For those two examples, it would be used by guests directly. This is
> > different from RSS stuff.
>
> They are mediated by the VMM, which is no different from RSS.

Not necessarily, e,g Qemu support vDPA control virtqueue passthrough.

>
> >
> > With native endian, you need an endian conversation that converts le to native.
>
> That's true, but QEMU does so anyway to validate the configuration, to
> attach/detach queues, and to share the data structures with userspace
> RSS implementations. I expect other VMMs will do so too.

See above.

>
> >
> >>
> >>>
> >>> More issues:
> >>>
> >>> 1) max_tx_vq is ignored, so do you expect the userspace to intercept
> >>> this and switch to using TUNSETQUEUE? This seems like a burden as TUN
> >>> can simply accept it and do the attaching/detaching by itself
> >>> 2) the rx depends on the indirection table, so userspace need to
> >>> intercept the indirection and change the rx queue numbers accordingly
> >>> 3) RSS allows a async TX/RX queue, this is not supported by TUN now,
> >>> no matter if we decide to let userspace to intercept max_tx_vq or not,
> >>> we need to implement it first
> >>   > > Things would be much more simpler, if kernel can do 1) and 2).
> >>
> >> Attaching and detaching queues is not possible for the kernel because it
> >> doesn't know what file descriptors that map to queues will be used by
> >> the userspace.
> >
> > The kernel knows, tfile has a queue_index part.
>
> queue_index is set with TUNSETQUEUE so we need the ioctl.

queue_index would be reshuffle during attaching/detaching since the
netdev core forbids sparse active queue indices.

But I don't see it's an issue since we are talking about introducing
new uAPI here.

If it doesn't work, it doesn't change the point, a new uAPI is needed
since RSS requires async tx/rx queue numbers, current TUN only allows
combined queue pairs.

>
> >
> >>
> >> The following patch does 2) for QEMU:
> >> https://lore.kernel.org/qemu-devel/20250322-vq-v2-1-cee0aafe6404@daynix.com/
> >
> > See below point, form the view of the kernel, it's still a queue pair
> > not async TX/RX queue.
> >
> >>
> >> For 3), the patch for QEMU takes the maximum of TX and RX queue numbers
> >> to derive the number of queue pairs.
> >>
> >>>
> >>>> We have two ioctls here because
> >>>> VIRTIO_NET_CTRL_MQ_RSS_CONFIG behaves differently depending on whether
> >>>> VIRTIO_NET_F_HASH_REPORT is negotiated or not;
> >>>
> >>> It wouldn't be a problem if you do 1:1 mapping between virtio commands
> >>> and TUN uAPI, otherwise it should have a bug somewhere.
> >>
> >> Speaking of 1:1 mapping, it is possible to map VIRTIO_NET_F_HASH_REPORT
> >> into another ioctl. It may help add another receive steering algorithm
> >> in the future by not requiring two ioctls (TUNSETVNETREPORTING_X and
> >> TUNSETVNET_X).
> >
> > Yes and as I pointed out, virtio_net_hash_config should not be
> > specific to automq, it can work for other steering algorithm as well.
>
> That's not what the virtio spec says, so it will not be 1:1 mapping
> between virtio commands and TUN uAPI.

That's only because the spec only supports RSS and AUTOMQ so far. Or
do we expect a new virtio_net_XXX_hash_config for the new steering
algorithm?

>
> >
> >>
> >>>
> >>>> it also enables hash
> >>>> reporting if the feature is negotiated.
> >>>
> >>> Again, starting from virtio-net command is easier, a strong
> >>> justification is needed to explain why we choose another for tun/tap.
> >>>
> >>>>
> >>>>>
> >>>>> One day we would have tun_select_queue_algorithm_x() we don't have to
> >>>>> duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ
> >>>>
> >>>> 5.1.6.5.6.4 Hash calculation says:
> >>>>>    If VIRTIO_NET_F_HASH_REPORT was negotiated and the device uses
> >>>>> automatic receive steering, the device MUST support a command to
> >>>>> configure hash calculation parameters.
> >>>>>
> >>>>> The driver provides parameters for hash calculation as follows:
> >>>>>
> >>>>> class VIRTIO_NET_CTRL_MQ, command VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
> >>>>
> >>>> VIRTIO_NET_CTRL_MQ_HASH_CONFIG is for automatic receive steering and not
> >>>> for RSS (or other steering algorithms if the spec gets any in the future).
> >>>
> >>> I'm not sure but the spec needs some tweaking. For example, I don't
> >>> expect there would be a dedicated hash config command for flow filters
> >>> in the future. I think the reason why spec says like this is that
> >>> virtio-net only supports automatic receive steering.
> >>>
> >>> Note that macvtap doesn't implement automatic receive steering.
> >>
> >> QEMU advertises VIRTIO_NET_F_CTRL_VQ for macvtap too, so it should have
> >> implemented it. I think QEMU with macvtap still compliant to the spec.
> >
> > Compliant, but automatic traffic steering is the best effort as well.
> >
> > Nope, TUN/TAP implements a flow cache that can steer tx based on rx.
> > Macvtap simply uses hash here.
> >
> >>
> >> "5.1.6.5.6 Automatic receive steering in multiqueue mode" says:
> >>
> >>   > After the driver transmitted a packet of a flow on transmitqX, the
> >>   > device SHOULD cause incoming packets for that flow to be steered to
> >>   > receiveqX.
> >>
> >> It is "SHOULD", so it is still compliant if the device doesn't properly
> >> respect the flow.
> >
> > Yes, a quality of implementation, or it's impractical to support a
> > correct steering for this device as limited resources and mailious
> > users can do syn flood etc.
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>>> +               if (copy_from_user(&common, argp, sizeof(common)))
> >>>>>> +                       return -EFAULT;
> >>>>>> +               argp = (struct tun_vnet_rss __user *)argp + 1;
> >>>>>> +
> >>>>>> +               indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
> >>>>>> +               key_size = virtio_net_hash_key_length(common.hash_types);
> >>>>>> +               size = struct_size(hash, rss_indirection_table,
> >>>>>> +                                  (size_t)common.indirection_table_mask + 1);
> >>>>>> +
> >>>>>> +               hash = kmalloc(size, GFP_KERNEL);
> >>>>>> +               if (!hash)
> >>>>>> +                       return -ENOMEM;
> >>>>>> +
> >>>>>> +               if (copy_from_user(hash->rss_indirection_table,
> >>>>>> +                                  argp, indirection_table_size)) {
> >>>>>> +                       kfree(hash);
> >>>>>> +                       return -EFAULT;
> >>>>>> +               }
> >>>>>> +               argp = (u16 __user *)argp + common.indirection_table_mask + 1;
> >>>>>> +
> >>>>>> +               if (copy_from_user(hash->rss_key, argp, key_size)) {
> >>>>>> +                       kfree(hash);
> >>>>>> +                       return -EFAULT;
> >>>>>> +               }
> >>>>>> +
> >>>>>> +               virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
> >>>>>> +               hash->report = cmd == TUNSETVNETREPORTINGRSS;
> >>>>>
> >>>>> At least, if this is the only difference why not simply code this into
> >>>>> the ioctl itself other than having a very similar command?
> >>>>
> >>>> It is what the previous version did. Either is fine I guess; the only
> >>>> practical difference would be the size of the configuration struct is
> >>>> smaller with this approach.
> >>>>
> >>>>>
> >>>>>> +               hash->rss = true;
> >>>>>> +               hash->common = common;
> >>>>>> +               break;
> >>>>>> +
> >>>>>> +       default:
> >>>>>> +               return -EINVAL;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
> >>>>>> +       return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
> >>>>>> +                                       struct sk_buff *skb,
> >>>>>> +                                       const struct flow_keys_basic *keys,
> >>>>>> +                                       u32 value,
> >>>>>> +                                       tun_vnet_hash_add vnet_hash_add)
> >>>>>> +{
> >>>>>> +       struct virtio_net_hash *report;
> >>>>>> +
> >>>>>> +       if (!hash || !hash->report)
> >>>>>> +               return;
> >>>>>> +
> >>>>>> +       report = vnet_hash_add(skb);
> >>>>>> +       if (!report)
> >>>>>> +               return;
> >>>>>> +
> >>>>>> +       *report = (struct virtio_net_hash) {
> >>>>>> +               .report = virtio_net_hash_report(hash->common.hash_types, keys),
> >>>>>> +               .value = value
> >>>>>> +       };
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
> >>>>>> +                                           const struct tun_vnet_hash *hash,
> >>>>>> +                                           struct sk_buff *skb,
> >>>>>> +                                           tun_vnet_hash_add vnet_hash_add)
> >>>>>> +{
> >>>>>> +       struct virtio_net_hash *report;
> >>>>>> +       struct virtio_net_hash ret;
> >>>>>> +       u16 index;
> >>>>>> +
> >>>>>> +       if (!numqueues)
> >>>>>> +               return 0;
> >>>>>> +
> >>>>>> +       virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
> >>>>>> +
> >>>>>> +       if (!ret.report)
> >>>>>> +               return hash->common.unclassified_queue % numqueues;
> >>>>>> +
> >>>>>> +       if (hash->report) {
> >>>>>> +               report = vnet_hash_add(skb);
> >>>>>> +               if (report)
> >>>>>> +                       *report = ret;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       index = ret.value & hash->common.indirection_table_mask;
> >>>>>> +
> >>>>>> +       return hash->rss_indirection_table[index] % numqueues;
> >>>>>> +}
> >>>>>> +
> >>>>>>     static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
> >>>>>>                                       struct iov_iter *from,
> >>>>>>                                       struct virtio_net_hdr *hdr)
> >>>>>> @@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
> >>>>>>     }
> >>>>>>
> >>>>>>     static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
> >>>>>> -                                  const struct virtio_net_hdr *hdr)
> >>>>>> +                                  const struct virtio_net_hdr_v1_hash *hdr)
> >>>>>>     {
> >>>>>> +       int content_sz = MIN(sizeof(*hdr), sz);
> >>>>>> +
> >>>>>>            if (unlikely(iov_iter_count(iter) < sz))
> >>>>>>                    return -EINVAL;
> >>>>>>
> >>>>>> -       if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
> >>>>>> +       if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz))
> >>>>>>                    return -EFAULT;
> >>>>>>
> >>>>>> -       if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
> >>>>>> +       if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz)
> >>>>>>                    return -EFAULT;
> >>>>>>
> >>>>>>            return 0;
> >>>>>> @@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
> >>>>>>            return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags));
> >>>>>>     }
> >>>>>>
> >>>>>> -static inline int tun_vnet_hdr_from_skb(unsigned int flags,
> >>>>>> +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags,
> >>>>>>                                            const struct net_device *dev,
> >>>>>>                                            const struct sk_buff *skb,
> >>>>>> -                                       struct virtio_net_hdr *hdr)
> >>>>>> +                                       tun_vnet_hash_find vnet_hash_find,
> >>>>>> +                                       struct virtio_net_hdr_v1_hash *hdr)
> >>>>>>     {
> >>>>>>            int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
> >>>>>> +       const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
> >>>>>> +                                              NULL : vnet_hash_find(skb);
> >>>>>> +
> >>>>>> +       *hdr = (struct virtio_net_hdr_v1_hash) {
> >>>>>> +               .hash_report = VIRTIO_NET_HASH_REPORT_NONE
> >>>>>> +       };
> >>>>>> +
> >>>>>> +       if (report) {
> >>>>>> +               hdr->hash_value = cpu_to_le32(report->value);
> >>>>>> +               hdr->hash_report = cpu_to_le16(report->report);
> >>>>>> +       }
> >>>>>>
> >>>>>> -       if (virtio_net_hdr_from_skb(skb, hdr,
> >>>>>> +       if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr,
> >>>>>>                                        tun_vnet_is_little_endian(flags), true,
> >>>>>>                                        vlan_hlen)) {
> >>>>>>                    struct skb_shared_info *sinfo = skb_shinfo(skb);
> >>>>>>
> >>>>>>                    if (net_ratelimit()) {
> >>>>>>                            netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
> >>>>>> -                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
> >>>>>> -                                  tun_vnet16_to_cpu(flags, hdr->hdr_len));
> >>>>>> +                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
> >>>>>> +                                  tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len));
> >>>>>>                            print_hex_dump(KERN_ERR, "tun: ",
> >>>>>>                                           DUMP_PREFIX_NONE,
> >>>>>>                                           16, 1, skb->head,
> >>>>>> -                                      min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
> >>>>>> +                                      min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true);
> >>>>>>                    }
> >>>>>>                    WARN_ON_ONCE(1);
> >>>>>>                    return -EINVAL;
> >>>>>> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> >>>>>> index 980de74724fc..fe4b984d3bbb 100644
> >>>>>> --- a/include/uapi/linux/if_tun.h
> >>>>>> +++ b/include/uapi/linux/if_tun.h
> >>>>>> @@ -62,6 +62,62 @@
> >>>>>>     #define TUNSETCARRIER _IOW('T', 226, int)
> >>>>>>     #define TUNGETDEVNETNS _IO('T', 227)
> >>>>>>
> >>>>>> +/**
> >>>>>> + * define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
> >>>>>> + *
> >>>>>> + * The argument is a pointer to __u32 which will store the supported virtio_net
> >>>>>> + * hashing types.
> >>>>>> + */
> >>>>>> +#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
> >>>>>> +
> >>>>>> +/**
> >>>>>> + * define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
> >>>>>> + *
> >>>>>> + * Disable RSS and enable automatic receive steering with hash reporting.
> >>>>>> + *
> >>>>>> + * The argument is a pointer to __u32 that contains a bitmask of hash types
> >>>>>> + * allowed to be reported.
> >>>>>> + *
> >>>>>> + * This ioctl results in %EBADFD if the underlying device is deleted. It affects
> >>>>>> + * all queues attached to the same device.
> >>>>>> + *
> >>>>>> + * This ioctl currently has no effect on XDP packets and packets with
> >>>>>> + * queue_mapping set by TC.
> >>>>>> + */
> >>>>>> +#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
> >>>>>> +
> >>>>>> +/**
> >>>>>> + * define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
> >>>>>> + *
> >>>>>> + * Disable automatic receive steering and enable RSS with hash reporting.
> >>>>>
> >>>>> This is unnecessary, e.g one day will have select_queue_xyz(), we
> >>>>> don't want to say "Disable automatic receive steering and xyz ..."
> >>>>
> >>>> It is still something better to be documented as its behavior is
> >>>> somewhat complicated.
> >>>
> >>> This is a hint of uAPI design issue.
> >>>
> >>>>
> >>>> Concretely, this ioctl disables automatic receive steering but doesn't
> >>>> disable steering by eBPF, which is implied by TUN_STEERINGEBPF_FALLBACK.
> >>>
> >>> It would be simpler:
> >>>
> >>> 1) not having TUN_STEERINGEBPF_FALLBACK
> >>> 2) the steering algorithm depends on the last uAPI call
> >>
> >> What will TUNSETSTEERINGEBPF with NULL mean in that case? It currently
> >> switches the steering algorithm to automq.
> >
> > A stackwise semantic then?
>
> Can you clarify the semantics with an example of a set of ioctls?
> Perhaps it is an easy way to demonstrate an alternative design idea.

Consider user do:

1) TUNSETQUEUE /* enable automq, push */
2) TUNSETSTEERINGEBPF /* enable steering ebp, push */
3) TUNSETETTERINGEBPF to NULL /* disable steering ebpf, pop */

Automq is in the stack top, so TUN will use that.

Thanks

>
> Regards,
> Akihiko Odaki
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code
  2025-06-17  3:39               ` Jason Wang
@ 2025-06-19 17:01                 ` Akihiko Odaki
  2025-06-23  7:59                   ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Akihiko Odaki @ 2025-06-19 17:01 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On 2025/06/17 12:39, Jason Wang wrote:
> On Fri, Jun 6, 2025 at 5:27 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2025/06/06 10:01, Jason Wang wrote:
>>> On Thu, Jun 5, 2025 at 4:18 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2025/06/05 11:46, Jason Wang wrote:
>>>>> On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> On 2025/06/04 10:53, Jason Wang wrote:
>>>>>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>
>>>>>>>> Add common code required for the features being added to TUN and TAP.
>>>>>>>> They will be enabled for each of them in following patches.
>>>>>>>>
>>>>>>>> Added Features
>>>>>>>> ==============
>>>>>>>>
>>>>>>>> Hash reporting
>>>>>>>> --------------
>>>>>>>>
>>>>>>>> Allow the guest to reuse the hash value to make receive steering
>>>>>>>> consistent between the host and guest, and to save hash computation.
>>>>>>>>
>>>>>>>> Receive Side Scaling (RSS)
>>>>>>>> --------------------------
>>>>>>>>
>>>>>>>> RSS is a receive steering algorithm that can be negotiated to use with
>>>>>>>> virtio_net. Conventionally the hash calculation was done by the VMM.
>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>> purpose of RSS.
>>>>>>>>
>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>> restrictive nature of eBPF steering program.
>>>>>>>>
>>>>>>>> Introduce the code to perform RSS to the kernel in order to overcome
>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>>>> program so that it will be able to report to the userspace, but I didn't
>>>>>>>> opt for it because extending the current mechanism of eBPF steering
>>>>>>>> program as is because it relies on legacy context rewriting, and
>>>>>>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
>>>>>>>> the other relevant virtualization APIs such as KVM and vhost_net are
>>>>>>>> UAPIs.
>>>>>>>>
>>>>>>>> Added ioctls
>>>>>>>> ============
>>>>>>>>
>>>>>>>> They are designed to make extensibility and VM migration compatible.
>>>>>>>> This change only adds the implementation and does not expose them to
>>>>>>>> the userspace.
>>>>>>>>
>>>>>>>> TUNGETVNETHASHTYPES
>>>>>>>> -------------------
>>>>>>>>
>>>>>>>> This ioctl tells supported hash types. It is useful to check if a VM can
>>>>>>>> be migrated to the current host.
>>>>>>>>
>>>>>>>> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
>>>>>>>> --------------------------------------------------------------------
>>>>>>>>
>>>>>>>> These ioctls configures a steering algorithm and, if needed, hash
>>>>>>>> reporting.
>>>>>>>>
>>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>>>> Tested-by: Lei Yang <leiyang@redhat.com>
>>>>>>>> ---
>>>>>>>>      drivers/net/tap.c           |  10 ++-
>>>>>>>>      drivers/net/tun.c           |  12 +++-
>>>>>>>>      drivers/net/tun_vnet.h      | 165 +++++++++++++++++++++++++++++++++++++++++---
>>>>>>>>      include/uapi/linux/if_tun.h |  71 +++++++++++++++++++
>>>>>>>>      4 files changed, 244 insertions(+), 14 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>>>>>>>> index d4ece538f1b2..25c60ff2d3f2 100644
>>>>>>>> --- a/drivers/net/tap.c
>>>>>>>> +++ b/drivers/net/tap.c
>>>>>>>> @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
>>>>>>>>             sock_put(&q->sk);
>>>>>>>>      }
>>>>>>>>
>>>>>>>> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
>>>>>>>> +{
>>>>>>>> +       return NULL;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>      /*
>>>>>>>>       * Select a queue based on the rxq of the device on which this packet
>>>>>>>>       * arrived. If the incoming device is not mq, calculate a flow hash
>>>>>>>> @@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
>>>>>>>>             int total;
>>>>>>>>
>>>>>>>>             if (q->flags & IFF_VNET_HDR) {
>>>>>>>> -               struct virtio_net_hdr vnet_hdr;
>>>>>>>> +               struct virtio_net_hdr_v1_hash vnet_hdr;
>>>>>>>>
>>>>>>>>                     vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
>>>>>>>>
>>>>>>>> -               ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
>>>>>>>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
>>>>>>>> +                                           tap_find_hash, &vnet_hdr);
>>>>>>>>                     if (ret)
>>>>>>>>                             return ret;
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>> index 9133ab9ed3f5..03d47799e9bd 100644
>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>> @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash)
>>>>>>>>                     e->rps_rxhash = hash;
>>>>>>>>      }
>>>>>>>>
>>>>>>>> +static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
>>>>>>>> +{
>>>>>>>> +       return NULL;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>      /* We try to identify a flow through its rxhash. The reason that
>>>>>>>>       * we do not check rxq no. is because some cards(e.g 82599), chooses
>>>>>>>>       * the rxq based on the txq where the last packet of the flow comes. As
>>>>>>>> @@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
>>>>>>>>             ssize_t ret;
>>>>>>>>
>>>>>>>>             if (tun->flags & IFF_VNET_HDR) {
>>>>>>>> -               struct virtio_net_hdr gso = { 0 };
>>>>>>>> +               struct virtio_net_hdr_v1_hash gso = { 0 };
>>>>>>>>
>>>>>>>>                     vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
>>>>>>>>                     ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
>>>>>>>> @@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>>>>>>>             }
>>>>>>>>
>>>>>>>>             if (vnet_hdr_sz) {
>>>>>>>> -               struct virtio_net_hdr gso;
>>>>>>>> +               struct virtio_net_hdr_v1_hash gso;
>>>>>>>>
>>>>>>>> -               ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
>>>>>>>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
>>>>>>>> +                                           skb, tun_find_hash, &gso);
>>>>>>>>                     if (ret)
>>>>>>>>                             return ret;
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
>>>>>>>> index 58b9ac7a5fc4..45d0533efc8d 100644
>>>>>>>> --- a/drivers/net/tun_vnet.h
>>>>>>>> +++ b/drivers/net/tun_vnet.h
>>>>>>>> @@ -6,6 +6,17 @@
>>>>>>>>      #define TUN_VNET_LE     0x80000000
>>>>>>>>      #define TUN_VNET_BE     0x40000000
>>>>>>>>
>>>>>>>> +typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *);
>>>>>>>> +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
>>>>>>>> +
>>>>>>>> +struct tun_vnet_hash {
>>>>>>>> +       bool report;
>>>>>>>> +       bool rss;
>>>>>>>> +       struct tun_vnet_rss common;
>>>>>>>> +       u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
>>>>>>>> +       u16 rss_indirection_table[];
>>>>>>>> +};
>>>>>>>> +
>>>>>>>>      static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags)
>>>>>>>>      {
>>>>>>>>             bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
>>>>>>>> @@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
>>>>>>>>             }
>>>>>>>>      }
>>>>>>>>
>>>>>>>> +static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp)
>>>>>>>> +{
>>>>>>>> +       return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
>>>>>>>> +                                         unsigned int cmd,
>>>>>>>> +                                         void __user *argp)
>>>>>>>> +{
>>>>>>>> +       struct tun_vnet_rss common;
>>>>>>>> +       struct tun_vnet_hash *hash;
>>>>>>>> +       size_t indirection_table_size;
>>>>>>>> +       size_t key_size;
>>>>>>>> +       size_t size;
>>>>>>>> +
>>>>>>>> +       switch (cmd) {
>>>>>>>> +       case TUNSETVNETREPORTINGAUTOMQ:
>>>>>>>> +               if (get_user(common.hash_types, (u32 __user *)argp))
>>>>>>>> +                       return -EFAULT;
>>>>>>>> +
>>>>>>>> +               if (common.hash_types) {
>>>>>>>> +                       hash = kzalloc(sizeof(*hash), GFP_KERNEL);
>>>>>>>> +                       if (!hash)
>>>>>>>> +                               return -ENOMEM;
>>>>>>>> +
>>>>>>>> +                       hash->report = true;
>>>>>>>> +                       hash->common.hash_types = common.hash_types;
>>>>>>>> +               } else {
>>>>>>>> +                       hash = NULL;
>>>>>>>> +               }
>>>>>>>> +               break;
>>>>>>>> +
>>>>>>>> +       case TUNSETVNETREPORTINGRSS:
>>>>>>>> +       case TUNSETVNETRSS:
>>>>>>>
>>>>>>> So the above three shows unnecessary design redundancy as well as a
>>>>>>> burden for the future extension.  Why not simply have
>>>>>>>
>>>>>>> 1) TUNSETVNET_RSS
>>>>>>> 2) TUNSETVNET_HASH_REPORT
>>>>>>> ?
>>>>>>>
>>>>>>> Which maps to
>>>>>>>
>>>>>>>      #define VIRTIO_NET_CTRL_MQ_RSS_CONFIG          1 (for configurable
>>>>>>> receive steering)
>>>>>>>      #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG         2 (for configurable
>>>>>>> hash calculation)
>>>>>>>
>>>>>>> It would be always easier to start with what spec had or at least we
>>>>>>> need to explain why we choose a different design here or in the
>>>>>>> changelog to ease our life.
>>>>>>
>>>>>> TUNSETVNETREPORTINGAUTOMQ maps to VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
>>>>>
>>>>> It's not:
>>>>>
>>>>> VIRTIO_NET_CTRL_MQ_HASH_CONFIG uses:
>>>>>
>>>>> struct virtio_net_hash_config {
>>>>>        le32 hash_types;
>>>>>        le16 reserved[4];
>>>>>        u8 hash_key_length;
>>>>>        u8 hash_key_data[hash_key_length];
>>>>> };
>>>>>
>>>>> but TUNSETVNETREPORTINGAUTOMQ only accepts hash_types without others:
>>>>
>>>> The others are not present because the spec doesn't specify what to do
>>>> with them and the kernel doesn't use them either.
>>>
>>> Did you mean the hash_key_length and hash_key_data? Note that we have
>>> drivers other than the Linux ones as well.
>>
>> And reserved. Drivers can set whatever to these fields. It is not
>> specified how these fields should be used.
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> TUNSETVNETREPORTINGRSS and TUNSETVNETRSS map to
>>>>>> VIRTIO_NET_CTRL_MQ_RSS_CONFIG.
>>>>>
>>>>> I think we've already had a discussion about this.
>>>>>
>>>>> Reusing virtio-net uAPI is much better instead of having a tun
>>>>> specific one considering tun may need to support more virtio commands
>>>>> in the future. Or maybe it's the time to introduce a transport for the
>>>>> virtio control virtqueue uAPI in tuntap to avoid inventing new uAPI
>>>>> endlessly.
>>>>>
>>>>> What's more I see:
>>>>>
>>>>> struct tun_vnet_rss {
>>>>>            __u32 hash_types;
>>>>>            __u16 indirection_table_mask;
>>>>>            __u16 unclassified_queue;
>>>>> };
>>>>>
>>>>> struct tun_vnet_hash {
>>>>>            bool report;
>>>>>            bool rss;
>>>>>            struct tun_vnet_rss common;
>>>>>            u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
>>>>>            u16 rss_indirection_table[];
>>>>> };
>>>>>
>>>>> As I pointed out in the past, let's just decouple the rss from hash,
>>>>> everything would be much simpler, or you need to explain why you
>>>>> couple this somewhere.
>>>>>
>>>>> For example:
>>>>>
>>>>> 1) why is the tun_vnet_hash not part of the uAPI but tun_vnet_rss, or
>>>>> how could userspace know what kind of format it would use for
>>>>> TUNSETVNETREPORTINGRSS?
>>>>
>>>> That was the previous version.
>>>>
>>>>> 2) what's the advantages of embedding rss specific stuff into hash
>>>>> report structure
>>>>
>>>> Because the hash types field in struct tun_vnet_rss is used by hash
>>>> reporting too.
>>>>
>>>>> 3) what's the advantages of not using virtio-net uAPI
>>>>
>>>> 1. The use cases that don't involve VM will be simplified; programs for
>>>> such a use case will not need to convert endian or to fill fileds the
>>>> kernel doesn't care.
>>>
>>> Well, virtio_net_hdr is used by packet socket as well. Considering the
>>> complexity of designing a new uAPI, it's still better.
>>
>> This patch series also reuses the datapath, following the prior examples.
>>
>>>
>>> Or maybe you can clarify which field that kernel doesn't care about?
>>> In this case TUN/TAP is basically the device datapath, if some of the
>>> fields don't make sense, it's a bug of the spec.
>>
>> reserved, hash_key_length, and hash_key_data.
> 
> I may miss something when RSS is not negotiated, hash_key_length, and
> hash_key_data is necessary, otherwise how could we calculate the hash?

I was not clear that I was referring to the fields of struct 
virtio_net_hash_config. struct virtio_net_rss_config provides 
hash_key_length and hash_key_data for RSS.

> 
>>
>>>
>>>> 2. It aligns with existing UAPIs that operate in native endian and don't
>>>> use virtio-net data structures like TUNSETOFFLOAD and TUNSETVNETHDRSZ.
>>>
>>> For those two examples, it would be used by guests directly. This is
>>> different from RSS stuff.
>>
>> They are mediated by the VMM, which is no different from RSS.
> 
> Not necessarily, e,g Qemu support vDPA control virtqueue passthrough.

TUNSETOFFLOAD and TUNSETVNETHDRSZ are not used with vDPA, are they?

> 
>>
>>>
>>> With native endian, you need an endian conversation that converts le to native.
>>
>> That's true, but QEMU does so anyway to validate the configuration, to
>> attach/detach queues, and to share the data structures with userspace
>> RSS implementations. I expect other VMMs will do so too.
> 
> See above.
> 
>>
>>>
>>>>
>>>>>
>>>>> More issues:
>>>>>
>>>>> 1) max_tx_vq is ignored, so do you expect the userspace to intercept
>>>>> this and switch to using TUNSETQUEUE? This seems like a burden as TUN
>>>>> can simply accept it and do the attaching/detaching by itself
>>>>> 2) the rx depends on the indirection table, so userspace need to
>>>>> intercept the indirection and change the rx queue numbers accordingly
>>>>> 3) RSS allows a async TX/RX queue, this is not supported by TUN now,
>>>>> no matter if we decide to let userspace to intercept max_tx_vq or not,
>>>>> we need to implement it first
>>>>    > > Things would be much more simpler, if kernel can do 1) and 2).
>>>>
>>>> Attaching and detaching queues is not possible for the kernel because it
>>>> doesn't know what file descriptors that map to queues will be used by
>>>> the userspace.
>>>
>>> The kernel knows, tfile has a queue_index part.
>>
>> queue_index is set with TUNSETQUEUE so we need the ioctl.
> 
> queue_index would be reshuffle during attaching/detaching since the
> netdev core forbids sparse active queue indices.
> 
> But I don't see it's an issue since we are talking about introducing
> new uAPI here.

If queue_index is not usable, how can we pick queues for the new UAPI?

> 
> If it doesn't work, it doesn't change the point, a new uAPI is needed
> since RSS requires async tx/rx queue numbers, current TUN only allows
> combined queue pairs.

TUN can have more tx/rx queues than the guest requests, so a VMM can 
take the maximum of TX and RX queue numbers to derive the number of 
queue pairs when the guest reqeusts async tx/rx queue numbers.

> 
>>
>>>
>>>>
>>>> The following patch does 2) for QEMU:
>>>> https://lore.kernel.org/qemu-devel/20250322-vq-v2-1-cee0aafe6404@daynix.com/
>>>
>>> See below point, form the view of the kernel, it's still a queue pair
>>> not async TX/RX queue.
>>>
>>>>
>>>> For 3), the patch for QEMU takes the maximum of TX and RX queue numbers
>>>> to derive the number of queue pairs.
>>>>
>>>>>
>>>>>> We have two ioctls here because
>>>>>> VIRTIO_NET_CTRL_MQ_RSS_CONFIG behaves differently depending on whether
>>>>>> VIRTIO_NET_F_HASH_REPORT is negotiated or not;
>>>>>
>>>>> It wouldn't be a problem if you do 1:1 mapping between virtio commands
>>>>> and TUN uAPI, otherwise it should have a bug somewhere.
>>>>
>>>> Speaking of 1:1 mapping, it is possible to map VIRTIO_NET_F_HASH_REPORT
>>>> into another ioctl. It may help add another receive steering algorithm
>>>> in the future by not requiring two ioctls (TUNSETVNETREPORTING_X and
>>>> TUNSETVNET_X).
>>>
>>> Yes and as I pointed out, virtio_net_hash_config should not be
>>> specific to automq, it can work for other steering algorithm as well.
>>
>> That's not what the virtio spec says, so it will not be 1:1 mapping
>> between virtio commands and TUN uAPI.
> 
> That's only because the spec only supports RSS and AUTOMQ so far. Or
> do we expect a new virtio_net_XXX_hash_config for the new steering
> algorithm?

No. RSS only needs struct virtio_net_rss_config for hash configuration, 
and a new steering algorithm will only need one struct. For example, if 
the spec is to gain siphash, we will need to add struct 
virtio_net_siphash_config.

struct virtio_net_hash_config is only for automq.

> 
>>
>>>
>>>>
>>>>>
>>>>>> it also enables hash
>>>>>> reporting if the feature is negotiated.
>>>>>
>>>>> Again, starting from virtio-net command is easier, a strong
>>>>> justification is needed to explain why we choose another for tun/tap.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> One day we would have tun_select_queue_algorithm_x() we don't have to
>>>>>>> duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ
>>>>>>
>>>>>> 5.1.6.5.6.4 Hash calculation says:
>>>>>>>     If VIRTIO_NET_F_HASH_REPORT was negotiated and the device uses
>>>>>>> automatic receive steering, the device MUST support a command to
>>>>>>> configure hash calculation parameters.
>>>>>>>
>>>>>>> The driver provides parameters for hash calculation as follows:
>>>>>>>
>>>>>>> class VIRTIO_NET_CTRL_MQ, command VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
>>>>>>
>>>>>> VIRTIO_NET_CTRL_MQ_HASH_CONFIG is for automatic receive steering and not
>>>>>> for RSS (or other steering algorithms if the spec gets any in the future).
>>>>>
>>>>> I'm not sure but the spec needs some tweaking. For example, I don't
>>>>> expect there would be a dedicated hash config command for flow filters
>>>>> in the future. I think the reason why spec says like this is that
>>>>> virtio-net only supports automatic receive steering.
>>>>>
>>>>> Note that macvtap doesn't implement automatic receive steering.
>>>>
>>>> QEMU advertises VIRTIO_NET_F_CTRL_VQ for macvtap too, so it should have
>>>> implemented it. I think QEMU with macvtap still compliant to the spec.
>>>
>>> Compliant, but automatic traffic steering is the best effort as well.
>>>
>>> Nope, TUN/TAP implements a flow cache that can steer tx based on rx.
>>> Macvtap simply uses hash here.
>>>
>>>>
>>>> "5.1.6.5.6 Automatic receive steering in multiqueue mode" says:
>>>>
>>>>    > After the driver transmitted a packet of a flow on transmitqX, the
>>>>    > device SHOULD cause incoming packets for that flow to be steered to
>>>>    > receiveqX.
>>>>
>>>> It is "SHOULD", so it is still compliant if the device doesn't properly
>>>> respect the flow.
>>>
>>> Yes, a quality of implementation, or it's impractical to support a
>>> correct steering for this device as limited resources and mailious
>>> users can do syn flood etc.
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +               if (copy_from_user(&common, argp, sizeof(common)))
>>>>>>>> +                       return -EFAULT;
>>>>>>>> +               argp = (struct tun_vnet_rss __user *)argp + 1;
>>>>>>>> +
>>>>>>>> +               indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
>>>>>>>> +               key_size = virtio_net_hash_key_length(common.hash_types);
>>>>>>>> +               size = struct_size(hash, rss_indirection_table,
>>>>>>>> +                                  (size_t)common.indirection_table_mask + 1);
>>>>>>>> +
>>>>>>>> +               hash = kmalloc(size, GFP_KERNEL);
>>>>>>>> +               if (!hash)
>>>>>>>> +                       return -ENOMEM;
>>>>>>>> +
>>>>>>>> +               if (copy_from_user(hash->rss_indirection_table,
>>>>>>>> +                                  argp, indirection_table_size)) {
>>>>>>>> +                       kfree(hash);
>>>>>>>> +                       return -EFAULT;
>>>>>>>> +               }
>>>>>>>> +               argp = (u16 __user *)argp + common.indirection_table_mask + 1;
>>>>>>>> +
>>>>>>>> +               if (copy_from_user(hash->rss_key, argp, key_size)) {
>>>>>>>> +                       kfree(hash);
>>>>>>>> +                       return -EFAULT;
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>> +               virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
>>>>>>>> +               hash->report = cmd == TUNSETVNETREPORTINGRSS;
>>>>>>>
>>>>>>> At least, if this is the only difference why not simply code this into
>>>>>>> the ioctl itself other than having a very similar command?
>>>>>>
>>>>>> It is what the previous version did. Either is fine I guess; the only
>>>>>> practical difference would be the size of the configuration struct is
>>>>>> smaller with this approach.
>>>>>>
>>>>>>>
>>>>>>>> +               hash->rss = true;
>>>>>>>> +               hash->common = common;
>>>>>>>> +               break;
>>>>>>>> +
>>>>>>>> +       default:
>>>>>>>> +               return -EINVAL;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
>>>>>>>> +       return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
>>>>>>>> +                                       struct sk_buff *skb,
>>>>>>>> +                                       const struct flow_keys_basic *keys,
>>>>>>>> +                                       u32 value,
>>>>>>>> +                                       tun_vnet_hash_add vnet_hash_add)
>>>>>>>> +{
>>>>>>>> +       struct virtio_net_hash *report;
>>>>>>>> +
>>>>>>>> +       if (!hash || !hash->report)
>>>>>>>> +               return;
>>>>>>>> +
>>>>>>>> +       report = vnet_hash_add(skb);
>>>>>>>> +       if (!report)
>>>>>>>> +               return;
>>>>>>>> +
>>>>>>>> +       *report = (struct virtio_net_hash) {
>>>>>>>> +               .report = virtio_net_hash_report(hash->common.hash_types, keys),
>>>>>>>> +               .value = value
>>>>>>>> +       };
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
>>>>>>>> +                                           const struct tun_vnet_hash *hash,
>>>>>>>> +                                           struct sk_buff *skb,
>>>>>>>> +                                           tun_vnet_hash_add vnet_hash_add)
>>>>>>>> +{
>>>>>>>> +       struct virtio_net_hash *report;
>>>>>>>> +       struct virtio_net_hash ret;
>>>>>>>> +       u16 index;
>>>>>>>> +
>>>>>>>> +       if (!numqueues)
>>>>>>>> +               return 0;
>>>>>>>> +
>>>>>>>> +       virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
>>>>>>>> +
>>>>>>>> +       if (!ret.report)
>>>>>>>> +               return hash->common.unclassified_queue % numqueues;
>>>>>>>> +
>>>>>>>> +       if (hash->report) {
>>>>>>>> +               report = vnet_hash_add(skb);
>>>>>>>> +               if (report)
>>>>>>>> +                       *report = ret;
>>>>>>>> +       }
>>>>>>>> +
>>>>>>>> +       index = ret.value & hash->common.indirection_table_mask;
>>>>>>>> +
>>>>>>>> +       return hash->rss_indirection_table[index] % numqueues;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>      static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
>>>>>>>>                                        struct iov_iter *from,
>>>>>>>>                                        struct virtio_net_hdr *hdr)
>>>>>>>> @@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
>>>>>>>>      }
>>>>>>>>
>>>>>>>>      static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
>>>>>>>> -                                  const struct virtio_net_hdr *hdr)
>>>>>>>> +                                  const struct virtio_net_hdr_v1_hash *hdr)
>>>>>>>>      {
>>>>>>>> +       int content_sz = MIN(sizeof(*hdr), sz);
>>>>>>>> +
>>>>>>>>             if (unlikely(iov_iter_count(iter) < sz))
>>>>>>>>                     return -EINVAL;
>>>>>>>>
>>>>>>>> -       if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
>>>>>>>> +       if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz))
>>>>>>>>                     return -EFAULT;
>>>>>>>>
>>>>>>>> -       if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
>>>>>>>> +       if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz)
>>>>>>>>                     return -EFAULT;
>>>>>>>>
>>>>>>>>             return 0;
>>>>>>>> @@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
>>>>>>>>             return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags));
>>>>>>>>      }
>>>>>>>>
>>>>>>>> -static inline int tun_vnet_hdr_from_skb(unsigned int flags,
>>>>>>>> +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags,
>>>>>>>>                                             const struct net_device *dev,
>>>>>>>>                                             const struct sk_buff *skb,
>>>>>>>> -                                       struct virtio_net_hdr *hdr)
>>>>>>>> +                                       tun_vnet_hash_find vnet_hash_find,
>>>>>>>> +                                       struct virtio_net_hdr_v1_hash *hdr)
>>>>>>>>      {
>>>>>>>>             int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
>>>>>>>> +       const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
>>>>>>>> +                                              NULL : vnet_hash_find(skb);
>>>>>>>> +
>>>>>>>> +       *hdr = (struct virtio_net_hdr_v1_hash) {
>>>>>>>> +               .hash_report = VIRTIO_NET_HASH_REPORT_NONE
>>>>>>>> +       };
>>>>>>>> +
>>>>>>>> +       if (report) {
>>>>>>>> +               hdr->hash_value = cpu_to_le32(report->value);
>>>>>>>> +               hdr->hash_report = cpu_to_le16(report->report);
>>>>>>>> +       }
>>>>>>>>
>>>>>>>> -       if (virtio_net_hdr_from_skb(skb, hdr,
>>>>>>>> +       if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr,
>>>>>>>>                                         tun_vnet_is_little_endian(flags), true,
>>>>>>>>                                         vlan_hlen)) {
>>>>>>>>                     struct skb_shared_info *sinfo = skb_shinfo(skb);
>>>>>>>>
>>>>>>>>                     if (net_ratelimit()) {
>>>>>>>>                             netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
>>>>>>>> -                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
>>>>>>>> -                                  tun_vnet16_to_cpu(flags, hdr->hdr_len));
>>>>>>>> +                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
>>>>>>>> +                                  tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len));
>>>>>>>>                             print_hex_dump(KERN_ERR, "tun: ",
>>>>>>>>                                            DUMP_PREFIX_NONE,
>>>>>>>>                                            16, 1, skb->head,
>>>>>>>> -                                      min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
>>>>>>>> +                                      min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true);
>>>>>>>>                     }
>>>>>>>>                     WARN_ON_ONCE(1);
>>>>>>>>                     return -EINVAL;
>>>>>>>> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
>>>>>>>> index 980de74724fc..fe4b984d3bbb 100644
>>>>>>>> --- a/include/uapi/linux/if_tun.h
>>>>>>>> +++ b/include/uapi/linux/if_tun.h
>>>>>>>> @@ -62,6 +62,62 @@
>>>>>>>>      #define TUNSETCARRIER _IOW('T', 226, int)
>>>>>>>>      #define TUNGETDEVNETNS _IO('T', 227)
>>>>>>>>
>>>>>>>> +/**
>>>>>>>> + * define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
>>>>>>>> + *
>>>>>>>> + * The argument is a pointer to __u32 which will store the supported virtio_net
>>>>>>>> + * hashing types.
>>>>>>>> + */
>>>>>>>> +#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
>>>>>>>> + *
>>>>>>>> + * Disable RSS and enable automatic receive steering with hash reporting.
>>>>>>>> + *
>>>>>>>> + * The argument is a pointer to __u32 that contains a bitmask of hash types
>>>>>>>> + * allowed to be reported.
>>>>>>>> + *
>>>>>>>> + * This ioctl results in %EBADFD if the underlying device is deleted. It affects
>>>>>>>> + * all queues attached to the same device.
>>>>>>>> + *
>>>>>>>> + * This ioctl currently has no effect on XDP packets and packets with
>>>>>>>> + * queue_mapping set by TC.
>>>>>>>> + */
>>>>>>>> +#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
>>>>>>>> + *
>>>>>>>> + * Disable automatic receive steering and enable RSS with hash reporting.
>>>>>>>
>>>>>>> This is unnecessary, e.g one day will have select_queue_xyz(), we
>>>>>>> don't want to say "Disable automatic receive steering and xyz ..."
>>>>>>
>>>>>> It is still something better to be documented as its behavior is
>>>>>> somewhat complicated.
>>>>>
>>>>> This is a hint of uAPI design issue.
>>>>>
>>>>>>
>>>>>> Concretely, this ioctl disables automatic receive steering but doesn't
>>>>>> disable steering by eBPF, which is implied by TUN_STEERINGEBPF_FALLBACK.
>>>>>
>>>>> It would be simpler:
>>>>>
>>>>> 1) not having TUN_STEERINGEBPF_FALLBACK
>>>>> 2) the steering algorithm depends on the last uAPI call
>>>>
>>>> What will TUNSETSTEERINGEBPF with NULL mean in that case? It currently
>>>> switches the steering algorithm to automq.
>>>
>>> A stackwise semantic then?
>>
>> Can you clarify the semantics with an example of a set of ioctls?
>> Perhaps it is an easy way to demonstrate an alternative design idea.
> 
> Consider user do:
> 
> 1) TUNSETQUEUE /* enable automq, push */
> 2) TUNSETSTEERINGEBPF /* enable steering ebp, push */
> 3) TUNSETETTERINGEBPF to NULL /* disable steering ebpf, pop */
> 
> Automq is in the stack top, so TUN will use that.

In that case, what will happen if the user does:

1) TUNSETQUEUE
2) TUNSETVNETRSS
3) TUNSETETTERINGEBPF to NULL

Regards,
Akihiko Odaki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-06-17  3:28                   ` Jason Wang
@ 2025-06-21  5:50                     ` Akihiko Odaki
  2025-06-22 17:40                     ` Yuri Benditovich
  1 sibling, 0 replies; 39+ messages in thread
From: Akihiko Odaki @ 2025-06-21  5:50 UTC (permalink / raw)
  To: Jason Wang, Yuri Benditovich
  Cc: Michael S. Tsirkin, Jonathan Corbet, Willem de Bruijn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On 2025/06/17 12:28, Jason Wang wrote:
> On Fri, Jun 6, 2025 at 5:10 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2025/06/06 9:48, Jason Wang wrote:
>>> On Thu, Jun 5, 2025 at 3:58 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>
>>>> On 2025/06/05 10:53, Jason Wang wrote:
>>>>> On Wed, Jun 4, 2025 at 3:20 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>
>>>>>> On 2025/06/04 10:18, Jason Wang wrote:
>>>>>>> On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>
>>>>>>>> On 2025/06/03 12:19, Jason Wang wrote:
>>>>>>>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>>>>>>>>>
>>>>>>>>>> They are useful to implement VIRTIO_NET_F_RSS and
>>>>>>>>>> VIRTIO_NET_F_HASH_REPORT.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
>>>>>>>>>> Tested-by: Lei Yang <leiyang@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>>       include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>       1 file changed, 188 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
>>>>>>>>>> index 02a9f4dc594d..426f33b4b824 100644
>>>>>>>>>> --- a/include/linux/virtio_net.h
>>>>>>>>>> +++ b/include/linux/virtio_net.h
>>>>>>>>>> @@ -9,6 +9,194 @@
>>>>>>>>>>       #include <uapi/linux/tcp.h>
>>>>>>>>>>       #include <uapi/linux/virtio_net.h>
>>>>>>>>>>
>>>>>>>>>> +struct virtio_net_hash {
>>>>>>>>>> +       u32 value;
>>>>>>>>>> +       u16 report;
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> +struct virtio_net_toeplitz_state {
>>>>>>>>>> +       u32 hash;
>>>>>>>>>> +       const u32 *key;
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
>>>>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
>>>>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
>>>>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
>>>>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
>>>>>>>>>> +                                        VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
>>>>>>>>>> +
>>>>>>>>>> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
>>>>>>>>>> +
>>>>>>>>>> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
>>>>>>>>>> +{
>>>>>>>>>> +       while (len >= sizeof(*input)) {
>>>>>>>>>> +               *input = be32_to_cpu((__force __be32)*input);
>>>>>>>>>> +               input++;
>>>>>>>>>> +               len -= sizeof(*input);
>>>>>>>>>> +       }
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
>>>>>>>>>> +                                           const __be32 *input, size_t len)
>>>>>>>>>> +{
>>>>>>>>>> +       while (len >= sizeof(*input)) {
>>>>>>>>>> +               for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
>>>>>>>>>> +                       u32 i = ffs(map);
>>>>>>>>>> +
>>>>>>>>>> +                       state->hash ^= state->key[0] << (32 - i) |
>>>>>>>>>> +                                      (u32)((u64)state->key[1] >> i);
>>>>>>>>>> +               }
>>>>>>>>>> +
>>>>>>>>>> +               state->key++;
>>>>>>>>>> +               input++;
>>>>>>>>>> +               len -= sizeof(*input);
>>>>>>>>>> +       }
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static inline u8 virtio_net_hash_key_length(u32 types)
>>>>>>>>>> +{
>>>>>>>>>> +       size_t len = 0;
>>>>>>>>>> +
>>>>>>>>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv4)
>>>>>>>>>> +               len = max(len,
>>>>>>>>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs));
>>>>>>>>>> +
>>>>>>>>>> +       if (types &
>>>>>>>>>> +           (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
>>>>>>>>>> +               len = max(len,
>>>>>>>>>> +                         sizeof(struct flow_dissector_key_ipv4_addrs) +
>>>>>>>>>> +                         sizeof(struct flow_dissector_key_ports));
>>>>>>>>>> +
>>>>>>>>>> +       if (types & VIRTIO_NET_HASH_REPORT_IPv6)
>>>>>>>>>> +               len = max(len,
>>>>>>>>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs));
>>>>>>>>>> +
>>>>>>>>>> +       if (types &
>>>>>>>>>> +           (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
>>>>>>>>>> +               len = max(len,
>>>>>>>>>> +                         sizeof(struct flow_dissector_key_ipv6_addrs) +
>>>>>>>>>> +                         sizeof(struct flow_dissector_key_ports));
>>>>>>>>>> +
>>>>>>>>>> +       return len + sizeof(u32);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static inline u32 virtio_net_hash_report(u32 types,
>>>>>>>>>> +                                        const struct flow_keys_basic *keys)
>>>>>>>>>> +{
>>>>>>>>>> +       switch (keys->basic.n_proto) {
>>>>>>>>>> +       case cpu_to_be16(ETH_P_IP):
>>>>>>>>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
>>>>>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
>>>>>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
>>>>>>>>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv4;
>>>>>>>>>> +
>>>>>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
>>>>>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
>>>>>>>>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv4;
>>>>>>>>>> +               }
>>>>>>>>>> +
>>>>>>>>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
>>>>>>>>>> +                       return VIRTIO_NET_HASH_REPORT_IPv4;
>>>>>>>>>> +
>>>>>>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>>>>>>>>>> +
>>>>>>>>>> +       case cpu_to_be16(ETH_P_IPV6):
>>>>>>>>>> +               if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
>>>>>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_TCP &&
>>>>>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
>>>>>>>>>> +                               return VIRTIO_NET_HASH_REPORT_TCPv6;
>>>>>>>>>> +
>>>>>>>>>> +                       if (keys->basic.ip_proto == IPPROTO_UDP &&
>>>>>>>>>> +                           (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
>>>>>>>>>> +                               return VIRTIO_NET_HASH_REPORT_UDPv6;
>>>>>>>>>> +               }
>>>>>>>>>> +
>>>>>>>>>> +               if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
>>>>>>>>>> +                       return VIRTIO_NET_HASH_REPORT_IPv6;
>>>>>>>>>> +
>>>>>>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>>>>>>>>>> +
>>>>>>>>>> +       default:
>>>>>>>>>> +               return VIRTIO_NET_HASH_REPORT_NONE;
>>>>>>>>>> +       }
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static inline void virtio_net_hash_rss(const struct sk_buff *skb,
>>>>>>>>>> +                                      u32 types, const u32 *key,
>>>>>>>>>> +                                      struct virtio_net_hash *hash)
>>>>>>>>>> +{
>>>>>>>>>> +       struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
>>>>>>>>>> +       struct flow_keys flow;
>>>>>>>>>> +       struct flow_keys_basic flow_basic;
>>>>>>>>>> +       u16 report;
>>>>>>>>>> +
>>>>>>>>>> +       if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
>>>>>>>>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
>>>>>>>>>> +               return;
>>>>>>>>>> +       }
>>>>>>>>>> +
>>>>>>>>>> +       flow_basic = (struct flow_keys_basic) {
>>>>>>>>>> +               .control = flow.control,
>>>>>>>>>> +               .basic = flow.basic
>>>>>>>>>> +       };
>>>>>>>>>> +
>>>>>>>>>> +       report = virtio_net_hash_report(types, &flow_basic);
>>>>>>>>>> +
>>>>>>>>>> +       switch (report) {
>>>>>>>>>> +       case VIRTIO_NET_HASH_REPORT_IPv4:
>>>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
>>>>>>>>>> +                                        sizeof(flow.addrs.v4addrs));
>>>>>>>>>> +               break;
>>>>>>>>>> +
>>>>>>>>>> +       case VIRTIO_NET_HASH_REPORT_TCPv4:
>>>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
>>>>>>>>>> +                                        sizeof(flow.addrs.v4addrs));
>>>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>>>>>>>> +                                        sizeof(flow.ports.ports));
>>>>>>>>>> +               break;
>>>>>>>>>> +
>>>>>>>>>> +       case VIRTIO_NET_HASH_REPORT_UDPv4:
>>>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>>>>>> +                                        (__be32 *)&flow.addrs.v4addrs,
>>>>>>>>>> +                                        sizeof(flow.addrs.v4addrs));
>>>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>>>>>>>> +                                        sizeof(flow.ports.ports));
>>>>>>>>>> +               break;
>>>>>>>>>> +
>>>>>>>>>> +       case VIRTIO_NET_HASH_REPORT_IPv6:
>>>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
>>>>>>>>>> +                                        sizeof(flow.addrs.v6addrs));
>>>>>>>>>> +               break;
>>>>>>>>>> +
>>>>>>>>>> +       case VIRTIO_NET_HASH_REPORT_TCPv6:
>>>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
>>>>>>>>>> +                                        sizeof(flow.addrs.v6addrs));
>>>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>>>>>>>> +                                        sizeof(flow.ports.ports));
>>>>>>>>>> +               break;
>>>>>>>>>> +
>>>>>>>>>> +       case VIRTIO_NET_HASH_REPORT_UDPv6:
>>>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state,
>>>>>>>>>> +                                        (__be32 *)&flow.addrs.v6addrs,
>>>>>>>>>> +                                        sizeof(flow.addrs.v6addrs));
>>>>>>>>>> +               virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
>>>>>>>>>> +                                        sizeof(flow.ports.ports));
>>>>>>>>>> +               break;
>>>>>>>>>> +
>>>>>>>>>> +       default:
>>>>>>>>>> +               hash->report = VIRTIO_NET_HASH_REPORT_NONE;
>>>>>>>>>> +               return;
>>>>>>>>>
>>>>>>>>> So I still think we need a comment here to explain why this is not an
>>>>>>>>> issue if the device can report HASH_XXX_EX. Or we need to add the
>>>>>>>>> support, since this is the code from the driver side, I don't think we
>>>>>>>>> need to worry about the device implementation issues.
>>>>>>>>
>>>>>>>> This is on the device side, and don't report HASH_TYPE_XXX_EX.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> For the issue of the number of options, does the spec forbid fallback
>>>>>>>>> to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
>>>>>>>>
>>>>>>>> 5.1.6.4.3.4 "IPv6 packets with extension header" says:
>>>>>>>>      > If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
>>>>>>>>      > header, the hash is calculated over the following fields:
>>>>>>>>      > - Home address from the home address option in the IPv6 destination
>>>>>>>>      >   options header. If the extension header is not present, use the
>>>>>>>>      >   Source IPv6 address.
>>>>>>>>      > - IPv6 address that is contained in the Routing-Header-Type-2 from the
>>>>>>>>      >   associated extension header. If the extension header is not present,
>>>>>>>>      >   use the Destination IPv6 address.
>>>>>>>>      > - Source TCP port
>>>>>>>>      > - Destination TCP port
>>>>>>>>
>>>>>>>> Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6
>>>>>>>> and an home address option in the IPv6 destination options header is
>>>>>>>> present, the hash is calculated over the home address. If the hash is
>>>>>>>> not calculated over the home address in such a case, the device is
>>>>>>>> contradicting with this section and violating the spec. The same goes
>>>>>>>> for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
>>>>>>>
>>>>>>> Just to make sure we are one the same page. I meant:
>>>>>>>
>>>>>>> 1) If the hash is not calculated over the home address (in the case of
>>>>>>> IPv6 destination destination), it can still report
>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your
>>>>>>> series. So the device can simply fallback to e.g TCPv6 if it can't
>>>>>>> understand all or part of the IPv6 options.
>>>>>>
>>>>>> The spec says it can fallback if "the extension header is not present",
>>>>>> not if the device can't understand the extension header.
>>>>>
>>>>> I don't think so,
>>>>>
>>>>> 1) spec had a condition beforehand:
>>>>>
>>>>> """
>>>>> If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6
>>>>> header, the hash is calculated over the following fields:
>>>>> ...
>>>>> If the extension header is not present ...
>>>>> """
>>>>>
>>>>> So the device can choose not to set VIRTIO_NET_HASH_TYPE_TCP_EX as
>>>>> spec doesn't say device MUST set VIRTIO_NET_HASH_TYPE_TCP_EX if ...
>>>>>
>>>>> 2) implementation wise, since device has limited resources, we can't
>>>>> expect the device can parse arbitrary number of ipv6 options
>>>>>
>>>>> 3) if 1) and 2) not the case, we need fix the spec otherwise implement
>>>>> a spec compliant device is impractical
>>>>
>>>> The statement is preceded by the following:
>>>>    >  The device calculates the hash on IPv4 packets according to
>>>>    > ’Enabled hash types’ bitmask as follows:
>>>>
>>>> The 'Enabled hash types' bitmask is specified by the device.
>>>>
>>>> I think the spec needs amendment.
>>>
>>> Michael, can you help to clarify here?
>>>
>>>>
>>>> I wonder if there are any people interested in the feature though.
>>>> Looking at virtnet_set_hashflow() in drivers/net/virtio_net.c, the
>>>> driver of Linux does not let users configure HASH_TYPE_XXX_EX. I suppose
>>>> Windows supports HASH_TYPE_XXX_EX, but those who care network
>>>> performance most would use Linux so HASH_TYPE_XXX_EX support without
>>>> Linux driver's support may not be useful.
>>>
>>> It might be still interesting for example for the hardware virtio
>>> vendors to support windows etc.
>>
>> I don't know if Windows needs them for e.g., device/driver certification
>> so surveying Windows makes sense.
> 
> Yuri, can you help to clarify this?
> 
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>> 2) the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the
>>>>>>> tun_vnet_ioctl_sethash(), so userspace may set
>>>>>>> VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by
>>>>>>> tun_vnet_ioctl_gethashtypes(). In this case they won't get
>>>>>>> VIRTIO_NET_HASH_TYPE_TCP_EX.
>>>>>>
>>>>>> That's right. It's the responsibility of the userspace to set only the
>>>>>> supported hash types.
>>>>>
>>>>> Well, the kernel should filter out the unsupported one to have a
>>>>> robust uAPI. Otherwise, we give green light to the buggy userspace
>>>>> which will have unexpected results.
>>>>
>>>> My reasoning was that it may be fine for some use cases other than VM
>>>> (e.g., DPDK); in such a use case, it is fine as long as the UAPI works
>>>> in the best-effort basis.
>>>
>>> Best-effort might increase the chance for user visisable changes after
>>> migration.
>>
>> It is a trade-off between catching a migration bug for VMM and making
>> life a bit easier for userspace programs other than VMM.
> 
> My understanding is to avoid breaking the migration compatibility as
> much as possible as fixing that would be complicated or even
> impossible.

That makes a point.

> 
>>
>>>
>>>>
>>>> For example, suppose a userspace program that processes TCP packets; the
>>>> program can enable: HASH_TYPE_IPv4, HASH_TYPE_TCPv4, HASH_TYPE_IPv6, and
>>>> HASH_TYPE_TCPv6. Ideally, the kernel should support all the hash types,
>>>> but, even if e.g., HASH_TYPE_TCPv6 is not available,
>>>
>>> For "available" did you mean it is not supported by the device?
>>>
>>>> it will fall back
>>>> to HASH_TYPE_IPv6, which still does something good and may be acceptable.
>>>
>>> This fallback is exactly the same as I said above, let
>>> VIRTIO_NET_HASH_TYPE_TCP_EX to fallback.
>>>
>>> My point is that, the implementation should either:
>>>
>>> 1) allow fallback so it can claim to support all hash types
>>>
>>> or
>>>
>>> 2) don't allow fallback so it can only support a part of the hash types
>>>
>>> If we're doing something in the middle, for example, allow part of the
>>> type to fallback.
>>
>> 1) or the middle will make it unsuitable for VM because it violates the
>> virtio spec. 2) makes sense though the trade-off I mentioned should be
>> taken into consideration.
>>
>>>
>>>>
>>>> That said, for a use case that involves VM and implements virtio-net
>>>> (e.g., QEMU), setting an unsupported hash type here is definitely a bug.
>>>> Catching the bug may outweigh the extra trouble for other use cases.
>>>>
>>>>>
>>>>>>
>>>>>>> 3) implementing part of the hash types might complicate the migration
>>>>>>> or at least we need to describe the expectations of libvirt or other
>>>>>>> management in this case. For example, do we plan to have a dedicated
>>>>>>> Qemu command line like:
>>>>>>>
>>>>>>> -device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?
>>>>>>
>>>>>> I posted a patch series to implement such a command line for vDPA[1].
>>>>>> The patch series that wires this tuntap feature up[2] reuses the
>>>>>> infrastructure so it doesn't bring additional complexity.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/qemu-devel/20250530-vdpa-v1-0-5af4109b1c19@daynix.com/
>>>>>> [2]
>>>>>> https://lore.kernel.org/qemu-devel/20250530-hash-v5-0-343d7d7a8200@daynix.com/
>>>>>
>>>>> I meant, if we implement a full hash report feature, it means a single
>>>>> hash cmdline option is more than sufficient and so compatibility code
>>>>> can just turn it off when dealing with machine types. This is much
>>>>> more simpler than
>>>>>
>>>>> 1) having both hash as well as supported_hash_features
>>>>> 2) dealing both hash as well as supported_hash_features in compatibility codes
>>>>> 3) libvirt will be happy
>>>>>
>>>>> For [1], it seems it introduces a per has type option, this seems to
>>>>> be a burden to the management layer as it need to learn new option
>>>>> everytime a new hash type is supported
>>>>
>>>> Even with the command line you proposed (supported_hash_types=X,Y,Z), it
>>>> is still necessary to know the values the supported_hash_types property
>>>> accepts (X.Y,Z), so I don't think it makes difference.
>>>
>>> It could be a uint32_t.
>>
>> The management layer will need to know what bits are accepted even with
>> uint32_t.
> 
> Ease the management, basically it would be used by debugging or
> machine type only.

Users need to be able to specify hash types for the reasons same with 
virtio features (via domain XML for libvirt). In particular:

1) To ensure migration compatibility in case that hosts have different 
sets of supported hash types. This kind of situation can happen when 
vDPA devices have different capabilities or when the spec gains new hash 
types and a new version of kernel/tuntap starts to support it.

2) For debugging as you mentioned.

It makes sense to let users specify hash types with mnemonics instead of 
cryptic bitmasks. 2) is not an end-user requirement so it may be 
tolerable to expose bitmasks, but it is less so for 1). The management 
layer needs to have a list of hash types to support mnemonics whether 
QOM properties are uint32_t or not.

We can have an interface different from the one for features without 
worrying about backward compatibility, but this aspect of hash types is 
exactly same with features and does not motivate to do so.

Regards,
Akihiko Odaki

> 
>>
>>>
>>>>
>>>> The burden to the management layer is already present for features, so
>>>> it is an existing problem (or its mere extension).
>>>
>>> Yes, but since this feature is new it's better to try our best to avoid that.
>>>
>>>>
>>>> This problem was discussed in the following thread in the past, but no
>>>> solution is implemented yet, and probably solving it will be difficult.
>>>> https://lore.kernel.org/qemu-devel/20230731223148.1002258-5-yuri.benditovich@daynix.com/
>>>
>>> It's a similar issue but not the same, it looks more like a discussion
>>> on whether the fallback from vhost-net to qemu works for missing
>>> features etc.
>>
>> Perhaps we may be able to do better since this feature is new as you say
>> and we don't have to worry much about breaking change. I don't have an
>> idea for that yet.
> 
> Right.
> 
>>
>> Regards,
>> Akihiko Odaki
>>
> 
> Thanks
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-06-17  3:28                   ` Jason Wang
  2025-06-21  5:50                     ` Akihiko Odaki
@ 2025-06-22 17:40                     ` Yuri Benditovich
  2025-06-23  8:07                       ` Jason Wang
  1 sibling, 1 reply; 39+ messages in thread
From: Yuri Benditovich @ 2025-06-22 17:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: Akihiko Odaki, Michael S. Tsirkin, Jonathan Corbet,
	Willem de Bruijn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel,
	netdev, kvm, virtualization, linux-kselftest, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

> Yuri, can you help to clarify this?

I see here several questions:
1. Whether it is ok for the device not to indicate support for XXX_EX hash type?
- I think, yes (strictly speaking, it was better to test that before
submitting the patches )
2. Is it possible that the guest will enable some XXX_EX hash type if
the device does not indicate that it is supported?
- No (I think this is part of the spec)
3. What to do if we migrate between systems with different
capabilities of hash support/reporting/whatever
- IMO, at this moment such case should be excluded and only mechanism
we have for that is the compatible machine version
- in some future the change of device capabilities can be communicated
to the driver and _probably_ the driver might be able to communicate
the change of device capabilities to the OS
4. Does it make sense to have fine configuration of hash types mask
via command-line?
- IMO, no. This would require the user to have too much knowledge
about RSS internals

Please let me know if I missed something.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code
  2025-06-19 17:01                 ` Akihiko Odaki
@ 2025-06-23  7:59                   ` Jason Wang
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Wang @ 2025-06-23  7:59 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jonathan Corbet, Willem de Bruijn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin, Xuan Zhuo,
	Shuah Khan, linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Fri, Jun 20, 2025 at 1:01 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2025/06/17 12:39, Jason Wang wrote:
> > On Fri, Jun 6, 2025 at 5:27 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> On 2025/06/06 10:01, Jason Wang wrote:
> >>> On Thu, Jun 5, 2025 at 4:18 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>
> >>>> On 2025/06/05 11:46, Jason Wang wrote:
> >>>>> On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>
> >>>>>> On 2025/06/04 10:53, Jason Wang wrote:
> >>>>>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>>>>>>>
> >>>>>>>> Add common code required for the features being added to TUN and TAP.
> >>>>>>>> They will be enabled for each of them in following patches.
> >>>>>>>>
> >>>>>>>> Added Features
> >>>>>>>> ==============
> >>>>>>>>
> >>>>>>>> Hash reporting
> >>>>>>>> --------------
> >>>>>>>>
> >>>>>>>> Allow the guest to reuse the hash value to make receive steering
> >>>>>>>> consistent between the host and guest, and to save hash computation.
> >>>>>>>>
> >>>>>>>> Receive Side Scaling (RSS)
> >>>>>>>> --------------------------
> >>>>>>>>
> >>>>>>>> RSS is a receive steering algorithm that can be negotiated to use with
> >>>>>>>> virtio_net. Conventionally the hash calculation was done by the VMM.
> >>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>> purpose of RSS.
> >>>>>>>>
> >>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>> restrictive nature of eBPF steering program.
> >>>>>>>>
> >>>>>>>> Introduce the code to perform RSS to the kernel in order to overcome
> >>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>>>> program so that it will be able to report to the userspace, but I didn't
> >>>>>>>> opt for it because extending the current mechanism of eBPF steering
> >>>>>>>> program as is because it relies on legacy context rewriting, and
> >>>>>>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >>>>>>>> the other relevant virtualization APIs such as KVM and vhost_net are
> >>>>>>>> UAPIs.
> >>>>>>>>
> >>>>>>>> Added ioctls
> >>>>>>>> ============
> >>>>>>>>
> >>>>>>>> They are designed to make extensibility and VM migration compatible.
> >>>>>>>> This change only adds the implementation and does not expose them to
> >>>>>>>> the userspace.
> >>>>>>>>
> >>>>>>>> TUNGETVNETHASHTYPES
> >>>>>>>> -------------------
> >>>>>>>>
> >>>>>>>> This ioctl tells supported hash types. It is useful to check if a VM can
> >>>>>>>> be migrated to the current host.
> >>>>>>>>
> >>>>>>>> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
> >>>>>>>> --------------------------------------------------------------------
> >>>>>>>>
> >>>>>>>> These ioctls configures a steering algorithm and, if needed, hash
> >>>>>>>> reporting.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> >>>>>>>> Tested-by: Lei Yang <leiyang@redhat.com>
> >>>>>>>> ---
> >>>>>>>>      drivers/net/tap.c           |  10 ++-
> >>>>>>>>      drivers/net/tun.c           |  12 +++-
> >>>>>>>>      drivers/net/tun_vnet.h      | 165 +++++++++++++++++++++++++++++++++++++++++---
> >>>>>>>>      include/uapi/linux/if_tun.h |  71 +++++++++++++++++++
> >>>>>>>>      4 files changed, 244 insertions(+), 14 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >>>>>>>> index d4ece538f1b2..25c60ff2d3f2 100644
> >>>>>>>> --- a/drivers/net/tap.c
> >>>>>>>> +++ b/drivers/net/tap.c
> >>>>>>>> @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
> >>>>>>>>             sock_put(&q->sk);
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
> >>>>>>>> +{
> >>>>>>>> +       return NULL;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>      /*
> >>>>>>>>       * Select a queue based on the rxq of the device on which this packet
> >>>>>>>>       * arrived. If the incoming device is not mq, calculate a flow hash
> >>>>>>>> @@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>>>>>>>             int total;
> >>>>>>>>
> >>>>>>>>             if (q->flags & IFF_VNET_HDR) {
> >>>>>>>> -               struct virtio_net_hdr vnet_hdr;
> >>>>>>>> +               struct virtio_net_hdr_v1_hash vnet_hdr;
> >>>>>>>>
> >>>>>>>>                     vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
> >>>>>>>>
> >>>>>>>> -               ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
> >>>>>>>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
> >>>>>>>> +                                           tap_find_hash, &vnet_hdr);
> >>>>>>>>                     if (ret)
> >>>>>>>>                             return ret;
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>>>> index 9133ab9ed3f5..03d47799e9bd 100644
> >>>>>>>> --- a/drivers/net/tun.c
> >>>>>>>> +++ b/drivers/net/tun.c
> >>>>>>>> @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash)
> >>>>>>>>                     e->rps_rxhash = hash;
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>> +static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
> >>>>>>>> +{
> >>>>>>>> +       return NULL;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>      /* We try to identify a flow through its rxhash. The reason that
> >>>>>>>>       * we do not check rxq no. is because some cards(e.g 82599), chooses
> >>>>>>>>       * the rxq based on the txq where the last packet of the flow comes. As
> >>>>>>>> @@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
> >>>>>>>>             ssize_t ret;
> >>>>>>>>
> >>>>>>>>             if (tun->flags & IFF_VNET_HDR) {
> >>>>>>>> -               struct virtio_net_hdr gso = { 0 };
> >>>>>>>> +               struct virtio_net_hdr_v1_hash gso = { 0 };
> >>>>>>>>
> >>>>>>>>                     vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
> >>>>>>>>                     ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
> >>>>>>>> @@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> >>>>>>>>             }
> >>>>>>>>
> >>>>>>>>             if (vnet_hdr_sz) {
> >>>>>>>> -               struct virtio_net_hdr gso;
> >>>>>>>> +               struct virtio_net_hdr_v1_hash gso;
> >>>>>>>>
> >>>>>>>> -               ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
> >>>>>>>> +               ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
> >>>>>>>> +                                           skb, tun_find_hash, &gso);
> >>>>>>>>                     if (ret)
> >>>>>>>>                             return ret;
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
> >>>>>>>> index 58b9ac7a5fc4..45d0533efc8d 100644
> >>>>>>>> --- a/drivers/net/tun_vnet.h
> >>>>>>>> +++ b/drivers/net/tun_vnet.h
> >>>>>>>> @@ -6,6 +6,17 @@
> >>>>>>>>      #define TUN_VNET_LE     0x80000000
> >>>>>>>>      #define TUN_VNET_BE     0x40000000
> >>>>>>>>
> >>>>>>>> +typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *);
> >>>>>>>> +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
> >>>>>>>> +
> >>>>>>>> +struct tun_vnet_hash {
> >>>>>>>> +       bool report;
> >>>>>>>> +       bool rss;
> >>>>>>>> +       struct tun_vnet_rss common;
> >>>>>>>> +       u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
> >>>>>>>> +       u16 rss_indirection_table[];
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>>      static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags)
> >>>>>>>>      {
> >>>>>>>>             bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
> >>>>>>>> @@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
> >>>>>>>>             }
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>> +static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp)
> >>>>>>>> +{
> >>>>>>>> +       return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
> >>>>>>>> +                                         unsigned int cmd,
> >>>>>>>> +                                         void __user *argp)
> >>>>>>>> +{
> >>>>>>>> +       struct tun_vnet_rss common;
> >>>>>>>> +       struct tun_vnet_hash *hash;
> >>>>>>>> +       size_t indirection_table_size;
> >>>>>>>> +       size_t key_size;
> >>>>>>>> +       size_t size;
> >>>>>>>> +
> >>>>>>>> +       switch (cmd) {
> >>>>>>>> +       case TUNSETVNETREPORTINGAUTOMQ:
> >>>>>>>> +               if (get_user(common.hash_types, (u32 __user *)argp))
> >>>>>>>> +                       return -EFAULT;
> >>>>>>>> +
> >>>>>>>> +               if (common.hash_types) {
> >>>>>>>> +                       hash = kzalloc(sizeof(*hash), GFP_KERNEL);
> >>>>>>>> +                       if (!hash)
> >>>>>>>> +                               return -ENOMEM;
> >>>>>>>> +
> >>>>>>>> +                       hash->report = true;
> >>>>>>>> +                       hash->common.hash_types = common.hash_types;
> >>>>>>>> +               } else {
> >>>>>>>> +                       hash = NULL;
> >>>>>>>> +               }
> >>>>>>>> +               break;
> >>>>>>>> +
> >>>>>>>> +       case TUNSETVNETREPORTINGRSS:
> >>>>>>>> +       case TUNSETVNETRSS:
> >>>>>>>
> >>>>>>> So the above three shows unnecessary design redundancy as well as a
> >>>>>>> burden for the future extension.  Why not simply have
> >>>>>>>
> >>>>>>> 1) TUNSETVNET_RSS
> >>>>>>> 2) TUNSETVNET_HASH_REPORT
> >>>>>>> ?
> >>>>>>>
> >>>>>>> Which maps to
> >>>>>>>
> >>>>>>>      #define VIRTIO_NET_CTRL_MQ_RSS_CONFIG          1 (for configurable
> >>>>>>> receive steering)
> >>>>>>>      #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG         2 (for configurable
> >>>>>>> hash calculation)
> >>>>>>>
> >>>>>>> It would be always easier to start with what spec had or at least we
> >>>>>>> need to explain why we choose a different design here or in the
> >>>>>>> changelog to ease our life.
> >>>>>>
> >>>>>> TUNSETVNETREPORTINGAUTOMQ maps to VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
> >>>>>
> >>>>> It's not:
> >>>>>
> >>>>> VIRTIO_NET_CTRL_MQ_HASH_CONFIG uses:
> >>>>>
> >>>>> struct virtio_net_hash_config {
> >>>>>        le32 hash_types;
> >>>>>        le16 reserved[4];
> >>>>>        u8 hash_key_length;
> >>>>>        u8 hash_key_data[hash_key_length];
> >>>>> };
> >>>>>
> >>>>> but TUNSETVNETREPORTINGAUTOMQ only accepts hash_types without others:
> >>>>
> >>>> The others are not present because the spec doesn't specify what to do
> >>>> with them and the kernel doesn't use them either.
> >>>
> >>> Did you mean the hash_key_length and hash_key_data? Note that we have
> >>> drivers other than the Linux ones as well.
> >>
> >> And reserved. Drivers can set whatever to these fields. It is not
> >> specified how these fields should be used.
> >>
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> TUNSETVNETREPORTINGRSS and TUNSETVNETRSS map to
> >>>>>> VIRTIO_NET_CTRL_MQ_RSS_CONFIG.
> >>>>>
> >>>>> I think we've already had a discussion about this.
> >>>>>
> >>>>> Reusing virtio-net uAPI is much better instead of having a tun
> >>>>> specific one considering tun may need to support more virtio commands
> >>>>> in the future. Or maybe it's the time to introduce a transport for the
> >>>>> virtio control virtqueue uAPI in tuntap to avoid inventing new uAPI
> >>>>> endlessly.
> >>>>>
> >>>>> What's more I see:
> >>>>>
> >>>>> struct tun_vnet_rss {
> >>>>>            __u32 hash_types;
> >>>>>            __u16 indirection_table_mask;
> >>>>>            __u16 unclassified_queue;
> >>>>> };
> >>>>>
> >>>>> struct tun_vnet_hash {
> >>>>>            bool report;
> >>>>>            bool rss;
> >>>>>            struct tun_vnet_rss common;
> >>>>>            u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
> >>>>>            u16 rss_indirection_table[];
> >>>>> };
> >>>>>
> >>>>> As I pointed out in the past, let's just decouple the rss from hash,
> >>>>> everything would be much simpler, or you need to explain why you
> >>>>> couple this somewhere.
> >>>>>
> >>>>> For example:
> >>>>>
> >>>>> 1) why is the tun_vnet_hash not part of the uAPI but tun_vnet_rss, or
> >>>>> how could userspace know what kind of format it would use for
> >>>>> TUNSETVNETREPORTINGRSS?
> >>>>
> >>>> That was the previous version.
> >>>>
> >>>>> 2) what's the advantages of embedding rss specific stuff into hash
> >>>>> report structure
> >>>>
> >>>> Because the hash types field in struct tun_vnet_rss is used by hash
> >>>> reporting too.
> >>>>
> >>>>> 3) what's the advantages of not using virtio-net uAPI
> >>>>
> >>>> 1. The use cases that don't involve VM will be simplified; programs for
> >>>> such a use case will not need to convert endian or to fill fileds the
> >>>> kernel doesn't care.
> >>>
> >>> Well, virtio_net_hdr is used by packet socket as well. Considering the
> >>> complexity of designing a new uAPI, it's still better.
> >>
> >> This patch series also reuses the datapath, following the prior examples.
> >>
> >>>
> >>> Or maybe you can clarify which field that kernel doesn't care about?
> >>> In this case TUN/TAP is basically the device datapath, if some of the
> >>> fields don't make sense, it's a bug of the spec.
> >>
> >> reserved, hash_key_length, and hash_key_data.
> >
> > I may miss something when RSS is not negotiated, hash_key_length, and
> > hash_key_data is necessary, otherwise how could we calculate the hash?
>
> I was not clear that I was referring to the fields of struct
> virtio_net_hash_config. struct virtio_net_rss_config provides
> hash_key_length and hash_key_data for RSS.
>
> >
> >>
> >>>
> >>>> 2. It aligns with existing UAPIs that operate in native endian and don't
> >>>> use virtio-net data structures like TUNSETOFFLOAD and TUNSETVNETHDRSZ.
> >>>
> >>> For those two examples, it would be used by guests directly. This is
> >>> different from RSS stuff.
> >>
> >> They are mediated by the VMM, which is no different from RSS.
> >
> > Not necessarily, e,g Qemu support vDPA control virtqueue passthrough.
>
> TUNSETOFFLOAD and TUNSETVNETHDRSZ are not used with vDPA, are they?

They aren't but I mean hash/rss config.

>
> >
> >>
> >>>
> >>> With native endian, you need an endian conversation that converts le to native.
> >>
> >> That's true, but QEMU does so anyway to validate the configuration, to
> >> attach/detach queues, and to share the data structures with userspace
> >> RSS implementations. I expect other VMMs will do so too.
> >
> > See above.
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>> More issues:
> >>>>>
> >>>>> 1) max_tx_vq is ignored, so do you expect the userspace to intercept
> >>>>> this and switch to using TUNSETQUEUE? This seems like a burden as TUN
> >>>>> can simply accept it and do the attaching/detaching by itself
> >>>>> 2) the rx depends on the indirection table, so userspace need to
> >>>>> intercept the indirection and change the rx queue numbers accordingly
> >>>>> 3) RSS allows a async TX/RX queue, this is not supported by TUN now,
> >>>>> no matter if we decide to let userspace to intercept max_tx_vq or not,
> >>>>> we need to implement it first
> >>>>    > > Things would be much more simpler, if kernel can do 1) and 2).
> >>>>
> >>>> Attaching and detaching queues is not possible for the kernel because it
> >>>> doesn't know what file descriptors that map to queues will be used by
> >>>> the userspace.
> >>>
> >>> The kernel knows, tfile has a queue_index part.
> >>
> >> queue_index is set with TUNSETQUEUE so we need the ioctl.
> >
> > queue_index would be reshuffle during attaching/detaching since the
> > netdev core forbids sparse active queue indices.
> >
> > But I don't see it's an issue since we are talking about introducing
> > new uAPI here.
>
> If queue_index is not usable, how can we pick queues for the new UAPI?

But queue_index is contiguous, so we can use the first several ones.

>
> >
> > If it doesn't work, it doesn't change the point, a new uAPI is needed
> > since RSS requires async tx/rx queue numbers, current TUN only allows
> > combined queue pairs.
>
> TUN can have more tx/rx queues than the guest requests, so a VMM can
> take the maximum of TX and RX queue numbers to derive the number of
> queue pairs when the guest reqeusts async tx/rx queue numbers.

I think we are talking about the same issue, I mean anyhow you need a
new uAPI to tell the kernel it needs to start async tx/rx. This is not
supported in the current kernel.

>
> >
> >>
> >>>
> >>>>
> >>>> The following patch does 2) for QEMU:
> >>>> https://lore.kernel.org/qemu-devel/20250322-vq-v2-1-cee0aafe6404@daynix.com/
> >>>
> >>> See below point, form the view of the kernel, it's still a queue pair
> >>> not async TX/RX queue.
> >>>
> >>>>
> >>>> For 3), the patch for QEMU takes the maximum of TX and RX queue numbers
> >>>> to derive the number of queue pairs.
> >>>>
> >>>>>
> >>>>>> We have two ioctls here because
> >>>>>> VIRTIO_NET_CTRL_MQ_RSS_CONFIG behaves differently depending on whether
> >>>>>> VIRTIO_NET_F_HASH_REPORT is negotiated or not;
> >>>>>
> >>>>> It wouldn't be a problem if you do 1:1 mapping between virtio commands
> >>>>> and TUN uAPI, otherwise it should have a bug somewhere.
> >>>>
> >>>> Speaking of 1:1 mapping, it is possible to map VIRTIO_NET_F_HASH_REPORT
> >>>> into another ioctl. It may help add another receive steering algorithm
> >>>> in the future by not requiring two ioctls (TUNSETVNETREPORTING_X and
> >>>> TUNSETVNET_X).
> >>>
> >>> Yes and as I pointed out, virtio_net_hash_config should not be
> >>> specific to automq, it can work for other steering algorithm as well.
> >>
> >> That's not what the virtio spec says, so it will not be 1:1 mapping
> >> between virtio commands and TUN uAPI.
> >
> > That's only because the spec only supports RSS and AUTOMQ so far. Or
> > do we expect a new virtio_net_XXX_hash_config for the new steering
> > algorithm?
>
> No. RSS only needs struct virtio_net_rss_config for hash configuration,
> and a new steering algorithm will only need one struct. For example, if
> the spec is to gain siphash, we will need to add struct
> virtio_net_siphash_config.

I think there're several things. We should decouple hash from the steering.

New steering algorithm doesn't necessarily hash or new hash algorithm.

>
> struct virtio_net_hash_config is only for automq.
>
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>>> it also enables hash
> >>>>>> reporting if the feature is negotiated.
> >>>>>
> >>>>> Again, starting from virtio-net command is easier, a strong
> >>>>> justification is needed to explain why we choose another for tun/tap.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> One day we would have tun_select_queue_algorithm_x() we don't have to
> >>>>>>> duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ
> >>>>>>
> >>>>>> 5.1.6.5.6.4 Hash calculation says:
> >>>>>>>     If VIRTIO_NET_F_HASH_REPORT was negotiated and the device uses
> >>>>>>> automatic receive steering, the device MUST support a command to
> >>>>>>> configure hash calculation parameters.
> >>>>>>>
> >>>>>>> The driver provides parameters for hash calculation as follows:
> >>>>>>>
> >>>>>>> class VIRTIO_NET_CTRL_MQ, command VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
> >>>>>>
> >>>>>> VIRTIO_NET_CTRL_MQ_HASH_CONFIG is for automatic receive steering and not
> >>>>>> for RSS (or other steering algorithms if the spec gets any in the future).
> >>>>>
> >>>>> I'm not sure but the spec needs some tweaking. For example, I don't
> >>>>> expect there would be a dedicated hash config command for flow filters
> >>>>> in the future. I think the reason why spec says like this is that
> >>>>> virtio-net only supports automatic receive steering.
> >>>>>
> >>>>> Note that macvtap doesn't implement automatic receive steering.
> >>>>
> >>>> QEMU advertises VIRTIO_NET_F_CTRL_VQ for macvtap too, so it should have
> >>>> implemented it. I think QEMU with macvtap still compliant to the spec.
> >>>
> >>> Compliant, but automatic traffic steering is the best effort as well.
> >>>
> >>> Nope, TUN/TAP implements a flow cache that can steer tx based on rx.
> >>> Macvtap simply uses hash here.
> >>>
> >>>>
> >>>> "5.1.6.5.6 Automatic receive steering in multiqueue mode" says:
> >>>>
> >>>>    > After the driver transmitted a packet of a flow on transmitqX, the
> >>>>    > device SHOULD cause incoming packets for that flow to be steered to
> >>>>    > receiveqX.
> >>>>
> >>>> It is "SHOULD", so it is still compliant if the device doesn't properly
> >>>> respect the flow.
> >>>
> >>> Yes, a quality of implementation, or it's impractical to support a
> >>> correct steering for this device as limited resources and mailious
> >>> users can do syn flood etc.
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>> +               if (copy_from_user(&common, argp, sizeof(common)))
> >>>>>>>> +                       return -EFAULT;
> >>>>>>>> +               argp = (struct tun_vnet_rss __user *)argp + 1;
> >>>>>>>> +
> >>>>>>>> +               indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
> >>>>>>>> +               key_size = virtio_net_hash_key_length(common.hash_types);
> >>>>>>>> +               size = struct_size(hash, rss_indirection_table,
> >>>>>>>> +                                  (size_t)common.indirection_table_mask + 1);
> >>>>>>>> +
> >>>>>>>> +               hash = kmalloc(size, GFP_KERNEL);
> >>>>>>>> +               if (!hash)
> >>>>>>>> +                       return -ENOMEM;
> >>>>>>>> +
> >>>>>>>> +               if (copy_from_user(hash->rss_indirection_table,
> >>>>>>>> +                                  argp, indirection_table_size)) {
> >>>>>>>> +                       kfree(hash);
> >>>>>>>> +                       return -EFAULT;
> >>>>>>>> +               }
> >>>>>>>> +               argp = (u16 __user *)argp + common.indirection_table_mask + 1;
> >>>>>>>> +
> >>>>>>>> +               if (copy_from_user(hash->rss_key, argp, key_size)) {
> >>>>>>>> +                       kfree(hash);
> >>>>>>>> +                       return -EFAULT;
> >>>>>>>> +               }
> >>>>>>>> +
> >>>>>>>> +               virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
> >>>>>>>> +               hash->report = cmd == TUNSETVNETREPORTINGRSS;
> >>>>>>>
> >>>>>>> At least, if this is the only difference why not simply code this into
> >>>>>>> the ioctl itself other than having a very similar command?
> >>>>>>
> >>>>>> It is what the previous version did. Either is fine I guess; the only
> >>>>>> practical difference would be the size of the configuration struct is
> >>>>>> smaller with this approach.
> >>>>>>
> >>>>>>>
> >>>>>>>> +               hash->rss = true;
> >>>>>>>> +               hash->common = common;
> >>>>>>>> +               break;
> >>>>>>>> +
> >>>>>>>> +       default:
> >>>>>>>> +               return -EINVAL;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
> >>>>>>>> +       return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
> >>>>>>>> +                                       struct sk_buff *skb,
> >>>>>>>> +                                       const struct flow_keys_basic *keys,
> >>>>>>>> +                                       u32 value,
> >>>>>>>> +                                       tun_vnet_hash_add vnet_hash_add)
> >>>>>>>> +{
> >>>>>>>> +       struct virtio_net_hash *report;
> >>>>>>>> +
> >>>>>>>> +       if (!hash || !hash->report)
> >>>>>>>> +               return;
> >>>>>>>> +
> >>>>>>>> +       report = vnet_hash_add(skb);
> >>>>>>>> +       if (!report)
> >>>>>>>> +               return;
> >>>>>>>> +
> >>>>>>>> +       *report = (struct virtio_net_hash) {
> >>>>>>>> +               .report = virtio_net_hash_report(hash->common.hash_types, keys),
> >>>>>>>> +               .value = value
> >>>>>>>> +       };
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
> >>>>>>>> +                                           const struct tun_vnet_hash *hash,
> >>>>>>>> +                                           struct sk_buff *skb,
> >>>>>>>> +                                           tun_vnet_hash_add vnet_hash_add)
> >>>>>>>> +{
> >>>>>>>> +       struct virtio_net_hash *report;
> >>>>>>>> +       struct virtio_net_hash ret;
> >>>>>>>> +       u16 index;
> >>>>>>>> +
> >>>>>>>> +       if (!numqueues)
> >>>>>>>> +               return 0;
> >>>>>>>> +
> >>>>>>>> +       virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
> >>>>>>>> +
> >>>>>>>> +       if (!ret.report)
> >>>>>>>> +               return hash->common.unclassified_queue % numqueues;
> >>>>>>>> +
> >>>>>>>> +       if (hash->report) {
> >>>>>>>> +               report = vnet_hash_add(skb);
> >>>>>>>> +               if (report)
> >>>>>>>> +                       *report = ret;
> >>>>>>>> +       }
> >>>>>>>> +
> >>>>>>>> +       index = ret.value & hash->common.indirection_table_mask;
> >>>>>>>> +
> >>>>>>>> +       return hash->rss_indirection_table[index] % numqueues;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>      static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
> >>>>>>>>                                        struct iov_iter *from,
> >>>>>>>>                                        struct virtio_net_hdr *hdr)
> >>>>>>>> @@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags,
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>>      static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
> >>>>>>>> -                                  const struct virtio_net_hdr *hdr)
> >>>>>>>> +                                  const struct virtio_net_hdr_v1_hash *hdr)
> >>>>>>>>      {
> >>>>>>>> +       int content_sz = MIN(sizeof(*hdr), sz);
> >>>>>>>> +
> >>>>>>>>             if (unlikely(iov_iter_count(iter) < sz))
> >>>>>>>>                     return -EINVAL;
> >>>>>>>>
> >>>>>>>> -       if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
> >>>>>>>> +       if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz))
> >>>>>>>>                     return -EFAULT;
> >>>>>>>>
> >>>>>>>> -       if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
> >>>>>>>> +       if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz)
> >>>>>>>>                     return -EFAULT;
> >>>>>>>>
> >>>>>>>>             return 0;
> >>>>>>>> @@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
> >>>>>>>>             return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags));
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>> -static inline int tun_vnet_hdr_from_skb(unsigned int flags,
> >>>>>>>> +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags,
> >>>>>>>>                                             const struct net_device *dev,
> >>>>>>>>                                             const struct sk_buff *skb,
> >>>>>>>> -                                       struct virtio_net_hdr *hdr)
> >>>>>>>> +                                       tun_vnet_hash_find vnet_hash_find,
> >>>>>>>> +                                       struct virtio_net_hdr_v1_hash *hdr)
> >>>>>>>>      {
> >>>>>>>>             int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
> >>>>>>>> +       const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
> >>>>>>>> +                                              NULL : vnet_hash_find(skb);
> >>>>>>>> +
> >>>>>>>> +       *hdr = (struct virtio_net_hdr_v1_hash) {
> >>>>>>>> +               .hash_report = VIRTIO_NET_HASH_REPORT_NONE
> >>>>>>>> +       };
> >>>>>>>> +
> >>>>>>>> +       if (report) {
> >>>>>>>> +               hdr->hash_value = cpu_to_le32(report->value);
> >>>>>>>> +               hdr->hash_report = cpu_to_le16(report->report);
> >>>>>>>> +       }
> >>>>>>>>
> >>>>>>>> -       if (virtio_net_hdr_from_skb(skb, hdr,
> >>>>>>>> +       if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr,
> >>>>>>>>                                         tun_vnet_is_little_endian(flags), true,
> >>>>>>>>                                         vlan_hlen)) {
> >>>>>>>>                     struct skb_shared_info *sinfo = skb_shinfo(skb);
> >>>>>>>>
> >>>>>>>>                     if (net_ratelimit()) {
> >>>>>>>>                             netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
> >>>>>>>> -                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
> >>>>>>>> -                                  tun_vnet16_to_cpu(flags, hdr->hdr_len));
> >>>>>>>> +                                  sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
> >>>>>>>> +                                  tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len));
> >>>>>>>>                             print_hex_dump(KERN_ERR, "tun: ",
> >>>>>>>>                                            DUMP_PREFIX_NONE,
> >>>>>>>>                                            16, 1, skb->head,
> >>>>>>>> -                                      min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
> >>>>>>>> +                                      min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true);
> >>>>>>>>                     }
> >>>>>>>>                     WARN_ON_ONCE(1);
> >>>>>>>>                     return -EINVAL;
> >>>>>>>> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> >>>>>>>> index 980de74724fc..fe4b984d3bbb 100644
> >>>>>>>> --- a/include/uapi/linux/if_tun.h
> >>>>>>>> +++ b/include/uapi/linux/if_tun.h
> >>>>>>>> @@ -62,6 +62,62 @@
> >>>>>>>>      #define TUNSETCARRIER _IOW('T', 226, int)
> >>>>>>>>      #define TUNGETDEVNETNS _IO('T', 227)
> >>>>>>>>
> >>>>>>>> +/**
> >>>>>>>> + * define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
> >>>>>>>> + *
> >>>>>>>> + * The argument is a pointer to __u32 which will store the supported virtio_net
> >>>>>>>> + * hashing types.
> >>>>>>>> + */
> >>>>>>>> +#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
> >>>>>>>> +
> >>>>>>>> +/**
> >>>>>>>> + * define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
> >>>>>>>> + *
> >>>>>>>> + * Disable RSS and enable automatic receive steering with hash reporting.
> >>>>>>>> + *
> >>>>>>>> + * The argument is a pointer to __u32 that contains a bitmask of hash types
> >>>>>>>> + * allowed to be reported.
> >>>>>>>> + *
> >>>>>>>> + * This ioctl results in %EBADFD if the underlying device is deleted. It affects
> >>>>>>>> + * all queues attached to the same device.
> >>>>>>>> + *
> >>>>>>>> + * This ioctl currently has no effect on XDP packets and packets with
> >>>>>>>> + * queue_mapping set by TC.
> >>>>>>>> + */
> >>>>>>>> +#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
> >>>>>>>> +
> >>>>>>>> +/**
> >>>>>>>> + * define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
> >>>>>>>> + *
> >>>>>>>> + * Disable automatic receive steering and enable RSS with hash reporting.
> >>>>>>>
> >>>>>>> This is unnecessary, e.g one day will have select_queue_xyz(), we
> >>>>>>> don't want to say "Disable automatic receive steering and xyz ..."
> >>>>>>
> >>>>>> It is still something better to be documented as its behavior is
> >>>>>> somewhat complicated.
> >>>>>
> >>>>> This is a hint of uAPI design issue.
> >>>>>
> >>>>>>
> >>>>>> Concretely, this ioctl disables automatic receive steering but doesn't
> >>>>>> disable steering by eBPF, which is implied by TUN_STEERINGEBPF_FALLBACK.
> >>>>>
> >>>>> It would be simpler:
> >>>>>
> >>>>> 1) not having TUN_STEERINGEBPF_FALLBACK
> >>>>> 2) the steering algorithm depends on the last uAPI call
> >>>>
> >>>> What will TUNSETSTEERINGEBPF with NULL mean in that case? It currently
> >>>> switches the steering algorithm to automq.
> >>>
> >>> A stackwise semantic then?
> >>
> >> Can you clarify the semantics with an example of a set of ioctls?
> >> Perhaps it is an easy way to demonstrate an alternative design idea.
> >
> > Consider user do:
> >
> > 1) TUNSETQUEUE /* enable automq, push */
> > 2) TUNSETSTEERINGEBPF /* enable steering ebp, push */
> > 3) TUNSETETTERINGEBPF to NULL /* disable steering ebpf, pop */
> >
> > Automq is in the stack top, so TUN will use that.
>
> In that case, what will happen if the user does:
>
> 1) TUNSETQUEUE
> 2) TUNSETVNETRSS
> 3) TUNSETETTERINGEBPF to NULL

RSS I guess since 3) will cause an error?

Or we can introduce new uAPI for setting a exclusively steering algorithm.

Thanks

>
> Regards,
> Akihiko Odaki
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-06-22 17:40                     ` Yuri Benditovich
@ 2025-06-23  8:07                       ` Jason Wang
  2025-06-23 14:28                         ` Yuri Benditovich
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2025-06-23  8:07 UTC (permalink / raw)
  To: Yuri Benditovich
  Cc: Akihiko Odaki, Michael S. Tsirkin, Jonathan Corbet,
	Willem de Bruijn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel,
	netdev, kvm, virtualization, linux-kselftest, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Mon, Jun 23, 2025 at 1:40 AM Yuri Benditovich
<yuri.benditovich@daynix.com> wrote:
>
> > Yuri, can you help to clarify this?
>
> I see here several questions:
> 1. Whether it is ok for the device not to indicate support for XXX_EX hash type?
> - I think, yes (strictly speaking, it was better to test that before
> submitting the patches )
> 2. Is it possible that the guest will enable some XXX_EX hash type if
> the device does not indicate that it is supported?
> - No (I think this is part of the spec)

There's another question, is the device allowed to fallback to
VIRTIO_NET_HASH_TYPE_IPv6 if it fails to parse extensions?

> 3. What to do if we migrate between systems with different
> capabilities of hash support/reporting/whatever
> - IMO, at this moment such case should be excluded and only mechanism
> we have for that is the compatible machine version
> - in some future the change of device capabilities can be communicated
> to the driver and _probably_ the driver might be able to communicate
> the change of device capabilities to the OS

Are you suggesting implementing all hash types? Note that Akihiko
raises the issue that in the actual implementation there should be a
limitation of the maximum number of options. If such a limitation is
different between src and dst, the difference could be noticed by the
guest.

> 4. Does it make sense to have fine configuration of hash types mask
> via command-line?
> - IMO, no. This would require the user to have too much knowledge
> about RSS internals
>
> Please let me know if I missed something.
>

Thanks


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-06-23  8:07                       ` Jason Wang
@ 2025-06-23 14:28                         ` Yuri Benditovich
  2025-06-24  0:49                           ` Jason Wang
  0 siblings, 1 reply; 39+ messages in thread
From: Yuri Benditovich @ 2025-06-23 14:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Jonathan Corbet, Willem de Bruijn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman,
	Akihiko Odaki

On Mon, Jun 23, 2025 at 11:07 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Jun 23, 2025 at 1:40 AM Yuri Benditovich
> <yuri.benditovich@daynix.com> wrote:
> >
> > > Yuri, can you help to clarify this?
> >
> > I see here several questions:
> > 1. Whether it is ok for the device not to indicate support for XXX_EX hash type?
> > - I think, yes (strictly speaking, it was better to test that before
> > submitting the patches )
> > 2. Is it possible that the guest will enable some XXX_EX hash type if
> > the device does not indicate that it is supported?
> > - No (I think this is part of the spec)
>
> There's another question, is the device allowed to fallback to
> VIRTIO_NET_HASH_TYPE_IPv6 if it fails to parse extensions?
MSFT expectations for that are at
https://learn.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types
If I read them correctly, the answer is "no"
BTW, my personal opinion is that placing all these things with hash
calculations into kernel instead of ebpf does not make too much sense.

>
> > 3. What to do if we migrate between systems with different
> > capabilities of hash support/reporting/whatever
> > - IMO, at this moment such case should be excluded and only mechanism
> > we have for that is the compatible machine version
> > - in some future the change of device capabilities can be communicated
> > to the driver and _probably_ the driver might be able to communicate
> > the change of device capabilities to the OS
>
> Are you suggesting implementing all hash types? Note that Akihiko
> raises the issue that in the actual implementation there should be a
> limitation of the maximum number of options. If such a limitation is
> different between src and dst, the difference could be noticed by the
> guest.
>
> > 4. Does it make sense to have fine configuration of hash types mask
> > via command-line?
> > - IMO, no. This would require the user to have too much knowledge
> > about RSS internals
> >
> > Please let me know if I missed something.
> >
>
> Thanks
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing
  2025-06-23 14:28                         ` Yuri Benditovich
@ 2025-06-24  0:49                           ` Jason Wang
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Wang @ 2025-06-24  0:49 UTC (permalink / raw)
  To: Yuri Benditovich
  Cc: Michael S. Tsirkin, Jonathan Corbet, Willem de Bruijn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Xuan Zhuo, Shuah Khan, linux-doc, linux-kernel, netdev, kvm,
	virtualization, linux-kselftest, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman,
	Akihiko Odaki

On Mon, Jun 23, 2025 at 10:28 PM Yuri Benditovich
<yuri.benditovich@daynix.com> wrote:
>
> On Mon, Jun 23, 2025 at 11:07 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, Jun 23, 2025 at 1:40 AM Yuri Benditovich
> > <yuri.benditovich@daynix.com> wrote:
> > >
> > > > Yuri, can you help to clarify this?
> > >
> > > I see here several questions:
> > > 1. Whether it is ok for the device not to indicate support for XXX_EX hash type?
> > > - I think, yes (strictly speaking, it was better to test that before
> > > submitting the patches )
> > > 2. Is it possible that the guest will enable some XXX_EX hash type if
> > > the device does not indicate that it is supported?
> > > - No (I think this is part of the spec)
> >
> > There's another question, is the device allowed to fallback to
> > VIRTIO_NET_HASH_TYPE_IPv6 if it fails to parse extensions?
> MSFT expectations for that are at
> https://learn.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types
> If I read them correctly, the answer is "no"

Ok, so I guess it implies the implementation should be ready to deal
with arbitrary length of ipv6 options.

> BTW, my personal opinion is that placing all these things with hash
> calculations into kernel instead of ebpf does not make too much sense.

If I remember correctly, we tried to enable it via eBPF, but failed
due to the rejection of eBPF maintainers.

Maybe we can revisit the idea. But anyhow the hardcoded logic might
still be useful as eBPF is not guaranteed to work in all cases.

Thanks

>
> >
> > > 3. What to do if we migrate between systems with different
> > > capabilities of hash support/reporting/whatever
> > > - IMO, at this moment such case should be excluded and only mechanism
> > > we have for that is the compatible machine version
> > > - in some future the change of device capabilities can be communicated
> > > to the driver and _probably_ the driver might be able to communicate
> > > the change of device capabilities to the OS
> >
> > Are you suggesting implementing all hash types? Note that Akihiko
> > raises the issue that in the actual implementation there should be a
> > limitation of the maximum number of options. If such a limitation is
> > different between src and dst, the difference could be noticed by the
> > guest.
> >
> > > 4. Does it make sense to have fine configuration of hash types mask
> > > via command-line?
> > > - IMO, no. This would require the user to have too much knowledge
> > > about RSS internals
> > >
> > > Please let me know if I missed something.
> > >
> >
> > Thanks
> >
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature
  2025-05-30  4:50 [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Akihiko Odaki
                   ` (9 preceding siblings ...)
  2025-05-30  4:50 ` [PATCH net-next v12 10/10] vhost/net: Support VIRTIO_NET_F_HASH_REPORT Akihiko Odaki
@ 2025-07-16  8:55 ` Michael S. Tsirkin
  10 siblings, 0 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2025-07-16  8:55 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Jonathan Corbet, Willem de Bruijn, Jason Wang, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Xuan Zhuo, Shuah Khan,
	linux-doc, linux-kernel, netdev, kvm, virtualization,
	linux-kselftest, Yuri Benditovich, Andrew Melnychenko,
	Stephen Hemminger, gur.stavi, Lei Yang, Simon Horman

On Fri, May 30, 2025 at 01:50:04PM +0900, Akihiko Odaki wrote:
> NOTE: I'm leaving Daynix Computing Ltd., for which I worked on this
> patch series, by the end of this month.
> 
> While net-next is closed, this is the last chance for me to send another
> version so let me send the local changes now.
> 
> Please contact Yuri Benditovich, who is CCed on this email, for anything
> about this series.
> 
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
> 
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
> 
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges.
> 
> An alternative solution is to extend the eBPF steering program so that it
> will be able to report to the userspace, but it is based on context
> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
> and vhost_net).
> 
> The patches for QEMU to use this new feature was submitted as RFC and
> is available at:
> https://patchew.org/QEMU/20250530-hash-v5-0-343d7d7a8200@daynix.com/
> 
> This work was presented at LPC 2024:
> https://lpc.events/event/18/contributions/1963/


It's been a while, do you intend to post v13?

> V1 -> V2:
>   Changed to introduce a new BPF program type.
> 
> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
> ---
> Changes in v12:
> - Updated tools/testing/selftests/net/config.
> - Split TUNSETVNETHASH.
> - Link to v11: https://lore.kernel.org/r/20250317-rss-v11-0-4cacca92f31f@daynix.com
> 
> Changes in v11:
> - Added the missing code to free vnet_hash in patch
>   "tap: Introduce virtio-net hash feature".
> - Link to v10: https://lore.kernel.org/r/20250313-rss-v10-0-3185d73a9af0@daynix.com
> 
> Changes in v10:
> - Split common code and TUN/TAP-specific code into separate patches.
> - Reverted a spurious style change in patch "tun: Introduce virtio-net
>   hash feature".
> - Added a comment explaining disable_ipv6 in tests.
> - Used AF_PACKET for patch "selftest: tun: Add tests for
>   virtio-net hashing". I also added the usage of FIXTURE_VARIANT() as
>   the testing function now needs access to more variant-specific
>   variables.
> - Corrected the message of patch "selftest: tun: Add tests for
>   virtio-net hashing"; it mentioned validation of configuration but
>   it is not scope of this patch.
> - Expanded the description of patch "selftest: tun: Add tests for
>   virtio-net hashing".
> - Added patch "tun: Allow steering eBPF program to fall back".
> - Changed to handle TUNGETVNETHASHCAP before taking the rtnl lock.
> - Removed redundant tests for tun_vnet_ioctl().
> - Added patch "selftest: tap: Add tests for virtio-net ioctls".
> - Added a design explanation of ioctls for extensibility and migration.
> - Removed a few branches in patch
>   "vhost/net: Support VIRTIO_NET_F_HASH_REPORT".
> - Link to v9: https://lore.kernel.org/r/20250307-rss-v9-0-df76624025eb@daynix.com
> 
> Changes in v9:
> - Added a missing return statement in patch
>   "tun: Introduce virtio-net hash feature".
> - Link to v8: https://lore.kernel.org/r/20250306-rss-v8-0-7ab4f56ff423@daynix.com
> 
> Changes in v8:
> - Disabled IPv6 to eliminate noises in tests.
> - Added a branch in tap to avoid unnecessary dissection when hash
>   reporting is disabled.
> - Removed unnecessary rtnl_lock().
> - Extracted code to handle new ioctls into separate functions to avoid
>   adding extra NULL checks to the code handling other ioctls.
> - Introduced variable named "fd" to __tun_chr_ioctl().
> - s/-/=/g in a patch message to avoid confusing Git.
> - Link to v7: https://lore.kernel.org/r/20250228-rss-v7-0-844205cbbdd6@daynix.com
> 
> Changes in v7:
> - Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for
>   VHOST_NET_F_VIRTIO_NET_HDR.
> - s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing".
> - Added tap_skb_cb type.
> - Rebased.
> - Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com
> 
> Changes in v6:
> - Extracted changes to fill vnet header holes into another series.
> - Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
>   Introduce virtio-net hash reporting feature", and "tun: Introduce
>   virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
> - Dropped the RFC tag.
> - Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
> 
> Changes in v5:
> - Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
> - Optimized the calculation of the hash value according to:
>   https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324cab4
> - Added patch "tun: Unify vnet implementation".
> - Dropped patch "tap: Pad virtio header with zero".
> - Added patch "selftest: tun: Test vnet ioctls without device".
> - Reworked selftests to skip for older kernels.
> - Documented the case when the underlying device is deleted and packets
>   have queue_mapping set by TC.
> - Reordered test harness arguments.
> - Added code to handle fragmented packets.
> - Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com
> 
> Changes in v4:
> - Moved tun_vnet_hash_ext to if_tun.h.
> - Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
> - Replaced htons() with cpu_to_be16().
> - Changed virtio_net_hash_rss() to return void.
> - Reordered variable declarations in virtio_net_hash_rss().
> - Removed virtio_net_hdr_v1_hash_from_skb().
> - Updated messages of "tap: Pad virtio header with zero" and
>   "tun: Pad virtio header with zero".
> - Fixed vnet_hash allocation size.
> - Ensured to free vnet_hash when destructing tun_struct.
> - Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com
> 
> Changes in v3:
> - Reverted back to add ioctl.
> - Split patch "tun: Introduce virtio-net hashing feature" into
>   "tun: Introduce virtio-net hash reporting feature" and
>   "tun: Introduce virtio-net RSS".
> - Changed to reuse hash values computed for automq instead of performing
>   RSS hashing when hash reporting is requested but RSS is not.
> - Extracted relevant data from struct tun_struct to keep it minimal.
> - Added kernel-doc.
> - Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
> - Initialized num_buffers with 1.
> - Added a test case for unclassified packets.
> - Fixed error handling in tests.
> - Changed tests to verify that the queue index will not overflow.
> - Rebased.
> - Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com
> 
> ---
> Akihiko Odaki (10):
>       virtio_net: Add functions for hashing
>       net: flow_dissector: Export flow_keys_dissector_symmetric
>       tun: Allow steering eBPF program to fall back
>       tun: Add common virtio-net hash feature code
>       tun: Introduce virtio-net hash feature
>       tap: Introduce virtio-net hash feature
>       selftest: tun: Test vnet ioctls without device
>       selftest: tun: Add tests for virtio-net hashing
>       selftest: tap: Add tests for virtio-net ioctls
>       vhost/net: Support VIRTIO_NET_F_HASH_REPORT
> 
>  Documentation/networking/tuntap.rst  |   7 +
>  drivers/net/Kconfig                  |   1 +
>  drivers/net/ipvlan/ipvtap.c          |   2 +-
>  drivers/net/macvtap.c                |   2 +-
>  drivers/net/tap.c                    |  80 +++++-
>  drivers/net/tun.c                    |  92 +++++--
>  drivers/net/tun_vnet.h               | 165 +++++++++++-
>  drivers/vhost/net.c                  |  68 ++---
>  include/linux/if_tap.h               |   4 +-
>  include/linux/skbuff.h               |   3 +
>  include/linux/virtio_net.h           | 188 ++++++++++++++
>  include/net/flow_dissector.h         |   1 +
>  include/uapi/linux/if_tun.h          |  80 ++++++
>  net/core/flow_dissector.c            |   3 +-
>  net/core/skbuff.c                    |   4 +
>  tools/testing/selftests/net/Makefile |   2 +-
>  tools/testing/selftests/net/config   |   1 +
>  tools/testing/selftests/net/tap.c    | 131 +++++++++-
>  tools/testing/selftests/net/tun.c    | 485 ++++++++++++++++++++++++++++++++++-
>  19 files changed, 1234 insertions(+), 85 deletions(-)
> ---
> base-commit: 5cb8274d66c611b7889565c418a8158517810f9b
> change-id: 20240403-rss-e737d89efa77
> 
> Best regards,
> -- 
> Akihiko Odaki <akihiko.odaki@daynix.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-07-16  8:55 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30  4:50 [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Akihiko Odaki
2025-05-30  4:50 ` [PATCH net-next v12 01/10] virtio_net: Add functions for hashing Akihiko Odaki
2025-06-03  3:19   ` Jason Wang
2025-06-03  5:31     ` Akihiko Odaki
2025-06-04  1:18       ` Jason Wang
2025-06-04  7:20         ` Akihiko Odaki
2025-06-05  1:53           ` Jason Wang
2025-06-05  7:57             ` Akihiko Odaki
2025-06-06  0:48               ` Jason Wang
2025-06-06  9:10                 ` Akihiko Odaki
2025-06-17  3:28                   ` Jason Wang
2025-06-21  5:50                     ` Akihiko Odaki
2025-06-22 17:40                     ` Yuri Benditovich
2025-06-23  8:07                       ` Jason Wang
2025-06-23 14:28                         ` Yuri Benditovich
2025-06-24  0:49                           ` Jason Wang
2025-05-30  4:50 ` [PATCH net-next v12 02/10] net: flow_dissector: Export flow_keys_dissector_symmetric Akihiko Odaki
2025-06-03  3:20   ` Jason Wang
2025-05-30  4:50 ` [PATCH net-next v12 03/10] tun: Allow steering eBPF program to fall back Akihiko Odaki
2025-06-04  1:27   ` Jason Wang
2025-06-04  7:24     ` Akihiko Odaki
2025-06-05  1:55       ` Jason Wang
2025-05-30  4:50 ` [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code Akihiko Odaki
2025-06-04  1:53   ` Jason Wang
2025-06-04  8:42     ` Akihiko Odaki
2025-06-05  2:46       ` Jason Wang
2025-06-05  8:18         ` Akihiko Odaki
2025-06-06  1:01           ` Jason Wang
2025-06-06  9:26             ` Akihiko Odaki
2025-06-17  3:39               ` Jason Wang
2025-06-19 17:01                 ` Akihiko Odaki
2025-06-23  7:59                   ` Jason Wang
2025-05-30  4:50 ` [PATCH net-next v12 05/10] tun: Introduce virtio-net hash feature Akihiko Odaki
2025-05-30  4:50 ` [PATCH net-next v12 06/10] tap: " Akihiko Odaki
2025-05-30  4:50 ` [PATCH net-next v12 07/10] selftest: tun: Test vnet ioctls without device Akihiko Odaki
2025-05-30  4:50 ` [PATCH net-next v12 08/10] selftest: tun: Add tests for virtio-net hashing Akihiko Odaki
2025-05-30  4:50 ` [PATCH net-next v12 09/10] selftest: tap: Add tests for virtio-net ioctls Akihiko Odaki
2025-05-30  4:50 ` [PATCH net-next v12 10/10] vhost/net: Support VIRTIO_NET_F_HASH_REPORT Akihiko Odaki
2025-07-16  8:55 ` [PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).