[PATCH net-next 0/3] icmp: Add RFC 5837 support

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/3] icmp: Add RFC 5837 support
@ 2025-10-22  6:53 Ido Schimmel
  2025-10-22  6:53 ` [PATCH net-next 1/3] ipv4: " Ido Schimmel
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Ido Schimmel @ 2025-10-22  6:53 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom, Ido Schimmel

tl;dr
=====

This patchset extends certain ICMP error messages (e.g., "Time
Exceeded") with incoming interface information in accordance with RFC
5837 [1]. This is required for more meaningful traceroute results in
unnumbered networks. Like other ICMP settings, the feature is controlled
via a per-{netns, address family} sysctl. The interface and the
implementation are designed to support more ICMP extensions.

Motivation
==========

Over the years, the kernel was extended with the ability to derive the
source IP of ICMP error messages from the interface that received the
datagram which elicited the ICMP error [2][3][4]. This is especially
important for "Time Exceeded" messages as it allows traceroute users to
trace the actual packet path along the network.

The above scheme does not work in unnumbered networks. In these
networks, only the loopback / VRF interface is assigned a global IP
address while router interfaces are assigned IPv6 link-local addresses.
As such, ICMP error messages are generated with a source IP derived from
the loopback / VRF interface, making it impossible to trace the actual
packet path when parallel links exist between routers.

The problem can be solved by implementing the solution proposed by RFC
4884 [5] and RFC 5837. The former defines an ICMP extension structure
that can be appended to selected ICMP messages and carry extension
objects. The latter defines an extension object called the "Interface
Information Object" (IIO) that can carry interface information (e.g.,
name, index, MTU) about interfaces with certain roles such as the
interface that received the datagram which elicited the ICMP error.

The payload of the datagram that elicited the error (potentially padded
/ trimmed) along with the ICMP extension structure will be queued to the
error queue of the originating socket, thereby allowing traceroute
applications to parse and display the information encoded in the ICMP
extension structure. Example:

 # traceroute6 -e 2001:db8:1::3
 traceroute to 2001:db8:1::3 (2001:db8:1::3), 30 hops max, 80 byte packets
  1  2001:db8:1::2 (2001:db8:1::2) <INC:11,"eth1",mtu=1500>  0.214 ms  0.171 ms  0.162 ms
  2  2001:db8:1::3 (2001:db8:1::3) <INC:12,"eth2",mtu=1500>  0.154 ms  0.135 ms  0.127 ms

 # traceroute -e 192.0.2.3
 traceroute to 192.0.2.3 (192.0.2.3), 30 hops max, 60 byte packets
  1  192.0.2.2 (192.0.2.2) <INC:11,"eth1",mtu=1500>  0.191 ms  0.148 ms  0.144 ms
  2  192.0.2.3 (192.0.2.3) <INC:12,"eth2",mtu=1500>  0.137 ms  0.122 ms  0.114 ms

Implementation
==============

As previously stated, the feature is controlled via a per-{netns,
address} sysctl. Specifically, a bit mask where each bit controls the
addition of a different ICMP extension to ICMP error messages.
Currently, only a single value is supported, to append the incoming
interface information.

Key points:

1. Global knob vs finer control. I am not aware of users who require
finer control, but it is possible that some users will want to avoid
appending ICMP extensions when the packet is sent out of a specific
interface (e.g., the management interface) or to a specific subnet. This
can be accomplished via a tc-bpf program that trims the ICMP extension
structure. An example program can be found here [6].

2. Split implementation between IPv4 / IPv6. While the implementation is
currently similar, there are some differences between both address
families. In addition, some extensions (e.g., RFC 8883 [7]) are
IPv6-specific. Given the above and given that the implementation is not
very complex, it makes sense to keep both implementations separate.

3. Compatibility with legacy applications. RFC 4884 from 2007 extended
certain ICMP messages with a length field that encodes the length of the
"original datagram" field, so that applications will be able to tell
where the "original datagram" ends and where the ICMP extension
structure starts.

Before the introduction of the IP{,6}_RECVERR_RFC4884 socket options
[8][9] in 2020 it was impossible for applications to know where the ICMP
extension structure starts and to this day some applications assume that
it starts at offset 128, which is the minimum length of the "original
datagram" field as specified by RFC 4884.

Therefore, in order to be compatible with both legacy and modern
applications, the datagram that elicited the ICMP error is trimmed /
padded to 128 bytes before appending the ICMP extension structure.

This behavior is specifically called out by RFC 4884: "Those wishing to
be backward compatible with non-compliant TRACEROUTE implementations
will include exactly 128 octets" [10].

Note that in 128 bytes we should be able to include enough headers for
the originating node to match the ICMP error message with the relevant
socket. For example, the following headers will be present in the
"original datagram" field when a VXLAN encapsulated IPv6 packet elicits
an ICMP error in an IPv6 underlay: IPv6 (40) | UDP (8) | VXLAN (8) | Eth
(14) | IPv6 (40) | UDP (8). Overall, 118 bytes.

If the 128 bytes limit proves to be insufficient for some use case, we
can consider dedicating a new bit in the previously mentioned sysctl to
allow for more bytes to be included in the "original datagram" field.

4. Extensibility. This patchset adds partial support for a single ICMP
extension. However, the interface and the implementation should be able
to support more extensions, if needed. Examples:

* More interface information objects as part of RFC 5837. We should be
  able to derive the outgoing interface information and nexthop IP from
  the dst entry attached to the packet that elicited the error.

* Node identification object (e.g., hostname / loopback IP) [11].

* Extended Information object which encodes aggregate header limits as
  part of RFC 8883.

A previous proposal from Ishaan Gandhi and Ron Bonica is available here
[12].

Testing
=======

The existing traceroute selftest is extended to test that ICMP
extensions are reported correctly when enabled. Both address families
are tested and with different packet sizes in order to make sure that
trimming / padding works correctly.

[1] https://datatracker.ietf.org/doc/html/rfc5837
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1c2fb7f93cb20621772bf304f3dba0849942e5db
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fac6fce9bdb59837bb89930c3a92f5e0d1482f0b
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4a8c416602d97a4e2073ed563d4d4c7627de19cf
[5] https://datatracker.ietf.org/doc/html/rfc4884
[6] https://gist.github.com/idosch/5013448cdb5e9e060e6bfdc8b433577c
[7] https://datatracker.ietf.org/doc/html/rfc8883
[8] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eba75c587e811d3249c8bd50d22bb2266ccd3c0f
[9] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=01370434df85eb76ecb1527a4466013c4aca2436
[10] https://datatracker.ietf.org/doc/html/rfc4884#section-5.3
[11] https://datatracker.ietf.org/doc/html/draft-ietf-intarea-extended-icmp-nodeid-04
[12] https://lore.kernel.org/netdev/20210317221959.4410-1-ishaangandhi@gmail.com/

Ido Schimmel (3):
  ipv4: icmp: Add RFC 5837 support
  ipv6: icmp: Add RFC 5837 support
  selftests: traceroute: Add ICMP extensions tests

 Documentation/networking/ip-sysctl.rst    |  34 +++
 include/linux/icmp.h                      |  32 +++
 include/net/netns/ipv4.h                  |   1 +
 include/net/netns/ipv6.h                  |   1 +
 net/core/dev.c                            |   1 +
 net/ipv4/icmp.c                           | 190 ++++++++++++++-
 net/ipv4/sysctl_net_ipv4.c                |  11 +
 net/ipv6/af_inet6.c                       |   1 +
 net/ipv6/icmp.c                           | 213 +++++++++++++++-
 tools/testing/selftests/net/traceroute.sh | 280 ++++++++++++++++++++++
 10 files changed, 761 insertions(+), 3 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net-next 1/3] ipv4: icmp: Add RFC 5837 support
  2025-10-22  6:53 [PATCH net-next 0/3] icmp: Add RFC 5837 support Ido Schimmel
@ 2025-10-22  6:53 ` Ido Schimmel
  2025-10-22 22:00   ` Willem de Bruijn
  2025-10-22  6:53 ` [PATCH net-next 2/3] ipv6: " Ido Schimmel
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Ido Schimmel @ 2025-10-22  6:53 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom, Ido Schimmel

Add the ability to append the incoming IP interface information to
ICMPv4 error messages in accordance with RFC 5837 and RFC 4884. This is
required for more meaningful traceroute results in unnumbered networks.

The feature is disabled by default and controlled via a new sysctl
("net.ipv4.icmp_errors_extension_mask") which accepts a bitmask of ICMP
extensions to append to ICMP error messages. Currently, only a single
value is supported, but the interface and the implementation should be
able to support more extensions, if needed.

Clone the skb and copy the relevant data portions before modifying the
skb as the caller of __icmp_send() still owns the skb after the function
returns. This should be fine since by default ICMP error messages are
rate limited to 1000 per second and no more than 1 per second per
specific host.

Trim or pad the packet to 128 bytes before appending the ICMP extension
structure in order to be compatible with legacy applications that assume
that the ICMP extension structure always starts at this offset (the
minimum length specified by RFC 4884).

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 Documentation/networking/ip-sysctl.rst |  17 +++
 include/linux/icmp.h                   |  32 +++++
 include/net/netns/ipv4.h               |   1 +
 net/ipv4/icmp.c                        | 190 ++++++++++++++++++++++++-
 net/ipv4/sysctl_net_ipv4.c             |  11 ++
 5 files changed, 250 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index a06cb99d66dc..ece1187ba0f1 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1796,6 +1796,23 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN
 
 	Default: 0 (disabled)
 
+icmp_errors_extension_mask - UNSIGNED INTEGER
+	Bitmask of ICMP extensions to append to ICMPv4 error messages
+	("Destination Unreachable", "Time Exceeded" and "Parameter Problem").
+	The original datagram is trimmed / padded to 128 bytes in order to be
+	compatible with applications that do not comply with RFC 4884.
+
+	Possible extensions are:
+
+	==== ==============================================================
+	0x01 Incoming IP interface information according to RFC 5837.
+	     Extension will include the index, IPv4 address (if present),
+	     name and MTU of the IP interface that received the datagram
+	     which elicited the ICMP error.
+	==== ==============================================================
+
+	Default: 0x00 (no extensions)
+
 igmp_max_memberships - INTEGER
 	Change the maximum number of multicast groups we can subscribe to.
 	Default: 20
diff --git a/include/linux/icmp.h b/include/linux/icmp.h
index 0af4d210ee31..043ec5d9c882 100644
--- a/include/linux/icmp.h
+++ b/include/linux/icmp.h
@@ -40,4 +40,36 @@ void ip_icmp_error_rfc4884(const struct sk_buff *skb,
 			   struct sock_ee_data_rfc4884 *out,
 			   int thlen, int off);
 
+/* RFC 4884 */
+#define ICMP_EXT_ORIG_DGRAM_MIN_LEN	128
+#define ICMP_EXT_VERSION_2		2
+
+/* ICMP Extension Object Classes */
+#define ICMP_EXT_OBJ_CLASS_IIO		2	/* RFC 5837 */
+
+/* Interface Information Object - RFC 5837 */
+enum {
+	ICMP_EXT_CTYPE_IIO_ROLE_IIF,
+};
+
+#define ICMP_EXT_CTYPE_IIO_ROLE(ROLE)	((ROLE) << 6)
+#define ICMP_EXT_CTYPE_IIO_MTU		BIT(0)
+#define ICMP_EXT_CTYPE_IIO_NAME		BIT(1)
+#define ICMP_EXT_CTYPE_IIO_IPADDR	BIT(2)
+#define ICMP_EXT_CTYPE_IIO_IFINDEX	BIT(3)
+
+struct icmp_ext_iio_name_subobj {
+	u8 len;
+	char name[IFNAMSIZ];
+};
+
+enum {
+	/* RFC 5837 - Incoming IP Interface Role */
+	ICMP_ERR_EXT_IIO_IIF,
+	/* Add new constants above. Used by "icmp_errors_extension_mask"
+	 * sysctl.
+	 */
+	ICMP_ERR_EXT_COUNT,
+};
+
 #endif	/* _LINUX_ICMP_H */
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 34eb3aecb3f2..0e96c90e56c6 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -135,6 +135,7 @@ struct netns_ipv4 {
 	u8 sysctl_icmp_echo_ignore_broadcasts;
 	u8 sysctl_icmp_ignore_bogus_error_responses;
 	u8 sysctl_icmp_errors_use_inbound_ifaddr;
+	u8 sysctl_icmp_errors_extension_mask;
 	int sysctl_icmp_ratelimit;
 	int sysctl_icmp_ratemask;
 	int sysctl_icmp_msgs_per_sec;
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 1b7fb5d935ed..44c4deb9d9da 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -582,6 +582,184 @@ static struct rtable *icmp_route_lookup(struct net *net, struct flowi4 *fl4,
 	return ERR_PTR(err);
 }
 
+struct icmp_ext_iio_addr4_subobj {
+	__be16 afi;
+	__be16 reserved;
+	__be32 addr4;
+};
+
+static unsigned int icmp_ext_iio_len(void)
+{
+	return sizeof(struct icmp_extobj_hdr) +
+		/* ifIndex */
+		sizeof(__be32) +
+		/* Interface Address Sub-Object */
+		sizeof(struct icmp_ext_iio_addr4_subobj) +
+		/* Interface Name Sub-Object. Length must be a multiple of 4
+		 * bytes.
+		 */
+		ALIGN(sizeof(struct icmp_ext_iio_name_subobj), 4) +
+		/* MTU */
+		sizeof(__be32);
+}
+
+static unsigned int icmp_ext_max_len(u8 ext_objs)
+{
+	unsigned int ext_max_len;
+
+	ext_max_len = sizeof(struct icmp_ext_hdr);
+
+	if (ext_objs & BIT(ICMP_ERR_EXT_IIO_IIF))
+		ext_max_len += icmp_ext_iio_len();
+
+	return ext_max_len;
+}
+
+static __be32 icmp_ext_iio_addr4_find(const struct net_device *dev)
+{
+	struct in_device *in_dev;
+	struct in_ifaddr *ifa;
+
+	in_dev = __in_dev_get_rcu(dev);
+	if (!in_dev)
+		return 0;
+
+	/* It is unclear from RFC 5837 which IP address should be chosen, but
+	 * it makes sense to choose a global unicast address.
+	 */
+	in_dev_for_each_ifa_rcu(ifa, in_dev) {
+		if (READ_ONCE(ifa->ifa_flags) & IFA_F_SECONDARY)
+			continue;
+		if (ifa->ifa_scope != RT_SCOPE_UNIVERSE ||
+		    ipv4_is_multicast(ifa->ifa_address))
+			continue;
+		return ifa->ifa_address;
+	}
+
+	return 0;
+}
+
+static void icmp_ext_iio_iif_append(struct net *net, struct sk_buff *skb,
+				    int iif)
+{
+	struct icmp_ext_iio_name_subobj *name_subobj;
+	struct icmp_extobj_hdr *objh;
+	struct net_device *dev;
+	__be32 data;
+
+	if (!iif)
+		return;
+
+	objh = skb_put(skb, sizeof(*objh));
+	objh->class_num = ICMP_EXT_OBJ_CLASS_IIO;
+	objh->class_type = ICMP_EXT_CTYPE_IIO_ROLE(ICMP_EXT_CTYPE_IIO_ROLE_IIF);
+
+	data = htonl(iif);
+	skb_put_data(skb, &data, sizeof(__be32));
+	objh->class_type |= ICMP_EXT_CTYPE_IIO_IFINDEX;
+
+	rcu_read_lock();
+
+	dev = dev_get_by_index_rcu(net, iif);
+	if (!dev)
+		goto out;
+
+	data = icmp_ext_iio_addr4_find(dev);
+	if (data) {
+		struct icmp_ext_iio_addr4_subobj *addr4_subobj;
+
+		addr4_subobj = skb_put_zero(skb, sizeof(*addr4_subobj));
+		addr4_subobj->afi = htons(ICMP_AFI_IP);
+		addr4_subobj->addr4 = data;
+		objh->class_type |= ICMP_EXT_CTYPE_IIO_IPADDR;
+	}
+
+	name_subobj = skb_put_zero(skb, ALIGN(sizeof(*name_subobj), 4));
+	name_subobj->len = ALIGN(sizeof(*name_subobj), 4);
+	netdev_copy_name(dev, name_subobj->name);
+	objh->class_type |= ICMP_EXT_CTYPE_IIO_NAME;
+
+	data = htonl(READ_ONCE(dev->mtu));
+	skb_put_data(skb, &data, sizeof(__be32));
+	objh->class_type |= ICMP_EXT_CTYPE_IIO_MTU;
+
+out:
+	rcu_read_unlock();
+	objh->length = htons(skb_tail_pointer(skb) - (unsigned char *)objh);
+}
+
+static void icmp_ext_objs_append(struct net *net, struct sk_buff *skb,
+				 u8 ext_objs, int iif)
+{
+	if (ext_objs & BIT(ICMP_ERR_EXT_IIO_IIF))
+		icmp_ext_iio_iif_append(net, skb, iif);
+}
+
+static struct sk_buff *
+icmp_ext_append(struct net *net, struct sk_buff *skb_in, struct icmphdr *icmph,
+		unsigned int room, int iif)
+{
+	unsigned int payload_len, ext_max_len, ext_len;
+	struct icmp_ext_hdr *ext_hdr;
+	struct sk_buff *skb;
+	u8 ext_objs;
+	int nhoff;
+
+	switch (icmph->type) {
+	case ICMP_DEST_UNREACH:
+	case ICMP_TIME_EXCEEDED:
+	case ICMP_PARAMETERPROB:
+		break;
+	default:
+		return NULL;
+	}
+
+	ext_objs = READ_ONCE(net->ipv4.sysctl_icmp_errors_extension_mask);
+	if (!ext_objs)
+		return NULL;
+
+	ext_max_len = icmp_ext_max_len(ext_objs);
+	if (ICMP_EXT_ORIG_DGRAM_MIN_LEN + ext_max_len > room)
+		return NULL;
+
+	skb = skb_clone(skb_in, GFP_ATOMIC);
+	if (!skb)
+		return NULL;
+
+	nhoff = skb_network_offset(skb);
+	payload_len = min(skb->len - nhoff, ICMP_EXT_ORIG_DGRAM_MIN_LEN);
+
+	if (!pskb_network_may_pull(skb, payload_len))
+		goto free_skb;
+
+	if (pskb_trim(skb, nhoff + ICMP_EXT_ORIG_DGRAM_MIN_LEN) ||
+	    __skb_put_padto(skb, nhoff + ICMP_EXT_ORIG_DGRAM_MIN_LEN, false))
+		goto free_skb;
+
+	if (pskb_expand_head(skb, 0, ext_max_len, GFP_ATOMIC))
+		goto free_skb;
+
+	ext_hdr = skb_put_zero(skb, sizeof(*ext_hdr));
+	ext_hdr->version = ICMP_EXT_VERSION_2;
+
+	icmp_ext_objs_append(net, skb, ext_objs, iif);
+
+	/* Do not send an empty extension structure. */
+	ext_len = skb_tail_pointer(skb) - (unsigned char *)ext_hdr;
+	if (ext_len == sizeof(*ext_hdr))
+		goto free_skb;
+
+	ext_hdr->checksum = ip_compute_csum(ext_hdr, ext_len);
+	/* The length of the original datagram in 32-bit words (RFC 4884). */
+	icmph->un.reserved[1] = ICMP_EXT_ORIG_DGRAM_MIN_LEN / sizeof(u32);
+
+	return skb;
+
+free_skb:
+	consume_skb(skb);
+	return NULL;
+}
+
 /*
  *	Send an ICMP message in response to a situation
  *
@@ -601,6 +779,7 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
 	struct icmp_bxm icmp_param;
 	struct rtable *rt = skb_rtable(skb_in);
 	bool apply_ratelimit = false;
+	struct sk_buff *ext_skb;
 	struct ipcm_cookie ipc;
 	struct flowi4 fl4;
 	__be32 saddr;
@@ -770,7 +949,12 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
 	if (room <= (int)sizeof(struct iphdr))
 		goto ende;
 
-	icmp_param.data_len = skb_in->len - icmp_param.offset;
+	ext_skb = icmp_ext_append(net, skb_in, &icmp_param.data.icmph, room,
+				  parm->iif);
+	if (ext_skb)
+		icmp_param.skb = ext_skb;
+
+	icmp_param.data_len = icmp_param.skb->len - icmp_param.offset;
 	if (icmp_param.data_len > room)
 		icmp_param.data_len = room;
 	icmp_param.head_len = sizeof(struct icmphdr);
@@ -785,6 +969,9 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
 	trace_icmp_send(skb_in, type, code);
 
 	icmp_push_reply(sk, &icmp_param, &fl4, &ipc, &rt);
+
+	if (ext_skb)
+		consume_skb(ext_skb);
 ende:
 	ip_rt_put(rt);
 out_unlock:
@@ -1502,6 +1689,7 @@ static int __net_init icmp_sk_init(struct net *net)
 	net->ipv4.sysctl_icmp_ratelimit = 1 * HZ;
 	net->ipv4.sysctl_icmp_ratemask = 0x1818;
 	net->ipv4.sysctl_icmp_errors_use_inbound_ifaddr = 0;
+	net->ipv4.sysctl_icmp_errors_extension_mask = 0;
 	net->ipv4.sysctl_icmp_msgs_per_sec = 1000;
 	net->ipv4.sysctl_icmp_msgs_burst = 50;
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 24dbc603cc44..0c7c8f9041cb 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -48,6 +48,8 @@ static int tcp_plb_max_rounds = 31;
 static int tcp_plb_max_cong_thresh = 256;
 static unsigned int tcp_tw_reuse_delay_max = TCP_PAWS_MSL * MSEC_PER_SEC;
 static int tcp_ecn_mode_max = 2;
+static u32 icmp_errors_extension_mask_all =
+	GENMASK_U8(ICMP_ERR_EXT_COUNT - 1, 0);
 
 /* obsolete */
 static int sysctl_tcp_low_latency __read_mostly;
@@ -674,6 +676,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE
 	},
+	{
+		.procname	= "icmp_errors_extension_mask",
+		.data		= &init_net.ipv4.sysctl_icmp_errors_extension_mask,
+		.maxlen		= sizeof(u8),
+		.mode		= 0644,
+		.proc_handler	= proc_dou8vec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &icmp_errors_extension_mask_all,
+	},
 	{
 		.procname	= "icmp_ratelimit",
 		.data		= &init_net.ipv4.sysctl_icmp_ratelimit,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next 2/3] ipv6: icmp: Add RFC 5837 support
  2025-10-22  6:53 [PATCH net-next 0/3] icmp: Add RFC 5837 support Ido Schimmel
  2025-10-22  6:53 ` [PATCH net-next 1/3] ipv4: " Ido Schimmel
@ 2025-10-22  6:53 ` Ido Schimmel
  2025-10-22  6:53 ` [PATCH net-next 3/3] selftests: traceroute: Add ICMP extensions tests Ido Schimmel
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Ido Schimmel @ 2025-10-22  6:53 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom, Ido Schimmel

Add the ability to append the incoming IP interface information to
ICMPv6 error messages in accordance with RFC 5837 and RFC 4884. This is
required for more meaningful traceroute results in unnumbered networks.

The feature is disabled by default and controlled via a new sysctl
("net.ipv6.icmp.errors_extension_mask") which accepts a bitmask of ICMP
extensions to append to ICMP error messages. Currently, only a single
value is supported, but the interface and the implementation should be
able to support more extensions, if needed.

Clone the skb and copy the relevant data portions before modifying the
skb as the caller of icmp6_send() still owns the skb after the function
returns. This should be fine since by default ICMP error messages are
rate limited to 1000 per second and no more than 1 per second per
specific host.

Trim or pad the packet to 128 bytes before appending the ICMP extension
structure in order to be compatible with legacy applications that assume
that the ICMP extension structure always starts at this offset (the
minimum length specified by RFC 4884).

Since commit 20e1954fe238 ("ipv6: RFC 4884 partial support for SIT/GRE
tunnels") it is possible for icmp6_send() to be called with an skb that
already contains ICMP extensions. This can happen when we receive an
ICMPv4 message with extensions from a tunnel and translate it to an
ICMPv6 message towards an IPv6 host in the overlay network. I could not
find an RFC that supports this behavior, but it makes sense to not
overwrite the original extensions that were appended to the packet.
Therefore, avoid appending extensions if the length field in the
provided ICMPv6 header is already filled.

Export netdev_copy_name() using EXPORT_IPV6_MOD_GPL() to make it
available to IPv6 when it is built as a module.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 Documentation/networking/ip-sysctl.rst |  17 ++
 include/net/netns/ipv6.h               |   1 +
 net/core/dev.c                         |   1 +
 net/ipv6/af_inet6.c                    |   1 +
 net/ipv6/icmp.c                        | 213 ++++++++++++++++++++++++-
 5 files changed, 231 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index ece1187ba0f1..7cd35bfd39e6 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -3279,6 +3279,23 @@ error_anycast_as_unicast - BOOLEAN
 
 	Default: 0 (disabled)
 
+errors_extension_mask - UNSIGNED INTEGER
+	Bitmask of ICMP extensions to append to ICMPv6 error messages
+	("Destination Unreachable" and "Time Exceeded"). The original datagram
+	is trimmed / padded to 128 bytes in order to be compatible with
+	applications that do not comply with RFC 4884.
+
+	Possible extensions are:
+
+	==== ==============================================================
+	0x01 Incoming IP interface information according to RFC 5837.
+	     Extension will include the index, IPv6 address (if present),
+	     name and MTU of the IP interface that received the datagram
+	     which elicited the ICMP error.
+	==== ==============================================================
+
+	Default: 0x00 (no extensions)
+
 xfrm6_gc_thresh - INTEGER
 	(Obsolete since linux-4.14)
 	The threshold at which we will start garbage collecting for IPv6
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 47dc70d8100a..08d2ecc96e2b 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -56,6 +56,7 @@ struct netns_sysctl_ipv6 {
 	u8 skip_notify_on_dev_down;
 	u8 fib_notify_on_flag_change;
 	u8 icmpv6_error_anycast_as_unicast;
+	u8 icmpv6_errors_extension_mask;
 };
 
 struct netns_ipv6 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 378c2d010faf..e6cc0fbc5e2a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1163,6 +1163,7 @@ void netdev_copy_name(struct net_device *dev, char *name)
 		strscpy(name, dev->name, IFNAMSIZ);
 	} while (read_seqretry(&netdev_rename_lock, seq));
 }
+EXPORT_IPV6_MOD_GPL(netdev_copy_name);
 
 /**
  *	netdev_get_name - get a netdevice name, knowing its ifindex.
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 1b0314644e0c..44d7de1eec4f 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -960,6 +960,7 @@ static int __net_init inet6_net_init(struct net *net)
 	net->ipv6.sysctl.icmpv6_echo_ignore_multicast = 0;
 	net->ipv6.sysctl.icmpv6_echo_ignore_anycast = 0;
 	net->ipv6.sysctl.icmpv6_error_anycast_as_unicast = 0;
+	net->ipv6.sysctl.icmpv6_errors_extension_mask = 0;
 
 	/* By default, rate limit error messages.
 	 * Except for pmtu discovery, it would break it.
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 56c974cf75d1..b2e958a23d4d 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -444,6 +444,192 @@ static int icmp6_iif(const struct sk_buff *skb)
 	return icmp6_dev(skb)->ifindex;
 }
 
+struct icmp6_ext_iio_addr6_subobj {
+	__be16 afi;
+	__be16 reserved;
+	struct in6_addr addr6;
+};
+
+static unsigned int icmp6_ext_iio_len(void)
+{
+	return sizeof(struct icmp_extobj_hdr) +
+		/* ifIndex */
+		sizeof(__be32) +
+		/* Interface Address Sub-Object */
+		sizeof(struct icmp6_ext_iio_addr6_subobj) +
+		/* Interface Name Sub-Object. Length must be a multiple of 4
+		 * bytes.
+		 */
+		ALIGN(sizeof(struct icmp_ext_iio_name_subobj), 4) +
+		/* MTU */
+		sizeof(__be32);
+}
+
+static unsigned int icmp6_ext_max_len(u8 ext_objs)
+{
+	unsigned int ext_max_len;
+
+	ext_max_len = sizeof(struct icmp_ext_hdr);
+
+	if (ext_objs & BIT(ICMP_ERR_EXT_IIO_IIF))
+		ext_max_len += icmp6_ext_iio_len();
+
+	return ext_max_len;
+}
+
+static struct in6_addr *icmp6_ext_iio_addr6_find(const struct net_device *dev)
+{
+	struct inet6_dev *in6_dev;
+	struct inet6_ifaddr *ifa;
+
+	in6_dev = __in6_dev_get(dev);
+	if (!in6_dev)
+		return NULL;
+
+	/* It is unclear from RFC 5837 which IP address should be chosen, but
+	 * it makes sense to choose a global unicast address.
+	 */
+	list_for_each_entry_rcu(ifa, &in6_dev->addr_list, if_list) {
+		if (ifa->flags & (IFA_F_TENTATIVE | IFA_F_DADFAILED))
+			continue;
+		if (ipv6_addr_type(&ifa->addr) != IPV6_ADDR_UNICAST ||
+		    ipv6_addr_src_scope(&ifa->addr) != IPV6_ADDR_SCOPE_GLOBAL)
+			continue;
+		return &ifa->addr;
+	}
+
+	return NULL;
+}
+
+static void icmp6_ext_iio_iif_append(struct net *net, struct sk_buff *skb,
+				     int iif)
+{
+	struct icmp_ext_iio_name_subobj *name_subobj;
+	struct icmp_extobj_hdr *objh;
+	struct net_device *dev;
+	struct in6_addr *addr6;
+	__be32 data;
+
+	if (!iif)
+		return;
+
+	objh = skb_put(skb, sizeof(*objh));
+	objh->class_num = ICMP_EXT_OBJ_CLASS_IIO;
+	objh->class_type = ICMP_EXT_CTYPE_IIO_ROLE(ICMP_EXT_CTYPE_IIO_ROLE_IIF);
+
+	data = htonl(iif);
+	skb_put_data(skb, &data, sizeof(__be32));
+	objh->class_type |= ICMP_EXT_CTYPE_IIO_IFINDEX;
+
+	rcu_read_lock();
+
+	dev = dev_get_by_index_rcu(net, iif);
+	if (!dev)
+		goto out;
+
+	addr6 = icmp6_ext_iio_addr6_find(dev);
+	if (addr6) {
+		struct icmp6_ext_iio_addr6_subobj *addr6_subobj;
+
+		addr6_subobj = skb_put_zero(skb, sizeof(*addr6_subobj));
+		addr6_subobj->afi = htons(ICMP_AFI_IP6);
+		addr6_subobj->addr6 = *addr6;
+		objh->class_type |= ICMP_EXT_CTYPE_IIO_IPADDR;
+	}
+
+	name_subobj = skb_put_zero(skb, ALIGN(sizeof(*name_subobj), 4));
+	name_subobj->len = ALIGN(sizeof(*name_subobj), 4);
+	netdev_copy_name(dev, name_subobj->name);
+	objh->class_type |= ICMP_EXT_CTYPE_IIO_NAME;
+
+	data = htonl(READ_ONCE(dev->mtu));
+	skb_put_data(skb, &data, sizeof(__be32));
+	objh->class_type |= ICMP_EXT_CTYPE_IIO_MTU;
+
+out:
+	rcu_read_unlock();
+	objh->length = htons(skb_tail_pointer(skb) - (unsigned char *)objh);
+}
+
+static void icmp6_ext_objs_append(struct net *net, struct sk_buff *skb,
+				  u8 ext_objs, int iif)
+{
+	if (ext_objs & BIT(ICMP_ERR_EXT_IIO_IIF))
+		icmp6_ext_iio_iif_append(net, skb, iif);
+}
+
+static struct sk_buff *
+icmp6_ext_append(struct net *net, struct sk_buff *skb_in,
+		 struct icmp6hdr *icmp6h, unsigned int room, int iif)
+{
+	unsigned int payload_len, ext_max_len, ext_len;
+	struct icmp_ext_hdr *ext_hdr;
+	struct sk_buff *skb;
+	u8 ext_objs;
+	int nhoff;
+
+	switch (icmp6h->icmp6_type) {
+	case ICMPV6_DEST_UNREACH:
+	case ICMPV6_TIME_EXCEED:
+		break;
+	default:
+		return NULL;
+	}
+
+	/* Do not overwrite existing extensions. This can happen when we
+	 * receive an ICMPv4 message with extensions from a tunnel and
+	 * translate it to an ICMPv6 message towards an IPv6 host in the
+	 * overlay network.
+	 */
+	if (icmp6h->icmp6_datagram_len)
+		return NULL;
+
+	ext_objs = READ_ONCE(net->ipv6.sysctl.icmpv6_errors_extension_mask);
+	if (!ext_objs)
+		return NULL;
+
+	ext_max_len = icmp6_ext_max_len(ext_objs);
+	if (ICMP_EXT_ORIG_DGRAM_MIN_LEN + ext_max_len > room)
+		return NULL;
+
+	skb = skb_clone(skb_in, GFP_ATOMIC);
+	if (!skb)
+		return NULL;
+
+	nhoff = skb_network_offset(skb);
+	payload_len = min(skb->len - nhoff, ICMP_EXT_ORIG_DGRAM_MIN_LEN);
+
+	if (!pskb_network_may_pull(skb, payload_len))
+		goto free_skb;
+
+	if (pskb_trim(skb, nhoff + ICMP_EXT_ORIG_DGRAM_MIN_LEN) ||
+	    __skb_put_padto(skb, nhoff + ICMP_EXT_ORIG_DGRAM_MIN_LEN, false))
+		goto free_skb;
+
+	if (pskb_expand_head(skb, 0, ext_max_len, GFP_ATOMIC))
+		goto free_skb;
+
+	ext_hdr = skb_put_zero(skb, sizeof(*ext_hdr));
+	ext_hdr->version = ICMP_EXT_VERSION_2;
+
+	icmp6_ext_objs_append(net, skb, ext_objs, iif);
+
+	/* Do not send an empty extension structure. */
+	ext_len = skb_tail_pointer(skb) - (unsigned char *)ext_hdr;
+	if (ext_len == sizeof(*ext_hdr))
+		goto free_skb;
+
+	ext_hdr->checksum = ip_compute_csum(ext_hdr, ext_len);
+	/* The length of the original datagram in 64-bit words (RFC 4884). */
+	icmp6h->icmp6_datagram_len = ICMP_EXT_ORIG_DGRAM_MIN_LEN / sizeof(u64);
+
+	return skb;
+
+free_skb:
+	consume_skb(skb);
+	return NULL;
+}
+
 /*
  *	Send an ICMP message in response to a packet in error
  */
@@ -458,7 +644,9 @@ void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
 	struct ipv6_pinfo *np;
 	const struct in6_addr *saddr = NULL;
 	bool apply_ratelimit = false;
+	struct sk_buff *ext_skb;
 	struct dst_entry *dst;
+	unsigned int room;
 	struct icmp6hdr tmp_hdr;
 	struct flowi6 fl6;
 	struct icmpv6_msg msg;
@@ -612,8 +800,13 @@ void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
 	msg.offset = skb_network_offset(skb);
 	msg.type = type;
 
-	len = skb->len - msg.offset;
-	len = min_t(unsigned int, len, IPV6_MIN_MTU - sizeof(struct ipv6hdr) - sizeof(struct icmp6hdr));
+	room = IPV6_MIN_MTU - sizeof(struct ipv6hdr) - sizeof(struct icmp6hdr);
+	ext_skb = icmp6_ext_append(net, skb, &tmp_hdr, room, parm->iif);
+	if (ext_skb)
+		msg.skb = ext_skb;
+
+	len = msg.skb->len - msg.offset;
+	len = min_t(unsigned int, len, room);
 	if (len < 0) {
 		net_dbg_ratelimited("icmp: len problem [%pI6c > %pI6c]\n",
 				    &hdr->saddr, &hdr->daddr);
@@ -635,6 +828,8 @@ void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
 	}
 
 out_dst_release:
+	if (ext_skb)
+		consume_skb(ext_skb);
 	dst_release(dst);
 out_unlock:
 	icmpv6_xmit_unlock(sk);
@@ -1171,6 +1366,10 @@ int icmpv6_err_convert(u8 type, u8 code, int *err)
 EXPORT_SYMBOL(icmpv6_err_convert);
 
 #ifdef CONFIG_SYSCTL
+
+static u32 icmpv6_errors_extension_mask_all =
+	GENMASK_U8(ICMP_ERR_EXT_COUNT - 1, 0);
+
 static struct ctl_table ipv6_icmp_table_template[] = {
 	{
 		.procname	= "ratelimit",
@@ -1216,6 +1415,15 @@ static struct ctl_table ipv6_icmp_table_template[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "errors_extension_mask",
+		.data		= &init_net.ipv6.sysctl.icmpv6_errors_extension_mask,
+		.maxlen		= sizeof(u8),
+		.mode		= 0644,
+		.proc_handler	= proc_dou8vec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &icmpv6_errors_extension_mask_all,
+	},
 };
 
 struct ctl_table * __net_init ipv6_icmp_sysctl_init(struct net *net)
@@ -1233,6 +1441,7 @@ struct ctl_table * __net_init ipv6_icmp_sysctl_init(struct net *net)
 		table[3].data = &net->ipv6.sysctl.icmpv6_echo_ignore_anycast;
 		table[4].data = &net->ipv6.sysctl.icmpv6_ratemask_ptr;
 		table[5].data = &net->ipv6.sysctl.icmpv6_error_anycast_as_unicast;
+		table[6].data = &net->ipv6.sysctl.icmpv6_errors_extension_mask;
 	}
 	return table;
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net-next 3/3] selftests: traceroute: Add ICMP extensions tests
  2025-10-22  6:53 [PATCH net-next 0/3] icmp: Add RFC 5837 support Ido Schimmel
  2025-10-22  6:53 ` [PATCH net-next 1/3] ipv4: " Ido Schimmel
  2025-10-22  6:53 ` [PATCH net-next 2/3] ipv6: " Ido Schimmel
@ 2025-10-22  6:53 ` Ido Schimmel
  2025-10-22 22:12   ` Willem de Bruijn
  2025-10-22 13:26 ` [PATCH net-next 0/3] icmp: Add RFC 5837 support Jakub Kicinski
  2025-10-22 17:29 ` David Ahern
  4 siblings, 1 reply; 21+ messages in thread
From: Ido Schimmel @ 2025-10-22  6:53 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom, Ido Schimmel

Test that ICMP extensions are reported correctly when enabled and not
reported when disabled. Test both IPv4 and IPv6 and using different
packet sizes, to make sure trimming / padding works correctly.

Disable ICMP rate limiting (defaults to 1 per-second per-target) so that
the kernel will always generate ICMP errors when needed.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 tools/testing/selftests/net/traceroute.sh | 280 ++++++++++++++++++++++
 1 file changed, 280 insertions(+)

diff --git a/tools/testing/selftests/net/traceroute.sh b/tools/testing/selftests/net/traceroute.sh
index dbb34c7e09ce..a57c61bd0b25 100755
--- a/tools/testing/selftests/net/traceroute.sh
+++ b/tools/testing/selftests/net/traceroute.sh
@@ -59,6 +59,8 @@ create_ns()
 	ip netns exec ${ns} ip -6 ro add unreachable default metric 8192
 
 	ip netns exec ${ns} sysctl -qw net.ipv4.ip_forward=1
+	ip netns exec ${ns} sysctl -qw net.ipv4.icmp_ratelimit=0
+	ip netns exec ${ns} sysctl -qw net.ipv6.icmp.ratelimit=0
 	ip netns exec ${ns} sysctl -qw net.ipv6.conf.all.keep_addr_on_down=1
 	ip netns exec ${ns} sysctl -qw net.ipv6.conf.all.forwarding=1
 	ip netns exec ${ns} sysctl -qw net.ipv6.conf.default.forwarding=1
@@ -297,6 +299,142 @@ run_traceroute6_vrf()
 	cleanup_traceroute6_vrf
 }
 
+################################################################################
+# traceroute6 with ICMP extensions test
+#
+# Verify that in this scenario
+#
+# ----                          ----                          ----
+# |H1|--------------------------|R1|--------------------------|H2|
+# ----            N1            ----            N2            ----
+#
+# ICMP extensions are correctly reported. The loopback interfaces on all the
+# nodes are assigned global addresses and the interfaces connecting the nodes
+# are assigned IPv6 link-local addresses.
+
+cleanup_traceroute6_ext()
+{
+	cleanup_all_ns
+}
+
+setup_traceroute6_ext()
+{
+	# Start clean
+	cleanup_traceroute6_ext
+
+	setup_ns h1 r1 h2
+	create_ns "$h1"
+	create_ns "$r1"
+	create_ns "$h2"
+
+	# Setup N1
+	connect_ns "$h1" eth1 - fe80::1/64 "$r1" eth1 - fe80::2/64
+	# Setup N2
+	connect_ns "$r1" eth2 - fe80::3/64 "$h2" eth2 - fe80::4/64
+
+	# Setup H1
+	ip -n "$h1" address add 2001:db8:1::1/128 dev lo
+	ip -n "$h1" route add ::/0 nexthop via fe80::2 dev eth1
+
+	# Setup R1
+	ip -n "$r1" address add 2001:db8:1::2/128 dev lo
+	ip -n "$r1" route add 2001:db8:1::1/128 nexthop via fe80::1 dev eth1
+	ip -n "$r1" route add 2001:db8:1::3/128 nexthop via fe80::4 dev eth2
+
+	# Setup H2
+	ip -n "$h2" address add 2001:db8:1::3/128 dev lo
+	ip -n "$h2" route add ::/0 nexthop via fe80::3 dev eth2
+
+	# Prime the network
+	ip netns exec "$h1" ping6 -c5 2001:db8:1::3 >/dev/null 2>&1
+}
+
+traceroute6_ext_iio_iif_test()
+{
+	local r1_ifindex h2_ifindex
+	local pkt_len=$1; shift
+
+	# Test that incoming interface info is not appended by default.
+	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep INC"
+	check_fail $? "Incoming interface info appended by default when should not"
+
+	# Test that the extension is appended when enabled.
+	run_cmd "$r1" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x01"
+	check_err $? "Failed to enable incoming interface info extension on R1"
+
+	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep INC"
+	check_err $? "Incoming interface info not appended after enable"
+
+	# Test that the extension is not appended when disabled.
+	run_cmd "$r1" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x00"
+	check_err $? "Failed to disable incoming interface info extension on R1"
+
+	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep INC"
+	check_fail $? "Incoming interface info appended after disable"
+
+	# Test that the extension is sent correctly from both R1 and H2.
+	run_cmd "$r1" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x01"
+	r1_ifindex=$(ip -n "$r1" -j link show dev eth1 | jq '.[]["ifindex"]')
+	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep '<INC:$r1_ifindex,\"eth1\",mtu=1500>'"
+	check_err $? "Wrong incoming interface info reported from R1"
+
+	run_cmd "$h2" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x01"
+	h2_ifindex=$(ip -n "$h2" -j link show dev eth2 | jq '.[]["ifindex"]')
+	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep '<INC:$h2_ifindex,\"eth2\",mtu=1500>'"
+	check_err $? "Wrong incoming interface info reported from H2"
+
+	# Add a global address on the incoming interface of R1 and check that
+	# it is reported.
+	run_cmd "$r1" "ip address add 2001:db8:100::1/64 dev eth1 nodad"
+	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep '<INC:$r1_ifindex,2001:db8:100::1,\"eth1\",mtu=1500>'"
+	check_err $? "Wrong incoming interface info reported from R1 after address addition"
+	run_cmd "$r1" "ip address del 2001:db8:100::1/64 dev eth1"
+
+	# Change name and MTU and make sure the result is still correct.
+	run_cmd "$r1" "ip link set dev eth1 name eth1tag mtu 1501"
+	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep '<INC:$r1_ifindex,\"eth1tag\",mtu=1501>'"
+	check_err $? "Wrong incoming interface info reported from R1 after name and MTU change"
+	run_cmd "$r1" "ip link set dev eth1tag name eth1 mtu 1500"
+
+	run_cmd "$r1" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x00"
+	run_cmd "$h2" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x00"
+}
+
+run_traceroute6_ext()
+{
+	if ! traceroute6 --help 2>&1 | grep -q "\--extensions"; then
+		log_test_skip "traceroute6 too old, missing ICMP extensions support"
+		return
+	fi
+
+	setup_traceroute6_ext
+
+	RET=0
+
+	## General ICMP extensions tests
+
+	# Test that ICMP extensions are disabled by default.
+	run_cmd "$h1" "sysctl net.ipv6.icmp.errors_extension_mask | grep \"= 0$\""
+	check_err $? "ICMP extensions are not disabled by default"
+
+	# Test that unsupported values are rejected.
+	run_cmd "$h1" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x80"
+	check_fail $? "Unsupported sysctl value was not rejected"
+
+	## Extension-specific tests
+
+	# Incoming interface info test. Test with various packet sizes,
+	# including the default one.
+	traceroute6_ext_iio_iif_test
+	traceroute6_ext_iio_iif_test 127
+	traceroute6_ext_iio_iif_test 128
+	traceroute6_ext_iio_iif_test 129
+
+	log_test "IPv6 traceroute with ICMP extensions"
+
+	cleanup_traceroute6_ext
+}
+
 ################################################################################
 # traceroute test
 #
@@ -437,6 +575,145 @@ run_traceroute_vrf()
 	cleanup_traceroute_vrf
 }
 
+################################################################################
+# traceroute with ICMP extensions test
+#
+# Verify that in this scenario
+#
+# ----                          ----                          ----
+# |H1|--------------------------|R1|--------------------------|H2|
+# ----            N1            ----            N2            ----
+#
+# ICMP extensions are correctly reported. The loopback interfaces on all the
+# nodes are assigned global addresses and the interfaces connecting the nodes
+# are assigned IPv6 link-local addresses.
+
+cleanup_traceroute_ext()
+{
+	cleanup_all_ns
+}
+
+setup_traceroute_ext()
+{
+	# Start clean
+	cleanup_traceroute_ext
+
+	setup_ns h1 r1 h2
+	create_ns "$h1"
+	create_ns "$r1"
+	create_ns "$h2"
+
+	# Setup N1
+	connect_ns "$h1" eth1 - fe80::1/64 "$r1" eth1 - fe80::2/64
+	# Setup N2
+	connect_ns "$r1" eth2 - fe80::3/64 "$h2" eth2 - fe80::4/64
+
+	# Setup H1
+	ip -n "$h1" address add 192.0.2.1/32 dev lo
+	ip -n "$h1" route add 0.0.0.0/0 nexthop via inet6 fe80::2 dev eth1
+
+	# Setup R1
+	ip -n "$r1" address add 192.0.2.2/32 dev lo
+	ip -n "$r1" route add 192.0.2.1/32 nexthop via inet6 fe80::1 dev eth1
+	ip -n "$r1" route add 192.0.2.3/32 nexthop via inet6 fe80::4 dev eth2
+
+	# Setup H2
+	ip -n "$h2" address add 192.0.2.3/32 dev lo
+	ip -n "$h2" route add 0.0.0.0/0 nexthop via inet6 fe80::3 dev eth2
+
+	# Prime the network
+	ip netns exec "$h1" ping -c5 192.0.2.3 >/dev/null 2>&1
+}
+
+traceroute_ext_iio_iif_test()
+{
+	local r1_ifindex h2_ifindex
+	local pkt_len=$1; shift
+
+	# Test that incoming interface info is not appended by default.
+	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep INC"
+	check_fail $? "Incoming interface info appended by default when should not"
+
+	# Test that the extension is appended when enabled.
+	run_cmd "$r1" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x01"
+	check_err $? "Failed to enable incoming interface info extension on R1"
+
+	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep INC"
+	check_err $? "Incoming interface info not appended after enable"
+
+	# Test that the extension is not appended when disabled.
+	run_cmd "$r1" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x00"
+	check_err $? "Failed to disable incoming interface info extension on R1"
+
+	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep INC"
+	check_fail $? "Incoming interface info appended after disable"
+
+	# Test that the extension is sent correctly from both R1 and H2.
+	run_cmd "$r1" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x01"
+	r1_ifindex=$(ip -n "$r1" -j link show dev eth1 | jq '.[]["ifindex"]')
+	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep '<INC:$r1_ifindex,\"eth1\",mtu=1500>'"
+	check_err $? "Wrong incoming interface info reported from R1"
+
+	run_cmd "$h2" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x01"
+	h2_ifindex=$(ip -n "$h2" -j link show dev eth2 | jq '.[]["ifindex"]')
+	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep '<INC:$h2_ifindex,\"eth2\",mtu=1500>'"
+	check_err $? "Wrong incoming interface info reported from H2"
+
+	# Add a global address on the incoming interface of R1 and check that
+	# it is reported.
+	run_cmd "$r1" "ip address add 198.51.100.1/24 dev eth1"
+	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep '<INC:$r1_ifindex,198.51.100.1,\"eth1\",mtu=1500>'"
+	check_err $? "Wrong incoming interface info reported from R1 after address addition"
+	run_cmd "$r1" "ip address del 198.51.100.1/24 dev eth1"
+
+	# Change name and MTU and make sure the result is still correct.
+	# Re-add the route towards H1 since it was deleted when we removed the
+	# last IPv4 address from eth1 on R1.
+	run_cmd "$r1" "ip route add 192.0.2.1/32 nexthop via inet6 fe80::1 dev eth1"
+	run_cmd "$r1" "ip link set dev eth1 name eth1tag mtu 1501"
+	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep '<INC:$r1_ifindex,\"eth1tag\",mtu=1501>'"
+	check_err $? "Wrong incoming interface info reported from R1 after name and MTU change"
+	run_cmd "$r1" "ip link set dev eth1tag name eth1 mtu 1500"
+
+	run_cmd "$r1" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x00"
+	run_cmd "$h2" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x00"
+}
+
+run_traceroute_ext()
+{
+	if ! traceroute --help 2>&1 | grep -q "\--extensions"; then
+		log_test_skip "traceroute too old, missing ICMP extensions support"
+		return
+	fi
+
+	setup_traceroute_ext
+
+	RET=0
+
+	## General ICMP extensions tests
+
+	# Test that ICMP extensions are disabled by default.
+	run_cmd "$h1" "sysctl net.ipv4.icmp_errors_extension_mask | grep \"= 0$\""
+	check_err $? "ICMP extensions are not disabled by default"
+
+	# Test that unsupported values are rejected.
+	run_cmd "$h1" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x80"
+	check_fail $? "Unsupported sysctl value was not rejected"
+
+	## Extension-specific tests
+
+	# Incoming interface info test. Test with various packet sizes,
+	# including the default one.
+	traceroute_ext_iio_iif_test
+	traceroute_ext_iio_iif_test 127
+	traceroute_ext_iio_iif_test 128
+	traceroute_ext_iio_iif_test 129
+
+	log_test "IPv4 traceroute with ICMP extensions"
+
+	cleanup_traceroute_ext
+}
+
 ################################################################################
 # Run tests
 
@@ -444,8 +721,10 @@ run_tests()
 {
 	run_traceroute6
 	run_traceroute6_vrf
+	run_traceroute6_ext
 	run_traceroute
 	run_traceroute_vrf
+	run_traceroute_ext
 }
 
 ################################################################################
@@ -462,6 +741,7 @@ done
 
 require_command traceroute6
 require_command traceroute
+require_command jq
 
 run_tests
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 0/3] icmp: Add RFC 5837 support
  2025-10-22  6:53 [PATCH net-next 0/3] icmp: Add RFC 5837 support Ido Schimmel
                   ` (2 preceding siblings ...)
  2025-10-22  6:53 ` [PATCH net-next 3/3] selftests: traceroute: Add ICMP extensions tests Ido Schimmel
@ 2025-10-22 13:26 ` Jakub Kicinski
  2025-10-22 13:58   ` Ido Schimmel
  2025-10-22 17:29 ` David Ahern
  4 siblings, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2025-10-22 13:26 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom

On Wed, 22 Oct 2025 09:53:46 +0300 Ido Schimmel wrote:
> Testing
> =======
> 
> The existing traceroute selftest is extended to test that ICMP
> extensions are reported correctly when enabled. Both address families
> are tested and with different packet sizes in order to make sure that
> trimming / padding works correctly.

Do we need to update traceroute to make the test pass?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 0/3] icmp: Add RFC 5837 support
  2025-10-22 13:26 ` [PATCH net-next 0/3] icmp: Add RFC 5837 support Jakub Kicinski
@ 2025-10-22 13:58   ` Ido Schimmel
  2025-10-22 15:10     ` Jakub Kicinski
  0 siblings, 1 reply; 21+ messages in thread
From: Ido Schimmel @ 2025-10-22 13:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom

On Wed, Oct 22, 2025 at 06:26:35AM -0700, Jakub Kicinski wrote:
> On Wed, 22 Oct 2025 09:53:46 +0300 Ido Schimmel wrote:
> > Testing
> > =======
> > 
> > The existing traceroute selftest is extended to test that ICMP
> > extensions are reported correctly when enabled. Both address families
> > are tested and with different packet sizes in order to make sure that
> > trimming / padding works correctly.
> 
> Do we need to update traceroute to make the test pass?

It shouldn't be necessary. There is a check to skip the test if
traceroute doesn't have the required functionality. I'm testing with
version 2.1.6 on Fedora 42.

If it's failing, can you please run the test with '-v' and paste the
output? I will try to see what's wrong. I didn't see any failures on my
end with both regular and debug configs.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 0/3] icmp: Add RFC 5837 support
  2025-10-22 13:58   ` Ido Schimmel
@ 2025-10-22 15:10     ` Jakub Kicinski
  2025-10-22 15:35       ` Ido Schimmel
  0 siblings, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2025-10-22 15:10 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom

On Wed, 22 Oct 2025 16:58:45 +0300 Ido Schimmel wrote:
> On Wed, Oct 22, 2025 at 06:26:35AM -0700, Jakub Kicinski wrote:
> > On Wed, 22 Oct 2025 09:53:46 +0300 Ido Schimmel wrote:  
> > > Testing
> > > =======
> > > 
> > > The existing traceroute selftest is extended to test that ICMP
> > > extensions are reported correctly when enabled. Both address families
> > > are tested and with different packet sizes in order to make sure that
> > > trimming / padding works correctly.  
> > 
> > Do we need to update traceroute to make the test pass?  
> 
> It shouldn't be necessary. There is a check to skip the test if
> traceroute doesn't have the required functionality. I'm testing with
> version 2.1.6 on Fedora 42.
> 
> If it's failing, can you please run the test with '-v' and paste the
> output? I will try to see what's wrong. I didn't see any failures on my
> end with both regular and debug configs.

bash-5.2# traceroute -V
Modern traceroute for Linux, version 2.1.3
Copyright (c) 2016  Dmitry Butskoy,   License: GPL v2 or any later


# 19.86 [+19.86] TEST: IPv6 traceroute                                               [ OK ]
# 42.27 [+22.42] TEST: IPv6 traceroute with VRF                                      [ OK ]
# 74.83 [+32.55] TEST: IPv6 traceroute with ICMP extensions                          [FAIL]
# 74.83 [+0.00] Wrong incoming interface info reported from R1 after name and MTU change
# 92.09 [+17.26] TEST: IPv4 traceroute                                               [ OK ]
# 109.25 [+17.16] TEST: IPv4 traceroute with VRF                                      [ OK ]
# 143.04 [+33.79] TEST: IPv4 traceroute with ICMP extensions                          [FAIL]
# 143.04 [+0.00] Wrong incoming interface info reported from R1 after name and MTU change
not ok 1 selftests: net: traceroute.sh # exit=1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 0/3] icmp: Add RFC 5837 support
  2025-10-22 15:10     ` Jakub Kicinski
@ 2025-10-22 15:35       ` Ido Schimmel
  2025-10-23  0:38         ` Jakub Kicinski
  0 siblings, 1 reply; 21+ messages in thread
From: Ido Schimmel @ 2025-10-22 15:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom

On Wed, Oct 22, 2025 at 08:10:04AM -0700, Jakub Kicinski wrote:
> On Wed, 22 Oct 2025 16:58:45 +0300 Ido Schimmel wrote:
> > On Wed, Oct 22, 2025 at 06:26:35AM -0700, Jakub Kicinski wrote:
> > > On Wed, 22 Oct 2025 09:53:46 +0300 Ido Schimmel wrote:  
> > > > Testing
> > > > =======
> > > > 
> > > > The existing traceroute selftest is extended to test that ICMP
> > > > extensions are reported correctly when enabled. Both address families
> > > > are tested and with different packet sizes in order to make sure that
> > > > trimming / padding works correctly.  
> > > 
> > > Do we need to update traceroute to make the test pass?  
> > 
> > It shouldn't be necessary. There is a check to skip the test if
> > traceroute doesn't have the required functionality. I'm testing with
> > version 2.1.6 on Fedora 42.
> > 
> > If it's failing, can you please run the test with '-v' and paste the
> > output? I will try to see what's wrong. I didn't see any failures on my
> > end with both regular and debug configs.
> 
> bash-5.2# traceroute -V
> Modern traceroute for Linux, version 2.1.3
> Copyright (c) 2016  Dmitry Butskoy,   License: GPL v2 or any later

It seems my check was not enough. I only checked that traceroute has the
'-e' option, but while version 2.1.3 supports ICMP extensions, it does
not support those defined in RFC 5837. For that you need at least
version 2.1.5.

I will change the test to require at least version 2.1.5. Can you please
update traceroute in the CI and see if it helps?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 0/3] icmp: Add RFC 5837 support
  2025-10-22  6:53 [PATCH net-next 0/3] icmp: Add RFC 5837 support Ido Schimmel
                   ` (3 preceding siblings ...)
  2025-10-22 13:26 ` [PATCH net-next 0/3] icmp: Add RFC 5837 support Jakub Kicinski
@ 2025-10-22 17:29 ` David Ahern
  2025-10-23 11:19   ` Ido Schimmel
  2025-10-23 13:39   ` Willem de Bruijn
  4 siblings, 2 replies; 21+ messages in thread
From: David Ahern @ 2025-10-22 17:29 UTC (permalink / raw)
  To: Ido Schimmel, netdev
  Cc: davem, kuba, pabeni, edumazet, horms, petrm, willemb, daniel, fw,
	ishaangandhi, rbonica, tom

On 10/22/25 12:53 AM, Ido Schimmel wrote:
> Testing
> =======
> 
> The existing traceroute selftest is extended to test that ICMP
> extensions are reported correctly when enabled. Both address families
> are tested and with different packet sizes in order to make sure that
> trimming / padding works correctly.
> 

For the set:
Reviewed-by: David Ahern <dsahern@kernel.org>

Did you try testing this with an older kernel versions on the receiving
side of the icmp packet? ie., making sure older code does not have a bad
reaction to the extra data in the icmp.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 1/3] ipv4: icmp: Add RFC 5837 support
  2025-10-22  6:53 ` [PATCH net-next 1/3] ipv4: " Ido Schimmel
@ 2025-10-22 22:00   ` Willem de Bruijn
  2025-10-23  8:35     ` Ido Schimmel
  0 siblings, 1 reply; 21+ messages in thread
From: Willem de Bruijn @ 2025-10-22 22:00 UTC (permalink / raw)
  To: Ido Schimmel, netdev
  Cc: davem, kuba, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom, Ido Schimmel

Ido Schimmel wrote:
> Add the ability to append the incoming IP interface information to
> ICMPv4 error messages in accordance with RFC 5837 and RFC 4884. This is
> required for more meaningful traceroute results in unnumbered networks.
> 
> The feature is disabled by default and controlled via a new sysctl
> ("net.ipv4.icmp_errors_extension_mask") which accepts a bitmask of ICMP
> extensions to append to ICMP error messages. Currently, only a single
> value is supported, but the interface and the implementation should be
> able to support more extensions, if needed.
> 
> Clone the skb and copy the relevant data portions before modifying the
> skb as the caller of __icmp_send() still owns the skb after the function
> returns. This should be fine since by default ICMP error messages are
> rate limited to 1000 per second and no more than 1 per second per
> specific host.
> 
> Trim or pad the packet to 128 bytes before appending the ICMP extension
> structure in order to be compatible with legacy applications that assume
> that the ICMP extension structure always starts at this offset (the
> minimum length specified by RFC 4884).
> 
> Reviewed-by: Petr Machata <petrm@nvidia.com>
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>  Documentation/networking/ip-sysctl.rst |  17 +++
>  include/linux/icmp.h                   |  32 +++++
>  include/net/netns/ipv4.h               |   1 +
>  net/ipv4/icmp.c                        | 190 ++++++++++++++++++++++++-
>  net/ipv4/sysctl_net_ipv4.c             |  11 ++
>  5 files changed, 250 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index a06cb99d66dc..ece1187ba0f1 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -1796,6 +1796,23 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN
>  
>  	Default: 0 (disabled)
>  
> +icmp_errors_extension_mask - UNSIGNED INTEGER
> +	Bitmask of ICMP extensions to append to ICMPv4 error messages
> +	("Destination Unreachable", "Time Exceeded" and "Parameter Problem").
> +	The original datagram is trimmed / padded to 128 bytes in order to be
> +	compatible with applications that do not comply with RFC 4884.
> +
> +	Possible extensions are:
> +
> +	==== ==============================================================
> +	0x01 Incoming IP interface information according to RFC 5837.
> +	     Extension will include the index, IPv4 address (if present),
> +	     name and MTU of the IP interface that received the datagram
> +	     which elicited the ICMP error.
> +	==== ==============================================================
> +
> +	Default: 0x00 (no extensions)
> +
>  igmp_max_memberships - INTEGER
>  	Change the maximum number of multicast groups we can subscribe to.
>  	Default: 20
> diff --git a/include/linux/icmp.h b/include/linux/icmp.h
> index 0af4d210ee31..043ec5d9c882 100644
> --- a/include/linux/icmp.h
> +++ b/include/linux/icmp.h
> @@ -40,4 +40,36 @@ void ip_icmp_error_rfc4884(const struct sk_buff *skb,
>  			   struct sock_ee_data_rfc4884 *out,
>  			   int thlen, int off);
>  
> +/* RFC 4884 */
> +#define ICMP_EXT_ORIG_DGRAM_MIN_LEN	128
> +#define ICMP_EXT_VERSION_2		2
> +
> +/* ICMP Extension Object Classes */
> +#define ICMP_EXT_OBJ_CLASS_IIO		2	/* RFC 5837 */
> +
> +/* Interface Information Object - RFC 5837 */
> +enum {
> +	ICMP_EXT_CTYPE_IIO_ROLE_IIF,
> +};
> +
> +#define ICMP_EXT_CTYPE_IIO_ROLE(ROLE)	((ROLE) << 6)
> +#define ICMP_EXT_CTYPE_IIO_MTU		BIT(0)
> +#define ICMP_EXT_CTYPE_IIO_NAME		BIT(1)
> +#define ICMP_EXT_CTYPE_IIO_IPADDR	BIT(2)
> +#define ICMP_EXT_CTYPE_IIO_IFINDEX	BIT(3)
> +
> +struct icmp_ext_iio_name_subobj {
> +	u8 len;
> +	char name[IFNAMSIZ];
> +};
> +
> +enum {
> +	/* RFC 5837 - Incoming IP Interface Role */
> +	ICMP_ERR_EXT_IIO_IIF,
> +	/* Add new constants above. Used by "icmp_errors_extension_mask"
> +	 * sysctl.
> +	 */
> +	ICMP_ERR_EXT_COUNT,
> +};
> +
>  #endif	/* _LINUX_ICMP_H */
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 34eb3aecb3f2..0e96c90e56c6 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -135,6 +135,7 @@ struct netns_ipv4 {
>  	u8 sysctl_icmp_echo_ignore_broadcasts;
>  	u8 sysctl_icmp_ignore_bogus_error_responses;
>  	u8 sysctl_icmp_errors_use_inbound_ifaddr;
> +	u8 sysctl_icmp_errors_extension_mask;
>  	int sysctl_icmp_ratelimit;
>  	int sysctl_icmp_ratemask;
>  	int sysctl_icmp_msgs_per_sec;
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index 1b7fb5d935ed..44c4deb9d9da 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -582,6 +582,184 @@ static struct rtable *icmp_route_lookup(struct net *net, struct flowi4 *fl4,
>  	return ERR_PTR(err);
>  }
>  
> +struct icmp_ext_iio_addr4_subobj {
> +	__be16 afi;
> +	__be16 reserved;
> +	__be32 addr4;
> +};
> +
> +static unsigned int icmp_ext_iio_len(void)
> +{
> +	return sizeof(struct icmp_extobj_hdr) +
> +		/* ifIndex */
> +		sizeof(__be32) +
> +		/* Interface Address Sub-Object */
> +		sizeof(struct icmp_ext_iio_addr4_subobj) +
> +		/* Interface Name Sub-Object. Length must be a multiple of 4
> +		 * bytes.
> +		 */
> +		ALIGN(sizeof(struct icmp_ext_iio_name_subobj), 4) +
> +		/* MTU */
> +		sizeof(__be32);
> +}
> +
> +static unsigned int icmp_ext_max_len(u8 ext_objs)
> +{
> +	unsigned int ext_max_len;
> +
> +	ext_max_len = sizeof(struct icmp_ext_hdr);
> +
> +	if (ext_objs & BIT(ICMP_ERR_EXT_IIO_IIF))
> +		ext_max_len += icmp_ext_iio_len();
> +
> +	return ext_max_len;
> +}
> +
> +static __be32 icmp_ext_iio_addr4_find(const struct net_device *dev)
> +{
> +	struct in_device *in_dev;
> +	struct in_ifaddr *ifa;
> +
> +	in_dev = __in_dev_get_rcu(dev);
> +	if (!in_dev)
> +		return 0;
> +
> +	/* It is unclear from RFC 5837 which IP address should be chosen, but
> +	 * it makes sense to choose a global unicast address.

Is it possible for no such address to exist, and in that case should
one of the backup options be considered?

> +	 */
> +	in_dev_for_each_ifa_rcu(ifa, in_dev) {
> +		if (READ_ONCE(ifa->ifa_flags) & IFA_F_SECONDARY)
> +			continue;
> +		if (ifa->ifa_scope != RT_SCOPE_UNIVERSE ||
> +		    ipv4_is_multicast(ifa->ifa_address))
> +			continue;
> +		return ifa->ifa_address;
> +	}
> +
> +	return 0;
> +}
> +
> +static void icmp_ext_iio_iif_append(struct net *net, struct sk_buff *skb,
> +				    int iif)
> +{
> +	struct icmp_ext_iio_name_subobj *name_subobj;
> +	struct icmp_extobj_hdr *objh;
> +	struct net_device *dev;
> +	__be32 data;
> +
> +	if (!iif)
> +		return;
> +

Might be good to add a comment that field order is prescribed by the RFC.

> +	objh = skb_put(skb, sizeof(*objh));
> +	objh->class_num = ICMP_EXT_OBJ_CLASS_IIO;
> +	objh->class_type = ICMP_EXT_CTYPE_IIO_ROLE(ICMP_EXT_CTYPE_IIO_ROLE_IIF);
> +
> +	data = htonl(iif);
> +	skb_put_data(skb, &data, sizeof(__be32));
> +	objh->class_type |= ICMP_EXT_CTYPE_IIO_IFINDEX;
> +
> +	rcu_read_lock();
> +
> +	dev = dev_get_by_index_rcu(net, iif);
> +	if (!dev)
> +		goto out;
> +
> +	data = icmp_ext_iio_addr4_find(dev);
> +	if (data) {
> +		struct icmp_ext_iio_addr4_subobj *addr4_subobj;
> +
> +		addr4_subobj = skb_put_zero(skb, sizeof(*addr4_subobj));
> +		addr4_subobj->afi = htons(ICMP_AFI_IP);
> +		addr4_subobj->addr4 = data;
> +		objh->class_type |= ICMP_EXT_CTYPE_IIO_IPADDR;
> +	}
> +
> +	name_subobj = skb_put_zero(skb, ALIGN(sizeof(*name_subobj), 4));
> +	name_subobj->len = ALIGN(sizeof(*name_subobj), 4);
> +	netdev_copy_name(dev, name_subobj->name);
> +	objh->class_type |= ICMP_EXT_CTYPE_IIO_NAME;
> +
> +	data = htonl(READ_ONCE(dev->mtu));
> +	skb_put_data(skb, &data, sizeof(__be32));
> +	objh->class_type |= ICMP_EXT_CTYPE_IIO_MTU;
> +
> +out:
> +	rcu_read_unlock();
> +	objh->length = htons(skb_tail_pointer(skb) - (unsigned char *)objh);
> +}
> +
> +static void icmp_ext_objs_append(struct net *net, struct sk_buff *skb,
> +				 u8 ext_objs, int iif)
> +{
> +	if (ext_objs & BIT(ICMP_ERR_EXT_IIO_IIF))
> +		icmp_ext_iio_iif_append(net, skb, iif);
> +}
> +
> +static struct sk_buff *
> +icmp_ext_append(struct net *net, struct sk_buff *skb_in, struct icmphdr *icmph,
> +		unsigned int room, int iif)
> +{
> +	unsigned int payload_len, ext_max_len, ext_len;
> +	struct icmp_ext_hdr *ext_hdr;
> +	struct sk_buff *skb;
> +	u8 ext_objs;
> +	int nhoff;
> +
> +	switch (icmph->type) {
> +	case ICMP_DEST_UNREACH:
> +	case ICMP_TIME_EXCEEDED:
> +	case ICMP_PARAMETERPROB:
> +		break;
> +	default:
> +		return NULL;
> +	}
> +
> +	ext_objs = READ_ONCE(net->ipv4.sysctl_icmp_errors_extension_mask);
> +	if (!ext_objs)
> +		return NULL;
> +
> +	ext_max_len = icmp_ext_max_len(ext_objs);
> +	if (ICMP_EXT_ORIG_DGRAM_MIN_LEN + ext_max_len > room)
> +		return NULL;
> +
> +	skb = skb_clone(skb_in, GFP_ATOMIC);
> +	if (!skb)
> +		return NULL;
> +
> +	nhoff = skb_network_offset(skb);
> +	payload_len = min(skb->len - nhoff, ICMP_EXT_ORIG_DGRAM_MIN_LEN);
> +
> +	if (!pskb_network_may_pull(skb, payload_len))
> +		goto free_skb;
> +
> +	if (pskb_trim(skb, nhoff + ICMP_EXT_ORIG_DGRAM_MIN_LEN) ||
> +	    __skb_put_padto(skb, nhoff + ICMP_EXT_ORIG_DGRAM_MIN_LEN, false))
> +		goto free_skb;
> +
> +	if (pskb_expand_head(skb, 0, ext_max_len, GFP_ATOMIC))
> +		goto free_skb;
> +
> +	ext_hdr = skb_put_zero(skb, sizeof(*ext_hdr));
> +	ext_hdr->version = ICMP_EXT_VERSION_2;
> +
> +	icmp_ext_objs_append(net, skb, ext_objs, iif);
> +
> +	/* Do not send an empty extension structure. */
> +	ext_len = skb_tail_pointer(skb) - (unsigned char *)ext_hdr;
> +	if (ext_len == sizeof(*ext_hdr))
> +		goto free_skb;
> +
> +	ext_hdr->checksum = ip_compute_csum(ext_hdr, ext_len);
> +	/* The length of the original datagram in 32-bit words (RFC 4884). */
> +	icmph->un.reserved[1] = ICMP_EXT_ORIG_DGRAM_MIN_LEN / sizeof(u32);
> +
> +	return skb;
> +
> +free_skb:
> +	consume_skb(skb);
> +	return NULL;
> +}
> +
>  /*
>   *	Send an ICMP message in response to a situation
>   *
> @@ -601,6 +779,7 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
>  	struct icmp_bxm icmp_param;
>  	struct rtable *rt = skb_rtable(skb_in);
>  	bool apply_ratelimit = false;
> +	struct sk_buff *ext_skb;
>  	struct ipcm_cookie ipc;
>  	struct flowi4 fl4;
>  	__be32 saddr;
> @@ -770,7 +949,12 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
>  	if (room <= (int)sizeof(struct iphdr))
>  		goto ende;
>  
> -	icmp_param.data_len = skb_in->len - icmp_param.offset;
> +	ext_skb = icmp_ext_append(net, skb_in, &icmp_param.data.icmph, room,
> +				  parm->iif);
> +	if (ext_skb)
> +		icmp_param.skb = ext_skb;
> +
> +	icmp_param.data_len = icmp_param.skb->len - icmp_param.offset;
>  	if (icmp_param.data_len > room)
>  		icmp_param.data_len = room;
>  	icmp_param.head_len = sizeof(struct icmphdr);
> @@ -785,6 +969,9 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
>  	trace_icmp_send(skb_in, type, code);
>  
>  	icmp_push_reply(sk, &icmp_param, &fl4, &ipc, &rt);
> +
> +	if (ext_skb)
> +		consume_skb(ext_skb);
>  ende:
>  	ip_rt_put(rt);
>  out_unlock:
> @@ -1502,6 +1689,7 @@ static int __net_init icmp_sk_init(struct net *net)
>  	net->ipv4.sysctl_icmp_ratelimit = 1 * HZ;
>  	net->ipv4.sysctl_icmp_ratemask = 0x1818;
>  	net->ipv4.sysctl_icmp_errors_use_inbound_ifaddr = 0;
> +	net->ipv4.sysctl_icmp_errors_extension_mask = 0;
>  	net->ipv4.sysctl_icmp_msgs_per_sec = 1000;
>  	net->ipv4.sysctl_icmp_msgs_burst = 50;
>  
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 24dbc603cc44..0c7c8f9041cb 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -48,6 +48,8 @@ static int tcp_plb_max_rounds = 31;
>  static int tcp_plb_max_cong_thresh = 256;
>  static unsigned int tcp_tw_reuse_delay_max = TCP_PAWS_MSL * MSEC_PER_SEC;
>  static int tcp_ecn_mode_max = 2;
> +static u32 icmp_errors_extension_mask_all =
> +	GENMASK_U8(ICMP_ERR_EXT_COUNT - 1, 0);
>  
>  /* obsolete */
>  static int sysctl_tcp_low_latency __read_mostly;
> @@ -674,6 +676,15 @@ static struct ctl_table ipv4_net_table[] = {
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= SYSCTL_ONE
>  	},
> +	{
> +		.procname	= "icmp_errors_extension_mask",
> +		.data		= &init_net.ipv4.sysctl_icmp_errors_extension_mask,
> +		.maxlen		= sizeof(u8),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dou8vec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= &icmp_errors_extension_mask_all,
> +	},
>  	{
>  		.procname	= "icmp_ratelimit",
>  		.data		= &init_net.ipv4.sysctl_icmp_ratelimit,
> -- 
> 2.51.0
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 3/3] selftests: traceroute: Add ICMP extensions tests
  2025-10-22  6:53 ` [PATCH net-next 3/3] selftests: traceroute: Add ICMP extensions tests Ido Schimmel
@ 2025-10-22 22:12   ` Willem de Bruijn
  2025-10-23  9:09     ` Ido Schimmel
  2025-10-23 15:39     ` Ido Schimmel
  0 siblings, 2 replies; 21+ messages in thread
From: Willem de Bruijn @ 2025-10-22 22:12 UTC (permalink / raw)
  To: Ido Schimmel, netdev
  Cc: davem, kuba, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom, Ido Schimmel

Ido Schimmel wrote:
> Test that ICMP extensions are reported correctly when enabled and not
> reported when disabled. Test both IPv4 and IPv6 and using different
> packet sizes, to make sure trimming / padding works correctly.
> 
> Disable ICMP rate limiting (defaults to 1 per-second per-target) so that
> the kernel will always generate ICMP errors when needed.

This reminds me that when I added SOL_IP/IP_RECVERR_4884, the selftest
was not integrated into kselftests. Commit eba75c587e81 points to

https://github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp_v2.c

It might be useful to verify that the kernel recv path that parses
RFC 4884 compliant ICMP messages correctly handles these RFC 4884
messages.

But traceroute parsing the data is sufficient validation that packet
generation is compliant with the RFCs.

> Reviewed-by: Petr Machata <petrm@nvidia.com>
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>  tools/testing/selftests/net/traceroute.sh | 280 ++++++++++++++++++++++
>  1 file changed, 280 insertions(+)
> 
> diff --git a/tools/testing/selftests/net/traceroute.sh b/tools/testing/selftests/net/traceroute.sh
> index dbb34c7e09ce..a57c61bd0b25 100755
> --- a/tools/testing/selftests/net/traceroute.sh
> +++ b/tools/testing/selftests/net/traceroute.sh
> @@ -59,6 +59,8 @@ create_ns()
>  	ip netns exec ${ns} ip -6 ro add unreachable default metric 8192
>  
>  	ip netns exec ${ns} sysctl -qw net.ipv4.ip_forward=1
> +	ip netns exec ${ns} sysctl -qw net.ipv4.icmp_ratelimit=0
> +	ip netns exec ${ns} sysctl -qw net.ipv6.icmp.ratelimit=0
>  	ip netns exec ${ns} sysctl -qw net.ipv6.conf.all.keep_addr_on_down=1
>  	ip netns exec ${ns} sysctl -qw net.ipv6.conf.all.forwarding=1
>  	ip netns exec ${ns} sysctl -qw net.ipv6.conf.default.forwarding=1
> @@ -297,6 +299,142 @@ run_traceroute6_vrf()
>  	cleanup_traceroute6_vrf
>  }
>  
> +################################################################################
> +# traceroute6 with ICMP extensions test
> +#
> +# Verify that in this scenario
> +#
> +# ----                          ----                          ----
> +# |H1|--------------------------|R1|--------------------------|H2|
> +# ----            N1            ----            N2            ----
> +#
> +# ICMP extensions are correctly reported. The loopback interfaces on all the
> +# nodes are assigned global addresses and the interfaces connecting the nodes
> +# are assigned IPv6 link-local addresses.
> +
> +cleanup_traceroute6_ext()
> +{
> +	cleanup_all_ns
> +}
> +
> +setup_traceroute6_ext()
> +{
> +	# Start clean
> +	cleanup_traceroute6_ext
> +
> +	setup_ns h1 r1 h2
> +	create_ns "$h1"
> +	create_ns "$r1"
> +	create_ns "$h2"
> +
> +	# Setup N1
> +	connect_ns "$h1" eth1 - fe80::1/64 "$r1" eth1 - fe80::2/64
> +	# Setup N2
> +	connect_ns "$r1" eth2 - fe80::3/64 "$h2" eth2 - fe80::4/64
> +
> +	# Setup H1
> +	ip -n "$h1" address add 2001:db8:1::1/128 dev lo

nodad or not needed in this lo special case?

> +	ip -n "$h1" route add ::/0 nexthop via fe80::2 dev eth1
> +
> +	# Setup R1
> +	ip -n "$r1" address add 2001:db8:1::2/128 dev lo
> +	ip -n "$r1" route add 2001:db8:1::1/128 nexthop via fe80::1 dev eth1
> +	ip -n "$r1" route add 2001:db8:1::3/128 nexthop via fe80::4 dev eth2
> +
> +	# Setup H2
> +	ip -n "$h2" address add 2001:db8:1::3/128 dev lo
> +	ip -n "$h2" route add ::/0 nexthop via fe80::3 dev eth2
> +
> +	# Prime the network
> +	ip netns exec "$h1" ping6 -c5 2001:db8:1::3 >/dev/null 2>&1
> +}
> +
> +traceroute6_ext_iio_iif_test()
> +{
> +	local r1_ifindex h2_ifindex
> +	local pkt_len=$1; shift
> +
> +	# Test that incoming interface info is not appended by default.
> +	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep INC"
> +	check_fail $? "Incoming interface info appended by default when should not"
> +
> +	# Test that the extension is appended when enabled.
> +	run_cmd "$r1" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x01"
> +	check_err $? "Failed to enable incoming interface info extension on R1"
> +
> +	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep INC"
> +	check_err $? "Incoming interface info not appended after enable"
> +
> +	# Test that the extension is not appended when disabled.
> +	run_cmd "$r1" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x00"
> +	check_err $? "Failed to disable incoming interface info extension on R1"
> +
> +	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep INC"
> +	check_fail $? "Incoming interface info appended after disable"
> +
> +	# Test that the extension is sent correctly from both R1 and H2.
> +	run_cmd "$r1" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x01"
> +	r1_ifindex=$(ip -n "$r1" -j link show dev eth1 | jq '.[]["ifindex"]')
> +	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep '<INC:$r1_ifindex,\"eth1\",mtu=1500>'"
> +	check_err $? "Wrong incoming interface info reported from R1"
> +
> +	run_cmd "$h2" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x01"
> +	h2_ifindex=$(ip -n "$h2" -j link show dev eth2 | jq '.[]["ifindex"]')
> +	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep '<INC:$h2_ifindex,\"eth2\",mtu=1500>'"
> +	check_err $? "Wrong incoming interface info reported from H2"
> +
> +	# Add a global address on the incoming interface of R1 and check that
> +	# it is reported.
> +	run_cmd "$r1" "ip address add 2001:db8:100::1/64 dev eth1 nodad"
> +	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep '<INC:$r1_ifindex,2001:db8:100::1,\"eth1\",mtu=1500>'"
> +	check_err $? "Wrong incoming interface info reported from R1 after address addition"
> +	run_cmd "$r1" "ip address del 2001:db8:100::1/64 dev eth1"
> +
> +	# Change name and MTU and make sure the result is still correct.
> +	run_cmd "$r1" "ip link set dev eth1 name eth1tag mtu 1501"
> +	run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep '<INC:$r1_ifindex,\"eth1tag\",mtu=1501>'"
> +	check_err $? "Wrong incoming interface info reported from R1 after name and MTU change"
> +	run_cmd "$r1" "ip link set dev eth1tag name eth1 mtu 1500"
> +
> +	run_cmd "$r1" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x00"
> +	run_cmd "$h2" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x00"
> +}
> +
> +run_traceroute6_ext()
> +{
> +	if ! traceroute6 --help 2>&1 | grep -q "\--extensions"; then
> +		log_test_skip "traceroute6 too old, missing ICMP extensions support"
> +		return
> +	fi
> +
> +	setup_traceroute6_ext
> +
> +	RET=0
> +
> +	## General ICMP extensions tests
> +
> +	# Test that ICMP extensions are disabled by default.
> +	run_cmd "$h1" "sysctl net.ipv6.icmp.errors_extension_mask | grep \"= 0$\""
> +	check_err $? "ICMP extensions are not disabled by default"
> +
> +	# Test that unsupported values are rejected.
> +	run_cmd "$h1" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x80"
> +	check_fail $? "Unsupported sysctl value was not rejected"
> +
> +	## Extension-specific tests
> +
> +	# Incoming interface info test. Test with various packet sizes,
> +	# including the default one.
> +	traceroute6_ext_iio_iif_test
> +	traceroute6_ext_iio_iif_test 127
> +	traceroute6_ext_iio_iif_test 128
> +	traceroute6_ext_iio_iif_test 129
> +
> +	log_test "IPv6 traceroute with ICMP extensions"
> +
> +	cleanup_traceroute6_ext
> +}
> +
>  ################################################################################
>  # traceroute test
>  #
> @@ -437,6 +575,145 @@ run_traceroute_vrf()
>  	cleanup_traceroute_vrf
>  }
>  
> +################################################################################
> +# traceroute with ICMP extensions test
> +#
> +# Verify that in this scenario
> +#
> +# ----                          ----                          ----
> +# |H1|--------------------------|R1|--------------------------|H2|
> +# ----            N1            ----            N2            ----
> +#
> +# ICMP extensions are correctly reported. The loopback interfaces on all the
> +# nodes are assigned global addresses and the interfaces connecting the nodes
> +# are assigned IPv6 link-local addresses.
> +
> +cleanup_traceroute_ext()
> +{
> +	cleanup_all_ns
> +}
> +
> +setup_traceroute_ext()
> +{
> +	# Start clean
> +	cleanup_traceroute_ext
> +
> +	setup_ns h1 r1 h2
> +	create_ns "$h1"
> +	create_ns "$r1"
> +	create_ns "$h2"
> +
> +	# Setup N1
> +	connect_ns "$h1" eth1 - fe80::1/64 "$r1" eth1 - fe80::2/64
> +	# Setup N2
> +	connect_ns "$r1" eth2 - fe80::3/64 "$h2" eth2 - fe80::4/64

Stray IPv6 addresses in this IPv4 test?

As a matter of fact, is it feasible to merge the IPv4 and IPv6 tests
with some basic variables like $TRACEROUTE, $SYSCTL_PATH and $ADDR?

(I appreciate that you spent more time looking at that, fine to leave
if it is not practical to do so.)

> +
> +	# Setup H1
> +	ip -n "$h1" address add 192.0.2.1/32 dev lo
> +	ip -n "$h1" route add 0.0.0.0/0 nexthop via inet6 fe80::2 dev eth1
> +
> +	# Setup R1
> +	ip -n "$r1" address add 192.0.2.2/32 dev lo
> +	ip -n "$r1" route add 192.0.2.1/32 nexthop via inet6 fe80::1 dev eth1
> +	ip -n "$r1" route add 192.0.2.3/32 nexthop via inet6 fe80::4 dev eth2
> +
> +	# Setup H2
> +	ip -n "$h2" address add 192.0.2.3/32 dev lo
> +	ip -n "$h2" route add 0.0.0.0/0 nexthop via inet6 fe80::3 dev eth2
> +
> +	# Prime the network
> +	ip netns exec "$h1" ping -c5 192.0.2.3 >/dev/null 2>&1
> +}
> +
> +traceroute_ext_iio_iif_test()
> +{
> +	local r1_ifindex h2_ifindex
> +	local pkt_len=$1; shift
> +
> +	# Test that incoming interface info is not appended by default.
> +	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep INC"
> +	check_fail $? "Incoming interface info appended by default when should not"
> +
> +	# Test that the extension is appended when enabled.
> +	run_cmd "$r1" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x01"
> +	check_err $? "Failed to enable incoming interface info extension on R1"
> +
> +	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep INC"
> +	check_err $? "Incoming interface info not appended after enable"
> +
> +	# Test that the extension is not appended when disabled.
> +	run_cmd "$r1" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x00"
> +	check_err $? "Failed to disable incoming interface info extension on R1"
> +
> +	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep INC"
> +	check_fail $? "Incoming interface info appended after disable"
> +
> +	# Test that the extension is sent correctly from both R1 and H2.
> +	run_cmd "$r1" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x01"
> +	r1_ifindex=$(ip -n "$r1" -j link show dev eth1 | jq '.[]["ifindex"]')
> +	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep '<INC:$r1_ifindex,\"eth1\",mtu=1500>'"
> +	check_err $? "Wrong incoming interface info reported from R1"
> +
> +	run_cmd "$h2" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x01"
> +	h2_ifindex=$(ip -n "$h2" -j link show dev eth2 | jq '.[]["ifindex"]')
> +	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep '<INC:$h2_ifindex,\"eth2\",mtu=1500>'"
> +	check_err $? "Wrong incoming interface info reported from H2"
> +
> +	# Add a global address on the incoming interface of R1 and check that
> +	# it is reported.
> +	run_cmd "$r1" "ip address add 198.51.100.1/24 dev eth1"
> +	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep '<INC:$r1_ifindex,198.51.100.1,\"eth1\",mtu=1500>'"
> +	check_err $? "Wrong incoming interface info reported from R1 after address addition"
> +	run_cmd "$r1" "ip address del 198.51.100.1/24 dev eth1"
> +
> +	# Change name and MTU and make sure the result is still correct.
> +	# Re-add the route towards H1 since it was deleted when we removed the
> +	# last IPv4 address from eth1 on R1.
> +	run_cmd "$r1" "ip route add 192.0.2.1/32 nexthop via inet6 fe80::1 dev eth1"
> +	run_cmd "$r1" "ip link set dev eth1 name eth1tag mtu 1501"
> +	run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep '<INC:$r1_ifindex,\"eth1tag\",mtu=1501>'"
> +	check_err $? "Wrong incoming interface info reported from R1 after name and MTU change"
> +	run_cmd "$r1" "ip link set dev eth1tag name eth1 mtu 1500"
> +
> +	run_cmd "$r1" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x00"
> +	run_cmd "$h2" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x00"
> +}
> +
> +run_traceroute_ext()
> +{
> +	if ! traceroute --help 2>&1 | grep -q "\--extensions"; then
> +		log_test_skip "traceroute too old, missing ICMP extensions support"
> +		return
> +	fi
> +
> +	setup_traceroute_ext
> +
> +	RET=0
> +
> +	## General ICMP extensions tests
> +
> +	# Test that ICMP extensions are disabled by default.
> +	run_cmd "$h1" "sysctl net.ipv4.icmp_errors_extension_mask | grep \"= 0$\""
> +	check_err $? "ICMP extensions are not disabled by default"
> +
> +	# Test that unsupported values are rejected.
> +	run_cmd "$h1" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x80"
> +	check_fail $? "Unsupported sysctl value was not rejected"
> +
> +	## Extension-specific tests
> +
> +	# Incoming interface info test. Test with various packet sizes,
> +	# including the default one.
> +	traceroute_ext_iio_iif_test
> +	traceroute_ext_iio_iif_test 127
> +	traceroute_ext_iio_iif_test 128
> +	traceroute_ext_iio_iif_test 129
> +
> +	log_test "IPv4 traceroute with ICMP extensions"
> +
> +	cleanup_traceroute_ext
> +}
> +
>  ################################################################################
>  # Run tests
>  
> @@ -444,8 +721,10 @@ run_tests()
>  {
>  	run_traceroute6
>  	run_traceroute6_vrf
> +	run_traceroute6_ext
>  	run_traceroute
>  	run_traceroute_vrf
> +	run_traceroute_ext
>  }
>  
>  ################################################################################
> @@ -462,6 +741,7 @@ done
>  
>  require_command traceroute6
>  require_command traceroute
> +require_command jq
>  
>  run_tests
>  
> -- 
> 2.51.0
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 0/3] icmp: Add RFC 5837 support
  2025-10-22 15:35       ` Ido Schimmel
@ 2025-10-23  0:38         ` Jakub Kicinski
  2025-10-24  1:48           ` Jakub Kicinski
  0 siblings, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2025-10-23  0:38 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom

On Wed, 22 Oct 2025 18:35:23 +0300 Ido Schimmel wrote:
> I will change the test to require at least version 2.1.5. Can you please
> update traceroute in the CI and see if it helps?

Will do but I'm a little behind on everything, so it may be tomorrow 
or even Friday. So ignore NIPA for now, worse case we'll follow up.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 1/3] ipv4: icmp: Add RFC 5837 support
  2025-10-22 22:00   ` Willem de Bruijn
@ 2025-10-23  8:35     ` Ido Schimmel
  0 siblings, 0 replies; 21+ messages in thread
From: Ido Schimmel @ 2025-10-23  8:35 UTC (permalink / raw)
  To: Willem de Bruijn, rbonica
  Cc: netdev, davem, kuba, pabeni, edumazet, horms, dsahern, petrm,
	willemb, daniel, fw, ishaangandhi, tom

On Wed, Oct 22, 2025 at 06:00:28PM -0400, Willem de Bruijn wrote:
> Ido Schimmel wrote:
> > +static __be32 icmp_ext_iio_addr4_find(const struct net_device *dev)
> > +{
> > +	struct in_device *in_dev;
> > +	struct in_ifaddr *ifa;
> > +
> > +	in_dev = __in_dev_get_rcu(dev);
> > +	if (!in_dev)
> > +		return 0;
> > +
> > +	/* It is unclear from RFC 5837 which IP address should be chosen, but
> > +	 * it makes sense to choose a global unicast address.
> 
> Is it possible for no such address to exist, and in that case should
> one of the backup options be considered?

It is possible for no such address to exist, but I believe this
extension is mainly going to be used in unnumbered networks where router
interfaces do not have IPv4 addresses assigned to them. At least that is
the use case I'm interested in supporting. So, while we can try to fall
back to backup options, I think that in the common case they are not
going to exist.

Ron, do you have any opinion on this? I read RFC 5837 again and could
not find any hints about which IP address to choose. I suspect because
the IP address is the least interesting sub-object.

> 
> > +	 */
> > +	in_dev_for_each_ifa_rcu(ifa, in_dev) {
> > +		if (READ_ONCE(ifa->ifa_flags) & IFA_F_SECONDARY)
> > +			continue;
> > +		if (ifa->ifa_scope != RT_SCOPE_UNIVERSE ||
> > +		    ipv4_is_multicast(ifa->ifa_address))
> > +			continue;
> > +		return ifa->ifa_address;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void icmp_ext_iio_iif_append(struct net *net, struct sk_buff *skb,
> > +				    int iif)
> > +{
> > +	struct icmp_ext_iio_name_subobj *name_subobj;
> > +	struct icmp_extobj_hdr *objh;
> > +	struct net_device *dev;
> > +	__be32 data;
> > +
> > +	if (!iif)
> > +		return;
> > +
> 
> Might be good to add a comment that field order is prescribed by the RFC.

Good idea. Will add. Looking at the traceroute code, I expect the
selftest to fail if someone changes this order, but it's good to be
explicit about it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 3/3] selftests: traceroute: Add ICMP extensions tests
  2025-10-22 22:12   ` Willem de Bruijn
@ 2025-10-23  9:09     ` Ido Schimmel
  2025-10-23 15:39     ` Ido Schimmel
  1 sibling, 0 replies; 21+ messages in thread
From: Ido Schimmel @ 2025-10-23  9:09 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netdev, davem, kuba, pabeni, edumazet, horms, dsahern, petrm,
	willemb, daniel, fw, ishaangandhi, rbonica, tom

On Wed, Oct 22, 2025 at 06:12:13PM -0400, Willem de Bruijn wrote:
> Ido Schimmel wrote:
> > Test that ICMP extensions are reported correctly when enabled and not
> > reported when disabled. Test both IPv4 and IPv6 and using different
> > packet sizes, to make sure trimming / padding works correctly.
> > 
> > Disable ICMP rate limiting (defaults to 1 per-second per-target) so that
> > the kernel will always generate ICMP errors when needed.
> 
> This reminds me that when I added SOL_IP/IP_RECVERR_4884, the selftest
> was not integrated into kselftests. Commit eba75c587e81 points to
> 
> https://github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp_v2.c

Yes, I saw that :)

> 
> It might be useful to verify that the kernel recv path that parses
> RFC 4884 compliant ICMP messages correctly handles these RFC 4884
> messages.
> 
> But traceroute parsing the data is sufficient validation that packet
> generation is compliant with the RFCs.

We plan to:

1. Add RFC 5837 support to tracepath using the socket options that you
added (instead of assuming that the ICMP extensions are at a fixed
offset like traceroute does).

2. Add a kernel selftest for these socket options. If you want to do
that yourself now that the kernel can generate ICMP extensions (assuming
the patches are accepted), that's fine too.

I already verified that traceroute, wireshark and tcpdump correctly
parse the ICMP messages generated by this series, so I don't expect to
encounter any problems when we integrate this with tracepath.

> 
> > Reviewed-by: Petr Machata <petrm@nvidia.com>
> > Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> > ---
> >  tools/testing/selftests/net/traceroute.sh | 280 ++++++++++++++++++++++
> >  1 file changed, 280 insertions(+)
> > 
> > diff --git a/tools/testing/selftests/net/traceroute.sh b/tools/testing/selftests/net/traceroute.sh
> > index dbb34c7e09ce..a57c61bd0b25 100755
> > --- a/tools/testing/selftests/net/traceroute.sh
> > +++ b/tools/testing/selftests/net/traceroute.sh
> > @@ -59,6 +59,8 @@ create_ns()
> >  	ip netns exec ${ns} ip -6 ro add unreachable default metric 8192
> >  
> >  	ip netns exec ${ns} sysctl -qw net.ipv4.ip_forward=1
> > +	ip netns exec ${ns} sysctl -qw net.ipv4.icmp_ratelimit=0
> > +	ip netns exec ${ns} sysctl -qw net.ipv6.icmp.ratelimit=0
> >  	ip netns exec ${ns} sysctl -qw net.ipv6.conf.all.keep_addr_on_down=1
> >  	ip netns exec ${ns} sysctl -qw net.ipv6.conf.all.forwarding=1
> >  	ip netns exec ${ns} sysctl -qw net.ipv6.conf.default.forwarding=1
> > @@ -297,6 +299,142 @@ run_traceroute6_vrf()
> >  	cleanup_traceroute6_vrf
> >  }
> >  
> > +################################################################################
> > +# traceroute6 with ICMP extensions test
> > +#
> > +# Verify that in this scenario
> > +#
> > +# ----                          ----                          ----
> > +# |H1|--------------------------|R1|--------------------------|H2|
> > +# ----            N1            ----            N2            ----
> > +#
> > +# ICMP extensions are correctly reported. The loopback interfaces on all the
> > +# nodes are assigned global addresses and the interfaces connecting the nodes
> > +# are assigned IPv6 link-local addresses.
> > +
> > +cleanup_traceroute6_ext()
> > +{
> > +	cleanup_all_ns
> > +}
> > +
> > +setup_traceroute6_ext()
> > +{
> > +	# Start clean
> > +	cleanup_traceroute6_ext
> > +
> > +	setup_ns h1 r1 h2
> > +	create_ns "$h1"
> > +	create_ns "$r1"
> > +	create_ns "$h2"
> > +
> > +	# Setup N1
> > +	connect_ns "$h1" eth1 - fe80::1/64 "$r1" eth1 - fe80::2/64
> > +	# Setup N2
> > +	connect_ns "$r1" eth2 - fe80::3/64 "$h2" eth2 - fe80::4/64
> > +
> > +	# Setup H1
> > +	ip -n "$h1" address add 2001:db8:1::1/128 dev lo
> 
> nodad or not needed in this lo special case?

I believe IFF_LOOPBACK is equivalent to IFA_F_NODAD. See the check at
the beginning of addrconf_dad_begin().

> 
> > +	ip -n "$h1" route add ::/0 nexthop via fe80::2 dev eth1
> > +
> > +	# Setup R1
> > +	ip -n "$r1" address add 2001:db8:1::2/128 dev lo
> > +	ip -n "$r1" route add 2001:db8:1::1/128 nexthop via fe80::1 dev eth1
> > +	ip -n "$r1" route add 2001:db8:1::3/128 nexthop via fe80::4 dev eth2
> > +
> > +	# Setup H2
> > +	ip -n "$h2" address add 2001:db8:1::3/128 dev lo
> > +	ip -n "$h2" route add ::/0 nexthop via fe80::3 dev eth2
> > +
> > +	# Prime the network
> > +	ip netns exec "$h1" ping6 -c5 2001:db8:1::3 >/dev/null 2>&1
> > +}

[...]

> > +################################################################################
> > +# traceroute with ICMP extensions test
> > +#
> > +# Verify that in this scenario
> > +#
> > +# ----                          ----                          ----
> > +# |H1|--------------------------|R1|--------------------------|H2|
> > +# ----            N1            ----            N2            ----
> > +#
> > +# ICMP extensions are correctly reported. The loopback interfaces on all the
> > +# nodes are assigned global addresses and the interfaces connecting the nodes
> > +# are assigned IPv6 link-local addresses.
> > +
> > +cleanup_traceroute_ext()
> > +{
> > +	cleanup_all_ns
> > +}
> > +
> > +setup_traceroute_ext()
> > +{
> > +	# Start clean
> > +	cleanup_traceroute_ext
> > +
> > +	setup_ns h1 r1 h2
> > +	create_ns "$h1"
> > +	create_ns "$r1"
> > +	create_ns "$h2"
> > +
> > +	# Setup N1
> > +	connect_ns "$h1" eth1 - fe80::1/64 "$r1" eth1 - fe80::2/64
> > +	# Setup N2
> > +	connect_ns "$r1" eth2 - fe80::3/64 "$h2" eth2 - fe80::4/64
> 
> Stray IPv6 addresses in this IPv4 test?

No, that's intentional :) The use case I'm interested in supporting is
an unnumbered network where router interfaces are only assigned IPv6
link-local addresses and IPv4 routes are configured with IPv6 nexthops.
In these networks only the loopback / VRF interface is configured with
an IPv4 address.

In fact, there are networks where nodes do not have an IPv4 address at
all. In these networks ICMP messages will be generated with a source IP
of 192.0.0.8 (see INADDR_DUMMY in __icmp_send()). That's one motivation
for the Node Identification Object which we might support in the future:
https://datatracker.ietf.org/doc/html/draft-ietf-intarea-extended-icmp-nodeid-04

> As a matter of fact, is it feasible to merge the IPv4 and IPv6 tests
> with some basic variables like $TRACEROUTE, $SYSCTL_PATH and $ADDR?
> 
> (I appreciate that you spent more time looking at that, fine to leave
> if it is not practical to do so.)

Will look into it.

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 0/3] icmp: Add RFC 5837 support
  2025-10-22 17:29 ` David Ahern
@ 2025-10-23 11:19   ` Ido Schimmel
  2025-10-23 13:39   ` Willem de Bruijn
  1 sibling, 0 replies; 21+ messages in thread
From: Ido Schimmel @ 2025-10-23 11:19 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, davem, kuba, pabeni, edumazet, horms, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom

On Wed, Oct 22, 2025 at 11:29:20AM -0600, David Ahern wrote:
> For the set:
> Reviewed-by: David Ahern <dsahern@kernel.org>

Thanks!

> Did you try testing this with an older kernel versions on the receiving
> side of the icmp packet? ie., making sure older code does not have a bad
> reaction to the extra data in the icmp.

I didn't touch the ICMP receive path so I don't expect any regressions
there. The only open question there is if the kernel will be able to
correctly parse the ICMP extensions when the IP{,6}_RECVERR_RFC4884
socket options are used, but like I told Willem, I don't expect any
problems given that traceroute, wireshark and tcpdump parse these
packets just fine [1].

[1] https://lore.kernel.org/netdev/aPnw2PkF3ZMP9EJr@shredder/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 0/3] icmp: Add RFC 5837 support
  2025-10-22 17:29 ` David Ahern
  2025-10-23 11:19   ` Ido Schimmel
@ 2025-10-23 13:39   ` Willem de Bruijn
  1 sibling, 0 replies; 21+ messages in thread
From: Willem de Bruijn @ 2025-10-23 13:39 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel, netdev
  Cc: davem, kuba, pabeni, edumazet, horms, petrm, willemb, daniel, fw,
	ishaangandhi, rbonica, tom

David Ahern wrote:
> On 10/22/25 12:53 AM, Ido Schimmel wrote:
> > Testing
> > =======
> > 
> > The existing traceroute selftest is extended to test that ICMP
> > extensions are reported correctly when enabled. Both address families
> > are tested and with different packet sizes in order to make sure that
> > trimming / padding works correctly.
> > 
> 
> For the set:
> Reviewed-by: David Ahern <dsahern@kernel.org>

Same

Reviewed-by: Willem de Bruijn <willemb@google.com>

Based on answers to my questions no changes strictly needed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 3/3] selftests: traceroute: Add ICMP extensions tests
  2025-10-22 22:12   ` Willem de Bruijn
  2025-10-23  9:09     ` Ido Schimmel
@ 2025-10-23 15:39     ` Ido Schimmel
  2025-10-23 21:25       ` Willem de Bruijn
  1 sibling, 1 reply; 21+ messages in thread
From: Ido Schimmel @ 2025-10-23 15:39 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netdev, davem, kuba, pabeni, edumazet, horms, dsahern, petrm,
	willemb, daniel, fw, ishaangandhi, rbonica, tom

On Wed, Oct 22, 2025 at 06:12:13PM -0400, Willem de Bruijn wrote:
> Ido Schimmel wrote:
> > Test that ICMP extensions are reported correctly when enabled and not
> > reported when disabled. Test both IPv4 and IPv6 and using different
> > packet sizes, to make sure trimming / padding works correctly.
> > 
> > Disable ICMP rate limiting (defaults to 1 per-second per-target) so that
> > the kernel will always generate ICMP errors when needed.
> 
> This reminds me that when I added SOL_IP/IP_RECVERR_4884, the selftest
> was not integrated into kselftests. Commit eba75c587e81 points to
> 
> https://github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp_v2.c
> 
> It might be useful to verify that the kernel recv path that parses
> RFC 4884 compliant ICMP messages correctly handles these RFC 4884
> messages.

FYI, I just ran this test with this series and it seems fine:

# sysctl -wq net.ipv4.icmp_errors_extension_mask=0x0
# sysctl -wq net.ipv6.icmp.errors_extension_mask=0x0
# ./recv_icmp_v2 

TEST(10, 0, 0)
len=0 ee_info=0x0, ee_data=0x0 rfc4884=(0, 0x0, 0)

TEST(10, 41, 31)
len=0 ee_info=0x0, ee_data=0x0 rfc4884=(0, 0x0, 0)

TEST(2, 0, 0)
len=0 ee_info=0x0, ee_data=0x0 rfc4884=(0, 0x0, 0)

TEST(2, 0, 26)
len=0 ee_info=0x0, ee_data=0x0 rfc4884=(0, 0x0, 0)
OK
# echo $?
0
# sysctl -wq net.ipv4.icmp_errors_extension_mask=0x1
# sysctl -wq net.ipv6.icmp.errors_extension_mask=0x1
# ./recv_icmp_v2 

TEST(10, 0, 0)
len=0 ee_info=0x10000000, ee_data=0x0 rfc4884=(0, 0x0, 0)

TEST(10, 41, 31)
len=0 ee_info=0x10000000, ee_data=0x50 rfc4884=(80, 0x0, 0)

TEST(2, 0, 0)
len=0 ee_info=0x0, ee_data=0x0 rfc4884=(0, 0x0, 0)

TEST(2, 0, 26)
len=0 ee_info=0x0, ee_data=0x64 rfc4884=(100, 0x0, 0)
OK
# echo $?
0

When the extensions are enabled and the RFC4884 socket options are used,
the offset to the extension structure relative to the beginning of the
UDP payload seems correct. In both cases the "original datagram" field
is 128 and if we remove the size of the headers from it we get the
offset to the extension structure:

IPv4: 128 - ipv4_hdr - udp_hdr = 128 - 20 - 8 = 100
IPv6: 128 - ipv6_hdr - udp_hdr = 128 - 40 - 8 = 80

In both cases SO_EE_RFC4884_FLAG_INVALID is not set.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 3/3] selftests: traceroute: Add ICMP extensions tests
  2025-10-23 15:39     ` Ido Schimmel
@ 2025-10-23 21:25       ` Willem de Bruijn
  0 siblings, 0 replies; 21+ messages in thread
From: Willem de Bruijn @ 2025-10-23 21:25 UTC (permalink / raw)
  To: Ido Schimmel, Willem de Bruijn
  Cc: netdev, davem, kuba, pabeni, edumazet, horms, dsahern, petrm,
	willemb, daniel, fw, ishaangandhi, rbonica, tom

Ido Schimmel wrote:
> On Wed, Oct 22, 2025 at 06:12:13PM -0400, Willem de Bruijn wrote:
> > Ido Schimmel wrote:
> > > Test that ICMP extensions are reported correctly when enabled and not
> > > reported when disabled. Test both IPv4 and IPv6 and using different
> > > packet sizes, to make sure trimming / padding works correctly.
> > > 
> > > Disable ICMP rate limiting (defaults to 1 per-second per-target) so that
> > > the kernel will always generate ICMP errors when needed.
> > 
> > This reminds me that when I added SOL_IP/IP_RECVERR_4884, the selftest
> > was not integrated into kselftests. Commit eba75c587e81 points to
> > 
> > https://github.com/wdebruij/kerneltools/blob/master/tests/recv_icmp_v2.c
> > 
> > It might be useful to verify that the kernel recv path that parses
> > RFC 4884 compliant ICMP messages correctly handles these RFC 4884
> > messages.
> 
> FYI, I just ran this test with this series and it seems fine:
> 
> # sysctl -wq net.ipv4.icmp_errors_extension_mask=0x0
> # sysctl -wq net.ipv6.icmp.errors_extension_mask=0x0
> # ./recv_icmp_v2 
> 
> TEST(10, 0, 0)
> len=0 ee_info=0x0, ee_data=0x0 rfc4884=(0, 0x0, 0)
> 
> TEST(10, 41, 31)
> len=0 ee_info=0x0, ee_data=0x0 rfc4884=(0, 0x0, 0)
> 
> TEST(2, 0, 0)
> len=0 ee_info=0x0, ee_data=0x0 rfc4884=(0, 0x0, 0)
> 
> TEST(2, 0, 26)
> len=0 ee_info=0x0, ee_data=0x0 rfc4884=(0, 0x0, 0)
> OK
> # echo $?
> 0
> # sysctl -wq net.ipv4.icmp_errors_extension_mask=0x1
> # sysctl -wq net.ipv6.icmp.errors_extension_mask=0x1
> # ./recv_icmp_v2 
> 
> TEST(10, 0, 0)
> len=0 ee_info=0x10000000, ee_data=0x0 rfc4884=(0, 0x0, 0)
> 
> TEST(10, 41, 31)
> len=0 ee_info=0x10000000, ee_data=0x50 rfc4884=(80, 0x0, 0)
> 
> TEST(2, 0, 0)
> len=0 ee_info=0x0, ee_data=0x0 rfc4884=(0, 0x0, 0)
> 
> TEST(2, 0, 26)
> len=0 ee_info=0x0, ee_data=0x64 rfc4884=(100, 0x0, 0)
> OK
> # echo $?
> 0
> 
> When the extensions are enabled and the RFC4884 socket options are used,
> the offset to the extension structure relative to the beginning of the
> UDP payload seems correct. In both cases the "original datagram" field
> is 128 and if we remove the size of the headers from it we get the
> offset to the extension structure:
> 
> IPv4: 128 - ipv4_hdr - udp_hdr = 128 - 20 - 8 = 100
> IPv6: 128 - ipv6_hdr - udp_hdr = 128 - 40 - 8 = 80
> 
> In both cases SO_EE_RFC4884_FLAG_INVALID is not set.

Oh excellent. Thanks for running that.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 0/3] icmp: Add RFC 5837 support
  2025-10-23  0:38         ` Jakub Kicinski
@ 2025-10-24  1:48           ` Jakub Kicinski
  2025-10-24 14:50             ` Ido Schimmel
  0 siblings, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2025-10-24  1:48 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom

On Wed, 22 Oct 2025 17:38:43 -0700 Jakub Kicinski wrote:
> On Wed, 22 Oct 2025 18:35:23 +0300 Ido Schimmel wrote:
> > I will change the test to require at least version 2.1.5. Can you please
> > update traceroute in the CI and see if it helps?  
> 
> Will do but I'm a little behind on everything, so it may be tomorrow 
> or even Friday. So ignore NIPA for now, worse case we'll follow up.

Updated now, next run should have it updated.
Current one has traceroute updated but not traceroute6
https://netdev-3.bots.linux.dev/vmksft-net/results/354402/27-traceroute-sh/stdout

This doesn't sound related to the traceroute version tho:

# 34.51 [+6.48] TEST: IPv4 traceroute with ICMP extensions                          [FAIL]
# 34.51 [+0.00] Unsupported sysctl value was not rejected

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 0/3] icmp: Add RFC 5837 support
  2025-10-24  1:48           ` Jakub Kicinski
@ 2025-10-24 14:50             ` Ido Schimmel
  2025-10-24 15:04               ` Jakub Kicinski
  0 siblings, 1 reply; 21+ messages in thread
From: Ido Schimmel @ 2025-10-24 14:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom

On Thu, Oct 23, 2025 at 06:48:57PM -0700, Jakub Kicinski wrote:
> On Wed, 22 Oct 2025 17:38:43 -0700 Jakub Kicinski wrote:
> > On Wed, 22 Oct 2025 18:35:23 +0300 Ido Schimmel wrote:
> > > I will change the test to require at least version 2.1.5. Can you please
> > > update traceroute in the CI and see if it helps?  
> > 
> > Will do but I'm a little behind on everything, so it may be tomorrow 
> > or even Friday. So ignore NIPA for now, worse case we'll follow up.
> 
> Updated now, next run should have it updated.

Thanks!

> Current one has traceroute updated but not traceroute6
> https://netdev-3.bots.linux.dev/vmksft-net/results/354402/27-traceroute-sh/stdout
> 
> This doesn't sound related to the traceroute version tho:
> 
> # 34.51 [+6.48] TEST: IPv4 traceroute with ICMP extensions                          [FAIL]
> # 34.51 [+0.00] Unsupported sysctl value was not rejected

Hmm, which sysctl version do you have?

I have:

$ sysctl --version
sysctl from procps-ng 4.0.4

I just compiled 3.3.17 and I get:

# /home/idosch/tmp/procps-v3.3.17/sysctl -wq net.ipv4.icmp_errors_extension_mask=0x80
sysctl: setting key "net.ipv4.icmp_errors_extension_mask": Invalid argument
# echo $?
0

Which can explain the failure.

Please let me know if that's the issue and if you want me to replace
this with:

echo 0x80 > /proc/sys/net/ipv4/icmp_errors_extension_mask

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next 0/3] icmp: Add RFC 5837 support
  2025-10-24 14:50             ` Ido Schimmel
@ 2025-10-24 15:04               ` Jakub Kicinski
  0 siblings, 0 replies; 21+ messages in thread
From: Jakub Kicinski @ 2025-10-24 15:04 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, pabeni, edumazet, horms, dsahern, petrm, willemb,
	daniel, fw, ishaangandhi, rbonica, tom

On Fri, 24 Oct 2025 17:50:56 +0300 Ido Schimmel wrote:
> On Thu, Oct 23, 2025 at 06:48:57PM -0700, Jakub Kicinski wrote:
> > Current one has traceroute updated but not traceroute6
> > https://netdev-3.bots.linux.dev/vmksft-net/results/354402/27-traceroute-sh/stdout
> > 
> > This doesn't sound related to the traceroute version tho:
> > 
> > # 34.51 [+6.48] TEST: IPv4 traceroute with ICMP extensions                          [FAIL]
> > # 34.51 [+0.00] Unsupported sysctl value was not rejected  
> 
> Hmm, which sysctl version do you have?
> 
> I have:
> 
> $ sysctl --version
> sysctl from procps-ng 4.0.4
> 
> I just compiled 3.3.17 and I get:

$ sysctl --version
sysctl from procps-ng 3.3.17

Amazon Linux 🤷️

> # /home/idosch/tmp/procps-v3.3.17/sysctl -wq net.ipv4.icmp_errors_extension_mask=0x80
> sysctl: setting key "net.ipv4.icmp_errors_extension_mask": Invalid argument
> # echo $?
> 0
> 
> Which can explain the failure.
> 
> Please let me know if that's the issue and if you want me to replace
> this with:
> 
> echo 0x80 > /proc/sys/net/ipv4/icmp_errors_extension_mask

Let's go with the echo, if you don't mind.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-10-24 15:04 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-22  6:53 [PATCH net-next 0/3] icmp: Add RFC 5837 support Ido Schimmel
2025-10-22  6:53 ` [PATCH net-next 1/3] ipv4: " Ido Schimmel
2025-10-22 22:00   ` Willem de Bruijn
2025-10-23  8:35     ` Ido Schimmel
2025-10-22  6:53 ` [PATCH net-next 2/3] ipv6: " Ido Schimmel
2025-10-22  6:53 ` [PATCH net-next 3/3] selftests: traceroute: Add ICMP extensions tests Ido Schimmel
2025-10-22 22:12   ` Willem de Bruijn
2025-10-23  9:09     ` Ido Schimmel
2025-10-23 15:39     ` Ido Schimmel
2025-10-23 21:25       ` Willem de Bruijn
2025-10-22 13:26 ` [PATCH net-next 0/3] icmp: Add RFC 5837 support Jakub Kicinski
2025-10-22 13:58   ` Ido Schimmel
2025-10-22 15:10     ` Jakub Kicinski
2025-10-22 15:35       ` Ido Schimmel
2025-10-23  0:38         ` Jakub Kicinski
2025-10-24  1:48           ` Jakub Kicinski
2025-10-24 14:50             ` Ido Schimmel
2025-10-24 15:04               ` Jakub Kicinski
2025-10-22 17:29 ` David Ahern
2025-10-23 11:19   ` Ido Schimmel
2025-10-23 13:39   ` Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).