* [PATCH net-next v2 1/3] ipv4: icmp: Add RFC 5837 support
2025-10-27 8:22 [PATCH net-next v2 0/3] icmp: Add RFC 5837 support Ido Schimmel
@ 2025-10-27 8:22 ` Ido Schimmel
2025-10-27 8:35 ` Eric Dumazet
2025-11-07 17:33 ` David 'equinox' Lamparter
2025-10-27 8:22 ` [PATCH net-next v2 2/3] ipv6: " Ido Schimmel
` (3 subsequent siblings)
4 siblings, 2 replies; 15+ messages in thread
From: Ido Schimmel @ 2025-10-27 8:22 UTC (permalink / raw)
To: netdev
Cc: davem, kuba, pabeni, edumazet, horms, dsahern, petrm, willemb,
daniel, fw, ishaangandhi, rbonica, tom, Ido Schimmel
Add the ability to append the incoming IP interface information to
ICMPv4 error messages in accordance with RFC 5837 and RFC 4884. This is
required for more meaningful traceroute results in unnumbered networks.
The feature is disabled by default and controlled via a new sysctl
("net.ipv4.icmp_errors_extension_mask") which accepts a bitmask of ICMP
extensions to append to ICMP error messages. Currently, only a single
value is supported, but the interface and the implementation should be
able to support more extensions, if needed.
Clone the skb and copy the relevant data portions before modifying the
skb as the caller of __icmp_send() still owns the skb after the function
returns. This should be fine since by default ICMP error messages are
rate limited to 1000 per second and no more than 1 per second per
specific host.
Trim or pad the packet to 128 bytes before appending the ICMP extension
structure in order to be compatible with legacy applications that assume
that the ICMP extension structure always starts at this offset (the
minimum length specified by RFC 4884).
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
Notes:
v2:
* Add a comment about field ordering.
Documentation/networking/ip-sysctl.rst | 17 +++
include/linux/icmp.h | 32 +++++
include/net/netns/ipv4.h | 1 +
net/ipv4/icmp.c | 191 ++++++++++++++++++++++++-
net/ipv4/sysctl_net_ipv4.c | 11 ++
5 files changed, 251 insertions(+), 1 deletion(-)
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index a06cb99d66dc..ece1187ba0f1 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1796,6 +1796,23 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN
Default: 0 (disabled)
+icmp_errors_extension_mask - UNSIGNED INTEGER
+ Bitmask of ICMP extensions to append to ICMPv4 error messages
+ ("Destination Unreachable", "Time Exceeded" and "Parameter Problem").
+ The original datagram is trimmed / padded to 128 bytes in order to be
+ compatible with applications that do not comply with RFC 4884.
+
+ Possible extensions are:
+
+ ==== ==============================================================
+ 0x01 Incoming IP interface information according to RFC 5837.
+ Extension will include the index, IPv4 address (if present),
+ name and MTU of the IP interface that received the datagram
+ which elicited the ICMP error.
+ ==== ==============================================================
+
+ Default: 0x00 (no extensions)
+
igmp_max_memberships - INTEGER
Change the maximum number of multicast groups we can subscribe to.
Default: 20
diff --git a/include/linux/icmp.h b/include/linux/icmp.h
index 0af4d210ee31..043ec5d9c882 100644
--- a/include/linux/icmp.h
+++ b/include/linux/icmp.h
@@ -40,4 +40,36 @@ void ip_icmp_error_rfc4884(const struct sk_buff *skb,
struct sock_ee_data_rfc4884 *out,
int thlen, int off);
+/* RFC 4884 */
+#define ICMP_EXT_ORIG_DGRAM_MIN_LEN 128
+#define ICMP_EXT_VERSION_2 2
+
+/* ICMP Extension Object Classes */
+#define ICMP_EXT_OBJ_CLASS_IIO 2 /* RFC 5837 */
+
+/* Interface Information Object - RFC 5837 */
+enum {
+ ICMP_EXT_CTYPE_IIO_ROLE_IIF,
+};
+
+#define ICMP_EXT_CTYPE_IIO_ROLE(ROLE) ((ROLE) << 6)
+#define ICMP_EXT_CTYPE_IIO_MTU BIT(0)
+#define ICMP_EXT_CTYPE_IIO_NAME BIT(1)
+#define ICMP_EXT_CTYPE_IIO_IPADDR BIT(2)
+#define ICMP_EXT_CTYPE_IIO_IFINDEX BIT(3)
+
+struct icmp_ext_iio_name_subobj {
+ u8 len;
+ char name[IFNAMSIZ];
+};
+
+enum {
+ /* RFC 5837 - Incoming IP Interface Role */
+ ICMP_ERR_EXT_IIO_IIF,
+ /* Add new constants above. Used by "icmp_errors_extension_mask"
+ * sysctl.
+ */
+ ICMP_ERR_EXT_COUNT,
+};
+
#endif /* _LINUX_ICMP_H */
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 34eb3aecb3f2..0e96c90e56c6 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -135,6 +135,7 @@ struct netns_ipv4 {
u8 sysctl_icmp_echo_ignore_broadcasts;
u8 sysctl_icmp_ignore_bogus_error_responses;
u8 sysctl_icmp_errors_use_inbound_ifaddr;
+ u8 sysctl_icmp_errors_extension_mask;
int sysctl_icmp_ratelimit;
int sysctl_icmp_ratemask;
int sysctl_icmp_msgs_per_sec;
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 1b7fb5d935ed..4abbec2f47ef 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -582,6 +582,185 @@ static struct rtable *icmp_route_lookup(struct net *net, struct flowi4 *fl4,
return ERR_PTR(err);
}
+struct icmp_ext_iio_addr4_subobj {
+ __be16 afi;
+ __be16 reserved;
+ __be32 addr4;
+};
+
+static unsigned int icmp_ext_iio_len(void)
+{
+ return sizeof(struct icmp_extobj_hdr) +
+ /* ifIndex */
+ sizeof(__be32) +
+ /* Interface Address Sub-Object */
+ sizeof(struct icmp_ext_iio_addr4_subobj) +
+ /* Interface Name Sub-Object. Length must be a multiple of 4
+ * bytes.
+ */
+ ALIGN(sizeof(struct icmp_ext_iio_name_subobj), 4) +
+ /* MTU */
+ sizeof(__be32);
+}
+
+static unsigned int icmp_ext_max_len(u8 ext_objs)
+{
+ unsigned int ext_max_len;
+
+ ext_max_len = sizeof(struct icmp_ext_hdr);
+
+ if (ext_objs & BIT(ICMP_ERR_EXT_IIO_IIF))
+ ext_max_len += icmp_ext_iio_len();
+
+ return ext_max_len;
+}
+
+static __be32 icmp_ext_iio_addr4_find(const struct net_device *dev)
+{
+ struct in_device *in_dev;
+ struct in_ifaddr *ifa;
+
+ in_dev = __in_dev_get_rcu(dev);
+ if (!in_dev)
+ return 0;
+
+ /* It is unclear from RFC 5837 which IP address should be chosen, but
+ * it makes sense to choose a global unicast address.
+ */
+ in_dev_for_each_ifa_rcu(ifa, in_dev) {
+ if (READ_ONCE(ifa->ifa_flags) & IFA_F_SECONDARY)
+ continue;
+ if (ifa->ifa_scope != RT_SCOPE_UNIVERSE ||
+ ipv4_is_multicast(ifa->ifa_address))
+ continue;
+ return ifa->ifa_address;
+ }
+
+ return 0;
+}
+
+static void icmp_ext_iio_iif_append(struct net *net, struct sk_buff *skb,
+ int iif)
+{
+ struct icmp_ext_iio_name_subobj *name_subobj;
+ struct icmp_extobj_hdr *objh;
+ struct net_device *dev;
+ __be32 data;
+
+ if (!iif)
+ return;
+
+ /* Add the fields in the order specified by RFC 5837. */
+ objh = skb_put(skb, sizeof(*objh));
+ objh->class_num = ICMP_EXT_OBJ_CLASS_IIO;
+ objh->class_type = ICMP_EXT_CTYPE_IIO_ROLE(ICMP_EXT_CTYPE_IIO_ROLE_IIF);
+
+ data = htonl(iif);
+ skb_put_data(skb, &data, sizeof(__be32));
+ objh->class_type |= ICMP_EXT_CTYPE_IIO_IFINDEX;
+
+ rcu_read_lock();
+
+ dev = dev_get_by_index_rcu(net, iif);
+ if (!dev)
+ goto out;
+
+ data = icmp_ext_iio_addr4_find(dev);
+ if (data) {
+ struct icmp_ext_iio_addr4_subobj *addr4_subobj;
+
+ addr4_subobj = skb_put_zero(skb, sizeof(*addr4_subobj));
+ addr4_subobj->afi = htons(ICMP_AFI_IP);
+ addr4_subobj->addr4 = data;
+ objh->class_type |= ICMP_EXT_CTYPE_IIO_IPADDR;
+ }
+
+ name_subobj = skb_put_zero(skb, ALIGN(sizeof(*name_subobj), 4));
+ name_subobj->len = ALIGN(sizeof(*name_subobj), 4);
+ netdev_copy_name(dev, name_subobj->name);
+ objh->class_type |= ICMP_EXT_CTYPE_IIO_NAME;
+
+ data = htonl(READ_ONCE(dev->mtu));
+ skb_put_data(skb, &data, sizeof(__be32));
+ objh->class_type |= ICMP_EXT_CTYPE_IIO_MTU;
+
+out:
+ rcu_read_unlock();
+ objh->length = htons(skb_tail_pointer(skb) - (unsigned char *)objh);
+}
+
+static void icmp_ext_objs_append(struct net *net, struct sk_buff *skb,
+ u8 ext_objs, int iif)
+{
+ if (ext_objs & BIT(ICMP_ERR_EXT_IIO_IIF))
+ icmp_ext_iio_iif_append(net, skb, iif);
+}
+
+static struct sk_buff *
+icmp_ext_append(struct net *net, struct sk_buff *skb_in, struct icmphdr *icmph,
+ unsigned int room, int iif)
+{
+ unsigned int payload_len, ext_max_len, ext_len;
+ struct icmp_ext_hdr *ext_hdr;
+ struct sk_buff *skb;
+ u8 ext_objs;
+ int nhoff;
+
+ switch (icmph->type) {
+ case ICMP_DEST_UNREACH:
+ case ICMP_TIME_EXCEEDED:
+ case ICMP_PARAMETERPROB:
+ break;
+ default:
+ return NULL;
+ }
+
+ ext_objs = READ_ONCE(net->ipv4.sysctl_icmp_errors_extension_mask);
+ if (!ext_objs)
+ return NULL;
+
+ ext_max_len = icmp_ext_max_len(ext_objs);
+ if (ICMP_EXT_ORIG_DGRAM_MIN_LEN + ext_max_len > room)
+ return NULL;
+
+ skb = skb_clone(skb_in, GFP_ATOMIC);
+ if (!skb)
+ return NULL;
+
+ nhoff = skb_network_offset(skb);
+ payload_len = min(skb->len - nhoff, ICMP_EXT_ORIG_DGRAM_MIN_LEN);
+
+ if (!pskb_network_may_pull(skb, payload_len))
+ goto free_skb;
+
+ if (pskb_trim(skb, nhoff + ICMP_EXT_ORIG_DGRAM_MIN_LEN) ||
+ __skb_put_padto(skb, nhoff + ICMP_EXT_ORIG_DGRAM_MIN_LEN, false))
+ goto free_skb;
+
+ if (pskb_expand_head(skb, 0, ext_max_len, GFP_ATOMIC))
+ goto free_skb;
+
+ ext_hdr = skb_put_zero(skb, sizeof(*ext_hdr));
+ ext_hdr->version = ICMP_EXT_VERSION_2;
+
+ icmp_ext_objs_append(net, skb, ext_objs, iif);
+
+ /* Do not send an empty extension structure. */
+ ext_len = skb_tail_pointer(skb) - (unsigned char *)ext_hdr;
+ if (ext_len == sizeof(*ext_hdr))
+ goto free_skb;
+
+ ext_hdr->checksum = ip_compute_csum(ext_hdr, ext_len);
+ /* The length of the original datagram in 32-bit words (RFC 4884). */
+ icmph->un.reserved[1] = ICMP_EXT_ORIG_DGRAM_MIN_LEN / sizeof(u32);
+
+ return skb;
+
+free_skb:
+ consume_skb(skb);
+ return NULL;
+}
+
/*
* Send an ICMP message in response to a situation
*
@@ -601,6 +780,7 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
struct icmp_bxm icmp_param;
struct rtable *rt = skb_rtable(skb_in);
bool apply_ratelimit = false;
+ struct sk_buff *ext_skb;
struct ipcm_cookie ipc;
struct flowi4 fl4;
__be32 saddr;
@@ -770,7 +950,12 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
if (room <= (int)sizeof(struct iphdr))
goto ende;
- icmp_param.data_len = skb_in->len - icmp_param.offset;
+ ext_skb = icmp_ext_append(net, skb_in, &icmp_param.data.icmph, room,
+ parm->iif);
+ if (ext_skb)
+ icmp_param.skb = ext_skb;
+
+ icmp_param.data_len = icmp_param.skb->len - icmp_param.offset;
if (icmp_param.data_len > room)
icmp_param.data_len = room;
icmp_param.head_len = sizeof(struct icmphdr);
@@ -785,6 +970,9 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
trace_icmp_send(skb_in, type, code);
icmp_push_reply(sk, &icmp_param, &fl4, &ipc, &rt);
+
+ if (ext_skb)
+ consume_skb(ext_skb);
ende:
ip_rt_put(rt);
out_unlock:
@@ -1502,6 +1690,7 @@ static int __net_init icmp_sk_init(struct net *net)
net->ipv4.sysctl_icmp_ratelimit = 1 * HZ;
net->ipv4.sysctl_icmp_ratemask = 0x1818;
net->ipv4.sysctl_icmp_errors_use_inbound_ifaddr = 0;
+ net->ipv4.sysctl_icmp_errors_extension_mask = 0;
net->ipv4.sysctl_icmp_msgs_per_sec = 1000;
net->ipv4.sysctl_icmp_msgs_burst = 50;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 24dbc603cc44..0c7c8f9041cb 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -48,6 +48,8 @@ static int tcp_plb_max_rounds = 31;
static int tcp_plb_max_cong_thresh = 256;
static unsigned int tcp_tw_reuse_delay_max = TCP_PAWS_MSL * MSEC_PER_SEC;
static int tcp_ecn_mode_max = 2;
+static u32 icmp_errors_extension_mask_all =
+ GENMASK_U8(ICMP_ERR_EXT_COUNT - 1, 0);
/* obsolete */
static int sysctl_tcp_low_latency __read_mostly;
@@ -674,6 +676,15 @@ static struct ctl_table ipv4_net_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE
},
+ {
+ .procname = "icmp_errors_extension_mask",
+ .data = &init_net.ipv4.sysctl_icmp_errors_extension_mask,
+ .maxlen = sizeof(u8),
+ .mode = 0644,
+ .proc_handler = proc_dou8vec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = &icmp_errors_extension_mask_all,
+ },
{
.procname = "icmp_ratelimit",
.data = &init_net.ipv4.sysctl_icmp_ratelimit,
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH net-next v2 1/3] ipv4: icmp: Add RFC 5837 support
2025-10-27 8:22 ` [PATCH net-next v2 1/3] ipv4: " Ido Schimmel
@ 2025-10-27 8:35 ` Eric Dumazet
2025-11-07 17:33 ` David 'equinox' Lamparter
1 sibling, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2025-10-27 8:35 UTC (permalink / raw)
To: Ido Schimmel
Cc: netdev, davem, kuba, pabeni, horms, dsahern, petrm, willemb,
daniel, fw, ishaangandhi, rbonica, tom
On Mon, Oct 27, 2025 at 1:24 AM Ido Schimmel <idosch@nvidia.com> wrote:
>
> Add the ability to append the incoming IP interface information to
> ICMPv4 error messages in accordance with RFC 5837 and RFC 4884. This is
> required for more meaningful traceroute results in unnumbered networks.
>
> The feature is disabled by default and controlled via a new sysctl
> ("net.ipv4.icmp_errors_extension_mask") which accepts a bitmask of ICMP
> extensions to append to ICMP error messages. Currently, only a single
> value is supported, but the interface and the implementation should be
> able to support more extensions, if needed.
>
> Clone the skb and copy the relevant data portions before modifying the
> skb as the caller of __icmp_send() still owns the skb after the function
> returns. This should be fine since by default ICMP error messages are
> rate limited to 1000 per second and no more than 1 per second per
> specific host.
>
> Trim or pad the packet to 128 bytes before appending the ICMP extension
> structure in order to be compatible with legacy applications that assume
> that the ICMP extension structure always starts at this offset (the
> minimum length specified by RFC 4884).
>
> Reviewed-by: Petr Machata <petrm@nvidia.com>
> Reviewed-by: David Ahern <dsahern@kernel.org>
> Reviewed-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>
> Notes:
> v2:
> * Add a comment about field ordering.
>
> Documentation/networking/ip-sysctl.rst | 17 +++
> include/linux/icmp.h | 32 +++++
> include/net/netns/ipv4.h | 1 +
> net/ipv4/icmp.c | 191 ++++++++++++++++++++++++-
> net/ipv4/sysctl_net_ipv4.c | 11 ++
> 5 files changed, 251 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index a06cb99d66dc..ece1187ba0f1 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -1796,6 +1796,23 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN
>
> Default: 0 (disabled)
>
> +icmp_errors_extension_mask - UNSIGNED INTEGER
> + Bitmask of ICMP extensions to append to ICMPv4 error messages
> + ("Destination Unreachable", "Time Exceeded" and "Parameter Problem").
> + The original datagram is trimmed / padded to 128 bytes in order to be
> + compatible with applications that do not comply with RFC 4884.
> +
> + Possible extensions are:
> +
> + ==== ==============================================================
> + 0x01 Incoming IP interface information according to RFC 5837.
> + Extension will include the index, IPv4 address (if present),
> + name and MTU of the IP interface that received the datagram
> + which elicited the ICMP error.
> + ==== ==============================================================
> +
> + Default: 0x00 (no extensions)
> +
> igmp_max_memberships - INTEGER
> Change the maximum number of multicast groups we can subscribe to.
> Default: 20
> diff --git a/include/linux/icmp.h b/include/linux/icmp.h
> index 0af4d210ee31..043ec5d9c882 100644
> --- a/include/linux/icmp.h
> +++ b/include/linux/icmp.h
> @@ -40,4 +40,36 @@ void ip_icmp_error_rfc4884(const struct sk_buff *skb,
> struct sock_ee_data_rfc4884 *out,
> int thlen, int off);
>
> +/* RFC 4884 */
> +#define ICMP_EXT_ORIG_DGRAM_MIN_LEN 128
> +#define ICMP_EXT_VERSION_2 2
> +
> +/* ICMP Extension Object Classes */
> +#define ICMP_EXT_OBJ_CLASS_IIO 2 /* RFC 5837 */
> +
> +/* Interface Information Object - RFC 5837 */
> +enum {
> + ICMP_EXT_CTYPE_IIO_ROLE_IIF,
> +};
> +
> +#define ICMP_EXT_CTYPE_IIO_ROLE(ROLE) ((ROLE) << 6)
> +#define ICMP_EXT_CTYPE_IIO_MTU BIT(0)
> +#define ICMP_EXT_CTYPE_IIO_NAME BIT(1)
> +#define ICMP_EXT_CTYPE_IIO_IPADDR BIT(2)
> +#define ICMP_EXT_CTYPE_IIO_IFINDEX BIT(3)
> +
> +struct icmp_ext_iio_name_subobj {
> + u8 len;
> + char name[IFNAMSIZ];
> +};
> +
> +enum {
> + /* RFC 5837 - Incoming IP Interface Role */
> + ICMP_ERR_EXT_IIO_IIF,
> + /* Add new constants above. Used by "icmp_errors_extension_mask"
> + * sysctl.
> + */
> + ICMP_ERR_EXT_COUNT,
> +};
> +
> #endif /* _LINUX_ICMP_H */
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 34eb3aecb3f2..0e96c90e56c6 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -135,6 +135,7 @@ struct netns_ipv4 {
> u8 sysctl_icmp_echo_ignore_broadcasts;
> u8 sysctl_icmp_ignore_bogus_error_responses;
> u8 sysctl_icmp_errors_use_inbound_ifaddr;
> + u8 sysctl_icmp_errors_extension_mask;
> int sysctl_icmp_ratelimit;
> int sysctl_icmp_ratemask;
> int sysctl_icmp_msgs_per_sec;
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index 1b7fb5d935ed..4abbec2f47ef 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -582,6 +582,185 @@ static struct rtable *icmp_route_lookup(struct net *net, struct flowi4 *fl4,
> return ERR_PTR(err);
> }
>
> +struct icmp_ext_iio_addr4_subobj {
> + __be16 afi;
> + __be16 reserved;
> + __be32 addr4;
> +};
> +
> +static unsigned int icmp_ext_iio_len(void)
> +{
> + return sizeof(struct icmp_extobj_hdr) +
> + /* ifIndex */
> + sizeof(__be32) +
> + /* Interface Address Sub-Object */
> + sizeof(struct icmp_ext_iio_addr4_subobj) +
> + /* Interface Name Sub-Object. Length must be a multiple of 4
> + * bytes.
> + */
> + ALIGN(sizeof(struct icmp_ext_iio_name_subobj), 4) +
> + /* MTU */
> + sizeof(__be32);
> +}
> +
> +static unsigned int icmp_ext_max_len(u8 ext_objs)
> +{
> + unsigned int ext_max_len;
> +
> + ext_max_len = sizeof(struct icmp_ext_hdr);
> +
> + if (ext_objs & BIT(ICMP_ERR_EXT_IIO_IIF))
> + ext_max_len += icmp_ext_iio_len();
> +
> + return ext_max_len;
> +}
> +
> +static __be32 icmp_ext_iio_addr4_find(const struct net_device *dev)
> +{
> + struct in_device *in_dev;
> + struct in_ifaddr *ifa;
> +
> + in_dev = __in_dev_get_rcu(dev);
> + if (!in_dev)
> + return 0;
> +
> + /* It is unclear from RFC 5837 which IP address should be chosen, but
> + * it makes sense to choose a global unicast address.
> + */
> + in_dev_for_each_ifa_rcu(ifa, in_dev) {
> + if (READ_ONCE(ifa->ifa_flags) & IFA_F_SECONDARY)
> + continue;
> + if (ifa->ifa_scope != RT_SCOPE_UNIVERSE ||
> + ipv4_is_multicast(ifa->ifa_address))
> + continue;
> + return ifa->ifa_address;
> + }
> +
> + return 0;
> +}
> +
> +static void icmp_ext_iio_iif_append(struct net *net, struct sk_buff *skb,
> + int iif)
> +{
> + struct icmp_ext_iio_name_subobj *name_subobj;
> + struct icmp_extobj_hdr *objh;
> + struct net_device *dev;
> + __be32 data;
> +
> + if (!iif)
> + return;
> +
> + /* Add the fields in the order specified by RFC 5837. */
> + objh = skb_put(skb, sizeof(*objh));
> + objh->class_num = ICMP_EXT_OBJ_CLASS_IIO;
> + objh->class_type = ICMP_EXT_CTYPE_IIO_ROLE(ICMP_EXT_CTYPE_IIO_ROLE_IIF);
> +
> + data = htonl(iif);
> + skb_put_data(skb, &data, sizeof(__be32));
> + objh->class_type |= ICMP_EXT_CTYPE_IIO_IFINDEX;
> +
> + rcu_read_lock();
> +
> + dev = dev_get_by_index_rcu(net, iif);
> + if (!dev)
> + goto out;
> +
> + data = icmp_ext_iio_addr4_find(dev);
> + if (data) {
> + struct icmp_ext_iio_addr4_subobj *addr4_subobj;
> +
> + addr4_subobj = skb_put_zero(skb, sizeof(*addr4_subobj));
> + addr4_subobj->afi = htons(ICMP_AFI_IP);
> + addr4_subobj->addr4 = data;
> + objh->class_type |= ICMP_EXT_CTYPE_IIO_IPADDR;
> + }
> +
> + name_subobj = skb_put_zero(skb, ALIGN(sizeof(*name_subobj), 4));
> + name_subobj->len = ALIGN(sizeof(*name_subobj), 4);
> + netdev_copy_name(dev, name_subobj->name);
> + objh->class_type |= ICMP_EXT_CTYPE_IIO_NAME;
> +
> + data = htonl(READ_ONCE(dev->mtu));
> + skb_put_data(skb, &data, sizeof(__be32));
> + objh->class_type |= ICMP_EXT_CTYPE_IIO_MTU;
> +
> +out:
> + rcu_read_unlock();
> + objh->length = htons(skb_tail_pointer(skb) - (unsigned char *)objh);
> +}
> +
> +static void icmp_ext_objs_append(struct net *net, struct sk_buff *skb,
> + u8 ext_objs, int iif)
> +{
> + if (ext_objs & BIT(ICMP_ERR_EXT_IIO_IIF))
> + icmp_ext_iio_iif_append(net, skb, iif);
> +}
> +
> +static struct sk_buff *
> +icmp_ext_append(struct net *net, struct sk_buff *skb_in, struct icmphdr *icmph,
> + unsigned int room, int iif)
> +{
> + unsigned int payload_len, ext_max_len, ext_len;
> + struct icmp_ext_hdr *ext_hdr;
> + struct sk_buff *skb;
> + u8 ext_objs;
> + int nhoff;
> +
> + switch (icmph->type) {
> + case ICMP_DEST_UNREACH:
> + case ICMP_TIME_EXCEEDED:
> + case ICMP_PARAMETERPROB:
> + break;
> + default:
> + return NULL;
> + }
> +
> + ext_objs = READ_ONCE(net->ipv4.sysctl_icmp_errors_extension_mask);
> + if (!ext_objs)
> + return NULL;
> +
> + ext_max_len = icmp_ext_max_len(ext_objs);
> + if (ICMP_EXT_ORIG_DGRAM_MIN_LEN + ext_max_len > room)
> + return NULL;
> +
> + skb = skb_clone(skb_in, GFP_ATOMIC);
> + if (!skb)
> + return NULL;
> +
> + nhoff = skb_network_offset(skb);
> + payload_len = min(skb->len - nhoff, ICMP_EXT_ORIG_DGRAM_MIN_LEN);
> +
> + if (!pskb_network_may_pull(skb, payload_len))
> + goto free_skb;
> +
> + if (pskb_trim(skb, nhoff + ICMP_EXT_ORIG_DGRAM_MIN_LEN) ||
> + __skb_put_padto(skb, nhoff + ICMP_EXT_ORIG_DGRAM_MIN_LEN, false))
> + goto free_skb;
> +
> + if (pskb_expand_head(skb, 0, ext_max_len, GFP_ATOMIC))
> + goto free_skb;
> +
> + ext_hdr = skb_put_zero(skb, sizeof(*ext_hdr));
> + ext_hdr->version = ICMP_EXT_VERSION_2;
> +
> + icmp_ext_objs_append(net, skb, ext_objs, iif);
> +
> + /* Do not send an empty extension structure. */
> + ext_len = skb_tail_pointer(skb) - (unsigned char *)ext_hdr;
> + if (ext_len == sizeof(*ext_hdr))
> + goto free_skb;
> +
> + ext_hdr->checksum = ip_compute_csum(ext_hdr, ext_len);
> + /* The length of the original datagram in 32-bit words (RFC 4884). */
> + icmph->un.reserved[1] = ICMP_EXT_ORIG_DGRAM_MIN_LEN / sizeof(u32);
> +
> + return skb;
> +
> +free_skb:
> + consume_skb(skb);
> + return NULL;
> +}
> +
> /*
> * Send an ICMP message in response to a situation
> *
> @@ -601,6 +780,7 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
> struct icmp_bxm icmp_param;
> struct rtable *rt = skb_rtable(skb_in);
> bool apply_ratelimit = false;
> + struct sk_buff *ext_skb;
> struct ipcm_cookie ipc;
> struct flowi4 fl4;
> __be32 saddr;
> @@ -770,7 +950,12 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
> if (room <= (int)sizeof(struct iphdr))
> goto ende;
>
> - icmp_param.data_len = skb_in->len - icmp_param.offset;
> + ext_skb = icmp_ext_append(net, skb_in, &icmp_param.data.icmph, room,
> + parm->iif);
> + if (ext_skb)
> + icmp_param.skb = ext_skb;
> +
> + icmp_param.data_len = icmp_param.skb->len - icmp_param.offset;
> if (icmp_param.data_len > room)
> icmp_param.data_len = room;
> icmp_param.head_len = sizeof(struct icmphdr);
> @@ -785,6 +970,9 @@ void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
> trace_icmp_send(skb_in, type, code);
>
> icmp_push_reply(sk, &icmp_param, &fl4, &ipc, &rt);
> +
> + if (ext_skb)
> + consume_skb(ext_skb);
nit : if (ext_skb) is not needed, consume_skb(NULL) is ok.
No need to resend.
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next v2 1/3] ipv4: icmp: Add RFC 5837 support
2025-10-27 8:22 ` [PATCH net-next v2 1/3] ipv4: " Ido Schimmel
2025-10-27 8:35 ` Eric Dumazet
@ 2025-11-07 17:33 ` David 'equinox' Lamparter
2025-11-08 16:01 ` Ido Schimmel
1 sibling, 1 reply; 15+ messages in thread
From: David 'equinox' Lamparter @ 2025-11-07 17:33 UTC (permalink / raw)
To: Ido Schimmel
Cc: netdev, davem, kuba, pabeni, edumazet, horms, dsahern, petrm,
willemb, daniel, fw, ishaangandhi, rbonica, tom
On Mon, Oct 27, 2025 at 10:22:30AM +0200, Ido Schimmel wrote:
> +/* ICMP Extension Object Classes */
> +#define ICMP_EXT_OBJ_CLASS_IIO 2 /* RFC 5837 */
> +
> +/* Interface Information Object - RFC 5837 */
> +enum {
> + ICMP_EXT_CTYPE_IIO_ROLE_IIF,
> +};
...
> +static __be32 icmp_ext_iio_addr4_find(const struct net_device *dev)
> +{
> + struct in_device *in_dev;
> + struct in_ifaddr *ifa;
> +
> + in_dev = __in_dev_get_rcu(dev);
> + if (!in_dev)
> + return 0;
> +
> + /* It is unclear from RFC 5837 which IP address should be chosen, but
> + * it makes sense to choose a global unicast address.
> + */
> + in_dev_for_each_ifa_rcu(ifa, in_dev) {
> + if (READ_ONCE(ifa->ifa_flags) & IFA_F_SECONDARY)
> + continue;
> + if (ifa->ifa_scope != RT_SCOPE_UNIVERSE ||
> + ipv4_is_multicast(ifa->ifa_address))
> + continue;
> + return ifa->ifa_address;
For 5837, this should be an address identifying the interface. This
sets up a rather tricky situation if there's a /32 configured on the
interface in the context of unnumbered operation. Arguably, in that
case class 5 (node info) should be used rather than class 2 (interface
info). Class 5 also allows sticking an IPv6 address in an ICMPv4 reply.
I would argue the logic here should be an order of preference:
1. any global non-/32 address on the interface, in a class 2 object
2. any global /32 on the interface, in a class 5 object
3. any global IPv6 on the interface, in a class 5 object
4. any global address from any interface in the VRF, preferring
loopback, in a class 5 object (addrsel logic, really)
[class 5 is draft-ietf-intarea-extended-icmp-nodeid]
+ analog for IPv6
(cf. my other mail in the thread)
Cheers,
-equi
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next v2 1/3] ipv4: icmp: Add RFC 5837 support
2025-11-07 17:33 ` David 'equinox' Lamparter
@ 2025-11-08 16:01 ` Ido Schimmel
0 siblings, 0 replies; 15+ messages in thread
From: Ido Schimmel @ 2025-11-08 16:01 UTC (permalink / raw)
To: David 'equinox' Lamparter
Cc: netdev, davem, kuba, pabeni, edumazet, horms, dsahern, petrm,
willemb, daniel, fw, ishaangandhi, rbonica, tom
On Fri, Nov 07, 2025 at 06:33:30PM +0100, David 'equinox' Lamparter wrote:
> On Mon, Oct 27, 2025 at 10:22:30AM +0200, Ido Schimmel wrote:
> > +/* ICMP Extension Object Classes */
> > +#define ICMP_EXT_OBJ_CLASS_IIO 2 /* RFC 5837 */
> > +
> > +/* Interface Information Object - RFC 5837 */
> > +enum {
> > + ICMP_EXT_CTYPE_IIO_ROLE_IIF,
> > +};
>
> ...
>
> > +static __be32 icmp_ext_iio_addr4_find(const struct net_device *dev)
> > +{
> > + struct in_device *in_dev;
> > + struct in_ifaddr *ifa;
> > +
> > + in_dev = __in_dev_get_rcu(dev);
> > + if (!in_dev)
> > + return 0;
> > +
> > + /* It is unclear from RFC 5837 which IP address should be chosen, but
> > + * it makes sense to choose a global unicast address.
> > + */
> > + in_dev_for_each_ifa_rcu(ifa, in_dev) {
> > + if (READ_ONCE(ifa->ifa_flags) & IFA_F_SECONDARY)
> > + continue;
> > + if (ifa->ifa_scope != RT_SCOPE_UNIVERSE ||
> > + ipv4_is_multicast(ifa->ifa_address))
> > + continue;
> > + return ifa->ifa_address;
>
> For 5837, this should be an address identifying the interface. This
> sets up a rather tricky situation if there's a /32 configured on the
> interface in the context of unnumbered operation. Arguably, in that
> case class 5 (node info) should be used rather than class 2 (interface
> info). Class 5 also allows sticking an IPv6 address in an ICMPv4 reply.
This patchset does not add support for class 5 objects and the above
code is strictly about choosing an address for the "IP Address
Sub-Object" in a class 2 object. Support for class 5 objects can be
added in a different patchset. An administrator can then choose to
include both objects in ICMP error messages.
The kernel does not avoid using a /32 as the source IP of ICMP error
messages, so I don't see a reason to avoid using such an address in the
"IP Address Sub-Object".
>
> I would argue the logic here should be an order of preference:
>
> 1. any global non-/32 address on the interface, in a class 2 object
> 2. any global /32 on the interface, in a class 5 object
> 3. any global IPv6 on the interface, in a class 5 object
> 4. any global address from any interface in the VRF, preferring
> loopback, in a class 5 object (addrsel logic, really)
>
> [class 5 is draft-ietf-intarea-extended-icmp-nodeid]
>
> + analog for IPv6
>
> (cf. my other mail in the thread)
>
> Cheers,
>
>
> -equi
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH net-next v2 2/3] ipv6: icmp: Add RFC 5837 support
2025-10-27 8:22 [PATCH net-next v2 0/3] icmp: Add RFC 5837 support Ido Schimmel
2025-10-27 8:22 ` [PATCH net-next v2 1/3] ipv4: " Ido Schimmel
@ 2025-10-27 8:22 ` Ido Schimmel
2025-10-27 8:39 ` Eric Dumazet
2025-10-27 8:22 ` [PATCH net-next v2 3/3] selftests: traceroute: Add ICMP extensions tests Ido Schimmel
` (2 subsequent siblings)
4 siblings, 1 reply; 15+ messages in thread
From: Ido Schimmel @ 2025-10-27 8:22 UTC (permalink / raw)
To: netdev
Cc: davem, kuba, pabeni, edumazet, horms, dsahern, petrm, willemb,
daniel, fw, ishaangandhi, rbonica, tom, Ido Schimmel
Add the ability to append the incoming IP interface information to
ICMPv6 error messages in accordance with RFC 5837 and RFC 4884. This is
required for more meaningful traceroute results in unnumbered networks.
The feature is disabled by default and controlled via a new sysctl
("net.ipv6.icmp.errors_extension_mask") which accepts a bitmask of ICMP
extensions to append to ICMP error messages. Currently, only a single
value is supported, but the interface and the implementation should be
able to support more extensions, if needed.
Clone the skb and copy the relevant data portions before modifying the
skb as the caller of icmp6_send() still owns the skb after the function
returns. This should be fine since by default ICMP error messages are
rate limited to 1000 per second and no more than 1 per second per
specific host.
Trim or pad the packet to 128 bytes before appending the ICMP extension
structure in order to be compatible with legacy applications that assume
that the ICMP extension structure always starts at this offset (the
minimum length specified by RFC 4884).
Since commit 20e1954fe238 ("ipv6: RFC 4884 partial support for SIT/GRE
tunnels") it is possible for icmp6_send() to be called with an skb that
already contains ICMP extensions. This can happen when we receive an
ICMPv4 message with extensions from a tunnel and translate it to an
ICMPv6 message towards an IPv6 host in the overlay network. I could not
find an RFC that supports this behavior, but it makes sense to not
overwrite the original extensions that were appended to the packet.
Therefore, avoid appending extensions if the length field in the
provided ICMPv6 header is already filled.
Export netdev_copy_name() using EXPORT_IPV6_MOD_GPL() to make it
available to IPv6 when it is built as a module.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
Notes:
v2:
* Add a comment about field ordering.
Documentation/networking/ip-sysctl.rst | 17 ++
include/net/netns/ipv6.h | 1 +
net/core/dev.c | 1 +
net/ipv6/af_inet6.c | 1 +
net/ipv6/icmp.c | 214 ++++++++++++++++++++++++-
5 files changed, 232 insertions(+), 2 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index ece1187ba0f1..7cd35bfd39e6 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -3279,6 +3279,23 @@ error_anycast_as_unicast - BOOLEAN
Default: 0 (disabled)
+errors_extension_mask - UNSIGNED INTEGER
+ Bitmask of ICMP extensions to append to ICMPv6 error messages
+ ("Destination Unreachable" and "Time Exceeded"). The original datagram
+ is trimmed / padded to 128 bytes in order to be compatible with
+ applications that do not comply with RFC 4884.
+
+ Possible extensions are:
+
+ ==== ==============================================================
+ 0x01 Incoming IP interface information according to RFC 5837.
+ Extension will include the index, IPv6 address (if present),
+ name and MTU of the IP interface that received the datagram
+ which elicited the ICMP error.
+ ==== ==============================================================
+
+ Default: 0x00 (no extensions)
+
xfrm6_gc_thresh - INTEGER
(Obsolete since linux-4.14)
The threshold at which we will start garbage collecting for IPv6
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 47dc70d8100a..08d2ecc96e2b 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -56,6 +56,7 @@ struct netns_sysctl_ipv6 {
u8 skip_notify_on_dev_down;
u8 fib_notify_on_flag_change;
u8 icmpv6_error_anycast_as_unicast;
+ u8 icmpv6_errors_extension_mask;
};
struct netns_ipv6 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 378c2d010faf..e6cc0fbc5e2a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1163,6 +1163,7 @@ void netdev_copy_name(struct net_device *dev, char *name)
strscpy(name, dev->name, IFNAMSIZ);
} while (read_seqretry(&netdev_rename_lock, seq));
}
+EXPORT_IPV6_MOD_GPL(netdev_copy_name);
/**
* netdev_get_name - get a netdevice name, knowing its ifindex.
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 1b0314644e0c..44d7de1eec4f 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -960,6 +960,7 @@ static int __net_init inet6_net_init(struct net *net)
net->ipv6.sysctl.icmpv6_echo_ignore_multicast = 0;
net->ipv6.sysctl.icmpv6_echo_ignore_anycast = 0;
net->ipv6.sysctl.icmpv6_error_anycast_as_unicast = 0;
+ net->ipv6.sysctl.icmpv6_errors_extension_mask = 0;
/* By default, rate limit error messages.
* Except for pmtu discovery, it would break it.
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 56c974cf75d1..5d2f90babaa5 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -444,6 +444,193 @@ static int icmp6_iif(const struct sk_buff *skb)
return icmp6_dev(skb)->ifindex;
}
+struct icmp6_ext_iio_addr6_subobj {
+ __be16 afi;
+ __be16 reserved;
+ struct in6_addr addr6;
+};
+
+static unsigned int icmp6_ext_iio_len(void)
+{
+ return sizeof(struct icmp_extobj_hdr) +
+ /* ifIndex */
+ sizeof(__be32) +
+ /* Interface Address Sub-Object */
+ sizeof(struct icmp6_ext_iio_addr6_subobj) +
+ /* Interface Name Sub-Object. Length must be a multiple of 4
+ * bytes.
+ */
+ ALIGN(sizeof(struct icmp_ext_iio_name_subobj), 4) +
+ /* MTU */
+ sizeof(__be32);
+}
+
+static unsigned int icmp6_ext_max_len(u8 ext_objs)
+{
+ unsigned int ext_max_len;
+
+ ext_max_len = sizeof(struct icmp_ext_hdr);
+
+ if (ext_objs & BIT(ICMP_ERR_EXT_IIO_IIF))
+ ext_max_len += icmp6_ext_iio_len();
+
+ return ext_max_len;
+}
+
+static struct in6_addr *icmp6_ext_iio_addr6_find(const struct net_device *dev)
+{
+ struct inet6_dev *in6_dev;
+ struct inet6_ifaddr *ifa;
+
+ in6_dev = __in6_dev_get(dev);
+ if (!in6_dev)
+ return NULL;
+
+ /* It is unclear from RFC 5837 which IP address should be chosen, but
+ * it makes sense to choose a global unicast address.
+ */
+ list_for_each_entry_rcu(ifa, &in6_dev->addr_list, if_list) {
+ if (ifa->flags & (IFA_F_TENTATIVE | IFA_F_DADFAILED))
+ continue;
+ if (ipv6_addr_type(&ifa->addr) != IPV6_ADDR_UNICAST ||
+ ipv6_addr_src_scope(&ifa->addr) != IPV6_ADDR_SCOPE_GLOBAL)
+ continue;
+ return &ifa->addr;
+ }
+
+ return NULL;
+}
+
+static void icmp6_ext_iio_iif_append(struct net *net, struct sk_buff *skb,
+ int iif)
+{
+ struct icmp_ext_iio_name_subobj *name_subobj;
+ struct icmp_extobj_hdr *objh;
+ struct net_device *dev;
+ struct in6_addr *addr6;
+ __be32 data;
+
+ if (!iif)
+ return;
+
+ /* Add the fields in the order specified by RFC 5837. */
+ objh = skb_put(skb, sizeof(*objh));
+ objh->class_num = ICMP_EXT_OBJ_CLASS_IIO;
+ objh->class_type = ICMP_EXT_CTYPE_IIO_ROLE(ICMP_EXT_CTYPE_IIO_ROLE_IIF);
+
+ data = htonl(iif);
+ skb_put_data(skb, &data, sizeof(__be32));
+ objh->class_type |= ICMP_EXT_CTYPE_IIO_IFINDEX;
+
+ rcu_read_lock();
+
+ dev = dev_get_by_index_rcu(net, iif);
+ if (!dev)
+ goto out;
+
+ addr6 = icmp6_ext_iio_addr6_find(dev);
+ if (addr6) {
+ struct icmp6_ext_iio_addr6_subobj *addr6_subobj;
+
+ addr6_subobj = skb_put_zero(skb, sizeof(*addr6_subobj));
+ addr6_subobj->afi = htons(ICMP_AFI_IP6);
+ addr6_subobj->addr6 = *addr6;
+ objh->class_type |= ICMP_EXT_CTYPE_IIO_IPADDR;
+ }
+
+ name_subobj = skb_put_zero(skb, ALIGN(sizeof(*name_subobj), 4));
+ name_subobj->len = ALIGN(sizeof(*name_subobj), 4);
+ netdev_copy_name(dev, name_subobj->name);
+ objh->class_type |= ICMP_EXT_CTYPE_IIO_NAME;
+
+ data = htonl(READ_ONCE(dev->mtu));
+ skb_put_data(skb, &data, sizeof(__be32));
+ objh->class_type |= ICMP_EXT_CTYPE_IIO_MTU;
+
+out:
+ rcu_read_unlock();
+ objh->length = htons(skb_tail_pointer(skb) - (unsigned char *)objh);
+}
+
+static void icmp6_ext_objs_append(struct net *net, struct sk_buff *skb,
+ u8 ext_objs, int iif)
+{
+ if (ext_objs & BIT(ICMP_ERR_EXT_IIO_IIF))
+ icmp6_ext_iio_iif_append(net, skb, iif);
+}
+
+static struct sk_buff *
+icmp6_ext_append(struct net *net, struct sk_buff *skb_in,
+ struct icmp6hdr *icmp6h, unsigned int room, int iif)
+{
+ unsigned int payload_len, ext_max_len, ext_len;
+ struct icmp_ext_hdr *ext_hdr;
+ struct sk_buff *skb;
+ u8 ext_objs;
+ int nhoff;
+
+ switch (icmp6h->icmp6_type) {
+ case ICMPV6_DEST_UNREACH:
+ case ICMPV6_TIME_EXCEED:
+ break;
+ default:
+ return NULL;
+ }
+
+ /* Do not overwrite existing extensions. This can happen when we
+ * receive an ICMPv4 message with extensions from a tunnel and
+ * translate it to an ICMPv6 message towards an IPv6 host in the
+ * overlay network.
+ */
+ if (icmp6h->icmp6_datagram_len)
+ return NULL;
+
+ ext_objs = READ_ONCE(net->ipv6.sysctl.icmpv6_errors_extension_mask);
+ if (!ext_objs)
+ return NULL;
+
+ ext_max_len = icmp6_ext_max_len(ext_objs);
+ if (ICMP_EXT_ORIG_DGRAM_MIN_LEN + ext_max_len > room)
+ return NULL;
+
+ skb = skb_clone(skb_in, GFP_ATOMIC);
+ if (!skb)
+ return NULL;
+
+ nhoff = skb_network_offset(skb);
+ payload_len = min(skb->len - nhoff, ICMP_EXT_ORIG_DGRAM_MIN_LEN);
+
+ if (!pskb_network_may_pull(skb, payload_len))
+ goto free_skb;
+
+ if (pskb_trim(skb, nhoff + ICMP_EXT_ORIG_DGRAM_MIN_LEN) ||
+ __skb_put_padto(skb, nhoff + ICMP_EXT_ORIG_DGRAM_MIN_LEN, false))
+ goto free_skb;
+
+ if (pskb_expand_head(skb, 0, ext_max_len, GFP_ATOMIC))
+ goto free_skb;
+
+ ext_hdr = skb_put_zero(skb, sizeof(*ext_hdr));
+ ext_hdr->version = ICMP_EXT_VERSION_2;
+
+ icmp6_ext_objs_append(net, skb, ext_objs, iif);
+
+ /* Do not send an empty extension structure. */
+ ext_len = skb_tail_pointer(skb) - (unsigned char *)ext_hdr;
+ if (ext_len == sizeof(*ext_hdr))
+ goto free_skb;
+
+ ext_hdr->checksum = ip_compute_csum(ext_hdr, ext_len);
+ /* The length of the original datagram in 64-bit words (RFC 4884). */
+ icmp6h->icmp6_datagram_len = ICMP_EXT_ORIG_DGRAM_MIN_LEN / sizeof(u64);
+
+ return skb;
+
+free_skb:
+ consume_skb(skb);
+ return NULL;
+}
+
/*
* Send an ICMP message in response to a packet in error
*/
@@ -458,7 +645,9 @@ void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
struct ipv6_pinfo *np;
const struct in6_addr *saddr = NULL;
bool apply_ratelimit = false;
+ struct sk_buff *ext_skb;
struct dst_entry *dst;
+ unsigned int room;
struct icmp6hdr tmp_hdr;
struct flowi6 fl6;
struct icmpv6_msg msg;
@@ -612,8 +801,13 @@ void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
msg.offset = skb_network_offset(skb);
msg.type = type;
- len = skb->len - msg.offset;
- len = min_t(unsigned int, len, IPV6_MIN_MTU - sizeof(struct ipv6hdr) - sizeof(struct icmp6hdr));
+ room = IPV6_MIN_MTU - sizeof(struct ipv6hdr) - sizeof(struct icmp6hdr);
+ ext_skb = icmp6_ext_append(net, skb, &tmp_hdr, room, parm->iif);
+ if (ext_skb)
+ msg.skb = ext_skb;
+
+ len = msg.skb->len - msg.offset;
+ len = min_t(unsigned int, len, room);
if (len < 0) {
net_dbg_ratelimited("icmp: len problem [%pI6c > %pI6c]\n",
&hdr->saddr, &hdr->daddr);
@@ -635,6 +829,8 @@ void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
}
out_dst_release:
+ if (ext_skb)
+ consume_skb(ext_skb);
dst_release(dst);
out_unlock:
icmpv6_xmit_unlock(sk);
@@ -1171,6 +1367,10 @@ int icmpv6_err_convert(u8 type, u8 code, int *err)
EXPORT_SYMBOL(icmpv6_err_convert);
#ifdef CONFIG_SYSCTL
+
+static u32 icmpv6_errors_extension_mask_all =
+ GENMASK_U8(ICMP_ERR_EXT_COUNT - 1, 0);
+
static struct ctl_table ipv6_icmp_table_template[] = {
{
.procname = "ratelimit",
@@ -1216,6 +1416,15 @@ static struct ctl_table ipv6_icmp_table_template[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
},
+ {
+ .procname = "errors_extension_mask",
+ .data = &init_net.ipv6.sysctl.icmpv6_errors_extension_mask,
+ .maxlen = sizeof(u8),
+ .mode = 0644,
+ .proc_handler = proc_dou8vec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = &icmpv6_errors_extension_mask_all,
+ },
};
struct ctl_table * __net_init ipv6_icmp_sysctl_init(struct net *net)
@@ -1233,6 +1442,7 @@ struct ctl_table * __net_init ipv6_icmp_sysctl_init(struct net *net)
table[3].data = &net->ipv6.sysctl.icmpv6_echo_ignore_anycast;
table[4].data = &net->ipv6.sysctl.icmpv6_ratemask_ptr;
table[5].data = &net->ipv6.sysctl.icmpv6_error_anycast_as_unicast;
+ table[6].data = &net->ipv6.sysctl.icmpv6_errors_extension_mask;
}
return table;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH net-next v2 2/3] ipv6: icmp: Add RFC 5837 support
2025-10-27 8:22 ` [PATCH net-next v2 2/3] ipv6: " Ido Schimmel
@ 2025-10-27 8:39 ` Eric Dumazet
0 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2025-10-27 8:39 UTC (permalink / raw)
To: Ido Schimmel
Cc: netdev, davem, kuba, pabeni, horms, dsahern, petrm, willemb,
daniel, fw, ishaangandhi, rbonica, tom
On Mon, Oct 27, 2025 at 1:24 AM Ido Schimmel <idosch@nvidia.com> wrote:
>
> Add the ability to append the incoming IP interface information to
> ICMPv6 error messages in accordance with RFC 5837 and RFC 4884. This is
> required for more meaningful traceroute results in unnumbered networks.
>
> The feature is disabled by default and controlled via a new sysctl
> ("net.ipv6.icmp.errors_extension_mask") which accepts a bitmask of ICMP
> extensions to append to ICMP error messages. Currently, only a single
> value is supported, but the interface and the implementation should be
> able to support more extensions, if needed.
>
> Clone the skb and copy the relevant data portions before modifying the
> skb as the caller of icmp6_send() still owns the skb after the function
> returns. This should be fine since by default ICMP error messages are
> rate limited to 1000 per second and no more than 1 per second per
> specific host.
>
> Trim or pad the packet to 128 bytes before appending the ICMP extension
> structure in order to be compatible with legacy applications that assume
> that the ICMP extension structure always starts at this offset (the
> minimum length specified by RFC 4884).
>
> Since commit 20e1954fe238 ("ipv6: RFC 4884 partial support for SIT/GRE
> tunnels") it is possible for icmp6_send() to be called with an skb that
> already contains ICMP extensions. This can happen when we receive an
> ICMPv4 message with extensions from a tunnel and translate it to an
> ICMPv6 message towards an IPv6 host in the overlay network. I could not
> find an RFC that supports this behavior, but it makes sense to not
> overwrite the original extensions that were appended to the packet.
> Therefore, avoid appending extensions if the length field in the
> provided ICMPv6 header is already filled.
>
> Export netdev_copy_name() using EXPORT_IPV6_MOD_GPL() to make it
> available to IPv6 when it is built as a module.
>
> Reviewed-by: Petr Machata <petrm@nvidia.com>
> Reviewed-by: David Ahern <dsahern@kernel.org>
> Reviewed-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
Same small remark about consume_skb().
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH net-next v2 3/3] selftests: traceroute: Add ICMP extensions tests
2025-10-27 8:22 [PATCH net-next v2 0/3] icmp: Add RFC 5837 support Ido Schimmel
2025-10-27 8:22 ` [PATCH net-next v2 1/3] ipv4: " Ido Schimmel
2025-10-27 8:22 ` [PATCH net-next v2 2/3] ipv6: " Ido Schimmel
@ 2025-10-27 8:22 ` Ido Schimmel
2025-10-29 1:04 ` [PATCH net-next v2 0/3] icmp: Add RFC 5837 support Jakub Kicinski
2025-10-30 1:50 ` patchwork-bot+netdevbpf
4 siblings, 0 replies; 15+ messages in thread
From: Ido Schimmel @ 2025-10-27 8:22 UTC (permalink / raw)
To: netdev
Cc: davem, kuba, pabeni, edumazet, horms, dsahern, petrm, willemb,
daniel, fw, ishaangandhi, rbonica, tom, Ido Schimmel
Test that ICMP extensions are reported correctly when enabled and not
reported when disabled. Test both IPv4 and IPv6 and using different
packet sizes, to make sure trimming / padding works correctly.
Disable ICMP rate limiting (defaults to 1 per-second per-target) so that
the kernel will always generate ICMP errors when needed.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
Notes:
v2:
* Use "echo" instead of "sysctl" when we are testing the return value.
* Skip if traceroute version is older than 2.1.5.
tools/testing/selftests/net/traceroute.sh | 313 ++++++++++++++++++++++
1 file changed, 313 insertions(+)
diff --git a/tools/testing/selftests/net/traceroute.sh b/tools/testing/selftests/net/traceroute.sh
index dbb34c7e09ce..a7c6ab8a0347 100755
--- a/tools/testing/selftests/net/traceroute.sh
+++ b/tools/testing/selftests/net/traceroute.sh
@@ -36,6 +36,35 @@ run_cmd()
return $rc
}
+__check_traceroute_version()
+{
+ local cmd=$1; shift
+ local req_ver=$1; shift
+ local ver
+
+ req_ver=$(echo "$req_ver" | sed 's/\.//g')
+ ver=$($cmd -V 2>&1 | grep -Eo '[0-9]+.[0-9]+.[0-9]+' | sed 's/\.//g')
+ if [[ $ver -lt $req_ver ]]; then
+ return 1
+ else
+ return 0
+ fi
+}
+
+check_traceroute6_version()
+{
+ local req_ver=$1; shift
+
+ __check_traceroute_version traceroute6 "$req_ver"
+}
+
+check_traceroute_version()
+{
+ local req_ver=$1; shift
+
+ __check_traceroute_version traceroute "$req_ver"
+}
+
################################################################################
# create namespaces and interconnects
@@ -59,6 +88,8 @@ create_ns()
ip netns exec ${ns} ip -6 ro add unreachable default metric 8192
ip netns exec ${ns} sysctl -qw net.ipv4.ip_forward=1
+ ip netns exec ${ns} sysctl -qw net.ipv4.icmp_ratelimit=0
+ ip netns exec ${ns} sysctl -qw net.ipv6.icmp.ratelimit=0
ip netns exec ${ns} sysctl -qw net.ipv6.conf.all.keep_addr_on_down=1
ip netns exec ${ns} sysctl -qw net.ipv6.conf.all.forwarding=1
ip netns exec ${ns} sysctl -qw net.ipv6.conf.default.forwarding=1
@@ -297,6 +328,144 @@ run_traceroute6_vrf()
cleanup_traceroute6_vrf
}
+################################################################################
+# traceroute6 with ICMP extensions test
+#
+# Verify that in this scenario
+#
+# ---- ---- ----
+# |H1|--------------------------|R1|--------------------------|H2|
+# ---- N1 ---- N2 ----
+#
+# ICMP extensions are correctly reported. The loopback interfaces on all the
+# nodes are assigned global addresses and the interfaces connecting the nodes
+# are assigned IPv6 link-local addresses.
+
+cleanup_traceroute6_ext()
+{
+ cleanup_all_ns
+}
+
+setup_traceroute6_ext()
+{
+ # Start clean
+ cleanup_traceroute6_ext
+
+ setup_ns h1 r1 h2
+ create_ns "$h1"
+ create_ns "$r1"
+ create_ns "$h2"
+
+ # Setup N1
+ connect_ns "$h1" eth1 - fe80::1/64 "$r1" eth1 - fe80::2/64
+ # Setup N2
+ connect_ns "$r1" eth2 - fe80::3/64 "$h2" eth2 - fe80::4/64
+
+ # Setup H1
+ ip -n "$h1" address add 2001:db8:1::1/128 dev lo
+ ip -n "$h1" route add ::/0 nexthop via fe80::2 dev eth1
+
+ # Setup R1
+ ip -n "$r1" address add 2001:db8:1::2/128 dev lo
+ ip -n "$r1" route add 2001:db8:1::1/128 nexthop via fe80::1 dev eth1
+ ip -n "$r1" route add 2001:db8:1::3/128 nexthop via fe80::4 dev eth2
+
+ # Setup H2
+ ip -n "$h2" address add 2001:db8:1::3/128 dev lo
+ ip -n "$h2" route add ::/0 nexthop via fe80::3 dev eth2
+
+ # Prime the network
+ ip netns exec "$h1" ping6 -c5 2001:db8:1::3 >/dev/null 2>&1
+}
+
+traceroute6_ext_iio_iif_test()
+{
+ local r1_ifindex h2_ifindex
+ local pkt_len=$1; shift
+
+ # Test that incoming interface info is not appended by default.
+ run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep INC"
+ check_fail $? "Incoming interface info appended by default when should not"
+
+ # Test that the extension is appended when enabled.
+ run_cmd "$r1" "bash -c \"echo 0x01 > /proc/sys/net/ipv6/icmp/errors_extension_mask\""
+ check_err $? "Failed to enable incoming interface info extension on R1"
+
+ run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep INC"
+ check_err $? "Incoming interface info not appended after enable"
+
+ # Test that the extension is not appended when disabled.
+ run_cmd "$r1" "bash -c \"echo 0x00 > /proc/sys/net/ipv6/icmp/errors_extension_mask\""
+ check_err $? "Failed to disable incoming interface info extension on R1"
+
+ run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep INC"
+ check_fail $? "Incoming interface info appended after disable"
+
+ # Test that the extension is sent correctly from both R1 and H2.
+ run_cmd "$r1" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x01"
+ r1_ifindex=$(ip -n "$r1" -j link show dev eth1 | jq '.[]["ifindex"]')
+ run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep '<INC:$r1_ifindex,\"eth1\",mtu=1500>'"
+ check_err $? "Wrong incoming interface info reported from R1"
+
+ run_cmd "$h2" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x01"
+ h2_ifindex=$(ip -n "$h2" -j link show dev eth2 | jq '.[]["ifindex"]')
+ run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep '<INC:$h2_ifindex,\"eth2\",mtu=1500>'"
+ check_err $? "Wrong incoming interface info reported from H2"
+
+ # Add a global address on the incoming interface of R1 and check that
+ # it is reported.
+ run_cmd "$r1" "ip address add 2001:db8:100::1/64 dev eth1 nodad"
+ run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep '<INC:$r1_ifindex,2001:db8:100::1,\"eth1\",mtu=1500>'"
+ check_err $? "Wrong incoming interface info reported from R1 after address addition"
+ run_cmd "$r1" "ip address del 2001:db8:100::1/64 dev eth1"
+
+ # Change name and MTU and make sure the result is still correct.
+ run_cmd "$r1" "ip link set dev eth1 name eth1tag mtu 1501"
+ run_cmd "$h1" "traceroute6 -e 2001:db8:1::3 $pkt_len | grep '<INC:$r1_ifindex,\"eth1tag\",mtu=1501>'"
+ check_err $? "Wrong incoming interface info reported from R1 after name and MTU change"
+ run_cmd "$r1" "ip link set dev eth1tag name eth1 mtu 1500"
+
+ run_cmd "$r1" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x00"
+ run_cmd "$h2" "sysctl -w net.ipv6.icmp.errors_extension_mask=0x00"
+}
+
+run_traceroute6_ext()
+{
+ # Need at least version 2.1.5 for RFC 5837 support.
+ if ! check_traceroute6_version 2.1.5; then
+ log_test_skip "traceroute6 too old, missing ICMP extensions support"
+ return
+ fi
+
+ setup_traceroute6_ext
+
+ RET=0
+
+ ## General ICMP extensions tests
+
+ # Test that ICMP extensions are disabled by default.
+ run_cmd "$h1" "sysctl net.ipv6.icmp.errors_extension_mask | grep \"= 0$\""
+ check_err $? "ICMP extensions are not disabled by default"
+
+ # Test that unsupported values are rejected. Do not use "sysctl" as
+ # older versions do not return an error code upon failure.
+ run_cmd "$h1" "bash -c \"echo 0x80 > /proc/sys/net/ipv6/icmp/errors_extension_mask\""
+ check_fail $? "Unsupported sysctl value was not rejected"
+
+ ## Extension-specific tests
+
+ # Incoming interface info test. Test with various packet sizes,
+ # including the default one.
+ traceroute6_ext_iio_iif_test
+ traceroute6_ext_iio_iif_test 127
+ traceroute6_ext_iio_iif_test 128
+ traceroute6_ext_iio_iif_test 129
+
+ log_test "IPv6 traceroute with ICMP extensions"
+
+ cleanup_traceroute6_ext
+}
+
################################################################################
# traceroute test
#
@@ -437,6 +606,147 @@ run_traceroute_vrf()
cleanup_traceroute_vrf
}
+################################################################################
+# traceroute with ICMP extensions test
+#
+# Verify that in this scenario
+#
+# ---- ---- ----
+# |H1|--------------------------|R1|--------------------------|H2|
+# ---- N1 ---- N2 ----
+#
+# ICMP extensions are correctly reported. The loopback interfaces on all the
+# nodes are assigned global addresses and the interfaces connecting the nodes
+# are assigned IPv6 link-local addresses.
+
+cleanup_traceroute_ext()
+{
+ cleanup_all_ns
+}
+
+setup_traceroute_ext()
+{
+ # Start clean
+ cleanup_traceroute_ext
+
+ setup_ns h1 r1 h2
+ create_ns "$h1"
+ create_ns "$r1"
+ create_ns "$h2"
+
+ # Setup N1
+ connect_ns "$h1" eth1 - fe80::1/64 "$r1" eth1 - fe80::2/64
+ # Setup N2
+ connect_ns "$r1" eth2 - fe80::3/64 "$h2" eth2 - fe80::4/64
+
+ # Setup H1
+ ip -n "$h1" address add 192.0.2.1/32 dev lo
+ ip -n "$h1" route add 0.0.0.0/0 nexthop via inet6 fe80::2 dev eth1
+
+ # Setup R1
+ ip -n "$r1" address add 192.0.2.2/32 dev lo
+ ip -n "$r1" route add 192.0.2.1/32 nexthop via inet6 fe80::1 dev eth1
+ ip -n "$r1" route add 192.0.2.3/32 nexthop via inet6 fe80::4 dev eth2
+
+ # Setup H2
+ ip -n "$h2" address add 192.0.2.3/32 dev lo
+ ip -n "$h2" route add 0.0.0.0/0 nexthop via inet6 fe80::3 dev eth2
+
+ # Prime the network
+ ip netns exec "$h1" ping -c5 192.0.2.3 >/dev/null 2>&1
+}
+
+traceroute_ext_iio_iif_test()
+{
+ local r1_ifindex h2_ifindex
+ local pkt_len=$1; shift
+
+ # Test that incoming interface info is not appended by default.
+ run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep INC"
+ check_fail $? "Incoming interface info appended by default when should not"
+
+ # Test that the extension is appended when enabled.
+ run_cmd "$r1" "bash -c \"echo 0x01 > /proc/sys/net/ipv4/icmp_errors_extension_mask\""
+ check_err $? "Failed to enable incoming interface info extension on R1"
+
+ run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep INC"
+ check_err $? "Incoming interface info not appended after enable"
+
+ # Test that the extension is not appended when disabled.
+ run_cmd "$r1" "bash -c \"echo 0x00 > /proc/sys/net/ipv4/icmp_errors_extension_mask\""
+ check_err $? "Failed to disable incoming interface info extension on R1"
+
+ run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep INC"
+ check_fail $? "Incoming interface info appended after disable"
+
+ # Test that the extension is sent correctly from both R1 and H2.
+ run_cmd "$r1" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x01"
+ r1_ifindex=$(ip -n "$r1" -j link show dev eth1 | jq '.[]["ifindex"]')
+ run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep '<INC:$r1_ifindex,\"eth1\",mtu=1500>'"
+ check_err $? "Wrong incoming interface info reported from R1"
+
+ run_cmd "$h2" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x01"
+ h2_ifindex=$(ip -n "$h2" -j link show dev eth2 | jq '.[]["ifindex"]')
+ run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep '<INC:$h2_ifindex,\"eth2\",mtu=1500>'"
+ check_err $? "Wrong incoming interface info reported from H2"
+
+ # Add a global address on the incoming interface of R1 and check that
+ # it is reported.
+ run_cmd "$r1" "ip address add 198.51.100.1/24 dev eth1"
+ run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep '<INC:$r1_ifindex,198.51.100.1,\"eth1\",mtu=1500>'"
+ check_err $? "Wrong incoming interface info reported from R1 after address addition"
+ run_cmd "$r1" "ip address del 198.51.100.1/24 dev eth1"
+
+ # Change name and MTU and make sure the result is still correct.
+ # Re-add the route towards H1 since it was deleted when we removed the
+ # last IPv4 address from eth1 on R1.
+ run_cmd "$r1" "ip route add 192.0.2.1/32 nexthop via inet6 fe80::1 dev eth1"
+ run_cmd "$r1" "ip link set dev eth1 name eth1tag mtu 1501"
+ run_cmd "$h1" "traceroute -e 192.0.2.3 $pkt_len | grep '<INC:$r1_ifindex,\"eth1tag\",mtu=1501>'"
+ check_err $? "Wrong incoming interface info reported from R1 after name and MTU change"
+ run_cmd "$r1" "ip link set dev eth1tag name eth1 mtu 1500"
+
+ run_cmd "$r1" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x00"
+ run_cmd "$h2" "sysctl -w net.ipv4.icmp_errors_extension_mask=0x00"
+}
+
+run_traceroute_ext()
+{
+ # Need at least version 2.1.5 for RFC 5837 support.
+ if ! check_traceroute_version 2.1.5; then
+ log_test_skip "traceroute too old, missing ICMP extensions support"
+ return
+ fi
+
+ setup_traceroute_ext
+
+ RET=0
+
+ ## General ICMP extensions tests
+
+ # Test that ICMP extensions are disabled by default.
+ run_cmd "$h1" "sysctl net.ipv4.icmp_errors_extension_mask | grep \"= 0$\""
+ check_err $? "ICMP extensions are not disabled by default"
+
+ # Test that unsupported values are rejected. Do not use "sysctl" as
+ # older versions do not return an error code upon failure.
+ run_cmd "$h1" "bash -c \"echo 0x80 > /proc/sys/net/ipv4/icmp_errors_extension_mask\""
+ check_fail $? "Unsupported sysctl value was not rejected"
+
+ ## Extension-specific tests
+
+ # Incoming interface info test. Test with various packet sizes,
+ # including the default one.
+ traceroute_ext_iio_iif_test
+ traceroute_ext_iio_iif_test 127
+ traceroute_ext_iio_iif_test 128
+ traceroute_ext_iio_iif_test 129
+
+ log_test "IPv4 traceroute with ICMP extensions"
+
+ cleanup_traceroute_ext
+}
+
################################################################################
# Run tests
@@ -444,8 +754,10 @@ run_tests()
{
run_traceroute6
run_traceroute6_vrf
+ run_traceroute6_ext
run_traceroute
run_traceroute_vrf
+ run_traceroute_ext
}
################################################################################
@@ -462,6 +774,7 @@ done
require_command traceroute6
require_command traceroute
+require_command jq
run_tests
--
2.51.0
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH net-next v2 0/3] icmp: Add RFC 5837 support
2025-10-27 8:22 [PATCH net-next v2 0/3] icmp: Add RFC 5837 support Ido Schimmel
` (2 preceding siblings ...)
2025-10-27 8:22 ` [PATCH net-next v2 3/3] selftests: traceroute: Add ICMP extensions tests Ido Schimmel
@ 2025-10-29 1:04 ` Jakub Kicinski
2025-10-29 9:54 ` Ido Schimmel
2025-10-30 1:38 ` Justin Iurman
2025-10-30 1:50 ` patchwork-bot+netdevbpf
4 siblings, 2 replies; 15+ messages in thread
From: Jakub Kicinski @ 2025-10-29 1:04 UTC (permalink / raw)
To: Ido Schimmel
Cc: netdev, davem, pabeni, edumazet, horms, dsahern, petrm, willemb,
daniel, fw, ishaangandhi, rbonica, tom, Justin Iurman
On Mon, 27 Oct 2025 10:22:29 +0200 Ido Schimmel wrote:
> This patchset extends certain ICMP error messages (e.g., "Time
> Exceeded") with incoming interface information in accordance with RFC
> 5837 [1]. This is required for more meaningful traceroute results in
> unnumbered networks. Like other ICMP settings, the feature is controlled
> via a per-{netns, address family} sysctl. The interface and the
> implementation are designed to support more ICMP extensions.
Is there supposed to be any relation between the ICMP message attrs
and what's provided via IOAM? For interface ID in IOAM we have
the ioam6_id attr instead of ifindex.
Would it make sense to add some info about relation to IOAM to the
commit msg (or even docs?). Or is it obvious to folks more familiar
with IP RFCs than I am?
cc: Justin
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next v2 0/3] icmp: Add RFC 5837 support
2025-10-29 1:04 ` [PATCH net-next v2 0/3] icmp: Add RFC 5837 support Jakub Kicinski
@ 2025-10-29 9:54 ` Ido Schimmel
2025-10-30 1:31 ` Jakub Kicinski
2025-10-30 1:38 ` Justin Iurman
1 sibling, 1 reply; 15+ messages in thread
From: Ido Schimmel @ 2025-10-29 9:54 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, davem, pabeni, edumazet, horms, dsahern, petrm, willemb,
daniel, fw, ishaangandhi, rbonica, tom, Justin Iurman
On Tue, Oct 28, 2025 at 06:04:32PM -0700, Jakub Kicinski wrote:
> On Mon, 27 Oct 2025 10:22:29 +0200 Ido Schimmel wrote:
> > This patchset extends certain ICMP error messages (e.g., "Time
> > Exceeded") with incoming interface information in accordance with RFC
> > 5837 [1]. This is required for more meaningful traceroute results in
> > unnumbered networks. Like other ICMP settings, the feature is controlled
> > via a per-{netns, address family} sysctl. The interface and the
> > implementation are designed to support more ICMP extensions.
>
> Is there supposed to be any relation between the ICMP message attrs
> and what's provided via IOAM? For interface ID in IOAM we have
> the ioam6_id attr instead of ifindex.
RFC 5837 precedes IOAM and I don't see any references from IOAM to RFC
5837. RFC 5837 is pretty clear about the interface index that should be
provided:
"The ifIndex of the interface of interest MAY be included. This is the
32-bit ifIndex assigned to the interface by the device as specified by
the Interfaces Group MIB [RFC2863]".
> Would it make sense to add some info about relation to IOAM to the
> commit msg (or even docs?). Or is it obvious to folks more familiar
> with IP RFCs than I am?
I'm pretty sure that when people are told that the message contains "the
32-bit ifIndex" they don't think about "the wide IOAM id of this
interface".
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH net-next v2 0/3] icmp: Add RFC 5837 support
2025-10-29 9:54 ` Ido Schimmel
@ 2025-10-30 1:31 ` Jakub Kicinski
2025-11-07 17:20 ` David 'equinox' Lamparter
0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2025-10-30 1:31 UTC (permalink / raw)
To: Ido Schimmel
Cc: netdev, davem, pabeni, edumazet, horms, dsahern, petrm, willemb,
daniel, fw, ishaangandhi, rbonica, tom, Justin Iurman
On Wed, 29 Oct 2025 11:54:43 +0200 Ido Schimmel wrote:
> > Is there supposed to be any relation between the ICMP message attrs
> > and what's provided via IOAM? For interface ID in IOAM we have
> > the ioam6_id attr instead of ifindex.
>
> RFC 5837 precedes IOAM and I don't see any references from IOAM to RFC
> 5837. RFC 5837 is pretty clear about the interface index that should be
> provided:
>
> "The ifIndex of the interface of interest MAY be included. This is the
> 32-bit ifIndex assigned to the interface by the device as specified by
> the Interfaces Group MIB [RFC2863]".
Makes sense, thanks. And we have another 4 weeks to change our mind,
in case someone from IETF pipes up..
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next v2 0/3] icmp: Add RFC 5837 support
2025-10-30 1:31 ` Jakub Kicinski
@ 2025-11-07 17:20 ` David 'equinox' Lamparter
2025-11-08 15:39 ` Ido Schimmel
0 siblings, 1 reply; 15+ messages in thread
From: David 'equinox' Lamparter @ 2025-11-07 17:20 UTC (permalink / raw)
To: Ido Schimmel
Cc: Jakub Kicinski, netdev, davem, pabeni, edumazet, horms, dsahern,
petrm, willemb, daniel, fw, ishaangandhi, rbonica, tom,
Justin Iurman
On Wed, Oct 29, 2025 at 06:31:43PM -0700, Jakub Kicinski wrote:
> On Wed, 29 Oct 2025 11:54:43 +0200 Ido Schimmel wrote:
> > > Is there supposed to be any relation between the ICMP message attrs
> > > and what's provided via IOAM? For interface ID in IOAM we have
> > > the ioam6_id attr instead of ifindex.
> >
> > RFC 5837 precedes IOAM and I don't see any references from IOAM to RFC
> > 5837. RFC 5837 is pretty clear about the interface index that should be
> > provided:
> >
> > "The ifIndex of the interface of interest MAY be included. This is the
> > 32-bit ifIndex assigned to the interface by the device as specified by
> > the Interfaces Group MIB [RFC2863]".
>
> Makes sense, thanks. And we have another 4 weeks to change our mind,
> in case someone from IETF pipes up..
The IETF is in fact doing draft-ietf-intarea-extended-icmp-nodeid, which
is past last call. The good news is that it's extremely similar,
different class value but same C-Type bitmask, the main distinction is
that 5837 had forbidden the use of "cross-address-family" addresses.
Note that for unnumbered networks, 5837 is wrong - it's
interface/nexthop information. But the interface has no address, the
node does. draft-ietf-intarea-extended-icmp-nodeid is about node
information and the correct thing to use for that case.
The good news is that the draft is past last call, IANA values have been
assigned, there's a bunch of text bashing going on but it's well into
the publishing process.
-David
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next v2 0/3] icmp: Add RFC 5837 support
2025-11-07 17:20 ` David 'equinox' Lamparter
@ 2025-11-08 15:39 ` Ido Schimmel
0 siblings, 0 replies; 15+ messages in thread
From: Ido Schimmel @ 2025-11-08 15:39 UTC (permalink / raw)
To: David 'equinox' Lamparter
Cc: Jakub Kicinski, netdev, davem, pabeni, edumazet, horms, dsahern,
petrm, willemb, daniel, fw, ishaangandhi, rbonica, tom,
Justin Iurman
On Fri, Nov 07, 2025 at 06:20:57PM +0100, David 'equinox' Lamparter wrote:
> The IETF is in fact doing draft-ietf-intarea-extended-icmp-nodeid, which
> is past last call. The good news is that it's extremely similar,
> different class value but same C-Type bitmask, the main distinction is
> that 5837 had forbidden the use of "cross-address-family" addresses.
I mentioned the node ID extension in both the cover letter and in my
reply to Willem:
https://lore.kernel.org/netdev/20251027082232.232571-1-idosch@nvidia.com/
https://lore.kernel.org/netdev/aPnw2PkF3ZMP9EJr@shredder/
> Note that for unnumbered networks, 5837 is wrong - it's
> interface/nexthop information. But the interface has no address, the
> node does. draft-ietf-intarea-extended-icmp-nodeid is about node
> information and the correct thing to use for that case.
RFC 5837 and draft-ietf-intarea-extended-icmp-nodeid solve different
problems.
The motivating use case for this work is a deployment where router /
infrastructure interfaces are only assigned IPv6 link-local addresses
and the loopback devices are assigned global IPv4 and IPv6 addresses. As
such, each node will generate ICMPv4 error messages with a source IP
that uniquely identifies the node.
draft-ietf-intarea-extended-icmp-nodeid is needed in cases where nodes
completely lack IPv4 addresses. In these cases all the nodes will
generate ICMPv4 error messages with the same source IP of 192.0.0.8
("IPv4 dummy address").
While this patchset does not add support for the node ID extension (as I
don't currently have a use case for it), the implementation does not
preclude it. After it is implemented, an administrator can then choose
to include both the incoming interface information and the node
identification in ICMP error messages.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next v2 0/3] icmp: Add RFC 5837 support
2025-10-29 1:04 ` [PATCH net-next v2 0/3] icmp: Add RFC 5837 support Jakub Kicinski
2025-10-29 9:54 ` Ido Schimmel
@ 2025-10-30 1:38 ` Justin Iurman
1 sibling, 0 replies; 15+ messages in thread
From: Justin Iurman @ 2025-10-30 1:38 UTC (permalink / raw)
To: Jakub Kicinski, Ido Schimmel
Cc: netdev, davem, pabeni, edumazet, horms, dsahern, petrm, willemb,
daniel, fw, ishaangandhi, rbonica, tom
On 10/29/25 02:04, Jakub Kicinski wrote:
> On Mon, 27 Oct 2025 10:22:29 +0200 Ido Schimmel wrote:
>> This patchset extends certain ICMP error messages (e.g., "Time
>> Exceeded") with incoming interface information in accordance with RFC
>> 5837 [1]. This is required for more meaningful traceroute results in
>> unnumbered networks. Like other ICMP settings, the feature is controlled
>> via a per-{netns, address family} sysctl. The interface and the
>> implementation are designed to support more ICMP extensions.
>
> Is there supposed to be any relation between the ICMP message attrs
> and what's provided via IOAM? For interface ID in IOAM we have
> the ioam6_id attr instead of ifindex.
>
> Would it make sense to add some info about relation to IOAM to the
> commit msg (or even docs?). Or is it obvious to folks more familiar
> with IP RFCs than I am?
>
> cc: Justin
I concur with what Ido said in his reply. There is no direct relation
between them, unfortunately. The interface ID in IOAM context could be
totally different, although one could see the benefit of having the same
value.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next v2 0/3] icmp: Add RFC 5837 support
2025-10-27 8:22 [PATCH net-next v2 0/3] icmp: Add RFC 5837 support Ido Schimmel
` (3 preceding siblings ...)
2025-10-29 1:04 ` [PATCH net-next v2 0/3] icmp: Add RFC 5837 support Jakub Kicinski
@ 2025-10-30 1:50 ` patchwork-bot+netdevbpf
4 siblings, 0 replies; 15+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-10-30 1:50 UTC (permalink / raw)
To: Ido Schimmel
Cc: netdev, davem, kuba, pabeni, edumazet, horms, dsahern, petrm,
willemb, daniel, fw, ishaangandhi, rbonica, tom
Hello:
This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 27 Oct 2025 10:22:29 +0200 you wrote:
> tl;dr
> =====
>
> This patchset extends certain ICMP error messages (e.g., "Time
> Exceeded") with incoming interface information in accordance with RFC
> 5837 [1]. This is required for more meaningful traceroute results in
> unnumbered networks. Like other ICMP settings, the feature is controlled
> via a per-{netns, address family} sysctl. The interface and the
> implementation are designed to support more ICMP extensions.
>
> [...]
Here is the summary with links:
- [net-next,v2,1/3] ipv4: icmp: Add RFC 5837 support
https://git.kernel.org/netdev/net-next/c/f0e7036fc9cb
- [net-next,v2,2/3] ipv6: icmp: Add RFC 5837 support
https://git.kernel.org/netdev/net-next/c/d12d04d221f8
- [net-next,v2,3/3] selftests: traceroute: Add ICMP extensions tests
https://git.kernel.org/netdev/net-next/c/02da59575183
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 15+ messages in thread