Netdev List
 help / color / mirror / Atom feed
* [PATCH v2] net: phy: leds: Don't make our own link speed names
From: Kyle Roeschley @ 2018-11-08 21:03 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli
  Cc: David S . Miller, netdev, linux-kernel, Kyle Roeschley

The phy core provides a handy phy_speed_to_str() helper, so use that
instead of doing our own formatting of the different known link speeds.
To do this, increase PHY_LED_TRIGGER_SPEED_SUFFIX_SIZE to 11 so we can fit
'Unsupported' if necessary.

Signed-off-by: Kyle Roeschley <kyle.roeschley@ni.com>
---

v2: Increase PHY_LED_TRIGGER_SPEED_SUFFIX_SIZE to fit 'Unsupported'.

 drivers/net/phy/phy_led_triggers.c | 13 +------------
 include/linux/phy_led_triggers.h   |  2 +-
 2 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/net/phy/phy_led_triggers.c b/drivers/net/phy/phy_led_triggers.c
index 491efc1bf5c4..2827eb413c9c 100644
--- a/drivers/net/phy/phy_led_triggers.c
+++ b/drivers/net/phy/phy_led_triggers.c
@@ -77,20 +77,9 @@ static int phy_led_trigger_register(struct phy_device *phy,
 				    struct phy_led_trigger *plt,
 				    unsigned int speed)
 {
-	char name_suffix[PHY_LED_TRIGGER_SPEED_SUFFIX_SIZE];
-
 	plt->speed = speed;
-
-	if (speed < SPEED_1000)
-		snprintf(name_suffix, sizeof(name_suffix), "%dMbps", speed);
-	else if (speed == SPEED_2500)
-		snprintf(name_suffix, sizeof(name_suffix), "2.5Gbps");
-	else
-		snprintf(name_suffix, sizeof(name_suffix), "%dGbps",
-			 DIV_ROUND_CLOSEST(speed, 1000));
-
 	phy_led_trigger_format_name(phy, plt->name, sizeof(plt->name),
-				    name_suffix);
+				    phy_speed_to_str(speed));
 	plt->trigger.name = plt->name;
 
 	return led_trigger_register(&plt->trigger);
diff --git a/include/linux/phy_led_triggers.h b/include/linux/phy_led_triggers.h
index b37b05bfd1a6..4587ce362535 100644
--- a/include/linux/phy_led_triggers.h
+++ b/include/linux/phy_led_triggers.h
@@ -20,7 +20,7 @@ struct phy_device;
 #include <linux/leds.h>
 #include <linux/phy.h>
 
-#define PHY_LED_TRIGGER_SPEED_SUFFIX_SIZE	10
+#define PHY_LED_TRIGGER_SPEED_SUFFIX_SIZE	11
 
 #define PHY_LINK_LED_TRIGGER_NAME_SIZE (MII_BUS_ID_SIZE + \
 				       FIELD_SIZEOF(struct mdio_device, addr)+\
-- 
2.19.1

^ permalink raw reply related

* RE: [PATCH v2 2/3] dt-bindings: can: rcar_can: Add r8a774a1 support
From: Fabrizio Castro @ 2018-11-08 11:25 UTC (permalink / raw)
  To: Fabrizio Castro, Wolfgang Grandegger, Marc Kleine-Budde,
	Rob Herring, Mark Rutland
  Cc: David S. Miller, Sergei Shtylyov, linux-can@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	devicetree@vger.kernel.org, Simon Horman, Geert Uytterhoeven,
	Chris Paterson, Biju Das, linux-renesas-soc@vger.kernel.org
In-Reply-To: <1536576195-11520-3-git-send-email-fabrizio.castro@bp.renesas.com>

Dear All,

Who is the best person to take this patch?

Thanks,
Fab

> From: Fabrizio Castro <fabrizio.castro@bp.renesas.com>
> Sent: 10 September 2018 11:43
> Subject: [PATCH v2 2/3] dt-bindings: can: rcar_can: Add r8a774a1 support
>
> Document RZ/G2M (r8a774a1) SoC specific bindings.
>
> Signed-off-by: Fabrizio Castro <fabrizio.castro@bp.renesas.com>
> Signed-off-by: Chris Paterson <Chris.Paterson2@renesas.com>
> Reviewed-by: Biju Das <biju.das@bp.renesas.com>
> ---
> v1->v2:
> * dropped "renesas,rzg-gen2-can" and fixed "clocks" property description
>   as per Geert's comments.
>
> This patch applies on top of next-20180910.
>
>  Documentation/devicetree/bindings/net/can/rcar_can.txt | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/net/can/rcar_can.txt b/Documentation/devicetree/bindings/net/can/rcar_can.txt
> index 94a7f33..f3b160c 100644
> --- a/Documentation/devicetree/bindings/net/can/rcar_can.txt
> +++ b/Documentation/devicetree/bindings/net/can/rcar_can.txt
> @@ -4,6 +4,7 @@ Renesas R-Car CAN controller Device Tree Bindings
>  Required properties:
>  - compatible: "renesas,can-r8a7743" if CAN controller is a part of R8A7743 SoC.
>        "renesas,can-r8a7745" if CAN controller is a part of R8A7745 SoC.
> +      "renesas,can-r8a774a1" if CAN controller is a part of R8A774A1 SoC.
>        "renesas,can-r8a7778" if CAN controller is a part of R8A7778 SoC.
>        "renesas,can-r8a7779" if CAN controller is a part of R8A7779 SoC.
>        "renesas,can-r8a7790" if CAN controller is a part of R8A7790 SoC.
> @@ -16,15 +17,21 @@ Required properties:
>        "renesas,rcar-gen1-can" for a generic R-Car Gen1 compatible device.
>        "renesas,rcar-gen2-can" for a generic R-Car Gen2 or RZ/G1
>        compatible device.
> -      "renesas,rcar-gen3-can" for a generic R-Car Gen3 compatible device.
> +      "renesas,rcar-gen3-can" for a generic R-Car Gen3 or RZ/G2
> +      compatible device.
>        When compatible with the generic version, nodes must list the
>        SoC-specific version corresponding to the platform first
>        followed by the generic version.
>
>  - reg: physical base address and size of the R-Car CAN register map.
>  - interrupts: interrupt specifier for the sole interrupt.
> -- clocks: phandles and clock specifiers for 3 CAN clock inputs.
> -- clock-names: 3 clock input name strings: "clkp1", "clkp2", "can_clk".
> +- clocks: phandles and clock specifiers for 2 CAN clock inputs for RZ/G2
> +  devices.
> +  phandles and clock specifiers for 3 CAN clock inputs for every other
> +  SoC.
> +- clock-names: 2 clock input name strings for RZ/G2: "clkp1", "can_clk".
> +       3 clock input name strings for every other SoC: "clkp1", "clkp2",
> +       "can_clk".
>  - pinctrl-0: pin control group to be used for this controller.
>  - pinctrl-names: must be "default".
>
> @@ -41,8 +48,9 @@ using the below properties:
>  Optional properties:
>  - renesas,can-clock-select: R-Car CAN Clock Source Select. Valid values are:
>      <0x0> (default) : Peripheral clock (clkp1)
> -    <0x1> : Peripheral clock (clkp2)
> -    <0x3> : Externally input clock
> +    <0x1> : Peripheral clock (clkp2) (not supported by
> +    RZ/G2 devices)
> +    <0x3> : External input clock
>
>  Example
>  -------
> --
> 2.7.4




Renesas Electronics Europe Ltd, Dukes Meadow, Millboard Road, Bourne End, Buckinghamshire, SL8 5FH, UK. Registered in England & Wales under Registered No. 04586709.

^ permalink raw reply

* [PATCH iproute2 net-next v2 2/2] iplink_geneve: Add DF configuration
From: Stefano Brivio @ 2018-11-08 11:21 UTC (permalink / raw)
  To: David Ahern; +Cc: Stephen Hemminger, netdev
In-Reply-To: <cover.1541641480.git.sbrivio@redhat.com>

Allow to set the DF bit behaviour for outgoing IPv4 packets: it can be
always on, inherited from the inner header, or, by default, always off,
which is the current behaviour.

v2:
- Indicate in the man page what DF refers to, using RFC 791 wording
  (David Ahern)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 include/uapi/linux/if_link.h |  9 +++++++++
 ip/iplink_geneve.c           | 29 +++++++++++++++++++++++++++++
 man/man8/ip-link.8.in        | 14 ++++++++++++++
 3 files changed, 52 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 4caf683ce546..183ca7527178 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -563,10 +563,19 @@ enum {
 	IFLA_GENEVE_UDP_ZERO_CSUM6_RX,
 	IFLA_GENEVE_LABEL,
 	IFLA_GENEVE_TTL_INHERIT,
+	IFLA_GENEVE_DF,
 	__IFLA_GENEVE_MAX
 };
 #define IFLA_GENEVE_MAX	(__IFLA_GENEVE_MAX - 1)
 
+enum ifla_geneve_df {
+	GENEVE_DF_UNSET = 0,
+	GENEVE_DF_SET,
+	GENEVE_DF_INHERIT,
+	__GENEVE_DF_END,
+	GENEVE_DF_MAX = __GENEVE_DF_END - 1,
+};
+
 /* PPP section */
 enum {
 	IFLA_PPP_UNSPEC,
diff --git a/ip/iplink_geneve.c b/ip/iplink_geneve.c
index c417842b2a5b..1872b74c5d70 100644
--- a/ip/iplink_geneve.c
+++ b/ip/iplink_geneve.c
@@ -24,6 +24,7 @@ static void print_explain(FILE *f)
 		"                  remote ADDR\n"
 		"                  [ ttl TTL ]\n"
 		"                  [ tos TOS ]\n"
+		"                  [ df DF ]\n"
 		"                  [ flowlabel LABEL ]\n"
 		"                  [ dstport PORT ]\n"
 		"                  [ [no]external ]\n"
@@ -35,6 +36,7 @@ static void print_explain(FILE *f)
 		"       ADDR  := IP_ADDRESS\n"
 		"       TOS   := { NUMBER | inherit }\n"
 		"       TTL   := { 1..255 | auto | inherit }\n"
+		"       DF    := { unset | set | inherit }\n"
 		"       LABEL := 0-1048575\n"
 	);
 }
@@ -115,6 +117,22 @@ static int geneve_parse_opt(struct link_util *lu, int argc, char **argv,
 				tos = uval;
 			} else
 				tos = 1;
+		} else if (!matches(*argv, "df")) {
+			enum ifla_geneve_df df;
+
+			NEXT_ARG();
+			check_duparg(&attrs, IFLA_GENEVE_DF, "df", *argv);
+			if (strcmp(*argv, "unset") == 0)
+				df = GENEVE_DF_UNSET;
+			else if (strcmp(*argv, "set") == 0)
+				df = GENEVE_DF_SET;
+			else if (strcmp(*argv, "inherit") == 0)
+				df = GENEVE_DF_INHERIT;
+			else
+				invarg("DF must be 'unset', 'set' or 'inherit'",
+				       *argv);
+
+			addattr8(n, 1024, IFLA_GENEVE_DF, df);
 		} else if (!matches(*argv, "label") ||
 			   !matches(*argv, "flowlabel")) {
 			__u32 uval;
@@ -287,6 +305,17 @@ static void geneve_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 			print_string(PRINT_FP, NULL, "tos %s ", "inherit");
 	}
 
+	if (tb[IFLA_GENEVE_DF]) {
+		enum ifla_geneve_df df = rta_getattr_u8(tb[IFLA_GENEVE_DF]);
+
+		if (df == GENEVE_DF_UNSET)
+			print_string(PRINT_JSON, "df", "df %s ", "unset");
+		else if (df == GENEVE_DF_SET)
+			print_string(PRINT_ANY, "df", "df %s ", "set");
+		else if (df == GENEVE_DF_INHERIT)
+			print_string(PRINT_ANY, "df", "df %s ", "inherit");
+	}
+
 	if (tb[IFLA_GENEVE_LABEL]) {
 		__u32 label = rta_getattr_u32(tb[IFLA_GENEVE_LABEL]);
 
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index a94cf4f19f1e..73d37c190fff 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -1180,6 +1180,8 @@ the following additional arguments are supported:
 ] [
 .BI tos " TOS "
 ] [
+.BI df " DF "
+] [
 .BI flowlabel " FLOWLABEL "
 ] [
 .BI dstport " PORT"
@@ -1212,6 +1214,18 @@ ttl. Default option is "0".
 .BI tos " TOS"
 - specifies the TOS value to use in outgoing packets.
 
+.sp
+.BI df " DF"
+- specifies the usage of the Don't Fragment flag (DF) bit in outgoing packets
+with IPv4 headers. The value
+.B inherit
+causes the bit to be copied from the original IP header. The values
+.B unset
+and
+.B set
+cause the bit to be always unset or always set, respectively. By default, the
+bit is not set.
+
 .sp
 .BI flowlabel " FLOWLABEL"
 - specifies the flow label to use in outgoing packets.
-- 
2.19.1

^ permalink raw reply related

* [PATCH iproute2 net-next v2 1/2] iplink_vxlan: Add DF configuration
From: Stefano Brivio @ 2018-11-08 11:21 UTC (permalink / raw)
  To: David Ahern; +Cc: Stephen Hemminger, netdev
In-Reply-To: <cover.1541641480.git.sbrivio@redhat.com>

Allow to set the DF bit behaviour for outgoing IPv4 packets: it can be
always on, inherited from the inner header, or, by default, always off,
which is the current behaviour.

v2:
- Indicate in the man page what DF refers to, using RFC 791 wording
  (David Ahern)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 include/uapi/linux/if_link.h |  9 +++++++++
 ip/iplink_vxlan.c            | 29 +++++++++++++++++++++++++++++
 man/man8/ip-link.8.in        | 14 ++++++++++++++
 3 files changed, 52 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 9c254603ebda..4caf683ce546 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -530,6 +530,7 @@ enum {
 	IFLA_VXLAN_LABEL,
 	IFLA_VXLAN_GPE,
 	IFLA_VXLAN_TTL_INHERIT,
+	IFLA_VXLAN_DF,
 	__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX	(__IFLA_VXLAN_MAX - 1)
@@ -539,6 +540,14 @@ struct ifla_vxlan_port_range {
 	__be16	high;
 };
 
+enum ifla_vxlan_df {
+	VXLAN_DF_UNSET = 0,
+	VXLAN_DF_SET,
+	VXLAN_DF_INHERIT,
+	__VXLAN_DF_END,
+	VXLAN_DF_MAX = __VXLAN_DF_END - 1,
+};
+
 /* GENEVE section */
 enum {
 	IFLA_GENEVE_UNSPEC,
diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index 7fc0e2b4eb06..86afbe1334f0 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -31,6 +31,7 @@ static void print_explain(FILE *f)
 		"                 [ local ADDR ]\n"
 		"                 [ ttl TTL ]\n"
 		"                 [ tos TOS ]\n"
+		"                 [ df DF ]\n"
 		"                 [ flowlabel LABEL ]\n"
 		"                 [ dev PHYS_DEV ]\n"
 		"                 [ dstport PORT ]\n"
@@ -52,6 +53,7 @@ static void print_explain(FILE *f)
 		"       ADDR  := { IP_ADDRESS | any }\n"
 		"       TOS   := { NUMBER | inherit }\n"
 		"       TTL   := { 1..255 | auto | inherit }\n"
+		"       DF    := { unset | set | inherit }\n"
 		"       LABEL := 0-1048575\n"
 	);
 }
@@ -170,6 +172,22 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 			} else
 				tos = 1;
 			addattr8(n, 1024, IFLA_VXLAN_TOS, tos);
+		} else if (!matches(*argv, "df")) {
+			enum ifla_vxlan_df df;
+
+			NEXT_ARG();
+			check_duparg(&attrs, IFLA_VXLAN_DF, "df", *argv);
+			if (strcmp(*argv, "unset") == 0)
+				df = VXLAN_DF_UNSET;
+			else if (strcmp(*argv, "set") == 0)
+				df = VXLAN_DF_SET;
+			else if (strcmp(*argv, "inherit") == 0)
+				df = VXLAN_DF_INHERIT;
+			else
+				invarg("DF must be 'unset', 'set' or 'inherit'",
+				       *argv);
+
+			addattr8(n, 1024, IFLA_VXLAN_DF, df);
 		} else if (!matches(*argv, "label") ||
 			   !matches(*argv, "flowlabel")) {
 			__u32 uval;
@@ -538,6 +556,17 @@ static void vxlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 			print_string(PRINT_FP, NULL, "ttl %s ", "auto");
 	}
 
+	if (tb[IFLA_VXLAN_DF]) {
+		enum ifla_vxlan_df df = rta_getattr_u8(tb[IFLA_VXLAN_DF]);
+
+		if (df == VXLAN_DF_UNSET)
+			print_string(PRINT_JSON, "df", "df %s ", "unset");
+		else if (df == VXLAN_DF_SET)
+			print_string(PRINT_ANY, "df", "df %s ", "set");
+		else if (df == VXLAN_DF_INHERIT)
+			print_string(PRINT_ANY, "df", "df %s ", "inherit");
+	}
+
 	if (tb[IFLA_VXLAN_LABEL]) {
 		__u32 label = rta_getattr_u32(tb[IFLA_VXLAN_LABEL]);
 
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 5132f514b279..a94cf4f19f1e 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -496,6 +496,8 @@ the following additional arguments are supported:
 ] [
 .BI tos " TOS "
 ] [
+.BI df " DF "
+] [
 .BI flowlabel " FLOWLABEL "
 ] [
 .BI dstport " PORT "
@@ -565,6 +567,18 @@ parameter.
 .BI tos " TOS"
 - specifies the TOS value to use in outgoing packets.
 
+.sp
+.BI df " DF"
+- specifies the usage of the Don't Fragment flag (DF) bit in outgoing packets
+with IPv4 headers. The value
+.B inherit
+causes the bit to be copied from the original IP header. The values
+.B unset
+and
+.B set
+cause the bit to be always unset or always set, respectively. By default, the
+bit is not set.
+
 .sp
 .BI flowlabel " FLOWLABEL"
 - specifies the flow label to use in outgoing packets.
-- 
2.19.1

^ permalink raw reply related

* [PATCH iproute2 net-next v2 0/2] Add DF configuration for VXLAN and GENEVE link types
From: Stefano Brivio @ 2018-11-08 11:21 UTC (permalink / raw)
  To: David Ahern; +Cc: Stephen Hemminger, netdev

This series adds configuration of the DF bit in outgoing IPv4 packets for
VXLAN and GENEVE link types.

Stefano Brivio (2):
  iplink_vxlan: Add DF configuration
  iplink_geneve: Add DF configuration

 include/uapi/linux/if_link.h | 18 ++++++++++++++++++
 ip/iplink_geneve.c           | 29 +++++++++++++++++++++++++++++
 ip/iplink_vxlan.c            | 29 +++++++++++++++++++++++++++++
 man/man8/ip-link.8.in        | 28 ++++++++++++++++++++++++++++
 4 files changed, 104 insertions(+)

-- 
2.19.1

^ permalink raw reply

* [PATCH net-next v2 11/11] selftests: pmtu: Introduce FoU and GUE PMTU exceptions tests
From: Stefano Brivio @ 2018-11-08 11:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Sabrina Dubroca, Xin Long, Stephen Hemminger, Jiri Benc,
	David Ahern, netdev
In-Reply-To: <cover.1541675666.git.sbrivio@redhat.com>

Introduce eight tests, for FoU and GUE, with IPv4 and IPv6 payload,
on IPv4 and IPv6 transport, that check that PMTU exceptions are created
with the right value when exceeding the MTU on a link of the path.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
---
v2: no changes

 tools/testing/selftests/net/pmtu.sh | 179 ++++++++++++++++++++++++++++
 1 file changed, 179 insertions(+)

diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 14d20782ccb7..e2c94e47707c 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -59,6 +59,14 @@
 #	Same as pmtu_ipv6_vxlan6_exception, but using a GENEVE tunnel instead of
 #	VXLAN
 #
+# - pmtu_ipv{4,6}_fou{4,6}_exception
+#	Same as pmtu_ipv4_vxlan4, but using a direct IPv4/IPv6 encapsulation
+#	(FoU) over IPv4/IPv6, instead of VXLAN
+#
+# - pmtu_ipv{4,6}_fou{4,6}_exception
+#	Same as pmtu_ipv4_vxlan4, but using a generic UDP IPv4/IPv6
+#	encapsulation (GUE) over IPv4/IPv6, instead of VXLAN
+#
 # - pmtu_vti4_exception
 #	Set up vti tunnel on top of veth, with xfrm states and policies, in two
 #	namespaces with matching endpoints. Check that route exception is not
@@ -113,6 +121,14 @@ tests="
 	pmtu_ipv6_geneve4_exception	IPv6 over geneve4: PMTU exceptions
 	pmtu_ipv4_geneve6_exception	IPv4 over geneve6: PMTU exceptions
 	pmtu_ipv6_geneve6_exception	IPv6 over geneve6: PMTU exceptions
+	pmtu_ipv4_fou4_exception	IPv4 over fou4: PMTU exceptions
+	pmtu_ipv6_fou4_exception	IPv6 over fou4: PMTU exceptions
+	pmtu_ipv4_fou6_exception	IPv4 over fou6: PMTU exceptions
+	pmtu_ipv6_fou6_exception	IPv6 over fou6: PMTU exceptions
+	pmtu_ipv4_gue4_exception	IPv4 over gue4: PMTU exceptions
+	pmtu_ipv6_gue4_exception	IPv6 over gue4: PMTU exceptions
+	pmtu_ipv4_gue6_exception	IPv4 over gue6: PMTU exceptions
+	pmtu_ipv6_gue6_exception	IPv6 over gue6: PMTU exceptions
 	pmtu_vti6_exception		vti6: PMTU exceptions
 	pmtu_vti4_exception		vti4: PMTU exceptions
 	pmtu_vti4_default_mtu		vti4: default MTU assignment
@@ -200,6 +216,89 @@ nsname() {
 	eval echo \$NS_$1
 }
 
+setup_fou_or_gue() {
+	outer="${1}"
+	inner="${2}"
+	encap="${3}"
+
+	if [ "${outer}" = "4" ]; then
+		modprobe fou || return 2
+		a_addr="${prefix4}.${a_r1}.1"
+		b_addr="${prefix4}.${b_r1}.1"
+		if [ "${inner}" = "4" ]; then
+			type="ipip"
+			ipproto="4"
+		else
+			type="sit"
+			ipproto="41"
+		fi
+	else
+		modprobe fou6 || return 2
+		a_addr="${prefix6}:${a_r1}::1"
+		b_addr="${prefix6}:${b_r1}::1"
+		if [ "${inner}" = "4" ]; then
+			type="ip6tnl"
+			mode="mode ipip6"
+			ipproto="4 -6"
+		else
+			type="ip6tnl"
+			mode="mode ip6ip6"
+			ipproto="41 -6"
+		fi
+	fi
+
+	${ns_a} ip fou add port 5555 ipproto ${ipproto} || return 2
+	${ns_a} ip link add ${encap}_a type ${type} ${mode} local ${a_addr} remote ${b_addr} encap ${encap} encap-sport auto encap-dport 5556 || return 2
+
+	${ns_b} ip fou add port 5556 ipproto ${ipproto}
+	${ns_b} ip link add ${encap}_b type ${type} ${mode} local ${b_addr} remote ${a_addr} encap ${encap} encap-sport auto encap-dport 5555
+
+	if [ "${inner}" = "4" ]; then
+		${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask} dev ${encap}_a
+		${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask} dev ${encap}_b
+	else
+		${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask} dev ${encap}_a
+		${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask} dev ${encap}_b
+	fi
+
+	${ns_a} ip link set ${encap}_a up
+	${ns_b} ip link set ${encap}_b up
+
+	sleep 1
+}
+
+setup_fou44() {
+	setup_fou_or_gue 4 4 fou
+}
+
+setup_fou46() {
+	setup_fou_or_gue 4 6 fou
+}
+
+setup_fou64() {
+	setup_fou_or_gue 6 4 fou
+}
+
+setup_fou66() {
+	setup_fou_or_gue 6 6 fou
+}
+
+setup_gue44() {
+	setup_fou_or_gue 4 4 gue
+}
+
+setup_gue46() {
+	setup_fou_or_gue 4 6 gue
+}
+
+setup_gue64() {
+	setup_fou_or_gue 6 4 gue
+}
+
+setup_gue66() {
+	setup_fou_or_gue 6 6 gue
+}
+
 setup_namespaces() {
 	for n in ${NS_A} ${NS_B} ${NS_R1} ${NS_R2}; do
 		ip netns add ${n} || return 1
@@ -627,6 +726,86 @@ test_pmtu_ipv6_geneve6_exception() {
 	test_pmtu_ipvX_over_vxlanY_or_geneveY_exception geneve 6 6
 }
 
+test_pmtu_ipvX_over_fouY_or_gueY() {
+	inner_family=${1}
+	outer_family=${2}
+	encap=${3}
+	ll_mtu=4000
+
+	setup namespaces routing ${encap}${outer_family}${inner_family} || return 2
+	trace "${ns_a}" ${encap}_a   "${ns_b}"  ${encap}_b \
+	      "${ns_a}" veth_A-R1    "${ns_r1}" veth_R1-A \
+	      "${ns_b}" veth_B-R1    "${ns_r1}" veth_R1-B
+
+	if [ ${inner_family} -eq 4 ]; then
+		ping=ping
+		dst=${tunnel4_b_addr}
+	else
+		ping=${ping6}
+		dst=${tunnel6_b_addr}
+	fi
+
+	if [ "${encap}" = "gue" ]; then
+		encap_overhead=4
+	else
+		encap_overhead=0
+	fi
+
+	if [ ${outer_family} -eq 4 ]; then
+		#                      IPv4 header   UDP header
+		exp_mtu=$((${ll_mtu} - 20          - 8         - ${encap_overhead}))
+	else
+		#                      IPv6 header   Option 4   UDP header
+		exp_mtu=$((${ll_mtu} - 40          - 8        - 8       - ${encap_overhead}))
+	fi
+
+	# Create route exception by exceeding link layer MTU
+	mtu "${ns_a}"  veth_A-R1 $((${ll_mtu} + 1000))
+	mtu "${ns_r1}" veth_R1-A $((${ll_mtu} + 1000))
+	mtu "${ns_b}"  veth_B-R1 ${ll_mtu}
+	mtu "${ns_r1}" veth_R1-B ${ll_mtu}
+
+	mtu "${ns_a}" ${encap}_a $((${ll_mtu} + 1000))
+	mtu "${ns_b}" ${encap}_b $((${ll_mtu} + 1000))
+	${ns_a} ${ping} -q -M want -i 0.1 -w 2 -s $((${ll_mtu} + 500)) ${dst} > /dev/null
+
+	# Check that exception was created
+	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${dst})"
+	check_pmtu_value ${exp_mtu} "${pmtu}" "exceeding link layer MTU on ${encap} interface"
+}
+
+test_pmtu_ipv4_fou4_exception() {
+	test_pmtu_ipvX_over_fouY_or_gueY 4 4 fou
+}
+
+test_pmtu_ipv6_fou4_exception() {
+	test_pmtu_ipvX_over_fouY_or_gueY 6 4 fou
+}
+
+test_pmtu_ipv4_fou6_exception() {
+	test_pmtu_ipvX_over_fouY_or_gueY 4 6 fou
+}
+
+test_pmtu_ipv6_fou6_exception() {
+	test_pmtu_ipvX_over_fouY_or_gueY 6 6 fou
+}
+
+test_pmtu_ipv4_gue4_exception() {
+	test_pmtu_ipvX_over_fouY_or_gueY 4 4 gue
+}
+
+test_pmtu_ipv6_gue4_exception() {
+	test_pmtu_ipvX_over_fouY_or_gueY 6 4 gue
+}
+
+test_pmtu_ipv4_gue6_exception() {
+	test_pmtu_ipvX_over_fouY_or_gueY 4 6 gue
+}
+
+test_pmtu_ipv6_gue6_exception() {
+	test_pmtu_ipvX_over_fouY_or_gueY 6 6 gue
+}
+
 test_pmtu_vti4_exception() {
 	setup namespaces veth vti4 xfrm4 || return 2
 	trace "${ns_a}" veth_a    "${ns_b}" veth_b \
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v2 10/11] fou, fou6: ICMP error handlers for FoU and GUE
From: Stefano Brivio @ 2018-11-08 11:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Sabrina Dubroca, Xin Long, Stephen Hemminger, Jiri Benc,
	David Ahern, netdev
In-Reply-To: <cover.1541675666.git.sbrivio@redhat.com>

As the destination port in FoU and GUE receiving sockets doesn't
necessarily match the remote destination port, we can't associate errors
to the encapsulating tunnels with a socket lookup -- we need to blindly
try them instead. This means we don't even know if we are handling errors
for FoU or GUE without digging into the packets.

Hence, implement a single handler for both, one for IPv4 and one for IPv6,
that will check whether the packet that generated the ICMP error used a
direct IP encapsulation or if it had a GUE header, and send the error to
the matching protocol handler, if any.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
---
v2: no changes

 net/ipv4/fou.c      | 68 +++++++++++++++++++++++++++++++++++++++++
 net/ipv4/protocol.c |  1 +
 net/ipv6/fou6.c     | 74 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 143 insertions(+)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 500a59906b87..0d0ad19ecb87 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -3,6 +3,7 @@
 #include <linux/socket.h>
 #include <linux/skbuff.h>
 #include <linux/ip.h>
+#include <linux/icmp.h>
 #include <linux/udp.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
@@ -1003,15 +1004,82 @@ static int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
 	return 0;
 }
 
+static int gue_err_proto_handler(int proto, struct sk_buff *skb, u32 info)
+{
+	const struct net_protocol *ipprot = rcu_dereference(inet_protos[proto]);
+
+	if (ipprot && ipprot->err_handler) {
+		if (!ipprot->err_handler(skb, info))
+			return 0;
+	}
+
+	return -ENOENT;
+}
+
+static int gue_err(struct sk_buff *skb, u32 info)
+{
+	int transport_offset = skb_transport_offset(skb);
+	struct guehdr *guehdr;
+	size_t optlen;
+	int ret;
+
+	if (skb->len < sizeof(struct udphdr) + sizeof(struct guehdr))
+		return -EINVAL;
+
+	guehdr = (struct guehdr *)&udp_hdr(skb)[1];
+
+	switch (guehdr->version) {
+	case 0: /* Full GUE header present */
+		break;
+	case 1: {
+		/* Direct encasulation of IPv4 or IPv6 */
+		skb_set_transport_header(skb, -(int)sizeof(struct icmphdr));
+
+		switch (((struct iphdr *)guehdr)->version) {
+		case 4:
+			ret = gue_err_proto_handler(IPPROTO_IPIP, skb, info);
+			goto out;
+#if IS_ENABLED(CONFIG_IPV6)
+		case 6:
+			ret = gue_err_proto_handler(IPPROTO_IPV6, skb, info);
+			goto out;
+#endif
+		default:
+			ret = -EOPNOTSUPP;
+			goto out;
+		}
+	}
+	default: /* Undefined version */
+		return -EOPNOTSUPP;
+	}
+
+	if (guehdr->control)
+		return -ENOENT;
+
+	optlen = guehdr->hlen << 2;
+
+	if (validate_gue_flags(guehdr, optlen))
+		return -EINVAL;
+
+	skb_set_transport_header(skb, -(int)sizeof(struct icmphdr));
+	ret = gue_err_proto_handler(guehdr->proto_ctype, skb, info);
+
+out:
+	skb_set_transport_header(skb, transport_offset);
+	return ret;
+}
+
 
 static const struct ip_tunnel_encap_ops fou_iptun_ops = {
 	.encap_hlen = fou_encap_hlen,
 	.build_header = fou_build_header,
+	.err_handler = gue_err,
 };
 
 static const struct ip_tunnel_encap_ops gue_iptun_ops = {
 	.encap_hlen = gue_encap_hlen,
 	.build_header = gue_build_header,
+	.err_handler = gue_err,
 };
 
 static int ip_tunnel_encap_add_fou_ops(void)
diff --git a/net/ipv4/protocol.c b/net/ipv4/protocol.c
index 32a691b7ce2c..92d249e053be 100644
--- a/net/ipv4/protocol.c
+++ b/net/ipv4/protocol.c
@@ -29,6 +29,7 @@
 #include <net/protocol.h>
 
 struct net_protocol __rcu *inet_protos[MAX_INET_PROTOS] __read_mostly;
+EXPORT_SYMBOL(inet_protos);
 const struct net_offload __rcu *inet_offloads[MAX_INET_PROTOS] __read_mostly;
 EXPORT_SYMBOL(inet_offloads);
 
diff --git a/net/ipv6/fou6.c b/net/ipv6/fou6.c
index 6de3c04b0f30..bd675c61deb1 100644
--- a/net/ipv6/fou6.c
+++ b/net/ipv6/fou6.c
@@ -4,6 +4,7 @@
 #include <linux/skbuff.h>
 #include <linux/ip.h>
 #include <linux/udp.h>
+#include <linux/icmpv6.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <net/fou.h>
@@ -69,14 +70,87 @@ static int gue6_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
 	return 0;
 }
 
+static int gue6_err_proto_handler(int proto, struct sk_buff *skb,
+				  struct inet6_skb_parm *opt,
+				  u8 type, u8 code, int offset, u32 info)
+{
+	const struct inet6_protocol *ipprot;
+
+	ipprot = rcu_dereference(inet6_protos[proto]);
+	if (ipprot && ipprot->err_handler) {
+		if (!ipprot->err_handler(skb, opt, type, code, offset, info))
+			return 0;
+	}
+
+	return -ENOENT;
+}
+
+static int gue6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+		    u8 type, u8 code, int offset, __be32 info)
+{
+	int transport_offset = skb_transport_offset(skb);
+	struct guehdr *guehdr;
+	size_t optlen;
+	int ret;
+
+	if (skb->len < sizeof(struct udphdr) + sizeof(struct guehdr))
+		return -EINVAL;
+
+	guehdr = (struct guehdr *)&udp_hdr(skb)[1];
+
+	switch (guehdr->version) {
+	case 0: /* Full GUE header present */
+		break;
+	case 1: {
+		/* Direct encasulation of IPv4 or IPv6 */
+		skb_set_transport_header(skb, -(int)sizeof(struct icmp6hdr));
+
+		switch (((struct iphdr *)guehdr)->version) {
+		case 4:
+			ret = gue6_err_proto_handler(IPPROTO_IPIP, skb, opt,
+						     type, code, offset, info);
+			goto out;
+		case 6:
+			ret = gue6_err_proto_handler(IPPROTO_IPV6, skb, opt,
+						     type, code, offset, info);
+			goto out;
+		default:
+			ret = -EOPNOTSUPP;
+			goto out;
+		}
+	}
+	default: /* Undefined version */
+		return -EOPNOTSUPP;
+	}
+
+	if (guehdr->control)
+		return -ENOENT;
+
+	optlen = guehdr->hlen << 2;
+
+	if (validate_gue_flags(guehdr, optlen))
+		return -EINVAL;
+
+	skb_set_transport_header(skb, -(int)sizeof(struct icmp6hdr));
+	ret = gue6_err_proto_handler(guehdr->proto_ctype, skb,
+				     opt, type, code, offset, info);
+
+out:
+	skb_set_transport_header(skb, transport_offset);
+	return ret;
+}
+
+
 static const struct ip6_tnl_encap_ops fou_ip6tun_ops = {
 	.encap_hlen = fou_encap_hlen,
 	.build_header = fou6_build_header,
+	.err_handler = gue6_err,
 };
 
 static const struct ip6_tnl_encap_ops gue_ip6tun_ops = {
 	.encap_hlen = gue_encap_hlen,
 	.build_header = gue6_build_header,
+	.err_handler = gue6_err,
 };
 
 static int ip6_tnl_encap_add_fou_ops(void)
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v2 09/11] udp: Support for error handlers of tunnels with arbitrary destination port
From: Stefano Brivio @ 2018-11-08 11:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Sabrina Dubroca, Xin Long, Stephen Hemminger, Jiri Benc,
	David Ahern, netdev
In-Reply-To: <cover.1541675666.git.sbrivio@redhat.com>

ICMP error handling is currently not possible for UDP tunnels not
employing a receiving socket with local destination port matching the
remote one, because we have no way to look them up.

Add an err_handler tunnel encapsulation operation that can be exported by
tunnels in order to pass the error to the protocol implementing the
encapsulation. We can't easily use a lookup function as we did for VXLAN
and GENEVE, as protocol error handlers, which would be in turn called by
implementations of this new operation, handle the errors themselves,
together with the tunnel lookup.

Without a socket, we can't be sure which encapsulation error handler is
the appropriate one: encapsulation handlers (the ones for FoU and GUE
introduced in the next patch, e.g.) will need to check the new error codes
returned by protocol handlers to figure out if errors match the given
encapsulation, and, in turn, report this error back, so that we can try
all of them in __udp{4,6}_lib_err_encap_no_sk() until we have a match.

v2:
- Name all arguments in err_handler prototypes (David Miller)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
---
 include/net/ip6_tunnel.h |  2 ++
 include/net/ip_tunnels.h |  1 +
 net/ipv4/udp.c           | 70 +++++++++++++++++++++++++++----------
 net/ipv6/udp.c           | 75 +++++++++++++++++++++++++++++++---------
 4 files changed, 113 insertions(+), 35 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 236e40ba06bf..69b4bcf880c9 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -69,6 +69,8 @@ struct ip6_tnl_encap_ops {
 	size_t (*encap_hlen)(struct ip_tunnel_encap *e);
 	int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
 			    u8 *protocol, struct flowi6 *fl6);
+	int (*err_handler)(struct sk_buff *skb, struct inet6_skb_parm *opt,
+			   u8 type, u8 code, int offset, __be32 info);
 };
 
 #ifdef CONFIG_INET
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index b0d022ff6ea1..db6b2218a2ad 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -311,6 +311,7 @@ struct ip_tunnel_encap_ops {
 	size_t (*encap_hlen)(struct ip_tunnel_encap *e);
 	int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
 			    u8 *protocol, struct flowi4 *fl4);
+	int (*err_handler)(struct sk_buff *skb, u32 info);
 };
 
 #define MAX_IPTUN_ENCAP_OPS 8
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a505ee5eb92c..6f8890c5bc7e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -105,6 +105,7 @@
 #include <net/net_namespace.h>
 #include <net/icmp.h>
 #include <net/inet_hashtables.h>
+#include <net/ip_tunnels.h>
 #include <net/route.h>
 #include <net/checksum.h>
 #include <net/xfrm.h>
@@ -590,6 +591,26 @@ void udp_encap_enable(void)
 }
 EXPORT_SYMBOL(udp_encap_enable);
 
+/* Handler for tunnels with arbitrary destination ports: no socket lookup, go
+ * through error handlers in encapsulations looking for a match.
+ */
+static int __udp4_lib_err_encap_no_sk(struct sk_buff *skb, u32 info)
+{
+	int i;
+
+	for (i = 0; i < MAX_IPTUN_ENCAP_OPS; i++) {
+		int (*handler)(struct sk_buff *skb, u32 info);
+
+		if (!iptun_encaps[i])
+			continue;
+		handler = rcu_dereference(iptun_encaps[i]->err_handler);
+		if (handler && !handler(skb, info))
+			return 0;
+	}
+
+	return -ENOENT;
+}
+
 /* Try to match ICMP errors to UDP tunnels by looking up a socket without
  * reversing source and destination port: this will match tunnels that force the
  * same destination port on both endpoints (e.g. VXLAN, GENEVE). Note that
@@ -597,28 +618,25 @@ EXPORT_SYMBOL(udp_encap_enable);
  * different destination ports on endpoints, in this case we won't be able to
  * trace ICMP messages back to them.
  *
+ * If this doesn't match any socket, probe tunnels with arbitrary destination
+ * ports (e.g. FoU, GUE): there, the receiving socket is useless, as the port
+ * we've sent packets to won't necessarily match the local destination port.
+ *
  * Then ask the tunnel implementation to match the error against a valid
  * association.
  *
- * Return the socket if we have a match.
+ * Return an error if we can't find a match, the socket if we need further
+ * processing, zero otherwise.
  */
 static struct sock *__udp4_lib_err_encap(struct net *net,
 					 const struct iphdr *iph,
 					 struct udphdr *uh,
 					 struct udp_table *udptable,
-					 struct sk_buff *skb)
+					 struct sk_buff *skb, u32 info)
 {
-	int (*lookup)(struct sock *sk, struct sk_buff *skb);
 	int network_offset, transport_offset;
-	struct udp_sock *up;
 	struct sock *sk;
 
-	sk = __udp4_lib_lookup(net, iph->daddr, uh->source,
-			       iph->saddr, uh->dest, skb->dev->ifindex, 0,
-			       udptable, NULL);
-	if (!sk)
-		return NULL;
-
 	network_offset = skb_network_offset(skb);
 	transport_offset = skb_transport_offset(skb);
 
@@ -628,10 +646,20 @@ static struct sock *__udp4_lib_err_encap(struct net *net,
 	/* Transport header needs to point to the UDP header */
 	skb_set_transport_header(skb, iph->ihl << 2);
 
-	up = udp_sk(sk);
-	lookup = READ_ONCE(up->encap_err_lookup);
-	if (!lookup || lookup(sk, skb))
-		sk = NULL;
+	sk = __udp4_lib_lookup(net, iph->daddr, uh->source,
+			       iph->saddr, uh->dest, skb->dev->ifindex, 0,
+			       udptable, NULL);
+	if (sk) {
+		int (*lookup)(struct sock *sk, struct sk_buff *skb);
+		struct udp_sock *up = udp_sk(sk);
+
+		lookup = READ_ONCE(up->encap_err_lookup);
+		if (!lookup || lookup(sk, skb))
+			sk = NULL;
+	}
+
+	if (!sk)
+		sk = ERR_PTR(__udp4_lib_err_encap_no_sk(skb, info));
 
 	skb_set_transport_header(skb, transport_offset);
 	skb_set_network_header(skb, network_offset);
@@ -668,13 +696,19 @@ int __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 			       inet_sdif(skb), udptable, NULL);
 	if (!sk) {
 		/* No socket for error: try tunnels before discarding */
-		if (static_branch_unlikely(&udp_encap_needed_key))
-			sk = __udp4_lib_err_encap(net, iph, uh, udptable, skb);
+		sk = ERR_PTR(-ENOENT);
+		if (static_branch_unlikely(&udp_encap_needed_key)) {
+			sk = __udp4_lib_err_encap(net, iph, uh, udptable, skb,
+						  info);
+			if (!sk)
+				return 0;
+		}
 
-		if (!sk) {
+		if (IS_ERR(sk)) {
 			__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
-			return -ENOENT;
+			return PTR_ERR(sk);
 		}
+
 		tunnel = true;
 	}
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 61316ec48b51..0c0cb1611aef 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -45,6 +45,7 @@
 #include <net/raw.h>
 #include <net/tcp_states.h>
 #include <net/ip6_checksum.h>
+#include <net/ip6_tunnel.h>
 #include <net/xfrm.h>
 #include <net/inet_hashtables.h>
 #include <net/inet6_hashtables.h>
@@ -469,6 +470,29 @@ void udpv6_encap_enable(void)
 }
 EXPORT_SYMBOL(udpv6_encap_enable);
 
+/* Handler for tunnels with arbitrary destination ports: no socket lookup, go
+ * through error handlers in encapsulations looking for a match.
+ */
+static int __udp6_lib_err_encap_no_sk(struct sk_buff *skb,
+				      struct inet6_skb_parm *opt,
+				      u8 type, u8 code, int offset, u32 info)
+{
+	int i;
+
+	for (i = 0; i < MAX_IPTUN_ENCAP_OPS; i++) {
+		int (*handler)(struct sk_buff *skb, struct inet6_skb_parm *opt,
+			       u8 type, u8 code, int offset, u32 info);
+
+		if (!ip6tun_encaps[i])
+			continue;
+		handler = rcu_dereference(ip6tun_encaps[i]->err_handler);
+		if (handler && !handler(skb, opt, type, code, offset, info))
+			return 0;
+	}
+
+	return -ENOENT;
+}
+
 /* Try to match ICMP errors to UDP tunnels by looking up a socket without
  * reversing source and destination port: this will match tunnels that force the
  * same destination port on both endpoints (e.g. VXLAN, GENEVE). Note that
@@ -476,28 +500,27 @@ EXPORT_SYMBOL(udpv6_encap_enable);
  * different destination ports on endpoints, in this case we won't be able to
  * trace ICMP messages back to them.
  *
+ * If this doesn't match any socket, probe tunnels with arbitrary destination
+ * ports (e.g. FoU, GUE): there, the receiving socket is useless, as the port
+ * we've sent packets to won't necessarily match the local destination port.
+ *
  * Then ask the tunnel implementation to match the error against a valid
  * association.
  *
- * Return the socket if we have a match.
+ * Return an error if we can't find a match, the socket if we need further
+ * processing, zero otherwise.
  */
 static struct sock *__udp6_lib_err_encap(struct net *net,
 					 const struct ipv6hdr *hdr, int offset,
 					 struct udphdr *uh,
 					 struct udp_table *udptable,
-					 struct sk_buff *skb)
+					 struct sk_buff *skb,
+					 struct inet6_skb_parm *opt,
+					 u8 type, u8 code, __be32 info)
 {
-	int (*lookup)(struct sock *sk, struct sk_buff *skb);
 	int network_offset, transport_offset;
-	struct udp_sock *up;
 	struct sock *sk;
 
-	sk = __udp6_lib_lookup(net, &hdr->daddr, uh->source,
-			       &hdr->saddr, uh->dest,
-			       inet6_iif(skb), 0, udptable, skb);
-	if (!sk)
-		return NULL;
-
 	network_offset = skb_network_offset(skb);
 	transport_offset = skb_transport_offset(skb);
 
@@ -507,13 +530,26 @@ static struct sock *__udp6_lib_err_encap(struct net *net,
 	/* Transport header needs to point to the UDP header */
 	skb_set_transport_header(skb, offset);
 
-	up = udp_sk(sk);
-	lookup = READ_ONCE(up->encap_err_lookup);
-	if (!lookup || lookup(sk, skb))
-		sk = NULL;
+	sk = __udp6_lib_lookup(net, &hdr->daddr, uh->source,
+			       &hdr->saddr, uh->dest,
+			       inet6_iif(skb), 0, udptable, skb);
+	if (sk) {
+		int (*lookup)(struct sock *sk, struct sk_buff *skb);
+		struct udp_sock *up = udp_sk(sk);
+
+		lookup = READ_ONCE(up->encap_err_lookup);
+		if (!lookup || lookup(sk, skb))
+			sk = NULL;
+	}
+
+	if (!sk) {
+		sk = ERR_PTR(__udp6_lib_err_encap_no_sk(skb, opt, type, code,
+							offset, info));
+	}
 
 	skb_set_transport_header(skb, transport_offset);
 	skb_set_network_header(skb, network_offset);
+
 	return sk;
 }
 
@@ -536,16 +572,21 @@ int __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 			       inet6_iif(skb), inet6_sdif(skb), udptable, skb);
 	if (!sk) {
 		/* No socket for error: try tunnels before discarding */
+		sk = ERR_PTR(-ENOENT);
 		if (static_branch_unlikely(&udpv6_encap_needed_key)) {
 			sk = __udp6_lib_err_encap(net, hdr, offset, uh,
-						  udptable, skb);
+						  udptable, skb,
+						  opt, type, code, info);
+			if (!sk)
+				return 0;
 		}
 
-		if (!sk) {
+		if (IS_ERR(sk)) {
 			__ICMP6_INC_STATS(net, __in6_dev_get(skb->dev),
 					  ICMP6_MIB_INERRORS);
-			return -ENOENT;
+			return PTR_ERR(sk);
 		}
+
 		tunnel = true;
 	}
 
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v2 08/11] net: Convert protocol error handlers from void to int
From: Stefano Brivio @ 2018-11-08 11:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Sabrina Dubroca, Xin Long, Stephen Hemminger, Jiri Benc,
	David Ahern, netdev
In-Reply-To: <cover.1541675666.git.sbrivio@redhat.com>

We'll need this to handle ICMP errors for tunnels without a sending socket
(i.e. FoU and GUE). There, we might have to look up different types of IP
tunnels, registered as network protocols, before we get a match, so we
want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
inet_protos and inet6_protos. These error codes will be used in the next
patch.

For consistency, return sensible error codes in protocol error handlers
whenever handlers can't handle errors because, even if valid, they don't
match a protocol or any of its states.

This has no effect on existing error handling paths.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
---
v2: no changes

 include/net/icmp.h        |  2 +-
 include/net/protocol.h    |  9 ++++++--
 include/net/sctp/sctp.h   |  2 +-
 include/net/tcp.h         |  2 +-
 include/net/udp.h         |  2 +-
 net/dccp/ipv4.c           | 13 +++++++----
 net/dccp/ipv6.c           | 13 +++++++----
 net/ipv4/gre_demux.c      |  9 ++++++--
 net/ipv4/icmp.c           |  6 +++--
 net/ipv4/ip_gre.c         | 48 ++++++++++++++++++++-------------------
 net/ipv4/ipip.c           | 14 ++++++------
 net/ipv4/tcp_ipv4.c       | 22 ++++++++++--------
 net/ipv4/tunnel4.c        | 18 ++++++++++-----
 net/ipv4/udp.c            | 10 ++++----
 net/ipv4/udp_impl.h       |  2 +-
 net/ipv4/udplite.c        |  4 ++--
 net/ipv4/xfrm4_protocol.c | 18 ++++++++++-----
 net/ipv6/icmp.c           |  4 +++-
 net/ipv6/ip6_gre.c        | 18 ++++++++-------
 net/ipv6/tcp_ipv6.c       | 13 +++++++----
 net/ipv6/tunnel6.c        | 12 ++++++----
 net/ipv6/udp.c            | 18 +++++++--------
 net/ipv6/udp_impl.h       |  4 ++--
 net/ipv6/udplite.c        |  5 ++--
 net/ipv6/xfrm6_protocol.c | 18 ++++++++++-----
 net/sctp/input.c          |  5 ++--
 net/sctp/ipv6.c           |  7 ++++--
 27 files changed, 177 insertions(+), 121 deletions(-)

diff --git a/include/net/icmp.h b/include/net/icmp.h
index 3ef2743a8eec..6ac3a5bd0117 100644
--- a/include/net/icmp.h
+++ b/include/net/icmp.h
@@ -41,7 +41,7 @@ struct net;
 
 void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info);
 int icmp_rcv(struct sk_buff *skb);
-void icmp_err(struct sk_buff *skb, u32 info);
+int icmp_err(struct sk_buff *skb, u32 info);
 int icmp_init(void);
 void icmp_out_count(struct net *net, unsigned char type);
 
diff --git a/include/net/protocol.h b/include/net/protocol.h
index 4fc75f7ae23b..92b3eaad6088 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -42,7 +42,10 @@ struct net_protocol {
 	int			(*early_demux)(struct sk_buff *skb);
 	int			(*early_demux_handler)(struct sk_buff *skb);
 	int			(*handler)(struct sk_buff *skb);
-	void			(*err_handler)(struct sk_buff *skb, u32 info);
+
+	/* This returns an error if we weren't able to handle the error. */
+	int			(*err_handler)(struct sk_buff *skb, u32 info);
+
 	unsigned int		no_policy:1,
 				netns_ok:1,
 				/* does the protocol do more stringent
@@ -58,10 +61,12 @@ struct inet6_protocol {
 	void    (*early_demux_handler)(struct sk_buff *skb);
 	int	(*handler)(struct sk_buff *skb);
 
-	void	(*err_handler)(struct sk_buff *skb,
+	/* This returns an error if we weren't able to handle the error. */
+	int	(*err_handler)(struct sk_buff *skb,
 			       struct inet6_skb_parm *opt,
 			       u8 type, u8 code, int offset,
 			       __be32 info);
+
 	unsigned int	flags;	/* INET6_PROTO_xxx */
 };
 
diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 8c2caa370e0f..9a3b48a35e90 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -151,7 +151,7 @@ int sctp_primitive_RECONF(struct net *net, struct sctp_association *asoc,
  * sctp/input.c
  */
 int sctp_rcv(struct sk_buff *skb);
-void sctp_v4_err(struct sk_buff *skb, u32 info);
+int sctp_v4_err(struct sk_buff *skb, u32 info);
 void sctp_hash_endpoint(struct sctp_endpoint *);
 void sctp_unhash_endpoint(struct sctp_endpoint *);
 struct sock *sctp_err_lookup(struct net *net, int family, struct sk_buff *,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a18914d20486..4743836bed2e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -313,7 +313,7 @@ extern struct proto tcp_prot;
 
 void tcp_tasklet_init(void);
 
-void tcp_v4_err(struct sk_buff *skb, u32);
+int tcp_v4_err(struct sk_buff *skb, u32);
 
 void tcp_shutdown(struct sock *sk, int how);
 
diff --git a/include/net/udp.h b/include/net/udp.h
index eccca2325ee6..fd6d948755c8 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -283,7 +283,7 @@ bool udp_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst);
 int udp_get_port(struct sock *sk, unsigned short snum,
 		 int (*saddr_cmp)(const struct sock *,
 				  const struct sock *));
-void udp_err(struct sk_buff *, u32);
+int udp_err(struct sk_buff *, u32);
 int udp_abort(struct sock *sk, int err);
 int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
 int udp_push_pending_frames(struct sock *sk);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 8e08cea6f178..26a21d97b6b0 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -231,7 +231,7 @@ EXPORT_SYMBOL(dccp_req_err);
  * check at all. A more general error queue to queue errors for later handling
  * is probably better.
  */
-static void dccp_v4_err(struct sk_buff *skb, u32 info)
+static int dccp_v4_err(struct sk_buff *skb, u32 info)
 {
 	const struct iphdr *iph = (struct iphdr *)skb->data;
 	const u8 offset = iph->ihl << 2;
@@ -259,16 +259,18 @@ static void dccp_v4_err(struct sk_buff *skb, u32 info)
 				       inet_iif(skb), 0);
 	if (!sk) {
 		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
-		return;
+		return -ENOENT;
 	}
 
 	if (sk->sk_state == DCCP_TIME_WAIT) {
 		inet_twsk_put(inet_twsk(sk));
-		return;
+		return 0;
 	}
 	seq = dccp_hdr_seq(dh);
-	if (sk->sk_state == DCCP_NEW_SYN_RECV)
-		return dccp_req_err(sk, seq);
+	if (sk->sk_state == DCCP_NEW_SYN_RECV) {
+		dccp_req_err(sk, seq);
+		return 0;
+	}
 
 	bh_lock_sock(sk);
 	/* If too many ICMPs get dropped on busy
@@ -357,6 +359,7 @@ static void dccp_v4_err(struct sk_buff *skb, u32 info)
 out:
 	bh_unlock_sock(sk);
 	sock_put(sk);
+	return 0;
 }
 
 static inline __sum16 dccp_v4_csum_finish(struct sk_buff *skb,
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 6344f1b18a6a..d5740bad5b18 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -68,7 +68,7 @@ static inline __u64 dccp_v6_init_sequence(struct sk_buff *skb)
 
 }
 
-static void dccp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+static int dccp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 			u8 type, u8 code, int offset, __be32 info)
 {
 	const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
@@ -96,16 +96,18 @@ static void dccp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	if (!sk) {
 		__ICMP6_INC_STATS(net, __in6_dev_get(skb->dev),
 				  ICMP6_MIB_INERRORS);
-		return;
+		return -ENOENT;
 	}
 
 	if (sk->sk_state == DCCP_TIME_WAIT) {
 		inet_twsk_put(inet_twsk(sk));
-		return;
+		return 0;
 	}
 	seq = dccp_hdr_seq(dh);
-	if (sk->sk_state == DCCP_NEW_SYN_RECV)
-		return dccp_req_err(sk, seq);
+	if (sk->sk_state == DCCP_NEW_SYN_RECV) {
+		dccp_req_err(sk, seq);
+		return 0;
+	}
 
 	bh_lock_sock(sk);
 	if (sock_owned_by_user(sk))
@@ -183,6 +185,7 @@ static void dccp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 out:
 	bh_unlock_sock(sk);
 	sock_put(sk);
+	return 0;
 }
 
 
diff --git a/net/ipv4/gre_demux.c b/net/ipv4/gre_demux.c
index 7efe740c06eb..a4bf22ee3aed 100644
--- a/net/ipv4/gre_demux.c
+++ b/net/ipv4/gre_demux.c
@@ -151,20 +151,25 @@ static int gre_rcv(struct sk_buff *skb)
 	return NET_RX_DROP;
 }
 
-static void gre_err(struct sk_buff *skb, u32 info)
+static int gre_err(struct sk_buff *skb, u32 info)
 {
 	const struct gre_protocol *proto;
 	const struct iphdr *iph = (const struct iphdr *)skb->data;
 	u8 ver = skb->data[(iph->ihl<<2) + 1]&0x7f;
+	int err = 0;
 
 	if (ver >= GREPROTO_MAX)
-		return;
+		return -EINVAL;
 
 	rcu_read_lock();
 	proto = rcu_dereference(gre_proto[ver]);
 	if (proto && proto->err_handler)
 		proto->err_handler(skb, info);
+	else
+		err = -EPROTONOSUPPORT;
 	rcu_read_unlock();
+
+	return err;
 }
 
 static const struct net_protocol net_gre_protocol = {
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index d832beed6e3a..065997f414e6 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -1079,7 +1079,7 @@ int icmp_rcv(struct sk_buff *skb)
 	goto drop;
 }
 
-void icmp_err(struct sk_buff *skb, u32 info)
+int icmp_err(struct sk_buff *skb, u32 info)
 {
 	struct iphdr *iph = (struct iphdr *)skb->data;
 	int offset = iph->ihl<<2;
@@ -1094,13 +1094,15 @@ void icmp_err(struct sk_buff *skb, u32 info)
 	 */
 	if (icmph->type != ICMP_ECHOREPLY) {
 		ping_err(skb, offset, info);
-		return;
+		return 0;
 	}
 
 	if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED)
 		ipv4_update_pmtu(skb, net, info, 0, IPPROTO_ICMP);
 	else if (type == ICMP_REDIRECT)
 		ipv4_redirect(skb, net, 0, IPPROTO_ICMP);
+
+	return 0;
 }
 
 /*
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 2c67af644e64..76a9a5f7a40e 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -121,8 +121,8 @@ static unsigned int ipgre_net_id __read_mostly;
 static unsigned int gre_tap_net_id __read_mostly;
 static unsigned int erspan_net_id __read_mostly;
 
-static void ipgre_err(struct sk_buff *skb, u32 info,
-		      const struct tnl_ptk_info *tpi)
+static int ipgre_err(struct sk_buff *skb, u32 info,
+		     const struct tnl_ptk_info *tpi)
 {
 
 	/* All the routers (except for Linux) return only
@@ -146,17 +146,32 @@ static void ipgre_err(struct sk_buff *skb, u32 info,
 	unsigned int data_len = 0;
 	struct ip_tunnel *t;
 
+	if (tpi->proto == htons(ETH_P_TEB))
+		itn = net_generic(net, gre_tap_net_id);
+	else if (tpi->proto == htons(ETH_P_ERSPAN) ||
+		 tpi->proto == htons(ETH_P_ERSPAN2))
+		itn = net_generic(net, erspan_net_id);
+	else
+		itn = net_generic(net, ipgre_net_id);
+
+	iph = (const struct iphdr *)(icmp_hdr(skb) + 1);
+	t = ip_tunnel_lookup(itn, skb->dev->ifindex, tpi->flags,
+			     iph->daddr, iph->saddr, tpi->key);
+
+	if (!t)
+		return -ENOENT;
+
 	switch (type) {
 	default:
 	case ICMP_PARAMETERPROB:
-		return;
+		return 0;
 
 	case ICMP_DEST_UNREACH:
 		switch (code) {
 		case ICMP_SR_FAILED:
 		case ICMP_PORT_UNREACH:
 			/* Impossible event. */
-			return;
+			return 0;
 		default:
 			/* All others are translated to HOST_UNREACH.
 			   rfc2003 contains "deep thoughts" about NET_UNREACH,
@@ -168,7 +183,7 @@ static void ipgre_err(struct sk_buff *skb, u32 info,
 
 	case ICMP_TIME_EXCEEDED:
 		if (code != ICMP_EXC_TTL)
-			return;
+			return 0;
 		data_len = icmp_hdr(skb)->un.reserved[1] * 4; /* RFC 4884 4.1 */
 		break;
 
@@ -176,40 +191,27 @@ static void ipgre_err(struct sk_buff *skb, u32 info,
 		break;
 	}
 
-	if (tpi->proto == htons(ETH_P_TEB))
-		itn = net_generic(net, gre_tap_net_id);
-	else if (tpi->proto == htons(ETH_P_ERSPAN) ||
-		 tpi->proto == htons(ETH_P_ERSPAN2))
-		itn = net_generic(net, erspan_net_id);
-	else
-		itn = net_generic(net, ipgre_net_id);
-
-	iph = (const struct iphdr *)(icmp_hdr(skb) + 1);
-	t = ip_tunnel_lookup(itn, skb->dev->ifindex, tpi->flags,
-			     iph->daddr, iph->saddr, tpi->key);
-
-	if (!t)
-		return;
-
 #if IS_ENABLED(CONFIG_IPV6)
        if (tpi->proto == htons(ETH_P_IPV6) &&
            !ip6_err_gen_icmpv6_unreach(skb, iph->ihl * 4 + tpi->hdr_len,
 				       type, data_len))
-               return;
+               return 0;
 #endif
 
 	if (t->parms.iph.daddr == 0 ||
 	    ipv4_is_multicast(t->parms.iph.daddr))
-		return;
+		return 0;
 
 	if (t->parms.iph.ttl == 0 && type == ICMP_TIME_EXCEEDED)
-		return;
+		return 0;
 
 	if (time_before(jiffies, t->err_time + IPTUNNEL_ERR_TIMEO))
 		t->err_count++;
 	else
 		t->err_count = 1;
 	t->err_time = jiffies;
+
+	return 0;
 }
 
 static void gre_err(struct sk_buff *skb, u32 info)
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index e65287c27e3d..57c5dd283a2c 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -140,6 +140,13 @@ static int ipip_err(struct sk_buff *skb, u32 info)
 	struct ip_tunnel *t;
 	int err = 0;
 
+	t = ip_tunnel_lookup(itn, skb->dev->ifindex, TUNNEL_NO_KEY,
+			     iph->daddr, iph->saddr, 0);
+	if (!t) {
+		err = -ENOENT;
+		goto out;
+	}
+
 	switch (type) {
 	case ICMP_DEST_UNREACH:
 		switch (code) {
@@ -167,13 +174,6 @@ static int ipip_err(struct sk_buff *skb, u32 info)
 		goto out;
 	}
 
-	t = ip_tunnel_lookup(itn, skb->dev->ifindex, TUNNEL_NO_KEY,
-			     iph->daddr, iph->saddr, 0);
-	if (!t) {
-		err = -ENOENT;
-		goto out;
-	}
-
 	if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED) {
 		ipv4_update_pmtu(skb, net, info, t->parms.link, iph->protocol);
 		goto out;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index de47038afdf0..a336787d75e5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -423,7 +423,7 @@ EXPORT_SYMBOL(tcp_req_err);
  *
  */
 
-void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+int tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 {
 	const struct iphdr *iph = (const struct iphdr *)icmp_skb->data;
 	struct tcphdr *th = (struct tcphdr *)(icmp_skb->data + (iph->ihl << 2));
@@ -446,20 +446,21 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 				       inet_iif(icmp_skb), 0);
 	if (!sk) {
 		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
-		return;
+		return -ENOENT;
 	}
 	if (sk->sk_state == TCP_TIME_WAIT) {
 		inet_twsk_put(inet_twsk(sk));
-		return;
+		return 0;
 	}
 	seq = ntohl(th->seq);
-	if (sk->sk_state == TCP_NEW_SYN_RECV)
-		return tcp_req_err(sk, seq,
-				  type == ICMP_PARAMETERPROB ||
-				  type == ICMP_TIME_EXCEEDED ||
-				  (type == ICMP_DEST_UNREACH &&
-				   (code == ICMP_NET_UNREACH ||
-				    code == ICMP_HOST_UNREACH)));
+	if (sk->sk_state == TCP_NEW_SYN_RECV) {
+		tcp_req_err(sk, seq, type == ICMP_PARAMETERPROB ||
+				     type == ICMP_TIME_EXCEEDED ||
+				     (type == ICMP_DEST_UNREACH &&
+				      (code == ICMP_NET_UNREACH ||
+				       code == ICMP_HOST_UNREACH)));
+		return 0;
+	}
 
 	bh_lock_sock(sk);
 	/* If too many ICMPs get dropped on busy
@@ -613,6 +614,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 out:
 	bh_unlock_sock(sk);
 	sock_put(sk);
+	return 0;
 }
 
 void __tcp_v4_send_check(struct sk_buff *skb, __be32 saddr, __be32 daddr)
diff --git a/net/ipv4/tunnel4.c b/net/ipv4/tunnel4.c
index c0630013c1ae..33bf8e9c8663 100644
--- a/net/ipv4/tunnel4.c
+++ b/net/ipv4/tunnel4.c
@@ -149,34 +149,40 @@ static int tunnelmpls4_rcv(struct sk_buff *skb)
 }
 #endif
 
-static void tunnel4_err(struct sk_buff *skb, u32 info)
+static int tunnel4_err(struct sk_buff *skb, u32 info)
 {
 	struct xfrm_tunnel *handler;
 
 	for_each_tunnel_rcu(tunnel4_handlers, handler)
 		if (!handler->err_handler(skb, info))
-			break;
+			return 0;
+
+	return -ENOENT;
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
-static void tunnel64_err(struct sk_buff *skb, u32 info)
+static int tunnel64_err(struct sk_buff *skb, u32 info)
 {
 	struct xfrm_tunnel *handler;
 
 	for_each_tunnel_rcu(tunnel64_handlers, handler)
 		if (!handler->err_handler(skb, info))
-			break;
+			return 0;
+
+	return -ENOENT;
 }
 #endif
 
 #if IS_ENABLED(CONFIG_MPLS)
-static void tunnelmpls4_err(struct sk_buff *skb, u32 info)
+static int tunnelmpls4_err(struct sk_buff *skb, u32 info)
 {
 	struct xfrm_tunnel *handler;
 
 	for_each_tunnel_rcu(tunnelmpls4_handlers, handler)
 		if (!handler->err_handler(skb, info))
-			break;
+			return 0;
+
+	return -ENOENT;
 }
 #endif
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ce759b61f6cd..a505ee5eb92c 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -650,7 +650,7 @@ static struct sock *__udp4_lib_err_encap(struct net *net,
  * to find the appropriate port.
  */
 
-void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
+int __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 {
 	struct inet_sock *inet;
 	const struct iphdr *iph = (const struct iphdr *)skb->data;
@@ -673,7 +673,7 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 
 		if (!sk) {
 			__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
-			return;
+			return -ENOENT;
 		}
 		tunnel = true;
 	}
@@ -731,12 +731,12 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 	sk->sk_err = err;
 	sk->sk_error_report(sk);
 out:
-	return;
+	return 0;
 }
 
-void udp_err(struct sk_buff *skb, u32 info)
+int udp_err(struct sk_buff *skb, u32 info)
 {
-	__udp4_lib_err(skb, info, &udp_table);
+	return __udp4_lib_err(skb, info, &udp_table);
 }
 
 /*
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index e7d18b140287..322672655419 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -7,7 +7,7 @@
 #include <net/inet_common.h>
 
 int __udp4_lib_rcv(struct sk_buff *, struct udp_table *, int);
-void __udp4_lib_err(struct sk_buff *, u32, struct udp_table *);
+int __udp4_lib_err(struct sk_buff *, u32, struct udp_table *);
 
 int udp_v4_get_port(struct sock *sk, unsigned short snum);
 
diff --git a/net/ipv4/udplite.c b/net/ipv4/udplite.c
index 8545457752fb..39c7f17d916f 100644
--- a/net/ipv4/udplite.c
+++ b/net/ipv4/udplite.c
@@ -25,9 +25,9 @@ static int udplite_rcv(struct sk_buff *skb)
 	return __udp4_lib_rcv(skb, &udplite_table, IPPROTO_UDPLITE);
 }
 
-static void udplite_err(struct sk_buff *skb, u32 info)
+static int udplite_err(struct sk_buff *skb, u32 info)
 {
-	__udp4_lib_err(skb, info, &udplite_table);
+	return __udp4_lib_err(skb, info, &udplite_table);
 }
 
 static const struct net_protocol udplite_protocol = {
diff --git a/net/ipv4/xfrm4_protocol.c b/net/ipv4/xfrm4_protocol.c
index 8dd0e6ab8606..35c54865dc42 100644
--- a/net/ipv4/xfrm4_protocol.c
+++ b/net/ipv4/xfrm4_protocol.c
@@ -106,13 +106,15 @@ static int xfrm4_esp_rcv(struct sk_buff *skb)
 	return 0;
 }
 
-static void xfrm4_esp_err(struct sk_buff *skb, u32 info)
+static int xfrm4_esp_err(struct sk_buff *skb, u32 info)
 {
 	struct xfrm4_protocol *handler;
 
 	for_each_protocol_rcu(esp4_handlers, handler)
 		if (!handler->err_handler(skb, info))
-			break;
+			return 0;
+
+	return -ENOENT;
 }
 
 static int xfrm4_ah_rcv(struct sk_buff *skb)
@@ -132,13 +134,15 @@ static int xfrm4_ah_rcv(struct sk_buff *skb)
 	return 0;
 }
 
-static void xfrm4_ah_err(struct sk_buff *skb, u32 info)
+static int xfrm4_ah_err(struct sk_buff *skb, u32 info)
 {
 	struct xfrm4_protocol *handler;
 
 	for_each_protocol_rcu(ah4_handlers, handler)
 		if (!handler->err_handler(skb, info))
-			break;
+			return 0;
+
+	return -ENOENT;
 }
 
 static int xfrm4_ipcomp_rcv(struct sk_buff *skb)
@@ -158,13 +162,15 @@ static int xfrm4_ipcomp_rcv(struct sk_buff *skb)
 	return 0;
 }
 
-static void xfrm4_ipcomp_err(struct sk_buff *skb, u32 info)
+static int xfrm4_ipcomp_err(struct sk_buff *skb, u32 info)
 {
 	struct xfrm4_protocol *handler;
 
 	for_each_protocol_rcu(ipcomp4_handlers, handler)
 		if (!handler->err_handler(skb, info))
-			break;
+			return 0;
+
+	return -ENOENT;
 }
 
 static const struct net_protocol esp4_protocol = {
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index c9c53ade55c3..5d7aa2c2770c 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -84,7 +84,7 @@ static inline struct sock *icmpv6_sk(struct net *net)
 	return net->ipv6.icmp_sk[smp_processor_id()];
 }
 
-static void icmpv6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+static int icmpv6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 		       u8 type, u8 code, int offset, __be32 info)
 {
 	/* icmpv6_notify checks 8 bytes can be pulled, icmp6hdr is 8 bytes */
@@ -100,6 +100,8 @@ static void icmpv6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	if (!(type & ICMPV6_INFOMSG_MASK))
 		if (icmp6->icmp6_type == ICMPV6_ECHO_REQUEST)
 			ping_err(skb, offset, ntohl(info));
+
+	return 0;
 }
 
 static int icmpv6_rcv(struct sk_buff *skb);
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 515adbdba1d2..81b69bcee714 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -423,7 +423,7 @@ static void ip6gre_tunnel_uninit(struct net_device *dev)
 }
 
 
-static void ip6gre_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+static int ip6gre_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 		       u8 type, u8 code, int offset, __be32 info)
 {
 	struct net *net = dev_net(skb->dev);
@@ -433,13 +433,13 @@ static void ip6gre_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 
 	if (gre_parse_header(skb, &tpi, NULL, htons(ETH_P_IPV6),
 			     offset) < 0)
-		return;
+		return -EINVAL;
 
 	ipv6h = (const struct ipv6hdr *)skb->data;
 	t = ip6gre_tunnel_lookup(skb->dev, &ipv6h->daddr, &ipv6h->saddr,
 				 tpi.key, tpi.proto);
 	if (!t)
-		return;
+		return -ENOENT;
 
 	switch (type) {
 		struct ipv6_tlv_tnl_enc_lim *tel;
@@ -449,14 +449,14 @@ static void ip6gre_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 				    t->parms.name);
 		if (code != ICMPV6_PORT_UNREACH)
 			break;
-		return;
+		return 0;
 	case ICMPV6_TIME_EXCEED:
 		if (code == ICMPV6_EXC_HOPLIMIT) {
 			net_dbg_ratelimited("%s: Too small hop limit or routing loop in tunnel!\n",
 					    t->parms.name);
 			break;
 		}
-		return;
+		return 0;
 	case ICMPV6_PARAMPROB:
 		teli = 0;
 		if (code == ICMPV6_HDR_FIELD)
@@ -472,14 +472,14 @@ static void ip6gre_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 			net_dbg_ratelimited("%s: Recipient unable to parse tunneled packet!\n",
 					    t->parms.name);
 		}
-		return;
+		return 0;
 	case ICMPV6_PKT_TOOBIG:
 		ip6_update_pmtu(skb, net, info, 0, 0, sock_net_uid(net, NULL));
-		return;
+		return 0;
 	case NDISC_REDIRECT:
 		ip6_redirect(skb, net, skb->dev->ifindex, 0,
 			     sock_net_uid(net, NULL));
-		return;
+		return 0;
 	}
 
 	if (time_before(jiffies, t->err_time + IP6TUNNEL_ERR_TIMEO))
@@ -487,6 +487,8 @@ static void ip6gre_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	else
 		t->err_count = 1;
 	t->err_time = jiffies;
+
+	return 0;
 }
 
 static int ip6gre_rcv(struct sk_buff *skb, const struct tnl_ptk_info *tpi)
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 03e6b7a2bc53..a3f559162521 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -349,7 +349,7 @@ static void tcp_v6_mtu_reduced(struct sock *sk)
 	}
 }
 
-static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 		u8 type, u8 code, int offset, __be32 info)
 {
 	const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
@@ -371,17 +371,19 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	if (!sk) {
 		__ICMP6_INC_STATS(net, __in6_dev_get(skb->dev),
 				  ICMP6_MIB_INERRORS);
-		return;
+		return -ENOENT;
 	}
 
 	if (sk->sk_state == TCP_TIME_WAIT) {
 		inet_twsk_put(inet_twsk(sk));
-		return;
+		return 0;
 	}
 	seq = ntohl(th->seq);
 	fatal = icmpv6_err_convert(type, code, &err);
-	if (sk->sk_state == TCP_NEW_SYN_RECV)
-		return tcp_req_err(sk, seq, fatal);
+	if (sk->sk_state == TCP_NEW_SYN_RECV) {
+		tcp_req_err(sk, seq, fatal);
+		return 0;
+	}
 
 	bh_lock_sock(sk);
 	if (sock_owned_by_user(sk) && type != ICMPV6_PKT_TOOBIG)
@@ -467,6 +469,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 out:
 	bh_unlock_sock(sk);
 	sock_put(sk);
+	return 0;
 }
 
 
diff --git a/net/ipv6/tunnel6.c b/net/ipv6/tunnel6.c
index dae25cad05cd..1991dede7367 100644
--- a/net/ipv6/tunnel6.c
+++ b/net/ipv6/tunnel6.c
@@ -134,24 +134,28 @@ static int tunnel46_rcv(struct sk_buff *skb)
 	return 0;
 }
 
-static void tunnel6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+static int tunnel6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 			u8 type, u8 code, int offset, __be32 info)
 {
 	struct xfrm6_tunnel *handler;
 
 	for_each_tunnel_rcu(tunnel6_handlers, handler)
 		if (!handler->err_handler(skb, opt, type, code, offset, info))
-			break;
+			return 0;
+
+	return -ENOENT;
 }
 
-static void tunnel46_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+static int tunnel46_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 			 u8 type, u8 code, int offset, __be32 info)
 {
 	struct xfrm6_tunnel *handler;
 
 	for_each_tunnel_rcu(tunnel46_handlers, handler)
 		if (!handler->err_handler(skb, opt, type, code, offset, info))
-			break;
+			return 0;
+
+	return -ENOENT;
 }
 
 static const struct inet6_protocol tunnel6_protocol = {
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 1216c920f945..61316ec48b51 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -517,9 +517,9 @@ static struct sock *__udp6_lib_err_encap(struct net *net,
 	return sk;
 }
 
-void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
-		    u8 type, u8 code, int offset, __be32 info,
-		    struct udp_table *udptable)
+int __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+		   u8 type, u8 code, int offset, __be32 info,
+		   struct udp_table *udptable)
 {
 	struct ipv6_pinfo *np;
 	const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
@@ -544,7 +544,7 @@ void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 		if (!sk) {
 			__ICMP6_INC_STATS(net, __in6_dev_get(skb->dev),
 					  ICMP6_MIB_INERRORS);
-			return;
+			return -ENOENT;
 		}
 		tunnel = true;
 	}
@@ -583,7 +583,7 @@ void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	sk->sk_err = err;
 	sk->sk_error_report(sk);
 out:
-	return;
+	return 0;
 }
 
 static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
@@ -614,11 +614,11 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	return 0;
 }
 
-static __inline__ void udpv6_err(struct sk_buff *skb,
-				 struct inet6_skb_parm *opt, u8 type,
-				 u8 code, int offset, __be32 info)
+static __inline__ int udpv6_err(struct sk_buff *skb,
+				struct inet6_skb_parm *opt, u8 type,
+				u8 code, int offset, __be32 info)
 {
-	__udp6_lib_err(skb, opt, type, code, offset, info, &udp_table);
+	return __udp6_lib_err(skb, opt, type, code, offset, info, &udp_table);
 }
 
 static int udpv6_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
diff --git a/net/ipv6/udp_impl.h b/net/ipv6/udp_impl.h
index 7903e21c178b..5730e6503cb4 100644
--- a/net/ipv6/udp_impl.h
+++ b/net/ipv6/udp_impl.h
@@ -9,8 +9,8 @@
 #include <net/transp_v6.h>
 
 int __udp6_lib_rcv(struct sk_buff *, struct udp_table *, int);
-void __udp6_lib_err(struct sk_buff *, struct inet6_skb_parm *, u8, u8, int,
-		    __be32, struct udp_table *);
+int __udp6_lib_err(struct sk_buff *, struct inet6_skb_parm *, u8, u8, int,
+		   __be32, struct udp_table *);
 
 int udp_v6_get_port(struct sock *sk, unsigned short snum);
 
diff --git a/net/ipv6/udplite.c b/net/ipv6/udplite.c
index 5000ad6878e6..a125aebc29e5 100644
--- a/net/ipv6/udplite.c
+++ b/net/ipv6/udplite.c
@@ -20,11 +20,12 @@ static int udplitev6_rcv(struct sk_buff *skb)
 	return __udp6_lib_rcv(skb, &udplite_table, IPPROTO_UDPLITE);
 }
 
-static void udplitev6_err(struct sk_buff *skb,
+static int udplitev6_err(struct sk_buff *skb,
 			  struct inet6_skb_parm *opt,
 			  u8 type, u8 code, int offset, __be32 info)
 {
-	__udp6_lib_err(skb, opt, type, code, offset, info, &udplite_table);
+	return __udp6_lib_err(skb, opt, type, code, offset, info,
+			      &udplite_table);
 }
 
 static const struct inet6_protocol udplitev6_protocol = {
diff --git a/net/ipv6/xfrm6_protocol.c b/net/ipv6/xfrm6_protocol.c
index b2dc8ce49378..cc979b702c89 100644
--- a/net/ipv6/xfrm6_protocol.c
+++ b/net/ipv6/xfrm6_protocol.c
@@ -80,14 +80,16 @@ static int xfrm6_esp_rcv(struct sk_buff *skb)
 	return 0;
 }
 
-static void xfrm6_esp_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+static int xfrm6_esp_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 			  u8 type, u8 code, int offset, __be32 info)
 {
 	struct xfrm6_protocol *handler;
 
 	for_each_protocol_rcu(esp6_handlers, handler)
 		if (!handler->err_handler(skb, opt, type, code, offset, info))
-			break;
+			return 0;
+
+	return -ENOENT;
 }
 
 static int xfrm6_ah_rcv(struct sk_buff *skb)
@@ -107,14 +109,16 @@ static int xfrm6_ah_rcv(struct sk_buff *skb)
 	return 0;
 }
 
-static void xfrm6_ah_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+static int xfrm6_ah_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 			 u8 type, u8 code, int offset, __be32 info)
 {
 	struct xfrm6_protocol *handler;
 
 	for_each_protocol_rcu(ah6_handlers, handler)
 		if (!handler->err_handler(skb, opt, type, code, offset, info))
-			break;
+			return 0;
+
+	return -ENOENT;
 }
 
 static int xfrm6_ipcomp_rcv(struct sk_buff *skb)
@@ -134,14 +138,16 @@ static int xfrm6_ipcomp_rcv(struct sk_buff *skb)
 	return 0;
 }
 
-static void xfrm6_ipcomp_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+static int xfrm6_ipcomp_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 			     u8 type, u8 code, int offset, __be32 info)
 {
 	struct xfrm6_protocol *handler;
 
 	for_each_protocol_rcu(ipcomp6_handlers, handler)
 		if (!handler->err_handler(skb, opt, type, code, offset, info))
-			break;
+			return 0;
+
+	return -ENOENT;
 }
 
 static const struct inet6_protocol esp6_protocol = {
diff --git a/net/sctp/input.c b/net/sctp/input.c
index 5c36a99882ed..7ab08a5b36dc 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -574,7 +574,7 @@ void sctp_err_finish(struct sock *sk, struct sctp_transport *t)
  * is probably better.
  *
  */
-void sctp_v4_err(struct sk_buff *skb, __u32 info)
+int sctp_v4_err(struct sk_buff *skb, __u32 info)
 {
 	const struct iphdr *iph = (const struct iphdr *)skb->data;
 	const int ihlen = iph->ihl * 4;
@@ -599,7 +599,7 @@ void sctp_v4_err(struct sk_buff *skb, __u32 info)
 	skb->transport_header = savesctp;
 	if (!sk) {
 		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
-		return;
+		return -ENOENT;
 	}
 	/* Warning:  The sock lock is held.  Remember to call
 	 * sctp_err_finish!
@@ -653,6 +653,7 @@ void sctp_v4_err(struct sk_buff *skb, __u32 info)
 
 out_unlock:
 	sctp_err_finish(sk, transport);
+	return 0;
 }
 
 /*
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index fc6c5e4bffa5..6e27c62646e9 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -138,7 +138,7 @@ static struct notifier_block sctp_inet6addr_notifier = {
 };
 
 /* ICMP error handler. */
-static void sctp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+static int sctp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 			u8 type, u8 code, int offset, __be32 info)
 {
 	struct inet6_dev *idev;
@@ -147,7 +147,7 @@ static void sctp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	struct sctp_transport *transport;
 	struct ipv6_pinfo *np;
 	__u16 saveip, savesctp;
-	int err;
+	int err, ret = 0;
 	struct net *net = dev_net(skb->dev);
 
 	idev = in6_dev_get(skb->dev);
@@ -163,6 +163,7 @@ static void sctp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	skb->transport_header = savesctp;
 	if (!sk) {
 		__ICMP6_INC_STATS(net, idev, ICMP6_MIB_INERRORS);
+		ret = -ENOENT;
 		goto out;
 	}
 
@@ -202,6 +203,8 @@ static void sctp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 out:
 	if (likely(idev != NULL))
 		in6_dev_put(idev);
+
+	return ret;
 }
 
 static int sctp_v6_xmit(struct sk_buff *skb, struct sctp_transport *transport)
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v2 07/11] selftests: pmtu: Introduce tests for IPv4/IPv6 over GENEVE over IPv4/IPv6
From: Stefano Brivio @ 2018-11-08 11:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Sabrina Dubroca, Xin Long, Stephen Hemminger, Jiri Benc,
	David Ahern, netdev
In-Reply-To: <cover.1541675666.git.sbrivio@redhat.com>

Use a router between endpoints, implemented via namespaces, set a low MTU
between router and destination endpoint, exceed it and check PMTU value in
route exceptions.

v2:
- Introduce IPv4 tests right away, if iproute2 doesn't support the 'df'
  link option they will be skipped (David Ahern)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
---
 tools/testing/selftests/net/pmtu.sh | 117 ++++++++++++++++++++--------
 1 file changed, 86 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 33cba295ad45..14d20782ccb7 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -43,6 +43,22 @@
 # - pmtu_ipv6_vxlan6_exception
 #	Same as pmtu_ipv4_vxlan6_exception, but send IPv6 packets from A to B
 #
+# - pmtu_ipv4_geneve4_exception
+#	Same as pmtu_ipv4_vxlan4_exception, but using a GENEVE tunnel instead of
+#	VXLAN
+#
+# - pmtu_ipv6_geneve4_exception
+#	Same as pmtu_ipv6_vxlan4_exception, but using a GENEVE tunnel instead of
+#	VXLAN
+#
+# - pmtu_ipv4_geneve6_exception
+#	Same as pmtu_ipv4_vxlan6_exception, but using a GENEVE tunnel instead of
+#	VXLAN
+#
+# - pmtu_ipv6_geneve6_exception
+#	Same as pmtu_ipv6_vxlan6_exception, but using a GENEVE tunnel instead of
+#	VXLAN
+#
 # - pmtu_vti4_exception
 #	Set up vti tunnel on top of veth, with xfrm states and policies, in two
 #	namespaces with matching endpoints. Check that route exception is not
@@ -93,6 +109,10 @@ tests="
 	pmtu_ipv6_vxlan4_exception	IPv6 over vxlan4: PMTU exceptions
 	pmtu_ipv4_vxlan6_exception	IPv4 over vxlan6: PMTU exceptions
 	pmtu_ipv6_vxlan6_exception	IPv6 over vxlan6: PMTU exceptions
+	pmtu_ipv4_geneve4_exception	IPv4 over geneve4: PMTU exceptions
+	pmtu_ipv6_geneve4_exception	IPv6 over geneve4: PMTU exceptions
+	pmtu_ipv4_geneve6_exception	IPv4 over geneve6: PMTU exceptions
+	pmtu_ipv6_geneve6_exception	IPv6 over geneve6: PMTU exceptions
 	pmtu_vti6_exception		vti6: PMTU exceptions
 	pmtu_vti4_exception		vti4: PMTU exceptions
 	pmtu_vti4_default_mtu		vti4: default MTU assignment
@@ -230,32 +250,50 @@ setup_vti6() {
 	setup_vti 6 ${veth6_a_addr} ${veth6_b_addr} ${tunnel6_a_addr} ${tunnel6_b_addr} ${tunnel6_mask}
 }
 
-setup_vxlan() {
-	a_addr="${1}"
-	b_addr="${2}"
-	opts="${3}"
+setup_vxlan_or_geneve() {
+	type="${1}"
+	a_addr="${2}"
+	b_addr="${3}"
+	opts="${4}"
+
+	if [ "${type}" = "vxlan" ]; then
+		opts="${opts} ttl 64 dstport 4789"
+		opts_a="local ${a_addr}"
+		opts_b="local ${b_addr}"
+	else
+		opts_a=""
+		opts_b=""
+	fi
 
-	${ns_a} ip link add vxlan_a type vxlan id 1 local ${a_addr} remote ${b_addr} ttl 64 dstport 4789 ${opts} || return 1
-	${ns_b} ip link add vxlan_b type vxlan id 1 local ${b_addr} remote ${a_addr} ttl 64 dstport 4789 ${opts}
+	${ns_a} ip link add ${type}_a type ${type} id 1 ${opts_a} remote ${b_addr} ${opts} || return 1
+	${ns_b} ip link add ${type}_b type ${type} id 1 ${opts_b} remote ${a_addr} ${opts}
 
-	${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask}   dev vxlan_a
-	${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask}   dev vxlan_b
+	${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask} dev ${type}_a
+	${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask} dev ${type}_b
 
-	${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask}   dev vxlan_a
-	${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask}   dev vxlan_b
+	${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask} dev ${type}_a
+	${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask} dev ${type}_b
 
-	${ns_a} ip link set vxlan_a up
-	${ns_b} ip link set vxlan_b up
+	${ns_a} ip link set ${type}_a up
+	${ns_b} ip link set ${type}_b up
 
 	sleep 1
 }
 
+setup_geneve4() {
+	setup_vxlan_or_geneve geneve ${prefix4}.${a_r1}.1  ${prefix4}.${b_r1}.1  "df set"
+}
+
 setup_vxlan4() {
-	setup_vxlan ${prefix4}.${a_r1}.1 ${prefix4}.${b_r1}.1 "df set"
+	setup_vxlan_or_geneve vxlan  ${prefix4}.${a_r1}.1  ${prefix4}.${b_r1}.1  "df set"
+}
+
+setup_geneve6() {
+	setup_vxlan_or_geneve geneve ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1
 }
 
 setup_vxlan6() {
-	setup_vxlan ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1 ""
+	setup_vxlan_or_geneve vxlan  ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1
 }
 
 setup_xfrm() {
@@ -514,22 +552,23 @@ test_pmtu_ipv6_exception() {
 	test_pmtu_ipvX 6
 }
 
-test_pmtu_ipvX_over_vxlanY_exception() {
-	family=${1}
-	outer_family=${2}
+test_pmtu_ipvX_over_vxlanY_or_geneveY_exception() {
+	type=${1}
+	family=${2}
+	outer_family=${3}
 	ll_mtu=4000
 
 	if [ ${outer_family} -eq 4 ]; then
-		setup namespaces routing vxlan4 || return 2
-		#                      IPv4 header   UDP header   VXLAN header   Ethernet header
-		exp_mtu=$((${ll_mtu} - 20          - 8          - 8            - 14))
+		setup namespaces routing ${type}4 || return 2
+		#                      IPv4 header   UDP header   VXLAN/GENEVE header   Ethernet header
+		exp_mtu=$((${ll_mtu} - 20          - 8          - 8                   - 14))
 	else
-		setup namespaces routing vxlan6 || return 2
-		#                      IPv6 header   UDP header   VXLAN header   Ethernet header
-		exp_mtu=$((${ll_mtu} - 40          - 8          - 8            - 14))
+		setup namespaces routing ${type}6 || return 2
+		#                      IPv6 header   UDP header   VXLAN/GENEVE header   Ethernet header
+		exp_mtu=$((${ll_mtu} - 40          - 8          - 8                   - 14))
 	fi
 
-	trace "${ns_a}" vxlan_a      "${ns_b}"  vxlan_b \
+	trace "${ns_a}" ${type}_a    "${ns_b}"  ${type}_b \
 	      "${ns_a}" veth_A-R1    "${ns_r1}" veth_R1-A \
 	      "${ns_b}" veth_B-R1    "${ns_r1}" veth_R1-B
 
@@ -547,29 +586,45 @@ test_pmtu_ipvX_over_vxlanY_exception() {
 	mtu "${ns_b}"  veth_B-R1 ${ll_mtu}
 	mtu "${ns_r1}" veth_R1-B ${ll_mtu}
 
-	mtu "${ns_a}" vxlan_a $((${ll_mtu} + 1000))
-	mtu "${ns_b}" vxlan_b $((${ll_mtu} + 1000))
+	mtu "${ns_a}" ${type}_a $((${ll_mtu} + 1000))
+	mtu "${ns_b}" ${type}_b $((${ll_mtu} + 1000))
 	${ns_a} ${ping} -q -M want -i 0.1 -w 2 -s $((${ll_mtu} + 500)) ${dst} > /dev/null
 
 	# Check that exception was created
 	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${dst})"
-	check_pmtu_value ${exp_mtu} "${pmtu}" "exceeding link layer MTU on VXLAN interface"
+	check_pmtu_value ${exp_mtu} "${pmtu}" "exceeding link layer MTU on ${type} interface"
 }
 
 test_pmtu_ipv4_vxlan4_exception() {
-	test_pmtu_ipvX_over_vxlanY_exception 4 4
+	test_pmtu_ipvX_over_vxlanY_or_geneveY_exception vxlan  4 4
 }
 
 test_pmtu_ipv6_vxlan4_exception() {
-	test_pmtu_ipvX_over_vxlanY_exception 6 4
+	test_pmtu_ipvX_over_vxlanY_or_geneveY_exception vxlan  6 4
+}
+
+test_pmtu_ipv4_geneve4_exception() {
+	test_pmtu_ipvX_over_vxlanY_or_geneveY_exception geneve 4 4
+}
+
+test_pmtu_ipv6_geneve4_exception() {
+	test_pmtu_ipvX_over_vxlanY_or_geneveY_exception geneve 6 4
 }
 
 test_pmtu_ipv4_vxlan6_exception() {
-	test_pmtu_ipvX_over_vxlanY_exception 4 6
+	test_pmtu_ipvX_over_vxlanY_or_geneveY_exception vxlan  4 6
 }
 
 test_pmtu_ipv6_vxlan6_exception() {
-	test_pmtu_ipvX_over_vxlanY_exception 6 6
+	test_pmtu_ipvX_over_vxlanY_or_geneveY_exception vxlan  6 6
+}
+
+test_pmtu_ipv4_geneve6_exception() {
+	test_pmtu_ipvX_over_vxlanY_or_geneveY_exception geneve 4 6
+}
+
+test_pmtu_ipv6_geneve6_exception() {
+	test_pmtu_ipvX_over_vxlanY_or_geneveY_exception geneve 6 6
 }
 
 test_pmtu_vti4_exception() {
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v2 06/11] geneve: Allow configuration of DF behaviour
From: Stefano Brivio @ 2018-11-08 11:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Sabrina Dubroca, Xin Long, Stephen Hemminger, Jiri Benc,
	David Ahern, netdev
In-Reply-To: <cover.1541675666.git.sbrivio@redhat.com>

draft-ietf-nvo3-geneve-08 says:

   It is strongly RECOMMENDED that Path MTU Discovery ([RFC1191],
   [RFC1981]) be used by setting the DF bit in the IP header when Geneve
   packets are transmitted over IPv4 (this is the default with IPv6).

Now that ICMP error handling is working for GENEVE, we can comply with
this recommendation.

Make this configurable, though, to avoid breaking existing setups. By
default, DF won't be set. It can be set or inherited from inner IPv4
packets. If it's configured to be inherited and we are encapsulating IPv6,
it will be set.

This only applies to non-lwt tunnels: if an external control plane is
used, tunnel key will still control the DF flag.

v2:
- DF behaviour configuration only applies for non-lwt tunnels, apply DF
  setting only if (!geneve->collect_md) in geneve_xmit_skb()
  (Stephen Hemminger)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
---
 drivers/net/geneve.c         | 55 ++++++++++++++++++++++++++++++------
 include/uapi/linux/if_link.h |  9 ++++++
 2 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 4dc457489770..7c53e06b31c3 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -70,6 +70,7 @@ struct geneve_dev {
 	bool		   collect_md;
 	bool		   use_udp6_rx_checksums;
 	bool		   ttl_inherit;
+	enum ifla_geneve_df df;
 };
 
 struct geneve_sock {
@@ -875,8 +876,8 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 	struct rtable *rt;
 	struct flowi4 fl4;
 	__u8 tos, ttl;
+	__be16 df = 0;
 	__be16 sport;
-	__be16 df;
 	int err;
 
 	rt = geneve_get_v4_rt(skb, dev, gs4, &fl4, info);
@@ -890,6 +891,8 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 	if (geneve->collect_md) {
 		tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
 		ttl = key->ttl;
+
+		df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
 	} else {
 		tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb);
 		if (geneve->ttl_inherit)
@@ -897,8 +900,22 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
 		else
 			ttl = key->ttl;
 		ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
+
+		if (geneve->df == GENEVE_DF_SET) {
+			df = htons(IP_DF);
+		} else if (geneve->df == GENEVE_DF_INHERIT) {
+			struct ethhdr *eth = eth_hdr(skb);
+
+			if (ntohs(eth->h_proto) == ETH_P_IPV6) {
+				df = htons(IP_DF);
+			} else if (ntohs(eth->h_proto) == ETH_P_IP) {
+				struct iphdr *iph = ip_hdr(skb);
+
+				if (iph->frag_off & htons(IP_DF))
+					df = htons(IP_DF);
+			}
+		}
 	}
-	df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
 
 	err = geneve_build_skb(&rt->dst, skb, info, xnet, sizeof(struct iphdr));
 	if (unlikely(err))
@@ -1145,6 +1162,7 @@ static const struct nla_policy geneve_policy[IFLA_GENEVE_MAX + 1] = {
 	[IFLA_GENEVE_UDP_ZERO_CSUM6_TX]	= { .type = NLA_U8 },
 	[IFLA_GENEVE_UDP_ZERO_CSUM6_RX]	= { .type = NLA_U8 },
 	[IFLA_GENEVE_TTL_INHERIT]	= { .type = NLA_U8 },
+	[IFLA_GENEVE_DF]		= { .type = NLA_U8 },
 };
 
 static int geneve_validate(struct nlattr *tb[], struct nlattr *data[],
@@ -1180,6 +1198,16 @@ static int geneve_validate(struct nlattr *tb[], struct nlattr *data[],
 		}
 	}
 
+	if (data[IFLA_GENEVE_DF]) {
+		enum ifla_geneve_df df = nla_get_u8(data[IFLA_GENEVE_DF]);
+
+		if (df < 0 || df > GENEVE_DF_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_GENEVE_DF],
+					    "Invalid DF attribute");
+			return -EINVAL;
+		}
+	}
+
 	return 0;
 }
 
@@ -1225,7 +1253,7 @@ static int geneve_configure(struct net *net, struct net_device *dev,
 			    struct netlink_ext_ack *extack,
 			    const struct ip_tunnel_info *info,
 			    bool metadata, bool ipv6_rx_csum,
-			    bool ttl_inherit)
+			    bool ttl_inherit, enum ifla_geneve_df df)
 {
 	struct geneve_net *gn = net_generic(net, geneve_net_id);
 	struct geneve_dev *t, *geneve = netdev_priv(dev);
@@ -1275,6 +1303,7 @@ static int geneve_configure(struct net *net, struct net_device *dev,
 	geneve->collect_md = metadata;
 	geneve->use_udp6_rx_checksums = ipv6_rx_csum;
 	geneve->ttl_inherit = ttl_inherit;
+	geneve->df = df;
 
 	err = register_netdevice(dev);
 	if (err)
@@ -1294,7 +1323,7 @@ static int geneve_nl2info(struct nlattr *tb[], struct nlattr *data[],
 			  struct netlink_ext_ack *extack,
 			  struct ip_tunnel_info *info, bool *metadata,
 			  bool *use_udp6_rx_checksums, bool *ttl_inherit,
-			  bool changelink)
+			  enum ifla_geneve_df *df, bool changelink)
 {
 	int attrtype;
 
@@ -1382,6 +1411,9 @@ static int geneve_nl2info(struct nlattr *tb[], struct nlattr *data[],
 	if (data[IFLA_GENEVE_TOS])
 		info->key.tos = nla_get_u8(data[IFLA_GENEVE_TOS]);
 
+	if (data[IFLA_GENEVE_DF])
+		*df = nla_get_u8(data[IFLA_GENEVE_DF]);
+
 	if (data[IFLA_GENEVE_LABEL]) {
 		info->key.label = nla_get_be32(data[IFLA_GENEVE_LABEL]) &
 				  IPV6_FLOWLABEL_MASK;
@@ -1500,6 +1532,7 @@ static int geneve_newlink(struct net *net, struct net_device *dev,
 			  struct nlattr *tb[], struct nlattr *data[],
 			  struct netlink_ext_ack *extack)
 {
+	enum ifla_geneve_df df = GENEVE_DF_UNSET;
 	bool use_udp6_rx_checksums = false;
 	struct ip_tunnel_info info;
 	bool ttl_inherit = false;
@@ -1508,12 +1541,12 @@ static int geneve_newlink(struct net *net, struct net_device *dev,
 
 	init_tnl_info(&info, GENEVE_UDP_PORT);
 	err = geneve_nl2info(tb, data, extack, &info, &metadata,
-			     &use_udp6_rx_checksums, &ttl_inherit, false);
+			     &use_udp6_rx_checksums, &ttl_inherit, &df, false);
 	if (err)
 		return err;
 
 	err = geneve_configure(net, dev, extack, &info, metadata,
-			       use_udp6_rx_checksums, ttl_inherit);
+			       use_udp6_rx_checksums, ttl_inherit, df);
 	if (err)
 		return err;
 
@@ -1576,6 +1609,7 @@ static int geneve_changelink(struct net_device *dev, struct nlattr *tb[],
 	struct ip_tunnel_info info;
 	bool metadata;
 	bool use_udp6_rx_checksums;
+	enum ifla_geneve_df df;
 	bool ttl_inherit;
 	int err;
 
@@ -1591,7 +1625,7 @@ static int geneve_changelink(struct net_device *dev, struct nlattr *tb[],
 	use_udp6_rx_checksums = geneve->use_udp6_rx_checksums;
 	ttl_inherit = geneve->ttl_inherit;
 	err = geneve_nl2info(tb, data, extack, &info, &metadata,
-			     &use_udp6_rx_checksums, &ttl_inherit, true);
+			     &use_udp6_rx_checksums, &ttl_inherit, &df, true);
 	if (err)
 		return err;
 
@@ -1624,6 +1658,7 @@ static size_t geneve_get_size(const struct net_device *dev)
 		nla_total_size(sizeof(struct in6_addr)) + /* IFLA_GENEVE_REMOTE{6} */
 		nla_total_size(sizeof(__u8)) +  /* IFLA_GENEVE_TTL */
 		nla_total_size(sizeof(__u8)) +  /* IFLA_GENEVE_TOS */
+		nla_total_size(sizeof(__u8)) +	/* IFLA_GENEVE_DF */
 		nla_total_size(sizeof(__be32)) +  /* IFLA_GENEVE_LABEL */
 		nla_total_size(sizeof(__be16)) +  /* IFLA_GENEVE_PORT */
 		nla_total_size(0) +	 /* IFLA_GENEVE_COLLECT_METADATA */
@@ -1672,6 +1707,9 @@ static int geneve_fill_info(struct sk_buff *skb, const struct net_device *dev)
 	    nla_put_be32(skb, IFLA_GENEVE_LABEL, info->key.label))
 		goto nla_put_failure;
 
+	if (nla_put_u8(skb, IFLA_GENEVE_DF, geneve->df))
+		goto nla_put_failure;
+
 	if (nla_put_be16(skb, IFLA_GENEVE_PORT, info->key.tp_dst))
 		goto nla_put_failure;
 
@@ -1723,7 +1761,8 @@ struct net_device *geneve_dev_create_fb(struct net *net, const char *name,
 		return dev;
 
 	init_tnl_info(&info, dst_port);
-	err = geneve_configure(net, dev, NULL, &info, true, true, false);
+	err = geneve_configure(net, dev, NULL, &info,
+			       true, true, false, GENEVE_DF_UNSET);
 	if (err) {
 		free_netdev(dev);
 		return ERR_PTR(err);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index efc588949431..f42c069d81db 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -566,10 +566,19 @@ enum {
 	IFLA_GENEVE_UDP_ZERO_CSUM6_RX,
 	IFLA_GENEVE_LABEL,
 	IFLA_GENEVE_TTL_INHERIT,
+	IFLA_GENEVE_DF,
 	__IFLA_GENEVE_MAX
 };
 #define IFLA_GENEVE_MAX	(__IFLA_GENEVE_MAX - 1)
 
+enum ifla_geneve_df {
+	GENEVE_DF_UNSET = 0,
+	GENEVE_DF_SET,
+	GENEVE_DF_INHERIT,
+	__GENEVE_DF_END,
+	GENEVE_DF_MAX = __GENEVE_DF_END - 1,
+};
+
 /* PPP section */
 enum {
 	IFLA_PPP_UNSPEC,
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v2 05/11] geneve: ICMP error lookup handler
From: Stefano Brivio @ 2018-11-08 11:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Sabrina Dubroca, Xin Long, Stephen Hemminger, Jiri Benc,
	David Ahern, netdev
In-Reply-To: <cover.1541675666.git.sbrivio@redhat.com>

Export an encap_err_lookup() operation to match an ICMP error against a
valid VNI.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
---
v2: no changes

 drivers/net/geneve.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index fbfc13d81f66..4dc457489770 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -387,6 +387,57 @@ static int geneve_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 	return 0;
 }
 
+/* Callback from net/ipv{4,6}/udp.c to check that we have a tunnel for errors */
+static int geneve_udp_encap_err_lookup(struct sock *sk, struct sk_buff *skb)
+{
+	struct genevehdr *geneveh;
+	struct geneve_sock *gs;
+	u8 zero_vni[3] = { 0 };
+	u8 *vni = zero_vni;
+
+	if (skb->len < GENEVE_BASE_HLEN)
+		return -EINVAL;
+
+	geneveh = geneve_hdr(skb);
+	if (geneveh->ver != GENEVE_VER)
+		return -EINVAL;
+
+	if (geneveh->proto_type != htons(ETH_P_TEB))
+		return -EINVAL;
+
+	gs = rcu_dereference_sk_user_data(sk);
+	if (!gs)
+		return -ENOENT;
+
+	if (geneve_get_sk_family(gs) == AF_INET) {
+		struct iphdr *iph = ip_hdr(skb);
+		__be32 addr4 = 0;
+
+		if (!gs->collect_md) {
+			vni = geneve_hdr(skb)->vni;
+			addr4 = iph->daddr;
+		}
+
+		return geneve_lookup(gs, addr4, vni) ? 0 : -ENOENT;
+	}
+
+#if IS_ENABLED(CONFIG_IPV6)
+	if (geneve_get_sk_family(gs) == AF_INET6) {
+		struct ipv6hdr *ip6h = ipv6_hdr(skb);
+		struct in6_addr addr6 = { 0 };
+
+		if (!gs->collect_md) {
+			vni = geneve_hdr(skb)->vni;
+			addr6 = ip6h->daddr;
+		}
+
+		return geneve6_lookup(gs, addr6, vni) ? 0 : -ENOENT;
+	}
+#endif
+
+	return -EPFNOSUPPORT;
+}
+
 static struct socket *geneve_create_sock(struct net *net, bool ipv6,
 					 __be16 port, bool ipv6_rx_csum)
 {
@@ -544,6 +595,7 @@ static struct geneve_sock *geneve_socket_create(struct net *net, __be16 port,
 	tunnel_cfg.gro_receive = geneve_gro_receive;
 	tunnel_cfg.gro_complete = geneve_gro_complete;
 	tunnel_cfg.encap_rcv = geneve_udp_encap_recv;
+	tunnel_cfg.encap_err_lookup = geneve_udp_encap_err_lookup;
 	tunnel_cfg.encap_destroy = NULL;
 	setup_udp_tunnel_sock(net, sock, &tunnel_cfg);
 	list_add(&gs->list, &gn->sock_list);
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v2 04/11] selftests: pmtu: Introduce tests for IPv4/IPv6 over VXLAN over IPv4/IPv6
From: Stefano Brivio @ 2018-11-08 11:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Sabrina Dubroca, Xin Long, Stephen Hemminger, Jiri Benc,
	David Ahern, netdev
In-Reply-To: <cover.1541675666.git.sbrivio@redhat.com>

Use a router between endpoints, implemented via namespaces, set a low MTU
between router and destination endpoint, exceed it and check PMTU value in
route exceptions.

v2:
- Change all occurrences of VxLAN to VXLAN (Jiri Benc)
- Introduce IPv4 tests right away, if iproute2 doesn't support the 'df'
  link option they will be skipped (David Ahern)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
---
 tools/testing/selftests/net/pmtu.sh | 143 ++++++++++++++++++++++++----
 1 file changed, 125 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index a369d616b390..33cba295ad45 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -26,6 +26,23 @@
 # - pmtu_ipv6
 #	Same as pmtu_ipv4, except for locked PMTU tests, using IPv6
 #
+# - pmtu_ipv4_vxlan4_exception
+#	Set up the same network topology as pmtu_ipv4, create a VXLAN tunnel
+#	over IPv4 between A and B, routed via R1. On the link between R1 and B,
+#	set a MTU lower than the VXLAN MTU and the MTU on the link between A and
+#	R1. Send IPv4 packets, exceeding the MTU between R1 and B, over VXLAN
+#	from A to B and check that the PMTU exception is created with the right
+#	value on A
+#
+# - pmtu_ipv6_vxlan4_exception
+#	Same as pmtu_ipv4_vxlan4_exception, but send IPv6 packets from A to B
+#
+# - pmtu_ipv4_vxlan6_exception
+#	Same as pmtu_ipv4_vxlan4_exception, but use IPv6 transport from A to B
+#
+# - pmtu_ipv6_vxlan6_exception
+#	Same as pmtu_ipv4_vxlan6_exception, but send IPv6 packets from A to B
+#
 # - pmtu_vti4_exception
 #	Set up vti tunnel on top of veth, with xfrm states and policies, in two
 #	namespaces with matching endpoints. Check that route exception is not
@@ -72,6 +89,10 @@ which ping6 > /dev/null 2>&1 && ping6=$(which ping6) || ping6=$(which ping)
 tests="
 	pmtu_ipv4_exception		ipv4: PMTU exceptions
 	pmtu_ipv6_exception		ipv6: PMTU exceptions
+	pmtu_ipv4_vxlan4_exception	IPv4 over vxlan4: PMTU exceptions
+	pmtu_ipv6_vxlan4_exception	IPv6 over vxlan4: PMTU exceptions
+	pmtu_ipv4_vxlan6_exception	IPv4 over vxlan6: PMTU exceptions
+	pmtu_ipv6_vxlan6_exception	IPv6 over vxlan6: PMTU exceptions
 	pmtu_vti6_exception		vti6: PMTU exceptions
 	pmtu_vti4_exception		vti4: PMTU exceptions
 	pmtu_vti4_default_mtu		vti4: default MTU assignment
@@ -95,8 +116,8 @@ ns_r2="ip netns exec ${NS_R2}"
 # Addresses are:
 # - IPv4: PREFIX4.SEGMENT.ID (/24)
 # - IPv6: PREFIX6:SEGMENT::ID (/64)
-prefix4="192.168"
-prefix6="fd00"
+prefix4="10.0"
+prefix6="fc00"
 a_r1=1
 a_r2=2
 b_r1=3
@@ -129,12 +150,12 @@ veth6_a_addr="fd00:1::a"
 veth6_b_addr="fd00:1::b"
 veth6_mask="64"
 
-vti4_a_addr="192.168.2.1"
-vti4_b_addr="192.168.2.2"
-vti4_mask="24"
-vti6_a_addr="fd00:2::a"
-vti6_b_addr="fd00:2::b"
-vti6_mask="64"
+tunnel4_a_addr="192.168.2.1"
+tunnel4_b_addr="192.168.2.2"
+tunnel4_mask="24"
+tunnel6_a_addr="fd00:2::a"
+tunnel6_b_addr="fd00:2::b"
+tunnel6_mask="64"
 
 dummy6_0_addr="fc00:1000::0"
 dummy6_1_addr="fc00:1001::0"
@@ -202,11 +223,39 @@ setup_vti() {
 }
 
 setup_vti4() {
-	setup_vti 4 ${veth4_a_addr} ${veth4_b_addr} ${vti4_a_addr} ${vti4_b_addr} ${vti4_mask}
+	setup_vti 4 ${veth4_a_addr} ${veth4_b_addr} ${tunnel4_a_addr} ${tunnel4_b_addr} ${tunnel4_mask}
 }
 
 setup_vti6() {
-	setup_vti 6 ${veth6_a_addr} ${veth6_b_addr} ${vti6_a_addr} ${vti6_b_addr} ${vti6_mask}
+	setup_vti 6 ${veth6_a_addr} ${veth6_b_addr} ${tunnel6_a_addr} ${tunnel6_b_addr} ${tunnel6_mask}
+}
+
+setup_vxlan() {
+	a_addr="${1}"
+	b_addr="${2}"
+	opts="${3}"
+
+	${ns_a} ip link add vxlan_a type vxlan id 1 local ${a_addr} remote ${b_addr} ttl 64 dstport 4789 ${opts} || return 1
+	${ns_b} ip link add vxlan_b type vxlan id 1 local ${b_addr} remote ${a_addr} ttl 64 dstport 4789 ${opts}
+
+	${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask}   dev vxlan_a
+	${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask}   dev vxlan_b
+
+	${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask}   dev vxlan_a
+	${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask}   dev vxlan_b
+
+	${ns_a} ip link set vxlan_a up
+	${ns_b} ip link set vxlan_b up
+
+	sleep 1
+}
+
+setup_vxlan4() {
+	setup_vxlan ${prefix4}.${a_r1}.1 ${prefix4}.${b_r1}.1 "df set"
+}
+
+setup_vxlan6() {
+	setup_vxlan ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1 ""
 }
 
 setup_xfrm() {
@@ -465,6 +514,64 @@ test_pmtu_ipv6_exception() {
 	test_pmtu_ipvX 6
 }
 
+test_pmtu_ipvX_over_vxlanY_exception() {
+	family=${1}
+	outer_family=${2}
+	ll_mtu=4000
+
+	if [ ${outer_family} -eq 4 ]; then
+		setup namespaces routing vxlan4 || return 2
+		#                      IPv4 header   UDP header   VXLAN header   Ethernet header
+		exp_mtu=$((${ll_mtu} - 20          - 8          - 8            - 14))
+	else
+		setup namespaces routing vxlan6 || return 2
+		#                      IPv6 header   UDP header   VXLAN header   Ethernet header
+		exp_mtu=$((${ll_mtu} - 40          - 8          - 8            - 14))
+	fi
+
+	trace "${ns_a}" vxlan_a      "${ns_b}"  vxlan_b \
+	      "${ns_a}" veth_A-R1    "${ns_r1}" veth_R1-A \
+	      "${ns_b}" veth_B-R1    "${ns_r1}" veth_R1-B
+
+	if [ ${family} -eq 4 ]; then
+		ping=ping
+		dst=${tunnel4_b_addr}
+	else
+		ping=${ping6}
+		dst=${tunnel6_b_addr}
+	fi
+
+	# Create route exception by exceeding link layer MTU
+	mtu "${ns_a}"  veth_A-R1 $((${ll_mtu} + 1000))
+	mtu "${ns_r1}" veth_R1-A $((${ll_mtu} + 1000))
+	mtu "${ns_b}"  veth_B-R1 ${ll_mtu}
+	mtu "${ns_r1}" veth_R1-B ${ll_mtu}
+
+	mtu "${ns_a}" vxlan_a $((${ll_mtu} + 1000))
+	mtu "${ns_b}" vxlan_b $((${ll_mtu} + 1000))
+	${ns_a} ${ping} -q -M want -i 0.1 -w 2 -s $((${ll_mtu} + 500)) ${dst} > /dev/null
+
+	# Check that exception was created
+	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${dst})"
+	check_pmtu_value ${exp_mtu} "${pmtu}" "exceeding link layer MTU on VXLAN interface"
+}
+
+test_pmtu_ipv4_vxlan4_exception() {
+	test_pmtu_ipvX_over_vxlanY_exception 4 4
+}
+
+test_pmtu_ipv6_vxlan4_exception() {
+	test_pmtu_ipvX_over_vxlanY_exception 6 4
+}
+
+test_pmtu_ipv4_vxlan6_exception() {
+	test_pmtu_ipvX_over_vxlanY_exception 4 6
+}
+
+test_pmtu_ipv6_vxlan6_exception() {
+	test_pmtu_ipvX_over_vxlanY_exception 6 6
+}
+
 test_pmtu_vti4_exception() {
 	setup namespaces veth vti4 xfrm4 || return 2
 	trace "${ns_a}" veth_a    "${ns_b}" veth_b \
@@ -484,14 +591,14 @@ test_pmtu_vti4_exception() {
 
 	# Send DF packet without exceeding link layer MTU, check that no
 	# exception is created
-	${ns_a} ping -q -M want -i 0.1 -w 2 -s ${ping_payload} ${vti4_b_addr} > /dev/null
-	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti4_b_addr})"
+	${ns_a} ping -q -M want -i 0.1 -w 2 -s ${ping_payload} ${tunnel4_b_addr} > /dev/null
+	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${tunnel4_b_addr})"
 	check_pmtu_value "" "${pmtu}" "sending packet smaller than PMTU (IP payload length ${esp_payload_rfc4106})" || return 1
 
 	# Now exceed link layer MTU by one byte, check that exception is created
 	# with the right PMTU value
-	${ns_a} ping -q -M want -i 0.1 -w 2 -s $((ping_payload + 1)) ${vti4_b_addr} > /dev/null
-	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti4_b_addr})"
+	${ns_a} ping -q -M want -i 0.1 -w 2 -s $((ping_payload + 1)) ${tunnel4_b_addr} > /dev/null
+	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${tunnel4_b_addr})"
 	check_pmtu_value "${esp_payload_rfc4106}" "${pmtu}" "exceeding PMTU (IP payload length $((esp_payload_rfc4106 + 1)))"
 }
 
@@ -506,20 +613,20 @@ test_pmtu_vti6_exception() {
 	mtu "${ns_b}" veth_b 4000
 	mtu "${ns_a}" vti6_a 5000
 	mtu "${ns_b}" vti6_b 5000
-	${ns_a} ${ping6} -q -i 0.1 -w 2 -s 60000 ${vti6_b_addr} > /dev/null
+	${ns_a} ${ping6} -q -i 0.1 -w 2 -s 60000 ${tunnel6_b_addr} > /dev/null
 
 	# Check that exception was created
-	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti6_b_addr})"
+	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${tunnel6_b_addr})"
 	check_pmtu_value any "${pmtu}" "creating tunnel exceeding link layer MTU" || return 1
 
 	# Decrease tunnel MTU, check for PMTU decrease in route exception
 	mtu "${ns_a}" vti6_a 3000
-	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti6_b_addr})"
+	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${tunnel6_b_addr})"
 	check_pmtu_value "3000" "${pmtu}" "decreasing tunnel MTU" || fail=1
 
 	# Increase tunnel MTU, check for PMTU increase in route exception
 	mtu "${ns_a}" vti6_a 9000
-	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti6_b_addr})"
+	pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${tunnel6_b_addr})"
 	check_pmtu_value "9000" "${pmtu}" "increasing tunnel MTU" || fail=1
 
 	return ${fail}
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v2 03/11] vxlan: Allow configuration of DF behaviour
From: Stefano Brivio @ 2018-11-08 11:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Sabrina Dubroca, Xin Long, Stephen Hemminger, Jiri Benc,
	David Ahern, netdev
In-Reply-To: <cover.1541675666.git.sbrivio@redhat.com>

Allow users to set the IPv4 DF bit in outgoing packets, or to inherit its
value from the IPv4 inner header. If the encapsulated protocol is IPv6 and
DF is configured to be inherited, always set it.

For IPv4, inheriting DF from the inner header was probably intended from
the very beginning judging by the comment to vxlan_xmit(), but it wasn't
actually implemented -- also because it would have done more harm than
good, without handling for ICMP Fragmentation Needed messages.

According to RFC 7348, "Path MTU discovery MAY be used". An expired RFC
draft, draft-saum-nvo3-pmtud-over-vxlan-05, whose purpose was to describe
PMTUD implementation, says that "is a MUST that Vxlan gateways [...]
SHOULD set the DF-bit [...]", whatever that means.

Given this background, the only sane option is probably to let the user
decide, and keep the current behaviour as default.

This only applies to non-lwt tunnels: if an external control plane is
used, tunnel key will still control the DF flag.

v2:
- DF behaviour configuration only applies for non-lwt tunnels, move DF
  setting to if (!info) block in vxlan_xmit_one() (Stephen Hemminger)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
---
 drivers/net/vxlan.c          | 29 ++++++++++++++++++++++++++++-
 include/net/vxlan.h          |  1 +
 include/uapi/linux/if_link.h |  9 +++++++++
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 0851af6733f3..c3e65e78f015 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2278,13 +2278,24 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 			goto tx_error;
 		}
 
-		/* Bypass encapsulation if the destination is local */
 		if (!info) {
+			/* Bypass encapsulation if the destination is local */
 			err = encap_bypass_if_local(skb, dev, vxlan, dst,
 						    dst_port, ifindex, vni,
 						    &rt->dst, rt->rt_flags);
 			if (err)
 				goto out_unlock;
+
+			if (vxlan->cfg.df == VXLAN_DF_SET) {
+				df = htons(IP_DF);
+			} else if (vxlan->cfg.df == VXLAN_DF_INHERIT) {
+				struct ethhdr *eth = eth_hdr(skb);
+
+				if (ntohs(eth->h_proto) == ETH_P_IPV6 ||
+				    (ntohs(eth->h_proto) == ETH_P_IP &&
+				     old_iph->frag_off & htons(IP_DF)))
+					df = htons(IP_DF);
+			}
 		} else if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT) {
 			df = htons(IP_DF);
 		}
@@ -2837,6 +2848,7 @@ static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = {
 	[IFLA_VXLAN_GPE]	= { .type = NLA_FLAG, },
 	[IFLA_VXLAN_REMCSUM_NOPARTIAL]	= { .type = NLA_FLAG },
 	[IFLA_VXLAN_TTL_INHERIT]	= { .type = NLA_FLAG },
+	[IFLA_VXLAN_DF]		= { .type = NLA_U8 },
 };
 
 static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[],
@@ -2893,6 +2905,16 @@ static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[],
 		}
 	}
 
+	if (data[IFLA_VXLAN_DF]) {
+		enum ifla_vxlan_df df = nla_get_u8(data[IFLA_VXLAN_DF]);
+
+		if (df < 0 || df > VXLAN_DF_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_DF],
+					    "Invalid DF attribute");
+			return -EINVAL;
+		}
+	}
+
 	return 0;
 }
 
@@ -3538,6 +3560,9 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[],
 		conf->mtu = nla_get_u32(tb[IFLA_MTU]);
 	}
 
+	if (data[IFLA_VXLAN_DF])
+		conf->df = nla_get_u8(data[IFLA_VXLAN_DF]);
+
 	return 0;
 }
 
@@ -3630,6 +3655,7 @@ static size_t vxlan_get_size(const struct net_device *dev)
 		nla_total_size(sizeof(__u8)) +	/* IFLA_VXLAN_TTL */
 		nla_total_size(sizeof(__u8)) +	/* IFLA_VXLAN_TTL_INHERIT */
 		nla_total_size(sizeof(__u8)) +	/* IFLA_VXLAN_TOS */
+		nla_total_size(sizeof(__u8)) +	/* IFLA_VXLAN_DF */
 		nla_total_size(sizeof(__be32)) + /* IFLA_VXLAN_LABEL */
 		nla_total_size(sizeof(__u8)) +	/* IFLA_VXLAN_LEARNING */
 		nla_total_size(sizeof(__u8)) +	/* IFLA_VXLAN_PROXY */
@@ -3696,6 +3722,7 @@ static int vxlan_fill_info(struct sk_buff *skb, const struct net_device *dev)
 	    nla_put_u8(skb, IFLA_VXLAN_TTL_INHERIT,
 		       !!(vxlan->cfg.flags & VXLAN_F_TTL_INHERIT)) ||
 	    nla_put_u8(skb, IFLA_VXLAN_TOS, vxlan->cfg.tos) ||
+	    nla_put_u8(skb, IFLA_VXLAN_DF, vxlan->cfg.df) ||
 	    nla_put_be32(skb, IFLA_VXLAN_LABEL, vxlan->cfg.label) ||
 	    nla_put_u8(skb, IFLA_VXLAN_LEARNING,
 			!!(vxlan->cfg.flags & VXLAN_F_LEARN)) ||
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 03431c148e16..ec999c49df1f 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -216,6 +216,7 @@ struct vxlan_config {
 	unsigned long		age_interval;
 	unsigned int		addrmax;
 	bool			no_share;
+	enum ifla_vxlan_df	df;
 };
 
 struct vxlan_dev_node {
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 1debfa42cba1..efc588949431 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -533,6 +533,7 @@ enum {
 	IFLA_VXLAN_LABEL,
 	IFLA_VXLAN_GPE,
 	IFLA_VXLAN_TTL_INHERIT,
+	IFLA_VXLAN_DF,
 	__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX	(__IFLA_VXLAN_MAX - 1)
@@ -542,6 +543,14 @@ struct ifla_vxlan_port_range {
 	__be16	high;
 };
 
+enum ifla_vxlan_df {
+	VXLAN_DF_UNSET = 0,
+	VXLAN_DF_SET,
+	VXLAN_DF_INHERIT,
+	__VXLAN_DF_END,
+	VXLAN_DF_MAX = __VXLAN_DF_END - 1,
+};
+
 /* GENEVE section */
 enum {
 	IFLA_GENEVE_UNSPEC,
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v2 02/11] vxlan: ICMP error lookup handler
From: Stefano Brivio @ 2018-11-08 11:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Sabrina Dubroca, Xin Long, Stephen Hemminger, Jiri Benc,
	David Ahern, netdev
In-Reply-To: <cover.1541675666.git.sbrivio@redhat.com>

Export an encap_err_lookup() operation to match an ICMP error against a
valid VNI.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
---
v2: no changes

 drivers/net/vxlan.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ae969f806d56..0851af6733f3 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1552,6 +1552,34 @@ static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
 	return 0;
 }
 
+/* Callback from net/ipv{4,6}/udp.c to check that we have a VNI for errors */
+static int vxlan_err_lookup(struct sock *sk, struct sk_buff *skb)
+{
+	struct vxlan_dev *vxlan;
+	struct vxlan_sock *vs;
+	struct vxlanhdr *hdr;
+	__be32 vni;
+
+	if (skb->len < VXLAN_HLEN)
+		return -EINVAL;
+
+	hdr = vxlan_hdr(skb);
+
+	if (!(hdr->vx_flags & VXLAN_HF_VNI))
+		return -EINVAL;
+
+	vs = rcu_dereference_sk_user_data(sk);
+	if (!vs)
+		return -ENOENT;
+
+	vni = vxlan_vni(hdr->vx_vni);
+	vxlan = vxlan_vs_find_vni(vs, skb->dev->ifindex, vni);
+	if (!vxlan)
+		return -ENOENT;
+
+	return 0;
+}
+
 static int arp_reduce(struct net_device *dev, struct sk_buff *skb, __be32 vni)
 {
 	struct vxlan_dev *vxlan = netdev_priv(dev);
@@ -2948,6 +2976,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, bool ipv6,
 	tunnel_cfg.sk_user_data = vs;
 	tunnel_cfg.encap_type = 1;
 	tunnel_cfg.encap_rcv = vxlan_rcv;
+	tunnel_cfg.encap_err_lookup = vxlan_err_lookup;
 	tunnel_cfg.encap_destroy = NULL;
 	tunnel_cfg.gro_receive = vxlan_gro_receive;
 	tunnel_cfg.gro_complete = vxlan_gro_complete;
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v2 01/11] udp: Handle ICMP errors for tunnels with same destination port on both endpoints
From: Stefano Brivio @ 2018-11-08 11:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Sabrina Dubroca, Xin Long, Stephen Hemminger, Jiri Benc,
	David Ahern, netdev
In-Reply-To: <cover.1541675666.git.sbrivio@redhat.com>

For both IPv4 and IPv6, if we can't match errors to a socket, try
tunnels before ignoring them. Look up a socket with the original source
and destination ports as found in the UDP packet inside the ICMP payload,
this will work for tunnels that force the same destination port for both
endpoints, i.e. VXLAN and GENEVE.

Actually, lwtunnels could break this assumption if they are configured by
an external control plane to have different destination ports on the
endpoints: in this case, we won't be able to trace ICMP messages back to
them.

For IPv6 redirect messages, call ip6_redirect() directly with the output
interface argument set to the interface we received the packet from (as
it's the very interface we should build the exception on), otherwise the
new nexthop will be rejected. There's no such need for IPv4.

Tunnels can now export an encap_err_lookup() operation that indicates a
match. Pass the packet to the lookup function, and if the tunnel driver
reports a matching association, continue with regular ICMP error handling.

v2:
- Added newline between network and transport header sets in
  __udp{4,6}_lib_err_encap() (David Miller)
- Removed redundant skb_reset_network_header(skb); in
  __udp4_lib_err_encap()
- Removed redundant reassignment of iph in __udp4_lib_err_encap()
  (Sabrina Dubroca)
- Edited comment to __udp{4,6}_lib_err_encap() to reflect the fact this
  won't work with lwtunnels configured to use asymmetric ports. By the way,
  it's VXLAN, not VxLAN (Jiri Benc)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
---
 include/linux/udp.h      |  1 +
 include/net/udp_tunnel.h |  3 ++
 net/ipv4/udp.c           | 79 +++++++++++++++++++++++++++++++----
 net/ipv4/udp_tunnel.c    |  1 +
 net/ipv6/udp.c           | 89 +++++++++++++++++++++++++++++++++++-----
 5 files changed, 153 insertions(+), 20 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 0a9c54e76305..2725c83395bf 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -77,6 +77,7 @@ struct udp_sock {
 	 * For encapsulation sockets.
 	 */
 	int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
+	int (*encap_err_lookup)(struct sock *sk, struct sk_buff *skb);
 	void (*encap_destroy)(struct sock *sk);
 
 	/* GRO functions for UDP socket */
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index 3fbe56430e3b..dc8d804af3b4 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -64,6 +64,8 @@ static inline int udp_sock_create(struct net *net,
 }
 
 typedef int (*udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
+typedef int (*udp_tunnel_encap_err_lookup_t)(struct sock *sk,
+					     struct sk_buff *skb);
 typedef void (*udp_tunnel_encap_destroy_t)(struct sock *sk);
 typedef struct sk_buff *(*udp_tunnel_gro_receive_t)(struct sock *sk,
 						    struct list_head *head,
@@ -76,6 +78,7 @@ struct udp_tunnel_sock_cfg {
 	/* Used for setting up udp_sock fields, see udp.h for details */
 	__u8  encap_type;
 	udp_tunnel_encap_rcv_t encap_rcv;
+	udp_tunnel_encap_err_lookup_t encap_err_lookup;
 	udp_tunnel_encap_destroy_t encap_destroy;
 	udp_tunnel_gro_receive_t gro_receive;
 	udp_tunnel_gro_complete_t gro_complete;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3488650b90ac..ce759b61f6cd 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -583,6 +583,62 @@ static inline bool __udp_is_mcast_sock(struct net *net, struct sock *sk,
 	return true;
 }
 
+DEFINE_STATIC_KEY_FALSE(udp_encap_needed_key);
+void udp_encap_enable(void)
+{
+	static_branch_enable(&udp_encap_needed_key);
+}
+EXPORT_SYMBOL(udp_encap_enable);
+
+/* Try to match ICMP errors to UDP tunnels by looking up a socket without
+ * reversing source and destination port: this will match tunnels that force the
+ * same destination port on both endpoints (e.g. VXLAN, GENEVE). Note that
+ * lwtunnels might actually break this assumption by being configured with
+ * different destination ports on endpoints, in this case we won't be able to
+ * trace ICMP messages back to them.
+ *
+ * Then ask the tunnel implementation to match the error against a valid
+ * association.
+ *
+ * Return the socket if we have a match.
+ */
+static struct sock *__udp4_lib_err_encap(struct net *net,
+					 const struct iphdr *iph,
+					 struct udphdr *uh,
+					 struct udp_table *udptable,
+					 struct sk_buff *skb)
+{
+	int (*lookup)(struct sock *sk, struct sk_buff *skb);
+	int network_offset, transport_offset;
+	struct udp_sock *up;
+	struct sock *sk;
+
+	sk = __udp4_lib_lookup(net, iph->daddr, uh->source,
+			       iph->saddr, uh->dest, skb->dev->ifindex, 0,
+			       udptable, NULL);
+	if (!sk)
+		return NULL;
+
+	network_offset = skb_network_offset(skb);
+	transport_offset = skb_transport_offset(skb);
+
+	/* Network header needs to point to the outer IPv4 header inside ICMP */
+	skb_reset_network_header(skb);
+
+	/* Transport header needs to point to the UDP header */
+	skb_set_transport_header(skb, iph->ihl << 2);
+
+	up = udp_sk(sk);
+	lookup = READ_ONCE(up->encap_err_lookup);
+	if (!lookup || lookup(sk, skb))
+		sk = NULL;
+
+	skb_set_transport_header(skb, transport_offset);
+	skb_set_network_header(skb, network_offset);
+
+	return sk;
+}
+
 /*
  * This routine is called by the ICMP module when it gets some
  * sort of error condition.  If err < 0 then the socket should
@@ -601,6 +657,7 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 	struct udphdr *uh = (struct udphdr *)(skb->data+(iph->ihl<<2));
 	const int type = icmp_hdr(skb)->type;
 	const int code = icmp_hdr(skb)->code;
+	bool tunnel = false;
 	struct sock *sk;
 	int harderr;
 	int err;
@@ -610,8 +667,15 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 			       iph->saddr, uh->source, skb->dev->ifindex,
 			       inet_sdif(skb), udptable, NULL);
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
-		return;	/* No socket for error */
+		/* No socket for error: try tunnels before discarding */
+		if (static_branch_unlikely(&udp_encap_needed_key))
+			sk = __udp4_lib_err_encap(net, iph, uh, udptable, skb);
+
+		if (!sk) {
+			__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+			return;
+		}
+		tunnel = true;
 	}
 
 	err = 0;
@@ -654,6 +718,10 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 	 *      RFC1122: OK.  Passes ICMP errors back to application, as per
 	 *	4.1.3.3.
 	 */
+	if (tunnel) {
+		/* ...not for tunnels though: we don't have a sending socket */
+		goto out;
+	}
 	if (!inet->recverr) {
 		if (!harderr || sk->sk_state != TCP_ESTABLISHED)
 			goto out;
@@ -1891,13 +1959,6 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	return 0;
 }
 
-DEFINE_STATIC_KEY_FALSE(udp_encap_needed_key);
-void udp_encap_enable(void)
-{
-	static_branch_enable(&udp_encap_needed_key);
-}
-EXPORT_SYMBOL(udp_encap_enable);
-
 /* returns:
  *  -1: error
  *   0: success
diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
index 6539ff15e9a3..d0c412fc56ad 100644
--- a/net/ipv4/udp_tunnel.c
+++ b/net/ipv4/udp_tunnel.c
@@ -68,6 +68,7 @@ void setup_udp_tunnel_sock(struct net *net, struct socket *sock,
 
 	udp_sk(sk)->encap_type = cfg->encap_type;
 	udp_sk(sk)->encap_rcv = cfg->encap_rcv;
+	udp_sk(sk)->encap_err_lookup = cfg->encap_err_lookup;
 	udp_sk(sk)->encap_destroy = cfg->encap_destroy;
 	udp_sk(sk)->gro_receive = cfg->gro_receive;
 	udp_sk(sk)->gro_complete = cfg->gro_complete;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index c55698d19d68..1216c920f945 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -462,6 +462,61 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	goto try_again;
 }
 
+DEFINE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
+void udpv6_encap_enable(void)
+{
+	static_branch_enable(&udpv6_encap_needed_key);
+}
+EXPORT_SYMBOL(udpv6_encap_enable);
+
+/* Try to match ICMP errors to UDP tunnels by looking up a socket without
+ * reversing source and destination port: this will match tunnels that force the
+ * same destination port on both endpoints (e.g. VXLAN, GENEVE). Note that
+ * lwtunnels might actually break this assumption by being configured with
+ * different destination ports on endpoints, in this case we won't be able to
+ * trace ICMP messages back to them.
+ *
+ * Then ask the tunnel implementation to match the error against a valid
+ * association.
+ *
+ * Return the socket if we have a match.
+ */
+static struct sock *__udp6_lib_err_encap(struct net *net,
+					 const struct ipv6hdr *hdr, int offset,
+					 struct udphdr *uh,
+					 struct udp_table *udptable,
+					 struct sk_buff *skb)
+{
+	int (*lookup)(struct sock *sk, struct sk_buff *skb);
+	int network_offset, transport_offset;
+	struct udp_sock *up;
+	struct sock *sk;
+
+	sk = __udp6_lib_lookup(net, &hdr->daddr, uh->source,
+			       &hdr->saddr, uh->dest,
+			       inet6_iif(skb), 0, udptable, skb);
+	if (!sk)
+		return NULL;
+
+	network_offset = skb_network_offset(skb);
+	transport_offset = skb_transport_offset(skb);
+
+	/* Network header needs to point to the outer IPv6 header inside ICMP */
+	skb_reset_network_header(skb);
+
+	/* Transport header needs to point to the UDP header */
+	skb_set_transport_header(skb, offset);
+
+	up = udp_sk(sk);
+	lookup = READ_ONCE(up->encap_err_lookup);
+	if (!lookup || lookup(sk, skb))
+		sk = NULL;
+
+	skb_set_transport_header(skb, transport_offset);
+	skb_set_network_header(skb, network_offset);
+	return sk;
+}
+
 void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 		    u8 type, u8 code, int offset, __be32 info,
 		    struct udp_table *udptable)
@@ -471,6 +526,7 @@ void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	const struct in6_addr *saddr = &hdr->saddr;
 	const struct in6_addr *daddr = &hdr->daddr;
 	struct udphdr *uh = (struct udphdr *)(skb->data+offset);
+	bool tunnel = false;
 	struct sock *sk;
 	int harderr;
 	int err;
@@ -479,9 +535,18 @@ void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	sk = __udp6_lib_lookup(net, daddr, uh->dest, saddr, uh->source,
 			       inet6_iif(skb), inet6_sdif(skb), udptable, skb);
 	if (!sk) {
-		__ICMP6_INC_STATS(net, __in6_dev_get(skb->dev),
-				  ICMP6_MIB_INERRORS);
-		return;
+		/* No socket for error: try tunnels before discarding */
+		if (static_branch_unlikely(&udpv6_encap_needed_key)) {
+			sk = __udp6_lib_err_encap(net, hdr, offset, uh,
+						  udptable, skb);
+		}
+
+		if (!sk) {
+			__ICMP6_INC_STATS(net, __in6_dev_get(skb->dev),
+					  ICMP6_MIB_INERRORS);
+			return;
+		}
+		tunnel = true;
 	}
 
 	harderr = icmpv6_err_convert(type, code, &err);
@@ -495,10 +560,19 @@ void __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 			harderr = 1;
 	}
 	if (type == NDISC_REDIRECT) {
-		ip6_sk_redirect(skb, sk);
+		if (tunnel) {
+			ip6_redirect(skb, sock_net(sk), inet6_iif(skb),
+				     sk->sk_mark, sk->sk_uid);
+		} else {
+			ip6_sk_redirect(skb, sk);
+		}
 		goto out;
 	}
 
+	/* Tunnels don't have an application socket: don't pass errors back */
+	if (tunnel)
+		goto out;
+
 	if (!np->recverr) {
 		if (!harderr || sk->sk_state != TCP_ESTABLISHED)
 			goto out;
@@ -547,13 +621,6 @@ static __inline__ void udpv6_err(struct sk_buff *skb,
 	__udp6_lib_err(skb, opt, type, code, offset, info, &udp_table);
 }
 
-DEFINE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
-void udpv6_encap_enable(void)
-{
-	static_branch_enable(&udpv6_encap_needed_key);
-}
-EXPORT_SYMBOL(udpv6_encap_enable);
-
 static int udpv6_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct udp_sock *up = udp_sk(sk);
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next v2 00/11] ICMP error handling for UDP tunnels
From: Stefano Brivio @ 2018-11-08 11:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Sabrina Dubroca, Xin Long, Stephen Hemminger, Jiri Benc,
	David Ahern, netdev

This series introduces ICMP error handling for UDP tunnels and
encapsulations and related selftests. We need to handle ICMP errors to
support PMTU discovery and route redirection -- this support is entirely
missing right now:

- patch 1/11 adds a socket lookup for UDP tunnels that use, by design,
  the same destination port on both endpoints -- i.e. VXLAN and GENEVE
- patches 2/11 to 7/11 are specific to VxLAN and GENEVE
- patches 8/11 and 9/11 add infrastructure for lookup of encapsulations
  where sent packets cannot be matched via receiving socket lookup, i.e.
  FoU and GUE
- patches 10/11 and 11/11 are specific to FoU and GUE

v2: changes are listed in the single patches

Stefano Brivio (11):
  udp: Handle ICMP errors for tunnels with same destination port on both
    endpoints
  vxlan: ICMP error lookup handler
  vxlan: Allow configuration of DF behaviour
  selftests: pmtu: Introduce tests for IPv4/IPv6 over VXLAN over
    IPv4/IPv6
  geneve: ICMP error lookup handler
  geneve: Allow configuration of DF behaviour
  selftests: pmtu: Introduce tests for IPv4/IPv6 over GENEVE over
    IPv4/IPv6
  net: Convert protocol error handlers from void to int
  udp: Support for error handlers of tunnels with arbitrary destination
    port
  fou, fou6: ICMP error handlers for FoU and GUE
  selftests: pmtu: Introduce FoU and GUE PMTU exceptions tests

 drivers/net/geneve.c                | 107 +++++++-
 drivers/net/vxlan.c                 |  58 ++++-
 include/linux/udp.h                 |   1 +
 include/net/icmp.h                  |   2 +-
 include/net/ip6_tunnel.h            |   2 +
 include/net/ip_tunnels.h            |   1 +
 include/net/protocol.h              |   9 +-
 include/net/sctp/sctp.h             |   2 +-
 include/net/tcp.h                   |   2 +-
 include/net/udp.h                   |   2 +-
 include/net/udp_tunnel.h            |   3 +
 include/net/vxlan.h                 |   1 +
 include/uapi/linux/if_link.h        |  18 ++
 net/dccp/ipv4.c                     |  13 +-
 net/dccp/ipv6.c                     |  13 +-
 net/ipv4/fou.c                      |  68 +++++
 net/ipv4/gre_demux.c                |   9 +-
 net/ipv4/icmp.c                     |   6 +-
 net/ipv4/ip_gre.c                   |  48 ++--
 net/ipv4/ipip.c                     |  14 +-
 net/ipv4/protocol.c                 |   1 +
 net/ipv4/tcp_ipv4.c                 |  22 +-
 net/ipv4/tunnel4.c                  |  18 +-
 net/ipv4/udp.c                      | 121 ++++++++-
 net/ipv4/udp_impl.h                 |   2 +-
 net/ipv4/udp_tunnel.c               |   1 +
 net/ipv4/udplite.c                  |   4 +-
 net/ipv4/xfrm4_protocol.c           |  18 +-
 net/ipv6/fou6.c                     |  74 ++++++
 net/ipv6/icmp.c                     |   4 +-
 net/ipv6/ip6_gre.c                  |  18 +-
 net/ipv6/tcp_ipv6.c                 |  13 +-
 net/ipv6/tunnel6.c                  |  12 +-
 net/ipv6/udp.c                      | 146 +++++++++--
 net/ipv6/udp_impl.h                 |   4 +-
 net/ipv6/udplite.c                  |   5 +-
 net/ipv6/xfrm6_protocol.c           |  18 +-
 net/sctp/input.c                    |   5 +-
 net/sctp/ipv6.c                     |   7 +-
 tools/testing/selftests/net/pmtu.sh | 377 ++++++++++++++++++++++++++--
 40 files changed, 1083 insertions(+), 166 deletions(-)

-- 
2.19.1

^ permalink raw reply

* Re: [PATCH v3 bpf-next 4/4] bpftool: support loading flow dissector
From: Quentin Monnet @ 2018-11-08 11:16 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev, linux-kselftest, ast, daniel, shuah,
	jakub.kicinski
  Cc: guro, jiong.wang, bhole_prashant_q7, john.fastabend, jbenc,
	treeze.taeung, yhs, osk, sandipan
In-Reply-To: <20181108053957.205681-5-sdf@google.com>

Hi Stanislav, thanks for the changes! More comments below.

2018-11-07 21:39 UTC-0800 ~ Stanislav Fomichev <sdf@google.com>
> This commit adds support for loading/attaching/detaching flow
> dissector program. The structure of the flow dissector program is
> assumed to be the same as in the selftests:
> 
> * flow_dissector section with the main entry point
> * a bunch of tail call progs
> * a jmp_table map that is populated with the tail call progs
> 
> When `bpftool load` is called with a flow_dissector prog (i.e. when the
> first section is flow_dissector of 'type flow_dissector' argument is
> passed), we load and pin all the programs/maps. User is responsible to
> construct the jump table for the tail calls.
> 
> The last argument of `bpftool attach` is made optional for this use
> case.
> 
> Example:
> bpftool prog load tools/testing/selftests/bpf/bpf_flow.o \
> 	/sys/fs/bpf/flow type flow_dissector
> 
> bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
>          key 0 0 0 0 \
>          value pinned /sys/fs/bpf/flow/IP
> 
> bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
>          key 1 0 0 0 \
>          value pinned /sys/fs/bpf/flow/IPV6
> 
> bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
>          key 2 0 0 0 \
>          value pinned /sys/fs/bpf/flow/IPV6OP
> 
> bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
>          key 3 0 0 0 \
>          value pinned /sys/fs/bpf/flow/IPV6FR
> 
> bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
>          key 4 0 0 0 \
>          value pinned /sys/fs/bpf/flow/MPLS
> 
> bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
>          key 5 0 0 0 \
>          value pinned /sys/fs/bpf/flow/VLAN
> 
> bpftool prog attach pinned /sys/fs/bpf/flow/flow_dissector flow_dissector
> 
> Tested by using the above lines to load the prog in
> the test_flow_dissector.sh selftest.
> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   .../bpftool/Documentation/bpftool-prog.rst    |  36 ++++--
>   tools/bpf/bpftool/bash-completion/bpftool     |   6 +-
>   tools/bpf/bpftool/common.c                    |  30 ++---
>   tools/bpf/bpftool/main.h                      |   1 +
>   tools/bpf/bpftool/prog.c                      | 112 +++++++++++++-----
>   5 files changed, 126 insertions(+), 59 deletions(-)
> 
> diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
> index ac4e904b10fb..0374634c3087 100644
> --- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
> +++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
> @@ -15,7 +15,8 @@ SYNOPSIS
>   	*OPTIONS* := { { **-j** | **--json** } [{ **-p** | **--pretty** }] | { **-f** | **--bpffs** } }
>   
>   	*COMMANDS* :=
> -	{ **show** | **list** | **dump xlated** | **dump jited** | **pin** | **load** | **help** }
> +	{ **show** | **list** | **dump xlated** | **dump jited** | **pin** | **load**
> +	| **loadall** | **help** }
>   
>   MAP COMMANDS
>   =============
> @@ -24,9 +25,9 @@ MAP COMMANDS
>   |	**bpftool** **prog dump xlated** *PROG* [{**file** *FILE* | **opcodes** | **visual**}]
>   |	**bpftool** **prog dump jited**  *PROG* [{**file** *FILE* | **opcodes**}]
>   |	**bpftool** **prog pin** *PROG* *FILE*
> -|	**bpftool** **prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
> -|       **bpftool** **prog attach** *PROG* *ATTACH_TYPE* *MAP*
> -|       **bpftool** **prog detach** *PROG* *ATTACH_TYPE* *MAP*
> +|	**bpftool** **prog { load | loadall }** *OBJ* *FILE* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
> +|       **bpftool** **prog attach** *PROG* *ATTACH_TYPE* [*MAP*]
> +|       **bpftool** **prog detach** *PROG* *ATTACH_TYPE* [*MAP*]
>   |	**bpftool** **prog help**
>   |
>   |	*MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
> @@ -39,7 +40,9 @@ MAP COMMANDS
>   |		**cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | **cgroup/post_bind6** |
>   |		**cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6**
>   |	}
> -|       *ATTACH_TYPE* := { **msg_verdict** | **skb_verdict** | **skb_parse** }
> +|       *ATTACH_TYPE* := {
> +|		**msg_verdict** | **skb_verdict** | **skb_parse** | **flow_dissector**
> +|	}
>   
>   
>   DESCRIPTION
> @@ -79,8 +82,11 @@ DESCRIPTION
>   		  contain a dot character ('.'), which is reserved for future
>   		  extensions of *bpffs*.
>   
> -	**bpftool prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
> +	**bpftool prog { load | loadall }** *OBJ* *FILE* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
>   		  Load bpf program from binary *OBJ* and pin as *FILE*.
> +		  **bpftool prog load** will pin only the first bpf program
> +		  from the *OBJ*, **bpftool prog loadall** will pin all maps
> +		  and programs from the *OBJ*.

This could be improved regarding maps: with "bpftool prog load" I think 
we also load and pin all maps, but your description implies this is only 
the case with "loadall"

>   		  **type** is optional, if not specified program type will be
>   		  inferred from section names.
>   		  By default bpftool will create new maps as declared in the ELF
> @@ -97,13 +103,17 @@ DESCRIPTION
>   		  contain a dot character ('.'), which is reserved for future
>   		  extensions of *bpffs*.
>   
> -        **bpftool prog attach** *PROG* *ATTACH_TYPE* *MAP*
> -                  Attach bpf program *PROG* (with type specified by *ATTACH_TYPE*)
> -                  to the map *MAP*.
> -
> -        **bpftool prog detach** *PROG* *ATTACH_TYPE* *MAP*
> -                  Detach bpf program *PROG* (with type specified by *ATTACH_TYPE*)
> -                  from the map *MAP*.
> +        **bpftool prog attach** *PROG* *ATTACH_TYPE* [*MAP*]
> +                  Attach bpf program *PROG* (with type specified by
> +                  *ATTACH_TYPE*). Most *ATTACH_TYPEs* require a *MAP*
> +                  parameter, with the exception of *flow_dissector* which is
> +                  attached to current networking name space.
> +
> +        **bpftool prog detach** *PROG* *ATTACH_TYPE* [*MAP*]
> +                  Detach bpf program *PROG* (with type specified by
> +                  *ATTACH_TYPE*). Most *ATTACH_TYPEs* require a *MAP*
> +                  parameter, with the exception of *flow_dissector* which is
> +                  detached from the current networking name space.

While at it could you please fix those two paragraphs to use tabs for 
indentation, as the rest of the doc? Thanks!

>   
>   	**bpftool prog help**
>   		  Print short help message.
> diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
> index 3f78e6404589..ad0fc919f7ec 100644
> --- a/tools/bpf/bpftool/bash-completion/bpftool
> +++ b/tools/bpf/bpftool/bash-completion/bpftool
> @@ -243,7 +243,7 @@ _bpftool()
>       # Completion depends on object and command in use
>       case $object in
>           prog)
> -            if [[ $command != "load" ]]; then
> +            if [[ $command != "load" && $command != "loadall" ]]; then
>                   case $prev in
>                       id)
>                           _bpftool_get_prog_ids
> @@ -299,7 +299,7 @@ _bpftool()
>                       fi
>   
>                       if [[ ${#words[@]} == 6 ]]; then
> -                        COMPREPLY=( $( compgen -W "msg_verdict skb_verdict skb_parse" -- "$cur" ) )
> +                        COMPREPLY=( $( compgen -W "msg_verdict skb_verdict skb_parse flow_dissector" -- "$cur" ) )
>                           return 0
>                       fi
>   
> @@ -309,7 +309,7 @@ _bpftool()
>                       fi
>                       return 0
>                       ;;
> -                load)
> +                load|loadall)
>                       local obj
>   
>                       if [[ ${#words[@]} -lt 6 ]]; then

You also want to update completion for the program types, at line 341 or 
so. Feel free to split that list on several lines, by the way :).

> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> index 25af85304ebe..f671a921dec5 100644
> --- a/tools/bpf/bpftool/common.c
> +++ b/tools/bpf/bpftool/common.c
> @@ -169,34 +169,24 @@ int open_obj_pinned_any(char *path, enum bpf_obj_type exp_type)
>   	return fd;
>   }
>   
> -int do_pin_fd(int fd, const char *name)
> +int mount_bpffs_for_pin(const char *name)
>   {
>   	char err_str[ERR_MAX_LEN];
>   	char *file;
>   	char *dir;
>   	int err = 0;
>   
> -	err = bpf_obj_pin(fd, name);
> -	if (!err)
> -		goto out;
> -
>   	file = malloc(strlen(name) + 1);
>   	strcpy(file, name);
>   	dir = dirname(file);
>   
> -	if (errno != EPERM || is_bpffs(dir)) {
> -		p_err("can't pin the object (%s): %s", name, strerror(errno));
> +	if (is_bpffs(dir)) {
> +		/* nothing to do if already mounted */
>   		goto out_free;
>   	}

Nitpick: unnecessary brackets.

>   
> -	/* Attempt to mount bpffs, then retry pinning. */
>   	err = mnt_bpffs(dir, err_str, ERR_MAX_LEN);
> -	if (!err) {
> -		err = bpf_obj_pin(fd, name);
> -		if (err)
> -			p_err("can't pin the object (%s): %s", name,
> -			      strerror(errno));
> -	} else {
> +	if (err) {
>   		err_str[ERR_MAX_LEN - 1] = '\0';
>   		p_err("can't mount BPF file system to pin the object (%s): %s",
>   		      name, err_str);
> @@ -204,10 +194,20 @@ int do_pin_fd(int fd, const char *name)
>   
>   out_free:
>   	free(file);
> -out:
>   	return err;
>   }
>   
> +int do_pin_fd(int fd, const char *name)
> +{
> +	int err;
> +
> +	err = mount_bpffs_for_pin(name);
> +	if (err)
> +		return err;
> +
> +	return bpf_obj_pin(fd, name);
> +}
> +
>   int do_pin_any(int argc, char **argv, int (*get_fd_by_id)(__u32))
>   {
>   	unsigned int id;
> diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
> index 28322ace2856..1383824c9baf 100644
> --- a/tools/bpf/bpftool/main.h
> +++ b/tools/bpf/bpftool/main.h
> @@ -129,6 +129,7 @@ const char *get_fd_type_name(enum bpf_obj_type type);
>   char *get_fdinfo(int fd, const char *key);
>   int open_obj_pinned(char *path);
>   int open_obj_pinned_any(char *path, enum bpf_obj_type exp_type);
> +int mount_bpffs_for_pin(const char *name);
>   int do_pin_any(int argc, char **argv, int (*get_fd_by_id)(__u32));
>   int do_pin_fd(int fd, const char *name);
>   
> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> index 5302ee282409..a4346dd673b1 100644
> --- a/tools/bpf/bpftool/prog.c
> +++ b/tools/bpf/bpftool/prog.c
> @@ -81,6 +81,7 @@ static const char * const attach_type_strings[] = {
>   	[BPF_SK_SKB_STREAM_PARSER] = "stream_parser",
>   	[BPF_SK_SKB_STREAM_VERDICT] = "stream_verdict",
>   	[BPF_SK_MSG_VERDICT] = "msg_verdict",
> +	[BPF_FLOW_DISSECTOR] = "flow_dissector",
>   	[__MAX_BPF_ATTACH_TYPE] = NULL,
>   };
>   
> @@ -724,10 +725,11 @@ int map_replace_compar(const void *p1, const void *p2)
>   static int do_attach(int argc, char **argv)
>   {
>   	enum bpf_attach_type attach_type;
> -	int err, mapfd, progfd;
> +	int err, progfd;
> +	int mapfd = 0;
>   
> -	if (!REQ_ARGS(5)) {
> -		p_err("too few parameters for map attach");
> +	if (!REQ_ARGS(3)) {
> +		p_err("too few parameters for attach");
>   		return -EINVAL;
>   	}
>   
> @@ -740,11 +742,17 @@ static int do_attach(int argc, char **argv)
>   		p_err("invalid attach type");
>   		return -EINVAL;
>   	}
> -	NEXT_ARG();
> +	if (attach_type != BPF_FLOW_DISSECTOR) {
> +		NEXT_ARG();
> +		if (!REQ_ARGS(2)) {
> +			p_err("too few parameters for map attach");
> +			return -EINVAL;
> +		}
>   
> -	mapfd = map_parse_fd(&argc, &argv);
> -	if (mapfd < 0)
> -		return mapfd;
> +		mapfd = map_parse_fd(&argc, &argv);
> +		if (mapfd < 0)
> +			return mapfd;
> +	}
>   
>   	err = bpf_prog_attach(progfd, mapfd, attach_type, 0);
>   	if (err) {
> @@ -760,10 +768,11 @@ static int do_attach(int argc, char **argv)
>   static int do_detach(int argc, char **argv)
>   {
>   	enum bpf_attach_type attach_type;
> -	int err, mapfd, progfd;
> +	int err, progfd;
> +	int mapfd = 0;
>   
> -	if (!REQ_ARGS(5)) {
> -		p_err("too few parameters for map detach");
> +	if (!REQ_ARGS(3)) {
> +		p_err("too few parameters for detach");
>   		return -EINVAL;
>   	}
>   
> @@ -776,11 +785,17 @@ static int do_detach(int argc, char **argv)
>   		p_err("invalid attach type");
>   		return -EINVAL;
>   	}
> -	NEXT_ARG();
> +	if (attach_type != BPF_FLOW_DISSECTOR) {
> +		NEXT_ARG();
> +		if (!REQ_ARGS(2)) {
> +			p_err("too few parameters for map detach");
> +			return -EINVAL;
> +		}

Would that make sense to factor argument checks or parsing for 
do_attach() and do_detach() to some extent? In order to reduce the 
number of attach-type-based exceptions to add in the code if we have 
other attach types that do not take maps in the future.

>   
> -	mapfd = map_parse_fd(&argc, &argv);
> -	if (mapfd < 0)
> -		return mapfd;
> +		mapfd = map_parse_fd(&argc, &argv);
> +		if (mapfd < 0)
> +			return mapfd;
> +	}
>   
>   	err = bpf_prog_detach2(progfd, mapfd, attach_type);
>   	if (err) {
> @@ -792,15 +807,16 @@ static int do_detach(int argc, char **argv)
>   		jsonw_null(json_wtr);
>   	return 0;
>   }
> -static int do_load(int argc, char **argv)
> +
> +static int load_with_options(int argc, char **argv, bool first_prog_only)
>   {
>   	enum bpf_attach_type expected_attach_type;
>   	struct bpf_object_open_attr attr = {
>   		.prog_type	= BPF_PROG_TYPE_UNSPEC,
>   	};
>   	struct map_replace *map_replace = NULL;
> +	struct bpf_program *prog = NULL, *pos;
>   	unsigned int old_map_fds = 0;
> -	struct bpf_program *prog;
>   	struct bpf_object *obj;
>   	struct bpf_map *map;
>   	const char *pinfile;
> @@ -918,14 +934,20 @@ static int do_load(int argc, char **argv)
>   		goto err_free_reuse_maps;
>   	}
>   
> -	prog = bpf_program__next(NULL, obj);
> -	if (!prog) {
> -		p_err("object file doesn't contain any bpf program");
> -		goto err_close_obj;
> +	if (first_prog_only) {
> +		prog = bpf_program__next(NULL, obj);
> +		if (!prog) {
> +			p_err("object file doesn't contain any bpf program");
> +			goto err_close_obj;
> +		}
>   	}
>   
> -	bpf_program__set_ifindex(prog, ifindex);
>   	if (attr.prog_type == BPF_PROG_TYPE_UNSPEC) {
> +		if (!prog) {
> +			p_err("can not guess program type when loading all programs\n");
> +			goto err_close_obj;
> +		}
> +
>   		const char *sec_name = bpf_program__title(prog, false);
>   
>   		err = libbpf_prog_type_by_name(sec_name, &attr.prog_type,
> @@ -936,8 +958,13 @@ static int do_load(int argc, char **argv)
>   			goto err_close_obj;
>   		}
>   	}
> -	bpf_program__set_type(prog, attr.prog_type);
> -	bpf_program__set_expected_attach_type(prog, expected_attach_type);
> +
> +	bpf_object__for_each_program(pos, obj) {
> +		bpf_program__set_ifindex(pos, ifindex);
> +		bpf_program__set_type(pos, attr.prog_type);
> +		bpf_program__set_expected_attach_type(pos,
> +						      expected_attach_type);
> +	}

I still believe you can have programs of different types here, and be 
able to load them. I tried it and managed to have it working fine. If no 
type is provided from command line we can retrieve types for each 
program from its section name. If a type is provided on the command 
line, we can do the same, but I am not sure we should do it, or impose 
that type for all programs instead.

>   
>   	qsort(map_replace, old_map_fds, sizeof(*map_replace),
>   	      map_replace_compar);
> @@ -1001,9 +1028,25 @@ static int do_load(int argc, char **argv)
>   		goto err_close_obj;
>   	}
>   
> -	if (do_pin_fd(bpf_program__fd(prog), pinfile))
> +	err = mount_bpffs_for_pin(pinfile);
> +	if (err)
>   		goto err_close_obj;
>   
> +	if (prog) {

Nit: Maybe "if (first_prog_only) {" instead? If I understand correctly, 
at this stage it should be equivalent, but in my opinion it would make 
it easier to understand why we have two cases here.

> +		err = bpf_obj_pin(bpf_program__fd(prog), pinfile);
> +		if (err) {
> +			p_err("failed to pin program %s",
> +			      bpf_program__title(prog, false));
> +			goto err_close_obj;
> +		}
> +	} else {
> +		err = bpf_object__pin(obj, pinfile);
> +		if (err) {
> +			p_err("failed to pin all programs");
> +			goto err_close_obj;
> +		}
> +	}
> +
>   	if (json_output)
>   		jsonw_null(json_wtr);
>   
> @@ -1023,6 +1066,16 @@ static int do_load(int argc, char **argv)
>   	return -1;
>   }
>   
> +static int do_load(int argc, char **argv)
> +{
> +	return load_with_options(argc, argv, true);
> +}
> +
> +static int do_loadall(int argc, char **argv)
> +{
> +	return load_with_options(argc, argv, false);
> +}
> +
>   static int do_help(int argc, char **argv)
>   {
>   	if (json_output) {
> @@ -1035,10 +1088,11 @@ static int do_help(int argc, char **argv)
>   		"       %s %s dump xlated PROG [{ file FILE | opcodes | visual }]\n"
>   		"       %s %s dump jited  PROG [{ file FILE | opcodes }]\n"
>   		"       %s %s pin   PROG FILE\n"
> -		"       %s %s load  OBJ  FILE [type TYPE] [dev NAME] \\\n"
> +		"       %s %s { load | loadall } OBJ  FILE \\\n"
> +		"                         [type TYPE] [dev NAME] \\\n"
>   		"                         [map { idx IDX | name NAME } MAP]\n"
> -		"       %s %s attach PROG ATTACH_TYPE MAP\n"
> -		"       %s %s detach PROG ATTACH_TYPE MAP\n"
> +		"       %s %s attach PROG ATTACH_TYPE [MAP]\n"
> +		"       %s %s detach PROG ATTACH_TYPE [MAP]\n"
>   		"       %s %s help\n"
>   		"\n"
>   		"       " HELP_SPEC_MAP "\n"
> @@ -1050,7 +1104,8 @@ static int do_help(int argc, char **argv)
>   		"                 cgroup/bind4 | cgroup/bind6 | cgroup/post_bind4 |\n"
>   		"                 cgroup/post_bind6 | cgroup/connect4 | cgroup/connect6 |\n"
>   		"                 cgroup/sendmsg4 | cgroup/sendmsg6 }\n"
> -		"       ATTACH_TYPE := { msg_verdict | skb_verdict | skb_parse }\n"
> +		"       ATTACH_TYPE := { msg_verdict | skb_verdict | skb_parse |\n"
> +		"                        flow_dissector }\n"
>   		"       " HELP_SPEC_OPTIONS "\n"
>   		"",
>   		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
> @@ -1067,6 +1122,7 @@ static const struct cmd cmds[] = {
>   	{ "dump",	do_dump },
>   	{ "pin",	do_pin },
>   	{ "load",	do_load },
> +	{ "loadall",	do_loadall },
>   	{ "attach",	do_attach },
>   	{ "detach",	do_detach },
>   	{ 0 }
> 

^ permalink raw reply

* Re: [net-next 12/12] igc: Clean up code
From: Joe Perches @ 2018-11-08 11:00 UTC (permalink / raw)
  To: Jeff Kirsher, davem; +Cc: Sasha Neftin, netdev, nhorman, sassmann
In-Reply-To: <20181107224830.9737-13-jeffrey.t.kirsher@intel.com>

On Wed, 2018-11-07 at 14:48 -0800, Jeff Kirsher wrote:
> From: Sasha Neftin <sasha.neftin@intel.com>
> 
> Address few community comments.
> Remove unused code, will be added per demand.
> Remove blank lines and unneeded includes.
> 
> Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
> Tested-by: Aaron Brown <aaron.f.brown@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
>  drivers/net/ethernet/intel/igc/igc.h      |  9 ---------
>  drivers/net/ethernet/intel/igc/igc_main.c | 15 ---------------
>  2 files changed, 24 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/igc/igc.h b/drivers/net/ethernet/intel/igc/igc.h
[]
>  #define IGC_ERR(args...) pr_err("igc: " args)

This is used just once and should probably be removed.

maybe:

---
 drivers/net/ethernet/intel/igc/igc_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index 9d85707e8a81..b58542b20623 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -3551,7 +3551,7 @@ static int igc_probe(struct pci_dev *pdev,
 			err = dma_set_coherent_mask(&pdev->dev,
 						    DMA_BIT_MASK(32));
 			if (err) {
-				IGC_ERR("Wrong DMA configuration, aborting\n");
+				dev_err(&pdev->dev, "igc: Wrong DMA configuration, aborting\n");
 				goto err_dma;
 			}
 		}

^ permalink raw reply related

* [PATCH v2 net-next 3/4] net: ethernet: ti: cpsw: fix vlan mcast
From: Ivan Khoronzhuk @ 2018-11-08 20:27 UTC (permalink / raw)
  To: grygorii.strashko, davem
  Cc: linux-omap, netdev, linux-kernel, alexander.h.duyck, bjorn,
	Ivan Khoronzhuk
In-Reply-To: <20181108202757.30110-1-ivan.khoronzhuk@linaro.org>

At this moment, mcast addresses are added for real device only
(reserved vlans for dual-emac mode), even if a mcast address was added
for some vlan only, thus ALE doesn't have corresponding vlan mcast
entries after vlan socket joined multicast group. So ALE drops vlan
frames with mcast addresses intended for vlans and potentially can
receive mcast frames for base ndev. That's not correct. So, fix it by
creating only vlan/mcast entries as requested. Patch doesn't use any
additional lists and is based on device mc address list and cpsw ALE
table entries.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
 drivers/net/ethernet/ti/cpsw.c | 169 +++++++++++++++++++++++++++------
 1 file changed, 138 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 500f7ed8c58c..0b18634d336c 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -570,21 +570,6 @@ static inline int cpsw_get_slave_port(u32 slave_num)
 	return slave_num + 1;
 }
 
-static void cpsw_add_mcast(struct cpsw_priv *priv, const u8 *addr)
-{
-	struct cpsw_common *cpsw = priv->cpsw;
-
-	if (cpsw->data.dual_emac) {
-		struct cpsw_slave *slave = cpsw->slaves + priv->emac_port;
-
-		cpsw_ale_add_mcast(cpsw->ale, addr, ALE_PORT_HOST,
-				   ALE_VLAN, slave->port_vlan, 0);
-		return;
-	}
-
-	cpsw_ale_add_mcast(cpsw->ale, addr, ALE_ALL_PORTS, 0, 0, 0);
-}
-
 static void cpsw_set_promiscious(struct net_device *ndev, bool enable)
 {
 	struct cpsw_common *cpsw = ndev_to_cpsw(ndev);
@@ -640,7 +625,7 @@ static void cpsw_set_promiscious(struct net_device *ndev, bool enable)
 
 			/* Clear all mcast from ALE */
 			cpsw_ale_flush_multicast(ale, ALE_ALL_PORTS, -1);
-			__dev_mc_unsync(ndev, NULL);
+			__hw_addr_ref_unsync_dev(&ndev->mc, ndev, NULL);
 
 			/* Flood All Unicast Packets to Host port */
 			cpsw_ale_control_set(ale, 0, ALE_P0_UNI_FLOOD, 1);
@@ -661,29 +646,148 @@ static void cpsw_set_promiscious(struct net_device *ndev, bool enable)
 	}
 }
 
-static int cpsw_add_mc_addr(struct net_device *ndev, const u8 *addr)
+struct addr_sync_ctx {
+	struct net_device *ndev;
+	const u8 *addr;		/* address to be synched */
+	int consumed;		/* number of address instances */
+	int flush;		/* flush flag */
+};
+
+/**
+ * cpsw_set_mc - adds multicast entry to the table if it's not added or deletes
+ * if it's not deleted
+ * @ndev: device to sync
+ * @addr: address to be added or deleted
+ * @vid: vlan id, if vid < 0 set/unset address for real device
+ * @add: add address if the flag is set or remove otherwise
+ */
+static int cpsw_set_mc(struct net_device *ndev, const u8 *addr,
+		       int vid, int add)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
+	struct cpsw_common *cpsw = priv->cpsw;
+	int mask, flags, ret;
+
+	if (vid < 0) {
+		if (cpsw->data.dual_emac)
+			vid = cpsw->slaves[priv->emac_port].port_vlan;
+		else
+			vid = 0;
+	}
+
+	mask = cpsw->data.dual_emac ? ALE_PORT_HOST : ALE_ALL_PORTS;
+	flags = vid ? ALE_VLAN : 0;
+
+	if (add)
+		ret = cpsw_ale_add_mcast(cpsw->ale, addr, mask, flags, vid, 0);
+	else
+		ret = cpsw_ale_del_mcast(cpsw->ale, addr, 0, flags, vid);
+
+	return ret;
+}
+
+static int cpsw_update_vlan_mc(struct net_device *vdev, int vid, void *ctx)
+{
+	struct addr_sync_ctx *sync_ctx = ctx;
+	struct netdev_hw_addr *ha;
+	int found = 0, ret = 0;
+
+	if (!vdev || !(vdev->flags & IFF_UP))
+		return 0;
+
+	/* vlan address is relevant if its sync_cnt != 0 */
+	netdev_for_each_mc_addr(ha, vdev) {
+		if (ether_addr_equal(ha->addr, sync_ctx->addr)) {
+			found = ha->sync_cnt;
+			break;
+		}
+	}
+
+	if (found)
+		sync_ctx->consumed++;
+
+	if (sync_ctx->flush) {
+		if (!found)
+			cpsw_set_mc(sync_ctx->ndev, sync_ctx->addr, vid, 0);
+		return 0;
+	}
+
+	if (found)
+		ret = cpsw_set_mc(sync_ctx->ndev, sync_ctx->addr, vid, 1);
+
+	return ret;
+}
+
+static int cpsw_add_mc_addr(struct net_device *ndev, const u8 *addr, int num)
+{
+	struct addr_sync_ctx sync_ctx;
+	int ret;
+
+	sync_ctx.consumed = 0;
+	sync_ctx.addr = addr;
+	sync_ctx.ndev = ndev;
+	sync_ctx.flush = 0;
+
+	ret = vlan_for_each(ndev, cpsw_update_vlan_mc, &sync_ctx);
+	if (sync_ctx.consumed < num && !ret)
+		ret = cpsw_set_mc(ndev, addr, -1, 1);
+
+	return ret;
+}
+
+static int cpsw_del_mc_addr(struct net_device *ndev, const u8 *addr, int num)
+{
+	struct addr_sync_ctx sync_ctx;
+
+	sync_ctx.consumed = 0;
+	sync_ctx.addr = addr;
+	sync_ctx.ndev = ndev;
+	sync_ctx.flush = 1;
+
+	vlan_for_each(ndev, cpsw_update_vlan_mc, &sync_ctx);
+	if (sync_ctx.consumed == num)
+		cpsw_set_mc(ndev, addr, -1, 0);
 
-	cpsw_add_mcast(priv, addr);
 	return 0;
 }
 
-static int cpsw_del_mc_addr(struct net_device *ndev, const u8 *addr)
+static int cpsw_purge_vlan_mc(struct net_device *vdev, int vid, void *ctx)
 {
-	struct cpsw_priv *priv = netdev_priv(ndev);
-	struct cpsw_common *cpsw = priv->cpsw;
-	int vid, flags;
+	struct addr_sync_ctx *sync_ctx = ctx;
+	struct netdev_hw_addr *ha;
+	int found = 0;
 
-	if (cpsw->data.dual_emac) {
-		vid = cpsw->slaves[priv->emac_port].port_vlan;
-		flags = ALE_VLAN;
-	} else {
-		vid = 0;
-		flags = 0;
+	if (!vdev || !(vdev->flags & IFF_UP))
+		return 0;
+
+	/* vlan address is relevant if its sync_cnt != 0 */
+	netdev_for_each_mc_addr(ha, vdev) {
+		if (ether_addr_equal(ha->addr, sync_ctx->addr)) {
+			found = ha->sync_cnt;
+			break;
+		}
 	}
 
-	cpsw_ale_del_mcast(cpsw->ale, addr, 0, flags, vid);
+	if (!found)
+		return 0;
+
+	sync_ctx->consumed++;
+	cpsw_set_mc(sync_ctx->ndev, sync_ctx->addr, vid, 0);
+	return 0;
+}
+
+static int cpsw_purge_all_mc(struct net_device *ndev, const u8 *addr, int num)
+{
+	struct addr_sync_ctx sync_ctx;
+
+	sync_ctx.addr = addr;
+	sync_ctx.ndev = ndev;
+	sync_ctx.consumed = 0;
+
+	vlan_for_each(ndev, cpsw_purge_vlan_mc, &sync_ctx);
+	if (sync_ctx.consumed < num)
+		cpsw_set_mc(ndev, addr, -1, 0);
+
 	return 0;
 }
 
@@ -704,7 +808,9 @@ static void cpsw_ndo_set_rx_mode(struct net_device *ndev)
 	/* Restore allmulti on vlans if necessary */
 	cpsw_ale_set_allmulti(cpsw->ale, ndev->flags & IFF_ALLMULTI);
 
-	__dev_mc_sync(ndev, cpsw_add_mc_addr, cpsw_del_mc_addr);
+	/* add/remove mcast address either for real netdev or for vlan */
+	__hw_addr_ref_sync_dev(&ndev->mc, ndev, cpsw_add_mc_addr,
+			       cpsw_del_mc_addr);
 }
 
 static void cpsw_intr_enable(struct cpsw_common *cpsw)
@@ -1964,7 +2070,7 @@ static int cpsw_ndo_stop(struct net_device *ndev)
 	struct cpsw_common *cpsw = priv->cpsw;
 
 	cpsw_info(priv, ifdown, "shutting down cpsw device\n");
-	__dev_mc_unsync(priv->ndev, cpsw_del_mc_addr);
+	__hw_addr_ref_unsync_dev(&ndev->mc, ndev, cpsw_purge_all_mc);
 	netif_tx_stop_all_queues(priv->ndev);
 	netif_carrier_off(priv->ndev);
 
@@ -2415,6 +2521,7 @@ static int cpsw_ndo_vlan_rx_kill_vid(struct net_device *ndev,
 				  HOST_PORT_NUM, ALE_VLAN, vid);
 	ret |= cpsw_ale_del_mcast(cpsw->ale, priv->ndev->broadcast,
 				  0, ALE_VLAN, vid);
+	ret |= cpsw_ale_flush_multicast(cpsw->ale, 0, vid);
 err:
 	pm_runtime_put(cpsw->dev);
 	return ret;
-- 
2.17.1

^ permalink raw reply related

* [PATCH v2 net-next 2/4] net: 8021q: vlan_core: allow use list of vlans for real device
From: Ivan Khoronzhuk @ 2018-11-08 20:27 UTC (permalink / raw)
  To: grygorii.strashko, davem
  Cc: linux-omap, netdev, linux-kernel, alexander.h.duyck, bjorn,
	Ivan Khoronzhuk
In-Reply-To: <20181108202757.30110-1-ivan.khoronzhuk@linaro.org>

It's redundancy for the drivers to hold the list of vlans when
absolutely the same list exists in vlan core. In most cases it's
needed only to traverse the vlan devices, their vids and sync some
settings with h/w, so add API to simplify this.

At least some of these drivers also can benefit:
grep "for_each.*vid" -r drivers/net/ethernet/

drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:
drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c:
drivers/net/ethernet/qlogic/qlge/qlge_main.c:
drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c:
drivers/net/ethernet/via/via-rhine.c:
drivers/net/ethernet/via/via-velocity.c:
drivers/net/ethernet/intel/igb/igb_main.c:
drivers/net/ethernet/intel/ice/ice_main.c:
drivers/net/ethernet/intel/e1000/e1000_main.c:
drivers/net/ethernet/intel/i40e/i40e_main.c:
drivers/net/ethernet/intel/e1000e/netdev.c:
drivers/net/ethernet/intel/igbvf/netdev.c:
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:
drivers/net/ethernet/intel/ixgb/ixgb_main.c:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:
drivers/net/ethernet/amd/xgbe/xgbe-dev.c:
drivers/net/ethernet/emulex/benet/be_main.c:
drivers/net/ethernet/neterion/vxge/vxge-main.c:
drivers/net/ethernet/adaptec/starfire.c:
drivers/net/ethernet/brocade/bna/bnad.c:

Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
 include/linux/if_vlan.h | 11 +++++++++++
 net/8021q/vlan_core.c   | 27 +++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 03b08ffded07..1be5230921b5 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -133,6 +133,9 @@ struct vlan_pcpu_stats {
 
 extern struct net_device *__vlan_find_dev_deep_rcu(struct net_device *real_dev,
 					       __be16 vlan_proto, u16 vlan_id);
+extern int vlan_for_each(struct net_device *dev,
+			 int (*action)(struct net_device *dev, int vid,
+				       void *arg), void *arg);
 extern struct net_device *vlan_dev_real_dev(const struct net_device *dev);
 extern u16 vlan_dev_vlan_id(const struct net_device *dev);
 extern __be16 vlan_dev_vlan_proto(const struct net_device *dev);
@@ -236,6 +239,14 @@ __vlan_find_dev_deep_rcu(struct net_device *real_dev,
 	return NULL;
 }
 
+static inline int
+vlan_for_each(struct net_device *dev,
+	      int (*action)(struct net_device *dev, int vid, void *arg),
+	      void *arg)
+{
+	return 0;
+}
+
 static inline struct net_device *vlan_dev_real_dev(const struct net_device *dev)
 {
 	BUG();
diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index 4f60e86f4b8d..6308b5427a66 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -223,6 +223,33 @@ static int vlan_kill_rx_filter_info(struct net_device *dev, __be16 proto, u16 vi
 		return -ENODEV;
 }
 
+int vlan_for_each(struct net_device *dev,
+		  int (*action)(struct net_device *dev, int vid, void *arg),
+		  void *arg)
+{
+	struct vlan_vid_info *vid_info;
+	struct vlan_info *vlan_info;
+	struct net_device *vdev;
+	int ret;
+
+	ASSERT_RTNL();
+
+	vlan_info = rtnl_dereference(dev->vlan_info);
+	if (!vlan_info)
+		return 0;
+
+	list_for_each_entry(vid_info, &vlan_info->vid_list, list) {
+		vdev = vlan_group_get_device(&vlan_info->grp, vid_info->proto,
+					     vid_info->vid);
+		ret = action(vdev, vid_info->vid, arg);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(vlan_for_each);
+
 int vlan_filter_push_vids(struct vlan_info *vlan_info, __be16 proto)
 {
 	struct net_device *real_dev = vlan_info->real_dev;
-- 
2.17.1

^ permalink raw reply related

* [PATCH v2 net-next 0/4] net: ethernet: ti: cpsw: fix vlan mcast
From: Ivan Khoronzhuk @ 2018-11-08 20:27 UTC (permalink / raw)
  To: grygorii.strashko, davem
  Cc: linux-omap, netdev, linux-kernel, alexander.h.duyck, bjorn,
	Ivan Khoronzhuk

The cpsw holds separate mcast entires for vlan entries. At this moment
driver adds only not vlan mcast addresses, omitting vlan/mcast entries.
As result mcast for vlans doesn't work. It can be fixed by adding same
mcast entries for every created vlan, but this patchseries uses more
sophisticated way and allows to create mcast entries only for vlans
that really require it. Generic functions from this series can be
reused for fixing vlan and macvlan unicast.

Simple example of ALE table before and after this series, having same
mcast entries as for vlan 100 as for real device (reserved vlan 2),
and one mcast address only for vlan 100 - 01:1b:19:00:00:00.

<---- Before this patchset ---->
vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5, mem_list = 0x5
mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
ucast, vid = 2, addr = 74:da:ea:47:7d:9d, persistant, port_num = 0x0
vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0, mem_list = 0x7
mcast, vid = 2, addr = 33:33:00:00:00:01, port_mask = 0x1
mcast, vid = 2, addr = 01:00:5e:00:00:01, port_mask = 0x1
vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3, mem_list = 0x3
mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
ucast, vid = 1, addr = 74:da:ea:47:7d:9c, persistant, port_num = 0x0
mcast, vid = 1, addr = 33:33:00:00:00:01, port_mask = 0x1
mcast, vid = 1, addr = 01:00:5e:00:00:01, port_mask = 0x1
mcast, vid = 2, addr = 01:80:c2:00:00:00, port_mask = 0x1
mcast, vid = 2, addr = 01:80:c2:00:00:03, port_mask = 0x1
mcast, vid = 2, addr = 01:80:c2:00:00:0e, port_mask = 0x1
mcast, vid = 1, addr = 01:80:c2:00:00:00, port_mask = 0x1
mcast, vid = 1, addr = 01:80:c2:00:00:03, port_mask = 0x1
mcast, vid = 1, addr = 01:80:c2:00:00:0e, port_mask = 0x1
mcast, vid = 2, addr = 33:33:ff:47:7d:9d, port_mask = 0x1
mcast, vid = 2, addr = 33:33:00:00:00:fb, port_mask = 0x1
mcast, vid = 2, addr = 33:33:00:01:00:03, port_mask = 0x1
mcast, vid = 1, addr = 33:33:ff:47:7d:9c, port_mask = 0x1
mcast, vid = 1, addr = 33:33:00:00:00:fb, port_mask = 0x1
mcast, vid = 1, addr = 33:33:00:01:00:03, port_mask = 0x1
mcast, vid = 1, addr = 01:00:5e:00:00:fb, port_mask = 0x1
mcast, vid = 1, addr = 01:00:5e:00:00:fc, port_mask = 0x1
vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x5, mem_list = 0x5
ucast, vid = 100, addr = 74:da:ea:47:7d:9d, persistant, port_num = 0x0
mcast, vid = 100, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
mcast, vid = 2, addr = 01:1b:19:00:00:00, port_mask = 0x1
			 ^^^
 Here mcast entry (ptpl2), has to be added only for vlan 100
 but added for reserved vlan 2...that's not enough.

<---- After this patchset ---->
vlan , vid = 2, untag_force = 0x5, reg_mcast = 0x5, mem_list = 0x5
mcast, vid = 2, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
ucast, vid = 2, addr = 74:da:ea:47:7d:9d, persistant, port_num = 0x0
vlan , vid = 0, untag_force = 0x7, reg_mcast = 0x0, mem_list = 0x7
mcast, vid = 2, addr = 33:33:00:00:00:01, port_mask = 0x1
mcast, vid = 2, addr = 01:00:5e:00:00:01, port_mask = 0x1
vlan , vid = 1, untag_force = 0x3, reg_mcast = 0x3, mem_list = 0x3
mcast, vid = 1, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
ucast, vid = 1, addr = 74:da:ea:47:7d:9c, persistant, port_num = 0x0
mcast, vid = 1, addr = 33:33:00:00:00:01, port_mask = 0x1
mcast, vid = 1, addr = 01:00:5e:00:00:01, port_mask = 0x1
mcast, vid = 2, addr = 01:80:c2:00:00:00, port_mask = 0x1
mcast, vid = 2, addr = 01:80:c2:00:00:03, port_mask = 0x1
mcast, vid = 2, addr = 01:80:c2:00:00:0e, port_mask = 0x1
mcast, vid = 1, addr = 01:80:c2:00:00:00, port_mask = 0x1
mcast, vid = 1, addr = 01:80:c2:00:00:03, port_mask = 0x1
mcast, vid = 1, addr = 01:80:c2:00:00:0e, port_mask = 0x1
mcast, vid = 2, addr = 33:33:ff:47:7d:9d, port_mask = 0x1
mcast, vid = 1, addr = 33:33:ff:47:7d:9c, port_mask = 0x1
mcast, vid = 2, addr = 33:33:00:00:00:fb, port_mask = 0x1
mcast, vid = 2, addr = 33:33:00:01:00:03, port_mask = 0x1
mcast, vid = 1, addr = 33:33:00:00:00:fb, port_mask = 0x1
mcast, vid = 1, addr = 33:33:00:01:00:03, port_mask = 0x1
vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x5, mem_list = 0x5
ucast, vid = 100, addr = 74:da:ea:47:7d:9d, persistant, port_num = 0x0
mcast, vid = 100, addr = ff:ff:ff:ff:ff:ff, port_mask = 0x1
mcast, vid = 100, addr = 33:33:00:00:00:01, port_mask = 0x1
mcast, vid = 100, addr = 01:00:5e:00:00:01, port_mask = 0x1
mcast, vid = 100, addr = 33:33:ff:47:7d:9d, port_mask = 0x1
mcast, vid = 100, addr = 01:80:c2:00:00:00, port_mask = 0x1
mcast, vid = 100, addr = 01:80:c2:00:00:03, port_mask = 0x1
mcast, vid = 100, addr = 01:80:c2:00:00:0e, port_mask = 0x1
mcast, vid = 100, addr = 33:33:00:00:00:fb, port_mask = 0x1
mcast, vid = 100, addr = 33:33:00:01:00:03, port_mask = 0x1
mcast, vid = 100, addr = 01:1b:19:00:00:00, port_mask = 0x1
			 ^^^
    Here mcast entry (ptpl2), is added only for vlan 100
    as it should be.

Based on net-next/master

v2..v1:
  net: ethernet: ti: cpsw: fix vlan mcast
	- removed limit for legacy switch cpsw mode

Ivan Khoronzhuk (4):
  net: core: dev_addr_lists: add auxiliary func to handle reference
    address updates
  net: 8021q: vlan_core: allow use list of vlans for real device
  net: ethernet: ti: cpsw: fix vlan mcast
  net: ethernet: ti: cpsw: fix vlan configuration while down/up

 drivers/net/ethernet/ti/cpsw.c | 186 +++++++++++++++++++++++++++------
 include/linux/if_vlan.h        |  11 ++
 include/linux/netdevice.h      |  10 ++
 net/8021q/vlan_core.c          |  27 +++++
 net/core/dev_addr_lists.c      |  97 +++++++++++++++++
 5 files changed, 300 insertions(+), 31 deletions(-)

-- 
2.17.1

^ permalink raw reply

* Re: [PATCH net-next 05/10] ipv6: factor out protocol delivery helper
From: Sergei Shtylyov @ 2018-11-08 10:29 UTC (permalink / raw)
  To: Paolo Abeni, netdev
  Cc: David S. Miller, Willem de Bruijn, Steffen Klassert,
	Subash Abhinov Kasiviswanathan
In-Reply-To: <0c668ae869c111bbfce9c9ba04e5c6ab2ce3c10a.camel@redhat.com>

On 11/8/2018 1:13 PM, Paolo Abeni wrote:

>>> So that we can re-use it at the UDP level in the next patch
>>>
>>> rfc v3 -> v1:
>>>    - add the helper declaration into the ipv6 header
>>>
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>> ---
>>>    include/net/ipv6.h   |  2 ++
>>>    net/ipv6/ip6_input.c | 28 ++++++++++++++++------------
>>>    2 files changed, 18 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
>>> index 829650540780..daf80863d3a5 100644
>>> --- a/include/net/ipv6.h
>>> +++ b/include/net/ipv6.h
>>
>> [...]
>>> @@ -319,28 +319,26 @@ void ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
>>>    /*
>>>     *	Deliver the packet to the host
>>>     */
>>> -
>>> -
>>> -static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
>>> +void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int nexthdr,
>>> +			      bool have_final)
>>>    {
>>>    	const struct inet6_protocol *ipprot;
>>>    	struct inet6_dev *idev;
>>>    	unsigned int nhoff;
>>> -	int nexthdr;
>>>    	bool raw;
>>> -	bool have_final = false;
>>>    
>>>    	/*
>>>    	 *	Parse extension headers
>>>    	 */
>>>    
>>> -	rcu_read_lock();
>>>    resubmit:
>>>    	idev = ip6_dst_idev(skb_dst(skb));
>>> -	if (!pskb_pull(skb, skb_transport_offset(skb)))
>>> -		goto discard;
>>>    	nhoff = IP6CB(skb)->nhoff;
>>> -	nexthdr = skb_network_header(skb)[nhoff];
>>> +	if (!have_final) {
>>
>>      Haven't you removed this variable above?
>>
>>> +		if (!pskb_pull(skb, skb_transport_offset(skb)))
>>> +			goto discard;
>>> +		nexthdr = skb_network_header(skb)[nhoff];
>>
>>      And this?
> 
> Thanks for reviewing.
> 
> Both local variables are now function arguments (see the function
> signature, far above in this chunk).

    Oops, my bad... I'm sorry. :-)

> Cheers,
> 
> Paolo

MBR, Sergei

^ permalink raw reply

* Re: [PATCH net-next 05/10] ipv6: factor out protocol delivery helper
From: Paolo Abeni @ 2018-11-08 10:13 UTC (permalink / raw)
  To: Sergei Shtylyov, netdev
  Cc: David S. Miller, Willem de Bruijn, Steffen Klassert,
	Subash Abhinov Kasiviswanathan
In-Reply-To: <34872c3b-eba8-d7eb-22da-ee74b52b0610@cogentembedded.com>

Hi,

On Thu, 2018-11-08 at 12:01 +0300, Sergei Shtylyov wrote:
> On 11/7/2018 2:38 PM, Paolo Abeni wrote:
> 
> > So that we can re-use it at the UDP level in the next patch
> > 
> > rfc v3 -> v1:
> >   - add the helper declaration into the ipv6 header
> > 
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> >   include/net/ipv6.h   |  2 ++
> >   net/ipv6/ip6_input.c | 28 ++++++++++++++++------------
> >   2 files changed, 18 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> > index 829650540780..daf80863d3a5 100644
> > --- a/include/net/ipv6.h
> > +++ b/include/net/ipv6.h
> 
> [...]
> > @@ -319,28 +319,26 @@ void ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
> >   /*
> >    *	Deliver the packet to the host
> >    */
> > -
> > -
> > -static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
> > +void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int nexthdr,
> > +			      bool have_final)
> >   {
> >   	const struct inet6_protocol *ipprot;
> >   	struct inet6_dev *idev;
> >   	unsigned int nhoff;
> > -	int nexthdr;
> >   	bool raw;
> > -	bool have_final = false;
> >   
> >   	/*
> >   	 *	Parse extension headers
> >   	 */
> >   
> > -	rcu_read_lock();
> >   resubmit:
> >   	idev = ip6_dst_idev(skb_dst(skb));
> > -	if (!pskb_pull(skb, skb_transport_offset(skb)))
> > -		goto discard;
> >   	nhoff = IP6CB(skb)->nhoff;
> > -	nexthdr = skb_network_header(skb)[nhoff];
> > +	if (!have_final) {
> 
>     Haven't you removed this variable above?
> 
> > +		if (!pskb_pull(skb, skb_transport_offset(skb)))
> > +			goto discard;
> > +		nexthdr = skb_network_header(skb)[nhoff];
> 
>     And this?

Thanks for reviewing. 

Both local variables are now function arguments (see the function
signature, far above in this chunk).

Cheers,

Paolo

^ permalink raw reply

* Re: [PATCH] net: phy: leds: Don't make our own link speed names
From: Kyle Roeschley @ 2018-11-08 19:32 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, David S . Miller, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <fa2751d5-a15c-6414-9b80-15f54fd9f135@gmail.com>

On Thu, Nov 08, 2018 at 09:59:53AM -0800, Florian Fainelli wrote:
> On 11/8/18 9:51 AM, Kyle Roeschley wrote:
> > The phy core provides a handy phy_speed_to_str() helper, so use that
> > instead of doing our own formatting of the different known link speeds.
> 
> In case the speed is not supported, phy_speed_to_str() would return
> "Unsupported (update phy-core.c)" which is bigger (by 21 characters)
> than name_suffix.
> 
> If you bumped name_suffix/PHY_LED_TRIGGER_SPEED_SUFFIX_SIZE to 11
> characters, that would be just large enough to accommodate for the
> "Unsupported" part of the string and that might be an acceptable
> solution in between.
> 

Thanks for catching that, I'll send a v2.

-- 
Kyle Roeschley
Software Engineer
National Instruments

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox