Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] selftests: fix timestamping Makefile
From: shuah @ 2019-02-13 18:17 UTC (permalink / raw)
  To: Deepa Dinamani; +Cc: willemb, netdev, linux-kselftest, shuah
In-Reply-To: <20190213170914.11991-1-deepa.kernel@gmail.com>

On 2/13/19 10:09 AM, Deepa Dinamani wrote:
> The clean target in the makefile conflicts with the generic
> kselftests lib.mk, and fails to properly remove the compiled
> test programs.
> 
> Remove the redundant rule, the TEST_GEN_FILES will be already
> removed by the CLEAN macro in lib.mk.
> 
> Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
> ---
> 
> * Changes since v1: as per review comments
> 
>   tools/testing/selftests/networking/timestamping/Makefile | 3 ---
>   1 file changed, 3 deletions(-)
> 
> diff --git a/tools/testing/selftests/networking/timestamping/Makefile b/tools/testing/selftests/networking/timestamping/Makefile
> index 9050eeea5f5f..1de8bd8ccf5d 100644
> --- a/tools/testing/selftests/networking/timestamping/Makefile
> +++ b/tools/testing/selftests/networking/timestamping/Makefile
> @@ -9,6 +9,3 @@ all: $(TEST_PROGS)
>   top_srcdir = ../../../../..
>   KSFT_KHDR_INSTALL := 1
>   include ../../lib.mk
> -
> -clean:
> -	rm -fr $(TEST_GEN_FILES)
> 

Thanks for the patch.

Acked-by: Shuah Khan <shuah@kernel.org>

thanks,
-- Shuah

^ permalink raw reply

* [PATCH v2 bpf-next 2/2] tools: sync uapi/linux/if_link.h header
From: Andrii Nakryiko @ 2019-02-13 18:25 UTC (permalink / raw)
  To: andrii.nakryiko, netdev, kernel-team, yhs, ast, kafai, daniel,
	david.laight, acme
  Cc: Andrii Nakryiko
In-Reply-To: <20190213182554.2763867-1-andriin@fb.com>

Syncing if_link.h that got out of sync.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/include/uapi/linux/if_link.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index d6533828123a..5b225ff63b48 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -925,6 +925,7 @@ enum {
 enum {
 	LINK_XSTATS_TYPE_UNSPEC,
 	LINK_XSTATS_TYPE_BRIDGE,
+	LINK_XSTATS_TYPE_BOND,
 	__LINK_XSTATS_TYPE_MAX
 };
 #define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1)
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 0/2] tools/bpf: smaller clean ups
From: Andrii Nakryiko @ 2019-02-13 18:25 UTC (permalink / raw)
  To: andrii.nakryiko, netdev, kernel-team, yhs, ast, kafai, daniel,
	david.laight, acme
  Cc: Andrii Nakryiko

This patchset replaces bzero() with memset() and syncs if_link.h header
to suppress unsynchronized headers warning.

Andrii Nakryiko (2):
  tools/bpf: replace bzero with memset
  tools: sync uapi/linux/if_link.h header

 tools/include/uapi/linux/if_link.h |  1 +
 tools/lib/bpf/bpf.c                | 48 +++++++++++++++---------------
 tools/lib/bpf/btf.c                |  5 ++--
 tools/lib/bpf/libbpf.c             |  5 ++--
 4 files changed, 29 insertions(+), 30 deletions(-)

-- 
2.17.1


^ permalink raw reply

* [PATCH v2 bpf-next 1/2] tools/bpf: replace bzero with memset
From: Andrii Nakryiko @ 2019-02-13 18:25 UTC (permalink / raw)
  To: andrii.nakryiko, netdev, kernel-team, yhs, ast, kafai, daniel,
	david.laight, acme
  Cc: Andrii Nakryiko
In-Reply-To: <20190213182554.2763867-1-andriin@fb.com>

bzero() call is deprecated and superseded by memset().

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Reported-by: David Laight <david.laight@aculab.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/lib/bpf/bpf.c    | 48 +++++++++++++++++++++---------------------
 tools/lib/bpf/btf.c    |  5 ++---
 tools/lib/bpf/libbpf.c |  5 ++---
 3 files changed, 28 insertions(+), 30 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index a5261f39e2bd..9cd015574e83 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -22,7 +22,7 @@
  */
 
 #include <stdlib.h>
-#include <strings.h>
+#include <string.h>
 #include <memory.h>
 #include <unistd.h>
 #include <asm/unistd.h>
@@ -228,7 +228,7 @@ int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
 
 	name_len = load_attr->name ? strlen(load_attr->name) : 0;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.prog_type = load_attr->prog_type;
 	attr.expected_attach_type = load_attr->expected_attach_type;
 	attr.insn_cnt = (__u32)load_attr->insns_cnt;
@@ -340,7 +340,7 @@ int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.prog_type = type;
 	attr.insn_cnt = (__u32)insns_cnt;
 	attr.insns = ptr_to_u64(insns);
@@ -360,7 +360,7 @@ int bpf_map_update_elem(int fd, const void *key, const void *value,
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.map_fd = fd;
 	attr.key = ptr_to_u64(key);
 	attr.value = ptr_to_u64(value);
@@ -373,7 +373,7 @@ int bpf_map_lookup_elem(int fd, const void *key, void *value)
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.map_fd = fd;
 	attr.key = ptr_to_u64(key);
 	attr.value = ptr_to_u64(value);
@@ -385,7 +385,7 @@ int bpf_map_lookup_elem_flags(int fd, const void *key, void *value, __u64 flags)
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.map_fd = fd;
 	attr.key = ptr_to_u64(key);
 	attr.value = ptr_to_u64(value);
@@ -398,7 +398,7 @@ int bpf_map_lookup_and_delete_elem(int fd, const void *key, void *value)
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.map_fd = fd;
 	attr.key = ptr_to_u64(key);
 	attr.value = ptr_to_u64(value);
@@ -410,7 +410,7 @@ int bpf_map_delete_elem(int fd, const void *key)
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.map_fd = fd;
 	attr.key = ptr_to_u64(key);
 
@@ -421,7 +421,7 @@ int bpf_map_get_next_key(int fd, const void *key, void *next_key)
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.map_fd = fd;
 	attr.key = ptr_to_u64(key);
 	attr.next_key = ptr_to_u64(next_key);
@@ -433,7 +433,7 @@ int bpf_obj_pin(int fd, const char *pathname)
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.pathname = ptr_to_u64((void *)pathname);
 	attr.bpf_fd = fd;
 
@@ -444,7 +444,7 @@ int bpf_obj_get(const char *pathname)
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.pathname = ptr_to_u64((void *)pathname);
 
 	return sys_bpf(BPF_OBJ_GET, &attr, sizeof(attr));
@@ -455,7 +455,7 @@ int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type,
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.target_fd	   = target_fd;
 	attr.attach_bpf_fd = prog_fd;
 	attr.attach_type   = type;
@@ -468,7 +468,7 @@ int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.target_fd	 = target_fd;
 	attr.attach_type = type;
 
@@ -479,7 +479,7 @@ int bpf_prog_detach2(int prog_fd, int target_fd, enum bpf_attach_type type)
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.target_fd	 = target_fd;
 	attr.attach_bpf_fd = prog_fd;
 	attr.attach_type = type;
@@ -493,7 +493,7 @@ int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags,
 	union bpf_attr attr;
 	int ret;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.query.target_fd	= target_fd;
 	attr.query.attach_type	= type;
 	attr.query.query_flags	= query_flags;
@@ -514,7 +514,7 @@ int bpf_prog_test_run(int prog_fd, int repeat, void *data, __u32 size,
 	union bpf_attr attr;
 	int ret;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.test.prog_fd = prog_fd;
 	attr.test.data_in = ptr_to_u64(data);
 	attr.test.data_out = ptr_to_u64(data_out);
@@ -539,7 +539,7 @@ int bpf_prog_test_run_xattr(struct bpf_prog_test_run_attr *test_attr)
 	if (!test_attr->data_out && test_attr->data_size_out > 0)
 		return -EINVAL;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.test.prog_fd = test_attr->prog_fd;
 	attr.test.data_in = ptr_to_u64(test_attr->data_in);
 	attr.test.data_out = ptr_to_u64(test_attr->data_out);
@@ -559,7 +559,7 @@ int bpf_prog_get_next_id(__u32 start_id, __u32 *next_id)
 	union bpf_attr attr;
 	int err;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.start_id = start_id;
 
 	err = sys_bpf(BPF_PROG_GET_NEXT_ID, &attr, sizeof(attr));
@@ -574,7 +574,7 @@ int bpf_map_get_next_id(__u32 start_id, __u32 *next_id)
 	union bpf_attr attr;
 	int err;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.start_id = start_id;
 
 	err = sys_bpf(BPF_MAP_GET_NEXT_ID, &attr, sizeof(attr));
@@ -588,7 +588,7 @@ int bpf_prog_get_fd_by_id(__u32 id)
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.prog_id = id;
 
 	return sys_bpf(BPF_PROG_GET_FD_BY_ID, &attr, sizeof(attr));
@@ -598,7 +598,7 @@ int bpf_map_get_fd_by_id(__u32 id)
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.map_id = id;
 
 	return sys_bpf(BPF_MAP_GET_FD_BY_ID, &attr, sizeof(attr));
@@ -608,7 +608,7 @@ int bpf_btf_get_fd_by_id(__u32 id)
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.btf_id = id;
 
 	return sys_bpf(BPF_BTF_GET_FD_BY_ID, &attr, sizeof(attr));
@@ -619,7 +619,7 @@ int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len)
 	union bpf_attr attr;
 	int err;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.info.bpf_fd = prog_fd;
 	attr.info.info_len = *info_len;
 	attr.info.info = ptr_to_u64(info);
@@ -635,7 +635,7 @@ int bpf_raw_tracepoint_open(const char *name, int prog_fd)
 {
 	union bpf_attr attr;
 
-	bzero(&attr, sizeof(attr));
+	memset(&attr, 0, sizeof(attr));
 	attr.raw_tracepoint.name = ptr_to_u64(name);
 	attr.raw_tracepoint.prog_fd = prog_fd;
 
diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 6953fedb88ff..ade1c32fb083 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -4,7 +4,6 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
-#include <strings.h>
 #include <unistd.h>
 #include <errno.h>
 #include <linux/err.h>
@@ -484,7 +483,7 @@ int btf__get_from_id(__u32 id, struct btf **btf)
 		goto exit_free;
 	}
 
-	bzero(ptr, last_size);
+	memset(ptr, 0, last_size);
 	btf_info.btf = ptr_to_u64(ptr);
 	err = bpf_obj_get_info_by_fd(btf_fd, &btf_info, &len);
 
@@ -498,7 +497,7 @@ int btf__get_from_id(__u32 id, struct btf **btf)
 			goto exit_free;
 		}
 		ptr = temp_ptr;
-		bzero(ptr, last_size);
+		memset(ptr, 0, last_size);
 		btf_info.btf = ptr_to_u64(ptr);
 		err = bpf_obj_get_info_by_fd(btf_fd, &btf_info, &len);
 	}
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e3c39edfb9d3..6ef7e6e4cbd3 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -18,7 +18,6 @@
 #include <libgen.h>
 #include <inttypes.h>
 #include <string.h>
-#include <strings.h>
 #include <unistd.h>
 #include <fcntl.h>
 #include <errno.h>
@@ -308,7 +307,7 @@ bpf_program__init(void *data, size_t size, char *section_name, int idx,
 		return -EINVAL;
 	}
 
-	bzero(prog, sizeof(*prog));
+	memset(prog, 0, sizeof(*prog));
 
 	prog->section_name = strdup(section_name);
 	if (!prog->section_name) {
@@ -1577,7 +1576,7 @@ bpf_program__load(struct bpf_program *prog,
 		struct bpf_prog_prep_result result;
 		bpf_program_prep_t preprocessor = prog->preprocessor;
 
-		bzero(&result, sizeof(result));
+		memset(&result, 0, sizeof(result));
 		err = preprocessor(prog, i, prog->insns,
 				   prog->insns_cnt, &result);
 		if (err) {
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH v2 bpf-next 2/2] tools: sync uapi/linux/if_link.h header
From: Martin Lau @ 2019-02-13 18:36 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: andrii.nakryiko@gmail.com, netdev@vger.kernel.org, Kernel Team,
	Yonghong Song, Alexei Starovoitov, daniel@iogearbox.net,
	david.laight@aculab.com, acme@kernel.org
In-Reply-To: <20190213182554.2763867-3-andriin@fb.com>

On Wed, Feb 13, 2019 at 10:25:54AM -0800, Andrii Nakryiko wrote:
> Syncing if_link.h that got out of sync.
Acked-by: Martin KaFai Lau <kafai@fb.com>

^ permalink raw reply

* [PATCH net 0/2] net: phy: fix locking issue
From: Heiner Kallweit @ 2019-02-13 19:10 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller
  Cc: Russell King - ARM Linux, netdev@vger.kernel.org

Russell pointed out that the locking used in phy_is_started() isn't
needed and misleading. This locking also contributes to a race fixed
with patch 2.

Heiner Kallweit (2):
  net: phy: don't use locking in phy_is_started
  net: phy: fix potential race in the phylib state machine

 drivers/net/phy/phy.c | 13 +++++++------
 include/linux/phy.h   | 15 +--------------
 2 files changed, 8 insertions(+), 20 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [PATCH net 1/2] net: phy: don't use locking in phy_is_started
From: Heiner Kallweit @ 2019-02-13 19:11 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller
  Cc: Russell King - ARM Linux, netdev@vger.kernel.org
In-Reply-To: <2a39271d-3b9e-e425-98b4-b2a24074e806@gmail.com>

Russell suggested to remove the locking from phy_is_started() because
the read is atomic anyway and actually the locking may be more
misleading.

Fixes: 2b3e88ea6528 ("net: phy: improve phy state checking")
Suggested-by: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/phy/phy.c | 11 +++++------
 include/linux/phy.h   | 15 +--------------
 2 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index ca5e0c0f018c..602816d70281 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -553,7 +553,7 @@ int phy_start_aneg(struct phy_device *phydev)
 	if (err < 0)
 		goto out_unlock;
 
-	if (__phy_is_started(phydev)) {
+	if (phy_is_started(phydev)) {
 		if (phydev->autoneg == AUTONEG_ENABLE) {
 			err = phy_check_link_status(phydev);
 		} else {
@@ -709,7 +709,7 @@ void phy_stop_machine(struct phy_device *phydev)
 	cancel_delayed_work_sync(&phydev->state_queue);
 
 	mutex_lock(&phydev->lock);
-	if (__phy_is_started(phydev))
+	if (phy_is_started(phydev))
 		phydev->state = PHY_UP;
 	mutex_unlock(&phydev->lock);
 }
@@ -839,15 +839,14 @@ EXPORT_SYMBOL(phy_stop_interrupts);
  */
 void phy_stop(struct phy_device *phydev)
 {
-	mutex_lock(&phydev->lock);
-
-	if (!__phy_is_started(phydev)) {
+	if (!phy_is_started(phydev)) {
 		WARN(1, "called from state %s\n",
 		     phy_state_to_str(phydev->state));
-		mutex_unlock(&phydev->lock);
 		return;
 	}
 
+	mutex_lock(&phydev->lock);
+
 	if (phy_interrupt_is_valid(phydev))
 		phy_disable_interrupts(phydev);
 
diff --git a/include/linux/phy.h b/include/linux/phy.h
index ef20aeea10cc..127fcc9c3778 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -674,26 +674,13 @@ phy_lookup_setting(int speed, int duplex, const unsigned long *mask,
 size_t phy_speeds(unsigned int *speeds, size_t size,
 		  unsigned long *mask);
 
-static inline bool __phy_is_started(struct phy_device *phydev)
-{
-	WARN_ON(!mutex_is_locked(&phydev->lock));
-
-	return phydev->state >= PHY_UP;
-}
-
 /**
  * phy_is_started - Convenience function to check whether PHY is started
  * @phydev: The phy_device struct
  */
 static inline bool phy_is_started(struct phy_device *phydev)
 {
-	bool started;
-
-	mutex_lock(&phydev->lock);
-	started = __phy_is_started(phydev);
-	mutex_unlock(&phydev->lock);
-
-	return started;
+	return phydev->state >= PHY_UP;
 }
 
 void phy_resolve_aneg_linkmode(struct phy_device *phydev);
-- 
2.20.1



^ permalink raw reply related

* [PATCH net 2/2] net: phy: fix potential race in the phylib state machine
From: Heiner Kallweit @ 2019-02-13 19:12 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller
  Cc: Russell King - ARM Linux, netdev@vger.kernel.org
In-Reply-To: <2a39271d-3b9e-e425-98b4-b2a24074e806@gmail.com>

Russell reported the following race in the phylib state machine
(quoting from his mail):

if (phy_polling_mode(phydev) && phy_is_started(phydev))
	phy_queue_state_machine(phydev, PHY_STATE_TIME);

state = PHY_UP
thread 0			thread 1
				phy_disconnect()
				+-phy_is_started()
phy_is_started()                |
				`-phy_stop()
				  +-phydev->state = PHY_HALTED
				  `-phy_stop_machine()
				    `-cancel_delayed_work_sync()
phy_queue_state_machine()
`-mod_delayed_work()

At this point, the phydev->state_queue() has been added back onto the
system workqueue despite phy_stop_machine() having been called and
cancel_delayed_work_sync() called on it.

Fix this by protecting the complete operation in thread 0.

Fixes: 2b3e88ea6528 ("net: phy: improve phy state checking")
Reported-by: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/phy/phy.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 602816d70281..c5675df5fc6f 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -985,8 +985,10 @@ void phy_state_machine(struct work_struct *work)
 	 * state machine would be pointless and possibly error prone when
 	 * called from phy_disconnect() synchronously.
 	 */
+	mutex_lock(&phydev->lock);
 	if (phy_polling_mode(phydev) && phy_is_started(phydev))
 		phy_queue_state_machine(phydev, PHY_STATE_TIME);
+	mutex_unlock(&phydev->lock);
 }
 
 /**
-- 
2.20.1



^ permalink raw reply related

* Re: Fw: [Bug 202561] BUG: Null pointer dereference in __skb_unlink()
From: Cong Wang @ 2019-02-13 19:30 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Linux Kernel Network Developers, sharathkernel
In-Reply-To: <20190212144547.27dca239@shemminger-XPS-13-9360>

On Tue, Feb 12, 2019 at 6:10 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
> Original report from sharathkernel@gmail.com:
>
> NULL POINTER DEFERENCE DURING __skb_unlink()
>
> In the function call, __skb_try_recv_from_queue() (net/core/datagram.c),
> sbk_queue_walk() walks through the queue without checking if the next member in the queue has valid next pointer/address. When a socket buffer has to unlink, __skb_unlink() is called.
>
>
>
> Inside __skb_unlink() function, it doesn't verify if skb->next has a valid address. skb->next is assigned and used, without verifying the value inside it.


It should always have a valid ->next pointer as it is in a doubly
linked list, where the last one simply points to the head of the
list. I don't see any problem in the code you quote here.


>
> What could be probable solution, in this scenario? Should we check if skb->next is not NULL, before calling __skb_unlink()?


Do you have a reproducer? Also, your crash report is incomplete,
it doesn't even show a kernel version... Is it 4.20.7? Is it tainted?
Please share the complete dmesg.


Thanks.

^ permalink raw reply

* [PATCH net-next] net: ip6_gre: Give ERSPAN a fill_info link op of its own
From: Petr Machata @ 2019-02-13 19:31 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: Petr Machata, davem@davemloft.net, kuznet@ms2.inr.ac.ru,
	yoshfuji@linux-ipv6.org, Lorenzo Bianconi

In commit c706863bc890 ("net: ip6_gre: always reports o_key to
userspace"), ip6gre and ip6gretap tunnels started reporting a TUNNEL_KEY
output flag even if one was not configured at the device.

When an okey-less ip6gre or ip6gretap netdevice is created, it initially
encapsulates the packets without okey. But any configuration change
(even a non-change such as setting TOS to an already-configured value)
then causes the okey flag from the reported configuration to be
circulated back to actual configuration. From that point on, the device
encapsulates packets with output key of 0.

The intention was to implement this behavior for ERSPAN devices, not for
all ip6gre devices. The ERSPAN netdevice should really have its own
fill_info callback. Add one.

Fixes: c706863bc890 ("net: ip6_gre: always reports o_key to userspace")
CC: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 net/ipv6/ip6_gre.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 65a4f96dc462..0a6087cffe54 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -2094,15 +2094,13 @@ static size_t ip6gre_get_size(const struct net_device *dev)
 		0;
 }
 
-static int ip6gre_fill_info(struct sk_buff *skb, const struct net_device *dev)
+static int __ip6gre_fill_info(struct sk_buff *skb,
+			      const struct net_device *dev,
+			      __be16 base_o_flags)
 {
 	struct ip6_tnl *t = netdev_priv(dev);
 	struct __ip6_tnl_parm *p = &t->parms;
-	__be16 o_flags = p->o_flags;
-
-	if ((p->erspan_ver == 1 || p->erspan_ver == 2) &&
-	    !p->collect_md)
-		o_flags |= TUNNEL_KEY;
+	__be16 o_flags = p->o_flags | base_o_flags;
 
 	if (nla_put_u32(skb, IFLA_GRE_LINK, p->link) ||
 	    nla_put_be16(skb, IFLA_GRE_IFLAGS,
@@ -2155,6 +2153,11 @@ static int ip6gre_fill_info(struct sk_buff *skb, const struct net_device *dev)
 	return -EMSGSIZE;
 }
 
+static int ip6gre_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	return __ip6gre_fill_info(skb, dev, 0);
+}
+
 static const struct nla_policy ip6gre_policy[IFLA_GRE_MAX + 1] = {
 	[IFLA_GRE_LINK]        = { .type = NLA_U32 },
 	[IFLA_GRE_IFLAGS]      = { .type = NLA_U16 },
@@ -2256,6 +2259,20 @@ static int ip6erspan_changelink(struct net_device *dev, struct nlattr *tb[],
 	return 0;
 }
 
+static int ip6erspan_fill_info(struct sk_buff *skb,
+			       const struct net_device *dev)
+{
+	struct ip6_tnl *t = netdev_priv(dev);
+	struct __ip6_tnl_parm *p = &t->parms;
+	__be16 base_o_flags = 0;
+
+	if ((p->erspan_ver == 1 || p->erspan_ver == 2) &&
+	    !p->collect_md)
+		base_o_flags |= TUNNEL_KEY;
+
+	return __ip6gre_fill_info(skb, dev, base_o_flags);
+}
+
 static struct rtnl_link_ops ip6gre_link_ops __read_mostly = {
 	.kind		= "ip6gre",
 	.maxtype	= IFLA_GRE_MAX,
@@ -2295,7 +2312,7 @@ static struct rtnl_link_ops ip6erspan_tap_ops __read_mostly = {
 	.newlink	= ip6erspan_newlink,
 	.changelink	= ip6erspan_changelink,
 	.get_size	= ip6gre_get_size,
-	.fill_info	= ip6gre_fill_info,
+	.fill_info	= ip6erspan_fill_info,
 	.get_link_net	= ip6_tnl_get_link_net,
 };
 
-- 
2.4.11


^ permalink raw reply related

* [RFC PATCH] net act_vlan: use correct len in skb_pull
From: Zahari Doychev @ 2019-02-13 19:51 UTC (permalink / raw)
  To: netdev, bridge, makita.toshiaki, nikolay, roopa, jhs, jiri,
	xiyou.wangcong
  Cc: johannes, zahari.doychev

The bridge and VLAN code expects that skb->data points to the start of the
VLAN header instead of the next (network) header. Currently after
tcf_vlan_act() on ingress filter skb->data points to the next network
header. In this case the Linux bridge does not forward correctly double
tagged VLAN packets added using tc vlan action as the outer vlan tag from
the skb is inserted at the wrong offset after the vlan tag in the payload.
Making skb->data to point to the VLAN header in tcf_vlan_act() by using
ETH_HLEN in skb_pull_rcsum() fixes the problem.

The following commands were used for testing:

ip link add name br0 type bridge vlan_filtering 1
ip link set dev br0 up

ip link set dev net0 up
ip link set dev net0 master br0

ip link set dev net1 up
ip link set dev net1 master br0

bridge vlan add dev net0 vid 100 master
bridge vlan add dev br0 vid 100 self
bridge vlan add dev net1 vid 100 master

tc qdisc add dev net0 handle ffff: clsact
tc qdisc add dev net1 handle ffff: clsact

tc filter add dev net0 ingress pref 1 protocol all flower \
                  action vlan push id 10 pipe action vlan push id 100

tc filter add dev net0 egress pref 1 protocol 802.1q flower \
                  vlan_id 100 vlan_ethtype 802.1q cvlan_id 10 \
                  action vlan pop pipe action vlan pop

Signed-off-by: Zahari Doychev <zahari.doychev@linux.com>
---
 net/sched/act_vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index 93fdaf707313..308d7d89f925 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -86,7 +86,7 @@ static int tcf_vlan_act(struct sk_buff *skb, const struct tc_action *a,
 
 out:
 	if (skb_at_tc_ingress(skb))
-		skb_pull_rcsum(skb, skb->mac_len);
+		skb_pull_rcsum(skb, ETH_HLEN);
 
 	return action;
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH bpf-next v11 0/7] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-13 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov

This patchset implements BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN
and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

V2 changes: added flowi-based route lookup, IPv6 encapping, and
   encapping on ingress.

V3 changes: incorporated David Ahern's suggestions:
   - added l3mdev check/oif (patch 2)
   - sync bpf.h from include/uapi into tools/include/uapi
   - selftest tweaks

V4 changes: moved route lookup/dst change from bpf_push_ip_encap
   to when BPF_LWT_REROUTE is handled, as suggested by David Ahern.

V5 changes: added a check in lwt_xmit that skb->protocol stays the
   same if the skb is to be passed back to the stack (ret == BPF_OK).
   Again, suggested by David Ahern.

V6 changes: abandoned.

V7 changes: added handling of GSO packets (patch 3 in the patchset added),
   as suggested by BPF maintainers.

V8 changes:
   - fixed build errors when LWT or IPV6 are not enabled;
   - whitelisted TCP GSO instead of blacklisting SCTP and UDP GSO, as
     suggested by Willem de Bruijn;
   - added validation that pushed length cover needed headers when GRE/UDP
     encap is detected, as suggested by Willem de Bruijn;
   - a couple of minor/stylistic tweaks/fixed typos.

V9 changes:
   - fixed a kbuild test robot compiler warning;
   - added ipv6_route_input to ipv6_stub (patch 4 in the patchset
     added), and IPv6 routing functions are now invoked via ipv6_stub,
     as suggested by David Ahern.

V10 changes:
   - removed unnecessary IS_ENABLED and pr_warn_once from patch 5.

V11 changes: fixed a potential dst leak in patch 5, as suggested by
    David Ahern.

Peter Oskolkov (7):
  bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap
  bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
  bpf: handle GSO in bpf_lwt_push_encap
  ipv6_stub: add ipv6_route_input stub/proxy.
  bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
  bpf: sync <kdir>/include/.../bpf.h with tools/include/.../bpf.h
  selftests: bpf: add test_lwt_ip_encap selftest

 include/net/addrconf.h                        |   1 +
 include/net/lwtunnel.h                        |   2 +
 include/uapi/linux/bpf.h                      |  26 +-
 net/core/filter.c                             |  49 ++-
 net/core/lwt_bpf.c                            | 254 +++++++++++++-
 net/ipv6/addrconf_core.c                      |   6 +
 net/ipv6/af_inet6.c                           |   7 +
 tools/include/uapi/linux/bpf.h                |  26 +-
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../selftests/bpf/progs/test_lwt_ip_encap.c   |  85 +++++
 .../selftests/bpf/test_lwt_ip_encap.sh        | 311 ++++++++++++++++++
 11 files changed, 758 insertions(+), 12 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

-- 
2.20.1.791.gb4d0f1c61a-goog

^ permalink raw reply

* [PATCH bpf-next v11 1/7] bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-13 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190213195341.184969-1-posk@google.com>

This patch adds all needed plumbing in preparation to allowing
bpf programs to do IP encapping via bpf_lwt_push_encap. Actual
implementation is added in the next patch in the patchset.

Of note:
- bpf_lwt_push_encap can now be called from BPF_PROG_TYPE_LWT_XMIT
  prog types in addition to BPF_PROG_TYPE_LWT_IN;
- if the skb being encapped has GSO set, encapsulation is limited
  to IPIP/IP+GRE/IP+GUE (both IPv4 and IPv6);
- as route lookups are different for ingress vs egress, the single
  external bpf_lwt_push_encap BPF helper is routed internally to
  either bpf_lwt_in_push_encap or bpf_lwt_xmit_push_encap BPF_CALLs,
  depending on prog type.

v8 changes: fixed a typo.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/uapi/linux/bpf.h | 26 ++++++++++++++++++++--
 net/core/filter.c        | 48 +++++++++++++++++++++++++++++++++++-----
 2 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 25c8c0e62ecf..bcdd2474eee7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2016,6 +2016,19 @@ union bpf_attr {
  *			Only works if *skb* contains an IPv6 packet. Insert a
  *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
  *			the IPv6 header.
+ *		**BPF_LWT_ENCAP_IP**
+ *			IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ *			must be IPv4 or IPv6, followed by zero or more
+ *			additional headers, up to LWT_BPF_MAX_HEADROOM total
+ *			bytes in all prepended headers. Please note that
+ *			if skb_is_gso(skb) is true, no more than two headers
+ *			can be prepended, and the inner header, if present,
+ *			should be either GRE or UDP/GUE.
+ *
+ *		BPF_LWT_ENCAP_SEG6*** types can be called by bpf programs of
+ *		type BPF_PROG_TYPE_LWT_IN; BPF_LWT_ENCAP_IP type can be called
+ *		by bpf programs of types BPF_PROG_TYPE_LWT_IN and
+ *		BPF_PROG_TYPE_LWT_XMIT.
  *
  * 		A call to this helper is susceptible to change the underlaying
  * 		packet buffer. Therefore, at load time, all checks on pointers
@@ -2517,7 +2530,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_SEG6,
-	BPF_LWT_ENCAP_SEG6_INLINE
+	BPF_LWT_ENCAP_SEG6_INLINE,
+	BPF_LWT_ENCAP_IP,
 };
 
 #define __bpf_md_ptr(type, name)	\
@@ -2606,7 +2620,15 @@ enum bpf_ret_code {
 	BPF_DROP = 2,
 	/* 3-6 reserved */
 	BPF_REDIRECT = 7,
-	/* >127 are reserved for prog type specific return codes */
+	/* >127 are reserved for prog type specific return codes.
+	 *
+	 * BPF_LWT_REROUTE: used by BPF_PROG_TYPE_LWT_IN and
+	 *    BPF_PROG_TYPE_LWT_XMIT to indicate that skb had been
+	 *    changed and should be routed based on its new L3 header.
+	 *    (This is an L3 redirect, as opposed to L2 redirect
+	 *    represented by BPF_REDIRECT above).
+	 */
+	BPF_LWT_REROUTE = 128,
 };
 
 struct bpf_sock {
diff --git a/net/core/filter.c b/net/core/filter.c
index 353735575204..12c88c21b6b8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4815,7 +4815,15 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len
 }
 #endif /* CONFIG_IPV6_SEG6_BPF */
 
-BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
+#if IS_ENABLED(CONFIG_LWTUNNEL_BPF)
+static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
+			     bool ingress)
+{
+	return -EINVAL;  /* Implemented in the next patch. */
+}
+#endif
+
+BPF_CALL_4(bpf_lwt_in_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
 	   u32, len)
 {
 	switch (type) {
@@ -4823,14 +4831,41 @@ BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
 	case BPF_LWT_ENCAP_SEG6:
 	case BPF_LWT_ENCAP_SEG6_INLINE:
 		return bpf_push_seg6_encap(skb, type, hdr, len);
+#endif
+#if IS_ENABLED(CONFIG_LWTUNNEL_BPF)
+	case BPF_LWT_ENCAP_IP:
+		return bpf_push_ip_encap(skb, hdr, len, true /* ingress */);
+#endif
+	default:
+		return -EINVAL;
+	}
+}
+
+BPF_CALL_4(bpf_lwt_xmit_push_encap, struct sk_buff *, skb, u32, type,
+	   void *, hdr, u32, len)
+{
+	switch (type) {
+#if IS_ENABLED(CONFIG_LWTUNNEL_BPF)
+	case BPF_LWT_ENCAP_IP:
+		return bpf_push_ip_encap(skb, hdr, len, false /* egress */);
 #endif
 	default:
 		return -EINVAL;
 	}
 }
 
-static const struct bpf_func_proto bpf_lwt_push_encap_proto = {
-	.func		= bpf_lwt_push_encap,
+static const struct bpf_func_proto bpf_lwt_in_push_encap_proto = {
+	.func		= bpf_lwt_in_push_encap,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_MEM,
+	.arg4_type	= ARG_CONST_SIZE
+};
+
+static const struct bpf_func_proto bpf_lwt_xmit_push_encap_proto = {
+	.func		= bpf_lwt_xmit_push_encap,
 	.gpl_only	= false,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
@@ -5417,7 +5452,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 	    func == bpf_lwt_seg6_adjust_srh ||
 	    func == bpf_lwt_seg6_action ||
 #endif
-	    func == bpf_lwt_push_encap)
+	    func == bpf_lwt_in_push_encap ||
+	    func == bpf_lwt_xmit_push_encap)
 		return true;
 
 	return false;
@@ -5815,7 +5851,7 @@ lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_lwt_push_encap:
-		return &bpf_lwt_push_encap_proto;
+		return &bpf_lwt_in_push_encap_proto;
 	default:
 		return lwt_out_func_proto(func_id, prog);
 	}
@@ -5851,6 +5887,8 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_l4_csum_replace_proto;
 	case BPF_FUNC_set_hash_invalid:
 		return &bpf_set_hash_invalid_proto;
+	case BPF_FUNC_lwt_push_encap:
+		return &bpf_lwt_xmit_push_encap_proto;
 	default:
 		return lwt_out_func_proto(func_id, prog);
 	}
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v11 2/7] bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-13 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190213195341.184969-1-posk@google.com>

Implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap BPF helper.
It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN and
BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

v7 changes:
 - added a call skb_clear_hash();
 - removed calls to skb_set_transport_header();
 - refuse to encap GSO-enabled packets.

v8 changes:
 - fix build errors when LWT is not enabled.

Note: the next patch in the patchset with deal with GSO-enabled packets,
which are currently rejected at encapping attempt.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/net/lwtunnel.h |  2 ++
 net/core/filter.c      |  3 +-
 net/core/lwt_bpf.c     | 65 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 33fd9ba7e0e5..671113bcb2cc 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -126,6 +126,8 @@ int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b);
 int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb);
 int lwtunnel_input(struct sk_buff *skb);
 int lwtunnel_xmit(struct sk_buff *skb);
+int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
+			  bool ingress);
 
 static inline void lwtunnel_set_redirect(struct dst_entry *dst)
 {
diff --git a/net/core/filter.c b/net/core/filter.c
index 12c88c21b6b8..a78deb2656e1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -73,6 +73,7 @@
 #include <linux/seg6_local.h>
 #include <net/seg6.h>
 #include <net/seg6_local.h>
+#include <net/lwtunnel.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -4819,7 +4820,7 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len
 static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
 			     bool ingress)
 {
-	return -EINVAL;  /* Implemented in the next patch. */
+	return bpf_lwt_push_ip_encap(skb, hdr, len, ingress);
 }
 #endif
 
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a648568c5e8f..e5a9850d9f48 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -390,6 +390,71 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = {
 	.owner		= THIS_MODULE,
 };
 
+static int handle_gso_encap(struct sk_buff *skb, bool ipv4, int encap_len)
+{
+	/* Handling of GSO-enabled packets is added in the next patch. */
+	return -EOPNOTSUPP;
+}
+
+int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
+{
+	struct iphdr *iph;
+	bool ipv4;
+	int err;
+
+	if (unlikely(len < sizeof(struct iphdr) || len > LWT_BPF_MAX_HEADROOM))
+		return -EINVAL;
+
+	/* validate protocol and length */
+	iph = (struct iphdr *)hdr;
+	if (iph->version == 4) {
+		ipv4 = true;
+		if (unlikely(len < iph->ihl * 4))
+			return -EINVAL;
+	} else if (iph->version == 6) {
+		ipv4 = false;
+		if (unlikely(len < sizeof(struct ipv6hdr)))
+			return -EINVAL;
+	} else {
+		return -EINVAL;
+	}
+
+	if (ingress)
+		err = skb_cow_head(skb, len + skb->mac_len);
+	else
+		err = skb_cow_head(skb,
+				   len + LL_RESERVED_SPACE(skb_dst(skb)->dev));
+	if (unlikely(err))
+		return err;
+
+	/* push the encap headers and fix pointers */
+	skb_reset_inner_headers(skb);
+	skb->encapsulation = 1;
+	skb_push(skb, len);
+	if (ingress)
+		skb_postpush_rcsum(skb, iph, len);
+	skb_reset_network_header(skb);
+	memcpy(skb_network_header(skb), hdr, len);
+	bpf_compute_data_pointers(skb);
+	skb_clear_hash(skb);
+
+	if (ipv4) {
+		skb->protocol = htons(ETH_P_IP);
+		iph = ip_hdr(skb);
+
+		if (!iph->check)
+			iph->check = ip_fast_csum((unsigned char *)iph,
+						  iph->ihl);
+	} else {
+		skb->protocol = htons(ETH_P_IPV6);
+	}
+
+	if (skb_is_gso(skb))
+		return handle_gso_encap(skb, ipv4, len);
+
+	return 0;
+}
+
 static int __init bpf_lwt_init(void)
 {
 	return lwtunnel_encap_add_ops(&bpf_encap_ops, LWTUNNEL_ENCAP_BPF);
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v11 3/7] bpf: handle GSO in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-02-13 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190213195341.184969-1-posk@google.com>

This patch adds handling of GSO packets in bpf_lwt_push_ip_encap()
(called from bpf_lwt_push_encap):

* IPIP, GRE, and UDP encapsulation types are deduced by looking
  into iphdr->protocol or ipv6hdr->next_header;
* SCTP GSO packets are not supported (as bpf_skb_proto_4_to_6
  and similar do);
* UDP_L4 GSO packets are also not supported (although they are
  not blocked in bpf_skb_proto_4_to_6 and similar), as
  skb_decrease_gso_size() will break it;
* SKB_GSO_DODGY bit is set.

Note: it may be possible to support SCTP and UDP_L4 gso packets;
      but as these cases seem to be not well handled by other
      tunneling/encapping code paths, the solution should
      be generic enough to apply to all tunneling/encapping code.

v8 changes:
   - make sure that if GRE or UDP encap is detected, there is
     enough of pushed bytes to cover both IP[v6] + GRE|UDP headers;
   - do not reject double-encapped packets;
   - whitelist TCP GSO packets rather than block SCTP GSO and
     UDP GSO.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 net/core/lwt_bpf.c | 67 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 65 insertions(+), 2 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index e5a9850d9f48..079871fc020f 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -16,6 +16,7 @@
 #include <linux/types.h>
 #include <linux/bpf.h>
 #include <net/lwtunnel.h>
+#include <net/gre.h>
 
 struct bpf_lwt_prog {
 	struct bpf_prog *prog;
@@ -390,10 +391,72 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = {
 	.owner		= THIS_MODULE,
 };
 
+static int handle_gso_type(struct sk_buff *skb, unsigned int gso_type,
+			   int encap_len)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+	gso_type |= SKB_GSO_DODGY;
+	shinfo->gso_type |= gso_type;
+	skb_decrease_gso_size(shinfo, encap_len);
+	shinfo->gso_segs = 0;
+	return 0;
+}
+
 static int handle_gso_encap(struct sk_buff *skb, bool ipv4, int encap_len)
 {
-	/* Handling of GSO-enabled packets is added in the next patch. */
-	return -EOPNOTSUPP;
+	int next_hdr_offset;
+	void *next_hdr;
+	__u8 protocol;
+
+	/* SCTP and UDP_L4 gso need more nuanced handling than what
+	 * handle_gso_type() does above: skb_decrease_gso_size() is not enough.
+	 * So at the moment only TCP GSO packets are let through.
+	 */
+	if (!(skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)))
+		return -ENOTSUPP;
+
+	if (ipv4) {
+		protocol = ip_hdr(skb)->protocol;
+		next_hdr_offset = sizeof(struct iphdr);
+		next_hdr = skb_network_header(skb) + next_hdr_offset;
+	} else {
+		protocol = ipv6_hdr(skb)->nexthdr;
+		next_hdr_offset = sizeof(struct ipv6hdr);
+		next_hdr = skb_network_header(skb) + next_hdr_offset;
+	}
+
+	switch (protocol) {
+	case IPPROTO_GRE:
+		next_hdr_offset += sizeof(struct gre_base_hdr);
+		if (next_hdr_offset > encap_len)
+			return -EINVAL;
+
+		if (((struct gre_base_hdr *)next_hdr)->flags & GRE_CSUM)
+			return handle_gso_type(skb, SKB_GSO_GRE_CSUM,
+					       encap_len);
+		return handle_gso_type(skb, SKB_GSO_GRE, encap_len);
+
+	case IPPROTO_UDP:
+		next_hdr_offset += sizeof(struct udphdr);
+		if (next_hdr_offset > encap_len)
+			return -EINVAL;
+
+		if (((struct udphdr *)next_hdr)->check)
+			return handle_gso_type(skb, SKB_GSO_UDP_TUNNEL_CSUM,
+					       encap_len);
+		return handle_gso_type(skb, SKB_GSO_UDP_TUNNEL, encap_len);
+
+	case IPPROTO_IP:
+	case IPPROTO_IPV6:
+		if (ipv4)
+			return handle_gso_type(skb, SKB_GSO_IPXIP4, encap_len);
+		else
+			return handle_gso_type(skb, SKB_GSO_IPXIP6, encap_len);
+
+	default:
+		return -EPROTONOSUPPORT;
+	}
 }
 
 int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v11 4/7] ipv6_stub: add ipv6_route_input stub/proxy.
From: Peter Oskolkov @ 2019-02-13 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190213195341.184969-1-posk@google.com>

Proxy ip6_route_input via ipv6_stub, for later use by lwt bpf ip encap
(see the next patch in the patchset).

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/net/addrconf.h   | 1 +
 net/ipv6/addrconf_core.c | 6 ++++++
 net/ipv6/af_inet6.c      | 7 +++++++
 3 files changed, 14 insertions(+)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 20d523ee2fec..269ec27385e9 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -248,6 +248,7 @@ struct ipv6_stub {
 				 const struct in6_addr *addr);
 	int (*ipv6_dst_lookup)(struct net *net, struct sock *sk,
 			       struct dst_entry **dst, struct flowi6 *fl6);
+	int (*ipv6_route_input)(struct sk_buff *skb);
 
 	struct fib6_table *(*fib6_get_table)(struct net *net, u32 id);
 	struct fib6_info *(*fib6_lookup)(struct net *net, int oif,
diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c
index 5cd0029d930e..6c79af056d9b 100644
--- a/net/ipv6/addrconf_core.c
+++ b/net/ipv6/addrconf_core.c
@@ -134,6 +134,11 @@ static int eafnosupport_ipv6_dst_lookup(struct net *net, struct sock *u1,
 	return -EAFNOSUPPORT;
 }
 
+static int eafnosupport_ipv6_route_input(struct sk_buff *skb)
+{
+	return -EAFNOSUPPORT;
+}
+
 static struct fib6_table *eafnosupport_fib6_get_table(struct net *net, u32 id)
 {
 	return NULL;
@@ -170,6 +175,7 @@ eafnosupport_ip6_mtu_from_fib6(struct fib6_info *f6i, struct in6_addr *daddr,
 
 const struct ipv6_stub *ipv6_stub __read_mostly = &(struct ipv6_stub) {
 	.ipv6_dst_lookup   = eafnosupport_ipv6_dst_lookup,
+	.ipv6_route_input  = eafnosupport_ipv6_route_input,
 	.fib6_get_table    = eafnosupport_fib6_get_table,
 	.fib6_table_lookup = eafnosupport_fib6_table_lookup,
 	.fib6_lookup       = eafnosupport_fib6_lookup,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d99753b5e39b..2f45d2a3e3a3 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -900,10 +900,17 @@ static struct pernet_operations inet6_net_ops = {
 	.exit = inet6_net_exit,
 };
 
+static int ipv6_route_input(struct sk_buff *skb)
+{
+	ip6_route_input(skb);
+	return skb_dst(skb)->error;
+}
+
 static const struct ipv6_stub ipv6_stub_impl = {
 	.ipv6_sock_mc_join = ipv6_sock_mc_join,
 	.ipv6_sock_mc_drop = ipv6_sock_mc_drop,
 	.ipv6_dst_lookup   = ip6_dst_lookup,
+	.ipv6_route_input  = ipv6_route_input,
 	.fib6_get_table	   = fib6_get_table,
 	.fib6_table_lookup = fib6_table_lookup,
 	.fib6_lookup       = fib6_lookup,
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v11 5/7] bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
From: Peter Oskolkov @ 2019-02-13 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190213195341.184969-1-posk@google.com>

This patch builds on top of the previous patch in the patchset,
which added BPF_LWT_ENCAP_IP mode to bpf_lwt_push_encap. As the
encapping can result in the skb needing to go via a different
interface/route/dst, bpf programs can indicate this by returning
BPF_LWT_REROUTE, which triggers a new route lookup for the skb.

v8 changes: fix kbuild errors when LWTUNNEL_BPF is builtin, but
   IPV6 is a module: as LWTUNNEL_BPF can only be either Y or N,
   call IPV6 routing functions only if they are built-in.

v9 changes:
   - fixed a kbuild test robot compiler warning;
   - call IPV6 routing functions via ipv6_stub.

v10 changes: removed unnecessary IS_ENABLED and pr_warn_once.

v11 changes: fixed a potential dst leak.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 net/core/lwt_bpf.c | 126 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 124 insertions(+), 2 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 079871fc020f..32251f3fcda0 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -17,6 +17,7 @@
 #include <linux/bpf.h>
 #include <net/lwtunnel.h>
 #include <net/gre.h>
+#include <net/ip6_route.h>
 
 struct bpf_lwt_prog {
 	struct bpf_prog *prog;
@@ -56,6 +57,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 
 	switch (ret) {
 	case BPF_OK:
+	case BPF_LWT_REROUTE:
 		break;
 
 	case BPF_REDIRECT:
@@ -88,6 +90,30 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 	return ret;
 }
 
+static int bpf_lwt_input_reroute(struct sk_buff *skb)
+{
+	int err = -EINVAL;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		struct iphdr *iph = ip_hdr(skb);
+
+		err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
+					   iph->tos, skb_dst(skb)->dev);
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		err = ipv6_stub->ipv6_route_input(skb);
+	} else {
+		err = -EAFNOSUPPORT;
+	}
+
+	if (err)
+		goto err;
+	return dst_input(skb);
+
+err:
+	kfree_skb(skb);
+	return err;
+}
+
 static int bpf_input(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -99,11 +125,11 @@ static int bpf_input(struct sk_buff *skb)
 		ret = run_lwt_bpf(skb, &bpf->in, dst, NO_REDIRECT);
 		if (ret < 0)
 			return ret;
+		if (ret == BPF_LWT_REROUTE)
+			return bpf_lwt_input_reroute(skb);
 	}
 
 	if (unlikely(!dst->lwtstate->orig_input)) {
-		pr_warn_once("orig_input not set on dst for prog %s\n",
-			     bpf->out.name);
 		kfree_skb(skb);
 		return -EINVAL;
 	}
@@ -148,6 +174,91 @@ static int xmit_check_hhlen(struct sk_buff *skb)
 	return 0;
 }
 
+static int bpf_lwt_xmit_reroute(struct sk_buff *skb)
+{
+	struct net_device *l3mdev = l3mdev_master_dev_rcu(skb_dst(skb)->dev);
+	int oif = l3mdev ? l3mdev->ifindex : 0;
+	struct dst_entry *dst = NULL;
+	struct sock *sk;
+	struct net *net;
+	bool ipv4;
+	int err;
+
+	if (skb->protocol == htons(ETH_P_IP))
+		ipv4 = true;
+	else if (skb->protocol == htons(ETH_P_IPV6))
+		ipv4 = false;
+	else
+		return -EAFNOSUPPORT;
+
+	sk = sk_to_full_sk(skb->sk);
+	if (sk) {
+		if (sk->sk_bound_dev_if)
+			oif = sk->sk_bound_dev_if;
+		net = sock_net(sk);
+	} else {
+		net = dev_net(skb_dst(skb)->dev);
+	}
+
+	if (ipv4) {
+		struct iphdr *iph = ip_hdr(skb);
+		struct flowi4 fl4 = {};
+		struct rtable *rt;
+
+		fl4.flowi4_oif = oif;
+		fl4.flowi4_mark = skb->mark;
+		fl4.flowi4_uid = sock_net_uid(net, sk);
+		fl4.flowi4_tos = RT_TOS(iph->tos);
+		fl4.flowi4_flags = FLOWI_FLAG_ANYSRC;
+		fl4.flowi4_proto = iph->protocol;
+		fl4.daddr = iph->daddr;
+		fl4.saddr = iph->saddr;
+
+		rt = ip_route_output_key(net, &fl4);
+		if (IS_ERR(rt))
+			return -EINVAL;
+		dst = &rt->dst;
+	} else {
+		struct ipv6hdr *iph6 = ipv6_hdr(skb);
+		struct flowi6 fl6 = {};
+
+		fl6.flowi6_oif = oif;
+		fl6.flowi6_mark = skb->mark;
+		fl6.flowi6_uid = sock_net_uid(net, sk);
+		fl6.flowlabel = ip6_flowinfo(iph6);
+		fl6.flowi6_proto = iph6->nexthdr;
+		fl6.daddr = iph6->daddr;
+		fl6.saddr = iph6->saddr;
+
+		err = ipv6_stub->ipv6_dst_lookup(net, skb->sk, &dst, &fl6);
+		if (err || IS_ERR(dst))
+			return -EINVAL;
+	}
+	if (unlikely(dst->error)) {
+		dst_release(dst);
+		return -EINVAL;
+	}
+
+	/* Although skb header was reserved in bpf_lwt_push_ip_encap(), it
+	 * was done for the previous dst, so we are doing it here again, in
+	 * case the new dst needs much more space. The call below is a noop
+	 * if there is enough header space in skb.
+	 */
+	err = skb_cow_head(skb, LL_RESERVED_SPACE(dst->dev));
+	if (unlikely(err))
+		return err;
+
+	skb_dst_drop(skb);
+	skb_dst_set(skb, dst);
+
+	err = dst_output(dev_net(skb_dst(skb)->dev), skb->sk, skb);
+	if (unlikely(err))
+		return err;
+
+	/* ip[6]_finish_output2 understand LWTUNNEL_XMIT_DONE */
+	return LWTUNNEL_XMIT_DONE;
+}
+
 static int bpf_xmit(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -155,11 +266,20 @@ static int bpf_xmit(struct sk_buff *skb)
 
 	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
 	if (bpf->xmit.prog) {
+		__be16 proto = skb->protocol;
 		int ret;
 
 		ret = run_lwt_bpf(skb, &bpf->xmit, dst, CAN_REDIRECT);
 		switch (ret) {
 		case BPF_OK:
+			/* If the header changed, e.g. via bpf_lwt_push_encap,
+			 * BPF_LWT_REROUTE below should have been used if the
+			 * protocol was also changed.
+			 */
+			if (skb->protocol != proto) {
+				kfree_skb(skb);
+				return -EINVAL;
+			}
 			/* If the header was expanded, headroom might be too
 			 * small for L2 header to come, expand as needed.
 			 */
@@ -170,6 +290,8 @@ static int bpf_xmit(struct sk_buff *skb)
 			return LWTUNNEL_XMIT_CONTINUE;
 		case BPF_REDIRECT:
 			return LWTUNNEL_XMIT_DONE;
+		case BPF_LWT_REROUTE:
+			return bpf_lwt_xmit_reroute(skb);
 		default:
 			return ret;
 		}
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v11 6/7] bpf: sync <kdir>/include/.../bpf.h with tools/include/.../bpf.h
From: Peter Oskolkov @ 2019-02-13 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190213195341.184969-1-posk@google.com>

This patch copies changes in bpf.h done by a previous patch
in this patchset from the kernel uapi include dir into tools
uapi include dir.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/include/uapi/linux/bpf.h | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 25c8c0e62ecf..bcdd2474eee7 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2016,6 +2016,19 @@ union bpf_attr {
  *			Only works if *skb* contains an IPv6 packet. Insert a
  *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
  *			the IPv6 header.
+ *		**BPF_LWT_ENCAP_IP**
+ *			IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ *			must be IPv4 or IPv6, followed by zero or more
+ *			additional headers, up to LWT_BPF_MAX_HEADROOM total
+ *			bytes in all prepended headers. Please note that
+ *			if skb_is_gso(skb) is true, no more than two headers
+ *			can be prepended, and the inner header, if present,
+ *			should be either GRE or UDP/GUE.
+ *
+ *		BPF_LWT_ENCAP_SEG6*** types can be called by bpf programs of
+ *		type BPF_PROG_TYPE_LWT_IN; BPF_LWT_ENCAP_IP type can be called
+ *		by bpf programs of types BPF_PROG_TYPE_LWT_IN and
+ *		BPF_PROG_TYPE_LWT_XMIT.
  *
  * 		A call to this helper is susceptible to change the underlaying
  * 		packet buffer. Therefore, at load time, all checks on pointers
@@ -2517,7 +2530,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_SEG6,
-	BPF_LWT_ENCAP_SEG6_INLINE
+	BPF_LWT_ENCAP_SEG6_INLINE,
+	BPF_LWT_ENCAP_IP,
 };
 
 #define __bpf_md_ptr(type, name)	\
@@ -2606,7 +2620,15 @@ enum bpf_ret_code {
 	BPF_DROP = 2,
 	/* 3-6 reserved */
 	BPF_REDIRECT = 7,
-	/* >127 are reserved for prog type specific return codes */
+	/* >127 are reserved for prog type specific return codes.
+	 *
+	 * BPF_LWT_REROUTE: used by BPF_PROG_TYPE_LWT_IN and
+	 *    BPF_PROG_TYPE_LWT_XMIT to indicate that skb had been
+	 *    changed and should be routed based on its new L3 header.
+	 *    (This is an L3 redirect, as opposed to L2 redirect
+	 *    represented by BPF_REDIRECT above).
+	 */
+	BPF_LWT_REROUTE = 128,
 };
 
 struct bpf_sock {
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* [PATCH bpf-next v11 7/7] selftests: bpf: add test_lwt_ip_encap selftest
From: Peter Oskolkov @ 2019-02-13 19:53 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190213195341.184969-1-posk@google.com>

This patch adds a bpf self-test to cover BPF_LWT_ENCAP_IP mode
in bpf_lwt_push_encap.

Covered:
- encapping in LWT_IN and LWT_XMIT
- IPv4 and IPv6

A follow-up patch will add GSO and VRF-enabled tests.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../selftests/bpf/progs/test_lwt_ip_encap.c   |  85 +++++
 .../selftests/bpf/test_lwt_ip_encap.sh        | 311 ++++++++++++++++++
 3 files changed, 398 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index c3edf47da05d..ccffaa0a0787 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -50,7 +50,8 @@ TEST_PROGS := test_kmod.sh \
 	test_lirc_mode2.sh \
 	test_skb_cgroup_id.sh \
 	test_flow_dissector.sh \
-	test_xdp_vlan.sh
+	test_xdp_vlan.sh \
+	test_lwt_ip_encap.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh \
 	with_tunnels.sh \
diff --git a/tools/testing/selftests/bpf/progs/test_lwt_ip_encap.c b/tools/testing/selftests/bpf/progs/test_lwt_ip_encap.c
new file mode 100644
index 000000000000..c957d6dfe6d7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_lwt_ip_encap.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stddef.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+struct grehdr {
+	__be16 flags;
+	__be16 protocol;
+};
+
+SEC("encap_gre")
+int bpf_lwt_encap_gre(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct iphdr iph;
+		struct grehdr greh;
+	} hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(struct encap_hdr));
+
+	hdr.iph.ihl = 5;
+	hdr.iph.version = 4;
+	hdr.iph.ttl = 0x40;
+	hdr.iph.protocol = 47;  /* IPPROTO_GRE */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	hdr.iph.saddr = 0x640110ac;  /* 172.16.1.100 */
+	hdr.iph.daddr = 0x641010ac;  /* 172.16.16.100 */
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+	hdr.iph.saddr = 0xac100164;  /* 172.16.1.100 */
+	hdr.iph.daddr = 0xac101064;  /* 172.16.16.100 */
+#else
+#error "Fix your compiler's __BYTE_ORDER__?!"
+#endif
+	hdr.iph.tot_len = bpf_htons(skb->len + sizeof(struct encap_hdr));
+
+	hdr.greh.protocol = skb->protocol;
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr,
+				 sizeof(struct encap_hdr));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_LWT_REROUTE;
+}
+
+SEC("encap_gre6")
+int bpf_lwt_encap_gre6(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct ipv6hdr ip6hdr;
+		struct grehdr greh;
+	} hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(struct encap_hdr));
+
+	hdr.ip6hdr.version = 6;
+	hdr.ip6hdr.payload_len = bpf_htons(skb->len + sizeof(struct grehdr));
+	hdr.ip6hdr.nexthdr = 47;  /* IPPROTO_GRE */
+	hdr.ip6hdr.hop_limit = 0x40;
+	/* fb01::1 */
+	hdr.ip6hdr.saddr.s6_addr[0] = 0xfb;
+	hdr.ip6hdr.saddr.s6_addr[1] = 1;
+	hdr.ip6hdr.saddr.s6_addr[15] = 1;
+	/* fb10::1 */
+	hdr.ip6hdr.daddr.s6_addr[0] = 0xfb;
+	hdr.ip6hdr.daddr.s6_addr[1] = 0x10;
+	hdr.ip6hdr.daddr.s6_addr[15] = 1;
+
+	hdr.greh.protocol = skb->protocol;
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr,
+				 sizeof(struct encap_hdr));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_LWT_REROUTE;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
new file mode 100755
index 000000000000..4ca714e23ab0
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
@@ -0,0 +1,311 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Setup/topology:
+#
+#    NS1             NS2             NS3
+#   veth1 <---> veth2   veth3 <---> veth4 (the top route)
+#   veth5 <---> veth6   veth7 <---> veth8 (the bottom route)
+#
+#   each vethN gets IPv[4|6]_N address
+#
+#   IPv*_SRC = IPv*_1
+#   IPv*_DST = IPv*_4
+#
+#   all tests test pings from IPv*_SRC to IPv*_DST
+#
+#   by default, routes are configured to allow packets to go
+#   IP*_1 <=> IP*_2 <=> IP*_3 <=> IP*_4 (the top route)
+#
+#   a GRE device is installed in NS3 with IPv*_GRE, and
+#   NS1/NS2 are configured to route packets to IPv*_GRE via IP*_8
+#   (the bottom route)
+#
+# Tests:
+#
+#   1. routes NS2->IPv*_DST are brought down, so the only way a ping
+#      from IP*_SRC to IP*_DST can work is via IPv*_GRE
+#
+#   2a. in an egress test, a bpf LWT_XMIT program is installed on veth1
+#       that encaps the packets with an IP/GRE header to route to IPv*_GRE
+#
+#       ping: SRC->[encap at veth1:egress]->GRE:decap->DST
+#       ping replies go DST->SRC directly
+#
+#   2b. in an ingress test, a bpf LWT_IN program is installed on veth2
+#       that encaps the packets with an IP/GRE header to route to IPv*_GRE
+#
+#       ping: SRC->[encap at veth2:ingress]->GRE:decap->DST
+#       ping replies go DST->SRC directly
+
+set -e  # exit on error
+
+if [[ $EUID -ne 0 ]]; then
+	echo "This script must be run as root"
+	echo "FAIL"
+	exit 1
+fi
+
+readonly NS1="ns1-$(mktemp -u XXXXXX)"
+readonly NS2="ns2-$(mktemp -u XXXXXX)"
+readonly NS3="ns3-$(mktemp -u XXXXXX)"
+
+readonly IPv4_1="172.16.1.100"
+readonly IPv4_2="172.16.2.100"
+readonly IPv4_3="172.16.3.100"
+readonly IPv4_4="172.16.4.100"
+readonly IPv4_5="172.16.5.100"
+readonly IPv4_6="172.16.6.100"
+readonly IPv4_7="172.16.7.100"
+readonly IPv4_8="172.16.8.100"
+readonly IPv4_GRE="172.16.16.100"
+
+readonly IPv4_SRC=$IPv4_1
+readonly IPv4_DST=$IPv4_4
+
+readonly IPv6_1="fb01::1"
+readonly IPv6_2="fb02::1"
+readonly IPv6_3="fb03::1"
+readonly IPv6_4="fb04::1"
+readonly IPv6_5="fb05::1"
+readonly IPv6_6="fb06::1"
+readonly IPv6_7="fb07::1"
+readonly IPv6_8="fb08::1"
+readonly IPv6_GRE="fb10::1"
+
+readonly IPv6_SRC=$IPv6_1
+readonly IPv6_DST=$IPv6_4
+
+setup() {
+set -e  # exit on error
+	# create devices and namespaces
+	ip netns add "${NS1}"
+	ip netns add "${NS2}"
+	ip netns add "${NS3}"
+
+	ip link add veth1 type veth peer name veth2
+	ip link add veth3 type veth peer name veth4
+	ip link add veth5 type veth peer name veth6
+	ip link add veth7 type veth peer name veth8
+
+	ip netns exec ${NS2} sysctl -wq net.ipv4.ip_forward=1
+	ip netns exec ${NS2} sysctl -wq net.ipv6.conf.all.forwarding=1
+
+	ip link set veth1 netns ${NS1}
+	ip link set veth2 netns ${NS2}
+	ip link set veth3 netns ${NS2}
+	ip link set veth4 netns ${NS3}
+	ip link set veth5 netns ${NS1}
+	ip link set veth6 netns ${NS2}
+	ip link set veth7 netns ${NS2}
+	ip link set veth8 netns ${NS3}
+
+	# configure addesses: the top route (1-2-3-4)
+	ip -netns ${NS1}    addr add ${IPv4_1}/24  dev veth1
+	ip -netns ${NS2}    addr add ${IPv4_2}/24  dev veth2
+	ip -netns ${NS2}    addr add ${IPv4_3}/24  dev veth3
+	ip -netns ${NS3}    addr add ${IPv4_4}/24  dev veth4
+	ip -netns ${NS1} -6 addr add ${IPv6_1}/128 nodad dev veth1
+	ip -netns ${NS2} -6 addr add ${IPv6_2}/128 nodad dev veth2
+	ip -netns ${NS2} -6 addr add ${IPv6_3}/128 nodad dev veth3
+	ip -netns ${NS3} -6 addr add ${IPv6_4}/128 nodad dev veth4
+
+	# configure addresses: the bottom route (5-6-7-8)
+	ip -netns ${NS1}    addr add ${IPv4_5}/24  dev veth5
+	ip -netns ${NS2}    addr add ${IPv4_6}/24  dev veth6
+	ip -netns ${NS2}    addr add ${IPv4_7}/24  dev veth7
+	ip -netns ${NS3}    addr add ${IPv4_8}/24  dev veth8
+	ip -netns ${NS1} -6 addr add ${IPv6_5}/128 nodad dev veth5
+	ip -netns ${NS2} -6 addr add ${IPv6_6}/128 nodad dev veth6
+	ip -netns ${NS2} -6 addr add ${IPv6_7}/128 nodad dev veth7
+	ip -netns ${NS3} -6 addr add ${IPv6_8}/128 nodad dev veth8
+
+
+	ip -netns ${NS1} link set dev veth1 up
+	ip -netns ${NS2} link set dev veth2 up
+	ip -netns ${NS2} link set dev veth3 up
+	ip -netns ${NS3} link set dev veth4 up
+	ip -netns ${NS1} link set dev veth5 up
+	ip -netns ${NS2} link set dev veth6 up
+	ip -netns ${NS2} link set dev veth7 up
+	ip -netns ${NS3} link set dev veth8 up
+
+	# configure routes: IP*_SRC -> veth1/IP*_2 (= top route) default;
+	# the bottom route to specific bottom addresses
+
+	# NS1
+	# top route
+	ip -netns ${NS1}    route add ${IPv4_2}/32  dev veth1
+	ip -netns ${NS1}    route add default dev veth1 via ${IPv4_2}  # go top by default
+	ip -netns ${NS1} -6 route add ${IPv6_2}/128 dev veth1
+	ip -netns ${NS1} -6 route add default dev veth1 via ${IPv6_2}  # go top by default
+	# bottom route
+	ip -netns ${NS1}    route add ${IPv4_6}/32  dev veth5
+	ip -netns ${NS1}    route add ${IPv4_7}/32  dev veth5 via ${IPv4_6}
+	ip -netns ${NS1}    route add ${IPv4_8}/32  dev veth5 via ${IPv4_6}
+	ip -netns ${NS1} -6 route add ${IPv6_6}/128 dev veth5
+	ip -netns ${NS1} -6 route add ${IPv6_7}/128 dev veth5 via ${IPv6_6}
+	ip -netns ${NS1} -6 route add ${IPv6_8}/128 dev veth5 via ${IPv6_6}
+
+	# NS2
+	# top route
+	ip -netns ${NS2}    route add ${IPv4_1}/32  dev veth2
+	ip -netns ${NS2}    route add ${IPv4_4}/32  dev veth3
+	ip -netns ${NS2} -6 route add ${IPv6_1}/128 dev veth2
+	ip -netns ${NS2} -6 route add ${IPv6_4}/128 dev veth3
+	# bottom route
+	ip -netns ${NS2}    route add ${IPv4_5}/32  dev veth6
+	ip -netns ${NS2}    route add ${IPv4_8}/32  dev veth7
+	ip -netns ${NS2} -6 route add ${IPv6_5}/128 dev veth6
+	ip -netns ${NS2} -6 route add ${IPv6_8}/128 dev veth7
+
+	# NS3
+	# top route
+	ip -netns ${NS3}    route add ${IPv4_3}/32  dev veth4
+	ip -netns ${NS3}    route add ${IPv4_1}/32  dev veth4 via ${IPv4_3}
+	ip -netns ${NS3}    route add ${IPv4_2}/32  dev veth4 via ${IPv4_3}
+	ip -netns ${NS3} -6 route add ${IPv6_3}/128 dev veth4
+	ip -netns ${NS3} -6 route add ${IPv6_1}/128 dev veth4 via ${IPv6_3}
+	ip -netns ${NS3} -6 route add ${IPv6_2}/128 dev veth4 via ${IPv6_3}
+	# bottom route
+	ip -netns ${NS3}    route add ${IPv4_7}/32  dev veth8
+	ip -netns ${NS3}    route add ${IPv4_5}/32  dev veth8 via ${IPv4_7}
+	ip -netns ${NS3}    route add ${IPv4_6}/32  dev veth8 via ${IPv4_7}
+	ip -netns ${NS3} -6 route add ${IPv6_7}/128 dev veth8
+	ip -netns ${NS3} -6 route add ${IPv6_5}/128 dev veth8 via ${IPv6_7}
+	ip -netns ${NS3} -6 route add ${IPv6_6}/128 dev veth8 via ${IPv6_7}
+
+	# configure IPv4 GRE device in NS3, and a route to it via the "bottom" route
+	ip -netns ${NS3} tunnel add gre_dev mode gre remote ${IPv4_1} local ${IPv4_GRE} ttl 255
+	ip -netns ${NS3} link set gre_dev up
+	ip -netns ${NS3} addr add ${IPv4_GRE} dev gre_dev
+	ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6}
+	ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8}
+
+
+	# configure IPv6 GRE device in NS3, and a route to it via the "bottom" route
+	ip -netns ${NS3} -6 tunnel add name gre6_dev mode ip6gre remote ${IPv6_1} local ${IPv6_GRE} ttl 255
+	ip -netns ${NS3} link set gre6_dev up
+	ip -netns ${NS3} -6 addr add ${IPv6_GRE} nodad dev gre6_dev
+	ip -netns ${NS1} -6 route add ${IPv6_GRE}/128 dev veth5 via ${IPv6_6}
+	ip -netns ${NS2} -6 route add ${IPv6_GRE}/128 dev veth7 via ${IPv6_8}
+
+	# rp_filter gets confused by what these tests are doing, so disable it
+	ip netns exec ${NS1} sysctl -wq net.ipv4.conf.all.rp_filter=0
+	ip netns exec ${NS2} sysctl -wq net.ipv4.conf.all.rp_filter=0
+	ip netns exec ${NS3} sysctl -wq net.ipv4.conf.all.rp_filter=0
+}
+
+cleanup() {
+	ip netns del ${NS1} 2> /dev/null
+	ip netns del ${NS2} 2> /dev/null
+	ip netns del ${NS3} 2> /dev/null
+}
+
+trap cleanup EXIT
+
+test_ping() {
+	local readonly PROTO=$1
+	local readonly EXPECTED=$2
+	local RET=0
+
+	set +e
+	if [ "${PROTO}" == "IPv4" ] ; then
+		ip netns exec ${NS1} ping  -c 1 -W 1 -I ${IPv4_SRC} ${IPv4_DST} 2>&1 > /dev/null
+		RET=$?
+	elif [ "${PROTO}" == "IPv6" ] ; then
+		ip netns exec ${NS1} ping6 -c 1 -W 6 -I ${IPv6_SRC} ${IPv6_DST} 2>&1 > /dev/null
+		RET=$?
+	else
+		echo "test_ping: unknown PROTO: ${PROTO}"
+		exit 1
+	fi
+	set -e
+
+	if [ "0" != "${RET}" ]; then
+		RET=1
+	fi
+
+	if [ "${EXPECTED}" != "${RET}" ] ; then
+		echo "FAIL: test_ping: ${RET}"
+		exit 1
+	fi
+}
+
+test_egress() {
+	local readonly ENCAP=$1
+	echo "starting egress ${ENCAP} encap test"
+	setup
+
+	# need to wait a bit for IPv6 to autoconf, otherwise
+	# ping6 sometimes fails with "unable to bind to address"
+
+	# by default, pings work
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	# remove NS2->DST routes, ping fails
+	ip -netns ${NS2}    route del ${IPv4_DST}/32  dev veth3
+	ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# install replacement routes (LWT/eBPF), pings succeed
+	if [ "${ENCAP}" == "IPv4" ] ; then
+		ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
+		ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
+	elif [ "${ENCAP}" == "IPv6" ] ; then
+		ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
+		ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
+	else
+		echo "FAIL: unknown encap ${ENCAP}"
+	fi
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	cleanup
+	echo "PASS"
+}
+
+test_ingress() {
+	local readonly ENCAP=$1
+	echo "starting ingress ${ENCAP} encap test"
+	setup
+
+	# need to wait a bit for IPv6 to autoconf, otherwise
+	# ping6 sometimes fails with "unable to bind to address"
+
+	# by default, pings work
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	# remove NS2->DST routes, pings fail
+	ip -netns ${NS2}    route del ${IPv4_DST}/32  dev veth3
+	ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# install replacement routes (LWT/eBPF), pings succeed
+	if [ "${ENCAP}" == "IPv4" ] ; then
+		ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
+		ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
+	elif [ "${ENCAP}" == "IPv6" ] ; then
+		ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
+		ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
+	else
+		echo "FAIL: unknown encap ${ENCAP}"
+	fi
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	cleanup
+	echo "PASS"
+}
+
+test_egress IPv4
+test_egress IPv6
+
+test_ingress IPv4
+test_ingress IPv6
+
+echo "all tests passed"
-- 
2.20.1.791.gb4d0f1c61a-goog


^ permalink raw reply related

* Re: [PATCH bpf-next v10 5/7] bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
From: Peter Oskolkov @ 2019-02-13 19:57 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Peter Oskolkov,
	Willem de Bruijn
In-Reply-To: <681aca28-b4e5-eb0d-46cd-94db7a2c368c@gmail.com>

On Tue, Feb 12, 2019 at 6:58 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 2/12/19 10:32 AM, Peter Oskolkov wrote:
> > @@ -148,6 +174,87 @@ static int xmit_check_hhlen(struct sk_buff *skb)
> >       return 0;
> >  }
> >
> > +static int bpf_lwt_xmit_reroute(struct sk_buff *skb)
> > +{
> > +     struct net_device *l3mdev = l3mdev_master_dev_rcu(skb_dst(skb)->dev);
> > +     int oif = l3mdev ? l3mdev->ifindex : 0;
> > +     struct dst_entry *dst = NULL;
> > +     struct sock *sk;
> > +     struct net *net;
> > +     bool ipv4;
> > +     int err;
> > +
> > +     if (skb->protocol == htons(ETH_P_IP))
> > +             ipv4 = true;
> > +     else if (skb->protocol == htons(ETH_P_IPV6))
> > +             ipv4 = false;
> > +     else
> > +             return -EAFNOSUPPORT;
> > +
> > +     sk = sk_to_full_sk(skb->sk);
> > +     if (sk) {
> > +             if (sk->sk_bound_dev_if)
> > +                     oif = sk->sk_bound_dev_if;
> > +             net = sock_net(sk);
> > +     } else {
> > +             net = dev_net(skb_dst(skb)->dev);
> > +     }
> > +
> > +     if (ipv4) {
> > +             struct iphdr *iph = ip_hdr(skb);
> > +             struct flowi4 fl4 = {};
> > +             struct rtable *rt;
> > +
> > +             fl4.flowi4_oif = oif;
> > +             fl4.flowi4_mark = skb->mark;
> > +             fl4.flowi4_uid = sock_net_uid(net, sk);
> > +             fl4.flowi4_tos = RT_TOS(iph->tos);
> > +             fl4.flowi4_flags = FLOWI_FLAG_ANYSRC;
> > +             fl4.flowi4_proto = iph->protocol;
> > +             fl4.daddr = iph->daddr;
> > +             fl4.saddr = iph->saddr;
> > +
> > +             rt = ip_route_output_key(net, &fl4);
> > +             if (IS_ERR(rt) || rt->dst.error)
> > +                     return -EINVAL;
>
> I think you have a dst leak here if rt is valid but the lookup is a
> reject (e.g., unreachable or blackhole).

Thanks, David! I was not able to reproduce the leak, but based on your
suggestion and similar code elsewhere I made a change in v11 to explicitly
release a dst with error.

>
> > +             dst = &rt->dst;
> > +     } else {
> > +             struct ipv6hdr *iph6 = ipv6_hdr(skb);
> > +             struct flowi6 fl6 = {};
> > +
> > +             fl6.flowi6_oif = oif;
> > +             fl6.flowi6_mark = skb->mark;
> > +             fl6.flowi6_uid = sock_net_uid(net, sk);
> > +             fl6.flowlabel = ip6_flowinfo(iph6);
> > +             fl6.flowi6_proto = iph6->nexthdr;
> > +             fl6.daddr = iph6->daddr;
> > +             fl6.saddr = iph6->saddr;
> > +
> > +             err = ipv6_stub->ipv6_dst_lookup(net, skb->sk, &dst, &fl6);
> > +             if (err || IS_ERR(dst) || dst->error)
> > +                     return -EINVAL;
>
> same here.
>
> You could check this by adding a route with unreachable as the target in
> your tests. Test cleanup and namespace teardown will tell you pretty quick.

^ permalink raw reply

* Re: [PATCH] net: phy: at803x: disable delay only for RGMII mode
From: Niklas Cassel @ 2019-02-13 20:07 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Marc Gonzalez, Andrew Lunn, Vinod Koul, David S Miller,
	linux-arm-msm, Bjorn Andersson, netdev, Nori, Sekhar,
	Peter Ujfalusi, hkallweit1
In-Reply-To: <34037b72-b082-89fa-f586-8c032ebe5aea@gmail.com>

On Wed, Feb 13, 2019 at 09:59:43AM -0800, Florian Fainelli wrote:
> On 2/13/19 9:40 AM, Niklas Cassel wrote:
> > On Wed, Feb 13, 2019 at 02:40:18PM +0100, Marc Gonzalez wrote:
> >> On 13/02/2019 14:29, Andrew Lunn wrote:
> >>
> >>>> So we have these modes:
> >>>>
> >>>> PHY_INTERFACE_MODE_RGMII: TX and RX delays disabled
> >>>> PHY_INTERFACE_MODE_RGMII_ID: TX and RX delays enabled
> >>>> PHY_INTERFACE_MODE_RGMII_RXID: RX delay enabled, TX delay disabled
> >>>> PHY_INTERFACE_MODE_RGMII_TXID: TX delay enabled, RX delay disabled
> >>>>
> >>>> What I don't like with this patch, is that if we specify phy-mode
> >>>> PHY_INTERFACE_MODE_RGMII_TXID, this patch will enable TX delay,
> >>>> but RX delay will not be explicitly set.
> >>>
> >>> That is not the behaviour we want. It is best to assume the device is
> >>> in a random state, and correctly enable/disable all delays as
> >>> requested. Only leave the hardware alone if PHY_INTERFACE_MODE_NA is
> >>> used.
> >>
> >> That's what my patch did:
> >> https://www.spinics.net/lists/netdev/msg445053.html
> >>
> >> But see Florian's remarks:
> >> https://www.spinics.net/lists/netdev/msg445133.html
> > 
> > Hello Marc,
> > 
> > I saw that comment from Florian. However that was way back in 2017.
> > Maybe the phy-modes were not as well defined back then?
> 
> The definition of the 'phy-mode' was clarified to be understood from the
> perspective of the PHY device (hence the name) after we had several
> fruitful exchanges with Marc (at least from my perspective), but since
> the definition was not clear before, there is a high chance of finding
> DTS/DTBs out there with the 'phy-mode' property understood from the
> MAC's perspective, which would now be wrong.

Hello Florian,


We have a specification:
Documentation/devicetree/bindings/net/ethernet.txt

And several implementations: the PHY drivers.

Either we decide that all PHY drivers have to follow
the specification for "phy-mode" in
Documentation/devicetree/bindings/net/ethernet.txt
or we decide that they don't.

If we decide that all PHY drivers have to follow the specification,
then we can fix the PHY drivers that currently do not follow the
specification.

If we decide that all PHY drivers do not have to follow the spec,
then the "phy-mode" property is basically useless, and then we should
introduce a new device tree property, e.g. "phy-mode2", that is
guaranteed to respect the definitons in
Documentation/devicetree/bindings/net/ethernet.txt

> 
> 
> > 
> > Andrew recently suggested to fix the driver so that it conforms with the
> > phy-modes, and fix any SoC that specified an incorrect phy-mode in DT
> > and thus relied upon the broken behavior of the PHY driver:
> > https://www.spinics.net/lists/netdev/msg445133.html
> > 
> > 
> > So, I've rebased your old patch, see attachment.
> > I suggest that Peter test it on am335x-evm.
> > 
> > am335x-evm appears to rely on the current broken behavior of the PHY
> > driver, so we will probably need to fix the am335x-evm according to this:
> > https://www.spinics.net/lists/netdev/msg445117.html
> > and merge that as well.
> > 
> > 
> > Andrew, Florian, do you both agree?
> 
> In my reply to Marc, there was a concern that while am335x-evm was
> identified and reported to be broken after fixing the PHY driver, there
> could be platforms out there that we have little to no visibility that
> would most likely be equally broken. That concern still exists, and I
> don't think there is anything we can do to even assess the size of the
> problem unless we attempt to fix it, so maybe we should attempt to fix that.
> 
> There was a suggestion to Marc that one way to possibly "ignore" an
> incorrectly broken 'phy-mode' property would be to allow specifying
> rx/tx delay properties such that if the driver obtained its
> phy_interface_t, yet still parsed rx/tx delays, the rx/tx delays would
> take precedence, and we could possibly derive some sort of a "more
> correct" phy_interface_t that we could assign back to phydev->interface
> and issue a warning about that.

You mean to add new device tree properties to
Documentation/devicetree/bindings/net/ethernet.txt

- phy-id-tx: "true" if PHY should add internal delay on TX lines;
             "false" or not specified if PHY should not add internal
	     delay on TX lines. This property overrides any delay
	     requested by "phy-mode".
- phy-id-rx: "true" if PHY should add internal delay on RX lines;
             "false" or not specified if PHY should not add internal
	     delay on RX lines. This property overrides any delay
	     requested by "phy-mode".

Perhaps something like that?

Personally, I prefer making "phy-mode" strict,
but whatever you guys decide:
- making "phy-mode" strict
- introducing a "phy-mode2"
- introducing "phy-id-tx/phy-id-rx"
- introducing "mac-mode"
- some other solution

It is probably wise to introduce helper functions in phy.h
 phy_wants_id_rx()
 phy_wants_id_tx()
so that PHY drivers can simply use e.g.:

if (phy_wants_id_rx(phydev))
	at803x_enable_rx_delay(phydev);
else
	at803x_disable_rx_delay(phydev);

if (phy_wants_id_tx(phydev))
	at803x_enable_tx_delay(phydev);
else
	at803x_disable_tx_delay(phydev);

> 
> Another possible way to resolve that could be to introduce a 'mac-mode'
> property, which must be strictly compatible with specifying a 'phy-mode'
> property. For instance:
> 
> - MAC specifies mac-mode = 'rgmii-id', then the PHY must have phy-mode =
> 'rmgii' since the MAC is taking of inserting both RX and TX delays,
> reverse also applies
> 
> - MAC specifies mac-mode = 'rgmii-txid', then the PHY must have phy-mode
> = 'rgmii-rxid' because the MAC adds the TX delay, but the PHY should
> insert the delay on the RX lines, reverse also applies
> 
> Because there is usually (not always, DSA is an exception) a 1:1 mapping
> between MAC and PHY devices we could look up the 'mac-mode' property in
> the MAC in the PHY library code and make sure that we have a compatible
> matrix and if we do not, maybe pass something like PHY_INTERFACE_MODE_NA
> such that the driver retains its settings.

Is there any advantage of creating a "mac-mode" over creating a
"phy-mode2" ?


Kind regards,
Niklas

> 
> Maybe another way to approach this is if we assume that the PHY comes up
> configured correctly by the boot loader, or upon power on reset, we add
> some PHY driver methods that allow us to determine the RGMII mode in
> which a PHY is and that tells us whether we are compatible with the
> MAC's phy_interface_t upon connection. We check both at connect() time
> and if something does not look right, we flip the meaning of
> phy_interface_t.
> 
> None of those solutions are entirely fool proof, but at least we might
> be able to detect incorrect combinations, yet still make them work by
> reversing the meaning of the 'phy-mode' property given information at hand.
> 
> Let me know if none of that makes sense and this just looks like yet
> another brain dump.
> 
> Wonderful RGMII...
> -- 
> Florian

^ permalink raw reply

* [RFC iproute2] ip route: get: allow zero-length subnet mask
From: Luca Boccassi @ 2019-02-13 20:09 UTC (permalink / raw)
  To: netdev; +Cc: stephen, Luca Boccassi, Clément Hertling

A /0 subnet mask is theoretically valid, but ip route get doesn't allow
it:

$ ip route get 1.0.0.0/0
need at least a destination address

Remove the check so that it can go through:

$ ip/ip route get 1.0.0.0/0
1.0.0.0 via 192.168.1.1 dev eth0 src 192.168.1.91 uid 1000
    cache

Reported-by: Clément Hertling <wxcafe@wxcafe.net>
Signed-off-by: Luca Boccassi <bluca@debian.org>
---
Stephen et al, this was reported by a Debian user:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=921737

It makes sense to me at a cursory glance, but sending as RFC as I'm
not 100% familiar with the route get function.

 ip/iproute.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 5f58a3b3..d78f43d8 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -2041,11 +2041,6 @@ static int iproute_get(int argc, char **argv)
 		argc--; argv++;
 	}
 
-	if (req.r.rtm_dst_len == 0) {
-		fprintf(stderr, "need at least a destination address\n");
-		return -1;
-	}
-
 	if (idev || odev)  {
 		int idx;
 
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH bpf-next v10 5/7] bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
From: David Ahern @ 2019-02-13 20:11 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Peter Oskolkov,
	Willem de Bruijn
In-Reply-To: <CAPNVh5eFMaXAdbhkn3Le5eQ-ZYaf2kWjKAxf4dfW9tYhyyXAKQ@mail.gmail.com>

On 2/13/19 12:57 PM, Peter Oskolkov wrote:
> Thanks, David! I was not able to reproduce the leak, but based on your
> suggestion and similar code elsewhere I made a change in v11 to explicitly
> release a dst with error.

ok. Did you run the test with a debug kernel - checking refcount, use
after free, etc?

^ permalink raw reply

* Re: [RFC iproute2] ip route: get: allow zero-length subnet mask
From: Stephen Hemminger @ 2019-02-13 20:37 UTC (permalink / raw)
  To: Luca Boccassi; +Cc: netdev, Clément Hertling
In-Reply-To: <20190213200954.32271-1-bluca@debian.org>

On Wed, 13 Feb 2019 20:09:53 +0000
Luca Boccassi <bluca@debian.org> wrote:

> A /0 subnet mask is theoretically valid, but ip route get doesn't allow
> it:
> 
> $ ip route get 1.0.0.0/0
> need at least a destination address
> 
> Remove the check so that it can go through:
> 
> $ ip/ip route get 1.0.0.0/0
> 1.0.0.0 via 192.168.1.1 dev eth0 src 192.168.1.91 uid 1000
>     cache
> 
> Reported-by: Clément Hertling <wxcafe@wxcafe.net>
> Signed-off-by: Luca Boccassi <bluca@debian.org>
> ---
> Stephen et al, this was reported by a Debian user:
> 
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=921737
> 
> It makes sense to me at a cursory glance, but sending as RFC as I'm
> not 100% familiar with the route get function.
> 
>  ip/iproute.c | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/ip/iproute.c b/ip/iproute.c
> index 5f58a3b3..d78f43d8 100644
> --- a/ip/iproute.c
> +++ b/ip/iproute.c
> @@ -2041,11 +2041,6 @@ static int iproute_get(int argc, char **argv)
>  		argc--; argv++;
>  	}
>  
> -	if (req.r.rtm_dst_len == 0) {
> -		fprintf(stderr, "need at least a destination address\n");
> -		return -1;
> -	}
> -
>  	if (idev || odev)  {
>  		int idx;
>  

You still need a way to report error for:
	ip route get
(i.e when no address is present)

^ permalink raw reply

* Re: [PATCH bpf-next v10 5/7] bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c
From: Peter Oskolkov @ 2019-02-13 20:41 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Peter Oskolkov,
	Willem de Bruijn
In-Reply-To: <80849fb5-c5de-ce6b-6c25-bd152326196c@gmail.com>

On Wed, Feb 13, 2019 at 12:11 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 2/13/19 12:57 PM, Peter Oskolkov wrote:
> > Thanks, David! I was not able to reproduce the leak, but based on your
> > suggestion and similar code elsewhere I made a change in v11 to explicitly
> > release a dst with error.
>
> ok. Did you run the test with a debug kernel - checking refcount, use
> after free, etc?

In my tests I was always getting ERR_PTR for unroutable packets,
not a full rt/dst with an error flag set. But I checked several
similar route lookups,
and they all release bad dsts, so I did not feel it was worth it to
investigate further.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox