Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH bpf-next 2/2] tools/bpf: add log_level to bpf_load_program_attr
From: Yonghong Song @ 2019-02-05 19:48 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Yonghong Song
In-Reply-To: <20190205194821.1804747-1-yhs@fb.com>

The kernel verifier has three levels of logs:
    0: no logs
    1: logs mostly useful
  > 1: verbose

Current libbpf API functions bpf_load_program_xattr() and
bpf_load_program() cannot specify log_level.
The bcc, however, provides an interface for user to
specify log_level 2 for verbose output.

This patch added log_level into structure
bpf_load_program_attr, so users, including bcc, can use
bpf_load_program_xattr() to change log_level.

The bpf selftest test_sock.c is modified to enable log_level = 2.
If the "verbose" in test_sock.c is changed to true,
the test will output logs like below:
  $ ./test_sock
  func#0 @0
  0: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
  0: (bf) r6 = r1
  1: R1=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
  1: (61) r7 = *(u32 *)(r6 +28)
  invalid bpf_context access off=28 size=4

  Test case: bind4 load with invalid access: src_ip6 .. [PASS]
  ...
  Test case: bind6 allow all .. [PASS]
  Summary: 16 PASSED, 0 FAILED

Some test_sock tests are negative tests and verbose verifier
log will be printed out as shown in the above.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/lib/bpf/bpf.c                     | 4 +++-
 tools/lib/bpf/bpf.h                     | 1 +
 tools/testing/selftests/bpf/test_sock.c | 9 ++++++++-
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 3defad77dc7a..78aa8c2b1a5c 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -214,6 +214,7 @@ int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
 {
 	void *finfo = NULL, *linfo = NULL;
 	union bpf_attr attr;
+	__u32 log_level;
 	__u32 name_len;
 	int fd;
 
@@ -292,7 +293,8 @@ int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
 	/* Try again with log */
 	attr.log_buf = ptr_to_u64(log_buf);
 	attr.log_size = log_buf_sz;
-	attr.log_level = 1;
+	log_level = load_attr->log_level;
+	attr.log_level = (log_level >= 2) ? log_level : 1;
 	log_buf[0] = 0;
 	fd = sys_bpf_prog_load(&attr, sizeof(attr));
 done:
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index ed09eed2dc3b..15a8e22e8eae 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -76,6 +76,7 @@ struct bpf_load_program_attr {
 	const struct bpf_insn *insns;
 	size_t insns_cnt;
 	const char *license;
+	__u32 log_level;
 	__u32 kern_version;
 	__u32 prog_ifindex;
 	__u32 prog_btf_fd;
diff --git a/tools/testing/selftests/bpf/test_sock.c b/tools/testing/selftests/bpf/test_sock.c
index 561ffb6d6433..fb679ac3d4b0 100644
--- a/tools/testing/selftests/bpf/test_sock.c
+++ b/tools/testing/selftests/bpf/test_sock.c
@@ -20,6 +20,7 @@
 #define MAX_INSNS	512
 
 char bpf_log_buf[BPF_LOG_BUF_SIZE];
+static bool verbose = false;
 
 struct sock_test {
 	const char *descr;
@@ -325,6 +326,7 @@ static int load_sock_prog(const struct bpf_insn *prog,
 			  enum bpf_attach_type attach_type)
 {
 	struct bpf_load_program_attr attr;
+	int ret;
 
 	memset(&attr, 0, sizeof(struct bpf_load_program_attr));
 	attr.prog_type = BPF_PROG_TYPE_CGROUP_SOCK;
@@ -332,8 +334,13 @@ static int load_sock_prog(const struct bpf_insn *prog,
 	attr.insns = prog;
 	attr.insns_cnt = probe_prog_length(attr.insns);
 	attr.license = "GPL";
+	attr.log_level = 2;
 
-	return bpf_load_program_xattr(&attr, bpf_log_buf, BPF_LOG_BUF_SIZE);
+	ret = bpf_load_program_xattr(&attr, bpf_log_buf, BPF_LOG_BUF_SIZE);
+	if (verbose && ret < 0)
+		fprintf(stderr, "%s\n", bpf_log_buf);
+
+	return ret;
 }
 
 static int attach_sock_prog(int cgfd, int progfd,
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 0/2] tools/btf: extends libbpf APIs to work with btf w/o kernel
From: Andrii Nakryiko @ 2019-02-05 19:48 UTC (permalink / raw)
  To: songliubraving, yhs, ast, kafai, netdev, daniel, andrii.nakryiko
  Cc: Andrii Nakryiko

This patchset changes existing btf__new() API call to only load and initialize
struct btf, while exposing new btf__load() API to attempt to load and validate
BTF in kernel. It also adds ability to copy raw BTF data out of struct btf for
further processing by external applications.

This makes utilizing libbpf's APIs that don't require kernel facilities (e.g.,
btf_dedup) simpler and more natural from external application.

Andrii Nakryiko (2):
  btf: separate btf creation and loading
  btf: expose API to work with raw btf data

 tools/lib/bpf/btf.c      | 63 +++++++++++++++++++++++++---------------
 tools/lib/bpf/btf.h      |  3 ++
 tools/lib/bpf/libbpf.c   |  2 +-
 tools/lib/bpf/libbpf.map |  3 ++
 4 files changed, 46 insertions(+), 25 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [PATCH bpf-next 2/2] btf: expose API to work with raw btf data
From: Andrii Nakryiko @ 2019-02-05 19:48 UTC (permalink / raw)
  To: songliubraving, yhs, ast, kafai, netdev, daniel, andrii.nakryiko
  Cc: Andrii Nakryiko
In-Reply-To: <20190205194856.967463-1-andriin@fb.com>

This patch exposes two new APIs btf__get_raw_data_size() and
btf__get_raw_data() that allows to get a copy of raw BTF data out of
struct btf. This is useful for external programs that need to manipulate
raw data, e.g., pahole using btf__dedup() to deduplicate BTF type info
and then writing it back to file.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
---
 tools/lib/bpf/btf.c      | 10 ++++++++++
 tools/lib/bpf/btf.h      |  2 ++
 tools/lib/bpf/libbpf.map |  2 ++
 3 files changed, 14 insertions(+)

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 065d51fa63e5..6491d440c9a7 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -437,6 +437,16 @@ int btf__fd(const struct btf *btf)
 	return btf->fd;
 }
 
+__u32 btf__get_raw_data_size(const struct btf *btf)
+{
+	return btf->data_size;
+}
+
+void btf__get_raw_data(const struct btf *btf, char *data)
+{
+	memcpy(data, btf->data, btf->data_size);
+}
+
 void btf__get_strings(const struct btf *btf, const char **strings,
 		      __u32 *str_len)
 {
diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
index e8410887f93a..d46f680b9416 100644
--- a/tools/lib/bpf/btf.h
+++ b/tools/lib/bpf/btf.h
@@ -66,6 +66,8 @@ LIBBPF_API const struct btf_type *btf__type_by_id(const struct btf *btf,
 LIBBPF_API __s64 btf__resolve_size(const struct btf *btf, __u32 type_id);
 LIBBPF_API int btf__resolve_type(const struct btf *btf, __u32 type_id);
 LIBBPF_API int btf__fd(const struct btf *btf);
+LIBBPF_API __u32 btf__get_raw_data_size(const struct btf *btf);
+LIBBPF_API void btf__get_raw_data(const struct btf *btf, char *data);
 LIBBPF_API void btf__get_strings(const struct btf *btf, const char **strings,
 				 __u32 *str_len);
 LIBBPF_API const char *btf__name_by_offset(const struct btf *btf, __u32 offset);
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index ffa1fe044f6a..873a26cd714d 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -137,6 +137,8 @@ LIBBPF_0.0.2 {
 		btf__load;
 		btf__get_map_kv_tids;
 		btf__get_nr_types;
+		btf__get_raw_data;
+		btf__get_raw_data_size;
 		btf__get_strings;
 		btf_ext__free;
 		btf_ext__func_info_rec_size;
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 1/2] btf: separate btf creation and loading
From: Andrii Nakryiko @ 2019-02-05 19:48 UTC (permalink / raw)
  To: songliubraving, yhs, ast, kafai, netdev, daniel, andrii.nakryiko
  Cc: Andrii Nakryiko
In-Reply-To: <20190205194856.967463-1-andriin@fb.com>

This change splits out previous btf__new functionality of constructing
struct btf and loading it into kernel into two:
- btf__new() just creates and initializes struct btf
- btf__load() attempts to load existing struct btf into kernel

btf__free will still close BTF fd, if it was ever loaded successfully
into kernel.

This change allows users of libbpf to manipulate BTF using its API,
without the need to unnecessarily load it into kernel.

One of the intended use cases is pahole using libbpf to do DWARF to BTF
conversion and deduplication using libbpf, while handling ELF sections
overwrites and other concerns on its own.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
---
 tools/lib/bpf/btf.c      | 53 ++++++++++++++++++++++------------------
 tools/lib/bpf/btf.h      |  1 +
 tools/lib/bpf/libbpf.c   |  2 +-
 tools/lib/bpf/libbpf.map |  1 +
 4 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 4949f8840bda..065d51fa63e5 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -366,8 +366,6 @@ void btf__free(struct btf *btf)
 
 struct btf *btf__new(__u8 *data, __u32 size)
 {
-	__u32 log_buf_size = 0;
-	char *log_buf = NULL;
 	struct btf *btf;
 	int err;
 
@@ -377,15 +375,6 @@ struct btf *btf__new(__u8 *data, __u32 size)
 
 	btf->fd = -1;
 
-	log_buf = malloc(BPF_LOG_BUF_SIZE);
-	if (!log_buf) {
-		err = -ENOMEM;
-		goto done;
-	}
-
-	*log_buf = 0;
-	log_buf_size = BPF_LOG_BUF_SIZE;
-
 	btf->data = malloc(size);
 	if (!btf->data) {
 		err = -ENOMEM;
@@ -395,17 +384,6 @@ struct btf *btf__new(__u8 *data, __u32 size)
 	memcpy(btf->data, data, size);
 	btf->data_size = size;
 
-	btf->fd = bpf_load_btf(btf->data, btf->data_size,
-			       log_buf, log_buf_size, false);
-
-	if (btf->fd == -1) {
-		err = -errno;
-		pr_warning("Error loading BTF: %s(%d)\n", strerror(errno), errno);
-		if (log_buf && *log_buf)
-			pr_warning("%s\n", log_buf);
-		goto done;
-	}
-
 	err = btf_parse_hdr(btf);
 	if (err)
 		goto done;
@@ -417,8 +395,6 @@ struct btf *btf__new(__u8 *data, __u32 size)
 	err = btf_parse_type_sec(btf);
 
 done:
-	free(log_buf);
-
 	if (err) {
 		btf__free(btf);
 		return ERR_PTR(err);
@@ -427,6 +403,35 @@ struct btf *btf__new(__u8 *data, __u32 size)
 	return btf;
 }
 
+int btf__load(struct btf* btf) {
+	__u32 log_buf_size = BPF_LOG_BUF_SIZE;
+	char *log_buf = NULL;
+
+	if (btf->fd >= 0) {
+		return -EEXIST;
+	}
+
+	log_buf = malloc(log_buf_size);
+	if (!log_buf)
+		return -ENOMEM;
+
+	*log_buf = 0;
+
+	btf->fd = bpf_load_btf(btf->data, btf->data_size,
+			       log_buf, log_buf_size, false);
+	if (btf->fd < 0) {
+		btf->fd = -errno;
+		pr_warning("Error loading BTF: %s(%d)\n", strerror(errno), errno);
+		if (*log_buf)
+			pr_warning("%s\n", log_buf);
+		goto done;
+	}
+
+done:
+	free(log_buf);
+	return btf->fd;
+}
+
 int btf__fd(const struct btf *btf)
 {
 	return btf->fd;
diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
index 25a9d2db035d..e8410887f93a 100644
--- a/tools/lib/bpf/btf.h
+++ b/tools/lib/bpf/btf.h
@@ -57,6 +57,7 @@ struct btf_ext_header {
 
 LIBBPF_API void btf__free(struct btf *btf);
 LIBBPF_API struct btf *btf__new(__u8 *data, __u32 size);
+LIBBPF_API int btf__load(struct btf* btf);
 LIBBPF_API __s32 btf__find_by_name(const struct btf *btf,
 				   const char *type_name);
 LIBBPF_API __u32 btf__get_nr_types(const struct btf *btf);
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 47969aa0faf8..75b82c1cfc72 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -835,7 +835,7 @@ static int bpf_object__elf_collect(struct bpf_object *obj, int flags)
 			obj->efile.maps_shndx = idx;
 		else if (strcmp(name, BTF_ELF_SEC) == 0) {
 			obj->btf = btf__new(data->d_buf, data->d_size);
-			if (IS_ERR(obj->btf)) {
+			if (IS_ERR(obj->btf) || btf__load(obj->btf) < 0) {
 				pr_warning("Error loading ELF section %s: %ld. Ignored and continue.\n",
 					   BTF_ELF_SEC, PTR_ERR(obj->btf));
 				obj->btf = NULL;
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 89c1149e32ee..ffa1fe044f6a 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -134,6 +134,7 @@ LIBBPF_0.0.2 {
 		bpf_object__find_map_fd_by_name;
 		bpf_get_link_xdp_id;
 		btf__dedup;
+		btf__load;
 		btf__get_map_kv_tids;
 		btf__get_nr_types;
 		btf__get_strings;
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH] ipmr: ip6mr: Create new sockopt to clear mfc cache only
From: Nikolay Aleksandrov @ 2019-02-05 19:52 UTC (permalink / raw)
  To: Callum Sinclair, davem, kuznet, yoshfuji, netdev, linux-kernel,
	Roopa Prabhu
In-Reply-To: <20190205025800.17185-2-callum.sinclair@alliedtelesis.co.nz>

On 05/02/2019 04:58, Callum Sinclair wrote:
> Currently the only way to clear the mfc cache was to delete the entries
> one by one using the MRT_DEL_MFC socket option or to destroy and
> recreate the socket.
> 
> Create a new socket option which will clear the multicast forwarding
> cache on the socket without destroying the socket.
> 
> Signed-off-by: Callum Sinclair <callum.sinclair@alliedtelesis.co.nz>
> ---
>  include/uapi/linux/mroute.h  |  3 ++-
>  include/uapi/linux/mroute6.h |  3 ++-
>  net/ipv4/ipmr.c              | 40 +++++++++++++++++++++----------
>  net/ipv6/ip6mr.c             | 46 +++++++++++++++++++++++-------------
>  4 files changed, 61 insertions(+), 31 deletions(-)
> 

Hi,
Why don't you do it per-table ? That is - use the selected table and flush
only its entries, I think that would be more useful. Also would be nice to
make the interface flush optional, e.g. depending on a user-supplied value.
Some people would like to flush only the MFC entries without deleting the
devices. Since it's called DEL_MFC_ALL, I'd expect this call to act only
on the MFC entries and not the device list, but it also flushes that.
I'd rename it to MRT_FLUSH with an argument which chooses to flush MFCs +
VIFs based on flags (so you could flush either one separately).

Thanks,
 Nik

> diff --git a/include/uapi/linux/mroute.h b/include/uapi/linux/mroute.h
> index 5d37a9ccce63..8a0beb885cd9 100644
> --- a/include/uapi/linux/mroute.h
> +++ b/include/uapi/linux/mroute.h
> @@ -28,7 +28,8 @@
>  #define MRT_TABLE	(MRT_BASE+9)	/* Specify mroute table ID		*/
>  #define MRT_ADD_MFC_PROXY	(MRT_BASE+10)	/* Add a (*,*|G) mfc entry	*/
>  #define MRT_DEL_MFC_PROXY	(MRT_BASE+11)	/* Del a (*,*|G) mfc entry	*/
> -#define MRT_MAX		(MRT_BASE+11)
> +#define MRT_DEL_MFC_ALL		(MRT_BASE+12)	/* Del all multicast entries	*/
> +#define MRT_MAX		(MRT_BASE+12)
>  
>  #define SIOCGETVIFCNT	SIOCPROTOPRIVATE	/* IP protocol privates */
>  #define SIOCGETSGCNT	(SIOCPROTOPRIVATE+1)
> diff --git a/include/uapi/linux/mroute6.h b/include/uapi/linux/mroute6.h
> index 9999cc006390..7def70cdf571 100644
> --- a/include/uapi/linux/mroute6.h
> +++ b/include/uapi/linux/mroute6.h
> @@ -31,7 +31,8 @@
>  #define MRT6_TABLE	(MRT6_BASE+9)	/* Specify mroute table ID		*/
>  #define MRT6_ADD_MFC_PROXY	(MRT6_BASE+10)	/* Add a (*,*|G) mfc entry	*/
>  #define MRT6_DEL_MFC_PROXY	(MRT6_BASE+11)	/* Del a (*,*|G) mfc entry	*/
> -#define MRT6_MAX	(MRT6_BASE+11)
> +#define MRT6_DEL_MFC_ALL	(MRT6_BASE+12)	/* Del all multicast entries	*/
> +#define MRT6_MAX	(MRT6_BASE+12)
>  
>  #define SIOCGETMIFCNT_IN6	SIOCPROTOPRIVATE	/* IP protocol privates */
>  #define SIOCGETSGCNT_IN6	(SIOCPROTOPRIVATE+1)
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index ddbf8c9a1abb..ccfeebd38e1a 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -1298,22 +1298,12 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
>  	return 0;
>  }
>  
> -/* Close the multicast socket, and clear the vif tables etc */
> -static void mroute_clean_tables(struct mr_table *mrt, bool all)
> +/* Clear the vif tables */
> +static void mroute_clean_cache(struct mr_table *mrt, bool all)
>  {
>  	struct net *net = read_pnet(&mrt->net);
> -	struct mr_mfc *c, *tmp;
>  	struct mfc_cache *cache;
> -	LIST_HEAD(list);
> -	int i;
> -
> -	/* Shut down all active vif entries */
> -	for (i = 0; i < mrt->maxvif; i++) {
> -		if (!all && (mrt->vif_table[i].flags & VIFF_STATIC))
> -			continue;
> -		vif_delete(mrt, i, 0, &list);
> -	}
> -	unregister_netdevice_many(&list);
> +	struct mr_mfc *c, *tmp;
>  
>  	/* Wipe the cache */
>  	list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
> @@ -1340,6 +1330,23 @@ static void mroute_clean_tables(struct mr_table *mrt, bool all)
>  	}
>  }
>  
> +/* Close the multicast socket, and clear the vif tables etc */
> +static void mroute_clean_tables(struct mr_table *mrt, bool all)
> +{
> +	LIST_HEAD(list);
> +	int i;
> +
> +	/* Shut down all active vif entries */
> +	for (i = 0; i < mrt->maxvif; i++) {
> +		if (!all && (mrt->vif_table[i].flags & VIFF_STATIC))
> +			continue;
> +		vif_delete(mrt, i, 0, &list);
> +	}
> +	unregister_netdevice_many(&list);
> +
> +	mroute_clean_cache(mrt, all);
> +}
> +
>  /* called from ip_ra_control(), before an RCU grace period,
>   * we dont need to call synchronize_rcu() here
>   */
> @@ -1482,6 +1489,13 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval,
>  					   sk == rtnl_dereference(mrt->mroute_sk),
>  					   parent);
>  		break;
> +	case MRT_DEL_MFC_ALL:
> +		rtnl_lock();
> +		ipmr_for_each_table(mrt, net) {
> +			mroute_clean_cache(mrt, true);
> +		}
> +		rtnl_unlock();
> +		break;
>  	/* Control PIM assert. */
>  	case MRT_ASSERT:
>  		if (optlen != sizeof(val)) {
> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> index 30337b38274b..0168420d217b 100644
> --- a/net/ipv6/ip6mr.c
> +++ b/net/ipv6/ip6mr.c
> @@ -1492,25 +1492,11 @@ static int ip6mr_mfc_add(struct net *net, struct mr_table *mrt,
>  	return 0;
>  }
>  
> -/*
> - *	Close the multicast socket, and clear the vif tables etc
> - */
> -
> -static void mroute_clean_tables(struct mr_table *mrt, bool all)
> +/* Clear the vif tables */
> +static void mroute_clean_cache(struct mr_table *mrt, bool all)
>  {
>  	struct mr_mfc *c, *tmp;
> -	LIST_HEAD(list);
> -	int i;
> -
> -	/* Shut down all active vif entries */
> -	for (i = 0; i < mrt->maxvif; i++) {
> -		if (!all && (mrt->vif_table[i].flags & VIFF_STATIC))
> -			continue;
> -		mif6_delete(mrt, i, 0, &list);
> -	}
> -	unregister_netdevice_many(&list);
>  
> -	/* Wipe the cache */
>  	list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
>  		if (!all && (c->mfc_flags & MFC_STATIC))
>  			continue;
> @@ -1536,6 +1522,27 @@ static void mroute_clean_tables(struct mr_table *mrt, bool all)
>  	}
>  }
>  
> +/*
> + *	Close the multicast socket, and clear the vif tables etc
> + */
> +
> +static void mroute_clean_tables(struct mr_table *mrt, bool all)
> +{
> +	LIST_HEAD(list);
> +	int i;
> +
> +	/* Shut down all active vif entries */
> +	for (i = 0; i < mrt->maxvif; i++) {
> +		if (!all && (mrt->vif_table[i].flags & VIFF_STATIC))
> +			continue;
> +		mif6_delete(mrt, i, 0, &list);
> +	}
> +	unregister_netdevice_many(&list);
> +
> +	/* Wipe the cache */
> +	mroute_clean_cache(mrt, all);
> +}
> +
>  static int ip6mr_sk_init(struct mr_table *mrt, struct sock *sk)
>  {
>  	int err = 0;
> @@ -1703,6 +1710,13 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns
>  					    parent);
>  		rtnl_unlock();
>  		return ret;
> +	case MRT6_DEL_MFC_ALL:
> +		rtnl_lock();
> +		ip6mr_for_each_table(mrt, net) {
> +			mroute_clean_cache(mrt, true);
> +		}
> +		rtnl_unlock();
> +		return 0;
>  
>  	/*
>  	 *	Control PIM assert (to activate pim will activate assert)
> 


^ permalink raw reply

* Re: [PATCH v2] net: dsa: mv88e6xxx: Revise irq setup ordering
From: Andrew Lunn @ 2019-02-05 19:54 UTC (permalink / raw)
  To: John David Anglin; +Cc: Russell King, Vivien Didelot, Florian Fainelli, netdev
In-Reply-To: <964a7e63-82b5-1b21-830e-ec5c7f1eb917@bell.net>

> My hunch is the second hunk of the original patch will fix this.

O.K.

But please fully explain how the race happens and how the fix fully
fixes it. I don't really want to add code which just makes the race
window smaller, but the race still happens. That helps nobody.

In the end, i think you will end up polling. If so, feel free to
submit a patch which makes the poll interval build time configurable.

       Andrew

^ permalink raw reply

* Re: [PATCH net] net: defxx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
From: Maciej W. Rozycki @ 2019-02-05 19:57 UTC (permalink / raw)
  To: Yang Wei; +Cc: netdev, David S. Miller, yang.wei9
In-Reply-To: <1549382464-5138-1-git-send-email-albin_yang@163.com>

On Wed, 6 Feb 2019, Yang Wei wrote:

> From: Yang Wei <yang.wei9@zte.com.cn>
> 
> dev_consume_skb_irq() should be called in dfx_xmt_done() when skb
> xmit done. It makes drop profiles(dropwatch, perf) more friendly.
> 
> Signed-off-by: Yang Wei <yang.wei9@zte.com.cn>

Reviewed-by: Maciej W. Rozycki <macro@linux-mips.org>

 It looks to me the driver has to be reviewed WRT `dev_kfree_skb' vs 
`kfree_skb' use too.  I'll have a look into it unless you are happy to do 
that.

 Thanks for your contribution!

  Maciej

^ permalink raw reply

* RE: lan78xx: WARNING: irq 79 handler enabled interrupts
From: Stefan Wahren @ 2019-02-05 19:57 UTC (permalink / raw)
  To: netdev; +Cc: eric, UNGLinuxDriver, marc.zyngier, linux-arm-kernel, Woojung.Huh
In-Reply-To: <958035754.310420.1546378307673@email.ionos.de>

Hi,

> Stefan Wahren <stefan.wahren@i2se.com> hat am 1. Januar 2019 um 22:31 geschrieben:
> 
> 
> Hi Woojung,
> 
> > Woojung.Huh@microchip.com hat am 30. Dezember 2018 um 04:25 geschrieben:
> > 
> > 
> > HI Marc & Stephen,
> > 
> > Most of engineers are out until New Year's Day.
> 
> thanks. I didn't expect a reply that fast.
> 
> >  
> > LAN78xx driver uses irq_domain for phy interrupt, but smsc95xx uses polling.
> > Need to check flow again, you can try that comment out "lan78xx_setup_irq_domain" to
> > make dev->domain_data.phyirq = 0 which forces PHY polling.
> 
> I tested your suggestion with multi_v7_defconfig (32 bit) and arm64/defconfig.
> The warning disappeared and Ethernet is still working.
> 
> Only the old issue that we can't receive until a first packet has been send out reappear. But this should be manageable.
> 

i got informed that the engineers are busy with other issues and come back later to this :-(

Since i'm getting requests to provide my PHY polling patch, here it is:

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index e96bc0c..a5bb292 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2982,13 +2982,6 @@ static int lan78xx_bind(struct lan78xx_net *dev, struct usb_interface *intf)
 
 	dev->net->hw_features = dev->net->features;
 
-	ret = lan78xx_setup_irq_domain(dev);
-	if (ret < 0) {
-		netdev_warn(dev->net,
-			    "lan78xx_setup_irq_domain() failed : %d", ret);
-		goto out1;
-	}
-
 	dev->net->hard_header_len += TX_OVERHEAD;
 	dev->hard_mtu = dev->net->mtu + dev->net->hard_header_len;
 
@@ -2996,13 +2989,13 @@ static int lan78xx_bind(struct lan78xx_net *dev, struct usb_interface *intf)
 	ret = lan78xx_reset(dev);
 	if (ret) {
 		netdev_warn(dev->net, "Registers INIT FAILED....");
-		goto out2;
+		goto out1;
 	}
 
 	ret = lan78xx_mdio_init(dev);
 	if (ret) {
 		netdev_warn(dev->net, "MDIO INIT FAILED.....");
-		goto out2;
+		goto out1;
 	}
 
 	dev->net->flags |= IFF_MULTICAST;
@@ -3011,9 +3004,6 @@ static int lan78xx_bind(struct lan78xx_net *dev, struct usb_interface *intf)
 
 	return ret;
 
-out2:
-	lan78xx_remove_irq_domain(dev);
-
 out1:
 	netdev_warn(dev->net, "Bind routine FAILED");
 	cancel_work_sync(&pdata->set_multicast);

^ permalink raw reply related

* Re: [PATCH v2 2/2] r8169: Avoid pointer aliasing
From: Eric Dumazet @ 2019-02-05 20:04 UTC (permalink / raw)
  To: Joe Perches, David Miller, thierry.reding
  Cc: hkallweit1, andrew, nic_swsd, netdev, linux-kernel
In-Reply-To: <8553086eaec167846992f1cff12aa388cee81767.camel@perches.com>



On 02/05/2019 10:42 AM, Joe Perches wrote:
> 
> It's declared after a pointer so it is already is 2 byte aligned.
> 
> A lot of drivers wouldn't work otherwise.

Maybe these drivers are only used on arches where this does not matter.



^ permalink raw reply

* Re: Kernel panic in eth_header
From: Eric Dumazet @ 2019-02-05 20:13 UTC (permalink / raw)
  To: Florian Fainelli, Eric Dumazet, Andrew, Netdev
In-Reply-To: <3ed6d5b9-f2ef-8bd1-f7b4-c4e1d8a540fd@gmail.com>



On 02/05/2019 11:34 AM, Florian Fainelli wrote:
> On 2/5/19 8:57 AM, Eric Dumazet wrote:
>>
>>
>> On 02/05/2019 08:29 AM, Andrew wrote:
>>> Hi all.
>>>
>>> After upgrade on PPPoE BRAS to kernel 4.9.153 I've got an kernel panic after a 3 days of uptime.
>>>
>>> Unfortunately kernel is compiled w/o debug info; I rebuilt kernel with debug info enabled (kernel is compiled with same function addresses - I compare vmlinux symbol maps) - it says that panic is in net/ethernet/eth.c:88
>>>
>>> Below there is a kernel panic trace. igb is from vendor, ver. 5.3.5.4. What extra info is needed?
>>>
>>> [263565.106441] BUG: unable to handle kernel paging request at ffff88015a4d2dd4
>>> [263565.113527] IP: [<ffffffff8158e48b>] eth_header+0x3b/0xc0
>>> [263565.119030] PGD 1e8f067 [263565.121474] PUD 0
>>> [263565.123580]
>>> [263565.125166] Oops: 0002 [#1] SMP
>>> [263565.128398] Modules linked in: xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter xt_length xt_TCPMSS xt_tcpudp xt_mark xt_dscp iptable_mangle ip_tables x_tables nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack sch_sfq sch_htb cls_u32 sch_ingress sch_prio sch_tbf cls_flow cls_fw act_police ifb 8021q mrp garp stp llc softdog pppoe pppox ppp_generic slhc i2c_nforce2 i2c_core igb(O) parport_pc dca parport thermal asus_atk0110 fan ptp k10temp hwmon pps_core nv_tco
>>> [263565.176083] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O    4.9.153-x86_64 #1
>>> [263565.183996] Hardware name: System manufacturer System Product Name/M2N-E, BIOS ASUS M2N-E ACPI BIOS Revision 5001 03/23/2010
>>> [263565.195289] task: ffff88007d0f5200 task.stack: ffffc9000006c000
>>> [263565.201295] RIP: 0010:[<ffffffff8158e48b>] [<ffffffff8158e48b>] eth_header+0x3b/0xc0
>>> [263565.209225] RSP: 0018:ffff88007fa83c58  EFLAGS: 00010286
>>> [263565.214622] RAX: ffff88015a4d2dc8 RBX: 0000000000000008 RCX: ffff8800682434a0
>>> [263565.221843] RDX: ffff88015a4d2dc8 RSI: ffff88015a4d2dc8 RDI: ffff880077aab000
>>> [263565.229062] RBP: ffff88007b663d90 R08: ffff88007b663d90 R09: 0000000000000574
>>> [263565.236281] R10: ffff88007d1fa000 R11: 0000000000000000 R12: ffff8800682434a0
>>> [263565.243501] R13: ffff88007d1fa000 R14: 0000000000000574 R15: 0000000000000008
>>> [263565.250719] FS:  0000000000000000(0000) GS:ffff88007fa80000(0000) knlGS:0000000000000000
>>> [263565.258894] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [263565.264725] CR2: ffff88015a4d2dd4 CR3: 000000007ad73000 CR4: 00000000000006f0
>>> [263565.271944] Stack:
>>> [263565.274041]  ffff880077aab000 ffff880068243400 ffff88007a745000 ffff8800682434a0
>>> [263565.281582]  0000000000000002 ffffffff81571d09 ffff880068243400 ffff88007fa83d00
>>> [263565.289121]  ffff88007a745000 ffff880077aab000 ffff88007a712000 ffffffff815a8c61
>>> [263565.296661] Call Trace:
>>> [263565.299193]  <IRQ> [263565.301205] [<ffffffff81571d09>] ? neigh_connected_output+0xa9/0x100
>>> [263565.307740]  [<ffffffff815a8c61>] ? ip_finish_output2+0x221/0x400
>>> [263565.313920]  [<ffffffff8159e144>] ? nf_iterate+0x54/0x60
>>> [263565.319319]  [<ffffffff815ab2fa>] ? ip_output+0x6a/0xf0
>>> [263565.324631]  [<ffffffff8159e102>] ? nf_iterate+0x12/0x60
>>> [263565.330030]  [<ffffffff815aa6e0>] ? ip_fragment.constprop.5+0x80/0x80
>>> [263565.336556]  [<ffffffff815a73b6>] ? ip_forward+0x396/0x480
>>> [263565.342128]  [<ffffffff815a6fb0>] ? ip_check_defrag+0x1e0/0x1e0
>>> [263565.348134]  [<ffffffff815a5a2e>] ? ip_rcv+0x2ae/0x370
>>> [263565.353361]  [<ffffffffa0107c02>] ? pppoe_rcv_core+0xd2/0x160 [pppoe]
>>> [263565.359888]  [<ffffffff815a5170>] ? ip_local_deliver_finish+0x1d0/0x1d0
>>> [263565.366586]  [<ffffffff81562a57>] ? __netif_receive_skb_core+0x527/0xa80
>>> [263565.373373]  [<ffffffff81567632>] ? process_backlog+0x92/0x130
>>> [263565.379291]  [<ffffffff8156745d>] ? net_rx_action+0x24d/0x390
>>> [263565.385124]  [<ffffffff81628374>] ? __do_softirq+0xf4/0x2a0
>>> [263565.390784]  [<ffffffff8107136c>] ? irq_exit+0xbc/0xd0
>>> [263565.396008]  [<ffffffff81626cd6>] ? call_function_single_interrupt+0x96/0xa0
>>> [263565.403141]  <EOI> [263565.405153] [<ffffffff81623eb0>] ? __sched_text_end+0x2/0x2
>>> [263565.410907]  [<ffffffff81624182>] ? native_safe_halt+0x2/0x10
>>> [263565.416741]  [<ffffffff81623ec8>] ? default_idle+0x18/0xd0
>>> [263565.422314]  [<ffffffff810a7a46>] ? cpu_startup_entry+0x126/0x220
>>> [263565.428492]  [<ffffffff8104c261>] ? start_secondary+0x161/0x180
>>> [263565.434496] Code: 0e 00 00 00 53 89 d3 49 89 cc 4c 89 c5 45 89 ce e8 bb 8a fc ff 66 83 fb 01 48 89 c6 74 44 66 83 fb 04 74 3e 66 c1 c3 08 48 85 ed <66> 89 58 0c 74 40 8b 45 00 4d 85 e4 89 46 06 0f b7 45 04 66 89
>>> [263565.454534] RIP  [<ffffffff8158e48b>] eth_header+0x3b/0xc0
>>> [263565.460124]  RSP <ffff88007fa83c58>
>>> [263565.463696] CR2: ffff88015a4d2dd4
>>> [263565.467104] ---[ end trace a1bcaf3618724adf ]---
>>> [263565.471807] Kernel panic - not syncing: Fatal exception in interrupt
>>> [263565.478245] Kernel Offset: disabled
>>> [263565.481818] Rebooting in 5 seconds..
>>>
>>
>>
>> This is a well known issue, a fix should come shortly in stable branches
> 
> Is Peter or yourself doing the backport? David would only take care of
> the most two recent stable kernels.
> 
> Sorry about missing that change as part of the fragmenstack backport to
> 4.9...


Greg took care of this for the trees he manages.




^ permalink raw reply

* Re: [PATCH v2 2/2] r8169: Avoid pointer aliasing
From: Joe Perches @ 2019-02-05 20:18 UTC (permalink / raw)
  To: Eric Dumazet, David Miller, thierry.reding
  Cc: hkallweit1, andrew, nic_swsd, netdev, linux-kernel
In-Reply-To: <ca6ca816-af03-e1f9-739a-a06cc4837b80@gmail.com>

On Tue, 2019-02-05 at 12:04 -0800, Eric Dumazet wrote:
> 
> On 02/05/2019 10:42 AM, Joe Perches wrote:
> > It's declared after a pointer so it is already is 2 byte aligned.
> > 
> > A lot of drivers wouldn't work otherwise.
> 
> Maybe these drivers are only used on arches where this does not matter.

Possible.

I had only grepped through the sources looking for
declarations using:

$ git grep -B1 '\[ETH_ALEN\];' -- '*.c' | grep -A1 '\*'

It's quite a few files in net/ too btw.

I still think adding __align(<even#>) is unnecessary here unless
it follows something like a bool or a u8.



^ permalink raw reply

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
From: Willem de Bruijn @ 2019-02-05 20:18 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <20190205173629.160717-1-sdf@google.com>

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> skb. Because we use passed skb to lookup associated networking namespace
> to find whether we have a BPF program attached or not, we always use
> C-based flow dissector in this case.
>
> The goal of this patch series is to add new networking namespace argument
> to the eth_get_headlen and make BPF flow dissector programs be able to
> work in the skb-less case.
>
> The series goes like this:
> 1. introduce __init_skb and __init_skb_shinfo; those will be used to
>    initialize temporary skb
> 2. introduce skb_net which can be used to get networking namespace
>    associated with an skb
> 3. add new optional network namespace argument to __skb_flow_dissect and
>    plumb through the callers
> 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
>    (using __init_skb) and calls BPF flow dissector program

The main concern I see with this series is this cost of skb zeroing
for every packet in the device driver receive routine, *independent*
from the real skb allocation and zeroing which will likely happen
later.

^ permalink raw reply

* Re: [RFC bpf-next 1/7] net: introduce __init_skb and __init_skb_shinfo helpers
From: Willem de Bruijn @ 2019-02-05 20:18 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <20190205173629.160717-2-sdf@google.com>

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> __init_skb is essentially a version of __build_skb which accepts skb as
> an argument (instead of doing kmem_cache_alloc to allocate it).
>
> __init_skb_shinfo initializes shinfo.
>
> No functional changes.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Nice code deduplication :-)

^ permalink raw reply

* Re: [RFC bpf-next 3/7] net: plumb network namespace into __skb_flow_dissect
From: Willem de Bruijn @ 2019-02-05 20:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <20190205173629.160717-4-sdf@google.com>

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> This new argument will be used in the next patches for the
> eth_get_headlen use case. eth_get_headlen calls flow dissector
> with only data (without skb) so there is currently no way to
> pull attached BPF flow dissector program. With this new argument,
> we can amend the callers to explicitly pass network namespace
> so we can use attached BPF program.
>
> Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

>  /**
>   * __skb_flow_dissect - extract the flow_keys struct and return it
> + * @net: associated network namespace, if NULL pulled from skb
>   * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified
>   * @flow_dissector: list of keys to dissect
>   * @target_container: target structure to put dissected values into
> @@ -739,7 +740,8 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
>   *
>   * Caller must take care of zeroing target container memory.
>   */
> -bool __skb_flow_dissect(const struct sk_buff *skb,
> +bool __skb_flow_dissect(struct net *net,
> +                       const struct sk_buff *skb,
>                         struct flow_dissector *flow_dissector,
>                         void *target_container,
>                         void *data, __be16 proto, int nhoff, int hlen,
> @@ -799,12 +801,11 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>
>                 rcu_read_lock();
>
> -               if (skb->dev)
> -                       attached = rcu_dereference(dev_net(skb->dev)->flow_dissector_prog);
> -               else if (skb->sk)
> -                       attached = rcu_dereference(sock_net(skb->sk)->flow_dissector_prog);
> -               else
> -                       WARN_ON_ONCE(1);
> +               if (!net && skb)
> +                       net = skb_net(skb);
> +               if (net)
> +                       attached = rcu_dereference(net->flow_dissector_prog);
> +               WARN_ON_ONCE(!net);

Instead of this just call skb_net(skb) in all callers of
__skb_flow_dissect that are called with an skb argument directly?

It may have to be able to handle skb == NULL args.

^ permalink raw reply

* Re: [RFC bpf-next 2/7] net: introduce skb_net helper
From: Willem de Bruijn @ 2019-02-05 20:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <20190205173629.160717-3-sdf@google.com>

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> skb_net returns network namespace from the associated device or socket.
>
> This will be used in the next commit.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/linux/skbuff.h |  2 ++
>  net/core/skbuff.c      | 10 ++++++++++
>  2 files changed, 12 insertions(+)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index ad883ab2762c..28723a86efdf 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -4343,5 +4343,7 @@ static inline __wsum lco_csum(struct sk_buff *skb)
>         return csum_partial(l4_hdr, csum_start - l4_hdr, partial);
>  }
>
> +struct net *skb_net(const struct sk_buff *skb);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_SKBUFF_H */
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 23c9cf100bd4..016db13fa2b6 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5585,6 +5585,16 @@ void skb_condense(struct sk_buff *skb)
>         skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
>  }
>
> +struct net *skb_net(const struct sk_buff *skb)
> +{
> +       if (skb->dev)
> +               return dev_net(skb->dev);
> +       else if (skb->sk)
> +               return sock_net(skb->sk);
> +       return NULL;
> +}
> +EXPORT_SYMBOL(skb_net);

If this needs a helper it is probably better static inline in the header.

^ permalink raw reply

* Re: [RFC bpf-next 4/7] net: flow_dissector: handle no-skb use case
From: Willem de Bruijn @ 2019-02-05 20:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <20190205173629.160717-5-sdf@google.com>

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> When flow_dissector is called without skb (with only data and hlen),
> construct on-stack skb (which has a linear chunk of data passed
> to the flow dissector). This should let us handle eth_get_headlen
> case where only data is provided and we don't want to (yet) allocate
> an skb.
>
> Since this on-stack skb doesn't allocate its own data, we can't
> add shinfo and need to be careful to avoid any code paths that use
> it. Flow dissector BPF programs can only call bpf_skb_load_bytes helper,
> which doesn't touch shinfo in our case (skb->len is the length of the
> linear header so it exits early).
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/linux/skbuff.h    |  5 +++
>  net/core/flow_dissector.c | 95 +++++++++++++++++++++++++++++----------
>  2 files changed, 76 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index aa9a9983de80..5f1c085cb34c 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1227,6 +1227,11 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
>                             const struct sk_buff *skb,
>                             struct flow_dissector *flow_dissector,
>                             struct bpf_flow_keys *flow_keys);
> +bool __flow_bpf_dissect(struct bpf_prog *prog,
> +                       void *data, __be16 proto,
> +                       int nhoff, int hlen,
> +                       struct flow_dissector *flow_dissector,
> +                       struct bpf_flow_keys *flow_keys);

nit: please use more descriptive name. Perhaps bpf_flow_dissect_raw
and rename __skb_flow_bpf_dissect to bpf_flow_dissect_skb.

> +bool __flow_bpf_dissect(struct bpf_prog *prog,
> +                       void *data, __be16 proto,
> +                       int nhoff, int hlen,
> +                       struct flow_dissector *flow_dissector,
> +                       struct bpf_flow_keys *flow_keys)
> +{
> +       struct bpf_skb_data_end *cb;
> +       struct sk_buff skb;
> +       u32 result;
> +
> +       __init_skb(&skb, data, hlen);
> +       skb_put(&skb, hlen);
> +       skb.protocol = proto;
> +
> +       init_flow_keys(flow_keys, &skb, nhoff);
> +
> +       cb = (struct bpf_skb_data_end *)skb.cb;
> +       cb->data_meta = skb.data;
> +       cb->data_end  = skb.data + skb_headlen(&skb);
> +
> +       result = BPF_PROG_RUN(prog, &skb);
> +
> +       clamp_flow_keys(flow_keys, hlen);
>
>         return result == BPF_OK;
>  }

Can__flow_bpf_dissect just construct an skb and then call
__skb_flow_bpf_dissect?

It will unnecessarily save and restore the control block, but that is
a relatively small cost (compared to, say, zeroing the entire skb).

^ permalink raw reply

* Re: [RFC bpf-next 5/7] bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode
From: Willem de Bruijn @ 2019-02-05 20:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn
In-Reply-To: <20190205173629.160717-6-sdf@google.com>

On Tue, Feb 5, 2019 at 12:58 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> Now that we have __flow_bpf_dissect which works on raw data (by
> constructing temporary on-stack skb), use it when doing
> BPF_PROG_TEST_RUN for flow dissector.
>
> This should help us catch any possible bugs due to missing shinfo on
> the on-stack skb.
>
> Note that existing __skb_flow_bpf_dissect swallows L2 headers and returns
> nhoff=0, we need to preserve the existing behavior.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  net/bpf/test_run.c | 52 +++++++++++++++-------------------------------
>  1 file changed, 17 insertions(+), 35 deletions(-)
>

>         ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys),
>                               retval, duration);
> -
> -       kfree_skb(skb);
> -       kfree(sk);
> +       kfree(data);
>         return ret;
> +

nit: unnecessary whitespace line

>  }

^ permalink raw reply

* [pull request][net 0/3] Mellanox, mlx5 fixes 2019-02-05
From: Saeed Mahameed @ 2019-02-05 20:20 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Saeed Mahameed

Hi Dave,

This series introduces some fixes to mlx5 driver.

Please pull and let me know if there is any problem.

For -stable v4.19
('net/mlx5e: FPGA, fix Innova IPsec TX offload data path performance')

For -stable v4.20
('net/mlx5e: Use the inner headers to determine tc/pedit offload limitation on decap flows')

Thanks,
Saeed.

---
The following changes since commit f09bef61f1ed72869b231e5cff16e73a06505cfb:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf (2019-02-05 11:23:23 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-fixes-2019-02-05

for you to fetch changes up to 1651925d403e077e3fc86f961905e27c6810e132:

  net/mlx5e: Use the inner headers to determine tc/pedit offload limitation on decap flows (2019-02-05 12:10:19 -0800)

----------------------------------------------------------------
mlx5-fixes-2019-02-05

----------------------------------------------------------------
Guy Shattah (1):
      net/mlx5e: Use the inner headers to determine tc/pedit offload limitation on decap flows

Or Gerlitz (1):
      net/mlx5e: Properly set steering match levels for offloaded TC decap rules

Raed Salem (1):
      net/mlx5e: FPGA, fix Innova IPsec TX offload data path performance

 .../net/ethernet/mellanox/mlx5/core/en/tc_tun.c    |  4 +++-
 .../net/ethernet/mellanox/mlx5/core/en/tc_tun.h    |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    | 27 ++++++++++++++--------
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  6 +++++
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |  1 +
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c | 17 +++++++-------
 6 files changed, 37 insertions(+), 20 deletions(-)

^ permalink raw reply

* [net 3/3] net/mlx5e: Use the inner headers to determine tc/pedit offload limitation on decap flows
From: Saeed Mahameed @ 2019-02-05 20:20 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Guy Shattah, Or Gerlitz, Saeed Mahameed
In-Reply-To: <20190205202011.24023-1-saeedm@mellanox.com>

From: Guy Shattah <sguy@mellanox.com>

In packets that need to be decaped the internal headers
have to be checked, not the external ones.

Fixes: bdd66ac0aeed ("net/mlx5e: Disallow TC offloading of unsupported match/action combinations")
Signed-off-by: Guy Shattah <sguy@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 043896e13ffa..1c3c9fa26b55 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2181,6 +2181,7 @@ static bool csum_offload_supported(struct mlx5e_priv *priv,
 
 static bool modify_header_match_supported(struct mlx5_flow_spec *spec,
 					  struct tcf_exts *exts,
+					  u32 actions,
 					  struct netlink_ext_ack *extack)
 {
 	const struct tc_action *a;
@@ -2190,7 +2191,11 @@ static bool modify_header_match_supported(struct mlx5_flow_spec *spec,
 	u16 ethertype;
 	int nkeys, i;
 
-	headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value, outer_headers);
+	if (actions & MLX5_FLOW_CONTEXT_ACTION_DECAP)
+		headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value, inner_headers);
+	else
+		headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value, outer_headers);
+
 	ethertype = MLX5_GET(fte_match_set_lyr_2_4, headers_v, ethertype);
 
 	/* for non-IP we only re-write MACs, so we're okay */
@@ -2247,7 +2252,7 @@ static bool actions_match_supported(struct mlx5e_priv *priv,
 
 	if (actions & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
 		return modify_header_match_supported(&parse_attr->spec, exts,
-						     extack);
+						     actions, extack);
 
 	return true;
 }
-- 
2.20.1


^ permalink raw reply related

* [net 2/3] net/mlx5e: Properly set steering match levels for offloaded TC decap rules
From: Saeed Mahameed @ 2019-02-05 20:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Slava Ovsiienko, Jianbo Liu, Saeed Mahameed
In-Reply-To: <20190205202011.24023-1-saeedm@mellanox.com>

From: Or Gerlitz <ogerlitz@mellanox.com>

The match level computed by the driver gets to be wrong for decap
rules with wildcarded inner packet match such as:

tc filter add dev vxlan_sys_4789 protocol all parent ffff: prio 2 flower
       enc_dst_ip 192.168.0.9 enc_key_id 100 enc_dst_port 4789
       action tunnel_key unset
       action mirred egress redirect dev eth1

The FW errs for a missing matching meta-data indicator for the outer
headers (where we do have a match), and a wrong matching meta-data
indicator for the inner headers (where we don't have a match).

Fix that by taking into account the matching on the tunnel info and
relating the match level of the encapsulated packet to the firmware
inner headers indicator in case of decap.

As for vxlan we mandate a match on the tunnel udp dst port, and in general
we practically madndate a match on the source or dest ip for any IP tunnel,
the fix was done in a minimal manner around the tunnel match parsing code.

Fixes: d708f902989b ('net/mlx5e: Get the required HW match level while parsing TC flow matches')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Slava Ovsiienko <viacheslavo@mellanox.com>
Reviewed-by: Jianbo Liu <jianbol@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../ethernet/mellanox/mlx5/core/en/tc_tun.c    |  4 +++-
 .../ethernet/mellanox/mlx5/core/en/tc_tun.h    |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_tc.c    | 18 ++++++++++--------
 .../net/ethernet/mellanox/mlx5/core/eswitch.h  |  1 +
 .../mellanox/mlx5/core/eswitch_offloads.c      | 17 +++++++++--------
 5 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
index 046948ead152..a3750af074a4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c
@@ -612,16 +612,18 @@ int mlx5e_tc_tun_parse(struct net_device *filter_dev,
 		       struct mlx5_flow_spec *spec,
 		       struct tc_cls_flower_offload *f,
 		       void *headers_c,
-		       void *headers_v)
+		       void *headers_v, u8 *match_level)
 {
 	int tunnel_type;
 	int err = 0;
 
 	tunnel_type = mlx5e_tc_tun_get_type(filter_dev);
 	if (tunnel_type == MLX5E_TC_TUNNEL_TYPE_VXLAN) {
+		*match_level = MLX5_MATCH_L4;
 		err = mlx5e_tc_tun_parse_vxlan(priv, spec, f,
 					       headers_c, headers_v);
 	} else if (tunnel_type == MLX5E_TC_TUNNEL_TYPE_GRETAP) {
+		*match_level = MLX5_MATCH_L3;
 		err = mlx5e_tc_tun_parse_gretap(priv, spec, f,
 						headers_c, headers_v);
 	} else {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h
index 706ce7bf15e7..b63f15de899d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h
@@ -39,6 +39,6 @@ int mlx5e_tc_tun_parse(struct net_device *filter_dev,
 		       struct mlx5_flow_spec *spec,
 		       struct tc_cls_flower_offload *f,
 		       void *headers_c,
-		       void *headers_v);
+		       void *headers_v, u8 *match_level);
 
 #endif //__MLX5_EN_TC_TUNNEL_H__
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index cae6c6d48984..043896e13ffa 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1302,7 +1302,7 @@ static void mlx5e_tc_del_flow(struct mlx5e_priv *priv,
 static int parse_tunnel_attr(struct mlx5e_priv *priv,
 			     struct mlx5_flow_spec *spec,
 			     struct tc_cls_flower_offload *f,
-			     struct net_device *filter_dev)
+			     struct net_device *filter_dev, u8 *match_level)
 {
 	struct netlink_ext_ack *extack = f->common.extack;
 	void *headers_c = MLX5_ADDR_OF(fte_match_param, spec->match_criteria,
@@ -1317,7 +1317,7 @@ static int parse_tunnel_attr(struct mlx5e_priv *priv,
 	int err = 0;
 
 	err = mlx5e_tc_tun_parse(filter_dev, priv, spec, f,
-				 headers_c, headers_v);
+				 headers_c, headers_v, match_level);
 	if (err) {
 		NL_SET_ERR_MSG_MOD(extack,
 				   "failed to parse tunnel attributes");
@@ -1426,7 +1426,7 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 			      struct mlx5_flow_spec *spec,
 			      struct tc_cls_flower_offload *f,
 			      struct net_device *filter_dev,
-			      u8 *match_level)
+			      u8 *match_level, u8 *tunnel_match_level)
 {
 	struct netlink_ext_ack *extack = f->common.extack;
 	void *headers_c = MLX5_ADDR_OF(fte_match_param, spec->match_criteria,
@@ -1477,7 +1477,7 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 		switch (key->addr_type) {
 		case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
 		case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
-			if (parse_tunnel_attr(priv, spec, f, filter_dev))
+			if (parse_tunnel_attr(priv, spec, f, filter_dev, tunnel_match_level))
 				return -EOPNOTSUPP;
 			break;
 		default:
@@ -1826,11 +1826,11 @@ static int parse_cls_flower(struct mlx5e_priv *priv,
 	struct mlx5_core_dev *dev = priv->mdev;
 	struct mlx5_eswitch *esw = dev->priv.eswitch;
 	struct mlx5e_rep_priv *rpriv = priv->ppriv;
+	u8 match_level, tunnel_match_level = MLX5_MATCH_NONE;
 	struct mlx5_eswitch_rep *rep;
-	u8 match_level;
 	int err;
 
-	err = __parse_cls_flower(priv, spec, f, filter_dev, &match_level);
+	err = __parse_cls_flower(priv, spec, f, filter_dev, &match_level, &tunnel_match_level);
 
 	if (!err && (flow->flags & MLX5E_TC_FLOW_ESWITCH)) {
 		rep = rpriv->rep;
@@ -1846,10 +1846,12 @@ static int parse_cls_flower(struct mlx5e_priv *priv,
 		}
 	}
 
-	if (flow->flags & MLX5E_TC_FLOW_ESWITCH)
+	if (flow->flags & MLX5E_TC_FLOW_ESWITCH) {
 		flow->esw_attr->match_level = match_level;
-	else
+		flow->esw_attr->tunnel_match_level = tunnel_match_level;
+	} else {
 		flow->nic_attr->match_level = match_level;
+	}
 
 	return err;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 9c89eea9b2c3..748ff178a1d6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -312,6 +312,7 @@ struct mlx5_esw_flow_attr {
 	} dests[MLX5_MAX_FLOW_FWD_VPORTS];
 	u32	mod_hdr_id;
 	u8	match_level;
+	u8	tunnel_match_level;
 	struct mlx5_fc *counter;
 	u32	chain;
 	u16	prio;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 53065b6ae593..d4e6fe5b9300 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -160,14 +160,15 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw,
 		MLX5_SET_TO_ONES(fte_match_set_misc, misc,
 				 source_eswitch_owner_vhca_id);
 
-	if (attr->match_level == MLX5_MATCH_NONE)
-		spec->match_criteria_enable = MLX5_MATCH_MISC_PARAMETERS;
-	else
-		spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS |
-					      MLX5_MATCH_MISC_PARAMETERS;
-
-	if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_DECAP)
-		spec->match_criteria_enable |= MLX5_MATCH_INNER_HEADERS;
+	spec->match_criteria_enable = MLX5_MATCH_MISC_PARAMETERS;
+	if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_DECAP) {
+		if (attr->tunnel_match_level != MLX5_MATCH_NONE)
+			spec->match_criteria_enable |= MLX5_MATCH_OUTER_HEADERS;
+		if (attr->match_level != MLX5_MATCH_NONE)
+			spec->match_criteria_enable |= MLX5_MATCH_INNER_HEADERS;
+	} else if (attr->match_level != MLX5_MATCH_NONE) {
+		spec->match_criteria_enable |= MLX5_MATCH_OUTER_HEADERS;
+	}
 
 	if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
 		flow_act.modify_id = attr->mod_hdr_id;
-- 
2.20.1


^ permalink raw reply related

* [net 1/3] net/mlx5e: FPGA, fix Innova IPsec TX offload data path performance
From: Saeed Mahameed @ 2019-02-05 20:20 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Raed Salem, Tariq Toukan, Saeed Mahameed
In-Reply-To: <20190205202011.24023-1-saeedm@mellanox.com>

From: Raed Salem <raeds@mellanox.com>

At Innova IPsec TX offload data path a special software parser metadata
is used to pass some packet attributes to the hardware, this metadata
is passed using the Ethernet control segment of a WQE (a HW descriptor)
header.

The cited commit might nullify this header, hence the metadata is lost,
this caused a significant performance drop during hw offloading
operation.

Fix by restoring the metadata at the Ethernet control segment in case
it was nullified.

Fixes: 37fdffb217a4 ("net/mlx5: WQ, fixes for fragmented WQ buffers API")
Signed-off-by: Raed Salem <raeds@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 598ad7e4d5c9..0e55cd1f2e98 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -387,8 +387,14 @@ netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS);
 	contig_wqebbs_room = mlx5_wq_cyc_get_contig_wqebbs(wq, pi);
 	if (unlikely(contig_wqebbs_room < num_wqebbs)) {
+#ifdef CONFIG_MLX5_EN_IPSEC
+		struct mlx5_wqe_eth_seg cur_eth = wqe->eth;
+#endif
 		mlx5e_fill_sq_frag_edge(sq, wq, pi, contig_wqebbs_room);
 		mlx5e_sq_fetch_wqe(sq, &wqe, &pi);
+#ifdef CONFIG_MLX5_EN_IPSEC
+		wqe->eth = cur_eth;
+#endif
 	}

 	/* fill wqe */
-- 
2.20.1

^ permalink raw reply related

* Re: Kernel panic in eth_header
From: Andrew @ 2019-02-05 20:21 UTC (permalink / raw)
  To: Netdev
In-Reply-To: <16f5a810-f183-2874-c67d-d490f70f7bf6@gmail.com>

On 05.02.2019 21:34, Florian Fainelli wrote:
> On 2/5/19 8:57 AM, Eric Dumazet wrote:
>>
>> On 02/05/2019 08:29 AM, Andrew wrote:
>>> Hi all.
>>>
>>> After upgrade on PPPoE BRAS to kernel 4.9.153 I've got an kernel panic after a 3 days of uptime.
>>>
>>> Unfortunately kernel is compiled w/o debug info; I rebuilt kernel with debug info enabled (kernel is compiled with same function addresses - I compare vmlinux symbol maps) - it says that panic is in net/ethernet/eth.c:88
>>>
>>> Below there is a kernel panic trace. igb is from vendor, ver. 5.3.5.4. What extra info is needed?
>>>
>>> [263565.106441] BUG: unable to handle kernel paging request at ffff88015a4d2dd4
>>> [263565.113527] IP: [<ffffffff8158e48b>] eth_header+0x3b/0xc0
>>> [263565.119030] PGD 1e8f067 [263565.121474] PUD 0
>>> [263565.123580]
>>> [263565.125166] Oops: 0002 [#1] SMP
>>> [263565.128398] Modules linked in: xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter xt_length xt_TCPMSS xt_tcpudp xt_mark xt_dscp iptable_mangle ip_tables x_tables nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack sch_sfq sch_htb cls_u32 sch_ingress sch_prio sch_tbf cls_flow cls_fw act_police ifb 8021q mrp garp stp llc softdog pppoe pppox ppp_generic slhc i2c_nforce2 i2c_core igb(O) parport_pc dca parport thermal asus_atk0110 fan ptp k10temp hwmon pps_core nv_tco
>>> [263565.176083] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O    4.9.153-x86_64 #1
>>> [263565.183996] Hardware name: System manufacturer System Product Name/M2N-E, BIOS ASUS M2N-E ACPI BIOS Revision 5001 03/23/2010
>>> [263565.195289] task: ffff88007d0f5200 task.stack: ffffc9000006c000
>>> [263565.201295] RIP: 0010:[<ffffffff8158e48b>] [<ffffffff8158e48b>] eth_header+0x3b/0xc0
>>> [263565.209225] RSP: 0018:ffff88007fa83c58  EFLAGS: 00010286
>>> [263565.214622] RAX: ffff88015a4d2dc8 RBX: 0000000000000008 RCX: ffff8800682434a0
>>> [263565.221843] RDX: ffff88015a4d2dc8 RSI: ffff88015a4d2dc8 RDI: ffff880077aab000
>>> [263565.229062] RBP: ffff88007b663d90 R08: ffff88007b663d90 R09: 0000000000000574
>>> [263565.236281] R10: ffff88007d1fa000 R11: 0000000000000000 R12: ffff8800682434a0
>>> [263565.243501] R13: ffff88007d1fa000 R14: 0000000000000574 R15: 0000000000000008
>>> [263565.250719] FS:  0000000000000000(0000) GS:ffff88007fa80000(0000) knlGS:0000000000000000
>>> [263565.258894] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [263565.264725] CR2: ffff88015a4d2dd4 CR3: 000000007ad73000 CR4: 00000000000006f0
>>> [263565.271944] Stack:
>>> [263565.274041]  ffff880077aab000 ffff880068243400 ffff88007a745000 ffff8800682434a0
>>> [263565.281582]  0000000000000002 ffffffff81571d09 ffff880068243400 ffff88007fa83d00
>>> [263565.289121]  ffff88007a745000 ffff880077aab000 ffff88007a712000 ffffffff815a8c61
>>> [263565.296661] Call Trace:
>>> [263565.299193]  <IRQ> [263565.301205] [<ffffffff81571d09>] ? neigh_connected_output+0xa9/0x100
>>> [263565.307740]  [<ffffffff815a8c61>] ? ip_finish_output2+0x221/0x400
>>> [263565.313920]  [<ffffffff8159e144>] ? nf_iterate+0x54/0x60
>>> [263565.319319]  [<ffffffff815ab2fa>] ? ip_output+0x6a/0xf0
>>> [263565.324631]  [<ffffffff8159e102>] ? nf_iterate+0x12/0x60
>>> [263565.330030]  [<ffffffff815aa6e0>] ? ip_fragment.constprop.5+0x80/0x80
>>> [263565.336556]  [<ffffffff815a73b6>] ? ip_forward+0x396/0x480
>>> [263565.342128]  [<ffffffff815a6fb0>] ? ip_check_defrag+0x1e0/0x1e0
>>> [263565.348134]  [<ffffffff815a5a2e>] ? ip_rcv+0x2ae/0x370
>>> [263565.353361]  [<ffffffffa0107c02>] ? pppoe_rcv_core+0xd2/0x160 [pppoe]
>>> [263565.359888]  [<ffffffff815a5170>] ? ip_local_deliver_finish+0x1d0/0x1d0
>>> [263565.366586]  [<ffffffff81562a57>] ? __netif_receive_skb_core+0x527/0xa80
>>> [263565.373373]  [<ffffffff81567632>] ? process_backlog+0x92/0x130
>>> [263565.379291]  [<ffffffff8156745d>] ? net_rx_action+0x24d/0x390
>>> [263565.385124]  [<ffffffff81628374>] ? __do_softirq+0xf4/0x2a0
>>> [263565.390784]  [<ffffffff8107136c>] ? irq_exit+0xbc/0xd0
>>> [263565.396008]  [<ffffffff81626cd6>] ? call_function_single_interrupt+0x96/0xa0
>>> [263565.403141]  <EOI> [263565.405153] [<ffffffff81623eb0>] ? __sched_text_end+0x2/0x2
>>> [263565.410907]  [<ffffffff81624182>] ? native_safe_halt+0x2/0x10
>>> [263565.416741]  [<ffffffff81623ec8>] ? default_idle+0x18/0xd0
>>> [263565.422314]  [<ffffffff810a7a46>] ? cpu_startup_entry+0x126/0x220
>>> [263565.428492]  [<ffffffff8104c261>] ? start_secondary+0x161/0x180
>>> [263565.434496] Code: 0e 00 00 00 53 89 d3 49 89 cc 4c 89 c5 45 89 ce e8 bb 8a fc ff 66 83 fb 01 48 89 c6 74 44 66 83 fb 04 74 3e 66 c1 c3 08 48 85 ed <66> 89 58 0c 74 40 8b 45 00 4d 85 e4 89 46 06 0f b7 45 04 66 89
>>> [263565.454534] RIP  [<ffffffff8158e48b>] eth_header+0x3b/0xc0
>>> [263565.460124]  RSP <ffff88007fa83c58>
>>> [263565.463696] CR2: ffff88015a4d2dd4
>>> [263565.467104] ---[ end trace a1bcaf3618724adf ]---
>>> [263565.471807] Kernel panic - not syncing: Fatal exception in interrupt
>>> [263565.478245] Kernel Offset: disabled
>>> [263565.481818] Rebooting in 5 seconds..
>>>
>>
>> This is a well known issue, a fix should come shortly in stable branches
> Is Peter or yourself doing the backport? David would only take care of
> the most two recent stable kernels.
>
> Sorry about missing that change as part of the fragmenstack backport to
> 4.9...

I think that backport will be trivial - at least patch lays smoothly on 
4.9 (just with offsets difference).

I'll test it.

Btw, maybe there's a some test conditions to quickly check if patch 
helps? Crash is reproducible with unpredictable interval (tens of hours 
of quite heavy load).


>> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
>> index f8bbd693c19c247e41839c2d0b5318ca51b23ee8..d95b32af4a0e3f552405c9e61cc372729834160c 100644
>> --- a/net/ipv4/ip_fragment.c
>> +++ b/net/ipv4/ip_fragment.c
>> @@ -425,6 +425,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
>>           * fragment.
>>           */
>>   
>> +       err = -EINVAL;
>>          /* Find out where to put this fragment.  */
>>          prev_tail = qp->q.fragments_tail;
>>          if (!prev_tail)
>> @@ -501,7 +502,6 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
>>   
>>   discard_qp:
>>          inet_frag_kill(&qp->q);
>> -       err = -EINVAL;
>>          __IP_INC_STATS(net, IPSTATS_MIB_REASM_OVERLAPS);
>>   err:
>>          kfree_skb(skb);
>>
>>
>>
>


^ permalink raw reply

* Re: [PATCH v2 2/2] r8169: Avoid pointer aliasing
From: Eric Dumazet @ 2019-02-05 20:23 UTC (permalink / raw)
  To: Joe Perches, David Miller, thierry.reding
  Cc: hkallweit1, andrew, nic_swsd, netdev, linux-kernel
In-Reply-To: <4aea523a3bc5be0d944f7ed9fadac276b7115002.camel@perches.com>

On 02/05/2019 12:18 PM, Joe Perches wrote:

> I still think adding __align(<even#>) is unnecessary here unless
> it follows something like a bool or a u8.

This would be some historical side effect, and we do not want to rely on that.

A security feature could in fact ask a compiler to perform random
shuffling of automatic variables.

( a la __randomize_layout )

^ permalink raw reply

* Re: [ovs-dev] [PATCH net-next V2 1/1] openvswitch: Declare ovs key structures using macros
From: Ben Pfaff @ 2019-02-05 20:23 UTC (permalink / raw)
  To: Gregory Rose
  Cc: Eli Britstein, David Miller, yihung.wei@gmail.com,
	dev@openvswitch.org, netdev@vger.kernel.org,
	simon.horman@netronome.com
In-Reply-To: <c3455573-5dad-e7a7-fc89-15312c0b7c88@gmail.com>

On Tue, Feb 05, 2019 at 10:22:09AM -0800, Gregory Rose wrote:
> 
> On 2/5/2019 4:02 AM, Eli Britstein wrote:
> > On 2/4/2019 10:07 PM, David Miller wrote:
> > > From: Yi-Hung Wei <yihung.wei@gmail.com>
> > > Date: Mon, 4 Feb 2019 10:47:18 -0800
> > > 
> > > > For example, to see how 'struct ovs_key_ipv6' is defined, now we need
> > > > to trace how OVS_KEY_IPV6_FIELDS is defined, and how OVS_KEY_FIELD_ARR
> > > > and OVS_KEY_FIELD defined.  I think it makes the header file to be
> > > > more complicated.
> > > I completely agree.
> > > 
> > > Unless this is totally unavoidable, I do not want to apply a patch
> > > which makes reading and auditing the networking code more difficult.
> > This technique is discussed for example in
> > https://stackoverflow.com/questions/6635851/real-world-use-of-x-macros,
> > and I found existing examples of using it in the kernel tree:
> > 
> > __BPF_FUNC_MAPPER in commit ebb676daa1a34 ("bpf: Print function name in
> > addition to function id")
> > 
> > __AAL_STAT_ITEMS and __SONET_ITEMS in commit 607ca46e97a1b ("UAPI:
> > (Scripted) Disintegrate include/linux"), the successor of commit
> > 1da177e4c3f4 ("Linux-2.6.12-rc2"). I didn't dig deeper to the past.
> > 
> > I can agree it makes that H file a bit more complicated, but for sure
> > less than ## macros that are widely used.
> > 
> > However, I think the alternatives of generating such defines by some
> > scripts, or having the fields in more than one place are even worse, so
> > it is a kind of unavoidable.
> 
> Why is using a script to generate defines for the requirements of your
> application or driver so bad?  Your patch
> turns openvswitch.h into some hardly readable code - using a script and
> having a machine output the macros
> your application or driver needs seems like a much better alternative to me.

In addition, one of the reasons that developers prefer to avoid
duplication is because of the need to synchronize copies when one of
them changes.  This doesn't apply in the same way to these structures,
because they are ABIs that will not change.

^ permalink raw reply

* Re: [PATCH v2 2/2] r8169: Avoid pointer aliasing
From: Heiner Kallweit @ 2019-02-05 20:23 UTC (permalink / raw)
  To: Joe Perches, Eric Dumazet, David Miller, thierry.reding
  Cc: andrew, nic_swsd, netdev, linux-kernel
In-Reply-To: <4aea523a3bc5be0d944f7ed9fadac276b7115002.camel@perches.com>

On 05.02.2019 21:18, Joe Perches wrote:
> On Tue, 2019-02-05 at 12:04 -0800, Eric Dumazet wrote:
>>
>> On 02/05/2019 10:42 AM, Joe Perches wrote:
>>> It's declared after a pointer so it is already is 2 byte aligned.
>>>
>>> A lot of drivers wouldn't work otherwise.
>>
>> Maybe these drivers are only used on arches where this does not matter.
> 
> Possible.
> 
> I had only grepped through the sources looking for
> declarations using:
> 
> $ git grep -B1 '\[ETH_ALEN\];' -- '*.c' | grep -A1 '\*'
> 
> It's quite a few files in net/ too btw.
> 
> I still think adding __align(<even#>) is unnecessary here unless
> it follows something like a bool or a u8.
> 
> 
I there's such a controversy, then it may be better to stay with
the current code, or?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox