Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [patch net-next rfc 2/7] net: introduce name_node struct to be used in hashlist
From: kbuild test robot @ 2019-08-08  4:34 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: kbuild-all, netdev, davem, jakub.kicinski, sthemmin, dsahern,
	dcbw, mkubecek, andrew, parav, saeedm, mlxsw
In-Reply-To: <20190719110029.29466-3-jiri@resnulli.us>

[-- Attachment #1: Type: text/plain, Size: 1138 bytes --]

Hi Jiri,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Jiri-Pirko/net-introduce-alternative-names-for-network-interfaces/20190720-094639
config: arc-defconfig (attached as .config)
compiler: arc-elf-gcc (GCC) 7.4.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.4.0 make.cross ARCH=arc 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   {standard input}: Assembler messages:
   {standard input}:147: Error: inappropriate arguments for opcode 'st'
>> {standard input}:150: Error: Insn j_s has a jump/branch instruction j_s in its delay slot.
   {standard input}:175: Error: inappropriate arguments for opcode 'st'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 8655 bytes --]

^ permalink raw reply

* Re: [PATCH v5 bpf-next] BPF: helpers: New helper to obtain namespace data from current task
From: Yonghong Song @ 2019-08-08  4:44 UTC (permalink / raw)
  To: Carlos Antonio Neira Bustos, netdev@vger.kernel.org
  Cc: ebiederm@xmission.com, brouer@redhat.com,
	quentin.monnet@netronome.com
In-Reply-To: <20190808012240.htbgpv2mhktvig5h@dev00>



On 8/7/19 6:22 PM, Carlos Antonio Neira Bustos wrote:
> The code has been modified to avoid syscalls that could sleep.
> Please let me know if any other modification is needed.
> 
>  From be0384c0fa209a78c1567936e8db4e35b9a7c0f8 Mon Sep 17 00:00:00 2001
> From: Carlos <cneirabustos@gmail.com>
> Date: Wed, 7 Aug 2019 20:04:30 -0400
> Subject: [PATCH] [PATCH v5 bpf-next] BPF: New helper to obtain namespace data
>   from current task
> 
> This helper obtains the active namespace from current and returns pid, tgid,
> device and namespace id as seen from that namespace, allowing to instrument
> a process inside a container.
> Device is read from /proc/self/ns/pid, as in the future it's possible that
> different pid_ns files may belong to different devices, according
> to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> conference.
> Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> scripts but this helper returns the pid as seen by the root namespace which is
> fine when a bcc script is not executed inside a container.
> When the process of interest is inside a container, pid filtering will not work
> if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
> returning the pid as it's seen by the current namespace where the script is
> executing.
> 
> This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> used to do pid filtering even inside a container.
> 
> For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):
> 
>          u32 pid = bpf_get_current_pid_tgid() >> 32;
>          if (pid != <pid_arg_passed_in>)
>                  return 0;
> Could be modified to use bpf_get_current_pidns_info() as follows:
> 
>          struct bpf_pidns pidns;
>          bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
>          u32 pid = pidns.tgid;
>          u32 nsid = pidns.nsid;
>          if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
>                  return 0;
> 
> To find out the name PID namespace id of a process, you could use this command:
> 
> $ ps -h -o pidns -p <pid_of_interest>
> 
> Or this other command:
> 
> $ ls -Li /proc/<pid_of_interest>/ns/pid
> 
> Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> ---
>   fs/namei.c                                         |   2 +-
>   include/linux/bpf.h                                |   1 +
>   include/linux/namei.h                              |   4 +
>   include/uapi/linux/bpf.h                           |  29 ++++-
>   kernel/bpf/core.c                                  |   1 +
>   kernel/bpf/helpers.c                               |  78 ++++++++++++
>   kernel/trace/bpf_trace.c                           |   2 +
>   samples/bpf/Makefile                               |   3 +
>   samples/bpf/trace_ns_info_user.c                   |  35 ++++++
>   samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
>   tools/include/uapi/linux/bpf.h                     |  29 ++++-
>   tools/testing/selftests/bpf/Makefile               |   2 +-
>   tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
>   .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
>   tools/testing/selftests/bpf/test_pidns.c           | 138 +++++++++++++++++++++
>   15 files changed, 418 insertions(+), 4 deletions(-)
>   create mode 100644 samples/bpf/trace_ns_info_user.c
>   create mode 100644 samples/bpf/trace_ns_info_user_kern.c
>   create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
>   create mode 100644 tools/testing/selftests/bpf/test_pidns.c

Carlos,

Thanks for your patch. Maybe the following changes can be made
for easier review and eventual merging if accepted.

(1). The single patch should be broken into multiple patches:
      #1. fs/namei.c include/linux/namei.h
      #2. bpf kernel changes in include/linux/bpf.h,
          include/uapi/linux/bpf.h, kernel/bpf/core.c
          kernel/bpf/helpers.h kernel/trace/bpf_trace.c
      #3. sync uapi header tools/include/uapi/linux/bpf.h
      #4. tools/testing/selftests/bpf changes.
      #5. samples/bpf/ changes.
(2). Since you have more than one patch, please do provide
      a cover letter to describe use case as you did in the
      above commit. Please put the change logs after
      the original commit message.
      You can look at some other examples
      "git log" in the net-next or bpf-next repo.
(3). Please send to or cc BPF maintainers:
      Alexei Starovoitov <ast@kernel.org>
      Daniel Borkmann <daniel@iogearbox.net>

> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 209c51a5226c..d1eca36972d2 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -19,7 +19,6 @@
>   #include <linux/export.h>
>   #include <linux/kernel.h>
>   #include <linux/slab.h>
> -#include <linux/fs.h>
>   #include <linux/namei.h>
>   #include <linux/pagemap.h>
>   #include <linux/fsnotify.h>
> @@ -2355,6 +2354,7 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
>   	putname(name);
>   	return retval;
>   }
> +EXPORT_SYMBOL(filename_lookup);
>   
>   /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
>   static int path_parentat(struct nameidata *nd, unsigned flags,
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f9a506147c8a..e4adf5e05afd 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
>   extern const struct bpf_func_proto bpf_strtol_proto;
>   extern const struct bpf_func_proto bpf_strtoul_proto;
>   extern const struct bpf_func_proto bpf_tcp_sock_proto;
> +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
>   
>   /* Shared helpers among cBPF and eBPF. */
>   void bpf_user_rnd_init_once(void);
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 9138b4471dbf..2c24e8c71d46 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -6,6 +6,7 @@
>   #include <linux/path.h>
>   #include <linux/fcntl.h>
>   #include <linux/errno.h>
> +#include <linux/fs.h>
>   
>   enum { MAX_NESTED_LINKS = 8 };
>   
> @@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
>   
>   extern void nd_jump_link(struct path *path);
>   
> +extern int filename_lookup(int dfd, struct filename *name, unsigned int flags,
> +		    struct path *path, struct path *root);
> +
>   static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
>   {
>   	((char *) name)[min(len, maxlen)] = '\0';
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4393bd4b2419..6f601f7106e2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2741,6 +2741,26 @@ union bpf_attr {
>    *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
>    *
>    *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
> + *
> + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> + *	Description
> + *		Copies into *pidns* pid, namespace id and tgid as seen by the
> + *		current namespace and also device from /proc/self/ns/pid.
> + *		*size_of_pidns* must be the size of *pidns*
> + *
> + *		This helper is used when pid filtering is needed inside a
> + *		container as bpf_get_current_tgid() helper returns always the
> + *		pid id as seen by the root namespace.
> + *	Return
> + *		0 on success
> + *
> + *		**-EINVAL**  if unable to get ns, pid or tgid of current task.
> + *		Or if size_of_pidns is not valid.
> + *
> + *		**-ENOMEM**  if allocation fails.
> + *
> + *		If unable to get the inode from /proc/self/ns/pid an error code
> + *		will be returned.
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -2853,7 +2873,8 @@ union bpf_attr {
>   	FN(sk_storage_get),		\
>   	FN(sk_storage_delete),		\
>   	FN(send_signal),		\
> -	FN(tcp_gen_syncookie),
> +	FN(tcp_gen_syncookie),		\
> +	FN(get_current_pidns_info),
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>    * function eBPF program intends to call
> @@ -3604,4 +3625,10 @@ struct bpf_sockopt {
>   	__s32	retval;
>   };
>   
> +struct bpf_pidns_info {
> +	__u32 dev;
> +	__u32 nsid;
> +	__u32 tgid;
> +	__u32 pid;
> +};
>   #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 8191a7db2777..3159f2a0188c 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
>   const struct bpf_func_proto bpf_get_current_comm_proto __weak;
>   const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
>   const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
>   
>   const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>   {
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 5e28718928ca..571f24077db2 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -11,6 +11,12 @@
>   #include <linux/uidgid.h>
>   #include <linux/filter.h>
>   #include <linux/ctype.h>
> +#include <linux/pid_namespace.h>
> +#include <linux/major.h>
> +#include <linux/stat.h>
> +#include <linux/namei.h>
> +#include <linux/version.h>
> +
>   
>   #include "../../lib/kstrtox.h"
>   
> @@ -312,6 +318,78 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>   	preempt_enable();
>   }
>   
> +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
> +	 size)
> +{
> +	const char *name = "/proc/self/ns/pid";
> +	struct pid_namespace *pidns = NULL;
> +	struct filename *tmp = NULL;
> +	int len = strlen(name) + 1;
> +	struct inode *inode;
> +	struct path kp;
> +	pid_t tgid = 0;
> +	pid_t pid = 0;
> +	int ret;
> +
> +	if (unlikely(size != sizeof(struct bpf_pidns_info)))
> +		return -EINVAL;
> +
> +	pidns = task_active_pid_ns(current);
> +
> +	if (unlikely(!pidns))
> +		goto clear;
> +
> +	pidns_info->nsid =  pidns->ns.inum;
> +	pid = task_pid_nr_ns(current, pidns);
> +
> +	if (unlikely(!pid))
> +		goto clear;
> +
> +	tgid = task_tgid_nr_ns(current, pidns);
> +
> +	if (unlikely(!tgid))
> +		goto clear;
> +
> +	pidns_info->tgid = (u32) tgid;
> +	pidns_info->pid = (u32) pid;
> +
> +	tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> +	if (unlikely(!tmp)) {
> +		memset((void *)pidns_info, 0, (size_t) size);
> +		return -ENOMEM;
> +	}
> +
> +	memcpy((char *)tmp->name, name, len);
> +	tmp->uptr = NULL;
> +	tmp->aname = NULL;
> +	tmp->refcnt = 1;
> +
> +	ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
> +
> +	if (ret) {
> +		memset((void *)pidns_info, 0, (size_t) size);
> +		return ret;
> +	}
> +
> +	inode = d_backing_inode(kp.dentry);
> +	pidns_info->dev = inode->i_sb->s_dev;
> +
> +	return 0;
> +
> +clear:
> +	memset((void *)pidns_info, 0, (size_t) size);
> +
> +	return -EINVAL;
> +}
> +
> +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> +	.func	= bpf_get_current_pidns_info,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
> +	.arg2_type	= ARG_CONST_SIZE,
> +};
> +
>   #ifdef CONFIG_CGROUPS
>   BPF_CALL_0(bpf_get_current_cgroup_id)
>   {
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ca1255d14576..5e1dc22765a5 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>   #endif
>   	case BPF_FUNC_send_signal:
>   		return &bpf_send_signal_proto;
> +	case BPF_FUNC_get_current_pidns_info:
> +		return &bpf_get_current_pidns_info_proto;
>   	default:
>   		return NULL;
>   	}
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 1d9be26b4edd..238453ff27d2 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
>   hostprogs-y += xdp_sample_pkts
>   hostprogs-y += ibumad
>   hostprogs-y += hbm
> +hostprogs-y += trace_ns_info
>   
>   # Libbpf dependencies
>   LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
> @@ -109,6 +110,7 @@ task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
>   xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
>   ibumad-objs := bpf_load.o ibumad_user.o $(TRACE_HELPERS)
>   hbm-objs := bpf_load.o hbm.o $(CGROUP_HELPERS)
> +trace_ns_info-objs := bpf_load.o trace_ns_info_user.o
>   
>   # Tell kbuild to always build the programs
>   always := $(hostprogs-y)
> @@ -170,6 +172,7 @@ always += xdp_sample_pkts_kern.o
>   always += ibumad_kern.o
>   always += hbm_out_kern.o
>   always += hbm_edt_kern.o
> +always += trace_ns_info_user_kern.o
>   
>   KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
>   KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/bpf/
> diff --git a/samples/bpf/trace_ns_info_user.c b/samples/bpf/trace_ns_info_user.c
> new file mode 100644
> index 000000000000..e06d08db6f30
> --- /dev/null
> +++ b/samples/bpf/trace_ns_info_user.c
> @@ -0,0 +1,35 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +#include <stdio.h>
> +#include <linux/bpf.h>
> +#include <unistd.h>
> +#include "bpf/libbpf.h"
> +#include "bpf_load.h"
> +
> +/* This code was taken verbatim from tracex1_user.c, it's used
> + * to exercize bpf_get_current_pidns_info() helper call.
> + */
> +int main(int ac, char **argv)
> +{
> +	FILE *f;
> +	char filename[256];
> +
> +	snprintf(filename, sizeof(filename), "%s_user_kern.o", argv[0]);
> +	printf("loading %s\n", filename);
> +
> +	if (load_bpf_file(filename)) {
> +		printf("%s", bpf_log_buf);
> +		return 1;
> +	}
> +
> +	f = popen("taskset 1 ping  localhost", "r");
> +	(void) f;
> +	read_trace_pipe();
> +	return 0;
> +}
> diff --git a/samples/bpf/trace_ns_info_user_kern.c b/samples/bpf/trace_ns_info_user_kern.c
> new file mode 100644
> index 000000000000..96675e02b707
> --- /dev/null
> +++ b/samples/bpf/trace_ns_info_user_kern.c
> @@ -0,0 +1,44 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#include <linux/skbuff.h>
> +#include <linux/netdevice.h>
> +#include <linux/version.h>
> +#include <uapi/linux/bpf.h>
> +#include "bpf_helpers.h"
> +
> +typedef __u64 u64;
> +typedef __u32 u32;
> +
> +
> +/* kprobe is NOT a stable ABI
> + * kernel functions can be removed, renamed or completely change semantics.
> + * Number of arguments and their positions can change, etc.
> + * In such case this bpf+kprobe example will no longer be meaningful
> + */
> +
> +/* This will call bpf_get_current_pidns_info() to display pid and ns values
> + * as seen by the current namespace, on the far left you will see the pid as
> + * seen as by the root namespace.
> + */
> +
> +SEC("kprobe/__netif_receive_skb_core")
> +int bpf_prog1(struct pt_regs *ctx)
> +{
> +	char fmt[] = "nsid:%u, dev: %u,  pid:%u\n";
> +	struct bpf_pidns_info nsinfo;
> +	int ok = 0;
> +
> +	ok = bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo));
> +	if (ok == 0)
> +		bpf_trace_printk(fmt, sizeof(fmt), (u32)nsinfo.nsid,
> +				 (u32) nsinfo.dev, (u32)nsinfo.pid);
> +
> +	return 0;
> +}
> +char _license[] SEC("license") = "GPL";
> +u32 _version SEC("version") = LINUX_VERSION_CODE;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 4393bd4b2419..6f601f7106e2 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -2741,6 +2741,26 @@ union bpf_attr {
>    *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
>    *
>    *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
> + *
> + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> + *	Description
> + *		Copies into *pidns* pid, namespace id and tgid as seen by the
> + *		current namespace and also device from /proc/self/ns/pid.
> + *		*size_of_pidns* must be the size of *pidns*
> + *
> + *		This helper is used when pid filtering is needed inside a
> + *		container as bpf_get_current_tgid() helper returns always the
> + *		pid id as seen by the root namespace.
> + *	Return
> + *		0 on success
> + *
> + *		**-EINVAL**  if unable to get ns, pid or tgid of current task.
> + *		Or if size_of_pidns is not valid.
> + *
> + *		**-ENOMEM**  if allocation fails.
> + *
> + *		If unable to get the inode from /proc/self/ns/pid an error code
> + *		will be returned.
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -2853,7 +2873,8 @@ union bpf_attr {
>   	FN(sk_storage_get),		\
>   	FN(sk_storage_delete),		\
>   	FN(send_signal),		\
> -	FN(tcp_gen_syncookie),
> +	FN(tcp_gen_syncookie),		\
> +	FN(get_current_pidns_info),
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>    * function eBPF program intends to call
> @@ -3604,4 +3625,10 @@ struct bpf_sockopt {
>   	__s32	retval;
>   };
>   
> +struct bpf_pidns_info {
> +	__u32 dev;
> +	__u32 nsid;
> +	__u32 tgid;
> +	__u32 pid;
> +};
>   #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 3bd0f4a0336a..1f97b571b581 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
>   	test_cgroup_storage test_select_reuseport test_section_names \
>   	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
>   	test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
> -	test_sockopt_multi test_tcp_rtt
> +	test_sockopt_multi test_tcp_rtt test_pidns
>   
>   BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
>   TEST_GEN_FILES = $(BPF_OBJ_FILES)
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index 120aa86c58d3..c96795a9d983 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -231,6 +231,9 @@ static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
>   static long long (*bpf_tcp_gen_syncookie)(struct bpf_sock *sk, void *ip,
>   					  int ip_len, void *tcp, int tcp_len) =
>   	(void *) BPF_FUNC_tcp_gen_syncookie;
> +static int (*bpf_get_current_pidns_info)(struct bpf_pidns_info *buf,
> +					 unsigned int buf_size) =
> +	(void *) BPF_FUNC_get_current_pidns_info;
>   
>   /* llvm builtin functions that eBPF C program may use to
>    * emit BPF_LD_ABS and BPF_LD_IND instructions
> diff --git a/tools/testing/selftests/bpf/progs/test_pidns_kern.c b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
> new file mode 100644
> index 000000000000..e1d2facfa762
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
> @@ -0,0 +1,51 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +#include <linux/bpf.h>
> +#include <errno.h>
> +#include "bpf_helpers.h"
> +
> +struct bpf_map_def SEC("maps") nsidmap = {
> +	.type = BPF_MAP_TYPE_ARRAY,
> +	.key_size = sizeof(__u32),
> +	.value_size = sizeof(__u32),
> +	.max_entries = 1,
> +};
> +
> +struct bpf_map_def SEC("maps") pidmap = {
> +	.type = BPF_MAP_TYPE_ARRAY,
> +	.key_size = sizeof(__u32),
> +	.value_size = sizeof(__u32),
> +	.max_entries = 1,
> +};
> +
> +SEC("tracepoint/syscalls/sys_enter_nanosleep")
> +int trace(void *ctx)
> +{
> +	struct bpf_pidns_info nsinfo;
> +	__u32 key = 0, *expected_pid, *val;
> +	char fmt[] = "ERROR nspid:%d\n";
> +
> +	if (bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo)))
> +		return -EINVAL;
> +
> +	expected_pid = bpf_map_lookup_elem(&pidmap, &key);
> +
> +
> +	if (!expected_pid || *expected_pid != nsinfo.pid)
> +		return 0;
> +
> +	val = bpf_map_lookup_elem(&nsidmap, &key);
> +	if (val)
> +		*val = nsinfo.nsid;
> +
> +	return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> +__u32 _version SEC("version") = 1;
> diff --git a/tools/testing/selftests/bpf/test_pidns.c b/tools/testing/selftests/bpf/test_pidns.c
> new file mode 100644
> index 000000000000..a7254055f294
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_pidns.c
> @@ -0,0 +1,138 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <syscall.h>
> +#include <unistd.h>
> +#include <linux/perf_event.h>
> +#include <sys/ioctl.h>
> +#include <sys/time.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +
> +#include <linux/bpf.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#include "cgroup_helpers.h"
> +#include "bpf_rlimit.h"
> +
> +#define CHECK(condition, tag, format...) ({		\
> +	int __ret = !!(condition);			\
> +	if (__ret) {					\
> +		printf("%s:FAIL:%s ", __func__, tag);	\
> +		printf(format);				\
> +	} else {					\
> +		printf("%s:PASS:%s\n", __func__, tag);	\
> +	}						\
> +	__ret;						\
> +})
> +
> +static int bpf_find_map(const char *test, struct bpf_object *obj,
> +			const char *name)
> +{
> +	struct bpf_map *map;
> +
> +	map = bpf_object__find_map_by_name(obj, name);
> +	if (!map)
> +		return -1;
> +	return bpf_map__fd(map);
> +}
> +
> +
> +int main(int argc, char **argv)
> +{
> +	const char *probe_name = "syscalls/sys_enter_nanosleep";
> +	const char *file = "test_pidns_kern.o";
> +	int err, bytes, efd, prog_fd, pmu_fd;
> +	int pidmap_fd, nsidmap_fd;
> +	struct perf_event_attr attr = {};
> +	struct bpf_object *obj;
> +	__u32 knsid = 0;
> +	__u32 key = 0, pid;
> +	int exit_code = 1;
> +	struct stat st;
> +	char buf[256];
> +
> +	err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
> +	if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
> +		goto cleanup_cgroup_env;
> +
> +	nsidmap_fd = bpf_find_map(__func__, obj, "nsidmap");
> +	if (CHECK(nsidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> +		  nsidmap_fd, errno))
> +		goto close_prog;
> +
> +	pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
> +	if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> +		  pidmap_fd, errno))
> +		goto close_prog;
> +
> +	pid = getpid();
> +	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
> +
> +	snprintf(buf, sizeof(buf),
> +		 "/sys/kernel/debug/tracing/events/%s/id", probe_name);
> +	efd = open(buf, O_RDONLY, 0);
> +	if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
> +		goto close_prog;
> +	bytes = read(efd, buf, sizeof(buf));
> +	close(efd);
> +	if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
> +		  "bytes %d errno %d\n", bytes, errno))
> +		goto close_prog;
> +
> +	attr.config = strtol(buf, NULL, 0);
> +	attr.type = PERF_TYPE_TRACEPOINT;
> +	attr.sample_type = PERF_SAMPLE_RAW;
> +	attr.sample_period = 1;
> +	attr.wakeup_events = 1;
> +
> +	pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
> +	if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
> +		  errno))
> +		goto close_prog;
> +
> +	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
> +	if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
> +		  errno))
> +		goto close_pmu;
> +
> +	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> +	if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
> +		  errno))
> +		goto close_pmu;
> +
> +	/* trigger some syscalls */
> +	sleep(1);
> +
> +	err = bpf_map_lookup_elem(nsidmap_fd, &key, &knsid);
> +	if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
> +		goto close_pmu;
> +
> +	if (stat("/proc/self/ns/pid", &st))
> +		goto close_pmu;
> +
> +	if (CHECK(knsid != (__u32) st.st_ino, "compare_namespace_id",
> +		  "kern knsid %u user unsid %u\n", knsid, (__u32) st.st_ino))
> +		goto close_pmu;
> +
> +	exit_code = 0;
> +	printf("%s:PASS\n", argv[0]);
> +
> +close_pmu:
> +	close(pmu_fd);
> +close_prog:
> +	bpf_object__close(obj);
> +cleanup_cgroup_env:
> +	return exit_code;
> +}
> 

^ permalink raw reply

* Re: [PATCH net-next] net/ncsi: allow to customize BMC MAC Address offset
From: Tao Ren @ 2019-08-08  4:48 UTC (permalink / raw)
  To: Andrew Lunn, Jakub Kicinski
  Cc: Samuel Mendoza-Jonas, David S . Miller, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, openbmc@lists.ozlabs.org,
	William Kennington, Joel Stanley
In-Reply-To: <20190807184143.GE26047@lunn.ch>

On 8/7/19 11:41 AM, Andrew Lunn wrote:
> On Wed, Aug 07, 2019 at 11:25:18AM -0700, Jakub Kicinski wrote:
>> On Tue, 6 Aug 2019 17:21:18 -0700, Tao Ren wrote:
>>> Currently BMC's MAC address is calculated by adding 1 to NCSI NIC's base
>>> MAC address when CONFIG_NCSI_OEM_CMD_GET_MAC option is enabled. The logic
>>> doesn't work for platforms with different BMC MAC offset: for example,
>>> Facebook Yamp BMC's MAC address is calculated by adding 2 to NIC's base
>>> MAC address ("BaseMAC + 1" is reserved for Host use).
>>>
>>> This patch adds NET_NCSI_MC_MAC_OFFSET config option to customize offset
>>> between NIC's Base MAC address and BMC's MAC address. Its default value is
>>> set to 1 to avoid breaking existing users.
>>>
>>> Signed-off-by: Tao Ren <taoren@fb.com>
>>
>> Maybe someone more knowledgeable like Andrew has an opinion here, 
>> but to me it seems a bit strange to encode what seems to be platfrom
>> information in the kernel config :(
> 
> Yes, this is not a good idea. It makes it impossible to have a 'BMC
> distro' kernel which you install on a number of different BMCs.
> 
> A device tree property would be better. Ideally it would be the
> standard local-mac-address, or mac-address.

Thank you Andrew and Jakub for the feedback. I picked up kconfig approach mainly because it's an OEM-only extention which is used only when NCSI_OEM_CMD_GET_MAC is enabled.

Let me prepare patch v2 using device tree. I'm not sure if standard "mac-address" fits this situation because all we need is an offset (integer) and BMC MAC is calculated by adding the offset to NIC's MAC address. Anyways, let me work out v2 patch we can discuss more then.


Thanks,

Tao

^ permalink raw reply

* Re: [PATCH net-next] net/ncsi: allow to customize BMC MAC Address offset
From: Tao Ren @ 2019-08-08  4:51 UTC (permalink / raw)
  To: Vijay Khemka, Samuel Mendoza-Jonas, David S . Miller,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	openbmc@lists.ozlabs.org, William Kennington, Joel Stanley
In-Reply-To: <75DDAF9A-DABC-4670-BEC0-320185017642@fb.com>

On 8/7/19 10:36 AM, Vijay Khemka wrote:
> Lgtm except one small comment below.
> 
> On 8/6/19, 5:22 PM, "openbmc on behalf of Tao Ren" <openbmc-bounces+vijaykhemka=fb.com@lists.ozlabs.org on behalf of taoren@fb.com> wrote:
> 
>     Currently BMC's MAC address is calculated by adding 1 to NCSI NIC's base
>     MAC address when CONFIG_NCSI_OEM_CMD_GET_MAC option is enabled. The logic
>     doesn't work for platforms with different BMC MAC offset: for example,
>     Facebook Yamp BMC's MAC address is calculated by adding 2 to NIC's base
>     MAC address ("BaseMAC + 1" is reserved for Host use).
>     
>     This patch adds NET_NCSI_MC_MAC_OFFSET config option to customize offset
>     between NIC's Base MAC address and BMC's MAC address. Its default value is
>     set to 1 to avoid breaking existing users.
>     
>     Signed-off-by: Tao Ren <taoren@fb.com>
>     ---
>      net/ncsi/Kconfig    |  8 ++++++++
>      net/ncsi/ncsi-rsp.c | 15 +++++++++++++--
>      2 files changed, 21 insertions(+), 2 deletions(-)
>     
>     diff --git a/net/ncsi/Kconfig b/net/ncsi/Kconfig
>     index 2f1e5756c03a..be8efe1ed99e 100644
>     --- a/net/ncsi/Kconfig
>     +++ b/net/ncsi/Kconfig
>     @@ -17,3 +17,11 @@ config NCSI_OEM_CMD_GET_MAC
>      	---help---
>      	  This allows to get MAC address from NCSI firmware and set them back to
>      		controller.
>     +config NET_NCSI_MC_MAC_OFFSET
>     +	int
>     +	prompt "Offset of Management Controller's MAC Address"
>     +	depends on NCSI_OEM_CMD_GET_MAC
>     +	default 1
>     +	help
>     +	  This defines the offset between Network Controller's (base) MAC
>     +	  address and Management Controller's MAC address.
>     diff --git a/net/ncsi/ncsi-rsp.c b/net/ncsi/ncsi-rsp.c
>     index 7581bf919885..24a791f9ebf5 100644
>     --- a/net/ncsi/ncsi-rsp.c
>     +++ b/net/ncsi/ncsi-rsp.c
>     @@ -656,6 +656,11 @@ static int ncsi_rsp_handler_oem_bcm_gma(struct ncsi_request *nr)
>      	struct ncsi_rsp_oem_pkt *rsp;
>      	struct sockaddr saddr;
>      	int ret = 0;
>     +#ifdef CONFIG_NET_NCSI_MC_MAC_OFFSET
>     +	int mac_offset = CONFIG_NET_NCSI_MC_MAC_OFFSET;
>     +#else
>     +	int mac_offset = 1;
>     +#endif
>      
>      	/* Get the response header */
>      	rsp = (struct ncsi_rsp_oem_pkt *)skb_network_header(nr->rsp);
>     @@ -663,8 +668,14 @@ static int ncsi_rsp_handler_oem_bcm_gma(struct ncsi_request *nr)
>      	saddr.sa_family = ndev->type;
>      	ndev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
>      	memcpy(saddr.sa_data, &rsp->data[BCM_MAC_ADDR_OFFSET], ETH_ALEN);
>     -	/* Increase mac address by 1 for BMC's address */
>     -	eth_addr_inc((u8 *)saddr.sa_data);
>     +
>     +	/* Management Controller's MAC address is calculated by adding
>     +	 * the offset to Network Controller's (base) MAC address.
>     +	 * Note: negative offset is "ignored", and BMC will use the Base
> Just mention negative and zero offset is ignored. As you are ignoring 0 as well.

Thank you Vijay for the review.

Zero offset is not ignored: users get what they want when setting offset to 0 (BMC-MAC = Base-MAC).


Thanks,

Tao

^ permalink raw reply

* [PATCH] liquidio: Use pcie_flr() instead of reimplementing it
From: Denis Efremov @ 2019-08-08  4:57 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Denis Efremov, Derek Chickles, Satanand Burla, Felix Manlunas,
	netdev, linux-pci, linux-kernel

octeon_mbox_process_cmd() directly writes the PCI_EXP_DEVCTL_BCR_FLR
bit, which bypasses timing requirements imposed by the PCIe spec.
This patch fixes the function to use the pcie_flr() interface instead.

Signed-off-by: Denis Efremov <efremov@linux.com>
---
 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
index 021d99cd1665..614d07be7181 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
@@ -260,9 +260,7 @@ static int octeon_mbox_process_cmd(struct octeon_mbox *mbox,
 		dev_info(&oct->pci_dev->dev,
 			 "got a request for FLR from VF that owns DPI ring %u\n",
 			 mbox->q_no);
-		pcie_capability_set_word(
-			oct->sriov_info.dpiring_to_vfpcidev_lut[mbox->q_no],
-			PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
+		pcie_flr(oct->sriov_info.dpiring_to_vfpcidev_lut[mbox->q_no]);
 		break;
 
 	case OCTEON_PF_CHANGED_VF_MACADDR:
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH v5 bpf-next] BPF: helpers: New helper to obtain namespace data from current task
From: Yonghong Song @ 2019-08-08  5:09 UTC (permalink / raw)
  To: Carlos Antonio Neira Bustos, netdev@vger.kernel.org
  Cc: ebiederm@xmission.com, brouer@redhat.com,
	quentin.monnet@netronome.com
In-Reply-To: <20190808012240.htbgpv2mhktvig5h@dev00>



On 8/7/19 6:22 PM, Carlos Antonio Neira Bustos wrote:
> The code has been modified to avoid syscalls that could sleep.
> Please let me know if any other modification is needed.
> 
>  From be0384c0fa209a78c1567936e8db4e35b9a7c0f8 Mon Sep 17 00:00:00 2001
> From: Carlos <cneirabustos@gmail.com>
> Date: Wed, 7 Aug 2019 20:04:30 -0400
> Subject: [PATCH] [PATCH v5 bpf-next] BPF: New helper to obtain namespace data
>   from current task
> 
> This helper obtains the active namespace from current and returns pid, tgid,
> device and namespace id as seen from that namespace, allowing to instrument
> a process inside a container.
> Device is read from /proc/self/ns/pid, as in the future it's possible that
> different pid_ns files may belong to different devices, according
> to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> conference.
> Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> scripts but this helper returns the pid as seen by the root namespace which is
> fine when a bcc script is not executed inside a container.
> When the process of interest is inside a container, pid filtering will not work
> if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
> returning the pid as it's seen by the current namespace where the script is
> executing.
> 
> This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> used to do pid filtering even inside a container.
> 
> For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):
> 
>          u32 pid = bpf_get_current_pid_tgid() >> 32;
>          if (pid != <pid_arg_passed_in>)
>                  return 0;
> Could be modified to use bpf_get_current_pidns_info() as follows:
> 
>          struct bpf_pidns pidns;
>          bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
>          u32 pid = pidns.tgid;
>          u32 nsid = pidns.nsid;
>          if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
>                  return 0;
> 
> To find out the name PID namespace id of a process, you could use this command:
> 
> $ ps -h -o pidns -p <pid_of_interest>
> 
> Or this other command:
> 
> $ ls -Li /proc/<pid_of_interest>/ns/pid
> 
> Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> ---
>   fs/namei.c                                         |   2 +-
>   include/linux/bpf.h                                |   1 +
>   include/linux/namei.h                              |   4 +
>   include/uapi/linux/bpf.h                           |  29 ++++-
>   kernel/bpf/core.c                                  |   1 +
>   kernel/bpf/helpers.c                               |  78 ++++++++++++
>   kernel/trace/bpf_trace.c                           |   2 +
>   samples/bpf/Makefile                               |   3 +
>   samples/bpf/trace_ns_info_user.c                   |  35 ++++++
>   samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
>   tools/include/uapi/linux/bpf.h                     |  29 ++++-
>   tools/testing/selftests/bpf/Makefile               |   2 +-
>   tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
>   .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
>   tools/testing/selftests/bpf/test_pidns.c           | 138 +++++++++++++++++++++
>   15 files changed, 418 insertions(+), 4 deletions(-)
>   create mode 100644 samples/bpf/trace_ns_info_user.c
>   create mode 100644 samples/bpf/trace_ns_info_user_kern.c
>   create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
>   create mode 100644 tools/testing/selftests/bpf/test_pidns.c
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 209c51a5226c..d1eca36972d2 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -19,7 +19,6 @@
>   #include <linux/export.h>
>   #include <linux/kernel.h>
>   #include <linux/slab.h>
> -#include <linux/fs.h>
>   #include <linux/namei.h>
>   #include <linux/pagemap.h>
>   #include <linux/fsnotify.h>
> @@ -2355,6 +2354,7 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
>   	putname(name);
>   	return retval;
>   }
> +EXPORT_SYMBOL(filename_lookup);

No need to export symbols. bpf uses it and bpf is in the core, not in 
modules.

>   
>   /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
>   static int path_parentat(struct nameidata *nd, unsigned flags,
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f9a506147c8a..e4adf5e05afd 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
>   extern const struct bpf_func_proto bpf_strtol_proto;
>   extern const struct bpf_func_proto bpf_strtoul_proto;
>   extern const struct bpf_func_proto bpf_tcp_sock_proto;
> +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
>   
>   /* Shared helpers among cBPF and eBPF. */
>   void bpf_user_rnd_init_once(void);
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 9138b4471dbf..2c24e8c71d46 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -6,6 +6,7 @@
>   #include <linux/path.h>
>   #include <linux/fcntl.h>
>   #include <linux/errno.h>
> +#include <linux/fs.h>
>   
>   enum { MAX_NESTED_LINKS = 8 };
>   
> @@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
>   
>   extern void nd_jump_link(struct path *path);
>   
> +extern int filename_lookup(int dfd, struct filename *name, unsigned int flags,
> +		    struct path *path, struct path *root);

The previous definition in fs/internal.h should be removed.

> +
>   static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
>   {
>   	((char *) name)[min(len, maxlen)] = '\0';
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4393bd4b2419..6f601f7106e2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2741,6 +2741,26 @@ union bpf_attr {
>    *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
>    *
>    *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
> + *
> + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> + *	Description
> + *		Copies into *pidns* pid, namespace id and tgid as seen by the
> + *		current namespace and also device from /proc/self/ns/pid.
> + *		*size_of_pidns* must be the size of *pidns*
> + *
> + *		This helper is used when pid filtering is needed inside a
> + *		container as bpf_get_current_tgid() helper returns always the
> + *		pid id as seen by the root namespace.
> + *	Return
> + *		0 on success
> + *
> + *		**-EINVAL**  if unable to get ns, pid or tgid of current task.
> + *		Or if size_of_pidns is not valid.

Maybe reword by following the code sequence.
    if *size_of_pidns* is not valid or unable to get ns, pid or tgid of
    the current task.

> + *
> + *		**-ENOMEM**  if allocation fails.

Maybe some other error codes in filename_lookup() function?

> + *
> + *		If unable to get the inode from /proc/self/ns/pid an error code
> + *		will be returned.

You do not need this. The description of error code cases should cover this.

>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -2853,7 +2873,8 @@ union bpf_attr {
>   	FN(sk_storage_get),		\
>   	FN(sk_storage_delete),		\
>   	FN(send_signal),		\
> -	FN(tcp_gen_syncookie),
> +	FN(tcp_gen_syncookie),		\
> +	FN(get_current_pidns_info),
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>    * function eBPF program intends to call
> @@ -3604,4 +3625,10 @@ struct bpf_sockopt {
>   	__s32	retval;
>   };
>   
> +struct bpf_pidns_info {
> +	__u32 dev;
> +	__u32 nsid;
> +	__u32 tgid;
> +	__u32 pid;
> +};
>   #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 8191a7db2777..3159f2a0188c 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
>   const struct bpf_func_proto bpf_get_current_comm_proto __weak;
>   const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
>   const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
>   
>   const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>   {
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 5e28718928ca..571f24077db2 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -11,6 +11,12 @@
>   #include <linux/uidgid.h>
>   #include <linux/filter.h>
>   #include <linux/ctype.h>
> +#include <linux/pid_namespace.h>
> +#include <linux/major.h>
> +#include <linux/stat.h>
> +#include <linux/namei.h>
> +#include <linux/version.h>
> +
>   
>   #include "../../lib/kstrtox.h"
>   
> @@ -312,6 +318,78 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>   	preempt_enable();
>   }
>   
> +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
> +	 size)
> +{
> +	const char *name = "/proc/self/ns/pid";

maybe rename this variable to pidns_path?

> +	struct pid_namespace *pidns = NULL;
> +	struct filename *tmp = NULL;

Maybe rename this variable to name?

> +	int len = strlen(name) + 1;

We can delay this assignment later until it is needed.

> +	struct inode *inode;
> +	struct path kp;
> +	pid_t tgid = 0;
> +	pid_t pid = 0;
> +	int ret;
> +
> +	if (unlikely(size != sizeof(struct bpf_pidns_info)))
> +		return -EINVAL;
> +
> +	pidns = task_active_pid_ns(current);
> +

we can save an empty line here.

> +	if (unlikely(!pidns))
> +		goto clear;
> +
> +	pidns_info->nsid =  pidns->ns.inum;
> +	pid = task_pid_nr_ns(current, pidns);
> +

We can save an empty line here.

> +	if (unlikely(!pid))
> +		goto clear;
> +
> +	tgid = task_tgid_nr_ns(current, pidns);
> +
ditto. save an empty line.
> +	if (unlikely(!tgid))
> +		goto clear;
> +
> +	pidns_info->tgid = (u32) tgid;
> +	pidns_info->pid = (u32) pid;
> +
> +	tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> +	if (unlikely(!tmp)) {
> +		memset((void *)pidns_info, 0, (size_t) size);
> +		return -ENOMEM;
> +	}
> +
> +	memcpy((char *)tmp->name, name, len);
> +	tmp->uptr = NULL;
> +	tmp->aname = NULL;
> +	tmp->refcnt = 1;
> +
ditto. save an empty line.
> +	ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
> +
ditto. save an empty line.
> +	if (ret) {
> +		memset((void *)pidns_info, 0, (size_t) size);
> +		return ret;
> +	}
> +
> +	inode = d_backing_inode(kp.dentry);
> +	pidns_info->dev = inode->i_sb->s_dev;
> +
> +	return 0;
> +
> +clear:
> +	memset((void *)pidns_info, 0, (size_t) size);
> +
save an empty line.
> +	return -EINVAL;
> +}
> +
> +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> +	.func	= bpf_get_current_pidns_info,
make the "= " aligned with others?
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
> +	.arg2_type	= ARG_CONST_SIZE,
> +};
> +
>   #ifdef CONFIG_CGROUPS
>   BPF_CALL_0(bpf_get_current_cgroup_id)
>   {
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ca1255d14576..5e1dc22765a5 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>   #endif
>   	case BPF_FUNC_send_signal:
>   		return &bpf_send_signal_proto;
> +	case BPF_FUNC_get_current_pidns_info:
> +		return &bpf_get_current_pidns_info_proto;
>   	default:
>   		return NULL;
>   	}
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 1d9be26b4edd..238453ff27d2 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
>   hostprogs-y += xdp_sample_pkts
>   hostprogs-y += ibumad
>   hostprogs-y += hbm
> +hostprogs-y += trace_ns_info
[...]

^ permalink raw reply

* Re: [PATCH net 1/2] mlxsw: spectrum: Fix error path in mlxsw_sp_module_init()
From: Jesse Brandeburg @ 2019-08-08  5:19 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, jiri, petrm, mlxsw, Ido Schimmel, jesse.brandeburg
In-Reply-To: <20190731063315.9381-2-idosch@idosch.org>

On Wed, 31 Jul 2019 09:33:14 +0300
Ido Schimmel <idosch@idosch.org> wrote:

> From: Jiri Pirko <jiri@mellanox.com>
> 
> In case of sp2 pci driver registration fail, fix the error path to
> start with sp1 pci driver unregister.
> 
> Fixes: c3ab435466d5 ("mlxsw: spectrum: Extend to support Spectrum-2 ASIC")
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

^ permalink raw reply

* Re: [PATCH net 2/2] mlxsw: spectrum_buffers: Further reduce pool size on Spectrum-2
From: Jesse Brandeburg @ 2019-08-08  5:19 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, jiri, petrm, mlxsw, Ido Schimmel, jesse.brandeburg
In-Reply-To: <20190731063315.9381-3-idosch@idosch.org>

On Wed, 31 Jul 2019 09:33:15 +0300
Ido Schimmel <idosch@idosch.org> wrote:

> From: Petr Machata <petrm@mellanox.com>
> 
> In commit e891ce1dd2a5 ("mlxsw: spectrum_buffers: Reduce pool size on
> Spectrum-2"), pool size was reduced to mitigate a problem in port buffer
> usage of ports split four ways. It turns out that this work around does not
> solve the issue, and a further reduction is required.
> 
> Thus reduce the size of pool 0 by another 2.7 MiB, and round down to the
> whole number of cells.
> 
> Fixes: e891ce1dd2a5 ("mlxsw: spectrum_buffers: Reduce pool size on Spectrum-2")
> Signed-off-by: Petr Machata <petrm@mellanox.com>
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

^ permalink raw reply

* Re: [PATCH] net: ethernet: et131x: Use GFP_KERNEL instead of GFP_ATOMIC when allocating tx_ring->tcb_ring
From: Jesse Brandeburg @ 2019-08-08  5:23 UTC (permalink / raw)
  To: Christophe JAILLET
  Cc: mark.einon, davem, willy, f.fainelli, andrew, netdev,
	linux-kernel, kernel-janitors, jesse.brandeburg
In-Reply-To: <20190731073842.16948-1-christophe.jaillet@wanadoo.fr>

On Wed, 31 Jul 2019 09:38:42 +0200
Christophe JAILLET <christophe.jaillet@wanadoo.fr> wrote:

> There is no good reason to use GFP_ATOMIC here. Other memory allocations
> are performed with GFP_KERNEL (see other 'dma_alloc_coherent()' below and
> 'kzalloc()' in 'et131x_rx_dma_memory_alloc()')
> 
> Use GFP_KERNEL which should be enough.
> 
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

Sure, but generally I'd say GFP_ATOMIC is ok if you're in an init path
and you can afford to have the allocation thread sleep while memory is
being found by the kernel.

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

^ permalink raw reply

* [PATCH] net/netfilter - add missing prototypes.
From: Valdis Klētnieks @ 2019-08-08  5:28 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, linux-kernel

Sparse rightly complains about undeclared symbols.

  CHECK   net/netfilter/nft_set_hash.c
net/netfilter/nft_set_hash.c:647:21: warning: symbol 'nft_set_rhash_type' was not declared. Should it be static?
net/netfilter/nft_set_hash.c:670:21: warning: symbol 'nft_set_hash_type' was not declared. Should it be static?
net/netfilter/nft_set_hash.c:690:21: warning: symbol 'nft_set_hash_fast_type' was not declared. Should it be static?
  CHECK   net/netfilter/nft_set_bitmap.c
net/netfilter/nft_set_bitmap.c:296:21: warning: symbol 'nft_set_bitmap_type' was not declared. Should it be static?
  CHECK   net/netfilter/nft_set_rbtree.c
net/netfilter/nft_set_rbtree.c:470:21: warning: symbol 'nft_set_rbtree_type' was not declared. Should it be static?

Include nf_tables_core.h rather than nf_tables.h to pick up the additional definitions.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>

diff --git a/net/netfilter/nft_set_bitmap.c b/net/netfilter/nft_set_bitmap.c
index b5aeccdddb22..087a056e34d1 100644
--- a/net/netfilter/nft_set_bitmap.c
+++ b/net/netfilter/nft_set_bitmap.c
@@ -10,7 +10,7 @@
 #include <linux/netlink.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter/nf_tables.h>
-#include <net/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>

 struct nft_bitmap_elem {
 	struct list_head	head;
diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c
index 6e8d20c03e3d..c490451fcebf 100644
--- a/net/netfilter/nft_set_hash.c
+++ b/net/netfilter/nft_set_hash.c
@@ -16,7 +16,7 @@
 #include <linux/rhashtable.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter/nf_tables.h>
-#include <net/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>

 /* We target a hash table size of 4, element hint is 75% of final size */
 #define NFT_RHASH_ELEMENT_HINT 3
diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c
index 419d58ef802b..57123259452f 100644
--- a/net/netfilter/nft_set_rbtree.c
+++ b/net/netfilter/nft_set_rbtree.c
@@ -13,7 +13,7 @@
 #include <linux/netlink.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter/nf_tables.h>
-#include <net/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>

 struct nft_rbtree {
 	struct rb_root		root;

^ permalink raw reply related

* Re: [PATCH 1/2] net: ag71xx: Slighly simplify code in 'ag71xx_rings_init()'
From: Jesse Brandeburg @ 2019-08-08  5:29 UTC (permalink / raw)
  To: Christophe JAILLET
  Cc: jcliburn, davem, chris.snook, netdev, linux-kernel,
	kernel-janitors, jesse.brandeburg
In-Reply-To: <08fbcfe0f913644fe538656221a15790a1a83f1d.1564560130.git.christophe.jaillet@wanadoo.fr>

On Wed, 31 Jul 2019 10:06:38 +0200
Christophe JAILLET <christophe.jaillet@wanadoo.fr> wrote:

> A few lines above, we have:
>    tx_size = BIT(tx->order);
> 
> So use 'tx_size' directly to be consistent with the way 'rx->descs_cpu' and
> 'rx->descs_dma' are computed below.
> 
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

^ permalink raw reply

* Re: [PATCH 2/2] net: ag71xx: Use GFP_KERNEL instead of GFP_ATOMIC in 'ag71xx_rings_init()'
From: Jesse Brandeburg @ 2019-08-08  5:30 UTC (permalink / raw)
  To: Christophe JAILLET
  Cc: jcliburn, davem, chris.snook, netdev, linux-kernel,
	kernel-janitors, jesse.brandeburg
In-Reply-To: <75ee52ae065ce9cb1f32d88939b166773316dc45.1564560130.git.christophe.jaillet@wanadoo.fr>

On Wed, 31 Jul 2019 10:06:48 +0200
Christophe JAILLET <christophe.jaillet@wanadoo.fr> wrote:

> There is no need to use GFP_ATOMIC here, GFP_KERNEL should be enough.
> The 'kcalloc()' just a few lines above, already uses GFP_KERNEL.
> 
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

^ permalink raw reply

* Re: [PATCH] myri10ge: remove unneeded variable
From: Jesse Brandeburg @ 2019-08-08  5:31 UTC (permalink / raw)
  To: Ding Xiang; +Cc: christopher.lee, davem, netdev, linux-kernel, jesse.brandeburg
In-Reply-To: <1564563226-13367-1-git-send-email-dingxiang@cmss.chinamobile.com>

On Wed, 31 Jul 2019 16:53:46 +0800
Ding Xiang <dingxiang@cmss.chinamobile.com> wrote:

> "error" is unneeded,just return 0
> 
> Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

^ permalink raw reply

* Re: [PATCH v3 38/41] powerpc: convert put_page() to put_user_page*()
From: Michael Ellerman @ 2019-08-08  5:42 UTC (permalink / raw)
  To: john.hubbard, Andrew Morton
  Cc: Christoph Hellwig, Dan Williams, Dave Chinner, Dave Hansen,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jérôme Glisse,
	LKML, amd-gfx, ceph-devel, devel, devel, dri-devel, intel-gfx,
	kvm, linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
	linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
	linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
	xen-devel, John Hubbard, Benjamin Herrenschmidt,
	Christoph Hellwig, linuxppc-dev
In-Reply-To: <20190807013340.9706-39-jhubbard@nvidia.com>

Hi John,

john.hubbard@gmail.com writes:
> diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c
> index b056cae3388b..e126193ba295 100644
> --- a/arch/powerpc/mm/book3s64/iommu_api.c
> +++ b/arch/powerpc/mm/book3s64/iommu_api.c
> @@ -203,6 +202,7 @@ static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
>  {
>  	long i;
>  	struct page *page = NULL;
> +	bool dirty = false;

I don't think you need that initialisation do you?

>  	if (!mem->hpas)
>  		return;
> @@ -215,10 +215,9 @@ static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
>  		if (!page)
>  			continue;
>  
> -		if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
> -			SetPageDirty(page);
> +		dirty = mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY;
> -		put_page(page);
> +		put_user_pages_dirty_lock(&page, 1, dirty);
>  		mem->hpas[i] = 0;
>  	}
>  }

cheers

^ permalink raw reply

* [PATCH] net/netfilter/nf_nat_proto.c - make tables static
From: Valdis Klētnieks @ 2019-08-08  5:43 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, linux-kernel

Sparse warns about two tables not being declared.

  CHECK   net/netfilter/nf_nat_proto.c
net/netfilter/nf_nat_proto.c:725:26: warning: symbol 'nf_nat_ipv4_ops' was not declared. Should it be static?
net/netfilter/nf_nat_proto.c:964:26: warning: symbol 'nf_nat_ipv6_ops' was not declared. Should it be static?

And in fact they can indeed be static.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>

diff --git a/net/netfilter/nf_nat_proto.c b/net/netfilter/nf_nat_proto.c
index 7ac733ebd060..0a59c14b5177 100644
--- a/net/netfilter/nf_nat_proto.c
+++ b/net/netfilter/nf_nat_proto.c
@@ -722,7 +722,7 @@ nf_nat_ipv4_local_fn(void *priv, struct sk_buff *skb,
 	return ret;
 }
 
-const struct nf_hook_ops nf_nat_ipv4_ops[] = {
+static const struct nf_hook_ops nf_nat_ipv4_ops[] = {
 	/* Before packet filtering, change destination */
 	{
 		.hook		= nf_nat_ipv4_in,
@@ -961,7 +961,7 @@ nf_nat_ipv6_local_fn(void *priv, struct sk_buff *skb,
 	return ret;
 }
 
-const struct nf_hook_ops nf_nat_ipv6_ops[] = {
+static const struct nf_hook_ops nf_nat_ipv6_ops[] = {
 	/* Before packet filtering, change destination */
 	{
 		.hook		= nf_nat_ipv6_in,


^ permalink raw reply related

* Re: [PATCH] team: Add vlan tx offload to hw_enc_features
From: Jesse Brandeburg @ 2019-08-08  5:48 UTC (permalink / raw)
  To: YueHaibing
  Cc: j.vosburgh, vfalico, andy, davem, jiri, jay.vosburgh,
	linux-kernel, netdev, jesse.brandeburg
In-Reply-To: <20190807023808.51976-1-yuehaibing@huawei.com>

On Wed, 7 Aug 2019 10:38:08 +0800
YueHaibing <yuehaibing@huawei.com> wrote:

> We should also enable bonding's vlan tx offload in hw_enc_features,

You mean team's vlan tx offload?

> pass the vlan packets to the slave devices with vlan tci, let them

s/let them to/let the slave/

> to handle vlan tunneling offload implementation.
> 
> Fixes: 3268e5cb494d ("team: Advertise tunneling offload features")
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>


^ permalink raw reply

* Re: [PATCH v2 bpf-next] btf: expose BTF info through sysfs
From: Greg KH @ 2019-08-08  6:08 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf@vger.kernel.org, netdev@vger.kernel.org,
	Alexei Starovoitov, daniel@iogearbox.net,
	andrii.nakryiko@gmail.com, Kernel Team, Masahiro Yamada,
	Arnaldo Carvalho de Melo, Jiri Olsa, Sam Ravnborg
In-Reply-To: <89a6e282-0250-4264-128d-469be99073e9@fb.com>

On Thu, Aug 08, 2019 at 04:24:25AM +0000, Yonghong Song wrote:
> 
> 
> On 8/7/19 5:32 PM, Andrii Nakryiko wrote:
> > Make .BTF section allocated and expose its contents through sysfs.

Was this original patch not on bpf@vger?  I can't find it in my
archive.  Anyway...

> > /sys/kernel/btf directory is created to contain all the BTFs present
> > inside kernel. Currently there is only kernel's main BTF, represented as
> > /sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
> > each module will expose its BTF as /sys/kernel/btf/<module-name> file.

Why are you using sysfs for this?  Who uses "BTF"s?  Are these debugging
images that only people working on developing bpf programs are going to
need, or are these things that you are going to need on a production
system?

I ask as maybe debugfs is the best place for this if they are not needed
on production systems.


> > 
> > Current approach relies on a few pieces coming together:
> > 1. pahole is used to take almost final vmlinux image (modulo .BTF and
> >     kallsyms) and generate .BTF section by converting DWARF info into
> >     BTF. This section is not allocated and not mapped to any segment,
> >     though, so is not yet accessible from inside kernel at runtime.
> > 2. objcopy dumps .BTF contents into binary file and subsequently
> >     convert binary file into linkable object file with automatically
> >     generated symbols _binary__btf_kernel_bin_start and
> >     _binary__btf_kernel_bin_end, pointing to start and end, respectively,
> >     of BTF raw data.
> > 3. final vmlinux image is generated by linking this object file (and
> >     kallsyms, if necessary). sysfs_btf.c then creates
> >     /sys/kernel/btf/kernel file and exposes embedded BTF contents through
> >     it. This allows, e.g., libbpf and bpftool access BTF info at
> >     well-known location, without resorting to searching for vmlinux image
> >     on disk (location of which is not standardized and vmlinux image
> >     might not be even available in some scenarios, e.g., inside qemu
> >     during testing).
> > 
> > Alternative approach using .incbin assembler directive to embed BTF
> > contents directly was attempted but didn't work, because sysfs_proc.o is
> > not re-compiled during link-vmlinux.sh stage. This is required, though,
> > to update embedded BTF data (initially empty data is embedded, then
> > pahole generates BTF info and we need to regenerate sysfs_btf.o with
> > updated contents, but it's too late at that point).
> > 
> > If BTF couldn't be generated due to missing or too old pahole,
> > sysfs_btf.c handles that gracefully by detecting that
> > _binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
> > /sys/kernel/btf at all.
> > 
> > v1->v2:
> > - allow kallsyms stage to re-use vmlinux generated by gen_btf();
> > 
> > Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> > Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> > Cc: Jiri Olsa <jolsa@kernel.org>
> > Cc: Sam Ravnborg <sam@ravnborg.org>
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> >   kernel/bpf/Makefile     |  3 +++
> >   kernel/bpf/sysfs_btf.c  | 52 ++++++++++++++++++++++++++++++++++++++
> >   scripts/link-vmlinux.sh | 55 +++++++++++++++++++++++++++--------------
> >   3 files changed, 91 insertions(+), 19 deletions(-)
> >   create mode 100644 kernel/bpf/sysfs_btf.c

First rule, you can't create new sysfs files without a matching
Documentation/ABI/ set of entries.  Please do that for the next version
of this patch so we can properly check to see if what you are
documenting lines up with the code.  Otherwise we just have to guess as
to what the entries you are creating actually do.

> > 
> > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > index 29d781061cd5..e1d9adb212f9 100644
> > --- a/kernel/bpf/Makefile
> > +++ b/kernel/bpf/Makefile
> > @@ -22,3 +22,6 @@ obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> >   ifeq ($(CONFIG_INET),y)
> >   obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
> >   endif
> > +ifeq ($(CONFIG_SYSFS),y)
> > +obj-$(CONFIG_DEBUG_INFO_BTF) += sysfs_btf.o
> > +endif
> > diff --git a/kernel/bpf/sysfs_btf.c b/kernel/bpf/sysfs_btf.c
> > new file mode 100644
> > index 000000000000..ac06ce1d62e8
> > --- /dev/null
> > +++ b/kernel/bpf/sysfs_btf.c
> > @@ -0,0 +1,52 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Provide kernel BTF information for introspection and use by eBPF tools.
> > + */
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/kobject.h>
> > +#include <linux/init.h>
> > +
> > +/* See scripts/link-vmlinux.sh, gen_btf() func for details */
> > +extern char __weak _binary__btf_kernel_bin_start[];
> > +extern char __weak _binary__btf_kernel_bin_end[];
> > +
> > +static ssize_t
> > +btf_kernel_read(struct file *file, struct kobject *kobj,
> > +		struct bin_attribute *bin_attr,
> > +		char *buf, loff_t off, size_t len)
> > +{
> > +	memcpy(buf, _binary__btf_kernel_bin_start + off, len);
> > +	return len;
> > +}
> > +
> > +static struct bin_attribute btf_kernel_attr __ro_after_init = {
> > +	.attr = {
> > +		.name = "kernel",
> > +		.mode = 0444,
> > +	},
> > +	.read = btf_kernel_read,
> > +};

BIN_ATTR_RO()?

> > +
> > +static struct bin_attribute *btf_attrs[] __ro_after_init = {
> > +	&btf_kernel_attr,
> > +	NULL,
> > +};
> > +
> > +static struct attribute_group btf_group_attr __ro_after_init = {
> > +	.name = "btf",
> > +	.bin_attrs = btf_attrs,
> > +};
> > +
> > +static int __init btf_kernel_init(void)
> > +{
> > +	if (!_binary__btf_kernel_bin_start)
> > +		return 0;
> > +
> > +	btf_kernel_attr.size = _binary__btf_kernel_bin_end -
> > +			       _binary__btf_kernel_bin_start;
> > +
> > +	return sysfs_create_group(kernel_kobj, &btf_group_attr);

You are nesting directories here without a "real" kobject in the middle.
Are you _sure_ you want to do that?  It's going to get really tricky
later on based on your comments above about creating multiple files in
that directory over time once "modules" are allowed.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH bpf-next] btf: expose BTF info through sysfs
From: Greg KH @ 2019-08-08  6:09 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, ast, daniel, yhs, andrii.nakryiko, kernel-team,
	Masahiro Yamada, Arnaldo Carvalho de Melo, Jiri Olsa
In-Reply-To: <20190807183821.138728-1-andriin@fb.com>

On Wed, Aug 07, 2019 at 11:38:21AM -0700, Andrii Nakryiko wrote:
> Make .BTF section allocated and expose its contents through sysfs.

Ah, found the original of this, sorry for the noise on the previous
response...

Still need Documentation/ABI/ entries though :)

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH net] net: phy: rtl8211f: do a double read to get real time link status
From: Heiner Kallweit @ 2019-08-08  6:11 UTC (permalink / raw)
  To: Yonglong Liu, davem, andrew
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <e663235c-93eb-702d-5a9c-8f781d631c42@huawei.com>

On 08.08.2019 03:15, Yonglong Liu wrote:
> 
> 
> On 2019/8/8 0:47, Heiner Kallweit wrote:
>> On 07.08.2019 15:16, Yonglong Liu wrote:
>>> [   27.232781] hns3 0000:bd:00.3 eth7: net open
>>> [   27.237303] 8021q: adding VLAN 0 to HW filter on device eth7
>>> [   27.242972] IPv6: ADDRCONF(NETDEV_CHANGE): eth7: link becomes ready
>>> [   27.244449] hns3 0000:bd:00.3: invalid speed (-1)
>>> [   27.253904] hns3 0000:bd:00.3 eth7: failed to adjust link.
>>> [   27.259379] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change UP -> RUNNING
>>> [   27.924903] hns3 0000:bd:00.3 eth7: link up
>>> [   28.280479] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change RUNNING -> NOLINK
>>> [   29.208452] hns3 0000:bd:00.3 eth7: link down
>>> [   32.376745] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change NOLINK -> RUNNING
>>> [   33.208448] hns3 0000:bd:00.3 eth7: link up
>>> [   35.253821] hns3 0000:bd:00.3 eth7: net stop
>>> [   35.258270] hns3 0000:bd:00.3 eth7: link down
>>>
>>> When using rtl8211f in polling mode, may get a invalid speed,
>>> because of reading a fake link up and autoneg complete status
>>> immediately after starting autoneg:
>>>
>>>         ifconfig-1176  [007] ....    27.232763: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>>   kworker/u257:1-670   [015] ....    27.232805: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x04 val:0x01e1
>>>   kworker/u257:1-670   [015] ....    27.232815: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x04 val:0x05e1
>>>   kworker/u257:1-670   [015] ....    27.232869: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>>   kworker/u257:1-670   [015] ....    27.232904: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>>   kworker/u257:1-670   [015] ....    27.232940: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>>   kworker/u257:1-670   [015] ....    27.232949: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x00 val:0x1240
>>>   kworker/u257:1-670   [015] ....    27.233003: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>>   kworker/u257:1-670   [015] ....    27.233039: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x0a val:0x3002
>>>   kworker/u257:1-670   [015] ....    27.233074: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>>   kworker/u257:1-670   [015] ....    27.233110: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x05 val:0x0000
>>>   kworker/u257:1-670   [000] ....    28.280475: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>>   kworker/u257:1-670   [000] ....    29.304471: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>>
>>> According to the datasheet of rtl8211f, to get the real time
>>> link status, need to read MII_BMSR twice.
>>>
>>> This patch add a read_status hook for rtl8211f, and do a fake
>>> phy_read before genphy_read_status(), so that can get real link
>>> status in genphy_read_status().
>>>
>>> Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
>>> ---
>>>  drivers/net/phy/realtek.c | 13 +++++++++++++
>>>  1 file changed, 13 insertions(+)
>>>
>> Is this an accidental resubmit? Because we discussed this in
>> https://marc.info/?t=156413509900003&r=1&w=2 and a fix has
>> been applied already.
>>
>> Heiner
>>
>> .
>>
> 
> In https://marc.info/?t=156413509900003&r=1&w=2 , the invalid speed
> recurrence rate is almost 100%, and I had test the solution about
> 5 times and it works. But yesterday it happen again suddenly, and than
> I fount that the recurrence rate reduce to 10%. This time we get 0x79ad
> after autoneg started which is not 0x798d from last discussion.
> 
> 
> 
OK, I'll have a look.
However the approach is wrong. The double read is related to the latching
of link-down events. This is done by all PHY's and not specific to RT8211F.
Also it's not related to the problem. I assume any sufficient delay would
do instead of the read.

^ permalink raw reply

* Re: [PATCH v2 1/2] tcp: add new tcp_mtu_probe_floor sysctl
From: Eric Dumazet @ 2019-08-08  6:12 UTC (permalink / raw)
  To: Josh Hunt, netdev; +Cc: davem, edumazet, ncardwell
In-Reply-To: <1565221950-1376-1-git-send-email-johunt@akamai.com>



On 8/8/19 1:52 AM, Josh Hunt wrote:
> The current implementation of TCP MTU probing can considerably
> underestimate the MTU on lossy connections allowing the MSS to get down to
> 48. We have found that in almost all of these cases on our networks these
> paths can handle much larger MTUs meaning the connections are being
> artificially limited. Even though TCP MTU probing can raise the MSS back up
> we have seen this not to be the case causing connections to be "stuck" with
> an MSS of 48 when heavy loss is present.
> 
> Prior to pushing out this change we could not keep TCP MTU probing enabled
> b/c of the above reasons. Now with a reasonble floor set we've had it
> enabled for the past 6 months.

I am still sad to see you do not share what is a reasonable value and let
everybody guess. 

It seems to be a top-secret value.

> 
> The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
> administrators the ability to control the floor of MSS probing.
> 
> Signed-off-by: Josh Hunt <johunt@akamai.com>

Signed-off-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH v2 2/2] tcp: Update TCP_BASE_MSS comment
From: Eric Dumazet @ 2019-08-08  6:13 UTC (permalink / raw)
  To: Josh Hunt, netdev; +Cc: davem, edumazet, ncardwell
In-Reply-To: <1565221950-1376-2-git-send-email-johunt@akamai.com>



On 8/8/19 1:52 AM, Josh Hunt wrote:
> TCP_BASE_MSS is used as the default initial MSS value when MTU probing is
> enabled. Update the comment to reflect this.
> 
> Suggested-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Josh Hunt <johunt@akamai.com>
> ---

Signed-off-by: Eric Dumazet <edumazet@google.com>


^ permalink raw reply

* Re: [PATCH net] net: phy: rtl8211f: do a double read to get real time link status
From: Yonglong Liu @ 2019-08-08  6:21 UTC (permalink / raw)
  To: Heiner Kallweit, davem, andrew
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <080b68c7-abe6-d142-da4b-26e8a7d4dc19@gmail.com>



On 2019/8/8 14:11, Heiner Kallweit wrote:
> On 08.08.2019 03:15, Yonglong Liu wrote:
>>
>>
>> On 2019/8/8 0:47, Heiner Kallweit wrote:
>>> On 07.08.2019 15:16, Yonglong Liu wrote:
>>>> [   27.232781] hns3 0000:bd:00.3 eth7: net open
>>>> [   27.237303] 8021q: adding VLAN 0 to HW filter on device eth7
>>>> [   27.242972] IPv6: ADDRCONF(NETDEV_CHANGE): eth7: link becomes ready
>>>> [   27.244449] hns3 0000:bd:00.3: invalid speed (-1)
>>>> [   27.253904] hns3 0000:bd:00.3 eth7: failed to adjust link.
>>>> [   27.259379] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change UP -> RUNNING
>>>> [   27.924903] hns3 0000:bd:00.3 eth7: link up
>>>> [   28.280479] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change RUNNING -> NOLINK
>>>> [   29.208452] hns3 0000:bd:00.3 eth7: link down
>>>> [   32.376745] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change NOLINK -> RUNNING
>>>> [   33.208448] hns3 0000:bd:00.3 eth7: link up
>>>> [   35.253821] hns3 0000:bd:00.3 eth7: net stop
>>>> [   35.258270] hns3 0000:bd:00.3 eth7: link down
>>>>
>>>> When using rtl8211f in polling mode, may get a invalid speed,
>>>> because of reading a fake link up and autoneg complete status
>>>> immediately after starting autoneg:
>>>>
>>>>         ifconfig-1176  [007] ....    27.232763: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>>>   kworker/u257:1-670   [015] ....    27.232805: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x04 val:0x01e1
>>>>   kworker/u257:1-670   [015] ....    27.232815: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x04 val:0x05e1
>>>>   kworker/u257:1-670   [015] ....    27.232869: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>>>   kworker/u257:1-670   [015] ....    27.232904: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>>>   kworker/u257:1-670   [015] ....    27.232940: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>>>   kworker/u257:1-670   [015] ....    27.232949: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x00 val:0x1240
>>>>   kworker/u257:1-670   [015] ....    27.233003: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>>>   kworker/u257:1-670   [015] ....    27.233039: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x0a val:0x3002
>>>>   kworker/u257:1-670   [015] ....    27.233074: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>>>   kworker/u257:1-670   [015] ....    27.233110: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x05 val:0x0000
>>>>   kworker/u257:1-670   [000] ....    28.280475: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>>>   kworker/u257:1-670   [000] ....    29.304471: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>>>
>>>> According to the datasheet of rtl8211f, to get the real time
>>>> link status, need to read MII_BMSR twice.
>>>>
>>>> This patch add a read_status hook for rtl8211f, and do a fake
>>>> phy_read before genphy_read_status(), so that can get real link
>>>> status in genphy_read_status().
>>>>
>>>> Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
>>>> ---
>>>>  drivers/net/phy/realtek.c | 13 +++++++++++++
>>>>  1 file changed, 13 insertions(+)
>>>>
>>> Is this an accidental resubmit? Because we discussed this in
>>> https://marc.info/?t=156413509900003&r=1&w=2 and a fix has
>>> been applied already.
>>>
>>> Heiner
>>>
>>> .
>>>
>>
>> In https://marc.info/?t=156413509900003&r=1&w=2 , the invalid speed
>> recurrence rate is almost 100%, and I had test the solution about
>> 5 times and it works. But yesterday it happen again suddenly, and than
>> I fount that the recurrence rate reduce to 10%. This time we get 0x79ad
>> after autoneg started which is not 0x798d from last discussion.
>>
>>
>>
> OK, I'll have a look.
> However the approach is wrong. The double read is related to the latching
> of link-down events. This is done by all PHY's and not specific to RT8211F.
> Also it's not related to the problem. I assume any sufficient delay would
> do instead of the read.
> 
> .
> 

So you will send a new patch to fix this problem? I am waiting for it,
and can do a full test this time.


^ permalink raw reply

* [PATCH v2] team: Add vlan tx offload to hw_enc_features
From: YueHaibing @ 2019-08-08  6:22 UTC (permalink / raw)
  To: j.vosburgh, vfalico, andy, davem, jiri, jay.vosburgh
  Cc: linux-kernel, netdev, YueHaibing
In-Reply-To: <20190807023808.51976-1-yuehaibing@huawei.com>

We should also enable team's vlan tx offload in hw_enc_features,
pass the vlan packets to the slave devices with vlan tci, let the
slave handle vlan tunneling offload implementation.

Fixes: 3268e5cb494d ("team: Advertise tunneling offload features")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
v2: fix commit log typo
---
 drivers/net/team/team.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index abfa0da..e8089de 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1004,6 +1004,8 @@ static void __team_compute_features(struct team *team)
 
 	team->dev->vlan_features = vlan_features;
 	team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL |
+				     NETIF_F_HW_VLAN_CTAG_TX |
+				     NETIF_F_HW_VLAN_STAG_TX |
 				     NETIF_F_GSO_UDP_L4;
 	team->dev->hard_header_len = max_hard_header_len;
 
-- 
2.7.4



^ permalink raw reply related

* Re: [PATCH 11/17] qca: no need to check return value of debugfs_create functions
From: Michael Heimpold @ 2019-08-08  6:30 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: netdev, David S. Miller, Stefan Wahren, Yangtao Li
In-Reply-To: <20190806161128.31232-12-gregkh@linuxfoundation.org>

Am Dienstag, 6. August 2019, 18:11:22 CEST schrieb Greg Kroah-Hartman:
> When calling debugfs functions, there is no need to ever check the
> return value.  The function can work or not, but the code logic should
> never do something different based on this.
> 
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Stefan Wahren <stefan.wahren@i2se.com>
> Cc: Michael Heimpold <michael.heimpold@i2se.com>
> Cc: Yangtao Li <tiny.windzz@gmail.com>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
>  drivers/net/ethernet/qualcomm/qca_debug.c | 13 +++----------
>  1 file changed, 3 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qualcomm/qca_debug.c
> b/drivers/net/ethernet/qualcomm/qca_debug.c index
> bcb890b18a94..702aa217a27a 100644
> --- a/drivers/net/ethernet/qualcomm/qca_debug.c
> +++ b/drivers/net/ethernet/qualcomm/qca_debug.c
> @@ -131,17 +131,10 @@ DEFINE_SHOW_ATTRIBUTE(qcaspi_info);
>  void
>  qcaspi_init_device_debugfs(struct qcaspi *qca)
>  {
> -	struct dentry *device_root;
> +	qca->device_root = debugfs_create_dir(dev_name(&qca->net_dev->dev),
> +					      NULL);
> 
> -	device_root = debugfs_create_dir(dev_name(&qca->net_dev->dev), NULL);
> -	qca->device_root = device_root;
> -
> -	if (IS_ERR(device_root) || !device_root) {
> -		pr_warn("failed to create debugfs directory for %s\n",
> -			dev_name(&qca->net_dev->dev));
> -		return;
> -	}
> -	debugfs_create_file("info", S_IFREG | 0444, device_root, qca,
> +	debugfs_create_file("info", S_IFREG | 0444, qca->device_root, qca,
>  			    &qcaspi_info_fops);
>  }

Acked-by: Michael Heimpold <michael.heimpold@i2se.com>





^ permalink raw reply

* Re: [PATCH net-next] net: openvswitch: Set OvS recirc_id from tc chain index
From: Paul Blakey @ 2019-08-08  6:39 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: netdev@vger.kernel.org, David S. Miller, Justin Pettit,
	Pravin B Shelar, Simon Horman, Vlad Buslov, Jiri Pirko, Roi Dayan,
	Yossi Kuperman, Rony Efraim, Oz Shlomo
In-Reply-To: <20190807150026.GE21609@localhost.localdomain>


On 8/7/2019 6:00 PM, Marcelo Ricardo Leitner wrote:
> On Wed, Aug 07, 2019 at 03:08:42PM +0300, Paul Blakey wrote:
>> Offloaded OvS datapath rules are translated one to one to tc rules,
>> for example the following simplified OvS rule:
>>
>> recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2)
>>
>> Will be translated to the following tc rule:
>>
>> $ tc filter add dev dev1 ingress \
>> 	    prio 1 chain 0 proto ip \
>> 		flower tcp ct_state -trk \
>> 		action ct pipe \
>> 		action goto chain 2
>>
>> Received packets will first travel though tc, and if they aren't stolen
>> by it, like in the above rule, they will continue to OvS datapath.
>> Since we already did some actions (action ct in this case) which might
>> modify the packets, and updated action stats, we would like to continue
>> the proccessing with the correct recirc_id in OvS (here recirc_id(2))
>> where we left off.
>>
>> To support this, introduce a new skb extension for tc, which
>> will be used for translating tc chain to ovs recirc_id to
>> handle these miss cases. Last tc chain index will be set
>> by tc goto chain action and read by OvS datapath.
>>
>> Signed-off-by: Paul Blakey <paulb@mellanox.com>
>> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>> Acked-by: Jiri Pirko <jiri@mellanox.com>
> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

Thanks!


>> ---
>>   include/linux/skbuff.h    | 13 +++++++++++++
>>   include/net/sch_generic.h |  5 ++++-
>>   net/core/skbuff.c         |  6 ++++++
>>   net/openvswitch/flow.c    |  9 +++++++++
>>   net/sched/Kconfig         | 13 +++++++++++++
>>   net/sched/act_api.c       |  1 +
>>   net/sched/cls_api.c       | 12 ++++++++++++
>>   7 files changed, 58 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 3aef8d8..fb2a792 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -279,6 +279,16 @@ struct nf_bridge_info {
>>   };
>>   #endif
>>   
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +/* Chain in tc_skb_ext will be used to share the tc chain with
>> + * ovs recirc_id. It will be set to the current chain by tc
>> + * and read by ovs to recirc_id.
>> + */
>> +struct tc_skb_ext {
>> +	__u32 chain;
>> +};
>> +#endif
>> +
>>   struct sk_buff_head {
>>   	/* These two members must be first. */
>>   	struct sk_buff	*next;
>> @@ -4050,6 +4060,9 @@ enum skb_ext_id {
>>   #ifdef CONFIG_XFRM
>>   	SKB_EXT_SEC_PATH,
>>   #endif
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +	TC_SKB_EXT,
>> +#endif
>>   	SKB_EXT_NUM, /* must be last */
>>   };
>>   
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index 6b6b012..871feea 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -275,7 +275,10 @@ struct tcf_result {
>>   			unsigned long	class;
>>   			u32		classid;
>>   		};
>> -		const struct tcf_proto *goto_tp;
>> +		struct {
>> +			const struct tcf_proto *goto_tp;
>> +			u32 goto_index;
>> +		};
>>   
>>   		/* used in the skb_tc_reinsert function */
>>   		struct {
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index ea8e8d3..2b40b5a 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -4087,6 +4087,9 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
>>   #ifdef CONFIG_XFRM
>>   	[SKB_EXT_SEC_PATH] = SKB_EXT_CHUNKSIZEOF(struct sec_path),
>>   #endif
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +	[TC_SKB_EXT] = SKB_EXT_CHUNKSIZEOF(struct tc_skb_ext),
>> +#endif
>>   };
>>   
>>   static __always_inline unsigned int skb_ext_total_length(void)
>> @@ -4098,6 +4101,9 @@ static __always_inline unsigned int skb_ext_total_length(void)
>>   #ifdef CONFIG_XFRM
>>   		skb_ext_type_len[SKB_EXT_SEC_PATH] +
>>   #endif
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +		skb_ext_type_len[TC_SKB_EXT] +
>> +#endif
>>   		0;
>>   }
>>   
>> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
>> index bc89e16..0287ead 100644
>> --- a/net/openvswitch/flow.c
>> +++ b/net/openvswitch/flow.c
>> @@ -816,6 +816,9 @@ static int key_extract_mac_proto(struct sk_buff *skb)
>>   int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
>>   			 struct sk_buff *skb, struct sw_flow_key *key)
>>   {
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +	struct tc_skb_ext *tc_ext;
>> +#endif
>>   	int res, err;
>>   
>>   	/* Extract metadata from packet. */
>> @@ -848,7 +851,13 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
>>   	if (res < 0)
>>   		return res;
>>   	key->mac_proto = res;
>> +
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +	tc_ext = skb_ext_find(skb, TC_SKB_EXT);
>> +	key->recirc_id = tc_ext ? tc_ext->chain : 0;
>> +#else
>>   	key->recirc_id = 0;
>> +#endif
>>   
>>   	err = key_extract(skb, key);
>>   	if (!err)
>> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
>> index afd2ba1..b3faafe 100644
>> --- a/net/sched/Kconfig
>> +++ b/net/sched/Kconfig
>> @@ -963,6 +963,19 @@ config NET_IFE_SKBTCINDEX
>>           tristate "Support to encoding decoding skb tcindex on IFE action"
>>           depends on NET_ACT_IFE
>>   
>> +config NET_TC_SKB_EXT
>> +	bool "TC recirculation support"
>> +	depends on NET_CLS_ACT
>> +	default y if NET_CLS_ACT
>> +	select SKB_EXTENSIONS
>> +
>> +	help
>> +	  Say Y here to allow tc chain misses to continue in OvS datapath in
>> +	  the correct recirc_id, and hardware chain misses to continue in
>> +	  the correct chain in tc software datapath.
>> +
>> +	  Say N here if you won't be using tc<->ovs offload or tc chains offload.
>> +
>>   endif # NET_SCHED
>>   
>>   config NET_SCH_FIFO
>> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
>> index 3397122..c393604 100644
>> --- a/net/sched/act_api.c
>> +++ b/net/sched/act_api.c
>> @@ -27,6 +27,7 @@ static void tcf_action_goto_chain_exec(const struct tc_action *a,
>>   {
>>   	const struct tcf_chain *chain = rcu_dereference_bh(a->goto_chain);
>>   
>> +	res->goto_index = chain->index;
>>   	res->goto_tp = rcu_dereference_bh(chain->filter_chain);
>>   }
>>   
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index 3565d9a..b0b829a 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -1660,6 +1660,18 @@ int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>>   			goto reset;
>>   		} else if (unlikely(TC_ACT_EXT_CMP(err, TC_ACT_GOTO_CHAIN))) {
>>   			first_tp = res->goto_tp;
>> +
>> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
>> +			{
>> +				struct tc_skb_ext *ext;
>> +
>> +				ext = skb_ext_add(skb, TC_SKB_EXT);
>> +				if (WARN_ON_ONCE(!ext))
>> +					return TC_ACT_SHOT;
>> +
>> +				ext->chain = res->goto_index;
>> +			}
>> +#endif
>>   			goto reset;
>>   		}
>>   #endif
>> -- 
>> 1.8.3.1
>>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox