Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-05 21:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190805192122.laxcaz75k4vxdspn@ast-mbp>

On Mon, Aug 5, 2019 at 12:21 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Aug 05, 2019 at 10:23:10AM -0700, Andy Lutomirski wrote:
> >
> > I refreshed the branch again.  I had a giant hole in my previous idea
> > that we could deprivilege program loading: some BPF functions need
> > privilege.  Now I have a changelog comment to that effect and a patch
> > that sketches out a way to addressing this.
> >
> > I don't think I'm going to have time soon to actually get any of this
> > stuff mergeable, and it would be fantastic if you or someone else who
> > likes working of bpf were to take this code and run with it.  Feel
> > free to add my Signed-off-by, and I'd be happy to help review.
>
> Thanks a lot for working on patches and helping us with the design!
>
> Can you resend the patches to the mailing list?
> It's kinda hard to reply/review to patches that are somewhere in the web.

Will do.

> I'm still trying to understand the main idea.
> If I'm reading things correctly:

The series doesn't, strictly speaking, have an overall problem that it
solves.  It's a series of steps in the direction of making bpf() make
more sense without privilege and toward reducing the required
privilege.

> patch 1 "add access permissions to bpf fds"
>   just passes the flags ?

It tries to make the kernel respect the access modes for fds.  Without
this patch, there seem to be some holes: nothing looked at program fds
and, unless I missed something, you could take a readonly fd for a
program, pin the program, and reopen it RW.

> patch 2 "Don't require mknod() permission to pin an object"
>  makes sense in isolation.

It makes even more sense now :)

> patch 3 "Allow creating all program types without privilege"
>   is not right.

I think it can be made right, which is the point.

> patch 4 "Add a way to mark functions as requiring privilege"
>  is an interesting idea, but I don't think it helps that much.

Other than the issue that this patch partially fixes, can you see any
reason that loading a program should require privilege?  Obviously the
verifier is weakened a bit when called by privileged users, but a lot
of that is about excessive resource usage and various less-well-tested
features.  It seems to me that most of the value of bpf() should be
available to programs that should not need privilege to load.  Are
there things I'm missing?

>
> So the main thing we're trying to solve with augmented bpf syscall
> and/or /dev/bpf is to be able to use root-only features of bpf when
> trused process already dropped root permissions.
> These features include bpf2bpf calls, bounded loops, special maps (like LPM), etc.

Can you elaborate on all these:

I see nothing inherently wrong with bpf2bpf for unprivileged users as
long as they have appropriate access to the called program.  Patch 1
improves that.

Bounded loops: if they are adequately well verified, then the only
damage is that they can make bpf progs that run slowly, right?  It
seems like some kind of capability or sysctl for "allow using lots of
bpf resources" would do the trick.  This could even be a cgroup
setting -- bpf resources aren't all that different from any other
resource.

LPM: I don't see why this requires privilege at all.  It indeed checks
capable(CAP_SYS_ADMIN), but I don't see why.

>
> Attaching to a cgroup already has file based permission checks.
> The user needs to open cgroup directory to attach.
> acls on cgroup dir can already be used to prevent attaching to
> certain parts of cgroup hierarchy.

The current checks seem inadequate.

$ echo 'yay' </sys/fs/cgroup/systemd/system.slice/

The ability to obtain an fd to a cgroup does *not* imply any right to
modify that cgroup.  The ability to write to a cgroup directory
already means something else -- it's the ability to create cgroups
under the group in question.  I'm suggesting that a new API be added
that allows attaching a bpf program to a cgroup without capabilities
and that instead requires write access to a new file in the cgroup
directory.  (It could be a single file for all bpf types or one file
per type.  I prefer the latter -- it gives the admin finer-grained
control.)

> What we need is to drop privileges sooner in daemons like systemd.

This is doable right now: systemd could fork off a subprocess and
delegate its cgroup operations to it.  It would be maybe a couple
hundred lines of code.  As an added benefit, that subprocess could
verify that the bpf operations in question are reasonable.
Alternatively, if there was a CAP_BPF_ADMIN, systemd could retain that
capability and flip it on and off as needed.

> Container management daemon runs in the nested containers.
> These trusted daemons need to have access to full bpf, but they
> don't want to be root all the time.
> They cannot flip back and forth via seteuid to root every time they
> need to do bpf.
> Hence the idea is to have a file that this daemon can open,
> then drop privileges and still keep doing bpf things because FD is held.
> Outer container daemon can pass this /dev/bpf's FD to inner daemon, etc.
> This /dev/bpf would be accessible to root only.
> There is no desire to open it up to non-root.

This seems extremely dangerous right now.  A program that can bypass
*all* of the capable() checks in bpf() can do a whole lot.  Among
other things, it can read all of kernel memory.  It can very likely
gain full system root by appropriate installation of malicious
programs in a cgroup that contains fully privileged programs.  In this
regard, bpf() is like most of the Linux capabilities -- it seems
somewhat limited, but it really implies a lot of privilege.  There was
a little paper awhile back pointing out that, on a normal system, most
of the Linux capabilities were functionally equivalent.

>
> It seems there is concern that /dev/bpf is unnecessary special.
> How about we combine bpffs and /dev/bpf ideas?
> Like we can have a special file name in bpffs.
> The root would do 'touch /sys/fs/bpf/privileges' and it would behave
> just like /dev/bpf, but now it can be in any bpffs directory and acls
> to bpffs mount would work as-is.

This seems to have most of the same problems.  My main point is that
it conflates a whole lot of different permissions, and I really don't
think it's that much work to mostly disentangle the permissions in
question.  My little series (if completed) plus a patch to allow
unprivileged cgroup attach operations if you have an FMODE_WRITE fd to
an appropriate file should get most of the way there.

Also, be careful about your bpffs idea: bpffs is (sort of) namespaced,
and it would make sense to allow new bpf instances to be created
inside unprivileged user namespaces.  Such instances should not be
able to create magical privilege-granting files.  In that respect,
/dev/bpf is better.

>
> CAP_BPF is also good idea. I think for the enviroment where untrusted
> and unprivileged users want to run 'bpftrace' that would be perfect mechanism.
> getcap /bin/bpftrace would have cap_bpf, cap_kprobe and whatever else.
> Sort of like /bin/ping.
> But I don't see how cap_bpf helps to solve our trusted root daemon problem.
> imo open ("/sys/fs/bpf/privileges") and pass that FD into bpf syscall
> is the only viable mechanism.
>

As above, I think that forking before dropping privileges and asking
the child to do the bpf() operations is safer and more flexible.

> Note the verifier does very different amount of work for unpriv vs root.
> It does speculative execution analysis, pointer leak checks for unpriv.
> So we gotta pass special flag to the verifier to make it act like it's
> loading a program for root.
>

Indeed.  And programs in untrusted containers should not be able to do this.

^ permalink raw reply

* [WIP 2/4] bpf: Don't require mknod() permission to pin an object
From: Andy Lutomirski @ 2019-08-05 21:29 UTC (permalink / raw)
  To: LKML, Alexei Starovoitov
  Cc: Song Liu, Kees Cook, Networking, bpf, Daniel Borkmann,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List, Andy Lutomirski
In-Reply-To: <cover.1565040372.git.luto@kernel.org>

security_path_mknod() seems excessive for pinning an object --
pinning an object is effectively just creating a file.  It's also
redundant, as vfs_mkobj() calls security_inode_create() by itself.

This isn't strictly required -- mknod(path, S_IFREG, unused) works
to create regular files, but bpf is currently the only user in the
kernel outside of mknod() itself that uses it to create regular
(i.e. S_IFREG) files.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/bpf/inode.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index cb07736b33ae..14304609003a 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -394,10 +394,6 @@ static int bpf_obj_do_pin(const struct filename *pathname, void *raw,
 
 	mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
 
-	ret = security_path_mknod(&path, dentry, mode, 0);
-	if (ret)
-		goto out;
-
 	dir = d_inode(path.dentry);
 	if (dir->i_op != &bpf_dir_iops) {
 		ret = -EPERM;
-- 
2.21.0


^ permalink raw reply related

* [WIP 1/4] bpf: Respect persistent map and prog access modes
From: Andy Lutomirski @ 2019-08-05 21:29 UTC (permalink / raw)
  To: LKML, Alexei Starovoitov
  Cc: Song Liu, Kees Cook, Networking, bpf, Daniel Borkmann,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List, Andy Lutomirski
In-Reply-To: <cover.1565040372.git.luto@kernel.org>

In the interest of making bpf() more useful by unprivileged users,
this patch teaches bpf to respect access modes on map and prog
inodes.  The permissions are:

R on a map: read the map
W on a map: write the map

Referencing a map from a program should require RW.

R on a prog: Read or introspect the prog
W on a prog: Attach the prog to something

Test-running a prog is a form of introspection, so it requires RW.
Detaching a prog merely uses the fd for identification, so neither R
nor W is needed.

This is likely incomplete, and it has some comments that should be
removed.

This patch uses WRITE instead of EXEC as the permission needed to
run (by attaching or test-running) a program.  EXEC seems nicer, but
O_MAYEXEC isn't merged, which makes using X awkward.
---
 include/linux/bpf.h    | 15 +++++++------
 kernel/bpf/arraymap.c  |  8 ++++++-
 kernel/bpf/cgroup.c    |  6 ++++-
 kernel/bpf/inode.c     | 25 ++++++++++++++-------
 kernel/bpf/syscall.c   | 51 ++++++++++++++++++++++++++++++------------
 kernel/events/core.c   |  5 +++--
 net/core/dev.c         |  4 +++-
 net/core/filter.c      |  8 ++++---
 net/netfilter/xt_bpf.c |  5 +++--
 net/packet/af_packet.c |  2 +-
 10 files changed, 89 insertions(+), 40 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 18f4cc2c6acd..2d5e1a4dff6c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -630,9 +630,9 @@ extern const struct bpf_prog_ops bpf_offload_prog_ops;
 extern const struct bpf_verifier_ops tc_cls_act_analyzer_ops;
 extern const struct bpf_verifier_ops xdp_analyzer_ops;
 
-struct bpf_prog *bpf_prog_get(u32 ufd);
+struct bpf_prog *bpf_prog_get(u32 ufd, int mask);
 struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
-				       bool attach_drv);
+				       bool attach_drv, int mask);
 struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, int i);
 void bpf_prog_sub(struct bpf_prog *prog, int i);
 struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog);
@@ -662,7 +662,7 @@ void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
 extern int sysctl_unprivileged_bpf_disabled;
 
 int bpf_map_new_fd(struct bpf_map *map, int flags);
-int bpf_prog_new_fd(struct bpf_prog *prog);
+int bpf_prog_new_fd(struct bpf_prog *prog, int flags);
 
 int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
 int bpf_obj_get_user(const char __user *pathname, int flags);
@@ -733,7 +733,7 @@ static inline int bpf_map_attr_numa_node(const union bpf_attr *attr)
 		attr->numa_node : NUMA_NO_NODE;
 }
 
-struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type type);
+struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type type, int mask);
 int array_map_alloc_check(union bpf_attr *attr);
 
 int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
@@ -850,7 +850,7 @@ static inline int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu,
 }
 
 static inline struct bpf_prog *bpf_prog_get_type_path(const char *name,
-				enum bpf_prog_type type)
+				enum bpf_prog_type type, int mask)
 {
 	return ERR_PTR(-EOPNOTSUPP);
 }
@@ -878,9 +878,10 @@ static inline int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 #endif /* CONFIG_BPF_SYSCALL */
 
 static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
-						 enum bpf_prog_type type)
+						 enum bpf_prog_type type,
+						 int mask)
 {
-	return bpf_prog_get_type_dev(ufd, type, false);
+	return bpf_prog_get_type_dev(ufd, type, false, mask);
 }
 
 bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 1c65ce0098a9..7e17a5d42110 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -522,6 +522,10 @@ int bpf_fd_array_map_lookup_elem(struct bpf_map *map, void *key, u32 *value)
 }
 
 /* only called from syscall */
+/*
+ * XXX: it's totally unclear to me what this ends up doing with the fd
+ * in general.
+ */
 int bpf_fd_array_map_update_elem(struct bpf_map *map, struct file *map_file,
 				 void *key, void *value, u64 map_flags)
 {
@@ -569,7 +573,9 @@ static void *prog_fd_array_get_ptr(struct bpf_map *map,
 				   struct file *map_file, int fd)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
-	struct bpf_prog *prog = bpf_prog_get(fd);
+
+	/* XXX: what, exactly, does this end up doing to the prog in question? */
+	struct bpf_prog *prog = bpf_prog_get(fd, FMODE_READ | FMODE_WRITE);
 
 	if (IS_ERR(prog))
 		return prog;
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 0a00eaca6fae..1450c3bdab82 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -562,7 +562,11 @@ int cgroup_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
 	if (IS_ERR(cgrp))
 		return PTR_ERR(cgrp);
 
-	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
+	/*
+	 * No particular access required -- this only uses the fd to identify
+	 * a program, not to do anything with the program.
+	 */
+	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype, 0);
 	if (IS_ERR(prog))
 		prog = NULL;
 
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index cc0d0cf114e3..cb07736b33ae 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -58,7 +58,7 @@ static void bpf_any_put(void *raw, enum bpf_type type)
 	}
 }
 
-static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
+static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type, int mask)
 {
 	void *raw;
 
@@ -66,7 +66,7 @@ static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
 	raw = bpf_map_get_with_uref(ufd);
 	if (IS_ERR(raw)) {
 		*type = BPF_TYPE_PROG;
-		raw = bpf_prog_get(ufd);
+		raw = bpf_prog_get(ufd, mask);
 	}
 
 	return raw;
@@ -430,7 +430,12 @@ int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
 	if (IS_ERR(pname))
 		return PTR_ERR(pname);
 
-	raw = bpf_fd_probe_obj(ufd, &type);
+	/*
+	 * Pinning an object effectively grants the caller all access, because
+	 * the caller ends up owning the inode.  So require all access.
+	 * XXX: If we use FMODE_EXEC, we should require FMODE_EXEC too.
+	 */
+	raw = bpf_fd_probe_obj(ufd, &type, FMODE_READ | FMODE_WRITE);
 	if (IS_ERR(raw)) {
 		ret = PTR_ERR(raw);
 		goto out;
@@ -456,6 +461,10 @@ static void *bpf_obj_do_get(const struct filename *pathname,
 	if (ret)
 		return ERR_PTR(ret);
 
+	/*
+	 * XXX: O_MAYEXEC doesn't exist, which is problematic here if we
+	 * want to use FMODE_EXEC.
+	 */
 	inode = d_backing_inode(path.dentry);
 	ret = inode_permission(inode, ACC_MODE(flags));
 	if (ret)
@@ -499,7 +508,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags)
 	}
 
 	if (type == BPF_TYPE_PROG)
-		ret = bpf_prog_new_fd(raw);
+		ret = bpf_prog_new_fd(raw, f_flags);
 	else if (type == BPF_TYPE_MAP)
 		ret = bpf_map_new_fd(raw, f_flags);
 	else
@@ -512,10 +521,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags)
 	return ret;
 }
 
-static struct bpf_prog *__get_prog_inode(struct inode *inode, enum bpf_prog_type type)
+static struct bpf_prog *__get_prog_inode(struct inode *inode, enum bpf_prog_type type, int mask)
 {
 	struct bpf_prog *prog;
-	int ret = inode_permission(inode, MAY_READ);
+	int ret = inode_permission(inode, mask);
 	if (ret)
 		return ERR_PTR(ret);
 
@@ -536,14 +545,14 @@ static struct bpf_prog *__get_prog_inode(struct inode *inode, enum bpf_prog_type
 	return bpf_prog_inc(prog);
 }
 
-struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type type)
+struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type type, int mask)
 {
 	struct bpf_prog *prog;
 	struct path path;
 	int ret = kern_path(name, LOOKUP_FOLLOW, &path);
 	if (ret)
 		return ERR_PTR(ret);
-	prog = __get_prog_inode(d_backing_inode(path.dentry), type);
+	prog = __get_prog_inode(d_backing_inode(path.dentry), type, mask);
 	if (!IS_ERR(prog))
 		touch_atime(&path);
 	path_put(&path);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5d141f16f6fa..23f8f89d2a86 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -447,6 +447,7 @@ int bpf_map_new_fd(struct bpf_map *map, int flags)
 
 int bpf_get_file_flag(int flags)
 {
+	/* XXX: What about exec? */
 	if ((flags & BPF_F_RDONLY) && (flags & BPF_F_WRONLY))
 		return -EINVAL;
 	if (flags & BPF_F_RDONLY)
@@ -556,6 +557,10 @@ static int map_create(union bpf_attr *attr)
 	if (err)
 		return -EINVAL;
 
+	/*
+	 * XXX: I'm a bit confused.  Why would you ever create a map and
+	 * grant *yourself* less than full permission?
+	 */
 	f_flags = bpf_get_file_flag(attr->map_flags);
 	if (f_flags < 0)
 		return f_flags;
@@ -1411,7 +1416,7 @@ const struct file_operations bpf_prog_fops = {
 	.write		= bpf_dummy_write,
 };
 
-int bpf_prog_new_fd(struct bpf_prog *prog)
+int bpf_prog_new_fd(struct bpf_prog *prog, int flags)
 {
 	int ret;
 
@@ -1420,10 +1425,10 @@ int bpf_prog_new_fd(struct bpf_prog *prog)
 		return ret;
 
 	return anon_inode_getfd("bpf-prog", &bpf_prog_fops, prog,
-				O_RDWR | O_CLOEXEC);
+				flags | O_CLOEXEC);
 }
 
-static struct bpf_prog *____bpf_prog_get(struct fd f)
+static struct bpf_prog *____bpf_prog_get(struct fd f, int mask)
 {
 	if (!f.file)
 		return ERR_PTR(-EBADF);
@@ -1431,6 +1436,10 @@ static struct bpf_prog *____bpf_prog_get(struct fd f)
 		fdput(f);
 		return ERR_PTR(-EINVAL);
 	}
+	if ((f.file->f_mode & mask) != mask) {
+		fdput(f);
+		return ERR_PTR(-EACCES);
+	}
 
 	return f.file->private_data;
 }
@@ -1497,12 +1506,12 @@ bool bpf_prog_get_ok(struct bpf_prog *prog,
 }
 
 static struct bpf_prog *__bpf_prog_get(u32 ufd, enum bpf_prog_type *attach_type,
-				       bool attach_drv)
+				       bool attach_drv, int mask)
 {
 	struct fd f = fdget(ufd);
 	struct bpf_prog *prog;
 
-	prog = ____bpf_prog_get(f);
+	prog = ____bpf_prog_get(f, mask);
 	if (IS_ERR(prog))
 		return prog;
 	if (!bpf_prog_get_ok(prog, attach_type, attach_drv)) {
@@ -1516,15 +1525,15 @@ static struct bpf_prog *__bpf_prog_get(u32 ufd, enum bpf_prog_type *attach_type,
 	return prog;
 }
 
-struct bpf_prog *bpf_prog_get(u32 ufd)
+struct bpf_prog *bpf_prog_get(u32 ufd, int mask)
 {
-	return __bpf_prog_get(ufd, NULL, false);
+	return __bpf_prog_get(ufd, NULL, false, mask);
 }
 
 struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
-				       bool attach_drv)
+				       bool attach_drv, int mask)
 {
-	return __bpf_prog_get(ufd, &type, attach_drv);
+	return __bpf_prog_get(ufd, &type, attach_drv, mask);
 }
 EXPORT_SYMBOL_GPL(bpf_prog_get_type_dev);
 
@@ -1707,7 +1716,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 	if (err)
 		goto free_used_maps;
 
-	err = bpf_prog_new_fd(prog);
+	err = bpf_prog_new_fd(prog, O_RDWR /* | O_MAYEXEC */);
 	if (err < 0) {
 		/* failed to allocate fd.
 		 * bpf_prog_put() is needed because the above
@@ -1808,7 +1817,7 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 	}
 	raw_tp->btp = btp;
 
-	prog = bpf_prog_get(attr->raw_tracepoint.prog_fd);
+	prog = bpf_prog_get(attr->raw_tracepoint.prog_fd, MAY_EXEC);
 	if (IS_ERR(prog)) {
 		err = PTR_ERR(prog);
 		goto out_free_tp;
@@ -1929,7 +1938,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 		return -EINVAL;
 	}
 
-	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
+	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype, MAY_EXEC);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
 
@@ -2083,7 +2092,11 @@ static int bpf_prog_test_run(const union bpf_attr *attr,
 	    (!attr->test.ctx_size_out && attr->test.ctx_out))
 		return -EINVAL;
 
-	prog = bpf_prog_get(attr->test.prog_fd);
+	/*
+	 * A test run is is a form of query, so require RW.  Using W as a proxy for
+	 * X, since X is awkward due to a lack of O_MAYEXEC.
+	 */
+	prog = bpf_prog_get(attr->test.prog_fd, MAY_READ | MAY_WRITE);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
 
@@ -2147,7 +2160,11 @@ static int bpf_prog_get_fd_by_id(const union bpf_attr *attr)
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
 
-	fd = bpf_prog_new_fd(prog);
+	/*
+	 * We have all permissions.  This is okay, since we also require
+	 * CAP_SYS_ADMIN to do this at all.
+	 */
+	fd = bpf_prog_new_fd(prog, O_RDWR /* | O_MAYEXEC */);
 	if (fd < 0)
 		bpf_prog_put(prog);
 
@@ -2638,6 +2655,11 @@ static int bpf_obj_get_info_by_fd(const union bpf_attr *attr,
 	if (!f.file)
 		return -EBADFD;
 
+	if (!(f.file->f_mode & FMODE_READ)) {
+		err = -EACCES;
+		goto out;
+	}
+
 	if (f.file->f_op == &bpf_prog_fops)
 		err = bpf_prog_get_info_by_fd(f.file->private_data, attr,
 					      uattr);
@@ -2649,6 +2671,7 @@ static int bpf_obj_get_info_by_fd(const union bpf_attr *attr,
 	else
 		err = -EINVAL;
 
+out:
 	fdput(f);
 	return err;
 }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 026a14541a38..f2e3973b28f2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8876,7 +8876,8 @@ static int perf_event_set_bpf_handler(struct perf_event *event, u32 prog_fd)
 	if (event->prog)
 		return -EEXIST;
 
-	prog = bpf_prog_get_type(prog_fd, BPF_PROG_TYPE_PERF_EVENT);
+	/* Should maybe be FMODE_EXEC? */
+	prog = bpf_prog_get_type(prog_fd, BPF_PROG_TYPE_PERF_EVENT, FMODE_WRITE);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
 
@@ -8942,7 +8943,7 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
 		/* bpf programs can only be attached to u/kprobe or tracepoint */
 		return -EINVAL;
 
-	prog = bpf_prog_get(prog_fd);
+	prog = bpf_prog_get(prog_fd, FMODE_WRITE);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index fc676b2610e3..3fcaeae693bb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8093,8 +8093,10 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 			return -EBUSY;
 		}
 
+		/* XXX: FMODE_EXEC? */
 		prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP,
-					     bpf_op == ops->ndo_bpf);
+					     bpf_op == ops->ndo_bpf,
+					     FMODE_WRITE);
 		if (IS_ERR(prog))
 			return PTR_ERR(prog);
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 4e2a79b2fd77..9282462678fd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1544,7 +1544,8 @@ static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
 	if (sock_flag(sk, SOCK_FILTER_LOCKED))
 		return ERR_PTR(-EPERM);
 
-	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	/* FMODE_EXEC? */
+	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER, FMODE_WRITE);
 }
 
 int sk_attach_bpf(u32 ufd, struct sock *sk)
@@ -1572,9 +1573,10 @@ int sk_reuseport_attach_bpf(u32 ufd, struct sock *sk)
 	if (sock_flag(sk, SOCK_FILTER_LOCKED))
 		return -EPERM;
 
-	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER, FMODE_WRITE);
 	if (IS_ERR(prog) && PTR_ERR(prog) == -EINVAL)
-		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SK_REUSEPORT);
+		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SK_REUSEPORT,
+					 FMODE_WRITE);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
 
diff --git a/net/netfilter/xt_bpf.c b/net/netfilter/xt_bpf.c
index 13cf3f9b5938..34e5c08ee1f3 100644
--- a/net/netfilter/xt_bpf.c
+++ b/net/netfilter/xt_bpf.c
@@ -44,7 +44,7 @@ static int __bpf_mt_check_fd(int fd, struct bpf_prog **ret)
 {
 	struct bpf_prog *prog;
 
-	prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
+	prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER, MAY_EXEC);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
 
@@ -57,7 +57,8 @@ static int __bpf_mt_check_path(const char *path, struct bpf_prog **ret)
 	if (strnlen(path, XT_BPF_PATH_MAX) == XT_BPF_PATH_MAX)
 		return -EINVAL;
 
-	*ret = bpf_prog_get_type_path(path, BPF_PROG_TYPE_SOCKET_FILTER);
+	*ret = bpf_prog_get_type_path(path, BPF_PROG_TYPE_SOCKET_FILTER,
+				      MAY_EXEC);
 	return PTR_ERR_OR_ZERO(*ret);
 }
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 8d54f3047768..5b8c5e5d94bf 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1563,7 +1563,7 @@ static int fanout_set_data_ebpf(struct packet_sock *po, char __user *data,
 	if (copy_from_user(&fd, data, len))
 		return -EFAULT;
 
-	new = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
+	new = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER, FMODE_WRITE);
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
-- 
2.21.0


^ permalink raw reply related

* [WIP 4/4] bpf: Allow creating all program types without privilege
From: Andy Lutomirski @ 2019-08-05 21:29 UTC (permalink / raw)
  To: LKML, Alexei Starovoitov
  Cc: Song Liu, Kees Cook, Networking, bpf, Daniel Borkmann,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List, Andy Lutomirski
In-Reply-To: <cover.1565040372.git.luto@kernel.org>

This doesn't let you *run* the programs except in test mode, so it should
be safe.  Famous last words.

This assumes that the check-privilege-to-call-privileged-functions
patch actually catches all the cases and that there's nothing else
that should need privilege lurking in the type-specific verifiers.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/bpf/syscall.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 23f8f89d2a86..730afa2be786 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1649,8 +1649,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 	    attr->insn_cnt > (capable(CAP_SYS_ADMIN) ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
 		return -E2BIG;
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
-	    type != BPF_PROG_TYPE_CGROUP_SKB &&
-	    !capable(CAP_SYS_ADMIN))
+	    type != BPF_PROG_TYPE_CGROUP_SKB)
 		return -EPERM;
 
 	bpf_prog_load_fixup_attach_type(attr);
-- 
2.21.0


^ permalink raw reply related

* [WIP 3/4] bpf: Add a way to mark functions as requiring privilege
From: Andy Lutomirski @ 2019-08-05 21:29 UTC (permalink / raw)
  To: LKML, Alexei Starovoitov
  Cc: Song Liu, Kees Cook, Networking, bpf, Daniel Borkmann,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List, Andy Lutomirski
In-Reply-To: <cover.1565040372.git.luto@kernel.org>

This is horribly incomplete:

 - I only marked one function as requiring privilege, and there are
   surely more.

 - Checking is_priv is probably not the right thing to do.  This should
   probably do something more clever.  At the very lease, it needs to
   integrate with the upcoming lockdown LSM infrastructure.

 - The seen_privileged_funcs mechanism is probably not a good solution.
   Instead we should check something while we still have enough context
   to give a good error message.  But we *don't* want to check for
   capabilities up front before even seeing a function call, since we
   don't want to inadvertently generate audit events for privileges that
   are never used.

So it's the idea that counts :)

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/bpf.h          | 15 +++++++++++++++
 include/linux/bpf_verifier.h |  1 +
 kernel/bpf/verifier.c        |  8 ++++++++
 kernel/trace/bpf_trace.c     |  1 +
 4 files changed, 25 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2d5e1a4dff6c..de31b9888b6c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -229,6 +229,7 @@ struct bpf_func_proto {
 	u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 	bool gpl_only;
 	bool pkt_access;
+	u16 privilege;
 	enum bpf_return_type ret_type;
 	enum bpf_arg_type arg1_type;
 	enum bpf_arg_type arg2_type;
@@ -237,6 +238,20 @@ struct bpf_func_proto {
 	enum bpf_arg_type arg5_type;
 };
 
+/*
+ * Some functions should require privilege to call at all, even in a test
+ * run.  These flags indicate why privilege is required.  The core BPF
+ * code will verify that the creator of such a program has the requisite
+ * privilege.
+ *
+ * NB: This means that anyone who creates a privileged program (due to
+ * such a call or due to a privilege-requiring pointer-to-integer conversion)
+ * is responsible for restricting access to the program in an appropriate
+ * manner.
+ */
+#define BPF_FUNC_PRIV_READ_KERNEL_MEMORY BIT(0)
+#define BPT_FUNC_PRIV_WRITE_GLOBAL_LOGS BIT(1)
+
 /* bpf_context is intentionally undefined structure. Pointer to bpf_context is
  * the first argument to eBPF programs.
  * For socket filters: 'struct bpf_context *' == 'struct sk_buff *'
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 5fe99f322b1c..9877f5753cf4 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -363,6 +363,7 @@ struct bpf_verifier_env {
 	u32 id_gen;			/* used to generate unique reg IDs */
 	bool allow_ptr_leaks;
 	bool seen_direct_write;
+	u16 seen_privileged_funcs;
 	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 	const struct bpf_line_info *prev_linfo;
 	struct bpf_verifier_log log;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5900cbb966b1..5e048688fd8d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4129,6 +4129,9 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 
 	if (changes_data)
 		clear_all_pkt_pointers(env);
+
+	env->seen_privileged_funcs |= fn->privilege;
+
 	return 0;
 }
 
@@ -9371,6 +9374,11 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
 	if (ret == 0)
 		adjust_btf_func(env);
 
+	if (env->seen_privileged_funcs && !is_priv) {
+		ret = -EPERM;
+		goto err_release_maps;
+	}
+
 err_release_maps:
 	if (!env->prog->aux->used_maps)
 		/* if we didn't copy map pointers into bpf_prog_info, release
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca1255d14576..d9454588d9e8 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -152,6 +152,7 @@ BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr)
 static const struct bpf_func_proto bpf_probe_read_proto = {
 	.func		= bpf_probe_read,
 	.gpl_only	= true,
+	.privilege	= BPF_FUNC_PRIV_READ_KERNEL_MEMORY,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
 	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
-- 
2.21.0


^ permalink raw reply related

* [WIP 0/4] bpf: A bit of progress toward unprivileged use
From: Andy Lutomirski @ 2019-08-05 21:29 UTC (permalink / raw)
  To: LKML, Alexei Starovoitov
  Cc: Song Liu, Kees Cook, Networking, bpf, Daniel Borkmann,
	Alexei Starovoitov, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List, Andy Lutomirski

Other than the mknod() patch, this is not ready for prime time.  These
patches try to make progress toward making bpf() more useful without
privilege

Andy Lutomirski (4):
  bpf: Respect persistent map and prog access modes
  bpf: Don't require mknod() permission to pin an object
  bpf: Add a way to mark functions as requiring privilege
  bpf: Allow creating all program types without privilege

 include/linux/bpf.h          | 30 +++++++++++++++-----
 include/linux/bpf_verifier.h |  1 +
 kernel/bpf/arraymap.c        |  8 +++++-
 kernel/bpf/cgroup.c          |  6 +++-
 kernel/bpf/inode.c           | 29 +++++++++++--------
 kernel/bpf/syscall.c         | 54 +++++++++++++++++++++++++-----------
 kernel/bpf/verifier.c        |  8 ++++++
 kernel/events/core.c         |  5 ++--
 kernel/trace/bpf_trace.c     |  1 +
 net/core/dev.c               |  4 ++-
 net/core/filter.c            |  8 ++++--
 net/netfilter/xt_bpf.c       |  5 ++--
 net/packet/af_packet.c       |  2 +-
 13 files changed, 115 insertions(+), 46 deletions(-)

-- 
2.21.0


^ permalink raw reply

* Re: [PATCH bpf-next 1/2] selftests/bpf: add loop test 4
From: Andrii Nakryiko @ 2019-08-05 21:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Alexei Starovoitov, David S. Miller,
	Daniel Borkmann, Networking, bpf, Kernel Team
In-Reply-To: <f3ccc18f-7c25-a4e8-3d3d-c9f0bdf453ea@fb.com>

On Mon, Aug 5, 2019 at 1:53 PM Alexei Starovoitov <ast@fb.com> wrote:
>
> On 8/5/19 1:04 PM, Yonghong Song wrote:
> >
> >
> > On 8/5/19 12:45 PM, Andrii Nakryiko wrote:
> >> On Sat, Aug 3, 2019 at 8:19 PM Alexei Starovoitov <ast@kernel.org> wrote:
> >>>
> >>> Add a test that returns a 'random' number between [0, 2^20)
> >>> If state pruning is not working correctly for loop body the number of
> >>> processed insns will be 2^20 * num_of_insns_in_loop_body and the program
> >>> will be rejected.
> >>>
> >>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >>> ---
> >>>    .../bpf/prog_tests/bpf_verif_scale.c          |  1 +
> >>>    tools/testing/selftests/bpf/progs/loop4.c     | 23 +++++++++++++++++++
> >>>    2 files changed, 24 insertions(+)
> >>>    create mode 100644 tools/testing/selftests/bpf/progs/loop4.c
> >>>
> >>> diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
> >>> index b4be96162ff4..757e39540eda 100644
> >>> --- a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
> >>> +++ b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
> >>> @@ -71,6 +71,7 @@ void test_bpf_verif_scale(void)
> >>>
> >>>                   { "loop1.o", BPF_PROG_TYPE_RAW_TRACEPOINT },
> >>>                   { "loop2.o", BPF_PROG_TYPE_RAW_TRACEPOINT },
> >>> +               { "loop4.o", BPF_PROG_TYPE_RAW_TRACEPOINT },
> >>>
> >>>                   /* partial unroll. 19k insn in a loop.
> >>>                    * Total program size 20.8k insn.
> >>> diff --git a/tools/testing/selftests/bpf/progs/loop4.c b/tools/testing/selftests/bpf/progs/loop4.c
> >>> new file mode 100644
> >>> index 000000000000..3e7ee14fddbd
> >>> --- /dev/null
> >>> +++ b/tools/testing/selftests/bpf/progs/loop4.c
> >>> @@ -0,0 +1,23 @@
> >>> +// SPDX-License-Identifier: GPL-2.0
> >>> +// Copyright (c) 2019 Facebook
> >>> +#include <linux/sched.h>
> >>> +#include <linux/ptrace.h>
> >>> +#include <stdint.h>
> >>> +#include <stddef.h>
> >>> +#include <stdbool.h>
> >>> +#include <linux/bpf.h>
> >>> +#include "bpf_helpers.h"
> >>> +
> >>> +char _license[] SEC("license") = "GPL";
> >>> +
> >>> +SEC("socket")
> >>> +int combinations(volatile struct __sk_buff* skb)
> >>> +{
> >>> +       int ret = 0, i;
> >>> +
> >>> +#pragma nounroll
> >>> +       for (i = 0; i < 20; i++)
> >>> +               if (skb->len)
> >>> +                       ret |= 1 << i;
> >>
> >> So I think the idea is that because verifier shouldn't know whether
> >> skb->len is zero or not, then you have two outcomes on every iteration
> >> leading to 2^20 states, right?
> >>
> >> But I'm afraid that verifier can eventually be smart enough (if it's
> >> not already, btw), to figure out that ret can be either 0 or ((1 <<
> >> 21) - 1), actually. If skb->len is put into separate register, then
> >> that register's bounds will be established on first loop iteration as
> >> either == 0 on one branch or (0, inf) on another branch, after which
> >> all subsequent iterations will not branch at all (one or the other
> >> branch will be always taken).
> >>
> >> It's also possible that LLVM/Clang is smart enough already to figure
> >> this out on its own and optimize loop into.
> >>
> >>
> >> if (skb->len) {
> >>       for (i = 0; i < 20; i++)
> >>           ret |= 1 << i;
> >> }
> >
> > We have
> >      volatile struct __sk_buff* skb
> >
> > So from the source code, skb->len could be different for each
> > iteration. The compiler cannot do the above optimization.
>
> yep.
> Without volatile llvm optimizes it even more than Andrii predicted :)

My bad, completely missed volatile.

>
> >>
> >>
> >> So two complains:
> >>
> >> 1. Let's obfuscate this a bit more, e.g., with testing (skb->len &
> >> (1<<i)) instead, so that result really depends on actual length of the
> >> packet.
> >> 2. Is it possible to somehow turn off this precision tracking (e.g.,
> >> running not under root, maybe?) and see that this same program fails
> >> in that case? That way we'll know test actually validates what we
> >> think it validates.
>
> that's on my todo list already.
> To do proper unit tests for all this stuff there should be a way
> to turn off not only precision, but heuristics too.
> All magic numbers in is_state_visited() need to be switchable.
> I'm still thinking on the way to expose it to tests infra.

Yep, that would be great.

I have nothing beyond what Yonghong suggested.

Acked-by: Andrii Nakryiko <andriin@fb.com>

^ permalink raw reply

* linux-next: Signed-off-by missing for commit in the net tree
From: Stephen Rothwell @ 2019-08-05 21:38 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: Linux Next Mailing List, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 168 bytes --]

Hi all,

Commit

  c3953a3c2d31 ("NFC: nfcmrvl: fix gpio-handling regression")

is missing a Signed-off-by from its committer.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH 1/2] ixgbe: Explicitly initialize reference count to 0
From: Bowers, AndrewX @ 2019-08-05 21:42 UTC (permalink / raw)
  To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org
In-Reply-To: <20190802105457.16596-1-hslester96@gmail.com>

> -----Original Message-----
> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> Behalf Of Chuhong Yuan
> Sent: Friday, August 2, 2019 3:55 AM
> Cc: netdev@vger.kernel.org; Chuhong Yuan <hslester96@gmail.com>; linux-
> kernel@vger.kernel.org; intel-wired-lan@lists.osuosl.org; David S . Miller
> <davem@davemloft.net>
> Subject: [Intel-wired-lan] [PATCH 1/2] ixgbe: Explicitly initialize reference
> count to 0
> 
> The driver does not explicitly call atomic_set to initialize refcount to 0.
> Add the call so that it will be more straight forward to convert refcount from
> atomic_t to refcount_t.
> 
> Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c | 1 +
>  1 file changed, 1 insertion(+)

Tested-by: Andrew Bowers <andrewx.bowers@intel.com>



^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH 2/2] ixgbe: Use refcount_t for refcount
From: Bowers, AndrewX @ 2019-08-05 21:43 UTC (permalink / raw)
  To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org
In-Reply-To: <20190802105507.16650-1-hslester96@gmail.com>

> -----Original Message-----
> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> Behalf Of Chuhong Yuan
> Sent: Friday, August 2, 2019 3:55 AM
> Cc: netdev@vger.kernel.org; Chuhong Yuan <hslester96@gmail.com>; linux-
> kernel@vger.kernel.org; intel-wired-lan@lists.osuosl.org; David S . Miller
> <davem@davemloft.net>
> Subject: [Intel-wired-lan] [PATCH 2/2] ixgbe: Use refcount_t for refcount
> 
> refcount_t is better for reference counters since its implementation can
> prevent overflows.
> So convert atomic_t ref counters to refcount_t.
> 
> Also convert refcount from 0-based to 1-based.
> 
> This patch depends on PATCH 1/2.
> 
> Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c | 6 +++---
> drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.h | 2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)

Tested-by: Andrew Bowers <andrewx.bowers@intel.com>



^ permalink raw reply

* Re: [PATCH net-next v2] openvswitch: Print error when ovs_execute_actions() fails
From: Yifeng Sun @ 2019-08-05 21:43 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: Linux Kernel Network Developers, Greg Rose
In-Reply-To: <CAOrHB_C758HjLJxb3jzAn0Wy1a_m4G2o4gsqMDdhJ9PRdT4GUg@mail.gmail.com>

Thanks Pravin!

Best,
Yifeng

On Mon, Aug 5, 2019 at 1:49 PM Pravin Shelar <pshelar@ovn.org> wrote:
>
> On Sun, Aug 4, 2019 at 7:56 PM Yifeng Sun <pkusunyifeng@gmail.com> wrote:
> >
> > Currently in function ovs_dp_process_packet(), return values of
> > ovs_execute_actions() are silently discarded. This patch prints out
> > an debug message when error happens so as to provide helpful hints
> > for debugging.
> > ---
> > v1->v2: Fixed according to Pravin's review.
> >
>
> Looks good.
> Acked-by: Pravin B Shelar <pshelar@ovn.org>
>
> Thanks,
> Pravin.

^ permalink raw reply

* Re: linux-next: Signed-off-by missing for commit in the net tree
From: David Miller @ 2019-08-05 21:43 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel
In-Reply-To: <20190806073825.6e6ba393@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Tue, 6 Aug 2019 07:38:25 +1000

> Commit
> 
>   c3953a3c2d31 ("NFC: nfcmrvl: fix gpio-handling regression")
> 
> is missing a Signed-off-by from its committer.

That has to be the first time that's ever happened to me :-)

And indeed as I check my command line history I forgot the --signoff
command line option.


^ permalink raw reply

* Re: pull-request: can 2019-08-02
From: David Miller @ 2019-08-05 21:45 UTC (permalink / raw)
  To: mkl; +Cc: netdev, linux-can, kernel
In-Reply-To: <20190802120038.18154-1-mkl@pengutronix.de>

From: Marc Kleine-Budde <mkl@pengutronix.de>
Date: Fri,  2 Aug 2019 14:00:34 +0200

> this is a pull request of 4 patches for net/master.
> 
> The first two patches are by Wang Xiayang, they force that the string buffer
> during a dev_info() is properly NULL terminated.
> 
> The last two patches are by Tomas Bortoli and fix both a potential info leak of
> kernel memory to USB devices.

Pulled, thanks Marc.

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH][net-next] ice: fix potential infinite loop
From: Bowers, AndrewX @ 2019-08-05 21:49 UTC (permalink / raw)
  To: intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org
  Cc: kernel-janitors@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20190802155217.16996-1-colin.king@canonical.com>

> -----Original Message-----
> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> Behalf Of Colin King
> Sent: Friday, August 2, 2019 8:52 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; David S . Miller
> <davem@davemloft.net>; intel-wired-lan@lists.osuosl.org;
> netdev@vger.kernel.org
> Cc: kernel-janitors@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH][net-next] ice: fix potential infinite loop
> 
> From: Colin Ian King <colin.king@canonical.com>
> 
> The loop counter of a for-loop is a u8 however this is being compared to an
> int upper bound and this can lead to an infinite loop if the upper bound is
> greater than 255 since the loop counter will wrap back to zero. Fix this
> potential issue by making the loop counter an int.
> 
> Addresses-Coverity: ("Infinite loop")
> Fixes: c7aeb4d1b9bf ("ice: Disable VFs until reset is completed")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Tested-by: Andrew Bowers <andrewx.bowers@intel.com>



^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH 2/2] ixgbe: Use refcount_t for refcount
From: Willem de Bruijn @ 2019-08-05 21:59 UTC (permalink / raw)
  To: Bowers, AndrewX
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org
In-Reply-To: <26D9FDECA4FBDD4AADA65D8E2FC68A4A1D40F174@ORSMSX104.amr.corp.intel.com>

On Mon, Aug 5, 2019 at 5:43 PM Bowers, AndrewX <andrewx.bowers@intel.com> wrote:
>
> > -----Original Message-----
> > From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> > Behalf Of Chuhong Yuan
> > Sent: Friday, August 2, 2019 3:55 AM
> > Cc: netdev@vger.kernel.org; Chuhong Yuan <hslester96@gmail.com>; linux-
> > kernel@vger.kernel.org; intel-wired-lan@lists.osuosl.org; David S . Miller
> > <davem@davemloft.net>
> > Subject: [Intel-wired-lan] [PATCH 2/2] ixgbe: Use refcount_t for refcount
> >
> > refcount_t is better for reference counters since its implementation can
> > prevent overflows.
> > So convert atomic_t ref counters to refcount_t.
> >
> > Also convert refcount from 0-based to 1-based.
> >
> > This patch depends on PATCH 1/2.
> >
> > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > ---
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c | 6 +++---
> > drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.h | 2 +-
> >  2 files changed, 4 insertions(+), 4 deletions(-)
>
> Tested-by: Andrew Bowers <andrewx.bowers@intel.com>

To reiterate, this patchset should not be applied as is. It is not
correct to simply change the initial refcount.

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH 2/2] ixgbe: Use refcount_t for refcount
From: Alexander Duyck @ 2019-08-05 22:02 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Chuhong Yuan, Network Development, intel-wired-lan, linux-kernel,
	David S . Miller
In-Reply-To: <CA+FuTScLs-qJApj5Yw9OOjVk4++HSjn__Vdy+xX2V1rpWU8uLg@mail.gmail.com>

On Fri, Aug 2, 2019 at 6:47 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Fri, Aug 2, 2019 at 6:55 AM Chuhong Yuan <hslester96@gmail.com> wrote:
> >
> > refcount_t is better for reference counters since its
> > implementation can prevent overflows.
> > So convert atomic_t ref counters to refcount_t.
> >
> > Also convert refcount from 0-based to 1-based.
> >
> > This patch depends on PATCH 1/2.
> >
> > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > ---
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c | 6 +++---
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.h | 2 +-
> >  2 files changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> > index 00710f43cfd2..d313d00065cd 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> > @@ -773,7 +773,7 @@ int ixgbe_setup_fcoe_ddp_resources(struct ixgbe_adapter *adapter)
> >
> >         fcoe->extra_ddp_buffer = buffer;
> >         fcoe->extra_ddp_buffer_dma = dma;
> > -       atomic_set(&fcoe->refcnt, 0);
> > +       refcount_set(&fcoe->refcnt, 1);
>
> Same point as in the cxgb4 driver patch: how can you just change the
> initial value without modifying the condition for release?
>
> This is not a suggestion to resubmit all these changes again with a
> change to the release condition.

So I am pretty sure this patch is badly broken. It doesn't make any
sense to be resetting with the refcnt in
ixgbe_setup_fcoe_ddp_resources. The value is initialized to zero when
the adapter structure was allocated.

Consider this a NAK from me.

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH 1/2] ixgbe: Explicitly initialize reference count to 0
From: Alexander Duyck @ 2019-08-05 22:03 UTC (permalink / raw)
  To: Bowers, AndrewX
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org
In-Reply-To: <26D9FDECA4FBDD4AADA65D8E2FC68A4A1D40F162@ORSMSX104.amr.corp.intel.com>

On Mon, Aug 5, 2019 at 2:42 PM Bowers, AndrewX <andrewx.bowers@intel.com> wrote:
>
> > -----Original Message-----
> > From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> > Behalf Of Chuhong Yuan
> > Sent: Friday, August 2, 2019 3:55 AM
> > Cc: netdev@vger.kernel.org; Chuhong Yuan <hslester96@gmail.com>; linux-
> > kernel@vger.kernel.org; intel-wired-lan@lists.osuosl.org; David S . Miller
> > <davem@davemloft.net>
> > Subject: [Intel-wired-lan] [PATCH 1/2] ixgbe: Explicitly initialize reference
> > count to 0
> >
> > The driver does not explicitly call atomic_set to initialize refcount to 0.
> > Add the call so that it will be more straight forward to convert refcount from
> > atomic_t to refcount_t.
> >
> > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > ---
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c | 1 +
> >  1 file changed, 1 insertion(+)
>
> Tested-by: Andrew Bowers <andrewx.bowers@intel.com>

NAK. This patch is badly broken. We should not be resetting the fcoe
refcnt in ixgbe_setup_fcoe_ddp_resources.

^ permalink raw reply

* Re: [PATCH] dt-bindings: net: meson-dwmac: convert to yaml
From: Rob Herring @ 2019-08-05 22:09 UTC (permalink / raw)
  To: Neil Armstrong
  Cc: Martin Blumenstingl, devicetree, netdev, linux-amlogic,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190805122558.5130-1-narmstrong@baylibre.com>

On Mon, Aug 5, 2019 at 6:26 AM Neil Armstrong <narmstrong@baylibre.com> wrote:
>
> Now that we have the DT validation in place, let's convert the device tree
> bindings for the Synopsys DWMAC Glue for Amlogic SoCs over to a YAML schemas.
>
> Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
> ---
> Rob,
>
> I keep getting :
> .../devicetree/bindings/net/amlogic,meson-dwmac.example.dt.yaml: ethernet@c9410000: reg: [[3376480256, 65536], [3364046144, 8]] is too long

Because snps,dwmac.yaml has:

  reg:
    maxItems: 1

The schemas are applied separately and all have to be valid. You'll
need to change snps,dwmac.yaml to:

reg:
  minItems: 1
  maxItems: 2


The schema error messages leave something to be desired. I wish the
error messages said which schema is throwing the error.

> for the example DT
>
> and for the board DT :
> ../amlogic/meson-gxl-s905x-libretech-cc.dt.yaml: ethernet@c9410000: reg: [[0, 3376480256, 0, 65536, 0, 3364046144, 0, 4]] is too short
> ../amlogic/meson-gxl-s905x-nexbox-a95x.dt.yaml: soc: ethernet@c9410000:reg:0: [0, 3376480256, 0, 65536, 0, 3364046144, 0, 4] is too long
>
> and I don't know how to get rid of it.

The first issue is the same as the above. The 2nd issue is the use of
<> in dts files becomes stricter with the schema. Each entry in an
array needs to be bracketed:

reg = <0x0 0xc9410000 0x0 0x10000>,
          <0x0 0xc8834540 0x0 0x4>;

Rob

^ permalink raw reply

* Re: [PATCH 10/16] net: phy: adin: add EEE translation layer for Clause 22
From: Andrew Lunn @ 2019-08-05 22:11 UTC (permalink / raw)
  To: Alexandru Ardelean
  Cc: netdev, devicetree, linux-kernel, davem, robh+dt, mark.rutland,
	f.fainelli, hkallweit1
In-Reply-To: <20190805165453.3989-11-alexandru.ardelean@analog.com>

> +static int adin_cl22_to_adin_reg(int devad, u16 cl22_regnum)
> +{
> +	struct clause22_mmd_map *m;
> +	int i;
> +
> +	if (devad == MDIO_MMD_VEND1)
> +		return cl22_regnum;
> +
> +	for (i = 0; i < ARRAY_SIZE(clause22_mmd_map); i++) {
> +		m = &clause22_mmd_map[i];
> +		if (m->devad == devad && m->cl22_regnum == cl22_regnum)
> +			return m->adin_regnum;
> +	}
> +
> +	pr_err("No translation available for devad: %d reg: %04x\n",
> +	       devad, cl22_regnum);

phydev_err(). 

	      Andrew

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-05 22:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <CALCETrVtPs8gY-H4gmzSqPboid3CB++n50SvYd6RU9YVde_-Ow@mail.gmail.com>



> On Aug 5, 2019, at 2:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Mon, Aug 5, 2019 at 12:21 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:


> 
>> What we need is to drop privileges sooner in daemons like systemd.
> 
> This is doable right now: systemd could fork off a subprocess and
> delegate its cgroup operations to it.  It would be maybe a couple
> hundred lines of code.  As an added benefit, that subprocess could
> verify that the bpf operations in question are reasonable.
> Alternatively, if there was a CAP_BPF_ADMIN, systemd could retain that
> capability and flip it on and off as needed.

I tried to look at the code and I couldn’t find it. Does systemd drop privileges at all?  Can you point me at the code you’re thinking of

^ permalink raw reply

* [PATCH net 1/2] docs: admin-guide: remove references to IPX and token-ring
From: Stephen Hemminger @ 2019-08-05 22:30 UTC (permalink / raw)
  To: davem, corbet; +Cc: linux-dog, netdev, Stephen Hemminger

Both IPX and TR have not been supported for a while now.
Remove them from the /proc/sys/net documentation.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 Documentation/admin-guide/sysctl/net.rst | 29 +-----------------------
 1 file changed, 1 insertion(+), 28 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index a7d44e71019d..287b98708a40 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -39,7 +39,6 @@ Table : Subdirectories in /proc/sys/net
  802       E802 protocol         ax25       AX25
  ethernet  Ethernet protocol     rose       X.25 PLP layer
  ipv4      IP version 4          x25        X.25 protocol
- ipx       IPX                   token-ring IBM token ring
  bridge    Bridging              decnet     DEC net
  ipv6      IP version 6          tipc       TIPC
  ========= =================== = ========== ==================
@@ -401,33 +400,7 @@ interface.
 (network) that the route leads to, the router (may be directly connected), the
 route flags, and the device the route is using.
 
-
-5. IPX
-------
-
-The IPX protocol has no tunable values in proc/sys/net.
-
-The IPX  protocol  does,  however,  provide  proc/net/ipx. This lists each IPX
-socket giving  the  local  and  remote  addresses  in  Novell  format (that is
-network:node:port). In  accordance  with  the  strange  Novell  tradition,
-everything but the port is in hex. Not_Connected is displayed for sockets that
-are not  tied to a specific remote address. The Tx and Rx queue sizes indicate
-the number  of  bytes  pending  for  transmission  and  reception.  The  state
-indicates the  state  the  socket  is  in and the uid is the owning uid of the
-socket.
-
-The /proc/net/ipx_interface  file lists all IPX interfaces. For each interface
-it gives  the network number, the node number, and indicates if the network is
-the primary  network.  It  also  indicates  which  device  it  is bound to (or
-Internal for  internal  networks)  and  the  Frame  Type if appropriate. Linux
-supports 802.3,  802.2,  802.2  SNAP  and DIX (Blue Book) ethernet framing for
-IPX.
-
-The /proc/net/ipx_route  table  holds  a list of IPX routes. For each route it
-gives the  destination  network, the router node (or Directly) and the network
-address of the router (or Connected) for internal networks.
-
-6. TIPC
+5. TIPC
 -------
 
 tipc_rmem
-- 
2.20.1


^ permalink raw reply related

* [PATCH net 2/2] net: docs: replace IPX in tuntap documentation
From: Stephen Hemminger @ 2019-08-05 22:30 UTC (permalink / raw)
  To: davem, corbet; +Cc: linux-dog, netdev, Stephen Hemminger
In-Reply-To: <20190805223003.13444-1-stephen@networkplumber.org>

IPX is no longer supported, but the example in the documentation
might useful. Replace it with IPv6.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 Documentation/networking/tuntap.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/tuntap.txt b/Documentation/networking/tuntap.txt
index 949d5dcdd9a3..0104830d5075 100644
--- a/Documentation/networking/tuntap.txt
+++ b/Documentation/networking/tuntap.txt
@@ -204,8 +204,8 @@ Ethernet device, which instead of receiving packets from a physical
 media, receives them from user space program and instead of sending 
 packets via physical media sends them to the user space program. 
 
-Let's say that you configured IPX on the tap0, then whenever 
-the kernel sends an IPX packet to tap0, it is passed to the application
+Let's say that you configured IPv6 on the tap0, then whenever
+the kernel sends an IPv6 packet to tap0, it is passed to the application
 (VTun for example). The application encrypts, compresses and sends it to 
 the other side over TCP or UDP. The application on the other side decompresses
 and decrypts the data received and writes the packet to the TAP device, 
-- 
2.20.1


^ permalink raw reply related

* [PATCH v2 net-next] selftests: Add l2tp tests
From: David Ahern @ 2019-08-05 22:41 UTC (permalink / raw)
  To: davem; +Cc: netdev, David Ahern

From: David Ahern <dsahern@gmail.com>

Add IPv4 and IPv6 l2tp tests. Current set is over IP and with
IPsec.

v2
- add l2tp.sh to TEST_PROGS in Makefile

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/l2tp.sh  | 382 +++++++++++++++++++++++++++++++++++
 2 files changed, 383 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/l2tp.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 70f2d6656170..0bd6b23c97ef 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -10,7 +10,7 @@ TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
 TEST_PROGS += udpgro_bench.sh udpgro.sh test_vxlan_under_vrf.sh reuseport_addr_any.sh
 TEST_PROGS += test_vxlan_fdb_changelink.sh so_txtime.sh ipv6_flowlabel.sh
-TEST_PROGS += tcp_fastopen_backup_key.sh fcnal-test.sh
+TEST_PROGS += tcp_fastopen_backup_key.sh fcnal-test.sh l2tp.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket nettest
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any
diff --git a/tools/testing/selftests/net/l2tp.sh b/tools/testing/selftests/net/l2tp.sh
new file mode 100644
index 000000000000..5782433886fc
--- /dev/null
+++ b/tools/testing/selftests/net/l2tp.sh
@@ -0,0 +1,382 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# L2TPv3 tunnel between 2 hosts
+#
+#            host-1          |   router   |     host-2
+#                            |            |
+#      lo          l2tp      |            |      l2tp           lo
+# 172.16.101.1  172.16.1.1   |            | 172.16.1.2    172.16.101.2
+#  fc00:101::1   fc00:1::1   |            |   fc00:1::2    fc00:101::2
+#                            |            |
+#                  eth0      |            |     eth0
+#                10.1.1.1    |            |   10.1.2.1
+#              2001:db8:1::1 |            | 2001:db8:2::1
+
+VERBOSE=0
+PAUSE_ON_FAIL=no
+
+which ping6 > /dev/null 2>&1 && ping6=$(which ping6) || ping6=$(which ping)
+
+################################################################################
+#
+log_test()
+{
+	local rc=$1
+	local expected=$2
+	local msg="$3"
+
+	if [ ${rc} -eq ${expected} ]; then
+		printf "TEST: %-60s  [ OK ]\n" "${msg}"
+		nsuccess=$((nsuccess+1))
+	else
+		ret=1
+		nfail=$((nfail+1))
+		printf "TEST: %-60s  [FAIL]\n" "${msg}"
+		if [ "${PAUSE_ON_FAIL}" = "yes" ]; then
+			echo
+			echo "hit enter to continue, 'q' to quit"
+			read a
+			[ "$a" = "q" ] && exit 1
+		fi
+	fi
+}
+
+run_cmd()
+{
+	local ns
+	local cmd
+	local out
+	local rc
+
+	ns="$1"
+	shift
+	cmd="$*"
+
+	if [ "$VERBOSE" = "1" ]; then
+		printf "    COMMAND: $cmd\n"
+	fi
+
+	out=$(eval ip netns exec ${ns} ${cmd} 2>&1)
+	rc=$?
+	if [ "$VERBOSE" = "1" -a -n "$out" ]; then
+		echo "    $out"
+	fi
+
+	[ "$VERBOSE" = "1" ] && echo
+
+	return $rc
+}
+
+################################################################################
+# create namespaces and interconnects
+
+create_ns()
+{
+	local ns=$1
+	local addr=$2
+	local addr6=$3
+
+	[ -z "${addr}" ] && addr="-"
+	[ -z "${addr6}" ] && addr6="-"
+
+	ip netns add ${ns}
+
+	ip -netns ${ns} link set lo up
+	if [ "${addr}" != "-" ]; then
+		ip -netns ${ns} addr add dev lo ${addr}
+	fi
+	if [ "${addr6}" != "-" ]; then
+		ip -netns ${ns} -6 addr add dev lo ${addr6}
+	fi
+
+	ip -netns ${ns} ro add unreachable default metric 8192
+	ip -netns ${ns} -6 ro add unreachable default metric 8192
+
+	ip netns exec ${ns} sysctl -qw net.ipv4.ip_forward=1
+	ip netns exec ${ns} sysctl -qw net.ipv6.conf.all.keep_addr_on_down=1
+	ip netns exec ${ns} sysctl -qw net.ipv6.conf.all.forwarding=1
+	ip netns exec ${ns} sysctl -qw net.ipv6.conf.default.forwarding=1
+	ip netns exec ${ns} sysctl -qw net.ipv6.conf.default.accept_dad=0
+}
+
+# create veth pair to connect namespaces and apply addresses.
+connect_ns()
+{
+	local ns1=$1
+	local ns1_dev=$2
+	local ns1_addr=$3
+	local ns1_addr6=$4
+	local ns2=$5
+	local ns2_dev=$6
+	local ns2_addr=$7
+	local ns2_addr6=$8
+
+	ip -netns ${ns1} li add ${ns1_dev} type veth peer name tmp
+	ip -netns ${ns1} li set ${ns1_dev} up
+	ip -netns ${ns1} li set tmp netns ${ns2} name ${ns2_dev}
+	ip -netns ${ns2} li set ${ns2_dev} up
+
+	if [ "${ns1_addr}" != "-" ]; then
+		ip -netns ${ns1} addr add dev ${ns1_dev} ${ns1_addr}
+		ip -netns ${ns2} addr add dev ${ns2_dev} ${ns2_addr}
+	fi
+
+	if [ "${ns1_addr6}" != "-" ]; then
+		ip -netns ${ns1} addr add dev ${ns1_dev} ${ns1_addr6}
+		ip -netns ${ns2} addr add dev ${ns2_dev} ${ns2_addr6}
+	fi
+}
+
+################################################################################
+# test setup
+
+cleanup()
+{
+	local ns
+
+	for ns in host-1 host-2 router
+	do
+		ip netns del ${ns} 2>/dev/null
+	done
+}
+
+setup_l2tp_ipv4()
+{
+	#
+	# configure l2tpv3 tunnel on host-1
+	#
+	ip -netns host-1 l2tp add tunnel tunnel_id 1041 peer_tunnel_id 1042 \
+			 encap ip local 10.1.1.1 remote 10.1.2.1
+	ip -netns host-1 l2tp add session name l2tp4 tunnel_id 1041 \
+			 session_id 1041 peer_session_id 1042
+	ip -netns host-1 link set dev l2tp4 up
+	ip -netns host-1 addr add dev l2tp4 172.16.1.1 peer 172.16.1.2
+
+	#
+	# configure l2tpv3 tunnel on host-2
+	#
+	ip -netns host-2 l2tp add tunnel tunnel_id 1042 peer_tunnel_id 1041 \
+			 encap ip local 10.1.2.1 remote 10.1.1.1
+	ip -netns host-2 l2tp add session name l2tp4 tunnel_id 1042 \
+			 session_id 1042 peer_session_id 1041
+	ip -netns host-2 link set dev l2tp4 up
+	ip -netns host-2 addr add dev l2tp4 172.16.1.2 peer 172.16.1.1
+
+	#
+	# add routes to loopback addresses
+	#
+	ip -netns host-1 ro add 172.16.101.2/32 via 172.16.1.2
+	ip -netns host-2 ro add 172.16.101.1/32 via 172.16.1.1
+}
+
+setup_l2tp_ipv6()
+{
+	#
+	# configure l2tpv3 tunnel on host-1
+	#
+	ip -netns host-1 l2tp add tunnel tunnel_id 1061 peer_tunnel_id 1062 \
+			 encap ip local 2001:db8:1::1 remote 2001:db8:2::1
+	ip -netns host-1 l2tp add session name l2tp6 tunnel_id 1061 \
+			 session_id 1061 peer_session_id 1062
+	ip -netns host-1 link set dev l2tp6 up
+	ip -netns host-1 addr add dev l2tp6 fc00:1::1 peer fc00:1::2
+
+	#
+	# configure l2tpv3 tunnel on host-2
+	#
+	ip -netns host-2 l2tp add tunnel tunnel_id 1062 peer_tunnel_id 1061 \
+			 encap ip local 2001:db8:2::1 remote 2001:db8:1::1
+	ip -netns host-2 l2tp add session name l2tp6 tunnel_id 1062 \
+			 session_id 1062 peer_session_id 1061
+	ip -netns host-2 link set dev l2tp6 up
+	ip -netns host-2 addr add dev l2tp6 fc00:1::2 peer fc00:1::1
+
+	#
+	# add routes to loopback addresses
+	#
+	ip -netns host-1 -6 ro add fc00:101::2/128 via fc00:1::2
+	ip -netns host-2 -6 ro add fc00:101::1/128 via fc00:1::1
+}
+
+setup()
+{
+	# start clean
+	cleanup
+
+	set -e
+	create_ns host-1 172.16.101.1/32 fc00:101::1/128
+	create_ns host-2 172.16.101.2/32 fc00:101::2/128
+	create_ns router
+
+	connect_ns host-1 eth0 10.1.1.1/24 2001:db8:1::1/64 \
+	           router eth1 10.1.1.2/24 2001:db8:1::2/64
+
+	connect_ns host-2 eth0 10.1.2.1/24 2001:db8:2::1/64 \
+	           router eth2 10.1.2.2/24 2001:db8:2::2/64
+
+	ip -netns host-1 ro add 10.1.2.0/24 via 10.1.1.2
+	ip -netns host-1 -6 ro add 2001:db8:2::/64 via 2001:db8:1::2
+
+	ip -netns host-2 ro add 10.1.1.0/24 via 10.1.2.2
+	ip -netns host-2 -6 ro add 2001:db8:1::/64 via 2001:db8:2::2
+
+	setup_l2tp_ipv4
+	setup_l2tp_ipv6
+	set +e
+}
+
+setup_ipsec()
+{
+	#
+	# IPv4
+	#
+	run_cmd host-1 ip xfrm policy add \
+		src 10.1.1.1 dst 10.1.2.1 dir out \
+		tmpl proto esp mode transport
+
+	run_cmd host-1 ip xfrm policy add \
+		src 10.1.2.1 dst 10.1.1.1 dir in \
+		tmpl proto esp mode transport
+
+	run_cmd host-2 ip xfrm policy add \
+		src 10.1.1.1 dst 10.1.2.1 dir in \
+		tmpl proto esp mode transport
+
+	run_cmd host-2 ip xfrm policy add \
+		src 10.1.2.1 dst 10.1.1.1 dir out \
+		tmpl proto esp mode transport
+
+	ip -netns host-1 xfrm state add \
+		src 10.1.1.1 dst 10.1.2.1 \
+		spi 0x1000 proto esp aead 'rfc4106(gcm(aes))' \
+		0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+	ip -netns host-1 xfrm state add \
+		src 10.1.2.1 dst 10.1.1.1 \
+		spi 0x1001 proto esp aead 'rfc4106(gcm(aes))' \
+		0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+	ip -netns host-2 xfrm state add \
+		src 10.1.1.1 dst 10.1.2.1 \
+		spi 0x1000 proto esp aead 'rfc4106(gcm(aes))' \
+		0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+	ip -netns host-2 xfrm state add \
+		src 10.1.2.1 dst 10.1.1.1 \
+		spi 0x1001 proto esp aead 'rfc4106(gcm(aes))' \
+		0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+	#
+	# IPV6
+	#
+	run_cmd host-1 ip -6 xfrm policy add \
+		src 2001:db8:1::1 dst 2001:db8:2::1 dir out \
+		tmpl proto esp mode transport
+
+	run_cmd host-1 ip -6 xfrm policy add \
+		src 2001:db8:2::1 dst 2001:db8:1::1 dir in \
+		tmpl proto esp mode transport
+
+	run_cmd host-2 ip -6 xfrm policy add \
+		src 2001:db8:1::1 dst 2001:db8:2::1 dir in \
+		tmpl proto esp mode transport
+
+	run_cmd host-2 ip -6 xfrm policy add \
+		src 2001:db8:2::1 dst 2001:db8:1::1 dir out \
+		tmpl proto esp mode transport
+
+	ip -netns host-1 -6 xfrm state add \
+		src 2001:db8:1::1 dst 2001:db8:2::1 \
+		spi 0x1000 proto esp aead 'rfc4106(gcm(aes))' \
+		0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+	ip -netns host-1 -6 xfrm state add \
+		src 2001:db8:2::1 dst 2001:db8:1::1 \
+		spi 0x1001 proto esp aead 'rfc4106(gcm(aes))' \
+		0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+	ip -netns host-2 -6 xfrm state add \
+		src 2001:db8:1::1 dst 2001:db8:2::1 \
+		spi 0x1000 proto esp aead 'rfc4106(gcm(aes))' \
+		0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+
+	ip -netns host-2 -6 xfrm state add \
+		src 2001:db8:2::1 dst 2001:db8:1::1 \
+		spi 0x1001 proto esp aead 'rfc4106(gcm(aes))' \
+		0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode transport
+}
+
+teardown_ipsec()
+{
+	run_cmd host-1 ip xfrm state flush
+	run_cmd host-1 ip xfrm policy flush
+	run_cmd host-2 ip xfrm state flush
+	run_cmd host-2 ip xfrm policy flush
+}
+
+################################################################################
+# generate traffic through tunnel for various cases
+
+run_ping()
+{
+	local desc="$1"
+
+	run_cmd host-1 ping -c1 -w1 172.16.1.2
+	log_test $? 0 "IPv4 basic L2TP tunnel ${desc}"
+
+	run_cmd host-1 ping -c1 -w1 -I 172.16.101.1 172.16.101.2
+	log_test $? 0 "IPv4 route through L2TP tunnel ${desc}"
+
+	run_cmd host-1 ${ping6} -c1 -w1 fc00:1::2
+	log_test $? 0 "IPv6 basic L2TP tunnel ${desc}"
+
+	run_cmd host-1 ${ping6} -c1 -w1 -I fc00:101::1 fc00:101::2
+	log_test $? 0 "IPv6 route through L2TP tunnel ${desc}"
+}
+
+run_tests()
+{
+	local desc
+
+	setup
+	run_ping
+
+	setup_ipsec
+	run_ping "- with IPsec"
+	run_cmd host-1 ping -c1 -w1 172.16.1.2
+	log_test $? 0 "IPv4 basic L2TP tunnel ${desc}"
+
+	run_cmd host-1 ping -c1 -w1 -I 172.16.101.1 172.16.101.2
+	log_test $? 0 "IPv4 route through L2TP tunnel ${desc}"
+
+	run_cmd host-1 ${ping6} -c1 -w1 fc00:1::2
+	log_test $? 0 "IPv6 basic L2TP tunnel - with IPsec"
+
+	run_cmd host-1 ${ping6} -c1 -w1 -I fc00:101::1 fc00:101::2
+	log_test $? 0 "IPv6 route through L2TP tunnel - with IPsec"
+
+	teardown_ipsec
+	run_ping "- after IPsec teardown"
+}
+
+################################################################################
+# main
+
+declare -i nfail=0
+declare -i nsuccess=0
+
+while getopts :pv o
+do
+	case $o in
+		p) PAUSE_ON_FAIL=yes;;
+		v) VERBOSE=$(($VERBOSE + 1));;
+		*) exit 1;;
+	esac
+done
+
+run_tests
+cleanup
+
+printf "\nTests passed: %3d\n" ${nsuccess}
+printf "Tests failed: %3d\n"   ${nfail}
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH 00/12] block/bio, fs: convert put_page() to put_user_page*()
From: John Hubbard @ 2019-08-05 22:54 UTC (permalink / raw)
  To: Christoph Hellwig, john.hubbard
  Cc: Andrew Morton, Alexander Viro, Anna Schumaker, David S . Miller,
	Dominique Martinet, Eric Van Hensbergen, Jason Gunthorpe,
	Jason Wang, Jens Axboe, Latchesar Ionkov, Michael S . Tsirkin,
	Miklos Szeredi, Trond Myklebust, Christoph Hellwig,
	Matthew Wilcox, linux-mm, LKML, ceph-devel, kvm, linux-block,
	linux-cifs, linux-fsdevel, linux-nfs, linux-rdma, netdev,
	samba-technical, v9fs-developer, virtualization
In-Reply-To: <20190724061750.GA19397@infradead.org>

On 7/23/19 11:17 PM, Christoph Hellwig wrote:
> On Tue, Jul 23, 2019 at 09:25:06PM -0700, john.hubbard@gmail.com wrote:
>> * Store, in the iov_iter, a "came from gup (get_user_pages)" parameter.
>>   Then, use the new iov_iter_get_pages_use_gup() to retrieve it when
>>   it is time to release the pages. That allows choosing between put_page()
>>   and put_user_page*().
>>
>> * Pass in one more piece of information to bio_release_pages: a "from_gup"
>>   parameter. Similar use as above.
>>
>> * Change the block layer, and several file systems, to use
>>   put_user_page*().
> 
> I think we can do this in a simple and better way.  We have 5 ITER_*
> types.  Of those ITER_DISCARD as the name suggests never uses pages, so
> we can skip handling it.  ITER_PIPE is rejected іn the direct I/O path,
> which leaves us with three.
> 

Hi Christoph,

Are you working on anything like this? Or on the put_user_bvec() idea?
Please let me know, otherwise I'll go in and implement something here.


thanks,
-- 
John Hubbard
NVIDIA

> Out of those ITER_BVEC needs a user page reference, so we want to call
> put_user_page* on it.  ITER_BVEC always already has page reference,
> which means in the block direct I/O path path we alread don't take
> a page reference.  We should extent that handling to all other calls
> of iov_iter_get_pages / iov_iter_get_pages_alloc.  I think we should
> just reject ITER_KVEC for direct I/O as well as we have no users and
> it is rather pointless.  Alternatively if we see a use for it the
> callers should always have a life page reference anyway (or might
> be on kmalloc memory), so we really should not take a reference either.
> 
> In other words:  the only time we should ever have to put a page in
> this patch is when they are user pages.  We'll need to clean up
> various bits of code for that, but that can be done gradually before
> even getting to the actual put_user_pages conversion.
> 

^ permalink raw reply

* Re: [PATCH] tools: bpftool: fix reading from /proc/config.gz
From: Peter Wu @ 2019-08-05 23:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann, netdev,
	Stanislav Fomichev, Quentin Monnet
In-Reply-To: <20190805120649.421211da@cakuba.netronome.com>

Hi all,

Thank you for your quick feedback, I will address them in the next
revision.

On Mon, Aug 05, 2019 at 11:41:09AM +0100, Quentin Monnet wrote:

> As far as I understood (from examining Cilium [0]), /proc/config _is_
> used by some distributions, such as CoreOS. This is why we look at that
> location in bpftool.
> 
> [0] https://github.com/cilium/cilium/blob/master/bpf/run_probes.sh#L42

This comment[1] says "CoreOS uses /proc/config", but I think that is a
typo and is supposed to say "/proc/config.gz". The original feature
request[2] uses "/boot/config" as example.

 [1]: https://github.com/cilium/cilium/pull/1065
 [2]: https://github.com/cilium/cilium/issues/891

Given that "/proc/config.gz" is the standard since at least v2.6.12-rc2,
and the official kernel has no mention of "/proc/config", I would like
to skip the latter. If someone has a need for this and it is not covered
by either /boot/config-$(uname -r) or /proc/config.gz, they could submit
a patch for it with links to documentation. How about that?

> > -static char *get_kernel_config_option(FILE *fd, const char *option)
> > +static bool get_kernel_config_option(FILE *fd, char **buf_p, size_t *n_p,
> > +				     char **value)
> 
> Maybe we could rename this function, and have "next" appear in it
> somewhere? After your changes, it does not return the value for a
> specific option anymore.

I have changed it to "read_next_kernel_config_option", let me know if
you prefer an alternative.

> >  {
> > -	size_t line_n = 0, optlen = strlen(option);
> > -	char *res, *strval, *line = NULL;
> > -	ssize_t n;
> > +	char *sep;
> > +	ssize_t linelen;
> 
> Please order the declarations in reverse-Christmas tree style.

Does this refer to the type, name, or full line length? I did not find
this in CodingStyle, the closest I could get is:
https://lore.kernel.org/patchwork/patch/732076/

I will assume the line length for now.

> >  static void probe_kernel_image_config(void)
> > @@ -386,31 +386,34 @@ static void probe_kernel_image_config(void)
> >  		/* test_bpf module for BPF tests */
> >  		"CONFIG_TEST_BPF",
> >  	};
> > +	char *values[ARRAY_SIZE(options)] = { };
> >  	char *value, *buf = NULL;
> >  	struct utsname utsn;
> >  	char path[PATH_MAX];
> >  	size_t i, n;
> >  	ssize_t ret;
> > -	FILE *fd;
> > +	FILE *fd = NULL;
> > +	bool is_pipe = false;
> 
> Reverse-Christmas-tree style please.

Even if that means moving lines? Something like this?

        char path[PATH_MAX];
   +    bool is_pipe = false;
   +    FILE *fd = NULL;
        size_t i, n;
        ssize_t ret;
   -    FILE *fd;

> >  	if (uname(&utsn))
> > -		goto no_config;
> > +		goto end_parse;
> 
> Just thinking, maybe if uname() fails we can skip /boot/config-$(uname
> -r) but still attempt to parse /proc/config{,.gz} instead of printing
> only NULL options?

Good idea, I'll try a bit harder if uname falls for whatever reason.

> Because some distributions do use /proc/config, we should keep this. You
> can probably add /proc/config.gz as another attempt below (or even
> above) the current case?

I doubt it is actually in use, it looks like a typo in the original PR.
This post only lists /proc/config.gz, /boot/config and
/boot/config-$(uname -r): https://superuser.com/questions/287371

> > +	while (get_kernel_config_option(fd, &buf, &n, &value))> +		for (i = 0; i < ARRAY_SIZE(options); i++) {
> > +			if (values[i] || strcmp(buf, options[i]))
> 
> Can we have an option set multiple times in the config file? If so,
> maybe have a p_info() if values are different to warn users that
> conflicting values were found?

scripts/kconfig/merge_config.sh seems to apply a merge strategy,
overwriting earlier values and warning about it. However this should be
rare given that it ended up at /proc/config.gz. For now I will favor
simplicity over complexity and keep the old situation. Let me know if
you prefer otherwise.


On Mon, Aug 05, 2019 at 12:06:49PM -0700, Jakub Kicinski wrote:
> On Mon, 5 Aug 2019 08:29:36 -0700, Stanislav Fomichev wrote:
> > On 08/05, Peter Wu wrote:
> > > /proc/config has never existed as far as I can see, but /proc/config.gz
> > > is present on Arch Linux. Execute an external gunzip program to avoid
> > > linking to zlib and rework the option scanning code since a pipe is not
> > > seekable. This also fixes a file handle leak on some error paths.  
> > Thanks for doing that! One question: why not link against -lz instead?
> > With fork/execing gunzip you're just hiding this dependency.
> > 
> > You can add something like this to the Makefile:
> > ifeq ($(feature-zlib),1)
> > CLFAGS += -DHAVE_ZLIB
> > endif
> > 
> > And then conditionally add support for config.gz. Thoughts?
> 
> +1

Given that the old code did not have this library dependency I did not
add it (the program would otherwise fail to run). Executing an external
process is similar to what tar does. I will look into linking directly
to zlib, thanks!
-- 
Kind regards,
Peter Wu
https://lekensteyn.nl

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox