Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] lan743x: remove redundant assignment to variable rx_process_result
From: David Miller @ 2019-09-06 14:47 UTC (permalink / raw)
  To: colin.king
  Cc: bryan.whitehead, UNGLinuxDriver, netdev, kernel-janitors,
	linux-kernel
In-Reply-To: <20190905140135.26951-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Thu,  5 Sep 2019 15:01:35 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> The variable rx_process_result is being initialized with a value that
> is never read and is being re-assigned immediately afterwards. The
> assignment is redundant, so replace it with the return from function
> lan743x_rx_process_packet.
> 
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH net-next,v3 0/4] flow_offload: update mangle action representation
From: Pablo Neira Ayuso @ 2019-09-06 14:50 UTC (permalink / raw)
  To: Edward Cree
  Cc: netfilter-devel, davem, netdev, jakub.kicinski, jiri, saeedm,
	vishal, vladbu
In-Reply-To: <35ac21be-ff2f-a9cd-dd71-28bc37e8a51b@solarflare.com>

On Fri, Sep 06, 2019 at 02:37:16PM +0100, Edward Cree wrote:
> On 06/09/2019 14:14, Pablo Neira Ayuso wrote:
> > OK, I can document this semantics, I need just _time_ to write that
> > documentation. I was expecting this patch description is enough by now
> > until I can get to finish that documentation.
>
> I think for two structs with apparently the same contents but different
>  semantics (one has the mask bitwise complemented) it's best to hold up
>  the code change until the comment is ready to come with it, because
>  until then it's a dangerously confusing situation.

The idea is that flow rule API != tc front-end anymore. So the flow
rule API can evolve regardless the front-end requirements. Before this
update there was no other choice than using the tc pedit layout since
it was exposed to the drivers, this is not the case anymore.

> >> And you can't just coalesce all consecutive mangles, because if you
> >>  mangle two consecutive fields (e.g. UDP sport and dport) the driver
> >>  still needs to disentangle that if it works on a 'fields' (rather
> >>  than 'u32s') level.
> >
> > This infrastructure is _not_ coalescing two consecutive field, e.g.
> > UDP sport and dport is _not_ coalesced. The coalesce routine does
> > _not_ handle multiple tc pedit ex actions.
>
> So an IPv6 address mangle only comes as a single action if it's from
>  netfilter, not if it's coming from TC pedit.

Driver gets one single action from tc/netfilter (and potentially
ethtool if it would support for this), it comes as a single action
from both subsystems:

        front-end -----> flow_rule API -----> drivers

Front-end need to translate their representation to this
FLOW_ACTION_MANGLE layout.

By front-end, I refer to ethtool/netfilter/tc.

>  Therefore drivers still  need to handle an IPv6 or MAC address
>  mangle coming in multiple  actions, therefore your driver
>  simplifications are invalid.  No?

No. IPv6 and MAC come as a single action. All subsystems use the same
flow_rule API, they use the same layout.

> > The model you propose would still need this code for tc pedit to
> > adjust offset/length and coalesce u32 fields.
>
>  Yes, but we don't add code/features to the kernel based on what we
>  *could* use it for later

This is already useful: Look at the cxgb driver in particular, it's
way easier to read.

Other existing drivers do not need to do things like:

        case round_down(offsetof(struct iphdr, tos), 4):

and the play with masks to infer if the user wants to mangle the TOS
field, they can just do:

        case offsetof(struct iphdr, tos):

which is way more natural representation.

^ permalink raw reply

* Re: [PATCH net] net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list
From: Willem de Bruijn @ 2019-09-06 14:49 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: Daniel Borkmann, Eric Dumazet, eyal, netdev, Shmulik Ladkani,
	Alexander Duyck
In-Reply-To: <20190906094744.345d9442@pixies>

On Fri, Sep 6, 2019 at 2:47 AM Shmulik Ladkani <shmulik@metanetworks.com> wrote:
>
> On Thu, 5 Sep 2019 17:51:20 -0400
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>
> > On Thu, Sep 5, 2019 at 2:36 PM Shmulik Ladkani <shmulik@metanetworks.com> wrote:
> > >
> > > +       if (mss != GSO_BY_FRAGS &&
> > > +           (skb_shinfo(head_skb)->gso_type & SKB_GSO_DODGY)) {
> > > +               /* gso_size is untrusted.
> > > +                *
> > > +                * If head_skb has a frag_list with a linear non head_frag
> > > +                * item, and head_skb's headlen does not fit requested
> > > +                * gso_size, fall back to copying the skbs - by disabling sg.
> > > +                *
> > > +                * We assume checking the first frag suffices, i.e if either of
> > > +                * the frags have non head_frag data, then the first frag is
> > > +                * too.
> > > +                */
> > > +               if (list_skb && skb_headlen(list_skb) && !list_skb->head_frag &&
> > > +                   (mss != skb_headlen(head_skb) - doffset)) {
> >
> > I thought the idea was to check skb_headlen(list_skb), as that is the
> > cause of the problem. Is skb_headlen(head_skb) a good predictor of
> > that? I can certainly imagine that it is, just not sure.
>
> Yes, 'mss != skb_headlen(HEAD_SKB)' seems to be a very good predictor,
> both for the test reproducer, and what's observered on a live system.
>
> We *CANNOT* use 'mss != skb_headlen(LIST_SKB)' as the test condition.
> The packet could have just a SINGLE frag_list member, and that member could
> be a "small remainder" not reaching the full mss size - so we could hit
> the test condition EVEN FOR NON gso_size mangled frag_list skbs -
> which is not desired.

The last segment. Yes, good point.

> Also, is we test 'mss != skb_headlen(list_skb)' and execute 'sg=false'
> ONLY IF 'list_skb' is *NOT* the last item, this is still bogus.
> Imagine a gso_size mangled packet having just head_skb and a single
> "small remainder" frag. This packet will hit the BUG_ON, as the
> 'sg=false' solution is now skipped according to the revised condition.

Right, I wouldn't suggest that.

But I wonder whether it is a given that head_skb has headlen.

Btw, it seems slightly odd to me tot test head_frag before testing
headlen in the v2 patch.

^ permalink raw reply

* Re: [PATCH] net/skbuff: silence warnings under memory pressure
From: Petr Mladek @ 2019-09-06 14:55 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Steven Rostedt, Qian Cai, davem, Eric Dumazet, Sergey Senozhatsky,
	Michal Hocko, linux-mm, linux-kernel, netdev
In-Reply-To: <20190905113208.GA521@jagdpanzerIV>

On Thu 2019-09-05 20:32:08, Sergey Senozhatsky wrote:
> On (09/04/19 16:42), Qian Cai wrote:
> > > Let me think more.
> > 
> > To summary, those look to me are all good long-term improvement that would
> > reduce the likelihood of this kind of livelock in general especially for other
> > unknown allocations that happen while processing softirqs, but it is still up to
> > the air if it fixes it 100% in all situations as printk() is going to take more
> > time
> 
> Well. So. I guess that we don't need irq_work most of the time.
> 
> We need to queue irq_work for "safe" wake_up_interruptible(), when we
> know that we can deadlock in scheduler. IOW, only when we are invoked
> from the scheduler. Scheduler has printk_deferred(), which tells printk()
> that it cannot do wake_up_interruptible(). Otherwise we can just use
> normal wake_up_process() and don't need that irq_work->wake_up_interruptible()
> indirection. The parts of the scheduler, which by mistake call plain printk()
> from under pi_lock or rq_lock have chances to deadlock anyway and should
> be switched to printk_deferred().
> 
> I think we can queue significantly much less irq_work-s from printk().
> 
> Petr, Steven, what do you think?
> 
> Something like this. Call wake_up_interruptible(), switch to
> wake_up_klogd() only when called from sched code.

Replacing irq_work_queue() with wake_up_interruptible() looks
dangerous to me.

As a result, all "normal" printk() calls from the scheduler
code will deadlock. There is almost always a userspace
logger registered.

By "normal" I mean anything that is not printk_deferred(). For
example, any WARN() from sheduler will cause a deadlock.
We will not even have chance to catch these problems in
advance by lockdep.

The difference is that console_unlock() calls wake_up_process()
only when there is a waiter. And the hard console_lock() is not
called that often.


Honestly, scheduling IRQ looks like the most lightweight and reliable
solution for offloading. We are in big troubles if we could not use
it in printk() code.

IMHO, the best solution is to ratelimit the warnings about the
allocation failures. It does not make sense to repeat the same
warning again and again. We might need a better ratelimiting API
if the current one is not reliable.

Best Regards,
Petr

^ permalink raw reply

* Re: Need more information on "ifi_change" in "struct ifinfomsg"
From: Nicolas Dichtel @ 2019-09-06 14:55 UTC (permalink / raw)
  To: dhan lin, netdev
In-Reply-To: <CAMvS6vbeo5tBADNmLvkXUuSSHmxVpt3UW+jZtxY2Le9nXRbNDw@mail.gmail.com>

Le 06/09/2019 à 07:08, dhan lin a écrit :
> Hi All,
> 
> There is a field called ifi_change in "struct ifinfomsg". man page for
> rtnetlink says its for future use and should be always set to
> 0xFFFFFFFF.
> 
> But ive run some sample tests, to confirm the value is not as per man
> pages explanation.
> Its 0 most of the times and non-zero sometimes.
> 
> I've the following question,
> 
> Is ifi_change set only when there is a state change in interface values?
ifi_change indicates which flags (ifi_flags) have changed.
It does not cover other changes.


Regards,
Nicolas

^ permalink raw reply

* [PATCH bpf-next v10 0/4] BPF: New helper to obtain namespace data  from current task
From: Carlos Neira @ 2019-09-06 15:09 UTC (permalink / raw)
  To: netdev; +Cc: yhs, ebiederm, brouer, cneirabustos, bpf

This helper obtains the active namespace from current and returns pid, tgid,
device and namespace id as seen from that namespace, allowing to instrument
a process inside a container.
Device is read from /proc/self/ns/pid, as in the future it's possible that
different pid_ns files may belong to different devices, according
to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
conference.
Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
scripts but this helper returns the pid as seen by the root namespace which is
fine when a bcc script is not executed inside a container.
When the process of interest is inside a container, pid filtering will not work
if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
returning the pid as it's seen by the current namespace where the script is
executing.

This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
used to do pid filtering even inside a container.

For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):

        u32 pid = bpf_get_current_pid_tgid() >> 32;
        if (pid != <pid_arg_passed_in>)
                return 0;
Could be modified to use bpf_get_current_pidns_info() as follows:

        struct bpf_pidns pidns;
        bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
        u32 pid = pidns.tgid;
        u32 nsid = pidns.nsid;
        if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
                return 0;

To find out the name PID namespace id of a process, you could use this command:

$ ps -h -o pidns -p <pid_of_interest>

Or this other command:

$ ls -Li /proc/<pid_of_interest>/ns/pid

Changes from v9 :
Removed samples/bpf in favor of tools/testing/selftests/bpf
Fixed bug when bpf helper is called in an interrupt context.
Code style fixes.
Added more comments on bpf helper struct member.

Signed-off-by: Carlos Neira <cneirabustos@gmail.com>

Carlos Neira (4):
  fs/namei.c: make available filename_lookup() for bpf helpers.
  bpf: new helper to obtain namespace data from  current task New bpf
    helper bpf_get_current_pidns_info.
  tools: Added bpf_get_current_pidns_info  helper.
  tools/testing/selftests/bpf: Add self-tests  for helper
    bpf_get_pidns_info.

 fs/internal.h                                      |   2 -
 fs/namei.c                                         |   1 -
 include/linux/bpf.h                                |   1 +
 include/linux/namei.h                              |   4 +
 include/uapi/linux/bpf.h                           |  35 ++++-
 kernel/bpf/core.c                                  |   1 +
 kernel/bpf/helpers.c                               |  86 ++++++++++++
 kernel/trace/bpf_trace.c                           |   2 +
 tools/include/uapi/linux/bpf.h                     |  35 ++++-
 tools/testing/selftests/bpf/Makefile               |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
 .../testing/selftests/bpf/progs/test_pidns_kern.c  |  52 ++++++++
 .../selftests/bpf/progs/test_pidns_nmi_kern.c      |  52 ++++++++
 tools/testing/selftests/bpf/test_pidns.c           | 146 +++++++++++++++++++++
 tools/testing/selftests/bpf/test_pidns_nmi.c       | 139 ++++++++++++++++++++
 15 files changed, 555 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_nmi_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_pidns.c
 create mode 100644 tools/testing/selftests/bpf/test_pidns_nmi.c

-- 
2.11.0


^ permalink raw reply

* [PATCH bpf-next v10 1/4] fs/namei.c: make available filename_lookup() for bpf helpers.
From: Carlos Neira @ 2019-09-06 15:09 UTC (permalink / raw)
  To: netdev; +Cc: yhs, ebiederm, brouer, cneirabustos, bpf
In-Reply-To: <20190906150952.23066-1-cneirabustos@gmail.com>

Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
---
 fs/internal.h         | 2 --
 fs/namei.c            | 1 -
 include/linux/namei.h | 4 ++++
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 315fcd8d237c..6647e15dd419 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -59,8 +59,6 @@ extern int finish_clean_context(struct fs_context *fc);
 /*
  * namei.c
  */
-extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
-			   struct path *path, struct path *root);
 extern int user_path_mountpoint_at(int, const char __user *, unsigned int, struct path *);
 extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
 			   const char *, unsigned int, struct path *);
diff --git a/fs/namei.c b/fs/namei.c
index 209c51a5226c..a89fc72a4a10 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -19,7 +19,6 @@
 #include <linux/export.h>
 #include <linux/kernel.h>
 #include <linux/slab.h>
-#include <linux/fs.h>
 #include <linux/namei.h>
 #include <linux/pagemap.h>
 #include <linux/fsnotify.h>
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 9138b4471dbf..b45c8b6f7cb4 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -6,6 +6,7 @@
 #include <linux/path.h>
 #include <linux/fcntl.h>
 #include <linux/errno.h>
+#include <linux/fs.h>
 
 enum { MAX_NESTED_LINKS = 8 };
 
@@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
 
 extern void nd_jump_link(struct path *path);
 
+extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
+			   struct path *path, struct path *root);
+
 static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
 {
 	((char *) name)[min(len, maxlen)] = '\0';
-- 
2.11.0


^ permalink raw reply related

* [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from  current task New bpf helper bpf_get_current_pidns_info.
From: Carlos Neira @ 2019-09-06 15:09 UTC (permalink / raw)
  To: netdev; +Cc: yhs, ebiederm, brouer, cneirabustos, bpf
In-Reply-To: <20190906150952.23066-1-cneirabustos@gmail.com>

This helper(bpf_get_current_pidns_info) obtains the active namespace from
current and returns pid, tgid, device and namespace id as seen from that
namespace, allowing to instrument a process inside a container.

Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
---
 include/linux/bpf.h      |  1 +
 include/uapi/linux/bpf.h | 35 +++++++++++++++++++-
 kernel/bpf/core.c        |  1 +
 kernel/bpf/helpers.c     | 86 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/bpf_trace.c |  2 ++
 5 files changed, 124 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5b9d22338606..819cb1c84be0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1055,6 +1055,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
 extern const struct bpf_func_proto bpf_strtol_proto;
 extern const struct bpf_func_proto bpf_strtoul_proto;
 extern const struct bpf_func_proto bpf_tcp_sock_proto;
+extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b5889257cc33..3ec9aa1438b7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2747,6 +2747,32 @@ union bpf_attr {
  *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
  *
  *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
+ *
+ * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
+ *	Description
+ *		Get tgid, pid and namespace id as seen by the current namespace,
+ *		and device major/minor numbers from /proc/self/ns/pid. Such
+ *		information is stored in *pidns* of size *size*.
+ *
+ *		This helper is used when pid filtering is needed inside a
+ *		container as bpf_get_current_tgid() helper always returns the
+ *		pid id as seen by the root namespace.
+ *	Return
+ *		0 on success
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid
+ *		or tgid of the current task.
+ *
+ *		**-ENOENT** if /proc/self/ns/pid does not exists.
+ *
+ *		**-ENOENT** if /proc/self/ns does not exists.
+ *
+ *		**-ENOMEM** if helper internal allocation fails.
+ *
+ *		**-EPERM** if not able to call helper.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2859,7 +2885,8 @@ union bpf_attr {
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
 	FN(send_signal),		\
-	FN(tcp_gen_syncookie),
+	FN(tcp_gen_syncookie),		\
+	FN(get_current_pidns_info),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3610,4 +3637,10 @@ struct bpf_sockopt {
 	__s32	retval;
 };
 
+struct bpf_pidns_info {
+	__u32 dev;	/* dev_t from /proc/self/ns/pid inode */
+	__u32 nsid;
+	__u32 tgid;
+	__u32 pid;
+};
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8191a7db2777..3159f2a0188c 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
 const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
 const struct bpf_func_proto bpf_get_local_storage_proto __weak;
+const struct bpf_func_proto bpf_get_current_pidns_info __weak;
 
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5e28718928ca..8dbe6347893c 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -11,6 +11,11 @@
 #include <linux/uidgid.h>
 #include <linux/filter.h>
 #include <linux/ctype.h>
+#include <linux/pid_namespace.h>
+#include <linux/kdev_t.h>
+#include <linux/stat.h>
+#include <linux/namei.h>
+#include <linux/version.h>
 
 #include "../../lib/kstrtox.h"
 
@@ -312,6 +317,87 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 	preempt_enable();
 }
 
+BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
+	 size)
+{
+	const char *pidns_path = "/proc/self/ns/pid";
+	struct pid_namespace *pidns = NULL;
+	struct filename *fname = NULL;
+	struct inode *inode;
+	struct path kp;
+	pid_t tgid = 0;
+	pid_t pid = 0;
+	int ret = -EINVAL;
+	int len;
+
+	if (unlikely(in_interrupt() ||
+			current->flags & (PF_KTHREAD | PF_EXITING)))
+		return -EPERM;
+
+	if (unlikely(size != sizeof(struct bpf_pidns_info)))
+		return -EINVAL;
+
+	pidns = task_active_pid_ns(current);
+	if (unlikely(!pidns))
+		return -ENOENT;
+
+	pidns_info->nsid =  pidns->ns.inum;
+	pid = task_pid_nr_ns(current, pidns);
+	if (unlikely(!pid))
+		goto clear;
+
+	tgid = task_tgid_nr_ns(current, pidns);
+	if (unlikely(!tgid))
+		goto clear;
+
+	pidns_info->tgid = (u32) tgid;
+	pidns_info->pid = (u32) pid;
+
+	fname = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
+	if (unlikely(!fname)) {
+		ret = -ENOMEM;
+		goto clear;
+	}
+	const size_t fnamesize = offsetof(struct filename, iname[1]);
+	struct filename *tmp;
+
+	tmp = kmalloc(fnamesize, GFP_ATOMIC);
+	if (unlikely(!tmp)) {
+		__putname(fname);
+		ret = -ENOMEM;
+		goto clear;
+	}
+
+	tmp->name = (char *)fname;
+	fname = tmp;
+	len = strlen(pidns_path) + 1;
+	memcpy((char *)fname->name, pidns_path, len);
+	fname->uptr = NULL;
+	fname->aname = NULL;
+	fname->refcnt = 1;
+
+	ret = filename_lookup(AT_FDCWD, fname, 0, &kp, NULL);
+	if (ret)
+		goto clear;
+
+	inode = d_backing_inode(kp.dentry);
+	pidns_info->dev = (u32)inode->i_rdev;
+
+	return 0;
+
+clear:
+	memset((void *)pidns_info, 0, (size_t) size);
+	return ret;
+}
+
+const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
+	.func		= bpf_get_current_pidns_info,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg2_type	= ARG_CONST_SIZE,
+};
+
 #ifdef CONFIG_CGROUPS
 BPF_CALL_0(bpf_get_current_cgroup_id)
 {
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca1255d14576..5e1dc22765a5 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 #endif
 	case BPF_FUNC_send_signal:
 		return &bpf_send_signal_proto;
+	case BPF_FUNC_get_current_pidns_info:
+		return &bpf_get_current_pidns_info_proto;
 	default:
 		return NULL;
 	}
-- 
2.11.0


^ permalink raw reply related

* [PATCH bpf-next v10 3/4] tools: Added bpf_get_current_pidns_info  helper.
From: Carlos Neira @ 2019-09-06 15:09 UTC (permalink / raw)
  To: netdev; +Cc: yhs, ebiederm, brouer, cneirabustos, bpf
In-Reply-To: <20190906150952.23066-1-cneirabustos@gmail.com>

Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
---
 tools/include/uapi/linux/bpf.h | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b5889257cc33..3ec9aa1438b7 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2747,6 +2747,32 @@ union bpf_attr {
  *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
  *
  *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
+ *
+ * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
+ *	Description
+ *		Get tgid, pid and namespace id as seen by the current namespace,
+ *		and device major/minor numbers from /proc/self/ns/pid. Such
+ *		information is stored in *pidns* of size *size*.
+ *
+ *		This helper is used when pid filtering is needed inside a
+ *		container as bpf_get_current_tgid() helper always returns the
+ *		pid id as seen by the root namespace.
+ *	Return
+ *		0 on success
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid
+ *		or tgid of the current task.
+ *
+ *		**-ENOENT** if /proc/self/ns/pid does not exists.
+ *
+ *		**-ENOENT** if /proc/self/ns does not exists.
+ *
+ *		**-ENOMEM** if helper internal allocation fails.
+ *
+ *		**-EPERM** if not able to call helper.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2859,7 +2885,8 @@ union bpf_attr {
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
 	FN(send_signal),		\
-	FN(tcp_gen_syncookie),
+	FN(tcp_gen_syncookie),		\
+	FN(get_current_pidns_info),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3610,4 +3637,10 @@ struct bpf_sockopt {
 	__s32	retval;
 };
 
+struct bpf_pidns_info {
+	__u32 dev;	/* dev_t from /proc/self/ns/pid inode */
+	__u32 nsid;
+	__u32 tgid;
+	__u32 pid;
+};
 #endif /* _UAPI__LINUX_BPF_H__ */
-- 
2.11.0


^ permalink raw reply related

* [PATCH bpf-next v10 4/4] tools/testing/selftests/bpf: Add self-tests  for helper bpf_get_pidns_info.
From: Carlos Neira @ 2019-09-06 15:09 UTC (permalink / raw)
  To: netdev; +Cc: yhs, ebiederm, brouer, cneirabustos, bpf
In-Reply-To: <20190906150952.23066-1-cneirabustos@gmail.com>

Added 2 selftest:

bpf_get_current_pidns_info helper is called in an interrupt
context and also in a non interrupt context.

Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
---
 tools/testing/selftests/bpf/Makefile               |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
 .../testing/selftests/bpf/progs/test_pidns_kern.c  |  52 ++++++++
 .../selftests/bpf/progs/test_pidns_nmi_kern.c      |  52 ++++++++
 tools/testing/selftests/bpf/test_pidns.c           | 146 +++++++++++++++++++++
 tools/testing/selftests/bpf/test_pidns_nmi.c       | 139 ++++++++++++++++++++
 6 files changed, 393 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_nmi_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_pidns.c
 create mode 100644 tools/testing/selftests/bpf/test_pidns_nmi.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 1faad0c3c3c9..8507c89141f5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_cgroup_storage test_select_reuseport test_section_names \
 	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
 	test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
-	test_sockopt_multi test_sockopt_inherit test_tcp_rtt
+	test_sockopt_multi test_sockopt_inherit test_tcp_rtt test_pidns test_pidns_nmi
 
 BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
 TEST_GEN_FILES = $(BPF_OBJ_FILES)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 6c4930bc6e2e..30a70ebe583a 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -313,6 +313,9 @@ static unsigned int (*bpf_set_hash)(void *ctx, __u32 hash) =
 static int (*bpf_skb_adjust_room)(void *ctx, __s32 len_diff, __u32 mode,
 				  unsigned long long flags) =
 	(void *) BPF_FUNC_skb_adjust_room;
+static int (*bpf_get_current_pidns_info)(struct bpf_pidns_info *buf,
+					 unsigned int buf_size) =
+	(void *) BPF_FUNC_get_current_pidns_info;
 
 /* Scan the ARCH passed in from ARCH env variable (see Makefile) */
 #if defined(__TARGET_ARCH_x86)
diff --git a/tools/testing/selftests/bpf/progs/test_pidns_kern.c b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
new file mode 100644
index 000000000000..a4c0bde1608b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
@@ -0,0 +1,52 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <linux/bpf.h>
+#include <errno.h>
+#include "bpf_helpers.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, struct bpf_pidns_info);
+} nsidmap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u32);
+} pidmap SEC(".maps");
+
+SEC("tracepoint/syscalls/sys_enter_nanosleep")
+int trace(void *ctx)
+{
+	struct bpf_pidns_info nsinfo;
+	__u32 key = 0, *expected_pid;
+	struct bpf_pidns_info  *val;
+
+	if (bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo)))
+		return -EINVAL;
+
+	expected_pid = bpf_map_lookup_elem(&pidmap, &key);
+	__u64 tgid = bpf_get_current_pid_tgid();
+
+	if (!expected_pid || *expected_pid != nsinfo.pid ||
+			nsinfo.tgid != (__u32)tgid)
+		return 0;
+
+	val = bpf_map_lookup_elem(&nsidmap, &key);
+	if (val)
+		*val = nsinfo;
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1;
diff --git a/tools/testing/selftests/bpf/progs/test_pidns_nmi_kern.c b/tools/testing/selftests/bpf/progs/test_pidns_nmi_kern.c
new file mode 100644
index 000000000000..7f02e4e29021
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_pidns_nmi_kern.c
@@ -0,0 +1,52 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <linux/bpf.h>
+#include <errno.h>
+#include "bpf_helpers.h"
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, struct bpf_pidns_info);
+} nsidmap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u32);
+} pidmap SEC(".maps");
+
+SEC("tracepoint/net/netif_receive_skb")
+int trace(void *ctx)
+{
+	struct bpf_pidns_info nsinfo;
+	__u32 key = 0, *expected_pid;
+	struct bpf_pidns_info  *val;
+
+	if (bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo)))
+		return -EINVAL;
+
+	expected_pid = bpf_map_lookup_elem(&pidmap, &key);
+	__u64 tgid = bpf_get_current_pid_tgid();
+
+	if (!expected_pid || *expected_pid != nsinfo.pid ||
+			nsinfo.tgid != (__u32)tgid)
+		return 0;
+
+	val = bpf_map_lookup_elem(&nsidmap, &key);
+	if (val)
+		*val = nsinfo;
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1;
diff --git a/tools/testing/selftests/bpf/test_pidns.c b/tools/testing/selftests/bpf/test_pidns.c
new file mode 100644
index 000000000000..0c579305da53
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_pidns.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <syscall.h>
+#include <unistd.h>
+#include <linux/perf_event.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "cgroup_helpers.h"
+#include "bpf_rlimit.h"
+
+#define CHECK(condition, tag, format...) ({		\
+	int __ret = !!(condition);			\
+	if (__ret) {					\
+		printf("%s:FAIL:%s ", __func__, tag);	\
+		printf(format);				\
+	} else {					\
+		printf("%s:PASS:%s\n", __func__, tag);	\
+	}						\
+	__ret;						\
+})
+
+static int bpf_find_map(const char *test, struct bpf_object *obj,
+			const char *name)
+{
+	struct bpf_map *map;
+
+	map = bpf_object__find_map_by_name(obj, name);
+	if (!map)
+		return -1;
+	return bpf_map__fd(map);
+}
+
+
+int main(int argc, char **argv)
+{
+	const char *probe_name = "syscalls/sys_enter_nanosleep";
+	const char *file = "test_pidns_kern.o";
+	int err, bytes, efd, prog_fd, pmu_fd;
+	struct bpf_pidns_info knsid = {};
+	int pidmap_fd, nsidmap_fd;
+	struct perf_event_attr attr = {};
+	struct bpf_object *obj;
+	__u32 key = 0, pid;
+	int exit_code = 1;
+	struct stat st;
+	char buf[256];
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
+	if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
+		goto cleanup_cgroup_env;
+
+	nsidmap_fd = bpf_find_map(__func__, obj, "nsidmap");
+	if (CHECK(nsidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
+		  nsidmap_fd, errno))
+		goto close_prog;
+
+	pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
+	if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
+		  pidmap_fd, errno))
+		goto close_prog;
+
+	pid = getpid();
+	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
+
+	snprintf(buf, sizeof(buf),
+		 "/sys/kernel/debug/tracing/events/%s/id", probe_name);
+	efd = open(buf, O_RDONLY, 0);
+	if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
+		goto close_prog;
+	bytes = read(efd, buf, sizeof(buf));
+	close(efd);
+	if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
+		  "bytes %d errno %d\n", bytes, errno))
+		goto close_prog;
+
+	attr.config = strtol(buf, NULL, 0);
+	attr.type = PERF_TYPE_TRACEPOINT;
+	attr.sample_type = PERF_SAMPLE_RAW;
+	attr.sample_period = 1;
+	attr.wakeup_events = 1;
+
+	pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
+	if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
+		  errno))
+		goto close_prog;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+	if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
+		  errno))
+		goto close_pmu;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+	if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
+		  errno))
+		goto close_pmu;
+
+	/* trigger some syscalls */
+	sleep(1);
+
+	err = bpf_map_lookup_elem(nsidmap_fd, &key, &knsid);
+	if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
+		goto close_pmu;
+
+	if (stat("/proc/self/ns/pid", &st))
+		goto close_pmu;
+
+	if (CHECK(knsid.nsid != (__u32) st.st_ino, "namespace id",
+		  "bpf %u user %u\n", knsid.nsid, (__u32) st.st_ino))
+		goto close_pmu;
+
+	if (CHECK(major(knsid.dev) != major(st.st_rdev), "dev major",
+		  "bpf %u user %u\n", major(knsid.dev), major(st.st_rdev)))
+		goto close_pmu;
+
+	if (CHECK(minor(knsid.dev) != minor(st.st_rdev), "dev minor",
+		  "bpf %u user %u\n", minor(knsid.dev), minor(st.st_rdev)))
+		goto close_pmu;
+
+	exit_code = 0;
+	printf("%s:PASS\n", argv[0]);
+
+close_pmu:
+	close(pmu_fd);
+close_prog:
+	bpf_object__close(obj);
+cleanup_cgroup_env:
+	return exit_code;
+}
diff --git a/tools/testing/selftests/bpf/test_pidns_nmi.c b/tools/testing/selftests/bpf/test_pidns_nmi.c
new file mode 100644
index 000000000000..bb8075bbe7d8
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_pidns_nmi.c
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <syscall.h>
+#include <unistd.h>
+#include <linux/perf_event.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "cgroup_helpers.h"
+#include "bpf_rlimit.h"
+
+#define CHECK(condition, tag, format...) ({		\
+	int __ret = !!(condition);			\
+	if (__ret) {					\
+		printf("%s:FAIL:%s ", __func__, tag);	\
+		printf(format);				\
+	} else {					\
+		printf("%s:PASS:%s\n", __func__, tag);	\
+	}						\
+	__ret;						\
+})
+
+static int bpf_find_map(const char *test, struct bpf_object *obj,
+			const char *name)
+{
+	struct bpf_map *map;
+
+	map = bpf_object__find_map_by_name(obj, name);
+	if (!map)
+		return -1;
+	return bpf_map__fd(map);
+}
+
+
+int main(int argc, char **argv)
+{
+	const char *probe_name = "net/netif_receive_skb";
+	const char *file = "test_pidns_kern.o";
+	int err, bytes, efd, prog_fd, pmu_fd;
+	struct bpf_pidns_info knsid = {};
+	int pidmap_fd, nsidmap_fd;
+	struct perf_event_attr attr = {};
+	struct bpf_object *obj;
+	__u32 key = 0, pid;
+	int exit_code = 1;
+	struct stat st;
+	char buf[256];
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
+	if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
+		goto cleanup_cgroup_env;
+
+	nsidmap_fd = bpf_find_map(__func__, obj, "nsidmap");
+	if (CHECK(nsidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
+		  nsidmap_fd, errno))
+		goto close_prog;
+
+	pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
+	if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
+		  pidmap_fd, errno))
+		goto close_prog;
+
+	pid = getpid();
+	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
+
+	snprintf(buf, sizeof(buf),
+		 "/sys/kernel/debug/tracing/events/%s/id", probe_name);
+	efd = open(buf, O_RDONLY, 0);
+	if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
+		goto close_prog;
+	bytes = read(efd, buf, sizeof(buf));
+	close(efd);
+	if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
+		  "bytes %d errno %d\n", bytes, errno))
+		goto close_prog;
+
+	attr.config = strtol(buf, NULL, 0);
+	attr.type = PERF_TYPE_TRACEPOINT;
+	attr.sample_type = PERF_SAMPLE_RAW;
+	attr.sample_period = 1;
+	attr.wakeup_events = 1;
+
+	pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
+	if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
+		  errno))
+		goto close_prog;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+	if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
+		  errno))
+		goto close_pmu;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+	if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
+		  errno))
+		goto close_pmu;
+
+	/* trigger some syscalls */
+	sleep(1);
+
+	err = bpf_map_lookup_elem(nsidmap_fd, &key, &knsid);
+	if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
+		goto close_pmu;
+
+	if (stat("/proc/self/ns/pid", &st))
+		goto close_pmu;
+
+	if (CHECK(knsid.nsid != (__u32) st.st_ino, "namespace_id",
+				"Called in interrupt context bpf %u user %u\n",
+				knsid.nsid, (__u32) st.st_ino))
+		goto close_pmu;
+
+	exit_code = 0;
+	printf("%s:PASS\n", argv[0]);
+
+close_pmu:
+	close(pmu_fd);
+close_prog:
+	bpf_object__close(obj);
+cleanup_cgroup_env:
+	return exit_code;
+}
-- 
2.11.0


^ permalink raw reply related

* [PATCH] net/mlx5: reduce stack usage in FW tracer
From: Arnd Bergmann @ 2019-09-06 15:11 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, David S. Miller
  Cc: Arnd Bergmann, Feras Daoud, Erez Shitrit, Moshe Shemesh,
	Eran Ben Elisha, Qian Cai, netdev, linux-rdma, linux-kernel

It's generally not ok to put a 512 byte buffer on the stack, as kernel
stack is a scarce resource:

drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c:660:13: error: stack frame size of 1032 bytes in function 'mlx5_fw_tracer_handle_traces' [-Werror,-Wframe-larger-than=]

This is done in a context that is allowed to sleep, so using
dynamic allocation is ok as well. I'm not too worried about
runtime overhead, as this already contains an snprintf() and
other expensive functions.

Fixes: 70dd6fdb8987 ("net/mlx5: FW tracer, parse traces and kernel tracing support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 .../mellanox/mlx5/core/diag/fw_tracer.c       | 21 ++++++++++---------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
index 2011eaf15cc5..d81e78060f9f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -557,16 +557,16 @@ static void mlx5_tracer_print_trace(struct tracer_string_format *str_frmt,
 				    struct mlx5_core_dev *dev,
 				    u64 trace_timestamp)
 {
-	char	tmp[512];
-
-	snprintf(tmp, sizeof(tmp), str_frmt->string,
-		 str_frmt->params[0],
-		 str_frmt->params[1],
-		 str_frmt->params[2],
-		 str_frmt->params[3],
-		 str_frmt->params[4],
-		 str_frmt->params[5],
-		 str_frmt->params[6]);
+	char *tmp = kasprintf(GFP_KERNEL, str_frmt->string,
+			      str_frmt->params[0],
+			      str_frmt->params[1],
+			      str_frmt->params[2],
+			      str_frmt->params[3],
+			      str_frmt->params[4],
+			      str_frmt->params[5],
+			      str_frmt->params[6]);
+	if (!tmp)
+		return;
 
 	trace_mlx5_fw(dev->tracer, trace_timestamp, str_frmt->lost,
 		      str_frmt->event_id, tmp);
@@ -576,6 +576,7 @@ static void mlx5_tracer_print_trace(struct tracer_string_format *str_frmt,
 
 	/* remove it from hash */
 	mlx5_tracer_clean_message(str_frmt);
+	kfree(tmp);
 }
 
 static int mlx5_tracer_handle_string_trace(struct mlx5_fw_tracer *tracer,
-- 
2.20.0


^ permalink raw reply related

* [PATCH net-next] netfilter: nf_tables: avoid excessive stack usage
From: Arnd Bergmann @ 2019-09-06 15:12 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David S. Miller
  Cc: Arnd Bergmann, Jakub Kicinski, wenxu, netfilter-devel, coreteam,
	netdev, linux-kernel

The nft_offload_ctx structure is much too large to put on the
stack:

net/netfilter/nf_tables_offload.c:31:23: error: stack frame size of 1200 bytes in function 'nft_flow_rule_create' [-Werror,-Wframe-larger-than=]

Use dynamic allocation here, as we do elsewhere in the same
function.

Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
Since we only really care about two members of the structure, an
alternative would be a larger rewrite, but that is probably too
late for v5.4.
---
 net/netfilter/nf_tables_offload.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/nf_tables_offload.c b/net/netfilter/nf_tables_offload.c
index 3c2725ade61b..c94331aae552 100644
--- a/net/netfilter/nf_tables_offload.c
+++ b/net/netfilter/nf_tables_offload.c
@@ -30,15 +30,13 @@ static struct nft_flow_rule *nft_flow_rule_alloc(int num_actions)
 
 struct nft_flow_rule *nft_flow_rule_create(const struct nft_rule *rule)
 {
-	struct nft_offload_ctx ctx = {
-		.dep	= {
-			.type	= NFT_OFFLOAD_DEP_UNSPEC,
-		},
-	};
+	struct nft_offload_ctx *ctx;
+
 	struct nft_flow_rule *flow;
 	int num_actions = 0, err;
 	struct nft_expr *expr;
 
+
 	expr = nft_expr_first(rule);
 	while (expr->ops && expr != nft_expr_last(rule)) {
 		if (expr->ops->offload_flags & NFT_OFFLOAD_F_ACTION)
@@ -52,21 +50,31 @@ struct nft_flow_rule *nft_flow_rule_create(const struct nft_rule *rule)
 		return ERR_PTR(-ENOMEM);
 
 	expr = nft_expr_first(rule);
+
+	ctx = kzalloc(sizeof(struct nft_offload_ctx), GFP_KERNEL);
+	if (!ctx) {
+		err = -ENOMEM;
+		goto err_out;
+	}
+	ctx->dep.type = NFT_OFFLOAD_DEP_UNSPEC;
+
 	while (expr->ops && expr != nft_expr_last(rule)) {
 		if (!expr->ops->offload) {
 			err = -EOPNOTSUPP;
 			goto err_out;
 		}
-		err = expr->ops->offload(&ctx, flow, expr);
+		err = expr->ops->offload(ctx, flow, expr);
 		if (err < 0)
 			goto err_out;
 
 		expr = nft_expr_next(expr);
 	}
-	flow->proto = ctx.dep.l3num;
+	flow->proto = ctx->dep.l3num;
+	kfree(ctx);
 
 	return flow;
 err_out:
+	kfree(ctx);
 	nft_flow_rule_destroy(flow);
 
 	return ERR_PTR(err);
-- 
2.20.0


^ permalink raw reply related

* Re: [PATCH net-next,v3 0/4] flow_offload: update mangle action representation
From: Edward Cree @ 2019-09-06 15:13 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, davem, netdev, jakub.kicinski, jiri, saeedm,
	vishal, vladbu
In-Reply-To: <20190906145019.2bggchaq43tcqdyc@salvia>

On 06/09/2019 15:50, Pablo Neira Ayuso wrote:
> On Fri, Sep 06, 2019 at 02:37:16PM +0100, Edward Cree wrote:
>> On 06/09/2019 14:14, Pablo Neira Ayuso wrote:
>>> OK, I can document this semantics, I need just _time_ to write that
>>> documentation. I was expecting this patch description is enough by now
>>> until I can get to finish that documentation.
>> I think for two structs with apparently the same contents but different
>>  semantics (one has the mask bitwise complemented) it's best to hold up
>>  the code change until the comment is ready to come with it, because
>>  until then it's a dangerously confusing situation.
> The idea is that flow rule API != tc front-end anymore. So the flow
> rule API can evolve regardless the front-end requirements. Before this
> update there was no other choice than using the tc pedit layout since
> it was exposed to the drivers, this is not the case anymore.
I'm not saying that they have to be the same.
I'm saying that they're _almost_ the same, and having things that are
 _almost_ the same but subtly different is a recipe for misunderstandings
 and bugs, and so must must must have very visible comments in the code.

>> So an IPv6 address mangle only comes as a single action if it's from
>>  netfilter, not if it's coming from TC pedit.
> Driver gets one single action from tc/netfilter (and potentially
> ethtool if it would support for this), it comes as a single action
> from both subsystems:
>
>         front-end -----> flow_rule API -----> drivers
>
> Front-end need to translate their representation to this
> FLOW_ACTION_MANGLE layout.
>
> By front-end, I refer to ethtool/netfilter/tc.
In your patch, flow_action_mangle() looks only at the offset and type
 (add vs. set) of each pedit, coalesces them if compatible (so, unless
 I'm misreading the code, it _will_ coalesce adjacent pedits to
 contiguous fields like UDP sport & dport, unlike what you said),
 because it doesn't know whether two contiguous pedits are part of the
 same field or not (it doesn't have any knowledge of 'fields' at all).
And for you to change that, while still coalescing IPv6 pedits, you
 would need flow_action_mangle() to know what fields each pedit is in.
It is not possible for code that doesn't know about fields to both:
 (a) not coalesce edits of contiguous fields, and
 (b) coalesce edits of larger-than-u32 fields

>>  Yes, but we don't add code/features to the kernel based on what we
>>  *could* use it for later
> This is already useful: Look at the cxgb driver in particular, it's
> way easier to read.
Have you tested it?  Again, I might be misreading, but it looks like
 your flow_action_mangle() *will* coalesce sport & dport, which it
 appears will break that cxgb code.

> Other existing drivers do not need to do things like:
>
>         case round_down(offsetof(struct iphdr, tos), 4):
>
> and the play with masks to infer if the user wants to mangle the TOS
> field, they can just do:
>
>         case offsetof(struct iphdr, tos):
>
> which is way more natural representation.
Proper thing to do is have helper functions available to drivers to test
 the pedit, and not just switch on the offset.  Why do I say that?
Well, consider a pedit on UDP dport, with mask 0x00ff (network endian).
Now as a u32 pedit that's 0x000000ff offset 0, so field-blind offset
 calculation (ffs in flow_action_mangle_entry()) will turn that into
 offset 3 mask 0xff.  Now driver does
    switch(offset) { /* 3 */
    case offsetof(struct udphdr, dest): /* 2 */
        /* Whoops, we never get here! */
    }
Do you see the problem?

-Ed

^ permalink raw reply

* Re: [PATCH bpf-next 0/6] selftests/bpf: move sockopt tests under test_progs
From: Stanislav Fomichev @ 2019-09-06 15:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Stanislav Fomichev, Networking, bpf,
	David S. Miller, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <CAEf4Bzaoh0Ur6Ze0VLNYqhTJ21Vp6D2NBMkb7yAeseqom=TyKA@mail.gmail.com>

On 09/06, Andrii Nakryiko wrote:
> On Wed, Sep 4, 2019 at 4:03 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Sep 04, 2019 at 09:25:03AM -0700, Stanislav Fomichev wrote:
> > > Now that test_progs is shaping into more generic test framework,
> > > let's convert sockopt tests to it. This requires adding
> > > a helper to create and join a cgroup first (test__join_cgroup).
> > > Since we already hijack stdout/stderr that shouldn't be
> > > a problem (cgroup helpers log to stderr).
> > >
> > > The rest of the patches just move sockopt tests files under prog_tests/
> > > and do the required small adjustments.
> >
> > Looks good. Thank you for working on it.
> > Could you de-verbose setsockopt test a bit?
> > #23/32 setsockopt: deny write ctx->retval:OK
> > #23/33 setsockopt: deny read ctx->retval:OK
> > #23/34 setsockopt: deny writing to ctx->optval:OK
> > #23/35 setsockopt: deny writing to ctx->optval_end:OK
> > #23/36 setsockopt: allow IP_TOS <= 128:OK
> > #23/37 setsockopt: deny IP_TOS > 128:OK
> > 37 subtests is a bit too much spam.
> 
> If we merged test_btf into test_progs, we'd have >150 subtests, which
> would be pretty verbose as well. But the benefit of subtest is that
> you can run just that sub-test and debug/verify just it, without all
> the rest stuff.
> 
> So I'm wondering, if too many lines of default output is the only
> problem, should we just not output per-subtest line by default?
Ack, we can output per-subtest line if it fails so it's easy to re-run;
otherwise, hiding by default sounds good. I'll prepare a v3 sometime
today; Alexei, let us know if you disagree.

^ permalink raw reply

* Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from  current task New bpf helper bpf_get_current_pidns_info.
From: Al Viro @ 2019-09-06 15:24 UTC (permalink / raw)
  To: Carlos Neira; +Cc: netdev, yhs, ebiederm, brouer, bpf
In-Reply-To: <20190906150952.23066-3-cneirabustos@gmail.com>

On Fri, Sep 06, 2019 at 11:09:50AM -0400, Carlos Neira wrote:

> +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
> +	 size)
> +{
> +	const char *pidns_path = "/proc/self/ns/pid";

> +	fname = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> +	if (unlikely(!fname)) {
> +		ret = -ENOMEM;
> +		goto clear;
> +	}
> +	const size_t fnamesize = offsetof(struct filename, iname[1]);
> +	struct filename *tmp;
> +
> +	tmp = kmalloc(fnamesize, GFP_ATOMIC);
> +	if (unlikely(!tmp)) {
> +		__putname(fname);
> +		ret = -ENOMEM;
> +		goto clear;
> +	}
> +
> +	tmp->name = (char *)fname;
> +	fname = tmp;
> +	len = strlen(pidns_path) + 1;
> +	memcpy((char *)fname->name, pidns_path, len);
> +	fname->uptr = NULL;
> +	fname->aname = NULL;
> +	fname->refcnt = 1;
> +
> +	ret = filename_lookup(AT_FDCWD, fname, 0, &kp, NULL);
> +	if (ret)
> +		goto clear;

Where do I begin?
	* getname_kernel() is there for purpose
	* so's kern_path(), damnit
> +
> +	inode = d_backing_inode(kp.dentry);
> +	pidns_info->dev = (u32)inode->i_rdev;

	* ... and this is utter bollocks - userland doesn't
have to have procfs mounted anywhere; it doesn't have to
have it mounted on /proc; it can bloody well bind a symlink
to anywhere and anythin on top of /proc/self even if its
has procfs mounted there.

	This is fundamentally wrong; nothing in the kernel
(bpf very much included) has any business assuming anything
about what's mounted where.  And while we are at it,
how deep on kernel stack can that thing be called?
Because pathname resolution can bring all kinds of interesting
crap into the game - consider e.g. NFS4 referral traversal.
And it can occur - see above about the lack of warranties
that your pathwalk will go to procfs and will remain there.

NAKed-by: Al Viro <viro@zeniv.linux.org.uk>

^ permalink raw reply

* Re: [PATCH] net/skbuff: silence warnings under memory pressure
From: Petr Mladek @ 2019-09-06 15:32 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Steven Rostedt, davem, Eric Dumazet, Sergey Senozhatsky,
	Michal Hocko, linux-mm, Qian Cai, linux-kernel, netdev
In-Reply-To: <20190906033900.GB1253@jagdpanzerIV>

On Fri 2019-09-06 12:39:00, Sergey Senozhatsky wrote:
> On (09/05/19 13:23), Steven Rostedt wrote:
> > > I think we can queue significantly much less irq_work-s from printk().
> > > 
> > > Petr, Steven, what do you think?
> 
> [..]
> > I mean, really, do we need to keep calling wake up if it
> > probably never even executed?
> 
> I guess ratelimiting you are talking about ("if it probably never even
> executed") would be to check if we have already called wake up on the
> log_wait ->head. For that we need to, at least, take log_wait spin_lock
> and check that ->head is still in TASK_INTERRUPTIBLE; which is (quite,
> but not exactly) close to what wake_up_interruptible() does - it doesn't
> wake up the same task twice, it bails out on `p->state & state' check.

I have just realized that only sleeping tasks are in the waitqueue.
It is already handled by waitqueue_active() check.

I am afraid that we could not ratelimit the wakeups. The userspace
loggers might then miss the last lines for a long.

We could move wake_up_klogd() back to console_unlock(). But it might
end up with a back-and-forth games according to who is currently
complaining.

Sigh, I still suggest to ratelimit the warning about failed
allocation.

Best Regards,
Petr

^ permalink raw reply

* Re: [PATCH net-next] ionic: Remove unused including <linux/version.h>
From: Shannon Nelson @ 2019-09-06 15:36 UTC (permalink / raw)
  To: YueHaibing, Pensando Drivers, David S . Miller; +Cc: netdev, kernel-janitors
In-Reply-To: <20190906095410.107596-1-yuehaibing@huawei.com>


On 9/6/19 2:54 AM, YueHaibing wrote:
> Remove including <linux/version.h> that don't need it.
>
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> ---
>   drivers/net/ethernet/pensando/ionic/ionic_main.c | 1 -
>   1 file changed, 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/pensando/ionic/ionic_main.c b/drivers/net/ethernet/pensando/ionic/ionic_main.c
> index 5ec67f3f1853..15e432386b35 100644
> --- a/drivers/net/ethernet/pensando/ionic/ionic_main.c
> +++ b/drivers/net/ethernet/pensando/ionic/ionic_main.c
> @@ -2,7 +2,6 @@
>   /* Copyright(c) 2017 - 2019 Pensando Systems, Inc */
>   
>   #include <linux/module.h>
> -#include <linux/version.h>
>   #include <linux/netdevice.h>
>   #include <linux/utsname.h>
>
>

Acked-by: Shannon Nelson <snelson@pensando.io>


^ permalink raw reply

* Re: [PATCH net] net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list
From: Shmulik Ladkani @ 2019-09-06 15:37 UTC (permalink / raw)
  To: Willem de Bruijn, Eric Dumazet, Alexander Duyck
  Cc: Daniel Borkmann, eyal, netdev, Shmulik Ladkani
In-Reply-To: <CAF=yD-JB6TMQuyaxzLX8=9CZZF+Zk5EmniSkx_F81bVc87XqJw@mail.gmail.com>

On Fri, 6 Sep 2019 10:49:55 -0400
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> But I wonder whether it is a given that head_skb has headlen.

This is what I observed for GRO packets that do have headlen frag_list
members: the 'head_skb' itself had a headlen too, and its head was
built using the original gso_size (similar to the frag_list members).

Maybe Eric can comment better.

> Btw, it seems slightly odd to me tot test head_frag before testing
> headlen in the v2 patch.

Requested by Alexander. I'm fine either way.

Thanks
Shmulik

^ permalink raw reply

* Re: [PATCH net] net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list
From: Alexander Duyck @ 2019-09-06 15:42 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: Willem de Bruijn, Eric Dumazet, Daniel Borkmann, eyal, netdev,
	Shmulik Ladkani
In-Reply-To: <20190906183707.3eaacd79@pixies>

On Fri, Sep 6, 2019 at 8:37 AM Shmulik Ladkani <shmulik@metanetworks.com> wrote:
>
> On Fri, 6 Sep 2019 10:49:55 -0400
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>
> > But I wonder whether it is a given that head_skb has headlen.
>
> This is what I observed for GRO packets that do have headlen frag_list
> members: the 'head_skb' itself had a headlen too, and its head was
> built using the original gso_size (similar to the frag_list members).
>
> Maybe Eric can comment better.
>
> > Btw, it seems slightly odd to me tot test head_frag before testing
> > headlen in the v2 patch.
>
> Requested by Alexander. I'm fine either way.

Yeah, my thought on that was "do we care about the length if the data
is stored in a head_frag?". I suppose you could flip the logic and
make it "do we care about it being a head_frag if there is no data
there?". The reason I had suggested the head_frag test first was
because it was a single test bit whereas the length requires reading
two fields and doing a comparison.

For either ordering it is fine by me. So if we need to feel free to
swap those two tests for a v3.

Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

^ permalink raw reply

* Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from  current task New bpf helper bpf_get_current_pidns_info.
From: Al Viro @ 2019-09-06 15:46 UTC (permalink / raw)
  To: Carlos Neira; +Cc: netdev, yhs, ebiederm, brouer, bpf
In-Reply-To: <20190906152435.GW1131@ZenIV.linux.org.uk>

On Fri, Sep 06, 2019 at 04:24:35PM +0100, Al Viro wrote:
> > +	tmp = kmalloc(fnamesize, GFP_ATOMIC);
> > +	if (unlikely(!tmp)) {
> > +		__putname(fname);
> > +		ret = -ENOMEM;
> > +		goto clear;
> > +	}
> > +
> > +	tmp->name = (char *)fname;
> > +	fname = tmp;
> > +	len = strlen(pidns_path) + 1;
> > +	memcpy((char *)fname->name, pidns_path, len);
> > +	fname->uptr = NULL;
> > +	fname->aname = NULL;
> > +	fname->refcnt = 1;
> > +
> > +	ret = filename_lookup(AT_FDCWD, fname, 0, &kp, NULL);
> > +	if (ret)
> > +		goto clear;
> 
> Where do I begin?
> 	* getname_kernel() is there for purpose
> 	* so's kern_path(), damnit

Oh, and filename_lookup() *CAN* sleep, obviously.  So that
GFP_ATOMIC above is completely pointless.

> > +
> > +	inode = d_backing_inode(kp.dentry);
> > +	pidns_info->dev = (u32)inode->i_rdev;

Why are plaing with device number, anyway?  And why would it
be anything other than 0?

^ permalink raw reply

* Re: [PATCH net-next,v3 0/4] flow_offload: update mangle action representation
From: Pablo Neira Ayuso @ 2019-09-06 15:58 UTC (permalink / raw)
  To: Edward Cree
  Cc: netfilter-devel, davem, netdev, jakub.kicinski, jiri, saeedm,
	vishal, vladbu
In-Reply-To: <be6eee6b-9d58-f0f7-571b-7e473612e2b3@solarflare.com>

Hi Edward,

On Fri, Sep 06, 2019 at 04:13:17PM +0100, Edward Cree wrote:
> On 06/09/2019 15:50, Pablo Neira Ayuso wrote:
> > On Fri, Sep 06, 2019 at 02:37:16PM +0100, Edward Cree wrote:
[...]
> >> So an IPv6 address mangle only comes as a single action if it's from
> >>  netfilter, not if it's coming from TC pedit.
> > Driver gets one single action from tc/netfilter (and potentially
> > ethtool if it would support for this), it comes as a single action
> > from both subsystems:
> >
> >         front-end -----> flow_rule API -----> drivers
> >
> > Front-end need to translate their representation to this
> > FLOW_ACTION_MANGLE layout.
> >
> > By front-end, I refer to ethtool/netfilter/tc.
>
>  In your patch, flow_action_mangle() looks only at the offset and type
>  (add vs. set) of each pedit, coalesces them if compatible (so, unless
>  I'm misreading the code, it _will_ coalesce adjacent pedits to
>  contiguous fields like UDP sport & dport, unlike what you said),
>  because it doesn't know whether two contiguous pedits are part of the
>  same field or not (it doesn't have any knowledge of 'fields' at all).

In tc pedit ex, those are _indeed_ two separated actions:

* One to mangle UDP sport.
* One to mangle UDP dport.

They are _not_ one as you describe above.

The coalesce routine I made does _not_ coalesce fields in two
different actions.

>  And for you to change that, while still coalescing IPv6 pedits, you
>  would need flow_action_mangle() to know what fields each pedit is in.

In the particular case of IPv6 address, 'tc pedit ex' generates one
single action with 4 nkeys. So this is:

* One action to mangle four 32-bits words (the number of words in tc
  pedit is stored in the nkeys field).

The coalesce routine I made in this case merges the four 32-bits words
into one single action.

[...]
> >>  Yes, but we don't add code/features to the kernel based on what we
> >>  *could* use it for later
> > This is already useful: Look at the cxgb driver in particular, it's
> > way easier to read.
>
>  Have you tested it?  Again, I might be misreading, but it looks like
>  your flow_action_mangle() *will* coalesce sport & dport, which it
>  appears will break that cxgb code.

Because sport and dport are not place in the same action by tc pedit
ex, _that cannot happen_.

> > Other existing drivers do not need to do things like:
> >
> >         case round_down(offsetof(struct iphdr, tos), 4):
> >
> > and the play with masks to infer if the user wants to mangle the TOS
> > field, they can just do:
> >
> >         case offsetof(struct iphdr, tos):
> >
> > which is way more natural representation.
>
> Proper thing to do is have helper functions available to drivers to test
> the pedit, and not just switch on the offset.  Why do I say that?
>
> Well, consider a pedit on UDP dport, with mask 0x00ff (network endian).
> Now as a u32 pedit that's 0x000000ff offset 0, so field-blind offset
> calculation (ffs in flow_action_mangle_entry()) will turn that into
> offset 3 mask 0xff.  Now driver does
>     switch(offset) { /* 3 */
>     case offsetof(struct udphdr, dest): /* 2 */
>         /* Whoops, we never get here! */
>     }
>
> Do you see the problem?

This scenario you describe cannot _work_ right now, with the existing
code. Without my patchset, this scenario you describe does _not_ work,

The drivers in the tree need a mask of 0xffff to infer that this is
UDP dport.

The 'tc pedit ex' infrastructure does not allow for the scenario that
you describe above.

No driver in the tree allow for what you describe already.

^ permalink raw reply

* Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from  current task New bpf helper bpf_get_current_pidns_info.
From: Al Viro @ 2019-09-06 16:00 UTC (permalink / raw)
  To: Carlos Neira; +Cc: netdev, yhs, ebiederm, brouer, bpf
In-Reply-To: <20190906154647.GA19707@ZenIV.linux.org.uk>

On Fri, Sep 06, 2019 at 04:46:47PM +0100, Al Viro wrote:

> > Where do I begin?
> > 	* getname_kernel() is there for purpose
> > 	* so's kern_path(), damnit
> 
> Oh, and filename_lookup() *CAN* sleep, obviously.  So that
> GFP_ATOMIC above is completely pointless.
> 
> > > +
> > > +	inode = d_backing_inode(kp.dentry);
> > > +	pidns_info->dev = (u32)inode->i_rdev;

In the original variant of patchset it used to be ->i_sb->s_dev,
which is also bloody strange - you are not asking filename_lookup()
to follow symlinks, so you'd get that of whatever filesystem
/proc/self/ns resides on.

->i_rdev use makes no sense whatsoever - it's a symlink and
neither it nor its target are device nodes; ->i_rdev will be
left zero for both.

What data are you really trying to get there?

^ permalink raw reply

* [net-next 00/11] nfp: implement firmware loading policy
From: Simon Horman @ 2019-09-06 16:00 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman

Hi Dave,

I am handling maintenance of the nfp diver in Jakub's absence while he is
on vacation this week and next, and I am sending this patchset in that
capacity.

Regarding the patches, Dirk says:

This series adds configuration capabilities to the firmware loading policy of
the NFP driver.

NFP firmware loading is controlled via three HWinfo keys which can be set per
device: 'abi_drv_reset', 'abi_drv_load_ifc' and 'app_fw_from_flash'.
Refer to patch #11 for more detail on how these control the firmware loading.

In order to configure the full extend of FW loading policy, a new devlink
parameter has been introduced, 'reset_dev_on_drv_probe', which controls if the
driver should reset the device when it's probed. This, inconjunction with the
existing 'fw_load_policy' (extended to include a 'disk' option) provides the
means to tweak the NFP HWinfo keys as required by users.

Patches 1 and 2 adds the devlink modifications and patches 3 through 9 adds the
support into the NFP driver. Furthermore, the last 2 patches are documentation
only.

Dirk van der Merwe (11):
  devlink: extend 'fw_load_policy' values
  devlink: add 'reset_dev_on_drv_probe' param
  nfp: nsp: add support for fw_loaded command
  nfp: nsp: add support for optional hwinfo lookup
  nfp: nsp: add support for hwinfo set operation
  nfp: honor FW reset and loading policies
  nfp: add devlink param infrastructure
  nfp: devlink: add 'fw_load_policy' support
  nfp: devlink: add 'reset_dev_on_drv_probe' support
  kdoc: fix nfp_fw_load documentation
  Documentation: nfp: add nfp driver specific notes

 .../networking/device_drivers/netronome/nfp.rst    | 133 +++++++++++
 Documentation/networking/devlink-params-nfp.txt    |   5 +
 Documentation/networking/devlink-params.txt        |  16 ++
 drivers/net/ethernet/netronome/nfp/Makefile        |   1 +
 drivers/net/ethernet/netronome/nfp/devlink_param.c | 253 +++++++++++++++++++++
 drivers/net/ethernet/netronome/nfp/nfp_main.c      | 141 +++++++++---
 drivers/net/ethernet/netronome/nfp/nfp_main.h      |   5 +
 drivers/net/ethernet/netronome/nfp/nfp_net_main.c  |   7 +
 .../net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c   |  77 ++++++-
 .../net/ethernet/netronome/nfp/nfpcore/nfp_nsp.h   |  29 +++
 include/net/devlink.h                              |   4 +
 include/uapi/linux/devlink.h                       |   8 +
 net/core/devlink.c                                 |   5 +
 13 files changed, 654 insertions(+), 30 deletions(-)
 create mode 100644 Documentation/networking/device_drivers/netronome/nfp.rst
 create mode 100644 Documentation/networking/devlink-params-nfp.txt
 create mode 100644 drivers/net/ethernet/netronome/nfp/devlink_param.c

-- 
2.11.0


^ permalink raw reply

* [net-next 01/11] devlink: extend 'fw_load_policy' values
From: Simon Horman @ 2019-09-06 16:00 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman
In-Reply-To: <20190906160101.14866-1-simon.horman@netronome.com>

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

Add the 'disk' value to the generic 'fw_load_policy' devlink parameter.
This value indicates that firmware should always be loaded from disk
only.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 Documentation/networking/devlink-params.txt | 2 ++
 include/uapi/linux/devlink.h                | 1 +
 2 files changed, 3 insertions(+)

diff --git a/Documentation/networking/devlink-params.txt b/Documentation/networking/devlink-params.txt
index 2d26434ddcf8..fadb5436188d 100644
--- a/Documentation/networking/devlink-params.txt
+++ b/Documentation/networking/devlink-params.txt
@@ -48,4 +48,6 @@ fw_load_policy		[DEVICE, GENERIC]
 			  Load firmware version preferred by the driver.
 			* DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_FLASH (1)
 			  Load firmware currently stored in flash.
+			* DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK (2)
+			  Load firmware currently available on host's disk.
 			Type: u8
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 546e75dd74ac..c25cc29a6647 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -202,6 +202,7 @@ enum devlink_param_cmode {
 enum devlink_param_fw_load_policy_value {
 	DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DRIVER,
 	DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_FLASH,
+	DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK,
 };
 
 enum {
-- 
2.11.0


^ permalink raw reply related

* [net-next 02/11] devlink: add 'reset_dev_on_drv_probe' param
From: Simon Horman @ 2019-09-06 16:00 UTC (permalink / raw)
  To: David Miller
  Cc: Jakub Kicinski, netdev, oss-drivers, Dirk van der Merwe,
	Simon Horman
In-Reply-To: <20190906160101.14866-1-simon.horman@netronome.com>

From: Dirk van der Merwe <dirk.vandermerwe@netronome.com>

Add the 'reset_dev_on_drv_probe' devlink parameter, controlling the
device reset policy on driver probe.

This parameter is useful in conjunction with the existing
'fw_load_policy' parameter.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
---
 Documentation/networking/devlink-params.txt | 14 ++++++++++++++
 include/net/devlink.h                       |  4 ++++
 include/uapi/linux/devlink.h                |  7 +++++++
 net/core/devlink.c                          |  5 +++++
 4 files changed, 30 insertions(+)

diff --git a/Documentation/networking/devlink-params.txt b/Documentation/networking/devlink-params.txt
index fadb5436188d..f9e30d686243 100644
--- a/Documentation/networking/devlink-params.txt
+++ b/Documentation/networking/devlink-params.txt
@@ -51,3 +51,17 @@ fw_load_policy		[DEVICE, GENERIC]
 			* DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK (2)
 			  Load firmware currently available on host's disk.
 			Type: u8
+
+reset_dev_on_drv_probe	[DEVICE, GENERIC]
+			Controls the device's reset policy on driver probe.
+			Valid values:
+			* DEVLINK_PARAM_RESET_DEV_VALUE_UNKNOWN (0)
+			  Unknown or invalid value.
+			* DEVLINK_PARAM_RESET_DEV_VALUE_ALWAYS (1)
+			  Always reset device on driver probe.
+			* DEVLINK_PARAM_RESET_DEV_VALUE_NEVER (2)
+			  Never reset device on driver probe.
+			* DEVLINK_PARAM_RESET_DEV_VALUE_DISK (3)
+			  Reset only if device firmware can be found in the
+			  filesystem.
+			Type: u8
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 460bc629d1a4..d880de5b8d3a 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -398,6 +398,7 @@ enum devlink_param_generic_id {
 	DEVLINK_PARAM_GENERIC_ID_MSIX_VEC_PER_PF_MAX,
 	DEVLINK_PARAM_GENERIC_ID_MSIX_VEC_PER_PF_MIN,
 	DEVLINK_PARAM_GENERIC_ID_FW_LOAD_POLICY,
+	DEVLINK_PARAM_GENERIC_ID_RESET_DEV,
 
 	/* add new param generic ids above here*/
 	__DEVLINK_PARAM_GENERIC_ID_MAX,
@@ -428,6 +429,9 @@ enum devlink_param_generic_id {
 #define DEVLINK_PARAM_GENERIC_FW_LOAD_POLICY_NAME "fw_load_policy"
 #define DEVLINK_PARAM_GENERIC_FW_LOAD_POLICY_TYPE DEVLINK_PARAM_TYPE_U8
 
+#define DEVLINK_PARAM_GENERIC_RESET_DEV_NAME "reset_dev_on_drv_probe"
+#define DEVLINK_PARAM_GENERIC_RESET_DEV_TYPE DEVLINK_PARAM_TYPE_U8
+
 #define DEVLINK_PARAM_GENERIC(_id, _cmodes, _get, _set, _validate)	\
 {									\
 	.id = DEVLINK_PARAM_GENERIC_ID_##_id,				\
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index c25cc29a6647..3172d1b3329f 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -205,6 +205,13 @@ enum devlink_param_fw_load_policy_value {
 	DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK,
 };
 
+enum devlink_param_reset_dev_value {
+	DEVLINK_PARAM_RESET_DEV_VALUE_UNKNOWN,
+	DEVLINK_PARAM_RESET_DEV_VALUE_ALWAYS,
+	DEVLINK_PARAM_RESET_DEV_VALUE_NEVER,
+	DEVLINK_PARAM_RESET_DEV_VALUE_DISK,
+};
+
 enum {
 	DEVLINK_ATTR_STATS_RX_PACKETS,		/* u64 */
 	DEVLINK_ATTR_STATS_RX_BYTES,		/* u64 */
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 6e52d639dac6..e8bc96f104a7 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -2852,6 +2852,11 @@ static const struct devlink_param devlink_param_generic[] = {
 		.name = DEVLINK_PARAM_GENERIC_FW_LOAD_POLICY_NAME,
 		.type = DEVLINK_PARAM_GENERIC_FW_LOAD_POLICY_TYPE,
 	},
+	{
+		.id = DEVLINK_PARAM_GENERIC_ID_RESET_DEV,
+		.name = DEVLINK_PARAM_GENERIC_RESET_DEV_NAME,
+		.type = DEVLINK_PARAM_GENERIC_RESET_DEV_TYPE,
+	},
 };
 
 static int devlink_param_generic_verify(const struct devlink_param *param)
-- 
2.11.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox