Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
From: Rahul Lakkireddy @ 2018-04-18 15:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Young, netdev@vger.kernel.org, kexec@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Indranil Choudhury, Nirranjan Kirubaharan,
	stephen@networkplumber.org, Ganesh GR, akpm@linux-foundation.org,
	torvalds@linux-foundation.org, davem@davemloft.net,
	viro@zeniv.linux.org.uk
In-Reply-To: <871sfcy4ge.fsf@xmission.com>

On Wednesday, April 04/18/18, 2018 at 19:58:01 +0530, Eric W. Biederman wrote:
> Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> 
> > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> >> Hi Rahul,
> >> On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> >> > On production servers running variety of workloads over time, kernel
> >> > panic can happen sporadically after days or even months. It is
> >> > important to collect as much debug logs as possible to root cause
> >> > and fix the problem, that may not be easy to reproduce. Snapshot of
> >> > underlying hardware/firmware state (like register dump, firmware
> >> > logs, adapter memory, etc.), at the time of kernel panic will be very
> >> > helpful while debugging the culprit device driver.
> >> > 
> >> > This series of patches add new generic framework that enable device
> >> > drivers to collect device specific snapshot of the hardware/firmware
> >> > state of the underlying device in the crash recovery kernel. In crash
> >> > recovery kernel, the collected logs are added as elf notes to
> >> > /proc/vmcore, which is copied by user space scripts for post-analysis.
> >> > 
> >> > The sequence of actions done by device drivers to append their device
> >> > specific hardware/firmware logs to /proc/vmcore are as follows:
> >> > 
> >> > 1. During probe (before hardware is initialized), device drivers
> >> > register to the vmcore module (via vmcore_add_device_dump()), with
> >> > callback function, along with buffer size and log name needed for
> >> > firmware/hardware log collection.
> >> 
> >> I assumed the elf notes info should be prepared while kexec_[file_]load
> >> phase. But I did not read the old comment, not sure if it has been discussed
> >> or not.
> >> 
> >
> > We must not collect dumps in crashing kernel. Adding more things in
> > crash dump path risks not collecting vmcore at all. Eric had
> > discussed this in more detail at:
> >
> > https://lkml.org/lkml/2018/3/24/319
> >
> > We are safe to collect dumps in the second kernel. Each device dump
> > will be exported as an elf note in /proc/vmcore.
> 
> It just occurred to me there is one variation that is worth
> considering.
> 
> Is the area you are looking at dumping part of a huge mmio area?
> I think someone said 2GB?
> 
> If that is the case it could be worth it to simply add the needed
> addresses to the range of memory we need to dump, and simply having a
> elf note saying that is what happened.
> 

We are _not_ dumping mmio area. However, one part of the dump
collection involves reading 2 GB on-chip memory via PIO access,
which is compressed and stored.

Thanks,
Rahul

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: Samudrala, Sridhar @ 2018-04-18 15:08 UTC (permalink / raw)
  To: Willem de Bruijn, Sowmini Varadhan
  Cc: Eric Dumazet, Network Development, Willem de Bruijn
In-Reply-To: <CAF=yD-LkTpVQ_8F2oTLeEDKcNWYuvn3QoSrSa0y5j7zgB6Em9A@mail.gmail.com>

On 4/18/2018 6:51 AM, Willem de Bruijn wrote:
> On Wed, Apr 18, 2018 at 9:47 AM, Sowmini Varadhan
> <sowmini.varadhan@oracle.com> wrote:
>> On (04/18/18 06:35), Eric Dumazet wrote:
>>> There is no change at all.
>>>
>>> This will only be used as a mechanism to send X packets of same size.
>>>
>>> So instead of X system calls , one system call.
>>>
>>> One traversal of some expensive part of the host stack.
>>>
>>> The content on the wire should be the same.
>> I'm sorry that's not how I interpret Willem's email below
>> (and maybe I misunderstood)
>>
>> the following taken from https://www.spinics.net/lists/netdev/msg496150.html
>>
>> Sowmini> If yes, how will the recvmsg differentiate between the case
>> Sowmini> (2000 byte message followed by 512 byte message) and
>> Sowmini> (1472 byte message, 526 byte message, then 512 byte message),
>> Sowmini> in other words, how are UDP message boundary semantics preserved?
>>
>> Willem> They aren't. This is purely an optimization to amortize the cost of
>> Willem> repeated tx stack traversal. Unlike UFO, which would preserve the
>> Willem> boundaries of the original larger than MTU datagram.
>>
>> As I understand Willem's explanation, if I do a sendmsg of 2000 bytes,
>> - classic UDP will send 2 IP fragments, the first one with a full UDP
>>    header, and the IP header indicating that this is the first frag for
>>    that ipid, with more frags to follow. The second frag will have the
>>    rest with the same ipid, it will not have a udp header,
>>    and it will indicatet that it is the last frag (no more frags).
>>
>>    The receiver can thus use the ipid, "more-frags" bit, frag offset etc
>>    to stitch the 2000 byte udp message together and pass it up on the udp
>>    socket.
>>
>> - in the "GSO" proposal my 2000  bytes of data are sent as *two*
>>    udp packets, each of them with a unique udp header, and uh_len set
>>    to 1476 (for first) and 526 (for second). The receiver has no clue
>>    that they are both part of the same UDP datagram, So wire format
>>    is not the same, am I mistaken?
> Eric is correct. If the application sets a segment size with UDP_SEGMENT
> this is an instruction to the kernel to split the payload along that border into
> separate discrete datagrams.

OK. So the sender app is passing the message boundary info to the kernel via the socket
option and letting the kernel split the large payload into multiple UDP segments.


>
> It does not matter what the behavior is without setting this option. If a
> process wants to send a larger than MTU datagram and rely on the
> kernel to fragment, then it should not set the option.

^ permalink raw reply

* Re: SRIOV switchdev mode BoF minutes
From: Andy Gospodarek @ 2018-04-18 15:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andy Gospodarek, Or Gerlitz, Samudrala, Sridhar, David Miller,
	Anjali Singhai Jain, Michael Chan, Simon Horman, John Fastabend,
	Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <20180417161915.193c3651@cakuba.netronome.com>

On Tue, Apr 17, 2018 at 04:19:15PM -0700, Jakub Kicinski wrote:
> On Tue, 17 Apr 2018 10:47:00 -0400, Andy Gospodarek wrote:
> > There is also a school of thought that the VF reps could be
> > pre-allocated on the SmartNIC so that any application processing that
> > traffic would sit idle when no traffic arrives on the rep, but could
> > process frames that do arrive when the VFs were created on the host.
> > This implementation will depend on how resources are allocated on a
> > given bit of hardware, but can really work well.
> 
> +1 if there is no FW resource allocation issues IMHO it's okay to
> just show all reprs for "remote PCIes (PFs and VFs)" on the SmartNIC/
> controller.  The reprs should just show link down as if PCIe cable
> was unpluged until host actually enables them.  

Yes we are on the same page on this.

> A similar issue exists on multi-host for PFs, right?  If one of the
> hosts is down do we still show their PF repr?  IMHO yes.

I would agree with that as well.  With today's model the VF reps are
created once a PF is put into switchdev mode, but I'm still working out
how we want to consider whether or not a PF rep for the other domains is
created locally or not and also how one can determine which domain is in
control.

Permanent config options (like NVRAM settings) could easily handle which
domain is in control, but that still does not mean that PF reps must be
created automatically, does it?

> That makes the thing looks more like a switch with cables being plugged
> in and out.

Yes, that's exactly how I view it as well.

^ permalink raw reply

* Re: [PATCH bpf-next v4 07/10] bpf: btf: Add pretty print support to the basic arraymap
From: Daniel Borkmann @ 2018-04-18 15:20 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev; +Cc: Alexei Starovoitov, kernel-team, jakub.kicinski
In-Reply-To: <20180417204243.4028831-8-kafai@fb.com>

Hi Martin,

first of all great work on the set! One issue that puzzled me
while digesting it further below.

On 04/17/2018 10:42 PM, Martin KaFai Lau wrote:
> This patch adds pretty print support to the basic arraymap.
> Support for other bpf maps can be added later.
> 
> This patch adds new attrs to the BPF_MAP_CREATE command to allow
> specifying the btf_fd, btf_key_id and btf_value_id.  The
> BPF_MAP_CREATE can then associate the btf to the map if
> the creating map supports BTF.

Feels like this patch is doing two things at once, meaning i)
attaching btf object to map object through bpf syscall at map
creation time, and ...

> A BTF supported map needs to implement two new map ops,
> map_seq_show_elem() and map_check_btf().  This patch has
> implemented these new map ops for the basic arraymap.
> 
> It also adds file_operations to the pinned map
> such that the pinned map can be opened and read.

... ii) pretty print map dump via bpf fs for array map.

> Here is a sample output when reading a pinned arraymap
> with the following map's value:
> 
> struct map_value {
> 	int count_a;
> 	int count_b;
> };
> 
> cat /sys/fs/bpf/pinned_array_map:
> 
> 0: {1,2}
> 1: {3,4}
> 2: {5,6}
> ...

Rather than adding this to the bpf fs itself, why not add full BTF
support for the main debugging and introspection tool we have and
ship with the kernel for BPF, namely bpftool? You can already dump
the whole map's key/value pairs via the following command from a
pinned file:

  bpftool map dump pinned /sys/fs/bpf/pinned_array_map

And given we already export the BTF info in your earlier patch through
the BPF_OBJ_GET_INFO_BY_FD, this would fit perfectly for bpftool
integration instead where the pretty-print which is done through the
extra cb map_seq_show_elem (which only does a map lookup and print
anyway) in this patch can simply all be done in user space w/o any
additional kernel complexity.

Aside that this would be very valuable there it would also nicely
demonstrate usage of it, but more importantly we could avoid implementing
such pretty-print callback in the kernel for every other map type and
then having two locations where a user now needs to go for debugging
(bpftool being one, and cat of pinned file the other; this split seems
confusing from a user perspective, imho, but also single key lookup +
pretty-print cannot be realized with the latter whereas it's trivial
with bpftool).

The same could be done for bpftool map lookup, updates, deletions, etc
where the key resp. key/value pair can be specified through a struct
like initializer from cmdline. (But dump/lookup should be good enough
starting point initially.) Thoughts?

Thanks again,
Daniel

> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> Acked-by: Alexei Starovoitov <ast@fb.com>
> ---
>  include/linux/bpf.h      |  20 ++++++-
>  include/uapi/linux/bpf.h |   3 +
>  kernel/bpf/arraymap.c    |  50 ++++++++++++++++
>  kernel/bpf/inode.c       | 146 ++++++++++++++++++++++++++++++++++++++++++++++-
>  kernel/bpf/syscall.c     |  32 ++++++++++-
>  5 files changed, 244 insertions(+), 7 deletions(-)

^ permalink raw reply

* [PATCH net-next 0/2] netns: uevent performance tweaks
From: Christian Brauner @ 2018-04-18 15:21 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner

Hey,

This series deals with a bunch of performance improvements when sending out
uevents that have been extensively discussed here:
https://lkml.org/lkml/2018/4/10/592

- Only record uevent sockets from network namespaces owned by the
  initial user namespace in the global uevent socket list.
  Eric, this is the exact patch we agreed upon in
  https://lkml.org/lkml/2018/4/10/592.
  **A very detailed rationale is present in the commit message for
    [PATCH 1/2] netns: restrict uevents**
- Decouple the locking for network namespaces in the global uevent socket
  list from the locking for network namespaces not in the global uevent
  socket list.
  **A very detailed rationale is present in the commit message
    [PATCH 2/2] netns: isolate seqnums to use per-netns locks**

Thanks!
Christian

Christian Brauner (2):
  netns: restrict uevents
  netns: isolate seqnums to use per-netns locks

 include/linux/kobject.h     |   3 -
 include/net/net_namespace.h |   3 +
 kernel/ksysfs.c             |   3 +-
 lib/kobject_uevent.c        | 118 ++++++++++++++++++++++++++++--------
 net/core/net_namespace.c    |  13 ++++
 5 files changed, 110 insertions(+), 30 deletions(-)

-- 
2.17.0

^ permalink raw reply

* [PATCH net-next 1/2] netns: restrict uevents
From: Christian Brauner @ 2018-04-18 15:21 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner
In-Reply-To: <20180418152106.18519-1-christian.brauner@ubuntu.com>

commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk a little. We have now reached the point where hotplug events for all
devices that carry a namespace tag are filtered according to that
namespace. Specifically, they are filtered whenever the namespace tag of
the kobject does not match the namespace tag of the netlink socket. One
example are network devices. Uevents for network devices only show up in
the network namespaces these devices are moved to or created in.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.

This patch restricts uevents to the initial user namespace for a couple of
reasons that have been extensively discusses on the mailing list [1].
- Thundering herd:
  Broadcasting uevents into all network namespaces introduces significant
  overhead.
  All processes that listen to uevents running in non-initial user
  namespaces will end up responding to uevents that will be meaningless to
  them. Mainly, because non-initial user namespaces cannot easily manage
  devices unless they have a privileged host-process helping them out. This
  means that there will be a thundering herd of activity when there
  shouldn't be any.
- Uevents from non-root users are already filtered in userspace:
  Uevents are filtered by userspace in a user namespace because the
  received uid != 0. Instead the uid associated with the event will be
  65534 == "nobody" because the global root uid is not mapped.
  This means we can safely and without introducing regressions modify the
  kernel to not send uevents into all network namespaces whose owning user
  namespace is not the initial user namespace because we know that
  userspace will ignore the message because of the uid anyway. I have
  a) verified that is is true for every udev implementation out there b)
  that this behavior has been present in all udev implementations from the
  very beginning.
- Removing needless overhead/Increasing performance:
  Currently, the uevent socket for each network namespace is added to the
  global variable uevent_sock_list. The list itself needs to be protected
  by a mutex. So everytime a uevent is generated the mutex is taken on the
  list. The mutex is held *from the creation of the uevent (memory
  allocation, string creation etc. until all uevent sockets have been
  handled*. This is aggravated by the fact that for each uevent socket that
  has listeners the mc_list must be walked as well which means we're
  talking O(n^2) here. Given that a standard Linux workload usually has
  quite a lot of network namespaces and - in the face of containers - a lot
  of user namespaces this quickly becomes a performance problem (see
  "Thundering herd" above). By just recording uevent sockets of network
  namespaces that are owned by the initial user namespace we significantly
  increase performance in this codepath.
- Injecting uevents:
  There's a valid argument that containers might be interested in receiving
  device events especially if they are delegated to them by a privileged
  userspace process. One prime example are SR-IOV enabled devices that are
  explicitly designed to be handed of to other users such as VMs or
  containers.
  This use-case can now be correctly handled since
  commit 692ec06d7c92 ("netns: send uevent messages"). This commit
  introduced the ability to send uevents from userspace. As such we can let
  a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
  the network namespace of the netlink socket) userspace process make a
  decision what uevents should be sent. This removes the need to blindly
  broadcast uevents into all user namespaces and provides a performant and
  safe solution to this problem.
- Filtering logic:
  This patch filters by *owning user namespace of the network namespace a
  given task resides in* and not by user namespace of the task per se. This
  means if the user namespace of a given task is unshared but the network
  namespace is kept and is owned by the initial user namespace a listener
  that is opening the uevent socket in that network namespace can still
  listen to uevents.

[1]: https://lkml.org/lkml/2018/4/4/739
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 lib/kobject_uevent.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..f5f5038787ac 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
 
 	net->uevent_sock = ue_sk;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_add_tail(&ue_sk->list, &uevent_sock_list);
-	mutex_unlock(&uevent_sock_mutex);
+	/* Restrict uevents to initial user namespace. */
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_add_tail(&ue_sk->list, &uevent_sock_list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
+
 	return 0;
 }
 
@@ -713,9 +717,11 @@ static void uevent_net_exit(struct net *net)
 {
 	struct uevent_sock *ue_sk = net->uevent_sock;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_del(&ue_sk->list);
-	mutex_unlock(&uevent_sock_mutex);
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_del(&ue_sk->list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
 
 	netlink_kernel_release(ue_sk->sk);
 	kfree(ue_sk);
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
From: Christian Brauner @ 2018-04-18 15:21 UTC (permalink / raw)
  To: ebiederm, davem, netdev, linux-kernel
  Cc: avagin, ktkhai, serge, gregkh, Christian Brauner
In-Reply-To: <20180418152106.18519-1-christian.brauner@ubuntu.com>

Now that it's possible to have a different set of uevents in different
network namespaces, per-network namespace uevent sequence numbers are
introduced. This increases performance as locking is now restricted to the
network namespace affected by the uevent rather than locking everything.

Since commit 692ec06 ("netns: send uevent messages") network namespaces not
owned by the intial user namespace can be sent uevents from a sufficiently
privileged userspace process.
In order to send a uevent into a network namespace not owned by the initial
user namespace we currently still need to take the *global mutex* that
locks the uevent socket list even though the list *only contains network
namespaces owned by the initial user namespace*. This needs to be done
because the uevent counter is a global variable. Taking the global lock is
performance sensitive since a user on the host can spawn a pool of n
process that each create their own new user and network namespaces and then
go on to inject uevents in parallel into the network namespace of all of
these processes. This can have a significant performance impact for the
host's udevd since it means that there can be a lot of delay between a
device being added and the corresponding uevent being sent out and
available for processing by udevd. It also means that each network
namespace not owned by the initial user namespace which userspace has sent
a uevent to will need to wait until the lock becomes available.

Implementation:
This patch gives each network namespace its own uevent sequence number.
Each network namespace not owned by the initial user namespace receives its
own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
file it is clearly documented which lock has to be taken. All network
namespaces owned by the initial user namespace will still share the same
lock since they are all served sequentially via the uevent socket list.
This decouples the locking and ensures that the host retrieves uevents as
fast as possible even if there are a lot of uevents injected into network
namespaces not owned by the initial user namespace.  In addition, each
network namespace not owned by the initial user namespace does not have to
wait on any other network namespace not sharing the same user namespace.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 include/linux/kobject.h     |   3 --
 include/net/net_namespace.h |   3 ++
 kernel/ksysfs.c             |   3 +-
 lib/kobject_uevent.c        | 100 ++++++++++++++++++++++++++++--------
 net/core/net_namespace.c    |  13 +++++
 5 files changed, 98 insertions(+), 24 deletions(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..776391aea247 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -36,9 +36,6 @@
 extern char uevent_helper[];
 #endif
 
-/* counter to tag the uevent, read only except for the kobject core */
-extern u64 uevent_seqnum;
-
 /*
  * The actions here must match the index to the string array
  * in lib/kobject_uevent.c
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 47e35cce3b64..e4e171b1ba69 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -85,6 +85,8 @@ struct net {
 	struct sock		*genl_sock;
 
 	struct uevent_sock	*uevent_sock;		/* uevent socket */
+	/* counter to tag the uevent, read only except for the kobject core */
+	u64                     uevent_seqnum;
 
 	struct list_head 	dev_base_head;
 	struct hlist_head 	*dev_name_head;
@@ -189,6 +191,7 @@ extern struct list_head net_namespace_list;
 
 struct net *get_net_ns_by_pid(pid_t pid);
 struct net *get_net_ns_by_fd(int fd);
+u64 get_ns_uevent_seqnum_by_vpid(void);
 
 #ifdef CONFIG_SYSCTL
 void ipx_register_sysctl(void);
diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 46ba853656f6..83264edcecda 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -19,6 +19,7 @@
 #include <linux/sched.h>
 #include <linux/capability.h>
 #include <linux/compiler.h>
+#include <net/net_namespace.h>
 
 #include <linux/rcupdate.h>	/* rcu_expedited and rcu_normal */
 
@@ -33,7 +34,7 @@ static struct kobj_attribute _name##_attr = \
 static ssize_t uevent_seqnum_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf)
 {
-	return sprintf(buf, "%llu\n", (unsigned long long)uevent_seqnum);
+	return sprintf(buf, "%llu\n", (unsigned long long)get_ns_uevent_seqnum_by_vpid());
 }
 KERNEL_ATTR_RO(uevent_seqnum);
 
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index f5f5038787ac..796fd502c227 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -29,21 +29,38 @@
 #include <net/net_namespace.h>
 
 
-u64 uevent_seqnum;
 #ifdef CONFIG_UEVENT_HELPER
 char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
 #endif
 
+/*
+ * Size a buffer needs to be in order to hold the largest possible sequence
+ * number stored in a u64 including \0 byte: 2^64 - 1 = 21 chars.
+ */
+#define SEQNUM_BUFSIZE (sizeof("SEQNUM=") + 21)
 struct uevent_sock {
 	struct list_head list;
 	struct sock *sk;
+	/*
+	 * This mutex protects uevent sockets and the uevent counter of
+	 * network namespaces *not* owned by init_user_ns.
+	 * For network namespaces owned by init_user_ns this lock is *not*
+	 * valid instead the global uevent_sock_mutex must be used!
+	 */
+	struct mutex sk_mutex;
 };
 
 #ifdef CONFIG_NET
 static LIST_HEAD(uevent_sock_list);
 #endif
 
-/* This lock protects uevent_seqnum and uevent_sock_list */
+/*
+ * This mutex protects uevent sockets and the uevent counter of network
+ * namespaces owned by init_user_ns.
+ * For network namespaces not owned by init_user_ns this lock is *not*
+ * valid instead the network namespace specific sk_mutex in struct
+ * uevent_sock must be used!
+ */
 static DEFINE_MUTEX(uevent_sock_mutex);
 
 /* the strings here must match the enum in include/linux/kobject.h */
@@ -253,6 +270,22 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
 
 	return 0;
 }
+
+static bool can_hold_seqnum(const struct kobj_uevent_env *env, size_t len)
+{
+	if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
+		WARN(1, KERN_ERR "Failed to append sequence number. "
+		     "Too many uevent variables\n");
+		return false;
+	}
+
+	if ((env->buflen + len) > UEVENT_BUFFER_SIZE) {
+		WARN(1, KERN_ERR "Insufficient space to append sequence number\n");
+		return false;
+	}
+
+	return true;
+}
 #endif
 
 #ifdef CONFIG_UEVENT_HELPER
@@ -308,18 +341,22 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 
 	/* send netlink message */
 	list_for_each_entry(ue_sk, &uevent_sock_list, list) {
+		/* bump sequence number */
+		u64 seqnum = ++sock_net(ue_sk->sk)->uevent_seqnum;
 		struct sock *uevent_sock = ue_sk->sk;
+		char buf[SEQNUM_BUFSIZE];
 
 		if (!netlink_has_listeners(uevent_sock, 1))
 			continue;
 
 		if (!skb) {
-			/* allocate message with the maximum possible size */
+			/* calculate header length */
 			size_t len = strlen(action_string) + strlen(devpath) + 2;
 			char *scratch;
 
+			/* allocate message with the maximum possible size */
 			retval = -ENOMEM;
-			skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+			skb = alloc_skb(len + env->buflen + SEQNUM_BUFSIZE, GFP_KERNEL);
 			if (!skb)
 				continue;
 
@@ -327,11 +364,24 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 			scratch = skb_put(skb, len);
 			sprintf(scratch, "%s@%s", action_string, devpath);
 
+			/* add env */
 			skb_put_data(skb, env->buf, env->buflen);
 
 			NETLINK_CB(skb).dst_group = 1;
 		}
 
+		/* prepare netns seqnum */
+		retval = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu", seqnum);
+		if (retval < 0 || retval >= SEQNUM_BUFSIZE)
+			continue;
+		retval++;
+
+		if (!can_hold_seqnum(env, retval))
+			continue;
+
+		/* append netns seqnum */
+		skb_put_data(skb, buf, retval);
+
 		retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
 						    0, 1, GFP_KERNEL,
 						    kobj_bcast_filter,
@@ -339,6 +389,9 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
 		/* ENOBUFS should be handled in userspace */
 		if (retval == -ENOBUFS || retval == -ESRCH)
 			retval = 0;
+
+		/* remove netns seqnum */
+		skb_trim(skb, env->buflen);
 	}
 	consume_skb(skb);
 #endif
@@ -510,14 +563,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
 	}
 
 	mutex_lock(&uevent_sock_mutex);
-	/* we will send an event, so request a new sequence number */
-	retval = add_uevent_var(env, "SEQNUM=%llu", (unsigned long long)++uevent_seqnum);
-	if (retval) {
-		mutex_unlock(&uevent_sock_mutex);
-		goto exit;
-	}
-	retval = kobject_uevent_net_broadcast(kobj, env, action_string,
-					      devpath);
+	retval = kobject_uevent_net_broadcast(kobj, env, action_string, devpath);
 	mutex_unlock(&uevent_sock_mutex);
 
 #ifdef CONFIG_UEVENT_HELPER
@@ -605,17 +651,18 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
 EXPORT_SYMBOL_GPL(add_uevent_var);
 
 #if defined(CONFIG_NET)
-static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
+static int uevent_net_broadcast(struct uevent_sock *ue_sk, struct sk_buff *skb,
 				struct netlink_ext_ack *extack)
 {
-	/* u64 to chars: 2^64 - 1 = 21 chars */
-	char buf[sizeof("SEQNUM=") + 21];
+	struct sock *usk = ue_sk->sk;
+	char buf[SEQNUM_BUFSIZE];
 	struct sk_buff *skbc;
 	int ret;
 
 	/* bump and prepare sequence number */
-	ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
-	if (ret < 0 || (size_t)ret >= sizeof(buf))
+	ret = snprintf(buf, SEQNUM_BUFSIZE, "SEQNUM=%llu",
+		       ++sock_net(ue_sk->sk)->uevent_seqnum);
+	if (ret < 0 || ret >= SEQNUM_BUFSIZE)
 		return -ENOMEM;
 	ret++;
 
@@ -668,9 +715,15 @@ static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return -EPERM;
 	}
 
-	mutex_lock(&uevent_sock_mutex);
-	ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
-	mutex_unlock(&uevent_sock_mutex);
+	if (net->user_ns == &init_user_ns)
+		mutex_lock(&uevent_sock_mutex);
+	else
+		mutex_lock(&net->uevent_sock->sk_mutex);
+	ret = uevent_net_broadcast(net->uevent_sock, skb, extack);
+	if (net->user_ns == &init_user_ns)
+		mutex_unlock(&uevent_sock_mutex);
+	else
+		mutex_unlock(&net->uevent_sock->sk_mutex);
 
 	return ret;
 }
@@ -708,6 +761,13 @@ static int uevent_net_init(struct net *net)
 		mutex_lock(&uevent_sock_mutex);
 		list_add_tail(&ue_sk->list, &uevent_sock_list);
 		mutex_unlock(&uevent_sock_mutex);
+	} else {
+		/*
+		 * Uevent sockets and counters for network namespaces
+		 * not owned by the initial user namespace have their
+		 * own mutex.
+		 */
+		mutex_init(&ue_sk->sk_mutex);
 	}
 
 	return 0;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a11e03f920d3..2f914804ef73 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -618,6 +618,19 @@ struct net *get_net_ns_by_pid(pid_t pid)
 }
 EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
 
+u64 get_ns_uevent_seqnum_by_vpid(void)
+{
+	pid_t cur_pid;
+	struct net *net;
+
+	cur_pid = task_pid_vnr(current);
+	net = get_net_ns_by_pid(cur_pid);
+	if (IS_ERR(net))
+		return 0;
+
+	return net->uevent_seqnum;
+}
+
 static __net_init int net_ns_net_init(struct net *net)
 {
 #ifdef CONFIG_NET_NS
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH net-next] team: account for oper state
From: George Wilkie @ 2018-04-18 15:33 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev
In-Reply-To: <20180418145822.GE1989@nanopsycho>

On Wed, Apr 18, 2018 at 04:58:22PM +0200, Jiri Pirko wrote:
> Wed, Apr 18, 2018 at 03:35:49PM CEST, gwilkie@vyatta.att-mail.com wrote:
> >On Wed, Apr 18, 2018 at 02:56:44PM +0200, Jiri Pirko wrote:
> >> Wed, Apr 18, 2018 at 12:29:50PM CEST, gwilkie@vyatta.att-mail.com wrote:
> >> >Account for operational state when determining port linkup state,
> >> >as per Documentation/networking/operstates.txt.
> >> 
> >> Could you please point me to the exact place in the document where this
> >> is suggested?
> >> 
> >
> >Various places cover it I think.
> >
> >In 1. Introduction:
> >"interface is not usable just because the admin enabled it"
> >"userspace must be granted the possibility to
> >influence operational state"
> >
> >In 4. Setting from userspace:
> >"the userspace application can set IFLA_OPERSTATE
> >to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set
> >netif_carrier_off() or netif_dormant_on()"
> >
> >We have a use case where we want to set the oper state of the team ports based
> >on whether they are actually usable or not (as opposed to just admin up).
> 
> Are you running a supplicant there or what is the use-case?
> 

We are using tun/tap interfaces for the team ports with the physical interfaces
under the control of a user process.

> How is this handle in other drivers like bond, openvswitch, bridge, etc?

It looks like bridge is using it, looking at br_port_carrier_check() and
br_add_if().

Cheers.

> 
> >
> >Cheers.
> >
> >> 
> >> >
> >> >Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com>
> >> >---
> >> > drivers/net/team/team.c | 3 ++-
> >> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >> >
> >> >diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
> >> >index a6c6ce19eeee..231264a05e55 100644
> >> >--- a/drivers/net/team/team.c
> >> >+++ b/drivers/net/team/team.c
> >> >@@ -2918,7 +2918,8 @@ static int team_device_event(struct notifier_block *unused,
> >> > 	case NETDEV_CHANGE:
> >> > 		if (netif_running(port->dev))
> >> > 			team_port_change_check(port,
> >> >-					       !!netif_carrier_ok(port->dev));
> >> >+					       !!(netif_carrier_ok(port->dev) &&
> >> >+						  netif_oper_up(port->dev)));
> >> > 		break;
> >> > 	case NETDEV_UNREGISTER:
> >> > 		team_del_slave(port->team->dev, dev);
> >> >-- 
> >> >2.11.0
> >> >

^ permalink raw reply

* Re: [PATCH net-next] team: account for oper state
From: George Wilkie @ 2018-04-18 15:39 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev
In-Reply-To: <20180418153312.24h7mvle6sy2dv25@debian9.gwilkie>

On Wed, Apr 18, 2018 at 04:33:12PM +0100, George Wilkie wrote:
> On Wed, Apr 18, 2018 at 04:58:22PM +0200, Jiri Pirko wrote:
> > Wed, Apr 18, 2018 at 03:35:49PM CEST, gwilkie@vyatta.att-mail.com wrote:
> > >On Wed, Apr 18, 2018 at 02:56:44PM +0200, Jiri Pirko wrote:
> > >> Wed, Apr 18, 2018 at 12:29:50PM CEST, gwilkie@vyatta.att-mail.com wrote:
> > >> >Account for operational state when determining port linkup state,
> > >> >as per Documentation/networking/operstates.txt.
> > >> 
> > >> Could you please point me to the exact place in the document where this
> > >> is suggested?
> > >> 
> > >
> > >Various places cover it I think.
> > >
> > >In 1. Introduction:
> > >"interface is not usable just because the admin enabled it"
> > >"userspace must be granted the possibility to
> > >influence operational state"
> > >
> > >In 4. Setting from userspace:
> > >"the userspace application can set IFLA_OPERSTATE
> > >to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set
> > >netif_carrier_off() or netif_dormant_on()"
> > >
> > >We have a use case where we want to set the oper state of the team ports based
> > >on whether they are actually usable or not (as opposed to just admin up).
> > 
> > Are you running a supplicant there or what is the use-case?
> > 
> 
> We are using tun/tap interfaces for the team ports with the physical interfaces
> under the control of a user process.
> 
> > How is this handle in other drivers like bond, openvswitch, bridge, etc?
> 
> It looks like bridge is using it, looking at br_port_carrier_check() and
> br_add_if().
> 

commit 576eb62598f10c8c7fd75703fe89010cdcfff596
Author:     stephen hemminger <shemminger@vyatta.com>
AuthorDate: Fri Dec 28 18:15:22 2012 +0000
Commit:     David S. Miller <davem@davemloft.net>
CommitDate: Sun Dec 30 02:31:43 2012 -0800

    bridge: respect RFC2863 operational state
    
    The bridge link detection should follow the operational state
    of the lower device, rather than the carrier bit. This allows devices
    like tunnels that are controlled by userspace control plane to work
    with bridge STP link management.

> Cheers.
> 
> > 
> > >
> > >Cheers.
> > >
> > >> 
> > >> >
> > >> >Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com>
> > >> >---
> > >> > drivers/net/team/team.c | 3 ++-
> > >> > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >> >
> > >> >diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
> > >> >index a6c6ce19eeee..231264a05e55 100644
> > >> >--- a/drivers/net/team/team.c
> > >> >+++ b/drivers/net/team/team.c
> > >> >@@ -2918,7 +2918,8 @@ static int team_device_event(struct notifier_block *unused,
> > >> > 	case NETDEV_CHANGE:
> > >> > 		if (netif_running(port->dev))
> > >> > 			team_port_change_check(port,
> > >> >-					       !!netif_carrier_ok(port->dev));
> > >> >+					       !!(netif_carrier_ok(port->dev) &&
> > >> >+						  netif_oper_up(port->dev)));
> > >> > 		break;
> > >> > 	case NETDEV_UNREGISTER:
> > >> > 		team_del_slave(port->team->dev, dev);
> > >> >-- 
> > >> >2.11.0
> > >> >

^ permalink raw reply

* Re: [PATCH bpf-next v3 8/8] bpf: add documentation for eBPF helpers (58-64)
From: Jesper Dangaard Brouer @ 2018-04-18 15:43 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man,
	John Fastabend, brouer
In-Reply-To: <67e84a95-5e7b-1c2c-e90f-7bcc0026bd10@netronome.com>

On Wed, 18 Apr 2018 15:09:41 +0100
Quentin Monnet <quentin.monnet@netronome.com> wrote:

> 2018-04-18 15:34 UTC+0200 ~ Jesper Dangaard Brouer <brouer@redhat.com>
> > On Tue, 17 Apr 2018 15:34:38 +0100
> > Quentin Monnet <quentin.monnet@netronome.com> wrote:
> >   
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 350459c583de..3d329538498f 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -1276,6 +1276,50 @@ union bpf_attr {
> >>   * 	Return
> >>   * 		0 on success, or a negative error in case of failure.
> >>   *
> >> + * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
> >> + * 	Description
> >> + * 		Redirect the packet to the endpoint referenced by *map* at
> >> + * 		index *key*. Depending on its type, his *map* can contain  
> >                                                     ^^^
> > 
> > "his" -> "this"  
> 
> Thanks!
> 
> >> + * 		references to net devices (for forwarding packets through other
> >> + * 		ports), or to CPUs (for redirecting XDP frames to another CPU;
> >> + * 		but this is only implemented for native XDP (with driver
> >> + * 		support) as of this writing).
> >> + *
> >> + * 		All values for *flags* are reserved for future usage, and must
> >> + * 		be left at zero.
> >> + * 	Return
> >> + * 		**XDP_REDIRECT** on success, or **XDP_ABORT** on error.
> >> + *  
> > 
> > "XDP_ABORT" -> "XDP_ABORTED"  
> 
> Ouch. And I did the same for bpf_redirect(). Thanks for the catch.
> 
> > 
> > I don't know if it's worth mentioning in the doc/man-page; that for XDP
> > using bpf_redirect_map() is a HUGE performance advantage, compared to
> > the bpf_redirect() call ?  
> 
> It seems worth to me. How would you simply explain the reason for this
> difference?

The basic reason is "bulking effect", as devmap avoids the NIC
tailptr/doorbell update on every packet... how to write that in a doc
format?

I wrote about why XDP_REDIRECT with maps are smart here:
 http://vger.kernel.org/netconf2017_files/XDP_devel_update_NetConf2017_Seoul.pdf

Using maps for redirect, hopefully makes XDP_REDIRECT the last driver
XDP action code we need.  As new types of redirect can be introduced
without driver changes. See that AF_XDP also uses a map.

It is more subtle, but maps also function as a sorting step. Imagine
your XDP program need to redirect out different interfaces (or CPUs in
cpumap case), and packets arrive intermixed.  Packets get sorted into
the different map indexes, and the xdp_do_flush_map() will trigger the
flush operation.


Happened to have an i40e NIC benchmark setup, and ran a single flow pktgen test.

Results with 'xdp_redirect_map'
 13589297 pps (13,589,297) 

Results with 'xdp_redirect' NOT using devmap:
  7567575 pps (7,567,575)

Just to point out the performance benefit of devmap...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH v2 0/3] lan78xx: Read configuration from Device Tree
From: Phil Elwell @ 2018-04-18 15:45 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Rob Herring,
	Mark Rutland, David S. Miller, Mauro Carvalho Chehab,
	Greg Kroah-Hartman, Linus Walleij, Andrew Morton, Randy Dunlap,
	netdev, devicetree, linux-kernel, linux-usb
  Cc: Phil Elwell

The Microchip LAN78XX family of devices are Ethernet controllers with
a USB interface. Despite being discoverable devices it can be useful to
be able to configure them from Device Tree, particularly in low-cost
applications without an EEPROM or programmed OTP.

This patch set adds support for reading the MAC address and LED modes from
Device Tree.

v2:
- Use eth_platform_get_mac_address.
- Support up to 4 LEDs, and move LED mode constants into dt-bindings header.
- Improve bindings document.
- Remove EEE support.

Phil Elwell (3):
  lan78xx: Read MAC address from DT if present
  lan78xx: Read LED states from Device Tree
  dt-bindings: Document the DT bindings for lan78xx

 .../devicetree/bindings/net/microchip,lan78xx.txt  | 43 +++++++++++++++
 MAINTAINERS                                        |  2 +
 drivers/net/usb/lan78xx.c                          | 62 ++++++++++++++--------
 include/dt-bindings/net/microchip-78xx.h           | 40 ++++++++++++++
 4 files changed, 125 insertions(+), 22 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/microchip,lan78xx.txt
 create mode 100644 include/dt-bindings/net/microchip-78xx.h

-- 
2.7.4

^ permalink raw reply

* [PATCH v2 1/3] lan78xx: Read MAC address from DT if present
From: Phil Elwell @ 2018-04-18 15:45 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Rob Herring,
	Mark Rutland, David S. Miller, Mauro Carvalho Chehab,
	Greg Kroah-Hartman, Linus Walleij, Andrew Morton, Randy Dunlap,
	netdev, devicetree, linux-kernel, linux-usb
  Cc: Phil Elwell
In-Reply-To: <1524066323-109628-1-git-send-email-phil@raspberrypi.org>

There is a standard mechanism for locating and using a MAC address from
the Device Tree. Use this facility in the lan78xx driver to support
applications without programmed EEPROM or OTP. At the same time,
regularise the handling of the different address sources.

Signed-off-by: Phil Elwell <phil@raspberrypi.org>
---
 drivers/net/usb/lan78xx.c | 42 ++++++++++++++++++++----------------------
 1 file changed, 20 insertions(+), 22 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 0867f72..a823f01 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -37,6 +37,7 @@
 #include <linux/irqchip/chained_irq.h>
 #include <linux/microchipphy.h>
 #include <linux/phy.h>
+#include <linux/of_net.h>
 #include "lan78xx.h"
 
 #define DRIVER_AUTHOR	"WOOJUNG HUH <woojung.huh@microchip.com>"
@@ -1652,34 +1653,31 @@ static void lan78xx_init_mac_address(struct lan78xx_net *dev)
 	addr[5] = (addr_hi >> 8) & 0xFF;
 
 	if (!is_valid_ether_addr(addr)) {
-		/* reading mac address from EEPROM or OTP */
-		if ((lan78xx_read_eeprom(dev, EEPROM_MAC_OFFSET, ETH_ALEN,
-					 addr) == 0) ||
-		    (lan78xx_read_otp(dev, EEPROM_MAC_OFFSET, ETH_ALEN,
-				      addr) == 0)) {
-			if (is_valid_ether_addr(addr)) {
-				/* eeprom values are valid so use them */
-				netif_dbg(dev, ifup, dev->net,
-					  "MAC address read from EEPROM");
-			} else {
-				/* generate random MAC */
-				random_ether_addr(addr);
-				netif_dbg(dev, ifup, dev->net,
-					  "MAC address set to random addr");
-			}
-
-			addr_lo = addr[0] | (addr[1] << 8) |
-				  (addr[2] << 16) | (addr[3] << 24);
-			addr_hi = addr[4] | (addr[5] << 8);
-
-			ret = lan78xx_write_reg(dev, RX_ADDRL, addr_lo);
-			ret = lan78xx_write_reg(dev, RX_ADDRH, addr_hi);
+		if (!eth_platform_get_mac_address(&dev->udev->dev, addr)) {
+			/* valid address present in Device Tree */
+			netif_dbg(dev, ifup, dev->net,
+				  "MAC address read from Device Tree");
+		} else if (((lan78xx_read_eeprom(dev, EEPROM_MAC_OFFSET,
+						 ETH_ALEN, addr) == 0) ||
+			    (lan78xx_read_otp(dev, EEPROM_MAC_OFFSET,
+					      ETH_ALEN, addr) == 0)) &&
+			   is_valid_ether_addr(addr)) {
+			/* eeprom values are valid so use them */
+			netif_dbg(dev, ifup, dev->net,
+				  "MAC address read from EEPROM");
 		} else {
 			/* generate random MAC */
 			random_ether_addr(addr);
 			netif_dbg(dev, ifup, dev->net,
 				  "MAC address set to random addr");
 		}
+
+		addr_lo = addr[0] | (addr[1] << 8) |
+			  (addr[2] << 16) | (addr[3] << 24);
+		addr_hi = addr[4] | (addr[5] << 8);
+
+		ret = lan78xx_write_reg(dev, RX_ADDRL, addr_lo);
+		ret = lan78xx_write_reg(dev, RX_ADDRH, addr_hi);
 	}
 
 	ret = lan78xx_write_reg(dev, MAF_LO(0), addr_lo);
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 2/3] lan78xx: Read LED states from Device Tree
From: Phil Elwell @ 2018-04-18 15:45 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Rob Herring,
	Mark Rutland, David S. Miller, Mauro Carvalho Chehab,
	Greg Kroah-Hartman, Linus Walleij, Andrew Morton, Randy Dunlap,
	netdev, devicetree, linux-kernel, linux-usb
  Cc: Phil Elwell
In-Reply-To: <1524066323-109628-1-git-send-email-phil@raspberrypi.org>

Add support for DT property "microchip,led-modes", a vector of zero
to four cells (u32s) in the range 0-15, each of which sets the mode
for one of the LEDs. Some possible values are:

    0=link/activity          1=link1000/activity
    2=link100/activity       3=link10/activity
    4=link100/1000/activity  5=link10/1000/activity
    6=link10/100/activity    14=off    15=on

These values are given symbolic constants in a dt-bindings header.

Also use the presence of the DT property to indicate that the
LEDs should be enabled - necessary in the event that no valid OTP
or EEPROM is available.

Signed-off-by: Phil Elwell <phil@raspberrypi.org>
---
 MAINTAINERS                              |  1 +
 drivers/net/usb/lan78xx.c                | 35 ++++++++++++++++++++++++++++++++
 include/dt-bindings/net/microchip-78xx.h | 21 +++++++++++++++++++
 3 files changed, 57 insertions(+)
 create mode 100644 include/dt-bindings/net/microchip-78xx.h

diff --git a/MAINTAINERS b/MAINTAINERS
index b60179d..9c9bc63 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14573,6 +14573,7 @@ M:	Microchip Linux Driver Support <UNGLinuxDriver@microchip.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	drivers/net/usb/lan78xx.*
+F:	include/dt-bindings/net/microchip-78xx.h
 
 USB MASS STORAGE DRIVER
 M:	Alan Stern <stern@rowland.harvard.edu>
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index a823f01..f47ffea 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -38,6 +38,7 @@
 #include <linux/microchipphy.h>
 #include <linux/phy.h>
 #include <linux/of_net.h>
+#include <dt-bindings/net/microchip-78xx.h>
 #include "lan78xx.h"
 
 #define DRIVER_AUTHOR	"WOOJUNG HUH <woojung.huh@microchip.com>"
@@ -74,6 +75,9 @@
 #define LAN78XX_EEPROM_MAGIC		(0x78A5)
 #define LAN78XX_OTP_MAGIC		(0x78F3)
 
+/* This register is specific to the LAN7800 and LAN7850 embedded PHYs */
+#define LAN78XX_PHY_LED_MODE_SELECT	29
+
 #define	MII_READ			1
 #define	MII_WRITE			0
 
@@ -2005,6 +2009,8 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 {
 	int ret;
 	u32 mii_adv;
+	u32 led_modes[4];
+	int len;
 	struct phy_device *phydev;
 
 	phydev = phy_find_first(dev->mdiobus);
@@ -2077,6 +2083,35 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 	mii_adv = (u32)mii_advertise_flowctrl(dev->fc_request_control);
 	phydev->advertising |= mii_adv_to_ethtool_adv_t(mii_adv);
 
+	len = of_property_read_variable_u32_array(dev->udev->dev.of_node,
+						  "microchip,led-modes",
+						  led_modes,
+						  0,
+						  ARRAY_SIZE(led_modes));
+	if (len >= 0) {
+		u32 reg = 0;
+		int i;
+
+		for (i = 0; i < len; i++) {
+			if (led_modes[i] > 15) {
+				ret = -EINVAL;
+				goto error;
+			}
+			reg |= led_modes[i] << (i * 4);
+		}
+		for (; i < ARRAY_SIZE(led_modes); i++)
+			reg |= LAN78XX_FORCE_LED_OFF << (i * 4);
+		(void)phy_write(phydev, LAN78XX_PHY_LED_MODE_SELECT, reg);
+
+		/* Ensure the LEDs are enabled */
+		lan78xx_read_reg(dev, HW_CFG, &reg);
+		reg |= HW_CFG_LED0_EN_ | HW_CFG_LED1_EN_;
+		lan78xx_write_reg(dev, HW_CFG, reg);
+	} else if (len == -EOVERFLOW) {
+		ret = -EINVAL;
+		goto error;
+	}
+
 	genphy_config_aneg(phydev);
 
 	dev->fc_autoneg = phydev->autoneg;
diff --git a/include/dt-bindings/net/microchip-78xx.h b/include/dt-bindings/net/microchip-78xx.h
new file mode 100644
index 0000000..dcf4a43
--- /dev/null
+++ b/include/dt-bindings/net/microchip-78xx.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _DT_BINDINGS_MICROCHIP_LAN78XX_H
+#define _DT_BINDINGS_MICROCHIP_LAN78XX_H
+
+/* LED modes */
+
+#define LAN78XX_LINK_ACTIVITY           0
+#define LAN78XX_LINK_1000_ACTIVITY      1
+#define LAN78XX_LINK_100_ACTIVITY       2
+#define LAN78XX_LINK_10_ACTIVITY        3
+#define LAN78XX_LINK_100_1000_ACTIVITY  4
+#define LAN78XX_LINK_10_1000_ACTIVITY   5
+#define LAN78XX_LINK_10_100_ACTIVITY    6
+#define LAN78XX_DUPLEX_COLLISION        8
+#define LAN78XX_COLLISION               9
+#define LAN78XX_ACTIVITY                10
+#define LAN78XX_AUTONEG_FAULT           12
+#define LAN78XX_FORCE_LED_OFF           14
+#define LAN78XX_FORCE_LED_ON            15
+
+#endif
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 3/3] dt-bindings: Document the DT bindings for lan78xx
From: Phil Elwell @ 2018-04-18 15:45 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Rob Herring,
	Mark Rutland, David S. Miller, Mauro Carvalho Chehab,
	Greg Kroah-Hartman, Linus Walleij, Andrew Morton, Randy Dunlap,
	netdev, devicetree, linux-kernel, linux-usb
  Cc: Phil Elwell
In-Reply-To: <1524066323-109628-1-git-send-email-phil@raspberrypi.org>

The Microchip LAN78XX family of devices are Ethernet controllers with
a USB interface. Despite being discoverable devices it can be useful to
be able to configure them from Device Tree, particularly in low-cost
applications without an EEPROM or programmed OTP.

Document the supported properties in a bindings file.

Signed-off-by: Phil Elwell <phil@raspberrypi.org>
---
 .../devicetree/bindings/net/microchip,lan78xx.txt  | 44 ++++++++++++++++++++++
 MAINTAINERS                                        |  1 +
 2 files changed, 45 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/microchip,lan78xx.txt

diff --git a/Documentation/devicetree/bindings/net/microchip,lan78xx.txt b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt
new file mode 100644
index 0000000..fa68f9b
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt
@@ -0,0 +1,44 @@
+Microchip LAN78xx Gigabit Ethernet controller
+
+The LAN78XX devices are usually configured by programming their OTP or with
+an external EEPROM, but some platforms (e.g. Raspberry Pi 3 B+) have neither.
+The Device Tree properties, if present, override the OTP and EEPROM.
+
+Required properties:
+- compatible: Should be one of "usb424,7800", "usb424,7801" or "usb424,7850".
+
+Optional properties:
+- local-mac-address:   see ethernet.txt
+- mac-address:         see ethernet.txt
+- microchip,led-modes: a 0..4 element vector, with each element configuring
+  the operating mode of an LED. Omitted LEDs are turned off. Allowed values
+  are defined in "include/dt-bindings/net/microchip-78xx.h".
+
+Example:
+
+/* Standard configuration for a Raspberry Pi 3 B+ */
+&usb {
+	usb1@1 {
+		compatible = "usb424,2514";
+		reg = <1>;
+		#address-cells = <1>;
+		#size-cells = <0>;
+
+		usb1_1@1 {
+			compatible = "usb424,2514";
+			reg = <1>;
+			#address-cells = <1>;
+			#size-cells = <0>;
+
+			ethernet: usbether@1 {
+				compatible = "usb424,7800";
+				reg = <1>;
+				local-mac-address = [ 00 11 22 33 44 55 ];
+				microchip,led-modes = <
+					LAN78XX_LINK_1000_ACTIVITY
+					LAN78XX_LINK_10_100_ACTIVITY
+				>;
+			};
+		};
+	};
+};
diff --git a/MAINTAINERS b/MAINTAINERS
index 9c9bc63..5352bbb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14572,6 +14572,7 @@ M:	Woojung Huh <woojung.huh@microchip.com>
 M:	Microchip Linux Driver Support <UNGLinuxDriver@microchip.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
+F:	Documentation/devicetree/bindings/net/microchip,lan78xx.txt
 F:	drivers/net/usb/lan78xx.*
 F:	include/dt-bindings/net/microchip-78xx.h
 
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH iproute2 net-next] vxlan: fix ttl inherit behavior
From: Stephen Hemminger @ 2018-04-18 15:50 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: network dev, Jiri Benc
In-Reply-To: <CAPwn2JT_e5UZg52Nmutv97DM408853i1HSEZRv+2e2Geqsma0g@mail.gmail.com>

On Wed, 18 Apr 2018 13:10:49 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> Hi Stephen,
> 
> The patch's subject contains fix. But the kernel feature is applied on net-next.
> So I'm not sure if iproute2 net-next is suitable. If you are OK with the patch,
> please feel free to apply it on the branch which you think is suitable.
> 
> Thanks
> Hangbin
> 
> On 18 April 2018 at 13:05, Hangbin Liu <liuhangbin@gmail.com> wrote:
> > Like kernel net-next commit 72f6d71e491e6 ("vxlan: add ttl inherit support"),
> > vxlan ttl inherit should means inherit the inner protocol's ttl value.
> >
> > But currently when we add vxlan with "ttl inherit", we only set ttl 0,
> > which is actually use whatever default value instead of inherit the inner
> > protocol's ttl value.
> >
> > To make a difference with ttl inherit and ttl == 0, we add an attribute
> > IFLA_VXLAN_TTL_INHERIT when "ttl inherit" specified. And use "ttl auto"
> > to means "use whatever default value", the same behavior with ttl == 0.
> >
> > Reported-by: Jianlin Shi <jishi@redhat.com>
> > Suggested-by: Jiri Benc <jbenc@redhat.com>
> > Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>  

When davem  merges the feature into net-next, dsa will merge this into iproute2-next.
We hold off merging into iproute2 because often the kernel review feedback causes
API changes.

^ permalink raw reply

* [PATCH net-next] lan78xx: Add support to dump lan78xx registers
From: Raghuram Chary J @ 2018-04-18 15:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, unglinuxdriver, woojung.huh, raghuramchary.jallipalli

In order to dump lan78xx family registers using ethtool, add
support at lan78xx driver level.

Signed-off-by: Raghuram Chary J <raghuramchary.jallipalli@microchip.com>
---
 drivers/net/usb/lan78xx.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 0867f7275852..e846698fa32a 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -278,6 +278,30 @@ struct lan78xx_statstage64 {
 	u64 eee_tx_lpi_time;
 };
 
+static u32 lan78xx_regs[] = {
+	ID_REV,
+	INT_STS,
+	HW_CFG,
+	PMT_CTL,
+	E2P_CMD,
+	E2P_DATA,
+	USB_STATUS,
+	VLAN_TYPE,
+	MAC_CR,
+	MAC_RX,
+	MAC_TX,
+	FLOW,
+	ERR_STS,
+	MII_ACC,
+	MII_DATA,
+	EEE_TX_LPI_REQ_DLY,
+	EEE_TW_TX_SYS,
+	EEE_TX_LPI_REM_DLY,
+	WUCSR
+};
+
+#define PHY_REG_SIZE (32 * sizeof(u32))
+
 struct lan78xx_net;
 
 struct lan78xx_priv {
@@ -1605,6 +1629,34 @@ static int lan78xx_set_pause(struct net_device *net,
 	return ret;
 }
 
+static int lan78xx_get_regs_len(struct net_device *netdev)
+{
+	if (!netdev->phydev)
+		return (sizeof(lan78xx_regs));
+	else
+		return (sizeof(lan78xx_regs) + PHY_REG_SIZE);
+}
+
+static void
+lan78xx_get_regs(struct net_device *netdev, struct ethtool_regs *regs,
+		 void *buf)
+{
+	u32 *data = buf;
+	int i, j;
+	struct lan78xx_net *dev = netdev_priv(netdev);
+
+	/* Read Device/MAC registers */
+	for (i = 0, j = 0; i < (sizeof(lan78xx_regs) / sizeof(u32)); i++, j++)
+		lan78xx_read_reg(dev, lan78xx_regs[i], &data[j]);
+
+	if (!netdev->phydev)
+		return;
+
+	/* Read PHY registers */
+	for (i = 0; i < 32; i++, j++)
+		data[j] = phy_read(netdev->phydev, i);
+}
+
 static const struct ethtool_ops lan78xx_ethtool_ops = {
 	.get_link	= lan78xx_get_link,
 	.nway_reset	= phy_ethtool_nway_reset,
@@ -1625,6 +1677,8 @@ static const struct ethtool_ops lan78xx_ethtool_ops = {
 	.set_pauseparam	= lan78xx_set_pause,
 	.get_link_ksettings = lan78xx_get_link_ksettings,
 	.set_link_ksettings = lan78xx_set_link_ksettings,
+	.get_regs_len	= lan78xx_get_regs_len,
+	.get_regs	= lan78xx_get_regs,
 };
 
 static int lan78xx_ioctl(struct net_device *netdev, struct ifreq *rq, int cmd)
-- 
2.16.2

^ permalink raw reply related

* Re: [RFC PATCH] net: bridge: multicast querier per VLAN support
From: Stephen Hemminger @ 2018-04-18 15:54 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: Joachim Nilsson, netdev, roopa
In-Reply-To: <da36ee2f-d39b-d6c0-15b2-50bde81482ab@cumulusnetworks.com>

On Wed, 18 Apr 2018 16:14:26 +0300
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:

> On 18/04/18 16:07, Joachim Nilsson wrote:
> > On Wed, Apr 18, 2018 at 03:31:57PM +0300, Nikolay Aleksandrov wrote:  
> >> On 18/04/18 15:07, Joachim Nilsson wrote:  
> >>> - First of all, is this patch useful to anyone  
> >> Obviously to us as it's based on our patch. :-)
> >> We actually recently discussed what will be needed to make it acceptable to upstream.  
> > 
> > Great! :)
> >   
> >>> - The current br_multicast.c is very complex.  The support for both IPv4
> >>>    and IPv6 is a no-brainer, but it also has #ifdef VLAN_FILTERING and
> >>>    'br->vlan_enabled' ... this has likely been discussed before, but if
> >>>    we could remove those code paths I believe what's left would be quite
> >>>    a bit easier to read and maintain.  
> >> br->vlan_enabled has a wrapper that can be used without ifdefs, as does br_vlan_find()
> >> so in short - you can remove the ifdefs and use the wrappers,  they'll degrade to always
> >> false/null when vlans are disabled.  
> > 
> > Thanks, I'll have a look at that and prepare an RFC v2!
> >   
> >>> - Many per-bridge specific multicast sysfs settings may need to have a
> >>>    corresponding per-VLAN setting, e.g. snooping, query_interval, etc.
> >>>    How should we go about that? (For status reporting I have a proposal)  
> >> We'll have to add more to the per-vlan context, but yes it has to happen.
> >> It will be only netlink interface for config/retrieval, no sysfs.  
> > 
> > Some settings are possible to do with sysfs, like multicast_query_interval
> > and ...  
> 
> We want to avoid sysfs in general, all of networking config and stats
> are moving to netlink. It is better controlled and structured for such
> changes, also provides nice interfaces for automatic  type checks etc.
> 
> Also (but a minor reason) there is no tree/entity in sysfs for the vlans
> where to add this. It will either have to be a file which does some
> format string hack (like us currently) or will need to add new tree for
> them which I'd really like to avoid for the bridge.

In general, all bridge attributes need to show in netlink and sysfs.
Sysfs is easier for scripting from languages.

^ permalink raw reply

* [PATCH] atm: iphase: fix spelling mistake: "Tansmit" -> "Transmit"
From: Colin King @ 2018-04-18 15:55 UTC (permalink / raw)
  To: Chas Williams, linux-atm-general, netdev; +Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Trivial fix to spelling mistake in message text.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/atm/iphase.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/atm/iphase.c b/drivers/atm/iphase.c
index 44abb8a0a5e5..be076606d30e 100644
--- a/drivers/atm/iphase.c
+++ b/drivers/atm/iphase.c
@@ -671,7 +671,7 @@ static void ia_tx_poll (IADEV *iadev) {
           if ((vcc->pop) && (skb1->len != 0))
           {
              vcc->pop(vcc, skb1);
-             IF_EVENT(printk("Tansmit Done - skb 0x%lx return\n",
+             IF_EVENT(printk("Transmit Done - skb 0x%lx return\n",
                                                           (long)skb1);)
           }
           else 
@@ -1665,7 +1665,7 @@ static void tx_intr(struct atm_dev *dev)
 	status = readl(iadev->seg_reg+SEG_INTR_STATUS_REG);  
         if (status & TRANSMIT_DONE){
 
-           IF_EVENT(printk("Tansmit Done Intr logic run\n");)
+           IF_EVENT(printk("Transmit Done Intr logic run\n");)
            spin_lock_irqsave(&iadev->tx_lock, flags);
            ia_tx_poll(iadev);
            spin_unlock_irqrestore(&iadev->tx_lock, flags);
-- 
2.17.0

^ permalink raw reply related

* Re: [Bug 199429] New: smc_shutdown(net/smc/af_smc.c) has a UAF causing null pointer vulnerability.
From: Stephen Hemminger @ 2018-04-18 15:55 UTC (permalink / raw)
  To: Ursula Braun; +Cc: Ursula Braun, netdev
In-Reply-To: <49ed2fa7-cace-12c9-eb57-539cac783cb2@linux.ibm.com>

On Wed, 18 Apr 2018 13:46:20 +0200
Ursula Braun <ubraun@linux.ibm.com> wrote:

> On 04/18/2018 04:56 AM, Stephen Hemminger wrote:
> > This may already be fixed.
> > 
> > Begin forwarded message:
> > 
> > Date: Wed, 18 Apr 2018 01:52:59 +0000
> > From: bugzilla-daemon@bugzilla.kernel.org
> > To: stephen@networkplumber.org
> > Subject: [Bug 199429] New: smc_shutdown(net/smc/af_smc.c) has a UAF causing null pointer vulnerability.
> > 
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=199429
> > 
> >             Bug ID: 199429
> >            Summary: smc_shutdown(net/smc/af_smc.c) has a UAF causing null
> >                     pointer vulnerability.
> >            Product: Networking
> >            Version: 2.5
> >     Kernel Version: 4.16.0-rc7
> >           Hardware: All
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Other
> >           Assignee: stephen@networkplumber.org
> >           Reporter: 1773876454@qq.com
> >         Regression: No
> > 
> > Created attachment 275431  
> >   --> https://bugzilla.kernel.org/attachment.cgi?id=275431&action=edit    
> > POC
> > 
> > Syzkaller hit 'general protection fault in kernel_sock_shutdown' bug.
> > 
> > NET: Registered protocol family 43  
> 
> Thanks for reporting. This fix is needed here:
> 
>  net/smc/af_smc.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -1314,7 +1314,7 @@ static int smc_shutdown(struct socket *s
>  	    (sk->sk_state != SMC_APPCLOSEWAIT2) &&
>  	    (sk->sk_state != SMC_APPFINCLOSEWAIT))
>  		goto out;
> -	if (smc->use_fallback) {
> +	if (smc->use_fallback || sk->sk_state == SMC_LISTEN) {
>  		rc = kernel_sock_shutdown(smc->clcsock, how);
>  		sk->sk_shutdown = smc->clcsock->sk->sk_shutdown;
>  		if (sk->sk_shutdown == SHUTDOWN_MASK)
> 
> Kind regards, Ursula
> 

Please submit patch to linux net with proper signed-off-by and Fixes tags.
The maintainer (davem) will take care of getting this into upstream and stable.

^ permalink raw reply

* [bpf-next PATCH 0/3] Add ID to bpf_map/prog tracepoints
From: Sebastiano Miano @ 2018-04-18 15:30 UTC (permalink / raw)
  To: netdev, ast, daniel; +Cc: mingo, rostedt, brouer, fulvio.risso

The following series:
1) Add ID to both map and prog related tracepoints
2) Add a sample program that shows how to monitor and 
   filter map related events using their IDs.

---

Sebastiano Miano (3):
      bpf: add id to map tracepoint
      bpf: add id to prog tracepoint
      bpf: add sample program to trace map events


 include/trace/events/bpf.h          |   44 ++++-
 samples/bpf/Makefile                |    4 
 samples/bpf/trace_map_events_kern.c |  217 ++++++++++++++++++++++++
 samples/bpf/trace_map_events_user.c |  314 +++++++++++++++++++++++++++++++++++
 4 files changed, 569 insertions(+), 10 deletions(-)
 create mode 100644 samples/bpf/trace_map_events_kern.c
 create mode 100644 samples/bpf/trace_map_events_user.c

^ permalink raw reply

* [bpf-next PATCH 1/3] bpf: add id to map tracepoint
From: Sebastiano Miano @ 2018-04-18 15:30 UTC (permalink / raw)
  To: netdev, ast, daniel; +Cc: mingo, rostedt, brouer, fulvio.risso
In-Reply-To: <152406544226.3465.948692097697975172.stgit@localhost.localdomain>

This patch adds the map id to the bpf tracepoints
that can be used when monitoring or inspecting map
related functions.

Signed-off-by: Sebastiano Miano <sebastiano.miano@polito.it>
Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/trace/events/bpf.h |   29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/bpf.h b/include/trace/events/bpf.h
index 1501856..d7c9726 100644
--- a/include/trace/events/bpf.h
+++ b/include/trace/events/bpf.h
@@ -130,6 +130,7 @@ TRACE_EVENT(bpf_map_create,
 		__field(u32, max_entries)
 		__field(u32, flags)
 		__field(int, ufd)
+		__field(u32, id)
 	),
 
 	TP_fast_assign(
@@ -139,9 +140,11 @@ TRACE_EVENT(bpf_map_create,
 		__entry->max_entries = map->max_entries;
 		__entry->flags       = map->map_flags;
 		__entry->ufd         = ufd;
+		__entry->id          = map->id;
 	),
 
-	TP_printk("map type=%s ufd=%d key=%u val=%u max=%u flags=%x",
+	TP_printk("id=%u type=%s ufd=%d key=%u val=%u max=%u flags=%x",
+		  __entry->id,
 		  __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB),
 		  __entry->ufd, __entry->size_key, __entry->size_value,
 		  __entry->max_entries, __entry->flags)
@@ -199,17 +202,20 @@ DECLARE_EVENT_CLASS(bpf_obj_map,
 		__field(u32, type)
 		__field(int, ufd)
 		__string(path, pname->name)
+		__field(u32, id)
 	),
 
 	TP_fast_assign(
 		__assign_str(path, pname->name);
 		__entry->type = map->map_type;
 		__entry->ufd  = ufd;
+		__entry->id   = map->id;
 	),
 
-	TP_printk("map type=%s ufd=%d path=%s",
-		  __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB),
-		  __entry->ufd, __get_str(path))
+	TP_printk("map id=%u type=%s ufd=%d path=%s",
+		__entry->id,
+		__print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB),
+		__entry->ufd, __get_str(path))
 );
 
 DEFINE_EVENT(bpf_obj_map, bpf_obj_pin_map,
@@ -244,6 +250,7 @@ DECLARE_EVENT_CLASS(bpf_map_keyval,
 		__dynamic_array(u8, val, map->value_size)
 		__field(bool, val_trunc)
 		__field(int, ufd)
+		__field(u32, id)
 	),
 
 	TP_fast_assign(
@@ -255,9 +262,11 @@ DECLARE_EVENT_CLASS(bpf_map_keyval,
 		__entry->val_len   = min(map->value_size, 16U);
 		__entry->val_trunc = map->value_size != __entry->val_len;
 		__entry->ufd       = ufd;
+		__entry->id        = map->id;
 	),
 
-	TP_printk("map type=%s ufd=%d key=[%s%s] val=[%s%s]",
+	TP_printk("map id=%d type=%s ufd=%d key=[%s%s] val=[%s%s]",
+		  __entry->id,
 		  __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB),
 		  __entry->ufd,
 		  __print_hex(__get_dynamic_array(key), __entry->key_len),
@@ -295,6 +304,7 @@ TRACE_EVENT(bpf_map_delete_elem,
 		__dynamic_array(u8, key, map->key_size)
 		__field(bool, key_trunc)
 		__field(int, ufd)
+		__field(u32, id)
 	),
 
 	TP_fast_assign(
@@ -303,9 +313,11 @@ TRACE_EVENT(bpf_map_delete_elem,
 		__entry->key_len   = min(map->key_size, 16U);
 		__entry->key_trunc = map->key_size != __entry->key_len;
 		__entry->ufd       = ufd;
+		__entry->id        = map->id;
 	),
 
-	TP_printk("map type=%s ufd=%d key=[%s%s]",
+	TP_printk("map id=%d type=%s ufd=%d key=[%s%s]",
+		  __entry->id,
 		  __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB),
 		  __entry->ufd,
 		  __print_hex(__get_dynamic_array(key), __entry->key_len),
@@ -327,6 +339,7 @@ TRACE_EVENT(bpf_map_next_key,
 		__field(bool, key_trunc)
 		__field(bool, key_null)
 		__field(int, ufd)
+		__field(u32, id)
 	),
 
 	TP_fast_assign(
@@ -338,9 +351,11 @@ TRACE_EVENT(bpf_map_next_key,
 		__entry->key_len   = min(map->key_size, 16U);
 		__entry->key_trunc = map->key_size != __entry->key_len;
 		__entry->ufd       = ufd;
+		__entry->id        = map->id;
 	),
 
-	TP_printk("map type=%s ufd=%d key=[%s%s] next=[%s%s]",
+	TP_printk("map id=%d type=%s ufd=%d key=[%s%s] next=[%s%s]",
+		  __entry->id,
 		  __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB),
 		  __entry->ufd,
 		  __entry->key_null ? "NULL" : __print_hex(__get_dynamic_array(key),

^ permalink raw reply related

* [bpf-next PATCH 2/3] bpf: add id to prog tracepoint
From: Sebastiano Miano @ 2018-04-18 15:30 UTC (permalink / raw)
  To: netdev, ast, daniel; +Cc: mingo, rostedt, brouer, fulvio.risso
In-Reply-To: <152406544226.3465.948692097697975172.stgit@localhost.localdomain>

This patch adds the prog id to the bpf tracepoints
that can be used when monitoring or inspecting prog
related functions.

Signed-off-by: Sebastiano Miano <sebastiano.miano@polito.it>
Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/trace/events/bpf.h |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/bpf.h b/include/trace/events/bpf.h
index d7c9726..4ec19c5 100644
--- a/include/trace/events/bpf.h
+++ b/include/trace/events/bpf.h
@@ -65,16 +65,19 @@ DECLARE_EVENT_CLASS(bpf_prog_event,
 	TP_STRUCT__entry(
 		__array(u8, prog_tag, 8)
 		__field(u32, type)
+		__field(u32, id)
 	),
 
 	TP_fast_assign(
 		BUILD_BUG_ON(sizeof(__entry->prog_tag) != sizeof(prg->tag));
 		memcpy(__entry->prog_tag, prg->tag, sizeof(prg->tag));
 		__entry->type = prg->type;
+		__entry->id   = prg->aux->id;
 	),
 
-	TP_printk("prog=%s type=%s",
+	TP_printk("prog=%s id=%u type=%s",
 		  __print_hex_str(__entry->prog_tag, 8),
+		  __entry->id,
 		  __print_symbolic(__entry->type, __PROG_TYPE_SYM_TAB))
 );
 
@@ -102,6 +105,7 @@ TRACE_EVENT(bpf_prog_load,
 		__array(u8, prog_tag, 8)
 		__field(u32, type)
 		__field(int, ufd)
+		__field(u32, id)
 	),
 
 	TP_fast_assign(
@@ -109,10 +113,12 @@ TRACE_EVENT(bpf_prog_load,
 		memcpy(__entry->prog_tag, prg->tag, sizeof(prg->tag));
 		__entry->type = prg->type;
 		__entry->ufd  = ufd;
+		__entry->id   = prg->aux->id;
 	),
 
-	TP_printk("prog=%s type=%s ufd=%d",
+	TP_printk("prog=%s id=%u type=%s ufd=%d",
 		  __print_hex_str(__entry->prog_tag, 8),
+		  __entry->id,
 		  __print_symbolic(__entry->type, __PROG_TYPE_SYM_TAB),
 		  __entry->ufd)
 );
@@ -161,6 +167,7 @@ DECLARE_EVENT_CLASS(bpf_obj_prog,
 		__array(u8, prog_tag, 8)
 		__field(int, ufd)
 		__string(path, pname->name)
+		__field(u32, id)
 	),
 
 	TP_fast_assign(
@@ -168,10 +175,12 @@ DECLARE_EVENT_CLASS(bpf_obj_prog,
 		memcpy(__entry->prog_tag, prg->tag, sizeof(prg->tag));
 		__assign_str(path, pname->name);
 		__entry->ufd = ufd;
+		__entry->id  = prg->aux->id;
 	),
 
-	TP_printk("prog=%s path=%s ufd=%d",
+	TP_printk("prog=%s id=%u path=%s ufd=%d",
 		  __print_hex_str(__entry->prog_tag, 8),
+		  __entry->id,
 		  __get_str(path), __entry->ufd)
 );
 

^ permalink raw reply related

* [bpf-next PATCH 3/3] bpf: add sample program to trace map events
From: Sebastiano Miano @ 2018-04-18 15:30 UTC (permalink / raw)
  To: netdev, ast, daniel; +Cc: mingo, rostedt, brouer, fulvio.risso
In-Reply-To: <152406544226.3465.948692097697975172.stgit@localhost.localdomain>

This patch adds a sample program, called trace_map_events,
that shows how to capture map events and filter them based on
the map id.

The program accepts a list of map IDs, via the -i command line
option, and filters all the map events related to those IDs (i.e.,
map_create/update/lookup/next_key).
If no IDs are specified, all map events are listed and no filtering
is performed.

Sample usage:

 # trace_map_events -i <map_id1> -i <map_id2> -i <map_id3> ...

Signed-off-by: Sebastiano Miano <sebastiano.miano@polito.it>
---
 samples/bpf/Makefile                |    4 
 samples/bpf/trace_map_events_kern.c |  225 +++++++++++++++++++++++++
 samples/bpf/trace_map_events_user.c |  314 +++++++++++++++++++++++++++++++++++
 3 files changed, 543 insertions(+)
 create mode 100644 samples/bpf/trace_map_events_kern.c
 create mode 100644 samples/bpf/trace_map_events_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4d6a6ed..a7d52b6 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -15,6 +15,7 @@ hostprogs-y += tracex6
 hostprogs-y += tracex7
 hostprogs-y += test_probe_write_user
 hostprogs-y += trace_output
+hostprogs-y += trace_map_events
 hostprogs-y += lathist
 hostprogs-y += offwaketime
 hostprogs-y += spintest
@@ -65,6 +66,7 @@ tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
 load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
 test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
 trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
+trace_map_events-objs := bpf_load.o $(LIBBPF) trace_map_events_user.o
 lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
 offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o
 spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o
@@ -111,6 +113,7 @@ always += tracex7_kern.o
 always += sock_flags_kern.o
 always += test_probe_write_user_kern.o
 always += trace_output_kern.o
+always += trace_map_events_kern.o
 always += tcbpf1_kern.o
 always += tcbpf2_kern.o
 always += tc_l2_redirect_kern.o
@@ -171,6 +174,7 @@ HOSTLOADLIBES_test_cgrp2_sock2 += -lelf
 HOSTLOADLIBES_load_sock_ops += -lelf
 HOSTLOADLIBES_test_probe_write_user += -lelf
 HOSTLOADLIBES_trace_output += -lelf -lrt
+HOSTLOADLIBES_trace_map_events += -lelf -lrt
 HOSTLOADLIBES_lathist += -lelf
 HOSTLOADLIBES_offwaketime += -lelf
 HOSTLOADLIBES_spintest += -lelf
diff --git a/samples/bpf/trace_map_events_kern.c b/samples/bpf/trace_map_events_kern.c
new file mode 100644
index 0000000..f887b5b
--- /dev/null
+++ b/samples/bpf/trace_map_events_kern.c
@@ -0,0 +1,225 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (C) 2018 Politecnico di Torino, Italy
+ *
+ * Author: Sebastiano Miano <sebastiano.miano@polito.it>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <linux/version.h>
+#include "bpf_helpers.h"
+
+enum map_event_type {
+	MAP_CREATE = 0,
+	MAP_UPDATE = 1,
+	MAP_LOOKUP = 2,
+	MAP_NEXT_KEY = 3
+};
+
+struct map_event_data {
+	u32 map_id;
+	enum map_event_type evnt_type;
+	u32 map_type;
+};
+
+struct bpf_map_def SEC("maps") map_event_trace = {
+	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(u32),
+	.max_entries = 64,
+};
+
+struct bpf_map_def SEC("maps") filtered_ids = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = 64,
+};
+
+struct bpf_map_def SEC("maps") filter_events = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(bool),
+	.max_entries = 1,
+};
+
+/*
+ * Tracepoint format: /sys/kernel/debug/tracing/events/bpf/bpf_map_create/format
+ * Code in:                kernel/include/trace/events/bpf.h
+ */
+struct bpf_map_create_ctx {
+	u64 pad;		// First 8 bytes are not accessible by bpf code
+	u32 type;		// offset:8;	size:4;	signed:0;
+	u32 size_key;		// offset:12;	size:4;	signed:0;
+	u32 size_value;		// offset:16;	size:4;	signed:0;
+	u32 max_entries;	// offset:20;	size:4;	signed:0;
+	u32 flags;		// offset:24;	size:4;	signed:0;
+	int ufd;		// offset:28;	size:4;	signed:1;
+	u32 id;			// offset:32;	size:4;	signed:0;
+};
+
+SEC("tracepoint/bpf/bpf_map_create")
+int trace_bpf_map_create(struct bpf_map_create_ctx *ctx)
+{
+	struct map_event_data data;
+	int cpu = bpf_get_smp_processor_id();
+	bool *filter;
+	u32 key = 0, map_id = ctx->id;
+
+	filter = bpf_map_lookup_elem(&filter_events, &key);
+	if (!filter)
+		return 1;
+
+	if (!*filter)
+		goto send_event;
+
+	/*
+	 * If the map_id is not in the list of filtered
+	 * ids we immediately return
+	 */
+	if (!bpf_map_lookup_elem(&filtered_ids, &map_id))
+		return 0;
+
+send_event:
+	data.map_id = map_id;
+	data.evnt_type = MAP_CREATE;
+	data.map_type = ctx->type;
+
+	bpf_perf_event_output(ctx, &map_event_trace, cpu, &data, sizeof(data));
+	return 0;
+}
+
+/*
+ * Tracepoint:	/sys/kernel/debug/tracing/events/bpf/bpf_map_lookup_elem/format
+ * Tracepoint:	/sys/kernel/debug/tracing/events/bpf/bpf_map_update_elem/format
+ * Code in:          kernel/include/trace/events/bpf.h
+ */
+struct bpf_map_keyval_ctx {
+	u64 pad;		// First 8 bytes are not accessible by bpf code
+	u32 type;		// offset:8;	size:4;	signed:0;
+	u32 key_len;		// offset:12;	size:4;	signed:0;
+	u32 key;		// offset:16;	size:4;	signed:0;
+	bool key_trunc;		// offset:20;	size:1;	signed:0;
+	u32 val_len;		// offset:24;	size:4;	signed:0;
+	u32 val;		// offset:28;	size:4;	signed:0;
+	bool val_trunc;		// offset:32;	size:1;	signed:0;
+	int ufd;		// offset:36;	size:4;	signed:1;
+	u32 id;			// offset:40;	size:4;	signed:0;
+};
+
+SEC("tracepoint/bpf/bpf_map_lookup_elem")
+int trace_bpf_map_lookup_elem(struct bpf_map_keyval_ctx *ctx)
+{
+	struct map_event_data data;
+	int cpu = bpf_get_smp_processor_id();
+	bool *filter;
+	u32 key = 0, map_id = ctx->id;
+
+	filter = bpf_map_lookup_elem(&filter_events, &key);
+	if (!filter)
+		return 1;
+
+	if (!*filter)
+		goto send_event;
+
+	/*
+	 * If the map_id is not in the list of filtered
+	 * ids we immediately return
+	 */
+	if (!bpf_map_lookup_elem(&filtered_ids, &map_id))
+		return 0;
+
+send_event:
+	data.map_id = map_id;
+	data.evnt_type = MAP_LOOKUP;
+	data.map_type = ctx->type;
+
+	bpf_perf_event_output(ctx, &map_event_trace, cpu, &data, sizeof(data));
+	return 0;
+}
+
+SEC("tracepoint/bpf/bpf_map_update_elem")
+int trace_bpf_map_update_elem(struct bpf_map_keyval_ctx *ctx)
+{
+	struct map_event_data data;
+	int cpu = bpf_get_smp_processor_id();
+	bool *filter;
+	u32 key = 0, map_id = ctx->id;
+
+	filter = bpf_map_lookup_elem(&filter_events, &key);
+	if (!filter)
+		return 1;
+
+	if (!*filter)
+		goto send_event;
+
+	/*
+	 * If the map_id is not in the list of filtered
+	 * ids we immediately return
+	 */
+	if (!bpf_map_lookup_elem(&filtered_ids, &map_id))
+		return 0;
+
+send_event:
+	data.map_id = map_id;
+	data.evnt_type = MAP_UPDATE;
+	data.map_type = ctx->type;
+
+	bpf_perf_event_output(ctx, &map_event_trace, cpu, &data, sizeof(data));
+	return 0;
+}
+
+/*
+ * Tracepoint:	/sys/kernel/debug/tracing/events/bpf/bpf_map_next_key/format
+ * Code in:          kernel/include/trace/events/bpf.h
+ */
+struct bpf_map_next_key_ctx {
+	u64 pad;		// First 8 bytes are not accessible by bpf code
+	u32 type;		// offset:8;	size:4;	signed:0;
+	u32 key_len;		// offset:12;	size:4;	signed:0;
+	u32 key;		// offset:16;	size:4;	signed:0;
+	u32 nxt;		// offset:20;	size:4;	signed:0;
+	bool key_trunc;		// offset:24;	size:1;	signed:0;
+	bool key_null;		// offset:25;	size:1;	signed:0;
+	int ufd;		// offset:28;	size:4;	signed:1;
+	u32 id;			// offset:32;	size:4;	signed:0;
+};
+
+SEC("tracepoint/bpf/bpf_map_next_key")
+int trace_bpf_map_next_key(struct bpf_map_next_key_ctx *ctx)
+{
+	struct map_event_data data;
+	int cpu = bpf_get_smp_processor_id();
+	bool *filter;
+	u32 key = 0, map_id = ctx->id;
+
+	filter = bpf_map_lookup_elem(&filter_events, &key);
+	if (!filter)
+		return 1;
+
+	if (!*filter)
+		goto send_event;
+
+	/*
+	 * If the map_id is not in the list of filtered
+	 * ids we immediately return
+	 */
+	if (!bpf_map_lookup_elem(&filtered_ids, &map_id))
+		return 0;
+
+send_event:
+	data.map_id = map_id;
+	data.evnt_type = MAP_NEXT_KEY;
+	data.map_type = ctx->type;
+
+	bpf_perf_event_output(ctx, &map_event_trace, cpu, &data, sizeof(data));
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/trace_map_events_user.c b/samples/bpf/trace_map_events_user.c
new file mode 100644
index 0000000..bc7447e
--- /dev/null
+++ b/samples/bpf/trace_map_events_user.c
@@ -0,0 +1,314 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (C) 2018 Politecnico di Torino, Italy
+ *
+ * Author: Sebastiano Miano <sebastiano.miano@polito.it>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ */
+
+static const char *__desc__ =
+"Sample program to trace map related events\n"
+"The -i option allows to set the id(s) of the map you are interested in.\n"
+"If no ID is specified, all map events are listed.\n";
+
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <sys/resource.h>
+#include <linux/perf_event.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <sys/epoll.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <signal.h>
+#include <getopt.h>
+
+#include "libbpf.h"
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include "perf-sys.h"
+
+#define MAX_FILTERED_IDS 64
+
+static int *perf_fd;
+
+int epoll_fd;
+int page_size;
+int page_cnt = 8;
+volatile struct perf_event_mmap_page **readers;
+
+typedef void (*event_cb)(void *data, int size);
+
+enum map_event_type {
+	MAP_CREATE = 0,
+	MAP_UPDATE = 1,
+	MAP_LOOKUP = 2,
+	MAP_NEXT_KEY = 3
+};
+
+static void usage(char *argv[])
+{
+	printf("\nDESCRIPTION:\n%s", __desc__);
+	printf("\n");
+	printf(" Usage: %s [-i map_id1] [-i map_id2] ...\n", argv[0]);
+	printf("\n");
+}
+
+static int perf_event_mmap(int fd, int cpu)
+{
+	void *base;
+	int mmap_size;
+
+	page_size = getpagesize();
+	mmap_size = page_size * (page_cnt + 1);
+
+	base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (base == MAP_FAILED) {
+		printf("mmap err\n");
+		return -1;
+	}
+
+	readers[cpu] = base;
+	return 0;
+}
+
+static void init_bpf_perf_event_on_cpu(int cpu)
+{
+	struct perf_event_attr attr = {
+		.sample_type = PERF_SAMPLE_RAW,
+		.type = PERF_TYPE_SOFTWARE,
+		.config = PERF_COUNT_SW_BPF_OUTPUT,
+		.sample_period = 1,
+		.wakeup_events = 1,
+	};
+	int key = cpu;
+
+	perf_fd[cpu] = sys_perf_event_open(&attr, -1, cpu, -1, 0);
+
+	assert(perf_fd[cpu] >= 0);
+	assert(perf_event_mmap(perf_fd[cpu], cpu) >= 0);
+	assert(ioctl(perf_fd[cpu], PERF_EVENT_IOC_ENABLE, 0) >= 0);
+	assert(bpf_map_update_elem(map_fd[0], &key, &perf_fd[cpu], 0) == 0);
+
+	struct epoll_event e = { .events = EPOLLIN, .data.u32 = cpu };
+
+	assert(epoll_ctl(epoll_fd, EPOLL_CTL_ADD, perf_fd[cpu], &e) == 0);
+}
+
+static int perf_event_poll(int fd, int num_cpus, struct epoll_event *events)
+{
+	return epoll_wait(fd, events, num_cpus, -1);
+}
+
+struct perf_event_sample {
+	struct perf_event_header header;
+	__u32 size;
+	char data[];
+};
+
+static void perf_event_read(event_cb fn, __u32 index)
+{
+	__u64 data_tail = readers[index]->data_tail;
+	__u64 data_head = readers[index]->data_head;
+	__u64 buffer_size = page_cnt * page_size;
+	void *base, *begin, *end;
+	char buf[256];
+
+	asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
+	if (data_head == data_tail)
+		return;
+
+	base = ((char *)readers[index]) + page_size;
+
+	begin = base + data_tail % buffer_size;
+	end = base + data_head % buffer_size;
+
+	while (begin != end) {
+		struct perf_event_sample *e;
+
+		e = begin;
+		if (begin + e->header.size > base + buffer_size) {
+			long len = base + buffer_size - begin;
+
+			assert(len < e->header.size);
+			memcpy(buf, begin, len);
+			memcpy(buf + len, base, e->header.size - len);
+			e = (void *) buf;
+			begin = base + e->header.size - len;
+		} else if (begin + e->header.size == base + buffer_size) {
+			begin = base;
+		} else {
+			begin += e->header.size;
+		}
+
+		if (e->header.type == PERF_RECORD_SAMPLE) {
+			fn(e->data, e->size);
+		} else if (e->header.type == PERF_RECORD_LOST) {
+			struct {
+				struct perf_event_header header;
+				__u64 id;
+				__u64 lost;
+			} *lost = (void *) e;
+			printf("lost %lld events\n", lost->lost);
+		} else {
+			printf("unknown event type=%d size=%d\n",
+			       e->header.type, e->header.size);
+		}
+	}
+
+	__sync_synchronize(); /* smp_mb() */
+	readers[index]->data_tail = data_head;
+}
+
+static const char *get_event_type(enum map_event_type event)
+{
+	switch (event) {
+	case MAP_CREATE:
+		return "CREATE";
+	case MAP_LOOKUP:
+		return "LOOKUP";
+	case MAP_UPDATE:
+		return "UPDATE";
+	case MAP_NEXT_KEY:
+		return "NEXT_KEY";
+	}
+
+	return "UNKNOWN";
+}
+
+
+static void map_event_callback(void *data, int size)
+{
+	struct {
+		__u32 map_id;
+		enum map_event_type event_type;
+		__u32 map_type;
+	} *e = data;
+
+	printf("%s event for map id: %d and type: %d\n",
+	       get_event_type(e->event_type), e->map_id, e->map_type);
+}
+
+static bool init_filtered_ids_map(int num_ids, int *filtered_ids)
+{
+	int i, key, value;
+	bool filtering = false;
+	/*
+	 * I am going to put the IDs in the map. Only event related to those IDs
+	 * will be shown. The key indicates the ID of the map while the value
+	 * is not used and then is set to 0.
+	 */
+	for (i = 0; i < num_ids; i++) {
+		key = filtered_ids[i];
+		value = 0;
+		if (bpf_map_update_elem(map_fd[1], &key, &value, 0) != 0) {
+			fprintf(stderr,
+			"ERR: bpf_map_update_elem failed key:0x%X\n", key);
+		return false;
+		}
+	}
+
+	if (num_ids > 0)
+		filtering = true;
+
+	key = 0;
+	assert(bpf_map_update_elem(map_fd[2], &key, &filtering, BPF_ANY) == 0);
+	return true;
+}
+
+static bool init_perf_buffer_data_structures(int nr_cpus)
+{
+	int i;
+
+	perf_fd = malloc(sizeof(int) * nr_cpus);
+	assert(perf_fd);
+	readers = malloc(sizeof(*readers) * nr_cpus);
+	assert(readers);
+
+	epoll_fd = epoll_create1(EPOLL_CLOEXEC);
+
+	for (i = 0; i < nr_cpus; i++) {
+		printf("Init bpf_perf_event for cpu:%d\n", i);
+		init_bpf_perf_event_on_cpu(i);
+	}
+
+	return true;
+}
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	int i, cnt, opt, ret = EXIT_SUCCESS;
+	char bpf_obj_file[256];
+	int num_ids = 0, nr_cpus = bpf_num_possible_cpus();
+	int filtered_ids[MAX_FILTERED_IDS];
+
+	snprintf(bpf_obj_file, sizeof(bpf_obj_file), "%s_kern.o", argv[0]);
+
+	/* Parse commands line args */
+	while ((opt = getopt(argc, argv, "hi:")) != -1) {
+		switch (opt) {
+		case 'i':
+			if (num_ids == MAX_FILTERED_IDS) {
+				printf("Reached maximum number of IDs");
+				return EXIT_FAILURE;
+			}
+			i = atoi(optarg);
+			if (!i)
+				printf("ERROR - Invalid id %s", optarg);
+			else
+				filtered_ids[num_ids++] = i;
+			break;
+		case 'h':
+		default:
+			usage(argv);
+			return EXIT_FAILURE;
+		}
+	}
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		perror("setrlimit(RLIMIT_MEMLOCK)");
+		return EXIT_FAILURE;
+	}
+
+	if (load_bpf_file(bpf_obj_file)) {
+		printf("ERROR - bpf_log_buf: %s", bpf_log_buf);
+		return EXIT_FAILURE;
+	}
+
+	if (!prog_fd[0]) {
+		printf("ERROR - load_bpf_file: %s\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	init_filtered_ids_map(num_ids, filtered_ids);
+	init_perf_buffer_data_structures(nr_cpus);
+
+	struct epoll_event *events = calloc(nr_cpus, sizeof(*events));
+
+	while (true) {
+		printf("Waiting for map events...\n");
+		cnt = perf_event_poll(epoll_fd, nr_cpus, events);
+		for (i = 0; i < cnt; i++)
+			perf_event_read(map_event_callback, events[i].data.u32);
+	}
+
+	free(perf_fd);
+	free(readers);
+	free(events);
+
+	return ret;
+}

^ permalink raw reply related

* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Eric Dumazet @ 2018-04-18 16:05 UTC (permalink / raw)
  To: Mikulas Patocka, David S. Miller, Eric Dumazet
  Cc: Joby Poriyath, Ben Hutchings, netdev, linux-kernel
In-Reply-To: <alpine.LRH.2.02.1804181029270.19294@file01.intranet.prod.int.rdu2.redhat.com>



On 04/18/2018 07:34 AM, Mikulas Patocka wrote:
> The patch 74d332c13b21 changes alloc_netdev_mqs to use vzalloc if kzalloc
> fails (later patches change it to kvzalloc).
> 
> The problem with this is that if the vzalloc function is actually used, 
> virtio_net doesn't work (because it expects that the extra memory should 
> be accessible with DMA-API and memory allocated with vzalloc isn't).
> 
> This patch changes it back to kzalloc and adds a warning if the allocated
> size is too large (the allocation is unreliable in this case).
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Fixes: 74d332c13b21 ("net: extend net_device allocation to vmalloc()")
> 
> ---
>  net/core/dev.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/net/core/dev.c
> ===================================================================
> --- linux-2.6.orig/net/core/dev.c	2018-04-16 21:08:36.000000000 +0200
> +++ linux-2.6/net/core/dev.c	2018-04-18 16:24:43.000000000 +0200
> @@ -8366,7 +8366,8 @@ struct net_device *alloc_netdev_mqs(int
>  	/* ensure 32-byte alignment of whole construct */
>  	alloc_size += NETDEV_ALIGN - 1;
>  
> -	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> +	WARN_ON(alloc_size > PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
> +	p = kzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>  	if (!p)
>  		return NULL;
>  
> 

Since when a net_device needs to be in DMA zone ???

I would rather fix virtio_net, this looks very suspect to me.

Each virtio_net should probably allocate the exact amount of DMA-memory it wants,
instead of expecting core networking stack to have a huge chunk of DMA-memory for everything.

^ permalink raw reply

* Re: [PATCH v2 2/3] lan78xx: Read LED states from Device Tree
From: Andrew Lunn @ 2018-04-18 16:11 UTC (permalink / raw)
  To: Phil Elwell
  Cc: Woojung Huh, Microchip Linux Driver Support, Rob Herring,
	Mark Rutland, David S. Miller, Mauro Carvalho Chehab,
	Greg Kroah-Hartman, Linus Walleij, Andrew Morton, Randy Dunlap,
	netdev, devicetree, linux-kernel, linux-usb
In-Reply-To: <1524066323-109628-3-git-send-email-phil@raspberrypi.org>

On Wed, Apr 18, 2018 at 04:45:22PM +0100, Phil Elwell wrote:
> Add support for DT property "microchip,led-modes", a vector of zero
> to four cells (u32s) in the range 0-15, each of which sets the mode
> for one of the LEDs. Some possible values are:
> 
>     0=link/activity          1=link1000/activity
>     2=link100/activity       3=link10/activity
>     4=link100/1000/activity  5=link10/1000/activity
>     6=link10/100/activity    14=off    15=on
> 
> These values are given symbolic constants in a dt-bindings header.
> 
> Also use the presence of the DT property to indicate that the
> LEDs should be enabled - necessary in the event that no valid OTP
> or EEPROM is available.

Hi Phil

As i said last week, these are PHY properties, so should be in the PHY
node in device tree. It should be the PHY driver which parses these
properties and configures the LEDs, not the MAC.

	   Andrew

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox