Netdev List
 help / color / mirror / Atom feed
* [PATCH 2/5 net-next] inet: kill smallest_size and smallest_port
From: Josef Bacik @ 2016-12-20 20:07 UTC (permalink / raw)
  To: davem, hannes, kraigatgoog, eric.dumazet, tom, netdev,
	kernel-team
In-Reply-To: <1482264424-15439-1-git-send-email-jbacik@fb.com>

In inet_csk_get_port we seem to be using smallest_port to figure out where the
best place to look for a SO_REUSEPORT sk that matches with an existing set of
SO_REUSEPORT's.  However if we get to the logic

if (smallest_size != -1) {
	port = smallest_port;
	goto have_port;
}

we will do a useless search, because we would have already done the
inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
would have gone to found_tb and succeeded.  Since this logic makes us do yet
another trip through inet_csk_bind_conflict for a port we know won't work just
delete this code and save us the time.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 net/ipv4/inet_connection_sock.c | 26 ++++----------------------
 1 file changed, 4 insertions(+), 22 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 74f6a57..1a1a94bd 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -93,7 +93,6 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
 	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
 	int ret = 1, attempts = 5, port = snum;
-	int smallest_size = -1, smallest_port;
 	struct inet_bind_hashbucket *head;
 	struct net *net = sock_net(sk);
 	int i, low, high, attempt_half;
@@ -103,7 +102,6 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 	bool reuseport_ok = !!snum;
 
 	if (port) {
-have_port:
 		head = &hinfo->bhash[inet_bhashfn(net, port,
 						  hinfo->bhash_size)];
 		spin_lock_bh(&head->lock);
@@ -137,8 +135,6 @@ other_half_scan:
 	 * We do the opposite to not pollute connect() users.
 	 */
 	offset |= 1U;
-	smallest_size = -1;
-	smallest_port = low; /* avoid compiler warning */
 
 other_parity_scan:
 	port = low + offset;
@@ -152,15 +148,6 @@ other_parity_scan:
 		spin_lock_bh(&head->lock);
 		inet_bind_bucket_for_each(tb, &head->chain)
 			if (net_eq(ib_net(tb), net) && tb->port == port) {
-				if (((tb->fastreuse > 0 && reuse) ||
-				     (tb->fastreuseport > 0 &&
-				      sk->sk_reuseport &&
-				      !rcu_access_pointer(sk->sk_reuseport_cb) &&
-				      uid_eq(tb->fastuid, uid))) &&
-				    (tb->num_owners < smallest_size || smallest_size == -1)) {
-					smallest_size = tb->num_owners;
-					smallest_port = port;
-				}
 				if (!inet_csk_bind_conflict(sk, tb, false, reuseport_ok))
 					goto tb_found;
 				goto next_port;
@@ -171,10 +158,6 @@ next_port:
 		cond_resched();
 	}
 
-	if (smallest_size != -1) {
-		port = smallest_port;
-		goto have_port;
-	}
 	offset--;
 	if (!(offset & 1))
 		goto other_parity_scan;
@@ -196,19 +179,18 @@ tb_found:
 		if (sk->sk_reuse == SK_FORCE_REUSE)
 			goto success;
 
-		if (((tb->fastreuse > 0 && reuse) ||
+		if ((tb->fastreuse > 0 && reuse) ||
 		     (tb->fastreuseport > 0 &&
 		      !rcu_access_pointer(sk->sk_reuseport_cb) &&
-		      sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
-		    smallest_size == -1)
+		      sk->sk_reuseport && uid_eq(tb->fastuid, uid)))
 			goto success;
 		if (inet_csk_bind_conflict(sk, tb, true, reuseport_ok)) {
 			if ((reuse ||
 			     (tb->fastreuseport > 0 &&
 			      sk->sk_reuseport &&
 			      !rcu_access_pointer(sk->sk_reuseport_cb) &&
-			      uid_eq(tb->fastuid, uid))) &&
-			    !snum && smallest_size != -1 && --attempts >= 0) {
+			      uid_eq(tb->fastuid, uid))) && !snum &&
+			    --attempts >= 0) {
 				spin_unlock_bh(&head->lock);
 				goto again;
 			}
-- 
2.9.3

^ permalink raw reply related

* [RFC][PATCH 0/5 net-next] Rework inet_csk_get_port
From: Josef Bacik @ 2016-12-20 20:06 UTC (permalink / raw)
  To: davem, hannes, kraigatgoog, eric.dumazet, tom, netdev,
	kernel-team

At some point recently the guys working on our load balancer added the ability
to use SO_REUSEPORT.  When they restarted their app with this option enabled
they immediately hit a softlockup on what appeared to be the
inet_bind_bucket->lock.  Eventually what all of our debugging and discussion led
us to was the fact that the application comes up without SO_REUSEPORT, shuts
down which creates around 100k twsk's, and then comes up and tries to open a
bunch of sockets using SO_REUSEPORT, which meant traversing the inet_bind_bucket
owners list under the lock.  Since this lock is needed for dealing with the
twsk's and basically anything else related to connections we would softlockup,
and sometimes not ever recover.

To solve this problem I did what you see in Path 5/5.  Once we have a
SO_REUSEPORT socket on the tb->owners list we know that the socket has no
conflicts with any of the other sockets on that list.  So we can add a copy of
the sock_common (really all we need is the recv_saddr but it seemed ugly to copy
just the ipv6, ipv4, and flag to indicate if we were ipv6 only in there so I've
copied the whole common) in order to check subsequent SO_REUSEPORT sockets.  If
they match the previous one then we can skip the expensive
inet_csk_bind_conflict check.  This is what eliminated the soft lockup that we
were seeing.

Patches 1-4 are cleanups and re-workings.  For instance when we specify port ==
0 we need to find an open port, but we would do two passes through
inet_csk_bind_conflict every time we found a possible port.  We would also keep
track of the smallest_port value in order to try and use it if we found no
port our first run through.  This however made no sense as it would have had to
fail the first pass through inet_csk_bind_conflict, so would not actually pass
the second pass through either.  Finally I split the function into two functions
in order to make it easier to read and to distinguish between the two behaviors.

I have tested this on one of our load balancing boxes during peak traffic and it
hasn't fallen over.  But this is not my area, so obviously feel free to point
out where I'm being stupid and I'll get it fixed up and retested.  Thanks,

Josef

^ permalink raw reply

* Re: nfc: trf7970a: Prevent repeated polling from crashing the kernel
From: Mark Greer @ 2016-12-20 19:56 UTC (permalink / raw)
  To: Justin Bronder
  Cc: Geoff Lansberry, linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	lauro.venancio-430g2QfJUUCGglJvpFV4uA,
	aloisio.almeida-430g2QfJUUCGglJvpFV4uA,
	sameo-VuQAYsv1563Yd54FQh9/CA, robh+dt-DgEjT+Ai2ygdnm+yROfE0A,
	mark.rutland-5wv7dgnIgG8, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jaret Cantu
In-Reply-To: <20161220191352.GB23496-WrO9gjaJVAZ9BIO3fBtL+FdjMaeQq/Z1QQ4Iyu8u01E@public.gmane.org>

On Tue, Dec 20, 2016 at 02:13:52PM -0500, Justin Bronder wrote:
> On 20/12/16 11:59 -0700, Mark Greer wrote:
> > On Tue, Dec 20, 2016 at 11:16:32AM -0500, Geoff Lansberry wrote:
> > > From: Jaret Cantu <jaret.cantu-jEh4hwF5bVhBDgjK7y7TUQ@public.gmane.org>
> > > 
> > > Repeated polling attempts cause a NULL dereference error to occur.
> > > This is because the state of the trf7970a is currently reading but
> > > another request has been made to send a command before it has finished.
> > 
> > How is this happening?  Was trf7970a_abort_cmd() called and it didn't
> > work right?  Was it not called at all and there is a bug in the digital
> > layer?  More details please.
> > 
> > > The solution is to properly kill the waiting reading (workqueue)
> > > before failing on the send.
> > 
> > If the bug is in the calling code, then that is what should get fixed.
> > This seems to be a hack to work-around a digital layer bug.
> 
> One of our uses of NFC is to begin polling to read a tag and then stop polling
> (in order to save power) until we know via user interaction that we need to poll
> again.  This is typically many minutes later so the power saving is pretty
> significant.  However, it's possible that a user will remove the tag before
> reading has completed.  We also detect this case and stop polling.  I can go
> more into this if necessary but that is what exposed a panic.
> 
> You can reproduce using neard and python, in our testing it was very likely to
> occur in 10-100 iterations of the following.:
> 
>     #!/usr/bin/python
>     import time
> 
>     import dbus
> 
>     bus = dbus.SystemBus()
>     nfc0 = bus.get_object('org.neard', '/org/neard/nfc0')
>     props = dbus.Interface(nfc0, 'org.freedesktop.DBus.Properties')
> 
>     try:
>         props.Set('org.neard.Adapter', 'Powered', dbus.Boolean(1))
>     except:
>         pass
> 
>     adapter = dbus.Interface(nfc0, 'org.neard.Adapter')
> 
>     for i in range(1000):
>         adapter.StartPollLoop('Initiator')
>         time.sleep(0.1)
>         adapter.StopPollLoop()
>         print(i)
> 
> I believe the last time we tested this was around the 4.1 release.

Thanks for the info, Justin, but I was also seeking more information
at the kernel NFC subsystem and trf7970a driver level.  This patch
adds code inside an 'if' in the driver whose condition should never
be evaluate to true but apparently it did.  How?

Thanks,

Mark
--
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: ipv6: handle -EFAULT from skb_copy_bits
From: Dave Jones @ 2016-12-20 19:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20161220.132813.435056880928769245.davem@davemloft.net>

On Tue, Dec 20, 2016 at 01:28:13PM -0500, David Miller wrote:
 
 > This has to do with the SKB buffer layout and geometry, not whether
 > the packet is "fragmented" in the protocol sense.
 > 
 > So no, this isn't a criteria for packets being filtered out by this
 > point.
 > 
 > Can you try to capture what sk->sk_socket->type and
 > inet_sk(sk)->hdrincl are set to at the time of the crash?
 > 

type:3 hdrincl:0

	Dave

^ permalink raw reply

* Re: ipv6: handle -EFAULT from skb_copy_bits
From: Cong Wang @ 2016-12-20 19:31 UTC (permalink / raw)
  To: Dave Jones; +Cc: David Miller, Linux Kernel Network Developers
In-Reply-To: <20161220181728.dd2cynjwrceruwcu@codemonkey.org.uk>

On Tue, Dec 20, 2016 at 10:17 AM, Dave Jones <davej@codemonkey.org.uk> wrote:
> On Mon, Dec 19, 2016 at 08:36:23PM -0500, David Miller wrote:
>  > From: Dave Jones <davej@codemonkey.org.uk>
>  > Date: Mon, 19 Dec 2016 19:40:13 -0500
>  >
>  > > On Mon, Dec 19, 2016 at 07:31:44PM -0500, Dave Jones wrote:
>  > >
>  > >  > Unfortunately, this made no difference.  I spent some time today trying
>  > >  > to make a better reproducer, but failed. I'll revisit again tomorrow.
>  > >  >
>  > >  > Maybe I need >1 process/thread to trigger this.  That would explain why
>  > >  > I can trigger it with Trinity.
>  > >
>  > > scratch that last part, I finally just repro'd it with a single process.
>  >
>  > Thanks for the info, I'll try to think about this some more.
>
> I threw in some debug printks right before that BUG_ON.
> it's always this:
>
> skb->len=31 skb->data_len=0 offset:30 total_len:9

Clearly we fail because 30 > 31 - 2, seems 'offset' is not correct here,
off-by-one?

^ permalink raw reply

* Re: [PATCH net] be2net: Increase skb headroom size to 256 bytes
From: David Miller @ 2016-12-20 19:30 UTC (permalink / raw)
  To: suresh.reddy; +Cc: netdev, kalesh-anakkur.purayil
In-Reply-To: <20161220151430.11115-1-suresh.reddy@broadcom.com>

From: Suresh Reddy <suresh.reddy@broadcom.com>
Date: Tue, 20 Dec 2016 10:14:30 -0500

> From: Kalesh A P <kalesh-anakkur.purayil@broadcom.com>
> 
> The driver currently allocates 128 bytes of skb headroom.
> This was found to be insufficient with some configurations
> like Geneve tunnels, which resulted in skb head reallocations.
> 
> Increase the headroom to 256 bytes to fix this.
> 
> Signed-off-by: Kalesh A P <kalesh-anakkur.purayil@broadcom.com>
> Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com>

Adding 128 bytes of headroom just for geneve seems excessive.

Do you really need to add that much?

^ permalink raw reply

* Re: [PATCH] net/mlx5: use rb_entry()
From: David Miller @ 2016-12-20 19:23 UTC (permalink / raw)
  To: geliangtang; +Cc: saeedm, matanb, leonro, netdev, linux-rdma, linux-kernel
In-Reply-To: <8443fa3fa03d82c2b829375d8020762e5236dc6d.1482203930.git.geliangtang@gmail.com>

From: Geliang Tang <geliangtang@gmail.com>
Date: Tue, 20 Dec 2016 22:02:14 +0800

> To make the code clearer, use rb_entry() instead of container_of() to
> deal with rbtree.
> 
> Signed-off-by: Geliang Tang <geliangtang@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] RDS: use rb_entry()
From: David Miller @ 2016-12-20 19:23 UTC (permalink / raw)
  To: geliangtang
  Cc: santosh.shilimkar, netdev, linux-rdma, rds-devel, linux-kernel
In-Reply-To: <2cd84448fe04ffb7023be892c5ed04bbfc759c87.1482204342.git.geliangtang@gmail.com>

From: Geliang Tang <geliangtang@gmail.com>
Date: Tue, 20 Dec 2016 22:02:18 +0800

> To make the code clearer, use rb_entry() instead of container_of() to
> deal with rbtree.
> 
> Signed-off-by: Geliang Tang <geliangtang@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net_sched: sch_netem: use rb_entry()
From: David Miller @ 2016-12-20 19:23 UTC (permalink / raw)
  To: geliangtang; +Cc: stephen, jhs, netem, netdev, linux-kernel
In-Reply-To: <dc6e312e4e2ff3868d1791032c329477d27ba50b.1482204694.git.geliangtang@gmail.com>

From: Geliang Tang <geliangtang@gmail.com>
Date: Tue, 20 Dec 2016 22:02:16 +0800

> To make the code clearer, use rb_entry() instead of container_of() to
> deal with rbtree.
> 
> Signed-off-by: Geliang Tang <geliangtang@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net_sched: sch_fq: use rb_entry()
From: David Miller @ 2016-12-20 19:23 UTC (permalink / raw)
  To: geliangtang; +Cc: jhs, netdev, linux-kernel
In-Reply-To: <6aa180710d70b09d4f81d3d219b3161077e1ff11.1482204526.git.geliangtang@gmail.com>

From: Geliang Tang <geliangtang@gmail.com>
Date: Tue, 20 Dec 2016 22:02:15 +0800

> To make the code clearer, use rb_entry() instead of container_of() to
> deal with rbtree.
> 
> Signed-off-by: Geliang Tang <geliangtang@gmail.com>

Applied.

^ permalink raw reply

* Re: [mm PATCH 0/3] Page fragment updates
From: Alexander Duyck @ 2016-12-20 19:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Netdev, Eric Dumazet, David Miller, Jeff Kirsher,
	linux-kernel@vger.kernel.org
In-Reply-To: <20161205121131.3c1d9ad8452d5e09247336e4@linux-foundation.org>

On Mon, Dec 5, 2016 at 12:11 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 5 Dec 2016 09:01:12 -0800 Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Tue, Nov 29, 2016 at 10:23 AM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>> > This patch series takes care of a few cleanups for the page fragments API.
>> >
>> > ...
>>
>> It's been about a week since I submitted this series.  Just wanted to
>> check in and see if anyone had any feedback or if this is good to be
>> accepted for 4.10-rc1 with the rest of the set?
>
> Looks good to me.  I have it all queued for post-4.9 processing.

So I guess there is a small bug in the first patch in that I was
comparing a pointer to to 0 instead of NULL.  Just wondering if I
should resubmit the first patch, the whole series, or if I need to
just submit an incremental patch.

Thanks.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] ethernet: sfc: Add Kconfig entry for vendor Solarflare
From: David Miller @ 2016-12-20 19:20 UTC (permalink / raw)
  To: tklauser; +Cc: netdev, linux-net-drivers, ecree, bkenward
In-Reply-To: <20161220133826.1478-1-tklauser@distanz.ch>

From: Tobias Klauser <tklauser@distanz.ch>
Date: Tue, 20 Dec 2016 14:38:26 +0100

> Since commit
> 
>   5a6681e22c14 ("sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver")
> 
> there are two drivers for Solarflare devices, but both still show up
> directly beneath "Ethernet driver support" in the Kconfig. Follow the
> pattern of other vendors and group them beneath an own vendor Kconfig
> entry for Solarflare.
> 
> Cc: Edward Cree <ecree@solarflare.com>
> Signed-off-by: Tobias Klauser <tklauser@distanz.ch>

Applied.

^ permalink raw reply

* Re: [GIT PULL 00/29] perf/core improvements and fixes
From: Ingo Molnar @ 2016-12-20 19:15 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Kim Phillips, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Paul Mackerras, Jiri Olsa, Daniel Borkmann, Alexei Starovoitov,
	Ravi Bangoria, Peter Zijlstra, Naveen N . Rao,
	Markus Trippelsdorf, Taeung Song, Wang Nan, Joe Stringer,
	Nicholas Piggin, Arnaldo Carvalho de Melo, Namhyung Kim,
	Kyle McMartin, Kan Liang, Chris Riyder, linux-kernel,
	Davidlohr Bueso, Masami
In-Reply-To: <20161220170358.4350-1-acme@kernel.org>


* Arnaldo Carvalho de Melo <acme@kernel.org> wrote:

> Hi Ingo,
> 
>         Please consider pulling, I had most of this queued before your first
> pull req to Linus for 4.10, most are fixes, with 'perf sched timehist --idle'
> as a followup new feature to the 'perf sched timehist' command introduced in
> this window.
> 	
> 	One other thing that delayed this was the samples/bpf/ switch to
> tools/lib/bpf/ that involved fixing up merge clashes with net.git and also
> to properly test it, after more rounds than antecipated, but all seems ok
> now and would be good to get this merge issues past us ASAP.
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit e7aa8c2eb11ba69b1b69099c3c7bd6be3087b0ba:
> 
>   Merge tag 'docs-4.10' of git://git.lwn.net/linux (2016-12-12 21:58:13 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git tags/perf-core-for-mingo-20161220
> 
> for you to fetch changes up to 9899694a7f67714216665b87318eb367e2c5c901:
> 
>   samples/bpf: Move open_raw_sock to separate header (2016-12-20 12:00:40 -0300)
> 
> ----------------------------------------------------------------
> perf/core improvements and fixes:
> 
> New features:
> 
> - Introduce 'perf sched timehist --idle', to analyse processes
>   going to/from idle state (Namhyung Kim)
> 
> Fixes:
> 
> - Allow 'perf record -u user' to continue when facing races with threads
>   going away after having scanned them via /proc (Jiri Olsa)
> 
> - Fix 'perf mem' --all-user/--all-kernel options (Jiri Olsa)
> 
> - Support jumps with multiple arguments (Ravi Bangoria)
> 
> - Fix jumps to before the function where they are located (Ravi
> Bangoria)
> 
> - Fix lock-pi help string (Davidlohr Bueso)
> 
> - Fix build of 'perf trace' in odd systems such as a RHEL PPC one (Jiri Olsa)
> 
> - Do not overwrite valid build id in 'perf diff' (Kan Liang)
> 
> - Don't throw error for zero length symbols, allowing the use of the TUI
>   in PowerPC, where such symbols became more common recently (Ravi Bangoria)
> 
> Infrastructure:
> 
> - Switch of samples/bpf/ to use tools/lib/bpf, removing libbpf
>   duplication (Joe Stringer)
> 
> - Move headers check into bash script (Jiri Olsa)
> 
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> 
> ----------------------------------------------------------------
> Arnaldo Carvalho de Melo (3):
>       perf tools: Remove some needless __maybe_unused
>       samples/bpf: Make perf_event_read() static
>       samples/bpf: Be consistent with bpf_load_program bpf_insn parameter
> 
> Davidlohr Bueso (1):
>       perf bench futex: Fix lock-pi help string
> 
> Jiri Olsa (7):
>       perf tools: Move headers check into bash script
>       perf mem: Fix --all-user/--all-kernel options
>       perf evsel: Use variable instead of repeating lengthy FD macro
>       perf thread_map: Add thread_map__remove function
>       perf evsel: Allow to ignore missing pid
>       perf record: Force ignore_missing_thread for uid option
>       perf trace: Check if MAP_32BIT is defined (again)
> 
> Joe Stringer (8):
>       tools lib bpf: Sync {tools,}/include/uapi/linux/bpf.h
>       tools lib bpf: use __u32 from linux/types.h
>       tools lib bpf: Add flags to bpf_create_map()
>       samples/bpf: Make samples more libbpf-centric
>       samples/bpf: Switch over to libbpf
>       tools lib bpf: Add bpf_prog_{attach,detach}
>       samples/bpf: Remove perf_event_open() declaration
>       samples/bpf: Move open_raw_sock to separate header
> 
> Kan Liang (1):
>       perf diff: Do not overwrite valid build id
> 
> Namhyung Kim (6):
>       perf sched timehist: Split is_idle_sample()
>       perf sched timehist: Introduce struct idle_time_data
>       perf sched timehist: Save callchain when entering idle
>       perf sched timehist: Skip non-idle events when necessary
>       perf sched timehist: Add -I/--idle-hist option
>       perf sched timehist: Show callchains for idle stat
> 
> Ravi Bangoria (3):
>       perf annotate: Support jump instruction with target as second operand
>       perf annotate: Fix jump target outside of function address range
>       perf annotate: Don't throw error for zero length symbols
> 
>  samples/bpf/Makefile                              |  70 +--
>  samples/bpf/README.rst                            |   4 +-
>  samples/bpf/bpf_load.c                            |  21 +-
>  samples/bpf/bpf_load.h                            |   3 +
>  samples/bpf/fds_example.c                         |  13 +-
>  samples/bpf/lathist_user.c                        |   2 +-
>  samples/bpf/libbpf.c                              | 176 -------
>  samples/bpf/libbpf.h                              |  28 +-
>  samples/bpf/lwt_len_hist_user.c                   |   6 +-
>  samples/bpf/offwaketime_user.c                    |   8 +-
>  samples/bpf/sampleip_user.c                       |   7 +-
>  samples/bpf/sock_example.c                        |  14 +-
>  samples/bpf/sock_example.h                        |  35 ++
>  samples/bpf/sockex1_user.c                        |   7 +-
>  samples/bpf/sockex2_user.c                        |   5 +-
>  samples/bpf/sockex3_user.c                        |   5 +-
>  samples/bpf/spintest_user.c                       |   8 +-
>  samples/bpf/tc_l2_redirect_user.c                 |   4 +-
>  samples/bpf/test_cgrp2_array_pin.c                |   4 +-
>  samples/bpf/test_cgrp2_attach.c                   |  12 +-
>  samples/bpf/test_cgrp2_attach2.c                  |   8 +-
>  samples/bpf/test_cgrp2_sock.c                     |   7 +-
>  samples/bpf/test_current_task_under_cgroup_user.c |   8 +-
>  samples/bpf/test_lru_dist.c                       |  32 +-
>  samples/bpf/test_probe_write_user_user.c          |   2 +-
>  samples/bpf/trace_event_user.c                    |  23 +-
>  samples/bpf/trace_output_user.c                   |   7 +-
>  samples/bpf/tracex2_user.c                        |  10 +-
>  samples/bpf/tracex3_user.c                        |   4 +-
>  samples/bpf/tracex4_user.c                        |   4 +-
>  samples/bpf/tracex6_user.c                        |   5 +-
>  samples/bpf/xdp1_user.c                           |   2 +-
>  samples/bpf/xdp_tx_iptunnel_user.c                |   6 +-
>  tools/include/uapi/linux/bpf.h                    | 593 +++++++++++++---------
>  tools/lib/bpf/bpf.c                               |  30 +-
>  tools/lib/bpf/bpf.h                               |   9 +-
>  tools/lib/bpf/libbpf.c                            |   3 +-
>  tools/perf/Documentation/perf-sched.txt           |   4 +
>  tools/perf/Makefile.perf                          |  94 +---
>  tools/perf/bench/futex-lock-pi.c                  |   2 +-
>  tools/perf/builtin-c2c.c                          |  13 +-
>  tools/perf/builtin-mem.c                          |   4 +-
>  tools/perf/builtin-record.c                       |   3 +
>  tools/perf/builtin-report.c                       |   2 +-
>  tools/perf/builtin-sched.c                        | 261 ++++++++--
>  tools/perf/builtin-stat.c                         |   6 +-
>  tools/perf/check-headers.sh                       |  59 +++
>  tools/perf/perf.h                                 |   1 +
>  tools/perf/tests/builtin-test.c                   |   4 +
>  tools/perf/tests/tests.h                          |   1 +
>  tools/perf/tests/thread-map.c                     |  44 ++
>  tools/perf/trace/beauty/mmap.c                    |   2 +
>  tools/perf/ui/browsers/annotate.c                 |   5 +-
>  tools/perf/util/annotate.c                        |  23 +-
>  tools/perf/util/annotate.h                        |   5 +-
>  tools/perf/util/evsel.c                           |  61 ++-
>  tools/perf/util/evsel.h                           |   1 +
>  tools/perf/util/symbol.c                          |   3 +-
>  tools/perf/util/thread_map.c                      |  22 +
>  tools/perf/util/thread_map.h                      |   1 +
>  60 files changed, 1075 insertions(+), 731 deletions(-)
>  delete mode 100644 samples/bpf/libbpf.c
>  create mode 100644 samples/bpf/sock_example.h
>  create mode 100755 tools/perf/check-headers.sh

Pulled, thanks a lot Arnaldo!

	Ingo

^ permalink raw reply

* Re: nfc: trf7970a: Prevent repeated polling from crashing the kernel
From: Justin Bronder @ 2016-12-20 19:13 UTC (permalink / raw)
  To: Mark Greer
  Cc: Geoff Lansberry, linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	lauro.venancio-430g2QfJUUCGglJvpFV4uA,
	aloisio.almeida-430g2QfJUUCGglJvpFV4uA,
	sameo-VuQAYsv1563Yd54FQh9/CA, robh+dt-DgEjT+Ai2ygdnm+yROfE0A,
	mark.rutland-5wv7dgnIgG8, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jaret Cantu
In-Reply-To: <20161220185905.GA5867-luAo+O/VEmrlveNOaEYElw@public.gmane.org>

On 20/12/16 11:59 -0700, Mark Greer wrote:
> On Tue, Dec 20, 2016 at 11:16:32AM -0500, Geoff Lansberry wrote:
> > From: Jaret Cantu <jaret.cantu-jEh4hwF5bVhBDgjK7y7TUQ@public.gmane.org>
> > 
> > Repeated polling attempts cause a NULL dereference error to occur.
> > This is because the state of the trf7970a is currently reading but
> > another request has been made to send a command before it has finished.
> 
> How is this happening?  Was trf7970a_abort_cmd() called and it didn't
> work right?  Was it not called at all and there is a bug in the digital
> layer?  More details please.
> 
> > The solution is to properly kill the waiting reading (workqueue)
> > before failing on the send.
> 
> If the bug is in the calling code, then that is what should get fixed.
> This seems to be a hack to work-around a digital layer bug.

One of our uses of NFC is to begin polling to read a tag and then stop polling
(in order to save power) until we know via user interaction that we need to poll
again.  This is typically many minutes later so the power saving is pretty
significant.  However, it's possible that a user will remove the tag before
reading has completed.  We also detect this case and stop polling.  I can go
more into this if necessary but that is what exposed a panic.

You can reproduce using neard and python, in our testing it was very likely to
occur in 10-100 iterations of the following.:

    #!/usr/bin/python
    import time

    import dbus

    bus = dbus.SystemBus()
    nfc0 = bus.get_object('org.neard', '/org/neard/nfc0')
    props = dbus.Interface(nfc0, 'org.freedesktop.DBus.Properties')

    try:
        props.Set('org.neard.Adapter', 'Powered', dbus.Boolean(1))
    except:
        pass

    adapter = dbus.Interface(nfc0, 'org.neard.Adapter')

    for i in range(1000):
        adapter.StartPollLoop('Initiator')
        time.sleep(0.1)
        adapter.StopPollLoop()
        print(i)

I believe the last time we tested this was around the 4.1 release.

-- 
Justin Bronder
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 0/2] net: hix5hd2_gmac: keep the compatible string not changed
From: David Miller @ 2016-12-20 19:13 UTC (permalink / raw)
  To: lidongpo
  Cc: robh+dt, mark.rutland, linux, yisen.zhuang, salil.mehta, arnd,
	andrew, xuejiancheng, benjamin.chenhao, caizhiyong, netdev,
	devicetree, linux-kernel
In-Reply-To: <1482199769-106501-1-git-send-email-lidongpo@hisilicon.com>

From: Dongpo Li <lidongpo@hisilicon.com>
Date: Tue, 20 Dec 2016 10:09:27 +0800

> This patch series fix the patch:
> d0fb6ba75dc0 ("net: hix5hd2_gmac: add generic compatible string")
> 
> The SoC hix5hd2 compatible string has the suffix "-gmac" and
> we should not change its compatible string.
> So we should name all the compatible string with the suffix "-gmac".
> Creating a new name suffix "-gemac" is unnecessary.

Series applied.

^ permalink raw reply

* Re: [PATCH net] openvswitch: Add a missing break statement.
From: David Miller @ 2016-12-20 19:08 UTC (permalink / raw)
  To: jarno; +Cc: netdev, jbenc, pshelar
In-Reply-To: <1482195993-97937-1-git-send-email-jarno@ovn.org>

From: Jarno Rajahalme <jarno@ovn.org>
Date: Mon, 19 Dec 2016 17:06:33 -0800

> Add a break statement to prevent fall-through from
> OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL.  Without the break
> actions setting ethernet addresses fail to validate with log messages
> complaining about invalid tunnel attributes.
> 
> Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets")
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> Acked-by: Pravin B Shelar <pshelar@ovn.org>
> Acked-by: Jiri Benc <jbenc@redhat.com>

Applied.

^ permalink raw reply

* Re: [PATCH net 2/2] net: netcp: ethss: fix 10gbe host port tx pri map configuration
From: David Miller @ 2016-12-20 19:08 UTC (permalink / raw)
  To: m-karicheri2; +Cc: netdev, linux-kernel
In-Reply-To: <1482188157-24490-2-git-send-email-m-karicheri2@ti.com>

From: Murali Karicheri <m-karicheri2@ti.com>
Date: Mon, 19 Dec 2016 17:55:57 -0500

> From: WingMan Kwok <w-kwok2@ti.com>
> 
> This patch adds the missing 10gbe host port tx priority map
> configurations.
> 
> Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
> Signed-off-by: Sekhar Nori <nsekhar@ti.com>

Applied.

^ permalink raw reply

* Re: [PATCH net 1/2] net: netcp: ethss: fix errors in ethtool ops
From: David Miller @ 2016-12-20 19:08 UTC (permalink / raw)
  To: m-karicheri2; +Cc: netdev, linux-kernel
In-Reply-To: <1482188157-24490-1-git-send-email-m-karicheri2@ti.com>

From: Murali Karicheri <m-karicheri2@ti.com>
Date: Mon, 19 Dec 2016 17:55:56 -0500

> From: WingMan Kwok <w-kwok2@ti.com>
> 
> In ethtool ops, it needs to retrieve the corresponding
> ethss module (gbe or xgbe) from the net_device structure.
> Prior to this patch, the retrieving procedure only
> checks for the gbe module.  This patch fixes the issue
> by checking the xgbe module if the net_device structure
> does not correspond to the gbe module.
> 
> Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
> Signed-off-by: Sekhar Nori <nsekhar@ti.com>

Applied.

^ permalink raw reply

* Re: [PATCH net v4 0/4] fsl/fman: fixes for ARM
From: David Miller @ 2016-12-20 19:00 UTC (permalink / raw)
  To: madalin.bucur; +Cc: netdev, linuxppc-dev, linux-kernel, scott.wood
In-Reply-To: <1482180166-10677-1-git-send-email-madalin.bucur@nxp.com>

From: Madalin Bucur <madalin.bucur@nxp.com>
Date: Mon, 19 Dec 2016 22:42:42 +0200

> The patch set fixes advertised speeds for QSGMII interfaces, disables
> A007273 erratum workaround on non-PowerPC platforms where it does not
> apply, enables compilation on ARM64 and addresses a probing issue on
> non PPC platforms.
> 
> Changes from v3: removed redundant comment, added ack by Scott
> Changes from v2: merged fsl/fman changes to avoid a point of failure
> Changes from v1: unifying probing on all supported platforms

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH 3/3] nfc: trf7970a: Prevent repeated polling from crashing the kernel
From: Mark Greer @ 2016-12-20 18:59 UTC (permalink / raw)
  To: Geoff Lansberry
  Cc: linux-wireless, lauro.venancio, aloisio.almeida, sameo, robh+dt,
	mark.rutland, netdev, devicetree, linux-kernel, justin,
	Jaret Cantu
In-Reply-To: <1482250592-4268-3-git-send-email-glansberry@gmail.com>

On Tue, Dec 20, 2016 at 11:16:32AM -0500, Geoff Lansberry wrote:
> From: Jaret Cantu <jaret.cantu@timesys.com>
> 
> Repeated polling attempts cause a NULL dereference error to occur.
> This is because the state of the trf7970a is currently reading but
> another request has been made to send a command before it has finished.

How is this happening?  Was trf7970a_abort_cmd() called and it didn't
work right?  Was it not called at all and there is a bug in the digital
layer?  More details please.

> The solution is to properly kill the waiting reading (workqueue)
> before failing on the send.

If the bug is in the calling code, then that is what should get fixed.
This seems to be a hack to work-around a digital layer bug.

Mark
--

^ permalink raw reply

* Re: Soft lockup in tc_classify
From: Shahar Klein @ 2016-12-20  6:22 UTC (permalink / raw)
  To: Cong Wang
  Cc: shahark, Or Gerlitz, Daniel Borkmann, Linux Netdev List,
	Roi Dayan, David Miller, Jiri Pirko, John Fastabend,
	Hadar Hen Zion
In-Reply-To: <CAM_iQpXUQYvvXonEXe0czd4osL5YxZ+G5B-PUddautcHnGOtQw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3353 bytes --]



On 12/19/2016 7:58 PM, Cong Wang wrote:
> Hello,
>
> On Mon, Dec 19, 2016 at 8:39 AM, Shahar Klein <shahark@mellanox.com> wrote:
>>
>>
>> On 12/13/2016 12:51 AM, Cong Wang wrote:
>>>
>>> On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>>>>
>>>> On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net>
>>>> wrote:
>>>>
>>>>> Note that there's still the RCU fix missing for the deletion race that
>>>>> Cong will still send out, but you say that the only thing you do is to
>>>>> add a single rule, but no other operation in involved during that test?
>>>>
>>>>
>>>> What's missing to have the deletion race fixed? making a patch or
>>>> testing to a patch which was sent?
>>>
>>>
>>> If you think it would help for this problem, here is my patch rebased
>>> on the latest net-next.
>>>
>>> Again, I don't see how it could help this case yet, especially I don't
>>> see how we could have a loop in this singly linked list.
>>>
>>
>> I've applied cong's patch and hit a different lockup(full log attached):
>
>
> Are you sure this is really different? For me, it is still inside the loop
> in tc_classify(), with only a slightly different offset.
>
>
>>
>> Daniel suggested I'll add a print:
>>                 case RTM_DELTFILTER:
>> -                   err = tp->ops->delete(tp, fh);
>> +                 printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__, __LINE__);
>> +                 err = tp->ops->delete(tp, fh, &last);
>>                         if (err == 0) {
>>
>> and I couldn't see this print in the output.....
>
> Hmm, that is odd, if this never prints, then my patch should not make any
> difference.
>
> There are still two other cases where we could change tp->next, so do you
> mind to add two more printk's for debugging?
>
> Attached is the delta patch.
>
> Thanks!
>

I've added a slightly different debug print:
@@ -368,11 +375,12 @@ static int tc_ctl_tfilter(struct sk_buff *skb, 
struct nlmsghdr *n)
                 if (tp_created) {
                         RCU_INIT_POINTER(tp->next, 
rtnl_dereference(*back));
                         rcu_assign_pointer(*back, tp);
+                 printk(KERN_ERR "DEBUGG:SK add/change filter by: %pf 
tp=%p tp->next=%p\n", tp->ops->get, tp, tp->next);
                 }
                 tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER, false);

full output attached:

[  283.290271] Mirror/redirect action on
[  283.305031] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9432d704df60 tp->next=          (null)
[  283.322563] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9436e718d240 tp->next=          (null)
[  283.359997] GACT probability on
[  283.365923] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9436e718d3c0 tp->next=ffff9436e718d240
[  283.378725] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  283.391310] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  283.403923] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  283.416542] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  308.538571] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! 
[swapper/0:0]


Thanks
Shahar




[-- Attachment #2: tp_p_debug.log --]
[-- Type: text/plain, Size: 18431 bytes --]

[  283.290271] Mirror/redirect action on
[  283.305031] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9432d704df60 tp->next=          (null)
[  283.322563] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d240 tp->next=          (null)
[  283.359997] GACT probability on
[  283.365923] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d240
[  283.378725] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  283.391310] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  283.403923] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  283.416542] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  308.538571] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
[  308.547322] Modules linked in: act_gact act_mirred openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 vfio_pci vfio_virqfd vfio_iommu_type1 vfio cls_flower mlx5_ib mlx5_core devlink sch_ingress nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun ebtable_filter ebtables ip6table_filter ip6_tables netconsole rpcrdma bridge ib_isert stp iscsi_target_mod llc ib_iser libiscsi scsi_transport_iscsi ib_srpt ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp kvm_intel kvm igb irqbypass joydev ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt crc32c_intel ptp ipmi_si iTCO_vendor_support pcspkr ghash_clmulni_intel wmi pps_core i2c_algo_bit ipmi_msghandler mei_me i2c_i801 ioatdma tpm_tis mei shpchp i2c_smbus dca tpm_tis_core lpc_ich tpm nfsd target_core_mod auth_rpcgss nfs_acl lockd grace sunrpc isci libsas serio_raw scsi_transport_sas [last unloaded: devlink]
[  308.668291] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0+ #31
[  308.675337] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  308.683060] task: ffffffff94e0e500 task.stack: ffffffff94e00000
[  308.690012] RIP: 0010:fl_classify+0xb/0x2b0 [cls_flower]
[  308.696275] RSP: 0018:ffff9432efa03c20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  308.705396] RAX: 0000000000000008 RBX: ffff9432b59c4100 RCX: 0000000000000000
[  308.713704] RDX: ffff9432efa03c98 RSI: ffff9436e718d3c0 RDI: ffff9432b59c4100
[  308.722099] RBP: ffff9432efa03c28 R08: 000000000000270f R09: 0000000000000000
[  308.730409] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9432efa03c98
[  308.738713] R13: 0000000000000008 R14: ffff9436e718d3c0 R15: 0000000000000001
[  308.747013] FS:  0000000000000000(0000) GS:ffff9432efa00000(0000) knlGS:0000000000000000
[  308.756625] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  308.763378] CR2: 00007f5415f67914 CR3: 00000005fde07000 CR4: 00000000000426f0
[  308.771684] Call Trace:
[  308.774739]  <IRQ>
[  308.777311]  tc_classify+0x78/0x120
[  308.781549]  __netif_receive_skb_core+0x623/0xa00
[  308.787141]  ? udp4_gro_receive+0x10b/0x2d0
[  308.792143]  __netif_receive_skb+0x18/0x60
[  308.797048]  netif_receive_skb_internal+0x40/0xb0
[  308.802637]  napi_gro_receive+0xcd/0x120
[  308.807462]  mlx5e_handle_rx_cqe_rep+0x61b/0x890 [mlx5_core]
[  308.814123]  mlx5e_poll_rx_cq+0x83/0x840 [mlx5_core]
[  308.820015]  mlx5e_napi_poll+0x89/0x480 [mlx5_core]
[  308.825818]  net_rx_action+0x260/0x3c0
[  308.830334]  __do_softirq+0xc9/0x28c
[  308.834658]  irq_exit+0xd7/0xe0
[  308.838492]  do_IRQ+0x51/0xd0
[  308.842132]  common_interrupt+0x93/0x93
[  308.846747] RIP: 0010:cpuidle_enter_state+0xe1/0x260
[  308.852624] RSP: 0018:ffffffff94e03dc8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffa2
[  308.861766] RAX: ffff9432efa19600 RBX: ffff9432efa23600 RCX: 000000000000001f
[  308.870077] RDX: 0000000000000000 RSI: ffff9432efa16cd8 RDI: 0000000000000000
[  308.878379] RBP: ffffffff94e03e00 R08: 0000000000000001 R09: cccccccccccccccd
[  308.886690] R10: 0000000000000000 R11: 0000000000000008 R12: 0000000000000001
[  308.895000] R13: 0000000000000000 R14: ffffffff94ec79a0 R15: 00000041fab01c8d
[  308.903306]  </IRQ>
[  308.905978]  ? cpuidle_enter_state+0xc0/0x260
[  308.911173]  cpuidle_enter+0x17/0x20
[  308.915498]  call_cpuidle+0x23/0x40
[  308.919721]  do_idle+0x172/0x200
[  308.923656]  cpu_startup_entry+0x71/0x80
[  308.928370]  rest_init+0x77/0x80
[  308.932304]  start_kernel+0x4a6/0x4c7
[  308.936723]  ? set_init_arg+0x55/0x55
[  308.941141]  ? early_idt_handler_array+0x120/0x120
[  308.946823]  x86_64_start_reservations+0x24/0x26
[  308.952314]  x86_64_start_kernel+0x14c/0x16f
[  308.957418]  start_cpu+0x5/0x14
[  308.961242] Code: a8 4c 89 fe 48 8b 4d c8 48 8d 14 07 4c 89 e7 e8 2c fe ff ff e9 14 ff ff ff 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 <41> 56 41 55 41 54 53 48 81 ec 28 01 00 00 65 48 8b 04 25 28 00 
[  308.989075] Kernel panic - not syncing: softlockup: hung tasks
[  308.995924] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G             L  4.9.0+ #31
[  309.010442] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  309.018160] Call Trace:
[  309.021211]  <IRQ>
[  309.023776]  dump_stack+0x63/0x8c
[  309.027807]  panic+0xeb/0x239
[  309.031449]  watchdog_timer_fn+0x1e5/0x1f0
[  309.036354]  ? watchdog+0x40/0x40
[  309.040386]  __hrtimer_run_queues+0xee/0x270
[  309.045486]  hrtimer_interrupt+0xa8/0x190
[  309.050293]  local_apic_timer_interrupt+0x35/0x60
[  309.055880]  smp_apic_timer_interrupt+0x38/0x50
[  309.061272]  apic_timer_interrupt+0x93/0xa0
[  309.066272] RIP: 0010:fl_classify+0xb/0x2b0 [cls_flower]
[  309.072538] RSP: 0018:ffff9432efa03c20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  309.081686] RAX: 0000000000000008 RBX: ffff9432b59c4100 RCX: 0000000000000000
[  309.089994] RDX: ffff9432efa03c98 RSI: ffff9436e718d3c0 RDI: ffff9432b59c4100
[  309.098297] RBP: ffff9432efa03c28 R08: 000000000000270f R09: 0000000000000000
[  309.106603] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9432efa03c98
[  309.114914] R13: 0000000000000008 R14: ffff9436e718d3c0 R15: 0000000000000001
[  309.123229]  tc_classify+0x78/0x120
[  309.127452]  __netif_receive_skb_core+0x623/0xa00
[  309.133031]  ? udp4_gro_receive+0x10b/0x2d0
[  309.138033]  __netif_receive_skb+0x18/0x60
[  309.142949]  netif_receive_skb_internal+0x40/0xb0
[  309.148534]  napi_gro_receive+0xcd/0x120
[  309.153259]  mlx5e_handle_rx_cqe_rep+0x61b/0x890 [mlx5_core]
[  309.159918]  mlx5e_poll_rx_cq+0x83/0x840 [mlx5_core]
[  309.165823]  mlx5e_napi_poll+0x89/0x480 [mlx5_core]
[  309.171608]  net_rx_action+0x260/0x3c0
[  309.176238]  __do_softirq+0xc9/0x28c
[  309.180563]  irq_exit+0xd7/0xe0
[  309.184395]  do_IRQ+0x51/0xd0
[  309.188035]  common_interrupt+0x93/0x93
[  309.192651] RIP: 0010:cpuidle_enter_state+0xe1/0x260
[  309.198527] RSP: 0018:ffffffff94e03dc8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffa2
[  309.207651] RAX: ffff9432efa19600 RBX: ffff9432efa23600 RCX: 000000000000001f
[  309.215959] RDX: 0000000000000000 RSI: ffff9432efa16cd8 RDI: 0000000000000000
[  309.224268] RBP: ffffffff94e03e00 R08: 0000000000000001 R09: cccccccccccccccd
[  309.232573] R10: 0000000000000000 R11: 0000000000000008 R12: 0000000000000001
[  309.240881] R13: 0000000000000000 R14: ffffffff94ec79a0 R15: 00000041fab01c8d
[  309.249187]  </IRQ>
[  309.251858]  ? cpuidle_enter_state+0xc0/0x260
[  309.257057]  cpuidle_enter+0x17/0x20
[  309.261382]  call_cpuidle+0x23/0x40
[  309.265635]  do_idle+0x172/0x200
[  309.269604]  cpu_startup_entry+0x71/0x80
[  309.274314]  rest_init+0x77/0x80
[  309.278247]  start_kernel+0x4a6/0x4c7
[  309.282668]  ? set_init_arg+0x55/0x55
[  309.287089]  ? early_idt_handler_array+0x120/0x120
[  309.292771]  x86_64_start_reservations+0x24/0x26
[  309.298262]  x86_64_start_kernel+0x14c/0x16f
[  309.303361]  start_cpu+0x5/0x14
[  309.307245] Kernel Offset: 0x13000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  310.573997] ---[ end Kernel panic - not syncing: softlockup: hung tasks
[  310.581734] ------------[ cut here ]------------
[  310.587236] unchecked MSR access error: WRMSR to 0x83f (tried to write 0x00000000000000f6) at rIP: 0xffffffff94065c14 (native_write_msr+0x4/0x30)
[  310.602404] Call Trace:
[  310.605472]  <IRQ>
[  310.608066]  ? native_apic_msr_write+0x30/0x40
[  310.613371]  x2apic_send_IPI_self+0x1d/0x20
[  310.618390]  arch_irq_work_raise+0x28/0x40
[  310.623309]  irq_work_queue+0x6e/0x80
[  310.627724]  wake_up_klogd+0x34/0x40
[  310.632045]  console_unlock+0x4dc/0x540
[  310.636659]  vprintk_emit+0x2eb/0x4b0
[  310.641091]  ? native_smp_send_reschedule+0x3f/0x50
[  310.646871]  vprintk_default+0x29/0x40
[  310.651393]  printk+0x5d/0x74
[  310.655034]  ? native_smp_send_reschedule+0x3f/0x50
[  310.660807]  __warn+0x3b/0xf0
[  310.664450]  warn_slowpath_null+0x1d/0x20
[  310.669262]  native_smp_send_reschedule+0x3f/0x50
[  310.674849]  try_to_wake_up+0x312/0x390
[  310.679456]  default_wake_function+0x12/0x20
[  310.684560]  __wake_up_common+0x55/0x90
[  310.689170]  __wake_up_locked+0x13/0x20
[  310.693788]  ep_poll_callback+0xbb/0x240
[  310.698493]  __wake_up_common+0x55/0x90
[  310.703101]  __wake_up+0x39/0x50
[  310.707028]  wake_up_klogd_work_func+0x40/0x60
[  310.712316]  irq_work_run_list+0x4d/0x70
[  310.717022]  irq_work_run+0x2c/0x40
[  310.721243]  smp_irq_work_interrupt+0x2e/0x40
[  310.726443]  irq_work_interrupt+0x93/0xa0
[  310.731253] RIP: 0010:panic+0x1f5/0x239
[  310.735876] RSP: 0018:ffff9432efa039e8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff09
[  310.744995] RAX: 000000000000003b RBX: 0000000000000000 RCX: 0000000000000006
[  310.753294] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff9432efa0e060
[  310.761594] RBP: ffff9432efa03a58 R08: 0000000000000674 R09: ffff942e800bb3e0
[  310.769900] R10: 00000000000000ef R11: 0000000000000198 R12: ffffffff94c4a4a9
[  310.778199] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9432efa03b78
[  310.786505]  ? panic+0x1f1/0x239
[  310.790444]  watchdog_timer_fn+0x1e5/0x1f0
[  310.795353]  ? watchdog+0x40/0x40
[  310.799401]  __hrtimer_run_queues+0xee/0x270
[  310.804501]  hrtimer_interrupt+0xa8/0x190
[  310.809318]  local_apic_timer_interrupt+0x35/0x60
[  310.814895]  smp_apic_timer_interrupt+0x38/0x50
[  310.820282]  apic_timer_interrupt+0x93/0xa0
[  310.825287] RIP: 0010:fl_classify+0xb/0x2b0 [cls_flower]
[  310.831554] RSP: 0018:ffff9432efa03c20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  310.840693] RAX: 0000000000000008 RBX: ffff9432b59c4100 RCX: 0000000000000000
[  310.849007] RDX: ffff9432efa03c98 RSI: ffff9436e718d3c0 RDI: ffff9432b59c4100
[  310.857402] RBP: ffff9432efa03c28 R08: 000000000000270f R09: 0000000000000000
[  310.865712] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9432efa03c98
[  310.874020] R13: 0000000000000008 R14: ffff9436e718d3c0 R15: 0000000000000001
[  310.882337]  tc_classify+0x78/0x120
[  310.886568]  __netif_receive_skb_core+0x623/0xa00
[  310.892157]  ? udp4_gro_receive+0x10b/0x2d0
[  310.897151]  __netif_receive_skb+0x18/0x60
[  310.902057]  netif_receive_skb_internal+0x40/0xb0
[  310.907643]  napi_gro_receive+0xcd/0x120
[  310.912370]  mlx5e_handle_rx_cqe_rep+0x61b/0x890 [mlx5_core]
[  310.919031]  mlx5e_poll_rx_cq+0x83/0x840 [mlx5_core]
[  310.924924]  mlx5e_napi_poll+0x89/0x480 [mlx5_core]
[  310.930808]  net_rx_action+0x260/0x3c0
[  310.935319]  __do_softirq+0xc9/0x28c
[  310.939658]  irq_exit+0xd7/0xe0
[  310.943485]  do_IRQ+0x51/0xd0
[  310.947124]  common_interrupt+0x93/0x93
[  310.951748] RIP: 0010:cpuidle_enter_state+0xe1/0x260
[  310.957616] RSP: 0018:ffffffff94e03dc8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffa2
[  310.966743] RAX: ffff9432efa19600 RBX: ffff9432efa23600 RCX: 000000000000001f
[  310.975044] RDX: 0000000000000000 RSI: ffff9432efa16cd8 RDI: 0000000000000000
[  310.983349] RBP: ffffffff94e03e00 R08: 0000000000000001 R09: cccccccccccccccd
[  310.991654] R10: 0000000000000000 R11: 0000000000000008 R12: 0000000000000001
[  310.999952] R13: 0000000000000000 R14: ffffffff94ec79a0 R15: 00000041fab01c8d
[  311.008254]  </IRQ>
[  311.010926]  ? cpuidle_enter_state+0xc0/0x260
[  311.016122]  cpuidle_enter+0x17/0x20
[  311.020430]  call_cpuidle+0x23/0x40
[  311.024658]  do_idle+0x172/0x200
[  311.028583]  cpu_startup_entry+0x71/0x80
[  311.033295]  rest_init+0x77/0x80
[  311.037233]  start_kernel+0x4a6/0x4c7
[  311.041646]  ? set_init_arg+0x55/0x55
[  311.046068]  ? early_idt_handler_array+0x120/0x120
[  311.051752]  x86_64_start_reservations+0x24/0x26
[  311.057238]  x86_64_start_kernel+0x14c/0x16f
[  311.062339]  start_cpu+0x5/0x14
[  311.066180] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x3f/0x50
[  311.076956] Modules linked in: act_gact act_mirred openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 vfio_pci vfio_virqfd vfio_iommu_type1 vfio cls_flower mlx5_ib mlx5_core devlink sch_ingress nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun ebtable_filter ebtables ip6table_filter ip6_tables netconsole rpcrdma bridge ib_isert stp iscsi_target_mod llc ib_iser libiscsi scsi_transport_iscsi ib_srpt ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp kvm_intel kvm igb irqbypass joydev ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt crc32c_intel ptp ipmi_si iTCO_vendor_support pcspkr ghash_clmulni_intel wmi pps_core i2c_algo_bit ipmi_msghandler mei_me i2c_i801 ioatdma tpm_tis mei shpchp i2c_smbus dca tpm_tis_core lpc_ich tpm nfsd target_core_mod auth_rpcgss nfs_acl lockd grace sunrpc isci libsas serio_raw scsi_transport_sas [last unloaded: devlink]
[  311.198587] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G             L  4.9.0+ #31
[  311.207253] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  311.214983] Call Trace:
[  311.218051]  <IRQ>
[  311.220626]  dump_stack+0x63/0x8c
[  311.224657]  __warn+0xd1/0xf0
[  311.228298]  warn_slowpath_null+0x1d/0x20
[  311.233116]  native_smp_send_reschedule+0x3f/0x50
[  311.238702]  try_to_wake_up+0x312/0x390
[  311.243318]  default_wake_function+0x12/0x20
[  311.248418]  __wake_up_common+0x55/0x90
[  311.253034]  __wake_up_locked+0x13/0x20
[  311.257641]  ep_poll_callback+0xbb/0x240
[  311.262346]  __wake_up_common+0x55/0x90
[  311.272771]  __wake_up+0x39/0x50
[  311.276697]  wake_up_klogd_work_func+0x40/0x60
[  311.281986]  irq_work_run_list+0x4d/0x70
[  311.286681]  irq_work_run+0x2c/0x40
[  311.290899]  smp_irq_work_interrupt+0x2e/0x40
[  311.296090]  irq_work_interrupt+0x93/0xa0
[  311.300900] RIP: 0010:panic+0x1f5/0x239
[  311.305508] RSP: 0018:ffff9432efa039e8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff09
[  311.314630] RAX: 000000000000003b RBX: 0000000000000000 RCX: 0000000000000006
[  311.322936] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff9432efa0e060
[  311.331245] RBP: ffff9432efa03a58 R08: 0000000000000674 R09: ffff942e800bb3e0
[  311.339543] R10: 00000000000000ef R11: 0000000000000198 R12: ffffffff94c4a4a9
[  311.347855] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9432efa03b78
[  311.356167]  ? panic+0x1f1/0x239
[  311.360106]  watchdog_timer_fn+0x1e5/0x1f0
[  311.365004]  ? watchdog+0x40/0x40
[  311.369035]  __hrtimer_run_queues+0xee/0x270
[  311.374132]  hrtimer_interrupt+0xa8/0x190
[  311.378935]  local_apic_timer_interrupt+0x35/0x60
[  311.384511]  smp_apic_timer_interrupt+0x38/0x50
[  311.389897]  apic_timer_interrupt+0x93/0xa0
[  311.394892] RIP: 0010:fl_classify+0xb/0x2b0 [cls_flower]
[  311.401151] RSP: 0018:ffff9432efa03c20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  311.410270] RAX: 0000000000000008 RBX: ffff9432b59c4100 RCX: 0000000000000000
[  311.418580] RDX: ffff9432efa03c98 RSI: ffff9436e718d3c0 RDI: ffff9432b59c4100
[  311.426967] RBP: ffff9432efa03c28 R08: 000000000000270f R09: 0000000000000000
[  311.435278] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9432efa03c98
[  311.443584] R13: 0000000000000008 R14: ffff9436e718d3c0 R15: 0000000000000001
[  311.451889]  tc_classify+0x78/0x120
[  311.456105]  __netif_receive_skb_core+0x623/0xa00
[  311.461683]  ? udp4_gro_receive+0x10b/0x2d0
[  311.466687]  __netif_receive_skb+0x18/0x60
[  311.471593]  netif_receive_skb_internal+0x40/0xb0
[  311.477186]  napi_gro_receive+0xcd/0x120
[  311.481900]  mlx5e_handle_rx_cqe_rep+0x61b/0x890 [mlx5_core]
[  311.488555]  mlx5e_poll_rx_cq+0x83/0x840 [mlx5_core]
[  311.494451]  mlx5e_napi_poll+0x89/0x480 [mlx5_core]
[  311.500233]  net_rx_action+0x260/0x3c0
[  311.504751]  __do_softirq+0xc9/0x28c
[  311.509075]  irq_exit+0xd7/0xe0
[  311.512901]  do_IRQ+0x51/0xd0
[  311.516529]  common_interrupt+0x93/0x93
[  311.521143] RIP: 0010:cpuidle_enter_state+0xe1/0x260
[  311.527011] RSP: 0018:ffffffff94e03dc8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffa2
[  311.536123] RAX: ffff9432efa19600 RBX: ffff9432efa23600 RCX: 000000000000001f
[  311.544430] RDX: 0000000000000000 RSI: ffff9432efa16cd8 RDI: 0000000000000000
[  311.552760] RBP: ffffffff94e03e00 R08: 0000000000000001 R09: cccccccccccccccd
[  311.561087] R10: 0000000000000000 R11: 0000000000000008 R12: 0000000000000001
[  311.569396] R13: 0000000000000000 R14: ffffffff94ec79a0 R15: 00000041fab01c8d
[  311.577714]  </IRQ>
[  311.580393]  ? cpuidle_enter_state+0xc0/0x260
[  311.585591]  cpuidle_enter+0x17/0x20
[  311.589913]  call_cpuidle+0x23/0x40
[  311.594136]  do_idle+0x172/0x200
[  311.598069]  cpu_startup_entry+0x71/0x80
[  311.602782]  rest_init+0x77/0x80
[  311.606713]  start_kernel+0x4a6/0x4c7
[  311.611134]  ? set_init_arg+0x55/0x55
[  311.615547]  ? early_idt_handler_array+0x120/0x120
[  311.621231]  x86_64_start_reservations+0x24/0x26
[  311.626717]  x86_64_start_kernel+0x14c/0x16f
[  311.631810]  start_cpu+0x5/0x14
[  311.635648] ---[ end trace c2fd08dd3d93dab3 ]---



^ permalink raw reply

* Re: [PATCH net 0/3] Fix integration of eee-broken-modes
From: David Miller @ 2016-12-20 18:51 UTC (permalink / raw)
  To: jbrunet
  Cc: netdev, devicetree, f.fainelli, carlo, khilman,
	martin.blumenstingl, neolynx, andrew, narmstrong, linux-amlogic,
	linux-arm-kernel, linux-kernel, julia.lawall, yegorslists,
	afaerber
In-Reply-To: <1482159938-13239-1-git-send-email-jbrunet@baylibre.com>

From: Jerome Brunet <jbrunet@baylibre.com>
Date: Mon, 19 Dec 2016 16:05:35 +0100

> The purpose of this series is to fix the integration of the ethernet phy
> property "eee-broken-modes" [0]
> 
> The v3 of this series has been merged, missing a fix (error reported by
> kbuild robot) available in the v4 [1]
> 
> More importantly, Florian opposed adding a DT property mapping a device
> register this directly [2]. The concern was that the property could be
> abused to implement platform configuration policy. After discussing it,
> I think we agreed that such information about the HW (defect) should appear
> in the platform DT. However, the preferred way is to add a boolean property
> for each EEE broken mode.
> 
> [0]: http://lkml.kernel.org/r/1480326409-25419-1-git-send-email-jbrunet@baylibre.com
> [1]: http://lkml.kernel.org/r/1480348229-25672-1-git-send-email-jbrunet@baylibre.com
> [2]: http://lkml.kernel.org/r/e14a3b0c-dc34-be14-48b3-518a0ad0c080@gmail.com

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH perf/core REBASE 3/5] tools lib bpf: Add bpf_prog_{attach,detach}
From: Joe Stringer @ 2016-12-20 18:50 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: LKML, netdev, Wang Nan, ast, Daniel Borkmann
In-Reply-To: <20161220143217.GC32756@kernel.org>

On 20 December 2016 at 06:32, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> Em Tue, Dec 20, 2016 at 11:18:51AM -0300, Arnaldo Carvalho de Melo escreveu:
>> Em Wed, Dec 14, 2016 at 02:43:40PM -0800, Joe Stringer escreveu:
>> > Commit d8c5b17f2bc0 ("samples: bpf: add userspace example for attaching
>> > eBPF programs to cgroups") added these functions to samples/libbpf, but
>> > during this merge all of the samples libbpf functionality is shifting to
>> > tools/lib/bpf. Shift these functions there.
>> >
>> > Signed-off-by: Joe Stringer <joe@ovn.org>
>> > ---
>> > Arnaldo, this is a new patch you didn't previously review which I've
>> > prepared due to the conflict with net-next. I figured it's better to try
>> > to get samples/bpf properly switched over this window rather than defer the
>> > problem and end up having to deal with another merge problem next time
>> > around. I hope that is fine for you. If not, this patch onwards will need
>> > to be dropped
>> >
>> > It's a simple copy/paste/delete with a minor change for sys_bpf() vs
>> > syscall().
>> > ---
>> >  samples/bpf/libbpf.c | 21 ---------------------
>> >  samples/bpf/libbpf.h |  3 ---
>> >  tools/lib/bpf/bpf.c  | 21 +++++++++++++++++++++
>> >  tools/lib/bpf/bpf.h  |  3 +++
>> >  4 files changed, 24 insertions(+), 24 deletions(-)
>> >
>> > diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
>> > index 3391225ad7e9..d9af876b4a2c 100644
>> > --- a/samples/bpf/libbpf.c
>> > +++ b/samples/bpf/libbpf.c
>> > @@ -11,27 +11,6 @@
>> >  #include <arpa/inet.h>
>> >  #include "libbpf.h"
>> >
>> > -int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
>> > -{
>> > -   union bpf_attr attr = {
>> > -           .target_fd = target_fd,
>> > -           .attach_bpf_fd = prog_fd,
>> > -           .attach_type = type,
>> > -   };
>> > -
>> > -   return syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr));
>>
>> This one makes it fail for CentOS 5 and 6, others may fail as well,
>> still building, investigating...
>
> Ok, fixed it by making it follow the model of the other sys_bpf wrappers
> setting up that bpf_attr union wrt initializing unamed struct members:
>
>  int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
>  {
> -       union bpf_attr attr = {
> -               .target_fd = target_fd,
> -               .attach_bpf_fd = prog_fd,
> -               .attach_type = type,
> -       };
> +       union bpf_attr attr;
> +
> +       bzero(&attr, sizeof(attr));
> +       attr.target_fd     = target_fd;
> +       attr.attach_bpf_fd = prog_fd;
> +       attr.attach_type   = type;
>
>         return sys_bpf(BPF_PROG_ATTACH, &attr, sizeof(attr));
>  }

Ah, I just shifted these across originally so the delta would be
minimal but now I know why this code is like this. Thanks.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Andy Lutomirski @ 2016-12-20 18:49 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Alexei Starovoitov, Andy Lutomirski, Mickaël Salaün,
	Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller,
	Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API,
	linux-kernel@vger.kernel.org, Network Development
In-Reply-To: <9e378fb1-23ff-a239-d915-3d9aa26a999e@zonque.org>

On Tue, Dec 20, 2016 at 10:36 AM, Daniel Mack <daniel@zonque.org> wrote:
> Hi,
>
> On 12/20/2016 06:23 PM, Andy Lutomirski wrote:
>> On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack <daniel@zonque.org> wrote:
>
>> To clarify, since this thread has gotten excessively long and twisted,
>> I think it's important that, for hooks attached to a cgroup, you be
>> able to tell in a generic way whether something is plugged into the
>> hook.  The natural way to see a cgroup's configuration is to read from
>> cgroupfs, so I think that reading from cgroupfs should show you that a
>> BPF program is attached and also give enough information that, once
>> bpf programs become dumpable, you can dump the program (using the
>> bpf() syscall or whatever).
>
> [...]
>
>> There isn't a big semantic difference between
>> 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(...,
>> CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file",
>> O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'.  There is, however, a semantic
>> difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY)
>> because the permission check is much weaker.
>
> Okay, if you have such a control file, you can of course do something
> like that. When we discussed things back then with Tejun however, we
> concluded that a controller that is not completely controllable through
> control knobs that can be written and read via cat is meaningless.
> That's why this has become a 'hidden' cgroup feature.
>
> With your proposed API, you'd first go to the bpf(2) syscall in order to
> get a prog fd, and then come back to some sort of cgroup API to put the
> fd in there. That's quite a mix and match, which is why we considered
> the API cleaner in its current form, as everything that is related to
> bpf is encapsulated behind a single syscall.

You already have to do bpf() to get a prog fd, then open() to get a
cgroup fd, then bpf() or ioctl() to attach, so this isn't much
different, and its exactly the same number of syscalls.

>
>> My preference would be to do an ioctl on a new
>> /cgroup/NAME/network_hooks.inet_ingress file.  Reading that file tells
>> you whether something is attached and hopefully also gives enough
>> information (a hash of the BPF program, perhaps) to dump the actual
>> program using future bpf() interfaces.  write() and ioctl() can be
>> used to configure it as appropriate.
>
> So am I reading this right? You're proposing to add ioctl() hooks to
> kernfs/cgroupfs? That would open more possibilities of course, but I'm
> not sure where that rabbit hole leads us eventually.

Indeed.  I already have a test patch to add ioctl() to kernfs.  Adding
it to cgroupfs shouldn't be much more complicated.

>
>> Another option that I like less would be to have a
>> /cgroup/NAME/cgroup.bpf that lists all the active hooks along with
>> their contents.  You would do an ioctl() on that to program a hook and
>> you could read it to see what's there.
>
> Yes, read() could, in theory, give you similar information than ioctl(),
> but in human-readable form.
>
>> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too.
>> It doesn't make a semantic difference, except that I dislike
>> BPF_PROG_DETACH because that particular command isn't BPF-specific at
>> all.
>
> Well, I think it is; it pops the bpf program from a target and drops the
> reference on it. It's not much code, but it's certainly bpf-specific.

I mean the interface isn't bpf-specific.  If there was something that
wasn't bpf attached to the target, you'd still want an API to detach
it.

>
>>>> So if I set up a cgroup that's monitored and call it /cgroup/a and
>>>> enable delegation and if the program running there wants to do its own
>>>> monitoring in /cgroup/a/b (via delegation), then you really want the
>>>> outer monitor to silently drop events coming from /cgroup/a/b?
>>>
>>> That's a fair point, and we've discussed it as well. The issue is, as
>>> Alexei already pointed out, that we do not want to traverse the tree up
>>> to the root for nested cgroups due to the runtime costs in the
>>> networking fast-path. After all, we're running the bpf program for each
>>> packet in flight. Hence, we opted for the approach to only look at the
>>> leaf node for now, with the ability to open it up further in the future
>>> using flags during attach etc.
>>
>> Careful here!  You don't look only at the leaf node for now.  You do a
>> fancy traversal and choose the nearest node that has a hook set up.
>
> But we do the 'complex' operation at attach time or when a cgroup is
> created, both of which are slow-path operations. In the fast-path, we
> only look at the leaf, which may or may not have an effective program
> installed. And that's of course much cheaper then doing the traversing
> for each packet.

You would never traverse the full hierarchy for each packet.  You'd
have a linked list of programs that are attached, kind of like how
there's an "effective" array right now.  I sent out pseudocode earlier
in the thread.

>
>> mkdir /cgroup/foo
>> BPF_PROG_ATTACH(some program to foo)
>> mkdir /cgroup/foo/bar
>> chown -R some_user /cgroup/foo/bar
>>
>> If the kernel only looked at the leaf, then the program that did the
>> above would not expect that the program would constrain
>> /cgroup/foo/bar's activity.  But, as it stands, the program *would*
>> expect /cgroup/foo/bar to be constrained, except that, whenever the
>> capable() check changes to ns_capable() (which will happen eventually
>> one way or another), then the bad guy can create /cgroup/foo/bar/baz,
>> install a new no-op hook there, and break the security assumption.
>>
>> IOW, I think that totally non-recursive hooks are okay from a security
>> perspective, albeit rather strange, but the current design is not okay
>> from a security perspective.
>
> We locked down the ability to override any of these programs with
> CAP_NET_ADMIN, which is also what it takes to flush iptables, right?
> What's the difference?

For iptables, it's ns_capable() now, and there have been a number of
holes in it.  For cgroup, it's going to turn in to ns_capable() sooner
or later, and it would be nice to be ready for it.

--Andy

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Daniel Mack @ 2016-12-20 18:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Andy Lutomirski, Mickaël Salaün,
	Kees Cook, Jann Horn, Tejun Heo, David Ahern, David S. Miller,
	Thomas Graf, Michael Kerrisk, Peter Zijlstra, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <CALCETrXyp2ddf4HRsEoN=qEwTBaezOUX2XWj6nxPcbc4t13svw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi,

On 12/20/2016 06:23 PM, Andy Lutomirski wrote:
> On Tue, Dec 20, 2016 at 2:21 AM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:

> To clarify, since this thread has gotten excessively long and twisted,
> I think it's important that, for hooks attached to a cgroup, you be
> able to tell in a generic way whether something is plugged into the
> hook.  The natural way to see a cgroup's configuration is to read from
> cgroupfs, so I think that reading from cgroupfs should show you that a
> BPF program is attached and also give enough information that, once
> bpf programs become dumpable, you can dump the program (using the
> bpf() syscall or whatever).

[...]

> There isn't a big semantic difference between
> 'open("/cgroup/NAME/some.control.file", O_WRONLY); ioctl(...,
> CGROUP_ATTACH_BPF, ...)' and 'open("/cgroup/NAME/some.control.file",
> O_WRONLY); bpf(BPF_PROG_ATTACH, ...);'.  There is, however, a semantic
> difference when you do open("/cgroup/NAME", O_RDONLY | O_DIRECTORY)
> because the permission check is much weaker.

Okay, if you have such a control file, you can of course do something
like that. When we discussed things back then with Tejun however, we
concluded that a controller that is not completely controllable through
control knobs that can be written and read via cat is meaningless.
That's why this has become a 'hidden' cgroup feature.

With your proposed API, you'd first go to the bpf(2) syscall in order to
get a prog fd, and then come back to some sort of cgroup API to put the
fd in there. That's quite a mix and match, which is why we considered
the API cleaner in its current form, as everything that is related to
bpf is encapsulated behind a single syscall.

> My preference would be to do an ioctl on a new
> /cgroup/NAME/network_hooks.inet_ingress file.  Reading that file tells
> you whether something is attached and hopefully also gives enough
> information (a hash of the BPF program, perhaps) to dump the actual
> program using future bpf() interfaces.  write() and ioctl() can be
> used to configure it as appropriate.

So am I reading this right? You're proposing to add ioctl() hooks to
kernfs/cgroupfs? That would open more possibilities of course, but I'm
not sure where that rabbit hole leads us eventually.

> Another option that I like less would be to have a
> /cgroup/NAME/cgroup.bpf that lists all the active hooks along with
> their contents.  You would do an ioctl() on that to program a hook and
> you could read it to see what's there.

Yes, read() could, in theory, give you similar information than ioctl(),
but in human-readable form.

> FWIW, everywhere I say ioctl(), the bpf() syscall would be okay, too.
> It doesn't make a semantic difference, except that I dislike
> BPF_PROG_DETACH because that particular command isn't BPF-specific at
> all.

Well, I think it is; it pops the bpf program from a target and drops the
reference on it. It's not much code, but it's certainly bpf-specific.

>>> So if I set up a cgroup that's monitored and call it /cgroup/a and
>>> enable delegation and if the program running there wants to do its own
>>> monitoring in /cgroup/a/b (via delegation), then you really want the
>>> outer monitor to silently drop events coming from /cgroup/a/b?
>>
>> That's a fair point, and we've discussed it as well. The issue is, as
>> Alexei already pointed out, that we do not want to traverse the tree up
>> to the root for nested cgroups due to the runtime costs in the
>> networking fast-path. After all, we're running the bpf program for each
>> packet in flight. Hence, we opted for the approach to only look at the
>> leaf node for now, with the ability to open it up further in the future
>> using flags during attach etc.
> 
> Careful here!  You don't look only at the leaf node for now.  You do a
> fancy traversal and choose the nearest node that has a hook set up.

But we do the 'complex' operation at attach time or when a cgroup is
created, both of which are slow-path operations. In the fast-path, we
only look at the leaf, which may or may not have an effective program
installed. And that's of course much cheaper then doing the traversing
for each packet.

> mkdir /cgroup/foo
> BPF_PROG_ATTACH(some program to foo)
> mkdir /cgroup/foo/bar
> chown -R some_user /cgroup/foo/bar
> 
> If the kernel only looked at the leaf, then the program that did the
> above would not expect that the program would constrain
> /cgroup/foo/bar's activity.  But, as it stands, the program *would*
> expect /cgroup/foo/bar to be constrained, except that, whenever the
> capable() check changes to ns_capable() (which will happen eventually
> one way or another), then the bad guy can create /cgroup/foo/bar/baz,
> install a new no-op hook there, and break the security assumption.
> 
> IOW, I think that totally non-recursive hooks are okay from a security
> perspective, albeit rather strange, but the current design is not okay
> from a security perspective.

We locked down the ability to override any of these programs with
CAP_NET_ADMIN, which is also what it takes to flush iptables, right?
What's the difference?

> So here's a fleshed-out possible version that's a bit of a compromise
> after sleeping on this.  There's plenty of room to tweak this.
> 
> Each cgroup gets a new file cgroup.hooks.  Reading it shows a list of
> active hooks.  (A hook can be a string like "network.inet_ingress".)
> 
> You can write a command like "-network.inet_ingress off" to it to
> disable network.inet_ingress.  You can write a command like
> "+network.inet_ingress" to it to enable the network.inet_ingress hook.
> 
> When a hook (e.g. network.inet_ingress) is enabled, a new file appears
> in the cgroup called "hooks.network.inet_ingress").  You can read it
> to get an indication of what is currently installed in that slot.  You
> can write "none" to it to cause nothing to be installed in that slot.
> (This replaces BPF_PROG_DETACH.).  You can open it for write and use
> bpf() or perhaps ioctl() to attach a bpf program.  Maybe you can also
> use bpf() to dump the bpf program, but, regardless, if a bpf program
> is there, read() will return some string that contains "bpf" and maybe
> some other useful information.

I can see where you're going, but I don't know yet if if I like this
approach better, given that you would still need a binary interface at
least at attach time, and that such an interface would use a resource
returned from bpf(2). The ability to read from control files in order to
see what's going on is nice though.

I'd like to have Tejun's and Alexei's opinion on this - as I said, I had
something like that (albeit much simpler) in one of my very early
drafts, but we consented to do the hookup the other way around, for
stated reasons.


Thanks,
Daniel

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox