Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCHv2 net 1/2] sctp: reduce indent level in sctp_copy_local_addr_list
From: Xin Long @ 2016-12-20  5:49 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman
In-Reply-To: <cover.1482212764.git.lucien.xin@gmail.com>

This patch is to reduce indent level by using continue when the addr
is not allowed, and also drop end_copy by using break.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sctp/protocol.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 7b523e3..da5d82b 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -205,26 +205,27 @@ int sctp_copy_local_addr_list(struct net *net, struct sctp_bind_addr *bp,
 	list_for_each_entry_rcu(addr, &net->sctp.local_addr_list, list) {
 		if (!addr->valid)
 			continue;
-		if (sctp_in_scope(net, &addr->a, scope)) {
-			/* Now that the address is in scope, check to see if
-			 * the address type is really supported by the local
-			 * sock as well as the remote peer.
-			 */
-			if ((((AF_INET == addr->a.sa.sa_family) &&
-			      (copy_flags & SCTP_ADDR4_PEERSUPP))) ||
-			    (((AF_INET6 == addr->a.sa.sa_family) &&
-			      (copy_flags & SCTP_ADDR6_ALLOWED) &&
-			      (copy_flags & SCTP_ADDR6_PEERSUPP)))) {
-				error = sctp_add_bind_addr(bp, &addr->a,
-						    sizeof(addr->a),
-						    SCTP_ADDR_SRC, GFP_ATOMIC);
-				if (error)
-					goto end_copy;
-			}
-		}
+		if (!sctp_in_scope(net, &addr->a, scope))
+			continue;
+
+		/* Now that the address is in scope, check to see if
+		 * the address type is really supported by the local
+		 * sock as well as the remote peer.
+		 */
+		if (addr->a.sa.sa_family == AF_INET &&
+		    !(copy_flags & SCTP_ADDR4_PEERSUPP))
+			continue;
+		if (addr->a.sa.sa_family == AF_INET6 &&
+		    (!(copy_flags & SCTP_ADDR6_ALLOWED) ||
+		     !(copy_flags & SCTP_ADDR6_PEERSUPP)))
+			continue;
+
+		error = sctp_add_bind_addr(bp, &addr->a, sizeof(addr->a),
+					   SCTP_ADDR_SRC, GFP_ATOMIC);
+		if (error)
+			break;
 	}
 
-end_copy:
 	rcu_read_unlock();
 	return error;
 }
-- 
2.1.0

^ permalink raw reply related

* [PATCHv2 net 0/2] fix the issue that may copy duplicate addrs into assoc's bind address list
From: Xin Long @ 2016-12-20  5:49 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman

Patch 1/2 is to fix some indent level.

Given that we have kernels out there with this issue, patch 2/2 also
fix sctp_raw_to_bind_addrs.

v1 -> v2:
  Explain why we didn't filter the duplicate addresses when global
  address list gets updated in patch 2/2 changelog.

Xin Long (2):
  sctp: reduce indent level in sctp_copy_local_addr_list
  sctp: not copying duplicate addrs to the assoc's bind address list

 net/sctp/bind_addr.c |  3 +++
 net/sctp/protocol.c  | 40 ++++++++++++++++++++++------------------
 2 files changed, 25 insertions(+), 18 deletions(-)

-- 
2.1.0

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei Starovoitov @ 2016-12-20  5:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Ahern, Andy Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <CALCETrVxkdZA3SsRv0KKhBz9YvNMsnmHSjS8HN1GHrgWRYNM1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Dec 19, 2016 at 09:27:18PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 8:44 PM, Alexei Starovoitov
> <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote:
> >>
> >> struct cgroup_bpf {
> >>         /*
> >>          * Store two sets of bpf_prog pointers, one for programs that are
> >>          * pinned directly to this cgroup, and one for those that are effective
> >>          * when this cgroup is accessed.
> >>          */
> >>         struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
> >>         struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
> >> };
> >>
> >> in struct cgroup, there's a 'struct cgroup_bpf bpf;'.
> >>
> >> This would change to something like:
> >>
> >> struct cgroup_filter_slot {
> >>   struct bpf_prog *effective;
> >>   struct cgroup_filter_slot *next;
> >>   struct bpf_prog *local;
> >> }
> >>
> >> local is NULL unless *this* cgroup has a filter.  effective points to
> >> the bpf_prog that's active in this cgroup or the nearest ancestor that
> >> has a filter.  next is NULL if there are no filters higher in the
> >> chain or points to the next slot that has a filter.  struct cgroup
> >> has:
> >>
> >> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];
> >>
> >> To evaluate it, you do:
> >>
> >> struct cgroup_filter_slot *slot = &cgroup->slot[the index];
> >>
> >> if (!slot->effective)
> >>   return;
> >>
> >> do {
> >>   evaluate(slot->effective);
> >>   slot = slot->next;
> >> } while (unlikely(slot));
> >
> > yes. something like this can work as a future extension
> > to support multiple programs for security use case.
> > Please propose a patch.
> > Again, it's not needed today and there is no rush to implement it.
> >
> 
> If this happens after 4.10 and 4.10 is released as is, then this
> change would be an ABI break.

it won't break existing apps.
please study how bpf syscall was extended in the past without
breaking anything.
Same thing here. The default is one program per hook per cgroup.
Everything else is future extensions.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Andy Lutomirski @ 2016-12-20  5:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Ahern, Andy Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <20161220044440.GB86803-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>

On Mon, Dec 19, 2016 at 8:44 PM, Alexei Starovoitov
<alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote:
>>
>> struct cgroup_bpf {
>>         /*
>>          * Store two sets of bpf_prog pointers, one for programs that are
>>          * pinned directly to this cgroup, and one for those that are effective
>>          * when this cgroup is accessed.
>>          */
>>         struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
>>         struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
>> };
>>
>> in struct cgroup, there's a 'struct cgroup_bpf bpf;'.
>>
>> This would change to something like:
>>
>> struct cgroup_filter_slot {
>>   struct bpf_prog *effective;
>>   struct cgroup_filter_slot *next;
>>   struct bpf_prog *local;
>> }
>>
>> local is NULL unless *this* cgroup has a filter.  effective points to
>> the bpf_prog that's active in this cgroup or the nearest ancestor that
>> has a filter.  next is NULL if there are no filters higher in the
>> chain or points to the next slot that has a filter.  struct cgroup
>> has:
>>
>> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];
>>
>> To evaluate it, you do:
>>
>> struct cgroup_filter_slot *slot = &cgroup->slot[the index];
>>
>> if (!slot->effective)
>>   return;
>>
>> do {
>>   evaluate(slot->effective);
>>   slot = slot->next;
>> } while (unlikely(slot));
>
> yes. something like this can work as a future extension
> to support multiple programs for security use case.
> Please propose a patch.
> Again, it's not needed today and there is no rush to implement it.
>

If this happens after 4.10 and 4.10 is released as is, then this
change would be an ABI break.

--Andy

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Andy Lutomirski @ 2016-12-20  5:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Andrew Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David Ahern, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel@vger.kernel.org, Network Development
In-Reply-To: <20161220045155.GC86803@ast-mbp.thefacebook.com>

On Mon, Dec 19, 2016 at 8:51 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Dec 19, 2016 at 05:40:53PM -0800, Andy Lutomirski wrote:
>>
>> By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't
>> even take a reference to a BPF program as an argument.  What is it
>> supposed to do if this mechanism ever gets extended?
>
> we just add another field to that anonymous union just like
> we did for other commands and everything is backwards compatible.
> It's the basics of bpf syscall that we've been relying on for some
> time now and it worked just fine.

And what happens if you don't specify that member and two programs are attached?

--Andy

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Josef Bacik @ 2016-12-20  4:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, David Miller, Hannes Frederic Sowa, Craig Gallek,
	Linux Kernel Network Developers
In-Reply-To: <1482209536.1521.21.camel@edumazet-glaptop3.roam.corp.google.com>


> On Dec 19, 2016, at 11:52 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> On Tue, 2016-12-20 at 03:40 +0000, Josef Bacik wrote:
>>> On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> 
>>>> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:
>>>> 
>>>> When sockets created SO_REUSEPORT move to TW state they are placed
>>>> back on the the tb->owners. fastreuse port is no longer set so we have
>>>> to walk potential long list of sockets in tb->owners to open a new
>>>> listener socket. I imagine this is happens when we try to open a new
>>>> listener SO_REUSEPORT after the system has been running a while and so
>>>> we hit the long tb->owners list.
>>> 
>>> Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse
>>> 
>>> So where tb->fastreuse is cleared ?
>>> 
>>> If all your sockets have SO_REUSEPORT set, this should not happen.
>>> 
>> 
>> The app starts out with no SO_REUSEPORT, and then we restart it with
>> that option enabled.
> 
> But... why would the application do this dance ?
> 
> I now better understand why we never had these issues...
> 

It doesn't do it as a part of it's normal operation.  The old version didn't use SO_REUSEPORT and then somebody added support for it, restarted the service with the new option enabled and boom.  They immediately stopped doing anything and gave it to me to figure out.

> 
>>  What I suspect is we have all the twsks from the original service,
>> and the fastreuse stuff is cleared.  My naive patch resets it once we
>> add a reuseport sk to the tb and that makes the problem go away.  I'm
>> reworking all of this logic and adding some extra info to the tb to
>> make the reset actually safe.  I'll send those patches out tomorrow.
>> Thanks,
> 
> Okay, we will review them ;)
> 
> Note that Willy Tarreau wants some mechanism to be able to freeze a
> listener, to allow haproxy to be replaced without closing any sessions.
> 

I assume that's what these guys would want as well.  They have some weird handoff thing they do when the app starts but I'm not entirely convinced it does what they think it does.  Thanks,

Josef

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Eric Dumazet @ 2016-12-20  4:52 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Tom Herbert, David Miller, Hannes Frederic Sowa, Craig Gallek,
	Linux Kernel Network Developers
In-Reply-To: <9DF94C8E-1463-4C10-81E3-E6F4534097CB@fb.com>

On Tue, 2016-12-20 at 03:40 +0000, Josef Bacik wrote:
> > On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> >> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:
> >> 
> >> When sockets created SO_REUSEPORT move to TW state they are placed
> >> back on the the tb->owners. fastreuse port is no longer set so we have
> >> to walk potential long list of sockets in tb->owners to open a new
> >> listener socket. I imagine this is happens when we try to open a new
> >> listener SO_REUSEPORT after the system has been running a while and so
> >> we hit the long tb->owners list.
> > 
> > Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse
> > 
> > So where tb->fastreuse is cleared ?
> > 
> > If all your sockets have SO_REUSEPORT set, this should not happen.
> > 
> 
> The app starts out with no SO_REUSEPORT, and then we restart it with
> that option enabled.

But... why would the application do this dance ?

I now better understand why we never had these issues...


>   What I suspect is we have all the twsks from the original service,
> and the fastreuse stuff is cleared.  My naive patch resets it once we
> add a reuseport sk to the tb and that makes the problem go away.  I'm
> reworking all of this logic and adding some extra info to the tb to
> make the reset actually safe.  I'll send those patches out tomorrow.
> Thanks,

Okay, we will review them ;)

Note that Willy Tarreau wants some mechanism to be able to freeze a
listener, to allow haproxy to be replaced without closing any sessions.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei Starovoitov @ 2016-12-20  4:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Miller, Andrew Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David Ahern, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel@vger.kernel.org, Network Development
In-Reply-To: <CALCETrUc-iDUT8isKc43PPZ3xz31Sz+QTU+_SQQTAsWhH+zkLw@mail.gmail.com>

On Mon, Dec 19, 2016 at 05:40:53PM -0800, Andy Lutomirski wrote:
> 
> By the way, even if Alexei is right, the BPF_PROG_DETACH API doesn't
> even take a reference to a BPF program as an argument.  What is it
> supposed to do if this mechanism ever gets extended?

we just add another field to that anonymous union just like
we did for other commands and everything is backwards compatible.
It's the basics of bpf syscall that we've been relying on for some
time now and it worked just fine.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei Starovoitov @ 2016-12-20  4:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Ahern, Andy Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel@vger.kernel.org, Network Development
In-Reply-To: <CALCETrVKu63BFVQFAJcLcd6ovPtq-WDdTh-BwyAPSprw8UarNQ@mail.gmail.com>

On Mon, Dec 19, 2016 at 07:12:48PM -0800, Andy Lutomirski wrote:
> 
> struct cgroup_bpf {
>         /*
>          * Store two sets of bpf_prog pointers, one for programs that are
>          * pinned directly to this cgroup, and one for those that are effective
>          * when this cgroup is accessed.
>          */
>         struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
>         struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
> };
> 
> in struct cgroup, there's a 'struct cgroup_bpf bpf;'.
> 
> This would change to something like:
> 
> struct cgroup_filter_slot {
>   struct bpf_prog *effective;
>   struct cgroup_filter_slot *next;
>   struct bpf_prog *local;
> }
> 
> local is NULL unless *this* cgroup has a filter.  effective points to
> the bpf_prog that's active in this cgroup or the nearest ancestor that
> has a filter.  next is NULL if there are no filters higher in the
> chain or points to the next slot that has a filter.  struct cgroup
> has:
> 
> struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];
> 
> To evaluate it, you do:
> 
> struct cgroup_filter_slot *slot = &cgroup->slot[the index];
> 
> if (!slot->effective)
>   return;
> 
> do {
>   evaluate(slot->effective);
>   slot = slot->next;
> } while (unlikely(slot));

yes. something like this can work as a future extension
to support multiple programs for security use case.
Please propose a patch.
Again, it's not needed today and there is no rush to implement it.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei Starovoitov @ 2016-12-20  4:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook,
	Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf,
	Michael Kerrisk, Peter Zijlstra, Linux API,
	linux-kernel@vger.kernel.org, Network Development
In-Reply-To: <CALCETrXymvAo-9zhQe=amToz_fs9XGniK2KLZv5Fxc66qcUx6A@mail.gmail.com>

On Mon, Dec 19, 2016 at 07:50:01PM -0800, Andy Lutomirski wrote:
> >>
> >> net.socket_create_filter = "none": no filter
> >> net.socket_create_filter = "bpf:baadf00d": bpf filter
> >
> > i'm assuming 'baadf00d' is bpf program fd expressed a text string?
> > and kernel needs to parse above? will you allow capital and lower
> > case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
> > can program fd expressed as decimal or hex or both?
> > how do you return the error? as a text string for user space
> > to parse?
> 
> No.  The kernel does not parse it because you cannot write this to the
> file.  You set a bpf filter with ioctl and pass an fd.  If you *read*
> the file, you get the same bpf program hash that fdinfo on the bpf
> object would show -- this is for debugging and (eventually) CRIU.

my understanding that cgroup is based on kernfs and both don't support
ioctl, so you'd need quite some hacking to introduce such concepts
and buy-in from a bunch of people first.

> >> net.socket_create_filter = "iptables:foobar": some iptables thingy
> >> net.socket_create_filter = "nft:blahblahblah": some nft thingy
> >
> > iptables/nft are not applicable to 'bpf_socket_create' which
> > looks into:
> > struct bpf_sock {
> >         __u32 bound_dev_if;
> >         __u32 family;
> >         __u32 type;
> >         __u32 protocol;
> > };
> > so don't fit as an example.
> 
> The code that takes a 'struct sock' and sets up bpf_sock is
> bpf-specific and would obviously not be used for non-bpf filter.  But
> if you had a filter that was just a like of address families, that
> filter would look at struct sock and do its own thing.  iptables
> wouldn't make sense for a socket creation filter, but it would make
> perfect sense for an ingress filter.  Obviously the bpf-specific state
> object wouldn't be used, but it would still be a hook, invoked from
> the same network code, guarded by the same static key, looking at the
> same skb.

I strongly suggest to go back and read my first reply where
I think I explained well enough that something like iptables
will not able to reuse the ioctl mechanism you're proposing here,
hook ids will be different, attachment mechanism will be different too.
So your proposed cgroup ioctl is already dead as a reusable interface.

> >> net.socket_create_filter = "disallow": no sockets created period
> >> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
> >
> > so you're proposing to add a bunch of hard coded logic to the kernel.
> > First to parse such text into some sort of syntax tree or list/set
> > and then have hard coded logic specifically for these two use cases?
> > While above two can be implemented as trivial bpf programs already?!
> > That goes 180% degree vs bpf philosophy. bpf is about moving
> > the specific code out of the kernel and keeping kernel generic that
> > it can solve as many use cases as possible by being programmable.
> 
> I'm not seriously proposing implementing these.  My point is that
> *bpf*, while wonderful, is not the be-all-and-end-all of kernel
> configurability, and other types of hooks might want to be hooked in
> here.

Then please let's talk about real use cases.
This daydreaming of some future llvm in the kernel that you were
bringing up during LPC doesn't help the discussion.
Just like these artificial examples.

> > ...
> >> What exactly isn't sensible about using cgroup_bpf for containers?
> >
> > my use case is close to zero overhead application network monitoring.
> 
> So if I set up a cgroup that's monitored and call it /cgroup/a and
> enable delegation and if the program running there wants to do its own
> monitoring in /cgroup/a/b (via delegation), then you really want the
> outer monitor to silently drop events coming from /cgroup/a/b?

yes. both are root and must talk to each other if they want
to co-exist. When root process is asking kernel to do X, this X has
to happen.

> Then disallow nesting.  You're welcome to not rush the decision, but
> that's not what you've done.  If 4.10 is released as is, you've made
> the decision and you're going to have a hard time changing it.

Nothing needs to be changed.

> > No. As was pointed out before only root can load BPF_PROG_TYPE_CGROUP_[SKB|SOCK]
> > type programs and only root can attach them.
> 
> Why?  It really seems to me that you expect that future namespaceable
> bpf hooks will use a totally different API.  At KS, I sat in a room
> full of people arguing about cgroup v2 and a lot of them pointed out
> that there are valid, paying use cases that want to stick cgroup v1 in
> a container, because the code that uses cgroup v1 in the container is
> called systemd and the contained OS is called RHEL (or SuSE or Ubuntu
> or Gentoo or whatever), and that code is *already written* and needs
> to be contained.

bpf in general is not namespace aware. It's global and this cgroup scoping
of bpf programs is the first of this kind. Namespacing of bpf is completely
different topic.

> The current approach to bpf hooks will bite you down the road.  David
> Ahern is already proposing using it for something that is not tracing
> at all, and someone will want that in a container, and there will be a
> problem.

vrf use case already supported by existing code.

> How about slowing down a wee bit and trying to come up with cgroup
> hook semantics that work for all of these use cases?  I think my
> proposal is quite close to workable.

you've started the topic by claiming that things are broken
and non-extensible. At the course of this thread it was explained
that the interface is extensible and not broken for the use case
it was designed for. The 'security' use case like lsm+bpf+cgroup
is not supported by the current model yet and that's what we need
to discuss in the future. So, yes, please slow down.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Andy Lutomirski @ 2016-12-20  3:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook,
	Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf,
	Michael Kerrisk, Peter Zijlstra, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <20161220031802.GA77838-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>

On Mon, Dec 19, 2016 at 7:18 PM, Alexei Starovoitov
<alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote:
>> I think we're still talking past each other.  A big part of the point
>> of changing it is that none of this is specific to bpf.  You could (in
>
> the hooks and context passed into the program is very much bpf specific.
> That's what I've been trying to convey all along.

You mean BPF_CGROUP_RUN_PROG_INET_SOCK(sk)?  There is nothing bpf
specfic about the hook except that the name of this macro has "BPF" in
it.  There is nothing whatsoever that's bpf-specific about the context
-- sk is not bpf-specific at all.

The only thing bpf-specific about it is that it currently only invokes
bpf programs.  That could easily change.

>
>> theory -- I'm not proposing implementing these until there's demand)
>> have:
>>
>> net.socket_create_filter = "none": no filter
>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>
> i'm assuming 'baadf00d' is bpf program fd expressed a text string?
> and kernel needs to parse above? will you allow capital and lower
> case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
> can program fd expressed as decimal or hex or both?
> how do you return the error? as a text string for user space
> to parse?

No.  The kernel does not parse it because you cannot write this to the
file.  You set a bpf filter with ioctl and pass an fd.  If you *read*
the file, you get the same bpf program hash that fdinfo on the bpf
object would show -- this is for debugging and (eventually) CRIU.

>
>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>
> iptables/nft are not applicable to 'bpf_socket_create' which
> looks into:
> struct bpf_sock {
>         __u32 bound_dev_if;
>         __u32 family;
>         __u32 type;
>         __u32 protocol;
> };
> so don't fit as an example.

The code that takes a 'struct sock' and sets up bpf_sock is
bpf-specific and would obviously not be used for non-bpf filter.  But
if you had a filter that was just a like of address families, that
filter would look at struct sock and do its own thing.  iptables
wouldn't make sense for a socket creation filter, but it would make
perfect sense for an ingress filter.  Obviously the bpf-specific state
object wouldn't be used, but it would still be a hook, invoked from
the same network code, guarded by the same static key, looking at the
same skb.

>
>> net.socket_create_filter = "disallow": no sockets created period
>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>
> so you're proposing to add a bunch of hard coded logic to the kernel.
> First to parse such text into some sort of syntax tree or list/set
> and then have hard coded logic specifically for these two use cases?
> While above two can be implemented as trivial bpf programs already?!
> That goes 180% degree vs bpf philosophy. bpf is about moving
> the specific code out of the kernel and keeping kernel generic that
> it can solve as many use cases as possible by being programmable.

I'm not seriously proposing implementing these.  My point is that
*bpf*, while wonderful, is not the be-all-and-end-all of kernel
configurability, and other types of hooks might want to be hooked in
here.

> ...
>> What exactly isn't sensible about using cgroup_bpf for containers?
>
> my use case is close to zero overhead application network monitoring.

So if I set up a cgroup that's monitored and call it /cgroup/a and
enable delegation and if the program running there wants to do its own
monitoring in /cgroup/a/b (via delegation), then you really want the
outer monitor to silently drop events coming from /cgroup/a/b?

>
>> >> > As you're pointing out, in case of security, we probably
>> >> > want to preserve original bpf program that should always be
>> >> > run first and only after it returned 'ok' (we'd need to define
>> >> > what 'ok' means in secruity context) run program attached to sub-hierarchy.
>> >>
>> >> It's already defined AFAICT.  1 means okay.  0 means not okay.
>> >
>> > sorry that doesn't make any sense. For seccomp we have a set of
>> > ranges that mean different things. Here you're proposing to
>> > hastily assign 1 and 0 ? How is that extensible?
>> > We need to carefully think through what should be the semantics
>> > of attaching multiple programs, consider performance implications,
>> > return codes and so on.
>>
>> You already assigned it.  The return value of the bpf program, loaded
>> in Linus' tree today, tells the kernel whether to accept or reject.
>
> yes. that's what the program tells the hook.
> I'm saying that whenever we have a link list of the programs
> interaction between them may or may not be expressible with 1/0
> and I don't want to rush such decision.

Then disallow nesting.  You're welcome to not rush the decision, but
that's not what you've done.  If 4.10 is released as is, you've made
the decision and you're going to have a hard time changing it.

>
>> >
>> >> > Another alternative is to disallow attaching programs in sub-hierarchy
>> >> > if parent has something already attached, but it's not useful
>> >> > for general case.
>> >> > All of these are possible future extensions.
>> >>
>> >> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
>> >> have to do it now (or disable the feature for 4.10).  This is why I'm
>> >> bringing this whole thing up now.
>> >
>> > We don't have to touch user visible api here, so extensions are fine.
>>
>> Huh?  My example in the original email attaches a program in a
>> sub-hierarchy.  Are you saying that 4.11 could make that example stop
>> working?
>
> No. As was pointed out before only root can load BPF_PROG_TYPE_CGROUP_[SKB|SOCK]
> type programs and only root can attach them.

Why?  It really seems to me that you expect that future namespaceable
bpf hooks will use a totally different API.  At KS, I sat in a room
full of people arguing about cgroup v2 and a lot of them pointed out
that there are valid, paying use cases that want to stick cgroup v1 in
a container, because the code that uses cgroup v1 in the container is
called systemd and the contained OS is called RHEL (or SuSE or Ubuntu
or Gentoo or whatever), and that code is *already written* and needs
to be contained.

The current approach to bpf hooks will bite you down the road.  David
Ahern is already proposing using it for something that is not tracing
at all, and someone will want that in a container, and there will be a
problem.

> Similarly for bpf+lsm+cgroup. In the next round of Mickael's patches
> we'll keep debating the right security model for it.
>

How about slowing down a wee bit and trying to come up with cgroup
hook semantics that work for all of these use cases?  I think my
proposal is quite close to workable.

--Andy

^ permalink raw reply

* Re: [PATCH v2 2/3] Bluetooth: btusb: Add out-of-band wakeup support
From: Brian Norris @ 2016-12-20  3:41 UTC (permalink / raw)
  To: Rajat Jain
  Cc: Rob Herring, Mark Rutland, Marcel Holtmann, Gustavo Padovan,
	Johan Hedberg, Amitkumar Karwar, Wei-Ning Huang, Xinming Hu,
	netdev, devicetree, linux-bluetooth, linux-kernel, Rajat Jain
In-Reply-To: <CACK8Z6FteRYmrNq5Oq9LpHOUOYg49XrL0WsxD8G45pTw_cwaSg@mail.gmail.com>

On Mon, Dec 19, 2016 at 05:46:19PM -0800, Rajat Jain wrote:
> On Mon, Dec 19, 2016 at 3:10 PM, Brian Norris <briannorris@chromium.org> wrote:
> > On Fri, Dec 16, 2016 at 11:30:03AM -0800, Rajat Jain wrote:
> >> diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
> >> index ce22cef..beca4e9 100644
> >> --- a/drivers/bluetooth/btusb.c
> >> +++ b/drivers/bluetooth/btusb.c

> >> +static const struct of_device_id btusb_match_table[] = {
> >> +     { .compatible = "usb1286,204e" },
> >> +     { }
> >> +};
> >> +MODULE_DEVICE_TABLE(of, btusb_match_table);
> >
> > You define a match table here, but you also define essentially same
> > table for Marvell-specific additions in patch 3. It looks like maybe
> > it's legal to have more than one OF table in a module? But it seems like
> > it would get confusing, besides being somewhat strange to maintain. It
> > might also produce duplicate 'modinfo' output.
> >
> > If you really want to independently opt into device-tree-specified
> > interrupts vs. Marvell-specific interrrupt configuration, then you
> > should probably just merge the latter into the former table, and
> > implement a Marvell/GPIO flag to stick in the .data field of this table.
> 
> The data we want to stick (The pin number on the chip) is board
> specific rather than being chip specific, and hence .data may not be
> useful here.

I just meant that if you conceptually wanted Marvell devices to "opt in"
to this, then you could turn the .data field into some kind of flag or
struct for describing capabilities (e.g., MARVELL_GPIO_WAKE or similar),
effectively merging the two tables instead of having two
mostly-overlapping tables.

But you could do the same by putting such flag(s) in the
{btusb,blacklist}_table[], or add such flag(s) later whenever there's a
Marvell device that doesn't support this feature.

> > Or it might be fine to drop one or both "match" checks. Particularly for
> > the Marvell-specific stuff, it's probably fair just to check if it has
> > an ->of_node and if 'id->driver_info & BTUSB_MARVELL'. Any other Marvell
> > device-specific quirks could probably be keyed off of the
> > (weirdly-named?) blacklist_table[], which already matches PID/VID.
> 
> Yup, I think I can remove the compatible string checks.
> 
> I'll be sending a V3.

Great!

Brian

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Josef Bacik @ 2016-12-20  3:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, David Miller, Hannes Frederic Sowa, Craig Gallek,
	Linux Kernel Network Developers
In-Reply-To: <1482201702.1521.13.camel@edumazet-glaptop3.roam.corp.google.com>


> On Dec 19, 2016, at 9:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:
>> 
>> When sockets created SO_REUSEPORT move to TW state they are placed
>> back on the the tb->owners. fastreuse port is no longer set so we have
>> to walk potential long list of sockets in tb->owners to open a new
>> listener socket. I imagine this is happens when we try to open a new
>> listener SO_REUSEPORT after the system has been running a while and so
>> we hit the long tb->owners list.
> 
> Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse
> 
> So where tb->fastreuse is cleared ?
> 
> If all your sockets have SO_REUSEPORT set, this should not happen.
> 

The app starts out with no SO_REUSEPORT, and then we restart it with that option enabled.  What I suspect is we have all the twsks from the original service, and the fastreuse stuff is cleared.  My naive patch resets it once we add a reuseport sk to the tb and that makes the problem go away.  I'm reworking all of this logic and adding some extra info to the tb to make the reset actually safe.  I'll send those patches out tomorrow. Thanks,

Josef

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Alexei Starovoitov @ 2016-12-20  3:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Daniel Mack, Mickaël Salaün, Kees Cook,
	Jann Horn, Tejun Heo, David Ahern, David S. Miller, Thomas Graf,
	Michael Kerrisk, Peter Zijlstra, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <CALCETrU1_bDVLfokQ7zasHVmeq7S-R+603GEw59V_wuj4eE1hw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Dec 19, 2016 at 04:25:32PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 4:02 PM, Alexei Starovoitov
> <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
> >> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
> >> <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
> >> >> Hi all-
> >> >>
> >> >> I apologize for being rather late with this.  I didn't realize that
> >> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> >> >> it on the linux-api list, so I missed the discussion.
> >> >>
> >> >> I think that the inet ingress, egress etc filters are a neat feature,
> >> >> but I think the API has some issues that will bite us down the road
> >> >> if it becomes stable in its current form.
> >> >>
> >> >> Most of the problems I see are summarized in this transcript:
> >> >>
> >> >> # mkdir cg2
> >> >> # mount -t cgroup2 none cg2
> >> >> # mkdir cg2/nosockets
> >> >> # strace cgrp_socket_rule cg2/nosockets/ 0
> >> >> ...
> >> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
> >> >>
> >> >> ^^^^ You can modify a cgroup after opening it O_RDONLY?
> >> >>
> >> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> >> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> >> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
> >> >>
> >> >> ^^^^ This is fine.  The bpf() syscall manipulates bpf objects.
> >> >>
> >> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
> >> >>
> >> >> ^^^^ This is not so good:
> >> >> ^^^^
> >> >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects.  This
> >> >> ^^^^    is manipulating a cgroup.  There's no reason that a socket creation
> >> >> ^^^^    filter couldn't be written in a different language (new iptables
> >> >> ^^^^    table?  Simple list of address families?), but if that happened,
> >> >> ^^^^    then using bpf() to install it would be entirely nonsensical.
> >> >
> >> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
> >> > network stack is using cgroup as an application group that should
> >> > invoke bpf program at the certain point in the stack.
> >> > imo cgroup management is orthogonal.
> >>
> >> It is literally modifying the struct cgroup, and, as a practical
> >> matter, it's causing membership in the cgroup to have a certain
> >> effect.  But rather than pointless arguing, let me propose an
> >> alternative API that I think solves most of the problems here.
> >>
> >> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
> >> Instead, the cgroup gets three new control files:
> >> "net.ingress_filter", "net.egress_filter", and
> >> "net.socket_create_filter".  Initially, if you read these files, you
> >> see "none\n".
> >>
> >> To attach a bpf filter, you open the file for write and do an ioctl on
> >> it.  After doing the ioctl, if you read the file, you'll see
> >> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
> >> the bpf program.
> >>
> >> To detach any type of filter, bpf or otherwise, you open the file for
> >> write and write "none\n" (or just "none").
> >>
> >> If you write anything else to the file, you get -EINVAL.  But, if
> >> someone writes a new type of filter (perhaps a simple list of address
> >> families), maybe you can enable the filter by writing something
> >> appropriate to the file.
> >
> > I see no difference in what you're proposing vs what is already implemented
> > from feature set point of view, but the file approach is very ugly, since
> > it's a mismatch to FD style access that bpf is using everywhere.
> > In your proposal you'd also need to add bpf prefix everywhere.
> > So the control file names should be bpf_inet_ingress, bpf_inet_egress
> > and bpf_socket_create.
> 
> I think we're still talking past each other.  A big part of the point
> of changing it is that none of this is specific to bpf.  You could (in

the hooks and context passed into the program is very much bpf specific.
That's what I've been trying to convey all along.

> theory -- I'm not proposing implementing these until there's demand)
> have:
> 
> net.socket_create_filter = "none": no filter
> net.socket_create_filter = "bpf:baadf00d": bpf filter

i'm assuming 'baadf00d' is bpf program fd expressed a text string?
and kernel needs to parse above? will you allow capital and lower
case for 'bpf:' ? and mixed case too? spaces and tabs allowed or not?
can program fd expressed as decimal or hex or both?
how do you return the error? as a text string for user space
to parse?

> net.socket_create_filter = "iptables:foobar": some iptables thingy
> net.socket_create_filter = "nft:blahblahblah": some nft thingy

iptables/nft are not applicable to 'bpf_socket_create' which
looks into:
struct bpf_sock {
        __u32 bound_dev_if;
        __u32 family;
        __u32 type;
        __u32 protocol;
};
so don't fit as an example.

> net.socket_create_filter = "disallow": no sockets created period
> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3

so you're proposing to add a bunch of hard coded logic to the kernel.
First to parse such text into some sort of syntax tree or list/set
and then have hard coded logic specifically for these two use cases?
While above two can be implemented as trivial bpf programs already?!
That goes 180% degree vs bpf philosophy. bpf is about moving
the specific code out of the kernel and keeping kernel generic that
it can solve as many use cases as possible by being programmable.

> See?  This API is not bpf-specific.  It's an API for filtering.  The

no. I don't see it. BPF_CGROUP_INET_SOCK_CREATE is very much bpf specific
and we just discussed it to the last the detail.

> Can you explain your use case more clearly?
...
> What exactly isn't sensible about using cgroup_bpf for containers?

my use case is close to zero overhead application network monitoring.

> >> > As you're pointing out, in case of security, we probably
> >> > want to preserve original bpf program that should always be
> >> > run first and only after it returned 'ok' (we'd need to define
> >> > what 'ok' means in secruity context) run program attached to sub-hierarchy.
> >>
> >> It's already defined AFAICT.  1 means okay.  0 means not okay.
> >
> > sorry that doesn't make any sense. For seccomp we have a set of
> > ranges that mean different things. Here you're proposing to
> > hastily assign 1 and 0 ? How is that extensible?
> > We need to carefully think through what should be the semantics
> > of attaching multiple programs, consider performance implications,
> > return codes and so on.
> 
> You already assigned it.  The return value of the bpf program, loaded
> in Linus' tree today, tells the kernel whether to accept or reject.

yes. that's what the program tells the hook.
I'm saying that whenever we have a link list of the programs
interaction between them may or may not be expressible with 1/0
and I don't want to rush such decision.

> >
> >> > Another alternative is to disallow attaching programs in sub-hierarchy
> >> > if parent has something already attached, but it's not useful
> >> > for general case.
> >> > All of these are possible future extensions.
> >>
> >> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
> >> have to do it now (or disable the feature for 4.10).  This is why I'm
> >> bringing this whole thing up now.
> >
> > We don't have to touch user visible api here, so extensions are fine.
> 
> Huh?  My example in the original email attaches a program in a
> sub-hierarchy.  Are you saying that 4.11 could make that example stop
> working?

No. As was pointed out before only root can load BPF_PROG_TYPE_CGROUP_[SKB|SOCK]
type programs and only root can attach them.
and this root semantics obviously have to be preserved from now on,
but that doesn't mean that non-root combinations have to follow the same.
For example, if for some bizarre reason you want to do
net.socket_create_filter = "disallow": no sockets created period
in the hard coded way without using bpf at all
(I would certainly oppose that as a waste of kernel .text,
but I'm not going to nack it), so you can do whatever semantics you like.
Similarly for bpf+lsm+cgroup. In the next round of Mickael's patches
we'll keep debating the right security model for it.

^ permalink raw reply

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: Andy Lutomirski @ 2016-12-20  3:12 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Andy Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel@vger.kernel.org, Network Development
In-Reply-To: <80574175-3692-0278-a74e-23b752d44f73@gmail.com>

On Mon, Dec 19, 2016 at 6:52 PM, David Ahern <dsahern@gmail.com> wrote:
> On 12/19/16 6:56 PM, Andy Lutomirski wrote:
>> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern <dsahern@gmail.com> wrote:
>>> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>>>> net.socket_create_filter = "none": no filter
>>>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>>>> net.socket_create_filter = "disallow": no sockets created period
>>>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>>>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>>>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>>>
>>> Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters.
>>
>> Can you elaborate on what goes wrong?  (Obviously the
>> "address_family_list" example makes no sense in that context.)
>
> Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability.

Oh -- I'm not proposing that at all.  Let me clarify.  For the bpf
case, if you *read* the file, you'd see "bpf:baadf00d".  But writing
"bpf:baadf00d" is nonsense and would give you -EINVAL.  Instead you
install a bpf filter by opening the file for write (O_RDWR or
O_WRONLY) and doing ioctl(cgroup_socket_create_file_fd,
CGROUP_IOCTL_SET_BPF_FILTER, bpf_fd);  It's kind of like
BPF_PROG_ATTACH except that it respects filesystem permissions, it
isn't in the bpf() multiplexer, the filter being set is implied (by
the fd in use), and everything is nicely seccompable.

To remove the filter, you write "none" or "none\n" to the file.

As a future extension, if someone wanted more than one filter to be
able to coexist in the cgroup socket_create_filter slot, you could
plausibly do 'echo disallow >>net.socket_create_filter' or use a new
ioctl CGROUP_IOCTL_APPEND_BPF_FILTER, and then you'd read the file and
see more than one line.  But this would be a future extension and may
never be needed.

>>
>> a) sub-cgroups cannot have a filter at all of the parent has a filter.
>> (This is the "punt" approach -- it lets different semantics be
>> assigned later without breaking userspace.)
>>
>> b) sub-cgroups can have a filter if a parent does, too.  The semantics
>> are that the sub-cgroup filter runs first and all side-effects occur.
>> If that filter says "reject" then ancestor filters are skipped.  If
>> that filter says "accept", then the ancestor filter is run and its
>> side-effects happen as well.  (And so on, all the way up to the root.)
>
> That comes with a big performance hit for skb / data path cases.
>
> I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part.

I'm not sure I buy the performance hit.  If you do it naively, then
performance will indeed suck.  But there's already a bunch of code in
there to pre-populate a filter list for faster use.  Currently, we
have:

struct cgroup_bpf {
        /*
         * Store two sets of bpf_prog pointers, one for programs that are
         * pinned directly to this cgroup, and one for those that are effective
         * when this cgroup is accessed.
         */
        struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
        struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
};

in struct cgroup, there's a 'struct cgroup_bpf bpf;'.

This would change to something like:

struct cgroup_filter_slot {
  struct bpf_prog *effective;
  struct cgroup_filter_slot *next;
  struct bpf_prog *local;
}

local is NULL unless *this* cgroup has a filter.  effective points to
the bpf_prog that's active in this cgroup or the nearest ancestor that
has a filter.  next is NULL if there are no filters higher in the
chain or points to the next slot that has a filter.  struct cgroup
has:

struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];

To evaluate it, you do:

struct cgroup_filter_slot *slot = &cgroup->slot[the index];

if (!slot->effective)
  return;

do {
  evaluate(slot->effective);
  slot = slot->next;
} while (unlikely(slot));

The old code was one branch per evaluation.  The new code is n+1
branches where n is the number of filters, so it's one extra branch in
the worst case.  But the extra branch is cache-hot (the variable is
right next to slot->effective, which is needed regardless) and highly
predictable (in the case where nesting isn't used, the branch is not
taken, and it's a backward branch which most CPUs will predict as not
taken).

I expect that, on x86, this adds at most a cycle or two and quite
possibly adds no measurable overhead at all.

^ permalink raw reply

* [PATCH v3 3/3] Bluetooth: btusb: Configure Marvell to use one of the pins for oob wakeup
From: Rajat Jain @ 2016-12-20  2:56 UTC (permalink / raw)
  To: Rob Herring, Mark Rutland, Marcel Holtmann, Gustavo Padovan,
	Johan Hedberg, Amitkumar Karwar, Wei-Ning Huang, Xinming Hu,
	netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-bluetooth-u79uwXL29TY76Z2rM5mHXA, Brian Norris,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Rajat Jain, rajatxjain-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <1482202592-76314-1-git-send-email-rajatja-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

The Marvell devices may have many gpio pins, and hence for wakeup
on these out-of-band pins, the chip needs to be told which pin is
to be used for wakeup, using an hci command.

Thus, we read the pin number etc from the device tree node and send
a command to the chip.

Signed-off-by: Rajat Jain <rajatja-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Brian Norris <briannorris-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
Acked-by: Rob Herring <robh-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
v3: * remove the Marvell specific id table and check
    * Add reference to marvell-bt-8xxx.txt in btusb.txt
    * Add "Reviewed-by" and "Acked-by"    
v2: Fix the binding document to specify to use "wakeup" interrupt-name

 Documentation/devicetree/bindings/net/btusb.txt    |  3 ++
 .../{marvell-bt-sd8xxx.txt => marvell-bt-8xxx.txt} | 46 +++++++++++++++----
 drivers/bluetooth/btusb.c                          | 51 ++++++++++++++++++++++
 3 files changed, 92 insertions(+), 8 deletions(-)
 rename Documentation/devicetree/bindings/net/{marvell-bt-sd8xxx.txt => marvell-bt-8xxx.txt} (50%)

diff --git a/Documentation/devicetree/bindings/net/btusb.txt b/Documentation/devicetree/bindings/net/btusb.txt
index 2c0355c..01fa2d4 100644
--- a/Documentation/devicetree/bindings/net/btusb.txt
+++ b/Documentation/devicetree/bindings/net/btusb.txt
@@ -10,6 +10,9 @@ Required properties:
 
 		  "usb1286,204e" (Marvell 8997)
 
+Also, vendors that use btusb may have device additional properties, e.g:
+Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt
+
 Optional properties:
 
   - interrupt-parent: phandle of the parent interrupt controller
diff --git a/Documentation/devicetree/bindings/net/marvell-bt-sd8xxx.txt b/Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt
similarity index 50%
rename from Documentation/devicetree/bindings/net/marvell-bt-sd8xxx.txt
rename to Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt
index 6a9a63c..9be1059 100644
--- a/Documentation/devicetree/bindings/net/marvell-bt-sd8xxx.txt
+++ b/Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt
@@ -1,16 +1,21 @@
-Marvell 8897/8997 (sd8897/sd8997) bluetooth SDIO devices
+Marvell 8897/8997 (sd8897/sd8997) bluetooth devices (SDIO or USB based)
 ------
+The 8997 devices supports multiple interfaces. When used on SDIO interfaces,
+the btmrvl driver is used and when used on USB interface, the btusb driver is
+used.
 
 Required properties:
 
   - compatible : should be one of the following:
-	* "marvell,sd8897-bt"
-	* "marvell,sd8997-bt"
+	* "marvell,sd8897-bt" (for SDIO)
+	* "marvell,sd8997-bt" (for SDIO)
+	* "usb1286,204e"      (for USB)
 
 Optional properties:
 
   - marvell,cal-data: Calibration data downloaded to the device during
 		      initialization. This is an array of 28 values(u8).
+		      This is only applicable to SDIO devices.
 
   - marvell,wakeup-pin: It represents wakeup pin number of the bluetooth chip.
 		        firmware will use the pin to wakeup host system (u16).
@@ -18,10 +23,15 @@ Optional properties:
 		      platform. The value will be configured to firmware. This
 		      is needed to work chip's sleep feature as expected (u16).
   - interrupt-parent: phandle of the parent interrupt controller
-  - interrupts : interrupt pin number to the cpu. Driver will request an irq based
-		 on this interrupt number. During system suspend, the irq will be
-		 enabled so that the bluetooth chip can wakeup host platform under
-		 certain condition. During system resume, the irq will be disabled
+  - interrupt-names: Used only for USB based devices (See below)
+  - interrupts : specifies the interrupt pin number to the cpu. For SDIO, the
+		 driver will use the first interrupt specified in the interrupt
+		 array. For USB based devices, the driver will use the interrupt
+		 named "wakeup" from the interrupt-names and interrupt arrays.
+		 The driver will request an irq based on this interrupt number.
+		 During system suspend, the irq will be enabled so that the
+		 bluetooth chip can wakeup host platform under certain
+		 conditions. During system resume, the irq will be disabled
 		 to make sure unnecessary interrupt is not received.
 
 Example:
@@ -29,7 +39,9 @@ Example:
 IRQ pin 119 is used as system wakeup source interrupt.
 wakeup pin 13 and gap 100ms are configured so that firmware can wakeup host
 using this device side pin and wakeup latency.
-calibration data is also available in below example.
+
+Example for SDIO device follows (calibration data is also available in
+below example).
 
 &mmc3 {
 	status = "okay";
@@ -54,3 +66,21 @@ calibration data is also available in below example.
 		marvell,wakeup-gap-ms = /bits/ 16 <0x64>;
 	};
 };
+
+Example for USB device:
+
+&usb_host1_ohci {
+    status = "okay";
+    #address-cells = <1>;
+    #size-cells = <0>;
+
+    mvl_bt1: bt@1 {
+	compatible = "usb1286,204e";
+	reg = <1>;
+	interrupt-parent = <&gpio0>;
+	interrupt-names = "wakeup";
+	interrupts = <119 IRQ_TYPE_LEVEL_LOW>;
+	marvell,wakeup-pin = /bits/ 16 <0x0d>;
+	marvell,wakeup-gap-ms = /bits/ 16 <0x64>;
+    };
+};
diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
index beca4e9..a5c2cf9 100644
--- a/drivers/bluetooth/btusb.c
+++ b/drivers/bluetooth/btusb.c
@@ -2343,6 +2343,50 @@ static int btusb_shutdown_intel(struct hci_dev *hdev)
 	return 0;
 }
 
+#ifdef CONFIG_PM
+/* Configure an out-of-band gpio as wake-up pin, if specified in device tree */
+static int marvell_config_oob_wake(struct hci_dev *hdev)
+{
+	struct sk_buff *skb;
+	struct btusb_data *data = hci_get_drvdata(hdev);
+	struct device *dev = &data->udev->dev;
+	u16 pin, gap, opcode;
+	int ret;
+	u8 cmd[5];
+
+	/* Move on if no wakeup pin specified */
+	if (of_property_read_u16(dev->of_node, "marvell,wakeup-pin", &pin) ||
+	    of_property_read_u16(dev->of_node, "marvell,wakeup-gap-ms", &gap))
+		return 0;
+
+	/* Vendor specific command to configure a GPIO as wake-up pin */
+	opcode = hci_opcode_pack(0x3F, 0x59);
+	cmd[0] = opcode & 0xFF;
+	cmd[1] = opcode >> 8;
+	cmd[2] = 2; /* length of parameters that follow */
+	cmd[3] = pin;
+	cmd[4] = gap; /* time in ms, for which wakeup pin should be asserted */
+
+	skb = bt_skb_alloc(sizeof(cmd), GFP_KERNEL);
+	if (!skb) {
+		bt_dev_err(hdev, "%s: No memory\n", __func__);
+		return -ENOMEM;
+	}
+
+	memcpy(skb_put(skb, sizeof(cmd)), cmd, sizeof(cmd));
+	hci_skb_pkt_type(skb) = HCI_COMMAND_PKT;
+
+	ret = btusb_send_frame(hdev, skb);
+	if (ret) {
+		bt_dev_err(hdev, "%s: configuration failed\n", __func__);
+		kfree_skb(skb);
+		return ret;
+	}
+
+	return 0;
+}
+#endif
+
 static int btusb_set_bdaddr_marvell(struct hci_dev *hdev,
 				    const bdaddr_t *bdaddr)
 {
@@ -2917,6 +2961,13 @@ static int btusb_probe(struct usb_interface *intf,
 	err = btusb_config_oob_wake(hdev);
 	if (err)
 		goto out_free_dev;
+
+	/* Marvell devices may need a specific chip configuration */
+	if (id->driver_info & BTUSB_MARVELL && data->oob_wake_irq) {
+		err = marvell_config_oob_wake(hdev);
+		if (err)
+			goto out_free_dev;
+	}
 #endif
 	if (id->driver_info & BTUSB_CW6622)
 		set_bit(HCI_QUIRK_BROKEN_STORED_LINK_KEY, &hdev->quirks);
-- 
2.8.0.rc3.226.g39d4020

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v3 2/3] Bluetooth: btusb: Add out-of-band wakeup support
From: Rajat Jain @ 2016-12-20  2:56 UTC (permalink / raw)
  To: Rob Herring, Mark Rutland, Marcel Holtmann, Gustavo Padovan,
	Johan Hedberg, Amitkumar Karwar, Wei-Ning Huang, Xinming Hu,
	netdev, devicetree, linux-bluetooth, Brian Norris, linux-kernel
  Cc: Rajat Jain, rajatxjain
In-Reply-To: <1482202592-76314-1-git-send-email-rajatja@google.com>

Some onboard BT chips (e.g. Marvell 8997) contain a wakeup pin that
can be connected to a gpio on the CPU side, and can be used to wakeup
the host out-of-band. This can be useful in situations where the
in-band wakeup is not possible or not preferable (e.g. the in-band
wakeup may require the USB host controller to remain active, and
hence consuming more system power during system sleep).

The oob gpio interrupt to be used for wakeup on the CPU side, is
read from the device tree node, (using standard interrupt descriptors).
A devcie tree binding document is also added for the driver. The
compatible string is in compliance with
Documentation/devicetree/bindings/usb/usb-device.txt

Signed-off-by: Rajat Jain <rajatja@google.com>
Reviewed-by: Brian Norris <briannorris@chromium.org>
---
v3: Add Brian's "Reviewed-by"
v2: * Use interrupt-names ("wakeup") instead of assuming first interrupt.
    * Leave it on device tree to specify IRQ flags (level /edge triggered)
    * Mark the device as non wakeable on exit.

 Documentation/devicetree/bindings/net/btusb.txt | 40 ++++++++++++
 drivers/bluetooth/btusb.c                       | 84 +++++++++++++++++++++++++
 2 files changed, 124 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/btusb.txt

diff --git a/Documentation/devicetree/bindings/net/btusb.txt b/Documentation/devicetree/bindings/net/btusb.txt
new file mode 100644
index 0000000..2c0355c
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/btusb.txt
@@ -0,0 +1,40 @@
+Generic Bluetooth controller over USB (btusb driver)
+---------------------------------------------------
+
+Required properties:
+
+  - compatible : should comply with the format "usbVID,PID" specified in
+		 Documentation/devicetree/bindings/usb/usb-device.txt
+		 At the time of writing, the only OF supported devices
+		 (more may be added later) are:
+
+		  "usb1286,204e" (Marvell 8997)
+
+Optional properties:
+
+  - interrupt-parent: phandle of the parent interrupt controller
+  - interrupt-names: (see below)
+  - interrupts : The interrupt specified by the name "wakeup" is the interrupt
+		 that shall be used for out-of-band wake-on-bt. Driver will
+		 request this interrupt for wakeup. During system suspend, the
+		 irq will be enabled so that the bluetooth chip can wakeup host
+		 platform out of band. During system resume, the irq will be
+		 disabled to make sure unnecessary interrupt is not received.
+
+Example:
+
+Following example uses irq pin number 3 of gpio0 for out of band wake-on-bt:
+
+&usb_host1_ehci {
+    status = "okay";
+    #address-cells = <1>;
+    #size-cells = <0>;
+
+    mvl_bt1: bt@1 {
+	compatible = "usb1286,204e";
+	reg = <1>;
+	interrupt-parent = <&gpio0>;
+	interrupt-name = "wakeup";
+	interrupts = <3 IRQ_TYPE_LEVEL_LOW>;
+    };
+};
diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
index ce22cef..beca4e9 100644
--- a/drivers/bluetooth/btusb.c
+++ b/drivers/bluetooth/btusb.c
@@ -24,6 +24,8 @@
 #include <linux/module.h>
 #include <linux/usb.h>
 #include <linux/firmware.h>
+#include <linux/of_device.h>
+#include <linux/of_irq.h>
 #include <asm/unaligned.h>
 
 #include <net/bluetooth/bluetooth.h>
@@ -369,6 +371,7 @@ static const struct usb_device_id blacklist_table[] = {
 #define BTUSB_BOOTING		9
 #define BTUSB_RESET_RESUME	10
 #define BTUSB_DIAG_RUNNING	11
+#define BTUSB_OOB_WAKE_DISABLED	12
 
 struct btusb_data {
 	struct hci_dev       *hdev;
@@ -416,6 +419,8 @@ struct btusb_data {
 	int (*recv_bulk)(struct btusb_data *data, void *buffer, int count);
 
 	int (*setup_on_usb)(struct hci_dev *hdev);
+
+	int oob_wake_irq;   /* irq for out-of-band wake-on-bt */
 };
 
 static inline void btusb_free_frags(struct btusb_data *data)
@@ -2728,6 +2733,65 @@ static int btusb_bcm_set_diag(struct hci_dev *hdev, bool enable)
 }
 #endif
 
+#ifdef CONFIG_PM
+static irqreturn_t btusb_oob_wake_handler(int irq, void *priv)
+{
+	struct btusb_data *data = priv;
+
+	/* Disable only if not already disabled (keep it balanced) */
+	if (!test_and_set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags)) {
+		disable_irq_nosync(irq);
+		disable_irq_wake(irq);
+	}
+	pm_wakeup_event(&data->udev->dev, 0);
+	return IRQ_HANDLED;
+}
+
+static const struct of_device_id btusb_match_table[] = {
+	{ .compatible = "usb1286,204e" },
+	{ }
+};
+MODULE_DEVICE_TABLE(of, btusb_match_table);
+
+/* Use an oob wakeup pin? */
+static int btusb_config_oob_wake(struct hci_dev *hdev)
+{
+	struct btusb_data *data = hci_get_drvdata(hdev);
+	struct device *dev = &data->udev->dev;
+	int irq, ret;
+
+	if (!of_match_device(btusb_match_table, dev))
+		return 0;
+
+	/* Move on if no IRQ specified */
+	irq = of_irq_get_byname(dev->of_node, "wakeup");
+	if (irq <= 0) {
+		bt_dev_dbg(hdev, "%s: no OOB Wakeup IRQ in DT", __func__);
+		return 0;
+	}
+
+	set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags);
+
+	ret = devm_request_irq(&hdev->dev, irq, btusb_oob_wake_handler,
+			       0, "OOB Wake-on-BT", data);
+	if (ret) {
+		bt_dev_err(hdev, "%s: IRQ request failed", __func__);
+		return ret;
+	}
+
+	ret = device_init_wakeup(dev, true);
+	if (ret) {
+		bt_dev_err(hdev, "%s: failed to init_wakeup\n", __func__);
+		return ret;
+	}
+
+	data->oob_wake_irq = irq;
+	disable_irq(irq);
+	bt_dev_info(hdev, "OOB Wake-on-BT configured at IRQ %u\n", irq);
+	return 0;
+}
+#endif
+
 static int btusb_probe(struct usb_interface *intf,
 		       const struct usb_device_id *id)
 {
@@ -2849,6 +2913,11 @@ static int btusb_probe(struct usb_interface *intf,
 	hdev->send   = btusb_send_frame;
 	hdev->notify = btusb_notify;
 
+#ifdef CONFIG_PM
+	err = btusb_config_oob_wake(hdev);
+	if (err)
+		goto out_free_dev;
+#endif
 	if (id->driver_info & BTUSB_CW6622)
 		set_bit(HCI_QUIRK_BROKEN_STORED_LINK_KEY, &hdev->quirks);
 
@@ -3061,6 +3130,9 @@ static void btusb_disconnect(struct usb_interface *intf)
 			usb_driver_release_interface(&btusb_driver, data->isoc);
 	}
 
+	if (data->oob_wake_irq)
+		device_init_wakeup(&data->udev->dev, false);
+
 	hci_free_dev(hdev);
 }
 
@@ -3089,6 +3161,12 @@ static int btusb_suspend(struct usb_interface *intf, pm_message_t message)
 	btusb_stop_traffic(data);
 	usb_kill_anchored_urbs(&data->tx_anchor);
 
+	if (data->oob_wake_irq && device_may_wakeup(&data->udev->dev)) {
+		clear_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags);
+		enable_irq_wake(data->oob_wake_irq);
+		enable_irq(data->oob_wake_irq);
+	}
+
 	/* Optionally request a device reset on resume, but only when
 	 * wakeups are disabled. If wakeups are enabled we assume the
 	 * device will stay powered up throughout suspend.
@@ -3126,6 +3204,12 @@ static int btusb_resume(struct usb_interface *intf)
 	if (--data->suspend_count)
 		return 0;
 
+	/* Disable only if not already disabled (keep it balanced) */
+	if (!test_and_set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags)) {
+		disable_irq(data->oob_wake_irq);
+		disable_irq_wake(data->oob_wake_irq);
+	}
+
 	if (!test_bit(HCI_RUNNING, &hdev->flags))
 		goto done;
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v3 1/3] Bluetooth: btusb: Use an error label for error paths
From: Rajat Jain @ 2016-12-20  2:56 UTC (permalink / raw)
  To: Rob Herring, Mark Rutland, Marcel Holtmann, Gustavo Padovan,
	Johan Hedberg, Amitkumar Karwar, Wei-Ning Huang, Xinming Hu,
	netdev, devicetree, linux-bluetooth, Brian Norris, linux-kernel
  Cc: Rajat Jain, rajatxjain

Use a label to remove the repetetive cleanup, for error cases.

Signed-off-by: Rajat Jain <rajatja@google.com>
Reviewed-by: Brian Norris <briannorris@chromium.org>
---
v3: Added Brian's "Reviewed-by"
v2: same as v1

 drivers/bluetooth/btusb.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
index 2f633df..ce22cef 100644
--- a/drivers/bluetooth/btusb.c
+++ b/drivers/bluetooth/btusb.c
@@ -2991,18 +2991,15 @@ static int btusb_probe(struct usb_interface *intf,
 		err = usb_set_interface(data->udev, 0, 0);
 		if (err < 0) {
 			BT_ERR("failed to set interface 0, alt 0 %d", err);
-			hci_free_dev(hdev);
-			return err;
+			goto out_free_dev;
 		}
 	}
 
 	if (data->isoc) {
 		err = usb_driver_claim_interface(&btusb_driver,
 						 data->isoc, data);
-		if (err < 0) {
-			hci_free_dev(hdev);
-			return err;
-		}
+		if (err < 0)
+			goto out_free_dev;
 	}
 
 #ifdef CONFIG_BT_HCIBTUSB_BCM
@@ -3016,14 +3013,16 @@ static int btusb_probe(struct usb_interface *intf,
 #endif
 
 	err = hci_register_dev(hdev);
-	if (err < 0) {
-		hci_free_dev(hdev);
-		return err;
-	}
+	if (err < 0)
+		goto out_free_dev;
 
 	usb_set_intfdata(intf, data);
 
 	return 0;
+
+out_free_dev:
+	hci_free_dev(hdev);
+	return err;
 }
 
 static void btusb_disconnect(struct usb_interface *intf)
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* Re: Potential issues (security and otherwise) with the current cgroup-bpf API
From: David Ahern @ 2016-12-20  2:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Andy Lutomirski, Daniel Mack,
	Mickaël Salaün, Kees Cook, Jann Horn, Tejun Heo,
	David S. Miller, Thomas Graf, Michael Kerrisk, Peter Zijlstra,
	Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Network Development
In-Reply-To: <CALCETrUW2jEYmjSsOrPj+MAjkDGGUCw_rdxQh+5Er0r4ReGLnA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 12/19/16 6:56 PM, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern <dsahern-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>>> net.socket_create_filter = "none": no filter
>>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>>> net.socket_create_filter = "disallow": no sockets created period
>>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>>
>> Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters.
> 
> Can you elaborate on what goes wrong?  (Obviously the
> "address_family_list" example makes no sense in that context.)

Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability. 
 
>>
>> ...
>>
>>>> you're ignoring use cases I described earlier.
>>>> In vrf case there is only one ifindex it needs to bind to.
>>>
>>> I'm totally lost.  Can you explain what this has to do with the cgroup
>>> hierarchy?
>>
>> I think the point is that a group hierarchy makes no sense for the VRF use case. What I put into iproute2 is
>>
>>     cgrp2/vrf/NAME
>>
>> where NAME is the vrf name. The filter added to it binds ipv4 and ipv6 sockets to a specific device index. cgrp2/vrf is the "default" vrf and does not have a filter. A user can certainly add another layer cgrp2/vrf/NAME/NAME2 but it provides no value since VRF in a VRF does not make sense.
> 
> I tend to agree.  I still think that the mechanism as it stands is
> broken in other respects and should be fixed before it goes live.  I
> have no desire to cause problems for the vrf use case.
> 
> But keep in mind that the vrf use case is, in Linus' tree, a bit
> broken right now in its interactions with other users of the same
> mechanism.  Suppose I create a container and want to trace all of its
> created sockets.  I'll set up cgrp2/container and load my tracer as a
> socket creation hook.  Then a container sets up
> cgrp2/container/vrf/NAME (using delgation) and loads your vrf binding
> filter.  Now the tracing stops working -- oops.

There are other ways to achieve socket tracing, but I get your point -- nested cases do not work as users may want.


>>>>> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION.  You
>>>>> have to do it now (or disable the feature for 4.10).  This is why I'm
>>>>> bringing this whole thing up now.
>>>>
>>>> We don't have to touch user visible api here, so extensions are fine.
>>>
>>> Huh?  My example in the original email attaches a program in a
>>> sub-hierarchy.  Are you saying that 4.11 could make that example stop
>>> working?
>>
>> Are you suggesting sub-cgroups should not be allowed to override the filter of a parent cgroup?
> 
> Yes, exactly.  I think there are two sensible behaviors:
> 
> a) sub-cgroups cannot have a filter at all of the parent has a filter.
> (This is the "punt" approach -- it lets different semantics be
> assigned later without breaking userspace.)
> 
> b) sub-cgroups can have a filter if a parent does, too.  The semantics
> are that the sub-cgroup filter runs first and all side-effects occur.
> If that filter says "reject" then ancestor filters are skipped.  If
> that filter says "accept", then the ancestor filter is run and its
> side-effects happen as well.  (And so on, all the way up to the root.)

That comes with a big performance hit for skb / data path cases.

I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part.

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Eric Dumazet @ 2016-12-20  2:41 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Josef Bacik, Hannes Frederic Sowa, Craig Gallek,
	Linux Kernel Network Developers
In-Reply-To: <CALx6S340vCsaymbx+CcKL3Zi2p9EvDSwc0fDAxKsSqs4C2p5kg@mail.gmail.com>

On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:

> When sockets created SO_REUSEPORT move to TW state they are placed
> back on the the tb->owners. fastreuse port is no longer set so we have
> to walk potential long list of sockets in tb->owners to open a new
> listener socket. I imagine this is happens when we try to open a new
> listener SO_REUSEPORT after the system has been running a while and so
> we hit the long tb->owners list.

Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse

So where tb->fastreuse is cleared ?

If all your sockets have SO_REUSEPORT set, this should not happen.

^ permalink raw reply

* [PATCH 2/2] ARM: dts: hix5hd2: don't change the existing compatible string
From: Dongpo Li @ 2016-12-20  2:09 UTC (permalink / raw)
  To: robh+dt-DgEjT+Ai2ygdnm+yROfE0A, mark.rutland-5wv7dgnIgG8,
	linux-I+IVW8TIWO2tmTQ+vhA3Yw, yisen.zhuang-hv44wF8Li93QT0dZR+AlfA,
	salil.mehta-hv44wF8Li93QT0dZR+AlfA, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	arnd-r2nGTMty4D4, andrew-g2DYL2Zd6BY
  Cc: xuejiancheng-C8/M+/jPZTeaMJb+Lgu22Q,
	benjamin.chenhao-C8/M+/jPZTeaMJb+Lgu22Q,
	caizhiyong-C8/M+/jPZTeaMJb+Lgu22Q, netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dongpo Li
In-Reply-To: <1482199769-106501-1-git-send-email-lidongpo-C8/M+/jPZTeaMJb+Lgu22Q@public.gmane.org>

The SoC hix5hd2 compatible string has the suffix "-gmac" and
we should not change it.
We should only add the generic compatible string "hisi-gmac-v1".

Fixes: 0855950ba580 ("ARM: dts: hix5hd2: add gmac generic compatible and clock names")
Signed-off-by: Dongpo Li <lidongpo-C8/M+/jPZTeaMJb+Lgu22Q@public.gmane.org>
---
 arch/arm/boot/dts/hisi-x5hd2.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/hisi-x5hd2.dtsi b/arch/arm/boot/dts/hisi-x5hd2.dtsi
index 0da76c5..2d06f4c 100644
--- a/arch/arm/boot/dts/hisi-x5hd2.dtsi
+++ b/arch/arm/boot/dts/hisi-x5hd2.dtsi
@@ -436,7 +436,7 @@
 		};
 
 		gmac0: ethernet@1840000 {
-			compatible = "hisilicon,hix5hd2-gemac", "hisilicon,hisi-gemac-v1";
+			compatible = "hisilicon,hix5hd2-gmac", "hisilicon,hisi-gmac-v1";
 			reg = <0x1840000 0x1000>,<0x184300c 0x4>;
 			interrupts = <0 71 4>;
 			clocks = <&clock HIX5HD2_MAC0_CLK>;
@@ -445,7 +445,7 @@
 		};
 
 		gmac1: ethernet@1841000 {
-			compatible = "hisilicon,hix5hd2-gemac", "hisilicon,hisi-gemac-v1";
+			compatible = "hisilicon,hix5hd2-gmac", "hisilicon,hisi-gmac-v1";
 			reg = <0x1841000 0x1000>,<0x1843010 0x4>;
 			interrupts = <0 72 4>;
 			clocks = <&clock HIX5HD2_MAC1_CLK>;
-- 
2.8.2

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 1/2] net: hix5hd2_gmac: fix compatible strings name
From: Dongpo Li @ 2016-12-20  2:09 UTC (permalink / raw)
  To: robh+dt, mark.rutland, linux, yisen.zhuang, salil.mehta, davem,
	arnd, andrew
  Cc: xuejiancheng, benjamin.chenhao, caizhiyong, netdev, devicetree,
	linux-kernel, Dongpo Li
In-Reply-To: <1482199769-106501-1-git-send-email-lidongpo@hisilicon.com>

The SoC hix5hd2 compatible string has the suffix "-gmac" and
we should not change its compatible string.
So we should name all the compatible string with the suffix "-gmac".
Creating a new name suffix "-gemac" is unnecessary.

We also add another SoC compatible string in dt binding documentation
and describe which generic version the SoC belongs to.

Fixes: d0fb6ba75dc0 ("net: hix5hd2_gmac: add generic compatible string")
Signed-off-by: Dongpo Li <lidongpo@hisilicon.com>
---
 .../devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt      | 13 ++++++++-----
 drivers/net/ethernet/hisilicon/hix5hd2_gmac.c               | 13 +++++++------
 2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
index 063c02d..eea73ad 100644
--- a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
+++ b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
@@ -2,11 +2,14 @@ Hisilicon hix5hd2 gmac controller
 
 Required properties:
 - compatible: should contain one of the following SoC strings:
-	* "hisilicon,hix5hd2-gemac"
-	* "hisilicon,hi3798cv200-gemac"
+	* "hisilicon,hix5hd2-gmac"
+	* "hisilicon,hi3798cv200-gmac"
+	* "hisilicon,hi3516a-gmac"
 	and one of the following version string:
-	* "hisilicon,hisi-gemac-v1"
-	* "hisilicon,hisi-gemac-v2"
+	* "hisilicon,hisi-gmac-v1"
+	* "hisilicon,hisi-gmac-v2"
+  The version v1 includes SoCs hix5hd2.
+  The version v2 includes SoCs hi3798cv200, hi3516a.
 - reg: specifies base physical address(s) and size of the device registers.
   The first region is the MAC register base and size.
   The second region is external interface control register.
@@ -35,7 +38,7 @@ Required properties:
 
 Example:
 	gmac0: ethernet@f9840000 {
-		compatible = "hisilicon,hi3798cv200-gemac", "hisilicon,hisi-gemac-v2";
+		compatible = "hisilicon,hi3798cv200-gmac", "hisilicon,hisi-gmac-v2";
 		reg = <0xf9840000 0x1000>,<0xf984300c 0x4>;
 		interrupts = <0 71 4>;
 		#address-cells = <1>;
diff --git a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
index ee7e9ce..418ca1f3 100644
--- a/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
+++ b/drivers/net/ethernet/hisilicon/hix5hd2_gmac.c
@@ -1316,10 +1316,11 @@ static int hix5hd2_dev_remove(struct platform_device *pdev)
 }
 
 static const struct of_device_id hix5hd2_of_match[] = {
-	{ .compatible = "hisilicon,hisi-gemac-v1", .data = (void *)GEMAC_V1 },
-	{ .compatible = "hisilicon,hisi-gemac-v2", .data = (void *)GEMAC_V2 },
-	{ .compatible = "hisilicon,hix5hd2-gemac", .data = (void *)GEMAC_V1 },
-	{ .compatible = "hisilicon,hi3798cv200-gemac", .data = (void *)GEMAC_V2 },
+	{ .compatible = "hisilicon,hisi-gmac-v1", .data = (void *)GEMAC_V1 },
+	{ .compatible = "hisilicon,hisi-gmac-v2", .data = (void *)GEMAC_V2 },
+	{ .compatible = "hisilicon,hix5hd2-gmac", .data = (void *)GEMAC_V1 },
+	{ .compatible = "hisilicon,hi3798cv200-gmac", .data = (void *)GEMAC_V2 },
+	{ .compatible = "hisilicon,hi3516a-gmac", .data = (void *)GEMAC_V2 },
 	{},
 };
 
@@ -1327,7 +1328,7 @@ MODULE_DEVICE_TABLE(of, hix5hd2_of_match);
 
 static struct platform_driver hix5hd2_dev_driver = {
 	.driver = {
-		.name = "hisi-gemac",
+		.name = "hisi-gmac",
 		.of_match_table = hix5hd2_of_match,
 	},
 	.probe = hix5hd2_dev_probe,
@@ -1338,4 +1339,4 @@ module_platform_driver(hix5hd2_dev_driver);
 
 MODULE_DESCRIPTION("HISILICON Gigabit Ethernet MAC driver");
 MODULE_LICENSE("GPL v2");
-MODULE_ALIAS("platform:hisi-gemac");
+MODULE_ALIAS("platform:hisi-gmac");
-- 
2.8.2

^ permalink raw reply related

* [PATCH 0/2] net: hix5hd2_gmac: keep the compatible string not changed
From: Dongpo Li @ 2016-12-20  2:09 UTC (permalink / raw)
  To: robh+dt, mark.rutland, linux, yisen.zhuang, salil.mehta, davem,
	arnd, andrew
  Cc: xuejiancheng, benjamin.chenhao, caizhiyong, netdev, devicetree,
	linux-kernel, Dongpo Li

This patch series fix the patch:
d0fb6ba75dc0 ("net: hix5hd2_gmac: add generic compatible string")

The SoC hix5hd2 compatible string has the suffix "-gmac" and
we should not change its compatible string.
So we should name all the compatible string with the suffix "-gmac".
Creating a new name suffix "-gemac" is unnecessary.

Dongpo Li (2):
  net: hix5hd2_gmac: fix compatible strings name
  ARM: dts: hix5hd2: don't change the existing compatible string

 .../devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt      | 13 ++++++++-----
 arch/arm/boot/dts/hisi-x5hd2.dtsi                           |  4 ++--
 drivers/net/ethernet/hisilicon/hix5hd2_gmac.c               | 13 +++++++------
 3 files changed, 17 insertions(+), 13 deletions(-)

-- 
2.8.2

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Tom Herbert @ 2016-12-20  2:07 UTC (permalink / raw)
  To: David Miller
  Cc: Josef Bacik, Hannes Frederic Sowa, Craig Gallek, Eric Dumazet,
	Linux Kernel Network Developers
In-Reply-To: <20161219.205646.1955469060856026212.davem@davemloft.net>

On Mon, Dec 19, 2016 at 5:56 PM, David Miller <davem@davemloft.net> wrote:
> From: Josef Bacik <jbacik@fb.com>
> Date: Sat, 17 Dec 2016 13:26:00 +0000
>
>> So take my current duct tape fix and augment it with more
>> information in the bind bucket?  I'm not sure how to make this work
>> without at least having a list of the binded addrs as well to make
>> sure we are really ok.  I suppose we could save the fastreuseport
>> address that last succeeded to make it work properly, but I'd have
>> to make it protocol agnostic and then have a callback to have the
>> protocol to make sure we don't have to do the bind_conflict run.  Is
>> that what you were thinking of?  Thanks,
>
> So there isn't a deadlock or lockup here, something is just running
> really slow, right?
>
Correct.

> And that "something" is a scan of the sockets on a tb list, and
> there's lots of timewait sockets hung off of that tb.
>
Yes.

> As far as I can tell, this scan is happening in
> inet_csk_bind_conflict().
>
Yes.

> Furthermore, reuseport is somehow required to make this problem
> happen.  How exactly?

When sockets created SO_REUSEPORT move to TW state they are placed
back on the the tb->owners. fastreuse port is no longer set so we have
to walk potential long list of sockets in tb->owners to open a new
listener socket. I imagine this is happens when we try to open a new
listener SO_REUSEPORT after the system has been running a while and so
we hit the long tb->owners list.

Tom

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: David Miller @ 2016-12-20  1:56 UTC (permalink / raw)
  To: jbacik; +Cc: hannes, tom, kraigatgoog, eric.dumazet, netdev
In-Reply-To: <286A21B1-2A15-4DDF-B334-A016DA3D52EA@fb.com>

From: Josef Bacik <jbacik@fb.com>
Date: Sat, 17 Dec 2016 13:26:00 +0000

> So take my current duct tape fix and augment it with more
> information in the bind bucket?  I'm not sure how to make this work
> without at least having a list of the binded addrs as well to make
> sure we are really ok.  I suppose we could save the fastreuseport
> address that last succeeded to make it work properly, but I'd have
> to make it protocol agnostic and then have a callback to have the
> protocol to make sure we don't have to do the bind_conflict run.  Is
> that what you were thinking of?  Thanks,

So there isn't a deadlock or lockup here, something is just running
really slow, right?

And that "something" is a scan of the sockets on a tb list, and
there's lots of timewait sockets hung off of that tb.

As far as I can tell, this scan is happening in
inet_csk_bind_conflict().

Furthermore, reuseport is somehow required to make this problem
happen.  How exactly?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox