Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: UDP sockets oddities
From: Eric Dumazet @ 2017-08-25 23:57 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, edumazet, pabeni, willemb, davem
In-Reply-To: <3c44d27e-27d9-e9c4-04b3-17c0366e60d9@gmail.com>

On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote:

> Eric, are there areas of the stack where we are allowed to drop packets,
> not propagate that back to write(2) and also not increment any counter
> either, or maybe I am not looking where I should...

What happens if you increase these sysctls ?

grep .  `find /proc/sys|grep unres_qlen`


unres_qlen_bytes -> 2000000
unres_qlen -> 10000

^ permalink raw reply

* Re: [PATCH net] ipv6: Fix may be used uninitialized warning in rt6_check
From: David Miller @ 2017-08-26  0:04 UTC (permalink / raw)
  To: sbrivio; +Cc: steffen.klassert, weiwan, edumazet, kafai, netdev
In-Reply-To: <20170825110206.36e4a7a7@elisabeth>

From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 25 Aug 2017 11:02:06 +0200

> On Fri, 25 Aug 2017 09:52:17 +0200
> Steffen Klassert <steffen.klassert@secunet.com> wrote:
> 
>> On Fri, Aug 25, 2017 at 09:05:42AM +0200, Steffen Klassert wrote:
>> > rt_cookie might be used uninitialized, fix this by
>> > initializing it.
>> > 
>> > Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
>> > Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
>> > ---
>> >  net/ipv6/route.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > 
>> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> > index a9d3564..48c8c92 100644
>> > --- a/net/ipv6/route.c
>> > +++ b/net/ipv6/route.c
>> > @@ -1289,7 +1289,7 @@ static void rt6_dst_from_metrics_check(struct rt6_info *rt)
>> >  
>> >  static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie)
>> >  {
>> > -	u32 rt_cookie;
>> > +	u32 rt_cookie = 0;
>> >  
>> >  	if (!rt6_get_cookie_safe(rt, &rt_cookie) || rt_cookie != cookie)
>> >  		return NULL;  
>> 
>> The compiler warning seems to be a false positive, as
>> rt_cookie != cookie is only checked if rt6_get_cookie_safe
>> returns true in which case rt_cookie is initialized.
>> 
>> Please disregard this patch.
> 
> ...or not? I was thinking of sending a similar patch with
> uninitialized_var(rt_cookie), but it seems we have similar cases
> where we just initialize to zero instead.
> 
> I wonder which approach is considered the most acceptable nowadays. I
> would be in favour of uninitialized_var() as it doesn't change the
> binary output, but https://lwn.net/Articles/529954/ also contains some
> valid criticism. Ideas?

Generally speaking I guess initializing to zero is Ok to do.

As far as which approach is better, I don't have any strong opinion.

So I will probably just apply Steffen's patch.

^ permalink raw reply

* Re: [PATCH net] ipv6: fix sparse warning on rt6i_node
From: Martin KaFai Lau @ 2017-08-26  0:06 UTC (permalink / raw)
  To: Wei Wang; +Cc: David Miller, netdev, Eric Dumazet
In-Reply-To: <20170825220310.24863-1-tracywwnj@gmail.com>

On Fri, Aug 25, 2017 at 03:03:10PM -0700, Wei Wang wrote:
> From: Wei Wang <weiwan@google.com>
>
> Commit c5cff8561d2d adds rcu grace period before freeing fib6_node. This
> generates a new sparse warning on rt->rt6i_node related code:
>   net/ipv6/route.c:1394:30: error: incompatible types in comparison
>   expression (different address spaces)
>   ./include/net/ip6_fib.h:187:14: error: incompatible types in comparison
>   expression (different address spaces)
>
> This commit adds "__rcu" tag for rt6i_node and makes sure corresponding
> rcu API is used for it.
> After this fix, sparse no longer generates the above warning.
>
> Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
> Signed-off-by: Wei Wang <weiwan@google.com>
> Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>

> ---
>  include/net/ip6_fib.h |  2 +-
>  net/ipv6/addrconf.c   |  2 +-
>  net/ipv6/ip6_fib.c    | 11 +++++++----
>  net/ipv6/route.c      |  3 ++-
>  4 files changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index e9c59db92942..af509f801084 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -105,7 +105,7 @@ struct rt6_info {
>  	 * the same cache line.
>  	 */
>  	struct fib6_table		*rt6i_table;
> -	struct fib6_node		*rt6i_node;
> +	struct fib6_node __rcu		*rt6i_node;
>
>  	struct in6_addr			rt6i_gateway;
>
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 3c46e9513a31..936e9ab4dda5 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -5556,7 +5556,7 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
>  		 * our DAD process, so we don't need
>  		 * to do it again
>  		 */
> -		if (!(ifp->rt->rt6i_node))
> +		if (!rcu_access_pointer(ifp->rt->rt6i_node))
>  			ip6_ins_rt(ifp->rt);
>  		if (ifp->idev->cnf.forwarding)
>  			addrconf_join_anycast(ifp);
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index a5ebf86f6be8..10b4b1f8b838 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -889,7 +889,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>
>  		rt->dst.rt6_next = iter;
>  		*ins = rt;
> -		rt->rt6i_node = fn;
> +		rcu_assign_pointer(rt->rt6i_node, fn);
>  		atomic_inc(&rt->rt6i_ref);
>  		if (!info->skip_notify)
>  			inet6_rt_notify(RTM_NEWROUTE, rt, info, nlflags);
> @@ -915,7 +915,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>  			return err;
>
>  		*ins = rt;
> -		rt->rt6i_node = fn;
> +		rcu_assign_pointer(rt->rt6i_node, fn);
>  		rt->dst.rt6_next = iter->dst.rt6_next;
>  		atomic_inc(&rt->rt6i_ref);
>  		if (!info->skip_notify)
> @@ -1480,8 +1480,9 @@ static void fib6_del_route(struct fib6_node *fn, struct rt6_info **rtp,
>
>  int fib6_del(struct rt6_info *rt, struct nl_info *info)
>  {
> +	struct fib6_node *fn = rcu_dereference_protected(rt->rt6i_node,
> +				    lockdep_is_held(&rt->rt6i_table->tb6_lock));
>  	struct net *net = info->nl_net;
> -	struct fib6_node *fn = rt->rt6i_node;
>  	struct rt6_info **rtp;
>
>  #if RT6_DEBUG >= 2
> @@ -1670,7 +1671,9 @@ static int fib6_clean_node(struct fib6_walker *w)
>  			if (res) {
>  #if RT6_DEBUG >= 2
>  				pr_debug("%s: del failed: rt=%p@%p err=%d\n",
> -					 __func__, rt, rt->rt6i_node, res);
> +					 __func__, rt,
> +					 rcu_access_pointer(rt->rt6i_node),
> +					 res);
>  #endif
>  				continue;
>  			}
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index a9d3564caf49..33629f2a0f9d 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1383,7 +1383,8 @@ static void rt6_do_update_pmtu(struct rt6_info *rt, u32 mtu)
>  static bool rt6_cache_allowed_for_pmtu(const struct rt6_info *rt)
>  {
>  	return !(rt->rt6i_flags & RTF_CACHE) &&
> -		(rt->rt6i_flags & RTF_PCPU || rt->rt6i_node);
> +		(rt->rt6i_flags & RTF_PCPU ||
> +		 rcu_access_pointer(rt->rt6i_node));
>  }
>
>  static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
> --
> 2.14.1.342.g6490525c54-goog
>

^ permalink raw reply

* Re: [PATCH net] ipv6: Fix may be used uninitialized warning in rt6_check
From: David Miller @ 2017-08-26  0:08 UTC (permalink / raw)
  To: steffen.klassert; +Cc: weiwan, edumazet, kafai, netdev
In-Reply-To: <20170825070542.GV31224@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Fri, 25 Aug 2017 09:05:42 +0200

> rt_cookie might be used uninitialized, fix this by
> initializing it.
> 
> Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next v2 0/5] net: updates for IPv6 Segment Routing
From: David Miller @ 2017-08-26  0:10 UTC (permalink / raw)
  To: david.lebrun; +Cc: netdev
In-Reply-To: <20170825075648.5061-1-david.lebrun@uclouvain.be>

From: David Lebrun <david.lebrun@uclouvain.be>
Date: Fri, 25 Aug 2017 09:56:43 +0200

> v2: seg6_lwt_headroom() is not relevant for lwtunnel_input_redirect()
>     use cases, and L2ENCAP only uses this redirection. Fix incoherence
>     between arbitrary MAC header size support and fixed headroom
>     computation by setting only LWTUNNEL_STATE_INPUT_REDIRECT for L2ENCAP
>     mode.
> 
> This patch series provides several updates for the SRv6 implementation. The
> first patch leverages the existing infrastructure to support encapsulation
> of IPv4 packets. The second patch implements the T.Encaps.L2 SR function,
> enabling to encapsulate an L2 Ethernet frame within an IPv6+SRH packet.
> The last three patches update the seg6local lightweight tunnel, and mainly
> implement four new actions: End.T, End.DX2, End.DX4 and End.DT6.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH] hinic: skb_pad() frees on error
From: David Miller @ 2017-08-26  0:14 UTC (permalink / raw)
  To: dan.carpenter; +Cc: aviad.krawczyk, netdev, kernel-janitors
In-Reply-To: <20170825082428.hpnbs4i74bubm4cz@mwanda>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Fri, 25 Aug 2017 11:24:28 +0300

> The skb_pad() function frees the skb on error, so this code has a double
> free.
> 
> Fixes: 00e57a6d4ad3 ("net-next/hinic: Add Tx operation")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net] tcp: fix refcnt leak with ebpf congestion control
From: David Miller @ 2017-08-26  0:16 UTC (permalink / raw)
  To: sd; +Cc: netdev, brakmo, daniel
In-Reply-To: <d69d1981d2c74805affc4ed6378447ea4cb06c67.1503659184.git.sd@queasysnail.net>

From: Sabrina Dubroca <sd@queasysnail.net>
Date: Fri, 25 Aug 2017 13:10:12 +0200

> There are a few bugs around refcnt handling in the new BPF congestion
> control setsockopt:
> 
>  - The new ca is assigned to icsk->icsk_ca_ops even in the case where we
>    cannot get a reference on it. This would lead to a use after free,
>    since that ca is going away soon.
> 
>  - Changing the congestion control case doesn't release the refcnt on
>    the previous ca.
> 
>  - In the reinit case, we first leak a reference on the old ca, then we
>    call tcp_reinit_congestion_control on the ca that we have just
>    assigned, leading to deinitializing the wrong ca (->release of the
>    new ca on the old ca's data) and releasing the refcount on the ca
>    that we actually want to use.
> 
> This is visible by building (for example) BIC as a module and setting
> net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from
> samples/bpf.
> 
> This patch fixes the refcount issues, and moves reinit back into tcp
> core to avoid passing a ca pointer back to BPF.
> 
> Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>

Applied, thank you.

^ permalink raw reply

* Re: [Patch net-next v2 0/4] net_sched: clean up tc classes and u32 filter
From: David Miller @ 2017-08-26  0:20 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, jhs
In-Reply-To: <20170824235130.28503-1-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Thu, 24 Aug 2017 16:51:26 -0700

> Patch 1 and patch 2 prepare for patch 3. Major changes
> are in patch 3 and patch 4, details are there too.
> 
> Cong Wang (4):
>   net_sched: get rid of more forward declarations
>   net_sched: introduce tclass_del_notify()
>   net_sched: remove tc class reference counting
>   net_sched: kill u32_node pointer in Qdisc
> 
> ---
> v2: Add patch 1 and 2, group all into a patchset
>     Fix a coding style issue in patch 4

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] tcp: fix hang in tcp_sendpage_locked()
From: David Miller @ 2017-08-26  0:22 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, tom, dvyukov
In-Reply-To: <1503667625.18816.9.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 25 Aug 2017 06:27:05 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> syszkaller got a hang in tcp stack, related to a bug in
> tcp_sendpage_locked()
> 
> root@syzkaller:~# cat /proc/3059/stack
> [<ffffffff83de926c>] __lock_sock+0x1dc/0x2f0
> [<ffffffff83de9473>] lock_sock_nested+0xf3/0x110
> [<ffffffff8408ce01>] tcp_sendmsg+0x21/0x50
> [<ffffffff84163b6f>] inet_sendmsg+0x11f/0x5e0
> [<ffffffff83dd8eea>] sock_sendmsg+0xca/0x110
> [<ffffffff83dd9547>] kernel_sendmsg+0x47/0x60
> [<ffffffff83de35dc>] sock_no_sendpage+0x1cc/0x280
> [<ffffffff8408916b>] tcp_sendpage_locked+0x10b/0x160
> [<ffffffff84089203>] tcp_sendpage+0x43/0x60
> [<ffffffff841641da>] inet_sendpage+0x1aa/0x660
> [<ffffffff83dd4fcd>] kernel_sendpage+0x8d/0xe0
> [<ffffffff83dd50ac>] sock_sendpage+0x8c/0xc0
> [<ffffffff81b63300>] pipe_to_sendpage+0x290/0x3b0
> [<ffffffff81b67243>] __splice_from_pipe+0x343/0x750
> [<ffffffff81b6a459>] splice_from_pipe+0x1e9/0x330
> [<ffffffff81b6a5e0>] generic_splice_sendpage+0x40/0x50
> [<ffffffff81b6b1d7>] SyS_splice+0x7b7/0x1610
> [<ffffffff84d77a01>] entry_SYSCALL_64_fastpath+0x1f/0xbe
> 
> Fixes: 306b13eb3cf9 ("proto_ops: Add locked held versions of sendmsg and sendpage")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Dmitry Vyukov <dvyukov@google.com>

APplied, thanks Eric.

^ permalink raw reply

* Re: Permissions for eBPF objects
From: Alexei Starovoitov @ 2017-08-26  1:03 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Chenbo Feng, Jeffrey Vander Stoep, Stephen Smalley, netdev,
	SELinux, mic
In-Reply-To: <59A0837F.9090301@iogearbox.net>

On Fri, Aug 25, 2017 at 10:07:27PM +0200, Daniel Borkmann wrote:
> On 08/25/2017 09:52 PM, Chenbo Feng wrote:
> > On Fri, Aug 25, 2017 at 12:45 PM, Jeffrey Vander Stoep <jeffv@google.com> wrote:
> > > On Fri, Aug 25, 2017 at 12:26 PM, Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > > > On Fri, 2017-08-25 at 11:01 -0700, Jeffrey Vander Stoep via Selinux
> > > > wrote:
> > > > > I’d like to get your thoughts on adding LSM permission checks on BPF
> > > > > objects.

before reinventing the wheel please take a look at landlock work.
Everything that was discussed in this thread is covered by it.
The patches have been in development for more than a year and most of the early
issues have been resolved.
It will be presented again during security summit in LA in September.

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Willem de Bruijn @ 2017-08-26  1:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development
In-Reply-To: <20170826022744-mutt-send-email-mst@kernel.org>

On Fri, Aug 25, 2017 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:
>> >> >> > We don't enable network watchdog on virtio but we could and maybe
>> >> >> > should.
>> >> >>
>> >> >> Can you elaborate?
>> >> >
>> >> > The issue is that holding onto buffers for very long times makes guests
>> >> > think they are stuck. This is funamentally because from guest point of
>> >> > view this is a NIC, so it is supposed to transmit things out in
>> >> > a timely manner. If host backs the virtual NIC by something that is not
>> >> > a NIC, with traffic shaping etc introducing unbounded latencies,
>> >> > guest will be confused.
>> >>
>> >> That assumes that guests are fragile in this regard. A linux guest
>> >> does not make such assumptions.
>> >
>> > Yes it does. Examples above:
>> >         > > - a single slow flow can occupy the whole ring, you will not
>> >         > >   be able to make any new buffers available for the fast flow
>>
>> Oh, right. Though those are due to vring_desc pool exhaustion
>> rather than an upper bound on latency of any single packet.
>>
>> Limiting the number of zerocopy packets in flight to some fraction
>> of the ring ensures that fast flows can always grab a slot.
>> Running
>> out of ubuf_info slots reverts to copy, so indirectly does this. But
>> I read it correclty the zerocopy pool may be equal to or larger than
>> the descriptor pool. Should we refine the zcopy_used test
>>
>>     (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
>>
>> to also return false if the number of outstanding ubuf_info is greater
>> than, say, vq->num >> 1?
>
>
> We'll need to think about where to put the threshold, but I think it's
> a good idea.
>
> Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
> resources.
>
> In a sense it still means once you run out of slots zcopt gets disabled possibly permanently.
>
> Need to experiment with some numbers.

I can take a stab with two flows, one delayed in a deep host qdisc
queue. See how this change affects the other flow and also how
sensitive that is to the chosen threshold value.

^ permalink raw reply

* Re: [PATCH net-next] bpf: fix oops on allocation failure
From: Alexei Starovoitov @ 2017-08-26  1:05 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Alexei Starovoitov, John Fastabend, Daniel Borkmann, netdev,
	kernel-janitors
In-Reply-To: <20170825202714.64ivixeindjph3z6@mwanda>

On Fri, Aug 25, 2017 at 11:27:14PM +0300, Dan Carpenter wrote:
> "err" is set to zero if bpf_map_area_alloc() fails so it means we return
> ERR_PTR(0) which is NULL.  The caller, find_and_alloc_map(), is not
> expecting NULL returns and will oops.
> 
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

good catch. Thanks!
Acked-by: Alexei Starovoitov <ast@kernel.org>


^ permalink raw reply

* Re: [PATCH net-next v7 01/10] selftest: Enhance kselftest_harness.h with a step mechanism
From: Alexei Starovoitov @ 2017-08-26  1:07 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Shuah Khan, linux-kernel, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Tejun Heo, Thomas Graf <tgr
In-Reply-To: <0e15da13-fad0-ba01-053c-1b4853e2bd6f@digikod.net>

On Fri, Aug 25, 2017 at 09:58:33AM +0200, Mickaël Salaün wrote:
> 
> 
> On 24/08/2017 04:31, Alexei Starovoitov wrote:
> > On Mon, Aug 21, 2017 at 02:09:24AM +0200, Mickaël Salaün wrote:
> >> This step mechanism may be useful to return an information about the
> >> error without being able to write to TH_LOG_STREAM.
> >>
> >> Set _metadata->no_print to true to print this counter.
> >>
> >> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> >> Cc: Andy Lutomirski <luto@amacapital.net>
> >> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> >> Cc: Kees Cook <keescook@chromium.org>
> >> Cc: Shuah Khan <shuah@kernel.org>
> >> Cc: Will Drewry <wad@chromium.org>
> >> Link: https://lkml.kernel.org/r/CAGXu5j+D-FP8Kt9unNOqKrQJP4DYTpmgkJxWykZyrYiVPz3Y3Q@mail.gmail.com
> >> ---
> >>
> >> This patch is intended to the kselftest tree:
> >> https://lkml.kernel.org/r/20170806232337.4191-1-mic@digikod.net
> >>
> >> Changes since v6:
> >> * add the step counter in assert/expect macros and use _metadata to
> >>   enable the counter (suggested by Kees Cook)
> >> ---
> >>  tools/testing/selftests/kselftest_harness.h   | 31 ++++++++++++++++++++++-----
> >>  tools/testing/selftests/seccomp/seccomp_bpf.c |  2 +-
> >>  2 files changed, 27 insertions(+), 6 deletions(-)
> > 
> > is there a dependency on this in patches 2+ ?
> > if not, I would send this patch to selftests right away.
> > 
> > 
> 
> The Landlock tests [patch 9/10] rely on it for now.
> 
> I sent it three weeks ago:
> https://lkml.kernel.org/r/20170806232337.4191-1-mic@digikod.net
> 
> Anyway, until this patch is merged in the kselftest tree and then
> available to net-next, I'll have to keep it here.

Shuah,
could you please pick up this patch into your tree?

^ permalink raw reply

* [PATCH net-next v3] e1000e: Be drop monitor friendly
From: Florian Fainelli @ 2017-08-26  1:14 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, davem, Florian Fainelli, Jeff Kirsher,
	moderated list:INTEL ETHERNET DRIVERS, open list

e1000e_put_txbuf() can be called from normal reclamation path as well as
when a DMA mapping failure, so we need to differentiate these two cases
when freeing SKBs to be drop monitor friendly. e1000e_tx_hwtstamp_work()
and e1000_remove() are processing TX timestamped SKBs and those should
not be accounted as drops either.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
Changes in v3:

- differentiate normal reclamation from TX DMA fragment mapping errors
- removed a few invalid dev_kfree_skb() replacements (those are already
  drop monitor friendly)

Changes in v2:

- make it compile

 drivers/net/ethernet/intel/e1000e/netdev.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 327dfe5bedc0..cfd21858c095 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1071,7 +1071,8 @@ static bool e1000_clean_rx_irq(struct e1000_ring *rx_ring, int *work_done,
 }
 
 static void e1000_put_txbuf(struct e1000_ring *tx_ring,
-			    struct e1000_buffer *buffer_info)
+			    struct e1000_buffer *buffer_info,
+			    bool drop)
 {
 	struct e1000_adapter *adapter = tx_ring->adapter;
 
@@ -1085,7 +1086,10 @@ static void e1000_put_txbuf(struct e1000_ring *tx_ring,
 		buffer_info->dma = 0;
 	}
 	if (buffer_info->skb) {
-		dev_kfree_skb_any(buffer_info->skb);
+		if (drop)
+			dev_kfree_skb_any(buffer_info->skb);
+		else
+			dev_consume_skb_any(buffer_info->skb);
 		buffer_info->skb = NULL;
 	}
 	buffer_info->time_stamp = 0;
@@ -1199,7 +1203,7 @@ static void e1000e_tx_hwtstamp_work(struct work_struct *work)
 		wmb(); /* force write prior to skb_tstamp_tx */
 
 		skb_tstamp_tx(skb, &shhwtstamps);
-		dev_kfree_skb_any(skb);
+		dev_consume_skb_any(skb);
 	} else if (time_after(jiffies, adapter->tx_hwtstamp_start
 			      + adapter->tx_timeout_factor * HZ)) {
 		dev_kfree_skb_any(adapter->tx_hwtstamp_skb);
@@ -1254,7 +1258,7 @@ static bool e1000_clean_tx_irq(struct e1000_ring *tx_ring)
 				}
 			}
 
-			e1000_put_txbuf(tx_ring, buffer_info);
+			e1000_put_txbuf(tx_ring, buffer_info, false);
 			tx_desc->upper.data = 0;
 
 			i++;
@@ -2421,7 +2425,7 @@ static void e1000_clean_tx_ring(struct e1000_ring *tx_ring)
 
 	for (i = 0; i < tx_ring->count; i++) {
 		buffer_info = &tx_ring->buffer_info[i];
-		e1000_put_txbuf(tx_ring, buffer_info);
+		e1000_put_txbuf(tx_ring, buffer_info, false);
 	}
 
 	netdev_reset_queue(adapter->netdev);
@@ -5614,7 +5618,7 @@ static int e1000_tx_map(struct e1000_ring *tx_ring, struct sk_buff *skb,
 			i += tx_ring->count;
 		i--;
 		buffer_info = &tx_ring->buffer_info[i];
-		e1000_put_txbuf(tx_ring, buffer_info);
+		e1000_put_txbuf(tx_ring, buffer_info, true);
 	}
 
 	return 0;
@@ -7411,7 +7415,7 @@ static void e1000_remove(struct pci_dev *pdev)
 	if (adapter->flags & FLAG_HAS_HW_TIMESTAMP) {
 		cancel_work_sync(&adapter->tx_hwtstamp_work);
 		if (adapter->tx_hwtstamp_skb) {
-			dev_kfree_skb_any(adapter->tx_hwtstamp_skb);
+			dev_consume_skb_any(adapter->tx_hwtstamp_skb);
 			adapter->tx_hwtstamp_skb = NULL;
 		}
 	}
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH net-next] selftests/bpf: check the instruction dumps are populated
From: Daniel Borkmann @ 2017-08-26  1:16 UTC (permalink / raw)
  To: Jakub Kicinski, netdev; +Cc: kafai, oss-drivers
In-Reply-To: <20170825213957.4768-1-jakub.kicinski@netronome.com>

On 08/25/2017 11:39 PM, Jakub Kicinski wrote:
> Add a basic test for checking whether kernel is populating
> the jited and xlated BPF images.  It was used to confirm
> the behaviour change from commit d777b2ddbecf ("bpf: don't
> zero out the info struct in bpf_obj_get_info_by_fd()"),
> which made bpf_obj_get_info_by_fd() usable for retrieving
> the image dumps.
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
[...]
> @@ -328,15 +331,20 @@ static void test_bpf_obj_id(void)
>   			  prog_infos[i].type != BPF_PROG_TYPE_SOCKET_FILTER ||
>   			  info_len != sizeof(struct bpf_prog_info) ||
>   			  (jit_enabled && !prog_infos[i].jited_prog_len) ||
> -			  !prog_infos[i].xlated_prog_len,
> +			  (jit_enabled &&
> +			   !memcmp(jited_insns, zeros, sizeof(zeros))) ||
> +			  !prog_infos[i].xlated_prog_len ||
> +			  !memcmp(xlated_insns, zeros, sizeof(zeros)),

There could still be the case where a JIT could bail out
for some reason and punt to the interpreter instead, but
I'm fine assuming for the specific test cases we have it
has to succeed, and if not JIT misses features or has other
issues. ;) Thus:

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Thanks!

^ permalink raw reply

* Re: [PATCH net-next v7 05/10] landlock: Add LSM hooks related to filesystem
From: Alexei Starovoitov @ 2017-08-26  1:16 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf <tgr
In-Reply-To: <22d09137-7212-5803-af64-0964fad875c7@digikod.net>

On Fri, Aug 25, 2017 at 10:16:39AM +0200, Mickaël Salaün wrote:
> > 
> >> +/* WRAP_ARG_SB */
> >> +#define WRAP_ARG_SB_TYPE	WRAP_TYPE_FS
> >> +#define WRAP_ARG_SB_DEC(arg)					\
> >> +	EXPAND_C(WRAP_TYPE_FS) wrap_##arg =			\
> >> +	{ .type = BPF_HANDLE_FS_TYPE_DENTRY, .dentry = arg->s_root };
> >> +#define WRAP_ARG_SB_VAL(arg)	((uintptr_t)&wrap_##arg)
> >> +#define WRAP_ARG_SB_OK(arg)	(arg && arg->s_root)
> > ...
> > 
> >> +HOOK_NEW_FS(sb_remount, 2,
> >> +	struct super_block *, sb,
> >> +	void *, data,
> >> +	WRAP_ARG_SB, sb,
> >> +	WRAP_ARG_RAW, LANDLOCK_ACTION_FS_WRITE
> >> +);
> > 
> > this looks wrong. casting super_block to dentry?
> 
> This is called when remounting a block device. The WRAP_ARG_SB take the
> sb->s_root as a dentry, it is not a cast. What do you expect from this hook?

got it. I missed -> part. Now it makes sense.

> > 
> >> +/* a directory inode contains only one dentry */
> >> +HOOK_NEW_FS(inode_create, 3,
> >> +	struct inode *, dir,
> >> +	struct dentry *, dentry,
> >> +	umode_t, mode,
> >> +	WRAP_ARG_INODE, dir,
> >> +	WRAP_ARG_RAW, LANDLOCK_ACTION_FS_WRITE
> >> +);
> > 
> > more general question: why you're not wrapping all useful
> > arguments? Like in the above dentry can be acted upon
> > by the landlock rule and it's readily available...
> 
> The context used for the FS event must have the exact same types for all
> calls. This event is meant to be generic but we can add more specific
> ones if needed, like I do with FS_IOCTL.

I see. So all FS events will have dentry as first argument
regardless of how it is in LSM hook ?
I guess that will simplify the rules indeed.
I suspect you're doing it to simplify the LSM->landlock shim layer as well, right?

> The idea is to enable people to write simple rules, while being able to
> write fine grain rules for special cases (e.g. IOCTL) if needed.
> 
> > 
> > The limitation of only 2 args looks odd.
> > Is it a hard limitation ? how hard to extend?
> 
> It's not a hard limit at all. Actually, the FS_FNCTL event should have
> three arguments (I'll add them in the next series): FS handle, FCNTL
> command and FCNTL argument. I made sure that it's really easy to add
> more arguments to the context of an event.

The reason I'm asking, because I'm not completely convinced that
adding another argument to existing event will be backwards compatible.
It looks like you're expecting only two args for all FS events, right?
How can you add 3rd argument? All FS events would have to get it,
but in some LSM hooks such argument will be meaningless, whereas
in other places it will carry useful info that rule can operate on.
Would that mean that we'll have FS_3 event type and only few LSM
hooks will be converted to it. That works, but then we'll lose
compatiblity with old rules written for FS event and that given hook.
Otherwise we'd need to have fancy logic to accept old FS event
into FS_3 LSM hook.



^ permalink raw reply

* Re: UDP sockets oddities
From: Florian Fainelli @ 2017-08-26  1:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, edumazet, pabeni, willemb, davem
In-Reply-To: <1503705440.11498.9.camel@edumazet-glaptop3.roam.corp.google.com>

On 08/25/2017 04:57 PM, Eric Dumazet wrote:
> On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote:
> 
>> Eric, are there areas of the stack where we are allowed to drop packets,
>> not propagate that back to write(2) and also not increment any counter
>> either, or maybe I am not looking where I should...
> 
> What happens if you increase these sysctls ?

I don't see packet loss after I tweak these two sysctls according to
your suggestions.

Tweaking eth0's sysctls did not change anything, but tweaking gphy's
sysctl resolved the loss. This was a little surprising considering that
gphy is an IFF_NO_QUEUE interface and eth0 is the conduit interface that
does the real transmission.

Does that make sense with respect to what I reported earlier? Should I
try to dump the neigh stats?

Thanks!

> 
> grep .  `find /proc/sys|grep unres_qlen`
> 
> 
> unres_qlen_bytes -> 2000000
> unres_qlen -> 10000
> 
> 

-- 
Florian

^ permalink raw reply

* Re: UDP sockets oddities
From: Eric Dumazet @ 2017-08-26  1:52 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, edumazet, pabeni, willemb, davem
In-Reply-To: <ce4af8cc-3de1-a777-967c-a57103994e1d@gmail.com>

On Fri, 2017-08-25 at 18:17 -0700, Florian Fainelli wrote:
> On 08/25/2017 04:57 PM, Eric Dumazet wrote:
> > On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote:
> > 
> >> Eric, are there areas of the stack where we are allowed to drop packets,
> >> not propagate that back to write(2) and also not increment any counter
> >> either, or maybe I am not looking where I should...
> > 
> > What happens if you increase these sysctls ?
> 
> I don't see packet loss after I tweak these two sysctls according to
> your suggestions.
> 
> Tweaking eth0's sysctls did not change anything, but tweaking gphy's
> sysctl resolved the loss. This was a little surprising considering that
> gphy is an IFF_NO_QUEUE interface and eth0 is the conduit interface that
> does the real transmission.
> 
> Does that make sense with respect to what I reported earlier? Should I
> try to dump the neigh stats?

Note that if you had TCP traffic, the neighbour would be constantly
confirmed and no losses would happen.

I guess we should an SNMP counter for packets dropped in neigh queues.

^ permalink raw reply

* Re: [PATCH v2 net-next 1/8] bpf: Add support for recursively running cgroup sock filters
From: Daniel Borkmann @ 2017-08-26  2:00 UTC (permalink / raw)
  To: David Ahern, netdev, ast, tj, davem
In-Reply-To: <1503687941-626-2-git-send-email-dsahern@gmail.com>

On 08/25/2017 09:05 PM, David Ahern wrote:
> Add support for recursively applying sock filters attached to a cgroup.
> For now, start with the inner cgroup attached to the socket and work back
> to the root or first cgroup without the recursive flag set. Once the
> recursive flag is set for a cgroup all descendant group's must have the
> flag as well.
>
> Signed-off-by: David Ahern <dsahern@gmail.com>
[...]
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f71f5e07d82d..595e31b30f23 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -151,6 +151,15 @@ enum bpf_attach_type {
>    */
>   #define BPF_F_ALLOW_OVERRIDE	(1U << 0)
>
> +/* If BPF_F_RECURSIVE flag is used in BPF_PROG_ATTACH command
> + * cgroups are walked recursively back to the root cgroup or the
> + * first cgroup without the flag set running any program attached.
> + * Once the flag is set, it MUST be set for all descendant cgroups.
> + */
> +#define BPF_F_RECURSIVE		(1U << 1)
> +
> +#define BPF_F_ALL_ATTACH_FLAGS  (BPF_F_ALLOW_OVERRIDE | BPF_F_RECURSIVE)
> +
>   /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
>    * verifier will perform strict alignment checking as if the kernel
>    * has been built with CONFIG_EFFICIENT_UNALIGNED_ACCESS not set,
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 546113430049..eb1f436c18fb 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -47,10 +47,16 @@ void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent)
>   	unsigned int type;
>
>   	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.effective); type++) {
> -		struct bpf_prog *e;
> +		struct bpf_prog *e = NULL;
> +
> +		/* do not need to set effective program if cgroups are
> +		 * walked recursively
> +		 */
> +		cgrp->bpf.is_recursive[type] = parent->bpf.is_recursive[type];
> +		if (!cgrp->bpf.is_recursive[type])
> +			e = rcu_dereference_protected(parent->bpf.effective[type],
> +						      lockdep_is_held(&cgroup_mutex));

[...]

> -		e = rcu_dereference_protected(parent->bpf.effective[type],
> -					      lockdep_is_held(&cgroup_mutex));
>   		rcu_assign_pointer(cgrp->bpf.effective[type], e);
>   		cgrp->bpf.disallow_override[type] = parent->bpf.disallow_override[type];
>   	}
[...]
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d5774a6851f1..a1ab5dbaae89 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1187,7 +1187,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>   	if (CHECK_ATTR(BPF_PROG_ATTACH))
>   		return -EINVAL;
>
> -	if (attr->attach_flags & ~BPF_F_ALLOW_OVERRIDE)
> +	if (attr->attach_flags & ~BPF_F_ALL_ATTACH_FLAGS)
>   		return -EINVAL;
>
>   	switch (attr->attach_type) {
> @@ -1222,7 +1222,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>   	}
>
>   	ret = cgroup_bpf_update(cgrp, prog, attr->attach_type,
> -				attr->attach_flags & BPF_F_ALLOW_OVERRIDE);
> +				attr->attach_flags);
>   	if (ret)
>   		bpf_prog_put(prog);
>   	cgroup_put(cgrp);
> @@ -1252,7 +1252,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>   		if (IS_ERR(cgrp))
>   			return PTR_ERR(cgrp);
>
> -		ret = cgroup_bpf_update(cgrp, NULL, attr->attach_type, false);
> +		ret = cgroup_bpf_update(cgrp, NULL, attr->attach_type, 0);
>   		cgroup_put(cgrp);
>   		break;

Can you elaborate on the semantical changes for the programs
setting the new flag which are not using below cgroup_bpf_run_filter_sk()
helper to walk back to root?

> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index df2e0f14a95d..27a4f14435a3 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5176,14 +5176,35 @@ void cgroup_sk_free(struct sock_cgroup_data *skcd)
>
>   #ifdef CONFIG_CGROUP_BPF
>   int cgroup_bpf_update(struct cgroup *cgrp, struct bpf_prog *prog,
> -		      enum bpf_attach_type type, bool overridable)
> +		      enum bpf_attach_type type, u32 flags)
>   {
>   	struct cgroup *parent = cgroup_parent(cgrp);
>   	int ret;
>
>   	mutex_lock(&cgroup_mutex);
> -	ret = __cgroup_bpf_update(cgrp, parent, prog, type, overridable);
> +	ret = __cgroup_bpf_update(cgrp, parent, prog, type, flags);
>   	mutex_unlock(&cgroup_mutex);
>   	return ret;
>   }
> +
> +int cgroup_bpf_run_filter_sk(struct sock *sk,
> +			     enum bpf_attach_type type)
> +{
> +	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> +	int ret = 0;
> +
> +	while (cgrp) {
> +		ret = __cgroup_bpf_run_filter_sk(cgrp, sk, type);
> +		if (ret)
> +			break;
> +
> +		if (!cgrp->bpf.is_recursive[type])
> +			break;
> +
> +		cgrp = cgroup_parent(cgrp);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(cgroup_bpf_run_filter_sk);
>   #endif /* CONFIG_CGROUP_BPF */
>

^ permalink raw reply

* [PATCH net-next v2 1/2] inet_diag: allow protocols to provide additional data
From: Ivan Delalande @ 2017-08-26  1:53 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, netdev, Ivan Delalande

Extend inet_diag_handler to allow individual protocols to report
additional data on INET_DIAG_INFO through idiag_get_aux. The size
can be dynamic and is computed by idiag_get_aux_size.

Signed-off-by: Ivan Delalande <colona@arista.com>
---
 include/linux/inet_diag.h |  7 +++++++
 net/ipv4/inet_diag.c      | 22 ++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h
index 65da430e260f..ee251c585854 100644
--- a/include/linux/inet_diag.h
+++ b/include/linux/inet_diag.h
@@ -25,6 +25,13 @@ struct inet_diag_handler {
 					  struct inet_diag_msg *r,
 					  void *info);
 
+	int		(*idiag_get_aux)(struct sock *sk,
+					 bool net_admin,
+					 struct sk_buff *skb);
+
+	size_t		(*idiag_get_aux_size)(struct sock *sk,
+					      bool net_admin);
+
 	int		(*destroy)(struct sk_buff *in_skb,
 				   const struct inet_diag_req_v2 *req);
 
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 67325d5832d7..8a88ef373395 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -93,8 +93,17 @@ void inet_diag_msg_common_fill(struct inet_diag_msg *r, struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(inet_diag_msg_common_fill);
 
-static size_t inet_sk_attr_size(void)
+static size_t inet_sk_attr_size(struct sock *sk,
+				const struct inet_diag_req_v2 *req,
+				bool net_admin)
 {
+	const struct inet_diag_handler *handler;
+	size_t aux = 0;
+
+	handler = inet_diag_table[req->sdiag_protocol];
+	if (handler && handler->idiag_get_aux_size)
+		aux = handler->idiag_get_aux_size(sk, net_admin);
+
 	return	  nla_total_size(sizeof(struct tcp_info))
 		+ nla_total_size(1) /* INET_DIAG_SHUTDOWN */
 		+ nla_total_size(1) /* INET_DIAG_TOS */
@@ -105,6 +114,7 @@ static size_t inet_sk_attr_size(void)
 		+ nla_total_size(SK_MEMINFO_VARS * sizeof(u32))
 		+ nla_total_size(TCP_CA_NAME_MAX)
 		+ nla_total_size(sizeof(struct tcpvegas_info))
+		+ nla_total_size(aux)
 		+ 64;
 }
 
@@ -260,6 +270,10 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
 
 	handler->idiag_get_info(sk, r, info);
 
+	if (ext & (1 << (INET_DIAG_INFO - 1)) && handler->idiag_get_aux)
+		if (handler->idiag_get_aux(sk, net_admin, skb) < 0)
+			goto errout;
+
 	if (sk->sk_state < TCP_TIME_WAIT) {
 		union tcp_cc_info info;
 		size_t sz = 0;
@@ -452,13 +466,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
 	struct net *net = sock_net(in_skb->sk);
 	struct sk_buff *rep;
 	struct sock *sk;
+	bool net_admin = netlink_net_capable(in_skb, CAP_NET_ADMIN);
 	int err;
 
 	sk = inet_diag_find_one_icsk(net, hashinfo, req);
 	if (IS_ERR(sk))
 		return PTR_ERR(sk);
 
-	rep = nlmsg_new(inet_sk_attr_size(), GFP_KERNEL);
+	rep = nlmsg_new(inet_sk_attr_size(sk, req, net_admin), GFP_KERNEL);
 	if (!rep) {
 		err = -ENOMEM;
 		goto out;
@@ -467,8 +482,7 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
 	err = sk_diag_fill(sk, rep, req,
 			   sk_user_ns(NETLINK_CB(in_skb).sk),
 			   NETLINK_CB(in_skb).portid,
-			   nlh->nlmsg_seq, 0, nlh,
-			   netlink_net_capable(in_skb, CAP_NET_ADMIN));
+			   nlh->nlmsg_seq, 0, nlh, net_admin);
 	if (err < 0) {
 		WARN_ON(err == -EMSGSIZE);
 		nlmsg_free(rep);
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v2 2/2] tcp_diag: report TCP MD5 signing keys and addresses
From: Ivan Delalande @ 2017-08-26  1:53 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, netdev, Ivan Delalande
In-Reply-To: <20170826015346.24247-1-colona@arista.com>

Report TCP MD5 (RFC2385) signing keys, addresses and address prefixes to
processes with CAP_NET_ADMIN requesting INET_DIAG_INFO. Currently it is
not possible to retrieve these from the kernel once they have been
configured on sockets.

Signed-off-by: Ivan Delalande <colona@arista.com>
---
 include/uapi/linux/inet_diag.h |   1 +
 net/ipv4/tcp_diag.c            | 112 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 107 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
index 678496897a68..f52ff62bfabe 100644
--- a/include/uapi/linux/inet_diag.h
+++ b/include/uapi/linux/inet_diag.h
@@ -143,6 +143,7 @@ enum {
 	INET_DIAG_MARK,
 	INET_DIAG_BBRINFO,
 	INET_DIAG_CLASS_ID,
+	INET_DIAG_MD5SIG,
 	__INET_DIAG_MAX,
 };
 
diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
index a748c74aa8b7..99c54b765921 100644
--- a/net/ipv4/tcp_diag.c
+++ b/net/ipv4/tcp_diag.c
@@ -16,6 +16,7 @@
 
 #include <linux/tcp.h>
 
+#include <net/netlink.h>
 #include <net/tcp.h>
 
 static void tcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
@@ -36,6 +37,103 @@ static void tcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
 		tcp_get_info(sk, info);
 }
 
+#ifdef CONFIG_TCP_MD5SIG
+static void inet_diag_md5sig_fill(struct tcp_md5sig *info,
+				  const struct tcp_md5sig_key *key)
+{
+	#if IS_ENABLED(CONFIG_IPV6)
+	if (key->family == AF_INET6) {
+		struct sockaddr_in6 *sin6 =
+			(struct sockaddr_in6 *)&info->tcpm_addr;
+
+		memcpy(&sin6->sin6_addr, &key->addr.a6,
+		       sizeof(struct in6_addr));
+	} else
+	#endif
+	{
+		struct sockaddr_in *sin =
+			(struct sockaddr_in *)&info->tcpm_addr;
+
+		memcpy(&sin->sin_addr, &key->addr.a4, sizeof(struct in_addr));
+	}
+
+	info->tcpm_addr.ss_family = key->family;
+	info->tcpm_prefixlen = key->prefixlen;
+	info->tcpm_keylen = key->keylen;
+	memcpy(info->tcpm_key, key->key, key->keylen);
+}
+
+static int inet_diag_put_md5sig(struct sk_buff *skb,
+				const struct tcp_md5sig_info *md5sig)
+{
+	const struct tcp_md5sig_key *key;
+	struct nlattr *attr;
+	struct tcp_md5sig *info;
+	int md5sig_count = 0;
+
+	hlist_for_each_entry_rcu(key, &md5sig->head, node)
+		md5sig_count++;
+
+	attr = nla_reserve(skb, INET_DIAG_MD5SIG,
+			   md5sig_count * sizeof(struct tcp_md5sig));
+	if (!attr)
+		return -EMSGSIZE;
+
+	info = nla_data(attr);
+	hlist_for_each_entry_rcu(key, &md5sig->head, node) {
+		inet_diag_md5sig_fill(info, key);
+		info++;
+	}
+
+	return 0;
+}
+#endif
+
+static int tcp_diag_get_aux(struct sock *sk, bool net_admin,
+			    struct sk_buff *skb)
+{
+#ifdef CONFIG_TCP_MD5SIG
+	if (net_admin) {
+		struct tcp_md5sig_info *md5sig;
+		int err = 0;
+
+		lock_sock(sk);
+		rcu_read_lock();
+		md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info);
+		if (md5sig)
+			err = inet_diag_put_md5sig(skb, md5sig);
+		rcu_read_unlock();
+		release_sock(sk);
+		if (err < 0)
+			return err;
+	}
+#endif
+
+	return 0;
+}
+
+static size_t tcp_diag_get_aux_size(struct sock *sk, bool net_admin)
+{
+	size_t size = 0;
+
+#ifdef CONFIG_TCP_MD5SIG
+	if (sk_fullsock(sk)) {
+		const struct tcp_md5sig_info *md5sig;
+		const struct tcp_md5sig_key *key;
+
+		rcu_read_lock();
+		md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info);
+		if (md5sig) {
+			hlist_for_each_entry_rcu(key, &md5sig->head, node)
+				size += sizeof(struct tcp_md5sig);
+		}
+		rcu_read_unlock();
+	}
+#endif
+
+	return size;
+}
+
 static void tcp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 			  const struct inet_diag_req_v2 *r, struct nlattr *bc)
 {
@@ -68,13 +166,15 @@ static int tcp_diag_destroy(struct sk_buff *in_skb,
 #endif
 
 static const struct inet_diag_handler tcp_diag_handler = {
-	.dump		 = tcp_diag_dump,
-	.dump_one	 = tcp_diag_dump_one,
-	.idiag_get_info	 = tcp_diag_get_info,
-	.idiag_type	 = IPPROTO_TCP,
-	.idiag_info_size = sizeof(struct tcp_info),
+	.dump			= tcp_diag_dump,
+	.dump_one		= tcp_diag_dump_one,
+	.idiag_get_info		= tcp_diag_get_info,
+	.idiag_get_aux		= tcp_diag_get_aux,
+	.idiag_get_aux_size	= tcp_diag_get_aux_size,
+	.idiag_type		= IPPROTO_TCP,
+	.idiag_info_size	= sizeof(struct tcp_info),
 #ifdef CONFIG_INET_DIAG_DESTROY
-	.destroy	 = tcp_diag_destroy,
+	.destroy		= tcp_diag_destroy,
 #endif
 };
 
-- 
2.14.1

^ permalink raw reply related

* Re: [PATCH net 0/2] r8169: Be drop monitor friendly
From: David Miller @ 2017-08-26  2:13 UTC (permalink / raw)
  To: f.fainelli
  Cc: netdev, nic_swsd, romieu, edumazet, alexander.h.duyck, sgruszka
In-Reply-To: <20170825013359.27258-1-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Thu, 24 Aug 2017 18:33:57 -0700

> First patch may be questionable but no other driver appears to be doing that
> and while it is defendable to account for left packets as dropped during TX
> clean, this appears misleadning. I picked Stanislaw changes which brings us
> back to 2010, but this was present from pre-git days as well.

Right, drivers should not do this.

> Second patch fixes the two missing calls to dev_consume_skb_any().

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 0/2] nfp: SR-IOV ndos support
From: David Miller @ 2017-08-26  2:25 UTC (permalink / raw)
  To: jakub.kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20170825043150.375-1-jakub.kicinski@netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Thu, 24 Aug 2017 21:31:48 -0700

> This set adds basic SR-IOV including setting/getting VF MAC addresses,
> VLANs, link state and spoofcheck settings.  It is wired up for both
> vNICs and representors (note: ip link will not report VF settings on
> VF/PF representors because they are not linked to the PF PCI device).
> 
> Pablo and team add the basic implementation, Simon and Dirk follow
> up with the representor plumbing.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH v2 net-next 3/8] bpf: Allow cgroup sock filters to use get_current_uid_gid helper
From: Alexei Starovoitov @ 2017-08-26  2:30 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, tj, davem
In-Reply-To: <1503687941-626-4-git-send-email-dsahern@gmail.com>

On Fri, Aug 25, 2017 at 12:05:36PM -0700, David Ahern wrote:
> Allow BPF programs run on sock create to use the get_current_uid_gid
> helper. IPv4 and IPv6 sockets are created in a process context so
> there is always a valid uid/gid
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [net-next v2 00/13][pull request] 40GbE Intel Wired LAN Driver Updates 2017-08-25
From: David Miller @ 2017-08-26  2:41 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20170825220057.51804-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Fri, 25 Aug 2017 15:00:44 -0700

> This series contains updates to i40e and i40evf only.

Pulled, thanks Jeff.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox