Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v1 net-next] stmmac: fix DMA channel hang in half-duplex mode
From: David Miller @ 2018-06-15  0:06 UTC (permalink / raw)
  To: vbhadram; +Cc: peppe.cavallaro, alexandre.torgue, joabreu, netdev, narayanr
In-Reply-To: <1528864248-2246-1-git-send-email-vbhadram@nvidia.com>

From: Bhadram Varka <vbhadram@nvidia.com>
Date: Wed, 13 Jun 2018 10:00:48 +0530

> HW does not support Half-duplex mode in multi-queue
> scenario. Fix it by not advertising the Half-Duplex
> mode if multi-queue enabled.
> 
> Signed-off-by: Bhadram Varka <vbhadram@nvidia.com>

Bug fixes should be submitted against net, not net-next.  And
net-next is closed for submissions at this time.

Thanks.

^ permalink raw reply

* Re: [PATCH v3] tcp: verify the checksum of the first data segment in a new connection
From: David Miller @ 2018-06-15  0:05 UTC (permalink / raw)
  To: fllinden; +Cc: edumazet, netdev
In-Reply-To: <5b2052b1.eyFJo5VnqmUbz/O6%fllinden@amazon.com>

From: Frank van der Linden <fllinden@amazon.com>
Date: Tue, 12 Jun 2018 23:09:37 +0000

> commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash
> table") introduced an optimization for the handling of child sockets
> created for a new TCP connection.
> 
> But this optimization passes any data associated with the last ACK of the
> connection handshake up the stack without verifying its checksum, because it
> calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
> directly.  These lower-level processing functions do not do any checksum
> verification.
> 
> Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
> fix this.
> 
> Signed-off-by: Frank van der Linden <fllinden@amazon.com>

Applied and queued up for -stable.

I know you mention the bug causing commit in your commit message,
but you should also still provide a proper Fixes: tag.  I took
care of it for you this time.

Thanks.

^ permalink raw reply

* Re: [PATCH bpf-net] selftests/bpf: delete xfrm tunnel when test exits.
From: Daniel Borkmann @ 2018-06-15  0:05 UTC (permalink / raw)
  To: Martin KaFai Lau, William Tu; +Cc: netdev, anders.roxell
In-Reply-To: <20180614223009.7ohlsxctbglbi24e@kafai-mbp.dhcp.thefacebook.com>

On 06/15/2018 12:30 AM, Martin KaFai Lau wrote:
> On Thu, Jun 14, 2018 at 05:01:06AM -0700, William Tu wrote:
>> Make the printting of bpf xfrm tunnel better and
>> cleanup xfrm state and policy when xfrm test finishes.
> LGTM.  The subject tag actually meant s/bpf-net/bpf-next/?
> 
> It makes sense to be in bpf-next but I think bpf-next is still closed.
> Please repost later.

But given this fixes up missing cleanup of xfrm policy/state that was
added earlier as part of the test, I think bpf would be fine. (Subject
is a bit confusing indeed either bpf resp. net tree or bpf-next was
meant.)

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH net] net: qcom/emac: Add missing of_node_put()
From: David Miller @ 2018-06-15  0:01 UTC (permalink / raw)
  To: yuehaibing; +Cc: timur, linux-kernel, netdev
In-Reply-To: <20180611130345.15172-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Mon, 11 Jun 2018 21:03:45 +0800

> Add missing of_node_put() call for device node returned by
> of_parse_phandle().
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next] hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload
From: David Miller @ 2018-06-15  0:00 UTC (permalink / raw)
  To: haiyangz, haiyangz
  Cc: netdev, kys, sthemmin, olaf, vkuznets, devel, linux-kernel
In-Reply-To: <20180614213912.2889-1-haiyangz@linuxonhyperv.com>


Bug fixes should be targetted at net, not net-next.  Furthermore,
net-next is closed.

^ permalink raw reply

* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: David Miller @ 2018-06-14 23:58 UTC (permalink / raw)
  To: tom; +Cc: pablo, netfilter-devel, netdev, steffen.klassert
In-Reply-To: <CALx6S345B2-sejH52beewdiHuGo5nAY0v0uFmNuiJ0hQ6Wu3xw@mail.gmail.com>

From: Tom Herbert <tom@herbertland.com>
Date: Thu, 14 Jun 2018 13:52:03 -0700

> IIRC, there was a similar proposal a while back that want to bundle
> packets of the same flow together (without doing GRO) so that they
> could be processed by various functions by looking at just one
> representative packet in the group. The concept had some promise, but
> in the end it created quite a bit of complexity since at some point
> the packet bundle needed to be undone to go back to processing the
> individual packets.

You're probably talking about Edward Cree's SKB list stuff, and as
per his presenation at netconf 2 weeks ago he plans to revitalize
it given how Spectre et al. gives cause to reevaluate all bulking
techniques.

^ permalink raw reply

* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: David Miller @ 2018-06-14 23:55 UTC (permalink / raw)
  To: f.fainelli; +Cc: pablo, netfilter-devel, netdev, steffen.klassert
In-Reply-To: <66634ba7-3d72-644a-9d26-d9644c540619@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Thu, 14 Jun 2018 11:14:37 -0700

> On those platforms there are a number of things that just literally
> kill the routing performance: small I and D caches, small or not L2,
> limited bandwidth DRAM, huge call depths, big struct sk_buff layout,
> you name it.

Another reason to work on a 64-bit MIPS eBPF JIT.

We have a model, and game plan for this kind of application.  And it's
XDP and eBPF with JITs.

We are fully commited to this approach, and I see anything else that
tries to slip in and approach some sub-part of the problem as a
complete distraction and a step backwards.

All of the effort on this work could have been spent filling in the
missing pieces you mention.

And guess what?  Then millions of possibilities would have been
openned up, rather than just this one special case.

So, I ask, please see the larger picture.

Thank you.

^ permalink raw reply

* Re: [bpf PATCH v2 1/6] bpf: sockmap, fix crash when ipv6 sock is added
From: Martin KaFai Lau @ 2018-06-14 23:53 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180614164446.24994.22118.stgit@john-Precision-Tower-5810>

On Thu, Jun 14, 2018 at 09:44:46AM -0700, John Fastabend wrote:
> This fixes a crash where we assign tcp_prot to IPv6 sockets instead
> of tcpv6_prot.
> 
> Previously we overwrote the sk->prot field with tcp_prot even in the
> AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
> are used.

> Further, only allow ESTABLISHED connections to join the
> map per note in TLS ULP,
> 
>    /* The TLS ulp is currently supported only for TCP sockets
>     * in ESTABLISHED state.
>     * Supporting sockets in LISTEN state will require us
>     * to modify the accept implementation to clone rather then
>     * share the ulp context.
>     */
This bit has been moved to patch 2.

> 
> Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
> 'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
> crashing case here.
> 
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Wei Wang <weiwan@google.com>
> ---
>  0 files changed
> 
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 52a91d8..f6dd4cd 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -140,6 +140,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
>  static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
>  			    int offset, size_t size, int flags);
> +static void bpf_tcp_close(struct sock *sk, long timeout);
>  
>  static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
>  {
> @@ -161,7 +162,42 @@ static bool bpf_tcp_stream_read(const struct sock *sk)
>  	return !empty;
>  }
>  
> -static struct proto tcp_bpf_proto;
> +enum {
> +	SOCKMAP_IPV4,
> +	SOCKMAP_IPV6,
> +	SOCKMAP_NUM_PROTS,
> +};
> +
> +enum {
> +	SOCKMAP_BASE,
> +	SOCKMAP_TX,
> +	SOCKMAP_NUM_CONFIGS,
> +};
> +
> +static struct proto *saved_tcpv6_prot;
__read_mostly

> +static DEFINE_MUTEX(tcpv6_prot_mutex);
> +static struct proto bpf_tcp_prots[SOCKMAP_NUM_PROTS][SOCKMAP_NUM_CONFIGS];
> +static void build_protos(struct proto prot[SOCKMAP_NUM_CONFIGS],
> +			 struct proto *base)
> +{
> +	prot[SOCKMAP_BASE]			= *base;
> +	prot[SOCKMAP_BASE].close		= bpf_tcp_close;
> +	prot[SOCKMAP_BASE].recvmsg		= bpf_tcp_recvmsg;
> +	prot[SOCKMAP_BASE].stream_memory_read	= bpf_tcp_stream_read;
> +
> +	prot[SOCKMAP_TX]			= prot[SOCKMAP_BASE];
> +	prot[SOCKMAP_TX].sendmsg		= bpf_tcp_sendmsg;
> +	prot[SOCKMAP_TX].sendpage		= bpf_tcp_sendpage;
> +}
> +
> +static void update_sk_prot(struct sock *sk, struct smap_psock *psock)
> +{
> +	int family = sk->sk_family == AF_INET6 ? SOCKMAP_IPV6 : SOCKMAP_IPV4;
> +	int conf = psock->bpf_tx_msg ? SOCKMAP_TX : SOCKMAP_BASE;
> +
> +	sk->sk_prot = &bpf_tcp_prots[family][conf];
> +}
> +
>  static int bpf_tcp_init(struct sock *sk)
>  {
>  	struct smap_psock *psock;
> @@ -181,14 +217,17 @@ static int bpf_tcp_init(struct sock *sk)
>  	psock->save_close = sk->sk_prot->close;
>  	psock->sk_proto = sk->sk_prot;
>  
> -	if (psock->bpf_tx_msg) {
> -		tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
> -		tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
> -		tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg;
> -		tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
> +	/* Build IPv6 sockmap whenever the address of tcpv6_prot changes */
> +	if (sk->sk_family == AF_INET6 &&
> +	    unlikely(sk->sk_prot != smp_load_acquire(&saved_tcpv6_prot))) {
> +		mutex_lock(&tcpv6_prot_mutex);
bpf_tcp_init() can be called by skops?
Can mutex_lock() be used here?

> +		if (likely(sk->sk_prot != saved_tcpv6_prot)) {
> +			build_protos(bpf_tcp_prots[SOCKMAP_IPV6], sk->sk_prot);
> +			smp_store_release(&saved_tcpv6_prot, sk->sk_prot);
> +		}
> +		mutex_unlock(&tcpv6_prot_mutex);
>  	}
> -
> -	sk->sk_prot = &tcp_bpf_proto;
> +	update_sk_prot(sk, psock);
>  	rcu_read_unlock();
>  	return 0;
>  }
> @@ -1111,8 +1150,7 @@ static void bpf_tcp_msg_add(struct smap_psock *psock,
>  
>  static int bpf_tcp_ulp_register(void)
>  {
> -	tcp_bpf_proto = tcp_prot;
> -	tcp_bpf_proto.close = bpf_tcp_close;
> +	build_protos(bpf_tcp_prots[SOCKMAP_IPV4], &tcp_prot);
>  	/* Once BPF TX ULP is registered it is never unregistered. It
>  	 * will be in the ULP list for the lifetime of the system. Doing
>  	 * duplicate registers is not a problem.
> 

^ permalink raw reply

* Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Eric Dumazet @ 2018-06-14 23:49 UTC (permalink / raw)
  To: Saeed Mahameed, eric.dumazet@gmail.com, kafai@fb.com,
	Tariq Toukan
  Cc: netdev@vger.kernel.org, edumazet@google.com
In-Reply-To: <77a87572d82deef1e035ec3b36027e8f6e0c1ea2.camel@mellanox.com>



On 06/14/2018 02:04 PM, Saeed Mahameed wrote:

> I was looking at the code without my fix :)
> 
> with my fix:
> release = frags->page_offset + frag_info->frag_stride > PAGE_SIZE;
> 
> for XDP: frag_info->frag_stride is PAGE_SIZE, so release will always be
> true regardless of PAGE_SIZE.
> 
> So i guess i didn't quite understand your PowerPC concern.. can you
> elaborate ?
> 

So your maths with PAGE_SIZE=65536 and MTU 9000

frag_stride is about 9344

So if the last chunk of the page has 9100 bytes, we wont be able to use it, while really we should be able to use it.

^ permalink raw reply

* Re: [PATCH bpf-net] selftests/bpf: delete xfrm tunnel when test exits.
From: Martin KaFai Lau @ 2018-06-14 22:30 UTC (permalink / raw)
  To: William Tu; +Cc: netdev, anders.roxell
In-Reply-To: <1528977666-26477-1-git-send-email-u9012063@gmail.com>

On Thu, Jun 14, 2018 at 05:01:06AM -0700, William Tu wrote:
> Make the printting of bpf xfrm tunnel better and
> cleanup xfrm state and policy when xfrm test finishes.
LGTM.  The subject tag actually meant s/bpf-net/bpf-next/?

It makes sense to be in bpf-next but I think bpf-next is still closed.
Please repost later.

> 
> Signed-off-by: William Tu <u9012063@gmail.com>
> ---
>  tools/testing/selftests/bpf/test_tunnel.sh | 24 +++++++++++++-----------
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/test_tunnel.sh b/tools/testing/selftests/bpf/test_tunnel.sh
> index aeb2901f21f4..7b1946b340be 100755
> --- a/tools/testing/selftests/bpf/test_tunnel.sh
> +++ b/tools/testing/selftests/bpf/test_tunnel.sh
> @@ -608,28 +608,26 @@ setup_xfrm_tunnel()
>  test_xfrm_tunnel()
>  {
>  	config_device
> -        #tcpdump -nei veth1 ip &
> -	output=$(mktemp)
> -	cat /sys/kernel/debug/tracing/trace_pipe | tee $output &
> -        setup_xfrm_tunnel
> +	> /sys/kernel/debug/tracing/trace
> +	setup_xfrm_tunnel
>  	tc qdisc add dev veth1 clsact
>  	tc filter add dev veth1 proto ip ingress bpf da obj test_tunnel_kern.o \
>  		sec xfrm_get_state
>  	ip netns exec at_ns0 ping $PING_ARG 10.1.1.200
>  	sleep 1
> -	grep "reqid 1" $output
> +	grep "reqid 1" /sys/kernel/debug/tracing/trace
>  	check_err $?
> -	grep "spi 0x1" $output
> +	grep "spi 0x1" /sys/kernel/debug/tracing/trace
>  	check_err $?
> -	grep "remote ip 0xac100164" $output
> +	grep "remote ip 0xac100164" /sys/kernel/debug/tracing/trace
>  	check_err $?
>  	cleanup
>  
>  	if [ $ret -ne 0 ]; then
> -                echo -e ${RED}"FAIL: xfrm tunnel"${NC}
> -                return 1
> -        fi
> -        echo -e ${GREEN}"PASS: xfrm tunnel"${NC}
> +		echo -e ${RED}"FAIL: xfrm tunnel"${NC}
> +		return 1
> +	fi
> +	echo -e ${GREEN}"PASS: xfrm tunnel"${NC}
>  }
>  
>  attach_bpf()
> @@ -657,6 +655,10 @@ cleanup()
>  	ip link del ip6geneve11 2> /dev/null
>  	ip link del erspan11 2> /dev/null
>  	ip link del ip6erspan11 2> /dev/null
> +	ip xfrm policy delete dir out src 10.1.1.200/32 dst 10.1.1.100/32 2> /dev/null
> +	ip xfrm policy delete dir in src 10.1.1.100/32 dst 10.1.1.200/32 2> /dev/null
> +	ip xfrm state delete src 172.16.1.100 dst 172.16.1.200 proto esp spi 0x1 2> /dev/null
> +	ip xfrm state delete src 172.16.1.200 dst 172.16.1.100 proto esp spi 0x2 2> /dev/null
>  }
>  
>  cleanup_exit()
> -- 
> 2.7.4
> 

^ permalink raw reply

* [PATCH net-next] hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload
From: Haiyang Zhang @ 2018-06-14 21:39 UTC (permalink / raw)
  To: davem, netdev; +Cc: olaf, sthemmin, haiyangz, linux-kernel, devel, vkuznets

From: Haiyang Zhang <haiyangz@microsoft.com>

These fields in struct ndis_ipsecv2_offload and struct ndis_rsc_offload
are one byte according to the specs. This patch defines them with the
right size.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 23304aca25f9..84eaaa6368a2 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -1277,17 +1277,17 @@ struct ndis_lsov2_offload {
 
 struct ndis_ipsecv2_offload {
 	u32	encap;
-	u16	ip6;
-	u16	ip4opt;
-	u16	ip6ext;
-	u16	ah;
-	u16	esp;
-	u16	ah_esp;
-	u16	xport;
-	u16	tun;
-	u16	xport_tun;
-	u16	lso;
-	u16	extseq;
+	u8	ip6;
+	u8	ip4opt;
+	u8	ip6ext;
+	u8	ah;
+	u8	esp;
+	u8	ah_esp;
+	u8	xport;
+	u8	tun;
+	u8	xport_tun;
+	u8	lso;
+	u8	extseq;
 	u32	udp_esp;
 	u32	auth;
 	u32	crypto;
@@ -1295,8 +1295,8 @@ struct ndis_ipsecv2_offload {
 };
 
 struct ndis_rsc_offload {
-	u16	ip4;
-	u16	ip6;
+	u8	ip4;
+	u8	ip6;
 };
 
 struct ndis_encap_offload {
-- 
2.17.1

^ permalink raw reply related

* Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Saeed Mahameed @ 2018-06-14 21:04 UTC (permalink / raw)
  To: eric.dumazet@gmail.com, kafai@fb.com, Tariq Toukan
  Cc: netdev@vger.kernel.org, edumazet@google.com
In-Reply-To: <9146b0a3d34388006d456135c9fe4f618260e63b.camel@mellanox.com>

On Thu, 2018-06-14 at 20:53 +0000, Saeed Mahameed wrote:
> On Thu, 2018-06-14 at 13:47 -0700, saeedm@mellanox.com wrote:
> > On Thu, 2018-06-14 at 12:12 -0700, Eric Dumazet wrote:
> > > 
> > > On 06/14/2018 11:56 AM, Saeed Mahameed wrote:
> > > 
> > > > Interestingly for this exact frag_stride we don't have an issue
> > > > :)
> > > > since it goes through a different condition branch
> > > > (the page flipping thing):
> > > > 
> > > > if (frag_info->frag_stride == PAGE_SIZE / 2) {
> > > >     frags->page_offset ^= PAGE_SIZE / 2;
> > > > 	release = page_count(page) != 1 ||
> > > > 		  page_is_pfmemalloc(page) ||
> > > > 		  page_to_nid(page) != numa_mem_id();
> > > > 
> > > 
> > > I guess you forgot to test on PowerPC where PAGE_SIZE=65536 ?
> > > 
> > > On PowerPC, the first branch is never taken.
> > > 
> 
> Oh, sorry i see where you are going with this, you mean for xdp in
> the
> other branch we need to take care of PAGE_SIZE > 4k .. 
> 

I was looking at the code without my fix :)

with my fix:
release = frags->page_offset + frag_info->frag_stride > PAGE_SIZE;

for XDP: frag_info->frag_stride is PAGE_SIZE, so release will always be
true regardless of PAGE_SIZE.

So i guess i didn't quite understand your PowerPC concern.. can you
elaborate ?

> You are totally right, good catch! I will have to figure this out :)
> ..
> 
> > 
> > This code is already in the driver, i guess as optimization for
> > when
> > a
> > page can be reused only twice (PAGE_SIZE=4K, strides_size = 2k).
> > frag_stride is never more than 2k.
> > 
> > for power PC we should always take the other branch which will
> > reuse
> > as
> > much 2k strides as possible in a page regardless of PAGE_SIZE.
> > 
> > so are we good with my change ?
> > 

^ permalink raw reply

* Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Saeed Mahameed @ 2018-06-14 20:53 UTC (permalink / raw)
  To: eric.dumazet@gmail.com, kafai@fb.com, Tariq Toukan
  Cc: netdev@vger.kernel.org, edumazet@google.com
In-Reply-To: <59c3b776cc0c683fff8090195fd71d7f22305744.camel@mellanox.com>

On Thu, 2018-06-14 at 13:47 -0700, saeedm@mellanox.com wrote:
> On Thu, 2018-06-14 at 12:12 -0700, Eric Dumazet wrote:
> > 
> > On 06/14/2018 11:56 AM, Saeed Mahameed wrote:
> > 
> > > Interestingly for this exact frag_stride we don't have an issue
> > > :)
> > > since it goes through a different condition branch
> > > (the page flipping thing):
> > > 
> > > if (frag_info->frag_stride == PAGE_SIZE / 2) {
> > >     frags->page_offset ^= PAGE_SIZE / 2;
> > > 	release = page_count(page) != 1 ||
> > > 		  page_is_pfmemalloc(page) ||
> > > 		  page_to_nid(page) != numa_mem_id();
> > > 
> > 
> > I guess you forgot to test on PowerPC where PAGE_SIZE=65536 ?
> > 
> > On PowerPC, the first branch is never taken.
> > 

Oh, sorry i see where you are going with this, you mean for xdp in the
other branch we need to take care of PAGE_SIZE > 4k .. 

You are totally right, good catch! I will have to figure this out :) ..

> 
> This code is already in the driver, i guess as optimization for when
> a
> page can be reused only twice (PAGE_SIZE=4K, strides_size = 2k).
> frag_stride is never more than 2k.
> 
> for power PC we should always take the other branch which will reuse
> as
> much 2k strides as possible in a page regardless of PAGE_SIZE.
> 
> so are we good with my change ?
> 
> 

^ permalink raw reply

* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: Tom Herbert @ 2018-06-14 20:52 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, Linux Kernel Network Developers,
	Steffen Klassert
In-Reply-To: <20180614141947.3580-1-pablo@netfilter.org>

On Thu, Jun 14, 2018 at 7:19 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> Hi,
>
> This patchset proposes a new fast forwarding path infrastructure that
> combines the GRO/GSO and the flowtable infrastructures. The idea is to
> add a hook at the GRO layer that is invoked before the standard GRO
> protocol offloads. This allows us to build custom packet chains that we
> can quickly pass in one go to the neighbour layer to define fast
> forwarding path for flows.
>
> For each packet that gets into the GRO layer, we first check if there is
> an entry in the flowtable, if so, the packet is placed in a list until
> the GRO infrastructure decides to send the batch from gro_complete to
> the neighbour layer. The first packet in the list takes the route from
> the flowtable entry, so we avoid reiterative routing lookups.
>
> In case no entry is found in the flowtable, the packet is passed up to
> the classic GRO offload handlers. Thus, this packet follows the standard
> forwarding path. Note that the initial packets of the flow always go
> through the standard IPv4/IPv6 netfilter forward hook, that is used to
> configure what flows are placed in the flowtable. Therefore, only a few
> (initial) packets follow the standard forwarding path while most of the
> follow up packets take this new fast forwarding path.
>

IIRC, there was a similar proposal a while back that want to bundle
packets of the same flow together (without doing GRO) so that they
could be processed by various functions by looking at just one
representative packet in the group. The concept had some promise, but
in the end it created quite a bit of complexity since at some point
the packet bundle needed to be undone to go back to processing the
individual packets.

Tom

> The fast forwarding path is enabled through explicit user policy, so the
> user needs to request this behaviour from control plane, the following
> example shows how to place flows in the new fast forwarding path from
> the netfilter forward chain:
>
>  table x {
>         flowtable f {
>                 hook early_ingress priority 0; devices = { eth0, eth1 }
>         }
>
>         chain y {
>                 type filter hook forward priority 0;
>                 ip protocol tcp flow offload @f
>         }
>  }
>
> The example above defines a fastpath for TCP flows that are placed in
> the flowtable 'f', this flowtable is hooked at the new early_ingress
> hook. The initial TCP packets that match this rule from the standard
> fowarding path create an entry in the flowtable, thus, GRO creates chain
> of packets for those that find an entry in the flowtable and send
> them through the neighbour layer.
>
> This new hook is happening before the ingress taps, therefore, packets
> that follow this new fast forwarding path are not shown by tcpdump.
>
> This patchset supports both layer 3 IPv4 and IPv6, and layer 4 TCP and
> UDP protocols. This fastpath also integrates with the IPSec
> infrastructure and the ESP protocol.
>
> We have collected performance numbers:
>
>         TCP TSO         TCP Fast Forward
>         32.5 Gbps       35.6 Gbps
>
>         UDP             UDP Fast Forward
>         17.6 Gbps       35.6 Gbps
>
>         ESP             ESP Fast Forward
>         6 Gbps          7.5 Gbps
>
> For UDP, this is doubling performance, and we almost achieve line rate
> with one single CPU using the Intel i40e NIC. We got similar numbers
> with the Mellanox ConnectX-4. For TCP, this is slightly improving things
> even if TSO is being defeated given that we need to segment the packet
> chain in software. We would like to explore HW GRO support with hardware
> vendors with this new mode, we think that should improve the TCP numbers
> we are showing above even more. For ESP traffic, performance improvement
> is ~25%, in this case, perf shows the bottleneck becomes the crypto layer.
>
> This patchset is co-authored work with Steffen Klassert.
>
> Comments are welcome, thanks.
>
>
> Pablo Neira Ayuso (6):
>   netfilter: nft_chain_filter: add support for early ingress
>   netfilter: nf_flow_table: add hooknum to flowtable type
>   netfilter: nf_flow_table: add flowtable for early ingress hook
>   netfilter: nft_flow_offload: enable offload after second packet is seen
>   netfilter: nft_flow_offload: remove secpath check
>   netfilter: nft_flow_offload: make sure route is not stale
>
> Steffen Klassert (7):
>   net: Add a helper to get the packet offload callbacks by priority.
>   net: Change priority of ipv4 and ipv6 packet offloads.
>   net: Add a GSO feature bit for the netfilter forward fastpath.
>   net: Use one bit of NAPI_GRO_CB for the netfilter fastpath.
>   netfilter: add early ingress hook for IPv4
>   netfilter: add early ingress support for IPv6
>   netfilter: add ESP support for early ingress
>
>  include/linux/netdev_features.h         |   4 +-
>  include/linux/netdevice.h               |   6 +-
>  include/linux/netfilter.h               |   6 +
>  include/linux/netfilter_ingress.h       |   1 +
>  include/linux/skbuff.h                  |   2 +
>  include/net/netfilter/early_ingress.h   |  24 +++
>  include/net/netfilter/nf_flow_table.h   |   4 +
>  include/uapi/linux/netfilter.h          |   1 +
>  net/core/dev.c                          |  50 ++++-
>  net/ipv4/af_inet.c                      |   1 +
>  net/ipv4/netfilter/Makefile             |   1 +
>  net/ipv4/netfilter/early_ingress.c      | 327 +++++++++++++++++++++++++++++
>  net/ipv4/netfilter/nf_flow_table_ipv4.c |  12 ++
>  net/ipv6/ip6_offload.c                  |   1 +
>  net/ipv6/netfilter/Makefile             |   1 +
>  net/ipv6/netfilter/early_ingress.c      | 315 ++++++++++++++++++++++++++++
>  net/ipv6/netfilter/nf_flow_table_ipv6.c |   1 +
>  net/netfilter/Kconfig                   |   8 +
>  net/netfilter/Makefile                  |   1 +
>  net/netfilter/core.c                    |  35 +++-
>  net/netfilter/early_ingress.c           | 361 ++++++++++++++++++++++++++++++++
>  net/netfilter/nf_flow_table_inet.c      |   1 +
>  net/netfilter/nf_flow_table_ip.c        |  72 +++++++
>  net/netfilter/nf_tables_api.c           | 120 ++++++-----
>  net/netfilter/nft_chain_filter.c        |   6 +-
>  net/netfilter/nft_flow_offload.c        |  13 +-
>  net/xfrm/xfrm_output.c                  |   4 +
>  27 files changed, 1297 insertions(+), 81 deletions(-)
>  create mode 100644 include/net/netfilter/early_ingress.h
>  create mode 100644 net/ipv4/netfilter/early_ingress.c
>  create mode 100644 net/ipv6/netfilter/early_ingress.c
>  create mode 100644 net/netfilter/early_ingress.c
>
> --
> 2.11.0
>
>

^ permalink raw reply

* Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Saeed Mahameed @ 2018-06-14 20:47 UTC (permalink / raw)
  To: eric.dumazet@gmail.com, kafai@fb.com, Tariq Toukan
  Cc: netdev@vger.kernel.org, edumazet@google.com
In-Reply-To: <1889d389-a741-aa7b-c2b1-14530fb44ba8@gmail.com>

On Thu, 2018-06-14 at 12:12 -0700, Eric Dumazet wrote:
> 
> On 06/14/2018 11:56 AM, Saeed Mahameed wrote:
> 
> > Interestingly for this exact frag_stride we don't have an issue :)
> > since it goes through a different condition branch
> > (the page flipping thing):
> > 
> > if (frag_info->frag_stride == PAGE_SIZE / 2) {
> >     frags->page_offset ^= PAGE_SIZE / 2;
> > 	release = page_count(page) != 1 ||
> > 		  page_is_pfmemalloc(page) ||
> > 		  page_to_nid(page) != numa_mem_id();
> > 
> 
> I guess you forgot to test on PowerPC where PAGE_SIZE=65536 ?
> 
> On PowerPC, the first branch is never taken.
> 

This code is already in the driver, i guess as optimization for when a
page can be reused only twice (PAGE_SIZE=4K, strides_size = 2k).
frag_stride is never more than 2k.

for power PC we should always take the other branch which will reuse as
much 2k strides as possible in a page regardless of PAGE_SIZE.

so are we good with my change ?



^ permalink raw reply

* Re: [PATCH] net: split sk_reuse into sk_reuse and sk_force_reuse
From: Andrei Vagin @ 2018-06-14 20:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrey Vagin, David Miller, netdev, Maciej Żenczykowski
In-Reply-To: <CANn89iJ6Yj8VPBQHsJcEcJUePRJz6NpZrL3OenSfiYccKDbDFA@mail.gmail.com>

On Wed, Jun 13, 2018 at 06:17:41PM -0700, Eric Dumazet wrote:
> On Wed, Jun 13, 2018 at 5:56 PM Andrei Vagin <avagin@openvz.org> wrote:
> 
> > The commit f396922d862a added a check to not allow changing
> > SO_REUSEADDR/SO_REUSEPORT on bound sockets. First, it doesn't
> > take into account that TCP_REPAIR changes SO_REUSEADDR. Second, now it
> > is impossible to restore a socket state and set SO_REUSEADDR,
> > because the kernel always sets SO_REUSEADDR into zero after disabling
> > the repair mode.
> >
> >
> Hi Andrey
> 
> This commit was reverted, do we still need  this patch ?

I have seen that this patch was reverted. Probably I had to check
net-next before sending it.

I'm agree with Maciej Żenczykowski that it makes this code better. I
have never understood why TCP_REPAIR drops SO_REUSEADDR. Now each time
when we use TCP_REPAIR, we have to save a value of SO_REUSEADDR and
restore it back after disabling TCP_REPAIR. With this patch, we will
able to enable/disable TCP_REPAIR and don't care about sk_reuse.

I will update the commit message and send the patch again.

Thanks,
Andrei

^ permalink raw reply

* Re: [RFC PATCH 12/12] PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
From: Besogonov, Aleksei @ 2018-06-14 19:50 UTC (permalink / raw)
  To: Pavel Machek, Agarwal, Anchal
  Cc: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com,
	x86@kernel.org, boris.ostrovsky@oracle.com,
	konrad.wilk@oracle.com, roger.pau@citrix.com,
	netdev@vger.kernel.org, jgross@suse.com,
	xen-devel@lists.xenproject.org, linux-kernel@vger.kernel.org,
	Kamata, Munehisa, van der Linden, Frank, Vaidyeshwara, Vallish,
	Anbalagane, Guru, Valentin, Eduardo, rjw@rjwysocki.net
In-Reply-To: <20180614194515.GE17808@amd>

(sorry for top-posting)

Well, not completely broken. It just required a restart for the offset setting to take an effect.

On 6/14/18, 12:46, "Pavel Machek" <pavel@ucw.cz> wrote:

    Hi!
    
    > From: Aleksei Besogonov <cyberax@amazon.com>
    > 
    > The SNAPSHOT_SET_SWAP_AREA is supposed to be used to set the hibernation
    > offset on a running kernel to enable hibernating to a swap file.
    > However, it doesn't actually update the swsusp_resume_block variable. As
    > a result, the hibernation fails at the last step (after all the data is
    > written out) in the validation of the swap signature in
    > mark_swapfiles().
    > 
    > Before this patch, the command line processing was the only place where
    > swsusp_resume_block was set.
    
    Are you saying that suspend-to-file was broken, even on
    non-virtualized systems?
    
    If so, we may this to go in first...
    								Pavel
    								
    -- 
    (english) http://www.livejournal.com/~pavelmachek
    (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
    


^ permalink raw reply

* Re: List of Networking enhancements and bug fixes in a particular release
From: Joe Smith @ 2018-06-14 19:48 UTC (permalink / raw)
  To: whiteheadm; +Cc: netdev, David Miller, Eric Dumazet
In-Reply-To: <CAP8WD_bu8jcVjGpdK_a=C_9P075ickqi=-EZnvVR0Xa3=8crgw@mail.gmail.com>

Matthew,

Thanks a lot for pointing this out to me. It is extremely helpful.

Regards

On Thu, Jun 14, 2018 at 12:19 PM, tedheadster <tedheadster@gmail.com> wrote:
> On Thu, Jun 14, 2018 at 1:21 PM, Joe Smith <codesoldier1@gmail.com> wrote:
>>
>> What is the best and authoritative mechanism to find out networking
>> enhancements in a Linux release?
>>
>
> Joe,
>   there usually is a good summary a few days after a kernel release on
> the Kernel Newbies site. Here is a recent one:
>
> https://kernelnewbies.org/Linux_4.15
>
> - Matthew



-- 
JS

^ permalink raw reply

* Re: [RFC PATCH 12/12] PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
From: Pavel Machek @ 2018-06-14 19:45 UTC (permalink / raw)
  To: Anchal Agarwal
  Cc: tglx, mingo, hpa, x86, boris.ostrovsky, konrad.wilk, roger.pau,
	netdev, jgross, xen-devel, linux-kernel, kamatam, fllinden,
	vallish, guruanb, eduval, rjw, len.brown, linux-pm, cyberax
In-Reply-To: <20180612205619.28156-13-anchalag@amazon.com>

[-- Attachment #1: Type: text/plain, Size: 834 bytes --]

Hi!

> From: Aleksei Besogonov <cyberax@amazon.com>
> 
> The SNAPSHOT_SET_SWAP_AREA is supposed to be used to set the hibernation
> offset on a running kernel to enable hibernating to a swap file.
> However, it doesn't actually update the swsusp_resume_block variable. As
> a result, the hibernation fails at the last step (after all the data is
> written out) in the validation of the swap signature in
> mark_swapfiles().
> 
> Before this patch, the command line processing was the only place where
> swsusp_resume_block was set.

Are you saying that suspend-to-file was broken, even on
non-virtualized systems?

If so, we may this to go in first...
								Pavel
								
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: List of Networking enhancements and bug fixes in a particular release
From: tedheadster @ 2018-06-14 19:19 UTC (permalink / raw)
  To: Joe Smith; +Cc: netdev, David Miller, Eric Dumazet
In-Reply-To: <CABGNecwFH+jYwTgWBO-FKd6CzP1-+uUL750aw-N1b63Yx=6a=A@mail.gmail.com>

On Thu, Jun 14, 2018 at 1:21 PM, Joe Smith <codesoldier1@gmail.com> wrote:
>
> What is the best and authoritative mechanism to find out networking
> enhancements in a Linux release?
>

Joe,
  there usually is a good summary a few days after a kernel release on
the Kernel Newbies site. Here is a recent one:

https://kernelnewbies.org/Linux_4.15

- Matthew

^ permalink raw reply

* Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Eric Dumazet @ 2018-06-14 19:12 UTC (permalink / raw)
  To: Saeed Mahameed, eric.dumazet@gmail.com, kafai@fb.com,
	Tariq Toukan
  Cc: netdev@vger.kernel.org, edumazet@google.com
In-Reply-To: <9a8f7e1b2b51320178f671c2ae57d7d54be5af5a.camel@mellanox.com>



On 06/14/2018 11:56 AM, Saeed Mahameed wrote:

> Interestingly for this exact frag_stride we don't have an issue :)
> since it goes through a different condition branch
> (the page flipping thing):
> 
> if (frag_info->frag_stride == PAGE_SIZE / 2) {
>     frags->page_offset ^= PAGE_SIZE / 2;
> 	release = page_count(page) != 1 ||
> 		  page_is_pfmemalloc(page) ||
> 		  page_to_nid(page) != numa_mem_id();
> 

I guess you forgot to test on PowerPC where PAGE_SIZE=65536 ?

On PowerPC, the first branch is never taken.

^ permalink raw reply

* Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Saeed Mahameed @ 2018-06-14 18:56 UTC (permalink / raw)
  To: eric.dumazet@gmail.com, kafai@fb.com, Tariq Toukan
  Cc: netdev@vger.kernel.org, edumazet@google.com
In-Reply-To: <82f89ebc-713c-1b97-0d0a-e455094e2638@gmail.com>

On Wed, 2018-06-13 at 19:30 -0700, Eric Dumazet wrote:
> 
> On 06/13/2018 05:53 PM, Saeed Mahameed wrote:
> > When a new rx packet arrives, the rx path will decide whether to
> > reuse
> > the same page or not according to the current rx frag page offset
> > and
> > frag size, i.e:
> > release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;
> > 
> > Martin debugged this and he showed that this can cause mlx4 XDP to
> > reuse
> > buffers when it shouldn't.
> > 
> > Using frag_info->frag_size is wrong since frag size is always the
> > port
> > mtu and the frag stride might be larger, especially in XDP case
> > where
> > frag_stride == PAGE_SIZE.
> 
> Hmmm... why frag_size is not PAGE_SIZE for XDP then ?
> 

I thought about this, but i see some problems with this that i would
like to avoid.

1. definition of frag_size v.s. frag_stride:
frag_size should be equal to the effective mtu, since the descriptor
frag size depends solely on it. and we don't want to end up receiving
packets larger than the stack mtu (think of a vf with smaller mtu than
the wire).
e.g:
rx_desc->data[i].byte_count = 
       cpu_to_be32(priv->frag_info[i].frag_size);

frag_stride defines how much the frag_size can safely grow inside of a
page or how much padding we need for this frag.

so yes frag_size = PAGE_SIZE can work but it also kind of weird and
might cause incorrectness in terms of RX packets sizes.


> This patch is a bit strange, since we do :
> 
> u32 sz_align = ALIGN(frag_size, SMP_CACHE_BYTES);
> 
> frags->page_offset += sz_align;
> release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;
> 
> So the @release condition is really to have enough space from frags-
> >page_offset
> to hold up to frag_info->frag_size bytes.
> 
> (NIC wont push more than frag_info->frag_size bytes, regardless of
> how big is out frag_stride )
> 

yes, but if frag_size > mtu, we might have a problem.

> Your patch would prevent a page being reused if say the amount of
> available bytes is 1536,
> but frag_stride = 2048
> 

Interestingly for this exact frag_stride we don't have an issue :)
since it goes through a different condition branch
(the page flipping thing):

if (frag_info->frag_stride == PAGE_SIZE / 2) {
    frags->page_offset ^= PAGE_SIZE / 2;
	release = page_count(page) != 1 ||
		  page_is_pfmemalloc(page) ||
		  page_to_nid(page) != numa_mem_id();

but yes there is a small price to pay for when 1536 < mtu < 2k

but for other cases there is no change in behavior.

overall i am not against your suggestion i am just laying off the
problems with the two suggestions.

> 
> I would suggest a patch like the following instead.
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index
> 5edd0cf2999cbde37b3404aafd700584696af940..83d6b17b102bc9a22bfd8e68863
> d079f38670501 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1082,7 +1082,7 @@ void mlx4_en_calc_rx_buf(struct net_device
> *dev)
>          * This only works when num_frags == 1.
>          */
>         if (priv->tx_ring_num[TX_XDP]) {
> -               priv->frag_info[0].frag_size = eff_mtu;
> +               priv->frag_info[0].frag_size = PAGE_SIZE;
>                 /* This will gain efficient xdp frame recycling at
> the
>                  * expense of more costly truesize accounting
>                  */
> 

^ permalink raw reply

* [PATCH] rds: avoid unenecessary cong_update in loop transport
From: Santosh Shilimkar @ 2018-06-14 18:52 UTC (permalink / raw)
  To: netdev, davem; +Cc: Santosh Shilimkar

Loop transport which is self loopback, remote port congestion
update isn't relevant. Infact the xmit path already ignores it.
Receive path needs to do the same.

Reported-by: syzbot+4c20b3866171ce8441d2@syzkaller.appspotmail.com
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
---
 net/rds/loop.c | 1 +
 net/rds/rds.h  | 5 +++++
 net/rds/recv.c | 5 +++++
 3 files changed, 11 insertions(+)

diff --git a/net/rds/loop.c b/net/rds/loop.c
index f2bf78d..dac6218 100644
--- a/net/rds/loop.c
+++ b/net/rds/loop.c
@@ -193,4 +193,5 @@ struct rds_transport rds_loop_transport = {
 	.inc_copy_to_user	= rds_message_inc_copy_to_user,
 	.inc_free		= rds_loop_inc_free,
 	.t_name			= "loopback",
+	.t_type			= RDS_TRANS_LOOP,
 };
diff --git a/net/rds/rds.h b/net/rds/rds.h
index b04c333..f2272fb 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -479,6 +479,11 @@ struct rds_notifier {
 	int			n_status;
 };
 
+/* Available as part of RDS core, so doesn't need to participate
+ * in get_preferred transport etc
+ */
+#define	RDS_TRANS_LOOP	3
+
 /**
  * struct rds_transport -  transport specific behavioural hooks
  *
diff --git a/net/rds/recv.c b/net/rds/recv.c
index dc67458..192ac6f 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -103,6 +103,11 @@ static void rds_recv_rcvbuf_delta(struct rds_sock *rs, struct sock *sk,
 		rds_stats_add(s_recv_bytes_added_to_socket, delta);
 	else
 		rds_stats_add(s_recv_bytes_removed_from_socket, -delta);
+
+	/* loop transport doesn't send/recv congestion updates */
+	if (rs->rs_transport->t_type == RDS_TRANS_LOOP)
+		return;
+
 	now_congested = rs->rs_rcv_bytes > rds_sk_rcvbuf(rs);
 
 	rdsdebug("rs %p (%pI4:%u) recv bytes %d buf %d "
-- 
1.9.1

^ permalink raw reply related

* Re: FW: [PATCH 2/2] ath10k: allow ATH10K_SNOC with COMPILE_TEST
From: Niklas Cassel @ 2018-06-14 18:42 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Govind Singh, bjorn.andersson, davem, netdev, linux-wireless,
	linux-kernel, ath10k
In-Reply-To: <878t7ha29r.fsf@kamboji.qca.qualcomm.com>

On Thu, Jun 14, 2018 at 05:09:04PM +0300, Kalle Valo wrote:
> Niklas Cassel <niklas.cassel@linaro.org> writes:
> 
> > On Tue, Jun 12, 2018 at 02:44:03PM +0200, Niklas Cassel wrote:
> >> On Tue, Jun 12, 2018 at 06:02:48PM +0530, Govind Singh wrote:
> >> > On 2018-06-12 17:45, Govind Singh wrote:
> >> > > 
> >> > > ATH10K_SNOC builds just fine with COMPILE_TEST, so make that possible.
> >> > > 
> >> > > Signed-off-by: Niklas Cassel <niklas.cassel@linaro.org>
> >> > > ---
> >> > >  drivers/net/wireless/ath/ath10k/Kconfig | 3 ++-
> >> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> >> > > 
> >> > > diff --git a/drivers/net/wireless/ath/ath10k/Kconfig
> >> > > b/drivers/net/wireless/ath/ath10k/Kconfig
> >> > > index 54ff5930126c..6572a43590a8 100644
> >> > > --- a/drivers/net/wireless/ath/ath10k/Kconfig
> >> > > +++ b/drivers/net/wireless/ath/ath10k/Kconfig
> >> > > @@ -42,7 +42,8 @@ config ATH10K_USB
> >> > > 
> >> > >  config ATH10K_SNOC
> >> > >  	tristate "Qualcomm ath10k SNOC support (EXPERIMENTAL)"
> >> > > -	depends on ATH10K && ARCH_QCOM
> >> > > +	depends on ATH10K
> >> > > +	depends on ARCH_QCOM || COMPILE_TEST
> >> > >  	---help---
> >> > >  	  This module adds support for integrated WCN3990 chip connected
> >> > >  	  to system NOC(SNOC). Currently work in progress and will not
> >> > 
> >> > Thanks Niklas for enabling COMPILE_TEST. With QMI set of
> >> > changes(https://patchwork.kernel.org/patch/10448183/), we need to enable
> >> > COMPILE_TEST for
> >> > QCOM_SCM/QMI_HELPERS which seems broken today. Are you planning to fix the
> >> > same.
> >
> > This patch is good as is.
> >
> > However, Govind's QMI patch set together with this patch
> > resulted in build errors.
> >
> > FTR, these are fixed by:
> > https://marc.info/?l=linux-kernel&m=152880985402356
> > https://marc.info/?l=linux-kernel&m=152889452326350
> 
> So the problem is that if I apply this patch I can't apply Govind's QMI
> patchset (due to the build problems) until Niklas' fixes to qcom and
> rpmsg subsystems propogate back to my tree and that might take weeks, or
> even months. But I really would like to apply the QMI patchset ASAP so
> that we can complete the wcn3990 support and not unnecessarily delay it.
> 
> So what I propose is that I put this patch 2 as 'Awaiting Upstream' in
> patchwork and apply it once Niklas' patches get to my tree. Does that
> sound good?

Absolutely.

I didn't realize this until after sending the patch.


Kind regards,
Niklas

^ permalink raw reply

* [PATCH] bpf: attach type BPF_LIRC_MODE2 should not depend on CONFIG_CGROUP_BPF
From: Sean Young @ 2018-06-14 18:42 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Y Song, Matthias Reichl, linux-media, LKML, Alexei Starovoitov,
	Mauro Carvalho Chehab, netdev, Devin Heitmueller, Quentin Monnet
In-Reply-To: <34406f72-722d-9c23-327f-b7c5d7a3090c@iogearbox.net>

If the kernel is compiled with CONFIG_CGROUP_BPF not enabled, it is not
possible to attach, detach or query IR BPF programs to /dev/lircN devices,
making them impossible to use. For embedded devices, it should be possible
to use IR decoding without cgroups or CONFIG_CGROUP_BPF enabled.

This change requires some refactoring, since bpf_prog_{attach,detach,query}
functions are now always compiled, but their code paths for cgroups need
moving out. Rather than a #ifdef CONFIG_CGROUP_BPF in kernel/bpf/syscall.c,
moving them to kernel/bpf/cgroup.c does not require #ifdefs since that file
is already conditionally compiled.

Signed-off-by: Sean Young <sean@mess.org>
---
 include/linux/bpf-cgroup.h |  31 +++++++++++
 kernel/bpf/cgroup.c        | 110 +++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c       | 105 ++---------------------------------
 3 files changed, 145 insertions(+), 101 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 975fb4cf1bb7..ee67cd35f426 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -188,12 +188,43 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
 									      \
 	__ret;								      \
 })
+int sockmap_get_from_fd(const union bpf_attr *attr, int type, bool attach);
+int cgroup_bpf_prog_attach(const union bpf_attr *attr,
+			   enum bpf_prog_type ptype);
+int cgroup_bpf_prog_detach(const union bpf_attr *attr,
+			   enum bpf_prog_type ptype);
+int cgroup_bpf_prog_query(const union bpf_attr *attr,
+			  union bpf_attr __user *uattr);
 #else
 
 struct cgroup_bpf {};
 static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
 static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
 
+static inline int sockmap_get_from_fd(const union bpf_attr *attr,
+				      int type, bool attach)
+{
+	return -EINVAL;
+}
+
+static inline int cgroup_bpf_prog_attach(const union bpf_attr *attr,
+					 enum bpf_prog_type ptype)
+{
+	return -EINVAL;
+}
+
+static inline int cgroup_bpf_prog_detach(const union bpf_attr *attr,
+					 enum bpf_prog_type ptype)
+{
+	return -EINVAL;
+}
+
+static inline int cgroup_bpf_prog_query(const union bpf_attr *attr,
+					union bpf_attr __user *uattr)
+{
+	return -EINVAL;
+}
+
 #define cgroup_bpf_enabled (0)
 #define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (0)
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index f7c00bd6f8e4..d6e18f9dc0c4 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -428,6 +428,116 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
 	return ret;
 }
 
+int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
+				      enum bpf_attach_type attach_type)
+{
+	switch (prog->type) {
+	case BPF_PROG_TYPE_CGROUP_SOCK:
+	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
+		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
+	default:
+		return 0;
+	}
+}
+
+int sockmap_get_from_fd(const union bpf_attr *attr, int type, bool attach)
+{
+	struct bpf_prog *prog = NULL;
+	int ufd = attr->target_fd;
+	struct bpf_map *map;
+	struct fd f;
+	int err;
+
+	f = fdget(ufd);
+	map = __bpf_map_get(f);
+	if (IS_ERR(map))
+		return PTR_ERR(map);
+
+	if (attach) {
+		prog = bpf_prog_get_type(attr->attach_bpf_fd, type);
+		if (IS_ERR(prog)) {
+			fdput(f);
+			return PTR_ERR(prog);
+		}
+	}
+
+	err = sock_map_prog(map, prog, attr->attach_type);
+	if (err) {
+		fdput(f);
+		if (prog)
+			bpf_prog_put(prog);
+		return err;
+	}
+
+	fdput(f);
+	return 0;
+}
+
+int cgroup_bpf_prog_attach(const union bpf_attr *attr, enum bpf_prog_type ptype)
+{
+	struct bpf_prog *prog;
+	struct cgroup *cgrp;
+	int ret;
+
+	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	if (bpf_prog_attach_check_attach_type(prog, attr->attach_type)) {
+		bpf_prog_put(prog);
+		return -EINVAL;
+	}
+
+	cgrp = cgroup_get_from_fd(attr->target_fd);
+	if (IS_ERR(cgrp)) {
+		bpf_prog_put(prog);
+		return PTR_ERR(cgrp);
+	}
+
+	ret = cgroup_bpf_attach(cgrp, prog, attr->attach_type,
+				attr->attach_flags);
+	if (ret)
+		bpf_prog_put(prog);
+	cgroup_put(cgrp);
+
+	return ret;
+}
+
+int cgroup_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
+{
+	struct bpf_prog *prog;
+	struct cgroup *cgrp;
+	int ret;
+
+	cgrp = cgroup_get_from_fd(attr->target_fd);
+	if (IS_ERR(cgrp))
+		return PTR_ERR(cgrp);
+
+	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
+	if (IS_ERR(prog))
+		prog = NULL;
+
+	ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type, 0);
+	if (prog)
+		bpf_prog_put(prog);
+	cgroup_put(cgrp);
+	return ret;
+}
+
+int cgroup_bpf_prog_query(const union bpf_attr *attr,
+			  union bpf_attr __user *uattr)
+{
+	struct cgroup *cgrp;
+	int ret;
+
+	cgrp = cgroup_get_from_fd(attr->query.target_fd);
+	if (IS_ERR(cgrp))
+		return PTR_ERR(cgrp);
+	ret = cgroup_bpf_query(cgrp, attr, uattr);
+	cgroup_put(cgrp);
+	return ret;
+}
+
 /**
  * __cgroup_bpf_run_filter_skb() - Run a program for packet filtering
  * @sk: The socket sending or receiving traffic
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0fa20624707f..52fa44856623 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1489,65 +1489,14 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 	return err;
 }
 
-#ifdef CONFIG_CGROUP_BPF
-
-static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
-					     enum bpf_attach_type attach_type)
-{
-	switch (prog->type) {
-	case BPF_PROG_TYPE_CGROUP_SOCK:
-	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
-		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
-	default:
-		return 0;
-	}
-}
-
 #define BPF_PROG_ATTACH_LAST_FIELD attach_flags
 
-static int sockmap_get_from_fd(const union bpf_attr *attr,
-			       int type, bool attach)
-{
-	struct bpf_prog *prog = NULL;
-	int ufd = attr->target_fd;
-	struct bpf_map *map;
-	struct fd f;
-	int err;
-
-	f = fdget(ufd);
-	map = __bpf_map_get(f);
-	if (IS_ERR(map))
-		return PTR_ERR(map);
-
-	if (attach) {
-		prog = bpf_prog_get_type(attr->attach_bpf_fd, type);
-		if (IS_ERR(prog)) {
-			fdput(f);
-			return PTR_ERR(prog);
-		}
-	}
-
-	err = sock_map_prog(map, prog, attr->attach_type);
-	if (err) {
-		fdput(f);
-		if (prog)
-			bpf_prog_put(prog);
-		return err;
-	}
-
-	fdput(f);
-	return 0;
-}
-
 #define BPF_F_ATTACH_MASK \
 	(BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI)
 
 static int bpf_prog_attach(const union bpf_attr *attr)
 {
 	enum bpf_prog_type ptype;
-	struct bpf_prog *prog;
-	struct cgroup *cgrp;
-	int ret;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
@@ -1593,28 +1542,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 		return -EINVAL;
 	}
 
-	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
-	if (IS_ERR(prog))
-		return PTR_ERR(prog);
-
-	if (bpf_prog_attach_check_attach_type(prog, attr->attach_type)) {
-		bpf_prog_put(prog);
-		return -EINVAL;
-	}
-
-	cgrp = cgroup_get_from_fd(attr->target_fd);
-	if (IS_ERR(cgrp)) {
-		bpf_prog_put(prog);
-		return PTR_ERR(cgrp);
-	}
-
-	ret = cgroup_bpf_attach(cgrp, prog, attr->attach_type,
-				attr->attach_flags);
-	if (ret)
-		bpf_prog_put(prog);
-	cgroup_put(cgrp);
-
-	return ret;
+	return cgroup_bpf_prog_attach(attr, ptype);
 }
 
 #define BPF_PROG_DETACH_LAST_FIELD attach_type
@@ -1622,9 +1550,6 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 static int bpf_prog_detach(const union bpf_attr *attr)
 {
 	enum bpf_prog_type ptype;
-	struct bpf_prog *prog;
-	struct cgroup *cgrp;
-	int ret;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
@@ -1667,19 +1592,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 		return -EINVAL;
 	}
 
-	cgrp = cgroup_get_from_fd(attr->target_fd);
-	if (IS_ERR(cgrp))
-		return PTR_ERR(cgrp);
-
-	prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
-	if (IS_ERR(prog))
-		prog = NULL;
-
-	ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type, 0);
-	if (prog)
-		bpf_prog_put(prog);
-	cgroup_put(cgrp);
-	return ret;
+	return cgroup_bpf_prog_detach(attr, ptype);
 }
 
 #define BPF_PROG_QUERY_LAST_FIELD query.prog_cnt
@@ -1687,9 +1600,6 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 static int bpf_prog_query(const union bpf_attr *attr,
 			  union bpf_attr __user *uattr)
 {
-	struct cgroup *cgrp;
-	int ret;
-
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 	if (CHECK_ATTR(BPF_PROG_QUERY))
@@ -1717,14 +1627,9 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	default:
 		return -EINVAL;
 	}
-	cgrp = cgroup_get_from_fd(attr->query.target_fd);
-	if (IS_ERR(cgrp))
-		return PTR_ERR(cgrp);
-	ret = cgroup_bpf_query(cgrp, attr, uattr);
-	cgroup_put(cgrp);
-	return ret;
+
+	return cgroup_bpf_prog_query(attr, uattr);
 }
-#endif /* CONFIG_CGROUP_BPF */
 
 #define BPF_PROG_TEST_RUN_LAST_FIELD test.duration
 
@@ -2371,7 +2276,6 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_OBJ_GET:
 		err = bpf_obj_get(&attr);
 		break;
-#ifdef CONFIG_CGROUP_BPF
 	case BPF_PROG_ATTACH:
 		err = bpf_prog_attach(&attr);
 		break;
@@ -2381,7 +2285,6 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_PROG_QUERY:
 		err = bpf_prog_query(&attr, uattr);
 		break;
-#endif
 	case BPF_PROG_TEST_RUN:
 		err = bpf_prog_test_run(&attr, uattr);
 		break;
-- 
2.17.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox