Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: Tom Herbert @ 2018-06-14 20:52 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, Linux Kernel Network Developers,
	Steffen Klassert
In-Reply-To: <20180614141947.3580-1-pablo@netfilter.org>

On Thu, Jun 14, 2018 at 7:19 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> Hi,
>
> This patchset proposes a new fast forwarding path infrastructure that
> combines the GRO/GSO and the flowtable infrastructures. The idea is to
> add a hook at the GRO layer that is invoked before the standard GRO
> protocol offloads. This allows us to build custom packet chains that we
> can quickly pass in one go to the neighbour layer to define fast
> forwarding path for flows.
>
> For each packet that gets into the GRO layer, we first check if there is
> an entry in the flowtable, if so, the packet is placed in a list until
> the GRO infrastructure decides to send the batch from gro_complete to
> the neighbour layer. The first packet in the list takes the route from
> the flowtable entry, so we avoid reiterative routing lookups.
>
> In case no entry is found in the flowtable, the packet is passed up to
> the classic GRO offload handlers. Thus, this packet follows the standard
> forwarding path. Note that the initial packets of the flow always go
> through the standard IPv4/IPv6 netfilter forward hook, that is used to
> configure what flows are placed in the flowtable. Therefore, only a few
> (initial) packets follow the standard forwarding path while most of the
> follow up packets take this new fast forwarding path.
>

IIRC, there was a similar proposal a while back that want to bundle
packets of the same flow together (without doing GRO) so that they
could be processed by various functions by looking at just one
representative packet in the group. The concept had some promise, but
in the end it created quite a bit of complexity since at some point
the packet bundle needed to be undone to go back to processing the
individual packets.

Tom

> The fast forwarding path is enabled through explicit user policy, so the
> user needs to request this behaviour from control plane, the following
> example shows how to place flows in the new fast forwarding path from
> the netfilter forward chain:
>
>  table x {
>         flowtable f {
>                 hook early_ingress priority 0; devices = { eth0, eth1 }
>         }
>
>         chain y {
>                 type filter hook forward priority 0;
>                 ip protocol tcp flow offload @f
>         }
>  }
>
> The example above defines a fastpath for TCP flows that are placed in
> the flowtable 'f', this flowtable is hooked at the new early_ingress
> hook. The initial TCP packets that match this rule from the standard
> fowarding path create an entry in the flowtable, thus, GRO creates chain
> of packets for those that find an entry in the flowtable and send
> them through the neighbour layer.
>
> This new hook is happening before the ingress taps, therefore, packets
> that follow this new fast forwarding path are not shown by tcpdump.
>
> This patchset supports both layer 3 IPv4 and IPv6, and layer 4 TCP and
> UDP protocols. This fastpath also integrates with the IPSec
> infrastructure and the ESP protocol.
>
> We have collected performance numbers:
>
>         TCP TSO         TCP Fast Forward
>         32.5 Gbps       35.6 Gbps
>
>         UDP             UDP Fast Forward
>         17.6 Gbps       35.6 Gbps
>
>         ESP             ESP Fast Forward
>         6 Gbps          7.5 Gbps
>
> For UDP, this is doubling performance, and we almost achieve line rate
> with one single CPU using the Intel i40e NIC. We got similar numbers
> with the Mellanox ConnectX-4. For TCP, this is slightly improving things
> even if TSO is being defeated given that we need to segment the packet
> chain in software. We would like to explore HW GRO support with hardware
> vendors with this new mode, we think that should improve the TCP numbers
> we are showing above even more. For ESP traffic, performance improvement
> is ~25%, in this case, perf shows the bottleneck becomes the crypto layer.
>
> This patchset is co-authored work with Steffen Klassert.
>
> Comments are welcome, thanks.
>
>
> Pablo Neira Ayuso (6):
>   netfilter: nft_chain_filter: add support for early ingress
>   netfilter: nf_flow_table: add hooknum to flowtable type
>   netfilter: nf_flow_table: add flowtable for early ingress hook
>   netfilter: nft_flow_offload: enable offload after second packet is seen
>   netfilter: nft_flow_offload: remove secpath check
>   netfilter: nft_flow_offload: make sure route is not stale
>
> Steffen Klassert (7):
>   net: Add a helper to get the packet offload callbacks by priority.
>   net: Change priority of ipv4 and ipv6 packet offloads.
>   net: Add a GSO feature bit for the netfilter forward fastpath.
>   net: Use one bit of NAPI_GRO_CB for the netfilter fastpath.
>   netfilter: add early ingress hook for IPv4
>   netfilter: add early ingress support for IPv6
>   netfilter: add ESP support for early ingress
>
>  include/linux/netdev_features.h         |   4 +-
>  include/linux/netdevice.h               |   6 +-
>  include/linux/netfilter.h               |   6 +
>  include/linux/netfilter_ingress.h       |   1 +
>  include/linux/skbuff.h                  |   2 +
>  include/net/netfilter/early_ingress.h   |  24 +++
>  include/net/netfilter/nf_flow_table.h   |   4 +
>  include/uapi/linux/netfilter.h          |   1 +
>  net/core/dev.c                          |  50 ++++-
>  net/ipv4/af_inet.c                      |   1 +
>  net/ipv4/netfilter/Makefile             |   1 +
>  net/ipv4/netfilter/early_ingress.c      | 327 +++++++++++++++++++++++++++++
>  net/ipv4/netfilter/nf_flow_table_ipv4.c |  12 ++
>  net/ipv6/ip6_offload.c                  |   1 +
>  net/ipv6/netfilter/Makefile             |   1 +
>  net/ipv6/netfilter/early_ingress.c      | 315 ++++++++++++++++++++++++++++
>  net/ipv6/netfilter/nf_flow_table_ipv6.c |   1 +
>  net/netfilter/Kconfig                   |   8 +
>  net/netfilter/Makefile                  |   1 +
>  net/netfilter/core.c                    |  35 +++-
>  net/netfilter/early_ingress.c           | 361 ++++++++++++++++++++++++++++++++
>  net/netfilter/nf_flow_table_inet.c      |   1 +
>  net/netfilter/nf_flow_table_ip.c        |  72 +++++++
>  net/netfilter/nf_tables_api.c           | 120 ++++++-----
>  net/netfilter/nft_chain_filter.c        |   6 +-
>  net/netfilter/nft_flow_offload.c        |  13 +-
>  net/xfrm/xfrm_output.c                  |   4 +
>  27 files changed, 1297 insertions(+), 81 deletions(-)
>  create mode 100644 include/net/netfilter/early_ingress.h
>  create mode 100644 net/ipv4/netfilter/early_ingress.c
>  create mode 100644 net/ipv6/netfilter/early_ingress.c
>  create mode 100644 net/netfilter/early_ingress.c
>
> --
> 2.11.0
>
>

^ permalink raw reply

* Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Saeed Mahameed @ 2018-06-14 20:53 UTC (permalink / raw)
  To: eric.dumazet@gmail.com, kafai@fb.com, Tariq Toukan
  Cc: netdev@vger.kernel.org, edumazet@google.com
In-Reply-To: <59c3b776cc0c683fff8090195fd71d7f22305744.camel@mellanox.com>

On Thu, 2018-06-14 at 13:47 -0700, saeedm@mellanox.com wrote:
> On Thu, 2018-06-14 at 12:12 -0700, Eric Dumazet wrote:
> > 
> > On 06/14/2018 11:56 AM, Saeed Mahameed wrote:
> > 
> > > Interestingly for this exact frag_stride we don't have an issue
> > > :)
> > > since it goes through a different condition branch
> > > (the page flipping thing):
> > > 
> > > if (frag_info->frag_stride == PAGE_SIZE / 2) {
> > >     frags->page_offset ^= PAGE_SIZE / 2;
> > > 	release = page_count(page) != 1 ||
> > > 		  page_is_pfmemalloc(page) ||
> > > 		  page_to_nid(page) != numa_mem_id();
> > > 
> > 
> > I guess you forgot to test on PowerPC where PAGE_SIZE=65536 ?
> > 
> > On PowerPC, the first branch is never taken.
> > 

Oh, sorry i see where you are going with this, you mean for xdp in the
other branch we need to take care of PAGE_SIZE > 4k .. 

You are totally right, good catch! I will have to figure this out :) ..

> 
> This code is already in the driver, i guess as optimization for when
> a
> page can be reused only twice (PAGE_SIZE=4K, strides_size = 2k).
> frag_stride is never more than 2k.
> 
> for power PC we should always take the other branch which will reuse
> as
> much 2k strides as possible in a page regardless of PAGE_SIZE.
> 
> so are we good with my change ?
> 
> 

^ permalink raw reply

* Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Saeed Mahameed @ 2018-06-14 21:04 UTC (permalink / raw)
  To: eric.dumazet@gmail.com, kafai@fb.com, Tariq Toukan
  Cc: netdev@vger.kernel.org, edumazet@google.com
In-Reply-To: <9146b0a3d34388006d456135c9fe4f618260e63b.camel@mellanox.com>

On Thu, 2018-06-14 at 20:53 +0000, Saeed Mahameed wrote:
> On Thu, 2018-06-14 at 13:47 -0700, saeedm@mellanox.com wrote:
> > On Thu, 2018-06-14 at 12:12 -0700, Eric Dumazet wrote:
> > > 
> > > On 06/14/2018 11:56 AM, Saeed Mahameed wrote:
> > > 
> > > > Interestingly for this exact frag_stride we don't have an issue
> > > > :)
> > > > since it goes through a different condition branch
> > > > (the page flipping thing):
> > > > 
> > > > if (frag_info->frag_stride == PAGE_SIZE / 2) {
> > > >     frags->page_offset ^= PAGE_SIZE / 2;
> > > > 	release = page_count(page) != 1 ||
> > > > 		  page_is_pfmemalloc(page) ||
> > > > 		  page_to_nid(page) != numa_mem_id();
> > > > 
> > > 
> > > I guess you forgot to test on PowerPC where PAGE_SIZE=65536 ?
> > > 
> > > On PowerPC, the first branch is never taken.
> > > 
> 
> Oh, sorry i see where you are going with this, you mean for xdp in
> the
> other branch we need to take care of PAGE_SIZE > 4k .. 
> 

I was looking at the code without my fix :)

with my fix:
release = frags->page_offset + frag_info->frag_stride > PAGE_SIZE;

for XDP: frag_info->frag_stride is PAGE_SIZE, so release will always be
true regardless of PAGE_SIZE.

So i guess i didn't quite understand your PowerPC concern.. can you
elaborate ?

> You are totally right, good catch! I will have to figure this out :)
> ..
> 
> > 
> > This code is already in the driver, i guess as optimization for
> > when
> > a
> > page can be reused only twice (PAGE_SIZE=4K, strides_size = 2k).
> > frag_stride is never more than 2k.
> > 
> > for power PC we should always take the other branch which will
> > reuse
> > as
> > much 2k strides as possible in a page regardless of PAGE_SIZE.
> > 
> > so are we good with my change ?
> > 

^ permalink raw reply

* [PATCH net-next] hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload
From: Haiyang Zhang @ 2018-06-14 21:39 UTC (permalink / raw)
  To: davem, netdev; +Cc: olaf, sthemmin, haiyangz, linux-kernel, devel, vkuznets

From: Haiyang Zhang <haiyangz@microsoft.com>

These fields in struct ndis_ipsecv2_offload and struct ndis_rsc_offload
are one byte according to the specs. This patch defines them with the
right size.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 23304aca25f9..84eaaa6368a2 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -1277,17 +1277,17 @@ struct ndis_lsov2_offload {
 
 struct ndis_ipsecv2_offload {
 	u32	encap;
-	u16	ip6;
-	u16	ip4opt;
-	u16	ip6ext;
-	u16	ah;
-	u16	esp;
-	u16	ah_esp;
-	u16	xport;
-	u16	tun;
-	u16	xport_tun;
-	u16	lso;
-	u16	extseq;
+	u8	ip6;
+	u8	ip4opt;
+	u8	ip6ext;
+	u8	ah;
+	u8	esp;
+	u8	ah_esp;
+	u8	xport;
+	u8	tun;
+	u8	xport_tun;
+	u8	lso;
+	u8	extseq;
 	u32	udp_esp;
 	u32	auth;
 	u32	crypto;
@@ -1295,8 +1295,8 @@ struct ndis_ipsecv2_offload {
 };
 
 struct ndis_rsc_offload {
-	u16	ip4;
-	u16	ip6;
+	u8	ip4;
+	u8	ip6;
 };
 
 struct ndis_encap_offload {
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH bpf-net] selftests/bpf: delete xfrm tunnel when test exits.
From: Martin KaFai Lau @ 2018-06-14 22:30 UTC (permalink / raw)
  To: William Tu; +Cc: netdev, anders.roxell
In-Reply-To: <1528977666-26477-1-git-send-email-u9012063@gmail.com>

On Thu, Jun 14, 2018 at 05:01:06AM -0700, William Tu wrote:
> Make the printting of bpf xfrm tunnel better and
> cleanup xfrm state and policy when xfrm test finishes.
LGTM.  The subject tag actually meant s/bpf-net/bpf-next/?

It makes sense to be in bpf-next but I think bpf-next is still closed.
Please repost later.

> 
> Signed-off-by: William Tu <u9012063@gmail.com>
> ---
>  tools/testing/selftests/bpf/test_tunnel.sh | 24 +++++++++++++-----------
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/test_tunnel.sh b/tools/testing/selftests/bpf/test_tunnel.sh
> index aeb2901f21f4..7b1946b340be 100755
> --- a/tools/testing/selftests/bpf/test_tunnel.sh
> +++ b/tools/testing/selftests/bpf/test_tunnel.sh
> @@ -608,28 +608,26 @@ setup_xfrm_tunnel()
>  test_xfrm_tunnel()
>  {
>  	config_device
> -        #tcpdump -nei veth1 ip &
> -	output=$(mktemp)
> -	cat /sys/kernel/debug/tracing/trace_pipe | tee $output &
> -        setup_xfrm_tunnel
> +	> /sys/kernel/debug/tracing/trace
> +	setup_xfrm_tunnel
>  	tc qdisc add dev veth1 clsact
>  	tc filter add dev veth1 proto ip ingress bpf da obj test_tunnel_kern.o \
>  		sec xfrm_get_state
>  	ip netns exec at_ns0 ping $PING_ARG 10.1.1.200
>  	sleep 1
> -	grep "reqid 1" $output
> +	grep "reqid 1" /sys/kernel/debug/tracing/trace
>  	check_err $?
> -	grep "spi 0x1" $output
> +	grep "spi 0x1" /sys/kernel/debug/tracing/trace
>  	check_err $?
> -	grep "remote ip 0xac100164" $output
> +	grep "remote ip 0xac100164" /sys/kernel/debug/tracing/trace
>  	check_err $?
>  	cleanup
>  
>  	if [ $ret -ne 0 ]; then
> -                echo -e ${RED}"FAIL: xfrm tunnel"${NC}
> -                return 1
> -        fi
> -        echo -e ${GREEN}"PASS: xfrm tunnel"${NC}
> +		echo -e ${RED}"FAIL: xfrm tunnel"${NC}
> +		return 1
> +	fi
> +	echo -e ${GREEN}"PASS: xfrm tunnel"${NC}
>  }
>  
>  attach_bpf()
> @@ -657,6 +655,10 @@ cleanup()
>  	ip link del ip6geneve11 2> /dev/null
>  	ip link del erspan11 2> /dev/null
>  	ip link del ip6erspan11 2> /dev/null
> +	ip xfrm policy delete dir out src 10.1.1.200/32 dst 10.1.1.100/32 2> /dev/null
> +	ip xfrm policy delete dir in src 10.1.1.100/32 dst 10.1.1.200/32 2> /dev/null
> +	ip xfrm state delete src 172.16.1.100 dst 172.16.1.200 proto esp spi 0x1 2> /dev/null
> +	ip xfrm state delete src 172.16.1.200 dst 172.16.1.100 proto esp spi 0x2 2> /dev/null
>  }
>  
>  cleanup_exit()
> -- 
> 2.7.4
> 

^ permalink raw reply

* Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Eric Dumazet @ 2018-06-14 23:49 UTC (permalink / raw)
  To: Saeed Mahameed, eric.dumazet@gmail.com, kafai@fb.com,
	Tariq Toukan
  Cc: netdev@vger.kernel.org, edumazet@google.com
In-Reply-To: <77a87572d82deef1e035ec3b36027e8f6e0c1ea2.camel@mellanox.com>



On 06/14/2018 02:04 PM, Saeed Mahameed wrote:

> I was looking at the code without my fix :)
> 
> with my fix:
> release = frags->page_offset + frag_info->frag_stride > PAGE_SIZE;
> 
> for XDP: frag_info->frag_stride is PAGE_SIZE, so release will always be
> true regardless of PAGE_SIZE.
> 
> So i guess i didn't quite understand your PowerPC concern.. can you
> elaborate ?
> 

So your maths with PAGE_SIZE=65536 and MTU 9000

frag_stride is about 9344

So if the last chunk of the page has 9100 bytes, we wont be able to use it, while really we should be able to use it.

^ permalink raw reply

* Re: [bpf PATCH v2 1/6] bpf: sockmap, fix crash when ipv6 sock is added
From: Martin KaFai Lau @ 2018-06-14 23:53 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180614164446.24994.22118.stgit@john-Precision-Tower-5810>

On Thu, Jun 14, 2018 at 09:44:46AM -0700, John Fastabend wrote:
> This fixes a crash where we assign tcp_prot to IPv6 sockets instead
> of tcpv6_prot.
> 
> Previously we overwrote the sk->prot field with tcp_prot even in the
> AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
> are used.

> Further, only allow ESTABLISHED connections to join the
> map per note in TLS ULP,
> 
>    /* The TLS ulp is currently supported only for TCP sockets
>     * in ESTABLISHED state.
>     * Supporting sockets in LISTEN state will require us
>     * to modify the accept implementation to clone rather then
>     * share the ulp context.
>     */
This bit has been moved to patch 2.

> 
> Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
> 'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
> crashing case here.
> 
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Wei Wang <weiwan@google.com>
> ---
>  0 files changed
> 
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 52a91d8..f6dd4cd 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -140,6 +140,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
>  static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
>  			    int offset, size_t size, int flags);
> +static void bpf_tcp_close(struct sock *sk, long timeout);
>  
>  static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
>  {
> @@ -161,7 +162,42 @@ static bool bpf_tcp_stream_read(const struct sock *sk)
>  	return !empty;
>  }
>  
> -static struct proto tcp_bpf_proto;
> +enum {
> +	SOCKMAP_IPV4,
> +	SOCKMAP_IPV6,
> +	SOCKMAP_NUM_PROTS,
> +};
> +
> +enum {
> +	SOCKMAP_BASE,
> +	SOCKMAP_TX,
> +	SOCKMAP_NUM_CONFIGS,
> +};
> +
> +static struct proto *saved_tcpv6_prot;
__read_mostly

> +static DEFINE_MUTEX(tcpv6_prot_mutex);
> +static struct proto bpf_tcp_prots[SOCKMAP_NUM_PROTS][SOCKMAP_NUM_CONFIGS];
> +static void build_protos(struct proto prot[SOCKMAP_NUM_CONFIGS],
> +			 struct proto *base)
> +{
> +	prot[SOCKMAP_BASE]			= *base;
> +	prot[SOCKMAP_BASE].close		= bpf_tcp_close;
> +	prot[SOCKMAP_BASE].recvmsg		= bpf_tcp_recvmsg;
> +	prot[SOCKMAP_BASE].stream_memory_read	= bpf_tcp_stream_read;
> +
> +	prot[SOCKMAP_TX]			= prot[SOCKMAP_BASE];
> +	prot[SOCKMAP_TX].sendmsg		= bpf_tcp_sendmsg;
> +	prot[SOCKMAP_TX].sendpage		= bpf_tcp_sendpage;
> +}
> +
> +static void update_sk_prot(struct sock *sk, struct smap_psock *psock)
> +{
> +	int family = sk->sk_family == AF_INET6 ? SOCKMAP_IPV6 : SOCKMAP_IPV4;
> +	int conf = psock->bpf_tx_msg ? SOCKMAP_TX : SOCKMAP_BASE;
> +
> +	sk->sk_prot = &bpf_tcp_prots[family][conf];
> +}
> +
>  static int bpf_tcp_init(struct sock *sk)
>  {
>  	struct smap_psock *psock;
> @@ -181,14 +217,17 @@ static int bpf_tcp_init(struct sock *sk)
>  	psock->save_close = sk->sk_prot->close;
>  	psock->sk_proto = sk->sk_prot;
>  
> -	if (psock->bpf_tx_msg) {
> -		tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
> -		tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
> -		tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg;
> -		tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
> +	/* Build IPv6 sockmap whenever the address of tcpv6_prot changes */
> +	if (sk->sk_family == AF_INET6 &&
> +	    unlikely(sk->sk_prot != smp_load_acquire(&saved_tcpv6_prot))) {
> +		mutex_lock(&tcpv6_prot_mutex);
bpf_tcp_init() can be called by skops?
Can mutex_lock() be used here?

> +		if (likely(sk->sk_prot != saved_tcpv6_prot)) {
> +			build_protos(bpf_tcp_prots[SOCKMAP_IPV6], sk->sk_prot);
> +			smp_store_release(&saved_tcpv6_prot, sk->sk_prot);
> +		}
> +		mutex_unlock(&tcpv6_prot_mutex);
>  	}
> -
> -	sk->sk_prot = &tcp_bpf_proto;
> +	update_sk_prot(sk, psock);
>  	rcu_read_unlock();
>  	return 0;
>  }
> @@ -1111,8 +1150,7 @@ static void bpf_tcp_msg_add(struct smap_psock *psock,
>  
>  static int bpf_tcp_ulp_register(void)
>  {
> -	tcp_bpf_proto = tcp_prot;
> -	tcp_bpf_proto.close = bpf_tcp_close;
> +	build_protos(bpf_tcp_prots[SOCKMAP_IPV4], &tcp_prot);
>  	/* Once BPF TX ULP is registered it is never unregistered. It
>  	 * will be in the ULP list for the lifetime of the system. Doing
>  	 * duplicate registers is not a problem.
> 

^ permalink raw reply

* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: David Miller @ 2018-06-14 23:55 UTC (permalink / raw)
  To: f.fainelli; +Cc: pablo, netfilter-devel, netdev, steffen.klassert
In-Reply-To: <66634ba7-3d72-644a-9d26-d9644c540619@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Thu, 14 Jun 2018 11:14:37 -0700

> On those platforms there are a number of things that just literally
> kill the routing performance: small I and D caches, small or not L2,
> limited bandwidth DRAM, huge call depths, big struct sk_buff layout,
> you name it.

Another reason to work on a 64-bit MIPS eBPF JIT.

We have a model, and game plan for this kind of application.  And it's
XDP and eBPF with JITs.

We are fully commited to this approach, and I see anything else that
tries to slip in and approach some sub-part of the problem as a
complete distraction and a step backwards.

All of the effort on this work could have been spent filling in the
missing pieces you mention.

And guess what?  Then millions of possibilities would have been
openned up, rather than just this one special case.

So, I ask, please see the larger picture.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: David Miller @ 2018-06-14 23:58 UTC (permalink / raw)
  To: tom; +Cc: pablo, netfilter-devel, netdev, steffen.klassert
In-Reply-To: <CALx6S345B2-sejH52beewdiHuGo5nAY0v0uFmNuiJ0hQ6Wu3xw@mail.gmail.com>

From: Tom Herbert <tom@herbertland.com>
Date: Thu, 14 Jun 2018 13:52:03 -0700

> IIRC, there was a similar proposal a while back that want to bundle
> packets of the same flow together (without doing GRO) so that they
> could be processed by various functions by looking at just one
> representative packet in the group. The concept had some promise, but
> in the end it created quite a bit of complexity since at some point
> the packet bundle needed to be undone to go back to processing the
> individual packets.

You're probably talking about Edward Cree's SKB list stuff, and as
per his presenation at netconf 2 weeks ago he plans to revitalize
it given how Spectre et al. gives cause to reevaluate all bulking
techniques.

^ permalink raw reply

* Re: [PATCH net-next] hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload
From: David Miller @ 2018-06-15  0:00 UTC (permalink / raw)
  To: haiyangz, haiyangz
  Cc: netdev, kys, sthemmin, olaf, vkuznets, devel, linux-kernel
In-Reply-To: <20180614213912.2889-1-haiyangz@linuxonhyperv.com>


Bug fixes should be targetted at net, not net-next.  Furthermore,
net-next is closed.

^ permalink raw reply

* Re: [PATCH net] net: qcom/emac: Add missing of_node_put()
From: David Miller @ 2018-06-15  0:01 UTC (permalink / raw)
  To: yuehaibing; +Cc: timur, linux-kernel, netdev
In-Reply-To: <20180611130345.15172-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Mon, 11 Jun 2018 21:03:45 +0800

> Add missing of_node_put() call for device node returned by
> of_parse_phandle().
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH bpf-net] selftests/bpf: delete xfrm tunnel when test exits.
From: Daniel Borkmann @ 2018-06-15  0:05 UTC (permalink / raw)
  To: Martin KaFai Lau, William Tu; +Cc: netdev, anders.roxell
In-Reply-To: <20180614223009.7ohlsxctbglbi24e@kafai-mbp.dhcp.thefacebook.com>

On 06/15/2018 12:30 AM, Martin KaFai Lau wrote:
> On Thu, Jun 14, 2018 at 05:01:06AM -0700, William Tu wrote:
>> Make the printting of bpf xfrm tunnel better and
>> cleanup xfrm state and policy when xfrm test finishes.
> LGTM.  The subject tag actually meant s/bpf-net/bpf-next/?
> 
> It makes sense to be in bpf-next but I think bpf-next is still closed.
> Please repost later.

But given this fixes up missing cleanup of xfrm policy/state that was
added earlier as part of the test, I think bpf would be fine. (Subject
is a bit confusing indeed either bpf resp. net tree or bpf-next was
meant.)

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH v3] tcp: verify the checksum of the first data segment in a new connection
From: David Miller @ 2018-06-15  0:05 UTC (permalink / raw)
  To: fllinden; +Cc: edumazet, netdev
In-Reply-To: <5b2052b1.eyFJo5VnqmUbz/O6%fllinden@amazon.com>

From: Frank van der Linden <fllinden@amazon.com>
Date: Tue, 12 Jun 2018 23:09:37 +0000

> commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash
> table") introduced an optimization for the handling of child sockets
> created for a new TCP connection.
> 
> But this optimization passes any data associated with the last ACK of the
> connection handshake up the stack without verifying its checksum, because it
> calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
> directly.  These lower-level processing functions do not do any checksum
> verification.
> 
> Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
> fix this.
> 
> Signed-off-by: Frank van der Linden <fllinden@amazon.com>

Applied and queued up for -stable.

I know you mention the bug causing commit in your commit message,
but you should also still provide a proper Fixes: tag.  I took
care of it for you this time.

Thanks.

^ permalink raw reply

* Re: [PATCH v1 net-next] stmmac: fix DMA channel hang in half-duplex mode
From: David Miller @ 2018-06-15  0:06 UTC (permalink / raw)
  To: vbhadram; +Cc: peppe.cavallaro, alexandre.torgue, joabreu, netdev, narayanr
In-Reply-To: <1528864248-2246-1-git-send-email-vbhadram@nvidia.com>

From: Bhadram Varka <vbhadram@nvidia.com>
Date: Wed, 13 Jun 2018 10:00:48 +0530

> HW does not support Half-duplex mode in multi-queue
> scenario. Fix it by not advertising the Half-Duplex
> mode if multi-queue enabled.
> 
> Signed-off-by: Bhadram Varka <vbhadram@nvidia.com>

Bug fixes should be submitted against net, not net-next.  And
net-next is closed for submissions at this time.

Thanks.

^ permalink raw reply

* Re: [PATCH 0/4] emaclite bug fixes and code cleanup
From: David Miller @ 2018-06-15  0:08 UTC (permalink / raw)
  To: radhey.shyam.pandey; +Cc: michal.simek, netdev, linux-arm-kernel, linux-kernel
In-Reply-To: <1528871719-1681-1-git-send-email-radhey.shyam.pandey@xilinx.com>

From: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Date: Wed, 13 Jun 2018 12:05:15 +0530

> This patch series fixes bug in emaclite remove and mdio_setup routines.
> It does minor code cleanup.

Series applied.

^ permalink raw reply

* Re: [PATCH net-next] tcp: add SNMP counter for zero-window drops
From: David Miller @ 2018-06-15  0:09 UTC (permalink / raw)
  To: laoar.shao; +Cc: edumazet, netdev, linux-kernel
In-Reply-To: <1528889908-15980-1-git-send-email-laoar.shao@gmail.com>

From: Yafang Shao <laoar.shao@gmail.com>
Date: Wed, 13 Jun 2018 07:38:28 -0400

> It will be helpful if we could display the drops due to zero window or no
> enough window space.
> So a new SNMP MIB entry is added to track this behavior.
> This entry is named LINUX_MIB_TCPZEROWINDOWDROP and published in
> /proc/net/netstat in TcpExt line as TCPZeroWindowDrop.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

net-next is closed, please resubmit this when net-next opens
back up.

Thank you.

^ permalink raw reply

* Re: [PATCH net 0/4] l2tp: pppol2tp_connect() fixes
From: David Miller @ 2018-06-15  0:10 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman
In-Reply-To: <cover.1528887257.git.g.nault@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Wed, 13 Jun 2018 15:09:16 +0200

> This series fixes a few remaining issues with pppol2tp_connect().
> 
> It doesn't try to prevent invalid configurations that have no effect on
> kernel's reliability. That would be work for a future patch set.
> 
> Patch 2 is the most important as it avoids an invalid pointer
> dereference crashing the kernel. It depends on patch 1 for correctly
> identifying L2TP session types.
> 
> Patches 3 and 4 avoid creating stale tunnels and sessions.

Series applied, thank you.

^ permalink raw reply

* Re: [bpf PATCH v2 2/6] bpf: sockmap only allow ESTABLISHED sock state
From: Martin KaFai Lau @ 2018-06-15  0:18 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180614164451.24994.31096.stgit@john-Precision-Tower-5810>

On Thu, Jun 14, 2018 at 09:44:52AM -0700, John Fastabend wrote:
> Per the note in the TLS ULP (which is actually a generic statement
> regarding ULPs)
> 
>  /* The TLS ulp is currently supported only for TCP sockets
>   * in ESTABLISHED state.
>   * Supporting sockets in LISTEN state will require us
>   * to modify the accept implementation to clone rather then
>   * share the ulp context.
>   */
Can you explain how that apply to bpf_tcp ulp?

My understanding is the "ulp context" referred in TLS ulp is
the tls_context stored in icsk_ulp_data but I don't see bpf_tcp's
ulp is using icsk_ulp_data.

Others LGTM.

> 
> After this patch we only allow socks that are in ESTABLISHED state or
> are being added via a sock_ops event that is transitioning into an
> ESTABLISHED state. By allowing sock_ops events we allow users to
> manage sockmaps directly from sock ops programs. The two supported
> sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and
> BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB.
> 
> >From the userspace BPF update API the sock lock is also taken now
> to ensure we don't race with state changes after the ESTABLISHED
> check. The BPF program sock ops hook already has the sock lock
> taken.
> 
> Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
> 'netperf -H [IPv4]'.
> 
> Reported-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  0 files changed
> 
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index f6dd4cd..f1ab52d 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -1976,13 +1976,20 @@ static int sock_map_update_elem(struct bpf_map *map,
>  		return -EINVAL;
>  	}
>  
> +	lock_sock(skops.sk);
> +	/* ULPs are currently supported only for TCP sockets in ESTABLISHED
> +	 * state.
> +	 */
>  	if (skops.sk->sk_type != SOCK_STREAM ||
> -	    skops.sk->sk_protocol != IPPROTO_TCP) {
> -		fput(socket->file);
> -		return -EOPNOTSUPP;
> +	    skops.sk->sk_protocol != IPPROTO_TCP ||
> +	    skops.sk->sk_state != TCP_ESTABLISHED) {
> +		err = -EOPNOTSUPP;
> +		goto out;
>  	}
>  
>  	err = sock_map_ctx_update_elem(&skops, map, key, flags);
> +out:
> +	release_sock(skops.sk);
>  	fput(socket->file);
>  	return err;
>  }
> @@ -2247,10 +2254,6 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
>  
>  	sock = skops->sk;
>  
> -	if (sock->sk_type != SOCK_STREAM ||
> -	    sock->sk_protocol != IPPROTO_TCP)
> -		return -EOPNOTSUPP;
> -
>  	if (unlikely(map_flags > BPF_EXIST))
>  		return -EINVAL;
>  
> @@ -2338,7 +2341,20 @@ static int sock_hash_update_elem(struct bpf_map *map,
>  		return -EINVAL;
>  	}
>  
> +	lock_sock(skops.sk);
> +	/* ULPs are currently supported only for TCP sockets in ESTABLISHED
> +	 * state.
> +	 */
> +	if (skops.sk->sk_type != SOCK_STREAM ||
> +	    skops.sk->sk_protocol != IPPROTO_TCP ||
> +	    skops.sk->sk_state != TCP_ESTABLISHED) {
> +		err = -EOPNOTSUPP;
> +		goto out;
> +	}
> +
>  	err = sock_hash_ctx_update_elem(&skops, map, key, flags);
> +out:
> +	release_sock(skops.sk);
>  	fput(socket->file);
>  	return err;
>  }
> @@ -2423,10 +2439,19 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
>  	.map_delete_elem = sock_hash_delete_elem,
>  };
>  
> +static bool bpf_is_valid_sock(struct bpf_sock_ops_kern *ops)
> +{
> +	return ops->op == BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB ||
> +	       ops->op == BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB;
> +}
> +
>  BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
>  	   struct bpf_map *, map, void *, key, u64, flags)
>  {
>  	WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +	if (!bpf_is_valid_sock(bpf_sock))
> +		return -EOPNOTSUPP;
>  	return sock_map_ctx_update_elem(bpf_sock, map, key, flags);
>  }
>  
> @@ -2445,6 +2470,9 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
>  	   struct bpf_map *, map, void *, key, u64, flags)
>  {
>  	WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +	if (!bpf_is_valid_sock(bpf_sock))
> +		return -EOPNOTSUPP;
>  	return sock_hash_ctx_update_elem(bpf_sock, map, key, flags);
>  }
>  
> 

^ permalink raw reply

* [PATCH bpf 1/2] bpf: fix panic in prog load calls cleanup
From: Daniel Borkmann @ 2018-06-15  0:30 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180615003048.3219-1-daniel@iogearbox.net>

While testing I found that when hitting error path in bpf_prog_load()
where we jump to free_used_maps and prog contained BPF to BPF calls
that were JITed earlier, then we never clean up the bpf_prog_kallsyms_add()
done under jit_subprogs(). Add proper API to make BPF kallsyms deletion
more clear and fix that.

Fixes: 1c2a088a6626 ("bpf: x64: add JIT support for multi-function programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/filter.h |  3 +++
 kernel/bpf/core.c      | 14 ++++++++++++++
 kernel/bpf/syscall.c   |  8 ++------
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 45fc0f5..297c56f 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -961,6 +961,9 @@ static inline void bpf_prog_kallsyms_del(struct bpf_prog *fp)
 }
 #endif /* CONFIG_BPF_JIT */
 
+void bpf_prog_kallsyms_del_subprogs(struct bpf_prog *fp);
+void bpf_prog_kallsyms_del_all(struct bpf_prog *fp);
+
 #define BPF_ANC		BIT(15)
 
 static inline bool bpf_needs_clear_a(const struct sock_filter *first)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 9f14937..1061968 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -350,6 +350,20 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 	return prog_adj;
 }
 
+void bpf_prog_kallsyms_del_subprogs(struct bpf_prog *fp)
+{
+	int i;
+
+	for (i = 0; i < fp->aux->func_cnt; i++)
+		bpf_prog_kallsyms_del(fp->aux->func[i]);
+}
+
+void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
+{
+	bpf_prog_kallsyms_del_subprogs(fp);
+	bpf_prog_kallsyms_del(fp);
+}
+
 #ifdef CONFIG_BPF_JIT
 /* All BPF JIT sysctl knobs here. */
 int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0fa2062..0f62692 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1034,14 +1034,9 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu)
 static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
 {
 	if (atomic_dec_and_test(&prog->aux->refcnt)) {
-		int i;
-
 		/* bpf_prog_free_id() must be called first */
 		bpf_prog_free_id(prog, do_idr_lock);
-
-		for (i = 0; i < prog->aux->func_cnt; i++)
-			bpf_prog_kallsyms_del(prog->aux->func[i]);
-		bpf_prog_kallsyms_del(prog);
+		bpf_prog_kallsyms_del_all(prog);
 
 		call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
 	}
@@ -1384,6 +1379,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 	return err;
 
 free_used_maps:
+	bpf_prog_kallsyms_del_subprogs(prog);
 	free_used_maps(prog->aux);
 free_prog:
 	bpf_prog_uncharge_memlock(prog);
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 0/2] Two bpf fixes
From: Daniel Borkmann @ 2018-06-15  0:30 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann

First one is a panic I ran into while testing the second
one where we got several syzkaller reports. Series here
fixes both.

Thanks!

Daniel Borkmann (2):
  bpf: fix panic in prog load calls cleanup
  bpf: reject any prog that failed read-only lock

 include/linux/filter.h | 63 +++++++++++++++++++++++++++++----------------
 kernel/bpf/core.c      | 69 +++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/bpf/syscall.c   | 12 +++------
 3 files changed, 106 insertions(+), 38 deletions(-)

-- 
2.9.5

^ permalink raw reply

* [PATCH bpf 2/2] bpf: reject any prog that failed read-only lock
From: Daniel Borkmann @ 2018-06-15  0:30 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180615003048.3219-1-daniel@iogearbox.net>

We currently lock any JITed image as read-only via bpf_jit_binary_lock_ro()
as well as the BPF image as read-only through bpf_prog_lock_ro(). In
the case any of these would fail we throw a WARN_ON_ONCE() in order to
yell loudly to the log. Perhaps, to some extend, this may be comparable
to an allocation where __GFP_NOWARN is explicitly not set.

Added via 65869a47f348 ("bpf: improve read-only handling"), this behavior
is slightly different compared to any of the other in-kernel set_memory_ro()
users who do not check the return code of set_memory_ro() and friends /at
all/ (e.g. in the case of module_enable_ro() / module_disable_ro()). Given
in BPF this is mandatory hardening step, we want to know whether there
are any issues that would leave both BPF data writable. So it happens
that syzkaller enabled fault injection and it triggered memory allocation
failure deep inside x86's change_page_attr_set_clr() which was triggered
from set_memory_ro().

Now, there are two options: i) leaving everything as is, and ii) reworking
the image locking code in order to have a final checkpoint out of the
central bpf_prog_select_runtime() which probes whether any of the calls
during prog setup weren't successful, and then bailing out with an error.
Option ii) is a better approach since this additional paranoia avoids
altogether leaving any potential W+X pages from BPF side in the system.
Therefore, lets be strict about it, and reject programs in such unlikely
occasion. While testing I noticed also that one bpf_prog_lock_ro()
call was missing on the outer dummy prog in case of calls, e.g. in the
destructor we call bpf_prog_free_deferred() on the main prog where we
try to bpf_prog_unlock_free() the program, and since we go via
bpf_prog_select_runtime() do that as well.

Reported-by: syzbot+3b889862e65a98317058@syzkaller.appspotmail.com
Reported-by: syzbot+9e762b52dd17e616a7a5@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/filter.h | 60 ++++++++++++++++++++++++++++++++------------------
 kernel/bpf/core.c      | 55 +++++++++++++++++++++++++++++++++++++++------
 kernel/bpf/syscall.c   |  4 +---
 3 files changed, 87 insertions(+), 32 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 297c56f..108f981 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -469,7 +469,8 @@ struct sock_fprog_kern {
 };
 
 struct bpf_binary_header {
-	unsigned int pages;
+	u16 pages;
+	u16 locked:1;
 	u8 image[];
 };
 
@@ -671,15 +672,18 @@ bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 size_default)
 
 #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0]))
 
-#ifdef CONFIG_ARCH_HAS_SET_MEMORY
 static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
 {
+#ifdef CONFIG_ARCH_HAS_SET_MEMORY
 	fp->locked = 1;
-	WARN_ON_ONCE(set_memory_ro((unsigned long)fp, fp->pages));
+	if (set_memory_ro((unsigned long)fp, fp->pages))
+		fp->locked = 0;
+#endif
 }
 
 static inline void bpf_prog_unlock_ro(struct bpf_prog *fp)
 {
+#ifdef CONFIG_ARCH_HAS_SET_MEMORY
 	if (fp->locked) {
 		WARN_ON_ONCE(set_memory_rw((unsigned long)fp, fp->pages));
 		/* In case set_memory_rw() fails, we want to be the first
@@ -687,34 +691,30 @@ static inline void bpf_prog_unlock_ro(struct bpf_prog *fp)
 		 */
 		fp->locked = 0;
 	}
+#endif
 }
 
 static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr)
 {
-	WARN_ON_ONCE(set_memory_ro((unsigned long)hdr, hdr->pages));
-}
-
-static inline void bpf_jit_binary_unlock_ro(struct bpf_binary_header *hdr)
-{
-	WARN_ON_ONCE(set_memory_rw((unsigned long)hdr, hdr->pages));
-}
-#else
-static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
-{
-}
-
-static inline void bpf_prog_unlock_ro(struct bpf_prog *fp)
-{
-}
-
-static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr)
-{
+#ifdef CONFIG_ARCH_HAS_SET_MEMORY
+	hdr->locked = 1;
+	if (set_memory_ro((unsigned long)hdr, hdr->pages))
+		hdr->locked = 0;
+#endif
 }
 
 static inline void bpf_jit_binary_unlock_ro(struct bpf_binary_header *hdr)
 {
+#ifdef CONFIG_ARCH_HAS_SET_MEMORY
+	if (hdr->locked) {
+		WARN_ON_ONCE(set_memory_rw((unsigned long)hdr, hdr->pages));
+		/* In case set_memory_rw() fails, we want to be the first
+		 * to crash here instead of some random place later on.
+		 */
+		hdr->locked = 0;
+	}
+#endif
 }
-#endif /* CONFIG_ARCH_HAS_SET_MEMORY */
 
 static inline struct bpf_binary_header *
 bpf_jit_binary_hdr(const struct bpf_prog *fp)
@@ -725,6 +725,22 @@ bpf_jit_binary_hdr(const struct bpf_prog *fp)
 	return (void *)addr;
 }
 
+#ifdef CONFIG_ARCH_HAS_SET_MEMORY
+static inline int bpf_prog_check_pages_ro_single(const struct bpf_prog *fp)
+{
+	if (!fp->locked)
+		return -ENOLCK;
+	if (fp->jited) {
+		const struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
+
+		if (!hdr->locked)
+			return -ENOLCK;
+	}
+
+	return 0;
+}
+#endif
+
 int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap);
 static inline int sk_filter(struct sock *sk, struct sk_buff *skb)
 {
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 1061968..a9e6c04 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -598,6 +598,8 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 	bpf_fill_ill_insns(hdr, size);
 
 	hdr->pages = size / PAGE_SIZE;
+	hdr->locked = 0;
+
 	hole = min_t(unsigned int, size - (proglen + sizeof(*hdr)),
 		     PAGE_SIZE - sizeof(*hdr));
 	start = (get_random_int() % hole) & ~(alignment - 1);
@@ -1448,6 +1450,33 @@ static int bpf_check_tail_call(const struct bpf_prog *fp)
 	return 0;
 }
 
+static int bpf_prog_check_pages_ro_locked(const struct bpf_prog *fp)
+{
+#ifdef CONFIG_ARCH_HAS_SET_MEMORY
+	int i, err;
+
+	for (i = 0; i < fp->aux->func_cnt; i++) {
+		err = bpf_prog_check_pages_ro_single(fp->aux->func[i]);
+		if (err)
+			return err;
+	}
+
+	return bpf_prog_check_pages_ro_single(fp);
+#endif
+	return 0;
+}
+
+static void bpf_prog_select_func(struct bpf_prog *fp)
+{
+#ifndef CONFIG_BPF_JIT_ALWAYS_ON
+	u32 stack_depth = max_t(u32, fp->aux->stack_depth, 1);
+
+	fp->bpf_func = interpreters[(round_up(stack_depth, 32) / 32) - 1];
+#else
+	fp->bpf_func = __bpf_prog_ret0_warn;
+#endif
+}
+
 /**
  *	bpf_prog_select_runtime - select exec runtime for BPF program
  *	@fp: bpf_prog populated with internal BPF program
@@ -1458,13 +1487,13 @@ static int bpf_check_tail_call(const struct bpf_prog *fp)
  */
 struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 {
-#ifndef CONFIG_BPF_JIT_ALWAYS_ON
-	u32 stack_depth = max_t(u32, fp->aux->stack_depth, 1);
+	/* In case of BPF to BPF calls, verifier did all the prep
+	 * work with regards to JITing, etc.
+	 */
+	if (fp->bpf_func)
+		goto finalize;
 
-	fp->bpf_func = interpreters[(round_up(stack_depth, 32) / 32) - 1];
-#else
-	fp->bpf_func = __bpf_prog_ret0_warn;
-#endif
+	bpf_prog_select_func(fp);
 
 	/* eBPF JITs can rewrite the program in case constant
 	 * blinding is active. However, in case of error during
@@ -1485,6 +1514,8 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 		if (*err)
 			return fp;
 	}
+
+finalize:
 	bpf_prog_lock_ro(fp);
 
 	/* The tail call compatibility check can only be done at
@@ -1493,7 +1524,17 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 	 * all eBPF JITs might immediately support all features.
 	 */
 	*err = bpf_check_tail_call(fp);
-
+	if (*err)
+		return fp;
+
+	/* Checkpoint: at this point onwards any cBPF -> eBPF or
+	 * native eBPF program is read-only. If we failed to change
+	 * the page attributes (e.g. allocation failure from
+	 * splitting large pages), then reject the whole program
+	 * in order to guarantee not ending up with any W+X pages
+	 * from BPF side in kernel.
+	 */
+	*err = bpf_prog_check_pages_ro_locked(fp);
 	return fp;
 }
 EXPORT_SYMBOL_GPL(bpf_prog_select_runtime);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0f62692..35dc466 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1353,9 +1353,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 	if (err < 0)
 		goto free_used_maps;
 
-	/* eBPF program is ready to be JITed */
-	if (!prog->bpf_func)
-		prog = bpf_prog_select_runtime(prog, &err);
+	prog = bpf_prog_select_runtime(prog, &err);
 	if (err < 0)
 		goto free_used_maps;
 
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH v2 2/3] bpfilter: include bpfilter_umh in assembly instead of using objcopy
From: Alexei Starovoitov @ 2018-06-15  0:47 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: netdev, Alexei Starovoitov, David S . Miller, Arnd Bergmann,
	Geert Uytterhoeven, linux-kernel, YueHaibing, Daniel Borkmann
In-Reply-To: <1528987172-19810-3-git-send-email-yamada.masahiro@socionext.com>

On Thu, Jun 14, 2018 at 11:39:31PM +0900, Masahiro Yamada wrote:
> What we want here is to embed a user-space program into the kernel.
> Instead of the complex ELF magic, let's simply wrap it in the assembly
> with the '.incbin' directive.
> 
> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
> ---
> 
> Changes in v2:
>   - Rebase
> 
>  net/bpfilter/Makefile            | 15 ++-------------
>  net/bpfilter/bpfilter_kern.c     | 11 +++++------
>  net/bpfilter/bpfilter_umh_blob.S |  7 +++++++
>  3 files changed, 14 insertions(+), 19 deletions(-)
>  create mode 100644 net/bpfilter/bpfilter_umh_blob.S
> 
> diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
> index e0bbe75..39c6980 100644
> --- a/net/bpfilter/Makefile
> +++ b/net/bpfilter/Makefile
> @@ -15,18 +15,7 @@ ifeq ($(CONFIG_BPFILTER_UMH), y)
>  HOSTLDFLAGS += -static
>  endif
>  
> -# a bit of elf magic to convert bpfilter_umh binary into a binary blob
> -# inside bpfilter_umh.o elf file referenced by
> -# _binary_net_bpfilter_bpfilter_umh_start symbol
> -# which bpfilter_kern.c passes further into umh blob loader at run-time
> -quiet_cmd_copy_umh = GEN $@
> -      cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
> -      $(OBJCOPY) -I binary -O `$(OBJDUMP) -f $<|grep format|cut -d' ' -f8` \
> -      -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
> -      --rename-section .data=.init.rodata $< $@
> -
> -$(obj)/bpfilter_umh.o: $(obj)/bpfilter_umh
> -	$(call cmd,copy_umh)
> +$(obj)/bpfilter_umh_blob.o: $(obj)/bpfilter_umh
>  
>  obj-$(CONFIG_BPFILTER_UMH) += bpfilter.o
> -bpfilter-objs += bpfilter_kern.o bpfilter_umh.o
> +bpfilter-objs += bpfilter_kern.o bpfilter_umh_blob.o
> diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
> index 0952257..6de3ae5 100644
> --- a/net/bpfilter/bpfilter_kern.c
> +++ b/net/bpfilter/bpfilter_kern.c
> @@ -10,11 +10,8 @@
>  #include <linux/file.h>
>  #include "msgfmt.h"
>  
> -#define UMH_start _binary_net_bpfilter_bpfilter_umh_start
> -#define UMH_end _binary_net_bpfilter_bpfilter_umh_end
> -
> -extern char UMH_start;
> -extern char UMH_end;
> +extern char bpfilter_umh_start;
> +extern char bpfilter_umh_end;
>  
>  static struct umh_info info;
>  /* since ip_getsockopt() can run in parallel, serialize access to umh */
> @@ -93,7 +90,9 @@ static int __init load_umh(void)
>  	int err;
>  
>  	/* fork usermode process */
> -	err = fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
> +	err = fork_usermode_blob(&bpfilter_umh_end,
> +				 &bpfilter_umh_end - &bpfilter_umh_start,
> +				 &info);
>  	if (err)
>  		return err;
>  	pr_info("Loaded bpfilter_umh pid %d\n", info.pid);
> diff --git a/net/bpfilter/bpfilter_umh_blob.S b/net/bpfilter/bpfilter_umh_blob.S
> new file mode 100644
> index 0000000..40311d1
> --- /dev/null
> +++ b/net/bpfilter/bpfilter_umh_blob.S
> @@ -0,0 +1,7 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +	.section .init.rodata, "a"
> +	.global bpfilter_umh_start
> +bpfilter_umh_start:
> +	.incbin "net/bpfilter/bpfilter_umh"
> +	.global bpfilter_umh_end
> +bpfilter_umh_end:

for some reason it doesn't work.
fork_usermode_blob() returns ENOEXEC
You should be able to test it simply running 'iptables -L'.
Without this patch you should see:
[   12.696937] bpfilter: Loaded bpfilter_umh pid 225
Started bpfilter

where first line comes from kernel module and second from umh.

^ permalink raw reply

* [PATCH net 0/2] Two tls fixes
From: Daniel Borkmann @ 2018-06-15  1:07 UTC (permalink / raw)
  To: davem; +Cc: davejwatson, netdev, Daniel Borkmann

First one is syzkaller trigered uaf and second one noticed
while writing test code with tls ulp. For details please see
individual patches.

Thanks!

Daniel Borkmann (2):
  tls: fix use-after-free in tls_push_record
  tls: fix waitall behavior in tls_sw_recvmsg

 net/tls/tls_sw.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

-- 
2.9.5

^ permalink raw reply

* [PATCH net 2/2] tls: fix waitall behavior in tls_sw_recvmsg
From: Daniel Borkmann @ 2018-06-15  1:07 UTC (permalink / raw)
  To: davem; +Cc: davejwatson, netdev, Daniel Borkmann
In-Reply-To: <20180615010746.3099-1-daniel@iogearbox.net>

Current behavior in tls_sw_recvmsg() is to wait for incoming tls
messages and copy up to exactly len bytes of data that the user
provided. This is problematic in the sense that i) if no packet
is currently queued in strparser we keep waiting until one has been
processed and pushed into tls receive layer for tls_wait_data() to
wake up and push the decrypted bits to user space. Given after
tls decryption, we're back at streaming data, use sock_rcvlowat()
hint from tcp socket instead. Retain current behavior with MSG_WAITALL
flag and otherwise use the hint target for breaking the loop and
returning to application. This is done if currently no ctx->recv_pkt
is ready, otherwise continue to process it from our strparser
backlog.

Fixes: c46234ebb4d1 ("tls: RX path for ktls")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Dave Watson <davejwatson@fb.com>
---
 net/tls/tls_sw.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 2945a3b..f127fac 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -754,7 +754,7 @@ int tls_sw_recvmsg(struct sock *sk,
 	struct sk_buff *skb;
 	ssize_t copied = 0;
 	bool cmsg = false;
-	int err = 0;
+	int target, err = 0;
 	long timeo;
 
 	flags |= nonblock;
@@ -764,6 +764,7 @@ int tls_sw_recvmsg(struct sock *sk,
 
 	lock_sock(sk);
 
+	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
 	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
 	do {
 		bool zc = false;
@@ -856,6 +857,9 @@ int tls_sw_recvmsg(struct sock *sk,
 					goto recv_end;
 			}
 		}
+		/* If we have a new message from strparser, continue now. */
+		if (copied >= target && !ctx->recv_pkt)
+			break;
 	} while (len);
 
 recv_end:
-- 
2.9.5

^ permalink raw reply related

* [PATCH net 1/2] tls: fix use-after-free in tls_push_record
From: Daniel Borkmann @ 2018-06-15  1:07 UTC (permalink / raw)
  To: davem; +Cc: davejwatson, netdev, Daniel Borkmann
In-Reply-To: <20180615010746.3099-1-daniel@iogearbox.net>

syzkaller managed to trigger a use-after-free in tls like the
following:

  BUG: KASAN: use-after-free in tls_push_record.constprop.15+0x6a2/0x810 [tls]
  Write of size 1 at addr ffff88037aa08000 by task a.out/2317

  CPU: 3 PID: 2317 Comm: a.out Not tainted 4.17.0+ #144
  Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET47W (1.21 ) 11/28/2016
  Call Trace:
   dump_stack+0x71/0xab
   print_address_description+0x6a/0x280
   kasan_report+0x258/0x380
   ? tls_push_record.constprop.15+0x6a2/0x810 [tls]
   tls_push_record.constprop.15+0x6a2/0x810 [tls]
   tls_sw_push_pending_record+0x2e/0x40 [tls]
   tls_sk_proto_close+0x3fe/0x710 [tls]
   ? tcp_check_oom+0x4c0/0x4c0
   ? tls_write_space+0x260/0x260 [tls]
   ? kmem_cache_free+0x88/0x1f0
   inet_release+0xd6/0x1b0
   __sock_release+0xc0/0x240
   sock_close+0x11/0x20
   __fput+0x22d/0x660
   task_work_run+0x114/0x1a0
   do_exit+0x71a/0x2780
   ? mm_update_next_owner+0x650/0x650
   ? handle_mm_fault+0x2f5/0x5f0
   ? __do_page_fault+0x44f/0xa50
   ? mm_fault_error+0x2d0/0x2d0
   do_group_exit+0xde/0x300
   __x64_sys_exit_group+0x3a/0x50
   do_syscall_64+0x9a/0x300
   ? page_fault+0x8/0x30
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happened through fault injection where aead_req allocation in
tls_do_encryption() eventually failed and we returned -ENOMEM from
the function. Turns out that the use-after-free is triggered from
tls_sw_sendmsg() in the second tls_push_record(). The error then
triggers a jump to waiting for memory in sk_stream_wait_memory()
resp. returning immediately in case of MSG_DONTWAIT. What follows is
the trim_both_sgl(sk, orig_size), which drops elements from the sg
list added via tls_sw_sendmsg(). Now the use-after-free gets triggered
when the socket is being closed, where tls_sk_proto_close() callback
is invoked. The tls_complete_pending_work() will figure that there's
a pending closed tls record to be flushed and thus calls into the
tls_push_pending_closed_record() from there. ctx->push_pending_record()
is called from the latter, which is the tls_sw_push_pending_record()
from sw path. This again calls into tls_push_record(). And here the
tls_fill_prepend() will panic since the buffer address has been freed
earlier via trim_both_sgl(). One way to fix it is to move the aead
request allocation out of tls_do_encryption() early into tls_push_record().
This means we don't prep the tls header and advance state to the
TLS_PENDING_CLOSED_RECORD before allocation which could potentially
fail happened. That fixes the issue on my side.

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Reported-by: syzbot+5c74af81c547738e1684@syzkaller.appspotmail.com
Reported-by: syzbot+709f2810a6a05f11d4d3@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Dave Watson <davejwatson@fb.com>
---
 net/tls/tls_sw.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 34895b7..2945a3b 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -191,18 +191,12 @@ static void tls_free_both_sg(struct sock *sk)
 }
 
 static int tls_do_encryption(struct tls_context *tls_ctx,
-			     struct tls_sw_context_tx *ctx, size_t data_len,
-			     gfp_t flags)
+			     struct tls_sw_context_tx *ctx,
+			     struct aead_request *aead_req,
+			     size_t data_len)
 {
-	unsigned int req_size = sizeof(struct aead_request) +
-		crypto_aead_reqsize(ctx->aead_send);
-	struct aead_request *aead_req;
 	int rc;
 
-	aead_req = kzalloc(req_size, flags);
-	if (!aead_req)
-		return -ENOMEM;
-
 	ctx->sg_encrypted_data[0].offset += tls_ctx->tx.prepend_size;
 	ctx->sg_encrypted_data[0].length -= tls_ctx->tx.prepend_size;
 
@@ -219,7 +213,6 @@ static int tls_do_encryption(struct tls_context *tls_ctx,
 	ctx->sg_encrypted_data[0].offset -= tls_ctx->tx.prepend_size;
 	ctx->sg_encrypted_data[0].length += tls_ctx->tx.prepend_size;
 
-	kfree(aead_req);
 	return rc;
 }
 
@@ -228,8 +221,14 @@ static int tls_push_record(struct sock *sk, int flags,
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
+	struct aead_request *req;
 	int rc;
 
+	req = kzalloc(sizeof(struct aead_request) +
+		      crypto_aead_reqsize(ctx->aead_send), sk->sk_allocation);
+	if (!req)
+		return -ENOMEM;
+
 	sg_mark_end(ctx->sg_plaintext_data + ctx->sg_plaintext_num_elem - 1);
 	sg_mark_end(ctx->sg_encrypted_data + ctx->sg_encrypted_num_elem - 1);
 
@@ -245,15 +244,14 @@ static int tls_push_record(struct sock *sk, int flags,
 	tls_ctx->pending_open_record_frags = 0;
 	set_bit(TLS_PENDING_CLOSED_RECORD, &tls_ctx->flags);
 
-	rc = tls_do_encryption(tls_ctx, ctx, ctx->sg_plaintext_size,
-			       sk->sk_allocation);
+	rc = tls_do_encryption(tls_ctx, ctx, req, ctx->sg_plaintext_size);
 	if (rc < 0) {
 		/* If we are called from write_space and
 		 * we fail, we need to set this SOCK_NOSPACE
 		 * to trigger another write_space in the future.
 		 */
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-		return rc;
+		goto out_req;
 	}
 
 	free_sg(sk, ctx->sg_plaintext_data, &ctx->sg_plaintext_num_elem,
@@ -268,6 +266,8 @@ static int tls_push_record(struct sock *sk, int flags,
 		tls_err_abort(sk, EBADMSG);
 
 	tls_advance_record_sn(sk, &tls_ctx->tx);
+out_req:
+	kfree(req);
 	return rc;
 }
 
-- 
2.9.5

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox