Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] tcp: verify the checksum of the first data segment in a new connection
From: Eric Dumazet @ 2018-06-11 23:37 UTC (permalink / raw)
  To: van der Linden, Frank, edumazet@google.com,
	netdev@vger.kernel.org
In-Reply-To: <631FD61F-40EF-4B6A-BD30-6C208DD450B7@amazon.com>



On 06/11/2018 04:25 PM, van der Linden, Frank wrote:
> A few comments on this one:
> 
> - obviously this is fairly serious, as it can let corrupted data all the way up to the application

Sure, although anyone relying on CRC checksum for ensuring TCP data integrity
has big troubles ;)

I would rather have a refined version of this patch doing a "goto csum_error" 
so that we properly increment TCP_MIB_CSUMERRORS and TCP_MIB_INERRS 

Thanks !

^ permalink raw reply

* Re: [PATCH 00/15] Netfilter/IPVS fixes for net
From: David Miller @ 2018-06-11 23:31 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <20180611092233.3219-1-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Mon, 11 Jun 2018 11:22:18 +0200

> The following patchset contains Netfilter/IPVS fixes for your net tree:
 ...
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thanks Pablo.

^ permalink raw reply

* Re: [PATCH net] tls: fix NULL pointer dereference on poll
From: David Miller @ 2018-06-11 23:30 UTC (permalink / raw)
  To: daniel; +Cc: davejwatson, netdev, ast, hch
In-Reply-To: <20180611212204.24002-1-daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon, 11 Jun 2018 23:22:04 +0200

> While hacking on kTLS, I ran into the following panic from an
> unprivileged netserver / netperf TCP session:
 ...
> Debugging further, it turns out that calling into ctx->sk_poll() is
> invalid since sk_poll itself is NULL which was saved from the original
> TCP socket in order for tls_sw_poll() to invoke it.
> 
> Looks like the recent conversion from poll to poll_mask callback started
> in 152524231023 ("net: add support for ->poll_mask in proto_ops") missed
> to eventually convert kTLS, too: TCP's ->poll was converted over to the
> ->poll_mask in commit 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
> and therefore kTLS wrongly saved the ->poll old one which is now NULL.
> 
> Convert kTLS over to use ->poll_mask instead. Also instead of POLLIN |
> POLLRDNORM use the proper EPOLLIN | EPOLLRDNORM bits as the case in
> tcp_poll_mask() as well that is mangled here.
> 
> Fixes: 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Applied, thanks Daniel.

^ permalink raw reply

* Re: [bpf PATCH] bpf: selftest fix for sockmap
From: Y Song @ 2018-06-11 23:28 UTC (permalink / raw)
  To: John Fastabend; +Cc: Alexei Starovoitov, Daniel Borkmann, netdev
In-Reply-To: <20180611184735.31255.51105.stgit@john-Precision-Tower-5810>

On Mon, Jun 11, 2018 at 11:47 AM, John Fastabend
<john.fastabend@gmail.com> wrote:
> In selftest test_maps the sockmap test case attempts to add a socket
> in listening state to the sockmap. This is no longer a valid operation
> so it fails as expected. However, the test wrongly reports this as an
> error now. Fix the test to avoid adding sockets in listening state.
>
> Fixes: 945ae430aa44 ("bpf: sockmap only allow ESTABLISHED sock state")

I cannot reproduce the failure and I cannot find the above "Fixes" commit.
I checked latest bpf/bpf-next/net-next trees. Maybe I missed something?

> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  tools/testing/selftests/bpf/test_maps.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> index 6c25334..9fed5f0 100644
> --- a/tools/testing/selftests/bpf/test_maps.c
> +++ b/tools/testing/selftests/bpf/test_maps.c
> @@ -564,7 +564,7 @@ static void test_sockmap(int tasks, void *data)
>         }
>
>         /* Test update without programs */
> -       for (i = 0; i < 6; i++) {
> +       for (i = 2; i < 6; i++) {
>                 err = bpf_map_update_elem(fd, &i, &sfd[i], BPF_ANY);
>                 if (err) {
>                         printf("Failed noprog update sockmap '%i:%i'\n",
> @@ -727,7 +727,7 @@ static void test_sockmap(int tasks, void *data)
>         }
>
>         /* Test map update elem afterwards fd lives in fd and map_fd */
> -       for (i = 0; i < 6; i++) {
> +       for (i = 2; i < 6; i++) {
>                 err = bpf_map_update_elem(map_fd_rx, &i, &sfd[i], BPF_ANY);
>                 if (err) {
>                         printf("Failed map_fd_rx update sockmap %i '%i:%i'\n",
>

^ permalink raw reply

* Re: [PATCH] tcp: verify the checksum of the first data segment in a new connection
From: van der Linden, Frank @ 2018-06-11 23:25 UTC (permalink / raw)
  To: edumazet@google.com, netdev@vger.kernel.org
In-Reply-To: <5b1f0292.IdMQh83ac/EN53Sl%fllinden@amazon.com>

A few comments on this one:

- obviously this is fairly serious, as it can let corrupted data all the way up to the application
- I am not nuts about the patch itself, the code feels a bit cluttered, but it's the least invasive way
  I could think of. Probably some refactoring is needed at some point.
- here's how to easily reproduce it:

On the sender, set up artificial packet corruption:

#!/bin/sh
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 netem corrupt 50.0%
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst $DSTADDR flowid 10:3


Then, run the following on the sender:

while :; do dd if=/dev/zero of=/dev/stdout bs=4096 count=4 | nc $DSTADDR 10000; sleep 1; done

..and this on the receiver:

uname -r; grep ^Tcp /proc/net/snmp; ttl=$((${SECONDS} + 300)); while [[ ${SECONDS} -lt ${ttl} ]]; do out="foo.$(date +%s)"; nc -l 10000 > "${out}"; md5=$(md5sum "${out}"|cut -d\  -f 1); echo -n "${md5}"; if [[ "${md5}" != "ce338fe6899778aacfc28414f2d9498b" ]]; then echo " corrupted"; else echo; fi; done; grep ^Tcp /proc/net/snmp

[obviously, if you change the size / content, the md5 sum has to be changed here]

This reproduces it fairly quickly (within 20 seconds) for us, on 4.14.x kernels. 4.17 kernels were verified to still have the issue.

Frank

On 6/11/18, 4:15 PM, "Frank van der Linden" <fllinden@amazon.com> wrote:

    commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash
    table") introduced an optimization for the handling of child sockets
    created for a new TCP connection.
    
    But this optimization passes any data associated with the last ACK of the
    connection handshake up the stack without verifying its checksum, because it
    calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
    directly.  These lower-level processing functions do not do any checksum
    verification.
    
    Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
    fix this.
    
    Signed-off-by: Frank van der Linden <fllinden@amazon.com>
    ---
     net/ipv4/tcp_ipv4.c | 8 +++++++-
     net/ipv6/tcp_ipv6.c | 8 +++++++-
     2 files changed, 14 insertions(+), 2 deletions(-)
    
    diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
    index f70586b50838..1ec4c0d4aba5 100644
    --- a/net/ipv4/tcp_ipv4.c
    +++ b/net/ipv4/tcp_ipv4.c
    @@ -1703,7 +1703,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
     			th = (const struct tcphdr *)skb->data;
     			iph = ip_hdr(skb);
     			tcp_v4_fill_cb(skb, iph, th);
    -			nsk = tcp_check_req(sk, skb, req, false, &req_stolen);
    +
    +			if (tcp_checksum_complete(skb)) {
    +				__TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
    +			} else {
    +				nsk = tcp_check_req(sk, skb, req, false,
    +						    &req_stolen);
    +			}
     		}
     		if (!nsk) {
     			reqsk_put(req);
    diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
    index 6d664d83cd16..a12b694d3d1e 100644
    --- a/net/ipv6/tcp_ipv6.c
    +++ b/net/ipv6/tcp_ipv6.c
    @@ -1486,7 +1486,13 @@ static int tcp_v6_rcv(struct sk_buff *skb)
     			th = (const struct tcphdr *)skb->data;
     			hdr = ipv6_hdr(skb);
     			tcp_v6_fill_cb(skb, hdr, th);
    -			nsk = tcp_check_req(sk, skb, req, false, &req_stolen);
    +
    +			if (tcp_checksum_complete(skb)) {
    +				__TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
    +			} else {
    +				nsk = tcp_check_req(sk, skb, req, false,
    +						    &req_stolen);
    +			}
     		}
     		if (!nsk) {
     			reqsk_put(req);
    -- 
    2.14.4
    
    


^ permalink raw reply

* [PATCH] tcp: verify the checksum of the first data segment in a new connection
From: Frank van der Linden @ 2018-06-11 23:15 UTC (permalink / raw)
  To: edumazet, netdev; +Cc: fllinden

commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash
table") introduced an optimization for the handling of child sockets
created for a new TCP connection.

But this optimization passes any data associated with the last ACK of the
connection handshake up the stack without verifying its checksum, because it
calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
directly.  These lower-level processing functions do not do any checksum
verification.

Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
fix this.

Signed-off-by: Frank van der Linden <fllinden@amazon.com>
---
 net/ipv4/tcp_ipv4.c | 8 +++++++-
 net/ipv6/tcp_ipv6.c | 8 +++++++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b50838..1ec4c0d4aba5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1703,7 +1703,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
 			th = (const struct tcphdr *)skb->data;
 			iph = ip_hdr(skb);
 			tcp_v4_fill_cb(skb, iph, th);
-			nsk = tcp_check_req(sk, skb, req, false, &req_stolen);
+
+			if (tcp_checksum_complete(skb)) {
+				__TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
+			} else {
+				nsk = tcp_check_req(sk, skb, req, false,
+						    &req_stolen);
+			}
 		}
 		if (!nsk) {
 			reqsk_put(req);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 6d664d83cd16..a12b694d3d1e 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1486,7 +1486,13 @@ static int tcp_v6_rcv(struct sk_buff *skb)
 			th = (const struct tcphdr *)skb->data;
 			hdr = ipv6_hdr(skb);
 			tcp_v6_fill_cb(skb, hdr, th);
-			nsk = tcp_check_req(sk, skb, req, false, &req_stolen);
+
+			if (tcp_checksum_complete(skb)) {
+				__TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
+			} else {
+				nsk = tcp_check_req(sk, skb, req, false,
+						    &req_stolen);
+			}
 		}
 		if (!nsk) {
 			reqsk_put(req);
-- 
2.14.4

^ permalink raw reply related

* Re: [bpf PATCH v2 1/2] bpf: sockmap, fix crash when ipv6 sock is added
From: Daniel Borkmann @ 2018-06-11 23:14 UTC (permalink / raw)
  To: John Fastabend, edumazet, weiwan, ast; +Cc: netdev
In-Reply-To: <20180608150639.15153.91342.stgit@john-Precision-Tower-5810>

Hi John,

On 06/08/2018 05:06 PM, John Fastabend wrote:
> This fixes a crash where we assign tcp_prot to IPv6 sockets instead
> of tcpv6_prot.
> 
> Previously we overwrote the sk->prot field with tcp_prot even in the
> AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
> are used. Further, only allow ESTABLISHED connections to join the
> map per note in TLS ULP,
> 
>    /* The TLS ulp is currently supported only for TCP sockets
>     * in ESTABLISHED state.
>     * Supporting sockets in LISTEN state will require us
>     * to modify the accept implementation to clone rather then
>     * share the ulp context.
>     */
> 
> Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
> 'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
> crashing case here.
> 
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Wei Wang <weiwan@google.com>
[...]

Still one question for some more clarification below that popped up while
review:

> @@ -162,6 +164,8 @@ static bool bpf_tcp_stream_read(const struct sock *sk)
>  }
>  
>  static struct proto tcp_bpf_proto;
> +static struct proto tcpv6_bpf_proto;

These two are global, w/o locking.

>  static int bpf_tcp_init(struct sock *sk)
>  {
>  	struct smap_psock *psock;
> @@ -181,14 +185,30 @@ static int bpf_tcp_init(struct sock *sk)
>  	psock->save_close = sk->sk_prot->close;
>  	psock->sk_proto = sk->sk_prot;
>  
> +	if (sk->sk_family == AF_INET6) {
> +		tcpv6_bpf_proto = *sk->sk_prot;
> +		tcpv6_bpf_proto.close = bpf_tcp_close;
> +	} else {
> +		tcp_bpf_proto = *sk->sk_prot;
> +		tcp_bpf_proto.close = bpf_tcp_close;
> +	}

And each time we add a BPF ULP to a v4/v6 socket, we override tcp{,v6}_bpf_proto
from scratch.

>  	if (psock->bpf_tx_msg) {
> +		tcpv6_bpf_proto.sendmsg = bpf_tcp_sendmsg;
> +		tcpv6_bpf_proto.sendpage = bpf_tcp_sendpage;
> +		tcpv6_bpf_proto.recvmsg = bpf_tcp_recvmsg;
> +		tcpv6_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
>  		tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
>  		tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
>  		tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg;
>  		tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
>  	}
>  
> -	sk->sk_prot = &tcp_bpf_proto;
> +	if (sk->sk_family == AF_INET6)
> +		sk->sk_prot = &tcpv6_bpf_proto;
> +	else
> +		sk->sk_prot = &tcp_bpf_proto;

Where every active socket would be affected from it as well. Isn't that
generally racy? E.g. existing ones where tcpv6_bpf_proto.sendmsg points
to bpf_tcp_sendmsg would get overridden with earlier assignment on the
tcpv6_bpf_proto = *sk->sk_prot during their lifetime after bpf_tcp_init().

In the kTLS case, the v4 protos are built up in module init via tls_register()
and never change from there. The v6 ones are only reloaded when their addr
changes e.g. module reload would come to mind, which should only be possible
once no active v6 socket is present. What speaks against adapting similar
scheme resp. what am I missing that the above would work? (Would be nice if
there was some discussion in commit log related to it on 'why' this approach
was done differently.)

Thanks,
Daniel

>  	rcu_read_unlock();
>  	return 0;
>  }
> @@ -1111,8 +1131,6 @@ static void bpf_tcp_msg_add(struct smap_psock *psock,
>  
>  static int bpf_tcp_ulp_register(void)
>  {
> -	tcp_bpf_proto = tcp_prot;
> -	tcp_bpf_proto.close = bpf_tcp_close;
>  	/* Once BPF TX ULP is registered it is never unregistered. It
>  	 * will be in the ULP list for the lifetime of the system. Doing
>  	 * duplicate registers is not a problem.
> 

^ permalink raw reply

* Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 5/7] net: Add generic ndo_select_queue functions
From: kbuild test robot @ 2018-06-11 23:10 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: kbuild-all, netdev, intel-wired-lan, jeffrey.t.kirsher
In-Reply-To: <20180611174119.41352.28975.stgit@ahduyck-green-test.jf.intel.com>

[-- Attachment #1: Type: text/plain, Size: 2094 bytes --]

Hi Alexander,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]
[also build test ERROR on next-20180608]
[cannot apply to v4.17]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Alexander-Duyck/Add-support-for-L2-Fwd-Offload-w-o-ndo_select_queue/20180612-015220
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=arm 

All errors (new ones prefixed by >>):

>> drivers/net//ethernet/ti/netcp_core.c:1968:22: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
     .ndo_select_queue = dev_pick_tx_zero,
                         ^~~~~~~~~~~~~~~~
   drivers/net//ethernet/ti/netcp_core.c:1968:22: note: (near initialization for 'netcp_netdev_ops.ndo_select_queue')
   cc1: some warnings being treated as errors

vim +1968 drivers/net//ethernet/ti/netcp_core.c

  1955	
  1956	static const struct net_device_ops netcp_netdev_ops = {
  1957		.ndo_open		= netcp_ndo_open,
  1958		.ndo_stop		= netcp_ndo_stop,
  1959		.ndo_start_xmit		= netcp_ndo_start_xmit,
  1960		.ndo_set_rx_mode	= netcp_set_rx_mode,
  1961		.ndo_do_ioctl           = netcp_ndo_ioctl,
  1962		.ndo_get_stats64        = netcp_get_stats,
  1963		.ndo_set_mac_address	= eth_mac_addr,
  1964		.ndo_validate_addr	= eth_validate_addr,
  1965		.ndo_vlan_rx_add_vid	= netcp_rx_add_vid,
  1966		.ndo_vlan_rx_kill_vid	= netcp_rx_kill_vid,
  1967		.ndo_tx_timeout		= netcp_ndo_tx_timeout,
> 1968		.ndo_select_queue	= dev_pick_tx_zero,
  1969		.ndo_setup_tc		= netcp_setup_tc,
  1970	};
  1971	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 65910 bytes --]

^ permalink raw reply

* Re: [PATCH] net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets
From: Marc Dionne @ 2018-06-11 23:09 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: David S . Miller, Eric Dumazet, netdev, Jeffrey Altman,
	Marc Dionne
In-Reply-To: <CANP3RGc=hsJ8VmB28ZgN_iBpvC0Ck3_HSctYQMjioZngNKnC7g@mail.gmail.com>

On Mon, Jun 11, 2018 at 7:29 PM, Maciej Żenczykowski
<zenczykowski@gmail.com> wrote:
>> This change is a potential performance regression for us.  Our
>> networking code sets up multiple sockets used by multiple threads to
>> listen on the same udp port.  For the first socket, the bind is done
>> before setting SO_REUSEPORT so that we can get an error and detect
>> that the port is currently in use by a different process, possibly a
>> previous incarnation of the same application.
>>
>> With this change, the servers end up with a single listener socket and
>> thread, which can have a large impact on performance for a busy
>> server.
>>
>> While we can adjust for the new behaviour going forward, this could be
>> an unpleasant surprise for our customers who get this update when
>> moving to a new kernel version or through a backport to stable
>> kernels, if this is targeted for stable.
>
> Probably you can:
> fd1=socket()
> fd2=socket()
> bind(fd1,port=0)
> setsockopt(fd2,REUSEPORT,1)
> port = getsockname(fd1)
> close(fd1)
> bind(fd2,port)
>
> Although yeah there's a slight chance of a race (ie. 2nd bind failing,
> in which case close() and retry).

This would be a bit different since we want a specific port rather
than a wildcard, but yes, a temporary socket could be bound just to
check if the port is in use and closed afterwards.  The problem is
that since fd2 has REUSEPORT set, that second bind might well succeed
if there was a race and the other user of the port also has REUSEPORT
set, as could be the case if it's another instance of the same
application.

Marc

^ permalink raw reply

* Re: Qualcomm rmnet driver and qmi_wwan
From: Subash Abhinov Kasiviswanathan @ 2018-06-11 23:00 UTC (permalink / raw)
  To: Bjørn Mork, Daniele Palmas; +Cc: Dan Williams, netdev
In-Reply-To: <87zi01w753.fsf@miraculix.mork.no>

> both patches work properly for me.
> 
> Maybe it could be helpful in the first patch to add a print when
> pass-through setting fails if raw-ip has not been set, just to let the
> user know what is happening.
> 

Thanks for testing Daniele.
I can add an error message there when pass through mode setting fails.
Will you be submitting the patch for iproute2. Otherwise, I can do so
a bit later.

> Maybe rmnet_is_real_dev_registered(dev->net) instead, since we use that
> elsewhere in this function?
> 
> Like Daniele said: It would be good to have some way to know when the
> rawip condition fails.  Or even better: Automatically force rawip mode
> when the rmnet driver attaches.  But that doesn't seem possible?  No
> notifications or anything when an rx handler is registered?
> 
> Hmm, looking at this I wonder: Is the rawip check really necessary?  
> You
> skip all the extra rawip code in the driver anyway, so I don't see how
> it matters.  But maybe the ethernet header_ops are a problem?
> 
> And I wonder about using skb->dev here.  Does that really work?  I
> didn't think we set that until later.  Why not use dev->net instead?
> 
> 

Hi Bjørn

Yes, I will switch to using dev->net.
There doesnt seem to be a net device notifier event when the rx 
registration
happens.

If the dev type is ethernet, rmnet driver will try to remove the first 
14
bytes since it assumes an ethernet header is present in the packet. 
Hence the
need for raw ip mode in qmi_wwan.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply

* Re: [PATCH bpf] xsk: silence warning on memory allocation failure
From: Y Song @ 2018-06-11 22:44 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, Alexei Starovoitov,
	Daniel Borkmann, netdev, Björn Töpel, Tetsuo Handa,
	syzkaller-bugs, syzbot+4abadc5d69117b346506
In-Reply-To: <20180611115712.9680-1-bjorn.topel@gmail.com>

On Mon, Jun 11, 2018 at 4:57 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> syzkaller reported a warning from xdp_umem_pin_pages():
>
>   WARNING: CPU: 1 PID: 4537 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996
>   ...
>   __do_kmalloc mm/slab.c:3713 [inline]
>   __kmalloc+0x25/0x760 mm/slab.c:3727
>   kmalloc_array include/linux/slab.h:634 [inline]
>   kcalloc include/linux/slab.h:645 [inline]
>   xdp_umem_pin_pages net/xdp/xdp_umem.c:205 [inline]
>   xdp_umem_reg net/xdp/xdp_umem.c:318 [inline]
>   xdp_umem_create+0x5c9/0x10f0 net/xdp/xdp_umem.c:349
>   xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
>   __sys_setsockopt+0x1bd/0x390 net/socket.c:1935
>   __do_sys_setsockopt net/socket.c:1946 [inline]
>   __se_sys_setsockopt net/socket.c:1943 [inline]
>   __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
>   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> This is a warning about attempting to allocate more than
> KMALLOC_MAX_SIZE memory. The request originates from userspace, and if
> the request is too big, the kernel is free to deny its allocation. In
> this patch, the failed allocation attempt is silenced with
> __GFP_NOWARN.
>
> Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
> Reported-by: syzbot+4abadc5d69117b346506@syzkaller.appspotmail.com
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply

* Re: [PATCH] net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets
From: Maciej Żenczykowski @ 2018-06-11 22:29 UTC (permalink / raw)
  To: Marc Dionne
  Cc: David S . Miller, Eric Dumazet, netdev, Jeffrey Altman,
	Marc Dionne
In-Reply-To: <CAB9dFdueWF6iXdqC+uN5pmfs5Z6hdb9b-Zq0g+dUtX20cMuWSg@mail.gmail.com>

> This change is a potential performance regression for us.  Our
> networking code sets up multiple sockets used by multiple threads to
> listen on the same udp port.  For the first socket, the bind is done
> before setting SO_REUSEPORT so that we can get an error and detect
> that the port is currently in use by a different process, possibly a
> previous incarnation of the same application.
>
> With this change, the servers end up with a single listener socket and
> thread, which can have a large impact on performance for a busy
> server.
>
> While we can adjust for the new behaviour going forward, this could be
> an unpleasant surprise for our customers who get this update when
> moving to a new kernel version or through a backport to stable
> kernels, if this is targeted for stable.

Probably you can:
fd1=socket()
fd2=socket()
bind(fd1,port=0)
setsockopt(fd2,REUSEPORT,1)
port = getsockname(fd1)
close(fd1)
bind(fd2,port)

Although yeah there's a slight chance of a race (ie. 2nd bind failing,
in which case close() and retry).

^ permalink raw reply

* Re: [PATCH net] tls: fix NULL pointer dereference on poll
From: Dave Watson @ 2018-06-11 22:27 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, netdev, ast, Christoph Hellwig
In-Reply-To: <20180611212204.24002-1-daniel@iogearbox.net>

On 06/11/18 11:22 PM, Daniel Borkmann wrote:
> While hacking on kTLS, I ran into the following panic from an
> unprivileged netserver / netperf TCP session:
> 
>   [...]
> Looks like the recent conversion from poll to poll_mask callback started
> in 152524231023 ("net: add support for ->poll_mask in proto_ops") missed
> to eventually convert kTLS, too: TCP's ->poll was converted over to the
> ->poll_mask in commit 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
> and therefore kTLS wrongly saved the ->poll old one which is now NULL.
> 
> Convert kTLS over to use ->poll_mask instead. Also instead of POLLIN |
> POLLRDNORM use the proper EPOLLIN | EPOLLRDNORM bits as the case in
> tcp_poll_mask() as well that is mangled here.

Thanks, was just trying to bisect this myself. Works for me. 

Tested-by: Dave Watson <davejwatson@fb.com>

> Fixes: 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Watson <davejwatson@fb.com>

^ permalink raw reply

* Re: [PATCH bpf] xsk: silence warning on memory allocation failure
From: Daniel Borkmann @ 2018-06-11 22:27 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, magnus.karlsson, ast,
	netdev
  Cc: Björn Töpel, penguin-kernel, syzkaller-bugs,
	syzbot+4abadc5d69117b346506
In-Reply-To: <20180611115712.9680-1-bjorn.topel@gmail.com>

On 06/11/2018 01:57 PM, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> syzkaller reported a warning from xdp_umem_pin_pages():
> 
>   WARNING: CPU: 1 PID: 4537 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996
>   ...
>   __do_kmalloc mm/slab.c:3713 [inline]
>   __kmalloc+0x25/0x760 mm/slab.c:3727
>   kmalloc_array include/linux/slab.h:634 [inline]
>   kcalloc include/linux/slab.h:645 [inline]
>   xdp_umem_pin_pages net/xdp/xdp_umem.c:205 [inline]
>   xdp_umem_reg net/xdp/xdp_umem.c:318 [inline]
>   xdp_umem_create+0x5c9/0x10f0 net/xdp/xdp_umem.c:349
>   xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
>   __sys_setsockopt+0x1bd/0x390 net/socket.c:1935
>   __do_sys_setsockopt net/socket.c:1946 [inline]
>   __se_sys_setsockopt net/socket.c:1943 [inline]
>   __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
>   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> This is a warning about attempting to allocate more than
> KMALLOC_MAX_SIZE memory. The request originates from userspace, and if
> the request is too big, the kernel is free to deny its allocation. In
> this patch, the failed allocation attempt is silenced with
> __GFP_NOWARN.
> 
> Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
> Reported-by: syzbot+4abadc5d69117b346506@syzkaller.appspotmail.com
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

Applied to bpf, thanks Björn!

^ permalink raw reply

* Re: [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure
From: Alexei Starovoitov @ 2018-06-11 22:12 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, ast, daniel, netdev, David S. Miller, ecree
In-Reply-To: <20180601153216.10901-1-fw@strlen.de>

On Fri, Jun 01, 2018 at 05:32:11PM +0200, Florian Westphal wrote:
> This patch series adds a JIT layer to translate nft expressions
> to ebpf programs.
> 
> From commit phase, spawn a userspace program (using recently added UMH
> infrastructure).
> 
> We then provide rules that came in this transaction to the helper via pipe,
> using same nf_tables netlink that nftables already uses.
> 
> The userspace helper translates the rules, and, if successful, installs the
> generated program(s) via bpf syscall.
> 
> For each rule a small response containing the corresponding epbf file
> descriptor (can be -1 on failure) and a attribute count (how many
> expressions were jitted) gets sent back to kernel via pipe.
> 
> If translation fails, the rule is will be processed by nf_tables
> interpreter (as before this patch).
> 
> If translation succeeded, nf_tables fetches the bpf program using the file
> descriptor identifier, allocates a new rule blob containing the new 'ebpf'
> expression (and possible trailing un-translated expressions).
> 
> It then replaces the original rule in the transaction log with the new
> 'ebpf-rule'.  The original rule is retained in a private area inside the epbf
> expression to be able to present the original expressions back to userspace
> on 'nft list ruleset'.
> 
> For easier review, this contains the kernel-side only.
> nf_tables_jit_work() will not do anything, yet.
> 
> Unresolved issues:
>  - maps and sets.
>    It might be possible to add a new ebpf map type that just wraps
>    the nft set infrastructure for lookups.
>    This would allow nft userspace to continue to work as-is while
>    not requiring new ebpf helper.
>    Anonymous set should be a lot easier as they're immutable
>    and could probably be handled already by existing infra.
> 
>  - BPF_PROG_RUN() is bolted into nft main loop via a middleman expression.
>    I'm also abusing skb->cb[] to pass network and transport header offsets.
>    Its not 'public' api so this can be changed later.
> 
>  - always uses BPF_PROG_TYPE_SCHED_CLS.
>    This is because it "works" for current RFC purposes.
> 
>  - we should eventually support translating multiple (adjacent) rules
>    into single program.
> 
>    If we do this kernel will need to track mapping of rules to
>    program (to re-jit when a rule is changed.  This isn't implemented
>    so far, but can be added later.  Alternatively, one could also add a
>    'readonly' table switch to just prevent further updates.
> 
>    We will also need to dump the 'next' generation of the
>    to-be-translated table.  The kernel has this information, so its only
>    a matter of serializing it back to userspace from the commit phase.
> 
> The jitter is still limited.  So far it supports:
> 
>  * payload expression for network and transport header
>  * meta mark, nfproto, l4proto
>  * 32 bit immediates
>  * 32 bit bitmask ops
>  * accept/drop verdicts
> 
> As this uses netlink, there is also no technical requirement for
> libnftnl, its simply used here for convienience.
> 
> It doesn't need any userspace changes. Patches for libnftnl and nftables
> make debug info available (e.g. to map rule to its bpf prog id).
> 
> Comments welcome.

The implementation of patch 5 looks good to me, but I'm concerned with
patch 2 that adds 'ebpf expression' to nft. I see no reason to do so.
It seems existing support for infinite number of nft expressions is
used as a way to execute infinite number of bpf programs sequentially.
I don't think it was a scalable approach before and won't scale in the future.
I think the algorithm should consider all nft rules at once and generate
a program or two that will execute fast even when number of rules is large.
We have the same scalability issue with bpfilter RFC patches. That's why
they're still in RFC stage, since we need to figure out a way to support
thousands of iptable rules in scalable way.
There are papers on scalable packet classification algorithms that
use decision trees (hicuts, hypercuts, efficuts, etc)
Imo that is the direction should we should be looking at.
If we implement one of the algorithms as a tree(trie) with a generic lookup
it will be usuable from bpf(bpfilter), from XDP, and other places
inside the kernel.
We can even have multiple algorithms implemented and pick and choose
depending on the size of ruleset and its properties, since one size
doesn't always fit all.
I'm imagining umh will be doing iptables->trie+bpf conversion and
nft->trie+bpf conversion where bpf progs will be dealing with pieces
of logic that don't fit into trie lookup and provide generic mechanism
for parsing the packet in the specific way suited for trie lookup
for the given ruleset. The trie will be sized differently depending
on tuples needed in the lookup. Like if there is no ipv6 in the ruleset
the bpf prog won't be parsing that to prepare a tuple for given trie.
Just like bpf hash map can be of different key/value size, this new
trie will be customized for specific ruleset on the fly by umh.
At the end the trie lookup is fully generic and bpf progs before
and after are generic as well.
imo this way majority of iptables/nft rules can be converted and
performance will be great even with large rulesets.

^ permalink raw reply

* Re: [PATCH] net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets
From: Marc Dionne @ 2018-06-11 21:25 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Maciej Żenczykowski, David S . Miller, Eric Dumazet, netdev,
	Jeffrey Altman, Marc Dionne
In-Reply-To: <20180603174705.51802-1-zenczykowski@gmail.com>

On Sun, Jun 3, 2018 at 2:47 PM, Maciej Żenczykowski
<zenczykowski@gmail.com> wrote:
> From: Maciej Żenczykowski <maze@google.com>
>
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
>
> This can have later far reaching consequences wrt. bind conflict checks
> which rely on these caches (for optimization purposes).
>
> Not to mention that you can currently end up with two identical
> non-reuseport listening sockets bound to the same local ip:port
> by clearing reuseport on them after they've already both been bound.
>
> There is unfortunately no EISBOUND error or anything similar,
> and EISCONN seems to be misleading for a bound-but-not-connected
> socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
> is the closest you can get to meaning 'socket in bad state'.
> (although perhaps EINVAL wouldn't be a bad choice either?)
>
> This does unfortunately run the risk of breaking buggy
> userspace programs...

This change is a potential performance regression for us.  Our
networking code sets up multiple sockets used by multiple threads to
listen on the same udp port.  For the first socket, the bind is done
before setting SO_REUSEPORT so that we can get an error and detect
that the port is currently in use by a different process, possibly a
previous incarnation of the same application.

With this change, the servers end up with a single listener socket and
thread, which can have a large impact on performance for a busy
server.

While we can adjust for the new behaviour going forward, this could be
an unpleasant surprise for our customers who get this update when
moving to a new kernel version or through a backport to stable
kernels, if this is targeted for stable.

Marc

^ permalink raw reply

* Re: [PATCH] net: cxgb3: add error handling for some functions
From: David Miller @ 2018-06-11 21:23 UTC (permalink / raw)
  To: jiazhouyang09; +Cc: santosh, netdev, linux-kernel
In-Reply-To: <1528706547-35243-1-git-send-email-jiazhouyang09@gmail.com>

From: Zhouyang Jia <jiazhouyang09@gmail.com>
Date: Mon, 11 Jun 2018 16:42:26 +0800

> When sysfs_create_group or alloc_skb fails, the lack of error-handling
> code may cause unexpected results.
> 
> This patch adds error-handling code after the functions.
> 
> Signed-off-by: Zhouyang Jia <jiazhouyang09@gmail.com>

The ->nofail_skb handling is perfectly fine.

It is a backup SKB, meant to handle alloc_skb() failures for the
primary 'skb' allocation the _next_ time this function is called.

Even if the ->nofail_skb allocation fails, the primary operation
done by init_tp_parity() has succeeeded and we should return
success.

All of the code in this function is perfectly prepared to handle
the case where ->nofail_skb is NULL.

^ permalink raw reply

* [PATCH net] tls: fix NULL pointer dereference on poll
From: Daniel Borkmann @ 2018-06-11 21:22 UTC (permalink / raw)
  To: davem; +Cc: davejwatson, netdev, ast, Daniel Borkmann, Christoph Hellwig

While hacking on kTLS, I ran into the following panic from an
unprivileged netserver / netperf TCP session:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
  PGD 800000037f378067 P4D 800000037f378067 PUD 3c0e61067 PMD 0
  Oops: 0010 [#1] SMP KASAN PTI
  CPU: 1 PID: 2289 Comm: netserver Not tainted 4.17.0+ #139
  Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET47W (1.21 ) 11/28/2016
  RIP: 0010:          (null)
  Code: Bad RIP value.
  RSP: 0018:ffff88036abcf740 EFLAGS: 00010246
  RAX: dffffc0000000000 RBX: ffff88036f5f6800 RCX: 1ffff1006debed26
  RDX: ffff88036abcf920 RSI: ffff8803cb1a4f00 RDI: ffff8803c258c280
  RBP: ffff8803c258c280 R08: ffff8803c258c280 R09: ffffed006f559d48
  R10: ffff88037aacea43 R11: ffffed006f559d49 R12: ffff8803c258c280
  R13: ffff8803cb1a4f20 R14: 00000000000000db R15: ffffffffc168a350
  FS:  00007f7e631f4700(0000) GS:ffff8803d1c80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffffffffffffd6 CR3: 00000003ccf64005 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   ? tls_sw_poll+0xa4/0x160 [tls]
   ? sock_poll+0x20a/0x680
   ? do_select+0x77b/0x11a0
   ? poll_schedule_timeout.constprop.12+0x130/0x130
   ? pick_link+0xb00/0xb00
   ? read_word_at_a_time+0x13/0x20
   ? vfs_poll+0x270/0x270
   ? deref_stack_reg+0xad/0xe0
   ? __read_once_size_nocheck.constprop.6+0x10/0x10
  [...]

Debugging further, it turns out that calling into ctx->sk_poll() is
invalid since sk_poll itself is NULL which was saved from the original
TCP socket in order for tls_sw_poll() to invoke it.

Looks like the recent conversion from poll to poll_mask callback started
in 152524231023 ("net: add support for ->poll_mask in proto_ops") missed
to eventually convert kTLS, too: TCP's ->poll was converted over to the
->poll_mask in commit 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
and therefore kTLS wrongly saved the ->poll old one which is now NULL.

Convert kTLS over to use ->poll_mask instead. Also instead of POLLIN |
POLLRDNORM use the proper EPOLLIN | EPOLLRDNORM bits as the case in
tcp_poll_mask() as well that is mangled here.

Fixes: 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Watson <davejwatson@fb.com>
---
 include/net/tls.h  |  6 ++----
 net/tls/tls_main.c |  2 +-
 net/tls/tls_sw.c   | 19 +++++++++----------
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 70c2737..7f84ea3 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -109,8 +109,7 @@ struct tls_sw_context_rx {
 
 	struct strparser strp;
 	void (*saved_data_ready)(struct sock *sk);
-	unsigned int (*sk_poll)(struct file *file, struct socket *sock,
-				struct poll_table_struct *wait);
+	__poll_t (*sk_poll_mask)(struct socket *sock, __poll_t events);
 	struct sk_buff *recv_pkt;
 	u8 control;
 	bool decrypted;
@@ -225,8 +224,7 @@ void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
 int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		   int nonblock, int flags, int *addr_len);
-unsigned int tls_sw_poll(struct file *file, struct socket *sock,
-			 struct poll_table_struct *wait);
+__poll_t tls_sw_poll_mask(struct socket *sock, __poll_t events);
 ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos,
 			   struct pipe_inode_info *pipe,
 			   size_t len, unsigned int flags);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 301f224..a127d61 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -712,7 +712,7 @@ static int __init tls_register(void)
 	build_protos(tls_prots[TLSV4], &tcp_prot);
 
 	tls_sw_proto_ops = inet_stream_ops;
-	tls_sw_proto_ops.poll = tls_sw_poll;
+	tls_sw_proto_ops.poll_mask = tls_sw_poll_mask;
 	tls_sw_proto_ops.splice_read = tls_sw_splice_read;
 
 #ifdef CONFIG_TLS_DEVICE
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 8ca57d0..34895b7 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -915,23 +915,22 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 	return copied ? : err;
 }
 
-unsigned int tls_sw_poll(struct file *file, struct socket *sock,
-			 struct poll_table_struct *wait)
+__poll_t tls_sw_poll_mask(struct socket *sock, __poll_t events)
 {
-	unsigned int ret;
 	struct sock *sk = sock->sk;
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
+	__poll_t mask;
 
-	/* Grab POLLOUT and POLLHUP from the underlying socket */
-	ret = ctx->sk_poll(file, sock, wait);
+	/* Grab EPOLLOUT and EPOLLHUP from the underlying socket */
+	mask = ctx->sk_poll_mask(sock, events);
 
-	/* Clear POLLIN bits, and set based on recv_pkt */
-	ret &= ~(POLLIN | POLLRDNORM);
+	/* Clear EPOLLIN bits, and set based on recv_pkt */
+	mask &= ~(EPOLLIN | EPOLLRDNORM);
 	if (ctx->recv_pkt)
-		ret |= POLLIN | POLLRDNORM;
+		mask |= EPOLLIN | EPOLLRDNORM;
 
-	return ret;
+	return mask;
 }
 
 static int tls_read_size(struct strparser *strp, struct sk_buff *skb)
@@ -1188,7 +1187,7 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 		sk->sk_data_ready = tls_data_ready;
 		write_unlock_bh(&sk->sk_callback_lock);
 
-		sw_ctx_rx->sk_poll = sk->sk_socket->ops->poll;
+		sw_ctx_rx->sk_poll_mask = sk->sk_socket->ops->poll_mask;
 
 		strp_check_rcv(&sw_ctx_rx->strp);
 	}
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH] net: dsa: add error handling for pskb_trim_rcsum
From: David Miller @ 2018-06-11 21:20 UTC (permalink / raw)
  To: jiazhouyang09; +Cc: andrew, vivien.didelot, f.fainelli, netdev, linux-kernel
In-Reply-To: <1528694795-32000-1-git-send-email-jiazhouyang09@gmail.com>

From: Zhouyang Jia <jiazhouyang09@gmail.com>
Date: Mon, 11 Jun 2018 13:26:35 +0800

> When pskb_trim_rcsum fails, the lack of error-handling code may
> cause unexpected results.
> 
> This patch adds error-handling code after calling pskb_trim_rcsum.
> 
> Signed-off-by: Zhouyang Jia <jiazhouyang09@gmail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net] ipv6: allow PMTU exceptions to local routes
From: David Miller @ 2018-06-11 21:17 UTC (permalink / raw)
  To: ja; +Cc: netdev, kafai, kernel-team, lvs-devel
In-Reply-To: <20180610230254.6347-1-ja@ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Mon, 11 Jun 2018 02:02:54 +0300

> IPVS setups with local client and remote tunnel server need
> to create exception for the local virtual IP. What we do is to
> change PMTU from 64KB (on "lo") to 1460 in the common case.
> 
> Suggested-by: Martin KaFai Lau <kafai@fb.com>
> Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
> Fixes: 7343ff31ebf0 ("ipv6: Don't create clones of host routes.")
> Signed-off-by: Julian Anastasov <ja@ssi.bg>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* [Patch net] smc: convert to ->poll_mask
From: Cong Wang @ 2018-06-11 21:07 UTC (permalink / raw)
  To: netdev; +Cc: penguin-kernel, Cong Wang, Christoph Hellwig, Ursula Braun

smc->clcsock is an internal TCP socket, after TCP socket
converts to ->poll_mask, ->poll doesn't exist any more.
So just convert smc socket to ->poll_mask too.

Fixes: 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
Reported-by: syzbot+f5066e369b2d5fff630f@syzkaller.appspotmail.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/smc/af_smc.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 973b4471b532..da7f02edcd37 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1273,8 +1273,7 @@ static __poll_t smc_accept_poll(struct sock *parent)
 	return mask;
 }
 
-static __poll_t smc_poll(struct file *file, struct socket *sock,
-			     poll_table *wait)
+static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
 {
 	struct sock *sk = sock->sk;
 	__poll_t mask = 0;
@@ -1290,7 +1289,7 @@ static __poll_t smc_poll(struct file *file, struct socket *sock,
 	if ((sk->sk_state == SMC_INIT) || smc->use_fallback) {
 		/* delegate to CLC child sock */
 		release_sock(sk);
-		mask = smc->clcsock->ops->poll(file, smc->clcsock, wait);
+		mask = smc->clcsock->ops->poll_mask(smc->clcsock, events);
 		lock_sock(sk);
 		sk->sk_err = smc->clcsock->sk->sk_err;
 		if (sk->sk_err) {
@@ -1308,11 +1307,6 @@ static __poll_t smc_poll(struct file *file, struct socket *sock,
 			}
 		}
 	} else {
-		if (sk->sk_state != SMC_CLOSED) {
-			release_sock(sk);
-			sock_poll_wait(file, sk_sleep(sk), wait);
-			lock_sock(sk);
-		}
 		if (sk->sk_err)
 			mask |= EPOLLERR;
 		if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
@@ -1625,7 +1619,7 @@ static const struct proto_ops smc_sock_ops = {
 	.socketpair	= sock_no_socketpair,
 	.accept		= smc_accept,
 	.getname	= smc_getname,
-	.poll		= smc_poll,
+	.poll_mask	= smc_poll_mask,
 	.ioctl		= smc_ioctl,
 	.listen		= smc_listen,
 	.shutdown	= smc_shutdown,
-- 
2.13.0

^ permalink raw reply related

* RE: [net] fq_codel: fix NULL pointer deref in fq_codel_reset
From: Keller, Jacob E @ 2018-06-11 20:47 UTC (permalink / raw)
  To: Cong Wang, Kirsher, Jeffrey T
  Cc: David Miller, Linux Kernel Network Developers, nhorman@redhat.com,
	sassmann@redhat.com, jogreene@redhat.com, Eric Dumazet
In-Reply-To: <CAM_iQpWmcBtQn8+g-QG2UKENeWYD=Zqq_78E8oNxS7tkLi0Cvw@mail.gmail.com>

> -----Original Message-----
> From: Cong Wang [mailto:xiyou.wangcong@gmail.com]
> Sent: Monday, June 11, 2018 1:03 PM
> To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>
> Cc: David Miller <davem@davemloft.net>; Keller, Jacob E
> <jacob.e.keller@intel.com>; Linux Kernel Network Developers
> <netdev@vger.kernel.org>; nhorman@redhat.com; sassmann@redhat.com;
> jogreene@redhat.com; Eric Dumazet <edumazet@google.com>
> Subject: Re: [net] fq_codel: fix NULL pointer deref in fq_codel_reset
> 

> Making q->flows_cnt 0 is simpler and easier to understand.

Feel free to propose such a patch :)

Thanks,
Jake

^ permalink raw reply

* Re: [PATCH] iwlwifi: pcie: make array prop static, shrinks object size
From: Greg Kroah-Hartman @ 2018-06-11 20:45 UTC (permalink / raw)
  To: Joe Perches
  Cc: Colin King, Johannes Berg, Emmanuel Grumbach, Luca Coelho,
	Intel Linux Wireless, Kalle Valo, David S . Miller,
	linux-wireless, netdev, kernel-janitors, linux-kernel
In-Reply-To: <7f508dccf95cff7fb8e083666b8939c58f65f894.camel@perches.com>

On Mon, Jun 11, 2018 at 12:40:55PM -0700, Joe Perches wrote:
> (adding Greg KH)
> 
> On Mon, 2018-06-11 at 18:15 +0100, Colin King wrote:
> > From: Colin Ian King <colin.king@canonical.com>
> > 
> > Don't populate the read-only array 'prop' on the stack but
> > instead make it static. Makes the object code smaller by 20 bytes:
> > 
> > Before:
> >    text    data     bss     dec     hex filename
> >   71659   14614     576   86849   15341 trans.o
> > 
> > After:
> >    text    data     bss     dec     hex filename
> >   71479   14774     576   86829   1532d trans.o
> > 
> > (gcc version 7.3.0 x86_64)
> > 
> > Signed-off-by: Colin Ian King <colin.king@canonical.com>
> > ---
> >  drivers/net/wireless/intel/iwlwifi/pcie/trans.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> > index 7229991ae70d..c4626ebe5da1 100644
> > --- a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> > +++ b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> > @@ -1946,7 +1946,7 @@ static void iwl_trans_pcie_removal_wk(struct work_struct *wk)
> >  	struct iwl_trans_pcie_removal *removal =
> >  		container_of(wk, struct iwl_trans_pcie_removal, work);
> >  	struct pci_dev *pdev = removal->pdev;
> > -	char *prop[] = {"EVENT=INACCESSIBLE", NULL};
> > +	static char *prop[] = {"EVENT=INACCESSIBLE", NULL};
> >  
> >  	dev_err(&pdev->dev, "Device gone - attempting removal\n");
> >  	kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, prop);
> 
> Now what is happening is that prop is being reloaded
> each invocation with the constant addresses of the strings.
> 
> It seems the prototype and function for kobject_uevent_env
> should change as well to avoid this.
> 
> Perhaps this should become:
> ---
>  drivers/net/wireless/intel/iwlwifi/pcie/trans.c | 2 +-
>  include/linux/kobject.h                         | 2 +-
>  lib/kobject_uevent.c                            | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> index 7229991ae70d..6668a8aad22e 100644
> --- a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> +++ b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> @@ -1946,7 +1946,7 @@ static void iwl_trans_pcie_removal_wk(struct work_struct *wk)
>  	struct iwl_trans_pcie_removal *removal =
>  		container_of(wk, struct iwl_trans_pcie_removal, work);
>  	struct pci_dev *pdev = removal->pdev;
> -	char *prop[] = {"EVENT=INACCESSIBLE", NULL};
> +	static const char * const prop[] = {"EVENT=INACCESSIBLE", NULL};
>  
>  	dev_err(&pdev->dev, "Device gone - attempting removal\n");
>  	kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, prop);
> diff --git a/include/linux/kobject.h b/include/linux/kobject.h
> index 7f6f93c3df9c..9f5cf553dd1e 100644
> --- a/include/linux/kobject.h
> +++ b/include/linux/kobject.h
> @@ -217,7 +217,7 @@ extern struct kobject *firmware_kobj;
>  
>  int kobject_uevent(struct kobject *kobj, enum kobject_action action);
>  int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
> -			char *envp[]);
> +			const char * const envp[]);
>  int kobject_synth_uevent(struct kobject *kobj, const char *buf, size_t count);
>  
>  __printf(2, 3)
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 63d0816ab23b..9107989a0cc8 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -452,7 +452,7 @@ static void zap_modalias_env(struct kobj_uevent_env *env)
>   * corresponding error when it fails.
>   */
>  int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
> -		       char *envp_ext[])
> +		       const char * const envp_ext[])
>  {
>  	struct kobj_uevent_env *env;
>  	const char *action_string = kobject_actions[action];

No objection from me, care to make it a real patch so that I can apply
it after 4.18-rc1 is out?

thanks,

greg k-h

^ permalink raw reply

* Re: [net] fq_codel: fix NULL pointer deref in fq_codel_reset
From: Cong Wang @ 2018-06-11 20:22 UTC (permalink / raw)
  To: Keller, Jacob E
  Cc: Kirsher, Jeffrey T, davem@davemloft.net, netdev@vger.kernel.org,
	nhorman@redhat.com, sassmann@redhat.com, jogreene@redhat.com,
	Eric Dumazet
In-Reply-To: <02874ECE860811409154E81DA85FBB5884BC3EC7@ORSMSX115.amr.corp.intel.com>

On Mon, Jun 11, 2018 at 12:57 PM, Keller, Jacob E
<jacob.e.keller@intel.com> wrote:
>
> I'm open to alternative suggestinos for fixing this, I think Eric suggested that maybe we should just remove the ->reset() call from qdisc_destroy..?

You can't remove ->reset() for non-failure call path.

For failure path, yeah, but it is much simpler to just make
q->flows_cnt be 0 for this specific case.

^ permalink raw reply

* Re: 4.17.0-10146-gf0dc7f9c6dd9: hw csum failure on powerpc+sungem
From: Mathieu Malaterre @ 2018-06-11 20:20 UTC (permalink / raw)
  To: Meelis Roos; +Cc: netdev, Mauro Carvalho Chehab, linuxppc-dev, LKML
In-Reply-To: <alpine.LRH.2.21.1806111352330.17091@math.ut.ee>

Hi Meelis,

On Mon, Jun 11, 2018 at 1:21 PM Meelis Roos <mroos@linux.ee> wrote:
>
> I am seeing this on PowerMac G4 with sungem ethernet driver. 4.17 was
> OK, 4.17.0-10146-gf0dc7f9c6dd9 is problematic.

Same here.

> [  140.518664] eth0: hw csum failure
> [  140.518699] CPU: 0 PID: 1237 Comm: postconf Not tainted 4.17.0-10146-gf0dc7f9c6dd9 #83
> [  140.518707] Call Trace:
> [  140.518734] [effefd90] [c03d6db8] __skb_checksum_complete+0xd8/0xdc (unreliable)
> [  140.518759] [effefdb0] [c04c1284] icmpv6_rcv+0x248/0x4ec
> [  140.518775] [effefdd0] [c049a448] ip6_input_finish.constprop.0+0x11c/0x5f4
> [  140.518786] [effefe10] [c049b1c0] ip6_mc_input+0xcc/0x100
> [  140.518807] [effefe20] [c03e110c] __netif_receive_skb_core+0x310/0x944
> [  140.518820] [effefe70] [c03e76ec] napi_gro_receive+0xd0/0xe8
> [  140.518845] [effefe80] [f3e1f66c] gem_poll+0x618/0x1274 [sungem]
> [  140.518856] [effeff30] [c03e6f0c] net_rx_action+0x198/0x374
> [  140.518872] [effeff90] [c0501a88] __do_softirq+0x120/0x278
> [  140.518890] [effeffe0] [c0036188] irq_exit+0xd8/0xdc
> [  140.518908] [effefff0] [c000f478] call_do_irq+0x24/0x3c
> [  140.518925] [d05a5d30] [c0007120] do_IRQ+0x74/0xf0
> [  140.518941] [d05a5d50] [c0012474] ret_from_except+0x0/0x14
> [  140.518960] --- interrupt: 501 at copy_page+0x40/0x90
>                    LR = copy_user_page+0x18/0x30
> [  140.518973] [d05a5e10] [d058cd80] 0xd058cd80 (unreliable)
> [  140.518989] [d05a5e20] [c00fa2bc] wp_page_copy+0xec/0x654
> [  140.519002] [d05a5e60] [c00fd3a4] do_wp_page+0xa8/0x5b4
> [  140.519013] [d05a5e90] [c00fe934] handle_mm_fault+0x564/0xa84
> [  140.519025] [d05a5f00] [c0016230] do_page_fault+0x1bc/0x7e8
> [  140.519037] [d05a5f40] [c0012300] handle_page_fault+0x14/0x40
> [  140.519048] --- interrupt: 301 at 0xb78b6864
>                    LR = 0xb78b6c54
>

For some reason if I do a git bisect it returns that:

$ git bisect good
3036bc45364f98515a2c446d7fac2c34dcfbeff4 is the first bad commit

Could you also check on your side please.

> --
> Meelis Roos (mroos@linux.ee)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox