Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 4.4 stable net] net: tcp: Fix use-after-free in tcp_write_xmit
From: maowenan @ 2019-07-26  9:10 UTC (permalink / raw)
  To: Eric Dumazet, davem, gregkh, netdev, linux-kernel
In-Reply-To: <510109e3-101f-517c-22b4-921432f04fe5@gmail.com>



On 2019/7/25 14:19, Eric Dumazet wrote:
> 
> 
> On 7/25/19 6:29 AM, maowenan wrote:
>>
> 
>>>>>> Syzkaller reproducer():
>>>>>> r0 = socket$packet(0x11, 0x3, 0x300)
>>>>>> r1 = socket$inet_tcp(0x2, 0x1, 0x0)
>>>>>> bind$inet(r1, &(0x7f0000000300)={0x2, 0x4e21, @multicast1}, 0x10)
>>>>>> connect$inet(r1, &(0x7f0000000140)={0x2, 0x1000004e21, @loopback}, 0x10)
>>>>>> recvmmsg(r1, &(0x7f0000001e40)=[{{0x0, 0x0, &(0x7f0000000100)=[{&(0x7f00000005c0)=""/88, 0x58}], 0x1}}], 0x1, 0x40000000, 0x0)
>>>>>> sendto$inet(r1, &(0x7f0000000000)="e2f7ad5b661c761edf", 0x9, 0x8080, 0x0, 0x0)
>>>>>> r2 = fcntl$dupfd(r1, 0x0, r0)
>>>>>> connect$unix(r2, &(0x7f00000001c0)=@file={0x0, './file0\x00'}, 0x6e)
>>>>>>
>>>
>>> It does call tcp_disconnect(), by one of the connect() call.
>>
>> yes, __inet_stream_connect will call tcp_disconnect when sa_family == AF_UNSPEC, in c repro if it
>> passes sa_family with AF_INET it won't call disconnect, and then sk_send_head won't be NULL when tcp_connect.
>>
> 
> 
> Look again at the Syzkaller reproducer()
> 
> It definitely uses tcp_disconnect()
> 
> Do not be fooled by connect$unix(), this is a connect() call really, with AF_UNSPEC

Right, in syzkaller reproducer, it calls connect() with AF_UNSPEC, actually I can reproduce the issue only with C repro(https://syzkaller.appspot.com/text?tag=ReproC&x=14db474f800000).
syscall procedure in C:
__NR_socket
__NR_bind
__NR_sendto  (flag=0x20000000,MSG_FASTOPEN, it will call __inet_stream_connect with sa_family = AF_INET, sk->sk_send_head = NULL)
__NR_write
__NR_connect (call __inet_stream_connect with sa_family = AF_UNSPEC, it will call tcp_disconnect and set sk->sk_send_head = NULL)
__NR_connect (call __inet_stream_connect with sa_family = AF_INET, if sk->sk_send_head != NULL UAF happen)

I debug why tcp_disconnect has already set sk->sk_send_head = NULL, but it is NOT NULL after next __NR_connect.
I find that some packets send out before second __NR_connect(with AF_INET), so the sk_send_head is modified by: tcp_sendmsg->skb_entail->tcp_add_write_queue_tail
static inline void tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb)
{
	__tcp_add_write_queue_tail(sk, skb);

	/* Queue it, remembering where we must start sending. */
	if (sk->sk_send_head == NULL) {
		sk->sk_send_head = skb;  //here, sk->sk_send_head is changed.

		if (tcp_sk(sk)->highest_sack == NULL)
			tcp_sk(sk)->highest_sack = skb;
	}
}




> 
> 




^ permalink raw reply

* Re: [PATCH net-next 1/4] sctp: check addr_size with sa_family_t size in __sctp_setsockopt_connectx
From: Xin Long @ 2019-07-26  9:11 UTC (permalink / raw)
  To: Neil Horman; +Cc: Marcelo Ricardo Leitner, network dev, linux-sctp, davem
In-Reply-To: <20190724204332.GF7212@hmswarspite.think-freely.org>

On Thu, Jul 25, 2019 at 4:44 AM Neil Horman <nhorman@tuxdriver.com> wrote:
>
> On Wed, Jul 24, 2019 at 04:12:43PM -0300, Marcelo Ricardo Leitner wrote:
> > On Wed, Jul 24, 2019 at 04:05:43PM -0300, Marcelo Ricardo Leitner wrote:
> > > On Wed, Jul 24, 2019 at 02:44:56PM -0400, Neil Horman wrote:
> > > > On Wed, Jul 24, 2019 at 09:49:07AM -0300, Marcelo Ricardo Leitner wrote:
> > > > > On Wed, Jul 24, 2019 at 09:36:50AM -0300, Marcelo Ricardo Leitner wrote:
> > > > > > On Wed, Jul 24, 2019 at 07:22:35AM -0400, Neil Horman wrote:
> > > > > > > On Wed, Jul 24, 2019 at 03:21:12PM +0800, Xin Long wrote:
> > > > > > > > On Tue, Jul 23, 2019 at 11:25 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Jul 23, 2019 at 01:37:57AM +0800, Xin Long wrote:
> > > > > > > > > > Now __sctp_connect() is called by __sctp_setsockopt_connectx() and
> > > > > > > > > > sctp_inet_connect(), the latter has done addr_size check with size
> > > > > > > > > > of sa_family_t.
> > > > > > > > > >
> > > > > > > > > > In the next patch to clean up __sctp_connect(), we will remove
> > > > > > > > > > addr_size check with size of sa_family_t from __sctp_connect()
> > > > > > > > > > for the 1st address.
> > > > > > > > > >
> > > > > > > > > > So before doing that, __sctp_setsockopt_connectx() should do
> > > > > > > > > > this check first, as sctp_inet_connect() does.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Xin Long <lucien.xin@gmail.com>
> > > > > > > > > > ---
> > > > > > > > > >  net/sctp/socket.c | 2 +-
> > > > > > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> > > > > > > > > > index aa80cda..5f92e4a 100644
> > > > > > > > > > --- a/net/sctp/socket.c
> > > > > > > > > > +++ b/net/sctp/socket.c
> > > > > > > > > > @@ -1311,7 +1311,7 @@ static int __sctp_setsockopt_connectx(struct sock *sk,
> > > > > > > > > >       pr_debug("%s: sk:%p addrs:%p addrs_size:%d\n",
> > > > > > > > > >                __func__, sk, addrs, addrs_size);
> > > > > > > > > >
> > > > > > > > > > -     if (unlikely(addrs_size <= 0))
> > > > > > > > > > +     if (unlikely(addrs_size < sizeof(sa_family_t)))
> > > > > > > > > I don't think this is what you want to check for here.  sa_family_t is
> > > > > > > > > an unsigned short, and addrs_size is the number of bytes in the addrs
> > > > > > > > > array.  The addrs array should be at least the size of one struct
> > > > > > > > > sockaddr (16 bytes iirc), and, if larger, should be a multiple of
> > > > > > > > > sizeof(struct sockaddr)
> > > > > > > > sizeof(struct sockaddr) is not the right value to check either.
> > > > > > > >
> > > > > > > > The proper check will be done later in __sctp_connect():
> > > > > > > >
> > > > > > > >         af = sctp_get_af_specific(daddr->sa.sa_family);
> > > > > > > >         if (!af || af->sockaddr_len > addrs_size)
> > > > > > > >                 return -EINVAL;
> > > > > > > >
> > > > > > > > So the check 'addrs_size < sizeof(sa_family_t)' in this patch is
> > > > > > > > just to make sure daddr->sa.sa_family is accessible. the same
> > > > > > > > check is also done in sctp_inet_connect().
> > > > > > > >
> > > > > > > That doesn't make much sense, if the proper check is done in __sctp_connect with
> > > > > > > the size of the families sockaddr_len, then we don't need this check at all, we
> > > > > > > can just let memdup_user take the fault on copy_to_user and return -EFAULT.  If
> > > > > > > we get that from memdup_user, we know its not accessible, and can bail out.
> > > > > > >
> > > > > > > About the only thing we need to check for here is that addr_len isn't some
> > > > > > > absurdly high value (i.e. a negative value), so that we avoid trying to kmalloc
> > > > > > > upwards of 2G in memdup_user.  Your change does that just fine, but its no
> > > > > > > better or worse than checking for <=0
> > > > > >
> > > > > > One can argue that such check against absurdly high values is random
> > > > > > and not effective, as 2G can be somewhat reasonable on 8GB systems but
> > > > > > certainly isn't on 512MB ones. On that, kmemdup_user() will also fail
> > > > > > gracefully as it uses GFP_USER and __GFP_NOWARN.
> > > > > >
> > > > > > The original check is more for protecting for sane usage of the
> > > > > > variable, which is an int, and a negative value is questionable. We
> > > > > > could cast, yes, but.. was that really the intent of the application?
> > > > > > Probably not.
> > > > >
> > > > > Though that said, I'm okay with the new check here: a quick sanity
> > > > > check that can avoid expensive calls to kmalloc(), while more refined
> > > > > check is done later on.
> > > > >
> > > > I agree a sanity check makes sense, just to avoid allocating a huge value
> > > > (even 2G is absurd on many systems), however, I'm not super comfortable with
> > > > checking for the value being less than 16 (sizeof(sa_family_t)).  The zero check
> > >
> > > 16 bits you mean then, per
> > > include/uapi/linux/socket.h
> > > typedef unsigned short __kernel_sa_family_t;
> > > include/linux/socket.h
> > > typedef __kernel_sa_family_t    sa_family_t;
> > >
> > > > is fairly obvious given the signed nature of the lengh field, this check makes
> > > > me wonder what exactly we are checking for.
> > >
> > > A minimum viable buffer without doing more extensive tests. Beyond
> > > sa_family, we need to parse sa_family and then that's left for later.
> > > Perhaps a comment helps, something like
> > >     /* Check if we have at least the family type in there */
> > > ?
> >
> > Hm, then this could be
> > -     if (unlikely(addrs_size <= 0))
> > +     if (unlikely(addrs_size < sizeof(struct sockaddr_in)))
> > (ipv4)
> > As it can't be smaller than that, always.
> >
> True, but I think perhaps just the family type size check is more correct, as
> thats the minimal information we need to get the proper sockaddr_len out of
> sctp_get_af_specific.
Okay, I will keep the check "addrs_size < sizeof(sa_family_t)" in this
patch and remove the useless variables in patch 2/4 when sending v2.

Thanks.

>
> Neil
>
> > >
> > >   Marcelo
> > >
> > > >
> > > > Neil
> > > >
> > > > > >
> > > > > > >
> > > > > > > Neil
> > > > > > >
> > > > > > > > >
> > > > > > > > > Neil
> > > > > > > > >
> > > > > > > > > >               return -EINVAL;
> > > > > > > > > >
> > > > > > > > > >       kaddrs = memdup_user(addrs, addrs_size);
> > > > > > > > > > --
> > > > > > > > > > 2.1.0
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > >
> > >
> >

^ permalink raw reply

* Re: [PATCH 1/2] net: ipv4: Fix a possible null-pointer dereference in inet_csk_rebuild_route()
From: Nicolas Dichtel @ 2019-07-26  9:15 UTC (permalink / raw)
  To: Jia-Ju Bai, davem, kuznet, yoshfuji; +Cc: netdev, linux-kernel
In-Reply-To: <20190726022534.24994-1-baijiaju1990@gmail.com>

Le 26/07/2019 à 04:25, Jia-Ju Bai a écrit :
> In inet_csk_rebuild_route(), rt is assigned to NULL on line 1071.
> On line 1076, rt is used:
>     return &rt->dst;
> Thus, a possible null-pointer dereference may occur.>
> To fix this bug, rt is checked before being used.
> 
> This bug is found by a static analysis tool STCheck written by us.
> 
> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
> ---
>  net/ipv4/inet_connection_sock.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index f5c163d4771b..27d9d80f3401 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -1073,7 +1073,10 @@ static struct dst_entry *inet_csk_rebuild_route(struct sock *sk, struct flowi *f
>  		sk_setup_caps(sk, &rt->dst);
>  	rcu_read_unlock();
>  
> -	return &rt->dst;
> +	if (rt)
> +		return &rt->dst;
> +	else
> +		return NULL;
Hmm, ->dst is the first field (and that will never change), thus &rt->dst is
NULL if rt is NULL.
I don't think there is a problem with the current code.


Regards,
Nicolas

^ permalink raw reply

* [PATCH] Revert "net: get rid of an signed integer overflow in ip_idents_reserve()"
From: Shaokun Zhang @ 2019-07-26  9:17 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Yang Guo, David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Eric Dumazet, Jiri Pirko

From: Yang Guo <guoyang2@huawei.com>

There is an significant performance regression with the following
commit-id <adb03115f459>
("net: get rid of an signed integer overflow in ip_idents_reserve()").

Both on x86 server(Skylake) and ARM64 server, when cpu core number
increase, the function ip_idents_reserve() of cpu usage is very high, 
and the performance will become bad. After revert the patch, we can
avoid this problem when cpu core number increases.

With the patch on x86, ip_idents_reserve() cpu usage is 63.05% when
iperf3 is run with 32 cpu cores.
Samples: 18K of event 'cycles:ppp', Event count (approx.)
  Children      Self  Command  Shared Object      Symbol
    63.18%    63.05%  iperf3   [kernel.vmlinux]   [k] ip_idents_reserve

And the IOPS is 4483830pps.
10:46:13 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s
10:46:14 AM        lo 4483830.00 4483830.00 192664.57 192664.57

Resert the patch, ip_idents_reserve() cpu usage is 17.05%.
Samples: 37K of event 'cycles:ppp', 4000 Hz, Event count (approx.)
  Children      Self  Shared Object      Symbol
    17.07%    17.05%  [kernel]           [k] ip_idents_reserve

And the IOPS is 1160021pps.
05:03:15 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s
05:03:16 PM        lo 11600213.00 11600213.00 498446.65 498446.65

The performance regression was also found on ARM64 server and discussed
a few days ago:
https://lore.kernel.org/netdev/98b95fbe-adcc-c95f-7f3d-6c57122f4586
@pengutronix.de/T/#t

Cc: "David S. Miller" <davem@davemloft.net> 
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> 
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Yang Guo <guoyang2@huawei.com>
---
 net/ipv4/route.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 517300d..dff457b 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -489,18 +489,12 @@ u32 ip_idents_reserve(u32 hash, int segs)
 	atomic_t *p_id = ip_idents + hash % IP_IDENTS_SZ;
 	u32 old = READ_ONCE(*p_tstamp);
 	u32 now = (u32)jiffies;
-	u32 new, delta = 0;
+	u32 delta = 0;
 
 	if (old != now && cmpxchg(p_tstamp, old, now) == old)
 		delta = prandom_u32_max(now - old);
 
-	/* Do not use atomic_add_return() as it makes UBSAN unhappy */
-	do {
-		old = (u32)atomic_read(p_id);
-		new = old + delta + segs;
-	} while (atomic_cmpxchg(p_id, old, new) != old);
-
-	return new - segs;
+	return atomic_add_return(segs + delta, p_id) - segs;
 }
 EXPORT_SYMBOL(ip_idents_reserve);
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH 2/2] net: ipv6: Fix a possible null-pointer dereference in vti6_link_config()
From: Nicolas Dichtel @ 2019-07-26  9:21 UTC (permalink / raw)
  To: Jia-Ju Bai, davem, kuznet, yoshfuji; +Cc: netdev, linux-kernel
In-Reply-To: <20190726080321.4466-1-baijiaju1990@gmail.com>

Le 26/07/2019 à 10:03, Jia-Ju Bai a écrit :
> In vti6_link_config(), there is an if statement on line 649 to check
> whether rt is NULL:
>     if (rt)
> 
> When rt is NULL, it is used on line 651:
>     ip6_rt_put(rt);
>         dst_release(&rt->dst);
> 
> Thus, a possible null-pointer dereference may occur.
> 
> To fix this bug, ip6_rt_put() is called when rt is not NULL.
> 
> This bug is found by a static analysis tool STCheck written by us.
> 
> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
> ---
>  net/ipv6/ip6_vti.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
> index 024db17386d2..572647205c52 100644
> --- a/net/ipv6/ip6_vti.c
> +++ b/net/ipv6/ip6_vti.c
> @@ -646,9 +646,10 @@ static void vti6_link_config(struct ip6_tnl *t, bool keep_mtu)
>  						 &p->raddr, &p->laddr,
>  						 p->link, NULL, strict);
>  
> -		if (rt)
> +		if (rt) {
>  			tdev = rt->dst.dev;
> -		ip6_rt_put(rt);
> +			ip6_rt_put(rt);
> +		}
Please, look at ip6_rt_put(), it is explicitly stated that it can be called with
rt == NULL.

^ permalink raw reply

* RE: [PATCH] rtw88: pci: Use general byte arrays as the elements of RX ring
From: David Laight @ 2019-07-26  9:23 UTC (permalink / raw)
  To: 'Jian-Hong Pan'
  Cc: Yan-Hsuan Chuang, Kalle Valo, David S . Miller,
	linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux@endlessm.com,
	stable@vger.kernel.org
In-Reply-To: <CAPpJ_ecAAw=1X=7+MOw-VVH0ZKBr6rcRub6JnEqgNbZ6Hxt=ag@mail.gmail.com>

From: Jian-Hong Pan 
> Sent: 26 July 2019 07:18
...
> > While allocating all 512 buffers in one block (just over 4MB)
> > is probably not a good idea, you may need to allocated (and dma map)
> > then in groups.
> 
> Thanks for reviewing.  But got questions here to double confirm the idea.
> According to original code, it allocates 512 skbs for RX ring and dma
> mapping one by one.  So, the new code allocates memory buffer 512
> times to get 512 buffer arrays.  Will the 512 buffers arrays be in one
> block?  Do you mean aggregate the buffers as a scatterlist and use
> dma_map_sg?

If you malloc a buffer of size (8192+32) the allocator will either
round it up to a whole number of (often 4k) pages or to a power of
2 of pages - so either 12k of 16k.
I think the Linux allocator does the latter.
Some of the allocators also 'steal' a bit from the front of the buffer
for 'red tape'.

OTOH malloc the space 15 buffers and the allocator will round the
15*(8192 + 32) up to 32*4k - and you waste under 8k across all the
buffers.

You then dma_map the large buffer and split into the actual rx buffers.
Repeat until you've filled the entire ring.
The only complication is remembering the base address (and size) for
the dma_unmap and free.
Although there is plenty of padding to extend the buffer structure
significantly without using more memory.
Allocate in 15's and you (probably) have 512 bytes per buffer.
Allocate in 31's and you have 256 bytes.

The problem is that larger allocates are more likely to fail
(especially if the system has been running for some time).
So you almost certainly want to be able to fall back to smaller
allocates even though they use more memory.

I also wonder if you actually need 512 8k rx buffers to cover
interrupt latency?
I've not done any measurements for 20 years!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Nikolay Aleksandrov @ 2019-07-26  9:26 UTC (permalink / raw)
  To: Horatiu Vultur; +Cc: roopa, davem, bridge, netdev, linux-kernel, allan.nielsen
In-Reply-To: <b9ce433a-3ef7-fe15-642a-659c5715d992@cumulusnetworks.com>

On 26/07/2019 11:41, Nikolay Aleksandrov wrote:
> On 25/07/2019 17:21, Horatiu Vultur wrote:
>> Hi Nikolay,
>>
>> The 07/25/2019 16:21, Nikolay Aleksandrov wrote:
>>> External E-Mail
>>>
>>>
>>> On 25/07/2019 16:06, Nikolay Aleksandrov wrote:
>>>> On 25/07/2019 14:44, Horatiu Vultur wrote:
>>>>> There is no way to configure the bridge, to receive only specific link
>>>>> layer multicast addresses. From the description of the command 'bridge
>>>>> fdb append' is supposed to do that, but there was no way to notify the
>>>>> network driver that the bridge joined a group, because LLADDR was added
>>>>> to the unicast netdev_hw_addr_list.
>>>>>
>>>>> Therefore update fdb_add_entry to check if the NLM_F_APPEND flag is set
>>>>> and if the source is NULL, which represent the bridge itself. Then add
>>>>> address to multicast netdev_hw_addr_list for each bridge interfaces.
>>>>> And then the .ndo_set_rx_mode function on the driver is called. To notify
>>>>> the driver that the list of multicast mac addresses changed.
>>>>>
>>>>> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
>>>>> ---
>>>>>  net/bridge/br_fdb.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
>>>>>  1 file changed, 46 insertions(+), 3 deletions(-)
>>>>>
>>>>
>>>> Hi,
>>>> I'm sorry but this patch is wrong on many levels, some notes below. In general
>>>> NLM_F_APPEND is only used in vxlan, the bridge does not handle that flag at all.
>>>> FDB is only for *unicast*, nothing is joined and no multicast should be used with fdbs.
>>>> MDB is used for multicast handling, but both of these are used for forwarding.
>>>> The reason the static fdbs are added to the filter is for non-promisc ports, so they can
>>>> receive traffic destined for these FDBs for forwarding.
>>>> If you'd like to join any multicast group please use the standard way, if you'd like to join
>>>> it only on a specific port - join it only on that port (or ports) and the bridge and you'll
>>>
>>> And obviously this is for the case where you're not enabling port promisc mode (non-default).
>>> In general you'll only need to join the group on the bridge to receive traffic for it
>>> or add it as an mdb entry to forward it.
>>>
>>>> have the effect that you're describing. What do you mean there's no way ?
>>
>> Thanks for the explanation.
>> There are few things that are not 100% clear to me and maybe you can
>> explain them, not to go totally in the wrong direction. Currently I am
>> writing a network driver on which I added switchdev support. Then I was
>> looking for a way to configure the network driver to copy link layer
>> multicast address to the CPU port.
>>
>> If I am using bridge mdb I can do it only for IP multicast addreses,
>> but how should I do it if I want non IP frames with link layer multicast
>> address to be copy to CPU? For example: all frames with multicast
>> address '01-21-6C-00-00-01' to be copy to CPU. What is the user space
>> command for that?
>>
> 
> Check SIOCADDMULTI (ip maddr from iproute2), f.e. add that mac to the port
> which needs to receive it and the bridge will send it up automatically since
> it's unknown mcast (note that if there's a querier, you'll have to make the
> bridge mcast router if it is not the querier itself). It would also flood it to all

Actually you mentioned non-IP traffic, so the querier stuff is not a problem. This
traffic will always be flooded by the bridge (and also a copy will be locally sent up).
Thus only the flooding may need to be controlled.

> other ports so you may want to control that. It really depends on the setup
> and the how the hardware is configured.
> 
>>>>
>>>> In addition you're allowing a mix of mcast functions to be called with unicast addresses
>>>> and vice versa, it is not that big of a deal because the kernel will simply return an error
>>>> but still makes no sense.
>>>>
>>>> Nacked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>>>
>>>>> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
>>>>> index b1d3248..d93746d 100644
>>>>> --- a/net/bridge/br_fdb.c
>>>>> +++ b/net/bridge/br_fdb.c
>>>>> @@ -175,6 +175,29 @@ static void fdb_add_hw_addr(struct net_bridge *br, const unsigned char *addr)
>>>>>  	}
>>>>>  }
>>>>>  
>>>>> +static void fdb_add_hw_maddr(struct net_bridge *br, const unsigned char *addr)
>>>>> +{
>>>>> +	int err;
>>>>> +	struct net_bridge_port *p;
>>>>> +
>>>>> +	ASSERT_RTNL();
>>>>> +
>>>>> +	list_for_each_entry(p, &br->port_list, list) {
>>>>> +		if (!br_promisc_port(p)) {
>>>>> +			err = dev_mc_add(p->dev, addr);
>>>>> +			if (err)
>>>>> +				goto undo;
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +	return;
>>>>> +undo:
>>>>> +	list_for_each_entry_continue_reverse(p, &br->port_list, list) {
>>>>> +		if (!br_promisc_port(p))
>>>>> +			dev_mc_del(p->dev, addr);
>>>>> +	}
>>>>> +}
>>>>> +
>>>>>  /* When a static FDB entry is deleted, the HW address from that entry is
>>>>>   * also removed from the bridge private HW address list and updates all
>>>>>   * the ports with needed information.
>>>>> @@ -192,13 +215,27 @@ static void fdb_del_hw_addr(struct net_bridge *br, const unsigned char *addr)
>>>>>  	}
>>>>>  }
>>>>>  
>>>>> +static void fdb_del_hw_maddr(struct net_bridge *br, const unsigned char *addr)
>>>>> +{
>>>>> +	struct net_bridge_port *p;
>>>>> +
>>>>> +	ASSERT_RTNL();
>>>>> +
>>>>> +	list_for_each_entry(p, &br->port_list, list) {
>>>>> +		if (!br_promisc_port(p))
>>>>> +			dev_mc_del(p->dev, addr);
>>>>> +	}
>>>>> +}
>>>>> +
>>>>>  static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f,
>>>>>  		       bool swdev_notify)
>>>>>  {
>>>>>  	trace_fdb_delete(br, f);
>>>>>  
>>>>> -	if (f->is_static)
>>>>> +	if (f->is_static) {
>>>>>  		fdb_del_hw_addr(br, f->key.addr.addr);
>>>>> +		fdb_del_hw_maddr(br, f->key.addr.addr);
>>>>
>>>> Walking over all ports again for each static delete is a no-go.
>>>>
>>>>> +	}
>>>>>  
>>>>>  	hlist_del_init_rcu(&f->fdb_node);
>>>>>  	rhashtable_remove_fast(&br->fdb_hash_tbl, &f->rhnode,
>>>>> @@ -843,13 +880,19 @@ static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
>>>>>  			fdb->is_local = 1;
>>>>>  			if (!fdb->is_static) {
>>>>>  				fdb->is_static = 1;
>>>>> -				fdb_add_hw_addr(br, addr);
>>>>> +				if (flags & NLM_F_APPEND && !source)
>>>>> +					fdb_add_hw_maddr(br, addr);
>>>>> +				else
>>>>> +					fdb_add_hw_addr(br, addr);
>>>>>  			}
>>>>>  		} else if (state & NUD_NOARP) {
>>>>>  			fdb->is_local = 0;
>>>>>  			if (!fdb->is_static) {
>>>>>  				fdb->is_static = 1;
>>>>> -				fdb_add_hw_addr(br, addr);
>>>>> +				if (flags & NLM_F_APPEND && !source)
>>>>> +					fdb_add_hw_maddr(br, addr);
>>>>> +				else
>>>>> +					fdb_add_hw_addr(br, addr);
>>>>>  			}
>>>>>  		} else {
>>>>>  			fdb->is_local = 0;
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply

* general protection fault in tls_sk_proto_close
From: syzbot @ 2019-07-26  9:28 UTC (permalink / raw)
  To: aviadye, borisp, daniel, davejwatson, davem, john.fastabend,
	linux-kernel, netdev, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    9e6dfe80 Add linux-next specific files for 20190724
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=11ff2594600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=6cbb8fc2cf2842d7
dashboard link: https://syzkaller.appspot.com/bug?extid=fb2a31b9c0676ea410e3
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=13eb6a7c600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+fb2a31b9c0676ea410e3@syzkaller.appspotmail.com

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 9180 Comm: syz-executor.0 Not tainted 5.3.0-rc1-next-20190724  
#50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:tls_sk_proto_close+0x90/0x4a0 net/tls/tls_main.c:348
Code: 3c 02 00 0f 85 dd 03 00 00 49 8b 84 24 c0 02 00 00 4d 8d 75 14 4c 89  
f2 48 c1 ea 03 48 89 45 b8 48 b8 00 00 00 00 00 fc ff df <0f> b6 04 02 4c  
89 f2 83 e2 07 38 d0 7f 08 84 c0 0f 85 67 03 00 00
RSP: 0018:ffff8880a6497c70 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: 00000000fffffff0 RCX: ffffffff8629731c
RDX: 0000000000000002 RSI: ffffffff862970cd RDI: ffff88808b204f00
RBP: ffff8880a6497cb8 R08: ffff8880a76c4700 R09: fffffbfff14a8151
R10: fffffbfff14a8150 R11: ffffffff8a540a87 R12: ffff88808b204c40
R13: 0000000000000000 R14: 0000000000000014 R15: 0000000000000001
FS:  000055555741a940(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000780000 CR3: 000000008ff7d000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  tls_sk_proto_close+0x2a9/0x4a0 net/tls/tls_main.c:369
  tcp_bpf_close+0x17c/0x390 net/ipv4/tcp_bpf.c:578
  inet_release+0xed/0x200 net/ipv4/af_inet.c:427
  inet6_release+0x53/0x80 net/ipv6/af_inet6.c:470
  __sock_release+0xce/0x280 net/socket.c:590
  sock_close+0x1e/0x30 net/socket.c:1268
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
  exit_to_usermode_loop+0x316/0x380 arch/x86/entry/common.c:163
  prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
  do_syscall_64+0x65f/0x760 arch/x86/entry/common.c:300
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4134f0
Code: 01 f0 ff ff 0f 83 30 1b 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f  
44 00 00 83 3d 9d 2d 66 00 00 75 14 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 04 1b 00 00 c3 48 83 ec 08 e8 0a fc ff ff
RSP: 002b:00007ffc6f204768 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000006 RCX: 00000000004134f0
RDX: 0000001b30d20000 RSI: 0000000000000000 RDI: 0000000000000005
RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffffffffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000075bf20
R13: 0000000000000003 R14: 0000000000761178 R15: ffffffffffffffff
Modules linked in:
---[ end trace 5143786da0160ad0 ]---
RIP: 0010:tls_sk_proto_close+0x90/0x4a0 net/tls/tls_main.c:348
Code: 3c 02 00 0f 85 dd 03 00 00 49 8b 84 24 c0 02 00 00 4d 8d 75 14 4c 89  
f2 48 c1 ea 03 48 89 45 b8 48 b8 00 00 00 00 00 fc ff df <0f> b6 04 02 4c  
89 f2 83 e2 07 38 d0 7f 08 84 c0 0f 85 67 03 00 00
RSP: 0018:ffff8880a6497c70 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: 00000000fffffff0 RCX: ffffffff8629731c
RDX: 0000000000000002 RSI: ffffffff862970cd RDI: ffff88808b204f00
RBP: ffff8880a6497cb8 R08: ffff8880a76c4700 R09: fffffbfff14a8151
R10: fffffbfff14a8150 R11: ffffffff8a540a87 R12: ffff88808b204c40
R13: 0000000000000000 R14: 0000000000000014 R15: 0000000000000001
FS:  000055555741a940(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000780000 CR3: 000000008ff7d000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* INFO: rcu detected stall in vhost_worker
From: syzbot @ 2019-07-26  9:38 UTC (permalink / raw)
  To: jasowang, kvm, linux-kernel, mst, netdev, syzkaller-bugs,
	virtualization

Hello,

syzbot found the following crash on:

HEAD commit:    13bf6d6a Add linux-next specific files for 20190725
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=141449f0600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=8ae987d803395886
dashboard link: https://syzkaller.appspot.com/bug?extid=36e93b425cd6eb54fcc1
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=15112f3fa00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=131ab578600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+36e93b425cd6eb54fcc1@syzkaller.appspotmail.com

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 	0-....: (10500 ticks this GP) idle=a56/1/0x4000000000000002  
softirq=12266/12266 fqs=5250
	(t=10502 jiffies g=14905 q=12)
NMI backtrace for cpu 0
CPU: 0 PID: 10848 Comm: vhost-10847 Not tainted 5.3.0-rc1-next-20190725 #52
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  nmi_cpu_backtrace.cold+0x70/0xb2 lib/nmi_backtrace.c:101
  nmi_trigger_cpumask_backtrace+0x23b/0x28b lib/nmi_backtrace.c:62
  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
  trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
  rcu_dump_cpu_stacks+0x183/0x1cf kernel/rcu/tree_stall.h:254
  print_cpu_stall kernel/rcu/tree_stall.h:455 [inline]
  check_cpu_stall kernel/rcu/tree_stall.h:529 [inline]
  rcu_pending kernel/rcu/tree.c:2736 [inline]
  rcu_sched_clock_irq.cold+0x4dd/0xc13 kernel/rcu/tree.c:2183
  update_process_times+0x32/0x80 kernel/time/timer.c:1639
  tick_sched_handle+0xa2/0x190 kernel/time/tick-sched.c:167
  tick_sched_timer+0x53/0x140 kernel/time/tick-sched.c:1296
  __run_hrtimer kernel/time/hrtimer.c:1389 [inline]
  __hrtimer_run_queues+0x364/0xe40 kernel/time/hrtimer.c:1451
  hrtimer_interrupt+0x314/0x770 kernel/time/hrtimer.c:1509
  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1068 [inline]
  smp_apic_timer_interrupt+0x160/0x610 arch/x86/kernel/apic/apic.c:1093
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:828
  </IRQ>
RIP: 0010:check_memory_region_inline mm/kasan/generic.c:173 [inline]
RIP: 0010:check_memory_region+0x0/0x1a0 mm/kasan/generic.c:192
Code: 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 f2 be f8 00 00 00 48 89 e5 e8  
df 60 90 05 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 <48> 85 f6 0f 84  
34 01 00 00 48 b8 ff ff ff ff ff 7f ff ff 55 0f b6
RSP: 0018:ffff8880a40bf950 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000000 RBX: ffff8880836a8220 RCX: ffffffff81599777
RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8880836a8220
RBP: ffff8880a40bf958 R08: 1ffff110106d5044 R09: ffffed10106d5045
R10: ffffed10106d5044 R11: ffff8880836a8223 R12: 0000000000000001
R13: 0000000000000003 R14: ffffed10106d5044 R15: 0000000000000001
  atomic_read include/asm-generic/atomic-instrumented.h:26 [inline]
  virt_spin_lock arch/x86/include/asm/qspinlock.h:83 [inline]
  native_queued_spin_lock_slowpath+0xb7/0x9f0 kernel/locking/qspinlock.c:325
  pv_queued_spin_lock_slowpath arch/x86/include/asm/paravirt.h:642 [inline]
  queued_spin_lock_slowpath arch/x86/include/asm/qspinlock.h:50 [inline]
  queued_spin_lock include/asm-generic/qspinlock.h:81 [inline]
  do_raw_spin_lock+0x20e/0x2e0 kernel/locking/spinlock_debug.c:113
  __raw_spin_lock include/linux/spinlock_api_smp.h:143 [inline]
  _raw_spin_lock+0x37/0x40 kernel/locking/spinlock.c:151
  spin_lock include/linux/spinlock.h:338 [inline]
  vhost_setup_uaddr drivers/vhost/vhost.c:790 [inline]
  vhost_setup_vq_uaddr drivers/vhost/vhost.c:801 [inline]
  vhost_vq_map_prefetch drivers/vhost/vhost.c:1783 [inline]
  vq_meta_prefetch+0x2a0/0xcb0 drivers/vhost/vhost.c:1804
  handle_rx+0x145/0x1890 drivers/vhost/net.c:1128
  handle_rx_net+0x19/0x20 drivers/vhost/net.c:1270
  vhost_worker+0x2af/0x4d0 drivers/vhost/vhost.c:473
  kthread+0x361/0x430 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* INFO: rcu detected stall in ipv6_rcv (2)
From: syzbot @ 2019-07-26  9:38 UTC (permalink / raw)
  To: andrew, arvid.brodin, aviadye, borisp, daniel, davejwatson, davem,
	f.fainelli, hkallweit1, huangfq.daxian, john.fastabend,
	linux-kernel, netdev, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    13bf6d6a Add linux-next specific files for 20190725
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=16c5cd94600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=8ae987d803395886
dashboard link: https://syzkaller.appspot.com/bug?extid=34f3e3f781b524b5127a
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=15e55df4600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=142f7768600000

The bug was bisected to:

commit ccf355e52a3265624b7acadd693c849d599e9b9f
Author: Fuqian Huang <huangfq.daxian@gmail.com>
Date:   Mon Jul 8 12:34:17 2019 +0000

     net: phy: Make use of linkmode_mod_bit helper

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=12285f58600000
final crash:    https://syzkaller.appspot.com/x/report.txt?x=11285f58600000
console output: https://syzkaller.appspot.com/x/log.txt?x=16285f58600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+34f3e3f781b524b5127a@syzkaller.appspotmail.com
Fixes: ccf355e52a32 ("net: phy: Make use of linkmode_mod_bit helper")

TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending  
cookies.  Check SNMP counters.
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 	1-....: (1 GPs behind) idle=c0a/1/0x4000000000000002  
softirq=11627/11628 fqs=5250
	(t=10500 jiffies g=10477 q=33)
NMI backtrace for cpu 1
CPU: 1 PID: 10160 Comm: syz-executor291 Not tainted 5.3.0-rc1-next-20190725  
#52
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  nmi_cpu_backtrace.cold+0x70/0xb2 lib/nmi_backtrace.c:101
  nmi_trigger_cpumask_backtrace+0x23b/0x28b lib/nmi_backtrace.c:62
  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
  trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
  rcu_dump_cpu_stacks+0x183/0x1cf kernel/rcu/tree_stall.h:254
  print_cpu_stall kernel/rcu/tree_stall.h:455 [inline]
  check_cpu_stall kernel/rcu/tree_stall.h:529 [inline]
  rcu_pending kernel/rcu/tree.c:2736 [inline]
  rcu_sched_clock_irq.cold+0x4dd/0xc13 kernel/rcu/tree.c:2183
  update_process_times+0x32/0x80 kernel/time/timer.c:1639
  tick_sched_handle+0xa2/0x190 kernel/time/tick-sched.c:167
  tick_sched_timer+0x53/0x140 kernel/time/tick-sched.c:1296
  __run_hrtimer kernel/time/hrtimer.c:1389 [inline]
  __hrtimer_run_queues+0x364/0xe40 kernel/time/hrtimer.c:1451
  hrtimer_interrupt+0x314/0x770 kernel/time/hrtimer.c:1509
  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1068 [inline]
  smp_apic_timer_interrupt+0x160/0x610 arch/x86/kernel/apic/apic.c:1093
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:828
RIP: 0010:cpu_relax arch/x86/include/asm/processor.h:656 [inline]
RIP: 0010:virt_spin_lock arch/x86/include/asm/qspinlock.h:84 [inline]
RIP: 0010:native_queued_spin_lock_slowpath+0x132/0x9f0  
kernel/locking/qspinlock.c:325
Code: 00 00 00 48 8b 45 d0 65 48 33 04 25 28 00 00 00 0f 85 37 07 00 00 48  
81 c4 98 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 f3 90 <e9> 73 ff ff ff  
8b 45 98 4c 8d 65 d8 3d 00 01 00 00 0f 84 e5 00 00
RSP: 0018:ffff8880ae909210 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000000 RBX: ffff88809338ad08 RCX: ffffffff81599777
RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff88809338ad08
RBP: ffff8880ae9092d0 R08: 1ffff110126715a1 R09: ffffed10126715a2
R10: ffffed10126715a1 R11: ffff88809338ad0b R12: 0000000000000001
R13: 0000000000000003 R14: ffffed10126715a1 R15: 0000000000000001
  pv_queued_spin_lock_slowpath arch/x86/include/asm/paravirt.h:642 [inline]
  queued_spin_lock_slowpath arch/x86/include/asm/qspinlock.h:50 [inline]
  queued_spin_lock include/asm-generic/qspinlock.h:81 [inline]
  do_raw_spin_lock+0x20e/0x2e0 kernel/locking/spinlock_debug.c:113
  __raw_spin_lock_bh include/linux/spinlock_api_smp.h:136 [inline]
  _raw_spin_lock_bh+0x3b/0x50 kernel/locking/spinlock.c:175
  spin_lock_bh include/linux/spinlock.h:343 [inline]
  release_sock+0x20/0x1c0 net/core/sock.c:2932
  wait_on_pending_writer+0x20f/0x420 net/tls/tls_main.c:91
  tls_sk_proto_cleanup+0x2c5/0x3e0 net/tls/tls_main.c:295
  tls_sk_proto_unhash+0x90/0x3f0 net/tls/tls_main.c:330
  tcp_set_state+0x5b9/0x7d0 net/ipv4/tcp.c:2235
  tcp_done+0xe2/0x320 net/ipv4/tcp.c:3824
  tcp_reset+0x132/0x500 net/ipv4/tcp_input.c:4080
  tcp_validate_incoming+0xa2d/0x1660 net/ipv4/tcp_input.c:5440
  tcp_rcv_established+0x6b5/0x1e70 net/ipv4/tcp_input.c:5648
  tcp_v6_do_rcv+0x41e/0x12c0 net/ipv6/tcp_ipv6.c:1356
  tcp_v6_rcv+0x31f1/0x3500 net/ipv6/tcp_ipv6.c:1588
  ip6_protocol_deliver_rcu+0x2fe/0x1660 net/ipv6/ip6_input.c:397
  ip6_input_finish+0x84/0x170 net/ipv6/ip6_input.c:438
  NF_HOOK include/linux/netfilter.h:305 [inline]
  NF_HOOK include/linux/netfilter.h:299 [inline]
  ip6_input+0xe4/0x3f0 net/ipv6/ip6_input.c:447
  dst_input include/net/dst.h:442 [inline]
  ip6_rcv_finish+0x1de/0x2f0 net/ipv6/ip6_input.c:76
  NF_HOOK include/linux/netfilter.h:305 [inline]
  NF_HOOK include/linux/netfilter.h:299 [inline]
  ipv6_rcv+0x10e/0x420 net/ipv6/ip6_input.c:272
  __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:4999
  __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5113
  process_backlog+0x206/0x750 net/core/dev.c:5924
  napi_poll net/core/dev.c:6347 [inline]
  net_rx_action+0x508/0x10c0 net/core/dev.c:6413
  __do_softirq+0x262/0x98c kernel/softirq.c:292
  do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1080
  </IRQ>
  do_softirq.part.0+0x11a/0x170 kernel/softirq.c:337
  do_softirq kernel/softirq.c:329 [inline]
  __local_bh_enable_ip+0x211/0x270 kernel/softirq.c:189
  local_bh_enable include/linux/bottom_half.h:32 [inline]
  inet_csk_listen_stop+0x1e0/0x850 net/ipv4/inet_connection_sock.c:993
  tcp_close+0xd5b/0x10e0 net/ipv4/tcp.c:2338
  inet_release+0xed/0x200 net/ipv4/af_inet.c:427
  inet6_release+0x53/0x80 net/ipv6/af_inet6.c:470
  __sock_release+0xce/0x280 net/socket.c:590
  sock_close+0x1e/0x30 net/socket.c:1268
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
  exit_to_usermode_loop+0x316/0x380 arch/x86/entry/common.c:163
  prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
  do_syscall_64+0x65f/0x760 arch/x86/entry/common.c:300
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x406571
Code: 75 14 b8 03 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 24 1a 00 00 c3 48  
83 ec 08 e8 6a fc ff ff 48 89 04 24 b8 03 00 00 00 0f 05 <48> 8b 3c 24 48  
89 c2 e8 b3 fc ff ff 48 89 d0 48 83 c4 08 48 3d 01
RSP: 002b:00007ffc1a1a5e00 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000005 RCX: 0000000000406571
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00000000006dcc20 R08: 0000000000000140 R09: 0000000000000140
R10: 00007ffc1a1a5e30 R11: 0000000000000293 R12: 00007ffc1a1a5e60
R13: 00000000006dcc2c R14: 000000000000002d R15: 0000000000000007


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [PATCH] rtw88: pci: Use general byte arrays as the elements of RX ring
From: Jian-Hong Pan @ 2019-07-26  9:40 UTC (permalink / raw)
  To: David Laight
  Cc: Yan-Hsuan Chuang, Kalle Valo, David S . Miller,
	linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux@endlessm.com,
	stable@vger.kernel.org
In-Reply-To: <c2cdffd30923459e8773379fc2927e1d@AcuMS.aculab.com>

David Laight <David.Laight@aculab.com> 於 2019年7月26日 週五 下午5:23寫道：
>
> From: Jian-Hong Pan
> > Sent: 26 July 2019 07:18
> ...
> > > While allocating all 512 buffers in one block (just over 4MB)
> > > is probably not a good idea, you may need to allocated (and dma map)
> > > then in groups.
> >
> > Thanks for reviewing.  But got questions here to double confirm the idea.
> > According to original code, it allocates 512 skbs for RX ring and dma
> > mapping one by one.  So, the new code allocates memory buffer 512
> > times to get 512 buffer arrays.  Will the 512 buffers arrays be in one
> > block?  Do you mean aggregate the buffers as a scatterlist and use
> > dma_map_sg?
>
> If you malloc a buffer of size (8192+32) the allocator will either
> round it up to a whole number of (often 4k) pages or to a power of
> 2 of pages - so either 12k of 16k.
> I think the Linux allocator does the latter.
> Some of the allocators also 'steal' a bit from the front of the buffer
> for 'red tape'.
>
> OTOH malloc the space 15 buffers and the allocator will round the
> 15*(8192 + 32) up to 32*4k - and you waste under 8k across all the
> buffers.
>
> You then dma_map the large buffer and split into the actual rx buffers.
> Repeat until you've filled the entire ring.
> The only complication is remembering the base address (and size) for
> the dma_unmap and free.
> Although there is plenty of padding to extend the buffer structure
> significantly without using more memory.
> Allocate in 15's and you (probably) have 512 bytes per buffer.
> Allocate in 31's and you have 256 bytes.
>
> The problem is that larger allocates are more likely to fail
> (especially if the system has been running for some time).
> So you almost certainly want to be able to fall back to smaller
> allocates even though they use more memory.
>
> I also wonder if you actually need 512 8k rx buffers to cover
> interrupt latency?
> I've not done any measurements for 20 years!

Thanks for the explanation.
I am not sure the combination of 512 8k RX buffers.  Maybe Realtek
folks can give us some idea.
Tony Chuang any comment?

Jian-Hong Pan

^ permalink raw reply

* Re: [PATCH] net: key: af_key: Fix possible null-pointer dereferences in pfkey_send_policy_notify()
From: Steffen Klassert @ 2019-07-26  9:45 UTC (permalink / raw)
  To: Jia-Ju Bai; +Cc: herbert, davem, netdev, linux-kernel
In-Reply-To: <20190724093509.1676-1-baijiaju1990@gmail.com>

On Wed, Jul 24, 2019 at 05:35:09PM +0800, Jia-Ju Bai wrote:
> In pfkey_send_policy_notify(), there is an if statement on line 3081 to
> check whether xp is NULL:
>     if (xp && xp->type != XFRM_POLICY_TYPE_MAIN)
> 
> When xp is NULL, it is used by key_notify_policy() on line 3090:
>     key_notify_policy(xp, ...)
>         pfkey_xfrm_policy2msg_prep(xp) -- line 2211
>             pfkey_xfrm_policy2msg_size(xp) -- line 2046
>                 for (i=0; i<xp->xfrm_nr; i++) -- line 2026
>                 t = xp->xfrm_vec + i; -- line 2027
>     key_notify_policy(xp, ...)
>         xp_net(xp) -- line 2231
>             return read_pnet(&xp->xp_net); -- line 534

Please don't quote random code lines, explain the
problem instead.

> 
> Thus, possible null-pointer dereferences may occur.
> 
> To fix these bugs, xp is checked before calling key_notify_policy().
> 
> These bugs are found by a static analysis tool STCheck written by us.
> 
> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
> ---
>  net/key/af_key.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/key/af_key.c b/net/key/af_key.c
> index b67ed3a8486c..ced54144d5fd 100644
> --- a/net/key/af_key.c
> +++ b/net/key/af_key.c
> @@ -3087,6 +3087,8 @@ static int pfkey_send_policy_notify(struct xfrm_policy *xp, int dir, const struc
>  	case XFRM_MSG_DELPOLICY:
>  	case XFRM_MSG_NEWPOLICY:
>  	case XFRM_MSG_UPDPOLICY:
> +		if (!xp)
> +			break;

I think this can not happen. Who sends one of these notifications
without a pointer to the policy?


^ permalink raw reply

* [PATCH] net: neigh: remove redundant assignment to variable bucket
From: Colin King @ 2019-07-26  9:46 UTC (permalink / raw)
  To: David S . Miller, David Ahern, netdev; +Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

The variable bucket is being initialized with a value that is never
read and it is being updated later with a new value in a following
for-loop. The initialization is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 net/core/neighbour.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index f79e61c570ea..5480edff0c86 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -3033,7 +3033,7 @@ static struct neighbour *neigh_get_first(struct seq_file *seq)
 	struct net *net = seq_file_net(seq);
 	struct neigh_hash_table *nht = state->nht;
 	struct neighbour *n = NULL;
-	int bucket = state->bucket;
+	int bucket;

 	state->flags &= ~NEIGH_SEQ_IS_PNEIGH;
 	for (bucket = 0; bucket < (1 << nht->hash_shift); bucket++) {
-- 
2.20.1

^ permalink raw reply related

* [RFC] net: phy: read link status twice when phy_check_link_status()
From: Yonglong Liu @ 2019-07-26  9:53 UTC (permalink / raw)
  To: andrew, davem
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose

According to the datasheet of Marvell phy and Realtek phy, the
copper link status should read twice, or it may get a fake link
up status, and cause up->down->up at the first time when link up.
This happens more oftem at Realtek phy.

I add a fake status read, and can solve this problem.

I also see that in genphy_update_link(), had delete the fake
read in polling mode, so I don't know whether my solution is
correct.

Or provide a phydev->drv->read_status functions for the phy I
used is more acceptable?

Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
---
 drivers/net/phy/phy.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index ef7aa73..0c03edc 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1,4 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0+
+	err = phy_read_status(phydev);
+	if (err)
+		return err;
 /* Framework for configuring and reading PHY devices
  * Based on code in sungem_phy.c and gianfar_phy.c
  *
@@ -525,6 +528,11 @@ static int phy_check_link_status(struct phy_device *phydev)

 	WARN_ON(!mutex_is_locked(&phydev->lock));

+	/* Do a fake read */
+	err = phy_read(phydev, MII_BMSR);
+	if (err < 0)
+		return err;
+
 	err = phy_read_status(phydev);
 	if (err)
 		return err;
-- 
2.8.1

^ permalink raw reply related

* Re: [PATCH] Revert "net: get rid of an signed integer overflow in ip_idents_reserve()"
From: Eric Dumazet @ 2019-07-26  9:58 UTC (permalink / raw)
  To: Shaokun Zhang, netdev, linux-kernel
  Cc: Yang Guo, David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Eric Dumazet, Jiri Pirko
In-Reply-To: <1564132635-57634-1-git-send-email-zhangshaokun@hisilicon.com>

On 7/26/19 11:17 AM, Shaokun Zhang wrote:
> From: Yang Guo <guoyang2@huawei.com>
> 
> There is an significant performance regression with the following
> commit-id <adb03115f459>
> ("net: get rid of an signed integer overflow in ip_idents_reserve()").
> 
>

So, you jump around and took ownership of this issue, while some of us
are already working on it ?

Have you first checked that current UBSAN versions will not complain anymore ?

A revert adding back the original issue would be silly, performance of
benchmarks is nice but secondary.

^ permalink raw reply

* [PATCH] ipw2x00: remove redundant assignment to err
From: Colin King @ 2019-07-26 10:06 UTC (permalink / raw)
  To: Stanislav Yakovlev, Kalle Valo, David S . Miller, linux-wireless,
	netdev
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Variable err is initialized to a value that is never read and it
is re-assigned later.  The initialization is redundant and can
be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/wireless/intel/ipw2x00/ipw2100.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/intel/ipw2x00/ipw2100.c b/drivers/net/wireless/intel/ipw2x00/ipw2100.c
index 75c0c29d81f0..8dfbaff2d1fe 100644
--- a/drivers/net/wireless/intel/ipw2x00/ipw2100.c
+++ b/drivers/net/wireless/intel/ipw2x00/ipw2100.c
@@ -4413,7 +4413,7 @@ static void ipw2100_kill_works(struct ipw2100_priv *priv)
 
 static int ipw2100_tx_allocate(struct ipw2100_priv *priv)
 {
-	int i, j, err = -EINVAL;
+	int i, j, err;
 	void *v;
 	dma_addr_t p;
 
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH 1/2] ipmr: Make cache queue length configurable
From: Stephen Suryaputra @ 2019-07-26 10:10 UTC (permalink / raw)
  To: Brodie Greenfield
  Cc: davem, stephen, kuznet, yoshfuji, netdev, linux-kernel,
	chris.packham, luuk.paulussen
In-Reply-To: <20190725204230.12229-2-brodie.greenfield@alliedtelesis.co.nz>

On Fri, Jul 26, 2019 at 08:42:29AM +1200, Brodie Greenfield wrote:
> We want to be able to keep more spaces available in our queue for
> processing incoming multicast traffic (adding (S,G) entries) - this lets
> us learn more groups faster, rather than dropping them at this stage.
> 
> Signed-off-by: Brodie Greenfield <brodie.greenfield@alliedtelesis.co.nz>

Our system can use this. The patch applied cleanly to my net-next
sandbox. Thank you.

Reviewed-by: Stephen Suryaputra <ssuryaextr@gmail.com>

^ permalink raw reply

* Re: [PATCH 2/2] ip6mr: Make cache queue length configurable
From: Stephen Suryaputra @ 2019-07-26 10:10 UTC (permalink / raw)
  To: Brodie Greenfield
  Cc: davem, stephen, kuznet, yoshfuji, netdev, linux-kernel,
	chris.packham, luuk.paulussen
In-Reply-To: <20190725204230.12229-3-brodie.greenfield@alliedtelesis.co.nz>

On Fri, Jul 26, 2019 at 08:42:30AM +1200, Brodie Greenfield wrote:
> We want to be able to keep more spaces available in our queue for
> processing incoming IPv6 multicast traffic (adding (S,G) entries) - this
> lets us learn more groups faster, rather than dropping them at this stage.
> 
> Signed-off-by: Brodie Greenfield <brodie.greenfield@alliedtelesis.co.nz>

Reviewed-by: Stephen Suryaputra <ssuryaextr@gmail.com>

^ permalink raw reply

* [PATCH net-next 1/2] net: stmmac: Make MDIO bus reset optional
From: Thierry Reding @ 2019-07-26 10:27 UTC (permalink / raw)
  To: David S . Miller
  Cc: Jose Abreu, Alexandre Torgue, Giuseppe Cavallaro, Jon Hunter,
	netdev, linux-tegra, linux-arm-kernel

From: Thierry Reding <treding@nvidia.com>

The Tegra EQOS driver already resets the MDIO bus at probe time via the
reset GPIO specified in the phy-reset-gpios device tree property. There
is no need to reset the bus again later on.

This avoids the need to query the device tree for the snps,reset GPIO,
which is not part of the Tegra EQOS device tree bindings. This quiesces
an error message from the generic bus reset code if it doesn't find the
snps,reset related delays.

Signed-off-by: Thierry Reding <treding@nvidia.com>
---
 drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c | 3 +++
 drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c       | 4 +++-
 drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c        | 1 +
 drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c   | 8 +++++++-
 include/linux/stmmac.h                                  | 1 +
 5 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
index 3a14cdd01f5f..66933332c68e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
@@ -333,6 +333,9 @@ static void *tegra_eqos_probe(struct platform_device *pdev,
 	usleep_range(2000, 4000);
 	gpiod_set_value(eqos->reset, 0);
 
+	/* MDIO bus was already reset just above */
+	data->mdio_bus_data->needs_reset = false;
+
 	eqos->rst = devm_reset_control_get(&pdev->dev, "eqos");
 	if (IS_ERR(eqos->rst)) {
 		err = PTR_ERR(eqos->rst);
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
index 4304c1abc5d1..40c42637ad75 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
@@ -348,7 +348,9 @@ int stmmac_mdio_register(struct net_device *ndev)
 		max_addr = PHY_MAX_ADDR;
 	}
 
-	new_bus->reset = &stmmac_mdio_reset;
+	if (mdio_bus_data->needs_reset)
+		new_bus->reset = &stmmac_mdio_reset;
+
 	snprintf(new_bus->id, MII_BUS_ID_SIZE, "%s-%x",
 		 new_bus->name, priv->plat->bus_id);
 	new_bus->priv = ndev;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
index 86f9c07a38cf..d5d08e11c353 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
@@ -63,6 +63,7 @@ static void common_default_data(struct plat_stmmacenet_data *plat)
 	plat->has_gmac = 1;
 	plat->force_sf_dma_mode = 1;
 
+	plat->mdio_bus_data->needs_reset = true;
 	plat->mdio_bus_data->phy_mask = 0;
 
 	/* Set default value for multicast hash bins */
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
index 73fc2524372e..333b09564b88 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@@ -342,10 +342,16 @@ static int stmmac_dt_phy(struct plat_stmmacenet_data *plat,
 		mdio = true;
 	}
 
-	if (mdio)
+	if (mdio) {
 		plat->mdio_bus_data =
 			devm_kzalloc(dev, sizeof(struct stmmac_mdio_bus_data),
 				     GFP_KERNEL);
+		if (!plat->mdio_bus_data)
+			return -ENOMEM;
+
+		plat->mdio_bus_data->needs_reset = true;
+	}
+
 	return 0;
 }
 
diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h
index 7d06241582dd..7b3e354bcd3c 100644
--- a/include/linux/stmmac.h
+++ b/include/linux/stmmac.h
@@ -81,6 +81,7 @@ struct stmmac_mdio_bus_data {
 	unsigned int phy_mask;
 	int *irqs;
 	int probed_phy_irq;
+	bool needs_reset;
 };
 
 struct stmmac_dma_cfg {
-- 
2.22.0


^ permalink raw reply related

* [PATCH net-next 2/2] net: stmmac: Do not request stmmaceth clock
From: Thierry Reding @ 2019-07-26 10:27 UTC (permalink / raw)
  To: David S . Miller
  Cc: Jose Abreu, Alexandre Torgue, Giuseppe Cavallaro, Jon Hunter,
	netdev, linux-tegra, linux-arm-kernel
In-Reply-To: <20190726102741.27872-1-thierry.reding@gmail.com>

From: Thierry Reding <treding@nvidia.com>

The stmmaceth clock is specified by the slave_bus and apb_pclk clocks in
the device tree bindings for snps,dwc-qos-ethernet-4.10 compatible nodes
of this IP.

The subdrivers for these bindings will be requesting the stmmac clock
correctly at a later point, so there is no need to request it here and
cause an error message to be printed to the kernel log.

Signed-off-by: Thierry Reding <treding@nvidia.com>
---
 .../net/ethernet/stmicro/stmmac/stmmac_platform.c  | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
index 333b09564b88..7ad2bb90ceb1 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@@ -521,13 +521,15 @@ stmmac_probe_config_dt(struct platform_device *pdev, const char **mac)
 	}
 
 	/* clock setup */
-	plat->stmmac_clk = devm_clk_get(&pdev->dev,
-					STMMAC_RESOURCE_NAME);
-	if (IS_ERR(plat->stmmac_clk)) {
-		dev_warn(&pdev->dev, "Cannot get CSR clock\n");
-		plat->stmmac_clk = NULL;
+	if (!of_device_is_compatible(np, "snps,dwc-qos-ethernet-4.10")) {
+		plat->stmmac_clk = devm_clk_get(&pdev->dev,
+						STMMAC_RESOURCE_NAME);
+		if (IS_ERR(plat->stmmac_clk)) {
+			dev_warn(&pdev->dev, "Cannot get CSR clock\n");
+			plat->stmmac_clk = NULL;
+		}
+		clk_prepare_enable(plat->stmmac_clk);
 	}
-	clk_prepare_enable(plat->stmmac_clk);
 
 	plat->pclk = devm_clk_get(&pdev->dev, "pclk");
 	if (IS_ERR(plat->pclk)) {
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH net-next 2/2] mlx4/en_netdev: call notifiers when hw_enc_features change
From: Davide Caratti @ 2019-07-26 10:39 UTC (permalink / raw)
  To: Saeed Mahameed, davem@davemloft.net, Tariq Toukan,
	netdev@vger.kernel.org
  Cc: Eran Ben Elisha
In-Reply-To: <f9ca12ff3880f94d4576ab4e4239f072ed611293.camel@mellanox.com>

On Thu, 2019-07-25 at 21:27 +0000, Saeed Mahameed wrote:
> On Thu, 2019-07-25 at 14:25 +0200, Davide Caratti wrote:
> > On Wed, 2019-07-24 at 20:47 +0000, Saeed Mahameed wrote:
> > > On Wed, 2019-07-24 at 16:02 +0200, Davide Caratti wrote:
> > > > ensure to call netdev_features_change() when the driver flips its
> > > > hw_enc_features bits.
> > > > 
> > > > Signed-off-by: Davide Caratti <dcaratti@redhat.com>
> > > 
> > > The patch is correct, 
> > 
> > hello Saeed, and thanks for looking at this!
> > 
> > > but can you explain how did you come to this ? 
> > > did you encounter any issue with the current code ?
> > > 
> > > I am asking just because i think the whole dynamic changing of dev-
> > > > hw_enc_features is redundant since mlx4 has the featutres_check
> > > callback.
> > 
> > we need it to ensure that vlan_transfer_features() updates
> > the (new) value of hw_enc_features in the overlying vlan: otherwise,
> > segmentation will happen anyway when skb passes from vxlan to vlan,
> > if the
> > vxlan is added after the vlan device has been created (see:
> > 7dad9937e064
> > ("net: vlan: add support for tunnel offload") ).
> > 
> 
> but in previous patch you made sure that the vlan always sees the
> correct hw_enc_features on driver load, we don't need to have this
> dynamic update mechanism,

ok, but the mlx4 driver flips the value of hw_enc_features when VXLAN
tunnels are added or removed. So, assume eth0 is a Cx3-pro, and I do:
 
 # ip link add name vlan5 link eth0 type vlan id 5
 # ip link add dev vxlan6 type vxlan id 6  [...]  dev vlan5
 
the value of dev->hw_enc_features is 0 for vlan5, and as a consequence
VXLAN over VLAN traffic becomes segmented by the VLAN, even if eth0, at
the end of this sequence, has the "right" features bits.

> features_check ndo should take care of
> protocols we don't support.

I just had a look at mlx4_en_features_check(), I see it checks if the
packet is tunneled in VXLAN and the destination port matches the
configured value of priv->vxlan_port (when that value is not zero). Now:

On Wed, 2019-07-24 at 20:47 +0000, Saeed Mahameed wrote:
> I am asking just because i think the whole dynamic changing of 
> dev-> hw_enc_features is redundant since mlx4 has the featutres_check
> callback.

I read your initial proposal again. Would it be correct if I just use
patch 1/2, where I add an assignment of

dev->hw_enc_features = NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | \
                       NETIF_F_RXCSUM | \
                       NETIF_F_TSO | NETIF_F_TSO6 | \
                       NETIF_F_GSO_UDP_TUNNEL | \
                       NETIF_F_GSO_UDP_TUNNEL_CSUM | \
                       NETIF_F_GSO_PARTIAL;

in mlx4_en_init_netdev(), and then remove the code that flips
dev->hw_enc_features in mlx4_en_add_vxlan_offloads() and
mlx4_en_del_vxlan_offloads() ?

thanks,
--
davide



^ permalink raw reply

* KASAN: use-after-free Read in bpf_get_prog_name
From: syzbot @ 2019-07-26 10:59 UTC (permalink / raw)
  To: ast, bpf, daniel, davem, hawk, jakub.kicinski, john.fastabend,
	kafai, linux-kernel, netdev, songliubraving, syzkaller-bugs,
	xdp-newbies, yhs

Hello,

syzbot found the following crash on:

HEAD commit:    192f0f8e Merge tag 'powerpc-5.3-1' of git://git.kernel.org..
git tree:       bpf-next
console output: https://syzkaller.appspot.com/x/log.txt?x=170afe64600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=87305c3ca9c25c70
dashboard link: https://syzkaller.appspot.com/bug?extid=4d5cdc96ead2e74e7f90
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+4d5cdc96ead2e74e7f90@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in string_nocheck+0x219/0x240 lib/vsprintf.c:605
Read of size 1 at addr ffff88809fee2d70 by task syz-executor.1/30647

CPU: 1 PID: 30647 Comm: syz-executor.1 Not tainted 5.2.0+ #41
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351
  __kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482
  kasan_report+0x12/0x20 mm/kasan/common.c:612
  __asan_report_load1_noabort+0x14/0x20 mm/kasan/generic_report.c:129
  string_nocheck+0x219/0x240 lib/vsprintf.c:605
  string+0xed/0x100 lib/vsprintf.c:668
  vsnprintf+0x97b/0x19a0 lib/vsprintf.c:2503
  snprintf+0xbb/0xf0 lib/vsprintf.c:2636
  bpf_get_prog_name+0x159/0x360 kernel/bpf/core.c:570
  perf_event_bpf_emit_ksymbols+0x284/0x390 kernel/events/core.c:7883
  perf_event_bpf_event+0x253/0x290 kernel/events/core.c:7914
  bpf_prog_load+0x102a/0x1670 kernel/bpf/syscall.c:1723
  __do_sys_bpf+0xa46/0x42f0 kernel/bpf/syscall.c:2849
  __se_sys_bpf kernel/bpf/syscall.c:2808 [inline]
  __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:2808
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x459829
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f8c78cf3c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000459829
RDX: 0000000000000070 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 000000000075bfc8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8c78cf46d4
R13: 00000000004bfc7c R14: 00000000004d16d8 R15: 00000000ffffffff

Allocated by task 30647:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  __kasan_kmalloc mm/kasan/common.c:487 [inline]
  __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:460
  kasan_kmalloc+0x9/0x10 mm/kasan/common.c:501
  kmem_cache_alloc_trace+0x158/0x790 mm/slab.c:3550
  kmalloc include/linux/slab.h:552 [inline]
  kzalloc include/linux/slab.h:748 [inline]
  bpf_prog_alloc_no_stats+0xe6/0x2b0 kernel/bpf/core.c:88
  bpf_prog_alloc+0x31/0x230 kernel/bpf/core.c:110
  bpf_prog_load+0x400/0x1670 kernel/bpf/syscall.c:1652
  __do_sys_bpf+0xa46/0x42f0 kernel/bpf/syscall.c:2849
  __se_sys_bpf kernel/bpf/syscall.c:2808 [inline]
  __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:2808
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 12:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  __kasan_slab_free+0x102/0x150 mm/kasan/common.c:449
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:457
  __cache_free mm/slab.c:3425 [inline]
  kfree+0x10a/0x2c0 mm/slab.c:3756
  __bpf_prog_free+0x87/0xc0 kernel/bpf/core.c:258
  bpf_jit_free+0x64/0x1b0
  bpf_prog_free_deferred+0x27a/0x350 kernel/bpf/core.c:1982
  process_one_work+0x9af/0x1740 kernel/workqueue.c:2269
  worker_thread+0x98/0xe40 kernel/workqueue.c:2415
  kthread+0x361/0x430 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

The buggy address belongs to the object at ffff88809fee2cc0
  which belongs to the cache kmalloc-512 of size 512
The buggy address is located 176 bytes inside of
  512-byte region [ffff88809fee2cc0, ffff88809fee2ec0)
The buggy address belongs to the page:
page:ffffea00027fb880 refcount:1 mapcount:0 mapping:ffff8880aa400a80  
index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea0002709008 ffffea000246e348 ffff8880aa400a80
raw: 0000000000000000 ffff88809fee2040 0000000100000006 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff88809fee2c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff88809fee2c80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
> ffff88809fee2d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                              ^
  ffff88809fee2d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff88809fee2e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: possible deadlock in rxrpc_put_peer
From: Dmitry Vyukov @ 2019-07-26 10:59 UTC (permalink / raw)
  To: syzbot, David Howells, David Miller, linux-afs, netdev
  Cc: LKML, syzkaller-bugs
In-Reply-To: <000000000000b7abcc058e924c12@google.com>

On Fri, Jul 26, 2019 at 11:38 AM syzbot
<syzbot+72af434e4b3417318f84@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    6d21a41b Add linux-next specific files for 20190718
> git tree:       linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=174e3af0600000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> dashboard link: https://syzkaller.appspot.com/bug?extid=72af434e4b3417318f84
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+72af434e4b3417318f84@syzkaller.appspotmail.com

+net/rxrpc/peer_object.c maintainers

> ============================================
> WARNING: possible recursive locking detected
> 5.2.0-next-20190718 #41 Not tainted
> --------------------------------------------
> kworker/0:3/21678 is trying to acquire lock:
> 00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at: spin_lock_bh
> /./include/linux/spinlock.h:343 [inline]
> 00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
> __rxrpc_put_peer /net/rxrpc/peer_object.c:415 [inline]
> 00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
> rxrpc_put_peer+0x2d3/0x6a0 /net/rxrpc/peer_object.c:435
>
> but task is already holding lock:
> 00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at: spin_lock_bh
> /./include/linux/spinlock.h:343 [inline]
> 00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
> rxrpc_peer_keepalive_dispatch /net/rxrpc/peer_event.c:378 [inline]
> 00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
> rxrpc_peer_keepalive_worker+0x6b3/0xd02 /net/rxrpc/peer_event.c:430
>
> other info that might help us debug this:
>   Possible unsafe locking scenario:
>
>         CPU0
>         ----
>    lock(&(&rxnet->peer_hash_lock)->rlock);
>    lock(&(&rxnet->peer_hash_lock)->rlock);
>
>   *** DEADLOCK ***
>
>   May be due to missing lock nesting notation
>
> 3 locks held by kworker/0:3/21678:
>   #0: 000000007c4c2bc3 ((wq_completion)krxrpcd){+.+.}, at: __write_once_size
> /./include/linux/compiler.h:226 [inline]
>   #0: 000000007c4c2bc3 ((wq_completion)krxrpcd){+.+.}, at: arch_atomic64_set
> /./arch/x86/include/asm/atomic64_64.h:34 [inline]
>   #0: 000000007c4c2bc3 ((wq_completion)krxrpcd){+.+.}, at: atomic64_set
> /./include/asm-generic/atomic-instrumented.h:855 [inline]
>   #0: 000000007c4c2bc3 ((wq_completion)krxrpcd){+.+.}, at: atomic_long_set
> /./include/asm-generic/atomic-long.h:40 [inline]
>   #0: 000000007c4c2bc3 ((wq_completion)krxrpcd){+.+.}, at: set_work_data
> /kernel/workqueue.c:620 [inline]
>   #0: 000000007c4c2bc3 ((wq_completion)krxrpcd){+.+.}, at:
> set_work_pool_and_clear_pending /kernel/workqueue.c:647 [inline]
>   #0: 000000007c4c2bc3 ((wq_completion)krxrpcd){+.+.}, at:
> process_one_work+0x88b/0x1740 /kernel/workqueue.c:2240
>   #1: 000000006782bc7f
> ((work_completion)(&rxnet->peer_keepalive_work)){+.+.}, at:
> process_one_work+0x8c1/0x1740 /kernel/workqueue.c:2244
>   #2: 00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
> spin_lock_bh /./include/linux/spinlock.h:343 [inline]
>   #2: 00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
> rxrpc_peer_keepalive_dispatch /net/rxrpc/peer_event.c:378 [inline]
>   #2: 00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
> rxrpc_peer_keepalive_worker+0x6b3/0xd02 /net/rxrpc/peer_event.c:430
>
> stack backtrace:
> CPU: 0 PID: 21678 Comm: kworker/0:3 Not tainted 5.2.0-next-20190718 #41
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: krxrpcd rxrpc_peer_keepalive_worker
> Call Trace:
>   __dump_stack /lib/dump_stack.c:77 [inline]
>   dump_stack+0x172/0x1f0 /lib/dump_stack.c:113
>   print_deadlock_bug /kernel/locking/lockdep.c:2301 [inline]
>   check_deadlock /kernel/locking/lockdep.c:2342 [inline]
>   validate_chain /kernel/locking/lockdep.c:2881 [inline]
>   __lock_acquire.cold+0x194/0x398 /kernel/locking/lockdep.c:3880
>   lock_acquire+0x190/0x410 /kernel/locking/lockdep.c:4413
>   __raw_spin_lock_bh /./include/linux/spinlock_api_smp.h:135 [inline]
>   _raw_spin_lock_bh+0x33/0x50 /kernel/locking/spinlock.c:175
>   spin_lock_bh /./include/linux/spinlock.h:343 [inline]
>   __rxrpc_put_peer /net/rxrpc/peer_object.c:415 [inline]
>   rxrpc_put_peer+0x2d3/0x6a0 /net/rxrpc/peer_object.c:435
>   rxrpc_peer_keepalive_dispatch /net/rxrpc/peer_event.c:381 [inline]
>   rxrpc_peer_keepalive_worker+0x7a6/0xd02 /net/rxrpc/peer_event.c:430
>   process_one_work+0x9af/0x1740 /kernel/workqueue.c:2269
>   worker_thread+0x98/0xe40 /kernel/workqueue.c:2415
>   kthread+0x361/0x430 /kernel/kthread.c:255
>   ret_from_fork+0x24/0x30 /arch/x86/entry/entry_64.S:352
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/000000000000b7abcc058e924c12%40google.com.

^ permalink raw reply

* Re: [PATCH 1/2] ipmr: Make cache queue length configurable
From: Nikolay Aleksandrov @ 2019-07-26 11:05 UTC (permalink / raw)
  To: Brodie Greenfield, davem, stephen, kuznet, yoshfuji, netdev
  Cc: linux-kernel, chris.packham, luuk.paulussen
In-Reply-To: <20190725204230.12229-2-brodie.greenfield@alliedtelesis.co.nz>

On 25/07/2019 23:42, Brodie Greenfield wrote:
> We want to be able to keep more spaces available in our queue for
> processing incoming multicast traffic (adding (S,G) entries) - this lets
> us learn more groups faster, rather than dropping them at this stage.
> 
> Signed-off-by: Brodie Greenfield <brodie.greenfield@alliedtelesis.co.nz>
> ---
>  Documentation/networking/ip-sysctl.txt | 8 ++++++++
>  include/net/netns/ipv4.h               | 1 +
>  net/ipv4/af_inet.c                     | 1 +
>  net/ipv4/ipmr.c                        | 4 +++-
>  net/ipv4/sysctl_net_ipv4.c             | 7 +++++++
>  5 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index acdfb5d2bcaa..02f77e932adf 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -887,6 +887,14 @@ ip_local_reserved_ports - list of comma separated ranges
>  
>  	Default: Empty
>  
> +ip_mr_cache_queue_length - INTEGER
> +	Limit the number of multicast packets we can have in the queue to be
> +	resolved.
> +	Bear in mind that when an unresolved multicast packet is received,
> +	there is an O(n) traversal of the queue. This should be considered
> +	if increasing.
> +	Default: 10
> +

Hi,
You've said it yourself - it has linear traversal time, but doesn't this patch allow any netns on the
system to increase its limit to any value, thus possibly affecting others ?
Though the socket limit will kick in at some point. I think that's where David
was going with his suggestion back in 2018:
https://www.spinics.net/lists/netdev/msg514543.html

If we add this sysctl now, we'll be stuck with it. I'd prefer David's suggestion
so we can rely only on the receive queue queue limit which is already configurable. 
We still need to be careful with the defaults though, the NOCACHE entry is 128 bytes
and with the skb overhead currently on my setup we end up at about 277 entries default limit.

Cheers,
 Nik

>  ip_unprivileged_port_start - INTEGER
>  	This is a per-namespace sysctl.  It defines the first
>  	unprivileged port in the network namespace.  Privileged ports
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 104a6669e344..3411d3f18d51 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -187,6 +187,7 @@ struct netns_ipv4 {
>  	int sysctl_igmp_max_msf;
>  	int sysctl_igmp_llm_reports;
>  	int sysctl_igmp_qrv;
> +	unsigned int sysctl_ip_mr_cache_queue_length;
>  
>  	struct ping_group_range ping_group_range;
>  
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 0dfb72c46671..8e25538bdb1e 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1827,6 +1827,7 @@ static __net_init int inet_init_net(struct net *net)
>  	net->ipv4.sysctl_igmp_llm_reports = 1;
>  	net->ipv4.sysctl_igmp_qrv = 2;
>  
> +	net->ipv4.sysctl_ip_mr_cache_queue_length = 10;
>  	return 0;
>  }
>  
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index ddbf8c9a1abb..c6a6c3e453a9 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -1127,6 +1127,7 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
>  				 struct sk_buff *skb, struct net_device *dev)
>  {
>  	const struct iphdr *iph = ip_hdr(skb);
> +	struct net *net = dev_net(dev);
>  	struct mfc_cache *c;
>  	bool found = false;
>  	int err;
> @@ -1142,7 +1143,8 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
>  
>  	if (!found) {
>  		/* Create a new entry if allowable */
> -		if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
> +		if (atomic_read(&mrt->cache_resolve_queue_len) >=
> +		    net->ipv4.sysctl_ip_mr_cache_queue_length ||
>  		    (c = ipmr_cache_alloc_unres()) == NULL) {
>  			spin_unlock_bh(&mfc_unres_lock);
>  
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index ba0fc4b18465..78ae86e8c6cb 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -784,6 +784,13 @@ static struct ctl_table ipv4_net_table[] = {
>  		.proc_handler	= proc_dointvec
>  	},
>  #ifdef CONFIG_IP_MULTICAST
> +	{
> +		.procname	= "ip_mr_cache_queue_length",
> +		.data		= &init_net.ipv4.sysctl_ip_mr_cache_queue_length,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec
> +	},
>  	{
>  		.procname	= "igmp_qrv",
>  		.data		= &init_net.ipv4.sysctl_igmp_qrv,
> 


^ permalink raw reply

* Re: [PATCH 1/2] ipmr: Make cache queue length configurable
From: Nikolay Aleksandrov @ 2019-07-26 11:15 UTC (permalink / raw)
  To: Brodie Greenfield, davem, stephen, kuznet, yoshfuji, netdev
  Cc: linux-kernel, chris.packham, luuk.paulussen
In-Reply-To: <e5606cf7-6848-1109-6cbe-63d94868ed65@cumulusnetworks.com>

On 26/07/2019 14:05, Nikolay Aleksandrov wrote:
> On 25/07/2019 23:42, Brodie Greenfield wrote:
>> We want to be able to keep more spaces available in our queue for
>> processing incoming multicast traffic (adding (S,G) entries) - this lets
>> us learn more groups faster, rather than dropping them at this stage.
>>
>> Signed-off-by: Brodie Greenfield <brodie.greenfield@alliedtelesis.co.nz>
>> ---
>>  Documentation/networking/ip-sysctl.txt | 8 ++++++++
>>  include/net/netns/ipv4.h               | 1 +
>>  net/ipv4/af_inet.c                     | 1 +
>>  net/ipv4/ipmr.c                        | 4 +++-
>>  net/ipv4/sysctl_net_ipv4.c             | 7 +++++++
>>  5 files changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>> index acdfb5d2bcaa..02f77e932adf 100644
>> --- a/Documentation/networking/ip-sysctl.txt
>> +++ b/Documentation/networking/ip-sysctl.txt
>> @@ -887,6 +887,14 @@ ip_local_reserved_ports - list of comma separated ranges
>>  
>>  	Default: Empty
>>  
>> +ip_mr_cache_queue_length - INTEGER
>> +	Limit the number of multicast packets we can have in the queue to be
>> +	resolved.
>> +	Bear in mind that when an unresolved multicast packet is received,
>> +	there is an O(n) traversal of the queue. This should be considered
>> +	if increasing.
>> +	Default: 10
>> +
> 
> Hi,
> You've said it yourself - it has linear traversal time, but doesn't this patch allow any netns on the
> system to increase its limit to any value, thus possibly affecting others ?
> Though the socket limit will kick in at some point. I think that's where David
> was going with his suggestion back in 2018:
> https://www.spinics.net/lists/netdev/msg514543.html
> 
> If we add this sysctl now, we'll be stuck with it. I'd prefer David's suggestion
> so we can rely only on the receive queue queue limit which is already configurable. 
> We still need to be careful with the defaults though, the NOCACHE entry is 128 bytes
> and with the skb overhead currently on my setup we end up at about 277 entries default limit.

I mean that people might be surprised if they increased that limit by default, that's the
only problem I'm not sure how to handle. Maybe we need some hard limit anyway.
Have you done any tests what value works for your setup ?

In the end we might have to go with this patch, but perhaps limit the per-netns sysctl
to the init_ns value as maximum (similar to what we did for frags) or don't make it per-netns
at all.

> 
> Cheers,
>  Nik
> 
>>  ip_unprivileged_port_start - INTEGER
>>  	This is a per-namespace sysctl.  It defines the first
>>  	unprivileged port in the network namespace.  Privileged ports
>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>> index 104a6669e344..3411d3f18d51 100644
>> --- a/include/net/netns/ipv4.h
>> +++ b/include/net/netns/ipv4.h
>> @@ -187,6 +187,7 @@ struct netns_ipv4 {
>>  	int sysctl_igmp_max_msf;
>>  	int sysctl_igmp_llm_reports;
>>  	int sysctl_igmp_qrv;
>> +	unsigned int sysctl_ip_mr_cache_queue_length;
>>  
>>  	struct ping_group_range ping_group_range;
>>  
>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
>> index 0dfb72c46671..8e25538bdb1e 100644
>> --- a/net/ipv4/af_inet.c
>> +++ b/net/ipv4/af_inet.c
>> @@ -1827,6 +1827,7 @@ static __net_init int inet_init_net(struct net *net)
>>  	net->ipv4.sysctl_igmp_llm_reports = 1;
>>  	net->ipv4.sysctl_igmp_qrv = 2;
>>  
>> +	net->ipv4.sysctl_ip_mr_cache_queue_length = 10;
>>  	return 0;
>>  }
>>  
>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>> index ddbf8c9a1abb..c6a6c3e453a9 100644
>> --- a/net/ipv4/ipmr.c
>> +++ b/net/ipv4/ipmr.c
>> @@ -1127,6 +1127,7 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
>>  				 struct sk_buff *skb, struct net_device *dev)
>>  {
>>  	const struct iphdr *iph = ip_hdr(skb);
>> +	struct net *net = dev_net(dev);
>>  	struct mfc_cache *c;
>>  	bool found = false;
>>  	int err;
>> @@ -1142,7 +1143,8 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
>>  
>>  	if (!found) {
>>  		/* Create a new entry if allowable */
>> -		if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
>> +		if (atomic_read(&mrt->cache_resolve_queue_len) >=
>> +		    net->ipv4.sysctl_ip_mr_cache_queue_length ||
>>  		    (c = ipmr_cache_alloc_unres()) == NULL) {
>>  			spin_unlock_bh(&mfc_unres_lock);
>>  
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index ba0fc4b18465..78ae86e8c6cb 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -784,6 +784,13 @@ static struct ctl_table ipv4_net_table[] = {
>>  		.proc_handler	= proc_dointvec
>>  	},
>>  #ifdef CONFIG_IP_MULTICAST
>> +	{
>> +		.procname	= "ip_mr_cache_queue_length",
>> +		.data		= &init_net.ipv4.sysctl_ip_mr_cache_queue_length,
>> +		.maxlen		= sizeof(int),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec
>> +	},
>>  	{
>>  		.procname	= "igmp_qrv",
>>  		.data		= &init_net.ipv4.sysctl_igmp_qrv,
>>
> 


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox