Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] net: sched: taprio: Fix potential integer overflow in taprio_set_picos_per_byte
From: Eric Dumazet @ 2019-09-03  7:19 UTC (permalink / raw)
  To: Gustavo A. R. Silva, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	David S. Miller, Vladimir Oltean
  Cc: netdev, linux-kernel
In-Reply-To: <20190903010817.GA13595@embeddedor>



On 9/3/19 3:08 AM, Gustavo A. R. Silva wrote:
> Add suffix LL to constant 1000 in order to avoid a potential integer
> overflow and give the compiler complete information about the proper
> arithmetic to use. Notice that this constant is being used in a context
> that expects an expression of type s64, but it's currently evaluated
> using 32-bit arithmetic.
> 
> Addresses-Coverity-ID: 1453459 ("Unintentional integer overflow")
> Fixes: f04b514c0ce2 ("taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte")
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
> ---
>  net/sched/sch_taprio.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index 8d8bc2ec5cd6..956f837436ea 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -966,7 +966,7 @@ static void taprio_set_picos_per_byte(struct net_device *dev,
>  
>  skip:
>  	picos_per_byte = div64_s64(NSEC_PER_SEC * 1000LL * 8,
> -				   speed * 1000 * 1000);
> +				   speed * 1000LL * 1000);
>  
>  	atomic64_set(&q->picos_per_byte, picos_per_byte);
>  	netdev_dbg(dev, "taprio: set %s's picos_per_byte to: %lld, linkspeed: %d\n",
> 

But, why even multiplying by 1,000,000 in the first place, this seems silly,
a standard 32 bit divide could be used instead.

->

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 8d8bc2ec5cd6281d811fd5d8a5c5211ebb0edd73..944b1af3215668e927d486b6c6c65c4599fb9539 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -965,8 +965,7 @@ static void taprio_set_picos_per_byte(struct net_device *dev,
                speed = ecmd.base.speed;
 
 skip:
-       picos_per_byte = div64_s64(NSEC_PER_SEC * 1000LL * 8,
-                                  speed * 1000 * 1000);
+       picos_per_byte = (USEC_PER_SEC * 8) / speed;
 
        atomic64_set(&q->picos_per_byte, picos_per_byte);
        netdev_dbg(dev, "taprio: set %s's picos_per_byte to: %lld, linkspeed: %d\n",




^ permalink raw reply related

* Re: [PATCH v4 2/5] vsock/virtio: reduce credit update messages
From: Stefano Garzarella @ 2019-09-03  7:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <20190903003050-mutt-send-email-mst@kernel.org>

On Tue, Sep 03, 2019 at 12:38:02AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 17, 2019 at 01:30:27PM +0200, Stefano Garzarella wrote:
> > In order to reduce the number of credit update messages,
> > we send them only when the space available seen by the
> > transmitter is less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE.
> > 
> > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > ---
> >  include/linux/virtio_vsock.h            |  1 +
> >  net/vmw_vsock/virtio_transport_common.c | 16 +++++++++++++---
> >  2 files changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> > index 7d973903f52e..49fc9d20bc43 100644
> > --- a/include/linux/virtio_vsock.h
> > +++ b/include/linux/virtio_vsock.h
> > @@ -41,6 +41,7 @@ struct virtio_vsock_sock {
> >  
> >  	/* Protected by rx_lock */
> >  	u32 fwd_cnt;
> > +	u32 last_fwd_cnt;
> >  	u32 rx_bytes;
> >  	struct list_head rx_queue;
> >  };
> > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> > index 095221f94786..a85559d4d974 100644
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -211,6 +211,7 @@ static void virtio_transport_dec_rx_pkt(struct virtio_vsock_sock *vvs,
> >  void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt)
> >  {
> >  	spin_lock_bh(&vvs->tx_lock);
> > +	vvs->last_fwd_cnt = vvs->fwd_cnt;
> >  	pkt->hdr.fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
> >  	pkt->hdr.buf_alloc = cpu_to_le32(vvs->buf_alloc);
> >  	spin_unlock_bh(&vvs->tx_lock);
> > @@ -261,6 +262,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
> >  	struct virtio_vsock_sock *vvs = vsk->trans;
> >  	struct virtio_vsock_pkt *pkt;
> >  	size_t bytes, total = 0;
> > +	u32 free_space;
> >  	int err = -EFAULT;
> >  
> >  	spin_lock_bh(&vvs->rx_lock);
> > @@ -291,11 +293,19 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
> >  			virtio_transport_free_pkt(pkt);
> >  		}
> >  	}
> > +
> > +	free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
> > +
> >  	spin_unlock_bh(&vvs->rx_lock);
> >  
> > -	/* Send a credit pkt to peer */
> > -	virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM,
> > -					    NULL);
> > +	/* We send a credit update only when the space available seen
> > +	 * by the transmitter is less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE
> 
> This is just repeating what code does though.
> Please include the *reason* for the condition.
> E.g. here's a better comment:
> 
> 	/* To reduce number of credit update messages,
> 	 * don't update credits as long as lots of space is available.
> 	 * Note: the limit chosen here is arbitrary. Setting the limit
> 	 * too high causes extra messages. Too low causes transmitter
> 	 * stalls. As stalls are in theory more expensive than extra
> 	 * messages, we set the limit to a high value. TODO: experiment
> 	 * with different values.
> 	 */
> 

Yes, it is better, sorry for that. I'll try to avoid unnecessary comments,
explaining the reason for certain changes.

Since this patch is already queued in net-next, should I send another
patch to fix the comment?

Thanks,
Stefano

^ permalink raw reply

* Re: [PATCH v3] tun: fix use-after-free when register netdev failed
From: Yang Yingliang @ 2019-09-03  7:35 UTC (permalink / raw)
  To: Jason Wang
  Cc: David Miller, netdev, eric dumazet, xiyou wangcong, weiyongjun1
In-Reply-To: <71e17457-d4bc-15be-dfb3-d0a977fd7556@redhat.com>



On 2019/9/3 14:06, Jason Wang wrote:
>
> On 2019/9/3 下午1:42, Yang Yingliang wrote:
>>
>>
>> On 2019/9/3 11:03, Jason Wang wrote:
>>>
>>> On 2019/9/3 上午9:45, Yang Yingliang wrote:
>>>>
>>>>
>>>> On 2019/9/2 13:32, Jason Wang wrote:
>>>>>
>>>>> On 2019/8/23 下午5:36, Yang Yingliang wrote:
>>>>>>
>>>>>>
>>>>>> On 2019/8/23 11:05, Jason Wang wrote:
>>>>>>> ----- Original Message -----
>>>>>>>>
>>>>>>>> On 2019/8/22 14:07, Yang Yingliang wrote:
>>>>>>>>>
>>>>>>>>> On 2019/8/22 10:13, Jason Wang wrote:
>>>>>>>>>> On 2019/8/20 上午10:28, Jason Wang wrote:
>>>>>>>>>>> On 2019/8/20 上午9:25, David Miller wrote:
>>>>>>>>>>>> From: Yang Yingliang <yangyingliang@huawei.com>
>>>>>>>>>>>> Date: Mon, 19 Aug 2019 21:31:19 +0800
>>>>>>>>>>>>
>>>>>>>>>>>>> Call tun_attach() after register_netdevice() to make sure 
>>>>>>>>>>>>> tfile->tun
>>>>>>>>>>>>> is not published until the netdevice is registered. So the 
>>>>>>>>>>>>> read/write
>>>>>>>>>>>>> thread can not use the tun pointer that may freed by 
>>>>>>>>>>>>> free_netdev().
>>>>>>>>>>>>> (The tun and dev pointer are allocated by 
>>>>>>>>>>>>> alloc_netdev_mqs(), they
>>>>>>>>>>>>> can
>>>>>>>>>>>>> be freed by netdev_freemem().)
>>>>>>>>>>>> register_netdevice() must always be the last operation in 
>>>>>>>>>>>> the order of
>>>>>>>>>>>> network device setup.
>>>>>>>>>>>>
>>>>>>>>>>>> At the point register_netdevice() is called, the device is 
>>>>>>>>>>>> visible
>>>>>>>>>>>> globally
>>>>>>>>>>>> and therefore all of it's software state must be fully 
>>>>>>>>>>>> initialized and
>>>>>>>>>>>> ready for us.
>>>>>>>>>>>>
>>>>>>>>>>>> You're going to have to find another solution to these 
>>>>>>>>>>>> problems.
>>>>>>>>>>>
>>>>>>>>>>> The device is loosely coupled with sockets/queues. Each side is
>>>>>>>>>>> allowed to be go away without caring the other side. So in this
>>>>>>>>>>> case, there's a small window that network stack think the 
>>>>>>>>>>> device has
>>>>>>>>>>> one queue but actually not, the code can then safely drop them.
>>>>>>>>>>> Maybe it's ok here with some comments?
>>>>>>>>>>>
>>>>>>>>>>> Or if not, we can try to hold the device before tun_attach 
>>>>>>>>>>> and drop
>>>>>>>>>>> it after register_netdevice().
>>>>>>>>>>
>>>>>>>>>> Hi Yang:
>>>>>>>>>>
>>>>>>>>>> I think maybe we can try to hold refcnt instead of playing 
>>>>>>>>>> real num
>>>>>>>>>> queues here. Do you want to post a V4?
>>>>>>>>> I think the refcnt can prevent freeing the memory in this case.
>>>>>>>>> When register_netdevice() failed, free_netdev() will be called 
>>>>>>>>> directly,
>>>>>>>>> dev->pcpu_refcnt and dev are freed without checking refcnt of 
>>>>>>>>> dev.
>>>>>>>> How about using patch-v1 that using a flag to check whether the 
>>>>>>>> device
>>>>>>>> registered successfully.
>>>>>>>>
>>>>>>> As I said, it lacks sufficient locks or barriers. To be clear, I 
>>>>>>> meant
>>>>>>> something like (compile-test only):
>>>>>>>
>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>> index db16d7a13e00..e52678f9f049 100644
>>>>>>> --- a/drivers/net/tun.c
>>>>>>> +++ b/drivers/net/tun.c
>>>>>>> @@ -2828,6 +2828,7 @@ static int tun_set_iff(struct net *net, 
>>>>>>> struct file *file, struct ifreq *ifr)
>>>>>>>                                (ifr->ifr_flags & TUN_FEATURES);
>>>>>>> INIT_LIST_HEAD(&tun->disabled);
>>>>>>> +               dev_hold(dev);
>>>>>>>                  err = tun_attach(tun, file, false, 
>>>>>>> ifr->ifr_flags & IFF_NAPI,
>>>>>>>                                   ifr->ifr_flags & IFF_NAPI_FRAGS);
>>>>>>>                  if (err < 0)
>>>>>>> @@ -2836,6 +2837,7 @@ static int tun_set_iff(struct net *net, 
>>>>>>> struct file *file, struct ifreq *ifr)
>>>>>>>                  err = register_netdevice(tun->dev);
>>>>>>>                  if (err < 0)
>>>>>>>                          goto err_detach;
>>>>>>> +               dev_put(dev);
>>>>>>>          }
>>>>>>>            netif_carrier_on(tun->dev);
>>>>>>> @@ -2852,11 +2854,13 @@ static int tun_set_iff(struct net *net, 
>>>>>>> struct file *file, struct ifreq *ifr)
>>>>>>>          return 0;
>>>>>>>     err_detach:
>>>>>>> +       dev_put(dev);
>>>>>>>          tun_detach_all(dev);
>>>>>>>          /* register_netdevice() already called 
>>>>>>> tun_free_netdev() */
>>>>>>>          goto err_free_dev;
>>>>>>>     err_free_flow:
>>>>>>> +       dev_put(dev);
>>>>>>>          tun_flow_uninit(tun);
>>>>>>> security_tun_dev_free_security(tun->security);
>>>>>>>   err_free_stat:
>>>>>>>
>>>>>>> What's your thought?
>>>>>>
>>>>>> The dev pointer are freed without checking the refcount in 
>>>>>> free_netdev() called by err_free_dev
>>>>>>
>>>>>> path, so I don't understand how the refcount protects this pointer.
>>>>>>
>>>>>
>>>>> The refcount are guaranteed to be zero there, isn't it?
>>>> No, it's not.
>>>>
>>>> err_free_dev:
>>>>         free_netdev(dev);
>>>>
>>>> void free_netdev(struct net_device *dev)
>>>> {
>>>> ...
>>>>         /* pcpu_refcnt can be freed without checking refcount */
>>>>         free_percpu(dev->pcpu_refcnt);
>>>>         dev->pcpu_refcnt = NULL;
>>>>
>>>>         /*  Compatibility with error handling in drivers */
>>>>         if (dev->reg_state == NETREG_UNINITIALIZED) {
>>>>                 /* dev can be freed without checking refcount */
>>>>                 netdev_freemem(dev);
>>>>                 return;
>>>>         }
>>>> ...
>>>> }
>>>
>>>
>>> Right, but what I meant is in my patch, when code reaches 
>>> free_netdev() the refcnt is zero. What did I miss?
>> Yes, but it can't fix the UAF problem.
>
>
> Well, it looks to me that the dev_put() in tun_put() won't release the 
> device in this case.

The device is not released in tun_put().
This is how the UAF occurs:

         CPUA                                           CPUB
     tun_set_iff()
       alloc_netdev_mqs()
       tun_attach()
                                                     tun_chr_read_iter()
                                                       tun_get()
                                                       tun_do_read()
                                                         tun_ring_recv()
       register_netdevice() <-- inject error
       goto err_detach
       tun_detach_all() <-- set RCV_SHUTDOWN
       free_netdev() <-- called from
                        err_free_dev path
         netdev_freemem() <-- free the memory
                           without check refcount
         (In this path, the refcount cannot prevent
          freeing the memory of dev, and the memory
          will be used by dev_put() called by
          tun_chr_read_iter() on CPUB.)
                                                        (Break from tun_ring_recv(), because RCV_SHUTDOWN is set)
                                                      tun_put()
                                                      dev_put() <-- use the memory freed by netdev_freemem()


>
> Thanks
>



^ permalink raw reply

* RE: [PATCH net-next] r8152: modify rtl8152_set_speed function
From: Hayes Wang @ 2019-09-03  7:36 UTC (permalink / raw)
  To: Heiner Kallweit, netdev@vger.kernel.org
  Cc: nic_swsd, linux-kernel@vger.kernel.org
In-Reply-To: <32d490ae-70af-ba86-93de-be342a2a7e39@gmail.com>

Heiner Kallweit [mailto:hkallweit1@gmail.com]
> Sent: Tuesday, September 03, 2019 3:13 PM
[...]
> > Some of our chips support the feature of UPS. When satisfying certain
> > condition, the hw would recover the settings of speed. Therefore, I have
> > to record the settings of the speed, and set them to hw.
> >
> Not knowing the UPS feature in detail:
> In net-next I changed the software "PHY speed-down" implementation to
> be more generic. It stores the old advertised settings in a new
> phy_device member adv_old, and restores them in phy_speed_up().
> Maybe what you need is similar.

It is a feature about power saving. When some conditions are
satisfied, the power of PHY would be cut. And the hw would
restore the PHY settings including the speed automatically,
when leaving power saving mode.

Best Regards,
Hayes



^ permalink raw reply

* [PATCH net-next] vsock/virtio: a better comment on credit update
From: Michael S. Tsirkin @ 2019-09-03  7:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Stefano Garzarella, Stefan Hajnoczi, David S. Miller, kvm,
	virtualization, netdev

The comment we have is just repeating what the code does.
Include the *reason* for the condition instead.

Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 net/vmw_vsock/virtio_transport_common.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 94cc0fa3e848..5bb70c692b1e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -307,8 +307,13 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 
 	spin_unlock_bh(&vvs->rx_lock);
 
-	/* We send a credit update only when the space available seen
-	 * by the transmitter is less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE
+	/* To reduce the number of credit update messages,
+	 * don't update credits as long as lots of space is available.
+	 * Note: the limit chosen here is arbitrary. Setting the limit
+	 * too high causes extra messages. Too low causes transmitter
+	 * stalls. As stalls are in theory more expensive than extra
+	 * messages, we set the limit to a high value. TODO: experiment
+	 * with different values.
 	 */
 	if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
 		virtio_transport_send_credit_update(vsk,
-- 
MST

^ permalink raw reply related

* Re: [PATCH v4 2/5] vsock/virtio: reduce credit update messages
From: Michael S. Tsirkin @ 2019-09-03  7:38 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <20190903073120.kefllalytkvidcvh@steredhat>

On Tue, Sep 03, 2019 at 09:31:20AM +0200, Stefano Garzarella wrote:
> On Tue, Sep 03, 2019 at 12:38:02AM -0400, Michael S. Tsirkin wrote:
> > On Wed, Jul 17, 2019 at 01:30:27PM +0200, Stefano Garzarella wrote:
> > > In order to reduce the number of credit update messages,
> > > we send them only when the space available seen by the
> > > transmitter is less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE.
> > > 
> > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > > ---
> > >  include/linux/virtio_vsock.h            |  1 +
> > >  net/vmw_vsock/virtio_transport_common.c | 16 +++++++++++++---
> > >  2 files changed, 14 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> > > index 7d973903f52e..49fc9d20bc43 100644
> > > --- a/include/linux/virtio_vsock.h
> > > +++ b/include/linux/virtio_vsock.h
> > > @@ -41,6 +41,7 @@ struct virtio_vsock_sock {
> > >  
> > >  	/* Protected by rx_lock */
> > >  	u32 fwd_cnt;
> > > +	u32 last_fwd_cnt;
> > >  	u32 rx_bytes;
> > >  	struct list_head rx_queue;
> > >  };
> > > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> > > index 095221f94786..a85559d4d974 100644
> > > --- a/net/vmw_vsock/virtio_transport_common.c
> > > +++ b/net/vmw_vsock/virtio_transport_common.c
> > > @@ -211,6 +211,7 @@ static void virtio_transport_dec_rx_pkt(struct virtio_vsock_sock *vvs,
> > >  void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt)
> > >  {
> > >  	spin_lock_bh(&vvs->tx_lock);
> > > +	vvs->last_fwd_cnt = vvs->fwd_cnt;
> > >  	pkt->hdr.fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
> > >  	pkt->hdr.buf_alloc = cpu_to_le32(vvs->buf_alloc);
> > >  	spin_unlock_bh(&vvs->tx_lock);
> > > @@ -261,6 +262,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
> > >  	struct virtio_vsock_sock *vvs = vsk->trans;
> > >  	struct virtio_vsock_pkt *pkt;
> > >  	size_t bytes, total = 0;
> > > +	u32 free_space;
> > >  	int err = -EFAULT;
> > >  
> > >  	spin_lock_bh(&vvs->rx_lock);
> > > @@ -291,11 +293,19 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
> > >  			virtio_transport_free_pkt(pkt);
> > >  		}
> > >  	}
> > > +
> > > +	free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
> > > +
> > >  	spin_unlock_bh(&vvs->rx_lock);
> > >  
> > > -	/* Send a credit pkt to peer */
> > > -	virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM,
> > > -					    NULL);
> > > +	/* We send a credit update only when the space available seen
> > > +	 * by the transmitter is less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE
> > 
> > This is just repeating what code does though.
> > Please include the *reason* for the condition.
> > E.g. here's a better comment:
> > 
> > 	/* To reduce number of credit update messages,
> > 	 * don't update credits as long as lots of space is available.
> > 	 * Note: the limit chosen here is arbitrary. Setting the limit
> > 	 * too high causes extra messages. Too low causes transmitter
> > 	 * stalls. As stalls are in theory more expensive than extra
> > 	 * messages, we set the limit to a high value. TODO: experiment
> > 	 * with different values.
> > 	 */
> > 
> 
> Yes, it is better, sorry for that. I'll try to avoid unnecessary comments,
> explaining the reason for certain changes.
> 
> Since this patch is already queued in net-next, should I send another
> patch to fix the comment?
> 
> Thanks,
> Stefano

I just sent a patch like that, pls ack it.

-- 
MST

^ permalink raw reply

* Re: [PATCH] Clock-independent TCP ISN generation
From: Eric Dumazet @ 2019-09-03  7:41 UTC (permalink / raw)
  To: Cyrus Sh, davem; +Cc: shiraz.saleem, jgg, arnd, netdev, sirus
In-Reply-To: <70c41960-6d14-3943-31ca-75598ad3d2d7@gmail.com>



On 9/3/19 7:06 AM, Cyrus Sh wrote:
> This patch addresses the privacy issue of TCP ISN generation in Linux
> kernel. Currently an adversary can deanonymize a user behind an anonymity
> network by inducing a load pattern on the target machine and correlating
> its clock skew with the pattern. Since the kernel adds a clock-based
> counter to generated ISNs, the adversary can observe SYN packets with
> similar IP and port numbers to find out the clock skew of the target
> machine and this can help them identify the user.  To resolve this problem
> I have changed the related function to generate the initial sequence
> numbers randomly and independent from the cpu clock. This feature is
> controlled by a new sysctl option called "tcp_random_isn" which I've added
> to the kernel. Once enabled the initial sequence numbers are guaranteed to
> be generated independently from each other and from the hardware clock of
> the machine. If the option is off, ISNs are generated as before.  To get
> more information about this patch and its effectiveness you can refer to my
> post here:
> https://bitguard.wordpress.com/?p=982


<quote>
Firstly it’s unlikely that this happens at all,
</quote>

Sorry this happens all the time.
Some people use very disturbing setups really, and they are not trying to be malicious.

Clock skew seems quite secondary. Some firewall rules should prevent this kind of attacks ?

> and to see a discussion about the issue you can read this:
> https://trac.torproject.org/projects/tor/ticket/16659

Four years old discussion. Does not seem urgent matter :/

> 
> Signed-off-by: Sirus Shahini <sirus.shahini@gmail.com>
> ---
>  include/net/tcp.h           |  1 +
>  include/uapi/linux/sysctl.h |  1 +
>  kernel/sysctl_binary.c      |  1 +
>  net/core/secure_seq.c       | 24 +++++++++++++++++++++++-
>  net/ipv4/sysctl_net_ipv4.c  |  7 +++++++
>  net/ipv4/tcp_input.c        |  1 +
>  6 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 81e8ade..4ad1bbf 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -241,6 +241,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>  
>  /* sysctl variables for tcp */
>  extern int sysctl_tcp_max_orphans;
> +extern int sysctl_tcp_random_isn;

All TCP sysctls are per netns these days. (Look in include/net/netns/ipv4.h )

>  extern long sysctl_tcp_mem[3];
>  
>  #define TCP_RACK_LOSS_DETECTION  0x1 /* Use RACK to detect losses */
> diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
> index 87aa2a6..ba8927e 100644
> --- a/include/uapi/linux/sysctl.h
> +++ b/include/uapi/linux/sysctl.h
> @@ -426,6 +426,7 @@ enum
>  	NET_TCP_ALLOWED_CONG_CONTROL=123,
>  	NET_TCP_MAX_SSTHRESH=124,
>  	NET_TCP_FRTO_RESPONSE=125,
> +	NET_IPV4_TCP_RANDOM_ISN=126,

Nope, we do not add new sysctls there anymore.

Everybody should now have /proc files to tune the values.

>  };
>  
>  enum {
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> index 73c1320..0faf7d4 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -332,6 +332,7 @@ static const struct bin_table bin_net_ipv4_netfilter_table[] = {
>  };
>  
>  static const struct bin_table bin_net_ipv4_table[] = {
> +	{CTL_INT,   NET_IPV4_TCP_RANDOM_ISN     "tcp_random_isn"}

Same remark. We no longer add stuff there.

>  	{CTL_INT,	NET_IPV4_FORWARD,			"ip_forward" },
>  
>  	{ CTL_DIR,	NET_IPV4_CONF,		"conf",		bin_net_ipv4_conf_table },
> diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
> index 7b6b1d2..b644bbe 100644
> --- a/net/core/secure_seq.c
> +++ b/net/core/secure_seq.c
> @@ -22,6 +22,7 @@
>  
>  static siphash_key_t net_secret __read_mostly;
>  static siphash_key_t ts_secret __read_mostly;
> +static siphash_key_t last_secret = {{0,0}} ;
>  
>  static __always_inline void net_secret_init(void)
>  {
> @@ -134,8 +135,29 @@ u32 secure_tcp_seq(__be32 saddr, __be32 daddr,
>  		   __be16 sport, __be16 dport)
>  {
>  	u32 hash;
> -
> +	u32 temp;
> +	
>  	net_secret_init();
> +	
> +	if (sysctl_tcp_random_isn){
> +		if (!last_secret.key[0] && !last_secret.key[1]){
> +			memcpy(&last_secret,&net_secret,sizeof(last_secret));	
> +					
> +		}else{

Check your patch using scripts/checkpatch.pl

All these missing spaces should be spotted.

> +			temp = *((u32*)&(net_secret.key[0]));
> +			temp >>= 8;
> +			last_secret.key[0]+=temp;
> +			temp = *((u32*)&(net_secret.key[1]));
> +			temp >>= 8;
> +			last_secret.key[1]+=temp;

Why not simply use an official random generator, instead of these convoluted
and not SMP safe attempts ?

Check drivers/char/random.c for officially maintained generators.

> +		}
> +		hash = siphash_3u32((__force u32)saddr, (__force u32)daddr,
> +			        (__force u32)sport << 16 | (__force u32)dport,
> +			        &last_secret);
> +		return hash;
> +	}
> +	
> +	
>  	hash = siphash_3u32((__force u32)saddr, (__force u32)daddr,
>  			    (__force u32)sport << 16 | (__force u32)dport,
>  			    &net_secret);

^ permalink raw reply

* Re: [PATCH v4 1/5] vsock/virtio: limit the memory used per-socket
From: Stefano Garzarella @ 2019-09-03  7:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <20190903003823-mutt-send-email-mst@kernel.org>

On Tue, Sep 03, 2019 at 12:39:19AM -0400, Michael S. Tsirkin wrote:
> On Mon, Sep 02, 2019 at 11:57:23AM +0200, Stefano Garzarella wrote:
> > > 
> > > Assuming we miss nothing and buffers < 4K are broken,
> > > I think we need to add this to the spec, possibly with
> > > a feature bit to relax the requirement that all buffers
> > > are at least 4k in size.
> > > 
> > 
> > Okay, should I send a proposal to virtio-dev@lists.oasis-open.org?
> 
> How about we also fix the bug for now?

This series unintentionally fix the bug because we are introducing a way
to split the packet depending on the buffer size ([PATCH 4/5] vhost/vsock:
split packets to send using multiple buffers) and we removed the limit
to 4K buffers ([PATCH 5/5] vsock/virtio: change the maximum packet size
allowed).

I discovered that there was a bug while we discussed memory accounting.

Do you think it's enough while we introduce the feature bit in the spec?

Thanks,
Stefano

^ permalink raw reply

* Re: [PATCH] Clock-independent TCP ISN generation
From: kbuild test robot @ 2019-09-03  7:45 UTC (permalink / raw)
  To: Cyrus Sh; +Cc: kbuild-all, davem, shiraz.saleem, jgg, arnd, netdev, sirus
In-Reply-To: <70c41960-6d14-3943-31ca-75598ad3d2d7@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 7349 bytes --]

Hi Cyrus,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[cannot apply to v5.3-rc7 next-20190902]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Cyrus-Sh/Clock-independent-TCP-ISN-generation/20190903-131719
config: x86_64-randconfig-e003-201935 (attached as .config)
compiler: gcc-7 (Debian 7.4.0-11) 7.4.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> kernel/sysctl_binary.c:335:42: error: expected '}' before string constant
     {CTL_INT,   NET_IPV4_TCP_RANDOM_ISN     "tcp_random_isn"}
                                             ^~~~~~~~~~~~~~~~
>> kernel/sysctl_binary.c:336:2: error: expected '}' before '{' token
     {CTL_INT, NET_IPV4_FORWARD,   "ip_forward" },
     ^

vim +335 kernel/sysctl_binary.c

   333	
   334	static const struct bin_table bin_net_ipv4_table[] = {
 > 335		{CTL_INT,   NET_IPV4_TCP_RANDOM_ISN     "tcp_random_isn"}
 > 336		{CTL_INT,	NET_IPV4_FORWARD,			"ip_forward" },
   337	
   338		{ CTL_DIR,	NET_IPV4_CONF,		"conf",		bin_net_ipv4_conf_table },
   339		{ CTL_DIR,	NET_IPV4_NEIGH,		"neigh",	bin_net_neigh_table },
   340		{ CTL_DIR,	NET_IPV4_ROUTE,		"route",	bin_net_ipv4_route_table },
   341		/* NET_IPV4_FIB_HASH unused */
   342		{ CTL_DIR,	NET_IPV4_NETFILTER,	"netfilter",	bin_net_ipv4_netfilter_table },
   343	
   344		{ CTL_INT,	NET_IPV4_TCP_TIMESTAMPS,		"tcp_timestamps" },
   345		{ CTL_INT,	NET_IPV4_TCP_WINDOW_SCALING,		"tcp_window_scaling" },
   346		{ CTL_INT,	NET_IPV4_TCP_SACK,			"tcp_sack" },
   347		{ CTL_INT,	NET_IPV4_TCP_RETRANS_COLLAPSE,		"tcp_retrans_collapse" },
   348		{ CTL_INT,	NET_IPV4_DEFAULT_TTL,			"ip_default_ttl" },
   349		/* NET_IPV4_AUTOCONFIG unused */
   350		{ CTL_INT,	NET_IPV4_NO_PMTU_DISC,			"ip_no_pmtu_disc" },
   351		{ CTL_INT,	NET_IPV4_NONLOCAL_BIND,			"ip_nonlocal_bind" },
   352		{ CTL_INT,	NET_IPV4_TCP_SYN_RETRIES,		"tcp_syn_retries" },
   353		{ CTL_INT,	NET_TCP_SYNACK_RETRIES,			"tcp_synack_retries" },
   354		{ CTL_INT,	NET_TCP_MAX_ORPHANS,			"tcp_max_orphans" },
   355		{ CTL_INT,	NET_TCP_MAX_TW_BUCKETS,			"tcp_max_tw_buckets" },
   356		{ CTL_INT,	NET_IPV4_DYNADDR,			"ip_dynaddr" },
   357		{ CTL_INT,	NET_IPV4_TCP_KEEPALIVE_TIME,		"tcp_keepalive_time" },
   358		{ CTL_INT,	NET_IPV4_TCP_KEEPALIVE_PROBES,		"tcp_keepalive_probes" },
   359		{ CTL_INT,	NET_IPV4_TCP_KEEPALIVE_INTVL,		"tcp_keepalive_intvl" },
   360		{ CTL_INT,	NET_IPV4_TCP_RETRIES1,			"tcp_retries1" },
   361		{ CTL_INT,	NET_IPV4_TCP_RETRIES2,			"tcp_retries2" },
   362		{ CTL_INT,	NET_IPV4_TCP_FIN_TIMEOUT,		"tcp_fin_timeout" },
   363		{ CTL_INT,	NET_TCP_SYNCOOKIES,			"tcp_syncookies" },
   364		{ CTL_INT,	NET_TCP_TW_RECYCLE,			"tcp_tw_recycle" },
   365		{ CTL_INT,	NET_TCP_ABORT_ON_OVERFLOW,		"tcp_abort_on_overflow" },
   366		{ CTL_INT,	NET_TCP_STDURG,				"tcp_stdurg" },
   367		{ CTL_INT,	NET_TCP_RFC1337,			"tcp_rfc1337" },
   368		{ CTL_INT,	NET_TCP_MAX_SYN_BACKLOG,		"tcp_max_syn_backlog" },
   369		{ CTL_INT,	NET_IPV4_LOCAL_PORT_RANGE,		"ip_local_port_range" },
   370		{ CTL_INT,	NET_IPV4_IGMP_MAX_MEMBERSHIPS,		"igmp_max_memberships" },
   371		{ CTL_INT,	NET_IPV4_IGMP_MAX_MSF,			"igmp_max_msf" },
   372		{ CTL_INT,	NET_IPV4_INET_PEER_THRESHOLD,		"inet_peer_threshold" },
   373		{ CTL_INT,	NET_IPV4_INET_PEER_MINTTL,		"inet_peer_minttl" },
   374		{ CTL_INT,	NET_IPV4_INET_PEER_MAXTTL,		"inet_peer_maxttl" },
   375		{ CTL_INT,	NET_IPV4_INET_PEER_GC_MINTIME,		"inet_peer_gc_mintime" },
   376		{ CTL_INT,	NET_IPV4_INET_PEER_GC_MAXTIME,		"inet_peer_gc_maxtime" },
   377		{ CTL_INT,	NET_TCP_ORPHAN_RETRIES,			"tcp_orphan_retries" },
   378		{ CTL_INT,	NET_TCP_FACK,				"tcp_fack" },
   379		{ CTL_INT,	NET_TCP_REORDERING,			"tcp_reordering" },
   380		{ CTL_INT,	NET_TCP_ECN,				"tcp_ecn" },
   381		{ CTL_INT,	NET_TCP_DSACK,				"tcp_dsack" },
   382		{ CTL_INT,	NET_TCP_MEM,				"tcp_mem" },
   383		{ CTL_INT,	NET_TCP_WMEM,				"tcp_wmem" },
   384		{ CTL_INT,	NET_TCP_RMEM,				"tcp_rmem" },
   385		{ CTL_INT,	NET_TCP_APP_WIN,			"tcp_app_win" },
   386		{ CTL_INT,	NET_TCP_ADV_WIN_SCALE,			"tcp_adv_win_scale" },
   387		{ CTL_INT,	NET_TCP_TW_REUSE,			"tcp_tw_reuse" },
   388		{ CTL_INT,	NET_TCP_FRTO,				"tcp_frto" },
   389		{ CTL_INT,	NET_TCP_FRTO_RESPONSE,			"tcp_frto_response" },
   390		{ CTL_INT,	NET_TCP_LOW_LATENCY,			"tcp_low_latency" },
   391		{ CTL_INT,	NET_TCP_NO_METRICS_SAVE,		"tcp_no_metrics_save" },
   392		{ CTL_INT,	NET_TCP_MODERATE_RCVBUF,		"tcp_moderate_rcvbuf" },
   393		{ CTL_INT,	NET_TCP_TSO_WIN_DIVISOR,		"tcp_tso_win_divisor" },
   394		{ CTL_STR,	NET_TCP_CONG_CONTROL,			"tcp_congestion_control" },
   395		{ CTL_INT,	NET_TCP_MTU_PROBING,			"tcp_mtu_probing" },
   396		{ CTL_INT,	NET_TCP_BASE_MSS,			"tcp_base_mss" },
   397		{ CTL_INT,	NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS,	"tcp_workaround_signed_windows" },
   398		{ CTL_INT,	NET_TCP_SLOW_START_AFTER_IDLE,		"tcp_slow_start_after_idle" },
   399		{ CTL_INT,	NET_CIPSOV4_CACHE_ENABLE,		"cipso_cache_enable" },
   400		{ CTL_INT,	NET_CIPSOV4_CACHE_BUCKET_SIZE,		"cipso_cache_bucket_size" },
   401		{ CTL_INT,	NET_CIPSOV4_RBM_OPTFMT,			"cipso_rbm_optfmt" },
   402		{ CTL_INT,	NET_CIPSOV4_RBM_STRICTVALID,		"cipso_rbm_strictvalid" },
   403		/* NET_TCP_AVAIL_CONG_CONTROL "tcp_available_congestion_control" no longer used */
   404		{ CTL_STR,	NET_TCP_ALLOWED_CONG_CONTROL,		"tcp_allowed_congestion_control" },
   405		{ CTL_INT,	NET_TCP_MAX_SSTHRESH,			"tcp_max_ssthresh" },
   406	
   407		{ CTL_INT,	NET_IPV4_ICMP_ECHO_IGNORE_ALL,		"icmp_echo_ignore_all" },
   408		{ CTL_INT,	NET_IPV4_ICMP_ECHO_IGNORE_BROADCASTS,	"icmp_echo_ignore_broadcasts" },
   409		{ CTL_INT,	NET_IPV4_ICMP_IGNORE_BOGUS_ERROR_RESPONSES,	"icmp_ignore_bogus_error_responses" },
   410		{ CTL_INT,	NET_IPV4_ICMP_ERRORS_USE_INBOUND_IFADDR,	"icmp_errors_use_inbound_ifaddr" },
   411		{ CTL_INT,	NET_IPV4_ICMP_RATELIMIT,		"icmp_ratelimit" },
   412		{ CTL_INT,	NET_IPV4_ICMP_RATEMASK,			"icmp_ratemask" },
   413	
   414		{ CTL_INT,	NET_IPV4_IPFRAG_HIGH_THRESH,		"ipfrag_high_thresh" },
   415		{ CTL_INT,	NET_IPV4_IPFRAG_LOW_THRESH,		"ipfrag_low_thresh" },
   416		{ CTL_INT,	NET_IPV4_IPFRAG_TIME,			"ipfrag_time" },
   417	
   418		{ CTL_INT,	NET_IPV4_IPFRAG_SECRET_INTERVAL,	"ipfrag_secret_interval" },
   419		/* NET_IPV4_IPFRAG_MAX_DIST "ipfrag_max_dist" no longer used */
   420	
   421		{ CTL_INT,	2088 /* NET_IPQ_QMAX */,		"ip_queue_maxlen" },
   422	
   423		/* NET_TCP_DEFAULT_WIN_SCALE unused */
   424		/* NET_TCP_BIC_BETA unused */
   425		/* NET_IPV4_TCP_MAX_KA_PROBES unused */
   426		/* NET_IPV4_IP_MASQ_DEBUG unused */
   427		/* NET_TCP_SYN_TAILDROP unused */
   428		/* NET_IPV4_ICMP_SOURCEQUENCH_RATE unused */
   429		/* NET_IPV4_ICMP_DESTUNREACH_RATE unused */
   430		/* NET_IPV4_ICMP_TIMEEXCEED_RATE unused */
   431		/* NET_IPV4_ICMP_PARAMPROB_RATE unused */
   432		/* NET_IPV4_ICMP_ECHOREPLY_RATE unused */
   433		/* NET_IPV4_ALWAYS_DEFRAG unused */
   434		{}
   435	};
   436	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 24350 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] vsock/virtio: a better comment on credit update
From: Stefano Garzarella @ 2019-09-03  7:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Stefan Hajnoczi, David S. Miller, kvm,
	virtualization, netdev
In-Reply-To: <20190903073748.25214-1-mst@redhat.com>

On Tue, Sep 03, 2019 at 03:38:16AM -0400, Michael S. Tsirkin wrote:
> The comment we have is just repeating what the code does.
> Include the *reason* for the condition instead.
> 
> Cc: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  net/vmw_vsock/virtio_transport_common.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 94cc0fa3e848..5bb70c692b1e 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -307,8 +307,13 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
>  
>  	spin_unlock_bh(&vvs->rx_lock);
>  
> -	/* We send a credit update only when the space available seen
> -	 * by the transmitter is less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE
> +	/* To reduce the number of credit update messages,
> +	 * don't update credits as long as lots of space is available.
> +	 * Note: the limit chosen here is arbitrary. Setting the limit
> +	 * too high causes extra messages. Too low causes transmitter
> +	 * stalls. As stalls are in theory more expensive than extra
> +	 * messages, we set the limit to a high value. TODO: experiment
> +	 * with different values.
>  	 */
>  	if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
>  		virtio_transport_send_credit_update(vsk,

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>



^ permalink raw reply

* Re: [PATCH v4 1/5] vsock/virtio: limit the memory used per-socket
From: Michael S. Tsirkin @ 2019-09-03  7:52 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <20190903074554.mq6spyivftuodahy@steredhat>

On Tue, Sep 03, 2019 at 09:45:54AM +0200, Stefano Garzarella wrote:
> On Tue, Sep 03, 2019 at 12:39:19AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Sep 02, 2019 at 11:57:23AM +0200, Stefano Garzarella wrote:
> > > > 
> > > > Assuming we miss nothing and buffers < 4K are broken,
> > > > I think we need to add this to the spec, possibly with
> > > > a feature bit to relax the requirement that all buffers
> > > > are at least 4k in size.
> > > > 
> > > 
> > > Okay, should I send a proposal to virtio-dev@lists.oasis-open.org?
> > 
> > How about we also fix the bug for now?
> 
> This series unintentionally fix the bug because we are introducing a way
> to split the packet depending on the buffer size ([PATCH 4/5] vhost/vsock:
> split packets to send using multiple buffers) and we removed the limit
> to 4K buffers ([PATCH 5/5] vsock/virtio: change the maximum packet size
> allowed).
> 
> I discovered that there was a bug while we discussed memory accounting.
> 
> Do you think it's enough while we introduce the feature bit in the spec?
> 
> Thanks,
> Stefano

Well locking is also broken (patch 3/5).  It seems that 3/5 and 4/5 work
by themselves, right?  So how about we ask Dave to send these to stable?
Also, how about 1/5? Also needed for stable?


-- 
MST

^ permalink raw reply

* [PATCH] hostap: remove set but not used variable 'copied' in prism2_io_debug_proc_read
From: zhong jiang @ 2019-09-03  7:57 UTC (permalink / raw)
  To: kvalo, davem; +Cc: zhongjiang, linux-wireless, netdev, linux-kernel

Obviously, variable 'copied' is initialized to zero. But it is not used.
hence just remove it.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
---
 drivers/net/wireless/intersil/hostap/hostap_proc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/wireless/intersil/hostap/hostap_proc.c b/drivers/net/wireless/intersil/hostap/hostap_proc.c
index 703d74c..6151d8d 100644
--- a/drivers/net/wireless/intersil/hostap/hostap_proc.c
+++ b/drivers/net/wireless/intersil/hostap/hostap_proc.c
@@ -234,7 +234,7 @@ static int prism2_io_debug_proc_read(char *page, char **start, off_t off,
 {
 	local_info_t *local = (local_info_t *) data;
 	int head = local->io_debug_head;
-	int start_bytes, left, copy, copied;
+	int start_bytes, left, copy;
 
 	if (off + count > PRISM2_IO_DEBUG_SIZE * 4) {
 		*eof = 1;
@@ -243,7 +243,6 @@ static int prism2_io_debug_proc_read(char *page, char **start, off_t off,
 		count = PRISM2_IO_DEBUG_SIZE * 4 - off;
 	}
 
-	copied = 0;
 	start_bytes = (PRISM2_IO_DEBUG_SIZE - head) * 4;
 	left = count;
 
-- 
1.7.12.4


^ permalink raw reply related

* Re: [PATCH v4 1/5] vsock/virtio: limit the memory used per-socket
From: Stefano Garzarella @ 2019-09-03  8:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <20190903034747-mutt-send-email-mst@kernel.org>

On Tue, Sep 03, 2019 at 03:52:24AM -0400, Michael S. Tsirkin wrote:
> On Tue, Sep 03, 2019 at 09:45:54AM +0200, Stefano Garzarella wrote:
> > On Tue, Sep 03, 2019 at 12:39:19AM -0400, Michael S. Tsirkin wrote:
> > > On Mon, Sep 02, 2019 at 11:57:23AM +0200, Stefano Garzarella wrote:
> > > > > 
> > > > > Assuming we miss nothing and buffers < 4K are broken,
> > > > > I think we need to add this to the spec, possibly with
> > > > > a feature bit to relax the requirement that all buffers
> > > > > are at least 4k in size.
> > > > > 
> > > > 
> > > > Okay, should I send a proposal to virtio-dev@lists.oasis-open.org?
> > > 
> > > How about we also fix the bug for now?
> > 
> > This series unintentionally fix the bug because we are introducing a way
> > to split the packet depending on the buffer size ([PATCH 4/5] vhost/vsock:
> > split packets to send using multiple buffers) and we removed the limit
> > to 4K buffers ([PATCH 5/5] vsock/virtio: change the maximum packet size
> > allowed).
> > 
> > I discovered that there was a bug while we discussed memory accounting.
> > 
> > Do you think it's enough while we introduce the feature bit in the spec?
> > 
> > Thanks,
> > Stefano
> 
> Well locking is also broken (patch 3/5).  It seems that 3/5 and 4/5 work
> by themselves, right?  So how about we ask Dave to send these to stable?

Yes, they work by themselves and I agree that should be send to stable.

> Also, how about 1/5? Also needed for stable?

I think so, without this patch if we flood the guest with 1-byte packets,
we can consume ~ 1 GB of guest memory per-socket.

Thanks,
Stefano

^ permalink raw reply

* Re: [PATCH] hostap: remove set but not used variable 'copied' in prism2_io_debug_proc_read
From: zhong jiang @ 2019-09-03  8:01 UTC (permalink / raw)
  To: zhong jiang; +Cc: kvalo, davem, linux-wireless, netdev, linux-kernel
In-Reply-To: <1567497430-22539-1-git-send-email-zhongjiang@huawei.com>

Please ignore the patch.  Because  the hostap_proc.c is marked as 'obsolete'.

Thanks,
zhong jiang
On 2019/9/3 15:57, zhong jiang wrote:
> Obviously, variable 'copied' is initialized to zero. But it is not used.
> hence just remove it.
>
> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
> ---
>  drivers/net/wireless/intersil/hostap/hostap_proc.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/net/wireless/intersil/hostap/hostap_proc.c b/drivers/net/wireless/intersil/hostap/hostap_proc.c
> index 703d74c..6151d8d 100644
> --- a/drivers/net/wireless/intersil/hostap/hostap_proc.c
> +++ b/drivers/net/wireless/intersil/hostap/hostap_proc.c
> @@ -234,7 +234,7 @@ static int prism2_io_debug_proc_read(char *page, char **start, off_t off,
>  {
>  	local_info_t *local = (local_info_t *) data;
>  	int head = local->io_debug_head;
> -	int start_bytes, left, copy, copied;
> +	int start_bytes, left, copy;
>  
>  	if (off + count > PRISM2_IO_DEBUG_SIZE * 4) {
>  		*eof = 1;
> @@ -243,7 +243,6 @@ static int prism2_io_debug_proc_read(char *page, char **start, off_t off,
>  		count = PRISM2_IO_DEBUG_SIZE * 4 - off;
>  	}
>  
> -	copied = 0;
>  	start_bytes = (PRISM2_IO_DEBUG_SIZE - head) * 4;
>  	left = count;
>  



^ permalink raw reply

* request for stable (was Re: [PATCH v4 0/5] vsock/virtio: optimizations to increase the throughput)
From: Michael S. Tsirkin @ 2019-09-03  8:02 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <20190717113030.163499-1-sgarzare@redhat.com>

Patches 1,3 and 4 are needed for stable.
Dave, could you queue them there please?

On Wed, Jul 17, 2019 at 01:30:25PM +0200, Stefano Garzarella wrote:
> This series tries to increase the throughput of virtio-vsock with slight
> changes.
> While I was testing the v2 of this series I discovered an huge use of memory,
> so I added patch 1 to mitigate this issue. I put it in this series in order
> to better track the performance trends.
> 
> v4:
> - rebased all patches on current master (conflicts is Patch 4)
> - Patch 1: added Stefan's R-b
> - Patch 3: removed lock when buf_alloc is written [David];
>            moved this patch after "vsock/virtio: reduce credit update messages"
>            to make it clearer
> - Patch 4: vhost_exceeds_weight() is recently introduced, so I've solved some
>            conflicts
> 
> v3: https://patchwork.kernel.org/cover/10970145
> 
> v2: https://patchwork.kernel.org/cover/10938743
> 
> v1: https://patchwork.kernel.org/cover/10885431
> 
> Below are the benchmarks step by step. I used iperf3 [1] modified with VSOCK
> support. As Micheal suggested in the v1, I booted host and guest with 'nosmap'.
> 
> A brief description of patches:
> - Patches 1:   limit the memory usage with an extra copy for small packets
> - Patches 2+3: reduce the number of credit update messages sent to the
>                transmitter
> - Patches 4+5: allow the host to split packets on multiple buffers and use
>                VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size allowed
> 
>                     host -> guest [Gbps]
> pkt_size before opt   p 1     p 2+3    p 4+5
> 
> 32         0.032     0.030    0.048    0.051
> 64         0.061     0.059    0.108    0.117
> 128        0.122     0.112    0.227    0.234
> 256        0.244     0.241    0.418    0.415
> 512        0.459     0.466    0.847    0.865
> 1K         0.927     0.919    1.657    1.641
> 2K         1.884     1.813    3.262    3.269
> 4K         3.378     3.326    6.044    6.195
> 8K         5.637     5.676   10.141   11.287
> 16K        8.250     8.402   15.976   16.736
> 32K       13.327    13.204   19.013   20.515
> 64K       21.241    21.341   20.973   21.879
> 128K      21.851    22.354   21.816   23.203
> 256K      21.408    21.693   21.846   24.088
> 512K      21.600    21.899   21.921   24.106
> 
>                     guest -> host [Gbps]
> pkt_size before opt   p 1     p 2+3    p 4+5
> 
> 32         0.045     0.046    0.057    0.057
> 64         0.089     0.091    0.103    0.104
> 128        0.170     0.179    0.192    0.200
> 256        0.364     0.351    0.361    0.379
> 512        0.709     0.699    0.731    0.790
> 1K         1.399     1.407    1.395    1.427
> 2K         2.670     2.684    2.745    2.835
> 4K         5.171     5.199    5.305    5.451
> 8K         8.442     8.500   10.083    9.941
> 16K       12.305    12.259   13.519   15.385
> 32K       11.418    11.150   11.988   24.680
> 64K       10.778    10.659   11.589   35.273
> 128K      10.421    10.339   10.939   40.338
> 256K      10.300     9.719   10.508   36.562
> 512K       9.833     9.808   10.612   35.979
> 
> As Stefan suggested in the v1, I measured also the efficiency in this way:
>     efficiency = Mbps / (%CPU_Host + %CPU_Guest)
> 
> The '%CPU_Guest' is taken inside the VM. I know that it is not the best way,
> but it's provided for free from iperf3 and could be an indication.
> 
>         host -> guest efficiency [Mbps / (%CPU_Host + %CPU_Guest)]
> pkt_size before opt   p 1     p 2+3    p 4+5
> 
> 32         0.35      0.45     0.79     1.02
> 64         0.56      0.80     1.41     1.54
> 128        1.11      1.52     3.03     3.12
> 256        2.20      2.16     5.44     5.58
> 512        4.17      4.18    10.96    11.46
> 1K         8.30      8.26    20.99    20.89
> 2K        16.82     16.31    39.76    39.73
> 4K        30.89     30.79    74.07    75.73
> 8K        53.74     54.49   124.24   148.91
> 16K       80.68     83.63   200.21   232.79
> 32K      132.27    132.52   260.81   357.07
> 64K      229.82    230.40   300.19   444.18
> 128K     332.60    329.78   331.51   492.28
> 256K     331.06    337.22   339.59   511.59
> 512K     335.58    328.50   331.56   504.56
> 
>         guest -> host efficiency [Mbps / (%CPU_Host + %CPU_Guest)]
> pkt_size before opt   p 1     p 2+3    p 4+5
> 
> 32         0.43      0.43     0.53     0.56
> 64         0.85      0.86     1.04     1.10
> 128        1.63      1.71     2.07     2.13
> 256        3.48      3.35     4.02     4.22
> 512        6.80      6.67     7.97     8.63
> 1K        13.32     13.31    15.72    15.94
> 2K        25.79     25.92    30.84    30.98
> 4K        50.37     50.48    58.79    59.69
> 8K        95.90     96.15   107.04   110.33
> 16K      145.80    145.43   143.97   174.70
> 32K      147.06    144.74   146.02   282.48
> 64K      145.25    143.99   141.62   406.40
> 128K     149.34    146.96   147.49   489.34
> 256K     156.35    149.81   152.21   536.37
> 512K     151.65    150.74   151.52   519.93
> 
> [1] https://github.com/stefano-garzarella/iperf/
> 
> Stefano Garzarella (5):
>   vsock/virtio: limit the memory used per-socket
>   vsock/virtio: reduce credit update messages
>   vsock/virtio: fix locking in virtio_transport_inc_tx_pkt()
>   vhost/vsock: split packets to send using multiple buffers
>   vsock/virtio: change the maximum packet size allowed
> 
>  drivers/vhost/vsock.c                   | 68 ++++++++++++-----
>  include/linux/virtio_vsock.h            |  4 +-
>  net/vmw_vsock/virtio_transport.c        |  1 +
>  net/vmw_vsock/virtio_transport_common.c | 99 ++++++++++++++++++++-----
>  4 files changed, 134 insertions(+), 38 deletions(-)
> 
> -- 
> 2.20.1

^ permalink raw reply

* Re: [PATCH v3 1/2] net: core: Notify on changes to dev->promiscuity.
From: Allan W. Nielsen @ 2019-09-03  8:14 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, David Miller, andrew, horatiu.vultur,
	alexandre.belloni, UNGLinuxDriver, ivecera, f.fainelli, netdev,
	linux-kernel
In-Reply-To: <20190903061324.GA6149@splinter>

Hi Ido,

The 09/03/2019 09:13, Ido Schimmel wrote:
> On Mon, Sep 02, 2019 at 07:42:31PM +0200, Allan W. Nielsen wrote:
> > I have been reading through this thread several times and I still do not get it.
> I kept thinking about this and I want to make sure that I correctly
> understand the end result.
So do I, and regardless of this being merged or not, I'm really happy that we
have the discussion as there are clearly a need for synchronizing the meaning of
promiscuity.

One of the reasons why we started working on this patch was the comments we got
when requesting comments on the "[PATCH] net: bridge: Allow bridge to join
multicast groups" where we want to be able to setup mac-entries for L2 multicast
MAC addresses. Here it is important for us to be able to control which ports a
static configured multicast address needs to be forwarded to (including the
CPU). When working on this patch it was pointed out that the interfaces are in
promisc mode anyway (if member of a bridge, and if flooding and learning are
enabled), and it therefore this optimization should not matter anyway.

I found that this was a fair comment, and it has been on the list of things we
wanted to "fix" for a long time.

The optimization does matter to us, as we have until now worked around the issue
by not implementing promisc mode.

When debugging, it is very useful for us to be able to see all the traffic RXed
on the interface, and to us it did make a lot of sense to "just" make promisc
mode work like we was used to.

But, this is for debugging, it is not the most important patch we have on the
backlog, and I would therefore prefer finding a solution which we are all happy
with.

> With these patches applied I assume I will see the following traffic
> when running tcpdump on one of the netdevs exposed by the ocelot driver:
> 
> - Ingress: All
> - Egress: Only locally generated traffic and traffic forwarded by the
>   kernel from interfaces not belonging to the ocelot driver
> 
> The above means I will not see any offloaded traffic transmitted by the
> port. Is that correct?
Correct - but maybe we should change this.

In Ocelot and in LANxxxx (the part we are working on now), we can come pretty
close. We can get the offloaded TX traffic to the CPU, but it will not be
re-written (it will look like the ingress frame, which is not always the same as
the egress frame, vlan tags an others may be re-written).

In some of our chips we can actually do this (not Ocelot, and not the LANxxxx
part we are working on now) after the frame as been re-written.

> I see that the driver is setting 'offload_fwd_mark' for any traffic trapped
> from bridged ports, which means the bridge will drop it before it traverses
> the packet taps on egress.
Correct.

> Large parts of the discussion revolve around the fact that switch ports
> are not any different than other ports. Dave wrote "Please stop
> portraying switches as special in this regard" and Andrew wrote "[The
> user] just wants tcpdump to work like on their desktop."
And we are trying to get as close to this as practical possible, knowing that it
may not be exactly the same.

> But if anything, this discussion proves that switch ports are special in
> this regard and that tcpdump will not work like on the desktop.
I think it can come really close. Some drivers may be able to fix the TX issue
you point out, others may not.

> Beside the fact that I don't agree (but gave up) with the new
> interpretation of promisc mode, I wonder if we're not asking for trouble
> with this patchset. Users will see all offloaded traffic on ingress, but
> none of it on egress. This is in contrast to the sever/desktop, where
> Linux is much more dominant in comparison to switches (let alone hw
> accelerated ones) and where all the traffic is visible through tcpdump.
> I can already see myself having to explain this over and over again to
> confused users.
> 
> Now, I understand that showing egress traffic is inherently difficult.
> It means one of two things:
> 
> 1. We allow packets to be forwarded by both the software and the
> hardware
> 2. We trap all ingressing traffic from all the ports
If the HW cannot copy the egress traffic to the CPU (which our HW cannot), then
you need to do both. All ingress traffic needs to go to the CPU, you need to
make all the forwarding decisions in the CPU, to figure out what traffic happens
to go to the port you want to monitor.

I really doubt this will work in real life. Too much traffic, and HW may make
different forwarding decision that the SW (tc rules in HW but not in SW), which
means that it will not be good for debugging anyway.

> Both options can have devastating effects on the network and therefore
> should not be triggered by a supposedly innocent invocation of tcpdump.
Agree.

> I again wonder if it would not be wiser to solve this by introducing two
> new flags to tcpdump for ingress/egress (similar to -Q in/out) capturing
> of offloaded traffic. The capturing of egress offloaded traffic can be
> documented with the appropriate warnings.
Not sure I agree, but I will try to spend some more time considering it.

In the mean while, what TC action was it that Jiri suggestion we should use? The
trap action is no good, and it prevents the forwarding in silicon, and I'm not
aware of a "COPY-TO-CPU" action.

> Anyway, I don't want to hold you up, I merely want to make sure that the
> above (assuming it's correct) is considered before the patches are
> applied.
Sounds good, and thanks for all the time spend on reviewing and asking the
critical questions.

/Allan


^ permalink raw reply

* [PATCH net] ipmr: remove cache_resolve_queue_len
From: Hangbin Liu @ 2019-09-03  8:43 UTC (permalink / raw)
  To: netdev
  Cc: Phil Karn, Sukumar Gopalakrishnan, David S . Miller,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, Hangbin Liu

This is a re-post of previous patch wrote by David Miller[1].

Phil Karn reported[2] that on busy networks with lots of unresolved
multicast routing entries, the creation of new multicast group routes
can be extremely slow and unreliable.

The reason is we hard-coded multicast route entries with unresolved source
addresses(cache_resolve_queue_len) to 10. If some multicast route never
resolves and the unresolved source addresses increased, there will
be no ability to create new multicast route cache.

To resolve this issue, we need either add a sysctl entry to make the
cache_resolve_queue_len configurable, or just remove cache_resolve_queue_len
directly, as we already have the socket receive queue limits of mrouted
socket, pointed by David.

From my side, I'd perfer to remove the cache_resolve_queue_len instead
of creating two more(IPv4 and IPv6 version) sysctl entry.

[1] https://lkml.org/lkml/2018/7/22/11
[2] https://lkml.org/lkml/2018/7/21/343

Reported-by: Phil Karn <karn@ka9q.net>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 include/linux/mroute_base.h |  2 --
 net/ipv4/ipmr.c             | 27 ++++++++++++++++++---------
 net/ipv6/ip6mr.c            | 10 +++-------
 3 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index 34de06b426ef..50fb89533029 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -235,7 +235,6 @@ struct mr_table_ops {
  * @mfc_hash: Hash table of all resolved routes for easy lookup
  * @mfc_cache_list: list of resovled routes for possible traversal
  * @maxvif: Identifier of highest value vif currently in use
- * @cache_resolve_queue_len: current size of unresolved queue
  * @mroute_do_assert: Whether to inform userspace on wrong ingress
  * @mroute_do_pim: Whether to receive IGMP PIMv1
  * @mroute_reg_vif_num: PIM-device vif index
@@ -252,7 +251,6 @@ struct mr_table {
 	struct rhltable		mfc_hash;
 	struct list_head	mfc_cache_list;
 	int			maxvif;
-	atomic_t		cache_resolve_queue_len;
 	bool			mroute_do_assert;
 	bool			mroute_do_pim;
 	bool			mroute_do_wrvifwhole;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index c07bc82cbbe9..6c5278b45afb 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -744,8 +744,6 @@ static void ipmr_destroy_unres(struct mr_table *mrt, struct mfc_cache *c)
 	struct sk_buff *skb;
 	struct nlmsgerr *e;
 
-	atomic_dec(&mrt->cache_resolve_queue_len);
-
 	while ((skb = skb_dequeue(&c->_c.mfc_un.unres.unresolved))) {
 		if (ip_hdr(skb)->version == 0) {
 			struct nlmsghdr *nlh = skb_pull(skb,
@@ -1133,9 +1131,11 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
 	}
 
 	if (!found) {
+		bool was_empty;
+
 		/* Create a new entry if allowable */
-		if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
-		    (c = ipmr_cache_alloc_unres()) == NULL) {
+		c = ipmr_cache_alloc_unres();
+		if (!c) {
 			spin_unlock_bh(&mfc_unres_lock);
 
 			kfree_skb(skb);
@@ -1161,11 +1161,11 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
 			return err;
 		}
 
-		atomic_inc(&mrt->cache_resolve_queue_len);
+		was_empty = list_empty(&mrt->mfc_unres_queue);
 		list_add(&c->_c.list, &mrt->mfc_unres_queue);
 		mroute_netlink_event(mrt, c, RTM_NEWROUTE);
 
-		if (atomic_read(&mrt->cache_resolve_queue_len) == 1)
+		if (was_empty)
 			mod_timer(&mrt->ipmr_expire_timer,
 				  c->_c.mfc_un.unres.expires);
 	}
@@ -1272,7 +1272,6 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
 		if (uc->mfc_origin == c->mfc_origin &&
 		    uc->mfc_mcastgrp == c->mfc_mcastgrp) {
 			list_del(&_uc->list);
-			atomic_dec(&mrt->cache_resolve_queue_len);
 			found = true;
 			break;
 		}
@@ -1328,7 +1327,7 @@ static void mroute_clean_tables(struct mr_table *mrt, int flags)
 	}
 
 	if (flags & MRT_FLUSH_MFC) {
-		if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
+		if (!list_empty(&mrt->mfc_unres_queue)) {
 			spin_lock_bh(&mfc_unres_lock);
 			list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
 				list_del(&c->list);
@@ -2750,9 +2749,19 @@ static int ipmr_rtm_route(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return ipmr_mfc_delete(tbl, &mfcc, parent);
 }
 
+static int queue_count(struct mr_table *mrt)
+{
+	struct list_head *pos;
+	int count = 0;
+
+	list_for_each(pos, &mrt->mfc_unres_queue)
+		count++;
+	return count;
+}
+
 static bool ipmr_fill_table(struct mr_table *mrt, struct sk_buff *skb)
 {
-	u32 queue_len = atomic_read(&mrt->cache_resolve_queue_len);
+	u32 queue_len = queue_count(mrt);
 
 	if (nla_put_u32(skb, IPMRA_TABLE_ID, mrt->id) ||
 	    nla_put_u32(skb, IPMRA_TABLE_CACHE_RES_QUEUE_LEN, queue_len) ||
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index e80d36c5073d..d02f0f903559 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -768,8 +768,6 @@ static void ip6mr_destroy_unres(struct mr_table *mrt, struct mfc6_cache *c)
 	struct net *net = read_pnet(&mrt->net);
 	struct sk_buff *skb;
 
-	atomic_dec(&mrt->cache_resolve_queue_len);
-
 	while ((skb = skb_dequeue(&c->_c.mfc_un.unres.unresolved)) != NULL) {
 		if (ipv6_hdr(skb)->version == 0) {
 			struct nlmsghdr *nlh = skb_pull(skb,
@@ -1148,8 +1146,8 @@ static int ip6mr_cache_unresolved(struct mr_table *mrt, mifi_t mifi,
 		 *	Create a new entry if allowable
 		 */
 
-		if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
-		    (c = ip6mr_cache_alloc_unres()) == NULL) {
+		c = ip6mr_cache_alloc_unres();
+		if (!c) {
 			spin_unlock_bh(&mfc_unres_lock);
 
 			kfree_skb(skb);
@@ -1176,7 +1174,6 @@ static int ip6mr_cache_unresolved(struct mr_table *mrt, mifi_t mifi,
 			return err;
 		}
 
-		atomic_inc(&mrt->cache_resolve_queue_len);
 		list_add(&c->_c.list, &mrt->mfc_unres_queue);
 		mr6_netlink_event(mrt, c, RTM_NEWROUTE);
 
@@ -1468,7 +1465,6 @@ static int ip6mr_mfc_add(struct net *net, struct mr_table *mrt,
 		if (ipv6_addr_equal(&uc->mf6c_origin, &c->mf6c_origin) &&
 		    ipv6_addr_equal(&uc->mf6c_mcastgrp, &c->mf6c_mcastgrp)) {
 			list_del(&_uc->list);
-			atomic_dec(&mrt->cache_resolve_queue_len);
 			found = true;
 			break;
 		}
@@ -1526,7 +1522,7 @@ static void mroute_clean_tables(struct mr_table *mrt, int flags)
 	}
 
 	if (flags & MRT6_FLUSH_MFC) {
-		if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
+		if (!list_empty(&mrt->mfc_unres_queue)) {
 			spin_lock_bh(&mfc_unres_lock);
 			list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
 				list_del(&c->list);
-- 
2.19.2


^ permalink raw reply related

* Re: [PATCH 4.14] tcp: fix tcp_rtx_queue_tail in case of empty retransmit queue
From: maowenan @ 2019-09-03  8:55 UTC (permalink / raw)
  To: Tim Froidcoeur, eric.dumazet
  Cc: David Miller, cpaasch@apple.com, jonathan.lemon@gmail.com,
	stable@vger.kernel.org, gregkh@linuxfoundation.org,
	matthieu.baerts@tessares.net, aprout@ll.mit.edu,
	edumazet@google.com, jtl@netflix.com,
	linux-kernel@vger.kernel.org, mkubecek@suse.cz,
	ncardwell@google.com, sashal@kernel.org, ycheng@google.com,
	netdev@vger.kernel.org
In-Reply-To: <CAOj+RUvXMaoVKzSeDab4oTn3p=-BJtuhgqwKDCUuhCQWHO7bgQ@mail.gmail.com>


On 2019/9/3 14:58, Tim Froidcoeur wrote:
> Hi,
> 
> I also tried to reproduce this in a targeted way, and run into the
> same difficulty as you: satisfying the first condition “
> (sk->sk_wmem_queued >> 1) > limit “.
> I will not have bandwidth the coming days to try and reproduce it in
> this way. Maybe simply forcing a very small send buffer using sysctl
> net.ipv4.tcp_wmem might even do the trick?
> 
> I suspect that the bug is easier to trigger with the MPTCP patch like
> I did originally, due to the way this patch manages the tcp subflow
> buffers (it can temporarily overfill the buffers, satisfying that
> first condition more often).
> 
> another thing, the stacktrace you shared before seems caused by
> another issue (corrupted socket?), it will not be solved by the patch
> we submitted.

The trace shows zero window probe message can be BUG_ON in skb_queue_prev,
this is reproduced on our platform with syzkaller. It can be resolved by
your fix patch.
The thing I need to think is why the first condition can be satisfied?
Eric, Do you have any comments to reproduce it as the first condition
is hard to be true?
(sk->sk_wmem_queued >> 1) > limit

> 
> kind regards,
> 
> Tim
> 
> 
> On Tue, Sep 3, 2019 at 5:22 AM maowenan <maowenan@huawei.com> wrote:
>>
>> Hi Tim,
>>
>>
>>
>> I try to reproduce it with packetdrill or user application, but I can’t.
>>
>> The first condition “ (sk->sk_wmem_queued >> 1) > limit “    can’t be satisfied,
>>
>> This condition is to avoid tiny SO_SNDBUF values set by user.
>>
>> It also adds the some room due to the fact that tcp_sendmsg()
>>
>> and tcp_sendpage() might overshoot sk_wmem_queued by about one full
>>
>> TSO skb (64KB size).
>>
>>
>>
>>         limit = sk->sk_sndbuf + 2 * SKB_TRUESIZE(GSO_MAX_SIZE);
>>
>>         if (unlikely((sk->sk_wmem_queued >> 1) > limit &&
>>
>>                      skb != tcp_rtx_queue_head(sk) &&
>>
>>                      skb != tcp_rtx_queue_tail(sk))) {
>>
>>                 NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
>>
>>                 return -ENOMEM;
>>
>>         }
>>
>>
>>
>> Can you try to reproduce it with packetdrill or C socket application?
>>
>>
> 
> 
> 


^ permalink raw reply

* RE: Proposal: r8152 firmware patching framework
From: Bambi Yeh @ 2019-09-03  9:01 UTC (permalink / raw)
  To: Hayes Wang, Amber Chen, Prashant Malani
  Cc: David Miller, netdev@vger.kernel.org, Ryankao, Jackc, Albertk,
	marcochen@google.com, nic_swsd, Grant Grundler
In-Reply-To: <0835B3720019904CB8F7AA43166CEEB2F18DA7A9@RTITMBSVM03.realtek.com.tw>

Hi Prashant:

We will try to implement your requests.
Based on our experience, upstream reviewer often reject our modification if they have any concern.
Do you think you can talk to them about this idea and see if they will accept it or not?
Or if you can help on this after we submit it?

Also, Hayes is now updating our current upstream driver and it goes back and forth for a while.
So we will need some time to finish it and the target schedule to have your request done is in the end of this month.

Thank you very much.

Best Regards,
Bambi Yeh

-----Original Message-----
From: Hayes Wang <hayeswang@realtek.com> 
Sent: Monday, September 2, 2019 2:31 PM
To: Amber Chen <amber.chen@realtek.com>; Prashant Malani <pmalani@chromium.org>
Cc: David Miller <davem@davemloft.net>; netdev@vger.kernel.org; Bambi Yeh <bambi.yeh@realtek.com>; Ryankao <ryankao@realtek.com>; Jackc <jackc@realtek.com>; Albertk <albertk@realtek.com>; marcochen@google.com; nic_swsd <nic_swsd@realtek.com>; Grant Grundler <grundler@chromium.org>
Subject: RE: Proposal: r8152 firmware patching framework

Prashant Malani <pmalani@chromium.org> 
> >
> > (Adding a few more Realtek folks)
> >
> > Friendly ping. Any thoughts / feedback, Realtek folks (and others) ?
> >
> >> On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani
> <pmalani@chromium.org> wrote:
> >>
> >> Hi,
> >>
> >> The r8152 driver source code distributed by Realtek (on
> >> www.realtek.com) contains firmware patches. This involves binary 
> >> byte-arrays being written byte/word-wise to the hardware memory
> >> Example: grundler@chromium.org (cc-ed) has an experimental patch
> which
> >> includes the firmware patching code which was distributed with the 
> >> Realtek source :
> >>
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kern
> el
> /+/1417953
> >>
> >> It would be nice to have a way to incorporate these firmware fixes 
> >> into the upstream code. Since having indecipherable byte-arrays is 
> >> not possible upstream, I propose the following:
> >> - We use the assistance of Realtek to come up with a format which 
> >> the firmware patch files can follow (this can be documented in the 
> >> comments).
> >>       - A real simple format could look like this:
> >>               +
> >>
> <section1><size_in_bytes><address1><data1><address2><data2>...<address
> N
> ><dataN><section2>...
> >>                + The driver would be able to understand how to 
> >> parse each section (e.g is each data entry a byte or a word?)
> >>
> >> - We use request_firmware() to load the firmware, parse it and 
> >> write the data to the relevant registers.

I plan to finish the patches which I am going to submit, first. Then, I could focus on this. However, I don't think I would start this quickly. There are many preparations and they would take me a lot of time.

Best Regards,
Hayes



^ permalink raw reply

* Re: [PATCH v5 04/17] MIPS: PCI: refactor ioc3 special handling
From: Paul Burton @ 2019-09-03  9:10 UTC (permalink / raw)
  To: Thomas Bogendoerfer
  Cc: Jonathan Corbet, Ralf Baechle, Paul Burton, James Hogan,
	Dmitry Torokhov, Lee Jones, David S. Miller, Srinivas Kandagatla,
	Alessandro Zummo, Alexandre Belloni, Greg Kroah-Hartman,
	Jiri Slaby, Evgeniy Polyakov, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mips@vger.kernel.org,
	linux-input@vger.kernel.org, netdev@vger.kernel.org,
	linux-rtc@vger.kernel.org, linux-serial@vger.kernel.org,
	linux-mips@vger.kernel.org
In-Reply-To: <20190819163144.3478-5-tbogendoerfer@suse.de>

Hello,

Thomas Bogendoerfer wrote:
> Refactored code to only have one ioc3 special handling for read
> access and one for write access.

Applied to mips-next.

> commit 813cafc4109c
> https://git.kernel.org/mips/c/813cafc4109c
> 
> Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
> Signed-off-by: Paul Burton <paul.burton@mips.com>

Thanks,
    Paul

[ This message was auto-generated; if you believe anything is incorrect
  then please email paul.burton@mips.com to report it. ]

^ permalink raw reply

* Re: [PATCH net] ipmr: remove cache_resolve_queue_len
From: Nikolay Aleksandrov @ 2019-09-03  9:15 UTC (permalink / raw)
  To: Hangbin Liu, netdev
  Cc: Phil Karn, Sukumar Gopalakrishnan, David S . Miller,
	Alexey Kuznetsov, Hideaki YOSHIFUJI
In-Reply-To: <20190903084359.13310-1-liuhangbin@gmail.com>

On 03/09/2019 11:43, Hangbin Liu wrote:
> This is a re-post of previous patch wrote by David Miller[1].
> 
> Phil Karn reported[2] that on busy networks with lots of unresolved
> multicast routing entries, the creation of new multicast group routes
> can be extremely slow and unreliable.
> 
> The reason is we hard-coded multicast route entries with unresolved source
> addresses(cache_resolve_queue_len) to 10. If some multicast route never
> resolves and the unresolved source addresses increased, there will
> be no ability to create new multicast route cache.
> 
> To resolve this issue, we need either add a sysctl entry to make the
> cache_resolve_queue_len configurable, or just remove cache_resolve_queue_len
> directly, as we already have the socket receive queue limits of mrouted
> socket, pointed by David.
> 
> From my side, I'd perfer to remove the cache_resolve_queue_len instead
> of creating two more(IPv4 and IPv6 version) sysctl entry.
> 
> [1] https://lkml.org/lkml/2018/7/22/11
> [2] https://lkml.org/lkml/2018/7/21/343
> 
> Reported-by: Phil Karn <karn@ka9q.net>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  include/linux/mroute_base.h |  2 --
>  net/ipv4/ipmr.c             | 27 ++++++++++++++++++---------
>  net/ipv6/ip6mr.c            | 10 +++-------
>  3 files changed, 21 insertions(+), 18 deletions(-)
> 

Hi,
IMO this is definitely net-next material. A few more comments below.

Note that hosts will automatically have this limit lifted to about 270
entries with current defaults, some might be surprised if they have
higher limits set and could be left with queues allowing for thousands
of entries in a linked list.

> diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
> index 34de06b426ef..50fb89533029 100644
> --- a/include/linux/mroute_base.h
> +++ b/include/linux/mroute_base.h
> @@ -235,7 +235,6 @@ struct mr_table_ops {
>   * @mfc_hash: Hash table of all resolved routes for easy lookup
>   * @mfc_cache_list: list of resovled routes for possible traversal
>   * @maxvif: Identifier of highest value vif currently in use
> - * @cache_resolve_queue_len: current size of unresolved queue
>   * @mroute_do_assert: Whether to inform userspace on wrong ingress
>   * @mroute_do_pim: Whether to receive IGMP PIMv1
>   * @mroute_reg_vif_num: PIM-device vif index
> @@ -252,7 +251,6 @@ struct mr_table {
>  	struct rhltable		mfc_hash;
>  	struct list_head	mfc_cache_list;
>  	int			maxvif;
> -	atomic_t		cache_resolve_queue_len;
>  	bool			mroute_do_assert;
>  	bool			mroute_do_pim;
>  	bool			mroute_do_wrvifwhole;
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index c07bc82cbbe9..6c5278b45afb 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -744,8 +744,6 @@ static void ipmr_destroy_unres(struct mr_table *mrt, struct mfc_cache *c)
>  	struct sk_buff *skb;
>  	struct nlmsgerr *e;
>  
> -	atomic_dec(&mrt->cache_resolve_queue_len);
> -
>  	while ((skb = skb_dequeue(&c->_c.mfc_un.unres.unresolved))) {
>  		if (ip_hdr(skb)->version == 0) {
>  			struct nlmsghdr *nlh = skb_pull(skb,
> @@ -1133,9 +1131,11 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
>  	}
>  
>  	if (!found) {
> +		bool was_empty;
> +
>  		/* Create a new entry if allowable */
> -		if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
> -		    (c = ipmr_cache_alloc_unres()) == NULL) {
> +		c = ipmr_cache_alloc_unres();
> +		if (!c) {
>  			spin_unlock_bh(&mfc_unres_lock);
>  
>  			kfree_skb(skb);
> @@ -1161,11 +1161,11 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
>  			return err;
>  		}
>  
> -		atomic_inc(&mrt->cache_resolve_queue_len);
> +		was_empty = list_empty(&mrt->mfc_unres_queue);
>  		list_add(&c->_c.list, &mrt->mfc_unres_queue);
>  		mroute_netlink_event(mrt, c, RTM_NEWROUTE);
>  
> -		if (atomic_read(&mrt->cache_resolve_queue_len) == 1)
> +		if (was_empty)
>  			mod_timer(&mrt->ipmr_expire_timer,
>  				  c->_c.mfc_un.unres.expires);
>  	}
> @@ -1272,7 +1272,6 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
>  		if (uc->mfc_origin == c->mfc_origin &&
>  		    uc->mfc_mcastgrp == c->mfc_mcastgrp) {
>  			list_del(&_uc->list);
> -			atomic_dec(&mrt->cache_resolve_queue_len);
>  			found = true;
>  			break;
>  		}
> @@ -1328,7 +1327,7 @@ static void mroute_clean_tables(struct mr_table *mrt, int flags)
>  	}
>  
>  	if (flags & MRT_FLUSH_MFC) {
> -		if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
> +		if (!list_empty(&mrt->mfc_unres_queue)) {
>  			spin_lock_bh(&mfc_unres_lock);
>  			list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
>  				list_del(&c->list);
> @@ -2750,9 +2749,19 @@ static int ipmr_rtm_route(struct sk_buff *skb, struct nlmsghdr *nlh,
>  		return ipmr_mfc_delete(tbl, &mfcc, parent);
>  }
>  
> +static int queue_count(struct mr_table *mrt)
> +{
> +	struct list_head *pos;
> +	int count = 0;
> +
> +	list_for_each(pos, &mrt->mfc_unres_queue)
> +		count++;
> +	return count;
> +}

I don't think we hold the mfc_unres_lock here while walking
the unresolved list below in ipmr_fill_table().

> +
>  static bool ipmr_fill_table(struct mr_table *mrt, struct sk_buff *skb)
>  {
> -	u32 queue_len = atomic_read(&mrt->cache_resolve_queue_len);
> +	u32 queue_len = queue_count(mrt);
>  
>  	if (nla_put_u32(skb, IPMRA_TABLE_ID, mrt->id) ||
>  	    nla_put_u32(skb, IPMRA_TABLE_CACHE_RES_QUEUE_LEN, queue_len) ||
> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> index e80d36c5073d..d02f0f903559 100644
> --- a/net/ipv6/ip6mr.c
> +++ b/net/ipv6/ip6mr.c
> @@ -768,8 +768,6 @@ static void ip6mr_destroy_unres(struct mr_table *mrt, struct mfc6_cache *c)
>  	struct net *net = read_pnet(&mrt->net);
>  	struct sk_buff *skb;
>  
> -	atomic_dec(&mrt->cache_resolve_queue_len);
> -
>  	while ((skb = skb_dequeue(&c->_c.mfc_un.unres.unresolved)) != NULL) {
>  		if (ipv6_hdr(skb)->version == 0) {
>  			struct nlmsghdr *nlh = skb_pull(skb,
> @@ -1148,8 +1146,8 @@ static int ip6mr_cache_unresolved(struct mr_table *mrt, mifi_t mifi,
>  		 *	Create a new entry if allowable
>  		 */
>  
> -		if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
> -		    (c = ip6mr_cache_alloc_unres()) == NULL) {
> +		c = ip6mr_cache_alloc_unres();
> +		if (!c) {
>  			spin_unlock_bh(&mfc_unres_lock);
>  
>  			kfree_skb(skb);
> @@ -1176,7 +1174,6 @@ static int ip6mr_cache_unresolved(struct mr_table *mrt, mifi_t mifi,
>  			return err;
>  		}
>  
> -		atomic_inc(&mrt->cache_resolve_queue_len);
>  		list_add(&c->_c.list, &mrt->mfc_unres_queue);
>  		mr6_netlink_event(mrt, c, RTM_NEWROUTE);
>  
> @@ -1468,7 +1465,6 @@ static int ip6mr_mfc_add(struct net *net, struct mr_table *mrt,
>  		if (ipv6_addr_equal(&uc->mf6c_origin, &c->mf6c_origin) &&
>  		    ipv6_addr_equal(&uc->mf6c_mcastgrp, &c->mf6c_mcastgrp)) {
>  			list_del(&_uc->list);
> -			atomic_dec(&mrt->cache_resolve_queue_len);
>  			found = true;
>  			break;
>  		}
> @@ -1526,7 +1522,7 @@ static void mroute_clean_tables(struct mr_table *mrt, int flags)
>  	}
>  
>  	if (flags & MRT6_FLUSH_MFC) {
> -		if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
> +		if (!list_empty(&mrt->mfc_unres_queue)) {
>  			spin_lock_bh(&mfc_unres_lock);
>  			list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
>  				list_del(&c->list);
> 


^ permalink raw reply

* RE: [PATCH v4] rtw88: pci: Move a mass of jobs in hw IRQ to soft IRQ
From: Tony Chuang @ 2019-09-03  9:18 UTC (permalink / raw)
  To: Jian-Hong Pan, Kalle Valo
  Cc: David S . Miller, linux-wireless@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux@endlessm.com
In-Reply-To: <CAPpJ_efAxQN4pRdpVmT5Pdkp-6Y-QVOQdJR4iY4A-PXZokLGtA@mail.gmail.com>

> From: Jian-Hong Pan [mailto:jian-hong@endlessm.com]
> 
> >
> > Tony Chuang <yhchuang@realtek.com> writes:
> >
> > >> From: Jian-Hong Pan
> > >> Subject: [PATCH v4] rtw88: pci: Move a mass of jobs in hw IRQ to soft
> IRQ
> > >>
> > >> There is a mass of jobs between spin lock and unlock in the hardware
> > >> IRQ which will occupy much time originally. To make system work more
> > >> efficiently, this patch moves the jobs to the soft IRQ (bottom half) to
> > >> reduce the time in hardware IRQ.
> > >>
> > >> Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
> > >
> > > Now it works fine with MSI interrupt enabled.
> > >
> > > But this patch is conflicting with MSI interrupt patch.
> > > Is there a better way we can make Kalle apply them more smoothly?
> > > I can rebase them and submit both if you're OK.
> 
> The rebase work is appreciated.
> 

Rebased and sent. Please check it and see if I've done anything wrong :)
https://patchwork.kernel.org/cover/11127453/

Thanks,
Yan-Hsuan

^ permalink raw reply

* Re: how to search for the best route from userspace in netlink?
From: Dave Taht @ 2019-09-03  9:29 UTC (permalink / raw)
  To: David Ahern; +Cc: Linux Kernel Network Developers
In-Reply-To: <3e8fd488-1bd1-3213-6329-6baf8935a446@gmail.com>

On Mon, Sep 2, 2019 at 8:13 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 9/2/19 4:07 PM, Dave Taht wrote:
> > Windows has the "RtmGetMostSpecificDestination" call:
> > https://docs.microsoft.com/en-us/windows/win32/rras/search-for-the-best-route
> >
> > In particular, I wanted to search for the best route, AND pick up the
> > PMTU from that (if it existed)
> > for older UDP applications like dnssec[1] and newer ones like QUIC[2].
>
> RTM_GETROUTE with data for the route lookup. See iproute2 code as an
> example.

Yes. I really didn't describe my thinking very well. It's coping with
pmtu better
in the case of a more increasingly udp'd and tunneled internet. tcp
(being kernel based)
will do the probe and cache that attribute of the path, udp does not.
A udp based app with root privs could be setting it after figuring it
out,  a userspace one cannot.

for more detail from server-al sides of that philosophical debate,
please see the links I posted
originally.



-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

^ permalink raw reply

* pull-request: can-next 2019-09-03
From: Marc Kleine-Budde @ 2019-09-03  9:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, kernel, linux-can


[-- Attachment #1.1: Type: text/plain, Size: 3307 bytes --]

Hello David,

this is a pull request for net-next/master consisting of 15 patches.

The first patch is by Christer Beskow, targets the kvaser_pciefd driver
and fixes the PWM generator's frequency.

The next three patches are by Dan Murphy, the tcan4x5x is updated to use
a proper interrupts/interrupt-parent DT binding to specify the devices
IRQ line. Further the unneeded wake ups of the device is removed from
the driver.

A patch by me for the mcp25xx driver removes the deprecated board file
setup example. Three patches by Andy Shevchenko simplify clock handling,
update the driver from OF to device property API and simplify the
mcp251x_can_suspend() function.

The remaining 7 patches are by me and clean up checkpatch warnings in
the generic CAN device infrastructure.

regards,
Marc

---

The following changes since commit 67538eb5c00f08d7fe27f1bb703098b17302bdc0:

  Merge branch 'mvpp2-per-cpu-buffers' (2019-09-02 12:07:46 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git tags/linux-can-next-for-5.4-20190903

for you to fetch changes up to 13ecee77fa810b21beaf3023e921525c55f88b04:

  can: dev: can_dev_init(): convert from printk(KERN_INFO) to pr_info (2019-09-03 10:28:13 +0200)

----------------------------------------------------------------
linux-can-next-for-5.4-20190903

----------------------------------------------------------------
Andy Shevchenko (3):
      can: mcp251x: Use devm_clk_get_optional() to get the input clock
      can: mcp251x: Make use of device property API
      can: mcp251x: Call wrapper instead of regulator_disable()

Christer Beskow (1):
      can: kvaser_pciefd: the PWM generator is running at the bus frequency of the system.

Dan Murphy (3):
      dt-bindings: can: tcan4x5x: Update binding to use interrupt property
      can: tcan4x5x: Remove data-ready gpio interrupt
      can: tcan4x5x: Remove checking the wake pin

Marc Kleine-Budde (8):
      can: mcp251x: remove deprecated board file setup example
      can: dev: convert block comments to network style comments
      can: dev: avoid long lines
      can: dev: remove unnecessary parentheses
      can: dev: remove unnecessary blank line
      can: dev: can_restart(): convert NULL pointer check
      can: dev: can_dellink(): remove return at end of void function
      can: dev: can_dev_init(): convert from printk(KERN_INFO) to pr_info

 .../devicetree/bindings/net/can/tcan4x5x.txt       |   7 +-
 drivers/net/can/dev.c                              | 131 +++++++++------------
 drivers/net/can/kvaser_pciefd.c                    |   6 +-
 drivers/net/can/m_can/tcan4x5x.c                   |  24 +---
 drivers/net/can/spi/mcp251x.c                      |  68 +++--------
 include/linux/can/dev.h                            |   3 +-
 include/linux/can/rx-offload.h                     |  13 +-
 7 files changed, 97 insertions(+), 155 deletions(-)

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |-
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |












[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH net] tipc: add NULL pointer check before calling kfree_rcu
From: Xin Long @ 2019-09-03  9:53 UTC (permalink / raw)
  To: network dev; +Cc: davem, Jon Maloy, Ying Xue, tipc-discussion

Unlike kfree(p), kfree_rcu(p, rcu) won't do NULL pointer check. When
tipc_nametbl_remove_publ returns NULL, the panic below happens:

   BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
   RIP: 0010:__call_rcu+0x1d/0x290
   Call Trace:
    <IRQ>
    tipc_publ_notify+0xa9/0x170 [tipc]
    tipc_node_write_unlock+0x8d/0x100 [tipc]
    tipc_node_link_down+0xae/0x1d0 [tipc]
    tipc_node_check_dest+0x3ea/0x8f0 [tipc]
    ? tipc_disc_rcv+0x2c7/0x430 [tipc]
    tipc_disc_rcv+0x2c7/0x430 [tipc]
    ? tipc_rcv+0x6bb/0xf20 [tipc]
    tipc_rcv+0x6bb/0xf20 [tipc]
    ? ip_route_input_slow+0x9cf/0xb10
    tipc_udp_recv+0x195/0x1e0 [tipc]
    ? tipc_udp_is_known_peer+0x80/0x80 [tipc]
    udp_queue_rcv_skb+0x180/0x460
    udp_unicast_rcv_skb.isra.56+0x75/0x90
    __udp4_lib_rcv+0x4ce/0xb90
    ip_local_deliver_finish+0x11c/0x210
    ip_local_deliver+0x6b/0xe0
    ? ip_rcv_finish+0xa9/0x410
    ip_rcv+0x273/0x362

Fixes: 97ede29e80ee ("tipc: convert name table read-write lock to RCU")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/tipc/name_distr.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index 44abc8e..241ed22 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -223,7 +223,8 @@ static void tipc_publ_purge(struct net *net, struct publication *publ, u32 addr)
 		       publ->key);
 	}
 
-	kfree_rcu(p, rcu);
+	if (p)
+		kfree_rcu(p, rcu);
 }
 
 /**
-- 
2.1.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox