Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 6/7] net: dsa: qca8k: Replace GPL boilerplate by SPDX
From: Florian Fainelli @ 2018-05-21 15:20 UTC (permalink / raw)
  To: Michal Vokáč, netdev, michal.vokac
  Cc: linux-kernel, devicetree, vivien.didelot, andrew, mark.rutland,
	robh+dt, davem
In-Reply-To: <1526909293-56377-7-git-send-email-michal.vokac@ysoft.com>



On 05/21/2018 06:28 AM, Michal Vokáč wrote:
> Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

I don't know if we need all people who contributed to that driver to
agree on that, this is not a license change, so it should be okay I presume?

-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 5/7] net: dsa: qca8k: Allow overwriting CPU port setting
From: Florian Fainelli @ 2018-05-21 15:20 UTC (permalink / raw)
  To: Michal Vokáč, netdev, michal.vokac
  Cc: linux-kernel, devicetree, vivien.didelot, andrew, mark.rutland,
	robh+dt, davem
In-Reply-To: <1526909293-56377-6-git-send-email-michal.vokac@ysoft.com>



On 05/21/2018 06:28 AM, Michal Vokáč wrote:
> Implement adjust_link function that allows to overwrite default CPU port
> setting using fixed-link device tree subnode.
> 
> Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 4/7] net: dsa: qca8k: Force CPU port to its highest bandwidth
From: Florian Fainelli @ 2018-05-21 15:19 UTC (permalink / raw)
  To: Michal Vokáč, netdev, michal.vokac
  Cc: linux-kernel, devicetree, vivien.didelot, andrew, mark.rutland,
	robh+dt, davem
In-Reply-To: <1526909293-56377-5-git-send-email-michal.vokac@ysoft.com>



On 05/21/2018 06:28 AM, Michal Vokáč wrote:
> By default autonegotiation is enabled to configure MAC on all ports.
> For the CPU port autonegotiation can not be used so we need to set
> some sensible defaults manually.
> 
> This patch forces the default setting of the CPU port to 1000Mbps/full
> duplex which is the chip maximum capability.
> 
> Also correct size of the bit field used to configure link speed.
> 
> Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

Likewise, would not we want to have a:

Fixes: 6b93fb46480a ("net-next: dsa: add new driver for qca8xxx family")

tag here as well?
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 3/7] net: dsa: qca8k: Enable RXMAC when bringing up a port
From: Florian Fainelli @ 2018-05-21 15:17 UTC (permalink / raw)
  To: Michal Vokáč, netdev, michal.vokac
  Cc: linux-kernel, devicetree, vivien.didelot, andrew, mark.rutland,
	robh+dt, davem
In-Reply-To: <1526909293-56377-4-git-send-email-michal.vokac@ysoft.com>



On 05/21/2018 06:28 AM, Michal Vokáč wrote:
> When a port is brought up/down do not enable/disable only the TXMAC
> but the RXMAC as well. This is essential for the CPU port to work.
> 
> Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

Should this have:

Fixes: 6b93fb46480a ("net-next: dsa: add new driver for qca8xxx family")?
-- 
Florian

^ permalink raw reply

* Re: [PATCH] bpf: check NULL for sk_to_full_sk()
From: Eric Dumazet @ 2018-05-21 15:17 UTC (permalink / raw)
  To: YueHaibing, ast, daniel; +Cc: linux-kernel, netdev
In-Reply-To: <20180521075558.11968-1-yuehaibing@huawei.com>



On 05/21/2018 12:55 AM, YueHaibing wrote:
> like commit df39a9f106d5 ("bpf: check NULL for sk_to_full_sk() return value"),
> we should check sk_to_full_sk return value against NULL.
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> ---
>  include/linux/bpf-cgroup.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index 30d15e6..fd3fbeb 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -91,7 +91,7 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
>  	int __ret = 0;							       \
>  	if (cgroup_bpf_enabled && sk && sk == skb->sk) {		       \
>  		typeof(sk) __sk = sk_to_full_sk(sk);			       \
> -		if (sk_fullsock(__sk))					       \
> +		if (__sk && sk_fullsock(__sk))				       \
>  			__ret = __cgroup_bpf_run_filter_skb(__sk, skb,	       \
>  						      BPF_CGROUP_INET_EGRESS); \
>  	}								       \
> 

Why is this needed ???

^ permalink raw reply

* Re: [net-next PATCH v2 2/4] net: Enable Tx queue selection based on Rx queues
From: Willem de Bruijn @ 2018-05-21 15:12 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Amritha Nambiar, Linux Kernel Network Developers, David S. Miller,
	Alexander Duyck, Sridhar Samudrala, Eric Dumazet,
	Hannes Frederic Sowa
In-Reply-To: <CALx6S36h=gGb1LkLuJ80DUrE=m+FhbcQ0AD94AdtEUvxJfHf=g@mail.gmail.com>

On Mon, May 21, 2018 at 10:51 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Sat, May 19, 2018 at 1:27 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> On Sat, May 19, 2018 at 4:13 PM, Willem de Bruijn
>> <willemdebruijn.kernel@gmail.com> wrote:
>>> On Fri, May 18, 2018 at 12:03 AM, Tom Herbert <tom@herbertland.com> wrote:
>>>> On Tue, May 15, 2018 at 6:26 PM, Amritha Nambiar
>>>> <amritha.nambiar@intel.com> wrote:
>>>>> This patch adds support to pick Tx queue based on the Rx queue map
>>>>> configuration set by the admin through the sysfs attribute
>>>>> for each Tx queue. If the user configuration for receive
>>>>> queue map does not apply, then the Tx queue selection falls back
>>>>> to CPU map based selection and finally to hashing.
>>>>>
>>>>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>> ---
>>
>>>>> +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>>>>> +{
>>>>> +#ifdef CONFIG_XPS
>>>>> +       enum xps_map_type i = XPS_MAP_RXQS;
>>>>> +       struct xps_dev_maps *dev_maps;
>>>>> +       struct sock *sk = skb->sk;
>>>>> +       int queue_index = -1;
>>>>> +       unsigned int tci = 0;
>>>>> +
>>>>> +       if (sk && sk->sk_rx_queue_mapping <= dev->real_num_rx_queues &&
>>>>> +           dev->ifindex == sk->sk_rx_ifindex)
>>>>> +               tci = sk->sk_rx_queue_mapping;
>>>>> +
>>>>> +       rcu_read_lock();
>>>>> +       while (queue_index < 0 && i < __XPS_MAP_MAX) {
>>>>> +               if (i == XPS_MAP_CPUS)
>>>>
>>>> This while loop typifies exactly why I don't think the XPS maps should
>>>> be an array.
>>>
>>> +1
>>
>> as a matter of fact, as enabling both cpu and rxqueue map at the same
>> time makes no sense, only one map is needed at any one time. The
>> only difference is in how it is indexed. It should probably not be possible
>> to configure both at the same time. Keeping a single map probably also
>> significantly simplifies patch 1/4.
>
> Willem,
>
> I think it might makes sense to have them both. Maybe one application
> is spin polling that needs this, where others might be happy with
> normal CPU mappings as default.

Some entries in the rx_queue table have queue_pair affinity
configured, the others return -1 to fall through to the cpu
affinity table?

I guess that implies flow steering to those special purpose
queues. I wonder whether this would be used this in practice.
I does make the code more complex by having to duplicate
the map lookup logic (mostly, patch 1/4).

^ permalink raw reply

* Re: [PATCH net 0/4] Fix several issues of virtio-net mergeable XDP
From: Michael S. Tsirkin @ 2018-05-21 15:04 UTC (permalink / raw)
  To: Jason Wang; +Cc: virtualization, netdev, linux-kernel
In-Reply-To: <1526891706-18516-1-git-send-email-jasowang@redhat.com>

On Mon, May 21, 2018 at 04:35:02PM +0800, Jason Wang wrote:
> Hi:
> 
> Please review the patches that tries to fix sevreal issues of
> virtio-net mergeable XDP.
> 
> Thanks

I think we should do 3/4 differently.
The rest looks good, and probably needed on stable.

Thanks!

> Jason Wang (4):
>   virtio-net: correctly redirect linearized packet
>   virtio-net: correctly transmit XDP buff after linearizing
>   virtio-net: reset num_buf to 1 after linearizing packet
>   virito-net: fix leaking page for gso packet during mergeable XDP
> 
>  drivers/net/virtio_net.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net 2/4] virtio-net: correctly transmit XDP buff after linearizing
From: Michael S. Tsirkin @ 2018-05-21 15:03 UTC (permalink / raw)
  To: Jason Wang; +Cc: virtualization, netdev, linux-kernel, John Fastabend
In-Reply-To: <1526891706-18516-3-git-send-email-jasowang@redhat.com>

On Mon, May 21, 2018 at 04:35:04PM +0800, Jason Wang wrote:
> We should not go for the error path after successfully transmitting a
> XDP buffer after linearizing. Since the error path may try to pop and
> drop next packet and increase the drop counters. Fixing this by simply
> drop the refcnt of original page and go for xmit path.
> 
> Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous buffers")
> Cc: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  drivers/net/virtio_net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c15d240..6260d65 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -775,7 +775,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			}
>  			*xdp_xmit = true;
>  			if (unlikely(xdp_page != page))
> -				goto err_xdp;
> +				put_page(page);
>  			rcu_read_unlock();
>  			goto xdp_xmit;
>  		case XDP_REDIRECT:
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net 1/4] virtio-net: correctly redirect linearized packet
From: Michael S. Tsirkin @ 2018-05-21 15:03 UTC (permalink / raw)
  To: Jason Wang; +Cc: virtualization, netdev, linux-kernel
In-Reply-To: <1526891706-18516-2-git-send-email-jasowang@redhat.com>

On Mon, May 21, 2018 at 04:35:03PM +0800, Jason Wang wrote:
> After a linearized packet was redirected by XDP, we should not go for
> the err path which will try to pop buffers for the next packet and
> increase the drop counter. Fixing this by just drop the page refcnt
> for the original page.
> 
> Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
> Reported-by: David Ahern <dsahern@gmail.com>
> Tested-by: David Ahern <dsahern@gmail.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  drivers/net/virtio_net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 770422e..c15d240 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -787,7 +787,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			}
>  			*xdp_xmit = true;
>  			if (unlikely(xdp_page != page))
> -				goto err_xdp;
> +				put_page(page);
>  			rcu_read_unlock();
>  			goto xdp_xmit;
>  		default:
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net 4/4] virito-net: fix leaking page for gso packet during mergeable XDP
From: Michael S. Tsirkin @ 2018-05-21 15:01 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, John Fastabend, linux-kernel, virtualization
In-Reply-To: <1526891706-18516-5-git-send-email-jasowang@redhat.com>

On Mon, May 21, 2018 at 04:35:06PM +0800, Jason Wang wrote:
> We need to drop refcnt to xdp_page if we see a gso packet. Otherwise
> it will be leaked. Fixing this by moving the check of gso packet above
> the linearizing logic.
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous buffers")
> Signed-off-by: Jason Wang <jasowang@redhat.com>

typo in subject

> ---
>  drivers/net/virtio_net.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 165a922..f8db809 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -707,6 +707,14 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  		void *data;
>  		u32 act;
>  
> +		/* Transient failure which in theory could occur if
> +		 * in-flight packets from before XDP was enabled reach
> +		 * the receive path after XDP is loaded. In practice I
> +		 * was not able to create this condition.

BTW we should probably drop the last sentence. It says in theory, should be enough.

> +		 */
> +		if (unlikely(hdr->hdr.gso_type))
> +			goto err_xdp;
> +
>  		/* This happens when rx buffer size is underestimated
>  		 * or headroom is not enough because of the buffer
>  		 * was refilled before XDP is set. This should only
> @@ -728,14 +736,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			xdp_page = page;
>  		}
>  
> -		/* Transient failure which in theory could occur if
> -		 * in-flight packets from before XDP was enabled reach
> -		 * the receive path after XDP is loaded. In practice I
> -		 * was not able to create this condition.
> -		 */
> -		if (unlikely(hdr->hdr.gso_type))
> -			goto err_xdp;
> -
>  		/* Allow consuming headroom but reserve enough space to push
>  		 * the descriptor on if we get an XDP_TX return code.
>  		 */
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net-next] net:sched: add action inheritdsfield to skbmod
From: Jamal Hadi Salim @ 2018-05-21 15:00 UTC (permalink / raw)
  To: Fu, Qiaobin; +Cc: davem@davemloft.net, netdev@vger.kernel.org, Michel Machado
In-Reply-To: <DA5C727C-BAE1-4355-B67C-5F9C3769CA30@bu.edu>

On 21/05/18 10:42 AM, Fu, Qiaobin wrote:
> Hi Jamal,
> 
> I've tested my patch before publishing it here, and Nishanth is going to test it further with version 2 of the GKprio. I'm going to push a patch to the repository iproute2 to add support for "inheritdsfield”.
> 

Thanks. I already acked the kernel patch. It looks good on its own.

Would you consider adding one or more tdc tests as well?

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net 3/4] virtio-net: reset num_buf to 1 after linearizing packet
From: Michael S. Tsirkin @ 2018-05-21 14:59 UTC (permalink / raw)
  To: Jason Wang; +Cc: virtualization, netdev, linux-kernel
In-Reply-To: <1526891706-18516-4-git-send-email-jasowang@redhat.com>

On Mon, May 21, 2018 at 04:35:05PM +0800, Jason Wang wrote:
> If we successfully linearize the packets, num_buf were set to zero
> which was wrong since we now have only 1 buffer to be used for e.g in
> the error path of receive_mergeable(). Zero num_buf will lead the code
> try to pop the buffers of next packet and drop it. Fixing this by set
> num_buf to 1 if we successfully linearize the packet.
> 
> Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/virtio_net.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 6260d65..165a922 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -722,6 +722,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  						      &len);
>  			if (!xdp_page)
>  				goto err_xdp;
> +			num_buf = 1;

So this is tweaked here for the benefit of err_skb below.
That's confusing and we won't remember to change it
if we change the error handling.

How about fixing the error path?


-        while (--num_buf) {
+        while (num_buf-- > 1) {

Seems more robust to me.


>  			offset = VIRTIO_XDP_HEADROOM;
>  		} else {
>  			xdp_page = page;
> -- 
> 2.7.4

^ permalink raw reply

* Re: [net-next PATCH v2 2/4] net: Enable Tx queue selection based on Rx queues
From: Tom Herbert @ 2018-05-21 14:51 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Amritha Nambiar, Linux Kernel Network Developers, David S. Miller,
	Alexander Duyck, Sridhar Samudrala, Eric Dumazet,
	Hannes Frederic Sowa
In-Reply-To: <CAF=yD-JghZY5NN6cHGdHeOTs8xb9KF=mQ=J2P49ojrvp+MsD8w@mail.gmail.com>

On Sat, May 19, 2018 at 1:27 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Sat, May 19, 2018 at 4:13 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> On Fri, May 18, 2018 at 12:03 AM, Tom Herbert <tom@herbertland.com> wrote:
>>> On Tue, May 15, 2018 at 6:26 PM, Amritha Nambiar
>>> <amritha.nambiar@intel.com> wrote:
>>>> This patch adds support to pick Tx queue based on the Rx queue map
>>>> configuration set by the admin through the sysfs attribute
>>>> for each Tx queue. If the user configuration for receive
>>>> queue map does not apply, then the Tx queue selection falls back
>>>> to CPU map based selection and finally to hashing.
>>>>
>>>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>> ---
>
>>>> +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>>>> +{
>>>> +#ifdef CONFIG_XPS
>>>> +       enum xps_map_type i = XPS_MAP_RXQS;
>>>> +       struct xps_dev_maps *dev_maps;
>>>> +       struct sock *sk = skb->sk;
>>>> +       int queue_index = -1;
>>>> +       unsigned int tci = 0;
>>>> +
>>>> +       if (sk && sk->sk_rx_queue_mapping <= dev->real_num_rx_queues &&
>>>> +           dev->ifindex == sk->sk_rx_ifindex)
>>>> +               tci = sk->sk_rx_queue_mapping;
>>>> +
>>>> +       rcu_read_lock();
>>>> +       while (queue_index < 0 && i < __XPS_MAP_MAX) {
>>>> +               if (i == XPS_MAP_CPUS)
>>>
>>> This while loop typifies exactly why I don't think the XPS maps should
>>> be an array.
>>
>> +1
>
> as a matter of fact, as enabling both cpu and rxqueue map at the same
> time makes no sense, only one map is needed at any one time. The
> only difference is in how it is indexed. It should probably not be possible
> to configure both at the same time. Keeping a single map probably also
> significantly simplifies patch 1/4.

Willem,

I think it might makes sense to have them both. Maybe one application
is spin polling that needs this, where others might be happy with
normal CPU mappings as default.

Tom

^ permalink raw reply

* Re: [PATCH net-next 7/7] net: dsa: qca8k: Remove rudundant parentheses
From: Andrew Lunn @ 2018-05-21 14:48 UTC (permalink / raw)
  To: Michal Vokáč
  Cc: netdev, michal.vokac, linux-kernel, devicetree, f.fainelli,
	vivien.didelot, mark.rutland, robh+dt, davem
In-Reply-To: <1526909293-56377-8-git-send-email-michal.vokac@ysoft.com>

On Mon, May 21, 2018 at 03:28:13PM +0200, Michal Vokáč wrote:
> Fix warning reported by checkpatch.
> 
> Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 6/7] net: dsa: qca8k: Replace GPL boilerplate by SPDX
From: Andrew Lunn @ 2018-05-21 14:47 UTC (permalink / raw)
  To: Michal Vokáč
  Cc: netdev, michal.vokac, linux-kernel, devicetree, f.fainelli,
	vivien.didelot, mark.rutland, robh+dt, davem
In-Reply-To: <1526909293-56377-7-git-send-email-michal.vokac@ysoft.com>

On Mon, May 21, 2018 at 03:28:12PM +0200, Michal Vokáč wrote:
> Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 1/7] net: dsa: qca8k: Add QCA8334 binding documentation
From: Andrew Lunn @ 2018-05-21 14:47 UTC (permalink / raw)
  To: Michal Vokáč
  Cc: netdev, michal.vokac, linux-kernel, devicetree, f.fainelli,
	vivien.didelot, mark.rutland, robh+dt, davem
In-Reply-To: <1526909293-56377-2-git-send-email-michal.vokac@ysoft.com>

On Mon, May 21, 2018 at 03:28:07PM +0200, Michal Vokáč wrote:
> Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>

Hi Michal

It would be good to document that fixed-link can be used.

   Andrew

^ permalink raw reply

* Re: [PATCH net-next 5/7] net: dsa: qca8k: Allow overwriting CPU port setting
From: Andrew Lunn @ 2018-05-21 14:46 UTC (permalink / raw)
  To: Michal Vokáč
  Cc: netdev, michal.vokac, linux-kernel, devicetree, f.fainelli,
	vivien.didelot, mark.rutland, robh+dt, davem
In-Reply-To: <1526909293-56377-6-git-send-email-michal.vokac@ysoft.com>

On Mon, May 21, 2018 at 03:28:11PM +0200, Michal Vokáč wrote:
> Implement adjust_link function that allows to overwrite default CPU port
> setting using fixed-link device tree subnode.
> 
> Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: KASAN: use-after-free Read in vhost_chr_write_iter
From: Michael S. Tsirkin @ 2018-05-21 14:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: DaeRyong Jeong, kvm, virtualization, netdev, linux-kernel,
	byoungyoung, kt0755, bammanag
In-Reply-To: <fb27c1fd-5172-252a-cb8f-b53927a26d06@redhat.com>

On Mon, May 21, 2018 at 10:38:10AM +0800, Jason Wang wrote:
> 
> 
> On 2018年05月18日 17:24, Jason Wang wrote:
> > 
> > 
> > On 2018年05月17日 21:45, DaeRyong Jeong wrote:
> > > We report the crash: KASAN: use-after-free Read in vhost_chr_write_iter
> > > 
> > > This crash has been found in v4.17-rc1 using RaceFuzzer (a modified
> > > version of Syzkaller), which we describe more at the end of this
> > > report. Our analysis shows that the race occurs when invoking two
> > > syscalls concurrently, write$vnet and ioctl$VHOST_RESET_OWNER.
> > > 
> > > 
> > > Analysis:
> > > We think the concurrent execution of vhost_process_iotlb_msg() and
> > > vhost_dev_cleanup() causes the crash.
> > > Both of functions can run concurrently (please see call sequence below),
> > > and possibly, there is a race on dev->iotlb.
> > > If the switch occurs right after vhost_dev_cleanup() frees
> > > dev->iotlb, vhost_process_iotlb_msg() still sees the non-null value
> > > and it
> > > keep executing without returning -EFAULT. Consequently, use-after-free
> > > occures
> > > 
> > > 
> > > Thread interleaving:
> > > CPU0 (vhost_process_iotlb_msg)                CPU1 (vhost_dev_cleanup)
> > > (In the case of both VHOST_IOTLB_UPDATE and
> > > VHOST_IOTLB_INVALIDATE)
> > > =====                            =====
> > >                             vhost_umem_clean(dev->iotlb);
> > > if (!dev->iotlb) {
> > >             ret = -EFAULT;
> > >                 break;
> > > }
> > >                             dev->iotlb = NULL;
> > > 
> > > 
> > > Call Sequence:
> > > CPU0
> > > =====
> > > vhost_net_chr_write_iter
> > >     vhost_chr_write_iter
> > >         vhost_process_iotlb_msg
> > > 
> > > CPU1
> > > =====
> > > vhost_net_ioctl
> > >     vhost_net_reset_owner
> > >         vhost_dev_reset_owner
> > >             vhost_dev_cleanup
> > 
> > Thanks a lot for the analysis.
> > 
> > This could be addressed by simply protect it with dev mutex.
> > 
> > Will post a patch.
> > 
> 
> Could you please help to test the attached patch? I've done some smoking
> test.
> 
> Thanks

> >From 88328386f3f652e684ee33dc4cf63dcaed871aea Mon Sep 17 00:00:00 2001
> From: Jason Wang <jasowang@redhat.com>
> Date: Fri, 18 May 2018 17:33:27 +0800
> Subject: [PATCH] vhost: synchronize IOTLB message with dev cleanup
> 
> DaeRyong Jeong reports a race between vhost_dev_cleanup() and
> vhost_process_iotlb_msg():
> 
> Thread interleaving:
> CPU0 (vhost_process_iotlb_msg)			CPU1 (vhost_dev_cleanup)
> (In the case of both VHOST_IOTLB_UPDATE and
> VHOST_IOTLB_INVALIDATE)
> =====						=====
> 						vhost_umem_clean(dev->iotlb);
> if (!dev->iotlb) {
> 	        ret = -EFAULT;
> 		        break;
> }
> 						dev->iotlb = NULL;
> 
> The reason is we don't synchronize between them, fixing by protecting
> vhost_process_iotlb_msg() with dev mutex.
> 
> Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
> Reported-by: DaeRyong Jeong <threeearcat@gmail.com>

Long terms we might want to move iotlb into vqs
so that messages can be processed in parallel.
Not sure how to do it yet.

> ---
>  drivers/vhost/vhost.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f3bd8e9..f0be5f3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -981,6 +981,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
>  {
>  	int ret = 0;
>  
> +	mutex_lock(&dev->mutex);
>  	vhost_dev_lock_vqs(dev);
>  	switch (msg->type) {
>  	case VHOST_IOTLB_UPDATE:
> @@ -1016,6 +1017,8 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
>  	}
>  
>  	vhost_dev_unlock_vqs(dev);
> +	mutex_unlock(&dev->mutex);
> +
>  	return ret;
>  }
>  ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
> -- 
> 2.7.4
> 

^ permalink raw reply

* Re: [PATCH net-next 4/7] net: dsa: qca8k: Force CPU port to its highest bandwidth
From: Andrew Lunn @ 2018-05-21 14:42 UTC (permalink / raw)
  To: Michal Vokáč
  Cc: netdev, michal.vokac, linux-kernel, devicetree, f.fainelli,
	vivien.didelot, mark.rutland, robh+dt, davem
In-Reply-To: <1526909293-56377-5-git-send-email-michal.vokac@ysoft.com>

On Mon, May 21, 2018 at 03:28:10PM +0200, Michal Vokáč wrote:
> By default autonegotiation is enabled to configure MAC on all ports.
> For the CPU port autonegotiation can not be used so we need to set
> some sensible defaults manually.
> 
> This patch forces the default setting of the CPU port to 1000Mbps/full
> duplex which is the chip maximum capability.
> 
> Also correct size of the bit field used to configure link speed.
> 
> Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 3/7] net: dsa: qca8k: Enable RXMAC when bringing up a port
From: Andrew Lunn @ 2018-05-21 14:40 UTC (permalink / raw)
  To: Michal Vokáč
  Cc: netdev, michal.vokac, linux-kernel, devicetree, f.fainelli,
	vivien.didelot, mark.rutland, robh+dt, davem
In-Reply-To: <1526909293-56377-4-git-send-email-michal.vokac@ysoft.com>

On Mon, May 21, 2018 at 03:28:09PM +0200, Michal Vokáč wrote:
> When a port is brought up/down do not enable/disable only the TXMAC
> but the RXMAC as well. This is essential for the CPU port to work.
> 
> Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 1/7] net: dsa: qca8k: Add QCA8334 binding documentation
From: Andrew Lunn @ 2018-05-21 14:39 UTC (permalink / raw)
  To: Michal Vokáč
  Cc: netdev, michal.vokac, linux-kernel, devicetree, f.fainelli,
	vivien.didelot, mark.rutland, robh+dt, davem
In-Reply-To: <1526909293-56377-2-git-send-email-michal.vokac@ysoft.com>

On Mon, May 21, 2018 at 03:28:07PM +0200, Michal Vokáč wrote:

Hi Michal

It is normal to have some commit message, even if it is the subject
said differently.

     Andrew

> Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>
> ---
>  Documentation/devicetree/bindings/net/dsa/qca8k.txt | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/dsa/qca8k.txt b/Documentation/devicetree/bindings/net/dsa/qca8k.txt
> index 9c67ee4..3d73cd0 100644
> --- a/Documentation/devicetree/bindings/net/dsa/qca8k.txt
> +++ b/Documentation/devicetree/bindings/net/dsa/qca8k.txt
> @@ -2,7 +2,10 @@
>  
>  Required properties:
>  
> -- compatible: should be "qca,qca8337"
> +- compatible: should be one of:
> +    "qca,qca8334"
> +    "qca,qca8337"
> +
>  - #size-cells: must be 0
>  - #address-cells: must be 1
>  
> -- 
> 2.1.4
> 

^ permalink raw reply

* Re: [RFC PATCH net-next 12/12] vhost_net: batch submitting XDP buffers to underlayer sockets
From: Michael S. Tsirkin @ 2018-05-21 14:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel
In-Reply-To: <1526893473-20128-13-git-send-email-jasowang@redhat.com>

On Mon, May 21, 2018 at 05:04:33PM +0800, Jason Wang wrote:
> This patch implements XDP batching for vhost_net with tun. This is
> done by batching XDP buffs in vhost and submit them when:
> 
> - vhost_net can not build XDP buff (mostly because of the size of packet)
> - #batched exceeds the limitation (VHOST_NET_RX_BATCH).
> - tun accept a batch of XDP buff through msg_control and process them
>   in a batch
> 
> With this tun XDP can benefit from e.g batch transmission during
> XDP_REDIRECT or XDP_TX.
> 
> Tests shows 21% improvement on TX pps (from ~3.2Mpps to ~3.9Mpps)
> while transmitting through testpmd from guest to host by
> xdp_redirect_map between tap0 and ixgbe.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

s/underlayer/underlying/ ?

> ---
>  drivers/net/tun.c   | 36 +++++++++++++++++----------
>  drivers/vhost/net.c | 71 ++++++++++++++++++++++++++++++++++++-----------------
>  2 files changed, 71 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index b586b3f..5d16d18 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1616,7 +1616,6 @@ static u32 tun_do_xdp(struct tun_struct *tun,
>  	switch (act) {
>  	case XDP_REDIRECT:
>  		*err = xdp_do_redirect(tun->dev, xdp, xdp_prog);
> -		xdp_do_flush_map();
>  		if (*err)
>  			break;
>  		goto out;
> @@ -1624,7 +1623,6 @@ static u32 tun_do_xdp(struct tun_struct *tun,
>  		*err = tun_xdp_tx(tun->dev, xdp);
>  		if (*err)
>  			break;
> -		tun_xdp_flush(tun->dev);
>  		goto out;
>  	case XDP_PASS:
>  		goto out;
> @@ -2400,9 +2398,6 @@ static int tun_xdp_one(struct tun_struct *tun,
>  	int err = 0;
>  	bool skb_xdp = false;
>  
> -	preempt_disable();
> -	rcu_read_lock();
> -
>  	xdp_prog = rcu_dereference(tun->xdp_prog);
>  	if (xdp_prog) {
>  		if (gso->gso_type) {
> @@ -2461,15 +2456,12 @@ static int tun_xdp_one(struct tun_struct *tun,
>  		tun_flow_update(tun, rxhash, tfile);
>  
>  out:
> -	rcu_read_unlock();
> -	preempt_enable();
> -
>  	return err;
>  }
>  
>  static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>  {
> -	int ret;
> +	int ret, i;
>  	struct tun_file *tfile = container_of(sock, struct tun_file, socket);
>  	struct tun_struct *tun = tun_get(tfile);
>  	struct tun_msg_ctl *ctl = m->msg_control;
> @@ -2477,10 +2469,28 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>  	if (!tun)
>  		return -EBADFD;
>  
> -	if (ctl && ctl->type == TUN_MSG_PTR) {
> -		ret = tun_xdp_one(tun, tfile, ctl->ptr);
> -		if (!ret)
> -			ret = total_len;
> +	if (ctl && ((ctl->type & 0xF) == TUN_MSG_PTR)) {
> +		int n = ctl->type >> 16;
> +
> +		preempt_disable();
> +		rcu_read_lock();
> +
> +		for (i = 0; i < n; i++) {
> +			struct xdp_buff *x = (struct xdp_buff *)ctl->ptr;
> +			struct xdp_buff *xdp = &x[i];
> +
> +			xdp_set_data_meta_invalid(xdp);
> +			xdp->rxq = &tfile->xdp_rxq;
> +			tun_xdp_one(tun, tfile, xdp);
> +		}
> +
> +		xdp_do_flush_map();
> +		tun_xdp_flush(tun->dev);
> +
> +		rcu_read_unlock();
> +		preempt_enable();
> +
> +		ret = total_len;
>  		goto out;
>  	}
>  
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 0d84de6..bec4109 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -118,6 +118,7 @@ struct vhost_net_virtqueue {
>  	struct ptr_ring *rx_ring;
>  	struct vhost_net_buf rxq;
>  	struct xdp_buff xdp[VHOST_RX_BATCH];
> +	struct vring_used_elem heads[VHOST_RX_BATCH];
>  };
>  
>  struct vhost_net {
> @@ -511,7 +512,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
>  	void *buf;
>  	int copied;
>  
> -	if (len < nvq->sock_hlen)
> +	if (unlikely(len < nvq->sock_hlen))
>  		return -EFAULT;
>  
>  	if (SKB_DATA_ALIGN(len + pad) +
> @@ -567,11 +568,37 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
>  	return 0;
>  }
>  
> +static void vhost_tx_batch(struct vhost_net *net,
> +			   struct vhost_net_virtqueue *nvq,
> +			   struct socket *sock,
> +			   struct msghdr *msghdr, int n)
> +{
> +	struct tun_msg_ctl ctl = {
> +		.type = n << 16 | TUN_MSG_PTR,
> +		.ptr = nvq->xdp,
> +	};
> +	int err;
> +
> +	if (n == 0)
> +		return;
> +
> +	msghdr->msg_control = &ctl;
> +	err = sock->ops->sendmsg(sock, msghdr, 0);
> +
> +	if (unlikely(err < 0)) {
> +		/* FIXME vq_err() */
> +		vq_err(&nvq->vq, "sendmsg err!\n");
> +		return;
> +	}
> +	vhost_add_used_and_signal_n(&net->dev, &nvq->vq, nvq->vq.heads, n);
> +}
> +
> +/* Expects to be always run from workqueue - which acts as
> + * read-size critical section for our kind of RCU. */
>  static void handle_tx_copy(struct vhost_net *net)
>  {
>  	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
>  	struct vhost_virtqueue *vq = &nvq->vq;
> -	struct xdp_buff xdp;
>  	unsigned out, in;
>  	int head;
>  	struct msghdr msg = {
> @@ -586,7 +613,6 @@ static void handle_tx_copy(struct vhost_net *net)
>  	size_t hdr_size;
>  	struct socket *sock;
>  	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
> -	struct tun_msg_ctl ctl;
>  	int sent_pkts = 0;
>  	s16 nheads = 0;
>  
> @@ -631,22 +657,24 @@ static void handle_tx_copy(struct vhost_net *net)
>  		vq->heads[nheads].id = cpu_to_vhost32(vq, head);
>  		vq->heads[nheads].len = 0;
>  
> -		err = vhost_net_build_xdp(nvq, &msg.msg_iter, &xdp);
> -		if (!err) {
> -			ctl.type = TUN_MSG_PTR;
> -			ctl.ptr = &xdp;
> -			msg.msg_control = &ctl;
> -		} else
> -			msg.msg_control = NULL;
> -
>  		total_len += len;
> -		if (total_len < VHOST_NET_WEIGHT &&
> -		    vhost_has_more_pkts(net, vq)) {
> -			msg.msg_flags |= MSG_MORE;
> -		} else {
> -			msg.msg_flags &= ~MSG_MORE;
> +		err = vhost_net_build_xdp(nvq, &msg.msg_iter,
> +					  &nvq->xdp[nheads]);
> +		if (!err) {
> +			if (++nheads == VHOST_RX_BATCH) {
> +				vhost_tx_batch(net, nvq, sock, &msg, nheads);
> +				nheads = 0;
> +			}
> +			goto done;
> +		} else if (unlikely(err != -ENOSPC)) {
> +			vq_err(vq, "Fail to build XDP buffer\n");
> +			break;
>  		}
>  
> +		vhost_tx_batch(net, nvq, sock, &msg, nheads);
> +		msg.msg_control = NULL;
> +		nheads = 0;
> +
>  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
>  		err = sock->ops->sendmsg(sock, &msg, len);
>  		if (unlikely(err < 0)) {
> @@ -657,11 +685,9 @@ static void handle_tx_copy(struct vhost_net *net)
>  		if (err != len)
>  			pr_debug("Truncated TX packet: "
>  				 " len %d != %zd\n", err, len);
> -		if (++nheads == VHOST_RX_BATCH) {
> -			vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
> -						    nheads);
> -			nheads = 0;
> -		}
> +
> +		vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +done:
>  		if (vhost_exceeds_weight(++sent_pkts, total_len)) {
>  			vhost_poll_queue(&vq->poll);
>  			break;
> @@ -669,8 +695,7 @@ static void handle_tx_copy(struct vhost_net *net)
>  	}
>  out:
>  	if (nheads)
> -		vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
> -					    nheads);
> +		vhost_tx_batch(net, nvq, sock, &msg, nheads);
>  	mutex_unlock(&vq->mutex);
>  }
>  
> -- 
> 2.7.4

^ permalink raw reply

* Re: [RFC PATCH net-next 06/12] tuntap: enable premmption early
From: Michael S. Tsirkin @ 2018-05-21 14:32 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel
In-Reply-To: <1526893473-20128-7-git-send-email-jasowang@redhat.com>

On Mon, May 21, 2018 at 05:04:27PM +0800, Jason Wang wrote:
> Signed-off-by: Jason Wang <jasowang@redhat.com>

typo in subject

> ---
>  drivers/net/tun.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 44d4f3d..24ecd82 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1697,6 +1697,8 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>  			goto err_xdp;
>  		}
>  	}
> +	rcu_read_unlock();
> +	preempt_enable();
>  
>  	skb = build_skb(buf, buflen);
>  	if (!skb) {
> @@ -1710,9 +1712,6 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>  	get_page(alloc_frag->page);
>  	alloc_frag->offset += buflen;
>  
> -	rcu_read_unlock();
> -	preempt_enable();
> -
>  	return skb;
>  
>  err_redirect:
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net-next] sctp: add support for SCTP_REUSE_PORT sockopt
From: Michael Tuexen @ 2018-05-21 14:09 UTC (permalink / raw)
  To: Neil Horman
  Cc: Marcelo Ricardo Leitner, Xin Long, network dev, linux-sctp, davem
In-Reply-To: <20180521134857.GC17593@hmswarspite.think-freely.org>

[-- Attachment #1: Type: text/plain, Size: 6459 bytes --]

> On 21. May 2018, at 15:48, Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> On Mon, May 21, 2018 at 02:16:56PM +0200, Michael Tuexen wrote:
>>> On 21. May 2018, at 13:39, Neil Horman <nhorman@tuxdriver.com> wrote:
>>> 
>>> On Sun, May 20, 2018 at 10:54:04PM -0300, Marcelo Ricardo Leitner wrote:
>>>> On Sun, May 20, 2018 at 08:50:59PM -0400, Neil Horman wrote:
>>>>> On Sat, May 19, 2018 at 03:44:40PM +0800, Xin Long wrote:
>>>>>> This feature is actually already supported by sk->sk_reuse which can be
>>>>>> set by SO_REUSEADDR. But it's not working exactly as RFC6458 demands in
>>>>>> section 8.1.27, like:
>>>>>> 
>>>>>> - This option only supports one-to-one style SCTP sockets
>>>>>> - This socket option must not be used after calling bind()
>>>>>>   or sctp_bindx().
>>>>>> 
>>>>>> Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
>>>>>> Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
>>>>>> work in linux.
>>>>>> 
>>>>>> This patch reuses sk->sk_reuse and works pretty much as SO_REUSEADDR,
>>>>>> just with some extra setup limitations that are neeeded when it is being
>>>>>> enabled.
>>>>>> 
>>>>>> "It should be noted that the behavior of the socket-level socket option
>>>>>> to reuse ports and/or addresses for SCTP sockets is unspecified", so it
>>>>>> leaves SO_REUSEADDR as is for the compatibility.
>>>>>> 
>>>>>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
>>>>>> ---
>>>>>> include/uapi/linux/sctp.h |  1 +
>>>>>> net/sctp/socket.c         | 48 +++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 2 files changed, 49 insertions(+)
>>>>>> 
>>>>> A few things:
>>>>> 
>>>>> 1) I agree with Tom, this feature is a complete duplication of the SK_REUSEPORT
>>>>> socket option.  I understand that this is an implementation of the option in the
>>>>> RFC, but its definately a duplication of a feature, which makes several things
>>>>> really messy.
>>>>> 
>>>>> 2) The overloading of the sk_reuse opeion is a bad idea, for several reasons.
>>>>> Chief among them is the behavioral interference between this patch and the
>>>>> SO_REUSEADDR socket level option, that also sets this feature.  If you set
>>>>> sk_reuse via SO_REUSEADDR, you will set the SCTP port reuse feature regardless
>>>>> of the bind or 1:1/1:m state of the socket.  Vice versa, if you set this socket
>>>>> option via the SCTP_PORT_REUSE option you will inadvertently turn on address
>>>>> reuse for the socket.  We can't do that.
>>>> 
>>>> Given your comments, going a bit further here, one other big
>>>> implication is that a port would never be able to be considered to
>>>> fully meet SCTP standards regarding reuse because a rogue application
>>>> may always abuse of the socket level opt to gain access to the port.
>>>> 
>>>> IOW, the patch allows the application to use such restrictions against
>>>> itself and nothing else, which undermines the patch idea.
>>>> 
>>> Agreed.
>>> 
>>>> I lack the knowledge on why the SCTP option was proposed in the RFC. I
>>>> guess they had a good reason to add the restriction on 1:1/1:m style.
>>>> Does the usage of the current imply in any risk to SCTP sockets? If
>>>> yes, that would give some grounds for going forward with the SCTP
>>>> option.
>>>> 
>>> I'm also not privy to why the sctp option was proposed, though I expect that the
>>> lack of standardization of SO_REUSEPORT probably had something to do with it.
>>> As for the reasoning behind restriction to only 1:1 sockets, if I had to guess,
>>> I would say it likely because it creates ordering difficulty at the application
>>> level.
>>> 
>>> CC-ing Michael Tuxen, who I believe had some input on this RFC.  Hopefully he
>>> can shed some light on this.
>> Dear all,
>> 
>> the reason this was added is to have a specified way to allow a system to
>> behave like a client and server making use of the INIT collision.
>> 
>> For 1-to-many style sockets you can do this by creating a socket, binding it,
>> calling listen on it and trying to connect to the peer.
>> 
>> For 1-to-1 style sockets you need two sockets for it. One listener and one
>> you use to connect (and close it in case of failure, open a new one...).
>> 
>> It was not clear if one can achieve this with SO_REUSEPORT and/or SO_REUSEADDR
>> on all platforms. We left that unspecified.
>> 
>> I hope this makes the intention clearer.
>> 
> I think it makes the intention clearer yes, but it unfortunately does nothing in
> my mind to clarify how the implementation should best handle the potential
> overlap in functionality.  What I see here is that we have two functional paths
> (the SO_REUSEPORT path and the SCTP_PORT_REUSE path), which may or may not
> (depending on the OS implementation achieve the same functional goal (allowing
> multiple sockets to share a port while allowing one socket to listen and the
> other connect to a remote peer).  If both implementations do the same thing on a
> given platform, we can either just alias one to another and be done, but if they
> don't then we either have to implement both paths, and ensure that the
> SO_REUSEPORT path is a no-op/error return for SCTP sockets, or that each path
> implements a distinct feature set that is cleaarly documented.
> 
> That said, I think we may be in luck.  Looking at the connect and listen paths,
> it appears to me that:
> 
> 1) Sockets ignore SO_REUSEPORT in the connect and listen paths (save for any
> autobinding) so it would appear that the intent of the SCTP rfc can be honored
> via SO_REUSEPORT on linux.  
> 
> 2) SO_REUSEPORT prevents changing state after a bind has occured, so we can honr
> that part of the SCTP RFC.
> 
> The only missing part is the restriction that SCTP_REUSE_PORT has which is
> unaccounted for is that 1:M sockets aren't allowed to enable port reuse.
> However, I think the implication from Michaels description above is that port
> reuse on a 1:M socket is implicit because a single socket can connect and listen
> in that use case, rather than there being a danger to doing so.
> 
> As such, I would propose that we implement this socket option by simply setting
> the sk->sk_reuseport field in the sock structure, and document the fact that
> linux does not restrict port reuse from 1:M sockets.
> 
> Thoughts?
Sounds acceptable to me...

Best regards
Michael
> Neil
> 


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5367 bytes --]

^ permalink raw reply

* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino
From: Alban Crequy @ 2018-05-21 13:52 UTC (permalink / raw)
  To: Y Song; +Cc: Alban Crequy, netdev, LKML, Linux Containers, cgroups
In-Reply-To: <CAH3MdRUe7K8zJHuGAfnY6_VEkBLAWY1F_WaJgcLs4qDdQv1bTA@mail.gmail.com>

On Mon, May 14, 2018 at 9:38 PM, Y Song <ys114321@gmail.com> wrote:
>
> On Sun, May 13, 2018 at 10:33 AM, Alban Crequy <alban.crequy@gmail.com> wrote:
> > From: Alban Crequy <alban@kinvolk.io>
> >
> > bpf_get_current_cgroup_ino() allows BPF trace programs to get the inode
> > of the cgroup where the current process resides.
> >
> > My use case is to get statistics about syscalls done by a specific
> > Kubernetes container. I have a tracepoint on raw_syscalls/sys_enter and
> > a BPF map containing the cgroup inode that I want to trace. I use
> > bpf_get_current_cgroup_ino() and I quickly return from the tracepoint if
> > the inode is not in the BPF hash map.
>
> Alternatively, the kernel already has bpf_current_task_under_cgroup helper
> which uses a cgroup array to store cgroup fd's. If the current task is
> in the hierarchy of a particular cgroup, the helper will return true.
>
> One difference between your helper and bpf_current_task_under_cgroup() is
> that your helper tests against a particular cgroup, not including its
> children, but
> bpf_current_task_under_cgroup() will return true even the task is in a
> nested cgroup.
>
> Maybe this will work for you?

I like the behaviour that it checks for children cgroups. But with the
cgroup array, I can test only one cgroup at a time. I would like to be
able to enable my tracer for a few Kubernetes containers or all by
adding the inodes of a few cgroups in a hash map. So I could keep
separate stats for each. With bpf_current_task_under_cgroup(), I would
need to iterate over the list of cgroups, which is difficult with BPF.

Also, Kubernetes is cgroup-v1 only and bpf_current_task_under_cgroup()
is cgroup-v2 only. In Kubernetes, the processes remain in the root of
the v2 hierarchy. I'd like to be able to select the cgroup hierarchy
in my helper so it'd work for both v1 and v2.

> > Without this BPF helper, I would need to keep track of all pids in the
> > container. The Netlink proc connector can be used to follow process
> > creation and destruction but it is racy.
> >
> > This patch only looks at the memory cgroup, which was enough for me
> > since each Kubernetes container is placed in a different mem cgroup.
> > For a generic implementation, I'm not sure how to proceed: it seems I
> > would need to use 'for_each_root(root)' (see example in
> > proc_cgroup_show() from kernel/cgroup/cgroup.c) but I don't know if
> > taking the cgroup mutex is possible in the BPF helper function. It might
> > be ok in the tracepoint raw_syscalls/sys_enter but could the mutex
> > already be taken in some other tracepoints?
>
> mutex is not allowed in a helper since it can block.

Ok. I don't know how to implement my helper properly then. Maybe I
could just use the 1st cgroup-v1 hierarchy (the name=systemd one) so I
don't have to iterate over the hierarchies. But would that be
acceptable?

Cheers,
Alban

> > Signed-off-by: Alban Crequy <alban@kinvolk.io>
> > ---
> >  include/uapi/linux/bpf.h | 11 ++++++++++-
> >  kernel/trace/bpf_trace.c | 25 +++++++++++++++++++++++++
> >  2 files changed, 35 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index c5ec89732a8d..38ac3959cdf3 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -755,6 +755,14 @@ union bpf_attr {
> >   *     @addr: pointer to struct sockaddr to bind socket to
> >   *     @addr_len: length of sockaddr structure
> >   *     Return: 0 on success or negative error code
> > + *
> > + * u64 bpf_get_current_cgroup_ino(hierarchy, flags)
> > + *     Get the cgroup{1,2} inode of current task under the specified hierarchy.
> > + *     @hierarchy: cgroup hierarchy
>
> Not sure what is the value to specify hierarchy here.
> A cgroup directory fd?
>
> > + *     @flags: reserved for future use
> > + *     Return:
> > + *       == 0 error
>
> looks like < 0 means error.
>
> > + *        > 0 inode of the cgroup
>                >= 0 means good?
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)          \
> >         FN(unspec),                     \
> > @@ -821,7 +829,8 @@ union bpf_attr {
> >         FN(msg_apply_bytes),            \
> >         FN(msg_cork_bytes),             \
> >         FN(msg_pull_data),              \
> > -       FN(bind),
> > +       FN(bind),                       \
> > +       FN(get_current_cgroup_ino),
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >   * function eBPF program intends to call
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index 56ba0f2a01db..9bf92a786639 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -524,6 +524,29 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
> >         .arg3_type      = ARG_ANYTHING,
> >  };
> >
> > +BPF_CALL_2(bpf_get_current_cgroup_ino, u32, hierarchy, u64, flags)
> > +{
> > +       // TODO: pick the correct hierarchy instead of the mem controller
> > +       struct cgroup *cgrp = task_cgroup(current, memory_cgrp_id);
> > +
> > +       if (unlikely(!cgrp))
> > +               return -EINVAL;
> > +       if (unlikely(hierarchy))
> > +               return -EINVAL;
> > +       if (unlikely(flags))
> > +               return -EINVAL;
> > +
> > +       return cgrp->kn->id.ino;
> > +}
> > +
> > +static const struct bpf_func_proto bpf_get_current_cgroup_ino_proto = {
> > +       .func           = bpf_get_current_cgroup_ino,
> > +       .gpl_only       = false,
> > +       .ret_type       = RET_INTEGER,
> > +       .arg1_type      = ARG_DONTCARE,
> > +       .arg2_type      = ARG_DONTCARE,
> > +};
> > +
> >  static const struct bpf_func_proto *
> >  tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >  {
> > @@ -564,6 +587,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >                 return &bpf_get_prandom_u32_proto;
> >         case BPF_FUNC_probe_read_str:
> >                 return &bpf_probe_read_str_proto;
> > +       case BPF_FUNC_get_current_cgroup_ino:
> > +               return &bpf_get_current_cgroup_ino_proto;
> >         default:
> >                 return NULL;
> >         }
> > --
> > 2.14.3
> >

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox