Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 0/4] Add SOCFPGA System Manager
From: Thor Thayer @ 2018-10-17 13:59 UTC (permalink / raw)
  To: lee.jones, peppe.cavallaro, dinguyen, linux, alexandre.torgue,
	joabreu
  Cc: davem, mchehab+samsung, catalin.marinas, akpm, arnd, aisheng.dong,
	linux-kernel, netdev, linux-arm-kernel
In-Reply-To: <021aaa02-1138-e538-3d02-ff639541d0f3@linux.intel.com>

Gentle ping again...

Any comments on this patch series?

On 10/10/2018 09:42 AM, Thor Thayer wrote:
> Hi
> On 09/24/2018 05:09 PM, thor.thayer@linux.intel.com wrote:
>> From: Thor Thayer <thor.thayer@linux.intel.com>
>>
>> Add MFD driver for ARM64 SOCFPGA System Manager to steer
>> System Manager calls appropriately.
>> The SOCFPGA System Manager includes registers from several
>> SOC peripherals.
>>
>> On ARM32, syscon handles this aggregated register grouping.
>> Redirect System Manager calls to syscon for ARM32 SOCFPGA
>> systems.
>>
>> The ARM64 System Manager can only be accessed from priority
>> level EL3 so this new MFD driver handles the calls to EL3.
>>
>> Thor Thayer (4):
>>    mfd: altera-sysmgr: Add SOCFPGA System Manager abstraction
>>    ARM: socfpga_defconfig: Enable CONFIG_MTD_ALTERA_SYSMGR
>>    arm64: defconfig: Enable CONFIG_MTD_ALTERA_SYSMGR
>>    net: stmmac: socfpga: Convert to shared System Manager driver
>>
>>   MAINTAINERS                                        |   6 +
>>   arch/arm/configs/socfpga_defconfig                 |   1 +
>>   arch/arm64/configs/defconfig                       |   1 +
>>   drivers/mfd/Kconfig                                |   9 +
>>   drivers/mfd/Makefile                               |   1 +
>>   drivers/mfd/altera-sysmgr.c                        | 310 
>> +++++++++++++++++++++
>>   .../net/ethernet/stmicro/stmmac/dwmac-socfpga.c    |   4 +-
>>   include/linux/mfd/altera-sysmgr.h                  | 113 ++++++++
>>   8 files changed, 444 insertions(+), 1 deletion(-)
>>   create mode 100644 drivers/mfd/altera-sysmgr.c
>>   create mode 100644 include/linux/mfd/altera-sysmgr.h
>>
> Gentle ping.
> 
> Thor

^ permalink raw reply

* Re: [PATCH] net-xfrm: add build time cfg option to PF_KEY SHA256 to use RFC4868-compliant truncation
From: Maciej Żenczykowski @ 2018-10-17  6:01 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S. Miller, steffen.klassert, Linux NetDev, Lorenzo Colitti
In-Reply-To: <20181017055950.m3wpucoyobt43k44@gondor.apana.org.au>

I'm afraid it's nothing we're using.  It's what people are using.
I guess we'll just carry this patch for a few more years.

^ permalink raw reply

* Re: [PATCH] net-xfrm: add build time cfg option to PF_KEY SHA256 to use RFC4868-compliant truncation
From: Herbert Xu @ 2018-10-17  5:59 UTC (permalink / raw)
  To: Maciej Żenczykowski; +Cc: maze, davem, steffen.klassert, netdev, lorenzo
In-Reply-To: <20181016080634.139776-1-zenczykowski@gmail.com>

Maciej Żenczykowski <zenczykowski@gmail.com> wrote:
>
> +#if IS_ENABLED(CONFIG_XFRM_HMAC_SHA256_RFC4868)
> +                       .icv_truncbits = 128,
> +#else
>                        .icv_truncbits = 96,
> +#endif

Nack.  We don't want a build-time configuration knob for this.
This needs to be decided at run-time.

In fact you can already do this at run-time anyway through the
xfrm interface.  So please please please just ditch whatever that
you're using that's still glued to the long-obsolete (more than a
decade) af_key interface and switch it over to xfrm.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
From: Song Liu @ 2018-10-17  5:56 UTC (permalink / raw)
  To: netdev; +Cc: ast, daniel, kernel-team, Song Liu
In-Reply-To: <20181017055606.353449-1-songliubraving@fb.com>

BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
skb. This patch enables direct access of skb for these programs.

In __cgroup_bpf_run_filter_skb(), bpf_compute_data_pointers() is called
to compute proper data_end for the BPF program.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/bpf/cgroup.c |  4 ++++
 net/core/filter.c   | 26 +++++++++++++++++++++++++-
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 00f6ed2e4f9a..340d496f35bd 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -566,6 +566,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 	save_sk = skb->sk;
 	skb->sk = sk;
 	__skb_push(skb, offset);
+
+	/* compute pointers for the bpf prog */
+	bpf_compute_data_pointers(skb);
+
 	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
 				 bpf_prog_run_save_cb);
 	__skb_pull(skb, offset);
diff --git a/net/core/filter.c b/net/core/filter.c
index 1a3ac6c46873..8b5a502e241f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5346,6 +5346,30 @@ static bool sk_filter_is_valid_access(int off, int size,
 	return bpf_skb_is_valid_access(off, size, type, prog, info);
 }
 
+static bool cg_skb_is_valid_access(int off, int size,
+				   enum bpf_access_type type,
+				   const struct bpf_prog *prog,
+				   struct bpf_insn_access_aux *info)
+{
+	if (type == BPF_WRITE)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range(struct __sk_buff, len):
+		break;
+	case bpf_ctx_range(struct __sk_buff, data):
+		info->reg_type = PTR_TO_PACKET;
+		break;
+	case bpf_ctx_range(struct __sk_buff, data_end):
+		info->reg_type = PTR_TO_PACKET_END;
+		break;
+	default:
+		return false;
+	}
+
+	return bpf_skb_is_valid_access(off, size, type, prog, info);
+}
+
 static bool lwt_is_valid_access(int off, int size,
 				enum bpf_access_type type,
 				const struct bpf_prog *prog,
@@ -7038,7 +7062,7 @@ const struct bpf_prog_ops xdp_prog_ops = {
 
 const struct bpf_verifier_ops cg_skb_verifier_ops = {
 	.get_func_proto		= cg_skb_func_proto,
-	.is_valid_access	= sk_filter_is_valid_access,
+	.is_valid_access	= cg_skb_is_valid_access,
 	.convert_ctx_access	= bpf_convert_ctx_access,
 };
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH bpf-next 2/2] bpf: add tests for direct packet access from CGROUP_SKB
From: Song Liu @ 2018-10-17  5:56 UTC (permalink / raw)
  To: netdev; +Cc: ast, daniel, kernel-team, Song Liu
In-Reply-To: <20181017055606.353449-1-songliubraving@fb.com>

Tests are added to make sure CGROUP_SKB can directly access len, data,
and data_end in __sk_buff, but not other fields.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 tools/testing/selftests/bpf/test_verifier.c | 30 +++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index cf4cd32b6772..aaf2ceba83dd 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -4862,6 +4862,36 @@ static struct bpf_test tests[] = {
 		.result = REJECT,
 		.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 	},
+	{
+		"direct packet read for CGROUP_SKB",
+		.insns = {
+			BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+				    offsetof(struct __sk_buff, data)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+				    offsetof(struct __sk_buff, data_end)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+				    offsetof(struct __sk_buff, len)),
+			BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+			BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+			BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	},
+	{
+		"invalid access of tc_classid for CGROUP_SKB",
+		.insns = {
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+				    offsetof(struct __sk_buff, tc_classid)),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "invalid bpf_context access",
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	},
 	{
 		"valid cgroup storage access",
 		.insns = {
-- 
2.17.1

^ permalink raw reply related

* [PATCH bpf-next 0/2] bpf: add cg_skb_is_valid_access
From: Song Liu @ 2018-10-17  5:56 UTC (permalink / raw)
  To: netdev; +Cc: ast, daniel, kernel-team, Song Liu

This set enables BPF program of type BPF_PROG_TYPE_CGROUP_SKB to access
__skb_buff->len/data/data_end directly.

Song Liu (2):
  bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
  bpf: add tests for direct packet access from CGROUP_SKB

 kernel/bpf/cgroup.c                         |  4 +++
 net/core/filter.c                           | 26 +++++++++++++++++-
 tools/testing/selftests/bpf/test_verifier.c | 30 +++++++++++++++++++++
 3 files changed, 59 insertions(+), 1 deletion(-)

^ permalink raw reply

* Re: netconsole warning in 4.19.0-rc7
From: Cong Wang @ 2018-10-17  5:16 UTC (permalink / raw)
  To: Dave Jones, Meelis Roos, LKML, Linux Kernel Network Developers
In-Reply-To: <20181017035038.czaqown24rnjn2pw@codemonkey.org.uk>

On Tue, Oct 16, 2018 at 8:50 PM Dave Jones <davej@codemonkey.org.uk> wrote:
>
> On Tue, Oct 16, 2018 at 11:40:47PM -0400, Dave Jones wrote:
>
>  >  > This is exactly what I mentioned in my review here:
>  >  > https://marc.info/?l=linux-netdev&m=153816136624679&w=2
>  >  >
>  >  > "But irq is disabled here, so not sure if rcu_read_lock_bh()
>  >  > could cause trouble... "
>  >
>  > Not sure why this didn't show up for me when I was developing that
>  > patch, but I can now reproduce this.  The patch below fixes it for
>  > me, but I'm not sure if there are still any side-effects.
>  > There's also a missed unlock in the error path.
>
> I took another look at that error path. Turns out this is all we need I
> think..

Double check if rcu_dereference_bh() is happy with it.

Don't trust me, enable the relevant Kconfig and trust your test. ;)

^ permalink raw reply

* [PATCH net-next] net: skbuff.h: Mark expected switch fall-throughs
From: Gustavo A. R. Silva @ 2018-10-17 13:01 UTC (permalink / raw)
  To: David S. Miller, Kees Cook; +Cc: linux-kernel, netdev, Gustavo A. R. Silva

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 include/linux/skbuff.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 119d092..0ba6874 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3505,13 +3505,19 @@ static inline bool __skb_metadata_differs(const struct sk_buff *skb_a,
 #define __it(x, op) (x -= sizeof(u##op))
 #define __it_diff(a, b, op) (*(u##op *)__it(a, op)) ^ (*(u##op *)__it(b, op))
 	case 32: diffs |= __it_diff(a, b, 64);
+		 /* fall through */
 	case 24: diffs |= __it_diff(a, b, 64);
+		 /* fall through */
 	case 16: diffs |= __it_diff(a, b, 64);
+		 /* fall through */
 	case  8: diffs |= __it_diff(a, b, 64);
 		break;
 	case 28: diffs |= __it_diff(a, b, 64);
+		 /* fall through */
 	case 20: diffs |= __it_diff(a, b, 64);
+		 /* fall through */
 	case 12: diffs |= __it_diff(a, b, 64);
+		 /* fall through */
 	case  4: diffs |= __it_diff(a, b, 32);
 		break;
 	}
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v3] virtio_net: avoid using netif_tx_disable() for serializing tx routine
From: Jason Wang @ 2018-10-17 12:30 UTC (permalink / raw)
  To: Ake Koomsin
  Cc: netdev, virtualization, David S. Miller, linux-kernel,
	Michael S. Tsirkin
In-Reply-To: <20181017104419.13003-1-ake@igel.co.jp>


On 2018/10/17 下午6:44, Ake Koomsin wrote:
> Commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
> introduces netif_tx_disable() after netif_device_detach() in order to
> avoid use-after-free of tx queues. However, there are two issues.
>
> 1) Its operation is redundant with netif_device_detach() in case the
>     interface is running.
> 2) In case of the interface is not running before suspending and
>     resuming, the tx does not get resumed by netif_device_attach().
>     This results in losing network connectivity.
>
> It is better to use netif_tx_lock_bh()/netif_tx_unlock_bh() instead for
> serializing tx routine during reset. This also preserves the symmetry
> of netif_device_detach() and netif_device_attach().
>
> Fixes commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
> Signed-off-by: Ake Koomsin <ake@igel.co.jp>
> ---
>   drivers/net/virtio_net.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 3f5aa59c37b7..3e2c041d76ac 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2267,8 +2267,9 @@ static void virtnet_freeze_down(struct virtio_device *vdev)
>   	/* Make sure no work handler is accessing the device */
>   	flush_work(&vi->config_work);
>   
> +	netif_tx_lock_bh(vi->dev);
>   	netif_device_detach(vi->dev);
> -	netif_tx_disable(vi->dev);
> +	netif_tx_unlock_bh(vi->dev);
>   	cancel_delayed_work_sync(&vi->refill);
>   
>   	if (netif_running(vi->dev)) {
> @@ -2304,7 +2305,9 @@ static int virtnet_restore_up(struct virtio_device *vdev)
>   		}
>   	}
>   
> +	netif_tx_lock_bh(vi->dev);
>   	netif_device_attach(vi->dev);
> +	netif_tx_unlock_bh(vi->dev);
>   	return err;
>   }
>   


Acked-by: Jason Wang <jasowang@redhat.com>

Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Application of f8b39039cbf2a15f ("net: fs_enet: do not call phy_stop() in interrupts") to 4.9 and 4.14 stable
From: Christophe LEROY @ 2018-10-17 12:08 UTC (permalink / raw)
  To: David Miller, netdev@vger.kernel.org; +Cc: Greg KH, stable@vger.kernel.org

Hi,

Could you please apply f8b39039cbf2a15f2b8c9f081e1cbd5dee00aaf5 to 4.9 
and 4.14 ?

It fixes an Oops when Ethernet transmission times out.

As you can observe in the commit log, the Oops what initially observed 
in 4.9.

The fix has been in mainline since 4.15

Thanks
Christophe

^ permalink raw reply

* Re: [PATCH net-next V2 6/8] vhost: packed ring support
From: Maxime Coquelin @ 2018-10-17 12:02 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin, Tiwei Bie
  Cc: kvm, virtualization, netdev, linux-kernel, wexu, jfreimann
In-Reply-To: <0f3827e5-a7fa-e54a-725d-7726e90333b8@redhat.com>



On 10/17/2018 08:54 AM, Jason Wang wrote:
> 
> On 2018/10/16 下午9:58, Maxime Coquelin wrote:
>>
>> On 10/15/2018 04:22 AM, Jason Wang wrote:
>>>
>>>
>>> On 2018年10月13日 01:23, Michael S. Tsirkin wrote:
>>>> On Fri, Oct 12, 2018 at 10:32:44PM +0800, Tiwei Bie wrote:
>>>>> On Mon, Jul 16, 2018 at 11:28:09AM +0800, Jason Wang wrote:
>>>>> [...]
>>>>>> @@ -1367,10 +1397,48 @@ long vhost_vring_ioctl(struct vhost_dev 
>>>>>> *d, unsigned int ioctl, void __user *arg
>>>>>>           vq->last_avail_idx = s.num;
>>>>>>           /* Forget the cached index value. */
>>>>>>           vq->avail_idx = vq->last_avail_idx;
>>>>>> +        if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
>>>>>> +            vq->last_avail_wrap_counter = wrap_counter;
>>>>>> +            vq->avail_wrap_counter = vq->last_avail_wrap_counter;
>>>>>> +        }
>>>>>>           break;
>>>>>>       case VHOST_GET_VRING_BASE:
>>>>>>           s.index = idx;
>>>>>>           s.num = vq->last_avail_idx;
>>>>>> +        if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>>>>>> +            s.num |= vq->last_avail_wrap_counter << 31;
>>>>>> +        if (copy_to_user(argp, &s, sizeof(s)))
>>>>>> +            r = -EFAULT;
>>>>>> +        break;
>>>>>> +    case VHOST_SET_VRING_USED_BASE:
>>>>>> +        /* Moving base with an active backend?
>>>>>> +         * You don't want to do that.
>>>>>> +         */
>>>>>> +        if (vq->private_data) {
>>>>>> +            r = -EBUSY;
>>>>>> +            break;
>>>>>> +        }
>>>>>> +        if (copy_from_user(&s, argp, sizeof(s))) {
>>>>>> +            r = -EFAULT;
>>>>>> +            break;
>>>>>> +        }
>>>>>> +        if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
>>>>>> +            wrap_counter = s.num >> 31;
>>>>>> +            s.num &= ~(1 << 31);
>>>>>> +        }
>>>>>> +        if (s.num > 0xffff) {
>>>>>> +            r = -EINVAL;
>>>>>> +            break;
>>>>>> +        }
>>>>> Do we want to put wrap_counter at bit 15?
>>>> I think I second that - seems to be consistent with
>>>> e.g. event suppression structure and the proposed
>>>> extension to driver notifications.
>>>
>>> Ok, I assumes packed virtqueue support 64K but looks not. I can 
>>> change it to bit 15 and GET_VRING_BASE need to be changed as well.
>>>
>>>>
>>>>
>>>>> If put wrap_counter at bit 31, the check (s.num > 0xffff)
>>>>> won't be able to catch the illegal index 0x8000~0xffff for
>>>>> packed ring.
>>>>>
>>>
>>> Do we need to clarify this in the spec?
>>>
>>>>>> +        vq->last_used_idx = s.num;
>>>>>> +        if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>>>>>> +            vq->last_used_wrap_counter = wrap_counter;
>>>>>> +        break;
>>>>>> +    case VHOST_GET_VRING_USED_BASE:
>>>>> Do we need the new VHOST_GET_VRING_USED_BASE and
>>>>> VHOST_SET_VRING_USED_BASE ops?
>>>>>
>>>>> We are going to merge below series in DPDK:
>>>>>
>>>>> http://patches.dpdk.org/patch/45874/
>>>>>
>>>>> We may need to reach an agreement first.
>>>
>>> If we agree that 64K virtqueue won't be supported, I'm ok with either.
>>
>> I'm fine to put wrap_counter at bit 15.
>> I will post a new version of the DPDK series soon.
>>
>>> Btw the code assumes used_wrap_counter is equal to avail_wrap_counter 
>>> which looks wrong?
>>
>> For split ring, we used to set the last_used_idx to the same value as
>> last_avail_idx as VHOST_USER_GET_VRING_BASE cannot be called while the
>> ring is being processed, so their value is always the same at the time
>> the request is handled.
> 
> 
> I may miss something, but it looks to me we should sync last_used_idx 
> from used_idx.

Ok, so as proposed off-list by Jason, we could extend
VHOST_USER_[GET|SET]_VRING_BASE to have the following payload when
VIRTIO_F_RING_PACKED is negotiated:

Bit[0:14] avail index
Bit[15] avail wrap counter
Bit[16:30] used index
Bit[31] used wrap counter

Is everyone ok with that?

Another thing that I'd like to discuss is how do we reconnect in case of
user backend crash. When it happens, the frontend hasn't queried the
backend for last_avail_idx/last_used_idx and their wrap counters.

With split ring, when it happens, we set last_avail_idx to device's used
index (see virtio_queue_restore_last_avail_idx()).
Problem with packed ring is that wrap counters information is only in 
the backend.

Can we get device's used index and deduce the wrap counter value from
corresponding descriptor flag?

Any thoughts?

Regards,
Maxime
> Thanks
> 
> 
>>
>>
>> I kept the same behavior for packed ring, and so the wrap counter have
>> to be the same.
>>
>> Regards,
>> Maxime
>>
>>> Thanks
>>>
>>>>>
>>>>>> +        s.index = idx;
>>>>>> +        s.num = vq->last_used_idx;
>>>>>> +        if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>>>>>> +            s.num |= vq->last_used_wrap_counter << 31;
>>>>>>           if (copy_to_user(argp, &s, sizeof s))
>>>>>>               r = -EFAULT;
>>>>>>           break;
>>>>> [...]
>>>

^ permalink raw reply

* Re: netconsole warning in 4.19.0-rc7
From: Dave Jones @ 2018-10-17  3:50 UTC (permalink / raw)
  To: Cong Wang, Meelis Roos, LKML, Linux Kernel Network Developers
In-Reply-To: <20181017034047.nrhstc3b4wzevnri@codemonkey.org.uk>

On Tue, Oct 16, 2018 at 11:40:47PM -0400, Dave Jones wrote:

 >  > This is exactly what I mentioned in my review here:
 >  > https://marc.info/?l=linux-netdev&m=153816136624679&w=2
 >  > 
 >  > "But irq is disabled here, so not sure if rcu_read_lock_bh()
 >  > could cause trouble... "
 > 
 > Not sure why this didn't show up for me when I was developing that
 > patch, but I can now reproduce this.  The patch below fixes it for
 > me, but I'm not sure if there are still any side-effects.
 > There's also a missed unlock in the error path.

I took another look at that error path. Turns out this is all we need I
think..


diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index de1d1ba92f2d..f9322d5db899 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -318,6 +318,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
 	npinfo = rcu_dereference_bh(np->dev->npinfo);
 	if (!npinfo || !netif_running(dev) || !netif_device_present(dev)) {
 		dev_kfree_skb_irq(skb);
+		rcu_read_unlock_bh();
 		return;
 	}
 

^ permalink raw reply related

* Re: [RFC] VSOCK: The performance problem of vhost_vsock.
From: jiangyiwen @ 2018-10-17 11:41 UTC (permalink / raw)
  To: Jason Wang, stefanha; +Cc: netdev, kvm, virtualization
In-Reply-To: <8edff784-b311-bfb6-1bf4-1970d564279d@redhat.com>

On 2018/10/17 17:51, Jason Wang wrote:
> 
> On 2018/10/17 下午5:39, Jason Wang wrote:
>>>>
>>> Hi Jason and Stefan,
>>>
>>> Maybe I find the reason of bad performance.
>>>
>>> I found pkt_len is limited to VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE(4K),
>>> it will cause the bandwidth is limited to 500~600MB/s. And once I
>>> increase to 64k, it can improve about 3 times(~1500MB/s).
>>
>>
>> Looks like the value was chosen for a balance between rx buffer size and performance. Allocating 64K always even for small packet is kind of waste and stress for guest memory. Virito-net try to avoid this by inventing the merge able rx buffer which allows big packet to be scattered in into different buffers. We can reuse this idea or revisit the idea of using virtio-net/vhost-net as a transport of vsock.
>>
>> What interesting is the performance is still behind vhost-net.
>>
>> Thanks
>>
>>>
>>> By the way, I send to 64K in application once, and I don't use
>>> sg_init_one and rewrite function to packet sg list because pkt_len
>>> include multiple pages.
>>>
>>> Thanks,
>>> Yiwen. 
> 
> 
> Btw, if you're using vsock for transferring large files, maybe it's more efficient to implement sendpage() for vsock to allow sendfile()/splice() work.
> 
> Thanks
>

I can't agree more.

why vhost_vsock is still behind vhost_net?
Because I use sendfile() to test performance at first, and then
I found vsock don't implement sendpage() and cause the bandwidth
can't be increased. So I use read() and send() to replace sendfile(),
it will increase some switch between kernel and user mode, and sendfile()
can support zero copy. I think this is main reason.

Thanks.

> 
> .
> 


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH bpf-next 00/13] bpf: add btf func info support
From: Yonghong Song @ 2018-10-17  3:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Martin Lau, daniel@iogearbox.net,
	netdev@vger.kernel.org, Kernel Team
In-Reply-To: <20181016182707.fb3cifqilasxbddv@ast-mbp.dhcp.thefacebook.com>



On 10/16/18 11:27 AM, Alexei Starovoitov wrote:
> On Fri, Oct 12, 2018 at 11:54:20AM -0700, Yonghong Song wrote:
>> The BTF support was added to kernel by Commit 69b693f0aefa
>> ("bpf: btf: Introduce BPF Type Format (BTF)"), which introduced
>> .BTF section into ELF file and is primarily
>> used for map pretty print.
>> pahole is used to convert dwarf to BTF for ELF files.
>>
>> The next step would be add func type info and debug line info
>> into BTF. For debug line info, it is desirable to encode
>> source code directly in the BTF to ease deployment and
>> introspection.
> 
> it's kinda confusing that cover letter talks about line info next step,
> but these kernel side patches are only for full prog name from btf.
> It certainly makes sense for llvm to do both at the same time.
> Please make the cover letter more clear.

Make sense. Will remove line_info stuff from the cover letter.

^ permalink raw reply

* Re: [PATCH bpf-next 05/13] bpf: get better bpf_prog ksyms based on btf func type_id
From: Yonghong Song @ 2018-10-17  3:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Martin Lau, daniel@iogearbox.net,
	netdev@vger.kernel.org, Kernel Team
In-Reply-To: <20181016175901.vninfgdsjtizssvt@ast-mbp.dhcp.thefacebook.com>



On 10/16/18 10:59 AM, Alexei Starovoitov wrote:
> On Fri, Oct 12, 2018 at 11:54:42AM -0700, Yonghong Song wrote:
>> This patch added interface to load a program with the following
>> additional information:
>>     . prog_btf_fd
>>     . func_info and func_info_len
>> where func_info will provides function range and type_id
>> corresponding to each function.
>>
>> If verifier agrees with function range provided by the user,
>> the bpf_prog ksym for each function will use the func name
>> provided in the type_id, which is supposed to provide better
>> encoding as it is not limited by 16 bytes program name
>> limitation and this is better for bpf program which contains
>> multiple subprograms.
>>
>> The bpf_prog_info interface is also extended to
>> return btf_id and jited_func_types, so user spaces can
>> print out the function prototype for each jited function.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
> ...
>>   	BUILD_BUG_ON(sizeof("bpf_prog_") +
>>   		     sizeof(prog->tag) * 2 +
>> @@ -401,6 +403,13 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
>>   
>>   	sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
>>   	sym  = bin2hex(sym, prog->tag, sizeof(prog->tag));
>> +
>> +	if (prog->aux->btf) {
>> +		func_name = btf_get_name_by_id(prog->aux->btf, prog->aux->type_id);
>> +		snprintf(sym, (size_t)(end - sym), "_%s", func_name);
>> +		return;
> 
> Would it make sense to add a comment here that prog->aux->name is ignored
> when full btf name is available? (otherwise the same name will appear twice in ksym)

Will add a comment.

> 
>> +	}
>> +
>>   	if (prog->aux->name[0])
>>   		snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
> ...
>> +static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env,
>> +			  union bpf_attr *attr)
>> +{
>> +	struct bpf_func_info *data;
>> +	int i, nfuncs, ret = 0;
>> +
>> +	if (!attr->func_info_len)
>> +		return 0;
>> +
>> +	nfuncs = attr->func_info_len / sizeof(struct bpf_func_info);
>> +	if (env->subprog_cnt != nfuncs) {
>> +		verbose(env, "number of funcs in func_info does not match verifier\n");
> 
> 'does not match verifier' is hard to make sense of.
> How about 'number of funcs in func_info doesn't match number of subprogs' ?

Sounds good to me.

> 
>> +		return -EINVAL;
>> +	}
>> +
>> +	data = kvmalloc(attr->func_info_len, GFP_KERNEL | __GFP_NOWARN);
>> +	if (!data) {
>> +		verbose(env, "no memory to allocate attr func_info\n");
> 
> I don't think we ever print such warnings for memory allocations.
> imo this can be removed, since enomem is enough.

Okay.

> 
>> +		return -ENOMEM;
>> +	}
>> +
>> +	if (copy_from_user(data, u64_to_user_ptr(attr->func_info),
>> +			   attr->func_info_len)) {
>> +		verbose(env, "memory copy error for attr func_info\n");
> 
> similar thing. kernel never warns about copy_from_user errors.

Okay.

> 
>> +		ret = -EFAULT;
>> +		goto cleanup;
>> +		}
>> +
>> +	for (i = 0; i < nfuncs; i++) {
>> +		if (env->subprog_info[i].start != data[i].insn_offset) {
>> +			verbose(env, "func_info subprog start (%d) does not match verifier (%d)\n",
>> +				env->subprog_info[i].start, data[i].insn_offset);
> 
> I think printing exact insn offset isn't going to be much help
> for regular user to debug it. If this happens, it's likely llvm issue.
> How about 'func_info BTF section doesn't match subprog layout in BPF program' ?

Okay.

> 

^ permalink raw reply

* Re: [PATCH bpf-next 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Yonghong Song @ 2018-10-17  3:22 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov, Martin Lau,
	netdev@vger.kernel.org
  Cc: Kernel Team
In-Reply-To: <0f3f22dd-1f31-ac13-7bde-31d727ca523d@iogearbox.net>



On 10/15/18 3:36 PM, Daniel Borkmann wrote:
> On 10/12/2018 08:54 PM, Yonghong Song wrote:
> [...]
>> +static bool btf_name_valid_identifier(const struct btf *btf, u32 offset)
>> +{
>> +	/* offset must be valid */
>> +	const char *src = &btf->strings[offset];
>> +
>> +	if (!isalpha(*src) && *src != '_')
>> +		return false;
>> +
>> +	src++;
>> +	while (*src) {
>> +		if (!isalnum(*src) && *src != '_')
>> +			return false;
>> +		src++;
>> +	}
>> +
>> +	return true;
>> +}
> 
> Should there be an upper name length limit like KSYM_NAME_LEN? (Is it implied
> by the kvmalloc() limit?)

KSYM_NAME_LEN is good choice. Here, we check function names and 
struct/union member names. In C, based on
https://stackoverflow.com/questions/2352209/max-identifier-length,
the identifier max length is 63. Some compiler implementation may vary.
KSYM_NAME_LEN is 128.

> 
>>   static const char *btf_name_by_offset(const struct btf *btf, u32 offset)
>>   {
>>   	if (!offset)
>> @@ -747,7 +782,9 @@ static bool env_type_is_resolve_sink(const struct btf_verifier_env *env,
>>   		/* int, enum or void is a sink */
>>   		return !btf_type_needs_resolve(next_type);
>>   	case RESOLVE_PTR:
>> -		/* int, enum, void, struct or array is a sink for ptr */
>> +		/* int, enum, void, struct, array or func_ptoto is a sink
>> +		 * for ptr
>> +		 */
>>   		return !btf_type_is_modifier(next_type) &&
>>   			!btf_type_is_ptr(next_type);
>>   	case RESOLVE_STRUCT_OR_ARRAY:

^ permalink raw reply

* Re: [PATCH bpf-next 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Yonghong Song @ 2018-10-17  3:11 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov, Martin Lau,
	netdev@vger.kernel.org
  Cc: Kernel Team
In-Reply-To: <3a4bc2f7-9797-ccbe-d38f-c2d3ce662889@iogearbox.net>



On 10/15/18 3:30 PM, Daniel Borkmann wrote:
> On 10/12/2018 08:54 PM, Yonghong Song wrote:
>> This patch adds BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
>> support to the type section. BTF_KIND_FUNC_PROTO is used
>> to specify the type of a function pointer. With this,
>> BTF has a complete set of C types (except float).
>>
>> BTF_KIND_FUNC is used to specify the signature of a
>> defined subprogram. BTF_KIND_FUNC_PROTO can be referenced
>> by another type, e.g., a pointer type, and BTF_KIND_FUNC
>> type cannot be referenced by another type.
>>
>> For both BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO types,
>> the func return type is in t->type (where t is a
>> "struct btf_type" object). The func args are an array of
>> u32s immediately following object "t".
>>
>> As a concrete example, for the C program below,
>>    $ cat test.c
>>    int foo(int (*bar)(int)) { return bar(5); }
>> with latest llvm trunk built with Debug mode, we have
>>    $ clang -target bpf -g -O2 -mllvm -debug-only=btf -c test.c
>>    Type Table:
>>    [1] FUNC name_off=1 info=0x0c000001 size/type=2
>>            param_type=3
>>    [2] INT name_off=11 info=0x01000000 size/type=4
>>            desc=0x01000020
>>    [3] PTR name_off=0 info=0x02000000 size/type=4
>>    [4] FUNC_PROTO name_off=0 info=0x0d000001 size/type=2
>>            param_type=2
>>
>>    String Table:
>>    0 :
>>    1 : foo
>>    5 : .text
>>    11 : int
>>    15 : test.c
>>    22 : int foo(int (*bar)(int)) { return bar(5); }
>>
>>    FuncInfo Table:
>>    sec_name_off=5
>>            insn_offset=<Omitted> type_id=1
>>
>>    ...
>>
>> (Eventually we shall have bpftool to dump btf information
>>   like the above.)
>>
>> Function "foo" has a FUNC type (type_id = 1).
>> The parameter of "foo" has type_id 3 which is PTR->FUNC_PROTO,
>> where FUNC_PROTO refers to function pointer "bar".
> 
> Should also "bar" be part of the string table (at least at some point in future)?

Yes, we can do it. The dwarf for the abovee example looks like

0x00000043:     DW_TAG_formal_parameter
                   DW_AT_location        (0x00000000
                      [0x0000000000000000,  0x0000000000000008): 
DW_OP_reg1 W1
                      [0x0000000000000008,  0x0000000000000018): 
DW_OP_reg2 W2)
                   DW_AT_name    ("bar")
                   DW_AT_decl_file       ("/home/yhs/tmp/t.c")
                   DW_AT_decl_line       (1)
                   DW_AT_type    (0x0000005a "subroutine int*")

0x0000005a:   DW_TAG_pointer_type
                 DW_AT_type      (0x0000005f "subroutine int")

0x0000005f:   DW_TAG_subroutine_type
                 DW_AT_type      (0x00000053 "int")
                 DW_AT_prototyped        (true)

0x00000064:     DW_TAG_formal_parameter
                   DW_AT_type    (0x00000053 "int")

0x00000069:     NULL

0x0000006a:   NULL

The current llvm implementation does not record func
parameter name, so "bar" got lost. We could associate
"bar" with pointer type in the future implementation.

> Iow, if verifier hints to an issue in the program when it would for example walk
> pointers and rewrite ctx access, then it could dump the var name along with it.
> It might be useful as well in combination with 22 from str table, when annotating
> the source. We might need support for variadic functions, though. How is LLVM
> handling the latter with the recent BTF support?

The LLVM implementation does support variadic functions.
The last type id 0 indicates a variadic function.

> 
>> In FuncInfo Table, for section .text, the function,
>> with to-be-determined offset (marked as <Omitted>),
>> has type_id=1 which refers to a FUNC type.
>> This way, the function signature is
>> available to both kernel and user space.
>> Here, the insn offset is not available during the dump time
>> as relocation is resolved pretty late in the compilation process.
>>
>> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
>> Signed-off-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply

* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Florian Fainelli @ 2018-10-17  3:10 UTC (permalink / raw)
  To: Eric Dumazet, Stephen Hemminger, Holger Hoffstätte
  Cc: Heiner Kallweit, David Miller, Realtek linux nic maintainers,
	netdev@vger.kernel.org
In-Reply-To: <f22a01e4-a8d4-7358-00e8-89fd8401976f@gmail.com>



On 10/16/2018 5:23 PM, Eric Dumazet wrote:
> 
> 
> On 10/16/2018 04:08 PM, Florian Fainelli wrote:
> 
>> I had started doing that about a month ago in light of the ixbge
>> ndo_poll_controller vs. napi problem, but have not had time to submit
>> that series yet:
>>
>> https://github.com/ffainelli/linux/commits/napi-check
>>
>> feel free to piggy back on top of that series if you would like to
>> address this.
> 
> 
> But the root cause was different, you remember ?
> 
> We fixed the real issue with netpoll ability to stick all NIC IRQ onto one
> victim CPU.

I do remember, after seeing the resolution I kind of decided to leave
this in a branch but not submit it because it was not particularly
relevant anymore.
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next v5] net/ncsi: Add NCSI Broadcom OEM command
From: Samuel Mendoza-Jonas @ 2018-10-17  2:50 UTC (permalink / raw)
  To: Vijay Khemka, David S. Miller, netdev, linux-kernel
  Cc: openbmc @ lists . ozlabs . org, Justin.Lee1, joel, linux-aspeed
In-Reply-To: <20181016191319.1909502-1-vijaykhemka@fb.com>

On Tue, 2018-10-16 at 12:13 -0700, Vijay Khemka wrote:
> This patch adds OEM Broadcom commands and response handling. It also
> defines OEM Get MAC Address handler to get and configure the device.
> 
> ncsi_oem_gma_handler_bcm: This handler send NCSI broadcom command for
> getting mac address.
> ncsi_rsp_handler_oem_bcm: This handles response received for all
> broadcom OEM commands.
> ncsi_rsp_handler_oem_bcm_gma: This handles get mac address response and
> set it to device.
> 
> Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>

Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>

^ permalink raw reply

* [PATCH v3] virtio_net: avoid using netif_tx_disable() for serializing tx routine
From: Ake Koomsin @ 2018-10-17 10:44 UTC (permalink / raw)
  To: Jason Wang
  Cc: ake, Michael S. Tsirkin, David S. Miller, virtualization, netdev,
	linux-kernel
In-Reply-To: <276ac5bb-90f7-5fb6-a826-0bb05ecfa069@redhat.com>

Commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
introduces netif_tx_disable() after netif_device_detach() in order to
avoid use-after-free of tx queues. However, there are two issues.

1) Its operation is redundant with netif_device_detach() in case the
   interface is running.
2) In case of the interface is not running before suspending and
   resuming, the tx does not get resumed by netif_device_attach().
   This results in losing network connectivity.

It is better to use netif_tx_lock_bh()/netif_tx_unlock_bh() instead for
serializing tx routine during reset. This also preserves the symmetry
of netif_device_detach() and netif_device_attach().

Fixes commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
Signed-off-by: Ake Koomsin <ake@igel.co.jp>
---
 drivers/net/virtio_net.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3f5aa59c37b7..3e2c041d76ac 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2267,8 +2267,9 @@ static void virtnet_freeze_down(struct virtio_device *vdev)
 	/* Make sure no work handler is accessing the device */
 	flush_work(&vi->config_work);

+	netif_tx_lock_bh(vi->dev);
 	netif_device_detach(vi->dev);
-	netif_tx_disable(vi->dev);
+	netif_tx_unlock_bh(vi->dev);
 	cancel_delayed_work_sync(&vi->refill);

 	if (netif_running(vi->dev)) {
@@ -2304,7 +2305,9 @@ static int virtnet_restore_up(struct virtio_device *vdev)
 		}
 	}

+	netif_tx_lock_bh(vi->dev);
 	netif_device_attach(vi->dev);
+	netif_tx_unlock_bh(vi->dev);
 	return err;
 }

-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH bpf-next v2 3/7] bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall
From: Mauricio Vasquez @ 2018-10-17  2:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Song Liu, Alexei Starovoitov, Daniel Borkmann, Networking
In-Reply-To: <20181016232006.rmvjgglbybf4ypwk@ast-mbp.dhcp.thefacebook.com>



On 10/16/2018 06:20 PM, Alexei Starovoitov wrote:
> On Tue, Oct 16, 2018 at 04:16:39PM -0500, Mauricio Vasquez wrote:
>>
>> On 10/11/2018 06:51 PM, Alexei Starovoitov wrote:
>>> On Wed, Oct 10, 2018 at 05:50:01PM -0500, Mauricio Vasquez wrote:
>>>>>> Does it make sense to you?
>>>>> I reread the other patch, and found it does NOT use the following logic for
>>>>> queue and stack:
>>>>>
>>>>>                   rcu_read_lock();
>>>>>                   ptr = map->ops->map_lookup_and_delete_elem(map, key);
>>>>>                   if (ptr)
>>>>>                           memcpy(value, ptr, value_size);
>>>>>
>>>>> I guess this part is not used at all? Can we just remove it?
>>>>>
>>>>> Thanks,
>>>>> Song
>>>> This is the base code for map_lookup_and_delete support, it is not used in
>>>> queue/stack maps.
>>>>
>>>> I think we can leave it there, so when somebody implements lookup_and_delete
>>>> for other maps doesn't have to care about implementing also this.
>>> The code looks useful to me, but I also agree with Song. And in the kernel
>>> we don't typically add 'almost dead code'.
>>> May be provide an implementation of the lookup_and_delete for hash map
>>> so it's actually used ?
>> I haven't written any code but I think there is a potential problem here.
>> Current lookup_and_delete returns a pointer to the element, hence deletion
>> of the element should be done using call_rcu to guarantee this is valid
>> after returning.
>> In the hashtab, the deletion only uses call_rcu when there is not prealloc,
>> otherwise the element is pushed on the list of free elements immediately.
>> If we move the logic to push elements into the free list under a call_rcu
>> invocation, it could happen that the free list is empty because the call_rcu
>> is still pending (a similar issue we had with the queue/stack maps when they
>> used a pass by reference API).
>>
>> There is another way to implement it without this issue, in syscall.c:
>> l = ops->lookup(key);
>> memcpy(l, some_buffer)
>> ops->delete(key)
>>
>> The point here is that the lookup_and_delete operation is not being used at
>> all, so, is lookup + delete = lookup_and_delete?, can it be generalized?
>> If this is true, then what is the sense of having lookup_and_delete
>> syscall?,
> I though of lookup_and_delete command as atomic operation.
> Only in such case it would make sense.
> Otherwise there is no point in having additional cmd.
> In case of hash map the implementation would be similar to delete:
>    raw_spin_lock_irqsave(&b->lock, flags);
>    l = lookup_elem_raw(head, hash, key, key_size);
>    if (l) {
>            hlist_nulls_del_rcu(&l->hash_node);
>            bpf_long_memcpy(); // into temp kernel area
>            free_htab_elem(htab, l);
>            ret = 0;
>    }
>    raw_spin_unlock_irqrestore(&b->lock, flags);
>    copy_to_user();

Well, this is new approach, currently operations have no enough info to 
perform the copy_to_user directly, btw, is there any technical reason 
why a double mem copy is done? (from the map value into a temp kernel 
buffer and then to userspace?)

>
> there is a chance that some other cpu is doing lookup in parallel
> and may be modifying value, so bpf_long_mempcy() isn't fully atomic.

I think we already have that case, if an eBPF program is updating the 
map, a lookup from userspace could see a partially updated value.
> But bpf side is written together with user space side,
> so above almost-atomic lookup_and_delete is usable instead
> of lookup and then delete which races too much.
>
> Having said that I think it's fine to defer this new ndo for now
> and leave lookup_and_delete syscall cmd for stack/queue map only.
>
I agree, just a question, should we remove the "almost dead code" or 
leave it there?

Thanks,
Mauricio.

^ permalink raw reply

* Re: [RFC] VSOCK: The performance problem of vhost_vsock.
From: jiangyiwen @ 2018-10-17  9:27 UTC (permalink / raw)
  To: Jason Wang, stefanha; +Cc: netdev, kvm, virtualization
In-Reply-To: <5BC42FD8.2070104@huawei.com>

On 2018/10/15 14:12, jiangyiwen wrote:
> On 2018/10/15 10:33, Jason Wang wrote:
>>
>>
>> On 2018年10月15日 09:43, jiangyiwen wrote:
>>> Hi Stefan & All:
>>>
>>> Now I find vhost-vsock has two performance problems even if it
>>> is not designed for performance.
>>>
>>> First, I think vhost-vsock should faster than vhost-net because it
>>> is no TCP/IP stack, but the real test result vhost-net is 5~10
>>> times than vhost-vsock, currently I am looking for the reason.
>>
>> TCP/IP is not a must for vhost-net.
>>
>> How do you test and compare the performance?
>>
>> Thanks
>>
> 
> I test the performance used my test tool, like follows:
> 
> Server                   Client
> socket()
> bind()
> listen()
> 
>                          socket(AF_VSOCK) or socket(AF_INET)
> Accept() <-------------->connect()
>                          *======Start Record Time======*
>                          Call syscall sendfile()
> Recv()
>                          Send end
> Receive end
> Send(file_size)
>                          Recv(file_size)
>                          *======End Record Time======*
> 
> The test result, vhost-vsock is about 500MB/s, and vhost-net is about 2500MB/s.
> 
> By the way, vhost-net use single queue.
> 
> Thanks.
> 
>>> Second, vhost-vsock only supports two vqs(tx and rx), that means
>>> if multiple sockets in the guest will use the same vq to transmit
>>> the message and get the response. So if there are multiple applications
>>> in the guest, we should support "Multiqueue" feature for Virtio-vsock.
>>>
>>> Stefan, have you encountered these problems?
>>>
>>> Thanks,
>>> Yiwen.
>>>
>>
>>
>> .
>>
> 
> 

Hi Jason and Stefan,

Maybe I find the reason of bad performance.

I found pkt_len is limited to VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE(4K),
it will cause the bandwidth is limited to 500~600MB/s. And once I
increase to 64k, it can improve about 3 times(~1500MB/s).

By the way, I send to 64K in application once, and I don't use
sg_init_one and rewrite function to packet sg list because pkt_len
include multiple pages.

Thanks,
Yiwen.

> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
> 


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC] virtio_net: add local_bh_disable() around u64_stats_update_begin
From: Toshiaki Makita @ 2018-10-17  1:13 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, netdev, virtualization
  Cc: tglx, Michael S. Tsirkin, Jason Wang, David S. Miller
In-Reply-To: <20181016165545.guksrl23ulcudxrk@linutronix.de>

On 2018/10/17 1:55, Sebastian Andrzej Siewior wrote:
> on 32bit, lockdep notices:
> | ================================
> | WARNING: inconsistent lock state
> | 4.19.0-rc8+ #9 Tainted: G        W
> | --------------------------------
> | inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> | ip/1106 [HC0[0]:SC1[1]:HE1:SE0] takes:
> | (ptrval) (&syncp->seq#2){+.?.}, at: net_rx_action+0xc8/0x380
> | {SOFTIRQ-ON-W} state was registered at:
> |   lock_acquire+0x7e/0x170
> |   try_fill_recv+0x5fa/0x700
> |   virtnet_open+0xe0/0x180
> |   __dev_open+0xae/0x130
> |   __dev_change_flags+0x17f/0x200
> |   dev_change_flags+0x23/0x60
> |   do_setlink+0x2bb/0xa20
> |   rtnl_newlink+0x523/0x830
> |   rtnetlink_rcv_msg+0x14b/0x470
> |   netlink_rcv_skb+0x6e/0xf0
> |   rtnetlink_rcv+0xd/0x10
> |   netlink_unicast+0x16e/0x1f0
> |   netlink_sendmsg+0x1af/0x3a0
> |   ___sys_sendmsg+0x20f/0x240
> |   __sys_sendmsg+0x39/0x80
> |   sys_socketcall+0x13a/0x2a0
> |   do_int80_syscall_32+0x50/0x180
> |   restore_all+0x0/0xb2
> | irq event stamp: 3326
> | hardirqs last  enabled at (3326): [<c159e6d0>] net_rx_action+0x80/0x380
> | hardirqs last disabled at (3325): [<c159e6aa>] net_rx_action+0x5a/0x380
> | softirqs last  enabled at (3322): [<c14b440d>] virtnet_napi_enable+0xd/0x60
> | softirqs last disabled at (3323): [<c101d63d>] call_on_stack+0xd/0x50
> |
> | other info that might help us debug this:
> |  Possible unsafe locking scenario:
> |
> |        CPU0
> |        ----
> |   lock(&syncp->seq#2);
> |   <Interrupt>
> |     lock(&syncp->seq#2);
> |
> |  *** DEADLOCK ***

IIUC try_fill_recv is called only when NAPI is disabled from process
context, so there should be no point to race with virtnet_receive which
is called from NAPI handler.

I'm not sure what condition triggered this warning.


Toshiaki Makita


> 
> This is the "up" path which is not a hotpath. There is also
> refill_work().
> It might be unwise to add the local_bh_disable() to try_fill_recv()
> because if it is used mostly in BH so that local_bh_en+dis might be a
> waste of cycles.
> 
> Adding local_bh_disable() around try_fill_recv() for the non-BH call
> sites would render GFP_KERNEL pointless.
> 
> Also, ptr->var++ is not an atomic operation even on 64bit CPUs. Which
> means if try_fill_recv() runs on CPU0 (via virtnet_receive()) then the
> worker might run on CPU1.
> 
> Do we care or is this just stupid stats?  Any suggestions?
> 
> This warning appears since commit 461f03dc99cf6 ("virtio_net: Add kick stats").
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  drivers/net/virtio_net.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index dab504ec5e502..d782160cfa882 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1206,9 +1206,11 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
>  			break;
>  	} while (rq->vq->num_free);
>  	if (virtqueue_kick_prepare(rq->vq) && virtqueue_notify(rq->vq)) {
> +		local_bh_disable();
>  		u64_stats_update_begin(&rq->stats.syncp);
>  		rq->stats.kicks++;
>  		u64_stats_update_end(&rq->stats.syncp);
> +		local_bh_enable();
>  	}
>  
>  	return !oom;
> 

^ permalink raw reply

* Re: [PATCH v2] virtio_net: avoid using netif_tx_disable() for serializing tx routine
From: Jason Wang @ 2018-10-17  9:02 UTC (permalink / raw)
  To: Ake Koomsin
  Cc: Michael S. Tsirkin, David S. Miller, virtualization, netdev,
	linux-kernel
In-Reply-To: <20181017075918.7043-1-ake@igel.co.jp>


On 2018/10/17 下午3:59, Ake Koomsin wrote:
> Commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
> introduces netif_tx_disable() after netif_device_detach() in order to
> avoid use-after-free of tx queues. However, there are two issues.
>
> 1) Its operation is redundant with netif_device_detach() if the interface
>     is running.
> 2) In case of the interface is not running before suspending and
>     resuming, the tx does not get resumed by netif_device_attach().
>     This results in losing network connectivity.
>
> It is better to use netif_tx_lock()/netif_tx_unlock() instead for
> serializing tx routine during reset. This also preserves the symmetry
> of netif_device_detach() and netif_device_attach().
>
> Fixes commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
> Signed-off-by: Ake Koomsin <ake@igel.co.jp>
> ---
>   drivers/net/virtio_net.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 3f5aa59c37b7..41ccf9c994a4 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2267,8 +2267,9 @@ static void virtnet_freeze_down(struct virtio_device *vdev)
>   	/* Make sure no work handler is accessing the device */
>   	flush_work(&vi->config_work);
>   
> +	netif_tx_lock(vi->dev);
>   	netif_device_detach(vi->dev);
> -	netif_tx_disable(vi->dev);
> +	netif_tx_unlock(vi->dev);


Sorry for not finding this earlier. I think we should use 
netif_tx_lock_bh() to prevent start_xmit() to run under bh.

Thanks


>   	cancel_delayed_work_sync(&vi->refill);
>   
>   	if (netif_running(vi->dev)) {
> @@ -2304,7 +2305,9 @@ static int virtnet_restore_up(struct virtio_device *vdev)
>   		}
>   	}
>   
> +	netif_tx_lock(vi->dev);
>   	netif_device_attach(vi->dev);
> +	netif_tx_unlock(vi->dev);
>   	return err;
>   }
>   

^ permalink raw reply

* Re: [bpf-next PATCH] bpf: sockmap, fix skmsg recvmsg handler to track size correctly
From: Daniel Borkmann @ 2018-10-17  0:32 UTC (permalink / raw)
  To: John Fastabend, ast; +Cc: netdev
In-Reply-To: <20181016173601.9211.1573.stgit@john-Precision-Tower-5810>

On 10/16/2018 07:36 PM, John Fastabend wrote:
> When converting sockmap to new skmsg generic data structures we missed
> that the recvmsg handler did not correctly use sg.size and instead was
> using individual elements length. The result is if a sock is closed
> with outstanding data we omit the call to sk_mem_uncharge() and can
> get the warning below.
> 
> [   66.728282] WARNING: CPU: 6 PID: 5783 at net/core/stream.c:206 sk_stream_kill_queues+0x1fa/0x210
> 
> To fix this correct the redirect handler to xfer the size along with
> the scatterlist and also decrement the size from the recvmsg handler.
> Now when a sock is closed the remaining 'size' will be decremented
> with sk_mem_uncharge().
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Applied to bpf-next, thanks!

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox