* [PATCH bpf-next 2/2] bpf: add tests for direct packet access from CGROUP_SKB
From: Song Liu @ 2018-10-17 5:56 UTC (permalink / raw)
To: netdev; +Cc: ast, daniel, kernel-team, Song Liu
In-Reply-To: <20181017055606.353449-1-songliubraving@fb.com>
Tests are added to make sure CGROUP_SKB can directly access len, data,
and data_end in __sk_buff, but not other fields.
Signed-off-by: Song Liu <songliubraving@fb.com>
---
tools/testing/selftests/bpf/test_verifier.c | 30 +++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index cf4cd32b6772..aaf2ceba83dd 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -4862,6 +4862,36 @@ static struct bpf_test tests[] = {
.result = REJECT,
.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
+ {
+ "direct packet read for CGROUP_SKB",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+ offsetof(struct __sk_buff, data)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+ offsetof(struct __sk_buff, data_end)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+ offsetof(struct __sk_buff, len)),
+ BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+ BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 1),
+ BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+ },
+ {
+ "invalid access of tc_classid for CGROUP_SKB",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+ offsetof(struct __sk_buff, tc_classid)),
+ BPF_EXIT_INSN(),
+ },
+ .result = REJECT,
+ .errstr = "invalid bpf_context access",
+ .prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+ },
{
"valid cgroup storage access",
.insns = {
--
2.17.1
^ permalink raw reply related
* [PATCH bpf-next 0/2] bpf: add cg_skb_is_valid_access
From: Song Liu @ 2018-10-17 5:56 UTC (permalink / raw)
To: netdev; +Cc: ast, daniel, kernel-team, Song Liu
This set enables BPF program of type BPF_PROG_TYPE_CGROUP_SKB to access
__skb_buff->len/data/data_end directly.
Song Liu (2):
bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
bpf: add tests for direct packet access from CGROUP_SKB
kernel/bpf/cgroup.c | 4 +++
net/core/filter.c | 26 +++++++++++++++++-
tools/testing/selftests/bpf/test_verifier.c | 30 +++++++++++++++++++++
3 files changed, 59 insertions(+), 1 deletion(-)
^ permalink raw reply
* Re: netconsole warning in 4.19.0-rc7
From: Cong Wang @ 2018-10-17 5:16 UTC (permalink / raw)
To: Dave Jones, Meelis Roos, LKML, Linux Kernel Network Developers
In-Reply-To: <20181017035038.czaqown24rnjn2pw@codemonkey.org.uk>
On Tue, Oct 16, 2018 at 8:50 PM Dave Jones <davej@codemonkey.org.uk> wrote:
>
> On Tue, Oct 16, 2018 at 11:40:47PM -0400, Dave Jones wrote:
>
> > > This is exactly what I mentioned in my review here:
> > > https://marc.info/?l=linux-netdev&m=153816136624679&w=2
> > >
> > > "But irq is disabled here, so not sure if rcu_read_lock_bh()
> > > could cause trouble... "
> >
> > Not sure why this didn't show up for me when I was developing that
> > patch, but I can now reproduce this. The patch below fixes it for
> > me, but I'm not sure if there are still any side-effects.
> > There's also a missed unlock in the error path.
>
> I took another look at that error path. Turns out this is all we need I
> think..
Double check if rcu_dereference_bh() is happy with it.
Don't trust me, enable the relevant Kconfig and trust your test. ;)
^ permalink raw reply
* [PATCH net-next] net: skbuff.h: Mark expected switch fall-throughs
From: Gustavo A. R. Silva @ 2018-10-17 13:01 UTC (permalink / raw)
To: David S. Miller, Kees Cook; +Cc: linux-kernel, netdev, Gustavo A. R. Silva
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
include/linux/skbuff.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 119d092..0ba6874 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3505,13 +3505,19 @@ static inline bool __skb_metadata_differs(const struct sk_buff *skb_a,
#define __it(x, op) (x -= sizeof(u##op))
#define __it_diff(a, b, op) (*(u##op *)__it(a, op)) ^ (*(u##op *)__it(b, op))
case 32: diffs |= __it_diff(a, b, 64);
+ /* fall through */
case 24: diffs |= __it_diff(a, b, 64);
+ /* fall through */
case 16: diffs |= __it_diff(a, b, 64);
+ /* fall through */
case 8: diffs |= __it_diff(a, b, 64);
break;
case 28: diffs |= __it_diff(a, b, 64);
+ /* fall through */
case 20: diffs |= __it_diff(a, b, 64);
+ /* fall through */
case 12: diffs |= __it_diff(a, b, 64);
+ /* fall through */
case 4: diffs |= __it_diff(a, b, 32);
break;
}
--
2.7.4
^ permalink raw reply related
* Re: [PATCH v3] virtio_net: avoid using netif_tx_disable() for serializing tx routine
From: Jason Wang @ 2018-10-17 12:30 UTC (permalink / raw)
To: Ake Koomsin
Cc: netdev, virtualization, David S. Miller, linux-kernel,
Michael S. Tsirkin
In-Reply-To: <20181017104419.13003-1-ake@igel.co.jp>
On 2018/10/17 下午6:44, Ake Koomsin wrote:
> Commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
> introduces netif_tx_disable() after netif_device_detach() in order to
> avoid use-after-free of tx queues. However, there are two issues.
>
> 1) Its operation is redundant with netif_device_detach() in case the
> interface is running.
> 2) In case of the interface is not running before suspending and
> resuming, the tx does not get resumed by netif_device_attach().
> This results in losing network connectivity.
>
> It is better to use netif_tx_lock_bh()/netif_tx_unlock_bh() instead for
> serializing tx routine during reset. This also preserves the symmetry
> of netif_device_detach() and netif_device_attach().
>
> Fixes commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
> Signed-off-by: Ake Koomsin <ake@igel.co.jp>
> ---
> drivers/net/virtio_net.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 3f5aa59c37b7..3e2c041d76ac 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2267,8 +2267,9 @@ static void virtnet_freeze_down(struct virtio_device *vdev)
> /* Make sure no work handler is accessing the device */
> flush_work(&vi->config_work);
>
> + netif_tx_lock_bh(vi->dev);
> netif_device_detach(vi->dev);
> - netif_tx_disable(vi->dev);
> + netif_tx_unlock_bh(vi->dev);
> cancel_delayed_work_sync(&vi->refill);
>
> if (netif_running(vi->dev)) {
> @@ -2304,7 +2305,9 @@ static int virtnet_restore_up(struct virtio_device *vdev)
> }
> }
>
> + netif_tx_lock_bh(vi->dev);
> netif_device_attach(vi->dev);
> + netif_tx_unlock_bh(vi->dev);
> return err;
> }
>
Acked-by: Jason Wang <jasowang@redhat.com>
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Application of f8b39039cbf2a15f ("net: fs_enet: do not call phy_stop() in interrupts") to 4.9 and 4.14 stable
From: Christophe LEROY @ 2018-10-17 12:08 UTC (permalink / raw)
To: David Miller, netdev@vger.kernel.org; +Cc: Greg KH, stable@vger.kernel.org
Hi,
Could you please apply f8b39039cbf2a15f2b8c9f081e1cbd5dee00aaf5 to 4.9
and 4.14 ?
It fixes an Oops when Ethernet transmission times out.
As you can observe in the commit log, the Oops what initially observed
in 4.9.
The fix has been in mainline since 4.15
Thanks
Christophe
^ permalink raw reply
* Re: [PATCH net-next V2 6/8] vhost: packed ring support
From: Maxime Coquelin @ 2018-10-17 12:02 UTC (permalink / raw)
To: Jason Wang, Michael S. Tsirkin, Tiwei Bie
Cc: kvm, virtualization, netdev, linux-kernel, wexu, jfreimann
In-Reply-To: <0f3827e5-a7fa-e54a-725d-7726e90333b8@redhat.com>
On 10/17/2018 08:54 AM, Jason Wang wrote:
>
> On 2018/10/16 下午9:58, Maxime Coquelin wrote:
>>
>> On 10/15/2018 04:22 AM, Jason Wang wrote:
>>>
>>>
>>> On 2018年10月13日 01:23, Michael S. Tsirkin wrote:
>>>> On Fri, Oct 12, 2018 at 10:32:44PM +0800, Tiwei Bie wrote:
>>>>> On Mon, Jul 16, 2018 at 11:28:09AM +0800, Jason Wang wrote:
>>>>> [...]
>>>>>> @@ -1367,10 +1397,48 @@ long vhost_vring_ioctl(struct vhost_dev
>>>>>> *d, unsigned int ioctl, void __user *arg
>>>>>> vq->last_avail_idx = s.num;
>>>>>> /* Forget the cached index value. */
>>>>>> vq->avail_idx = vq->last_avail_idx;
>>>>>> + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
>>>>>> + vq->last_avail_wrap_counter = wrap_counter;
>>>>>> + vq->avail_wrap_counter = vq->last_avail_wrap_counter;
>>>>>> + }
>>>>>> break;
>>>>>> case VHOST_GET_VRING_BASE:
>>>>>> s.index = idx;
>>>>>> s.num = vq->last_avail_idx;
>>>>>> + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>>>>>> + s.num |= vq->last_avail_wrap_counter << 31;
>>>>>> + if (copy_to_user(argp, &s, sizeof(s)))
>>>>>> + r = -EFAULT;
>>>>>> + break;
>>>>>> + case VHOST_SET_VRING_USED_BASE:
>>>>>> + /* Moving base with an active backend?
>>>>>> + * You don't want to do that.
>>>>>> + */
>>>>>> + if (vq->private_data) {
>>>>>> + r = -EBUSY;
>>>>>> + break;
>>>>>> + }
>>>>>> + if (copy_from_user(&s, argp, sizeof(s))) {
>>>>>> + r = -EFAULT;
>>>>>> + break;
>>>>>> + }
>>>>>> + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
>>>>>> + wrap_counter = s.num >> 31;
>>>>>> + s.num &= ~(1 << 31);
>>>>>> + }
>>>>>> + if (s.num > 0xffff) {
>>>>>> + r = -EINVAL;
>>>>>> + break;
>>>>>> + }
>>>>> Do we want to put wrap_counter at bit 15?
>>>> I think I second that - seems to be consistent with
>>>> e.g. event suppression structure and the proposed
>>>> extension to driver notifications.
>>>
>>> Ok, I assumes packed virtqueue support 64K but looks not. I can
>>> change it to bit 15 and GET_VRING_BASE need to be changed as well.
>>>
>>>>
>>>>
>>>>> If put wrap_counter at bit 31, the check (s.num > 0xffff)
>>>>> won't be able to catch the illegal index 0x8000~0xffff for
>>>>> packed ring.
>>>>>
>>>
>>> Do we need to clarify this in the spec?
>>>
>>>>>> + vq->last_used_idx = s.num;
>>>>>> + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>>>>>> + vq->last_used_wrap_counter = wrap_counter;
>>>>>> + break;
>>>>>> + case VHOST_GET_VRING_USED_BASE:
>>>>> Do we need the new VHOST_GET_VRING_USED_BASE and
>>>>> VHOST_SET_VRING_USED_BASE ops?
>>>>>
>>>>> We are going to merge below series in DPDK:
>>>>>
>>>>> http://patches.dpdk.org/patch/45874/
>>>>>
>>>>> We may need to reach an agreement first.
>>>
>>> If we agree that 64K virtqueue won't be supported, I'm ok with either.
>>
>> I'm fine to put wrap_counter at bit 15.
>> I will post a new version of the DPDK series soon.
>>
>>> Btw the code assumes used_wrap_counter is equal to avail_wrap_counter
>>> which looks wrong?
>>
>> For split ring, we used to set the last_used_idx to the same value as
>> last_avail_idx as VHOST_USER_GET_VRING_BASE cannot be called while the
>> ring is being processed, so their value is always the same at the time
>> the request is handled.
>
>
> I may miss something, but it looks to me we should sync last_used_idx
> from used_idx.
Ok, so as proposed off-list by Jason, we could extend
VHOST_USER_[GET|SET]_VRING_BASE to have the following payload when
VIRTIO_F_RING_PACKED is negotiated:
Bit[0:14] avail index
Bit[15] avail wrap counter
Bit[16:30] used index
Bit[31] used wrap counter
Is everyone ok with that?
Another thing that I'd like to discuss is how do we reconnect in case of
user backend crash. When it happens, the frontend hasn't queried the
backend for last_avail_idx/last_used_idx and their wrap counters.
With split ring, when it happens, we set last_avail_idx to device's used
index (see virtio_queue_restore_last_avail_idx()).
Problem with packed ring is that wrap counters information is only in
the backend.
Can we get device's used index and deduce the wrap counter value from
corresponding descriptor flag?
Any thoughts?
Regards,
Maxime
> Thanks
>
>
>>
>>
>> I kept the same behavior for packed ring, and so the wrap counter have
>> to be the same.
>>
>> Regards,
>> Maxime
>>
>>> Thanks
>>>
>>>>>
>>>>>> + s.index = idx;
>>>>>> + s.num = vq->last_used_idx;
>>>>>> + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>>>>>> + s.num |= vq->last_used_wrap_counter << 31;
>>>>>> if (copy_to_user(argp, &s, sizeof s))
>>>>>> r = -EFAULT;
>>>>>> break;
>>>>> [...]
>>>
^ permalink raw reply
* Re: netconsole warning in 4.19.0-rc7
From: Dave Jones @ 2018-10-17 3:50 UTC (permalink / raw)
To: Cong Wang, Meelis Roos, LKML, Linux Kernel Network Developers
In-Reply-To: <20181017034047.nrhstc3b4wzevnri@codemonkey.org.uk>
On Tue, Oct 16, 2018 at 11:40:47PM -0400, Dave Jones wrote:
> > This is exactly what I mentioned in my review here:
> > https://marc.info/?l=linux-netdev&m=153816136624679&w=2
> >
> > "But irq is disabled here, so not sure if rcu_read_lock_bh()
> > could cause trouble... "
>
> Not sure why this didn't show up for me when I was developing that
> patch, but I can now reproduce this. The patch below fixes it for
> me, but I'm not sure if there are still any side-effects.
> There's also a missed unlock in the error path.
I took another look at that error path. Turns out this is all we need I
think..
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index de1d1ba92f2d..f9322d5db899 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -318,6 +318,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
npinfo = rcu_dereference_bh(np->dev->npinfo);
if (!npinfo || !netif_running(dev) || !netif_device_present(dev)) {
dev_kfree_skb_irq(skb);
+ rcu_read_unlock_bh();
return;
}
^ permalink raw reply related
* Re: [RFC] VSOCK: The performance problem of vhost_vsock.
From: jiangyiwen @ 2018-10-17 11:41 UTC (permalink / raw)
To: Jason Wang, stefanha; +Cc: netdev, kvm, virtualization
In-Reply-To: <8edff784-b311-bfb6-1bf4-1970d564279d@redhat.com>
On 2018/10/17 17:51, Jason Wang wrote:
>
> On 2018/10/17 下午5:39, Jason Wang wrote:
>>>>
>>> Hi Jason and Stefan,
>>>
>>> Maybe I find the reason of bad performance.
>>>
>>> I found pkt_len is limited to VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE(4K),
>>> it will cause the bandwidth is limited to 500~600MB/s. And once I
>>> increase to 64k, it can improve about 3 times(~1500MB/s).
>>
>>
>> Looks like the value was chosen for a balance between rx buffer size and performance. Allocating 64K always even for small packet is kind of waste and stress for guest memory. Virito-net try to avoid this by inventing the merge able rx buffer which allows big packet to be scattered in into different buffers. We can reuse this idea or revisit the idea of using virtio-net/vhost-net as a transport of vsock.
>>
>> What interesting is the performance is still behind vhost-net.
>>
>> Thanks
>>
>>>
>>> By the way, I send to 64K in application once, and I don't use
>>> sg_init_one and rewrite function to packet sg list because pkt_len
>>> include multiple pages.
>>>
>>> Thanks,
>>> Yiwen.
>
>
> Btw, if you're using vsock for transferring large files, maybe it's more efficient to implement sendpage() for vsock to allow sendfile()/splice() work.
>
> Thanks
>
I can't agree more.
why vhost_vsock is still behind vhost_net?
Because I use sendfile() to test performance at first, and then
I found vsock don't implement sendpage() and cause the bandwidth
can't be increased. So I use read() and send() to replace sendfile(),
it will increase some switch between kernel and user mode, and sendfile()
can support zero copy. I think this is main reason.
Thanks.
>
> .
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH bpf-next 00/13] bpf: add btf func info support
From: Yonghong Song @ 2018-10-17 3:25 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Alexei Starovoitov, Martin Lau, daniel@iogearbox.net,
netdev@vger.kernel.org, Kernel Team
In-Reply-To: <20181016182707.fb3cifqilasxbddv@ast-mbp.dhcp.thefacebook.com>
On 10/16/18 11:27 AM, Alexei Starovoitov wrote:
> On Fri, Oct 12, 2018 at 11:54:20AM -0700, Yonghong Song wrote:
>> The BTF support was added to kernel by Commit 69b693f0aefa
>> ("bpf: btf: Introduce BPF Type Format (BTF)"), which introduced
>> .BTF section into ELF file and is primarily
>> used for map pretty print.
>> pahole is used to convert dwarf to BTF for ELF files.
>>
>> The next step would be add func type info and debug line info
>> into BTF. For debug line info, it is desirable to encode
>> source code directly in the BTF to ease deployment and
>> introspection.
>
> it's kinda confusing that cover letter talks about line info next step,
> but these kernel side patches are only for full prog name from btf.
> It certainly makes sense for llvm to do both at the same time.
> Please make the cover letter more clear.
Make sense. Will remove line_info stuff from the cover letter.
^ permalink raw reply
* Re: [PATCH bpf-next 05/13] bpf: get better bpf_prog ksyms based on btf func type_id
From: Yonghong Song @ 2018-10-17 3:24 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Alexei Starovoitov, Martin Lau, daniel@iogearbox.net,
netdev@vger.kernel.org, Kernel Team
In-Reply-To: <20181016175901.vninfgdsjtizssvt@ast-mbp.dhcp.thefacebook.com>
On 10/16/18 10:59 AM, Alexei Starovoitov wrote:
> On Fri, Oct 12, 2018 at 11:54:42AM -0700, Yonghong Song wrote:
>> This patch added interface to load a program with the following
>> additional information:
>> . prog_btf_fd
>> . func_info and func_info_len
>> where func_info will provides function range and type_id
>> corresponding to each function.
>>
>> If verifier agrees with function range provided by the user,
>> the bpf_prog ksym for each function will use the func name
>> provided in the type_id, which is supposed to provide better
>> encoding as it is not limited by 16 bytes program name
>> limitation and this is better for bpf program which contains
>> multiple subprograms.
>>
>> The bpf_prog_info interface is also extended to
>> return btf_id and jited_func_types, so user spaces can
>> print out the function prototype for each jited function.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
> ...
>> BUILD_BUG_ON(sizeof("bpf_prog_") +
>> sizeof(prog->tag) * 2 +
>> @@ -401,6 +403,13 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
>>
>> sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
>> sym = bin2hex(sym, prog->tag, sizeof(prog->tag));
>> +
>> + if (prog->aux->btf) {
>> + func_name = btf_get_name_by_id(prog->aux->btf, prog->aux->type_id);
>> + snprintf(sym, (size_t)(end - sym), "_%s", func_name);
>> + return;
>
> Would it make sense to add a comment here that prog->aux->name is ignored
> when full btf name is available? (otherwise the same name will appear twice in ksym)
Will add a comment.
>
>> + }
>> +
>> if (prog->aux->name[0])
>> snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
> ...
>> +static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env,
>> + union bpf_attr *attr)
>> +{
>> + struct bpf_func_info *data;
>> + int i, nfuncs, ret = 0;
>> +
>> + if (!attr->func_info_len)
>> + return 0;
>> +
>> + nfuncs = attr->func_info_len / sizeof(struct bpf_func_info);
>> + if (env->subprog_cnt != nfuncs) {
>> + verbose(env, "number of funcs in func_info does not match verifier\n");
>
> 'does not match verifier' is hard to make sense of.
> How about 'number of funcs in func_info doesn't match number of subprogs' ?
Sounds good to me.
>
>> + return -EINVAL;
>> + }
>> +
>> + data = kvmalloc(attr->func_info_len, GFP_KERNEL | __GFP_NOWARN);
>> + if (!data) {
>> + verbose(env, "no memory to allocate attr func_info\n");
>
> I don't think we ever print such warnings for memory allocations.
> imo this can be removed, since enomem is enough.
Okay.
>
>> + return -ENOMEM;
>> + }
>> +
>> + if (copy_from_user(data, u64_to_user_ptr(attr->func_info),
>> + attr->func_info_len)) {
>> + verbose(env, "memory copy error for attr func_info\n");
>
> similar thing. kernel never warns about copy_from_user errors.
Okay.
>
>> + ret = -EFAULT;
>> + goto cleanup;
>> + }
>> +
>> + for (i = 0; i < nfuncs; i++) {
>> + if (env->subprog_info[i].start != data[i].insn_offset) {
>> + verbose(env, "func_info subprog start (%d) does not match verifier (%d)\n",
>> + env->subprog_info[i].start, data[i].insn_offset);
>
> I think printing exact insn offset isn't going to be much help
> for regular user to debug it. If this happens, it's likely llvm issue.
> How about 'func_info BTF section doesn't match subprog layout in BPF program' ?
Okay.
>
^ permalink raw reply
* Re: [PATCH bpf-next 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Yonghong Song @ 2018-10-17 3:22 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov, Martin Lau,
netdev@vger.kernel.org
Cc: Kernel Team
In-Reply-To: <0f3f22dd-1f31-ac13-7bde-31d727ca523d@iogearbox.net>
On 10/15/18 3:36 PM, Daniel Borkmann wrote:
> On 10/12/2018 08:54 PM, Yonghong Song wrote:
> [...]
>> +static bool btf_name_valid_identifier(const struct btf *btf, u32 offset)
>> +{
>> + /* offset must be valid */
>> + const char *src = &btf->strings[offset];
>> +
>> + if (!isalpha(*src) && *src != '_')
>> + return false;
>> +
>> + src++;
>> + while (*src) {
>> + if (!isalnum(*src) && *src != '_')
>> + return false;
>> + src++;
>> + }
>> +
>> + return true;
>> +}
>
> Should there be an upper name length limit like KSYM_NAME_LEN? (Is it implied
> by the kvmalloc() limit?)
KSYM_NAME_LEN is good choice. Here, we check function names and
struct/union member names. In C, based on
https://stackoverflow.com/questions/2352209/max-identifier-length,
the identifier max length is 63. Some compiler implementation may vary.
KSYM_NAME_LEN is 128.
>
>> static const char *btf_name_by_offset(const struct btf *btf, u32 offset)
>> {
>> if (!offset)
>> @@ -747,7 +782,9 @@ static bool env_type_is_resolve_sink(const struct btf_verifier_env *env,
>> /* int, enum or void is a sink */
>> return !btf_type_needs_resolve(next_type);
>> case RESOLVE_PTR:
>> - /* int, enum, void, struct or array is a sink for ptr */
>> + /* int, enum, void, struct, array or func_ptoto is a sink
>> + * for ptr
>> + */
>> return !btf_type_is_modifier(next_type) &&
>> !btf_type_is_ptr(next_type);
>> case RESOLVE_STRUCT_OR_ARRAY:
^ permalink raw reply
* Re: [PATCH bpf-next 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Yonghong Song @ 2018-10-17 3:11 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov, Martin Lau,
netdev@vger.kernel.org
Cc: Kernel Team
In-Reply-To: <3a4bc2f7-9797-ccbe-d38f-c2d3ce662889@iogearbox.net>
On 10/15/18 3:30 PM, Daniel Borkmann wrote:
> On 10/12/2018 08:54 PM, Yonghong Song wrote:
>> This patch adds BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
>> support to the type section. BTF_KIND_FUNC_PROTO is used
>> to specify the type of a function pointer. With this,
>> BTF has a complete set of C types (except float).
>>
>> BTF_KIND_FUNC is used to specify the signature of a
>> defined subprogram. BTF_KIND_FUNC_PROTO can be referenced
>> by another type, e.g., a pointer type, and BTF_KIND_FUNC
>> type cannot be referenced by another type.
>>
>> For both BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO types,
>> the func return type is in t->type (where t is a
>> "struct btf_type" object). The func args are an array of
>> u32s immediately following object "t".
>>
>> As a concrete example, for the C program below,
>> $ cat test.c
>> int foo(int (*bar)(int)) { return bar(5); }
>> with latest llvm trunk built with Debug mode, we have
>> $ clang -target bpf -g -O2 -mllvm -debug-only=btf -c test.c
>> Type Table:
>> [1] FUNC name_off=1 info=0x0c000001 size/type=2
>> param_type=3
>> [2] INT name_off=11 info=0x01000000 size/type=4
>> desc=0x01000020
>> [3] PTR name_off=0 info=0x02000000 size/type=4
>> [4] FUNC_PROTO name_off=0 info=0x0d000001 size/type=2
>> param_type=2
>>
>> String Table:
>> 0 :
>> 1 : foo
>> 5 : .text
>> 11 : int
>> 15 : test.c
>> 22 : int foo(int (*bar)(int)) { return bar(5); }
>>
>> FuncInfo Table:
>> sec_name_off=5
>> insn_offset=<Omitted> type_id=1
>>
>> ...
>>
>> (Eventually we shall have bpftool to dump btf information
>> like the above.)
>>
>> Function "foo" has a FUNC type (type_id = 1).
>> The parameter of "foo" has type_id 3 which is PTR->FUNC_PROTO,
>> where FUNC_PROTO refers to function pointer "bar".
>
> Should also "bar" be part of the string table (at least at some point in future)?
Yes, we can do it. The dwarf for the abovee example looks like
0x00000043: DW_TAG_formal_parameter
DW_AT_location (0x00000000
[0x0000000000000000, 0x0000000000000008):
DW_OP_reg1 W1
[0x0000000000000008, 0x0000000000000018):
DW_OP_reg2 W2)
DW_AT_name ("bar")
DW_AT_decl_file ("/home/yhs/tmp/t.c")
DW_AT_decl_line (1)
DW_AT_type (0x0000005a "subroutine int*")
0x0000005a: DW_TAG_pointer_type
DW_AT_type (0x0000005f "subroutine int")
0x0000005f: DW_TAG_subroutine_type
DW_AT_type (0x00000053 "int")
DW_AT_prototyped (true)
0x00000064: DW_TAG_formal_parameter
DW_AT_type (0x00000053 "int")
0x00000069: NULL
0x0000006a: NULL
The current llvm implementation does not record func
parameter name, so "bar" got lost. We could associate
"bar" with pointer type in the future implementation.
> Iow, if verifier hints to an issue in the program when it would for example walk
> pointers and rewrite ctx access, then it could dump the var name along with it.
> It might be useful as well in combination with 22 from str table, when annotating
> the source. We might need support for variadic functions, though. How is LLVM
> handling the latter with the recent BTF support?
The LLVM implementation does support variadic functions.
The last type id 0 indicates a variadic function.
>
>> In FuncInfo Table, for section .text, the function,
>> with to-be-determined offset (marked as <Omitted>),
>> has type_id=1 which refers to a FUNC type.
>> This way, the function signature is
>> available to both kernel and user space.
>> Here, the insn offset is not available during the dump time
>> as relocation is resolved pretty late in the compilation process.
>>
>> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
^ permalink raw reply
* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Florian Fainelli @ 2018-10-17 3:10 UTC (permalink / raw)
To: Eric Dumazet, Stephen Hemminger, Holger Hoffstätte
Cc: Heiner Kallweit, David Miller, Realtek linux nic maintainers,
netdev@vger.kernel.org
In-Reply-To: <f22a01e4-a8d4-7358-00e8-89fd8401976f@gmail.com>
On 10/16/2018 5:23 PM, Eric Dumazet wrote:
>
>
> On 10/16/2018 04:08 PM, Florian Fainelli wrote:
>
>> I had started doing that about a month ago in light of the ixbge
>> ndo_poll_controller vs. napi problem, but have not had time to submit
>> that series yet:
>>
>> https://github.com/ffainelli/linux/commits/napi-check
>>
>> feel free to piggy back on top of that series if you would like to
>> address this.
>
>
> But the root cause was different, you remember ?
>
> We fixed the real issue with netpoll ability to stick all NIC IRQ onto one
> victim CPU.
I do remember, after seeing the resolution I kind of decided to leave
this in a branch but not submit it because it was not particularly
relevant anymore.
--
Florian
^ permalink raw reply
* Re: [PATCH net-next v5] net/ncsi: Add NCSI Broadcom OEM command
From: Samuel Mendoza-Jonas @ 2018-10-17 2:50 UTC (permalink / raw)
To: Vijay Khemka, David S. Miller, netdev, linux-kernel
Cc: openbmc @ lists . ozlabs . org, Justin.Lee1, joel, linux-aspeed
In-Reply-To: <20181016191319.1909502-1-vijaykhemka@fb.com>
On Tue, 2018-10-16 at 12:13 -0700, Vijay Khemka wrote:
> This patch adds OEM Broadcom commands and response handling. It also
> defines OEM Get MAC Address handler to get and configure the device.
>
> ncsi_oem_gma_handler_bcm: This handler send NCSI broadcom command for
> getting mac address.
> ncsi_rsp_handler_oem_bcm: This handles response received for all
> broadcom OEM commands.
> ncsi_rsp_handler_oem_bcm_gma: This handles get mac address response and
> set it to device.
>
> Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
^ permalink raw reply
* [PATCH v3] virtio_net: avoid using netif_tx_disable() for serializing tx routine
From: Ake Koomsin @ 2018-10-17 10:44 UTC (permalink / raw)
To: Jason Wang
Cc: ake, Michael S. Tsirkin, David S. Miller, virtualization, netdev,
linux-kernel
In-Reply-To: <276ac5bb-90f7-5fb6-a826-0bb05ecfa069@redhat.com>
Commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
introduces netif_tx_disable() after netif_device_detach() in order to
avoid use-after-free of tx queues. However, there are two issues.
1) Its operation is redundant with netif_device_detach() in case the
interface is running.
2) In case of the interface is not running before suspending and
resuming, the tx does not get resumed by netif_device_attach().
This results in losing network connectivity.
It is better to use netif_tx_lock_bh()/netif_tx_unlock_bh() instead for
serializing tx routine during reset. This also preserves the symmetry
of netif_device_detach() and netif_device_attach().
Fixes commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
Signed-off-by: Ake Koomsin <ake@igel.co.jp>
---
drivers/net/virtio_net.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3f5aa59c37b7..3e2c041d76ac 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2267,8 +2267,9 @@ static void virtnet_freeze_down(struct virtio_device *vdev)
/* Make sure no work handler is accessing the device */
flush_work(&vi->config_work);
+ netif_tx_lock_bh(vi->dev);
netif_device_detach(vi->dev);
- netif_tx_disable(vi->dev);
+ netif_tx_unlock_bh(vi->dev);
cancel_delayed_work_sync(&vi->refill);
if (netif_running(vi->dev)) {
@@ -2304,7 +2305,9 @@ static int virtnet_restore_up(struct virtio_device *vdev)
}
}
+ netif_tx_lock_bh(vi->dev);
netif_device_attach(vi->dev);
+ netif_tx_unlock_bh(vi->dev);
return err;
}
--
2.19.1
^ permalink raw reply related
* Re: [PATCH bpf-next v2 3/7] bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall
From: Mauricio Vasquez @ 2018-10-17 2:29 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Song Liu, Alexei Starovoitov, Daniel Borkmann, Networking
In-Reply-To: <20181016232006.rmvjgglbybf4ypwk@ast-mbp.dhcp.thefacebook.com>
On 10/16/2018 06:20 PM, Alexei Starovoitov wrote:
> On Tue, Oct 16, 2018 at 04:16:39PM -0500, Mauricio Vasquez wrote:
>>
>> On 10/11/2018 06:51 PM, Alexei Starovoitov wrote:
>>> On Wed, Oct 10, 2018 at 05:50:01PM -0500, Mauricio Vasquez wrote:
>>>>>> Does it make sense to you?
>>>>> I reread the other patch, and found it does NOT use the following logic for
>>>>> queue and stack:
>>>>>
>>>>> rcu_read_lock();
>>>>> ptr = map->ops->map_lookup_and_delete_elem(map, key);
>>>>> if (ptr)
>>>>> memcpy(value, ptr, value_size);
>>>>>
>>>>> I guess this part is not used at all? Can we just remove it?
>>>>>
>>>>> Thanks,
>>>>> Song
>>>> This is the base code for map_lookup_and_delete support, it is not used in
>>>> queue/stack maps.
>>>>
>>>> I think we can leave it there, so when somebody implements lookup_and_delete
>>>> for other maps doesn't have to care about implementing also this.
>>> The code looks useful to me, but I also agree with Song. And in the kernel
>>> we don't typically add 'almost dead code'.
>>> May be provide an implementation of the lookup_and_delete for hash map
>>> so it's actually used ?
>> I haven't written any code but I think there is a potential problem here.
>> Current lookup_and_delete returns a pointer to the element, hence deletion
>> of the element should be done using call_rcu to guarantee this is valid
>> after returning.
>> In the hashtab, the deletion only uses call_rcu when there is not prealloc,
>> otherwise the element is pushed on the list of free elements immediately.
>> If we move the logic to push elements into the free list under a call_rcu
>> invocation, it could happen that the free list is empty because the call_rcu
>> is still pending (a similar issue we had with the queue/stack maps when they
>> used a pass by reference API).
>>
>> There is another way to implement it without this issue, in syscall.c:
>> l = ops->lookup(key);
>> memcpy(l, some_buffer)
>> ops->delete(key)
>>
>> The point here is that the lookup_and_delete operation is not being used at
>> all, so, is lookup + delete = lookup_and_delete?, can it be generalized?
>> If this is true, then what is the sense of having lookup_and_delete
>> syscall?,
> I though of lookup_and_delete command as atomic operation.
> Only in such case it would make sense.
> Otherwise there is no point in having additional cmd.
> In case of hash map the implementation would be similar to delete:
> raw_spin_lock_irqsave(&b->lock, flags);
> l = lookup_elem_raw(head, hash, key, key_size);
> if (l) {
> hlist_nulls_del_rcu(&l->hash_node);
> bpf_long_memcpy(); // into temp kernel area
> free_htab_elem(htab, l);
> ret = 0;
> }
> raw_spin_unlock_irqrestore(&b->lock, flags);
> copy_to_user();
Well, this is new approach, currently operations have no enough info to
perform the copy_to_user directly, btw, is there any technical reason
why a double mem copy is done? (from the map value into a temp kernel
buffer and then to userspace?)
>
> there is a chance that some other cpu is doing lookup in parallel
> and may be modifying value, so bpf_long_mempcy() isn't fully atomic.
I think we already have that case, if an eBPF program is updating the
map, a lookup from userspace could see a partially updated value.
> But bpf side is written together with user space side,
> so above almost-atomic lookup_and_delete is usable instead
> of lookup and then delete which races too much.
>
> Having said that I think it's fine to defer this new ndo for now
> and leave lookup_and_delete syscall cmd for stack/queue map only.
>
I agree, just a question, should we remove the "almost dead code" or
leave it there?
Thanks,
Mauricio.
^ permalink raw reply
* Re: [RFC] VSOCK: The performance problem of vhost_vsock.
From: jiangyiwen @ 2018-10-17 9:27 UTC (permalink / raw)
To: Jason Wang, stefanha; +Cc: netdev, kvm, virtualization
In-Reply-To: <5BC42FD8.2070104@huawei.com>
On 2018/10/15 14:12, jiangyiwen wrote:
> On 2018/10/15 10:33, Jason Wang wrote:
>>
>>
>> On 2018年10月15日 09:43, jiangyiwen wrote:
>>> Hi Stefan & All:
>>>
>>> Now I find vhost-vsock has two performance problems even if it
>>> is not designed for performance.
>>>
>>> First, I think vhost-vsock should faster than vhost-net because it
>>> is no TCP/IP stack, but the real test result vhost-net is 5~10
>>> times than vhost-vsock, currently I am looking for the reason.
>>
>> TCP/IP is not a must for vhost-net.
>>
>> How do you test and compare the performance?
>>
>> Thanks
>>
>
> I test the performance used my test tool, like follows:
>
> Server Client
> socket()
> bind()
> listen()
>
> socket(AF_VSOCK) or socket(AF_INET)
> Accept() <-------------->connect()
> *======Start Record Time======*
> Call syscall sendfile()
> Recv()
> Send end
> Receive end
> Send(file_size)
> Recv(file_size)
> *======End Record Time======*
>
> The test result, vhost-vsock is about 500MB/s, and vhost-net is about 2500MB/s.
>
> By the way, vhost-net use single queue.
>
> Thanks.
>
>>> Second, vhost-vsock only supports two vqs(tx and rx), that means
>>> if multiple sockets in the guest will use the same vq to transmit
>>> the message and get the response. So if there are multiple applications
>>> in the guest, we should support "Multiqueue" feature for Virtio-vsock.
>>>
>>> Stefan, have you encountered these problems?
>>>
>>> Thanks,
>>> Yiwen.
>>>
>>
>>
>> .
>>
>
>
Hi Jason and Stefan,
Maybe I find the reason of bad performance.
I found pkt_len is limited to VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE(4K),
it will cause the bandwidth is limited to 500~600MB/s. And once I
increase to 64k, it can improve about 3 times(~1500MB/s).
By the way, I send to 64K in application once, and I don't use
sg_init_one and rewrite function to packet sg list because pkt_len
include multiple pages.
Thanks,
Yiwen.
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC] virtio_net: add local_bh_disable() around u64_stats_update_begin
From: Toshiaki Makita @ 2018-10-17 1:13 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, netdev, virtualization
Cc: tglx, Michael S. Tsirkin, Jason Wang, David S. Miller
In-Reply-To: <20181016165545.guksrl23ulcudxrk@linutronix.de>
On 2018/10/17 1:55, Sebastian Andrzej Siewior wrote:
> on 32bit, lockdep notices:
> | ================================
> | WARNING: inconsistent lock state
> | 4.19.0-rc8+ #9 Tainted: G W
> | --------------------------------
> | inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> | ip/1106 [HC0[0]:SC1[1]:HE1:SE0] takes:
> | (ptrval) (&syncp->seq#2){+.?.}, at: net_rx_action+0xc8/0x380
> | {SOFTIRQ-ON-W} state was registered at:
> | lock_acquire+0x7e/0x170
> | try_fill_recv+0x5fa/0x700
> | virtnet_open+0xe0/0x180
> | __dev_open+0xae/0x130
> | __dev_change_flags+0x17f/0x200
> | dev_change_flags+0x23/0x60
> | do_setlink+0x2bb/0xa20
> | rtnl_newlink+0x523/0x830
> | rtnetlink_rcv_msg+0x14b/0x470
> | netlink_rcv_skb+0x6e/0xf0
> | rtnetlink_rcv+0xd/0x10
> | netlink_unicast+0x16e/0x1f0
> | netlink_sendmsg+0x1af/0x3a0
> | ___sys_sendmsg+0x20f/0x240
> | __sys_sendmsg+0x39/0x80
> | sys_socketcall+0x13a/0x2a0
> | do_int80_syscall_32+0x50/0x180
> | restore_all+0x0/0xb2
> | irq event stamp: 3326
> | hardirqs last enabled at (3326): [<c159e6d0>] net_rx_action+0x80/0x380
> | hardirqs last disabled at (3325): [<c159e6aa>] net_rx_action+0x5a/0x380
> | softirqs last enabled at (3322): [<c14b440d>] virtnet_napi_enable+0xd/0x60
> | softirqs last disabled at (3323): [<c101d63d>] call_on_stack+0xd/0x50
> |
> | other info that might help us debug this:
> | Possible unsafe locking scenario:
> |
> | CPU0
> | ----
> | lock(&syncp->seq#2);
> | <Interrupt>
> | lock(&syncp->seq#2);
> |
> | *** DEADLOCK ***
IIUC try_fill_recv is called only when NAPI is disabled from process
context, so there should be no point to race with virtnet_receive which
is called from NAPI handler.
I'm not sure what condition triggered this warning.
Toshiaki Makita
>
> This is the "up" path which is not a hotpath. There is also
> refill_work().
> It might be unwise to add the local_bh_disable() to try_fill_recv()
> because if it is used mostly in BH so that local_bh_en+dis might be a
> waste of cycles.
>
> Adding local_bh_disable() around try_fill_recv() for the non-BH call
> sites would render GFP_KERNEL pointless.
>
> Also, ptr->var++ is not an atomic operation even on 64bit CPUs. Which
> means if try_fill_recv() runs on CPU0 (via virtnet_receive()) then the
> worker might run on CPU1.
>
> Do we care or is this just stupid stats? Any suggestions?
>
> This warning appears since commit 461f03dc99cf6 ("virtio_net: Add kick stats").
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> drivers/net/virtio_net.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index dab504ec5e502..d782160cfa882 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1206,9 +1206,11 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
> break;
> } while (rq->vq->num_free);
> if (virtqueue_kick_prepare(rq->vq) && virtqueue_notify(rq->vq)) {
> + local_bh_disable();
> u64_stats_update_begin(&rq->stats.syncp);
> rq->stats.kicks++;
> u64_stats_update_end(&rq->stats.syncp);
> + local_bh_enable();
> }
>
> return !oom;
>
^ permalink raw reply
* Re: [PATCH v2] virtio_net: avoid using netif_tx_disable() for serializing tx routine
From: Jason Wang @ 2018-10-17 9:02 UTC (permalink / raw)
To: Ake Koomsin
Cc: Michael S. Tsirkin, David S. Miller, virtualization, netdev,
linux-kernel
In-Reply-To: <20181017075918.7043-1-ake@igel.co.jp>
On 2018/10/17 下午3:59, Ake Koomsin wrote:
> Commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
> introduces netif_tx_disable() after netif_device_detach() in order to
> avoid use-after-free of tx queues. However, there are two issues.
>
> 1) Its operation is redundant with netif_device_detach() if the interface
> is running.
> 2) In case of the interface is not running before suspending and
> resuming, the tx does not get resumed by netif_device_attach().
> This results in losing network connectivity.
>
> It is better to use netif_tx_lock()/netif_tx_unlock() instead for
> serializing tx routine during reset. This also preserves the symmetry
> of netif_device_detach() and netif_device_attach().
>
> Fixes commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
> Signed-off-by: Ake Koomsin <ake@igel.co.jp>
> ---
> drivers/net/virtio_net.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 3f5aa59c37b7..41ccf9c994a4 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2267,8 +2267,9 @@ static void virtnet_freeze_down(struct virtio_device *vdev)
> /* Make sure no work handler is accessing the device */
> flush_work(&vi->config_work);
>
> + netif_tx_lock(vi->dev);
> netif_device_detach(vi->dev);
> - netif_tx_disable(vi->dev);
> + netif_tx_unlock(vi->dev);
Sorry for not finding this earlier. I think we should use
netif_tx_lock_bh() to prevent start_xmit() to run under bh.
Thanks
> cancel_delayed_work_sync(&vi->refill);
>
> if (netif_running(vi->dev)) {
> @@ -2304,7 +2305,9 @@ static int virtnet_restore_up(struct virtio_device *vdev)
> }
> }
>
> + netif_tx_lock(vi->dev);
> netif_device_attach(vi->dev);
> + netif_tx_unlock(vi->dev);
> return err;
> }
>
^ permalink raw reply
* Re: [bpf-next PATCH] bpf: sockmap, fix skmsg recvmsg handler to track size correctly
From: Daniel Borkmann @ 2018-10-17 0:32 UTC (permalink / raw)
To: John Fastabend, ast; +Cc: netdev
In-Reply-To: <20181016173601.9211.1573.stgit@john-Precision-Tower-5810>
On 10/16/2018 07:36 PM, John Fastabend wrote:
> When converting sockmap to new skmsg generic data structures we missed
> that the recvmsg handler did not correctly use sg.size and instead was
> using individual elements length. The result is if a sock is closed
> with outstanding data we omit the call to sk_mem_uncharge() and can
> get the warning below.
>
> [ 66.728282] WARNING: CPU: 6 PID: 5783 at net/core/stream.c:206 sk_stream_kill_queues+0x1fa/0x210
>
> To fix this correct the redirect handler to xfer the size along with
> the scatterlist and also decrement the size from the recvmsg handler.
> Now when a sock is closed the remaining 'size' will be decremented
> with sk_mem_uncharge().
>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Applied to bpf-next, thanks!
^ permalink raw reply
* Re: [bpf-next PATCH 0/3] sockmap support for msg_peek flag
From: Daniel Borkmann @ 2018-10-17 0:32 UTC (permalink / raw)
To: Alexei Starovoitov, John Fastabend; +Cc: ast, netdev
In-Reply-To: <20181016184144.dryrzu7zbfzioz7q@ast-mbp.dhcp.thefacebook.com>
On 10/16/2018 08:41 PM, Alexei Starovoitov wrote:
> On Tue, Oct 16, 2018 at 11:07:54AM -0700, John Fastabend wrote:
>> This adds support for the MSG_PEEK flag when redirecting into an
>> ingress psock sk_msg queue.
>>
>> The first patch adds some base support to the helpers, then the
>> feature, and finally we add an option for the test suite to do
>> a duplicate MSG_PEEK call on every recv to test the feature.
>>
>> With duplicate MSG_PEEK call all tests continue to PASS.
>
> for the set
> Acked-by: Alexei Starovoitov <ast@kernel.org>
Applied to bpf-next, thanks!
^ permalink raw reply
* Re: [PATCH bpf-next 05/13] bpf: get better bpf_prog ksyms based on btf func type_id
From: Yonghong Song @ 2018-10-17 0:32 UTC (permalink / raw)
To: Martin Lau
Cc: Alexei Starovoitov, daniel@iogearbox.net, netdev@vger.kernel.org,
Kernel Team
In-Reply-To: <20181015231223.cwe3zblnh4wt7xds@kafai-mbp.dhcp.thefacebook.com>
On 10/15/18 4:12 PM, Martin Lau wrote:
> On Fri, Oct 12, 2018 at 11:54:42AM -0700, Yonghong Song wrote:
>> This patch added interface to load a program with the following
>> additional information:
>> . prog_btf_fd
>> . func_info and func_info_len
>> where func_info will provides function range and type_id
>> corresponding to each function.
>>
>> If verifier agrees with function range provided by the user,
>> the bpf_prog ksym for each function will use the func name
>> provided in the type_id, which is supposed to provide better
>> encoding as it is not limited by 16 bytes program name
>> limitation and this is better for bpf program which contains
>> multiple subprograms.
>>
>> The bpf_prog_info interface is also extended to
>> return btf_id and jited_func_types, so user spaces can
>> print out the function prototype for each jited function.
> Some nits.
>
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>> include/linux/bpf.h | 2 +
>> include/linux/bpf_verifier.h | 1 +
>> include/linux/btf.h | 2 +
>> include/uapi/linux/bpf.h | 11 +++++
>> kernel/bpf/btf.c | 16 +++++++
>> kernel/bpf/core.c | 9 ++++
>> kernel/bpf/syscall.c | 86 +++++++++++++++++++++++++++++++++++-
>> kernel/bpf/verifier.c | 50 +++++++++++++++++++++
>> 8 files changed, 176 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 9b558713447f..e9c63ffa01af 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -308,6 +308,8 @@ struct bpf_prog_aux {
>> void *security;
>> #endif
>> struct bpf_prog_offload *offload;
>> + struct btf *btf;
>> + u32 type_id; /* type id for this prog/func */
>> union {
>> struct work_struct work;
>> struct rcu_head rcu;
>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
>> index 9e8056ec20fa..e84782ec50ac 100644
>> --- a/include/linux/bpf_verifier.h
>> +++ b/include/linux/bpf_verifier.h
>> @@ -201,6 +201,7 @@ static inline bool bpf_verifier_log_needed(const struct bpf_verifier_log *log)
>> struct bpf_subprog_info {
>> u32 start; /* insn idx of function entry point */
>> u16 stack_depth; /* max. stack depth used by this function */
>> + u32 type_id; /* btf type_id for this subprog */
>> };
>>
>> /* single container for all structs
>> diff --git a/include/linux/btf.h b/include/linux/btf.h
>> index e076c4697049..90e91b52aa90 100644
>> --- a/include/linux/btf.h
>> +++ b/include/linux/btf.h
>> @@ -46,5 +46,7 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj,
>> struct seq_file *m);
>> int btf_get_fd_by_id(u32 id);
>> u32 btf_id(const struct btf *btf);
>> +bool is_btf_func_type(const struct btf *btf, u32 type_id);
>> +const char *btf_get_name_by_id(const struct btf *btf, u32 type_id);
>>
>> #endif
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index f9187b41dff6..7ebbf4f06a65 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -332,6 +332,9 @@ union bpf_attr {
>> * (context accesses, allowed helpers, etc).
>> */
>> __u32 expected_attach_type;
>> + __u32 prog_btf_fd; /* fd pointing to BTF type data */
>> + __u32 func_info_len; /* func_info length */
>> + __aligned_u64 func_info; /* func type info */
>> };
>>
>> struct { /* anonymous struct used by BPF_OBJ_* commands */
>> @@ -2585,6 +2588,9 @@ struct bpf_prog_info {
>> __u32 nr_jited_func_lens;
>> __aligned_u64 jited_ksyms;
>> __aligned_u64 jited_func_lens;
>> + __u32 btf_id;
>> + __u32 nr_jited_func_types;
>> + __aligned_u64 jited_func_types;
>> } __attribute__((aligned(8)));
>>
>> struct bpf_map_info {
>> @@ -2896,4 +2902,9 @@ struct bpf_flow_keys {
>> };
>> };
>>
>> +struct bpf_func_info {
>> + __u32 insn_offset;
>> + __u32 type_id;
>> +};
>> +
>> #endif /* _UAPI__LINUX_BPF_H__ */
>> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
>> index 794a185f11bf..85b8eeccddbd 100644
>> --- a/kernel/bpf/btf.c
>> +++ b/kernel/bpf/btf.c
>> @@ -486,6 +486,15 @@ static const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id)
>> return btf->types[type_id];
>> }
>>
>> +bool is_btf_func_type(const struct btf *btf, u32 type_id)
>> +{
>> + const struct btf_type *type = btf_type_by_id(btf, type_id);
>> +
>> + if (!type || BTF_INFO_KIND(type->info) != BTF_KIND_FUNC)
>> + return false;
>> + return true;
>> +}
> Can btf_type_is_func() (from patch 2) be reused?
> The btf_type_by_id() can be done by the caller.
> I don't think it worths to add a similar helper
> for just one user for now.
Currently, btf_type_by_id() is not exposed.
bpf/btf.c:
static const struct btf_type *btf_type_by_id(const struct btf *btf, u32
type_id)
Do you want to expose this function as global one?
We cannot put the whole definition in the header as it touches
btf internals.
>
> The !type check can be added to btf_type_is_func() if
> it is needed.
>
>> +
>> /*
>> * Regular int is not a bit field and it must be either
>> * u8/u16/u32/u64.
>> @@ -2579,3 +2588,10 @@ u32 btf_id(const struct btf *btf)
>> {
>> return btf->id;
>> }
>> +
>> +const char *btf_get_name_by_id(const struct btf *btf, u32 type_id)
>> +{
>> + const struct btf_type *t = btf_type_by_id(btf, type_id);
>> +
>> + return btf_name_by_offset(btf, t->name_off);
>> +}
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 3f5bf1af0826..add3866a120e 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -27,6 +27,7 @@
>> #include <linux/random.h>
>> #include <linux/moduleloader.h>
>> #include <linux/bpf.h>
>> +#include <linux/btf.h>
>> #include <linux/frame.h>
>> #include <linux/rbtree_latch.h>
>> #include <linux/kallsyms.h>
>> @@ -387,6 +388,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog,
>> static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
>> {
>> const char *end = sym + KSYM_NAME_LEN;
>> + const char *func_name;
>>
>> BUILD_BUG_ON(sizeof("bpf_prog_") +
>> sizeof(prog->tag) * 2 +
>> @@ -401,6 +403,13 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
>>
>> sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
>> sym = bin2hex(sym, prog->tag, sizeof(prog->tag));
>> +
>> + if (prog->aux->btf) {
>> + func_name = btf_get_name_by_id(prog->aux->btf, prog->aux->type_id);
>> + snprintf(sym, (size_t)(end - sym), "_%s", func_name);
>> + return;
>> + }
>> +
>> if (prog->aux->name[0])
>> snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
>> else
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 4f416234251f..aa4688a1a137 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -1120,6 +1120,7 @@ static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
>> /* bpf_prog_free_id() must be called first */
>> bpf_prog_free_id(prog, do_idr_lock);
>> bpf_prog_kallsyms_del_all(prog);
>> + btf_put(prog->aux->btf);
>>
>> call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
>> }
>> @@ -1343,8 +1344,45 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
>> }
>> }
>>
>> +static int prog_check_btf(const struct bpf_prog *prog, const struct btf *btf,
>> + union bpf_attr *attr)
>> +{
>> + struct bpf_func_info __user *uinfo, info;
>> + int i, nfuncs, usize;
>> + u32 prev_offset;
>> +
>> + usize = sizeof(struct bpf_func_info);
>> + if (attr->func_info_len % usize)
>> + return -EINVAL;
>> +
>> + /* func_info section should have increasing and valid insn_offset
>> + * and type should be BTF_KIND_FUNC.
>> + */
>> + nfuncs = attr->func_info_len / usize;
>> + uinfo = u64_to_user_ptr(attr->func_info);
>> + for (i = 0; i < nfuncs; i++) {
>> + if (copy_from_user(&info, uinfo, usize))
>> + return -EFAULT;
>> +
>> + if (!is_btf_func_type(btf, info.type_id))
>> + return -EINVAL;
>> +
>> + if (i == 0) {
>> + if (info.insn_offset)
>> + return -EINVAL;
>> + prev_offset = 0;
>> + } else if (info.insn_offset < prev_offset) {
>> + return -EINVAL;
>> + }
>> +
>> + prev_offset = info.insn_offset;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> /* last field in 'union bpf_attr' used by this command */
>> -#define BPF_PROG_LOAD_LAST_FIELD expected_attach_type
>> +#define BPF_PROG_LOAD_LAST_FIELD func_info
>>
>> static int bpf_prog_load(union bpf_attr *attr)
>> {
>> @@ -1431,6 +1469,27 @@ static int bpf_prog_load(union bpf_attr *attr)
>> if (err)
>> goto free_prog;
>>
>> + /* copy func_info before verifier which may make
>> + * some adjustment.
>> + */
> Is it a left over comment? I don't see the intention of
> copying func_info to avoid verifier modification from below.
> I could be missing something.
>
> or should the comments be moved to the new "check_btf_func()" below?
>
>> + if (attr->func_info_len) {
>> + struct btf *btf;
>> +
>> + btf = btf_get_by_fd(attr->prog_btf_fd);
>> + if (IS_ERR(btf)) {
>> + err = PTR_ERR(btf);
>> + goto free_prog;
>> + }
>> +
>> + err = prog_check_btf(prog, btf, attr);
>> + if (err) {
>> + btf_put(btf);
>> + goto free_prog;
>> + }
>> +
>> + prog->aux->btf = btf;
>> + }
>> +
>> /* run eBPF verifier */
>> err = bpf_check(&prog, attr);
>> if (err < 0)
>> @@ -1463,6 +1522,7 @@ static int bpf_prog_load(union bpf_attr *attr)
>> bpf_prog_kallsyms_del_subprogs(prog);
>> free_used_maps(prog->aux);
>> free_prog:
>> + btf_put(prog->aux->btf);
>> bpf_prog_uncharge_memlock(prog);
>> free_prog_sec:
>> security_bpf_prog_free(prog->aux);
>> @@ -2108,6 +2168,30 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
>> }
>> }
>>
>> + if (prog->aux->btf) {
>> + info.btf_id = btf_id(prog->aux->btf);
>> +
>> + ulen = info.nr_jited_func_types;
>> + info.nr_jited_func_types = prog->aux->func_cnt;
>> + if (info.nr_jited_func_types && ulen) {
>> + if (bpf_dump_raw_ok()) {
>> + u32 __user *user_types;
>> + u32 func_type, i;
>> +
>> + ulen = min_t(u32, info.nr_jited_func_types,
>> + ulen);
>> + user_types = u64_to_user_ptr(info.jited_func_types);
>> + for (i = 0; i < ulen; i++) {
>> + func_type = prog->aux->func[i]->aux->type_id;
>> + if (put_user(func_type, &user_types[i]))
>> + return -EFAULT;
>> + }
>> + } else {
>> + info.jited_func_types = 0;
>> + }
>> + }
>> + }
>> +
>> done:
>> if (copy_to_user(uinfo, &info, info_len) ||
>> put_user(info_len, &uattr->info.info_len))
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 3f93a548a642..97c408e84322 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -4589,6 +4589,50 @@ static int check_cfg(struct bpf_verifier_env *env)
>> return ret;
>> }
>>
>> +static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env,
>> + union bpf_attr *attr)
>> +{
>> + struct bpf_func_info *data;
>> + int i, nfuncs, ret = 0;
>> +
>> + if (!attr->func_info_len)
>> + return 0;
>> +
>> + nfuncs = attr->func_info_len / sizeof(struct bpf_func_info);
>> + if (env->subprog_cnt != nfuncs) {
>> + verbose(env, "number of funcs in func_info does not match verifier\n");
>> + return -EINVAL;
>> + }
>> +
>> + data = kvmalloc(attr->func_info_len, GFP_KERNEL | __GFP_NOWARN);
>> + if (!data) {
>> + verbose(env, "no memory to allocate attr func_info\n");
>> + return -ENOMEM;
>> + }
>> +
>> + if (copy_from_user(data, u64_to_user_ptr(attr->func_info),
>> + attr->func_info_len)) {
>> + verbose(env, "memory copy error for attr func_info\n");
>> + ret = -EFAULT;
>> + goto cleanup;
>> + }
> Extra indentation.
>
>> +
>> + for (i = 0; i < nfuncs; i++) {
>> + if (env->subprog_info[i].start != data[i].insn_offset) {
>> + verbose(env, "func_info subprog start (%d) does not match verifier (%d)\n",
>> + env->subprog_info[i].start, data[i].insn_offset);
> The printing args are swapped? env->subprog_info[i].start should
> go to the "verifier (%d)"?
>
> and s/%d/%u/
>
>> + ret = -EINVAL;
>> + goto cleanup;
>> + }
>> + env->subprog_info[i].type_id = data[i].type_id;
>> + }
>> +
>> + prog->aux->type_id = data[0].type_id;
>> +cleanup:
>> + kvfree(data);
>> + return ret;
>> +}
>> +
>> /* check %cur's range satisfies %old's */
>> static bool range_within(struct bpf_reg_state *old,
>> struct bpf_reg_state *cur)
>> @@ -5873,6 +5917,8 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>> func[i]->aux->name[0] = 'F';
>> func[i]->aux->stack_depth = env->subprog_info[i].stack_depth;
>> func[i]->jit_requested = 1;
>> + func[i]->aux->btf = prog->aux->btf;
>> + func[i]->aux->type_id = env->subprog_info[i].type_id;
>> func[i] = bpf_int_jit_compile(func[i]);
>> if (!func[i]->jited) {
>> err = -ENOTSUPP;
>> @@ -6307,6 +6353,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr)
>> if (ret < 0)
>> goto skip_full_check;
>>
>> + ret = check_btf_func(env->prog, env, attr);
>> + if (ret < 0)
>> + goto skip_full_check;
>> +
>> ret = do_check(env);
>> if (env->cur_state) {
>> free_verifier_state(env->cur_state, true);
>> --
>> 2.17.1
>>
^ permalink raw reply
* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Eric Dumazet @ 2018-10-17 0:23 UTC (permalink / raw)
To: Florian Fainelli, Stephen Hemminger, Holger Hoffstätte
Cc: Heiner Kallweit, David Miller, Realtek linux nic maintainers,
netdev@vger.kernel.org
In-Reply-To: <92db72c9-1f8c-d6a5-bcf4-241fa4c5a310@gmail.com>
On 10/16/2018 04:08 PM, Florian Fainelli wrote:
> I had started doing that about a month ago in light of the ixbge
> ndo_poll_controller vs. napi problem, but have not had time to submit
> that series yet:
>
> https://github.com/ffainelli/linux/commits/napi-check
>
> feel free to piggy back on top of that series if you would like to
> address this.
But the root cause was different, you remember ?
We fixed the real issue with netpoll ability to stick all NIC IRQ onto one
victim CPU.
^ permalink raw reply
* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Eric Dumazet @ 2018-10-17 0:21 UTC (permalink / raw)
To: Stephen Hemminger, Holger Hoffstätte
Cc: Heiner Kallweit, David Miller, Realtek linux nic maintainers,
netdev@vger.kernel.org
In-Reply-To: <20181016160355.1cc0a2e9@xeon-e3>
On 10/16/2018 04:03 PM, Stephen Hemminger wrote:
> Many drivers have buggy usage of napi_complete_done.
>
> Might even be worth forcing all network drivers to check the return
> value. But fixing 150 broken drivers will be a nuisance.
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index dc1d9ed33b31..c38bc66ffe74 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -466,7 +466,8 @@ static inline bool napi_reschedule(struct napi_struct *napi)
> return false;
> }
>
> -bool napi_complete_done(struct napi_struct *n, int work_done);
> +bool __must_check napi_complete_done(struct napi_struct *n, int work_done);
> +
> /**
> * napi_complete - NAPI processing complete
> * @n: NAPI context
>
I disagree completely.
This never has been a requirement.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox