* Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
From: Siwei Liu @ 2018-04-07 2:54 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
Si-Wei Liu, David Miller
In-Reply-To: <d666b8a9-9ff8-4a56-0faa-ef3d0bf81d25@redhat.com>
(click the wrong reply button again, sorry)
On Thu, Apr 5, 2018 at 8:31 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 04/04/2018 10:02, Siwei Liu wrote:
>>> pci_bus_num is almost always a bug if not done within
>>> a context of a PCI host, bridge, etc.
>>>
>>> In particular this will not DTRT if run before guest assigns bus
>>> numbers.
>>>
>> I was seeking means to reserve a specific pci bus slot from drivers,
>> and update the driver when guest assigns the bus number but it seems
>> there's no low-hanging fruits. Because of that reason the bus_num is
>> only obtained until it's really needed (during get_config) and I
>> assume at that point the pci bus assignment is already done. I know
>> the current one is not perfect, but we need that information (PCI
>> bus:slot.func number) to name the guest device correctly.
>
> Can you use the -device "id", and look it up as
>
> devices = container_get(qdev_get_machine(), "/peripheral");
> return object_resolve_path_component(devices, id);
No. The problem of using device id is that the vfio device may come
and go at any time, this is particularly true when live migration is
happening. There's no gurantee we can get the bus:device.func info if
that device is gone. Currently the binding between vfio and virtio-net
is weakly coupled through the backup property, there's no better way
than specifying the bus id and addr property directly.
Regards,
-Siwei
>
> ?
>
> Thanks,
>
> Paolo
^ permalink raw reply
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Andrew Lunn @ 2018-04-07 3:19 UTC (permalink / raw)
To: Siwei Liu
Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
David Ahern, si-wei liu, David Miller
In-Reply-To: <CADGSJ206RKi704uoLe3xVs6PyWpURPLb75OPCM_YO9DfnnEzpw@mail.gmail.com>
Hi Siwei
> I think everyone seems to agree not to fiddle with the ":" prefix, but
> rather have a new class of network subsystem under /sys/class thus a
> separate device namespace e.g. /sys/class/net-kernel for those
> auto-managed lower netdevs is needed.
How do you get a device into this new class? I don't know the Linux
driver model too well, but to get a device out of one class and into
another, i think you need to device_del(dev). modify dev->class and
then device_add(dev). However, device_add() says you are not allowed
to do this.
And i don't even see how this helps. Are you also not going to call
list_netdevice()? Are you going to add some other list for these
devices in a different class?
Andrew
^ permalink raw reply
* Re: [RFC PATCH 0/2] use larger max_request_size for virtio_blk
From: Weiping Zhang @ 2018-04-07 8:30 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Cornelia Huck, virtualization, Michael S . Tsirkin
In-Reply-To: <04280c25-aecb-3d79-e589-5fafa974d9e0@kernel.dk>
2018-04-05 22:29 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 4/5/18 4:09 AM, Weiping Zhang wrote:
>> Hi,
>>
>> For virtio block device, actually there is no a hard limit for max request
>> size, and virtio_blk driver set -1 to blk_queue_max_hw_sectors(q, -1U);.
>> But it doesn't work, because there is a default upper limitation
>> BLK_DEF_MAX_SECTORS (1280 sectors). So this series want to add a new helper
>> blk_queue_max_hw_sectors_no_limit to set a proper max reqeust size.
>>
>> Weiping Zhang (2):
>> blk-setting: add new helper blk_queue_max_hw_sectors_no_limit
>> virtio_blk: add new module parameter to set max request size
>>
>> block/blk-settings.c | 20 ++++++++++++++++++++
>> drivers/block/virtio_blk.c | 32 ++++++++++++++++++++++++++++++--
>> include/linux/blkdev.h | 2 ++
>> 3 files changed, 52 insertions(+), 2 deletions(-)
>
> The driver should just use blk_queue_max_hw_sectors() to set the limit,
> and then the soft limit can be modified by a udev rule. Technically the
> driver doesn't own the software limit, it's imposed to ensure that we
> don't introduce too much latency per request.
>
> Your situation is no different from many other setups, where the
> hw limit is much higher than the default 1280k.
>
Hi Martin, Jens,
It seems more reasonable to change software limitation by udev rule,
thanks you.
>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: David Miller @ 2018-04-08 16:32 UTC (permalink / raw)
To: loseweigh
Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici,
sridhar.samudrala, virtualization, netdev, dsahern, si-wei.liu
In-Reply-To: <CADGSJ206RKi704uoLe3xVs6PyWpURPLb75OPCM_YO9DfnnEzpw@mail.gmail.com>
From: Siwei Liu <loseweigh@gmail.com>
Date: Fri, 6 Apr 2018 19:32:05 -0700
> And I assume everyone here understands the use case for live
> migration (in the context of providing cloud service) is very
> different, and we have to hide the netdevs. If not, I'm more than
> happy to clarify.
I think you still need to clarify.
netdevs are netdevs. If they have special attributes, mark them as
such and the tools base their actions upon that.
"Hiding", or changing classes, doesn't make any sense to me still.
^ permalink raw reply
* Re: [PATCH] vhost-net: set packet weight of tx polling to 2 * vq size
From: David Miller @ 2018-04-08 16:52 UTC (permalink / raw)
To: haibinzhang
Cc: kvm, mst, netdev, linux-kernel, virtualization, yunfangtai,
lidongchen
In-Reply-To: <88D661ADF6AFBF42B2AB88D8E7682B0901FC47D3@EXMBX-SZMAIL011.tencent.com>
From: haibinzhang(张海斌) <haibinzhang@tencent.com>
Date: Fri, 6 Apr 2018 08:22:37 +0000
> handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
> polling udp packets with small length(e.g. 1byte udp payload), because setting
> VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
>
> Ping-Latencies shown below were tested between two Virtual Machines using
> netperf (UDP_STREAM, len=1), and then another machine pinged the client:
...
> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
Michael and Jason, please review.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: kernel BUG at drivers/vhost/vhost.c:LINE! (2)
From: Stefan Hajnoczi @ 2018-04-09 2:37 UTC (permalink / raw)
To: syzbot
Cc: kvm, Michael S. Tsirkin, netdev, syzkaller-bugs, linux-kernel,
Linux Virtualization
In-Reply-To: <001a11449424f11322056932b09c@google.com>
On Sat, Apr 7, 2018 at 3:02 AM, syzbot
<syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com> wrote:
> syzbot hit the following crash on upstream commit
> 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +0000)
> Merge tag 'armsoc-drivers' of
> git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
> syzbot dashboard link:
> https://syzkaller.appspot.com/bug?extid=65a84dde0214b0387ccd
To prevent duplicated work: I am working on this one.
Stefan
>
> So far this crash happened 4 times on upstream.
> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6586748079439872
> syzkaller reproducer:
> https://syzkaller.appspot.com/x/repro.syz?id=5974272052822016
> Raw console output:
> https://syzkaller.appspot.com/x/log.txt?id=6224632407392256
> Kernel config:
> https://syzkaller.appspot.com/x/.config?id=-5813481738265533882
> compiler: gcc (GCC) 8.0.1 20180301 (experimental)
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> ------------[ cut here ]------------
> kernel BUG at drivers/vhost/vhost.c:1652!
> invalid opcode: 0000 [#1] SMP KASAN
> ------------[ cut here ]------------
> Dumping ftrace buffer:
> kernel BUG at drivers/vhost/vhost.c:1652!
> (ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 4461 Comm: syzkaller684218 Not tainted 4.16.0+ #3
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:set_bit_to_user drivers/vhost/vhost.c:1652 [inline]
> RIP: 0010:log_write+0x42a/0x4d0 drivers/vhost/vhost.c:1676
> RSP: 0018:ffff8801b256f920 EFLAGS: 00010293
> RAX: ffff8801adc9e2c0 RBX: dffffc0000000000 RCX: ffffffff85924a0f
> RDX: 0000000000000000 RSI: ffffffff85924cea RDI: 0000000000000005
> RBP: ffff8801b256fa58 R08: ffff8801adc9e2c0 R09: ffffed003962412d
> R10: ffff8801b256fad8 R11: ffff8801cb12096f R12: 0001ffffffffffff
> R13: ffffed00364adf36 R14: 0000000000000000 R15: ffff8801b256fa30
> FS: 00007fdf24b19700(0000) GS:ffff8801db100000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020bf6000 CR3: 00000001ae6a7000 CR4: 00000000001406e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> vhost_update_used_flags+0x3af/0x4a0 drivers/vhost/vhost.c:1723
> vhost_vq_init_access+0x117/0x590 drivers/vhost/vhost.c:1763
> vhost_vsock_start drivers/vhost/vsock.c:446 [inline]
> vhost_vsock_dev_ioctl+0x751/0x920 drivers/vhost/vsock.c:678
> vfs_ioctl fs/ioctl.c:46 [inline]
> file_ioctl fs/ioctl.c:500 [inline]
> do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684
> ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
> SYSC_ioctl fs/ioctl.c:708 [inline]
> SyS_ioctl+0x24/0x30 fs/ioctl.c:706
> do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x4456c9
> RSP: 002b:00007fdf24b18da8 EFLAGS: 00000297 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 00000000006dac24 RCX: 00000000004456c9
> RDX: 0000000020f82ffc RSI: 000000004004af61 RDI: 000000000000001b
> RBP: 00000000006dac20 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000297 R12: 6b636f73762d7473
> R13: 6f68762f7665642f R14: fffffffffffffffc R15: 0000000000000007
> Code: e8 7c 5e e4 fb 4c 89 ef e8 e4 16 06 fc 48 8d 85 58 ff ff ff 48 c1 e8
> 03 c6 04 18 f8 e9 46 ff ff ff 45 31 f6 eb 91 e8 56 5e e4 fb <0f> 0b e8 4f 5e
> e4 fb 48 c7 c6 a0 a3 24 88 4c 89 ef e8 60 b6 10
> RIP: set_bit_to_user drivers/vhost/vhost.c:1652 [inline] RSP:
> ffff8801b256f920
> RIP: log_write+0x42a/0x4d0 drivers/vhost/vhost.c:1676 RSP: ffff8801b256f920
> invalid opcode: 0000 [#2] SMP KASAN
> ---[ end trace 0d0ff45aa44d8a23 ]---
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Modules linked in:
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this bug, please reply with:
> #syz test: git://repo/address.git branch
> and provide the patch inline or as an attachment.
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
^ permalink raw reply
* Re: [PATCH] vhost-net: set packet weight of tx polling to 2 * vq size
From: Michael S. Tsirkin @ 2018-04-09 2:42 UTC (permalink / raw)
To: haibinzhang(张海斌)
Cc: kvm@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org,
yunfangtai(台运方),
lidongchen(陈立东)
In-Reply-To: <88D661ADF6AFBF42B2AB88D8E7682B0901FC47D3@EXMBX-SZMAIL011.tencent.com>
On Fri, Apr 06, 2018 at 08:22:37AM +0000, haibinzhang(张海斌) wrote:
> handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
> polling udp packets with small length(e.g. 1byte udp payload), because setting
> VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
>
> Ping-Latencies shown below were tested between two Virtual Machines using
> netperf (UDP_STREAM, len=1), and then another machine pinged the client:
>
> Packet-Weight Ping-Latencies(millisecond)
> min avg max
> Origin 3.319 18.489 57.303
> 64 1.643 2.021 2.552
> 128 1.825 2.600 3.224
> 256 1.997 2.710 4.295
> 512 1.860 3.171 4.631
> 1024 2.002 4.173 9.056
> 2048 2.257 5.650 9.688
> 4096 2.093 8.508 15.943
And this is with Q size 256 right?
> Ring size is a hint from device about a burst size it can tolerate. Based on
> benchmarks, set the weight to 2 * vq size.
>
> To evaluate this change, another tests were done using netperf(RR, TX) between
> two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
> tweaked through qemu. Results shown below does not show obvious changes.
What I asked for is ping-latency with different VQ sizes,
streaming below does not show anything.
> vq size=256 TCP_RR vq size=512 TCP_RR
> size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
> 1/ 1/ -7%/ -2% 1/ 1/ 0%/ -2%
> 1/ 4/ +1%/ 0% 1/ 4/ +1%/ 0%
> 1/ 8/ +1%/ -2% 1/ 8/ 0%/ +1%
> 64/ 1/ -6%/ 0% 64/ 1/ +7%/ +3%
> 64/ 4/ 0%/ +2% 64/ 4/ -1%/ +1%
> 64/ 8/ 0%/ 0% 64/ 8/ -1%/ -2%
> 256/ 1/ -3%/ -4% 256/ 1/ -4%/ -2%
> 256/ 4/ +3%/ +4% 256/ 4/ +1%/ +2%
> 256/ 8/ +2%/ 0% 256/ 8/ +1%/ -1%
>
> vq size=256 UDP_RR vq size=512 UDP_RR
> size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
> 1/ 1/ -5%/ +1% 1/ 1/ -3%/ -2%
> 1/ 4/ +4%/ +1% 1/ 4/ -2%/ +2%
> 1/ 8/ -1%/ -1% 1/ 8/ -1%/ 0%
> 64/ 1/ -2%/ -3% 64/ 1/ +1%/ +1%
> 64/ 4/ -5%/ -1% 64/ 4/ +2%/ 0%
> 64/ 8/ 0%/ -1% 64/ 8/ -2%/ +1%
> 256/ 1/ +7%/ +1% 256/ 1/ -7%/ 0%
> 256/ 4/ +1%/ +1% 256/ 4/ -3%/ -4%
> 256/ 8/ +2%/ +2% 256/ 8/ +1%/ +1%
>
> vq size=256 TCP_STREAM vq size=512 TCP_STREAM
> size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
> 64/ 1/ 0%/ -3% 64/ 1/ 0%/ 0%
> 64/ 4/ +3%/ -1% 64/ 4/ -2%/ +4%
> 64/ 8/ +9%/ -4% 64/ 8/ -1%/ +2%
> 256/ 1/ +1%/ -4% 256/ 1/ +1%/ +1%
> 256/ 4/ -1%/ -1% 256/ 4/ -3%/ 0%
> 256/ 8/ +7%/ +5% 256/ 8/ -3%/ 0%
> 512/ 1/ +1%/ 0% 512/ 1/ -1%/ -1%
> 512/ 4/ +1%/ -1% 512/ 4/ 0%/ 0%
> 512/ 8/ +7%/ -5% 512/ 8/ +6%/ -1%
> 1024/ 1/ 0%/ -1% 1024/ 1/ 0%/ +1%
> 1024/ 4/ +3%/ 0% 1024/ 4/ +1%/ 0%
> 1024/ 8/ +8%/ +5% 1024/ 8/ -1%/ 0%
> 2048/ 1/ +2%/ +2% 2048/ 1/ -1%/ 0%
> 2048/ 4/ +1%/ 0% 2048/ 4/ 0%/ -1%
> 2048/ 8/ -2%/ 0% 2048/ 8/ 5%/ -1%
> 4096/ 1/ -2%/ 0% 4096/ 1/ -2%/ 0%
> 4096/ 4/ +2%/ 0% 4096/ 4/ 0%/ 0%
> 4096/ 8/ +9%/ -2% 4096/ 8/ -5%/ -1%
>
> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
Code is fine but I'd like to see validation of the heuristic
2*vq->num with another vq size.
> ---
> drivers/vhost/net.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 8139bc70ad7d..3563a305cc0a 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
> * Using this limit prevents one virtqueue from starving others. */
> #define VHOST_NET_WEIGHT 0x80000
>
> +/* Max number of packets transferred before requeueing the job.
> + * Using this limit prevents one virtqueue from starving rx. */
> +#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
> +
> /* MAX number of TX used buffers for outstanding zerocopy */
> #define VHOST_MAX_PEND 128
> #define VHOST_GOODCOPY_LEN 256
> @@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
> struct socket *sock;
> struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
> bool zcopy, zcopy_used;
> + int sent_pkts = 0;
>
> mutex_lock(&vq->mutex);
> sock = vq->private_data;
> @@ -580,7 +585,8 @@ static void handle_tx(struct vhost_net *net)
> else
> vhost_zerocopy_signal_used(net, vq);
> vhost_net_tx_packet(net);
> - if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> + if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
> + unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT(vq))) {
> vhost_poll_queue(&vq->poll);
> break;
> }
> --
> 2.12.3
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: kernel BUG at drivers/vhost/vhost.c:LINE! (2)
From: Michael S. Tsirkin @ 2018-04-09 2:44 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: kvm, netdev, syzkaller-bugs, syzbot, linux-kernel,
Linux Virtualization
In-Reply-To: <CAJSP0QVHk7NSQcjfLWq13=1Vzm=_vtmKZqp4dDjuh8ETLO5g_g@mail.gmail.com>
On Mon, Apr 09, 2018 at 10:37:45AM +0800, Stefan Hajnoczi wrote:
> On Sat, Apr 7, 2018 at 3:02 AM, syzbot
> <syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com> wrote:
> > syzbot hit the following crash on upstream commit
> > 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +0000)
> > Merge tag 'armsoc-drivers' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
> > syzbot dashboard link:
> > https://syzkaller.appspot.com/bug?extid=65a84dde0214b0387ccd
>
> To prevent duplicated work: I am working on this one.
>
> Stefan
Do you want to try this patchset:
https://lkml.org/lkml/2018/4/5/665
?
--
MST
^ permalink raw reply
* Re: kernel BUG at drivers/vhost/vhost.c:LINE! (2)
From: Stefan Hajnoczi @ 2018-04-09 3:28 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: kvm, netdev, syzkaller-bugs, syzbot, linux-kernel,
Linux Virtualization
In-Reply-To: <20180409054316-mutt-send-email-mst@kernel.org>
[-- Attachment #1.1: Type: text/plain, Size: 961 bytes --]
On Mon, Apr 09, 2018 at 05:44:36AM +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 09, 2018 at 10:37:45AM +0800, Stefan Hajnoczi wrote:
> > On Sat, Apr 7, 2018 at 3:02 AM, syzbot
> > <syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com> wrote:
> > > syzbot hit the following crash on upstream commit
> > > 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +0000)
> > > Merge tag 'armsoc-drivers' of
> > > git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
> > > syzbot dashboard link:
> > > https://syzkaller.appspot.com/bug?extid=65a84dde0214b0387ccd
> >
> > To prevent duplicated work: I am working on this one.
> >
> > Stefan
>
> Do you want to try this patchset:
> https://lkml.org/lkml/2018/4/5/665
>
> ?
Thanks, I'll give it a shot.
I also noticed a regression in commit
d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log when
IOTLB is enabled") and am currently testing a fix.
Stefan
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH] vhost-net: set packet weight of tx polling to 2 * vq size
From: Michael S. Tsirkin @ 2018-04-09 5:46 UTC (permalink / raw)
To: haibinzhang(张海斌)
Cc: kvm@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org,
yunfangtai(台运方),
lidongchen(陈立东)
In-Reply-To: <88D661ADF6AFBF42B2AB88D8E7682B0901FC49BE@EXMBX-SZMAIL011.tencent.com>
On Mon, Apr 09, 2018 at 04:09:20AM +0000, haibinzhang(张海斌) wrote:
>
> > On Fri, Apr 06, 2018 at 08:22:37AM +0000, haibinzhang(张海斌) wrote:
> > > handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
> > > polling udp packets with small length(e.g. 1byte udp payload), because setting
> > > VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
> > >
> > > Ping-Latencies shown below were tested between two Virtual Machines using
> > > netperf (UDP_STREAM, len=1), and then another machine pinged the client:
> > >
> > > Packet-Weight Ping-Latencies(millisecond)
> > > min avg max
> > > Origin 3.319 18.489 57.303
> > > 64 1.643 2.021 2.552
> > > 128 1.825 2.600 3.224
> > > 256 1.997 2.710 4.295
> > > 512 1.860 3.171 4.631
> > > 1024 2.002 4.173 9.056
> > > 2048 2.257 5.650 9.688
> > > 4096 2.093 8.508 15.943
> >
> > And this is with Q size 256 right?
>
> Yes. Ping-latencies with 512 VQ size show below.
>
> Packet-Weight Ping-Latencies(millisecond)
> min avg max
> Origin 6.357 29.177 66.245
> 64 2.798 3.614 4.403
> 128 2.861 3.820 4.775
> 256 3.008 4.018 4.807
> 512 3.254 4.523 5.824
> 1024 3.079 5.335 7.747
> 2048 3.944 8.201 12.762
> 4096 4.158 11.057 19.985
>
> We will submit again. Is there anything else?
Seems pretty consistent, a small dip at 2 VQ sizes.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
> >
> > > Ring size is a hint from device about a burst size it can tolerate. Based on
> > > benchmarks, set the weight to 2 * vq size.
> > >
> > > To evaluate this change, another tests were done using netperf(RR, TX) between
> > > two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
> > > tweaked through qemu. Results shown below does not show obvious changes.
> >
> > What I asked for is ping-latency with different VQ sizes,
> > streaming below does not show anything.
> >
> > > vq size=256 TCP_RR vq size=512 TCP_RR
> > > size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
> > > 1/ 1/ -7%/ -2% 1/ 1/ 0%/ -2%
> > > 1/ 4/ +1%/ 0% 1/ 4/ +1%/ 0%
> > > 1/ 8/ +1%/ -2% 1/ 8/ 0%/ +1%
> > > 64/ 1/ -6%/ 0% 64/ 1/ +7%/ +3%
> > > 64/ 4/ 0%/ +2% 64/ 4/ -1%/ +1%
> > > 64/ 8/ 0%/ 0% 64/ 8/ -1%/ -2%
> > > 256/ 1/ -3%/ -4% 256/ 1/ -4%/ -2%
> > > 256/ 4/ +3%/ +4% 256/ 4/ +1%/ +2%
> > > 256/ 8/ +2%/ 0% 256/ 8/ +1%/ -1%
> > >
> > > vq size=256 UDP_RR vq size=512 UDP_RR
> > > size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
> > > 1/ 1/ -5%/ +1% 1/ 1/ -3%/ -2%
> > > 1/ 4/ +4%/ +1% 1/ 4/ -2%/ +2%
> > > 1/ 8/ -1%/ -1% 1/ 8/ -1%/ 0%
> > > 64/ 1/ -2%/ -3% 64/ 1/ +1%/ +1%
> > > 64/ 4/ -5%/ -1% 64/ 4/ +2%/ 0%
> > > 64/ 8/ 0%/ -1% 64/ 8/ -2%/ +1%
> > > 256/ 1/ +7%/ +1% 256/ 1/ -7%/ 0%
> > > 256/ 4/ +1%/ +1% 256/ 4/ -3%/ -4%
> > > 256/ 8/ +2%/ +2% 256/ 8/ +1%/ +1%
> > >
> > > vq size=256 TCP_STREAM vq size=512 TCP_STREAM
> > > size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
> > > 64/ 1/ 0%/ -3% 64/ 1/ 0%/ 0%
> > > 64/ 4/ +3%/ -1% 64/ 4/ -2%/ +4%
> > > 64/ 8/ +9%/ -4% 64/ 8/ -1%/ +2%
> > > 256/ 1/ +1%/ -4% 256/ 1/ +1%/ +1%
> > > 256/ 4/ -1%/ -1% 256/ 4/ -3%/ 0%
> > > 256/ 8/ +7%/ +5% 256/ 8/ -3%/ 0%
> > > 512/ 1/ +1%/ 0% 512/ 1/ -1%/ -1%
> > > 512/ 4/ +1%/ -1% 512/ 4/ 0%/ 0%
> > > 512/ 8/ +7%/ -5% 512/ 8/ +6%/ -1%
> > > 1024/ 1/ 0%/ -1% 1024/ 1/ 0%/ +1%
> > > 1024/ 4/ +3%/ 0% 1024/ 4/ +1%/ 0%
> > > 1024/ 8/ +8%/ +5% 1024/ 8/ -1%/ 0%
> > > 2048/ 1/ +2%/ +2% 2048/ 1/ -1%/ 0%
> > > 2048/ 4/ +1%/ 0% 2048/ 4/ 0%/ -1%
> > > 2048/ 8/ -2%/ 0% 2048/ 8/ 5%/ -1%
> > > 4096/ 1/ -2%/ 0% 4096/ 1/ -2%/ 0%
> > > 4096/ 4/ +2%/ 0% 4096/ 4/ 0%/ 0%
> > > 4096/ 8/ +9%/ -2% 4096/ 8/ -5%/ -1%
> > >
> > > Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
> > > Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
> > > Signed-off-by: Lidong Chen <lidongchen@tencent.com>
> >
> > Code is fine but I'd like to see validation of the heuristic
> > 2*vq->num with another vq size.
> >
> >
> >
> > > ---
> > > drivers/vhost/net.c | 8 +++++++-
> > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index 8139bc70ad7d..3563a305cc0a 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
> > > * Using this limit prevents one virtqueue from starving others. */
> > > #define VHOST_NET_WEIGHT 0x80000
> > >
> > > +/* Max number of packets transferred before requeueing the job.
> > > + * Using this limit prevents one virtqueue from starving rx. */
> > > +#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
> > > +
> > > /* MAX number of TX used buffers for outstanding zerocopy */
> > > #define VHOST_MAX_PEND 128
> > > #define VHOST_GOODCOPY_LEN 256
> > > @@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
> > > struct socket *sock;
> > > struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
> > > bool zcopy, zcopy_used;
> > > + int sent_pkts = 0;
> > >
> > > mutex_lock(&vq->mutex);
> > > sock = vq->private_data;
> > > @@ -580,7 +585,8 @@ static void handle_tx(struct vhost_net *net)
> > > else
> > > vhost_zerocopy_signal_used(net, vq);
> > > vhost_net_tx_packet(net);
> > > - if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> > > + if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
> > > + unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT(vq))) {
> > > vhost_poll_queue(&vq->poll);
> > > break;
> > > }
> > > --
> > > 2.12.3
> > >
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v31 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Michael S. Tsirkin @ 2018-04-09 6:03 UTC (permalink / raw)
To: Wei Wang
Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
liliang.opensource, linux-kernel, mhocko, linux-mm, huangzhichao,
pbonzini, akpm, virtualization
In-Reply-To: <1523017045-18315-3-git-send-email-wei.w.wang@intel.com>
On Fri, Apr 06, 2018 at 08:17:23PM +0800, Wei Wang wrote:
> Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
> support of reporting hints of guest free pages to host via virtio-balloon.
>
> Host requests the guest to report free page hints by sending a new cmd
> id to the guest via the free_page_report_cmd_id configuration register.
>
> When the guest starts to report, the first element added to the free page
> vq is the cmd id given by host. When the guest finishes the reporting
> of all the free pages, VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID is added
> to the vq to tell host that the reporting is done. Host polls the free
> page vq after sending the starting cmd id, so the guest doesn't need to
> kick after filling an element to the vq.
>
> Host may also requests the guest to stop the reporting in advance by
> sending the stop cmd id to the guest via the configuration register.
>
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
Pretty good by now, Minor comments below.
> ---
> drivers/virtio/virtio_balloon.c | 272 +++++++++++++++++++++++++++++++-----
> include/uapi/linux/virtio_balloon.h | 4 +
> 2 files changed, 240 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index dfe5684..aef73ee 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -51,9 +51,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> static struct vfsmount *balloon_mnt;
> #endif
>
> +enum virtio_balloon_vq {
> + VIRTIO_BALLOON_VQ_INFLATE,
> + VIRTIO_BALLOON_VQ_DEFLATE,
> + VIRTIO_BALLOON_VQ_STATS,
> + VIRTIO_BALLOON_VQ_FREE_PAGE,
> + VIRTIO_BALLOON_VQ_MAX
> +};
> +
> struct virtio_balloon {
> struct virtio_device *vdev;
> - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> +
> + /* Balloon's own wq for cpu-intensive work items */
> + struct workqueue_struct *balloon_wq;
> + /* The free page reporting work item submitted to the balloon wq */
> + struct work_struct report_free_page_work;
>
> /* The balloon servicing is delegated to a freezable workqueue. */
> struct work_struct update_balloon_stats_work;
> @@ -63,6 +76,13 @@ struct virtio_balloon {
> spinlock_t stop_update_lock;
> bool stop_update;
>
> + /* The new cmd id received from host */
> + uint32_t cmd_id_received;
> + /* The cmd id that is in use */
> + __virtio32 cmd_id_use;
I'd prefer cmd_id_active but it's not critical.
> + /* Buffer to store the stop sign */
> + __virtio32 stop_cmd_id;
> +
> /* Waiting for host to ack the pages we released. */
> wait_queue_head_t acked;
>
> @@ -320,17 +340,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
> virtqueue_kick(vq);
> }
>
> -static void virtballoon_changed(struct virtio_device *vdev)
> -{
> - struct virtio_balloon *vb = vdev->priv;
> - unsigned long flags;
> -
> - spin_lock_irqsave(&vb->stop_update_lock, flags);
> - if (!vb->stop_update)
> - queue_work(system_freezable_wq, &vb->update_balloon_size_work);
> - spin_unlock_irqrestore(&vb->stop_update_lock, flags);
> -}
> -
> static inline s64 towards_target(struct virtio_balloon *vb)
> {
> s64 target;
> @@ -347,6 +356,34 @@ static inline s64 towards_target(struct virtio_balloon *vb)
> return target - vb->num_pages;
> }
>
> +static void virtballoon_changed(struct virtio_device *vdev)
> +{
> + struct virtio_balloon *vb = vdev->priv;
> + unsigned long flags;
> + s64 diff = towards_target(vb);
> +
> + if (diff) {
> + spin_lock_irqsave(&vb->stop_update_lock, flags);
> + if (!vb->stop_update)
> + queue_work(system_freezable_wq,
> + &vb->update_balloon_size_work);
> + spin_unlock_irqrestore(&vb->stop_update_lock, flags);
> + }
> +
> + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> + virtio_cread(vdev, struct virtio_balloon_config,
> + free_page_report_cmd_id, &vb->cmd_id_received);
> + if (vb->cmd_id_received !=
> + VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID) {
> + spin_lock_irqsave(&vb->stop_update_lock, flags);
> + if (!vb->stop_update)
> + queue_work(vb->balloon_wq,
> + &vb->report_free_page_work);
> + spin_unlock_irqrestore(&vb->stop_update_lock, flags);
> + }
> + }
> +}
> +
> static void update_balloon_size(struct virtio_balloon *vb)
> {
> u32 actual = vb->num_pages;
> @@ -421,42 +458,178 @@ static void update_balloon_size_func(struct work_struct *work)
>
> static int init_vqs(struct virtio_balloon *vb)
> {
> - struct virtqueue *vqs[3];
> - vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> - static const char * const names[] = { "inflate", "deflate", "stats" };
> - int err, nvqs;
> + struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
> + vq_callback_t *callbacks[VIRTIO_BALLOON_VQ_MAX];
> + const char *names[VIRTIO_BALLOON_VQ_MAX];
> + struct scatterlist sg;
> + int ret;
>
> /*
> - * We expect two virtqueues: inflate and deflate, and
> - * optionally stat.
> + * Inflateq and deflateq are used unconditionally. The names[]
> + * will be NULL if the related feature is not enabled, which will
> + * cause no allocation for the corresponding virtqueue in find_vqs.
> */
> - nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> - err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> - if (err)
> - return err;
> + callbacks[VIRTIO_BALLOON_VQ_INFLATE] = balloon_ack;
> + names[VIRTIO_BALLOON_VQ_INFLATE] = "inflate";
> + callbacks[VIRTIO_BALLOON_VQ_DEFLATE] = balloon_ack;
> + names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> + names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> + names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
>
> - vb->inflate_vq = vqs[0];
> - vb->deflate_vq = vqs[1];
> if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> - struct scatterlist sg;
> - unsigned int num_stats;
> - vb->stats_vq = vqs[2];
> + names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> + callbacks[VIRTIO_BALLOON_VQ_STATS] = stats_request;
> + }
>
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> + names[VIRTIO_BALLOON_VQ_FREE_PAGE] = "free_page_vq";
> + callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> + }
> +
> + ret = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> + vqs, callbacks, names, NULL, NULL);
> + if (ret)
> + return ret;
> +
> + vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> + vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> + vb->stats_vq = vqs[VIRTIO_BALLOON_VQ_STATS];
> /*
> * Prime this virtqueue with one buffer so the hypervisor can
> * use it to signal us later (it can't be broken yet!).
> */
> - num_stats = update_balloon_stats(vb);
> -
> - sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
> - if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
> - < 0)
> - BUG();
> + sg_init_one(&sg, vb->stats, sizeof(vb->stats));
> + ret = virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb,
> + GFP_KERNEL);
> + if (ret) {
> + dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
> + __func__);
> + return ret;
> + }
> virtqueue_kick(vb->stats_vq);
> }
> +
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> + vb->free_page_vq = vqs[VIRTIO_BALLOON_VQ_FREE_PAGE];
> +
> return 0;
> }
>
> +static int add_one_sg(struct virtqueue *vq, unsigned long pfn, uint32_t len)
> +{
> + struct scatterlist sg;
> + unsigned int unused;
> +
> + sg_init_table(&sg, 1);
> + sg_set_page(&sg, pfn_to_page(pfn), len, 0);
> +
> + /* Detach all the used buffers from the vq */
> + while (virtqueue_get_buf(vq, &unused))
> + ;
> +
> + /*
> + * Since this is an optimization feature, losing a couple of free
> + * pages to report isn't important. We simply return without adding
> + * this page hint if the vq is full.
> + * We are adding one entry each time, which essentially results in no
> + * memory allocation, so the GFP_KERNEL flag below can be ignored.
> + * Host works by polling the free page vq for hints after sending the
> + * starting cmd id, so the driver doesn't need to kick after filling
> + * the vq.
> + * Lastly, there is always one entry reserved for the cmd id to use.
> + *
> + * TODO: The current implementation could be further improved by
> + * stopping the reporting when the vq is full and continuing the
> + * reporting when host notifies the driver that entries have been
> + * used.
> + */
> + if (vq->num_free > 1)
> + return virtqueue_add_inbuf(vq, &sg, 1, vq, GFP_KERNEL);
> +
> + return 0;
> +}
> +
> +static int virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> + unsigned long nr_pages)
> +{
> + struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> + uint32_t len = nr_pages << PAGE_SHIFT;
> +
> + /*
> + * If a stop id or a new cmd id was just received from host, stop
> + * the reporting, and return -EINTR to indicate an active stop.
> + *
> + * Ideally, we could have cmd_id_received accessed under locks, which
> + * ensures that no more entries are added to the vq once a stop cmd
> + * id is received from host. This requires host to wait for the
> + * driver's ACK about finishing the update of cmd_id_received. But
> + * this is not how host side works, because host doesn't work on an
> + * assumption that the driver would always be responsive. So
> + * theorically, there are possibilities that some entries may stay in
> + * the vq when host has exited the optimization. This isn't an issue,
> + * because entries simply contain guest physical addresses. There is no
> + * allocated memory that needs to be freed or dma mapped pages that
> + * need to be unmapped since virtio-balloon works with
> + * VIRTIO_F_IOMMU_PLATFORM disabled.
> + */
> + if (virtio32_to_cpu(vb->vdev, vb->cmd_id_use) != vb->cmd_id_received)
> + return -EINTR;
> +
> + return add_one_sg(vb->free_page_vq, pfn, len);
> +}
> +
> +static int send_start_cmd_id(struct virtio_balloon *vb, uint32_t cmd_id)
> +{
> + struct scatterlist sg;
> + struct virtqueue *vq = vb->free_page_vq;
> +
> + vb->cmd_id_use = cpu_to_virtio32(vb->vdev, cmd_id);
> + sg_init_one(&sg, &vb->cmd_id_use, sizeof(vb->cmd_id_use));
> + return virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +}
> +
> +static int send_stop_cmd_id(struct virtio_balloon *vb)
> +{
> + struct scatterlist sg;
> + struct virtqueue *vq = vb->free_page_vq;
> +
> + sg_init_one(&sg, &vb->stop_cmd_id, sizeof(vb->stop_cmd_id));
> + return virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +}
> +
> +static void report_free_page_func(struct work_struct *work)
> +{
> + struct virtio_balloon *vb;
> + struct virtqueue *vq;
> + unsigned int unused;
> + int ret;
> +
> + vb = container_of(work, struct virtio_balloon, report_free_page_work);
> + vq = vb->free_page_vq;
> +
> + /* Start by sending the received cmd id to host with an outbuf. */
> + ret = send_start_cmd_id(vb, vb->cmd_id_received);
> + if (unlikely(ret))
> + goto err;
> +
> + ret = walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> + if (unlikely(ret == -EIO))
> + goto err;
why is EIO special? I think you should special-case EINTR maybe.
> +
> + /* End by sending a stop id to host with an outbuf. */
> + ret = send_stop_cmd_id(vb);
> + if (likely(!ret)) {
What happens on failure? Don't we need to detach anyway?
> + /* Ending: detach all the used buffers from the vq. */
> + while (vq->num_free != virtqueue_get_vring_size(vq))
> + virtqueue_get_buf(vq, &unused);
This isn't all that happens here. It also waits for buffers to
be consumed. Is this by design? And why is it good idea to
busy poll while doing it?
> + return;
Better move this return outside. Fall through to error is confusing.
> + }
> +err:
> + dev_err(&vb->vdev->dev, "%s: free page vq failure, ret=%d\n",
> + __func__, ret);
> +}
> +
> #ifdef CONFIG_BALLOON_COMPACTION
> /*
> * virtballoon_migratepage - perform the balloon page migration on behalf of
> @@ -570,18 +743,36 @@ static int virtballoon_probe(struct virtio_device *vdev)
> if (err)
> goto out_free_vb;
>
> + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> + /*
> + * There is always one entry reserved for cmd id, so the ring
> + * size needs to be at least two to report free page hints.
> + */
> + if (virtqueue_get_vring_size(vb->free_page_vq) < 2)
> + goto out_free_vb;
> + vb->balloon_wq = alloc_workqueue("balloon-wq",
> + WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
> + if (!vb->balloon_wq) {
> + err = -ENOMEM;
> + goto out_del_vqs;
> + }
> + vb->stop_cmd_id = cpu_to_virtio32(vb->vdev,
> + VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID);
> + INIT_WORK(&vb->report_free_page_work, report_free_page_func);
> + }
> +
> vb->nb.notifier_call = virtballoon_oom_notify;
> vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
> err = register_oom_notifier(&vb->nb);
> if (err < 0)
> - goto out_del_vqs;
> + goto out_del_balloon_wq;
>
> #ifdef CONFIG_BALLOON_COMPACTION
> balloon_mnt = kern_mount(&balloon_fs);
> if (IS_ERR(balloon_mnt)) {
> err = PTR_ERR(balloon_mnt);
> unregister_oom_notifier(&vb->nb);
> - goto out_del_vqs;
> + goto out_del_balloon_wq;
> }
>
> vb->vb_dev_info.migratepage = virtballoon_migratepage;
> @@ -591,7 +782,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
> kern_unmount(balloon_mnt);
> unregister_oom_notifier(&vb->nb);
> vb->vb_dev_info.inode = NULL;
> - goto out_del_vqs;
> + goto out_del_balloon_wq;
> }
> vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
> #endif
> @@ -602,6 +793,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
> virtballoon_changed(vdev);
> return 0;
>
> +out_del_balloon_wq:
> + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> + destroy_workqueue(vb->balloon_wq);
> out_del_vqs:
> vdev->config->del_vqs(vdev);
> out_free_vb:
> @@ -635,6 +829,11 @@ static void virtballoon_remove(struct virtio_device *vdev)
> cancel_work_sync(&vb->update_balloon_size_work);
> cancel_work_sync(&vb->update_balloon_stats_work);
>
> + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> + cancel_work_sync(&vb->report_free_page_work);
> + destroy_workqueue(vb->balloon_wq);
> + }
> +
> remove_common(vb);
> #ifdef CONFIG_BALLOON_COMPACTION
> if (vb->vb_dev_info.inode)
> @@ -686,6 +885,7 @@ static unsigned int features[] = {
> VIRTIO_BALLOON_F_MUST_TELL_HOST,
> VIRTIO_BALLOON_F_STATS_VQ,
> VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> + VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> };
>
> static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 4e8b830..b2d86c2 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,15 +34,19 @@
> #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
> #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
> #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
>
> /* Size of a PFN in the balloon interface. */
> #define VIRTIO_BALLOON_PFN_SHIFT 12
>
> +#define VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID 0
> struct virtio_balloon_config {
> /* Number of pages host wants Guest to give up. */
> __u32 num_pages;
> /* Number of pages we've actually got in balloon. */
> __u32 actual;
> + /* Free page report command id, readonly by guest */
> + __u32 free_page_report_cmd_id;
> };
>
> #define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */
> --
> 2.7.4
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-09 8:07 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, davem
In-Reply-To: <bb4f0821-246f-1052-de8b-d96be3388ed3@intel.com>
Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
[...]
>> > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> > + struct net_device *child_netdev)
>> > +{
>> > + struct virtnet_bypass_info *vbi;
>> > + bool backup;
>> > +
>> > + vbi = netdev_priv(bypass_netdev);
>> > + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> > + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > + rtnl_dereference(vbi->active_netdev)) {
>> > + netdev_info(bypass_netdev,
>> > + "%s attempting to join bypass dev when %s already present\n",
>> > + child_netdev->name, backup ? "backup" : "active");
>> Bypass module should check if there is already some other netdev
>> enslaved and refuse right there.
>
>This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>as its bypass_netdev is same as the backup_netdev.
>Will add a flag while registering with the bypass module to indicate if the driver is doing
>a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>for 3 netdev scenario.
Just let me undestand it clearly. What I expect the difference would be
between 2netdev and3 netdev model is this:
2netdev:
bypass_master
/
/
VF_slave
3netdev:
bypass_master
/ \
/ \
VF_slave backup_slave
Is that correct? If not, how does it look like?
Thanks!
^ permalink raw reply
* Re: [PATCH v31 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
From: Wei Wang @ 2018-04-09 11:46 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
liliang.opensource, linux-kernel, mhocko, linux-mm, huangzhichao,
pbonzini, akpm, virtualization
In-Reply-To: <20180409085457-mutt-send-email-mst@kernel.org>
On 04/09/2018 02:03 PM, Michael S. Tsirkin wrote:
> On Fri, Apr 06, 2018 at 08:17:23PM +0800, Wei Wang wrote:
>> Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
>> support of reporting hints of guest free pages to host via virtio-balloon.
>>
>> Host requests the guest to report free page hints by sending a new cmd
>> id to the guest via the free_page_report_cmd_id configuration register.
>>
>> When the guest starts to report, the first element added to the free page
>> vq is the cmd id given by host. When the guest finishes the reporting
>> of all the free pages, VIRTIO_BALLOON_FREE_PAGE_REPORT_STOP_ID is added
>> to the vq to tell host that the reporting is done. Host polls the free
>> page vq after sending the starting cmd id, so the guest doesn't need to
>> kick after filling an element to the vq.
>>
>> Host may also requests the guest to stop the reporting in advance by
>> sending the stop cmd id to the guest via the configuration register.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>
> Pretty good by now, Minor comments below.
Thanks for the comments.
>
>> ---
>> drivers/virtio/virtio_balloon.c | 272 +++++++++++++++++++++++++++++++-----
>> include/uapi/linux/virtio_balloon.h | 4 +
>> 2 files changed, 240 insertions(+), 36 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index dfe5684..aef73ee 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -51,9 +51,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>> static struct vfsmount *balloon_mnt;
>> #endif
>>
>> +enum virtio_balloon_vq {
>> + VIRTIO_BALLOON_VQ_INFLATE,
>> + VIRTIO_BALLOON_VQ_DEFLATE,
>> + VIRTIO_BALLOON_VQ_STATS,
>> + VIRTIO_BALLOON_VQ_FREE_PAGE,
>> + VIRTIO_BALLOON_VQ_MAX
>> +};
>> +
>> struct virtio_balloon {
>> struct virtio_device *vdev;
>> - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>> + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
>> +
>> + /* Balloon's own wq for cpu-intensive work items */
>> + struct workqueue_struct *balloon_wq;
>> + /* The free page reporting work item submitted to the balloon wq */
>> + struct work_struct report_free_page_work;
>>
>> /* The balloon servicing is delegated to a freezable workqueue. */
>> struct work_struct update_balloon_stats_work;
>> @@ -63,6 +76,13 @@ struct virtio_balloon {
>> spinlock_t stop_update_lock;
>> bool stop_update;
>>
>> + /* The new cmd id received from host */
>> + uint32_t cmd_id_received;
>> + /* The cmd id that is in use */
>> + __virtio32 cmd_id_use;
> I'd prefer cmd_id_active but it's not critical.
OK, will change.
>
> +
> +static void report_free_page_func(struct work_struct *work)
> +{
> + struct virtio_balloon *vb;
> + struct virtqueue *vq;
> + unsigned int unused;
> + int ret;
> +
> + vb = container_of(work, struct virtio_balloon, report_free_page_work);
> + vq = vb->free_page_vq;
> +
> + /* Start by sending the received cmd id to host with an outbuf. */
> + ret = send_start_cmd_id(vb, vb->cmd_id_received);
> + if (unlikely(ret))
> + goto err;
> +
> + ret = walk_free_mem_block(vb, 0, &virtio_balloon_send_free_pages);
> + if (unlikely(ret == -EIO))
> + goto err;
> why is EIO special? I think you should special-case EINTR maybe.
Actually EINTR isn't an error which needs to bail out. That's just the
case that the vq is full, that hint isn't added. Maybe it is not
necessary to treat the "vq full" case as an error.
How about just returning "0" when the vq is full, instead of returning
"EINTR"? (The next hint will continue to be added)
>
>> +
>> + /* End by sending a stop id to host with an outbuf. */
>> + ret = send_stop_cmd_id(vb);
>> + if (likely(!ret)) {
> What happens on failure? Don't we need to detach anyway?
Yes. Please see below, we could make some more change.
>
>> + /* Ending: detach all the used buffers from the vq. */
>> + while (vq->num_free != virtqueue_get_vring_size(vq))
>> + virtqueue_get_buf(vq, &unused);
> This isn't all that happens here. It also waits for buffers to
> be consumed. Is this by design? And why is it good idea to
> busy poll while doing it?
Because host and guest operations happen asynchronously. When the guest
reaches here, host may have not put anything to the vq yet. How about
doing this via the free page vq handler?
Host will send a free page vq interrupt before exiting the optimization.
I think this would be nicer.
Best,
Wei
^ permalink raw reply
* [PATCH] vhost: fix vhost_vq_access_ok() log check
From: Stefan Hajnoczi @ 2018-04-09 13:10 UTC (permalink / raw)
To: virtualization; +Cc: kvm, mst, syzkaller-bugs, linux-kernel, Stefan Hajnoczi
Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
when IOTLB is enabled") introduced a regression. The logic was
originally:
if (vq->iotlb)
return 1;
return A && B;
After the patch the short-circuit logic for A was inverted:
if (A || vq->iotlb)
return A;
return B;
The correct logic is:
if (!A || vq->iotlb)
return A;
return B;
Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
drivers/vhost/vhost.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5320039671b7..f6af4210679a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1246,7 +1246,7 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
{
int ret = vq_log_access_ok(vq, vq->log_base);
- if (ret || vq->iotlb)
+ if (!ret || vq->iotlb)
return ret;
return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
--
2.14.3
^ permalink raw reply related
* Re: kernel BUG at drivers/vhost/vhost.c:LINE! (2)
From: Stefan Hajnoczi @ 2018-04-09 13:17 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: kvm, netdev, syzkaller-bugs, syzbot, linux-kernel,
Linux Virtualization
In-Reply-To: <20180409032835.GB1648@stefanha-x1.localdomain>
On Mon, Apr 9, 2018 at 11:28 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Mon, Apr 09, 2018 at 05:44:36AM +0300, Michael S. Tsirkin wrote:
>> On Mon, Apr 09, 2018 at 10:37:45AM +0800, Stefan Hajnoczi wrote:
>> > On Sat, Apr 7, 2018 at 3:02 AM, syzbot
>> > <syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com> wrote:
>> > > syzbot hit the following crash on upstream commit
>> > > 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +0000)
>> > > Merge tag 'armsoc-drivers' of
>> > > git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
>> > > syzbot dashboard link:
>> > > https://syzkaller.appspot.com/bug?extid=65a84dde0214b0387ccd
>> >
>> > To prevent duplicated work: I am working on this one.
>> >
>> > Stefan
>>
>> Do you want to try this patchset:
>> https://lkml.org/lkml/2018/4/5/665
>>
>> ?
>
> Thanks, I'll give it a shot.
>
> I also noticed a regression in commit
> d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log when
> IOTLB is enabled") and am currently testing a fix.
I have sent a fix:
https://lkml.org/lkml/2018/4/9/390
Stefan
^ permalink raw reply
* Re: [PATCH] vhost: fix vhost_vq_access_ok() log check
From: Michael S. Tsirkin @ 2018-04-09 13:58 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: syzkaller-bugs, linux-kernel, kvm, virtualization
In-Reply-To: <20180409131021.5132-1-stefanha@redhat.com>
On Mon, Apr 09, 2018 at 09:10:21PM +0800, Stefan Hajnoczi wrote:
> Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
> when IOTLB is enabled") introduced a regression. The logic was
> originally:
>
> if (vq->iotlb)
> return 1;
> return A && B;
>
> After the patch the short-circuit logic for A was inverted:
>
> if (A || vq->iotlb)
> return A;
> return B;
>
> The correct logic is:
>
> if (!A || vq->iotlb)
> return A;
> return B;
>
> Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
> Cc: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> drivers/vhost/vhost.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 5320039671b7..f6af4210679a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1246,7 +1246,7 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> {
> int ret = vq_log_access_ok(vq, vq->log_base);
>
> - if (ret || vq->iotlb)
> + if (!ret || vq->iotlb)
> return ret;
>
> return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
> --
> 2.14.3
^ permalink raw reply
* Re: [PATCH RESEND v2] vhost-net: set packet weight of tx polling to 2 * vq size
From: David Miller @ 2018-04-09 15:02 UTC (permalink / raw)
To: haibinzhang
Cc: kvm, mst, netdev, linux-kernel, virtualization, yunfangtai,
lidongchen
In-Reply-To: <88D661ADF6AFBF42B2AB88D8E7682B0901FC4A32@EXMBX-SZMAIL011.tencent.com>
From: haibinzhang(张海斌) <haibinzhang@tencent.com>
Date: Mon, 9 Apr 2018 07:22:17 +0000
> handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
> polling udp packets with small length(e.g. 1byte udp payload), because setting
> VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
>
> Ping-Latencies shown below were tested between two Virtual Machines using
> netperf (UDP_STREAM, len=1), and then another machine pinged the client:
...
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
Applied, thank you.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH] vhost: fix vhost_vq_access_ok() log check
From: Linus Torvalds @ 2018-04-09 16:52 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: KVM list, Michael S. Tsirkin, syzkaller-bugs,
Linux Kernel Mailing List, virtualization
In-Reply-To: <20180409131021.5132-1-stefanha@redhat.com>
On Mon, Apr 9, 2018 at 6:10 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> @@ -1246,7 +1246,7 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> {
> int ret = vq_log_access_ok(vq, vq->log_base);
>
> - if (ret || vq->iotlb)
> + if (!ret || vq->iotlb)
> return ret;
That logic is still very non-obvious.
This code already had one bug because of an odd illegible test
sequence. Let's not keep the crazy code.
Why not just do the *obvious* thing, and get rid of "ret" entirely,
and make the damn thing return a boolean, and then just write it all
as
/* Caller should have vq mutex and device mutex */
bool vhost_vq_access_ok(struct vhost_virtqueue *vq)
{
if (!vq_log_access_ok(vq, vq->log_base))
return false;
if (vq->iotlb || vq_access_ok(vq, vq->num, vq->desc,
vq->avail, vq->used);
}
which makes the logic obvious: if vq_log_access_ok() fails, then then
vhost_vq_access_ok() fails unconditionally.
Otherwise, we need to have an iotlb, or a successful vq_access_ok() check.
Doesn't that all make more sense, and avoid the insane "ret" value use
that is really quite subtle?
Linus
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-04-09 18:47 UTC (permalink / raw)
To: Jiri Pirko
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, davem
In-Reply-To: <20180409080751.GE19345@nanopsycho>
On 4/9/2018 1:07 AM, Jiri Pirko wrote:
> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
> [...]
>
>>>> +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>>>> + struct net_device *child_netdev)
>>>> +{
>>>> + struct virtnet_bypass_info *vbi;
>>>> + bool backup;
>>>> +
>>>> + vbi = netdev_priv(bypass_netdev);
>>>> + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>>>> + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>>> + rtnl_dereference(vbi->active_netdev)) {
>>>> + netdev_info(bypass_netdev,
>>>> + "%s attempting to join bypass dev when %s already present\n",
>>>> + child_netdev->name, backup ? "backup" : "active");
>>> Bypass module should check if there is already some other netdev
>>> enslaved and refuse right there.
>> This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>> as its bypass_netdev is same as the backup_netdev.
>> Will add a flag while registering with the bypass module to indicate if the driver is doing
>> a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>> for 3 netdev scenario.
> Just let me undestand it clearly. What I expect the difference would be
> between 2netdev and3 netdev model is this:
> 2netdev:
> bypass_master
> /
> /
> VF_slave
>
> 3netdev:
> bypass_master
> / \
> / \
> VF_slave backup_slave
>
> Is that correct? If not, how does it look like?
>
>
Looks correct.
VF_slave and backup_slave are the original netdevs and are present in both the models.
In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
marked as the 2 slaves of this new netdev.
In the 2 netdev model, backup_slave acts as bypass_master and the bypass module doesn't
have access to netdev_priv of the backup_slave.
Once i moved all the ndo_ops of the master netdev to bypass module, i realized that we can
move the create/destroy of the upper netdev also to bypass.c.
That way the changes to virtio_net become very minimal.
With these updates, bypass module now supports both the models by exporting 2 sets of
functions.
3 netdev:
int bypass_master_create(struct net_device *backup_netdev,
struct bypass_master **pbypass_master);
void bypass_master_destroy(struct bypass_master *bypass_master);
2 netdev:
int bypass_master_register(struct net_device *backup_netdev, struct bypass_ops *ops,
struct bypass_master **pbypass_master);
void bypass_master_unregister(struct bypass_master *bypass_master);
Will send the next revision in a day or two.
Thanks
Sridhar
^ permalink raw reply
* [PATCH RESEND net] vhost: fix vhost_vq_access_ok() log check
From: Michael S. Tsirkin @ 2018-04-09 19:40 UTC (permalink / raw)
To: David Miller, netdev, virtualization
Cc: kvm, mst, syzkaller-bugs, linux-kernel, Stefan Hajnoczi
In-Reply-To: <20180409131021.5132-1-stefanha@redhat.com>
From: Stefan Hajnoczi <stefanha@redhat.com>
Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
when IOTLB is enabled") introduced a regression. The logic was
originally:
if (vq->iotlb)
return 1;
return A && B;
After the patch the short-circuit logic for A was inverted:
if (A || vq->iotlb)
return A;
return B;
The correct logic is:
if (!A || vq->iotlb)
return A;
return B;
Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
drivers/vhost/vhost.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5320039671b7..f6af4210679a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1246,7 +1246,7 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
{
int ret = vq_log_access_ok(vq, vq->log_base);
- if (ret || vq->iotlb)
+ if (!ret || vq->iotlb)
return ret;
return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
--
2.14.3
^ permalink raw reply related
* Re: [PATCH] vhost: fix vhost_vq_access_ok() log check
From: Michael S. Tsirkin @ 2018-04-09 19:54 UTC (permalink / raw)
To: Linus Torvalds
Cc: KVM list, syzkaller-bugs, Linux Kernel Mailing List,
virtualization, Stefan Hajnoczi
In-Reply-To: <CA+55aFzJ_JPtU5SjAHHv--p3ZcAw4gVidxvrLLZF7w+kbHCNdg@mail.gmail.com>
On Mon, Apr 09, 2018 at 09:52:13AM -0700, Linus Torvalds wrote:
> On Mon, Apr 9, 2018 at 6:10 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > @@ -1246,7 +1246,7 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> > {
> > int ret = vq_log_access_ok(vq, vq->log_base);
> >
> > - if (ret || vq->iotlb)
> > + if (!ret || vq->iotlb)
> > return ret;
>
> That logic is still very non-obvious.
>
> This code already had one bug because of an odd illegible test
> sequence. Let's not keep the crazy code.
>
> Why not just do the *obvious* thing, and get rid of "ret" entirely,
> and make the damn thing return a boolean, and then just write it all
> as
>
> /* Caller should have vq mutex and device mutex */
> bool vhost_vq_access_ok(struct vhost_virtqueue *vq)
> {
> if (!vq_log_access_ok(vq, vq->log_base))
> return false;
>
> if (vq->iotlb || vq_access_ok(vq, vq->num, vq->desc,
> vq->avail, vq->used);
> }
>
> which makes the logic obvious: if vq_log_access_ok() fails, then then
> vhost_vq_access_ok() fails unconditionally.
>
> Otherwise, we need to have an iotlb, or a successful vq_access_ok() check.
>
> Doesn't that all make more sense, and avoid the insane "ret" value use
> that is really quite subtle?
>
> Linus
I agree it's cleaner.
Stefan, I reposted your patch on netdev (since the breakage got applied
there too).
Would you like to self-nak it and post v2? Pls remember to CC netdev.
--
MST
^ permalink raw reply
* Re: [virtio-dev] Re: [PATCH v2] virtio_balloon: export hugetlb page allocation counts
From: Jonathan Helman @ 2018-04-09 21:11 UTC (permalink / raw)
To: Jason Wang, Michael S. Tsirkin
Cc: Mihai Carabas, virtio-dev, linux-kernel, virtualization
In-Reply-To: <3f8a1b91-c5ba-8a24-72b8-02d884a67826@redhat.com>
On 03/22/2018 07:38 PM, Jason Wang wrote:
>
>
> On 2018年03月22日 11:10, Michael S. Tsirkin wrote:
>> On Thu, Mar 22, 2018 at 09:52:18AM +0800, Jason Wang wrote:
>>> On 2018年03月20日 12:26, Jonathan Helman wrote:
>>>>> On Mar 19, 2018, at 7:31 PM, Jason Wang<jasowang@redhat.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2018年03月20日 06:14, Jonathan Helman wrote:
>>>>>> Export the number of successful and failed hugetlb page
>>>>>> allocations via the virtio balloon driver. These 2 counts
>>>>>> come directly from the vm_events HTLB_BUDDY_PGALLOC and
>>>>>> HTLB_BUDDY_PGALLOC_FAIL.
>>>>>>
>>>>>> Signed-off-by: Jonathan Helman<jonathan.helman@oracle.com>
>>>>> Reviewed-by: Jason Wang<jasowang@redhat.com>
>>>> Thanks.
>>>>
>>>>>> ---
>>>>>> drivers/virtio/virtio_balloon.c | 6 ++++++
>>>>>> include/uapi/linux/virtio_balloon.h | 4 +++-
>>>>>> 2 files changed, 9 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/virtio/virtio_balloon.c
>>>>>> b/drivers/virtio/virtio_balloon.c
>>>>>> index dfe5684..6b237e3 100644
>>>>>> --- a/drivers/virtio/virtio_balloon.c
>>>>>> +++ b/drivers/virtio/virtio_balloon.c
>>>>>> @@ -272,6 +272,12 @@ static unsigned int
>>>>>> update_balloon_stats(struct virtio_balloon *vb)
>>>>>> pages_to_bytes(events[PSWPOUT]));
>>>>>> update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT,
>>>>>> events[PGMAJFAULT]);
>>>>>> update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT,
>>>>>> events[PGFAULT]);
>>>>>> +#ifdef CONFIG_HUGETLB_PAGE
>>>>>> + update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
>>>>>> + events[HTLB_BUDDY_PGALLOC]);
>>>>>> + update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL,
>>>>>> + events[HTLB_BUDDY_PGALLOC_FAIL]);
>>>>>> +#endif
>>>>>> #endif
>>>>>> update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
>>>>>> pages_to_bytes(i.freeram));
>>>>>> diff --git a/include/uapi/linux/virtio_balloon.h
>>>>>> b/include/uapi/linux/virtio_balloon.h
>>>>>> index 4e8b830..40297a3 100644
>>>>>> --- a/include/uapi/linux/virtio_balloon.h
>>>>>> +++ b/include/uapi/linux/virtio_balloon.h
>>>>>> @@ -53,7 +53,9 @@ struct virtio_balloon_config {
>>>>>> #define VIRTIO_BALLOON_S_MEMTOT 5 /* Total amount of memory */
>>>>>> #define VIRTIO_BALLOON_S_AVAIL 6 /* Available memory as in
>>>>>> /proc */
>>>>>> #define VIRTIO_BALLOON_S_CACHES 7 /* Disk caches */
>>>>>> -#define VIRTIO_BALLOON_S_NR 8
>>>>>> +#define VIRTIO_BALLOON_S_HTLB_PGALLOC 8 /* Hugetlb page
>>>>>> allocations */
>>>>>> +#define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page
>>>>>> allocation failures */
>>>>>> +#define VIRTIO_BALLOON_S_NR 10
>>>>>> /*
>>>>>> * Memory statistics structure.
>>>>> Not for this patch, but it looks to me that exporting such nr
>>>>> through uapi is fragile.
>>>> Sorry, can you explain what you mean here?
>>>>
>>>> Jon
>>> Spec said "Within an output buffer submitted to the statsq, the
>>> device MUST
>>> ignore entries with tag values that it does not recognize". So exporting
>>> VIRTIO_BALLOON_S_NR seems useless and device implementation can not
>>> depend
>>> on such number in uapi.
>>>
>>> Thanks
>> Suggestions? I don't like to break build for people ...
>>
>
> Didn't have a good idea. But maybe we should keep VIRTIO_BALLOON_S_NR
> unchanged, and add a comment here.
>
> Thanks
I think Jason's comment is for a future patch. Didn't see this patch get
applied, so wondering if it could be.
Thanks,
Jon
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-09 22:07 UTC (permalink / raw)
To: Andrew Lunn
Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
David Ahern, si-wei liu, David Miller
In-Reply-To: <20180407031919.GB11029@lunn.ch>
On Fri, Apr 6, 2018 at 8:19 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> Hi Siwei
>
>> I think everyone seems to agree not to fiddle with the ":" prefix, but
>> rather have a new class of network subsystem under /sys/class thus a
>> separate device namespace e.g. /sys/class/net-kernel for those
>> auto-managed lower netdevs is needed.
>
> How do you get a device into this new class? I don't know the Linux
> driver model too well, but to get a device out of one class and into
> another, i think you need to device_del(dev). modify dev->class and
> then device_add(dev). However, device_add() says you are not allowed
> to do this.
No, implementation wise I'd avoid changing the class on the fly. What
I'm looking to is a means to add a secondary class or class aliasing
mechanism for netdevs that allows mapping for a kernel device
namespace (/class/net-kernel) to userspace (/class/net). Imagine
creating symlinks between these two namespaces as an analogy. All
userspace visible netdevs today will have both a kernel name and a
userspace visible name, having one (/class/net) referecing the other
(/class/net-kernel) in its own namespace. The newly introduced
IFF_AUTO_MANAGED device will have a kernel name only
(/class/net-kernel). As a result, the existing applications using
/class/net don't break, while we're adding the kernel namespace that
allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
at all.
As this requires changing the internals of driver model core it's a
rather big hammer approach I'd think. If there exists a better
implementation than this to allow adding a separate layer of in-kernel
device namespace, I'd more than welcome to hear about.
>
> And i don't even see how this helps. Are you also not going to call
> list_netdevice()? Are you going to add some other list for these
> devices in a different class?
list_netdevice() is still called. I think with the current RFC patch,
I've added two lists for netdevs under the kernel namespace:
dev_cmpl_list and name_cmpl_hlist. As a result of that, all userspace
netdevs get registered will be added to two types of lists: the
userspace list for e.g. dev_list, and also the kernelspace list e.g.
dev_cmpl_list (I can rename it to something more accurate). The
IFF_AUTO_MANAGED device will be only added to kernelspace list e.g.
dev_cmpl_list.
Hope all your questions are answered.
Thanks,
-Siwei
>
> Andrew
^ permalink raw reply
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Andrew Lunn @ 2018-04-09 22:15 UTC (permalink / raw)
To: Siwei Liu
Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
David Ahern, si-wei liu, David Miller
In-Reply-To: <CADGSJ211kA5ktADbwv2f4ma_ooBmYW4r7YwzGD2G_Gw5+wv5rw@mail.gmail.com>
> No, implementation wise I'd avoid changing the class on the fly. What
> I'm looking to is a means to add a secondary class or class aliasing
> mechanism for netdevs that allows mapping for a kernel device
> namespace (/class/net-kernel) to userspace (/class/net). Imagine
> creating symlinks between these two namespaces as an analogy. All
> userspace visible netdevs today will have both a kernel name and a
> userspace visible name, having one (/class/net) referecing the other
> (/class/net-kernel) in its own namespace. The newly introduced
> IFF_AUTO_MANAGED device will have a kernel name only
> (/class/net-kernel). As a result, the existing applications using
> /class/net don't break, while we're adding the kernel namespace that
> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
> at all.
My gut feeling is this whole scheme will not fly. You really should be
talking to GregKH.
Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
A device can start out as a normal device, and will change to being
automatic later, when the user on top of it probes.
Andrew
^ permalink raw reply
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-09 22:30 UTC (permalink / raw)
To: Andrew Lunn
Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
David Ahern, si-wei liu, David Miller
In-Reply-To: <20180409221523.GH562@lunn.ch>
On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> No, implementation wise I'd avoid changing the class on the fly. What
>> I'm looking to is a means to add a secondary class or class aliasing
>> mechanism for netdevs that allows mapping for a kernel device
>> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> creating symlinks between these two namespaces as an analogy. All
>> userspace visible netdevs today will have both a kernel name and a
>> userspace visible name, having one (/class/net) referecing the other
>> (/class/net-kernel) in its own namespace. The newly introduced
>> IFF_AUTO_MANAGED device will have a kernel name only
>> (/class/net-kernel). As a result, the existing applications using
>> /class/net don't break, while we're adding the kernel namespace that
>> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> at all.
>
> My gut feeling is this whole scheme will not fly. You really should be
> talking to GregKH.
Will do. Before spreading it out loudly I'd run it within netdev to
clarify the need for why not exposing the lower netdevs is critical
for cloud service providers in the face of introducing a new feature,
and we are not hiding anything but exposing it in a way that don't
break existing userspace applications while introducing feature is
possible with the limitation of keeping old userspace still.
>
> Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> A device can start out as a normal device, and will change to being
> automatic later, when the user on top of it probes.
Sure. In whatever form it's still a netdev, and changing the namespace
should be more dynamic than changing the class.
Thanks a lot,
-Siwei
>
> Andrew
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox