Linux virtualization list
 help / color / mirror / Atom feed
* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-27 10:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew, eric.dumazet, Michael S. Tsirkin, edumazet, netdev,
	Randy Dunlap, John Stoffel, linux-kernel, Matthew Wilcox,
	virtualization, James Bottomley, Michal, dm-devel, David Miller,
	David Rientjes, Morton, linux-mm, Vlastimil Babka
In-Reply-To: <20180427082555.GC17484@dhcp22.suse.cz>



On Fri, 27 Apr 2018, Michal Hocko wrote:

> On Thu 26-04-18 18:52:05, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 27 Apr 2018, Michael S. Tsirkin wrote:
> [...]
> > >    But assuming it's important to control this kind of
> > >    fault injection to be controlled from
> > >    a dedicated menuconfig option, why not the rest of
> > >    faults?
> > 
> > The injected faults cause damage to the user, so there's no point to 
> > enable them by default. vmalloc fallback should not cause any damage 
> > (assuming that the code is correctly written).
> 
> But you want to find those bugs which would BUG_ON easier, so there is a
> risk of harm IIUC

Yes, I want to harm them, but I only want to harm the users using the 
debugging kernel. Testers should be "harmed" by crashes - so that the 
users of production kernels are harmed less.

If someone hits this, he should report it, use the kernel parameter to 
turn it off and continue with the testing.

> and this is not much different than other fault injecting paths.

Fault injections causes misbehavior even on completely bug-free code (for 
example, syscalls randomly returning -ENOMEM). This won't cause 
misbehavior on bug-free code.

Mikulas

^ permalink raw reply

* Re: [RFC v3 0/5] virtio: support packed ring
From: Tiwei Bie @ 2018-04-27  9:12 UTC (permalink / raw)
  To: Jason Wang; +Cc: Michael S. Tsirkin, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <5c712aa2-f00e-b472-cdfc-48175aea790d@redhat.com>

On Fri, Apr 27, 2018 at 02:17:51PM +0800, Jason Wang wrote:
> On 2018年04月27日 12:18, Michael S. Tsirkin wrote:
> > On Fri, Apr 27, 2018 at 11:56:05AM +0800, Jason Wang wrote:
> > > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > > Hello everyone,
> > > > 
> > > > This RFC implements packed ring support in virtio driver.
> > > > 
> > > > Some simple functional tests have been done with Jason's
> > > > packed ring implementation in vhost:
> > > > 
> > > > https://lkml.org/lkml/2018/4/23/12
> > > > 
> > > > Both of ping and netperf worked as expected (with EVENT_IDX
> > > > disabled). But there are below known issues:
> > > > 
> > > > 1. Reloading the guest driver will break the Tx/Rx;
> > > Will have a look at this issue.
> > > 
> > > > 2. Zeroing the flags when detaching a used desc will
> > > >      break the guest -> host path.
> > > I still think zeroing flags is unnecessary or even a bug. At host, I track
> > > last observed avail wrap counter and detect avail like (what is suggested in
> > > the example code in the spec):
> > > 
> > > static bool desc_is_avail(struct vhost_virtqueue *vq, __virtio16 flags)
> > > {
> > >         bool avail = flags & cpu_to_vhost16(vq, DESC_AVAIL);
> > > 
> > >         return avail == vq->avail_wrap_counter;
> > > }
> > > 
> > > So zeroing wrap can not work with this obviously.
> > > 
> > > Thanks
> > I agree. I think what one should do is flip the available bit.
> > 
> 
> But is this flipping a must?
> 
> Thanks

Yeah, that's my question too. It seems to be a requirement
for driver that, the only change to the desc status that a
driver can do during running is to mark the desc as avail,
and any other changes to the desc status are not allowed.
Similarly, the device can only mark the desc as used, and
any other changes to the desc status are also not allowed.
So the question is, are there such requirements?

Based on below contents in the spec:

"""
Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different
for an available descriptor and equal for a used descriptor.

Note that this observation is mostly useful for sanity-checking
as these are necessary but not sufficient conditions
"""

It seems that, it's necessary for devices to check whether
the AVAIL bit and USED bit are different.

Best regards,
Tiwei Bie
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Michal Hocko @ 2018-04-27  8:25 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew, eric.dumazet, Michael S. Tsirkin, edumazet, netdev,
	Randy Dunlap, John Stoffel, linux-kernel, Matthew Wilcox,
	virtualization, James Bottomley, Michal, dm-devel, David Miller,
	David Rientjes, Morton, linux-mm, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804261829190.30599@file01.intranet.prod.int.rdu2.redhat.com>

On Thu 26-04-18 18:52:05, Mikulas Patocka wrote:
> 
> 
> On Fri, 27 Apr 2018, Michael S. Tsirkin wrote:
[...]
> >    But assuming it's important to control this kind of
> >    fault injection to be controlled from
> >    a dedicated menuconfig option, why not the rest of
> >    faults?
> 
> The injected faults cause damage to the user, so there's no point to 
> enable them by default. vmalloc fallback should not cause any damage 
> (assuming that the code is correctly written).

But you want to find those bugs which would BUG_ON easier, so there is a
risk of harm IIUC and this is not much different than other fault
injecting paths.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [RFC v3 0/5] virtio: support packed ring
From: Jason Wang @ 2018-04-27  6:17 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180427071725-mutt-send-email-mst@kernel.org>



On 2018年04月27日 12:18, Michael S. Tsirkin wrote:
> On Fri, Apr 27, 2018 at 11:56:05AM +0800, Jason Wang wrote:
>> On 2018年04月25日 13:15, Tiwei Bie wrote:
>>> Hello everyone,
>>>
>>> This RFC implements packed ring support in virtio driver.
>>>
>>> Some simple functional tests have been done with Jason's
>>> packed ring implementation in vhost:
>>>
>>> https://lkml.org/lkml/2018/4/23/12
>>>
>>> Both of ping and netperf worked as expected (with EVENT_IDX
>>> disabled). But there are below known issues:
>>>
>>> 1. Reloading the guest driver will break the Tx/Rx;
>> Will have a look at this issue.
>>
>>> 2. Zeroing the flags when detaching a used desc will
>>>      break the guest -> host path.
>> I still think zeroing flags is unnecessary or even a bug. At host, I track
>> last observed avail wrap counter and detect avail like (what is suggested in
>> the example code in the spec):
>>
>> static bool desc_is_avail(struct vhost_virtqueue *vq, __virtio16 flags)
>> {
>>         bool avail = flags & cpu_to_vhost16(vq, DESC_AVAIL);
>>
>>         return avail == vq->avail_wrap_counter;
>> }
>>
>> So zeroing wrap can not work with this obviously.
>>
>> Thanks
> I agree. I think what one should do is flip the available bit.
>

But is this flipping a must?

Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH 16/17] drm/virtio: Remove unecessary dma_fence_ops
From: Daniel Vetter @ 2018-04-27  6:17 UTC (permalink / raw)
  To: DRI Development
  Cc: David Airlie, Daniel Vetter, Intel Graphics Development,
	virtualization
In-Reply-To: <20180427061724.28497-1-daniel.vetter@ffwll.ch>

dma_fence_default_wait is the default now, same for the trivial
enable_signaling implementation.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: virtualization@lists.linux-foundation.org
---
 drivers/gpu/drm/virtio/virtgpu_fence.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_fence.c b/drivers/gpu/drm/virtio/virtgpu_fence.c
index 23353521f903..00c742a441bf 100644
--- a/drivers/gpu/drm/virtio/virtgpu_fence.c
+++ b/drivers/gpu/drm/virtio/virtgpu_fence.c
@@ -36,11 +36,6 @@ static const char *virtio_get_timeline_name(struct dma_fence *f)
 	return "controlq";
 }
 
-static bool virtio_enable_signaling(struct dma_fence *f)
-{
-	return true;
-}
-
 static bool virtio_signaled(struct dma_fence *f)
 {
 	struct virtio_gpu_fence *fence = to_virtio_fence(f);
@@ -67,9 +62,7 @@ static void virtio_timeline_value_str(struct dma_fence *f, char *str, int size)
 static const struct dma_fence_ops virtio_fence_ops = {
 	.get_driver_name     = virtio_get_driver_name,
 	.get_timeline_name   = virtio_get_timeline_name,
-	.enable_signaling    = virtio_enable_signaling,
 	.signaled            = virtio_signaled,
-	.wait                = dma_fence_default_wait,
 	.fence_value_str     = virtio_fence_value_str,
 	.timeline_value_str  = virtio_timeline_value_str,
 };
-- 
2.17.0

^ permalink raw reply related

* [PATCH 12/17] drm/qxl: Remove unecessary dma_fence_ops
From: Daniel Vetter @ 2018-04-27  6:17 UTC (permalink / raw)
  To: DRI Development
  Cc: Daniel Vetter, Intel Graphics Development, virtualization,
	Dave Airlie
In-Reply-To: <20180427061724.28497-1-daniel.vetter@ffwll.ch>

dma_fence_default_wait is the default now, same for the trivial
enable_signaling implementation.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: virtualization@lists.linux-foundation.org
---
 drivers/gpu/drm/qxl/qxl_release.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
index 7cb214577275..e37f0097f744 100644
--- a/drivers/gpu/drm/qxl/qxl_release.c
+++ b/drivers/gpu/drm/qxl/qxl_release.c
@@ -50,12 +50,6 @@ static const char *qxl_get_timeline_name(struct dma_fence *fence)
 	return "release";
 }
 
-static bool qxl_nop_signaling(struct dma_fence *fence)
-{
-	/* fences are always automatically signaled, so just pretend we did this.. */
-	return true;
-}
-
 static long qxl_fence_wait(struct dma_fence *fence, bool intr,
 			   signed long timeout)
 {
@@ -119,7 +113,6 @@ static long qxl_fence_wait(struct dma_fence *fence, bool intr,
 static const struct dma_fence_ops qxl_fence_ops = {
 	.get_driver_name = qxl_get_driver_name,
 	.get_timeline_name = qxl_get_timeline_name,
-	.enable_signaling = qxl_nop_signaling,
 	.wait = qxl_fence_wait,
 };
 
-- 
2.17.0

^ permalink raw reply related

* Re: [RFC v3 0/5] virtio: support packed ring
From: Michael S. Tsirkin @ 2018-04-27  4:18 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, virtualization, wexu
In-Reply-To: <5ad1ca01-1e5c-7105-f303-7e8d42f6a068@redhat.com>

On Fri, Apr 27, 2018 at 11:56:05AM +0800, Jason Wang wrote:
> 
> 
> On 2018年04月25日 13:15, Tiwei Bie wrote:
> > Hello everyone,
> > 
> > This RFC implements packed ring support in virtio driver.
> > 
> > Some simple functional tests have been done with Jason's
> > packed ring implementation in vhost:
> > 
> > https://lkml.org/lkml/2018/4/23/12
> > 
> > Both of ping and netperf worked as expected (with EVENT_IDX
> > disabled). But there are below known issues:
> > 
> > 1. Reloading the guest driver will break the Tx/Rx;
> 
> Will have a look at this issue.
> 
> > 2. Zeroing the flags when detaching a used desc will
> >     break the guest -> host path.
> 
> I still think zeroing flags is unnecessary or even a bug. At host, I track
> last observed avail wrap counter and detect avail like (what is suggested in
> the example code in the spec):
> 
> static bool desc_is_avail(struct vhost_virtqueue *vq, __virtio16 flags)
> {
>        bool avail = flags & cpu_to_vhost16(vq, DESC_AVAIL);
> 
>        return avail == vq->avail_wrap_counter;
> }
> 
> So zeroing wrap can not work with this obviously.
> 
> Thanks

I agree. I think what one should do is flip the available bit.

> > 
> > Some simple functional tests have also been done with
> > Wei's packed ring implementation in QEMU:
> > 
> > http://lists.nongnu.org/archive/html/qemu-devel/2018-04/msg00342.html
> > 
> > Both of ping and netperf worked as expected (with EVENT_IDX
> > disabled). Reloading the guest driver also worked as expected.
> > 
> > TODO:
> > - Refinements (for code and commit log) and bug fixes;
> > - Discuss/fix/test EVENT_IDX support;
> > - Test devices other than net;
> > 
> > RFC v2 -> RFC v3:
> > - Split into small patches (Jason);
> > - Add helper virtqueue_use_indirect() (Jason);
> > - Just set id for the last descriptor of a list (Jason);
> > - Calculate the prev in virtqueue_add_packed() (Jason);
> > - Fix/improve desc suppression code (Jason/MST);
> > - Refine the code layout for XXX_split/packed and wrappers (MST);
> > - Fix the comments and API in uapi (MST);
> > - Remove the BUG_ON() for indirect (Jason);
> > - Some other refinements and bug fixes;
> > 
> > RFC v1 -> RFC v2:
> > - Add indirect descriptor support - compile test only;
> > - Add event suppression supprt - compile test only;
> > - Move vring_packed_init() out of uapi (Jason, MST);
> > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > - Avoid using '%' operator (Jason);
> > - Rename free_head -> next_avail_idx (Jason);
> > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > - Some other refinements and bug fixes;
> > 
> > Thanks!
> > 
> > Tiwei Bie (5):
> >    virtio: add packed ring definitions
> >    virtio_ring: support creating packed ring
> >    virtio_ring: add packed ring support
> >    virtio_ring: add event idx support in packed ring
> >    virtio_ring: enable packed ring
> > 
> >   drivers/virtio/virtio_ring.c       | 1271 ++++++++++++++++++++++++++++--------
> >   include/linux/virtio_ring.h        |    8 +-
> >   include/uapi/linux/virtio_config.h |   12 +-
> >   include/uapi/linux/virtio_ring.h   |   36 +
> >   4 files changed, 1049 insertions(+), 278 deletions(-)
> > 
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC v3 0/5] virtio: support packed ring
From: Jason Wang @ 2018-04-27  3:56 UTC (permalink / raw)
  To: Tiwei Bie, mst, virtualization, linux-kernel, netdev; +Cc: wexu
In-Reply-To: <20180425051550.24342-1-tiwei.bie@intel.com>



On 2018年04月25日 13:15, Tiwei Bie wrote:
> Hello everyone,
>
> This RFC implements packed ring support in virtio driver.
>
> Some simple functional tests have been done with Jason's
> packed ring implementation in vhost:
>
> https://lkml.org/lkml/2018/4/23/12
>
> Both of ping and netperf worked as expected (with EVENT_IDX
> disabled). But there are below known issues:
>
> 1. Reloading the guest driver will break the Tx/Rx;

Will have a look at this issue.

> 2. Zeroing the flags when detaching a used desc will
>     break the guest -> host path.

I still think zeroing flags is unnecessary or even a bug. At host, I 
track last observed avail wrap counter and detect avail like (what is 
suggested in the example code in the spec):

static bool desc_is_avail(struct vhost_virtqueue *vq, __virtio16 flags)
{
        bool avail = flags & cpu_to_vhost16(vq, DESC_AVAIL);

        return avail == vq->avail_wrap_counter;
}

So zeroing wrap can not work with this obviously.

Thanks

>
> Some simple functional tests have also been done with
> Wei's packed ring implementation in QEMU:
>
> http://lists.nongnu.org/archive/html/qemu-devel/2018-04/msg00342.html
>
> Both of ping and netperf worked as expected (with EVENT_IDX
> disabled). Reloading the guest driver also worked as expected.
>
> TODO:
> - Refinements (for code and commit log) and bug fixes;
> - Discuss/fix/test EVENT_IDX support;
> - Test devices other than net;
>
> RFC v2 -> RFC v3:
> - Split into small patches (Jason);
> - Add helper virtqueue_use_indirect() (Jason);
> - Just set id for the last descriptor of a list (Jason);
> - Calculate the prev in virtqueue_add_packed() (Jason);
> - Fix/improve desc suppression code (Jason/MST);
> - Refine the code layout for XXX_split/packed and wrappers (MST);
> - Fix the comments and API in uapi (MST);
> - Remove the BUG_ON() for indirect (Jason);
> - Some other refinements and bug fixes;
>
> RFC v1 -> RFC v2:
> - Add indirect descriptor support - compile test only;
> - Add event suppression supprt - compile test only;
> - Move vring_packed_init() out of uapi (Jason, MST);
> - Merge two loops into one in virtqueue_add_packed() (Jason);
> - Split vring_unmap_one() for packed ring and split ring (Jason);
> - Avoid using '%' operator (Jason);
> - Rename free_head -> next_avail_idx (Jason);
> - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> - Some other refinements and bug fixes;
>
> Thanks!
>
> Tiwei Bie (5):
>    virtio: add packed ring definitions
>    virtio_ring: support creating packed ring
>    virtio_ring: add packed ring support
>    virtio_ring: add event idx support in packed ring
>    virtio_ring: enable packed ring
>
>   drivers/virtio/virtio_ring.c       | 1271 ++++++++++++++++++++++++++++--------
>   include/linux/virtio_ring.h        |    8 +-
>   include/uapi/linux/virtio_config.h |   12 +-
>   include/uapi/linux/virtio_ring.h   |   36 +
>   4 files changed, 1049 insertions(+), 278 deletions(-)
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next v8 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Samudrala, Sridhar @ 2018-04-27  0:30 UTC (permalink / raw)
  To: Michael S. Tsirkin, Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, netdev,
	virtualization, loseweigh, aaron.f.brown, davem
In-Reply-To: <20180427024347-mutt-send-email-mst@kernel.org>


On 4/26/2018 5:09 PM, Michael S. Tsirkin wrote:
> On Thu, Apr 26, 2018 at 03:33:26PM -0700, Stephen Hemminger wrote:
>> On Thu, 26 Apr 2018 05:30:05 +0300
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>>> On Wed, Apr 25, 2018 at 05:08:37PM -0700, Stephen Hemminger wrote:
>>>> On Wed, 25 Apr 2018 16:59:28 -0700
>>>> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>>>>    
>>>>> Use the registration/notification framework supported by the generic
>>>>> failover infrastructure.
>>>>>
>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>> NAK unless you prove this works on legacy distributions and with DPDK 18.05
>>>> without modification.
>>> It looks like it should work. What kind of proof are you looking for?
>>>
>>
>> I tried this with working Ubuntu 17 on WS2016.
>> It boots if the failover driver is configured in (as module).
> Oh so by "unless you prove" you meant "unless you test"?
>
>> But if the configuration has:
>>
>> $ grep FAILOVER .config
>> # CONFIG_NET_FAILOVER is not set
>> CONFIG_MAY_USE_NET_FAILOVER=y
> So the new option works, what's broken is when it's *not* selected.
> Looks like a bug.

Yes.  Looks like i need to make NET_FAILOVER  to be enabled automatically when
VIRTIO_NET or HYPERV_NET are selected.

Thanks Stephen for confirming that it works when NET_FAILOVER is enabled.



>
>> The netvsc driver fails on boot with:
>>
>> [    0.826447] hv_vmbus: registering driver hv_netvsc
>> [    0.829616] scsi 0:0:0:0: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 5
>> [    0.836291] input: Microsoft Vmbus HID-compliant Mouse as /devices/0006:045E:0621.0001/input/input1
>> [    0.839139] hid-generic 0006:045E:0621.0001: input: <UNKNOWN> HID v0.01 Mouse [Microsoft Vmbus HID-compliant Mouse] on
>> [    0.964897] hv_vmbus: probe failed for device 849a776e-8120-4e4a-9a36-7e3d95ac75b3 (-95)
>> [    0.968039] hv_netvsc: probe of 849a776e-8120-4e4a-9a36-7e3d95ac75b3 failed with error -95
>> [    1.112877] hv_vmbus: probe failed for device 53557f8e-057d-425b-9265-01c0fd7e273e (-95)
>> [    1.116064] hv_netvsc: probe of 53557f8e-057d-425b-9265-01c0fd7e273e failed with error -95
>>
>> The system has two virtual networks. eth0 is on vswitch for management.
>> eth1 is on vswitch with SR-IOV for performance tests.
>>
>> You probably need to just put the failover part in net/core and select it.
>  From all drivers. Yes. And it does not need to be visible in the menu
> imho.
>
>
>> It is trivial to get an evaluation version of Windows Server 2016 and setup a Linux VM.
>> Please try it.
> I presume some kind of test was done here. Sridhar, do you think you
> could include some notes about testing of this patch? Or in case you
> only build-tested this part, it's a good idea to notice this after ---
> in the patch, or in the cover letter.

I only did compile and build test of netvsc. I don't have a hyperv setup to test it out.
Would appreciate if Stephen or someone who has this setup will test it out when i
submit the next revision with the config fixes.

>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next v8 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Michael S. Tsirkin @ 2018-04-27  0:09 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, Sridhar Samudrala,
	virtualization, loseweigh, netdev, aaron.f.brown, davem
In-Reply-To: <20180426153326.16388dfd@xeon-e3>

On Thu, Apr 26, 2018 at 03:33:26PM -0700, Stephen Hemminger wrote:
> On Thu, 26 Apr 2018 05:30:05 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Apr 25, 2018 at 05:08:37PM -0700, Stephen Hemminger wrote:
> > > On Wed, 25 Apr 2018 16:59:28 -0700
> > > Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> > >   
> > > > Use the registration/notification framework supported by the generic
> > > > failover infrastructure.
> > > > 
> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>  
> > > 
> > > NAK unless you prove this works on legacy distributions and with DPDK 18.05
> > > without modification.  
> > 
> > It looks like it should work. What kind of proof are you looking for?
> > 
> 
> 
> I tried this with working Ubuntu 17 on WS2016.
> It boots if the failover driver is configured in (as module).

Oh so by "unless you prove" you meant "unless you test"?

> But if the configuration has:
> 
> $ grep FAILOVER .config
> # CONFIG_NET_FAILOVER is not set
> CONFIG_MAY_USE_NET_FAILOVER=y

So the new option works, what's broken is when it's *not* selected.
Looks like a bug.

> The netvsc driver fails on boot with:
> 
> [    0.826447] hv_vmbus: registering driver hv_netvsc
> [    0.829616] scsi 0:0:0:0: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 5
> [    0.836291] input: Microsoft Vmbus HID-compliant Mouse as /devices/0006:045E:0621.0001/input/input1
> [    0.839139] hid-generic 0006:045E:0621.0001: input: <UNKNOWN> HID v0.01 Mouse [Microsoft Vmbus HID-compliant Mouse] on
> [    0.964897] hv_vmbus: probe failed for device 849a776e-8120-4e4a-9a36-7e3d95ac75b3 (-95)
> [    0.968039] hv_netvsc: probe of 849a776e-8120-4e4a-9a36-7e3d95ac75b3 failed with error -95
> [    1.112877] hv_vmbus: probe failed for device 53557f8e-057d-425b-9265-01c0fd7e273e (-95)
> [    1.116064] hv_netvsc: probe of 53557f8e-057d-425b-9265-01c0fd7e273e failed with error -95
> 
> The system has two virtual networks. eth0 is on vswitch for management.
> eth1 is on vswitch with SR-IOV for performance tests.
> 
> You probably need to just put the failover part in net/core and select it.

From all drivers. Yes. And it does not need to be visible in the menu
imho.


> It is trivial to get an evaluation version of Windows Server 2016 and setup a Linux VM.
> Please try it.

I presume some kind of test was done here. Sridhar, do you think you
could include some notes about testing of this patch? Or in case you
only build-tested this part, it's a good idea to notice this after ---
in the patch, or in the cover letter.

-- 
MST

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Michael S. Tsirkin @ 2018-04-26 23:42 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Sridhar Samudrala, virtualization, Netdev, David Miller
In-Reply-To: <CADGSJ22D=NTWtjVeLKsgvcesy0b_ugzb9XGHGsWOktUEFhYN_Q@mail.gmail.com>

On Thu, Apr 26, 2018 at 03:14:46PM -0700, Siwei Liu wrote:
> On Wed, Apr 25, 2018 at 7:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Apr 25, 2018 at 03:57:57PM -0700, Siwei Liu wrote:
> >> On Wed, Apr 25, 2018 at 3:22 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
> >> >> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> >> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
> >> >> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> >> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
> >> >> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
> >> >> >> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >> >> >>
> >> >> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
> >> >> >> >> > > > >
> >> >> >> >> > > > >I will NAK patches to change to common code for netvsc especially the
> >> >> >> >> > > > >three device model.  MS worked hard with distro vendors to support transparent
> >> >> >> >> > > > >mode, ans we really can't have a new model; or do backport.
> >> >> >> >> > > > >
> >> >> >> >> > > > >Plus, DPDK is now dependent on existing model.
> >> >> >> >> > > >
> >> >> >> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
> >> >> >> >> > >
> >> >> >> >> > > The network device model is a userspace API, and DPDK is a userspace application.
> >> >> >> >> >
> >> >> >> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> >> >> >> >> > AFAIK it's normally banging device registers directly.
> >> >> >> >> >
> >> >> >> >> > > You can't go breaking userspace even if you don't like the application.
> >> >> >> >> >
> >> >> >> >> > Could you please explain how is the proposed patchset breaking
> >> >> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> >> >> >> >> > API at all.
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> >> >> >> >> to look for Linux netvsc device and the paired VF device and setup the
> >> >> >> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
> >> >> >> >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> >> >> >> >> VF device.
> >> >> >> >>
> >> >> >> >> So it depends on existing 2 device model. You can't go to a 3 device model
> >> >> >> >> or start hiding devices from userspace.
> >> >> >> >
> >> >> >> > Okay so how does the existing patch break that? IIUC does not go to
> >> >> >> > a 3 device model since netvsc calls failover_register directly.
> >> >> >> >
> >> >> >> >> Also, I am working on associating netvsc and VF device based on serial number
> >> >> >> >> rather than MAC address. The serial number is how Windows works now, and it makes
> >> >> >> >> sense for Linux and Windows to use the same mechanism if possible.
> >> >> >> >
> >> >> >> > Maybe we should support same for virtio ...
> >> >> >> > Which serial do you mean? From vpd?
> >> >> >> >
> >> >> >> > I guess you will want to keep supporting MAC for old hypervisors?
> >> >> >> >
> >> >> >> > It all seems like a reasonable thing to support in the generic core.
> >> >> >>
> >> >> >> That's the reason why I chose explicit identifier rather than rely on
> >> >> >> MAC address to bind/pair a device. MAC address can change. Even if it
> >> >> >> can't, malicious guest user can fake MAC address to skip binding.
> >> >> >>
> >> >> >> -Siwei
> >> >> >
> >> >> > Address should be sampled at device creation to prevent this
> >> >> > kind of hack. Not that it buys the malicious user much:
> >> >> > if you can poke at MAC addresses you probably already can
> >> >> > break networking.
> >> >>
> >> >> I don't understand why poking at MAC address may potentially break
> >> >> networking.
> >> >
> >> > Set a MAC address to match another device on the same LAN,
> >> > packets will stop reaching that MAC.
> >>
> >> What I meant was guest users may create a virtual link, say veth that
> >> has exactly the same MAC address as that for the VF, which can easily
> >> get around of the binding procedure.
> >
> > This patchset limits binding to PCI devices so it won't be affected
> > by any hacks around virtual devices.
> 
> Wait, I vaguely recall you seemed to like to generalize this feature
> to non-PCI device. 

It's purely a layering thing.  It is cleaner not to have PCI specific
data in the device-specific transport-independent section of the virtio
spec.


> But now you're saying it should stick to PCI. It's
> not that I'm reluctant with sticking to PCI. The fact is that I don't
> think we can go with implementation until the semantics of the
> so-called _F_STANDBY feature can be clearly defined into the spec.
> Previously the boundary of using MAC address as the identifier for
> bonding was quite confusing to me. And now PCI adds to the matrix.

PCI is simply one way to exclude software NICs. It's not the most
elegant one, but it will cover many setups.  We can add more types, but
we do want to exclude software devices since these have
not been supplied by the hypervisor.

> However it still does not gurantee uniqueness I think. It's almost
> incorrect of choosing MAC address as the ID in the beginning since
> that has the implication of breaking existing configs.

IMO there's no chance it will break any existing config since
no existing config sets _F_STANDBY.

> I don't think
> libvirt or QEMU today retricts the MAC address to be unique per VM
> instance. Neither the virtio spec mentions that.

You really don't have to.

> In addition, it's difficult to fake PCI device on Linux does not mean
> the same applies to other OSes that is going to implement this VirtIO
> feature. It's a fragile assumption IMHO.

What an OS does internally is its own business.

What we are telling the guest here is simply that the virtio NIC is
actually the same device as some other NIC. At this point we do not
specify this other NIC in any way. So how do you find it?  Well it has
to have the same MAC clearly.

You point out that there could be multiple NICs with the same
MAC in theory. It's a broken config generally but since it
kind of works in some setups maybe it's worth supporting.
If so we can look for ways to make the matching more specific by e.g.
adding more flags but I see that as a separate issue,
and pretty narrow in scope.

> >
> >> There's no explicit flag to
> >> identify a VF or pass-through device AFAIK. And sometimes this happens
> >> maybe due to user misconfiguring the link. This process should be
> >> hardened to avoid from any potential configuration errors.
> >
> > They are still PCI devices though.
> >
> >> >
> >> >> Unlike VF, passthrough PCI endpoint device has its freedom
> >> >> to change the MAC address. Even on a VF setup it's not neccessarily
> >> >> always safe to assume the VF's MAC address cannot or shouldn't be
> >> >> changed. That depends on the specific need whether the host admin
> >> >> wants to restrict guest from changing the MAC address, although in
> >> >> most cases it's true.
> >> >>
> >> >> I understand we can use the perm_addr to distinguish. But as said,
> >> >> this will pose limitation of flexible configuration where one can
> >> >> assign VFs with identical MAC address at all while each VF belongs to
> >> >> different PF and/or different subnet for e.g. load balancing.
> >> >> And
> >> >> furthermore, the QEMU device model never uses MAC address to be
> >> >> interpreted as an identifier, which requires to be unique per VM
> >> >> instance. Why we're introducing this inconsistency?
> >> >>
> >> >> -Siwei
> >> >
> >> > Because it addresses most of the issues and is simple.  That's already
> >> > much better than what we have now which is nothing unless guest
> >> > configures things manually.
> >>
> >> Did you see my QEMU patch for using BDF as the grouping identifier?
> >
> > Yes. And I don't think it can work because bus numbers are
> > guest specified.
> 
> I know it's not ideal but perhaps its the best one can do in the KVM
> world without adding complex config e.g. PCI bridge.

KVM is just a VMX/SVM driver. I think you mean QEMU.  And well -
"best one can do" is a high bar to clear.


> Even if bus
> number is guest specified, it's readily available in the guest and
> recognizable by any OS, while on the QEMU configuration users specify
> an id instead of the bus number. Unlike Hyper-V PCI bus, I don't think
> there exists a para-virtual PCI bus in QEMU backend to expose VPD
> capability to a passthrough device.

We can always add more interfaces if we need them.  But let's be clear
that we are adding an interface and what are we trying to fix by doing
it. Let's not mix it as part of the failover discussion.

> >
> >> And there can be others like what you suggested, but the point is that
> >> it's requried to support explicit grouping mechanism from day one,
> >> before the backup property cast into stones.
> >
> > Let's start with addressing simple configs with just two NICs.
> >
> > Down the road I can see possible extensions that can work: for example,
> > require that devices are on the same pci bridge. Or we could even make
> > the virtio device actually include a pci bridge (as part of same
> > or a child function), the PT would have to be
> > behind it.
> >
> > As long as we are not breaking anything, adding more flags to fix
> > non-working configurations is always fair game.
> 
> While it may work, the PCI bridge has NUMA and IOMMU implications that
> would restrict the current flexibility to group devices.

It's interesting you should mention that.

If you want to be flexible in placing the primary device WRT NUMA and
IOMMU, and given that both IOMMU and NUMA are keyed by the bus address,
then doesn't this completely break the idea of passing
the bus address to the guest?

> I'm not sure
> if vIOMMU would have to be introduced inadvertently for
> isolation/protection of devices under the PCI bridge which may cause
> negative performance impact on the VF.

No idea how do you introduce an IOMMU inadvertently.

> >
> >> This is orthogonal to
> >> device model being proposed, be it 1-netdev or not. Delaying it would
> >> just mean support and compatibility burden, appearing more like a
> >> design flaw rather than a feature to add later on.
> >
> > Well it's mostly myself who gets to support it, and I see the device
> > model as much more fundamental as userspace will come to depend
> > on it. So I'm not too worried, let's take this one step at a time.
> >
> >> >
> >> > I think ideally the infrastructure should suppport flexible matching of
> >> > NICs - netvsc is already reported to be moving to some kind of serial
> >> > address.
> >> >
> >> As Stephen said, Hyper-V supports the serial UUID thing from day-one.
> >> It's just the Linux netvsc guest driver itself does not leverage that
> >> ID from the very beginging.
> >>
> >> Regards,
> >> -Siwei
> >
> > We could add something like this, too. For example,
> > we could add a virtual VPD capability with a UUID.
> 
> I'm not an expert on that and wonder how you could do this (add a
> virtual VPD capability with a UUID to passthrough device) with
> existing QEMU emulation model and native PCI bus.


I think I see an elegant way to do that.

You could put it in the port where you want to stick you PT device.

Here's how it could work then:


- standby virtio device is tied to a pci bridge.

  Tied how? Well it could be 
  - behind this bridge
  - include a bridge internally
  - have the bridge as a PCI function
  - include a bridge and the bridge as a PCI function
  - have a VPD or serial capability with same UUID as the bridge

- primary passthrough device is placed behind a bridge
  *with the same ID*

	- either simply behind the same bridge
	- or behind another bridge with the same UUID.


The treatment could also be limited just to bridges which have a
specific vendor/device id (maybe a good idea), or in any other arbitrary
way.




> >
> > Do you know how exactly does hyperv pass the UUID for NICs?
> 
> Stephen might know it more and can correct me. But my personal
> interpretation is that the SN is a host generated 32 bit sequence
> number which is unique per VM instance and gets propogated to guest
> via the para-virtual Hyper-V PCI bus.
> 
> Regards,
> -Siwei

Ah, so it's a Hyper-V thing.




> >
> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> >
> >> >> >> > --
> >> >> >> > MST

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-26 22:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Andrew, eric.dumazet, linux-mm, edumazet, netdev, Randy Dunlap,
	John Stoffel, linux-kernel, Matthew Wilcox, Hocko,
	James Bottomley, Michal, dm-devel, David Rientjes, Morton,
	virtualization, David Miller, Vlastimil Babka
In-Reply-To: <20180427005213-mutt-send-email-mst@kernel.org>



On Fri, 27 Apr 2018, Michael S. Tsirkin wrote:

> On Thu, Apr 26, 2018 at 05:50:20PM -0400, Mikulas Patocka wrote:
> > How is the user or developer supposed to learn about this option, if 
> > he gets no crash at all?
> 
> Look in /sys/kernel/debug/fail* ? That actually lets you
> filter by module, process etc.
> 
> I think this patch conflates two things:
> 
> 1. Make kvmalloc use the vmalloc path.
>     This seems a bit narrow.
>     What is special about kvmalloc? IMHO nothing - it's yet another user
>     of __GFP_NORETRY or __GFP_RETRY_MAYFAIL. As any such

__GFP_RETRY_MAYFAIL makes the allocator retry the costly_order allocations

>     user, it either recovers correctly or not.
>     So IMHO it's just a case of
>     making __GFP_NORETRY, __GFP_RETRY_MAYFAIL, or both
>     fail once in a while.
>     Seems like a better extension to me than focusing on vmalloc.
>     I think you will find more bugs this way.

If the array is <= PAGE_SIZE, vmalloc will not use __GFP_NORETRY. So it 
still hides some bugs - such as, if a structure grows above 4k, it would 
start randomly crashing due to memory fragmentation.

> 2. Ability to control this from a separate config
>    option.
> 
>    It's still not that clear to me why is this such a
>    hard requirement.  If a distro wants to force specific
>    boot time options, why isn't CONFIG_CMDLINE sufficient?

There are 489 kernel options declared with the __setup keyword. Hardly any 
kernel developer notices that a new one was added and selects it when 
testing his code.

>    But assuming it's important to control this kind of
>    fault injection to be controlled from
>    a dedicated menuconfig option, why not the rest of
>    faults?

The injected faults cause damage to the user, so there's no point to 
enable them by default. vmalloc fallback should not cause any damage 
(assuming that the code is correctly written).

> IMHO if you split 1/2 up, and generalize, the path upstream
> will be much smoother.

This seems like a lost case. So, let's not care about code correctness and 
let's solve crashes only after they are reported. If the upstream wants to 
work this way, there's nothing that can be done about it.

I'm wondering if I can still push it to RHEL or not.

> Hope this helps.
> 
> -- 
> MST

Mikulas

^ permalink raw reply

* Re: [PATCH net-next v8 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Stephen Hemminger @ 2018-04-26 22:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: alexander.h.duyck, virtio-dev, jiri, kubakici, Sridhar Samudrala,
	virtualization, loseweigh, netdev, aaron.f.brown, davem
In-Reply-To: <20180426052908-mutt-send-email-mst@kernel.org>

On Thu, 26 Apr 2018 05:30:05 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Apr 25, 2018 at 05:08:37PM -0700, Stephen Hemminger wrote:
> > On Wed, 25 Apr 2018 16:59:28 -0700
> > Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> >   
> > > Use the registration/notification framework supported by the generic
> > > failover infrastructure.
> > > 
> > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>  
> > 
> > NAK unless you prove this works on legacy distributions and with DPDK 18.05
> > without modification.  
> 
> It looks like it should work. What kind of proof are you looking for?
> 


I tried this with working Ubuntu 17 on WS2016.
It boots if the failover driver is configured in (as module).
But if the configuration has:

$ grep FAILOVER .config
# CONFIG_NET_FAILOVER is not set
CONFIG_MAY_USE_NET_FAILOVER=y


The netvsc driver fails on boot with:

[    0.826447] hv_vmbus: registering driver hv_netvsc
[    0.829616] scsi 0:0:0:0: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 5
[    0.836291] input: Microsoft Vmbus HID-compliant Mouse as /devices/0006:045E:0621.0001/input/input1
[    0.839139] hid-generic 0006:045E:0621.0001: input: <UNKNOWN> HID v0.01 Mouse [Microsoft Vmbus HID-compliant Mouse] on
[    0.964897] hv_vmbus: probe failed for device 849a776e-8120-4e4a-9a36-7e3d95ac75b3 (-95)
[    0.968039] hv_netvsc: probe of 849a776e-8120-4e4a-9a36-7e3d95ac75b3 failed with error -95
[    1.112877] hv_vmbus: probe failed for device 53557f8e-057d-425b-9265-01c0fd7e273e (-95)
[    1.116064] hv_netvsc: probe of 53557f8e-057d-425b-9265-01c0fd7e273e failed with error -95

The system has two virtual networks. eth0 is on vswitch for management.
eth1 is on vswitch with SR-IOV for performance tests.

You probably need to just put the failover part in net/core and select it.


It is trivial to get an evaluation version of Windows Server 2016 and setup a Linux VM.
Please try it.

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Michael S. Tsirkin @ 2018-04-26 22:21 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew, eric.dumazet, linux-mm, edumazet, netdev, Randy Dunlap,
	John Stoffel, linux-kernel, Matthew Wilcox, Hocko,
	James Bottomley, Michal, dm-devel, David Rientjes, Morton,
	virtualization, David Miller, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804261726540.13401@file01.intranet.prod.int.rdu2.redhat.com>

On Thu, Apr 26, 2018 at 05:50:20PM -0400, Mikulas Patocka wrote:
> How is the user or developer supposed to learn about this option, if 
> he gets no crash at all?

Look in /sys/kernel/debug/fail* ? That actually lets you
filter by module, process etc.

I think this patch conflates two things:

1. Make kvmalloc use the vmalloc path.
    This seems a bit narrow.
    What is special about kvmalloc? IMHO nothing - it's yet another user
    of __GFP_NORETRY or __GFP_RETRY_MAYFAIL. As any such
    user, it either recovers correctly or not.
    So IMHO it's just a case of
    making __GFP_NORETRY, __GFP_RETRY_MAYFAIL, or both
    fail once in a while.
    Seems like a better extension to me than focusing on vmalloc.
    I think you will find more bugs this way.

2. Ability to control this from a separate config
   option.

   It's still not that clear to me why is this such a
   hard requirement.  If a distro wants to force specific
   boot time options, why isn't CONFIG_CMDLINE sufficient?

   But assuming it's important to control this kind of
   fault injection to be controlled from
   a dedicated menuconfig option, why not the rest of
   faults?

IMHO if you split 1/2 up, and generalize, the path upstream
will be much smoother.

Hope this helps.

-- 
MST

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Siwei Liu @ 2018-04-26 22:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Sridhar Samudrala, virtualization, Netdev, David Miller
In-Reply-To: <20180426050934-mutt-send-email-mst@kernel.org>

On Wed, Apr 25, 2018 at 7:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Apr 25, 2018 at 03:57:57PM -0700, Siwei Liu wrote:
>> On Wed, Apr 25, 2018 at 3:22 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
>> >> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> >> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> >> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> >> >>
>> >> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> >> >> >> > > > >
>> >> >> >> > > > >I will NAK patches to change to common code for netvsc especially the
>> >> >> >> > > > >three device model.  MS worked hard with distro vendors to support transparent
>> >> >> >> > > > >mode, ans we really can't have a new model; or do backport.
>> >> >> >> > > > >
>> >> >> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> >> >> > > >
>> >> >> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
>> >> >> >> > >
>> >> >> >> > > The network device model is a userspace API, and DPDK is a userspace application.
>> >> >> >> >
>> >> >> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> >> >> >> > AFAIK it's normally banging device registers directly.
>> >> >> >> >
>> >> >> >> > > You can't go breaking userspace even if you don't like the application.
>> >> >> >> >
>> >> >> >> > Could you please explain how is the proposed patchset breaking
>> >> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
>> >> >> >> > API at all.
>> >> >> >> >
>> >> >> >>
>> >> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
>> >> >> >> to look for Linux netvsc device and the paired VF device and setup the
>> >> >> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
>> >> >> >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
>> >> >> >> VF device.
>> >> >> >>
>> >> >> >> So it depends on existing 2 device model. You can't go to a 3 device model
>> >> >> >> or start hiding devices from userspace.
>> >> >> >
>> >> >> > Okay so how does the existing patch break that? IIUC does not go to
>> >> >> > a 3 device model since netvsc calls failover_register directly.
>> >> >> >
>> >> >> >> Also, I am working on associating netvsc and VF device based on serial number
>> >> >> >> rather than MAC address. The serial number is how Windows works now, and it makes
>> >> >> >> sense for Linux and Windows to use the same mechanism if possible.
>> >> >> >
>> >> >> > Maybe we should support same for virtio ...
>> >> >> > Which serial do you mean? From vpd?
>> >> >> >
>> >> >> > I guess you will want to keep supporting MAC for old hypervisors?
>> >> >> >
>> >> >> > It all seems like a reasonable thing to support in the generic core.
>> >> >>
>> >> >> That's the reason why I chose explicit identifier rather than rely on
>> >> >> MAC address to bind/pair a device. MAC address can change. Even if it
>> >> >> can't, malicious guest user can fake MAC address to skip binding.
>> >> >>
>> >> >> -Siwei
>> >> >
>> >> > Address should be sampled at device creation to prevent this
>> >> > kind of hack. Not that it buys the malicious user much:
>> >> > if you can poke at MAC addresses you probably already can
>> >> > break networking.
>> >>
>> >> I don't understand why poking at MAC address may potentially break
>> >> networking.
>> >
>> > Set a MAC address to match another device on the same LAN,
>> > packets will stop reaching that MAC.
>>
>> What I meant was guest users may create a virtual link, say veth that
>> has exactly the same MAC address as that for the VF, which can easily
>> get around of the binding procedure.
>
> This patchset limits binding to PCI devices so it won't be affected
> by any hacks around virtual devices.

Wait, I vaguely recall you seemed to like to generalize this feature
to non-PCI device. But now you're saying it should stick to PCI. It's
not that I'm reluctant with sticking to PCI. The fact is that I don't
think we can go with implementation until the semantics of the
so-called _F_STANDBY feature can be clearly defined into the spec.
Previously the boundary of using MAC address as the identifier for
bonding was quite confusing to me. And now PCI adds to the matrix.
However it still does not gurantee uniqueness I think. It's almost
incorrect of choosing MAC address as the ID in the beginning since
that has the implication of breaking existing configs. I don't think
libvirt or QEMU today retricts the MAC address to be unique per VM
instance. Neither the virtio spec mentions that.

In addition, it's difficult to fake PCI device on Linux does not mean
the same applies to other OSes that is going to implement this VirtIO
feature. It's a fragile assumption IMHO.

>
>> There's no explicit flag to
>> identify a VF or pass-through device AFAIK. And sometimes this happens
>> maybe due to user misconfiguring the link. This process should be
>> hardened to avoid from any potential configuration errors.
>
> They are still PCI devices though.
>
>> >
>> >> Unlike VF, passthrough PCI endpoint device has its freedom
>> >> to change the MAC address. Even on a VF setup it's not neccessarily
>> >> always safe to assume the VF's MAC address cannot or shouldn't be
>> >> changed. That depends on the specific need whether the host admin
>> >> wants to restrict guest from changing the MAC address, although in
>> >> most cases it's true.
>> >>
>> >> I understand we can use the perm_addr to distinguish. But as said,
>> >> this will pose limitation of flexible configuration where one can
>> >> assign VFs with identical MAC address at all while each VF belongs to
>> >> different PF and/or different subnet for e.g. load balancing.
>> >> And
>> >> furthermore, the QEMU device model never uses MAC address to be
>> >> interpreted as an identifier, which requires to be unique per VM
>> >> instance. Why we're introducing this inconsistency?
>> >>
>> >> -Siwei
>> >
>> > Because it addresses most of the issues and is simple.  That's already
>> > much better than what we have now which is nothing unless guest
>> > configures things manually.
>>
>> Did you see my QEMU patch for using BDF as the grouping identifier?
>
> Yes. And I don't think it can work because bus numbers are
> guest specified.

I know it's not ideal but perhaps its the best one can do in the KVM
world without adding complex config e.g. PCI bridge. Even if bus
number is guest specified, it's readily available in the guest and
recognizable by any OS, while on the QEMU configuration users specify
an id instead of the bus number. Unlike Hyper-V PCI bus, I don't think
there exists a para-virtual PCI bus in QEMU backend to expose VPD
capability to a passthrough device.

>
>> And there can be others like what you suggested, but the point is that
>> it's requried to support explicit grouping mechanism from day one,
>> before the backup property cast into stones.
>
> Let's start with addressing simple configs with just two NICs.
>
> Down the road I can see possible extensions that can work: for example,
> require that devices are on the same pci bridge. Or we could even make
> the virtio device actually include a pci bridge (as part of same
> or a child function), the PT would have to be
> behind it.
>
> As long as we are not breaking anything, adding more flags to fix
> non-working configurations is always fair game.

While it may work, the PCI bridge has NUMA and IOMMU implications that
would restrict the current flexibility to group devices. I'm not sure
if vIOMMU would have to be introduced inadvertently for
isolation/protection of devices under the PCI bridge which may cause
negative performance impact on the VF.

>
>> This is orthogonal to
>> device model being proposed, be it 1-netdev or not. Delaying it would
>> just mean support and compatibility burden, appearing more like a
>> design flaw rather than a feature to add later on.
>
> Well it's mostly myself who gets to support it, and I see the device
> model as much more fundamental as userspace will come to depend
> on it. So I'm not too worried, let's take this one step at a time.
>
>> >
>> > I think ideally the infrastructure should suppport flexible matching of
>> > NICs - netvsc is already reported to be moving to some kind of serial
>> > address.
>> >
>> As Stephen said, Hyper-V supports the serial UUID thing from day-one.
>> It's just the Linux netvsc guest driver itself does not leverage that
>> ID from the very beginging.
>>
>> Regards,
>> -Siwei
>
> We could add something like this, too. For example,
> we could add a virtual VPD capability with a UUID.

I'm not an expert on that and wonder how you could do this (add a
virtual VPD capability with a UUID to passthrough device) with
existing QEMU emulation model and native PCI bus.

>
> Do you know how exactly does hyperv pass the UUID for NICs?

Stephen might know it more and can correct me. But my personal
interpretation is that the SN is a host generated 32 bit sequence
number which is unique per VM instance and gets propogated to guest
via the para-virtual Hyper-V PCI bus.

Regards,
-Siwei

>
>> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> >
>> >> >> > --
>> >> >> > MST

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-26 21:50 UTC (permalink / raw)
  To: John Stoffel
  Cc: Andrew, eric.dumazet, mst, edumazet, netdev, Randy Dunlap,
	linux-kernel, Matthew Wilcox, Hocko, James Bottomley, Michal,
	dm-devel, David Miller, David Rientjes, Morton, virtualization,
	linux-mm, Vlastimil Babka
In-Reply-To: <23266.8532.619051.784274@quad.stoffel.home>



On Thu, 26 Apr 2018, John Stoffel wrote:

> >>>>> "James" == James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> 
> James> I may be an atypical developer but I'd rather have a root canal
> James> than browse through menuconfig options.  The way to get people
> James> to learn about new debugging options is to blog about it (or
> James> write an lwn.net article) which google will find the next time
> James> I ask it how I debug XXX.  Google (probably as a service to
> James> humanity) rarely turns up Kconfig options in response to a
> James> query.
> 
> I agree with James here.  Looking at the SLAB vs SLUB Kconfig entries
> tells me *nothing* about why I should pick one or the other, as an
> example.
> 
> John

I see your point - and I think the misunderstanding is this.

This patch is not really helping people to debug existing crashes. It is 
not like "you get a crash" - "you google for some keywords" - "you get a 
page that suggests to turn this option on" - "you turn it on and solve the 
crash".

What this patch really does is that - it makes the kernel deliberately 
crash in a situation when the code violates the specification, but it 
would not crash otherwise or it would crash very rarely. It helps to 
detect specification violations.

If the kernel developer (or tester) doesn't use this option, his buggy 
code won't crash - and if it won't crash, he won't fix the bug or report 
it. How is the user or developer supposed to learn about this option, if 
he gets no crash at all?

Mikulas

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-26 20:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eric.dumazet, netdev, Randy Dunlap, linux-kernel, Matthew Wilcox,
	Michal Hocko, James Bottomley, linux-mm, dm-devel,
	Vlastimil Babka, David Rientjes, Andrew Morton, virtualization,
	David Miller, edumazet
In-Reply-To: <20180426223925-mutt-send-email-mst@kernel.org>



On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:

> On Thu, Apr 26, 2018 at 03:36:14PM -0400, Mikulas Patocka wrote:
> > People on this list argue "this should be a kernel parameter".
> 
> How about making it a writeable attribute, so it's easy to turn on/off
> after boot. Then you can keep it deterministic, userspace can play with
> the attribute at random if it wants to.
> 
> -- 
> MST

It is already controllable by an attribute in debugfs.

Will you email all the testers about this attribute? How many of them will 
remember to set it? How many of them will remember to set it a year after? 
Will you write a userspace program that manages it and introduce it into 
the distributon?

This is a little feature.

Mikulas

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Michael S. Tsirkin @ 2018-04-26 19:45 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: eric.dumazet, netdev, Randy Dunlap, linux-kernel, Matthew Wilcox,
	Michal Hocko, James Bottomley, linux-mm, dm-devel,
	Vlastimil Babka, David Rientjes, Andrew Morton, virtualization,
	David Miller, edumazet
In-Reply-To: <alpine.LRH.2.02.1804261516250.26980@file01.intranet.prod.int.rdu2.redhat.com>

On Thu, Apr 26, 2018 at 03:36:14PM -0400, Mikulas Patocka wrote:
> People on this list argue "this should be a kernel parameter".

How about making it a writeable attribute, so it's easy to turn on/off
after boot. Then you can keep it deterministic, userspace can play with
the attribute at random if it wants to.

-- 
MST

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-26 19:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eric.dumazet, netdev, Randy Dunlap, linux-kernel, Matthew Wilcox,
	Michal Hocko, James Bottomley, linux-mm, dm-devel,
	Vlastimil Babka, David Rientjes, Andrew Morton, virtualization,
	David Miller, edumazet
In-Reply-To: <20180426220523-mutt-send-email-mst@kernel.org>



On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:

> On Thu, Apr 26, 2018 at 02:54:26PM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:
> > 
> > > On Thu, Apr 26, 2018 at 12:07:25PM -0400, Mikulas Patocka wrote:
> > > > > IIUC debug kernels mainly exist so people who experience e.g. memory
> > > > > corruption can try and debug the failure. In this case, CONFIG_DEBUG_SG
> > > > > will *already* catch a failure early. Nothing special needs to be done.
> > > > 
> > > > The patch helps people debug such memory coprruptions (such as using DMA 
> > > > API on the result of kvmalloc).
> > > 
> > > That's my point.  I don't think your patch helps debug any memory
> > > corruptions.  With CONFIG_DEBUG_SG using DMA API already causes a
> > > BUG_ON, that's before any memory can get corrupted.
> > 
> > The patch turns a hard-to-reproduce bug into an easy-to-reproduce bug. 
> 
> It's still not a memory corruption. It's a BUG_ON the source of which -
> should it trigger - can be typically found using grep.
> 
> > Obviously we don't want this in production kernels, but in the debug 
> > kernels it should be done.
> > 
> > Mikulas
> 
> I'm not so sure. debug kernels should make debugging easier,
> definitely.
> 
> Unfortunately they are already slower so some races don't trigger.
> 
> If they also start crashing more because we are injecting
> memory allocation errors, people are even less likely to
> be able to use them.

I've actually already pushed this patch to RHEL-7 (just before 7.5 was 
released) and it found out some powerpc issues. See the commit 
ea376cc55bc3 in the RHEL-7 git. It was reverted just before RHEL-7.5 was 
released with the intention that it will be reinstated just after RHEL-7.5 
release, so that these issues could be found and eliminated in the 
7.5->7.6 development cycle. Jeff Moyer asked me to put it upstream because 
they want to follow upstream and they don't like RHEL-specific patches. 
There's clear incentive to put this patch to RHEL-7, that's why I'm 
posting it here.

> Just add a comment near the BUG_ON within DMA API telling people how
> they can inject this error some more if the bug does not
> reproduce, and leave it at that.

But the problem is that the powerpc bug only triggers with this patch. It 
doesn't trigger without it. So, we have a potential random-crashing bug in 
the codebase (and perhaps more others) and we want to eliminate them - 
that's why we need the patch.

People on this list argue "this should be a kernel parameter". But the 
testers won't enable the kernel parameter, the crashes won't happen 
without the kernel parameter and the bugs will stay unreported and 
uncorrected. That's why it needs to be the default.

Mikulas

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Michael S. Tsirkin @ 2018-04-26 19:14 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: eric.dumazet, netdev, Randy Dunlap, linux-kernel, Matthew Wilcox,
	Michal Hocko, James Bottomley, linux-mm, dm-devel,
	Vlastimil Babka, David Rientjes, Andrew Morton, virtualization,
	David Miller, edumazet
In-Reply-To: <alpine.LRH.2.02.1804261451120.23716@file01.intranet.prod.int.rdu2.redhat.com>

On Thu, Apr 26, 2018 at 02:54:26PM -0400, Mikulas Patocka wrote:
> 
> 
> On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:
> 
> > On Thu, Apr 26, 2018 at 12:07:25PM -0400, Mikulas Patocka wrote:
> > > > IIUC debug kernels mainly exist so people who experience e.g. memory
> > > > corruption can try and debug the failure. In this case, CONFIG_DEBUG_SG
> > > > will *already* catch a failure early. Nothing special needs to be done.
> > > 
> > > The patch helps people debug such memory coprruptions (such as using DMA 
> > > API on the result of kvmalloc).
> > 
> > That's my point.  I don't think your patch helps debug any memory
> > corruptions.  With CONFIG_DEBUG_SG using DMA API already causes a
> > BUG_ON, that's before any memory can get corrupted.
> 
> The patch turns a hard-to-reproduce bug into an easy-to-reproduce bug. 

It's still not a memory corruption. It's a BUG_ON the source of which -
should it trigger - can be typically found using grep.

> Obviously we don't want this in production kernels, but in the debug 
> kernels it should be done.
> 
> Mikulas

I'm not so sure. debug kernels should make debugging easier,
definitely.

Unfortunately they are already slower so some races don't trigger.

If they also start crashing more because we are injecting
memory allocation errors, people are even less likely to
be able to use them.

Just add a comment near the BUG_ON within DMA API telling people how
they can inject this error some more if the bug does not
reproduce, and leave it at that.

-- 
MST

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Michael S. Tsirkin @ 2018-04-26 19:05 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: eric.dumazet, netdev, Randy Dunlap, linux-kernel, Matthew Wilcox,
	Michal Hocko, James Bottomley, linux-mm, dm-devel,
	Vlastimil Babka, David Rientjes, Andrew Morton, virtualization,
	David Miller, edumazet
In-Reply-To: <alpine.LRH.2.02.1804261454380.23716@file01.intranet.prod.int.rdu2.redhat.com>

On Thu, Apr 26, 2018 at 02:58:08PM -0400, Mikulas Patocka wrote:
> 
> 
> On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:
> 
> > How do you make sure QA tests a specific corner case? Add it to
> > the test plan :)
> 
> BTW. how many "lines of code" of corporate bureaucracy would that take? :-)

It's pretty easy at least here at Red Hat.

> > I don't speak for Red Hat, etc.
> > 
> > -- 
> > MST
> 
> Mikulas

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-26 18:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eric.dumazet, netdev, Randy Dunlap, linux-kernel, Matthew Wilcox,
	Michal Hocko, James Bottomley, linux-mm, dm-devel,
	Vlastimil Babka, David Rientjes, Andrew Morton, virtualization,
	David Miller, edumazet
In-Reply-To: <20180426184845-mutt-send-email-mst@kernel.org>



On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:

> How do you make sure QA tests a specific corner case? Add it to
> the test plan :)

BTW. how many "lines of code" of corporate bureaucracy would that take? :-)

> I don't speak for Red Hat, etc.
> 
> -- 
> MST

Mikulas

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-26 18:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eric.dumazet, netdev, Randy Dunlap, linux-kernel, Matthew Wilcox,
	Michal Hocko, James Bottomley, linux-mm, dm-devel,
	Vlastimil Babka, David Rientjes, Andrew Morton, virtualization,
	David Miller, edumazet
In-Reply-To: <20180426214011-mutt-send-email-mst@kernel.org>



On Thu, 26 Apr 2018, Michael S. Tsirkin wrote:

> On Thu, Apr 26, 2018 at 12:07:25PM -0400, Mikulas Patocka wrote:
> > > IIUC debug kernels mainly exist so people who experience e.g. memory
> > > corruption can try and debug the failure. In this case, CONFIG_DEBUG_SG
> > > will *already* catch a failure early. Nothing special needs to be done.
> > 
> > The patch helps people debug such memory coprruptions (such as using DMA 
> > API on the result of kvmalloc).
> 
> That's my point.  I don't think your patch helps debug any memory
> corruptions.  With CONFIG_DEBUG_SG using DMA API already causes a
> BUG_ON, that's before any memory can get corrupted.

The patch turns a hard-to-reproduce bug into an easy-to-reproduce bug. 

Obviously we don't want this in production kernels, but in the debug 
kernels it should be done.

Mikulas

^ permalink raw reply

* [PULL] virtio: fixups
From: Michael S. Tsirkin @ 2018-04-26 18:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: kvm, mst, netdev, linux-kernel, stable, virtualization

The following changes since commit 6d08b06e67cd117f6992c46611dfb4ce267cd71e:

  Linux 4.17-rc2 (2018-04-22 19:20:09 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 5c60300d68da32ca77f7f978039dc72bfc78b06b:

  virtio_console: reset on out of memory (2018-04-25 20:41:29 +0300)

----------------------------------------------------------------
virtio: fixups

Latest header update will break QEMU (if it's rebuilt with the new
header) - and it seems that the code there is so fragile that any change
in this header will break it.  Add a better interface so users do not
need to change their code every time that header changes.

Fix virtio console for spec compliance.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Michael S. Tsirkin (7):
      virtio_balloon: add array of stat names
      virtio_console: don't tie bufs to a vq
      virtio: add ability to iterate over vqs
      virtio_console: free buffers after reset
      virtio_console: drop custom control queue cleanup
      virtio_console: move removal code
      virtio_console: reset on out of memory

 drivers/char/virtio_console.c       | 157 ++++++++++++++++--------------------
 include/linux/virtio.h              |   3 +
 include/uapi/linux/virtio_balloon.h |  15 ++++
 3 files changed, 89 insertions(+), 86 deletions(-)

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Michael S. Tsirkin @ 2018-04-26 18:49 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: eric.dumazet, netdev, Randy Dunlap, linux-kernel, Matthew Wilcox,
	Michal Hocko, James Bottomley, linux-mm, dm-devel,
	Vlastimil Babka, David Rientjes, Andrew Morton, virtualization,
	David Miller, edumazet
In-Reply-To: <alpine.LRH.2.02.1804261202350.24656@file01.intranet.prod.int.rdu2.redhat.com>

On Thu, Apr 26, 2018 at 12:07:25PM -0400, Mikulas Patocka wrote:
> > IIUC debug kernels mainly exist so people who experience e.g. memory
> > corruption can try and debug the failure. In this case, CONFIG_DEBUG_SG
> > will *already* catch a failure early. Nothing special needs to be done.
> 
> The patch helps people debug such memory coprruptions (such as using DMA 
> API on the result of kvmalloc).

That's my point.  I don't think your patch helps debug any memory
corruptions.  With CONFIG_DEBUG_SG using DMA API already causes a
BUG_ON, that's before any memory can get corrupted.

-- 
MST

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox