[PATCH] [RFC] virtio: Limit the retries on a virtio device reset

virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
@ 2017-08-23 16:33 Pierre Morel
  2017-08-24 11:07 ` Cornelia Huck
  2017-08-24 14:19 ` Michael S. Tsirkin
  0 siblings, 2 replies; 14+ messages in thread
From: Pierre Morel @ 2017-08-23 16:33 UTC (permalink / raw)
  To: virtualization; +Cc: cohuck, mst

Reseting a device can sometime fail, even a virtual device.
If the device is not reseted after a while the driver should
abandon the retries.
This is the change proposed for the modern virtio_pci.

More generally, when this happens,the virtio driver can set the
VIRTIO_CONFIG_S_FAILED status flag to advertise the caller.

The virtio core can test if the reset was succesful by testing
this flag after a reset.

This behavior is backward compatible with existing drivers.
This behavior seems to me compatible with Virtio-1.0 specifications,
Chapters 2.1 Device Status Field.
There I definitively need your opinion: Is it right?

This patch also lead to another question:
do we care if a device provided by the hypervisor is buggy?

Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 drivers/virtio/virtio.c            |  4 ++++
 drivers/virtio/virtio_pci_modern.c | 11 ++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 48230a5..6255dc4 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -324,6 +324,8 @@ int register_virtio_device(struct virtio_device *dev)
 	/* We always start by resetting the device, in case a previous
 	 * driver messed it up.  This also tests that code path a little. */
 	dev->config->reset(dev);
+	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
+		return -EIO;
 
 	/* Acknowledge that we've seen the device. */
 	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
@@ -373,6 +375,8 @@ int virtio_device_restore(struct virtio_device *dev)
 	/* We always start by resetting the device, in case a previous
 	 * driver messed it up. */
 	dev->config->reset(dev);
+	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
+		return -EIO;
 
 	/* Acknowledge that we've seen the device. */
 	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 2555d80..bfc5fc1 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -270,6 +270,7 @@ static void vp_set_status(struct virtio_device *vdev, u8 status)
 static void vp_reset(struct virtio_device *vdev)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	int retry_count = 10;
 	/* 0 status means a reset. */
 	vp_iowrite8(0, &vp_dev->common->device_status);
 	/* After writing 0 to device_status, the driver MUST wait for a read of
@@ -277,8 +278,16 @@ static void vp_reset(struct virtio_device *vdev)
 	 * This will flush out the status write, and flush in device writes,
 	 * including MSI-X interrupts, if any.
 	 */
-	while (vp_ioread8(&vp_dev->common->device_status))
+	while (vp_ioread8(&vp_dev->common->device_status) && retry_count--)
 		msleep(1);
+	/* If the read did not return 0 before the timeout consider that
+	 * the device failed.
+	 */
+	if (retry_count <= 0) {
+		virtio_add_status(vdev, VIRTIO_CONFIG_S_FAILED);
+		return;
+	}
+	virtio_add_status(vdev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
 	/* Flush pending VQ/configuration callbacks. */
 	vp_synchronize_vectors(vdev);
 }
-- 
2.3.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
  2017-08-23 16:33 [PATCH] [RFC] virtio: Limit the retries on a virtio device reset Pierre Morel
@ 2017-08-24 11:07 ` Cornelia Huck
  2017-08-24 12:16   ` Pierre Morel
  2017-08-24 14:19 ` Michael S. Tsirkin
  1 sibling, 1 reply; 14+ messages in thread
From: Cornelia Huck @ 2017-08-24 11:07 UTC (permalink / raw)
  To: Pierre Morel; +Cc: mst, virtualization

On Wed, 23 Aug 2017 18:33:02 +0200
Pierre Morel <pmorel@linux.vnet.ibm.com> wrote:

> Reseting a device can sometime fail, even a virtual device.
> If the device is not reseted after a while the driver should
> abandon the retries.
> This is the change proposed for the modern virtio_pci.
> 
> More generally, when this happens,the virtio driver can set the
> VIRTIO_CONFIG_S_FAILED status flag to advertise the caller.
> 
> The virtio core can test if the reset was succesful by testing
> this flag after a reset.
> 
> This behavior is backward compatible with existing drivers.
> This behavior seems to me compatible with Virtio-1.0 specifications,
> Chapters 2.1 Device Status Field.
> There I definitively need your opinion: Is it right?

Will have to double check with the spec.

> 
> This patch also lead to another question:
> do we care if a device provided by the hypervisor is buggy?

Getting into a hang because of a broken device is not nice, but I'm not
sure we need to plan for this. Have you seen this in the wild?

> 
> Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
> ---
>  drivers/virtio/virtio.c            |  4 ++++
>  drivers/virtio/virtio_pci_modern.c | 11 ++++++++++-
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> index 48230a5..6255dc4 100644
> --- a/drivers/virtio/virtio.c
> +++ b/drivers/virtio/virtio.c
> @@ -324,6 +324,8 @@ int register_virtio_device(struct virtio_device *dev)
>  	/* We always start by resetting the device, in case a previous
>  	 * driver messed it up.  This also tests that code path a little. */
>  	dev->config->reset(dev);
> +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
> +		return -EIO;
>  
>  	/* Acknowledge that we've seen the device. */
>  	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
> @@ -373,6 +375,8 @@ int virtio_device_restore(struct virtio_device *dev)
>  	/* We always start by resetting the device, in case a previous
>  	 * driver messed it up. */
>  	dev->config->reset(dev);
> +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
> +		return -EIO;

virtio-ccw prior to rev 2 won't ever see this (as the read command did
not exist then), but this is not really a problem.

>  
>  	/* Acknowledge that we've seen the device. */
>  	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> index 2555d80..bfc5fc1 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -270,6 +270,7 @@ static void vp_set_status(struct virtio_device *vdev, u8 status)
>  static void vp_reset(struct virtio_device *vdev)
>  {
>  	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> +	int retry_count = 10;

When you're touching this anyway, it would be a good time to add an
extra blank line :)

>  	/* 0 status means a reset. */
>  	vp_iowrite8(0, &vp_dev->common->device_status);
>  	/* After writing 0 to device_status, the driver MUST wait for a read of
> @@ -277,8 +278,16 @@ static void vp_reset(struct virtio_device *vdev)
>  	 * This will flush out the status write, and flush in device writes,
>  	 * including MSI-X interrupts, if any.
>  	 */
> -	while (vp_ioread8(&vp_dev->common->device_status))
> +	while (vp_ioread8(&vp_dev->common->device_status) && retry_count--)
>  		msleep(1);
> +	/* If the read did not return 0 before the timeout consider that
> +	 * the device failed.
> +	 */
> +	if (retry_count <= 0) {
> +		virtio_add_status(vdev, VIRTIO_CONFIG_S_FAILED);
> +		return;
> +	}
> +	virtio_add_status(vdev, VIRTIO_CONFIG_S_ACKNOWLEDGE);

Adding ACK here seems wrong?

>  	/* Flush pending VQ/configuration callbacks. */
>  	vp_synchronize_vectors(vdev);
>  }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
  2017-08-24 11:07 ` Cornelia Huck
@ 2017-08-24 12:16   ` Pierre Morel
  2017-08-24 14:12     ` Michael S. Tsirkin
  0 siblings, 1 reply; 14+ messages in thread
From: Pierre Morel @ 2017-08-24 12:16 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: mst, virtualization

On 24/08/2017 13:07, Cornelia Huck wrote:
> On Wed, 23 Aug 2017 18:33:02 +0200
> Pierre Morel <pmorel@linux.vnet.ibm.com> wrote:
> 
>> Reseting a device can sometime fail, even a virtual device.
>> If the device is not reseted after a while the driver should
>> abandon the retries.
>> This is the change proposed for the modern virtio_pci.
>>
>> More generally, when this happens,the virtio driver can set the
>> VIRTIO_CONFIG_S_FAILED status flag to advertise the caller.
>>
>> The virtio core can test if the reset was succesful by testing
>> this flag after a reset.
>>
>> This behavior is backward compatible with existing drivers.
>> This behavior seems to me compatible with Virtio-1.0 specifications,
>> Chapters 2.1 Device Status Field.
>> There I definitively need your opinion: Is it right?
> 
> Will have to double check with the spec.
> 
>>
>> This patch also lead to another question:
>> do we care if a device provided by the hypervisor is buggy?
> 
> Getting into a hang because of a broken device is not nice, but I'm not
> sure we need to plan for this. Have you seen this in the wild?

Yes, with virtio-pci on S390.


> 
>>
>> Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
>> ---
>>   drivers/virtio/virtio.c            |  4 ++++
>>   drivers/virtio/virtio_pci_modern.c | 11 ++++++++++-
>>   2 files changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
>> index 48230a5..6255dc4 100644
>> --- a/drivers/virtio/virtio.c
>> +++ b/drivers/virtio/virtio.c
>> @@ -324,6 +324,8 @@ int register_virtio_device(struct virtio_device *dev)
>>   	/* We always start by resetting the device, in case a previous
>>   	 * driver messed it up.  This also tests that code path a little. */
>>   	dev->config->reset(dev);
>> +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
>> +		return -EIO;
>>   
>>   	/* Acknowledge that we've seen the device. */
>>   	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
>> @@ -373,6 +375,8 @@ int virtio_device_restore(struct virtio_device *dev)
>>   	/* We always start by resetting the device, in case a previous
>>   	 * driver messed it up. */
>>   	dev->config->reset(dev);
>> +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
>> +		return -EIO;
> 
> virtio-ccw prior to rev 2 won't ever see this (as the read command did
> not exist then), but this is not really a problem.
> 
>>   
>>   	/* Acknowledge that we've seen the device. */
>>   	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
>> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
>> index 2555d80..bfc5fc1 100644
>> --- a/drivers/virtio/virtio_pci_modern.c
>> +++ b/drivers/virtio/virtio_pci_modern.c
>> @@ -270,6 +270,7 @@ static void vp_set_status(struct virtio_device *vdev, u8 status)
>>   static void vp_reset(struct virtio_device *vdev)
>>   {
>>   	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>> +	int retry_count = 10;
> 
> When you're touching this anyway, it would be a good time to add an
> extra blank line :)

Yes, I like blank lines too.

> 
>>   	/* 0 status means a reset. */
>>   	vp_iowrite8(0, &vp_dev->common->device_status);
>>   	/* After writing 0 to device_status, the driver MUST wait for a read of
>> @@ -277,8 +278,16 @@ static void vp_reset(struct virtio_device *vdev)
>>   	 * This will flush out the status write, and flush in device writes,
>>   	 * including MSI-X interrupts, if any.
>>   	 */
>> -	while (vp_ioread8(&vp_dev->common->device_status))
>> +	while (vp_ioread8(&vp_dev->common->device_status) && retry_count--)
>>   		msleep(1);
>> +	/* If the read did not return 0 before the timeout consider that
>> +	 * the device failed.
>> +	 */
>> +	if (retry_count <= 0) {
>> +		virtio_add_status(vdev, VIRTIO_CONFIG_S_FAILED);
>> +		return;
>> +	}
>> +	virtio_add_status(vdev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
> 
> Adding ACK here seems wrong?

Exact, I forgot to remove this from a previous test.
I wait a little and post a v2

Thanks for reviewing.

Pierre

> 
>>   	/* Flush pending VQ/configuration callbacks. */
>>   	vp_synchronize_vectors(vdev);
>>   }
> 


-- 
Pierre Morel
Linux/KVM/QEMU in Böblingen - Germany

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
  2017-08-24 12:16   ` Pierre Morel
@ 2017-08-24 14:12     ` Michael S. Tsirkin
  2017-08-24 17:07       ` Pierre Morel
  0 siblings, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2017-08-24 14:12 UTC (permalink / raw)
  To: Pierre Morel; +Cc: Cornelia Huck, virtualization

On Thu, Aug 24, 2017 at 02:16:11PM +0200, Pierre Morel wrote:
> On 24/08/2017 13:07, Cornelia Huck wrote:
> > On Wed, 23 Aug 2017 18:33:02 +0200
> > Pierre Morel <pmorel@linux.vnet.ibm.com> wrote:
> > 
> > > Reseting a device can sometime fail, even a virtual device.
> > > If the device is not reseted after a while the driver should
> > > abandon the retries.
> > > This is the change proposed for the modern virtio_pci.
> > > 
> > > More generally, when this happens,the virtio driver can set the
> > > VIRTIO_CONFIG_S_FAILED status flag to advertise the caller.
> > > 
> > > The virtio core can test if the reset was succesful by testing
> > > this flag after a reset.
> > > 
> > > This behavior is backward compatible with existing drivers.
> > > This behavior seems to me compatible with Virtio-1.0 specifications,
> > > Chapters 2.1 Device Status Field.
> > > There I definitively need your opinion: Is it right?
> > 
> > Will have to double check with the spec.
> > 
> > > 
> > > This patch also lead to another question:
> > > do we care if a device provided by the hypervisor is buggy?
> > 
> > Getting into a hang because of a broken device is not nice, but I'm not
> > sure we need to plan for this. Have you seen this in the wild?
> 
> Yes, with virtio-pci on S390.

And what triggered this?
I don't think we can recover from a failed reset in all cases.

> 
> > 
> > > 
> > > Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
> > > ---
> > >   drivers/virtio/virtio.c            |  4 ++++
> > >   drivers/virtio/virtio_pci_modern.c | 11 ++++++++++-
> > >   2 files changed, 14 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> > > index 48230a5..6255dc4 100644
> > > --- a/drivers/virtio/virtio.c
> > > +++ b/drivers/virtio/virtio.c
> > > @@ -324,6 +324,8 @@ int register_virtio_device(struct virtio_device *dev)
> > >   	/* We always start by resetting the device, in case a previous
> > >   	 * driver messed it up.  This also tests that code path a little. */
> > >   	dev->config->reset(dev);
> > > +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
> > > +		return -EIO;
> > >   	/* Acknowledge that we've seen the device. */
> > >   	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
> > > @@ -373,6 +375,8 @@ int virtio_device_restore(struct virtio_device *dev)
> > >   	/* We always start by resetting the device, in case a previous
> > >   	 * driver messed it up. */
> > >   	dev->config->reset(dev);
> > > +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
> > > +		return -EIO;
> > 
> > virtio-ccw prior to rev 2 won't ever see this (as the read command did
> > not exist then), but this is not really a problem.
> > 
> > >   	/* Acknowledge that we've seen the device. */
> > >   	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
> > > diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> > > index 2555d80..bfc5fc1 100644
> > > --- a/drivers/virtio/virtio_pci_modern.c
> > > +++ b/drivers/virtio/virtio_pci_modern.c
> > > @@ -270,6 +270,7 @@ static void vp_set_status(struct virtio_device *vdev, u8 status)
> > >   static void vp_reset(struct virtio_device *vdev)
> > >   {
> > >   	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > > +	int retry_count = 10;
> > 
> > When you're touching this anyway, it would be a good time to add an
> > extra blank line :)
> 
> Yes, I like blank lines too.
> 
> > 
> > >   	/* 0 status means a reset. */
> > >   	vp_iowrite8(0, &vp_dev->common->device_status);
> > >   	/* After writing 0 to device_status, the driver MUST wait for a read of
> > > @@ -277,8 +278,16 @@ static void vp_reset(struct virtio_device *vdev)
> > >   	 * This will flush out the status write, and flush in device writes,
> > >   	 * including MSI-X interrupts, if any.
> > >   	 */
> > > -	while (vp_ioread8(&vp_dev->common->device_status))
> > > +	while (vp_ioread8(&vp_dev->common->device_status) && retry_count--)
> > >   		msleep(1);
> > > +	/* If the read did not return 0 before the timeout consider that
> > > +	 * the device failed.
> > > +	 */
> > > +	if (retry_count <= 0) {
> > > +		virtio_add_status(vdev, VIRTIO_CONFIG_S_FAILED);
> > > +		return;
> > > +	}

I'm not sure what's the right approach by I don't really like this one:
- an arbitrary number of retries looks wrong. why 10?
- doing this on probe might be reasonable but any other reset
  is expected to actually reset the device
- we'll have to spread these tests all over the place.
  Allowing reset to fail would be better. 


> > > +	virtio_add_status(vdev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
> > 
> > Adding ACK here seems wrong?
> 
> Exact, I forgot to remove this from a previous test.
> I wait a little and post a v2
> 
> Thanks for reviewing.
> 
> Pierre
> 
> > 
> > >   	/* Flush pending VQ/configuration callbacks. */
> > >   	vp_synchronize_vectors(vdev);
> > >   }
> > 
> 
> 
> -- 
> Pierre Morel
> Linux/KVM/QEMU in Böblingen - Germany

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
  2017-08-23 16:33 [PATCH] [RFC] virtio: Limit the retries on a virtio device reset Pierre Morel
  2017-08-24 11:07 ` Cornelia Huck
@ 2017-08-24 14:19 ` Michael S. Tsirkin
  2017-08-24 17:42   ` Pierre Morel
  1 sibling, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2017-08-24 14:19 UTC (permalink / raw)
  To: Pierre Morel; +Cc: cohuck, virtualization

On Wed, Aug 23, 2017 at 06:33:02PM +0200, Pierre Morel wrote:
> Reseting a device can sometime fail, even a virtual device.
> If the device is not reseted after a while the driver should
> abandon the retries.
> This is the change proposed for the modern virtio_pci.
> 
> More generally, when this happens,the virtio driver can set the
> VIRTIO_CONFIG_S_FAILED status flag to advertise the caller.
> 
> The virtio core can test if the reset was succesful by testing
> this flag after a reset.
> 
> This behavior is backward compatible with existing drivers.
> This behavior seems to me compatible with Virtio-1.0 specifications,
> Chapters 2.1 Device Status Field.
> There I definitively need your opinion: Is it right?
> 
> This patch also lead to another question:
> do we care if a device provided by the hypervisor is buggy?
> 
> Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>

So I think this is not the best place to start to add error recovery.
It should be much more common to have a situation where device gets
broken while it's being used.  Spec has a NEEDS_RESET flag for this.

I think we should start by coding up that support in all virtio drivers.

As a next step, we can add more code to detect unexpected behaviour by
the host and mark device as broken. Then we can do more things by
looking at the broken flag.


> ---
>  drivers/virtio/virtio.c            |  4 ++++
>  drivers/virtio/virtio_pci_modern.c | 11 ++++++++++-
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> index 48230a5..6255dc4 100644
> --- a/drivers/virtio/virtio.c
> +++ b/drivers/virtio/virtio.c
> @@ -324,6 +324,8 @@ int register_virtio_device(struct virtio_device *dev)
>  	/* We always start by resetting the device, in case a previous
>  	 * driver messed it up.  This also tests that code path a little. */
>  	dev->config->reset(dev);
> +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
> +		return -EIO;
>  
>  	/* Acknowledge that we've seen the device. */
>  	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
> @@ -373,6 +375,8 @@ int virtio_device_restore(struct virtio_device *dev)
>  	/* We always start by resetting the device, in case a previous
>  	 * driver messed it up. */
>  	dev->config->reset(dev);
> +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
> +		return -EIO;
>  
>  	/* Acknowledge that we've seen the device. */
>  	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> index 2555d80..bfc5fc1 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -270,6 +270,7 @@ static void vp_set_status(struct virtio_device *vdev, u8 status)
>  static void vp_reset(struct virtio_device *vdev)
>  {
>  	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> +	int retry_count = 10;
>  	/* 0 status means a reset. */
>  	vp_iowrite8(0, &vp_dev->common->device_status);
>  	/* After writing 0 to device_status, the driver MUST wait for a read of
> @@ -277,8 +278,16 @@ static void vp_reset(struct virtio_device *vdev)
>  	 * This will flush out the status write, and flush in device writes,
>  	 * including MSI-X interrupts, if any.
>  	 */
> -	while (vp_ioread8(&vp_dev->common->device_status))
> +	while (vp_ioread8(&vp_dev->common->device_status) && retry_count--)
>  		msleep(1);
> +	/* If the read did not return 0 before the timeout consider that
> +	 * the device failed.
> +	 */
> +	if (retry_count <= 0) {
> +		virtio_add_status(vdev, VIRTIO_CONFIG_S_FAILED);
> +		return;
> +	}
> +	virtio_add_status(vdev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
>  	/* Flush pending VQ/configuration callbacks. */
>  	vp_synchronize_vectors(vdev);
>  }
> -- 
> 2.3.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
  2017-08-24 14:12     ` Michael S. Tsirkin
@ 2017-08-24 17:07       ` Pierre Morel
  2017-08-24 21:16         ` Michael S. Tsirkin
  0 siblings, 1 reply; 14+ messages in thread
From: Pierre Morel @ 2017-08-24 17:07 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Cornelia Huck, virtualization

On 24/08/2017 16:12, Michael S. Tsirkin wrote:
> On Thu, Aug 24, 2017 at 02:16:11PM +0200, Pierre Morel wrote:
>> On 24/08/2017 13:07, Cornelia Huck wrote:
>>> On Wed, 23 Aug 2017 18:33:02 +0200
>>> Pierre Morel <pmorel@linux.vnet.ibm.com> wrote:
>>>
>>>> Reseting a device can sometime fail, even a virtual device.
>>>> If the device is not reseted after a while the driver should
>>>> abandon the retries.
>>>> This is the change proposed for the modern virtio_pci.
>>>>
>>>> More generally, when this happens,the virtio driver can set the
>>>> VIRTIO_CONFIG_S_FAILED status flag to advertise the caller.
>>>>
>>>> The virtio core can test if the reset was succesful by testing
>>>> this flag after a reset.
>>>>
>>>> This behavior is backward compatible with existing drivers.
>>>> This behavior seems to me compatible with Virtio-1.0 specifications,
>>>> Chapters 2.1 Device Status Field.
>>>> There I definitively need your opinion: Is it right?
>>>
>>> Will have to double check with the spec.
>>>
>>>>
>>>> This patch also lead to another question:
>>>> do we care if a device provided by the hypervisor is buggy?
>>>
>>> Getting into a hang because of a broken device is not nice, but I'm not
>>> sure we need to plan for this. Have you seen this in the wild?
>>
>> Yes, with virtio-pci on S390.
> 
> And what triggered this?

Buggy zPCI QEMU device we are currently put right

> I don't think we can recover from a failed reset in all cases.

I do not think so too.
The device must be abandoned.
Too dangerous to be used.

Normaly the hypervisor should not be buggy. But... nobdy's perfect

> 
>>
>>>
>>>>
>>>> Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
>>>> ---
>>>>    drivers/virtio/virtio.c            |  4 ++++
>>>>    drivers/virtio/virtio_pci_modern.c | 11 ++++++++++-
>>>>    2 files changed, 14 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
>>>> index 48230a5..6255dc4 100644
>>>> --- a/drivers/virtio/virtio.c
>>>> +++ b/drivers/virtio/virtio.c
>>>> @@ -324,6 +324,8 @@ int register_virtio_device(struct virtio_device *dev)
>>>>    	/* We always start by resetting the device, in case a previous
>>>>    	 * driver messed it up.  This also tests that code path a little. */
>>>>    	dev->config->reset(dev);
>>>> +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
>>>> +		return -EIO;
>>>>    	/* Acknowledge that we've seen the device. */
>>>>    	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
>>>> @@ -373,6 +375,8 @@ int virtio_device_restore(struct virtio_device *dev)
>>>>    	/* We always start by resetting the device, in case a previous
>>>>    	 * driver messed it up. */
>>>>    	dev->config->reset(dev);
>>>> +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
>>>> +		return -EIO;
>>>
>>> virtio-ccw prior to rev 2 won't ever see this (as the read command did
>>> not exist then), but this is not really a problem.
>>>
>>>>    	/* Acknowledge that we've seen the device. */
>>>>    	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
>>>> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
>>>> index 2555d80..bfc5fc1 100644
>>>> --- a/drivers/virtio/virtio_pci_modern.c
>>>> +++ b/drivers/virtio/virtio_pci_modern.c
>>>> @@ -270,6 +270,7 @@ static void vp_set_status(struct virtio_device *vdev, u8 status)
>>>>    static void vp_reset(struct virtio_device *vdev)
>>>>    {
>>>>    	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>>>> +	int retry_count = 10;
>>>
>>> When you're touching this anyway, it would be a good time to add an
>>> extra blank line :)
>>
>> Yes, I like blank lines too.
>>
>>>
>>>>    	/* 0 status means a reset. */
>>>>    	vp_iowrite8(0, &vp_dev->common->device_status);
>>>>    	/* After writing 0 to device_status, the driver MUST wait for a read of
>>>> @@ -277,8 +278,16 @@ static void vp_reset(struct virtio_device *vdev)
>>>>    	 * This will flush out the status write, and flush in device writes,
>>>>    	 * including MSI-X interrupts, if any.
>>>>    	 */
>>>> -	while (vp_ioread8(&vp_dev->common->device_status))
>>>> +	while (vp_ioread8(&vp_dev->common->device_status) && retry_count--)
>>>>    		msleep(1);
>>>> +	/* If the read did not return 0 before the timeout consider that
>>>> +	 * the device failed.
>>>> +	 */
>>>> +	if (retry_count <= 0) {
>>>> +		virtio_add_status(vdev, VIRTIO_CONFIG_S_FAILED);
>>>> +		return;
>>>> +	}
> 
> I'm not sure what's the right approach by I don't really like this one:
> - an arbitrary number of retries looks wrong. why 10?

I fear that at this moment we can not rely on a lot of information on 
the device. An arbitrary value may not be so bad.

But I agree 10 can be discussed :). It is just a convenient value for 
testing.
Something leading to a waiting time of around some seconds would be more 
appropriate I think.

> - doing this on probe might be reasonable but any other reset
>    is expected to actually reset the device

We are handling virtual devices.
If we consider that if one reset works the next reset will take the same 
path and work, we do not have to.
But... not completely sure, bugs can hide everywhere.

> - we'll have to spread these tests all over the place.

I counted 19 places where to check if the reset went OK.

None of them touch the device anymore after reset and just free driver's 
resources.

So that if reset failed, nothing goes wrong, no device access, but the 
probability that the next probe fail is high. (If it ever succeed).

>    Allowing reset to fail would be better.

May be I did not understand what you mean.
Testing the flag or a return value is as expensive.

Of course the implementation is a mater of taste.

I notice two other things to do:

- May be adding a warning would be fine too.
- Virtio_ccw may add a fail flag when allocation of CCW failed.
   I did not find anything to do for virtio_mmio or legacy virtio_pci.

Regards,

Pierre

> 
> 
>>>> +	virtio_add_status(vdev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
>>>
>>> Adding ACK here seems wrong?
>>
>> Exact, I forgot to remove this from a previous test.
>> I wait a little and post a v2
>>
>> Thanks for reviewing.
>>
>> Pierre
>>
>>>
>>>>    	/* Flush pending VQ/configuration callbacks. */
>>>>    	vp_synchronize_vectors(vdev);
>>>>    }
>>>
>>
>>
>> -- 
>> Pierre Morel
>> Linux/KVM/QEMU in Böblingen - Germany
> 


-- 
Pierre Morel
Linux/KVM/QEMU in Böblingen - Germany

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
  2017-08-24 14:19 ` Michael S. Tsirkin
@ 2017-08-24 17:42   ` Pierre Morel
  2017-08-24 21:23     ` Michael S. Tsirkin
  0 siblings, 1 reply; 14+ messages in thread
From: Pierre Morel @ 2017-08-24 17:42 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: cohuck, virtualization

On 24/08/2017 16:19, Michael S. Tsirkin wrote:
> On Wed, Aug 23, 2017 at 06:33:02PM +0200, Pierre Morel wrote:
>> Reseting a device can sometime fail, even a virtual device.
>> If the device is not reseted after a while the driver should
>> abandon the retries.
>> This is the change proposed for the modern virtio_pci.
>>
>> More generally, when this happens,the virtio driver can set the
>> VIRTIO_CONFIG_S_FAILED status flag to advertise the caller.
>>
>> The virtio core can test if the reset was succesful by testing
>> this flag after a reset.
>>
>> This behavior is backward compatible with existing drivers.
>> This behavior seems to me compatible with Virtio-1.0 specifications,
>> Chapters 2.1 Device Status Field.
>> There I definitively need your opinion: Is it right?
>>
>> This patch also lead to another question:
>> do we care if a device provided by the hypervisor is buggy?
>>
>> Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
> 
> So I think this is not the best place to start to add error recovery.

I agree, there can not be any error recovery there.
If reset does not work we can let fall the device until next reset of 
the hypervisor.

> It should be much more common to have a situation where device gets
> broken while it's being used.  Spec has a NEEDS_RESET flag for this.

Yes the device side can set this flag, but it is another problem, it is 
supposing that:
- the transport, device side, still works.
- it is able to detect that the device need a reset
- a reset is effective

> 
> I think we should start by coding up that support in all virtio drivers.
> 
> As a next step, we can add more code to detect unexpected behaviour by
> the host and mark device as broken. Then we can do more things by
> looking at the broken flag.

It seems difficult to me.
But may be I went too fast to the conclusion that there is nothing to do.
I still think about it.

Best regards

Pierre

> 
> 
>> ---
>>   drivers/virtio/virtio.c            |  4 ++++
>>   drivers/virtio/virtio_pci_modern.c | 11 ++++++++++-
>>   2 files changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
>> index 48230a5..6255dc4 100644
>> --- a/drivers/virtio/virtio.c
>> +++ b/drivers/virtio/virtio.c
>> @@ -324,6 +324,8 @@ int register_virtio_device(struct virtio_device *dev)
>>   	/* We always start by resetting the device, in case a previous
>>   	 * driver messed it up.  This also tests that code path a little. */
>>   	dev->config->reset(dev);
>> +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
>> +		return -EIO;
>>   
>>   	/* Acknowledge that we've seen the device. */
>>   	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
>> @@ -373,6 +375,8 @@ int virtio_device_restore(struct virtio_device *dev)
>>   	/* We always start by resetting the device, in case a previous
>>   	 * driver messed it up. */
>>   	dev->config->reset(dev);
>> +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
>> +		return -EIO;
>>   
>>   	/* Acknowledge that we've seen the device. */
>>   	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
>> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
>> index 2555d80..bfc5fc1 100644
>> --- a/drivers/virtio/virtio_pci_modern.c
>> +++ b/drivers/virtio/virtio_pci_modern.c
>> @@ -270,6 +270,7 @@ static void vp_set_status(struct virtio_device *vdev, u8 status)
>>   static void vp_reset(struct virtio_device *vdev)
>>   {
>>   	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>> +	int retry_count = 10;
>>   	/* 0 status means a reset. */
>>   	vp_iowrite8(0, &vp_dev->common->device_status);
>>   	/* After writing 0 to device_status, the driver MUST wait for a read of
>> @@ -277,8 +278,16 @@ static void vp_reset(struct virtio_device *vdev)
>>   	 * This will flush out the status write, and flush in device writes,
>>   	 * including MSI-X interrupts, if any.
>>   	 */
>> -	while (vp_ioread8(&vp_dev->common->device_status))
>> +	while (vp_ioread8(&vp_dev->common->device_status) && retry_count--)
>>   		msleep(1);
>> +	/* If the read did not return 0 before the timeout consider that
>> +	 * the device failed.
>> +	 */
>> +	if (retry_count <= 0) {
>> +		virtio_add_status(vdev, VIRTIO_CONFIG_S_FAILED);
>> +		return;
>> +	}
>> +	virtio_add_status(vdev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
>>   	/* Flush pending VQ/configuration callbacks. */
>>   	vp_synchronize_vectors(vdev);
>>   }
>> -- 
>> 2.3.0
> 


-- 
Pierre Morel
Linux/KVM/QEMU in Böblingen - Germany

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
  2017-08-24 17:07       ` Pierre Morel
@ 2017-08-24 21:16         ` Michael S. Tsirkin
  2017-08-25  8:26           ` Cornelia Huck
  0 siblings, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2017-08-24 21:16 UTC (permalink / raw)
  To: Pierre Morel; +Cc: Cornelia Huck, virtualization

On Thu, Aug 24, 2017 at 07:07:42PM +0200, Pierre Morel wrote:
> > - we'll have to spread these tests all over the place.
> 
> I counted 19 places where to check if the reset went OK.
> 
> None of them touch the device anymore after reset and just free driver's
> resources.

... and then hypervisor uses the resources after free. Not good.

> So that if reset failed, nothing goes wrong, no device access, but the
> probability that the next probe fail is high. (If it ever succeed).
> 
> >    Allowing reset to fail would be better.
> 
> May be I did not understand what you mean.
> Testing the flag or a return value is as expensive.
> 
> Of course the implementation is a mater of taste.

If a function can fail it should return an error, not just set a flag.


> I notice two other things to do:
> 
> - May be adding a warning would be fine too.
> - Virtio_ccw may add a fail flag when allocation of CCW failed.
>   I did not find anything to do for virtio_mmio or legacy virtio_pci.
> 
> Regards,
> 
> Pierre

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
  2017-08-24 17:42   ` Pierre Morel
@ 2017-08-24 21:23     ` Michael S. Tsirkin
  2017-08-25  8:33       ` Pierre Morel
  0 siblings, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2017-08-24 21:23 UTC (permalink / raw)
  To: Pierre Morel; +Cc: cohuck, virtualization

On Thu, Aug 24, 2017 at 07:42:07PM +0200, Pierre Morel wrote:
> On 24/08/2017 16:19, Michael S. Tsirkin wrote:
> > On Wed, Aug 23, 2017 at 06:33:02PM +0200, Pierre Morel wrote:
> > > Reseting a device can sometime fail, even a virtual device.
> > > If the device is not reseted after a while the driver should
> > > abandon the retries.
> > > This is the change proposed for the modern virtio_pci.
> > > 
> > > More generally, when this happens,the virtio driver can set the
> > > VIRTIO_CONFIG_S_FAILED status flag to advertise the caller.
> > > 
> > > The virtio core can test if the reset was succesful by testing
> > > this flag after a reset.
> > > 
> > > This behavior is backward compatible with existing drivers.
> > > This behavior seems to me compatible with Virtio-1.0 specifications,
> > > Chapters 2.1 Device Status Field.
> > > There I definitively need your opinion: Is it right?
> > > 
> > > This patch also lead to another question:
> > > do we care if a device provided by the hypervisor is buggy?
> > > 
> > > Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
> > 
> > So I think this is not the best place to start to add error recovery.
> 
> I agree, there can not be any error recovery there.
> If reset does not work we can let fall the device until next reset of the
> hypervisor.

On probe, yes. But failures are more likely to trigger at other times.

> > It should be much more common to have a situation where device gets
> > broken while it's being used.  Spec has a NEEDS_RESET flag for this.
> 
> Yes the device side can set this flag, but it is another problem, it is
> supposing that:
> - the transport, device side, still works.
> - it is able to detect that the device need a reset
> - a reset is effective

Right. OTOH in this case there's more we can do.


> > 
> > I think we should start by coding up that support in all virtio drivers.
> > 
> > As a next step, we can add more code to detect unexpected behaviour by
> > the host and mark device as broken. Then we can do more things by
> > looking at the broken flag.
> 
> It seems difficult to me.
> But may be I went too fast to the conclusion that there is nothing to do.
> I still think about it.
> 
> Best regards
> 
> Pierre
> 
> > 
> > 
> > > ---
> > >   drivers/virtio/virtio.c            |  4 ++++
> > >   drivers/virtio/virtio_pci_modern.c | 11 ++++++++++-
> > >   2 files changed, 14 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> > > index 48230a5..6255dc4 100644
> > > --- a/drivers/virtio/virtio.c
> > > +++ b/drivers/virtio/virtio.c
> > > @@ -324,6 +324,8 @@ int register_virtio_device(struct virtio_device *dev)
> > >   	/* We always start by resetting the device, in case a previous
> > >   	 * driver messed it up.  This also tests that code path a little. */
> > >   	dev->config->reset(dev);
> > > +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
> > > +		return -EIO;
> > >   	/* Acknowledge that we've seen the device. */
> > >   	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
> > > @@ -373,6 +375,8 @@ int virtio_device_restore(struct virtio_device *dev)
> > >   	/* We always start by resetting the device, in case a previous
> > >   	 * driver messed it up. */
> > >   	dev->config->reset(dev);
> > > +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
> > > +		return -EIO;
> > >   	/* Acknowledge that we've seen the device. */
> > >   	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
> > > diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> > > index 2555d80..bfc5fc1 100644
> > > --- a/drivers/virtio/virtio_pci_modern.c
> > > +++ b/drivers/virtio/virtio_pci_modern.c
> > > @@ -270,6 +270,7 @@ static void vp_set_status(struct virtio_device *vdev, u8 status)
> > >   static void vp_reset(struct virtio_device *vdev)
> > >   {
> > >   	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > > +	int retry_count = 10;
> > >   	/* 0 status means a reset. */
> > >   	vp_iowrite8(0, &vp_dev->common->device_status);
> > >   	/* After writing 0 to device_status, the driver MUST wait for a read of
> > > @@ -277,8 +278,16 @@ static void vp_reset(struct virtio_device *vdev)
> > >   	 * This will flush out the status write, and flush in device writes,
> > >   	 * including MSI-X interrupts, if any.
> > >   	 */
> > > -	while (vp_ioread8(&vp_dev->common->device_status))
> > > +	while (vp_ioread8(&vp_dev->common->device_status) && retry_count--)
> > >   		msleep(1);
> > > +	/* If the read did not return 0 before the timeout consider that
> > > +	 * the device failed.
> > > +	 */
> > > +	if (retry_count <= 0) {
> > > +		virtio_add_status(vdev, VIRTIO_CONFIG_S_FAILED);
> > > +		return;
> > > +	}
> > > +	virtio_add_status(vdev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
> > >   	/* Flush pending VQ/configuration callbacks. */
> > >   	vp_synchronize_vectors(vdev);
> > >   }
> > > -- 
> > > 2.3.0
> > 
> 
> 
> -- 
> Pierre Morel
> Linux/KVM/QEMU in Böblingen - Germany

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
  2017-08-24 21:16         ` Michael S. Tsirkin
@ 2017-08-25  8:26           ` Cornelia Huck
  2017-08-25 11:21             ` Pierre Morel
  2017-08-25 16:43             ` Michael S. Tsirkin
  0 siblings, 2 replies; 14+ messages in thread
From: Cornelia Huck @ 2017-08-25  8:26 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Pierre Morel, virtualization

On Fri, 25 Aug 2017 00:16:05 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Aug 24, 2017 at 07:07:42PM +0200, Pierre Morel wrote:
> > > - we'll have to spread these tests all over the place.  
> > 
> > I counted 19 places where to check if the reset went OK.
> > 
> > None of them touch the device anymore after reset and just free driver's
> > resources.  
> 
> ... and then hypervisor uses the resources after free. Not good.

The only place where we can simply give up on the device and be sure
that nothing bad happens is during initial setup. In the other places,
it seems we have the choice between looping (as now) or panic.

> 
> > So that if reset failed, nothing goes wrong, no device access, but the
> > probability that the next probe fail is high. (If it ever succeed).
> >   
> > >    Allowing reset to fail would be better.  
> > 
> > May be I did not understand what you mean.
> > Testing the flag or a return value is as expensive.
> > 
> > Of course the implementation is a mater of taste.  
> 
> If a function can fail it should return an error, not just set a flag.

ccw reset can (a) succeed, (b) fail (by returning an error status via
standard channel subsystem mechanisms), or (c) run into a timeout
(where the driver should recover via csch and friends). [1] In any
case, we need to able to rely on the channel subsystem to make sure a
broken device is dead.

pci-modern reset can (a) succeed, or (b) be in an indeterminate state
(it has not yet completed, but we don't know whether it will complete
in the future). It may be reasonable to just give up if (b) happens
during initial setup.

I'm not sure that allowing to fail reset will help much, other than
allowing to give up on a pci device early.

> 
> 
> > I notice two other things to do:
> > 
> > - May be adding a warning would be fine too.
> > - Virtio_ccw may add a fail flag when allocation of CCW failed.
> >   I did not find anything to do for virtio_mmio or legacy virtio_pci.
> > 
> > Regards,
> > 
> > Pierre  

[1] At that point, we either succeed with csch and can either retry or
fail, or fail with device gone, in which case the device is, well,
gone, which we also should be able to handle fine.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
  2017-08-24 21:23     ` Michael S. Tsirkin
@ 2017-08-25  8:33       ` Pierre Morel
  2017-08-25 16:46         ` Michael S. Tsirkin
  0 siblings, 1 reply; 14+ messages in thread
From: Pierre Morel @ 2017-08-25  8:33 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: cohuck, virtualization

On 24/08/2017 23:23, Michael S. Tsirkin wrote:
> On Thu, Aug 24, 2017 at 07:42:07PM +0200, Pierre Morel wrote:
>> On 24/08/2017 16:19, Michael S. Tsirkin wrote:
>>> On Wed, Aug 23, 2017 at 06:33:02PM +0200, Pierre Morel wrote:
>>>> Reseting a device can sometime fail, even a virtual device.
>>>> If the device is not reseted after a while the driver should
>>>> abandon the retries.
>>>> This is the change proposed for the modern virtio_pci.
>>>>
>>>> More generally, when this happens,the virtio driver can set the
>>>> VIRTIO_CONFIG_S_FAILED status flag to advertise the caller.
>>>>
>>>> The virtio core can test if the reset was succesful by testing
>>>> this flag after a reset.
>>>>
>>>> This behavior is backward compatible with existing drivers.
>>>> This behavior seems to me compatible with Virtio-1.0 specifications,
>>>> Chapters 2.1 Device Status Field.
>>>> There I definitively need your opinion: Is it right?
>>>>
>>>> This patch also lead to another question:
>>>> do we care if a device provided by the hypervisor is buggy?
>>>>
>>>> Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
>>>
>>> So I think this is not the best place to start to add error recovery.
>>
>> I agree, there can not be any error recovery there.
>> If reset does not work we can let fall the device until next reset of the
>> hypervisor.
> 
> On probe, yes. But failures are more likely to trigger at other times.

OK, what about:
- On probe if reset fail, the probe fail.

- On freeze and remove : we can not free resources which are common
	with the device, at least the queues.
	... we can only signal the error and give up with the device.

> 
>>> It should be much more common to have a situation where device gets
>>> broken while it's being used.  Spec has a NEEDS_RESET flag for this.
>>
>> Yes the device side can set this flag, but it is another problem, it is
>> supposing that:
>> - the transport, device side, still works.
>> - it is able to detect that the device need a reset
>> - a reset is effective
> 
> Right. OTOH in this case there's more we can do.

Yes, I did not find a single test of this flag (NEEDS_RESET).
even QEMU set it quite often (though virtio_error())

The decision to reset the device must come from the driver.
The protocol to reset the device is device/driver specific... lotta work

Shouldn't it be separate from the "reset failed" problem?


Regards,

Pierre


> 
> 
>>>
>>> I think we should start by coding up that support in all virtio drivers.
>>>
>>> As a next step, we can add more code to detect unexpected behaviour by
>>> the host and mark device as broken. Then we can do more things by
>>> looking at the broken flag.
>>
>> It seems difficult to me.
>> But may be I went too fast to the conclusion that there is nothing to do.
>> I still think about it.
>>
>> Best regards
>>
>> Pierre
>>
>>>
>>>
>>>> ---
>>>>    drivers/virtio/virtio.c            |  4 ++++
>>>>    drivers/virtio/virtio_pci_modern.c | 11 ++++++++++-
>>>>    2 files changed, 14 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
>>>> index 48230a5..6255dc4 100644
>>>> --- a/drivers/virtio/virtio.c
>>>> +++ b/drivers/virtio/virtio.c
>>>> @@ -324,6 +324,8 @@ int register_virtio_device(struct virtio_device *dev)
>>>>    	/* We always start by resetting the device, in case a previous
>>>>    	 * driver messed it up.  This also tests that code path a little. */
>>>>    	dev->config->reset(dev);
>>>> +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
>>>> +		return -EIO;
>>>>    	/* Acknowledge that we've seen the device. */
>>>>    	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
>>>> @@ -373,6 +375,8 @@ int virtio_device_restore(struct virtio_device *dev)
>>>>    	/* We always start by resetting the device, in case a previous
>>>>    	 * driver messed it up. */
>>>>    	dev->config->reset(dev);
>>>> +	if (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FAILED)
>>>> +		return -EIO;
>>>>    	/* Acknowledge that we've seen the device. */
>>>>    	virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
>>>> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
>>>> index 2555d80..bfc5fc1 100644
>>>> --- a/drivers/virtio/virtio_pci_modern.c
>>>> +++ b/drivers/virtio/virtio_pci_modern.c
>>>> @@ -270,6 +270,7 @@ static void vp_set_status(struct virtio_device *vdev, u8 status)
>>>>    static void vp_reset(struct virtio_device *vdev)
>>>>    {
>>>>    	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>>>> +	int retry_count = 10;
>>>>    	/* 0 status means a reset. */
>>>>    	vp_iowrite8(0, &vp_dev->common->device_status);
>>>>    	/* After writing 0 to device_status, the driver MUST wait for a read of
>>>> @@ -277,8 +278,16 @@ static void vp_reset(struct virtio_device *vdev)
>>>>    	 * This will flush out the status write, and flush in device writes,
>>>>    	 * including MSI-X interrupts, if any.
>>>>    	 */
>>>> -	while (vp_ioread8(&vp_dev->common->device_status))
>>>> +	while (vp_ioread8(&vp_dev->common->device_status) && retry_count--)
>>>>    		msleep(1);
>>>> +	/* If the read did not return 0 before the timeout consider that
>>>> +	 * the device failed.
>>>> +	 */
>>>> +	if (retry_count <= 0) {
>>>> +		virtio_add_status(vdev, VIRTIO_CONFIG_S_FAILED);
>>>> +		return;
>>>> +	}
>>>> +	virtio_add_status(vdev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
>>>>    	/* Flush pending VQ/configuration callbacks. */
>>>>    	vp_synchronize_vectors(vdev);
>>>>    }
>>>> -- 
>>>> 2.3.0
>>>
>>
>>
>> -- 
>> Pierre Morel
>> Linux/KVM/QEMU in Böblingen - Germany
> 


-- 
Pierre Morel
Linux/KVM/QEMU in Böblingen - Germany

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
  2017-08-25  8:26           ` Cornelia Huck
@ 2017-08-25 11:21             ` Pierre Morel
  2017-08-25 16:43             ` Michael S. Tsirkin
  1 sibling, 0 replies; 14+ messages in thread
From: Pierre Morel @ 2017-08-25 11:21 UTC (permalink / raw)
  To: Cornelia Huck, Michael S. Tsirkin; +Cc: virtualization

On 25/08/2017 10:26, Cornelia Huck wrote:
> On Fri, 25 Aug 2017 00:16:05 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
>> On Thu, Aug 24, 2017 at 07:07:42PM +0200, Pierre Morel wrote:
>>>> - we'll have to spread these tests all over the place.
>>>
>>> I counted 19 places where to check if the reset went OK.
>>>
>>> None of them touch the device anymore after reset and just free driver's
>>> resources.
>>
>> ... and then hypervisor uses the resources after free. Not good.

hum... yes, no good.
we can not assume anything about the host side at that time anymore.

> 
> The only place where we can simply give up on the device and be sure
> that nothing bad happens is during initial setup. In the other places,
> it seems we have the choice between looping (as now) or panic.

Yes
Note also that the probability that an initial reset works but further 
reset don't is very small.
I think we catch most of the problem during the probe.

IMHO, looping if not in initial setup let the administrator more freedom.
OTOH panic reset to a known state

Problem open: how to recognize an initial setup:
- just make reset return an error and let the driver take the decision
- find a way to let the reset function that this is an initial setup
	- add a state in virtio_device?
	- other?

If we add a state to the virtio_device we could also add other 
information as the reset_retries_count.

> 
>>
>>> So that if reset failed, nothing goes wrong, no device access, but the
>>> probability that the next probe fail is high. (If it ever succeed).
>>>    
>>>>     Allowing reset to fail would be better.
>>>
>>> May be I did not understand what you mean.
>>> Testing the flag or a return value is as expensive.
>>>
>>> Of course the implementation is a mater of taste.
>>
>> If a function can fail it should return an error, not just set a flag.

You are right in this case (at least), setting the flag is bad, since 
the flag is set by iowrite8(), it is handled on the device side... 
untrusted in the case we explore.

> 
> ccw reset can (a) succeed, (b) fail (by returning an error status via
> standard channel subsystem mechanisms), or (c) run into a timeout
> (where the driver should recover via csch and friends). [1] In any
> case, we need to able to rely on the channel subsystem to make sure a
> broken device is dead.
> 
> pci-modern reset can (a) succeed, or (b) be in an indeterminate state
> (it has not yet completed, but we don't know whether it will complete
> in the future). It may be reasonable to just give up if (b) happens
> during initial setup.
> 
> I'm not sure that allowing to fail reset will help much, other than
> allowing to give up on a pci device early.
We have nowhere a sanity check to verify that the virtio transport is in 
order.
The first successful reset can be seen as such a check, independently 
from the virtio transport type.
Give up the device early if the transport is not in order is indeed the 
only thing we can do.

Regards,

Pierre

> 
>>
>>
>>> I notice two other things to do:
>>>
>>> - May be adding a warning would be fine too.
>>> - Virtio_ccw may add a fail flag when allocation of CCW failed.
>>>    I did not find anything to do for virtio_mmio or legacy virtio_pci.
>>>
>>> Regards,
>>>
>>> Pierre
> 
> [1] At that point, we either succeed with csch and can either retry or
> fail, or fail with device gone, in which case the device is, well,
> gone, which we also should be able to handle fine.
> 


-- 
Pierre Morel
Linux/KVM/QEMU in Böblingen - Germany

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
  2017-08-25  8:26           ` Cornelia Huck
  2017-08-25 11:21             ` Pierre Morel
@ 2017-08-25 16:43             ` Michael S. Tsirkin
  1 sibling, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2017-08-25 16:43 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: Pierre Morel, virtualization

On Fri, Aug 25, 2017 at 10:26:12AM +0200, Cornelia Huck wrote:
> On Fri, 25 Aug 2017 00:16:05 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Aug 24, 2017 at 07:07:42PM +0200, Pierre Morel wrote:
> > > > - we'll have to spread these tests all over the place.  
> > > 
> > > I counted 19 places where to check if the reset went OK.
> > > 
> > > None of them touch the device anymore after reset and just free driver's
> > > resources.  
> > 
> > ... and then hypervisor uses the resources after free. Not good.
> 
> The only place where we can simply give up on the device and be sure
> that nothing bad happens is during initial setup. In the other places,
> it seems we have the choice between looping (as now) or panic.

Right. Whether it's even worth it to handle just this corner case,
I don't really know.

-- 
MST

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset
  2017-08-25  8:33       ` Pierre Morel
@ 2017-08-25 16:46         ` Michael S. Tsirkin
  0 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2017-08-25 16:46 UTC (permalink / raw)
  To: Pierre Morel; +Cc: cohuck, virtualization

On Fri, Aug 25, 2017 at 10:33:57AM +0200, Pierre Morel wrote:
> On 24/08/2017 23:23, Michael S. Tsirkin wrote:
> > On Thu, Aug 24, 2017 at 07:42:07PM +0200, Pierre Morel wrote:
> > > On 24/08/2017 16:19, Michael S. Tsirkin wrote:
> > > > On Wed, Aug 23, 2017 at 06:33:02PM +0200, Pierre Morel wrote:
> > > > > Reseting a device can sometime fail, even a virtual device.
> > > > > If the device is not reseted after a while the driver should
> > > > > abandon the retries.
> > > > > This is the change proposed for the modern virtio_pci.
> > > > > 
> > > > > More generally, when this happens,the virtio driver can set the
> > > > > VIRTIO_CONFIG_S_FAILED status flag to advertise the caller.
> > > > > 
> > > > > The virtio core can test if the reset was succesful by testing
> > > > > this flag after a reset.
> > > > > 
> > > > > This behavior is backward compatible with existing drivers.
> > > > > This behavior seems to me compatible with Virtio-1.0 specifications,
> > > > > Chapters 2.1 Device Status Field.
> > > > > There I definitively need your opinion: Is it right?
> > > > > 
> > > > > This patch also lead to another question:
> > > > > do we care if a device provided by the hypervisor is buggy?
> > > > > 
> > > > > Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
> > > > 
> > > > So I think this is not the best place to start to add error recovery.
> > > 
> > > I agree, there can not be any error recovery there.
> > > If reset does not work we can let fall the device until next reset of the
> > > hypervisor.
> > 
> > On probe, yes. But failures are more likely to trigger at other times.
> 
> OK, what about:
> - On probe if reset fail, the probe fail.
> 
> - On freeze and remove : we can not free resources which are common
> 	with the device, at least the queues.
> 	... we can only signal the error and give up with the device.
> 
> > 
> > > > It should be much more common to have a situation where device gets
> > > > broken while it's being used.  Spec has a NEEDS_RESET flag for this.
> > > 
> > > Yes the device side can set this flag, but it is another problem, it is
> > > supposing that:
> > > - the transport, device side, still works.
> > > - it is able to detect that the device need a reset
> > > - a reset is effective
> > 
> > Right. OTOH in this case there's more we can do.
> 
> Yes, I did not find a single test of this flag (NEEDS_RESET).
> even QEMU set it quite often (though virtio_error())
> 
> The decision to reset the device must come from the driver.
> The protocol to reset the device is device/driver specific... lotta work
> 
> Shouldn't it be separate from the "reset failed" problem?
> 
> 
> Regards,
> 
> Pierre
> 

I just don't think we can do a lot about reset failed without risk of
breaking some working config. So I would start with need reset
and maybe some reset failures will be fixable as a side effect.

Yes it's a lot of work. For example we need to validate device
input, can't rely on it to be consistent.

-- 
MST

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-08-25 16:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-23 16:33 [PATCH] [RFC] virtio: Limit the retries on a virtio device reset Pierre Morel
2017-08-24 11:07 ` Cornelia Huck
2017-08-24 12:16   ` Pierre Morel
2017-08-24 14:12     ` Michael S. Tsirkin
2017-08-24 17:07       ` Pierre Morel
2017-08-24 21:16         ` Michael S. Tsirkin
2017-08-25  8:26           ` Cornelia Huck
2017-08-25 11:21             ` Pierre Morel
2017-08-25 16:43             ` Michael S. Tsirkin
2017-08-24 14:19 ` Michael S. Tsirkin
2017-08-24 17:42   ` Pierre Morel
2017-08-24 21:23     ` Michael S. Tsirkin
2017-08-25  8:33       ` Pierre Morel
2017-08-25 16:46         ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).