virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH net-next] virtio_net: set link state down when virtqueue is broken
       [not found] <79907bf6c835572b4af92f16d9a3ff2822b1c7ea.1622028946.git.wangyunjian@huawei.com>
@ 2021-05-27  4:22 ` Jason Wang
       [not found]   ` <d18383f7e675452d9392321506db6fa0@huawei.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Jason Wang @ 2021-05-27  4:22 UTC (permalink / raw)
  To: wangyunjian, netdev; +Cc: kuba, virtualization, dingxiaoxiong, davem, mst


在 2021/5/26 下午7:39, wangyunjian 写道:
> From: Yunjian Wang <wangyunjian@huawei.com>
>
> The NIC can't receive/send packets if a rx/tx virtqueue is broken.
> However, the link state of the NIC is still normal. As a result,
> the user cannot detect the NIC exception.


Doesn't we have:

        /* This should not happen! */
         if (unlikely(err)) {
                 dev->stats.tx_fifo_errors++;
                 if (net_ratelimit())
                         dev_warn(&dev->dev,
                                  "Unexpected TXQ (%d) queue failure: %d\n",
                                  qnum, err);
                 dev->stats.tx_dropped++;
                 dev_kfree_skb_any(skb);
                 return NETDEV_TX_OK;
         }

Which should be sufficient?


>
> The driver can set the link state down when the virtqueue is broken.
> If the state is down, the user can switch over to another NIC.


Note that, we probably need the watchdog for virtio-net in order to be a 
complete solution.

Thanks


>
> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
> ---
>   drivers/net/virtio_net.c | 36 +++++++++++++++++++++++++++++++++++-
>   1 file changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 073fec4c0df1..05a3cd1c589b 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -237,6 +237,10 @@ struct virtnet_info {
>   
>   	/* failover when STANDBY feature enabled */
>   	struct failover *failover;
> +
> +	/* Work struct for checking vq status, stop NIC if vq is broken. */
> +	struct delayed_work vq_check_work;
> +	bool broken;
>   };
>   
>   struct padded_vnet_hdr {
> @@ -1407,6 +1411,27 @@ static void refill_work(struct work_struct *work)
>   	}
>   }
>   
> +static void virnet_vq_check_work(struct work_struct *work)
> +{
> +	struct virtnet_info *vi =
> +		container_of(work, struct virtnet_info, vq_check_work.work);
> +	struct net_device *netdev = vi->dev;
> +	int i;
> +
> +	if (vi->broken)
> +		return;
> +
> +	/* If virtqueue is broken, set link down and stop all queues */
> +	for (i = 0; i < vi->max_queue_pairs; i++) {
> +		if (virtqueue_is_broken(vi->rq[i].vq) || virtqueue_is_broken(vi->sq[i].vq)) {
> +			netif_carrier_off(netdev);
> +			netif_tx_stop_all_queues(netdev);
> +			vi->broken = true;
> +			break;
> +		}
> +	}
> +}
> +
>   static int virtnet_receive(struct receive_queue *rq, int budget,
>   			   unsigned int *xdp_xmit)
>   {
> @@ -1432,6 +1457,9 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
>   		}
>   	}
>   
> +	if (unlikely(!virtqueue_is_broken(rq->vq)))
> +		schedule_delayed_work(&vi->vq_check_work, HZ);
> +
>   	if (rq->vq->num_free > min((unsigned int)budget, virtqueue_get_vring_size(rq->vq)) / 2) {
>   		if (!try_fill_recv(vi, rq, GFP_ATOMIC))
>   			schedule_delayed_work(&vi->refill, 0);
> @@ -1681,6 +1709,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>   				 qnum, err);
>   		dev->stats.tx_dropped++;
>   		dev_kfree_skb_any(skb);
> +		schedule_delayed_work(&vi->vq_check_work, HZ);
>   		return NETDEV_TX_OK;
>   	}
>   
> @@ -1905,6 +1934,7 @@ static int virtnet_close(struct net_device *dev)
>   
>   	/* Make sure refill_work doesn't re-enable napi! */
>   	cancel_delayed_work_sync(&vi->refill);
> +	cancel_delayed_work_sync(&vi->vq_check_work);
>   
>   	for (i = 0; i < vi->max_queue_pairs; i++) {
>   		xdp_rxq_info_unreg(&vi->rq[i].xdp_rxq);
> @@ -2381,6 +2411,7 @@ static void virtnet_freeze_down(struct virtio_device *vdev)
>   	netif_device_detach(vi->dev);
>   	netif_tx_unlock_bh(vi->dev);
>   	cancel_delayed_work_sync(&vi->refill);
> +	cancel_delayed_work_sync(&vi->vq_check_work);
>   
>   	if (netif_running(vi->dev)) {
>   		for (i = 0; i < vi->max_queue_pairs; i++) {
> @@ -2662,7 +2693,7 @@ static void virtnet_config_changed_work(struct work_struct *work)
>   
>   	vi->status = v;
>   
> -	if (vi->status & VIRTIO_NET_S_LINK_UP) {
> +	if ((vi->status & VIRTIO_NET_S_LINK_UP) && !vi->broken) {
>   		virtnet_update_settings(vi);
>   		netif_carrier_on(vi->dev);
>   		netif_tx_wake_all_queues(vi->dev);
> @@ -2889,6 +2920,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
>   		goto err_rq;
>   
>   	INIT_DELAYED_WORK(&vi->refill, refill_work);
> +	INIT_DELAYED_WORK(&vi->vq_check_work, virnet_vq_check_work);
> +
>   	for (i = 0; i < vi->max_queue_pairs; i++) {
>   		vi->rq[i].pages = NULL;
>   		netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
> @@ -3240,6 +3273,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>   	net_failover_destroy(vi->failover);
>   free_vqs:
>   	cancel_delayed_work_sync(&vi->refill);
> +	cancel_delayed_work_sync(&vi->vq_check_work);
>   	free_receive_page_frags(vi);
>   	virtnet_del_vqs(vi);
>   free:

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next] virtio_net: set link state down when virtqueue is broken
       [not found]   ` <d18383f7e675452d9392321506db6fa0@huawei.com>
@ 2021-05-31  3:28     ` Jason Wang
       [not found]       ` <20a5f1bd8a5a49fa8c0f90875a49631b@huawei.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Jason Wang @ 2021-05-31  3:28 UTC (permalink / raw)
  To: wangyunjian, netdev@vger.kernel.org
  Cc: kuba@kernel.org, virtualization@lists.linux-foundation.org,
	dingxiaoxiong, davem@davemloft.net, mst@redhat.com


在 2021/5/28 下午6:58, wangyunjian 写道:
>> -----Original Message-----
>>> From: Yunjian Wang <wangyunjian@huawei.com>
>>>
>>> The NIC can't receive/send packets if a rx/tx virtqueue is broken.
>>> However, the link state of the NIC is still normal. As a result, the
>>> user cannot detect the NIC exception.
>>
>> Doesn't we have:
>>
>>          /* This should not happen! */
>>           if (unlikely(err)) {
>>                   dev->stats.tx_fifo_errors++;
>>                   if (net_ratelimit())
>>                           dev_warn(&dev->dev,
>>                                    "Unexpected TXQ (%d) queue
>> failure: %d\n",
>>                                    qnum, err);
>>                   dev->stats.tx_dropped++;
>>                   dev_kfree_skb_any(skb);
>>                   return NETDEV_TX_OK;
>>           }
>>
>> Which should be sufficient?
> There may be other reasons for this error, e.g -ENOSPC(no free desc).


This should not happen unless the device or driver is buggy. We always 
reserved sufficient slots:

         if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
                 netif_stop_subqueue(dev, qnum);
...


> And if rx virtqueue is broken, there is no error statistics.


Feel free to add one if it's necessary.

Let's leave the policy decision (link down) to userspace.


>
>>
>>> The driver can set the link state down when the virtqueue is broken.
>>> If the state is down, the user can switch over to another NIC.
>>
>> Note that, we probably need the watchdog for virtio-net in order to be a
>> complete solution.
> Yes, I can think of is that the virtqueue's broken exception is detected on watchdog.
> Is there anything else that needs to be done?


Basically, it's all about TX stall which watchdog tries to catch. Broken 
vq is only one of the possible reason.

Thanks


>
> Thanks
>
>> Thanks
>>
>>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next] virtio_net: set link state down when virtqueue is broken
       [not found]       ` <20a5f1bd8a5a49fa8c0f90875a49631b@huawei.com>
@ 2021-06-04  2:37         ` Jason Wang
       [not found]           ` <5d6fdd5c8e62498ba804aa22d71eb6a8@huawei.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Jason Wang @ 2021-06-04  2:37 UTC (permalink / raw)
  To: wangyunjian, netdev@vger.kernel.org
  Cc: kuba@kernel.org, virtualization@lists.linux-foundation.org,
	dingxiaoxiong, davem@davemloft.net, mst@redhat.com


在 2021/6/3 下午7:34, wangyunjian 写道:
>> -----Original Message-----
>> From: Jason Wang [mailto:jasowang@redhat.com]
>> Sent: Monday, May 31, 2021 11:29 AM
>> To: wangyunjian <wangyunjian@huawei.com>; netdev@vger.kernel.org
>> Cc: kuba@kernel.org; davem@davemloft.net; mst@redhat.com;
>> virtualization@lists.linux-foundation.org; dingxiaoxiong
>> <dingxiaoxiong@huawei.com>
>> Subject: Re: [PATCH net-next] virtio_net: set link state down when virtqueue is
>> broken
>>
>>
>> 在 2021/5/28 下午6:58, wangyunjian 写道:
>>>> -----Original Message-----
>>>>> From: Yunjian Wang <wangyunjian@huawei.com>
>>>>>
>>>>> The NIC can't receive/send packets if a rx/tx virtqueue is broken.
>>>>> However, the link state of the NIC is still normal. As a result, the
>>>>> user cannot detect the NIC exception.
>>>> Doesn't we have:
>>>>
>>>>           /* This should not happen! */
>>>>            if (unlikely(err)) {
>>>>                    dev->stats.tx_fifo_errors++;
>>>>                    if (net_ratelimit())
>>>>                            dev_warn(&dev->dev,
>>>>                                     "Unexpected TXQ (%d)
>> queue
>>>> failure: %d\n",
>>>>                                     qnum, err);
>>>>                    dev->stats.tx_dropped++;
>>>>                    dev_kfree_skb_any(skb);
>>>>                    return NETDEV_TX_OK;
>>>>            }
>>>>
>>>> Which should be sufficient?
>>> There may be other reasons for this error, e.g -ENOSPC(no free desc).
>>
>> This should not happen unless the device or driver is buggy. We always reserved
>> sufficient slots:
>>
>>           if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
>>                   netif_stop_subqueue(dev, qnum); ...
>>
>>
>>> And if rx virtqueue is broken, there is no error statistics.
>>
>> Feel free to add one if it's necessary.
> Currently receiving scenario, it is impossible to distinguish whether the reason for
> not receiving packet is virtqueue's broken or no packet.


Can we introduce rx_fifo_errors for that?


>
>> Let's leave the policy decision (link down) to userspace.
>>
>>
>>>>> The driver can set the link state down when the virtqueue is broken.
>>>>> If the state is down, the user can switch over to another NIC.
>>>> Note that, we probably need the watchdog for virtio-net in order to
>>>> be a complete solution.
>>> Yes, I can think of is that the virtqueue's broken exception is detected on
>> watchdog.
>>> Is there anything else that needs to be done?
>>
>> Basically, it's all about TX stall which watchdog tries to catch. Broken vq is only
>> one of the possible reason.
> Are there any plans for the watchdog?


Somebody posted a prototype 3 or 4 years ago, you can search it and 
maybe we can start from there.

Thanks


>
> Thanks
>
>> Thanks
>>
>>
>>> Thanks
>>>
>>>> Thanks
>>>>
>>>>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next] virtio_net: set link state down when virtqueue is broken
       [not found]           ` <5d6fdd5c8e62498ba804aa22d71eb6a8@huawei.com>
@ 2021-06-07  2:28             ` Jason Wang
  0 siblings, 0 replies; 4+ messages in thread
From: Jason Wang @ 2021-06-07  2:28 UTC (permalink / raw)
  To: wangyunjian, netdev@vger.kernel.org, jcfaracco@gmail.com
  Cc: kuba@kernel.org, virtualization@lists.linux-foundation.org,
	dingxiaoxiong, davem@davemloft.net, mst@redhat.com


在 2021/6/5 下午3:10, wangyunjian 写道:
>> -----Original Message-----
>> From: Jason Wang [mailto:jasowang@redhat.com]
>> Sent: Friday, June 4, 2021 10:38 AM
>> To: wangyunjian <wangyunjian@huawei.com>; netdev@vger.kernel.org
>> Cc: kuba@kernel.org; davem@davemloft.net; mst@redhat.com;
>> virtualization@lists.linux-foundation.org; dingxiaoxiong
>> <dingxiaoxiong@huawei.com>
>> Subject: Re: [PATCH net-next] virtio_net: set link state down when virtqueue is
>> broken
>>
>>
>> 在 2021/6/3 下午7:34, wangyunjian 写道:
>>>> -----Original Message-----
>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>> Sent: Monday, May 31, 2021 11:29 AM
>>>> To: wangyunjian <wangyunjian@huawei.com>; netdev@vger.kernel.org
>>>> Cc: kuba@kernel.org; davem@davemloft.net; mst@redhat.com;
>>>> virtualization@lists.linux-foundation.org; dingxiaoxiong
>>>> <dingxiaoxiong@huawei.com>
>>>> Subject: Re: [PATCH net-next] virtio_net: set link state down when
>>>> virtqueue is broken
>>>>
>>>>
>>>> 在 2021/5/28 下午6:58, wangyunjian 写道:
>>>>>> -----Original Message-----
>>>>>>> From: Yunjian Wang <wangyunjian@huawei.com>
>>>>>>>
>>>>>>> The NIC can't receive/send packets if a rx/tx virtqueue is broken.
>>>>>>> However, the link state of the NIC is still normal. As a result,
>>>>>>> the user cannot detect the NIC exception.
>>>>>> Doesn't we have:
>>>>>>
>>>>>>            /* This should not happen! */
>>>>>>             if (unlikely(err)) {
>>>>>>                     dev->stats.tx_fifo_errors++;
>>>>>>                     if (net_ratelimit())
>>>>>>                             dev_warn(&dev->dev,
>>>>>>                                      "Unexpected TXQ (%d)
>>>> queue
>>>>>> failure: %d\n",
>>>>>>                                      qnum, err);
>>>>>>                     dev->stats.tx_dropped++;
>>>>>>                     dev_kfree_skb_any(skb);
>>>>>>                     return NETDEV_TX_OK;
>>>>>>             }
>>>>>>
>>>>>> Which should be sufficient?
>>>>> There may be other reasons for this error, e.g -ENOSPC(no free desc).
>>>> This should not happen unless the device or driver is buggy. We
>>>> always reserved sufficient slots:
>>>>
>>>>            if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
>>>>                    netif_stop_subqueue(dev, qnum); ...
>>>>
>>>>
>>>>> And if rx virtqueue is broken, there is no error statistics.
>>>> Feel free to add one if it's necessary.
>>> Currently receiving scenario, it is impossible to distinguish whether
>>> the reason for not receiving packet is virtqueue's broken or no packet.
>>
>> Can we introduce rx_fifo_errors for that?
>>
>>
>>>> Let's leave the policy decision (link down) to userspace.
>>>>
>>>>
>>>>>>> The driver can set the link state down when the virtqueue is broken.
>>>>>>> If the state is down, the user can switch over to another NIC.
>>>>>> Note that, we probably need the watchdog for virtio-net in order to
>>>>>> be a complete solution.
>>>>> Yes, I can think of is that the virtqueue's broken exception is
>>>>> detected on
>>>> watchdog.
>>>>> Is there anything else that needs to be done?
>>>> Basically, it's all about TX stall which watchdog tries to catch.
>>>> Broken vq is only one of the possible reason.
>>> Are there any plans for the watchdog?
>>
>> Somebody posted a prototype 3 or 4 years ago, you can search it and maybe we
>> can start from there.
> I find the patch (https://patchwork.ozlabs.org/project/netdev/patch/20191126200628.22251-3-jcfaracco@gmail.com/)
>
> The patch checks only the scenario where the sending queue is abnormal, but can
> not detect the exception in the receiving queue.


It's almost impossible to detect the abnormal of receiving queue since 
we there's no deterministic time for a receiving packet.


>
> And the patch restores the NIC by reset, which is inappropriate because the broken
> state may be caused by a front-end or back-end bug. We should keep the scene to
> locate bugs.


This could be changed, we can increase the error counters and let 
userspce to decide what to do.

Thanks


>
> Thanks
>
>> Thanks
>>
>>
>>> Thanks
>>>
>>>> Thanks
>>>>
>>>>
>>>>> Thanks
>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-06-07  2:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <79907bf6c835572b4af92f16d9a3ff2822b1c7ea.1622028946.git.wangyunjian@huawei.com>
2021-05-27  4:22 ` [PATCH net-next] virtio_net: set link state down when virtqueue is broken Jason Wang
     [not found]   ` <d18383f7e675452d9392321506db6fa0@huawei.com>
2021-05-31  3:28     ` Jason Wang
     [not found]       ` <20a5f1bd8a5a49fa8c0f90875a49631b@huawei.com>
2021-06-04  2:37         ` Jason Wang
     [not found]           ` <5d6fdd5c8e62498ba804aa22d71eb6a8@huawei.com>
2021-06-07  2:28             ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).