Re: kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue

From: "Michael S. Tsirkin" <mst@redhat.com>
To: "Steven Luong (sluong)" <sluong@cisco.com>
Cc: "Jerome Tollet (jtollet)" <jtollet@cisco.com>,
	"Damjan Marion (damarion)" <damarion@cisco.com>,
	"virtualization@lists.linux-foundation.org"
	<virtualization@lists.linux-foundation.org>
Subject: Re: kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue
Date: Mon, 17 Dec 2018 19:15:55 -0500	[thread overview]
Message-ID: <20181217191142-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <4FE702F9-8680-461F-822B-03E81082F4F2@cisco.com>

On Mon, Dec 17, 2018 at 11:56:59PM +0000, Steven Luong (sluong) wrote:
> 
> 
> On 12/17/18, 2:55 PM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
>     On Thu, Dec 13, 2018 at 11:24:28PM +0000, Steven Luong (sluong) via Virtualization wrote:
>     > Folks,
>     > 
>     >  
>     > 
>     > We came across a memory race condition between VPP vhost driver and the kernel
>     > vhost. VPP is running a tap interface over vhost backend. In this case, VPP is
>     > acting as the vhost driver mode and the kernel vhost is acting as the vhost
>     > device mode.
>     > 
>     >  
>     > 
>     > In the kernel vhost’s TX traffic direction which is VPP’s RX traffic direction,
>     > kernel vhost is the producer and VPP is the consumer.
>     
>     Looking at vhost net kernel code, it seems to use the
>     same terminology as virtio net.
>     Can you pls clarify which ring number do you use 0 or 1?
> 
> <SVL> 0. </SVL>
>     
>     I will assume ring 0 below.
>     
>     
>     
>     > Kernel vhost places
>     > traffic in kernel vhost’s TX vring. VPP removes traffic in VPP’s RX vring.
>     
>     
>     So VPP makes buffers available and vhost kernel uses them?
>     
> <SVL> Correct. </SVL>
>     
>     > It
>     > is inevitable that the vring may become full under heavy traffic and the kernel
>     > vhost may have to stop and wait for the guest (VPP) to empty the vring and to
>     > refill the vring with descriptors. When that case happens, kernel vhost clears
>     > the bit in the vring’s used flag to request an interrupt/notification. Due to
>     > shared memory race condition, VPP may miss the clearing of the vring’s used
>     > flag from kernel vhost and didn’t send kernel vhost an interrupt after VPP
>     > empties and refills the vring with new descriptors.
>     
>     So kernel vhost's wakeups are signalled through eventfd - I assume
>     this is what you mean by an interrupt?
>     
>     
>     To prevent the race that you describe, vhost has this code:
>     
>                     /* OK, now we need to know about added descriptors. */
>                     if (!headcount) {
>                             if (unlikely(busyloop_intr)) {
>                                     vhost_poll_queue(&vq->poll);
>                             } else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
>                                     /* They have slipped one in as we were
>                                      * doing that: check again. */
>                                     vhost_disable_notify(&net->dev, vq);
>                                     continue;
>                             }
>                             /* Nothing new?  Wait for eventfd to tell us
>     
>     So ring should be re-checked after notifications are enabled.
>     If ring is no longer empty, vhost will proceed to handle the ring.  
>     This code has been around for many years.
>     
>     Thus if VPP doesn't work with it then I suspect it does something
>     differently, e.g. is it possible that it is missing a memory
>     barrier after writing out the available buffers and before
>     checking the flags?
>     
>     <SVL> Can the race condition be avoided totally? Here is the scenario.
> 
> t0: VPP refills the vring with new descriptors, not yet sets avail->idx to make it available to kernel vhost. 
> t1: kernel vhost checks the vring, still sees no space available, but does not yet execute the line of code to clear the used->flags
> t2: vpp executes sfence, and sets avail->idx to make it available to kernel vhost

At this point you need a full memory barrier. E.g. mfence, or xor.
Otherwise the read can get re-ordered before the write.

> t3: vpp checks used->flags, it is still 1, vpp skips sending the interrupt
> t4: kernel vhost clears used->flags to indicate interrupt is required. But VPP just missed it.

fine but at this point kernel vhost will recheck using vhost_get_avail
and process the buffers. So the lack of notification isn't a problem:

bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
{
        __virtio16 avail_idx;
        int r;

        if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
                return false;
        vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
        if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
                r = vhost_update_used_flags(vq);
                if (r) {
                        vq_err(vq, "Failed to enable notification at %p: %d\n",
                               &vq->used->flags, r);
                        return false;
                }
        } else {
                r = vhost_update_avail_event(vq, vq->avail_idx);
                if (r) {
                        vq_err(vq, "Failed to update avail event index at %p: %d\n",
                               vhost_avail_event(vq), r);
                        return false;
                }
        }
        /* They could have slipped one in as we were doing that: make
         * sure it's written, then check again. */
        smp_mb();
        r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
        if (r) {
                vq_err(vq, "Failed to check avail idx at %p: %d\n",
                       &vq->avail->idx, r);
                return false;
        }

        return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
}

>The result is kernel vhost got stuck even though the vring is filled with new descriptors.

Shouldn't be. Maybe you are missing a fence before reading used->flags?

> Steven
> 
> </SVL>
>     
>     
>     > Unfortunately, this missed
>     > notification causes the kernel vhost to be stuck because once the kernel vhost
>     > is waiting for an interrupt/notification from the guest, only an interrupt/
>     > notification from the guest can resume/re-enable the guest from submitting new
>     > packets to the vring. This design seems vulnerable.
>     
>     The protocol right now relies on notifications never being lost.
>     
>     I can imagine an alternative where device also re-checks
>     the rings periodically, but that would involve running
>     timers host side which is only possible if there's a
>     free processor that can handle them. Further,
>     it will still lead to stalls and packet drops, which will
>     be harder to debug.
>     
>     That's just my $.02, pls feel free to take it with the virtio tc
>     maybe others will feel differently.
>     
>     > Should the kernel vhost
>     > totally depend on the notification from the guest? Quoting the text from
>     > 
>     >  
>     > 
>     > http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html
>     > 
>     >  
>     > 
>     > /* The device uses this in used->flags to advise the driver: don’t kick me 
>     >  * when you add a buffer.  It’s unreliable, so it’s simply an 
>     >  * optimization. */ 
>     > #define VIRTQ_USED_F_NO_NOTIFY  1 
>     > 
>     >  
>     > 
>     > I interpret that the notification is simply an optimization, not a reliable
>     > notification mechanism. 
>     
>     What was meant I think is that suppression is unreliable.
>     
>     >So the kernel vhost should not bet the farm on it.
>     > 
>     >  
>     > 
>     > We encounter the same race condition in kernel vhost’s RX traffic direction
>     > which causes the kernel vhost to be stuck due to missed interrupt/notification
>     > although it is less frequent.
>     
>     
>     For the reverse direction the spec actually has some pseudo-code and a suggestion
>     about handling that race:
>     
>     	For optimal performance, a driver MAY disable used buffer notifications while processing the used
>     	ring, but beware the problem of missing notifications between emptying the ring and reenabling no-
>     	tifications. This is usually handled by re-checking for more used buffers after notifications are re-
>     	enabled:
>     
>     		virtq_disable_used_buffer_notifications(vq);
>     		for (;;) {
>     			if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) {
>     			virtq_enable_used_buffer_notifications(vq);
>     			mb();
>     			if (vq->last_seen_used != le16_to_cpu(virtq->used.idx))
>     				break;
>     			virtq_disable_used_buffer_notifications(vq);
>     		}
>     		struct virtq_used_elem *e = virtq.used->ring[vq->last_seen_used%vsz];
>     		process_buffer(e);
>     		vq->last_seen_used++;
>     
>     
>     
>     >  
>     > 
>     > Steven
>     > 
>     
>     > _______________________________________________
>     > Virtualization mailing list
>     > Virtualization@lists.linux-foundation.org
>     > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
>     
>     
> 
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization