qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "kwolf@redhat.com" <kwolf@redhat.com>,
	Denis Lunev <den@virtuozzo.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"mreitz@redhat.com" <mreitz@redhat.com>,
	Roman Kagan <rkagan@virtuozzo.com>,
	"stefanha@redhat.com" <stefanha@redhat.com>,
	"kraxel@redhat.com" <kraxel@redhat.com>
Subject: Re: [PATCH] virtio: fix IO request length in virtio SCSI/block #PSBM-78839
Date: Fri, 13 Dec 2019 07:40:41 -0500	[thread overview]
Message-ID: <20191213072714-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <c39d90bc-92c0-6b53-b7a6-992ec69bb0b0@virtuozzo.com>

On Mon, Nov 25, 2019 at 09:16:10AM +0000, Denis Plotnikov wrote:
> 
> 
> On 06.11.2019 15:03, Michael S. Tsirkin wrote:
> > On Thu, Oct 24, 2019 at 11:34:34AM +0000, Denis Lunev wrote:
> >> On 10/24/19 12:28 AM, Michael S. Tsirkin wrote:
> >>> On Fri, Oct 18, 2019 at 02:55:47PM +0300, Denis Plotnikov wrote:
> >>>> From: "Denis V. Lunev" <den@openvz.org>
> >>>>
> >>>> Linux guests submit IO requests no longer than PAGE_SIZE * max_seg
> >>>> field reported by SCSI controler. Thus typical sequential read with
> >>>> 1 MB size results in the following pattern of the IO from the guest:
> >>>>    8,16   1    15754     2.766095122  2071  D   R 2095104 + 1008 [dd]
> >>>>    8,16   1    15755     2.766108785  2071  D   R 2096112 + 1008 [dd]
> >>>>    8,16   1    15756     2.766113486  2071  D   R 2097120 + 32 [dd]
> >>>>    8,16   1    15757     2.767668961     0  C   R 2095104 + 1008 [0]
> >>>>    8,16   1    15758     2.768534315     0  C   R 2096112 + 1008 [0]
> >>>>    8,16   1    15759     2.768539782     0  C   R 2097120 + 32 [0]
> >>>> The IO was generated by
> >>>>    dd if=/dev/sda of=/dev/null bs=1024 iflag=direct
> >>>>
> >>>> This effectively means that on rotational disks we will observe 3 IOPS
> >>>> for each 2 MBs processed. This definitely negatively affects both
> >>>> guest and host IO performance.
> >>>>
> >>>> The cure is relatively simple - we should report lengthy scatter-gather
> >>>> ability of the SCSI controller. Fortunately the situation here is very
> >>>> good. VirtIO transport layer can accomodate 1024 items in one request
> >>>> while we are using only 128. This situation is present since almost
> >>>> very beginning. 2 items are dedicated for request metadata thus we
> >>>> should publish VIRTQUEUE_MAX_SIZE - 2 as max_seg.
> >>>>
> >>>> The following pattern is observed after the patch:
> >>>>    8,16   1     9921     2.662721340  2063  D   R 2095104 + 1024 [dd]
> >>>>    8,16   1     9922     2.662737585  2063  D   R 2096128 + 1024 [dd]
> >>>>    8,16   1     9923     2.665188167     0  C   R 2095104 + 1024 [0]
> >>>>    8,16   1     9924     2.665198777     0  C   R 2096128 + 1024 [0]
> >>>> which is much better.
> >>>>
> >>>> The dark side of this patch is that we are tweaking guest visible
> >>>> parameter, though this should be relatively safe as above transport
> >>>> layer support is present in QEMU/host Linux for a very long time.
> >>>> The patch adds configurable property for VirtIO SCSI with a new default
> >>>> and hardcode option for VirtBlock which does not provide good
> >>>> configurable framework.
> >>>>
> >>>> Unfortunately the commit can not be applied as is. For the real cure we
> >>>> need guest to be fixed to accomodate that queue length, which is done
> >>>> only in the latest 4.14 kernel. Thus we are going to expose the property
> >>>> and tweak it on machine type level.
> >>>>
> >>>> The problem with the old kernels is that they have
> >>>> max_segments <= virtqueue_size restriction which cause the guest
> >>>> crashing in the case of violation.
> >>> This isn't just in the guests: virtio spec also seems to imply this,
> >>> or at least be vague on this point.
> >>>
> >>> So I think it'll need a feature bit.
> >>> Doing that in a safe way will also allow being compatible with old guests.
> >>>
> >>> The only downside is it's a bit more work as we need to
> >>> spec this out and add guest support.
> >>>
> >>>> To fix the case described above in the old kernels we can increase
> >>>> virtqueue_size to 256 and max_segments to 254. The pitfall here is
> >>>> that seabios allows the virtqueue_size-s < 128, however, the seabios
> >>>> patch extending that value to 256 is pending.
> >>> And the fix here is just to limit large vq size to virtio 1.0.
> >>> In that mode it's fine I think:
> >>>
> >>>
> >>>     /* check if the queue is available */
> >>>     if (vp->use_modern) {
> >>>         num = vp_read(&vp->common, virtio_pci_common_cfg, queue_size);
> >>>         if (num > MAX_QUEUE_NUM) {
> >>>             vp_write(&vp->common, virtio_pci_common_cfg, queue_size,
> >>>                      MAX_QUEUE_NUM);
> >>>             num = vp_read(&vp->common, virtio_pci_common_cfg, queue_size);
> >>>         }
> >>>     } else {
> >>>         num = vp_read(&vp->legacy, virtio_pci_legacy, queue_num);
> >>>     }
> The same seabios snippet,  but more detailed:
> 
> vp_find_vq()
> {
>     ...
>     /* check if the queue is available */
>     if (vp->use_modern) {
>         num = vp_read(&vp->common, virtio_pci_common_cfg, queue_size);
>         if (num > MAX_QUEUE_NUM) {
>             vp_write(&vp->common, virtio_pci_common_cfg, queue_size,
>                      MAX_QUEUE_NUM);
>             num = vp_read(&vp->common, virtio_pci_common_cfg, queue_size);

So how about we drop this last line in bios?

Will fix things for existing hypervisors.
spec does not say guests need to re-read it.

>         }
>     } else {
>         num = vp_read(&vp->legacy, virtio_pci_legacy, queue_num);
>     }
>     if (!num) {
>         dprintf(1, "ERROR: queue size is 0\n");
>         goto fail;
>     }
>     if (num > MAX_QUEUE_NUM) {
>         dprintf(1, "ERROR: queue size %d > %d\n", num, MAX_QUEUE_NUM);
>         goto fail;
>     }
> ...
> }
> 
> It turned out that the problem is here, but not because of the seabios code.
> The virtqueue size is written and then incorrect value is re-read.
> Thanks to Roman Kagan (rkagan@virtuozzo.com) for investigating the root 
> cause of the problem.
> 
> As the code states, for the modern devices, seabios reads the queue size 
> and if it's
> greater than seabios can support, reduce the queue size to the max 
> seabios supported value.
> 
> This doesn't work.
> 
> The reason is that the size is read from the virtio device,
> 
> virtio_pci_common_read()
> {
>      ...
>      case VIRTIO_PCI_COMMON_Q_SIZE:
>          val = virtio_queue_get_num(vdev, vdev->queue_sel);
>          break;
>      ...
> }
> 
> but is written to the proxy
> 
> virtio_pci_common_write()
> {
>      ...
>      case VIRTIO_PCI_COMMON_Q_SIZE:
>          proxy->vqs[vdev->queue_sel].num = val;
>          break;
>     ...
> }.

Yea that's a bug. Here's a hacky way to fix it.
But I think really we should just get rid of the
two copies down the road.


diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index c6b47a9c73..e5c759e19e 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -1256,6 +1256,8 @@ static void virtio_pci_common_write(void *opaque, hwaddr addr,
         break;
     case VIRTIO_PCI_COMMON_Q_SIZE:
         proxy->vqs[vdev->queue_sel].num = val;
+        virtio_queue_set_num(vdev, vdev->queue_sel,
+                             proxy->vqs[vdev->queue_sel].num);
         break;
     case VIRTIO_PCI_COMMON_Q_MSIX:
         msix_vector_unuse(&proxy->pci_dev,


> The final stage of the size setting is propagated it from the proxy to 
> the device on virtqueue enabling:
> 
> virtio_cpi_common_write()
> {
>      ...
>      case VIRTIO_PCI_COMMON_Q_ENABLE:
>          virtio_queue_set_num(vdev, vdev->queue_sel,
>                               proxy->vqs[vdev->queue_sel].num);
>          virtio_queue_set_rings(vdev, vdev->queue_sel,
> ((uint64_t)proxy->vqs[vdev->queue_sel].desc[1]) << 32 |
>                         proxy->vqs[vdev->queue_sel].desc[0],
> ((uint64_t)proxy->vqs[vdev->queue_sel].avail[1]) << 32 |
>                         proxy->vqs[vdev->queue_sel].avail[0],
> ((uint64_t)proxy->vqs[vdev->queue_sel].used[1]) << 32 |
>                         proxy->vqs[vdev->queue_sel].used[0]);
>          proxy->vqs[vdev->queue_sel].enabled = 1;
>          break;
>      ...
> }.
> 
> So we have the following workflow:
> suppose the device has virtqueue size = 256 and seabios MAX_QUEUE_NUM = 128.
> In that case the seabios works like:
> 
> 1. if vp_modern read the size (256)
> 2. 256 > 128
> 3. write virtqueue size = 128
> 4. re-read virtqueue size = 256 !!!

bios probably should not re-read size, it's a waste of cpu cycles anyway.


> 5. fail because of the check
>      if (num > MAX_QUEUE_NUM) {
>          dprintf(1, "ERROR: queue size %d > %d\n", num, MAX_QUEUE_NUM);
>          goto fail;
>      }
> 
> To fix the issue, we need to read and write the virtqueue size from the 
> same place.
> Should we do with the proxy?
> Is there any reason to read from the device and write to the proxy?
> 
> Furthermore, the size setting has a few flaws:
> 
> 1. The size being set should be a power of 2
> 2. The size being set should be less or equal to the virtqueue size (and 
> be greater that 2?)

I think 1 is checked in virtio_queue_set_num.
I guess we should check 2 as well?


> 
> Denis
> >> you mean to put the code like this into virtio_pci_realize() inside QEMU?
> >>
> >> If no, can you pls clarify which component should be touched.
> >>
> >> Den
> > I mean:
> >   - add an API to change the default queue size
> >   - add a validate features callback, in there check and for modern
> >     flag set in features increase the queue size
> >
> > maybe all this is too much work, we could block this
> > for transitional devices, but your patch does not do it,
> > you need to check that legacy is enabled not that modern
> > is not disabled.
> >
> >
> >
> 



  parent reply	other threads:[~2019-12-13 21:28 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-18 11:55 [PATCH] virtio: fix IO request length in virtio SCSI/block #PSBM-78839 Denis Plotnikov
2019-10-21 13:24 ` Stefan Hajnoczi
2019-10-22  4:01   ` Denis Lunev
2019-10-23 14:17     ` Stefan Hajnoczi
2019-10-23 14:37       ` Denis Lunev
2019-10-23  9:13   ` Denis Plotnikov
2019-10-23 21:50   ` Michael S. Tsirkin
2019-10-23 21:28 ` Michael S. Tsirkin
2019-10-24 11:34   ` Denis Lunev
2019-11-06 12:03     ` Michael S. Tsirkin
2019-11-13 12:38       ` Denis Plotnikov
2019-11-13 13:18         ` Michael S. Tsirkin
2019-11-14 15:33           ` Denis Plotnikov
2019-11-25  9:16       ` Denis Plotnikov
2019-12-05  7:59         ` Denis Plotnikov
2019-12-13 12:24           ` [PING] " Denis Plotnikov
2019-12-13 12:40         ` Michael S. Tsirkin [this message]
2019-11-12 10:03   ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191213072714-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=den@virtuozzo.com \
    --cc=dplotnikov@virtuozzo.com \
    --cc=kraxel@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rkagan@virtuozzo.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).