All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christian Schoenebeck <qemu_oss@crudebyte.com>
To: qemu-devel@nongnu.org
Cc: "Kevin Wolf" <kwolf@redhat.com>,
	"Laurent Vivier" <lvivier@redhat.com>,
	qemu-block@nongnu.org, "Michael S. Tsirkin" <mst@redhat.com>,
	"Jason Wang" <jasowang@redhat.com>, "Amit Shah" <amit@kernel.org>,
	"David Hildenbrand" <david@redhat.com>,
	"Raphael Norwitz" <raphael.norwitz@nutanix.com>,
	virtio-fs@redhat.com, "Eric Auger" <eric.auger@redhat.com>,
	"Hanna Reitz" <hreitz@redhat.com>,
	"Gonglei (Arei)" <arei.gonglei@huawei.com>,
	"Gerd Hoffmann" <kraxel@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Fam Zheng" <fam@euphon.net>
Subject: Re: [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
Date: Fri, 08 Oct 2021 16:48:33 +0200	[thread overview]
Message-ID: <1903319.Sgii2z2lGP@silver> (raw)
In-Reply-To: <YV8Pqwap4oR8fpvc@stefanha-x1.localdomain>

On Donnerstag, 7. Oktober 2021 17:18:03 CEST Stefan Hajnoczi wrote:
> On Thu, Oct 07, 2021 at 03:09:16PM +0200, Christian Schoenebeck wrote:
> > On Mittwoch, 6. Oktober 2021 16:42:34 CEST Stefan Hajnoczi wrote:
> > > On Wed, Oct 06, 2021 at 02:50:07PM +0200, Christian Schoenebeck wrote:
> > > > On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> > > > > On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck 
wrote:
> > > > > > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > > > > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck
> > 
> > wrote:
> > > > > > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi 
wrote:
> > > > > > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian
> > > > > > > > > Schoenebeck
> > > > 
> > > > wrote:
> > > > > > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a
> > > > > > > > > > runtime
> > > > > > > > > > variable per virtio user.
> > > > > > > > > 
> > > > > > > > > virtio user == virtio device model?
> > > > > > > > 
> > > > > > > > Yes
> > > > > > > > 
> > > > > > > > > > Reasons:
> > > > > > > > > > 
> > > > > > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute
> > > > > > > > > > theoretical
> > > > > > > > > > 
> > > > > > > > > >     maximum queue size possible. Which is actually the
> > > > > > > > > >     maximum
> > > > > > > > > >     queue size allowed by the virtio protocol. The
> > > > > > > > > >     appropriate
> > > > > > > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > > > > > >     
> > > > > > > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/vi
> > > > > > > > > >     rtio
> > > > > > > > > >     -v1.
> > > > > > > > > >     1-cs
> > > > > > > > > >     01.h
> > > > > > > > > >     tml#x1-240006
> > > > > > > > > >     
> > > > > > > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with
> > > > > > > > > >     a
> > > > > > > > > >     more or less arbitrary value of 1024 in the past,
> > > > > > > > > >     which
> > > > > > > > > >     limits the maximum transfer size with virtio to 4M
> > > > > > > > > >     (more precise: 1024 * PAGE_SIZE, with the latter
> > > > > > > > > >     typically
> > > > > > > > > >     being 4k).
> > > > > > > > > 
> > > > > > > > > Being equal to IOV_MAX is a likely reason. Buffers with more
> > > > > > > > > iovecs
> > > > > > > > > than
> > > > > > > > > that cannot be passed to host system calls (sendmsg(2),
> > > > > > > > > pwritev(2),
> > > > > > > > > etc).
> > > > > > > > 
> > > > > > > > Yes, that's use case dependent. Hence the solution to opt-in
> > > > > > > > if it
> > > > > > > > is
> > > > > > > > desired and feasible.
> > > > > > > > 
> > > > > > > > > > (2) Additionally the current value of 1024 poses a hidden
> > > > > > > > > > limit,
> > > > > > > > > > 
> > > > > > > > > >     invisible to guest, which causes a system hang with
> > > > > > > > > >     the
> > > > > > > > > >     following QEMU error if guest tries to exceed it:
> > > > > > > > > >     
> > > > > > > > > >     virtio: too many write descriptors in indirect table
> > > > > > > > > 
> > > > > > > > > I don't understand this point. 2.6.5 The Virtqueue
> > > > > > > > > Descriptor
> > > > > > > > > Table
> > > > > > 
> > > > > > says:
> > > > > > > > >   The number of descriptors in the table is defined by the
> > > > > > > > >   queue
> > > > > > > > >   size
> > > > > > > > >   for
> > > > > > > > > 
> > > > > > > > > this virtqueue: this is the maximum possible descriptor
> > > > > > > > > chain
> > > > > > > > > length.
> > > > > > > > > 
> > > > > > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors 
says:
> > > > > > > > >   A driver MUST NOT create a descriptor chain longer than
> > > > > > > > >   the
> > > > > > > > >   Queue
> > > > > > > > >   Size
> > > > > > > > >   of
> > > > > > > > > 
> > > > > > > > > the device.
> > > > > > > > > 
> > > > > > > > > Do you mean a broken/malicious guest driver that is
> > > > > > > > > violating
> > > > > > > > > the
> > > > > > > > > spec?
> > > > > > > > > That's not a hidden limit, it's defined by the spec.
> > > > > > > > 
> > > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781
> > > > > > > > .htm
> > > > > > > > l
> > > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788
> > > > > > > > .htm
> > > > > > > > l
> > > > > > > > 
> > > > > > > > You can already go beyond that queue size at runtime with the
> > > > > > > > indirection
> > > > > > > > table. The only actual limit is the currently hard coded value
> > > > > > > > of
> > > > > > > > 1k
> > > > > > > > pages.
> > > > > > > > Hence the suggestion to turn that into a variable.
> > > > > > > 
> > > > > > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that
> > > > > > > operate
> > > > > > > outsided the spec do so at their own risk. They may not be
> > > > > > > compatible
> > > > > > > with all device implementations.
> > > > > > 
> > > > > > Yes, I am ware about that. And still, this practice is already
> > > > > > done,
> > > > > > which
> > > > > > apparently is not limited to 9pfs.
> > > > > > 
> > > > > > > The limit is not hidden, it's Queue Size as defined by the spec
> > > > > > > :).
> > > > > > > 
> > > > > > > If you have a driver that is exceeding the limit, then please
> > > > > > > fix
> > > > > > > the
> > > > > > > driver.
> > > > > > 
> > > > > > I absolutely understand your position, but I hope you also
> > > > > > understand
> > > > > > that
> > > > > > this violation of the specs is a theoretical issue, it is not a
> > > > > > real-life
> > > > > > problem right now, and due to lack of man power unfortunately I
> > > > > > have
> > > > > > to
> > > > > > prioritize real-life problems over theoretical ones ATM. Keep in
> > > > > > mind
> > > > > > that
> > > > > > right now I am the only person working on 9pfs actively, I do this
> > > > > > voluntarily whenever I find a free time slice, and I am not paid
> > > > > > for
> > > > > > it
> > > > > > either.
> > > > > > 
> > > > > > I don't see any reasonable way with reasonable effort to do what
> > > > > > you
> > > > > > are
> > > > > > asking for here in 9pfs, and Greg may correct me here if I am
> > > > > > saying
> > > > > > anything wrong. If you are seeing any specific real-life issue
> > > > > > here,
> > > > > > then
> > > > > > please tell me which one, otherwise I have to postpone that "specs
> > > > > > violation" issue.
> > > > > > 
> > > > > > There is still a long list of real problems that I need to hunt
> > > > > > down
> > > > > > in
> > > > > > 9pfs, afterwards I can continue with theoretical ones if you want,
> > > > > > but
> > > > > > right now I simply can't, sorry.
> > > > > 
> > > > > I understand. If you don't have time to fix the Linux virtio-9p
> > > > > driver
> > > > > then that's fine.
> > > > 
> > > > I will look at this again, but it might be tricky. On doubt I'll
> > > > postpone
> > > > it.>
> > > > 
> > > > > I still wanted us to agree on the spec position because the commit
> > > > > description says it's a "hidden limit", which is incorrect. It might
> > > > > seem pedantic, but my concern is that misconceptions can spread if
> > > > > we
> > > > > let them. That could cause people to write incorrect code later on.
> > > > > Please update the commit description either by dropping 2) or by
> > > > > 
> > > > > replacing it with something else. For example:
> > > > >   2) The Linux virtio-9p guest driver does not honor the VIRTIO
> > > > >   Queue
> > > > >   
> > > > >      Size value and can submit descriptor chains that exceed it.
> > > > >      That is
> > > > >      a spec violation but is accepted by QEMU's device
> > > > >      implementation.
> > > > >      
> > > > >      When the guest creates a descriptor chain larger than 1024 the
> > > > >      following QEMU error is printed and the guest hangs:
> > > > >      
> > > > >      virtio: too many write descriptors in indirect table
> > > > 
> > > > I am fine with both, probably preferring the text block above instead
> > > > of
> > > > silently dropping the reason, just for clarity.
> > > > 
> > > > But keep in mind that this might not be limited to virtio-9p as your
> > > > text
> > > > would suggest, see below.
> > > > 
> > > > > > > > > > (3) Unfortunately not all virtio users in QEMU would
> > > > > > > > > > currently
> > > > > > > > > > 
> > > > > > > > > >     work correctly with the new value of 32768.
> > > > > > > > > > 
> > > > > > > > > > So let's turn this hard coded global value into a runtime
> > > > > > > > > > variable as a first step in this commit, configurable for
> > > > > > > > > > each
> > > > > > > > > > virtio user by passing a corresponding value with
> > > > > > > > > > virtio_init()
> > > > > > > > > > call.
> > > > > > > > > 
> > > > > > > > > virtio_add_queue() already has an int queue_size argument,
> > > > > > > > > why
> > > > > > > > > isn't
> > > > > > > > > that enough to deal with the maximum queue size? There's
> > > > > > > > > probably a
> > > > > > > > > good
> > > > > > > > > reason for it, but please include it in the commit
> > > > > > > > > description.
> > > > > > > > 
> > > > > > > > [...]
> > > > > > > > 
> > > > > > > > > Can you make this value per-vq instead of per-vdev since
> > > > > > > > > virtqueues
> > > > > > > > > can
> > > > > > > > > have different queue sizes?
> > > > > > > > > 
> > > > > > > > > The same applies to the rest of this patch. Anything using
> > > > > > > > > vdev->queue_max_size should probably use vq->vring.num
> > > > > > > > > instead.
> > > > > > > > 
> > > > > > > > I would like to avoid that and keep it per device. The maximum
> > > > > > > > size
> > > > > > > > stored
> > > > > > > > there is the maximum size supported by virtio user (or vortio
> > > > > > > > device
> > > > > > > > model,
> > > > > > > > however you want to call it). So that's really a limit per
> > > > > > > > device,
> > > > > > > > not
> > > > > > > > per
> > > > > > > > queue, as no queue of the device would ever exceed that limit.
> > > > > > > > 
> > > > > > > > Plus a lot more code would need to be refactored, which I
> > > > > > > > think is
> > > > > > > > unnecessary.
> > > > > > > 
> > > > > > > I'm against a per-device limit because it's a concept that
> > > > > > > cannot
> > > > > > > accurately describe reality. Some devices have multiple classes
> > > > > > > of
> > > > > > 
> > > > > > It describes current reality, because VIRTQUEUE_MAX_SIZE obviously
> > > > > > is
> > > > > > not
> > > > > > per queue either ATM, and nobody ever cared.
> > > > > > 
> > > > > > All this series does, is allowing to override that currently
> > > > > > project-wide
> > > > > > compile-time constant to a per-driver-model compile-time constant.
> > > > > > Which
> > > > > > makes sense, because that's what it is: some drivers could cope
> > > > > > with
> > > > > > any
> > > > > > transfer size, and some drivers are constrained to a certain
> > > > > > maximum
> > > > > > application specific transfer size (e.g. IOV_MAX).
> > > > > > 
> > > > > > > virtqueues and they are sized differently, so a per-device limit
> > > > > > > is
> > > > > > > insufficient. virtio-net has separate rx_queue_size and
> > > > > > > tx_queue_size
> > > > > > > parameters (plus a control vq hardcoded to 64 descriptors).
> > > > > > 
> > > > > > I simply find this overkill. This value semantically means "my
> > > > > > driver
> > > > > > model
> > > > > > supports at any time and at any coincidence at the very most x *
> > > > > > PAGE_SIZE
> > > > > > = max_transfer_size". Do you see any driver that might want a more
> > > > > > fine
> > > > > > graded control over this?
> > > > > 
> > > > > One reason why per-vq limits could make sense is that the maximum
> > > > > possible number of struct elements is allocated upfront in some code
> > > > > paths. Those code paths may need to differentiate between per-vq
> > > > > limits
> > > > > for performance or memory utilization reasons. Today some places
> > > > > allocate 1024 elements on the stack in some code paths, but maybe
> > > > > that's
> > > > > not acceptable when the per-device limit is 32k. This can matter
> > > > > when a
> > > > > device has vqs with very different sizes.
> > > > 
> > > > [...]
> > > > 
> > > > > > ... I leave that up to Michael or whoever might be in charge to
> > > > > > decide. I
> > > > > > still find this overkill, but I will adapt this to whatever the
> > > > > > decision
> > > > > > eventually will be in v3.
> > > > > > 
> > > > > > But then please tell me the precise representation that you find
> > > > > > appropriate, i.e. whether you want a new function for that, or
> > > > > > rather
> > > > > > an
> > > > > > additional argument to virtio_add_queue(). Your call.
> > > > > 
> > > > > virtio_add_queue() already takes an int queue_size argument. I think
> > > > > the
> > > > > necessary information is already there.
> > > > > 
> > > > > This patch just needs to be tweaked to use the
> > > > > virtio_queue_get_num()
> > > > > (or a new virtqueue_get_num() API if that's easier because only a
> > > > > VirtQueue *vq pointer is available) instead of introducing a new
> > > > > per-device limit.
> > > > 
> > > > My understanding is that both the original 9p virtio device authors,
> > > > as
> > > > well as other virtio device authors in QEMU have been and are still
> > > > using
> > > > this as a default value (i.e. to allocate some upfront, and the rest
> > > > on
> > > > demand).
> > > > 
> > > > So yes, I know your argument about the specs, but AFAICS if I would
> > > > just
> > > > take this existing numeric argument for the limit, then it would
> > > > probably
> > > > break those other QEMU devices as well.
> > > 
> > > This is a good point that I didn't consider. If guest drivers currently
> > > violate the spec, then restricting descriptor chain length to vring.num
> > > will introduce regressions.
> > > 
> > > We can't use virtio_queue_get_num() directly. A backwards-compatible
> > > 
> > > limit is required:
> > >   int virtio_queue_get_desc_chain_max(VirtIODevice *vdev, int n)
> > >   {
> > >   
> > >       /*
> > >       
> > >        * QEMU historically allowed 1024 descriptors even if the
> > >        * descriptor table was smaller.
> > >        */
> > >       
> > >       return MAX(virtio_queue_get_num(vdev, qidx), 1024);
> > >   
> > >   }
> > 
> > That was an alternative that I thought about as well, but decided against.
> > It would require devices (that would want to support large transmissions
> > sizes)> 
> > to create the virtio queue(s) with the maximum possible size, i.e:
> >   virtio_add_queue(32k);
> 
> The spec allows drivers to set the size of the vring as long as they do
> not exceed Queue Size.
> 
> The Linux drivers accept the device's default size, so you're right that
> this would cause large vrings to be allocated if the device sets the
> virtqueue size to 32k.
> 
> > And that's the point where my current lack of knowledge, of what this
> > would
> > precisely mean to the resulting allocation set, decided against it. I mean
> > would that mean would QEMU's virtio implementation would just a) allocate
> > 32k scatter gather list entries? Or would it rather b) additionally also
> > allocate the destination memory pages as well?
> 
> The vring consumes guest RAM but it just consists of descriptors, not
> the buffer memory pages. The guest RAM requirements are:
> - split layout: 32k * 16 + 6 + 32k * 2 + 6 + 8 * 32k = 851,980 bytes
> - packed layout: 32k * 16 + 4 + 4 = 524,296 bytes
> 
> That's still quite large!
> 
> By the way, virtio-blk currently uses a virtqueue size of 256
> descriptors and this has been found reasonable for disk I/O performance.
> The Linux block layer splits requests at around 1.25 MB for virtio-blk.
> The virtio-blk queue limits are reported by the device and the guest
> Linux block layer uses them to size/split requests appropriately. I'm
> not sure 9p really needs 32k, although you're right that fragmented
> physical memory requires 32k descriptors to describe 128 MB of buffers.
> 
> Going back to the original problem, a vring feature bit could be added
> to the VIRTIO specification indicating that indirect descriptor tables
> are limited to the maximum (32k) instead of Queue Size. This way the
> device's default vring size could be small but drivers could allocate
> indirect descriptor tables that are large when necessary. Then the Linux
> virtio driver API would need to report the maximum supported sglist
> length for a given virtqueue so drivers can take advantage of this
> information.

Due to forced pragmatism, ~1M unused/wasted space would be acceptable IMHO.

But if that might really go up to 128M+ as you said if physical RAM is highly 
fragmented, then a cleaner solution would definitely make sense, yes, if 
that's possible.

But as changing specs etc. is probably a long process, it would make sense 
first doing some more tests with the kernel patches to find out whether there 
was probably some show stopper like IOV_MAX anyway.

As for whether large transfer sizes make sense at all: well, from the 
benchmarks I made so far I "think" it does make sense going >4M. It might be 
something specific to 9p (its a full file server), I guess it has a higher 
latency than raw virtio block devices. OTOH with M.2 SSDs we now have several 
thousand MB/s, so not sure if the old common transfer size of 1M for block 
devices is still reasonable today.

Best regards,
Christian Schoenebeck



WARNING: multiple messages have this Message-ID (diff)
From: Christian Schoenebeck <qemu_oss@crudebyte.com>
To: qemu-devel@nongnu.org
Cc: "Stefan Hajnoczi" <stefanha@redhat.com>,
	"Kevin Wolf" <kwolf@redhat.com>,
	"Laurent Vivier" <lvivier@redhat.com>,
	qemu-block@nongnu.org, "Michael S. Tsirkin" <mst@redhat.com>,
	"Jason Wang" <jasowang@redhat.com>, "Amit Shah" <amit@kernel.org>,
	"David Hildenbrand" <david@redhat.com>,
	"Greg Kurz" <groug@kaod.org>,
	virtio-fs@redhat.com, "Eric Auger" <eric.auger@redhat.com>,
	"Hanna Reitz" <hreitz@redhat.com>,
	"Gonglei (Arei)" <arei.gonglei@huawei.com>,
	"Gerd Hoffmann" <kraxel@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Fam Zheng" <fam@euphon.net>,
	"Raphael Norwitz" <raphael.norwitz@nutanix.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable
Date: Fri, 08 Oct 2021 16:48:33 +0200	[thread overview]
Message-ID: <1903319.Sgii2z2lGP@silver> (raw)
In-Reply-To: <YV8Pqwap4oR8fpvc@stefanha-x1.localdomain>

On Donnerstag, 7. Oktober 2021 17:18:03 CEST Stefan Hajnoczi wrote:
> On Thu, Oct 07, 2021 at 03:09:16PM +0200, Christian Schoenebeck wrote:
> > On Mittwoch, 6. Oktober 2021 16:42:34 CEST Stefan Hajnoczi wrote:
> > > On Wed, Oct 06, 2021 at 02:50:07PM +0200, Christian Schoenebeck wrote:
> > > > On Mittwoch, 6. Oktober 2021 13:06:55 CEST Stefan Hajnoczi wrote:
> > > > > On Tue, Oct 05, 2021 at 06:32:46PM +0200, Christian Schoenebeck 
wrote:
> > > > > > On Dienstag, 5. Oktober 2021 17:10:40 CEST Stefan Hajnoczi wrote:
> > > > > > > On Tue, Oct 05, 2021 at 03:15:26PM +0200, Christian Schoenebeck
> > 
> > wrote:
> > > > > > > > On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi 
wrote:
> > > > > > > > > On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian
> > > > > > > > > Schoenebeck
> > > > 
> > > > wrote:
> > > > > > > > > > Refactor VIRTQUEUE_MAX_SIZE to effectively become a
> > > > > > > > > > runtime
> > > > > > > > > > variable per virtio user.
> > > > > > > > > 
> > > > > > > > > virtio user == virtio device model?
> > > > > > > > 
> > > > > > > > Yes
> > > > > > > > 
> > > > > > > > > > Reasons:
> > > > > > > > > > 
> > > > > > > > > > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute
> > > > > > > > > > theoretical
> > > > > > > > > > 
> > > > > > > > > >     maximum queue size possible. Which is actually the
> > > > > > > > > >     maximum
> > > > > > > > > >     queue size allowed by the virtio protocol. The
> > > > > > > > > >     appropriate
> > > > > > > > > >     value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > > > > > > > > >     
> > > > > > > > > >     https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/vi
> > > > > > > > > >     rtio
> > > > > > > > > >     -v1.
> > > > > > > > > >     1-cs
> > > > > > > > > >     01.h
> > > > > > > > > >     tml#x1-240006
> > > > > > > > > >     
> > > > > > > > > >     Apparently VIRTQUEUE_MAX_SIZE was instead defined with
> > > > > > > > > >     a
> > > > > > > > > >     more or less arbitrary value of 1024 in the past,
> > > > > > > > > >     which
> > > > > > > > > >     limits the maximum transfer size with virtio to 4M
> > > > > > > > > >     (more precise: 1024 * PAGE_SIZE, with the latter
> > > > > > > > > >     typically
> > > > > > > > > >     being 4k).
> > > > > > > > > 
> > > > > > > > > Being equal to IOV_MAX is a likely reason. Buffers with more
> > > > > > > > > iovecs
> > > > > > > > > than
> > > > > > > > > that cannot be passed to host system calls (sendmsg(2),
> > > > > > > > > pwritev(2),
> > > > > > > > > etc).
> > > > > > > > 
> > > > > > > > Yes, that's use case dependent. Hence the solution to opt-in
> > > > > > > > if it
> > > > > > > > is
> > > > > > > > desired and feasible.
> > > > > > > > 
> > > > > > > > > > (2) Additionally the current value of 1024 poses a hidden
> > > > > > > > > > limit,
> > > > > > > > > > 
> > > > > > > > > >     invisible to guest, which causes a system hang with
> > > > > > > > > >     the
> > > > > > > > > >     following QEMU error if guest tries to exceed it:
> > > > > > > > > >     
> > > > > > > > > >     virtio: too many write descriptors in indirect table
> > > > > > > > > 
> > > > > > > > > I don't understand this point. 2.6.5 The Virtqueue
> > > > > > > > > Descriptor
> > > > > > > > > Table
> > > > > > 
> > > > > > says:
> > > > > > > > >   The number of descriptors in the table is defined by the
> > > > > > > > >   queue
> > > > > > > > >   size
> > > > > > > > >   for
> > > > > > > > > 
> > > > > > > > > this virtqueue: this is the maximum possible descriptor
> > > > > > > > > chain
> > > > > > > > > length.
> > > > > > > > > 
> > > > > > > > > and 2.6.5.3.1 Driver Requirements: Indirect Descriptors 
says:
> > > > > > > > >   A driver MUST NOT create a descriptor chain longer than
> > > > > > > > >   the
> > > > > > > > >   Queue
> > > > > > > > >   Size
> > > > > > > > >   of
> > > > > > > > > 
> > > > > > > > > the device.
> > > > > > > > > 
> > > > > > > > > Do you mean a broken/malicious guest driver that is
> > > > > > > > > violating
> > > > > > > > > the
> > > > > > > > > spec?
> > > > > > > > > That's not a hidden limit, it's defined by the spec.
> > > > > > > > 
> > > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781
> > > > > > > > .htm
> > > > > > > > l
> > > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788
> > > > > > > > .htm
> > > > > > > > l
> > > > > > > > 
> > > > > > > > You can already go beyond that queue size at runtime with the
> > > > > > > > indirection
> > > > > > > > table. The only actual limit is the currently hard coded value
> > > > > > > > of
> > > > > > > > 1k
> > > > > > > > pages.
> > > > > > > > Hence the suggestion to turn that into a variable.
> > > > > > > 
> > > > > > > Exceeding Queue Size is a VIRTIO spec violation. Drivers that
> > > > > > > operate
> > > > > > > outsided the spec do so at their own risk. They may not be
> > > > > > > compatible
> > > > > > > with all device implementations.
> > > > > > 
> > > > > > Yes, I am ware about that. And still, this practice is already
> > > > > > done,
> > > > > > which
> > > > > > apparently is not limited to 9pfs.
> > > > > > 
> > > > > > > The limit is not hidden, it's Queue Size as defined by the spec
> > > > > > > :).
> > > > > > > 
> > > > > > > If you have a driver that is exceeding the limit, then please
> > > > > > > fix
> > > > > > > the
> > > > > > > driver.
> > > > > > 
> > > > > > I absolutely understand your position, but I hope you also
> > > > > > understand
> > > > > > that
> > > > > > this violation of the specs is a theoretical issue, it is not a
> > > > > > real-life
> > > > > > problem right now, and due to lack of man power unfortunately I
> > > > > > have
> > > > > > to
> > > > > > prioritize real-life problems over theoretical ones ATM. Keep in
> > > > > > mind
> > > > > > that
> > > > > > right now I am the only person working on 9pfs actively, I do this
> > > > > > voluntarily whenever I find a free time slice, and I am not paid
> > > > > > for
> > > > > > it
> > > > > > either.
> > > > > > 
> > > > > > I don't see any reasonable way with reasonable effort to do what
> > > > > > you
> > > > > > are
> > > > > > asking for here in 9pfs, and Greg may correct me here if I am
> > > > > > saying
> > > > > > anything wrong. If you are seeing any specific real-life issue
> > > > > > here,
> > > > > > then
> > > > > > please tell me which one, otherwise I have to postpone that "specs
> > > > > > violation" issue.
> > > > > > 
> > > > > > There is still a long list of real problems that I need to hunt
> > > > > > down
> > > > > > in
> > > > > > 9pfs, afterwards I can continue with theoretical ones if you want,
> > > > > > but
> > > > > > right now I simply can't, sorry.
> > > > > 
> > > > > I understand. If you don't have time to fix the Linux virtio-9p
> > > > > driver
> > > > > then that's fine.
> > > > 
> > > > I will look at this again, but it might be tricky. On doubt I'll
> > > > postpone
> > > > it.>
> > > > 
> > > > > I still wanted us to agree on the spec position because the commit
> > > > > description says it's a "hidden limit", which is incorrect. It might
> > > > > seem pedantic, but my concern is that misconceptions can spread if
> > > > > we
> > > > > let them. That could cause people to write incorrect code later on.
> > > > > Please update the commit description either by dropping 2) or by
> > > > > 
> > > > > replacing it with something else. For example:
> > > > >   2) The Linux virtio-9p guest driver does not honor the VIRTIO
> > > > >   Queue
> > > > >   
> > > > >      Size value and can submit descriptor chains that exceed it.
> > > > >      That is
> > > > >      a spec violation but is accepted by QEMU's device
> > > > >      implementation.
> > > > >      
> > > > >      When the guest creates a descriptor chain larger than 1024 the
> > > > >      following QEMU error is printed and the guest hangs:
> > > > >      
> > > > >      virtio: too many write descriptors in indirect table
> > > > 
> > > > I am fine with both, probably preferring the text block above instead
> > > > of
> > > > silently dropping the reason, just for clarity.
> > > > 
> > > > But keep in mind that this might not be limited to virtio-9p as your
> > > > text
> > > > would suggest, see below.
> > > > 
> > > > > > > > > > (3) Unfortunately not all virtio users in QEMU would
> > > > > > > > > > currently
> > > > > > > > > > 
> > > > > > > > > >     work correctly with the new value of 32768.
> > > > > > > > > > 
> > > > > > > > > > So let's turn this hard coded global value into a runtime
> > > > > > > > > > variable as a first step in this commit, configurable for
> > > > > > > > > > each
> > > > > > > > > > virtio user by passing a corresponding value with
> > > > > > > > > > virtio_init()
> > > > > > > > > > call.
> > > > > > > > > 
> > > > > > > > > virtio_add_queue() already has an int queue_size argument,
> > > > > > > > > why
> > > > > > > > > isn't
> > > > > > > > > that enough to deal with the maximum queue size? There's
> > > > > > > > > probably a
> > > > > > > > > good
> > > > > > > > > reason for it, but please include it in the commit
> > > > > > > > > description.
> > > > > > > > 
> > > > > > > > [...]
> > > > > > > > 
> > > > > > > > > Can you make this value per-vq instead of per-vdev since
> > > > > > > > > virtqueues
> > > > > > > > > can
> > > > > > > > > have different queue sizes?
> > > > > > > > > 
> > > > > > > > > The same applies to the rest of this patch. Anything using
> > > > > > > > > vdev->queue_max_size should probably use vq->vring.num
> > > > > > > > > instead.
> > > > > > > > 
> > > > > > > > I would like to avoid that and keep it per device. The maximum
> > > > > > > > size
> > > > > > > > stored
> > > > > > > > there is the maximum size supported by virtio user (or vortio
> > > > > > > > device
> > > > > > > > model,
> > > > > > > > however you want to call it). So that's really a limit per
> > > > > > > > device,
> > > > > > > > not
> > > > > > > > per
> > > > > > > > queue, as no queue of the device would ever exceed that limit.
> > > > > > > > 
> > > > > > > > Plus a lot more code would need to be refactored, which I
> > > > > > > > think is
> > > > > > > > unnecessary.
> > > > > > > 
> > > > > > > I'm against a per-device limit because it's a concept that
> > > > > > > cannot
> > > > > > > accurately describe reality. Some devices have multiple classes
> > > > > > > of
> > > > > > 
> > > > > > It describes current reality, because VIRTQUEUE_MAX_SIZE obviously
> > > > > > is
> > > > > > not
> > > > > > per queue either ATM, and nobody ever cared.
> > > > > > 
> > > > > > All this series does, is allowing to override that currently
> > > > > > project-wide
> > > > > > compile-time constant to a per-driver-model compile-time constant.
> > > > > > Which
> > > > > > makes sense, because that's what it is: some drivers could cope
> > > > > > with
> > > > > > any
> > > > > > transfer size, and some drivers are constrained to a certain
> > > > > > maximum
> > > > > > application specific transfer size (e.g. IOV_MAX).
> > > > > > 
> > > > > > > virtqueues and they are sized differently, so a per-device limit
> > > > > > > is
> > > > > > > insufficient. virtio-net has separate rx_queue_size and
> > > > > > > tx_queue_size
> > > > > > > parameters (plus a control vq hardcoded to 64 descriptors).
> > > > > > 
> > > > > > I simply find this overkill. This value semantically means "my
> > > > > > driver
> > > > > > model
> > > > > > supports at any time and at any coincidence at the very most x *
> > > > > > PAGE_SIZE
> > > > > > = max_transfer_size". Do you see any driver that might want a more
> > > > > > fine
> > > > > > graded control over this?
> > > > > 
> > > > > One reason why per-vq limits could make sense is that the maximum
> > > > > possible number of struct elements is allocated upfront in some code
> > > > > paths. Those code paths may need to differentiate between per-vq
> > > > > limits
> > > > > for performance or memory utilization reasons. Today some places
> > > > > allocate 1024 elements on the stack in some code paths, but maybe
> > > > > that's
> > > > > not acceptable when the per-device limit is 32k. This can matter
> > > > > when a
> > > > > device has vqs with very different sizes.
> > > > 
> > > > [...]
> > > > 
> > > > > > ... I leave that up to Michael or whoever might be in charge to
> > > > > > decide. I
> > > > > > still find this overkill, but I will adapt this to whatever the
> > > > > > decision
> > > > > > eventually will be in v3.
> > > > > > 
> > > > > > But then please tell me the precise representation that you find
> > > > > > appropriate, i.e. whether you want a new function for that, or
> > > > > > rather
> > > > > > an
> > > > > > additional argument to virtio_add_queue(). Your call.
> > > > > 
> > > > > virtio_add_queue() already takes an int queue_size argument. I think
> > > > > the
> > > > > necessary information is already there.
> > > > > 
> > > > > This patch just needs to be tweaked to use the
> > > > > virtio_queue_get_num()
> > > > > (or a new virtqueue_get_num() API if that's easier because only a
> > > > > VirtQueue *vq pointer is available) instead of introducing a new
> > > > > per-device limit.
> > > > 
> > > > My understanding is that both the original 9p virtio device authors,
> > > > as
> > > > well as other virtio device authors in QEMU have been and are still
> > > > using
> > > > this as a default value (i.e. to allocate some upfront, and the rest
> > > > on
> > > > demand).
> > > > 
> > > > So yes, I know your argument about the specs, but AFAICS if I would
> > > > just
> > > > take this existing numeric argument for the limit, then it would
> > > > probably
> > > > break those other QEMU devices as well.
> > > 
> > > This is a good point that I didn't consider. If guest drivers currently
> > > violate the spec, then restricting descriptor chain length to vring.num
> > > will introduce regressions.
> > > 
> > > We can't use virtio_queue_get_num() directly. A backwards-compatible
> > > 
> > > limit is required:
> > >   int virtio_queue_get_desc_chain_max(VirtIODevice *vdev, int n)
> > >   {
> > >   
> > >       /*
> > >       
> > >        * QEMU historically allowed 1024 descriptors even if the
> > >        * descriptor table was smaller.
> > >        */
> > >       
> > >       return MAX(virtio_queue_get_num(vdev, qidx), 1024);
> > >   
> > >   }
> > 
> > That was an alternative that I thought about as well, but decided against.
> > It would require devices (that would want to support large transmissions
> > sizes)> 
> > to create the virtio queue(s) with the maximum possible size, i.e:
> >   virtio_add_queue(32k);
> 
> The spec allows drivers to set the size of the vring as long as they do
> not exceed Queue Size.
> 
> The Linux drivers accept the device's default size, so you're right that
> this would cause large vrings to be allocated if the device sets the
> virtqueue size to 32k.
> 
> > And that's the point where my current lack of knowledge, of what this
> > would
> > precisely mean to the resulting allocation set, decided against it. I mean
> > would that mean would QEMU's virtio implementation would just a) allocate
> > 32k scatter gather list entries? Or would it rather b) additionally also
> > allocate the destination memory pages as well?
> 
> The vring consumes guest RAM but it just consists of descriptors, not
> the buffer memory pages. The guest RAM requirements are:
> - split layout: 32k * 16 + 6 + 32k * 2 + 6 + 8 * 32k = 851,980 bytes
> - packed layout: 32k * 16 + 4 + 4 = 524,296 bytes
> 
> That's still quite large!
> 
> By the way, virtio-blk currently uses a virtqueue size of 256
> descriptors and this has been found reasonable for disk I/O performance.
> The Linux block layer splits requests at around 1.25 MB for virtio-blk.
> The virtio-blk queue limits are reported by the device and the guest
> Linux block layer uses them to size/split requests appropriately. I'm
> not sure 9p really needs 32k, although you're right that fragmented
> physical memory requires 32k descriptors to describe 128 MB of buffers.
> 
> Going back to the original problem, a vring feature bit could be added
> to the VIRTIO specification indicating that indirect descriptor tables
> are limited to the maximum (32k) instead of Queue Size. This way the
> device's default vring size could be small but drivers could allocate
> indirect descriptor tables that are large when necessary. Then the Linux
> virtio driver API would need to report the maximum supported sglist
> length for a given virtqueue so drivers can take advantage of this
> information.

Due to forced pragmatism, ~1M unused/wasted space would be acceptable IMHO.

But if that might really go up to 128M+ as you said if physical RAM is highly 
fragmented, then a cleaner solution would definitely make sense, yes, if 
that's possible.

But as changing specs etc. is probably a long process, it would make sense 
first doing some more tests with the kernel patches to find out whether there 
was probably some show stopper like IOV_MAX anyway.

As for whether large transfer sizes make sense at all: well, from the 
benchmarks I made so far I "think" it does make sense going >4M. It might be 
something specific to 9p (its a full file server), I guess it has a higher 
latency than raw virtio block devices. OTOH with M.2 SSDs we now have several 
thousand MB/s, so not sure if the old common transfer size of 1M for block 
devices is still reasonable today.

Best regards,
Christian Schoenebeck




  reply	other threads:[~2021-10-08 14:48 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-04 19:38 [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k Christian Schoenebeck
2021-10-04 19:38 ` Christian Schoenebeck
2021-10-04 19:38 ` [Virtio-fs] [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable Christian Schoenebeck
2021-10-04 19:38   ` Christian Schoenebeck
2021-10-05  7:36   ` [Virtio-fs] " Greg Kurz
2021-10-05  7:36     ` Greg Kurz
2021-10-05 12:45   ` [Virtio-fs] " Stefan Hajnoczi
2021-10-05 12:45     ` Stefan Hajnoczi
2021-10-05 13:15     ` [Virtio-fs] " Christian Schoenebeck
2021-10-05 13:15       ` Christian Schoenebeck
2021-10-05 15:10       ` [Virtio-fs] " Stefan Hajnoczi
2021-10-05 15:10         ` Stefan Hajnoczi
2021-10-05 16:32         ` [Virtio-fs] " Christian Schoenebeck
2021-10-05 16:32           ` Christian Schoenebeck
2021-10-06 11:06           ` [Virtio-fs] " Stefan Hajnoczi
2021-10-06 11:06             ` Stefan Hajnoczi
2021-10-06 12:50             ` [Virtio-fs] " Christian Schoenebeck
2021-10-06 12:50               ` Christian Schoenebeck
2021-10-06 14:42               ` [Virtio-fs] " Stefan Hajnoczi
2021-10-06 14:42                 ` Stefan Hajnoczi
2021-10-07 13:09                 ` [Virtio-fs] " Christian Schoenebeck
2021-10-07 13:09                   ` Christian Schoenebeck
2021-10-07 15:18                   ` [Virtio-fs] " Stefan Hajnoczi
2021-10-07 15:18                     ` Stefan Hajnoczi
2021-10-08 14:48                     ` Christian Schoenebeck [this message]
2021-10-08 14:48                       ` Christian Schoenebeck
2021-10-04 19:38 ` [Virtio-fs] [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k Christian Schoenebeck
2021-10-04 19:38   ` Christian Schoenebeck
2021-10-05  7:16   ` [Virtio-fs] " Michael S. Tsirkin
2021-10-05  7:16     ` Michael S. Tsirkin
2021-10-05  7:35     ` [Virtio-fs] " Greg Kurz
2021-10-05  7:35       ` Greg Kurz
2021-10-05 11:17     ` [Virtio-fs] " Christian Schoenebeck
2021-10-05 11:17       ` Christian Schoenebeck
2021-10-05 11:24       ` [Virtio-fs] " Michael S. Tsirkin
2021-10-05 11:24         ` Michael S. Tsirkin
2021-10-05 12:01         ` [Virtio-fs] " Christian Schoenebeck
2021-10-05 12:01           ` Christian Schoenebeck
2021-10-04 19:38 ` [Virtio-fs] [PATCH v2 3/3] virtio-9p-device: switch to 32k max. transfer size Christian Schoenebeck
2021-10-04 19:38   ` Christian Schoenebeck
2021-10-05  7:38 ` [Virtio-fs] [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k David Hildenbrand
2021-10-05  7:38   ` David Hildenbrand
2021-10-05 11:10   ` [Virtio-fs] " Christian Schoenebeck
2021-10-05 11:10     ` Christian Schoenebeck
2021-10-05 11:19     ` [Virtio-fs] " Michael S. Tsirkin
2021-10-05 11:19       ` Michael S. Tsirkin
2021-10-05 11:43       ` [Virtio-fs] " Christian Schoenebeck
2021-10-05 11:43         ` Christian Schoenebeck
2021-10-07  5:23 ` [Virtio-fs] " Stefan Hajnoczi
2021-10-07  5:23   ` Stefan Hajnoczi
2021-10-07 12:51   ` [Virtio-fs] " Christian Schoenebeck
2021-10-07 12:51     ` Christian Schoenebeck
2021-10-07 15:42     ` [Virtio-fs] " Stefan Hajnoczi
2021-10-07 15:42       ` Stefan Hajnoczi
2021-10-08  7:25       ` [Virtio-fs] " Greg Kurz
2021-10-08  7:25         ` Greg Kurz
2021-10-08  8:27         ` [Virtio-fs] " Greg Kurz
2021-10-08 14:24         ` Christian Schoenebeck
2021-10-08 14:24           ` Christian Schoenebeck
2021-10-08 16:08           ` [Virtio-fs] " Christian Schoenebeck
2021-10-08 16:08             ` Christian Schoenebeck
2021-10-21 15:39             ` [Virtio-fs] " Christian Schoenebeck
2021-10-21 15:39               ` Christian Schoenebeck
2021-10-25 10:30               ` [Virtio-fs] " Stefan Hajnoczi
2021-10-25 10:30                 ` Stefan Hajnoczi
2021-10-25 15:03                 ` [Virtio-fs] " Christian Schoenebeck
2021-10-25 15:03                   ` Christian Schoenebeck
2021-10-28  9:00                   ` [Virtio-fs] " Stefan Hajnoczi
2021-10-28  9:00                     ` Stefan Hajnoczi
2021-11-01 20:29                     ` [Virtio-fs] " Christian Schoenebeck
2021-11-01 20:29                       ` Christian Schoenebeck
2021-11-03 11:33                       ` [Virtio-fs] " Stefan Hajnoczi
2021-11-03 11:33                         ` Stefan Hajnoczi
2021-11-04 14:41                         ` [Virtio-fs] " Christian Schoenebeck
2021-11-04 14:41                           ` Christian Schoenebeck
2021-11-09 10:56                           ` [Virtio-fs] " Stefan Hajnoczi
2021-11-09 10:56                             ` Stefan Hajnoczi
2021-11-09 13:09                             ` [Virtio-fs] " Christian Schoenebeck
2021-11-09 13:09                               ` Christian Schoenebeck
2021-11-10 10:05                               ` [Virtio-fs] " Stefan Hajnoczi
2021-11-10 10:05                                 ` Stefan Hajnoczi
2021-11-10 13:14                                 ` [Virtio-fs] " Christian Schoenebeck
2021-11-10 13:14                                   ` Christian Schoenebeck
2021-11-10 15:14                                   ` [Virtio-fs] " Stefan Hajnoczi
2021-11-10 15:14                                     ` Stefan Hajnoczi
2021-11-10 15:53                                     ` [Virtio-fs] " Christian Schoenebeck
2021-11-10 15:53                                       ` Christian Schoenebeck
2021-11-11 16:31                                       ` [Virtio-fs] " Stefan Hajnoczi
2021-11-11 16:31                                         ` Stefan Hajnoczi
2021-11-11 17:54                                         ` [Virtio-fs] " Christian Schoenebeck
2021-11-11 17:54                                           ` Christian Schoenebeck
2021-11-15 11:54                                           ` [Virtio-fs] " Stefan Hajnoczi
2021-11-15 11:54                                             ` Stefan Hajnoczi
2021-11-15 14:32                                             ` [Virtio-fs] " Christian Schoenebeck
2021-11-15 14:32                                               ` Christian Schoenebeck
2021-11-16 11:13                                               ` [Virtio-fs] " Stefan Hajnoczi
2021-11-16 11:13                                                 ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1903319.Sgii2z2lGP@silver \
    --to=qemu_oss@crudebyte.com \
    --cc=amit@kernel.org \
    --cc=arei.gonglei@huawei.com \
    --cc=david@redhat.com \
    --cc=eric.auger@redhat.com \
    --cc=fam@euphon.net \
    --cc=hreitz@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=kraxel@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=raphael.norwitz@nutanix.com \
    --cc=virtio-fs@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.