All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-comment] [PATCH] Introduce VIRTIO_F_ISOLATE_INDIRECT_DESC feature
@ 2022-10-13  7:45 Baptiste Afsa
  2023-01-13 12:46 ` Michael S. Tsirkin
       [not found] ` <6380471.4BWXO1n1mU@silver>
  0 siblings, 2 replies; 20+ messages in thread
From: Baptiste Afsa @ 2022-10-13  7:45 UTC (permalink / raw)
  To: virtio-comment; +Cc: Baptiste Afsa

When negotiated, this feature bit serves two purposes:

  - It instructs the driver to allocate every indirect descriptor on its
    own dedicated memory pages. A single memory page must only hold data
    for exactly one indirect descriptor.

  - It allows the host to unmap these pages from the guest address space
    while the indirect descriptor is available to the device.

This feature may cause extra memory consumption on the guest side but is
particularly helpful to support the memory isolation scheme described
below. However, note that this feature is not limited to this specific
use case and other applications may be considered.

The main idea is to keep the memory of the VM that runs the driver
isolated from the memory that runs the device, while still allowing
zero-copy transfers between the two domains. The virtio buffers are
mapped dynamically in the host address space by the hypervisor.

In this model, the virtqueues shared with the device are not the
original virtqueues allocated by the driver but a copy maintained by the
hypervisor. The hypervisor copies the descriptors from the driver
virtqueues to these second virtqueues when the descriptors are made
available to the device, along with mapping the I/O buffers to the
device VM address space.

The use of this second set of virtqueues avoids the device the need to
verify that the buffers are actually accessible to the device since the
driver cannot update these copies.

Dealing with indirect descriptors in this model brings additional issues
because the hypervisor needs to grant the device access to both the
indirect descriptor table and all the I/O buffers pointed to by this
table. However, a compromised guest can modify the table while the
device is using it, which may lead to faults in the device.

This problem could be solved by creating a copy of the indirect
descriptor table as it is done with other descriptors but this approach
requires some sort of dynamic memory allocation in the hypervisor, which
might be problematic depending on the situation.

Using the VIRTIO_F_ISOLATE_INDIRECT_DESC feature allows the hypervisor
to unmap the indirect descriptor table from the guest address space
while the indirect descriptor is on the device side and guarantees that
it will not be modified while the device is using it.
---
 content.tex | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/content.tex b/content.tex
index e863709..20e17b7 100644
--- a/content.tex
+++ b/content.tex
@@ -6944,6 +6944,23 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
   that the driver can reset a queue individually.
   See \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Reset}.
 
+  \item[VIRTIO_F_ISOLATE_INDIRECT_DESC(41)] This feature indicates that the
+  device requires the driver to allocate each indirect descriptor on its own
+  dedicated memory pages which MUST NOT hold any other data than this indirect
+  descriptor.
+
+  This allows the host to unmap these pages from the guest address space while
+  the indirect descriptors are available to the device. The device
+  implementation is therefore guaranteed that the driver cannot tampered with an
+  indirect descriptor table while the device is using it.
+
+  This mechanism notably allows to implement a memory isolation scheme where the
+  virtio buffers are shared dynamically between the host and the guest as they
+  are exchanged through the virtqueues. It permits the host to share indirect
+  descriptor tables and all the associated buffers as is, without the need for
+  the device to verify that the buffers referenced in an indirect descriptor
+  table are actually accessible to the device.
+
 \end{description}
 
 \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
@@ -6980,6 +6997,12 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
 
 A driver SHOULD accept VIRTIO_F_NOTIF_CONFIG_DATA if it is offered.
 
+A driver SHOULD accept VIRTIO_F_ISOLATE_INDIRECT_DESC if it is offered. If
+VIRTIO_F_ISOLATE_INDIRECT_DESC has been negotiated, a driver MUST NOT access the
+memory pages that contain an indirect descriptor after the indirect descriptor
+has been made available to the device and before it is returned as used,
+otherwise the resulting behavior is undefined.
+
 \devicenormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
 
 A device MUST offer VIRTIO_F_VERSION_1.  A device MAY fail to operate further
@@ -7009,6 +7032,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
 and presents a PCI SR-IOV capability structure, otherwise
 it MUST NOT offer VIRTIO_F_SR_IOV.
 
+A device MAY fail to operate further if VIRTIO_F_ISOLATE_INDIRECT_DESC is not
+accepted.
+
 \section{Legacy Interface: Reserved Feature Bits}\label{sec:Reserved Feature Bits / Legacy Interface: Reserved Feature Bits}
 
 Transitional devices MAY offer the following:
-- 
2.38.0


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [virtio-comment] [PATCH] Introduce VIRTIO_F_ISOLATE_INDIRECT_DESC feature
  2022-10-13  7:45 [virtio-comment] [PATCH] Introduce VIRTIO_F_ISOLATE_INDIRECT_DESC feature Baptiste Afsa
@ 2023-01-13 12:46 ` Michael S. Tsirkin
  2023-01-17 15:19   ` Afsa, Baptiste
       [not found] ` <6380471.4BWXO1n1mU@silver>
  1 sibling, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2023-01-13 12:46 UTC (permalink / raw)
  To: Baptiste Afsa; +Cc: virtio-comment, Eugenio Pérez

On Thu, Oct 13, 2022 at 09:45:13AM +0200, Baptiste Afsa wrote:
> When negotiated, this feature bit serves two purposes:
> 
>   - It instructs the driver to allocate every indirect descriptor on its
>     own dedicated memory pages. A single memory page must only hold data
>     for exactly one indirect descriptor.
> 
>   - It allows the host to unmap these pages from the guest address space
>     while the indirect descriptor is available to the device.
> 
> This feature may cause extra memory consumption on the guest side but is
> particularly helpful to support the memory isolation scheme described
> below. However, note that this feature is not limited to this specific
> use case and other applications may be considered.
> 
> The main idea is to keep the memory of the VM that runs the driver
> isolated from the memory that runs the device, while still allowing
> zero-copy transfers between the two domains. The virtio buffers are
> mapped dynamically in the host address space by the hypervisor.
> 
> In this model, the virtqueues shared with the device are not the
> original virtqueues allocated by the driver but a copy maintained by the
> hypervisor. The hypervisor copies the descriptors from the driver
> virtqueues to these second virtqueues when the descriptors are made
> available to the device, along with mapping the I/O buffers to the
> device VM address space.
> 
> The use of this second set of virtqueues avoids the device the need to
> verify that the buffers are actually accessible to the device since the
> driver cannot update these copies.
> 
> Dealing with indirect descriptors in this model brings additional issues
> because the hypervisor needs to grant the device access to both the
> indirect descriptor table and all the I/O buffers pointed to by this
> table. However, a compromised guest can modify the table while the
> device is using it, which may lead to faults in the device.
> 
> This problem could be solved by creating a copy of the indirect
> descriptor table as it is done with other descriptors but this approach
> requires some sort of dynamic memory allocation in the hypervisor, which
> might be problematic depending on the situation.
> 
> Using the VIRTIO_F_ISOLATE_INDIRECT_DESC feature allows the hypervisor
> to unmap the indirect descriptor table from the guest address space
> while the indirect descriptor is on the device side and guarantees that
> it will not be modified while the device is using it.

I missed this when this was posted :( Creating the github issue
drew my attention to it. Replying now.

I also CC Eugenio who seems to have implemented shadow VQs in QEMU
(motivation for this patch) without need for new feature bits.

> ---
>  content.tex | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index e863709..20e17b7 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -6944,6 +6944,23 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>    that the driver can reset a queue individually.
>    See \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Reset}.
>  
> +  \item[VIRTIO_F_ISOLATE_INDIRECT_DESC(41)] This feature indicates that the
> +  device requires the driver to allocate each indirect descriptor on its own
> +  dedicated memory pages which MUST NOT hold any other data than this indirect
> +  descriptor.

what are "memory pages"? the spec does not define anything like this.


> +
> +  This allows the host to unmap these pages from the guest address space while
> +  the indirect descriptors are available to the device. The device
> +  implementation is therefore guaranteed that the driver cannot tampered with an
> +  indirect descriptor table while the device is using it.

grammar issues in this text.

> +
> +  This mechanism notably allows to implement a memory isolation scheme where the
> +  virtio buffers are shared dynamically between the host and the guest as they
> +  are exchanged through the virtqueues. It permits the host to share indirect
> +  descriptor tables and all the associated buffers as is, without the need for
> +  the device to verify that the buffers referenced in an indirect descriptor
> +  table are actually accessible to the device.

It's not really clear from this text what this verification is and how
device can access without verifying it's accessible.  Generally we talk
about device and driver.  Some examples talk about host and guest
instead.  But here we have host, guest and device. This is not clear
without much more explanation.

> +
>  \end{description}
>  
>  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> @@ -6980,6 +6997,12 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>  
>  A driver SHOULD accept VIRTIO_F_NOTIF_CONFIG_DATA if it is offered.
>  
> +A driver SHOULD accept VIRTIO_F_ISOLATE_INDIRECT_DESC if it is offered. If
> +VIRTIO_F_ISOLATE_INDIRECT_DESC has been negotiated, a driver MUST NOT access the
> +memory pages that contain an indirect descriptor after the indirect descriptor
> +has been made available to the device and before it is returned as used,
> +otherwise the resulting behavior is undefined.

What size are "memory pages"? descriptors are not returned as used,
buffers are used, pls use consistent terminology.

> +
>  \devicenormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
>  
>  A device MUST offer VIRTIO_F_VERSION_1.  A device MAY fail to operate further
> @@ -7009,6 +7032,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>  and presents a PCI SR-IOV capability structure, otherwise
>  it MUST NOT offer VIRTIO_F_SR_IOV.
>  
> +A device MAY fail to operate further if VIRTIO_F_ISOLATE_INDIRECT_DESC is not
> +accepted.

what does this mean? fail how? reject FEATURES_OK? that's normal for any
bit I guess.


> +
>  \section{Legacy Interface: Reserved Feature Bits}\label{sec:Reserved Feature Bits / Legacy Interface: Reserved Feature Bits}
>  
>  Transitional devices MAY offer the following:
> -- 
> 2.38.0
> 
> 
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
> 
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
> 
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [virtio-comment] [PATCH] Introduce VIRTIO_F_ISOLATE_INDIRECT_DESC feature
  2023-01-13 12:46 ` Michael S. Tsirkin
@ 2023-01-17 15:19   ` Afsa, Baptiste
  2023-01-17 18:27     ` Eugenio Perez Martin
  0 siblings, 1 reply; 20+ messages in thread
From: Afsa, Baptiste @ 2023-01-17 15:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment@lists.oasis-open.org, Eugenio Pérez

> I missed this when this was posted :( Creating the github issue
> drew my attention to it. Replying now.
> 
> I also CC Eugenio who seems to have implemented shadow VQs in QEMU
> (motivation for this patch) without need for new feature bits.
>

Hello Michael,

No worries, thanks for the review!

I didn't know there was such support in QEMU. I'll look at it in more details
because it is not clear to me whether we are talking about the same thing or
not.

Keep in mind that we use shadow virtqueues in a context where the device do not
have full access to the memory of the guest where the driver runs. Instead we
grant buffers to the device transparently from the hypervisor as buffers are
added to the virtqueues. This new feature bit helps supporting indirect
descriptors in this context.

When an indirect descriptor is added to a virtqueue, we need to "share" the
indirect descriptor table in addition to the buffers themselves. To do this we
have two options:

  - Share a shadow copy of the table with the device but this requires some kind
    of dynamic memory allocation in the hypervisor.

  - Grant the indirect descriptor table from the guest as-is. Note that in our
    use case we do not translate buffer addresses when importing them into the
    device address space.

The issue with the second option is that we need to ensure that the driver will
not modify the indirect descriptor table while it is on the device side.
Otherwise the device could attempt to access the buffers using the new addresses
which would not correspond to the one the hypervisor used when granting memory.

Since this is something that a correct driver implementation would not do, this
led to the idea of remapping the table read-only while it is used by the device.
Because an indirect descriptor table may not fill an entire page, we needed a
way to ensure that the OS would not re-use the rest of these pages for other
purposes while we have made them read-only.

> > ---
> >  content.tex | 26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> >
> > diff --git a/content.tex b/content.tex
> > index e863709..20e17b7 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -6944,6 +6944,23 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> >    that the driver can reset a queue individually.
> >    See \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Reset}.
> > 
> > +  \item[VIRTIO_F_ISOLATE_INDIRECT_DESC(41)] This feature indicates that the
> > +  device requires the driver to allocate each indirect descriptor on its own
> > +  dedicated memory pages which MUST NOT hold any other data than this indirect
> > +  descriptor.
> 
> what are "memory pages"? the spec does not define anything like this.
> 
> 
> > +
> > +  This allows the host to unmap these pages from the guest address space while
> > +  the indirect descriptors are available to the device. The device
> > +  implementation is therefore guaranteed that the driver cannot tampered with an
> > +  indirect descriptor table while the device is using it.
> 
> grammar issues in this text.
> 
> > +
> > +  This mechanism notably allows to implement a memory isolation scheme where the
> > +  virtio buffers are shared dynamically between the host and the guest as they
> > +  are exchanged through the virtqueues. It permits the host to share indirect
> > +  descriptor tables and all the associated buffers as is, without the need for
> > +  the device to verify that the buffers referenced in an indirect descriptor
> > +  table are actually accessible to the device.
> 
> It's not really clear from this text what this verification is and how
> device can access without verifying it's accessible.  Generally we talk
> about device and driver.  Some examples talk about host and guest
> instead.  But here we have host, guest and device. This is not clear
> without much more explanation.
> 

Fair enough. I agree this could be presented in more details. However this
brings up a question: do you think that the specification should describe this
specific application of this feature bit?

The spec should define what the feature bit implies, how it can be used is
another topic. I can certainly provide more details on how we used it but there
may be other applications.

What do you think?

> > +
> >  \end{description}
> > 
> >  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > @@ -6980,6 +6997,12 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > 
> >  A driver SHOULD accept VIRTIO_F_NOTIF_CONFIG_DATA if it is offered.
> > 
> > +A driver SHOULD accept VIRTIO_F_ISOLATE_INDIRECT_DESC if it is offered. If
> > +VIRTIO_F_ISOLATE_INDIRECT_DESC has been negotiated, a driver MUST NOT access the
> > +memory pages that contain an indirect descriptor after the indirect descriptor
> > +has been made available to the device and before it is returned as used,
> > +otherwise the resulting behavior is undefined.
> 
> What size are "memory pages"? descriptors are not returned as used,
> buffers are used, pls use consistent terminology.
> 
> > +
> >  \devicenormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > 
> >  A device MUST offer VIRTIO_F_VERSION_1.  A device MAY fail to operate further
> > @@ -7009,6 +7032,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> >  and presents a PCI SR-IOV capability structure, otherwise
> >  it MUST NOT offer VIRTIO_F_SR_IOV.
> > 
> > +A device MAY fail to operate further if VIRTIO_F_ISOLATE_INDIRECT_DESC is not
> > +accepted.
> 
> what does this mean? fail how? reject FEATURES_OK? that's normal for any
> bit I guess.
>

Yes, rejecting FEATURES_OK is probably the best option here. To be honest with
you, I simply re-used the wording used for some other feature bits like
VIRTIO_F_VERSION_1 or VIRTIO_F_ACCESS_PLATFORM.

I thought that this was important to mention, because most of the feature bits
are here to describe optional features that the driver is free to use or
not. Not accepting them should not make the device unusable. Whereas this new
feature bit is more like some sort of requirement from the device on the driver.

At the same time, I agree with you that the specification is already clear on
the fact that the device may reject FEATURES_OK if it is not happy with the
subset of features selected by the driver. If you think this is not relevant, I
do not mind dropping it.

> 
> > +
> >  \section{Legacy Interface: Reserved Feature Bits}\label{sec:Reserved Feature Bits / Legacy Interface: Reserved Feature Bits}
> > 
> >  Transitional devices MAY offer the following:
> > --
> > 2.38.0
> >
> >
> > This publicly archived list offers a means to provide input to the
> > OASIS Virtual I/O Device (VIRTIO) TC.
> >
> > In order to verify user consent to the Feedback License terms and
> > to minimize spam in the list archive, subscription is required
> > before posting.
> >
> > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > List help: virtio-comment-help@lists.oasis-open.org
> > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> > Committee: https://www.oasis-open.org/committees/virtio/
> > Join OASIS: https://www.oasis-open.org/join/
This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [virtio-comment] [PATCH] Introduce VIRTIO_F_ISOLATE_INDIRECT_DESC feature
  2023-01-17 15:19   ` Afsa, Baptiste
@ 2023-01-17 18:27     ` Eugenio Perez Martin
  2023-02-27 14:53       ` Afsa, Baptiste
  0 siblings, 1 reply; 20+ messages in thread
From: Eugenio Perez Martin @ 2023-01-17 18:27 UTC (permalink / raw)
  To: Afsa, Baptiste; +Cc: Michael S. Tsirkin, virtio-comment@lists.oasis-open.org

On Tue, Jan 17, 2023 at 4:24 PM Afsa, Baptiste <Baptiste.Afsa@harman.com> wrote:
>
> > I missed this when this was posted :( Creating the github issue
> > drew my attention to it. Replying now.
> >
> > I also CC Eugenio who seems to have implemented shadow VQs in QEMU
> > (motivation for this patch) without need for new feature bits.
> >
>
> Hello Michael,
>
> No worries, thanks for the review!
>
> I didn't know there was such support in QEMU. I'll look at it in more details
> because it is not clear to me whether we are talking about the same thing or
> not.
>
> Keep in mind that we use shadow virtqueues in a context where the device do not
> have full access to the memory of the guest where the driver runs. Instead we
> grant buffers to the device transparently from the hypervisor as buffers are
> added to the virtqueues.

Hi Baptiste,

It is not the current way of working for shadow virtqueue in qemu, but
it is doable.

QEMU offers a new address space to the device, including all the guest
memory and the shadow vrings in qemu memory. By default, qemu
translates each descriptor to this new address space and copies it in
the shadow vring, making sure it's a valid guest address.

Net CVQ is a special case, as qemu copies also the buffer so a
malicious guest will not be able to change it and make vdpa device and
qemu's knowledge of it out of sync. This is done in a region mapped at
device start for this purpose for performance and simplicity, not for
each descriptor.

It is technically possible to develop the copy of the buffer content
in a memory region too. Once a suitable size is agreed (or given by a
parameter in cmdline, for example), SVQ already supports things like
backoff in case this region is full. To malloc and map this region
dynamically is also possible.

> This new feature bit helps supporting indirect
> descriptors in this context.
>
> When an indirect descriptor is added to a virtqueue, we need to "share" the
> indirect descriptor table in addition to the buffers themselves. To do this we
> have two options:
>
>   - Share a shadow copy of the table with the device but this requires some kind
>     of dynamic memory allocation in the hypervisor.
>
>   - Grant the indirect descriptor table from the guest as-is. Note that in our
>     use case we do not translate buffer addresses when importing them into the
>     device address space.
>

The plan with indirect descriptors is similar to CVQ buffer treatment:
To preallocate a big enough chunk of memory able to hold a copy of the
indirect table and then translate each descriptor. The current dynamic
allocation scheme is very simple but it should work as long as there
is enough memory. As commented before, backoff is already available
for other cases.

Going a step forward with current qemu's SVQ, a feature that would
help is to allow the device to translate buffers addresses with a
different IOVA from vring and indirect table. But this is far from
standard POV, since it does not have even an address space concept.

Do you think both modes can converge?

> The issue with the second option is that we need to ensure that the driver will
> not modify the indirect descriptor table while it is on the device side.
> Otherwise the device could attempt to access the buffers using the new addresses
> which would not correspond to the one the hypervisor used when granting memory.
>
> Since this is something that a correct driver implementation would not do, this
> led to the idea of remapping the table read-only while it is used by the device.
> Because an indirect descriptor table may not fill an entire page, we needed a
> way to ensure that the OS would not re-use the rest of these pages for other
> purposes while we have made them read-only.
>

How do you guarantee it for vring itself?

> > > ---
> > >  content.tex | 26 ++++++++++++++++++++++++++
> > >  1 file changed, 26 insertions(+)
> > >
> > > diff --git a/content.tex b/content.tex
> > > index e863709..20e17b7 100644
> > > --- a/content.tex
> > > +++ b/content.tex
> > > @@ -6944,6 +6944,23 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > >    that the driver can reset a queue individually.
> > >    See \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Reset}.
> > >
> > > +  \item[VIRTIO_F_ISOLATE_INDIRECT_DESC(41)] This feature indicates that the
> > > +  device requires the driver to allocate each indirect descriptor on its own
> > > +  dedicated memory pages which MUST NOT hold any other data than this indirect
> > > +  descriptor.
> >
> > what are "memory pages"? the spec does not define anything like this.
> >
> >
> > > +
> > > +  This allows the host to unmap these pages from the guest address space while
> > > +  the indirect descriptors are available to the device. The device
> > > +  implementation is therefore guaranteed that the driver cannot tampered with an
> > > +  indirect descriptor table while the device is using it.
> >
> > grammar issues in this text.
> >
> > > +
> > > +  This mechanism notably allows to implement a memory isolation scheme where the
> > > +  virtio buffers are shared dynamically between the host and the guest as they
> > > +  are exchanged through the virtqueues. It permits the host to share indirect
> > > +  descriptor tables and all the associated buffers as is, without the need for
> > > +  the device to verify that the buffers referenced in an indirect descriptor
> > > +  table are actually accessible to the device.
> >
> > It's not really clear from this text what this verification is and how
> > device can access without verifying it's accessible.  Generally we talk
> > about device and driver.  Some examples talk about host and guest
> > instead.  But here we have host, guest and device. This is not clear
> > without much more explanation.
> >
>
> Fair enough. I agree this could be presented in more details. However this
> brings up a question: do you think that the specification should describe this
> specific application of this feature bit?
>
> The spec should define what the feature bit implies, how it can be used is
> another topic. I can certainly provide more details on how we used it but there
> may be other applications.
>
> What do you think?
>

I think one problem of the description is that it requires having the
hypervisor in mind, with most of the other features also valid in
scenarios where a virtio device is used without one. This should be
explicit, and at least why it cannot be achieved with this feature (or
what makes it convenient).

The closer example I have in mind is net gratuitous packet sending.
The standard explicitly explains why it is easier to delegate it to
the guest (hypervisor may not have full knowledge of what to
announce). I think a rewrite of the last two paragraphs in terms of
"if not done this way, these are the consequences" may help here.

Thanks!

> > > +
> > >  \end{description}
> > >
> > >  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > > @@ -6980,6 +6997,12 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > >
> > >  A driver SHOULD accept VIRTIO_F_NOTIF_CONFIG_DATA if it is offered.
> > >
> > > +A driver SHOULD accept VIRTIO_F_ISOLATE_INDIRECT_DESC if it is offered. If
> > > +VIRTIO_F_ISOLATE_INDIRECT_DESC has been negotiated, a driver MUST NOT access the
> > > +memory pages that contain an indirect descriptor after the indirect descriptor
> > > +has been made available to the device and before it is returned as used,
> > > +otherwise the resulting behavior is undefined.
> >
> > What size are "memory pages"? descriptors are not returned as used,
> > buffers are used, pls use consistent terminology.
> >
> > > +
> > >  \devicenormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > >
> > >  A device MUST offer VIRTIO_F_VERSION_1.  A device MAY fail to operate further
> > > @@ -7009,6 +7032,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > >  and presents a PCI SR-IOV capability structure, otherwise
> > >  it MUST NOT offer VIRTIO_F_SR_IOV.
> > >
> > > +A device MAY fail to operate further if VIRTIO_F_ISOLATE_INDIRECT_DESC is not
> > > +accepted.
> >
> > what does this mean? fail how? reject FEATURES_OK? that's normal for any
> > bit I guess.
> >
>
> Yes, rejecting FEATURES_OK is probably the best option here. To be honest with
> you, I simply re-used the wording used for some other feature bits like
> VIRTIO_F_VERSION_1 or VIRTIO_F_ACCESS_PLATFORM.
>
> I thought that this was important to mention, because most of the feature bits
> are here to describe optional features that the driver is free to use or
> not. Not accepting them should not make the device unusable. Whereas this new
> feature bit is more like some sort of requirement from the device on the driver.
>
> At the same time, I agree with you that the specification is already clear on
> the fact that the device may reject FEATURES_OK if it is not happy with the
> subset of features selected by the driver. If you think this is not relevant, I
> do not mind dropping it.
>
> >
> > > +
> > >  \section{Legacy Interface: Reserved Feature Bits}\label{sec:Reserved Feature Bits / Legacy Interface: Reserved Feature Bits}
> > >
> > >  Transitional devices MAY offer the following:
> > > --
> > > 2.38.0
> > >
> > >
> > > This publicly archived list offers a means to provide input to the
> > > OASIS Virtual I/O Device (VIRTIO) TC.
> > >
> > > In order to verify user consent to the Feedback License terms and
> > > to minimize spam in the list archive, subscription is required
> > > before posting.
> > >
> > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > > List help: virtio-comment-help@lists.oasis-open.org
> > > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> > > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> > > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> > > Committee: https://www.oasis-open.org/committees/virtio/
> > > Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [virtio-comment] [PATCH] Introduce VIRTIO_F_ISOLATE_INDIRECT_DESC feature
  2023-01-17 18:27     ` Eugenio Perez Martin
@ 2023-02-27 14:53       ` Afsa, Baptiste
  2023-02-27 15:45         ` Stefan Hajnoczi
  0 siblings, 1 reply; 20+ messages in thread
From: Afsa, Baptiste @ 2023-02-27 14:53 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S. Tsirkin, virtio-comment@lists.oasis-open.org

> Hi Baptiste,
>
> It is not the current way of working for shadow virtqueue in qemu, but
> it is doable.
>
> QEMU offers a new address space to the device, including all the guest
> memory and the shadow vrings in qemu memory. By default, qemu
> translates each descriptor to this new address space and copies it in
> the shadow vring, making sure it's a valid guest address.
>
> Net CVQ is a special case, as qemu copies also the buffer so a
> malicious guest will not be able to change it and make vdpa device and
> qemu's knowledge of it out of sync. This is done in a region mapped at
> device start for this purpose for performance and simplicity, not for
> each descriptor.
>
> It is technically possible to develop the copy of the buffer content
> in a memory region too. Once a suitable size is agreed (or given by a
> parameter in cmdline, for example), SVQ already supports things like
> backoff in case this region is full. To malloc and map this region
> dynamically is also possible.
>

Hello Eugenio,

Thanks for the insight. It does have some similarities with what we have done on
our side.

We have also tried the approach of doing a copy of the buffer content into a
shared memory region but we did that on the guest side using things like the
swiotlb or restricted DMA pools. This works well when buffers are small, but
when they get larger, we get better performance by granting the memory
dynamically.

> > This new feature bit helps supporting indirect
> > descriptors in this context.
> >
> > When an indirect descriptor is added to a virtqueue, we need to "share" the
> > indirect descriptor table in addition to the buffers themselves. To do this we
> > have two options:
> >
> >   - Share a shadow copy of the table with the device but this requires some kind
> >     of dynamic memory allocation in the hypervisor.
> >
> >   - Grant the indirect descriptor table from the guest as-is. Note that in our
> >     use case we do not translate buffer addresses when importing them into the
> >     device address space.
> >
>
> The plan with indirect descriptors is similar to CVQ buffer treatment:
> To preallocate a big enough chunk of memory able to hold a copy of the
> indirect table and then translate each descriptor. The current dynamic
> allocation scheme is very simple but it should work as long as there
> is enough memory. As commented before, backoff is already available
> for other cases.

Ok. This is one of the options we considered initially, but that we rejected
because we wanted to avoid dynamic memory allocation.

However, we are reconsidering this idea now. We initially overlooked the fact
that the spec limits the size of an indirect descriptor table to the queue
size. Having this allows us to figure out the maximum size for a single indirect
descriptor table and the total size needed for the memory pool that will store
the copies of the indirect descriptor tables.

The issue that we have now, is that this limitation does not seem to be enforced
in Linux virtio drivers today. I came across:

https://github.com/oasis-tcs/virtio-spec/issues/122

which looks like a good base for us to build upon, but I'm not sure what is the
status with this issue. Do you know whether there is any plan regarding this?

> Going a step forward with current qemu's SVQ, a feature that would
> help is to allow the device to translate buffers addresses with a
> different IOVA from vring and indirect table. But this is far from
> standard POV, since it does not have even an address space concept.

I'm not sure to see what this would bring compared to the current design.

> Do you think both modes can converge?

I think so since we are now thinking of using a copy of the indirect descriptor
table. This would eliminate the need for any additional feature bit.

> > The issue with the second option is that we need to ensure that the driver will
> > not modify the indirect descriptor table while it is on the device side.
> > Otherwise the device could attempt to access the buffers using the new addresses
> > which would not correspond to the one the hypervisor used when granting memory.
> >
> > Since this is something that a correct driver implementation would not do, this
> > led to the idea of remapping the table read-only while it is used by the device.
> > Because an indirect descriptor table may not fill an entire page, we needed a
> > way to ensure that the OS would not re-use the rest of these pages for other
> > purposes while we have made them read-only.
> >
>
> How do you guarantee it for vring itself?

For the vrings we do exactly what you described for QEMU. The virtqueues that
the device sees are not the virtqueues allocated by the driver but the shadow
copies managed by the hypervisor. If the driver touches the descriptors after
they have been copied to the shadow virtqueue, the hypervisor will ignore these
changes.

Thanks.
This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [virtio-comment] [PATCH] Introduce VIRTIO_F_ISOLATE_INDIRECT_DESC feature
  2023-02-27 14:53       ` Afsa, Baptiste
@ 2023-02-27 15:45         ` Stefan Hajnoczi
       [not found]           ` <2244126.gP0zCk8Q6A@silver>
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Hajnoczi @ 2023-02-27 15:45 UTC (permalink / raw)
  To: Afsa, Baptiste
  Cc: Eugenio Perez Martin, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, Christian Schoenebeck

[-- Attachment #1: Type: text/plain, Size: 6133 bytes --]

On Mon, Feb 27, 2023 at 02:53:55PM +0000, Afsa, Baptiste wrote:
> > Hi Baptiste,
> >
> > It is not the current way of working for shadow virtqueue in qemu, but
> > it is doable.
> >
> > QEMU offers a new address space to the device, including all the guest
> > memory and the shadow vrings in qemu memory. By default, qemu
> > translates each descriptor to this new address space and copies it in
> > the shadow vring, making sure it's a valid guest address.
> >
> > Net CVQ is a special case, as qemu copies also the buffer so a
> > malicious guest will not be able to change it and make vdpa device and
> > qemu's knowledge of it out of sync. This is done in a region mapped at
> > device start for this purpose for performance and simplicity, not for
> > each descriptor.
> >
> > It is technically possible to develop the copy of the buffer content
> > in a memory region too. Once a suitable size is agreed (or given by a
> > parameter in cmdline, for example), SVQ already supports things like
> > backoff in case this region is full. To malloc and map this region
> > dynamically is also possible.
> >
> 
> Hello Eugenio,
> 
> Thanks for the insight. It does have some similarities with what we have done on
> our side.
> 
> We have also tried the approach of doing a copy of the buffer content into a
> shared memory region but we did that on the guest side using things like the
> swiotlb or restricted DMA pools. This works well when buffers are small, but
> when they get larger, we get better performance by granting the memory
> dynamically.
> 
> > > This new feature bit helps supporting indirect
> > > descriptors in this context.
> > >
> > > When an indirect descriptor is added to a virtqueue, we need to "share" the
> > > indirect descriptor table in addition to the buffers themselves. To do this we
> > > have two options:
> > >
> > >   - Share a shadow copy of the table with the device but this requires some kind
> > >     of dynamic memory allocation in the hypervisor.
> > >
> > >   - Grant the indirect descriptor table from the guest as-is. Note that in our
> > >     use case we do not translate buffer addresses when importing them into the
> > >     device address space.
> > >
> >
> > The plan with indirect descriptors is similar to CVQ buffer treatment:
> > To preallocate a big enough chunk of memory able to hold a copy of the
> > indirect table and then translate each descriptor. The current dynamic
> > allocation scheme is very simple but it should work as long as there
> > is enough memory. As commented before, backoff is already available
> > for other cases.
> 
> Ok. This is one of the options we considered initially, but that we rejected
> because we wanted to avoid dynamic memory allocation.
> 
> However, we are reconsidering this idea now. We initially overlooked the fact
> that the spec limits the size of an indirect descriptor table to the queue
> size. Having this allows us to figure out the maximum size for a single indirect
> descriptor table and the total size needed for the memory pool that will store
> the copies of the indirect descriptor tables.
> 
> The issue that we have now, is that this limitation does not seem to be enforced
> in Linux virtio drivers today. I came across:
> 
> https://github.com/oasis-tcs/virtio-spec/issues/122
> 
> which looks like a good base for us to build upon, but I'm not sure what is the
> status with this issue. Do you know whether there is any plan regarding this?

CCing Christian regarding extending queue size limits.

> > Going a step forward with current qemu's SVQ, a feature that would
> > help is to allow the device to translate buffers addresses with a
> > different IOVA from vring and indirect table. But this is far from
> > standard POV, since it does not have even an address space concept.
> 
> I'm not sure to see what this would bring compared to the current design.
> 
> > Do you think both modes can converge?
> 
> I think so since we are now thinking of using a copy of the indirect descriptor
> table. This would eliminate the need for any additional feature bit.
> 
> > > The issue with the second option is that we need to ensure that the driver will
> > > not modify the indirect descriptor table while it is on the device side.
> > > Otherwise the device could attempt to access the buffers using the new addresses
> > > which would not correspond to the one the hypervisor used when granting memory.
> > >
> > > Since this is something that a correct driver implementation would not do, this
> > > led to the idea of remapping the table read-only while it is used by the device.
> > > Because an indirect descriptor table may not fill an entire page, we needed a
> > > way to ensure that the OS would not re-use the rest of these pages for other
> > > purposes while we have made them read-only.
> > >
> >
> > How do you guarantee it for vring itself?
> 
> For the vrings we do exactly what you described for QEMU. The virtqueues that
> the device sees are not the virtqueues allocated by the driver but the shadow
> copies managed by the hypervisor. If the driver touches the descriptors after
> they have been copied to the shadow virtqueue, the hypervisor will ignore these
> changes.
> 
> Thanks.
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
> 
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
> 
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
       [not found]           ` <2244126.gP0zCk8Q6A@silver>
@ 2023-02-27 17:41             ` Michael S. Tsirkin
       [not found]               ` <2494182.5W6NY9sLyD@silver>
  0 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2023-02-27 17:41 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Afsa, Baptiste, Stefan Hajnoczi, Eugenio Perez Martin,
	virtio-comment@lists.oasis-open.org

On Mon, Feb 27, 2023 at 05:13:12PM +0100, Christian Schoenebeck wrote:
> On Monday, February 27, 2023 4:45:45 PM CET Stefan Hajnoczi wrote:
> > On Mon, Feb 27, 2023 at 02:53:55PM +0000, Afsa, Baptiste wrote:
> > > The issue that we have now, is that this limitation does not seem to be enforced
> > > in Linux virtio drivers today. I came across:
> > > 
> > > https://github.com/oasis-tcs/virtio-spec/issues/122
> > > 
> > > which looks like a good base for us to build upon, but I'm not sure what is the
> > > status with this issue. Do you know whether there is any plan regarding this?
> > 
> > CCing Christian regarding extending queue size limits.
> 
> Status on this issue: it was suspended in May last year. AFAICR Michael
> expressed the need to give some more thought about it, and a new virtio spec
> release was just ahead at that time.
> 
> Michael, suggestions how to bring this forward?
> 
> Best regards,
> Christian Schoenebeck
> 

How about a summary letter listing various options, pros and cons?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
       [not found]               ` <2494182.5W6NY9sLyD@silver>
@ 2023-02-28 12:05                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 20+ messages in thread
From: Michael S. Tsirkin @ 2023-02-28 12:05 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Afsa, Baptiste, Stefan Hajnoczi, Eugenio Perez Martin,
	virtio-comment@lists.oasis-open.org

On Tue, Feb 28, 2023 at 12:49:00PM +0100, Christian Schoenebeck wrote:
> On Monday, February 27, 2023 6:41:12 PM CET Michael S. Tsirkin wrote:
> > On Mon, Feb 27, 2023 at 05:13:12PM +0100, Christian Schoenebeck wrote:
> > > On Monday, February 27, 2023 4:45:45 PM CET Stefan Hajnoczi wrote:
> > > > On Mon, Feb 27, 2023 at 02:53:55PM +0000, Afsa, Baptiste wrote:
> > > > > The issue that we have now, is that this limitation does not seem to be enforced
> > > > > in Linux virtio drivers today. I came across:
> > > > > 
> > > > > https://github.com/oasis-tcs/virtio-spec/issues/122
> > > > > 
> > > > > which looks like a good base for us to build upon, but I'm not sure what is the
> > > > > status with this issue. Do you know whether there is any plan regarding this?
> > > > 
> > > > CCing Christian regarding extending queue size limits.
> > > 
> > > Status on this issue: it was suspended in May last year. AFAICR Michael
> > > expressed the need to give some more thought about it, and a new virtio spec
> > > release was just ahead at that time.
> > > 
> > > Michael, suggestions how to bring this forward?
> > > 
> > > Best regards,
> > > Christian Schoenebeck
> > > 
> > 
> > How about a summary letter listing various options, pros and cons?
> 
> Actually I had submitted a draft for this feature (latest version linked
> above), and AFAICR Michael was the only person who expressed concerns from
> design perspective. Comments by other people were already just aboout precise
> wording, but not about the design itself.
> 
> We stopped the discussion at the point where Michael expressed the need to
> think more about it, but as his expressed concerns were a bit vague, I still
> don't see how I could bring this issue forward.
> 
> Best regards,
> Christian Schoenebeck
> 


So the last time there were two things:


1. bugs introduced during the packed ring work. For example:

	> 
	> I don't think so:
	> 
	>   2.6.5.3.1 Driver Requirements: Indirect Descriptors
	>   ..
	>   "A driver MUST NOT set both VIRTQ_DESC_F_INDIRECT and VIRTQ_DESC_F_NEXT in
	>   flags."
	> So as far as I can see it, the amount of direct descriptors is currently
	> always exactly one if an indirect table is used.
	> 
	> Best regards,
	> Christian Schoenebeck
	> 

	Oh. You are right.  Weirdly this text is not in packed-ring.tex - I
	think this and a bunch of other cases like this are an oversight,
	however we need to fix them first before adding features that assume
	they are fixed ...

we need to get them fixed before we poke at core ring otherwise it's a
mess - maybe the TC will accept just fixing this
without a feature bit, or maybe people will feel a feature bit
is required.



2. Given this is a lot of work I am trying to find a way to
make the impact bigger. In particular to cover the use-case
of limiting s/g to 1k while making queues deeper (with
or without indirect). For this I proposed:

	So I think that given this, we can limit the total number
	of non-indirect descriptors, including non-indirect ones
	in a chain + all the ones in indirect pointer table if any,
	and excluding the indirect descriptor itself, and this
	will address the issue you are describing here, right?

people seemed to be ok with this idea?



-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
       [not found]     ` <2458440.T3bEdP9vpG@silver>
@ 2023-03-06 16:27       ` Stefan Hajnoczi
  0 siblings, 0 replies; 20+ messages in thread
From: Stefan Hajnoczi @ 2023-03-06 16:27 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Michael S. Tsirkin, Afsa, Baptiste, Eugenio Perez Martin,
	virtio-comment@lists.oasis-open.org

[-- Attachment #1: Type: text/plain, Size: 6582 bytes --]

On Mon, Mar 06, 2023 at 03:51:13PM +0100, Christian Schoenebeck wrote:
> On Wednesday, March 1, 2023 2:57:57 PM CET Stefan Hajnoczi wrote:
> > On Wed, Mar 01, 2023 at 01:55:14PM +0100, Christian Schoenebeck wrote:
> > > On Tuesday, February 28, 2023 1:05:21 PM CET Michael S. Tsirkin wrote:
> > > > On Tue, Feb 28, 2023 at 12:49:00PM +0100, Christian Schoenebeck wrote:
> > > > > On Monday, February 27, 2023 6:41:12 PM CET Michael S. Tsirkin wrote:
> > > > > > On Mon, Feb 27, 2023 at 05:13:12PM +0100, Christian Schoenebeck wrote:
> > > > > > > On Monday, February 27, 2023 4:45:45 PM CET Stefan Hajnoczi wrote:
> > > > > > > > On Mon, Feb 27, 2023 at 02:53:55PM +0000, Afsa, Baptiste wrote:
> > > > > > > > > The issue that we have now, is that this limitation does not seem to be enforced
> > > > > > > > > in Linux virtio drivers today. I came across:
> > > > > > > > > 
> > > > > > > > > https://github.com/oasis-tcs/virtio-spec/issues/122
> > > > > > > > > 
> > > > > > > > > which looks like a good base for us to build upon, but I'm not sure what is the
> > > > > > > > > status with this issue. Do you know whether there is any plan regarding this?
> > > > > > > > 
> > > > > > > > CCing Christian regarding extending queue size limits.
> > > > > > > 
> > > > > > > Status on this issue: it was suspended in May last year. AFAICR Michael
> > > > > > > expressed the need to give some more thought about it, and a new virtio spec
> > > > > > > release was just ahead at that time.
> > > > > > > 
> > > > > > > Michael, suggestions how to bring this forward?
> > > > > > > 
> > > > > > > Best regards,
> > > > > > > Christian Schoenebeck
> > > > > > > 
> > > > > > 
> > > > > > How about a summary letter listing various options, pros and cons?
> > > > > 
> > > > > Actually I had submitted a draft for this feature (latest version linked
> > > > > above), and AFAICR Michael was the only person who expressed concerns from
> > > > > design perspective. Comments by other people were already just aboout precise
> > > > > wording, but not about the design itself.
> > > > > 
> > > > > We stopped the discussion at the point where Michael expressed the need to
> > > > > think more about it, but as his expressed concerns were a bit vague, I still
> > > > > don't see how I could bring this issue forward.
> > > > > 
> > > > > Best regards,
> > > > > Christian Schoenebeck
> > > > > 
> > > > 
> > > > 
> > > > So the last time there were two things:
> > > > 
> > > > 
> > > > 1. bugs introduced during the packed ring work. For example:
> > > > 
> > > > 	> 
> > > > 	> I don't think so:
> > > > 	> 
> > > > 	>   2.6.5.3.1 Driver Requirements: Indirect Descriptors
> > > > 	>   ..
> > > > 	>   "A driver MUST NOT set both VIRTQ_DESC_F_INDIRECT and VIRTQ_DESC_F_NEXT in
> > > > 	>   flags."
> > > > 	> So as far as I can see it, the amount of direct descriptors is currently
> > > > 	> always exactly one if an indirect table is used.
> > > > 	> 
> > > > 	> Best regards,
> > > > 	> Christian Schoenebeck
> > > > 	> 
> > > > 
> > > > 	Oh. You are right.  Weirdly this text is not in packed-ring.tex - I
> > > > 	think this and a bunch of other cases like this are an oversight,
> > > > 	however we need to fix them first before adding features that assume
> > > > 	they are fixed ...
> > > > 
> > > > we need to get them fixed before we poke at core ring otherwise it's a
> > > > mess - maybe the TC will accept just fixing this
> > > > without a feature bit, or maybe people will feel a feature bit
> > > > is required.
> > > 
> > > It's been a while. Looking at v1.2 it says:
> > > 
> > >   2.8 Packed Virtqueues
> > >   ...
> > >   2.8.5 Scatter-Gather Support [1]
> > >   ...
> > >   While unusual (most implementations either create all lists solely using   
> > >   non-indirect descriptors, or always use a single indirect element), if both 
> > >   features have been negotiated, mixing indirect and non-indirect descriptors 
> > >   in a ring is valid, as long as each list only contains descriptors of a 
> > >   given type.
> > > 
> > >   [1] https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-770005
> > > 
> > > To avoid misapprehensions: the way I understand it, same restrictions apply to
> > > packed queues as split queues, in the sense that you may neither chain several
> > > tables in a single message, nor multi-level nest tables, nor mix a list of
> > > direct descriptors and indirect descriptors on the same level within one
> > > message. So the explicit exception described here, only means you may use
> > > *one* indirect table in one message, while using chained direct descriptors in
> > > another message. But that's it, right?
> > 
> > Hi Christian,
> > Yes, for packed virtqueues it is forbidden to mix normal and indirect
> > descriptors in a single buffer. Further support for this interpretation
> > is in 2.8.19 Driver Requirements: Indirect Descriptors:
> > 
> >   A driver MUST NOT write direct descriptors with VIRTQ_DESC_F_INDIRECT
> >   set in a scatter-gather list linked by VIRTQ_DESC_F_NEXT. flags.
> > 
> > VIRTQ_DESC_F_INDIRECT cannot be set if the descriptor chain uses
> > VIRTQ_DESC_F_NEXT, therefore it's not possible to mix normal and
> > indirect descriptors in a single buffer.
> > 
> > However, the Split Virtqueues section documents an exception where
> > mixing normal and indirect descriptors in a single buffer is allowed,
> > but only in a specific order. 2.7.5.3.2 Device Requirements: Indirect
> > Descriptors says:
> > 
> >   The device MUST handle the case of zero or more normal chained
> >   descriptors followed by a single descriptor with
> >   flags&VIRTQ_DESC_F_INDIRECT. Note: While unusual (most implementations
> >   either create a chain solely using non-indirect descriptors, or use a
> >   single indirect element), such a layout is valid.
> > 
> > So what degree of mixing normal and indirect descriptors is allowed
> > depends on the virtqueue format.
> 
> Oh, good catch! I was not aware about this device requirement for split
> queues. I thought 2.7.5.3.1 "A driver MUST NOT set both VIRTQ_DESC_F_INDIRECT
> and VIRTQ_DESC_F_NEXT in flags" would rule that mixed case out, but right, if
> the indirect descriptor is last in the buffer then 2.7.5.3.1 is not violated.
> 
> Which makes Michael's suggestion even mandatory.
> 
> Out of pure curiosity: is this mixed case actively used by somebody?

I would love to know as well. :)

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
       [not found]     ` <2812377.Px9Efocobp@silver>
@ 2023-03-06 17:41       ` Michael S. Tsirkin
  2023-03-06 20:46         ` Stefan Hajnoczi
  0 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2023-03-06 17:41 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Afsa, Baptiste, Stefan Hajnoczi, Eugenio Perez Martin,
	virtio-comment@lists.oasis-open.org

On Mon, Mar 06, 2023 at 04:00:37PM +0100, Christian Schoenebeck wrote:
> On Wednesday, March 1, 2023 3:55:57 PM CET Michael S. Tsirkin wrote:
> > On Wed, Mar 01, 2023 at 01:55:14PM +0100, Christian Schoenebeck wrote:
> > >   2.8 Packed Virtqueues
> > >   ...
> > >   2.8.5 Scatter-Gather Support [1]
> > >   ...
> > >   While unusual (most implementations either create all lists solely using   
> > >   non-indirect descriptors, or always use a single indirect element), if both 
> > >   features have been negotiated, mixing indirect and non-indirect descriptors 
> > >   in a ring is valid, as long as each list only contains descriptors of a 
> > >   given type.
> > > 
> > >   [1] https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-770005
> > > 
> > > To avoid misapprehensions: the way I understand it, same restrictions apply to
> > > packed queues as split queues, in the sense that you may neither chain several
> > > tables in a single message, nor multi-level nest tables, nor mix a list of
> > > direct descriptors and indirect descriptors on the same level within one
> > > message. So the explicit exception described here, only means you may use
> > > *one* indirect table in one message, while using chained direct descriptors in
> > > another message. But that's it, right?
> > 
> > 
> > That's my understanding.
> > 
> > > > 2. Given this is a lot of work I am trying to find a way to
> > > > make the impact bigger. In particular to cover the use-case
> > > > of limiting s/g to 1k while making queues deeper (with
> > > > or without indirect). For this I proposed:
> > > > 
> > > > 	So I think that given this, we can limit the total number
> > > > 	of non-indirect descriptors, including non-indirect ones
> > > > 	in a chain + all the ones in indirect pointer table if any,
> > > > 	and excluding the indirect descriptor itself, and this
> > > > 	will address the issue you are describing here, right?
> > > > 
> > > > people seemed to be ok with this idea?
> > > 
> > > IIUIC it would not make a difference from design perspective from what I
> > > proposed, as virtio currently neither allows to mix, chain or mult-level nest
> > > indirect descriptor tables within a single message), and hence it would just
> > > boil down to adjusting the wording. So yes, it would therefore cover my
> > > intended use case.
> > > 
> > > Best regards,
> > > Christian Schoenebeck
> > 
> > 
> > Sounds good to me. One interesting case is scsi and blk which have
> > a seg_max field. This is defined as
> > 
> > \item[\field{seg_max}] is the maximum number of segments that can be in a
> >     command. A bidirectional command can include \field{seg_max} input
> >     segments and \field{seg_max} output segments.
> > 
> > it is never explained what *are* the segments, or how does it
> > interact with VQ depth. Current drivers interpret this
> > strictly and assume that this limits the s/g length but does not
> > allow you to exceed vq size.
> > 
> > Do we thus want two limits (for read and write descriptors)?
> 
> No opinion on that, as my intended use case was just extending the buffer size
> beyond queue size, not limiting it below queue size. Either way is fine with
> me.
> 
> Anyhow, as this now gets broader scope, that also means the suggested flag
> VIRTIO_RING_F_INDIRECT_SIZE needs to be renamed. VIRTIO_RING_F_BUFFER_SIZE?
> 
> Best regards,
> Christian Schoenebeck


Hmm that's unclear in that it might be in bytes too.
Given blk and scsi call these "segments" how about
VIRTIO_RING_F_SEG_MAX?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
  2023-03-06 17:41       ` Michael S. Tsirkin
@ 2023-03-06 20:46         ` Stefan Hajnoczi
  2023-03-06 21:50           ` Michael S. Tsirkin
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Hajnoczi @ 2023-03-06 20:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Schoenebeck, Afsa, Baptiste, Eugenio Perez Martin,
	virtio-comment@lists.oasis-open.org

[-- Attachment #1: Type: text/plain, Size: 4884 bytes --]

On Mon, Mar 06, 2023 at 12:41:25PM -0500, Michael S. Tsirkin wrote:
> On Mon, Mar 06, 2023 at 04:00:37PM +0100, Christian Schoenebeck wrote:
> > On Wednesday, March 1, 2023 3:55:57 PM CET Michael S. Tsirkin wrote:
> > > On Wed, Mar 01, 2023 at 01:55:14PM +0100, Christian Schoenebeck wrote:
> > > >   2.8 Packed Virtqueues
> > > >   ...
> > > >   2.8.5 Scatter-Gather Support [1]
> > > >   ...
> > > >   While unusual (most implementations either create all lists solely using   
> > > >   non-indirect descriptors, or always use a single indirect element), if both 
> > > >   features have been negotiated, mixing indirect and non-indirect descriptors 
> > > >   in a ring is valid, as long as each list only contains descriptors of a 
> > > >   given type.
> > > > 
> > > >   [1] https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-770005
> > > > 
> > > > To avoid misapprehensions: the way I understand it, same restrictions apply to
> > > > packed queues as split queues, in the sense that you may neither chain several
> > > > tables in a single message, nor multi-level nest tables, nor mix a list of
> > > > direct descriptors and indirect descriptors on the same level within one
> > > > message. So the explicit exception described here, only means you may use
> > > > *one* indirect table in one message, while using chained direct descriptors in
> > > > another message. But that's it, right?
> > > 
> > > 
> > > That's my understanding.
> > > 
> > > > > 2. Given this is a lot of work I am trying to find a way to
> > > > > make the impact bigger. In particular to cover the use-case
> > > > > of limiting s/g to 1k while making queues deeper (with
> > > > > or without indirect). For this I proposed:
> > > > > 
> > > > > 	So I think that given this, we can limit the total number
> > > > > 	of non-indirect descriptors, including non-indirect ones
> > > > > 	in a chain + all the ones in indirect pointer table if any,
> > > > > 	and excluding the indirect descriptor itself, and this
> > > > > 	will address the issue you are describing here, right?
> > > > > 
> > > > > people seemed to be ok with this idea?
> > > > 
> > > > IIUIC it would not make a difference from design perspective from what I
> > > > proposed, as virtio currently neither allows to mix, chain or mult-level nest
> > > > indirect descriptor tables within a single message), and hence it would just
> > > > boil down to adjusting the wording. So yes, it would therefore cover my
> > > > intended use case.
> > > > 
> > > > Best regards,
> > > > Christian Schoenebeck
> > > 
> > > 
> > > Sounds good to me. One interesting case is scsi and blk which have
> > > a seg_max field. This is defined as
> > > 
> > > \item[\field{seg_max}] is the maximum number of segments that can be in a
> > >     command. A bidirectional command can include \field{seg_max} input
> > >     segments and \field{seg_max} output segments.
> > > 
> > > it is never explained what *are* the segments, or how does it
> > > interact with VQ depth. Current drivers interpret this
> > > strictly and assume that this limits the s/g length but does not
> > > allow you to exceed vq size.
> > > 
> > > Do we thus want two limits (for read and write descriptors)?
> > 
> > No opinion on that, as my intended use case was just extending the buffer size
> > beyond queue size, not limiting it below queue size. Either way is fine with
> > me.
> > 
> > Anyhow, as this now gets broader scope, that also means the suggested flag
> > VIRTIO_RING_F_INDIRECT_SIZE needs to be renamed. VIRTIO_RING_F_BUFFER_SIZE?
> > 
> > Best regards,
> > Christian Schoenebeck
> 
> 
> Hmm that's unclear in that it might be in bytes too.
> Given blk and scsi call these "segments" how about
> VIRTIO_RING_F_SEG_MAX?

The VIRTIO equivalent of a "segment" is an "element". I don't think the
term "segment" is needed at the VIRTIO device model level since there is
already a word for it.

I'm confused because VIRTIO_RING_F_BUFFER_SIZE and VIRTIO_RING_F_SEG_MAX
mean different things to me and have different units (bytes vs number of
segments).

I wouldn't worry about virtio-blk/scsi seg_max. Although the segments
map to virtqueue elements, seg_max has a specific SCSI/block level
meaning related to data transfer and is not about constraints that apply
to all virtqueue requests. I/O requests have headers/footers, so they
can actually consume more elements than seg_max. Also, there could be
non-data transfer requests that happen to consume more than seg_max and
the storage controller would be happy with that (e.g. because VIRTIO
mandates flexible framing so you could break a request into 1-byte
elements). It's confusing the talk about seg_max at the VIRTIO device
model level - it's not about virtqueues at all.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
  2023-03-06 20:46         ` Stefan Hajnoczi
@ 2023-03-06 21:50           ` Michael S. Tsirkin
  2023-03-07 12:40             ` Christian Schoenebeck
  2023-03-07 13:26             ` Stefan Hajnoczi
  0 siblings, 2 replies; 20+ messages in thread
From: Michael S. Tsirkin @ 2023-03-06 21:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Christian Schoenebeck, Afsa, Baptiste, Eugenio Perez Martin,
	virtio-comment@lists.oasis-open.org

On Mon, Mar 06, 2023 at 03:46:01PM -0500, Stefan Hajnoczi wrote:
> On Mon, Mar 06, 2023 at 12:41:25PM -0500, Michael S. Tsirkin wrote:
> > On Mon, Mar 06, 2023 at 04:00:37PM +0100, Christian Schoenebeck wrote:
> > > On Wednesday, March 1, 2023 3:55:57 PM CET Michael S. Tsirkin wrote:
> > > > On Wed, Mar 01, 2023 at 01:55:14PM +0100, Christian Schoenebeck wrote:
> > > > >   2.8 Packed Virtqueues
> > > > >   ...
> > > > >   2.8.5 Scatter-Gather Support [1]
> > > > >   ...
> > > > >   While unusual (most implementations either create all lists solely using   
> > > > >   non-indirect descriptors, or always use a single indirect element), if both 
> > > > >   features have been negotiated, mixing indirect and non-indirect descriptors 
> > > > >   in a ring is valid, as long as each list only contains descriptors of a 
> > > > >   given type.
> > > > > 
> > > > >   [1] https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-770005
> > > > > 
> > > > > To avoid misapprehensions: the way I understand it, same restrictions apply to
> > > > > packed queues as split queues, in the sense that you may neither chain several
> > > > > tables in a single message, nor multi-level nest tables, nor mix a list of
> > > > > direct descriptors and indirect descriptors on the same level within one
> > > > > message. So the explicit exception described here, only means you may use
> > > > > *one* indirect table in one message, while using chained direct descriptors in
> > > > > another message. But that's it, right?
> > > > 
> > > > 
> > > > That's my understanding.
> > > > 
> > > > > > 2. Given this is a lot of work I am trying to find a way to
> > > > > > make the impact bigger. In particular to cover the use-case
> > > > > > of limiting s/g to 1k while making queues deeper (with
> > > > > > or without indirect). For this I proposed:
> > > > > > 
> > > > > > 	So I think that given this, we can limit the total number
> > > > > > 	of non-indirect descriptors, including non-indirect ones
> > > > > > 	in a chain + all the ones in indirect pointer table if any,
> > > > > > 	and excluding the indirect descriptor itself, and this
> > > > > > 	will address the issue you are describing here, right?
> > > > > > 
> > > > > > people seemed to be ok with this idea?
> > > > > 
> > > > > IIUIC it would not make a difference from design perspective from what I
> > > > > proposed, as virtio currently neither allows to mix, chain or mult-level nest
> > > > > indirect descriptor tables within a single message), and hence it would just
> > > > > boil down to adjusting the wording. So yes, it would therefore cover my
> > > > > intended use case.
> > > > > 
> > > > > Best regards,
> > > > > Christian Schoenebeck
> > > > 
> > > > 
> > > > Sounds good to me. One interesting case is scsi and blk which have
> > > > a seg_max field. This is defined as
> > > > 
> > > > \item[\field{seg_max}] is the maximum number of segments that can be in a
> > > >     command. A bidirectional command can include \field{seg_max} input
> > > >     segments and \field{seg_max} output segments.
> > > > 
> > > > it is never explained what *are* the segments, or how does it
> > > > interact with VQ depth. Current drivers interpret this
> > > > strictly and assume that this limits the s/g length but does not
> > > > allow you to exceed vq size.
> > > > 
> > > > Do we thus want two limits (for read and write descriptors)?
> > > 
> > > No opinion on that, as my intended use case was just extending the buffer size
> > > beyond queue size, not limiting it below queue size. Either way is fine with
> > > me.
> > > 
> > > Anyhow, as this now gets broader scope, that also means the suggested flag
> > > VIRTIO_RING_F_INDIRECT_SIZE needs to be renamed. VIRTIO_RING_F_BUFFER_SIZE?
> > > 
> > > Best regards,
> > > Christian Schoenebeck
> > 
> > 
> > Hmm that's unclear in that it might be in bytes too.
> > Given blk and scsi call these "segments" how about
> > VIRTIO_RING_F_SEG_MAX?
> 
> The VIRTIO equivalent of a "segment" is an "element".

Hmm true:
	A buffer consists of zero or more device-readable physically-contiguous
	elements followed by zero or more physically-contiguous
	device-writable elements (each buffer has at least one element).

However we then need to clean this up, since

- At least in one place we say

indirect elements to mean indirect descriptors.

- we also say "queue elements" to mean "avail/desc/used"
- We also say "descriptor elements" - not 100% sure it's the same.

so we need to clean this up a bit first and maybe add
text about indirect descriptors not counting as elements.




> I don't think the
> term "segment" is needed at the VIRTIO device model level since there is
> already a word for it.
> 
> I'm confused because VIRTIO_RING_F_BUFFER_SIZE and VIRTIO_RING_F_SEG_MAX
> mean different things to me and have different units (bytes vs number of
> segments).

What was requested was number of segments/elements.

> I wouldn't worry about virtio-blk/scsi seg_max. Although the segments
> map to virtqueue elements, seg_max has a specific SCSI/block level
> meaning related to data transfer and is not about constraints that apply
> to all virtqueue requests. I/O requests have headers/footers, so they
> can actually consume more elements than seg_max. Also, there could be
> non-data transfer requests that happen to consume more than seg_max and
> the storage controller would be happy with that (e.g. because VIRTIO
> mandates flexible framing so you could break a request into 1-byte
> elements). It's confusing the talk about seg_max at the VIRTIO device
> model level - it's not about virtqueues at all.
> 
> Stefan

OK thanks for the clarification.




This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
  2023-03-06 21:50           ` Michael S. Tsirkin
@ 2023-03-07 12:40             ` Christian Schoenebeck
  2023-03-13 11:48               ` Christian Schoenebeck
  2023-03-07 13:26             ` Stefan Hajnoczi
  1 sibling, 1 reply; 20+ messages in thread
From: Christian Schoenebeck @ 2023-03-07 12:40 UTC (permalink / raw)
  To: Stefan Hajnoczi, Michael S. Tsirkin
  Cc: Afsa, Baptiste, Eugenio Perez Martin,
	virtio-comment@lists.oasis-open.org

On Monday, March 6, 2023 10:50:53 PM CET Michael S. Tsirkin wrote:
> On Mon, Mar 06, 2023 at 03:46:01PM -0500, Stefan Hajnoczi wrote:
> > On Mon, Mar 06, 2023 at 12:41:25PM -0500, Michael S. Tsirkin wrote:
> > > On Mon, Mar 06, 2023 at 04:00:37PM +0100, Christian Schoenebeck wrote:
[...]
> > > > Anyhow, as this now gets broader scope, that also means the suggested flag
> > > > VIRTIO_RING_F_INDIRECT_SIZE needs to be renamed. VIRTIO_RING_F_BUFFER_SIZE?
> > > > 
> > > > Best regards,
> > > > Christian Schoenebeck
> > > 
> > > 
> > > Hmm that's unclear in that it might be in bytes too.
> > > Given blk and scsi call these "segments" how about
> > > VIRTIO_RING_F_SEG_MAX?
> > 
> > The VIRTIO equivalent of a "segment" is an "element".
> 
> Hmm true:
> 	A buffer consists of zero or more device-readable physically-contiguous
> 	elements followed by zero or more physically-contiguous
> 	device-writable elements (each buffer has at least one element).
> 
> However we then need to clean this up, since
> 
> - At least in one place we say
> 
> indirect elements to mean indirect descriptors.
> 
> - we also say "queue elements" to mean "avail/desc/used"
> - We also say "descriptor elements" - not 100% sure it's the same.
> 
> so we need to clean this up a bit first and maybe add
> text about indirect descriptors not counting as elements.

With VIRTIO_RING_F_BUFFER_SIZE I actually had in mind that this limit was tied
to per buffer/message (i.e. in contrast to a limit that would apply to the
entire queue). But you are right, the name could mislead to the interpretation
as if it was in bytes.

How about VIRTIO_RING_F_MAX_BUF_ELEMENTS?




This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
  2023-03-06 21:50           ` Michael S. Tsirkin
  2023-03-07 12:40             ` Christian Schoenebeck
@ 2023-03-07 13:26             ` Stefan Hajnoczi
  2023-03-07 16:47               ` Michael S. Tsirkin
  1 sibling, 1 reply; 20+ messages in thread
From: Stefan Hajnoczi @ 2023-03-07 13:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Schoenebeck, Afsa, Baptiste, Eugenio Perez Martin,
	virtio-comment@lists.oasis-open.org

[-- Attachment #1: Type: text/plain, Size: 5606 bytes --]

On Mon, Mar 06, 2023 at 04:50:53PM -0500, Michael S. Tsirkin wrote:
> On Mon, Mar 06, 2023 at 03:46:01PM -0500, Stefan Hajnoczi wrote:
> > On Mon, Mar 06, 2023 at 12:41:25PM -0500, Michael S. Tsirkin wrote:
> > > On Mon, Mar 06, 2023 at 04:00:37PM +0100, Christian Schoenebeck wrote:
> > > > On Wednesday, March 1, 2023 3:55:57 PM CET Michael S. Tsirkin wrote:
> > > > > On Wed, Mar 01, 2023 at 01:55:14PM +0100, Christian Schoenebeck wrote:
> > > > > >   2.8 Packed Virtqueues
> > > > > >   ...
> > > > > >   2.8.5 Scatter-Gather Support [1]
> > > > > >   ...
> > > > > >   While unusual (most implementations either create all lists solely using   
> > > > > >   non-indirect descriptors, or always use a single indirect element), if both 
> > > > > >   features have been negotiated, mixing indirect and non-indirect descriptors 
> > > > > >   in a ring is valid, as long as each list only contains descriptors of a 
> > > > > >   given type.
> > > > > > 
> > > > > >   [1] https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-770005
> > > > > > 
> > > > > > To avoid misapprehensions: the way I understand it, same restrictions apply to
> > > > > > packed queues as split queues, in the sense that you may neither chain several
> > > > > > tables in a single message, nor multi-level nest tables, nor mix a list of
> > > > > > direct descriptors and indirect descriptors on the same level within one
> > > > > > message. So the explicit exception described here, only means you may use
> > > > > > *one* indirect table in one message, while using chained direct descriptors in
> > > > > > another message. But that's it, right?
> > > > > 
> > > > > 
> > > > > That's my understanding.
> > > > > 
> > > > > > > 2. Given this is a lot of work I am trying to find a way to
> > > > > > > make the impact bigger. In particular to cover the use-case
> > > > > > > of limiting s/g to 1k while making queues deeper (with
> > > > > > > or without indirect). For this I proposed:
> > > > > > > 
> > > > > > > 	So I think that given this, we can limit the total number
> > > > > > > 	of non-indirect descriptors, including non-indirect ones
> > > > > > > 	in a chain + all the ones in indirect pointer table if any,
> > > > > > > 	and excluding the indirect descriptor itself, and this
> > > > > > > 	will address the issue you are describing here, right?
> > > > > > > 
> > > > > > > people seemed to be ok with this idea?
> > > > > > 
> > > > > > IIUIC it would not make a difference from design perspective from what I
> > > > > > proposed, as virtio currently neither allows to mix, chain or mult-level nest
> > > > > > indirect descriptor tables within a single message), and hence it would just
> > > > > > boil down to adjusting the wording. So yes, it would therefore cover my
> > > > > > intended use case.
> > > > > > 
> > > > > > Best regards,
> > > > > > Christian Schoenebeck
> > > > > 
> > > > > 
> > > > > Sounds good to me. One interesting case is scsi and blk which have
> > > > > a seg_max field. This is defined as
> > > > > 
> > > > > \item[\field{seg_max}] is the maximum number of segments that can be in a
> > > > >     command. A bidirectional command can include \field{seg_max} input
> > > > >     segments and \field{seg_max} output segments.
> > > > > 
> > > > > it is never explained what *are* the segments, or how does it
> > > > > interact with VQ depth. Current drivers interpret this
> > > > > strictly and assume that this limits the s/g length but does not
> > > > > allow you to exceed vq size.
> > > > > 
> > > > > Do we thus want two limits (for read and write descriptors)?
> > > > 
> > > > No opinion on that, as my intended use case was just extending the buffer size
> > > > beyond queue size, not limiting it below queue size. Either way is fine with
> > > > me.
> > > > 
> > > > Anyhow, as this now gets broader scope, that also means the suggested flag
> > > > VIRTIO_RING_F_INDIRECT_SIZE needs to be renamed. VIRTIO_RING_F_BUFFER_SIZE?
> > > > 
> > > > Best regards,
> > > > Christian Schoenebeck
> > > 
> > > 
> > > Hmm that's unclear in that it might be in bytes too.
> > > Given blk and scsi call these "segments" how about
> > > VIRTIO_RING_F_SEG_MAX?
> > 
> > The VIRTIO equivalent of a "segment" is an "element".
> 
> Hmm true:
> 	A buffer consists of zero or more device-readable physically-contiguous
> 	elements followed by zero or more physically-contiguous
> 	device-writable elements (each buffer has at least one element).
> 
> However we then need to clean this up, since
> 
> - At least in one place we say
> 
> indirect elements to mean indirect descriptors.
> 
> - we also say "queue elements" to mean "avail/desc/used"
> - We also say "descriptor elements" - not 100% sure it's the same.
> 
> so we need to clean this up a bit first and maybe add
> text about indirect descriptors not counting as elements.

Haha, yes. I also remembered that QEMU's type for a virtqueue buffer is
called VirtQueueElement :).

My impression from the spec is that when talking about virtqueues an
element is a data blob that's part of a buffer and when talking about
vrings an element descriptor is the ring entry that points to the data
blob. Often the terms are used interchangeably (just "descriptors" or
"elements").

I'm not sure if the distinction is necessary. It might be simpler to
always talk about descriptors and remove the term "element", since there
is no way to avoid talking about descriptors eventually.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
  2023-03-07 13:26             ` Stefan Hajnoczi
@ 2023-03-07 16:47               ` Michael S. Tsirkin
  2023-03-07 19:35                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2023-03-07 16:47 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Christian Schoenebeck, Afsa, Baptiste, Eugenio Perez Martin,
	virtio-comment@lists.oasis-open.org

On Tue, Mar 07, 2023 at 08:26:27AM -0500, Stefan Hajnoczi wrote:
> On Mon, Mar 06, 2023 at 04:50:53PM -0500, Michael S. Tsirkin wrote:
> > On Mon, Mar 06, 2023 at 03:46:01PM -0500, Stefan Hajnoczi wrote:
> > > On Mon, Mar 06, 2023 at 12:41:25PM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Mar 06, 2023 at 04:00:37PM +0100, Christian Schoenebeck wrote:
> > > > > On Wednesday, March 1, 2023 3:55:57 PM CET Michael S. Tsirkin wrote:
> > > > > > On Wed, Mar 01, 2023 at 01:55:14PM +0100, Christian Schoenebeck wrote:
> > > > > > >   2.8 Packed Virtqueues
> > > > > > >   ...
> > > > > > >   2.8.5 Scatter-Gather Support [1]
> > > > > > >   ...
> > > > > > >   While unusual (most implementations either create all lists solely using   
> > > > > > >   non-indirect descriptors, or always use a single indirect element), if both 
> > > > > > >   features have been negotiated, mixing indirect and non-indirect descriptors 
> > > > > > >   in a ring is valid, as long as each list only contains descriptors of a 
> > > > > > >   given type.
> > > > > > > 
> > > > > > >   [1] https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-770005
> > > > > > > 
> > > > > > > To avoid misapprehensions: the way I understand it, same restrictions apply to
> > > > > > > packed queues as split queues, in the sense that you may neither chain several
> > > > > > > tables in a single message, nor multi-level nest tables, nor mix a list of
> > > > > > > direct descriptors and indirect descriptors on the same level within one
> > > > > > > message. So the explicit exception described here, only means you may use
> > > > > > > *one* indirect table in one message, while using chained direct descriptors in
> > > > > > > another message. But that's it, right?
> > > > > > 
> > > > > > 
> > > > > > That's my understanding.
> > > > > > 
> > > > > > > > 2. Given this is a lot of work I am trying to find a way to
> > > > > > > > make the impact bigger. In particular to cover the use-case
> > > > > > > > of limiting s/g to 1k while making queues deeper (with
> > > > > > > > or without indirect). For this I proposed:
> > > > > > > > 
> > > > > > > > 	So I think that given this, we can limit the total number
> > > > > > > > 	of non-indirect descriptors, including non-indirect ones
> > > > > > > > 	in a chain + all the ones in indirect pointer table if any,
> > > > > > > > 	and excluding the indirect descriptor itself, and this
> > > > > > > > 	will address the issue you are describing here, right?
> > > > > > > > 
> > > > > > > > people seemed to be ok with this idea?
> > > > > > > 
> > > > > > > IIUIC it would not make a difference from design perspective from what I
> > > > > > > proposed, as virtio currently neither allows to mix, chain or mult-level nest
> > > > > > > indirect descriptor tables within a single message), and hence it would just
> > > > > > > boil down to adjusting the wording. So yes, it would therefore cover my
> > > > > > > intended use case.
> > > > > > > 
> > > > > > > Best regards,
> > > > > > > Christian Schoenebeck
> > > > > > 
> > > > > > 
> > > > > > Sounds good to me. One interesting case is scsi and blk which have
> > > > > > a seg_max field. This is defined as
> > > > > > 
> > > > > > \item[\field{seg_max}] is the maximum number of segments that can be in a
> > > > > >     command. A bidirectional command can include \field{seg_max} input
> > > > > >     segments and \field{seg_max} output segments.
> > > > > > 
> > > > > > it is never explained what *are* the segments, or how does it
> > > > > > interact with VQ depth. Current drivers interpret this
> > > > > > strictly and assume that this limits the s/g length but does not
> > > > > > allow you to exceed vq size.
> > > > > > 
> > > > > > Do we thus want two limits (for read and write descriptors)?
> > > > > 
> > > > > No opinion on that, as my intended use case was just extending the buffer size
> > > > > beyond queue size, not limiting it below queue size. Either way is fine with
> > > > > me.
> > > > > 
> > > > > Anyhow, as this now gets broader scope, that also means the suggested flag
> > > > > VIRTIO_RING_F_INDIRECT_SIZE needs to be renamed. VIRTIO_RING_F_BUFFER_SIZE?
> > > > > 
> > > > > Best regards,
> > > > > Christian Schoenebeck
> > > > 
> > > > 
> > > > Hmm that's unclear in that it might be in bytes too.
> > > > Given blk and scsi call these "segments" how about
> > > > VIRTIO_RING_F_SEG_MAX?
> > > 
> > > The VIRTIO equivalent of a "segment" is an "element".
> > 
> > Hmm true:
> > 	A buffer consists of zero or more device-readable physically-contiguous
> > 	elements followed by zero or more physically-contiguous
> > 	device-writable elements (each buffer has at least one element).
> > 
> > However we then need to clean this up, since
> > 
> > - At least in one place we say
> > 
> > indirect elements to mean indirect descriptors.
> > 
> > - we also say "queue elements" to mean "avail/desc/used"
> > - We also say "descriptor elements" - not 100% sure it's the same.
> > 
> > so we need to clean this up a bit first and maybe add
> > text about indirect descriptors not counting as elements.
> 
> Haha, yes. I also remembered that QEMU's type for a virtqueue buffer is
> called VirtQueueElement :).
> 
> My impression from the spec is that when talking about virtqueues an
> element is a data blob that's part of a buffer and when talking about
> vrings an element descriptor is the ring entry that points to the data
> blob. Often the terms are used interchangeably (just "descriptors" or
> "elements").
> 
> I'm not sure if the distinction is necessary. It might be simpler to
> always talk about descriptors and remove the term "element", since there
> is no way to avoid talking about descriptors eventually.
> 
> Stefan

Well this whole discussion started from the point that
indirect descriptors do not describe parts of a buffer.

So I feel there's a distinction.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
  2023-03-07 16:47               ` Michael S. Tsirkin
@ 2023-03-07 19:35                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 20+ messages in thread
From: Stefan Hajnoczi @ 2023-03-07 19:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christian Schoenebeck, Afsa, Baptiste, Eugenio Perez Martin,
	virtio-comment@lists.oasis-open.org

[-- Attachment #1: Type: text/plain, Size: 6385 bytes --]

On Tue, Mar 07, 2023 at 11:47:32AM -0500, Michael S. Tsirkin wrote:
> On Tue, Mar 07, 2023 at 08:26:27AM -0500, Stefan Hajnoczi wrote:
> > On Mon, Mar 06, 2023 at 04:50:53PM -0500, Michael S. Tsirkin wrote:
> > > On Mon, Mar 06, 2023 at 03:46:01PM -0500, Stefan Hajnoczi wrote:
> > > > On Mon, Mar 06, 2023 at 12:41:25PM -0500, Michael S. Tsirkin wrote:
> > > > > On Mon, Mar 06, 2023 at 04:00:37PM +0100, Christian Schoenebeck wrote:
> > > > > > On Wednesday, March 1, 2023 3:55:57 PM CET Michael S. Tsirkin wrote:
> > > > > > > On Wed, Mar 01, 2023 at 01:55:14PM +0100, Christian Schoenebeck wrote:
> > > > > > > >   2.8 Packed Virtqueues
> > > > > > > >   ...
> > > > > > > >   2.8.5 Scatter-Gather Support [1]
> > > > > > > >   ...
> > > > > > > >   While unusual (most implementations either create all lists solely using   
> > > > > > > >   non-indirect descriptors, or always use a single indirect element), if both 
> > > > > > > >   features have been negotiated, mixing indirect and non-indirect descriptors 
> > > > > > > >   in a ring is valid, as long as each list only contains descriptors of a 
> > > > > > > >   given type.
> > > > > > > > 
> > > > > > > >   [1] https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-770005
> > > > > > > > 
> > > > > > > > To avoid misapprehensions: the way I understand it, same restrictions apply to
> > > > > > > > packed queues as split queues, in the sense that you may neither chain several
> > > > > > > > tables in a single message, nor multi-level nest tables, nor mix a list of
> > > > > > > > direct descriptors and indirect descriptors on the same level within one
> > > > > > > > message. So the explicit exception described here, only means you may use
> > > > > > > > *one* indirect table in one message, while using chained direct descriptors in
> > > > > > > > another message. But that's it, right?
> > > > > > > 
> > > > > > > 
> > > > > > > That's my understanding.
> > > > > > > 
> > > > > > > > > 2. Given this is a lot of work I am trying to find a way to
> > > > > > > > > make the impact bigger. In particular to cover the use-case
> > > > > > > > > of limiting s/g to 1k while making queues deeper (with
> > > > > > > > > or without indirect). For this I proposed:
> > > > > > > > > 
> > > > > > > > > 	So I think that given this, we can limit the total number
> > > > > > > > > 	of non-indirect descriptors, including non-indirect ones
> > > > > > > > > 	in a chain + all the ones in indirect pointer table if any,
> > > > > > > > > 	and excluding the indirect descriptor itself, and this
> > > > > > > > > 	will address the issue you are describing here, right?
> > > > > > > > > 
> > > > > > > > > people seemed to be ok with this idea?
> > > > > > > > 
> > > > > > > > IIUIC it would not make a difference from design perspective from what I
> > > > > > > > proposed, as virtio currently neither allows to mix, chain or mult-level nest
> > > > > > > > indirect descriptor tables within a single message), and hence it would just
> > > > > > > > boil down to adjusting the wording. So yes, it would therefore cover my
> > > > > > > > intended use case.
> > > > > > > > 
> > > > > > > > Best regards,
> > > > > > > > Christian Schoenebeck
> > > > > > > 
> > > > > > > 
> > > > > > > Sounds good to me. One interesting case is scsi and blk which have
> > > > > > > a seg_max field. This is defined as
> > > > > > > 
> > > > > > > \item[\field{seg_max}] is the maximum number of segments that can be in a
> > > > > > >     command. A bidirectional command can include \field{seg_max} input
> > > > > > >     segments and \field{seg_max} output segments.
> > > > > > > 
> > > > > > > it is never explained what *are* the segments, or how does it
> > > > > > > interact with VQ depth. Current drivers interpret this
> > > > > > > strictly and assume that this limits the s/g length but does not
> > > > > > > allow you to exceed vq size.
> > > > > > > 
> > > > > > > Do we thus want two limits (for read and write descriptors)?
> > > > > > 
> > > > > > No opinion on that, as my intended use case was just extending the buffer size
> > > > > > beyond queue size, not limiting it below queue size. Either way is fine with
> > > > > > me.
> > > > > > 
> > > > > > Anyhow, as this now gets broader scope, that also means the suggested flag
> > > > > > VIRTIO_RING_F_INDIRECT_SIZE needs to be renamed. VIRTIO_RING_F_BUFFER_SIZE?
> > > > > > 
> > > > > > Best regards,
> > > > > > Christian Schoenebeck
> > > > > 
> > > > > 
> > > > > Hmm that's unclear in that it might be in bytes too.
> > > > > Given blk and scsi call these "segments" how about
> > > > > VIRTIO_RING_F_SEG_MAX?
> > > > 
> > > > The VIRTIO equivalent of a "segment" is an "element".
> > > 
> > > Hmm true:
> > > 	A buffer consists of zero or more device-readable physically-contiguous
> > > 	elements followed by zero or more physically-contiguous
> > > 	device-writable elements (each buffer has at least one element).
> > > 
> > > However we then need to clean this up, since
> > > 
> > > - At least in one place we say
> > > 
> > > indirect elements to mean indirect descriptors.
> > > 
> > > - we also say "queue elements" to mean "avail/desc/used"
> > > - We also say "descriptor elements" - not 100% sure it's the same.
> > > 
> > > so we need to clean this up a bit first and maybe add
> > > text about indirect descriptors not counting as elements.
> > 
> > Haha, yes. I also remembered that QEMU's type for a virtqueue buffer is
> > called VirtQueueElement :).
> > 
> > My impression from the spec is that when talking about virtqueues an
> > element is a data blob that's part of a buffer and when talking about
> > vrings an element descriptor is the ring entry that points to the data
> > blob. Often the terms are used interchangeably (just "descriptors" or
> > "elements").
> > 
> > I'm not sure if the distinction is necessary. It might be simpler to
> > always talk about descriptors and remove the term "element", since there
> > is no way to avoid talking about descriptors eventually.
> > 
> > Stefan
> 
> Well this whole discussion started from the point that
> indirect descriptors do not describe parts of a buffer.
> 
> So I feel there's a distinction.

Good point.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
  2023-03-07 12:40             ` Christian Schoenebeck
@ 2023-03-13 11:48               ` Christian Schoenebeck
  2023-03-13 13:06                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 20+ messages in thread
From: Christian Schoenebeck @ 2023-03-13 11:48 UTC (permalink / raw)
  To: Stefan Hajnoczi, Michael S. Tsirkin, virtio-comment
  Cc: Afsa, Baptiste, Eugenio Perez Martin,
	virtio-comment@lists.oasis-open.org

On Tuesday, March 7, 2023 1:40:24 PM CET Christian Schoenebeck wrote:
> On Monday, March 6, 2023 10:50:53 PM CET Michael S. Tsirkin wrote:
> > On Mon, Mar 06, 2023 at 03:46:01PM -0500, Stefan Hajnoczi wrote:
> > > On Mon, Mar 06, 2023 at 12:41:25PM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Mar 06, 2023 at 04:00:37PM +0100, Christian Schoenebeck wrote:
> [...]
> > > > > Anyhow, as this now gets broader scope, that also means the suggested flag
> > > > > VIRTIO_RING_F_INDIRECT_SIZE needs to be renamed. VIRTIO_RING_F_BUFFER_SIZE?
> > > > > 
> > > > > Best regards,
> > > > > Christian Schoenebeck
> > > > 
> > > > 
> > > > Hmm that's unclear in that it might be in bytes too.
> > > > Given blk and scsi call these "segments" how about
> > > > VIRTIO_RING_F_SEG_MAX?
> > > 
> > > The VIRTIO equivalent of a "segment" is an "element".
> > 
> > Hmm true:
> > 	A buffer consists of zero or more device-readable physically-contiguous
> > 	elements followed by zero or more physically-contiguous
> > 	device-writable elements (each buffer has at least one element).
> > 
> > However we then need to clean this up, since
> > 
> > - At least in one place we say
> > 
> > indirect elements to mean indirect descriptors.
> > 
> > - we also say "queue elements" to mean "avail/desc/used"
> > - We also say "descriptor elements" - not 100% sure it's the same.
> > 
> > so we need to clean this up a bit first and maybe add
> > text about indirect descriptors not counting as elements.
> 
> With VIRTIO_RING_F_BUFFER_SIZE I actually had in mind that this limit was tied
> to per buffer/message (i.e. in contrast to a limit that would apply to the
> entire queue). But you are right, the name could mislead to the interpretation
> as if it was in bytes.
> 
> How about VIRTIO_RING_F_MAX_BUF_ELEMENTS?

Ping?

Michael, another thing: what was the rationale for your suggestion to exclude
the indirect descriptor itself from this limit?

While I agree that it makes sense to broaden the scope to the total amount of
all descriptors per buffer (as I wasn't aware before that it is indeed
permitted to mix chained direct and indirect descriptors in a split queue
buffer, and given that QEMU's implementation for instance just allocates them
on the stack), however I currently don't see why it would make sense to
exclude the indirect descriptor itself, as it makes things more complicated?
Did you have something specific in mind for that exclusion?

Best regards,
Christian Schoenebeck



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
  2023-03-13 11:48               ` Christian Schoenebeck
@ 2023-03-13 13:06                 ` Michael S. Tsirkin
  2023-03-13 13:48                   ` Christian Schoenebeck
  0 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2023-03-13 13:06 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Stefan Hajnoczi, virtio-comment, Afsa, Baptiste,
	Eugenio Perez Martin

On Mon, Mar 13, 2023 at 12:48:11PM +0100, Christian Schoenebeck wrote:
> On Tuesday, March 7, 2023 1:40:24 PM CET Christian Schoenebeck wrote:
> > On Monday, March 6, 2023 10:50:53 PM CET Michael S. Tsirkin wrote:
> > > On Mon, Mar 06, 2023 at 03:46:01PM -0500, Stefan Hajnoczi wrote:
> > > > On Mon, Mar 06, 2023 at 12:41:25PM -0500, Michael S. Tsirkin wrote:
> > > > > On Mon, Mar 06, 2023 at 04:00:37PM +0100, Christian Schoenebeck wrote:
> > [...]
> > > > > > Anyhow, as this now gets broader scope, that also means the suggested flag
> > > > > > VIRTIO_RING_F_INDIRECT_SIZE needs to be renamed. VIRTIO_RING_F_BUFFER_SIZE?
> > > > > > 
> > > > > > Best regards,
> > > > > > Christian Schoenebeck
> > > > > 
> > > > > 
> > > > > Hmm that's unclear in that it might be in bytes too.
> > > > > Given blk and scsi call these "segments" how about
> > > > > VIRTIO_RING_F_SEG_MAX?
> > > > 
> > > > The VIRTIO equivalent of a "segment" is an "element".
> > > 
> > > Hmm true:
> > > 	A buffer consists of zero or more device-readable physically-contiguous
> > > 	elements followed by zero or more physically-contiguous
> > > 	device-writable elements (each buffer has at least one element).
> > > 
> > > However we then need to clean this up, since
> > > 
> > > - At least in one place we say
> > > 
> > > indirect elements to mean indirect descriptors.
> > > 
> > > - we also say "queue elements" to mean "avail/desc/used"
> > > - We also say "descriptor elements" - not 100% sure it's the same.
> > > 
> > > so we need to clean this up a bit first and maybe add
> > > text about indirect descriptors not counting as elements.
> > 
> > With VIRTIO_RING_F_BUFFER_SIZE I actually had in mind that this limit was tied
> > to per buffer/message (i.e. in contrast to a limit that would apply to the
> > entire queue). But you are right, the name could mislead to the interpretation
> > as if it was in bytes.
> > 
> > How about VIRTIO_RING_F_MAX_BUF_ELEMENTS?
> 
> Ping?
> 
> Michael, another thing: what was the rationale for your suggestion to exclude
> the indirect descriptor itself from this limit?
> 
> While I agree that it makes sense to broaden the scope to the total amount of
> all descriptors per buffer (as I wasn't aware before that it is indeed
> permitted to mix chained direct and indirect descriptors in a split queue
> buffer, and given that QEMU's implementation for instance just allocates them
> on the stack), however I currently don't see why it would make sense to
> exclude the indirect descriptor itself, as it makes things more complicated?
> Did you have something specific in mind for that exclusion?
> 
> Best regards,
> Christian Schoenebeck
> 

It matches the requirement to limit the number of s/g elements. For
example, both qemu and vhost need to limit this to 1K since  they are
using the iov[1024] structure to pass buffers around.
Indirect descriptors are just a different way to link to s/g elements
they are not s/g elements themselves.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
  2023-03-13 13:06                 ` Michael S. Tsirkin
@ 2023-03-13 13:48                   ` Christian Schoenebeck
  2023-03-13 13:54                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 20+ messages in thread
From: Christian Schoenebeck @ 2023-03-13 13:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, virtio-comment, Afsa, Baptiste,
	Eugenio Perez Martin

On Monday, March 13, 2023 2:06:00 PM CET Michael S. Tsirkin wrote:
> On Mon, Mar 13, 2023 at 12:48:11PM +0100, Christian Schoenebeck wrote:
> > On Tuesday, March 7, 2023 1:40:24 PM CET Christian Schoenebeck wrote:
> > > On Monday, March 6, 2023 10:50:53 PM CET Michael S. Tsirkin wrote:
> > > > On Mon, Mar 06, 2023 at 03:46:01PM -0500, Stefan Hajnoczi wrote:
> > > > > On Mon, Mar 06, 2023 at 12:41:25PM -0500, Michael S. Tsirkin wrote:
> > > > > > On Mon, Mar 06, 2023 at 04:00:37PM +0100, Christian Schoenebeck wrote:
> > > [...]
> > > > > > > Anyhow, as this now gets broader scope, that also means the suggested flag
> > > > > > > VIRTIO_RING_F_INDIRECT_SIZE needs to be renamed. VIRTIO_RING_F_BUFFER_SIZE?
> > > > > > > 
> > > > > > > Best regards,
> > > > > > > Christian Schoenebeck
> > > > > > 
> > > > > > 
> > > > > > Hmm that's unclear in that it might be in bytes too.
> > > > > > Given blk and scsi call these "segments" how about
> > > > > > VIRTIO_RING_F_SEG_MAX?
> > > > > 
> > > > > The VIRTIO equivalent of a "segment" is an "element".
> > > > 
> > > > Hmm true:
> > > > 	A buffer consists of zero or more device-readable physically-contiguous
> > > > 	elements followed by zero or more physically-contiguous
> > > > 	device-writable elements (each buffer has at least one element).
> > > > 
> > > > However we then need to clean this up, since
> > > > 
> > > > - At least in one place we say
> > > > 
> > > > indirect elements to mean indirect descriptors.
> > > > 
> > > > - we also say "queue elements" to mean "avail/desc/used"
> > > > - We also say "descriptor elements" - not 100% sure it's the same.
> > > > 
> > > > so we need to clean this up a bit first and maybe add
> > > > text about indirect descriptors not counting as elements.
> > > 
> > > With VIRTIO_RING_F_BUFFER_SIZE I actually had in mind that this limit was tied
> > > to per buffer/message (i.e. in contrast to a limit that would apply to the
> > > entire queue). But you are right, the name could mislead to the interpretation
> > > as if it was in bytes.
> > > 
> > > How about VIRTIO_RING_F_MAX_BUF_ELEMENTS?
> > 
> > Ping?
> > 
> > Michael, another thing: what was the rationale for your suggestion to exclude
> > the indirect descriptor itself from this limit?
> > 
> > While I agree that it makes sense to broaden the scope to the total amount of
> > all descriptors per buffer (as I wasn't aware before that it is indeed
> > permitted to mix chained direct and indirect descriptors in a split queue
> > buffer, and given that QEMU's implementation for instance just allocates them
> > on the stack), however I currently don't see why it would make sense to
> > exclude the indirect descriptor itself, as it makes things more complicated?
> > Did you have something specific in mind for that exclusion?
> > 
> > Best regards,
> > Christian Schoenebeck
> > 
> 
> It matches the requirement to limit the number of s/g elements. For
> example, both qemu and vhost need to limit this to 1K since  they are
> using the iov[1024] structure to pass buffers around.
> Indirect descriptors are just a different way to link to s/g elements
> they are not s/g elements themselves.

OK, I see, I am just wondering whether that wouldn't make phrasing this in the
spec unnecessarily more complicated and confusing. Not that I'd care much.

And what about my latest name suggestion VIRTIO_RING_F_MAX_BUF_ELEMENTS?

Best regards,
Christian Schoenebeck



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status
  2023-03-13 13:48                   ` Christian Schoenebeck
@ 2023-03-13 13:54                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 20+ messages in thread
From: Michael S. Tsirkin @ 2023-03-13 13:54 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Stefan Hajnoczi, virtio-comment, Afsa, Baptiste,
	Eugenio Perez Martin

On Mon, Mar 13, 2023 at 02:48:23PM +0100, Christian Schoenebeck wrote:
> On Monday, March 13, 2023 2:06:00 PM CET Michael S. Tsirkin wrote:
> > On Mon, Mar 13, 2023 at 12:48:11PM +0100, Christian Schoenebeck wrote:
> > > On Tuesday, March 7, 2023 1:40:24 PM CET Christian Schoenebeck wrote:
> > > > On Monday, March 6, 2023 10:50:53 PM CET Michael S. Tsirkin wrote:
> > > > > On Mon, Mar 06, 2023 at 03:46:01PM -0500, Stefan Hajnoczi wrote:
> > > > > > On Mon, Mar 06, 2023 at 12:41:25PM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Mon, Mar 06, 2023 at 04:00:37PM +0100, Christian Schoenebeck wrote:
> > > > [...]
> > > > > > > > Anyhow, as this now gets broader scope, that also means the suggested flag
> > > > > > > > VIRTIO_RING_F_INDIRECT_SIZE needs to be renamed. VIRTIO_RING_F_BUFFER_SIZE?
> > > > > > > > 
> > > > > > > > Best regards,
> > > > > > > > Christian Schoenebeck
> > > > > > > 
> > > > > > > 
> > > > > > > Hmm that's unclear in that it might be in bytes too.
> > > > > > > Given blk and scsi call these "segments" how about
> > > > > > > VIRTIO_RING_F_SEG_MAX?
> > > > > > 
> > > > > > The VIRTIO equivalent of a "segment" is an "element".
> > > > > 
> > > > > Hmm true:
> > > > > 	A buffer consists of zero or more device-readable physically-contiguous
> > > > > 	elements followed by zero or more physically-contiguous
> > > > > 	device-writable elements (each buffer has at least one element).
> > > > > 
> > > > > However we then need to clean this up, since
> > > > > 
> > > > > - At least in one place we say
> > > > > 
> > > > > indirect elements to mean indirect descriptors.
> > > > > 
> > > > > - we also say "queue elements" to mean "avail/desc/used"
> > > > > - We also say "descriptor elements" - not 100% sure it's the same.
> > > > > 
> > > > > so we need to clean this up a bit first and maybe add
> > > > > text about indirect descriptors not counting as elements.
> > > > 
> > > > With VIRTIO_RING_F_BUFFER_SIZE I actually had in mind that this limit was tied
> > > > to per buffer/message (i.e. in contrast to a limit that would apply to the
> > > > entire queue). But you are right, the name could mislead to the interpretation
> > > > as if it was in bytes.
> > > > 
> > > > How about VIRTIO_RING_F_MAX_BUF_ELEMENTS?
> > > 
> > > Ping?
> > > 
> > > Michael, another thing: what was the rationale for your suggestion to exclude
> > > the indirect descriptor itself from this limit?
> > > 
> > > While I agree that it makes sense to broaden the scope to the total amount of
> > > all descriptors per buffer (as I wasn't aware before that it is indeed
> > > permitted to mix chained direct and indirect descriptors in a split queue
> > > buffer, and given that QEMU's implementation for instance just allocates them
> > > on the stack), however I currently don't see why it would make sense to
> > > exclude the indirect descriptor itself, as it makes things more complicated?
> > > Did you have something specific in mind for that exclusion?
> > > 
> > > Best regards,
> > > Christian Schoenebeck
> > > 
> > 
> > It matches the requirement to limit the number of s/g elements. For
> > example, both qemu and vhost need to limit this to 1K since  they are
> > using the iov[1024] structure to pass buffers around.
> > Indirect descriptors are just a different way to link to s/g elements
> > they are not s/g elements themselves.
> 
> OK, I see, I am just wondering whether that wouldn't make phrasing this in the
> spec unnecessarily more complicated and confusing. Not that I'd care much.
> 
> And what about my latest name suggestion VIRTIO_RING_F_MAX_BUF_ELEMENTS?
> 
> Best regards,
> Christian Schoenebeck


Fine by me. What do others think?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-03-13 13:54 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-13  7:45 [virtio-comment] [PATCH] Introduce VIRTIO_F_ISOLATE_INDIRECT_DESC feature Baptiste Afsa
2023-01-13 12:46 ` Michael S. Tsirkin
2023-01-17 15:19   ` Afsa, Baptiste
2023-01-17 18:27     ` Eugenio Perez Martin
2023-02-27 14:53       ` Afsa, Baptiste
2023-02-27 15:45         ` Stefan Hajnoczi
     [not found]           ` <2244126.gP0zCk8Q6A@silver>
2023-02-27 17:41             ` [virtio-comment] Re: VIRTIO_RING_F_INDIRECT_SIZE status Michael S. Tsirkin
     [not found]               ` <2494182.5W6NY9sLyD@silver>
2023-02-28 12:05                 ` Michael S. Tsirkin
     [not found] ` <6380471.4BWXO1n1mU@silver>
     [not found]   ` <Y/9Z5fphn34/HSKs@fedora>
     [not found]     ` <2458440.T3bEdP9vpG@silver>
2023-03-06 16:27       ` Stefan Hajnoczi
     [not found]   ` <20230301095017-mutt-send-email-mst@kernel.org>
     [not found]     ` <2812377.Px9Efocobp@silver>
2023-03-06 17:41       ` Michael S. Tsirkin
2023-03-06 20:46         ` Stefan Hajnoczi
2023-03-06 21:50           ` Michael S. Tsirkin
2023-03-07 12:40             ` Christian Schoenebeck
2023-03-13 11:48               ` Christian Schoenebeck
2023-03-13 13:06                 ` Michael S. Tsirkin
2023-03-13 13:48                   ` Christian Schoenebeck
2023-03-13 13:54                     ` Michael S. Tsirkin
2023-03-07 13:26             ` Stefan Hajnoczi
2023-03-07 16:47               ` Michael S. Tsirkin
2023-03-07 19:35                 ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.