All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-dev] virtio packed ring concerns
@ 2020-01-28 17:26 Walker, Benjamin
  2020-01-29  9:41 ` Stefan Hajnoczi
  2020-01-29  9:53 ` Michael S. Tsirkin
  0 siblings, 2 replies; 5+ messages in thread
From: Walker, Benjamin @ 2020-01-28 17:26 UTC (permalink / raw)
  To: virtio-dev@lists.oasis-open.org

Hi all - I'm a core maintainer for the SPDK project, which includes a vhost-user 
target implementation that handles both virtio-blk and virtio-scsi. Recently we
had some engineers go off and attempt to add support for packed virtqueues, as
defined here: 
https://github.com/oasis-tcs/virtio-spec/blob/master/packed-ring.tex

While reviewing the patches they submitted I noticed two separate issues. The
first issue is that the ring cannot function correctly under certain
circumstances by spec. The second issue is a performance issue. I'm going to try
to keep both of these entirely separate.

Comments are very welcome here, and I'm hoping I'm just missing some critical
piece of information that resolves the whole thing, but testing is so far
showing that I'm finding a real problem.

First, the functional issue. The problem scenario involves out of order
completions during ring wrap around. Here are the most relevant sections of the
specification:

~~~
Note: after reading driver descriptors and starting their processing in order,
the device might complete their processing out of order.  Used device
descriptors are written in the order in which their processing is complete.

The counter maintained by the driver is called the Driver Ring Wrap Counter. The
driver changes the value of this counter each time it makes available the last
descriptor in the ring (after making the last descriptor available).

The counter maintained by the device is called the Device Ring Wrap
Counter.  The device changes the value of this counter
each time it uses the last descriptor in
the ring (after marking the last descriptor used).

To mark a descriptor as available, the driver sets the VIRTQ_DESC_F_AVAIL bit in
Flags to match the internal Driver Ring Wrap Counter.  It also sets the
VIRTQ_DESC_F_USED bit to match the inverse value (i.e. to not match the internal
Driver Ring Wrap Counter).

To mark a descriptor as used, the device sets the VIRTQ_DESC_F_USED bit in Flags
to match the internal Device Ring Wrap Counter.  It also sets the
VIRTQ_DESC_F_AVAIL bit to match the \emph{same} value.
~~~

Imagine a packed ring with 4 descriptors. At some earlier time let's say the
first 3 descriptors were submitted and completed, in order. Then, two more
descriptors were submitted (so they went into slot 3 and then 0). Each time
through the ring, the "ring wrap counter" toggles from 1 to 0 or vice versa. The
avail and used flags are initialized to 1 at start up and the ring wrap counter
is 0.

So when submitting into slot 3, the driver flips the avail bit from 1 to 0. When
submitting into slot 0 after the ring-wrap, it flips the avail bit from 0 to 1.
The spec says it also sets the used flag, but the used flag is already the
correct value. That's all solid so far.

The device is then polling on slot 3 waiting for new descriptors to be
submitted. Note here that the device must be maintaining its own copy of what it
thinks is the driver's ring wrap counter value to detect descriptors being made
available, which isn't called out in the specification at all but all actual
implementations do. Regardless, once it sees the avail flag in the descriptor in
slot 3 flip, it begins processing the descriptor. It then moves on to poll slot
0 for incoming descriptors (and updates its expected avail ring wrap counter
value internally). In this case, it sees the bit flip for slot 0 and begins
processing the descriptor. Descriptor processing is asynchronous and may
complete out of order, at least for storage devices.

So let's say that the descriptors do, in fact, complete out of order, such that
the descriptor in slot 0 completes first and then the descriptor in slot 3
completes second. The specification is very clear that this is allowed.

When completing the descriptor in slot 0, the device has not yet completed the
last descriptor in the ring (they're out of order!). So the ring wrap counter is
still 0. So it then proceeds to write both the avail flag and the used flag to
0. But the driver is expecting the used flag to get flipped to 1 to indicate a
completion. This is broken.

I wrote a unit test for the SPDK implementation and it is indeed broken in this
scenario, causing the driver to hang. I've also at least looked through the code
in DPDK and in the Linux kernel implementations and they appear broken in
various ways as well. It appears that DPDK may decide the ring is corrupt and
"drop the packet" in this scenario, which is data corruption for storage, but I
haven't yet run a test to confirm. The Linux kernel implementation looks to me
like it will never consider the descriptor as complete and eventually hang.

The device doesn't actually need to track a ring wrap counter for the used flag
at all. The device does need to track a ring wrap counter for the expected value
of the avail bit, to mirror the driver side value while polling for incoming
descriptors. But to mark descriptors as used, it simply needs to set the used
flag to the current value of the avail flag in that descriptor.

----------------------

Moving on to my separate concerns about performance. Critically, it's important
to note that storage devices these days are inherently parallel pieces of
hardware. You can submit a large number of commands at once and the device
actually does process some number of them simultaneously. That means things
complete out of order all the time. For example, imagine submitting two commands
to a NAND flash storage device where one is a 1MB read and the second one is a
4k read. Let's say that the data these are reading are on entirely separate NAND
chips internally. That 4k read will complete ahead of the 1MB read, and it would
be awful to force that small read to wait on the larger one.

But with the packed ring design, there's only one ring for both submission and
completion, so descriptor slots can only be re-used in order. Even if the device
implementation is able to complete the 4K read out of order ahead of time and
mark the descriptor as used, a driver that's polling for completions won't see
it until the earlier descriptor completes. A driver using interrupts may be able
to see it if the interrupt tells it specifically which slot to look at, but it
can't re-use the descriptor slot for a new command because submissions must go
in order. While the bug I outlined in the first part of my email is fixable,
this design issue is not.

Thanks,
Ben

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [virtio-dev] virtio packed ring concerns
  2020-01-28 17:26 [virtio-dev] virtio packed ring concerns Walker, Benjamin
@ 2020-01-29  9:41 ` Stefan Hajnoczi
  2020-01-29 18:54   ` Walker, Benjamin
  2020-01-29  9:53 ` Michael S. Tsirkin
  1 sibling, 1 reply; 5+ messages in thread
From: Stefan Hajnoczi @ 2020-01-29  9:41 UTC (permalink / raw)
  To: Walker, Benjamin; +Cc: virtio-dev@lists.oasis-open.org

[-- Attachment #1: Type: text/plain, Size: 8463 bytes --]

On Tue, Jan 28, 2020 at 05:26:07PM +0000, Walker, Benjamin wrote:
> Hi all - I'm a core maintainer for the SPDK project, which includes a vhost-user 
> target implementation that handles both virtio-blk and virtio-scsi. Recently we
> had some engineers go off and attempt to add support for packed virtqueues, as
> defined here: 
> https://github.com/oasis-tcs/virtio-spec/blob/master/packed-ring.tex
> 
> While reviewing the patches they submitted I noticed two separate issues. The
> first issue is that the ring cannot function correctly under certain
> circumstances by spec. The second issue is a performance issue. I'm going to try
> to keep both of these entirely separate.
> 
> Comments are very welcome here, and I'm hoping I'm just missing some critical
> piece of information that resolves the whole thing, but testing is so far
> showing that I'm finding a real problem.
> 
> First, the functional issue. The problem scenario involves out of order
> completions during ring wrap around. Here are the most relevant sections of the
> specification:
> 
> ~~~
> Note: after reading driver descriptors and starting their processing in order,
> the device might complete their processing out of order.  Used device
> descriptors are written in the order in which their processing is complete.
> 
> The counter maintained by the driver is called the Driver Ring Wrap Counter. The
> driver changes the value of this counter each time it makes available the last
> descriptor in the ring (after making the last descriptor available).
> 
> The counter maintained by the device is called the Device Ring Wrap
> Counter.  The device changes the value of this counter
> each time it uses the last descriptor in
> the ring (after marking the last descriptor used).
> 
> To mark a descriptor as available, the driver sets the VIRTQ_DESC_F_AVAIL bit in
> Flags to match the internal Driver Ring Wrap Counter.  It also sets the
> VIRTQ_DESC_F_USED bit to match the inverse value (i.e. to not match the internal
> Driver Ring Wrap Counter).
> 
> To mark a descriptor as used, the device sets the VIRTQ_DESC_F_USED bit in Flags
> to match the internal Device Ring Wrap Counter.  It also sets the
> VIRTQ_DESC_F_AVAIL bit to match the \emph{same} value.
> ~~~
> 
> Imagine a packed ring with 4 descriptors. At some earlier time let's say the
> first 3 descriptors were submitted and completed, in order. Then, two more
> descriptors were submitted (so they went into slot 3 and then 0). Each time
> through the ring, the "ring wrap counter" toggles from 1 to 0 or vice versa. The
> avail and used flags are initialized to 1 at start up and the ring wrap counter
> is 0.

Both the Driver and Device Ring Wrap Counters are initialized to 1
according to the spec, not 0.  This doesn't matter for the rest of the
discussion though.  I just wanted to mention it in case your
implementation has the value inverted.

> So when submitting into slot 3, the driver flips the avail bit from 1 to 0. When
> submitting into slot 0 after the ring-wrap, it flips the avail bit from 0 to 1.
> The spec says it also sets the used flag, but the used flag is already the
> correct value. That's all solid so far.
> 
> The device is then polling on slot 3 waiting for new descriptors to be
> submitted. Note here that the device must be maintaining its own copy of what it
> thinks is the driver's ring wrap counter value to detect descriptors being made
> available, which isn't called out in the specification at all but all actual
> implementations do. Regardless, once it sees the avail flag in the descriptor in
> slot 3 flip, it begins processing the descriptor. It then moves on to poll slot
> 0 for incoming descriptors (and updates its expected avail ring wrap counter
> value internally). In this case, it sees the bit flip for slot 0 and begins
> processing the descriptor. Descriptor processing is asynchronous and may
> complete out of order, at least for storage devices.
> 
> So let's say that the descriptors do, in fact, complete out of order, such that
> the descriptor in slot 0 completes first and then the descriptor in slot 3
> completes second. The specification is very clear that this is allowed.
> 
> When completing the descriptor in slot 0, the device has not yet completed the
> last descriptor in the ring (they're out of order!). So the ring wrap counter is
> still 0. So it then proceeds to write both the avail flag and the used flag to
> 0. But the driver is expecting the used flag to get flipped to 1 to indicate a
> completion. This is broken.

You're saying used descriptors overwrite the same descriptor slot where
they were made available by the driver.  This is not the case.  The
device overwrites the descriptor ring in completion order.  See my
response to your concern about performance for an illustration.

> I wrote a unit test for the SPDK implementation and it is indeed broken in this
> scenario, causing the driver to hang. I've also at least looked through the code
> in DPDK and in the Linux kernel implementations and they appear broken in
> various ways as well. It appears that DPDK may decide the ring is corrupt and
> "drop the packet" in this scenario, which is data corruption for storage, but I
> haven't yet run a test to confirm. The Linux kernel implementation looks to me
> like it will never consider the descriptor as complete and eventually hang.
> 
> The device doesn't actually need to track a ring wrap counter for the used flag
> at all. The device does need to track a ring wrap counter for the expected value
> of the avail bit, to mirror the driver side value while polling for incoming
> descriptors. But to mark descriptors as used, it simply needs to set the used
> flag to the current value of the avail flag in that descriptor.
> 
> ----------------------
> 
> Moving on to my separate concerns about performance. Critically, it's important
> to note that storage devices these days are inherently parallel pieces of
> hardware. You can submit a large number of commands at once and the device
> actually does process some number of them simultaneously. That means things
> complete out of order all the time. For example, imagine submitting two commands
> to a NAND flash storage device where one is a 1MB read and the second one is a
> 4k read. Let's say that the data these are reading are on entirely separate NAND
> chips internally. That 4k read will complete ahead of the 1MB read, and it would
> be awful to force that small read to wait on the larger one.
> 
> But with the packed ring design, there's only one ring for both submission and
> completion, so descriptor slots can only be re-used in order. Even if the device
> implementation is able to complete the 4K read out of order ahead of time and
> mark the descriptor as used, a driver that's polling for completions won't see
> it until the earlier descriptor completes. A driver using interrupts may be able
> to see it if the interrupt tells it specifically which slot to look at, but it
> can't re-use the descriptor slot for a new command because submissions must go
> in order. While the bug I outlined in the first part of my email is fixable,
> this design issue is not.

The spec says:

  When the device has finished processing the buffer, it writes a used
  device descriptor including the Buffer ID into the Descriptor Ring
  (overwriting a driver descriptor previously made available)

  ...

  Note: after reading driver descriptors and starting their processing
  in order, the device might complete their processing out of order.
  Used device descriptors are written in the order in which their
  processing is complete.

Say the write has the following requests:

  | 1M (Avail) | 4K (Avail) |

The device sees both requests and completes the 4K request.  The device
overwrites the oldest available descriptor:

  | 4K (Used)  | 4K (Avail/seen) |

The driver polls the first descriptor for completions.  Therefore, the
driver can process the 4K request completion while the 1M request is
still pending.  Now the driver can submit the next request while the 1M
request is in flight:

  | 8K (Avail) | 4K (Avail/seen) |

With this mental model of how the packed ring layout works I think
neither of the issues you've raised is a problem.  Have I missed
anything?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [virtio-dev] virtio packed ring concerns
  2020-01-28 17:26 [virtio-dev] virtio packed ring concerns Walker, Benjamin
  2020-01-29  9:41 ` Stefan Hajnoczi
@ 2020-01-29  9:53 ` Michael S. Tsirkin
  2020-01-29 18:57   ` Walker, Benjamin
  1 sibling, 1 reply; 5+ messages in thread
From: Michael S. Tsirkin @ 2020-01-29  9:53 UTC (permalink / raw)
  To: Walker, Benjamin; +Cc: virtio-dev@lists.oasis-open.org

I second Stefan's comments. To add to that, I think the spec
could be improved to avoid this kind of misunderstanding.
See comments below:

On Tue, Jan 28, 2020 at 05:26:07PM +0000, Walker, Benjamin wrote:
> Hi all - I'm a core maintainer for the SPDK project, which includes a vhost-user 
> target implementation that handles both virtio-blk and virtio-scsi. Recently we
> had some engineers go off and attempt to add support for packed virtqueues, as
> defined here: 
> https://github.com/oasis-tcs/virtio-spec/blob/master/packed-ring.tex
> 
> While reviewing the patches they submitted I noticed two separate issues. The
> first issue is that the ring cannot function correctly under certain
> circumstances by spec. The second issue is a performance issue. I'm going to try
> to keep both of these entirely separate.
> 
> Comments are very welcome here, and I'm hoping I'm just missing some critical
> piece of information that resolves the whole thing, but testing is so far
> showing that I'm finding a real problem.
> 
> First, the functional issue. The problem scenario involves out of order
> completions during ring wrap around. Here are the most relevant sections of the
> specification:
> 
> ~~~
> Note: after reading driver descriptors and starting their processing in order,
> the device might complete their processing out of order.  Used device
> descriptors are written in the order in which their processing is complete.
> 
> The counter maintained by the driver is called the Driver Ring Wrap Counter. The
> driver changes the value of this counter each time it makes available the last
> descriptor in the ring (after making the last descriptor available).
> 
> The counter maintained by the device is called the Device Ring Wrap
> Counter.  The device changes the value of this counter
> each time it uses the last descriptor in
> the ring (after marking the last descriptor used).
> 
> To mark a descriptor as available, the driver sets the VIRTQ_DESC_F_AVAIL bit in
> Flags to match the internal Driver Ring Wrap Counter.  It also sets the
> VIRTQ_DESC_F_USED bit to match the inverse value (i.e. to not match the internal
> Driver Ring Wrap Counter).
> 
> To mark a descriptor as used, the device sets the VIRTQ_DESC_F_USED bit in Flags
> to match the internal Device Ring Wrap Counter.  It also sets the
> VIRTQ_DESC_F_AVAIL bit to match the \emph{same} value.
> ~~~
> 
> Imagine a packed ring with 4 descriptors. At some earlier time let's say the
> first 3 descriptors were submitted and completed, in order. Then, two more
> descriptors were submitted (so they went into slot 3 and then 0).

And let's assume they have IDs 3 and 0, respectively.

> Each time
> through the ring, the "ring wrap counter" toggles from 1 to 0 or vice versa. The
> avail and used flags are initialized to 1 at start up and the ring wrap counter
> is 0.
> 
> So when submitting into slot 3, the driver flips the avail bit from 1 to 0. When
> submitting into slot 0 after the ring-wrap, it flips the avail bit from 0 to 1.
> The spec says it also sets the used flag, but the used flag is already the
> correct value. That's all solid so far.
> 
> The device is then polling on slot 3 waiting for new descriptors to be
> submitted. Note here that the device must be maintaining its own copy of what it
> thinks is the driver's ring wrap counter value to detect descriptors being made
> available, which isn't called out in the specification at all but all actual
> implementations do. Regardless, once it sees the avail flag in the descriptor in
> slot 3 flip, it begins processing the descriptor. It then moves on to poll slot
> 0 for incoming descriptors (and updates its expected avail ring wrap counter
> value internally). In this case, it sees the bit flip for slot 0 and begins
> processing the descriptor. Descriptor processing is asynchronous and may
> complete out of order, at least for storage devices.
> 
> So let's say that the descriptors do, in fact, complete out of order, such that
> the descriptor in slot 0 completes first and then the descriptor in slot 3
> completes second. The specification is very clear that this is allowed.
> 
> When completing the descriptor in slot 0, the device has not yet completed the
> last descriptor in the ring (they're out of order!). So the ring wrap counter is
> still 0. So it then proceeds to write both the avail flag and the used flag to
> 0. But the driver is expecting the used flag to get flipped to 1 to indicate a
> completion. This is broken.

Right but the spec says
	 Used device
	 descriptors are written in the order in which their processing is complete.

so what should happen, is this:
	 descriptor with ID 0 is written into slot 3.
	 descriptor with ID 3 is written into slot 0.


The following might be helpful to describe this better:

the spec already says:

	The Descriptor Ring is used in a circular manner: the driver writes descriptors into the ring in order. After
	reaching the end of the ring, the next descriptor is placed at the head of the ring. Once the ring is full of driver
	descriptors, the driver stops sending new requests and waits for the device to start processing descriptors
	and to write out some used descriptors before making new driver descriptors available.

and maybe we should add:

	Similarly, device writes out used descriptors in a circular manner:
	starting at the head of the ring, and until it reaches the end of the
	ring. After reaching the end of the ring, the next used descriptor is
	again placed at the head of the ring.


Also spec currently says:

	The counter maintained by the device is called the Device Ring Wrap Counter. The device changes the
	value of this counter each time it uses the last descriptor in the ring (after marking the last descriptor used).

in fact this part may be confusing: it is not clear what does "uses the last descriptor"
mean - you took it to mean "writes out descriptor with ID from the last
descriptor", what spec means is "writes out a used descriptor over the
last descriptor location".  The text also does not properly account for skipping descriptors.

What it should say is:

	The counter maintained by the device is called the Device Ring Wrap Counter. The device changes the
	value of this counter each time it reaches the last descriptor
	in the ring when writing used descriptors
        (after writing out or skipping over the last used descriptor).

Please let me know whether the changes suggested above would clarify
things for you, or whether you have other suggestions. Thanks!

-- 
MST


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [virtio-dev] virtio packed ring concerns
  2020-01-29  9:41 ` Stefan Hajnoczi
@ 2020-01-29 18:54   ` Walker, Benjamin
  0 siblings, 0 replies; 5+ messages in thread
From: Walker, Benjamin @ 2020-01-29 18:54 UTC (permalink / raw)
  To: stefanha@redhat.com; +Cc: virtio-dev@lists.oasis-open.org

On Wed, 2020-01-29 at 09:41 +0000, Stefan Hajnoczi wrote:
> On Tue, Jan 28, 2020 at 05:26:07PM +0000, Walker, Benjamin wrote:
> > Hi all - I'm a core maintainer for the SPDK project, which includes a vhost-
> > user 
> > target implementation that handles both virtio-blk and virtio-scsi. Recently
> > we
> > had some engineers go off and attempt to add support for packed virtqueues,
> > as
> > defined here: 
> > https://github.com/oasis-tcs/virtio-spec/blob/master/packed-ring.tex
> > 
> > While reviewing the patches they submitted I noticed two separate issues.
> > The
> > first issue is that the ring cannot function correctly under certain
> > circumstances by spec. The second issue is a performance issue. I'm going to
> > try
> > to keep both of these entirely separate.
> > 
> > Comments are very welcome here, and I'm hoping I'm just missing some
> > critical
> > piece of information that resolves the whole thing, but testing is so far
> > showing that I'm finding a real problem.
> > 
> > First, the functional issue. The problem scenario involves out of order
> > completions during ring wrap around. Here are the most relevant sections of
> > the
> > specification:
> > 
> > ~~~
> > Note: after reading driver descriptors and starting their processing in
> > order,
> > the device might complete their processing out of order.  Used device
> > descriptors are written in the order in which their processing is complete.
> > 
> > The counter maintained by the driver is called the Driver Ring Wrap Counter.
> > The
> > driver changes the value of this counter each time it makes available the
> > last
> > descriptor in the ring (after making the last descriptor available).
> > 
> > The counter maintained by the device is called the Device Ring Wrap
> > Counter.  The device changes the value of this counter
> > each time it uses the last descriptor in
> > the ring (after marking the last descriptor used).
> > 
> > To mark a descriptor as available, the driver sets the VIRTQ_DESC_F_AVAIL
> > bit in
> > Flags to match the internal Driver Ring Wrap Counter.  It also sets the
> > VIRTQ_DESC_F_USED bit to match the inverse value (i.e. to not match the
> > internal
> > Driver Ring Wrap Counter).
> > 
> > To mark a descriptor as used, the device sets the VIRTQ_DESC_F_USED bit in
> > Flags
> > to match the internal Device Ring Wrap Counter.  It also sets the
> > VIRTQ_DESC_F_AVAIL bit to match the \emph{same} value.
> > ~~~
> > 
> > Imagine a packed ring with 4 descriptors. At some earlier time let's say the
> > first 3 descriptors were submitted and completed, in order. Then, two more
> > descriptors were submitted (so they went into slot 3 and then 0). Each time
> > through the ring, the "ring wrap counter" toggles from 1 to 0 or vice versa.
> > The
> > avail and used flags are initialized to 1 at start up and the ring wrap
> > counter
> > is 0.
> 
> Both the Driver and Device Ring Wrap Counters are initialized to 1
> according to the spec, not 0.  This doesn't matter for the rest of the
> discussion though.  I just wanted to mention it in case your
> implementation has the value inverted.
> 
> > So when submitting into slot 3, the driver flips the avail bit from 1 to 0.
> > When
> > submitting into slot 0 after the ring-wrap, it flips the avail bit from 0 to
> > 1.
> > The spec says it also sets the used flag, but the used flag is already the
> > correct value. That's all solid so far.
> > 
> > The device is then polling on slot 3 waiting for new descriptors to be
> > submitted. Note here that the device must be maintaining its own copy of
> > what it
> > thinks is the driver's ring wrap counter value to detect descriptors being
> > made
> > available, which isn't called out in the specification at all but all actual
> > implementations do. Regardless, once it sees the avail flag in the
> > descriptor in
> > slot 3 flip, it begins processing the descriptor. It then moves on to poll
> > slot
> > 0 for incoming descriptors (and updates its expected avail ring wrap counter
> > value internally). In this case, it sees the bit flip for slot 0 and begins
> > processing the descriptor. Descriptor processing is asynchronous and may
> > complete out of order, at least for storage devices.
> > 
> > So let's say that the descriptors do, in fact, complete out of order, such
> > that
> > the descriptor in slot 0 completes first and then the descriptor in slot 3
> > completes second. The specification is very clear that this is allowed.
> > 
> > When completing the descriptor in slot 0, the device has not yet completed
> > the
> > last descriptor in the ring (they're out of order!). So the ring wrap
> > counter is
> > still 0. So it then proceeds to write both the avail flag and the used flag
> > to
> > 0. But the driver is expecting the used flag to get flipped to 1 to indicate
> > a
> > completion. This is broken.
> 
> You're saying used descriptors overwrite the same descriptor slot where
> they were made available by the driver.  This is not the case.  The
> device overwrites the descriptor ring in completion order.  See my
> response to your concern about performance for an illustration.

Thank you! Now I get it! Here's the portion of the spec that describes this
behavior. I agonized over this for hours trying to decipher what it was saying,
and even sent it to several other engineers, and we still misunderstood.

~~~
The driver then notifies the device. When the device has finished
processing the buffer, it writes a used device descriptor
including the Buffer ID into the Descriptor Ring (overwriting a
driver descriptor previously made available), and sends a
used event notification.

The Descriptor Ring is used in a circular manner: the driver writes
descriptors into the ring in order. After reaching the end of the ring,
the next descriptor is placed at the head of the ring.  Once the ring is
full of driver descriptors, the driver stops sending new requests and
waits for the device to start processing descriptors and to write out
some used descriptors before making new driver descriptors
available.

Similarly, the device reads descriptors from the ring in order and
detects that a driver descriptor has been made available.  As
processing of descriptors is completed, used descriptors are
written by the device back into the ring.
~~~

Note that the wording says the "driver writes descriptors into the ring in
order" - it doesn't say the device does. It does say the device "reads
descriptors from the ring in order" but it doesn't say it writes them back in
order. The only hint of the behavior you're saying is in the first paragraph,
where it says "(overwriting a driver descriptor previously made available)"
because why would it be worded that way if it was updating a descriptor instead
of completely replacing it. But unfortunately I wasn't able to piece it all
together correctly.

> 
> > I wrote a unit test for the SPDK implementation and it is indeed broken in
> > this
> > scenario, causing the driver to hang. I've also at least looked through the
> > code
> > in DPDK and in the Linux kernel implementations and they appear broken in
> > various ways as well. It appears that DPDK may decide the ring is corrupt
> > and
> > "drop the packet" in this scenario, which is data corruption for storage,
> > but I
> > haven't yet run a test to confirm. The Linux kernel implementation looks to
> > me
> > like it will never consider the descriptor as complete and eventually hang.
> > 
> > The device doesn't actually need to track a ring wrap counter for the used
> > flag
> > at all. The device does need to track a ring wrap counter for the expected
> > value
> > of the avail bit, to mirror the driver side value while polling for incoming
> > descriptors. But to mark descriptors as used, it simply needs to set the
> > used
> > flag to the current value of the avail flag in that descriptor.
> > 
> > ----------------------
> > 
> > Moving on to my separate concerns about performance. Critically, it's
> > important
> > to note that storage devices these days are inherently parallel pieces of
> > hardware. You can submit a large number of commands at once and the device
> > actually does process some number of them simultaneously. That means things
> > complete out of order all the time. For example, imagine submitting two
> > commands
> > to a NAND flash storage device where one is a 1MB read and the second one is
> > a
> > 4k read. Let's say that the data these are reading are on entirely separate
> > NAND
> > chips internally. That 4k read will complete ahead of the 1MB read, and it
> > would
> > be awful to force that small read to wait on the larger one.
> > 
> > But with the packed ring design, there's only one ring for both submission
> > and
> > completion, so descriptor slots can only be re-used in order. Even if the
> > device
> > implementation is able to complete the 4K read out of order ahead of time
> > and
> > mark the descriptor as used, a driver that's polling for completions won't
> > see
> > it until the earlier descriptor completes. A driver using interrupts may be
> > able
> > to see it if the interrupt tells it specifically which slot to look at, but
> > it
> > can't re-use the descriptor slot for a new command because submissions must
> > go
> > in order. While the bug I outlined in the first part of my email is fixable,
> > this design issue is not.
> 
> The spec says:
> 
>   When the device has finished processing the buffer, it writes a used
>   device descriptor including the Buffer ID into the Descriptor Ring
>   (overwriting a driver descriptor previously made available)
> 
>   ...
> 
>   Note: after reading driver descriptors and starting their processing
>   in order, the device might complete their processing out of order.
>   Used device descriptors are written in the order in which their
>   processing is complete.
> 
> Say the write has the following requests:
> 
>   | 1M (Avail) | 4K (Avail) |
> 
> The device sees both requests and completes the 4K request.  The device
> overwrites the oldest available descriptor:
> 
>   | 4K (Used)  | 4K (Avail/seen) |
> 
> The driver polls the first descriptor for completions.  Therefore, the
> driver can process the 4K request completion while the 1M request is
> still pending.  Now the driver can submit the next request while the 1M
> request is in flight:
> 
>   | 8K (Avail) | 4K (Avail/seen) |
> 
> With this mental model of how the packed ring layout works I think
> neither of the issues you've raised is a problem.  Have I missed
> anything?
> 

Now that I see that the used descriptors are written in order, replacing
whatever descriptor was previously in the slot, I no longer have performance
concerns.

> Stefan


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [virtio-dev] virtio packed ring concerns
  2020-01-29  9:53 ` Michael S. Tsirkin
@ 2020-01-29 18:57   ` Walker, Benjamin
  0 siblings, 0 replies; 5+ messages in thread
From: Walker, Benjamin @ 2020-01-29 18:57 UTC (permalink / raw)
  To: mst@redhat.com; +Cc: virtio-dev@lists.oasis-open.org

On Wed, 2020-01-29 at 04:53 -0500, Michael S. Tsirkin wrote:
> I second Stefan's comments. To add to that, I think the spec
> could be improved to avoid this kind of misunderstanding.
> See comments below:
> 
> On Tue, Jan 28, 2020 at 05:26:07PM +0000, Walker, Benjamin wrote:
> > Hi all - I'm a core maintainer for the SPDK project, which includes a vhost-
> > user 
> > target implementation that handles both virtio-blk and virtio-scsi. Recently
> > we
> > had some engineers go off and attempt to add support for packed virtqueues,
> > as
> > defined here: 
> > https://github.com/oasis-tcs/virtio-spec/blob/master/packed-ring.tex
> > 
> > While reviewing the patches they submitted I noticed two separate issues.
> > The
> > first issue is that the ring cannot function correctly under certain
> > circumstances by spec. The second issue is a performance issue. I'm going to
> > try
> > to keep both of these entirely separate.
> > 
> > Comments are very welcome here, and I'm hoping I'm just missing some
> > critical
> > piece of information that resolves the whole thing, but testing is so far
> > showing that I'm finding a real problem.
> > 
> > First, the functional issue. The problem scenario involves out of order
> > completions during ring wrap around. Here are the most relevant sections of
> > the
> > specification:
> > 
> > ~~~
> > Note: after reading driver descriptors and starting their processing in
> > order,
> > the device might complete their processing out of order.  Used device
> > descriptors are written in the order in which their processing is complete.
> > 
> > The counter maintained by the driver is called the Driver Ring Wrap Counter.
> > The
> > driver changes the value of this counter each time it makes available the
> > last
> > descriptor in the ring (after making the last descriptor available).
> > 
> > The counter maintained by the device is called the Device Ring Wrap
> > Counter.  The device changes the value of this counter
> > each time it uses the last descriptor in
> > the ring (after marking the last descriptor used).
> > 
> > To mark a descriptor as available, the driver sets the VIRTQ_DESC_F_AVAIL
> > bit in
> > Flags to match the internal Driver Ring Wrap Counter.  It also sets the
> > VIRTQ_DESC_F_USED bit to match the inverse value (i.e. to not match the
> > internal
> > Driver Ring Wrap Counter).
> > 
> > To mark a descriptor as used, the device sets the VIRTQ_DESC_F_USED bit in
> > Flags
> > to match the internal Device Ring Wrap Counter.  It also sets the
> > VIRTQ_DESC_F_AVAIL bit to match the \emph{same} value.
> > ~~~
> > 
> > Imagine a packed ring with 4 descriptors. At some earlier time let's say the
> > first 3 descriptors were submitted and completed, in order. Then, two more
> > descriptors were submitted (so they went into slot 3 and then 0).
> 
> And let's assume they have IDs 3 and 0, respectively.
> 
> > Each time
> > through the ring, the "ring wrap counter" toggles from 1 to 0 or vice versa.
> > The
> > avail and used flags are initialized to 1 at start up and the ring wrap
> > counter
> > is 0.
> > 
> > So when submitting into slot 3, the driver flips the avail bit from 1 to 0.
> > When
> > submitting into slot 0 after the ring-wrap, it flips the avail bit from 0 to
> > 1.
> > The spec says it also sets the used flag, but the used flag is already the
> > correct value. That's all solid so far.
> > 
> > The device is then polling on slot 3 waiting for new descriptors to be
> > submitted. Note here that the device must be maintaining its own copy of
> > what it
> > thinks is the driver's ring wrap counter value to detect descriptors being
> > made
> > available, which isn't called out in the specification at all but all actual
> > implementations do. Regardless, once it sees the avail flag in the
> > descriptor in
> > slot 3 flip, it begins processing the descriptor. It then moves on to poll
> > slot
> > 0 for incoming descriptors (and updates its expected avail ring wrap counter
> > value internally). In this case, it sees the bit flip for slot 0 and begins
> > processing the descriptor. Descriptor processing is asynchronous and may
> > complete out of order, at least for storage devices.
> > 
> > So let's say that the descriptors do, in fact, complete out of order, such
> > that
> > the descriptor in slot 0 completes first and then the descriptor in slot 3
> > completes second. The specification is very clear that this is allowed.
> > 
> > When completing the descriptor in slot 0, the device has not yet completed
> > the
> > last descriptor in the ring (they're out of order!). So the ring wrap
> > counter is
> > still 0. So it then proceeds to write both the avail flag and the used flag
> > to
> > 0. But the driver is expecting the used flag to get flipped to 1 to indicate
> > a
> > completion. This is broken.
> 
> Right but the spec says
> 	 Used device
> 	 descriptors are written in the order in which their processing is
> complete.
> 
> so what should happen, is this:
> 	 descriptor with ID 0 is written into slot 3.
> 	 descriptor with ID 3 is written into slot 0.
> 
> 
> The following might be helpful to describe this better:
> 
> the spec already says:
> 
> 	The Descriptor Ring is used in a circular manner: the driver writes
> descriptors into the ring in order. After
> 	reaching the end of the ring, the next descriptor is placed at the head
> of the ring. Once the ring is full of driver
> 	descriptors, the driver stops sending new requests and waits for the
> device to start processing descriptors
> 	and to write out some used descriptors before making new driver
> descriptors available.
> 
> and maybe we should add:
> 
> 	Similarly, device writes out used descriptors in a circular manner:
> 	starting at the head of the ring, and until it reaches the end of the
> 	ring. After reaching the end of the ring, the next used descriptor is
> 	again placed at the head of the ring.

Yes please! No where in the current wording does it say the device writes the
descriptors circularly. It says the driver writes them circularly and that the
device reads them circularly only.

> 
> 
> Also spec currently says:
> 
> 	The counter maintained by the device is called the Device Ring Wrap
> Counter. The device changes the
> 	value of this counter each time it uses the last descriptor in the ring
> (after marking the last descriptor used).
> 
> in fact this part may be confusing: it is not clear what does "uses the last
> descriptor"
> mean - you took it to mean "writes out descriptor with ID from the last
> descriptor", what spec means is "writes out a used descriptor over the
> last descriptor location".  The text also does not properly account for
> skipping descriptors.
> 
> What it should say is:
> 
> 	The counter maintained by the device is called the Device Ring Wrap
> Counter. The device changes the
> 	value of this counter each time it reaches the last descriptor
> 	in the ring when writing used descriptors
>         (after writing out or skipping over the last used descriptor).
> 
> Please let me know whether the changes suggested above would clarify
> things for you, or whether you have other suggestions. Thanks!
> 

I wholeheartedly agree that the new wording is much better and would have made
this much more obvious. Thanks for the assistance.

Thanks,
Ben

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-01-29 18:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-01-28 17:26 [virtio-dev] virtio packed ring concerns Walker, Benjamin
2020-01-29  9:41 ` Stefan Hajnoczi
2020-01-29 18:54   ` Walker, Benjamin
2020-01-29  9:53 ` Michael S. Tsirkin
2020-01-29 18:57   ` Walker, Benjamin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.