* [PATCH 0/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
@ 2023-08-27 18:29 Laszlo Ersek
2023-08-27 18:29 ` [PATCH 1/7] vhost-user: strip superfluous whitespace Laszlo Ersek
` (7 more replies)
0 siblings, 8 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-27 18:29 UTC (permalink / raw)
To: qemu-devel, lersek
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
The last patch in the series states and fixes the problem; prior patches
only refactor the code.
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Thanks,
Laszlo
Laszlo Ersek (7):
vhost-user: strip superfluous whitespace
vhost-user: tighten "reply_supported" scope in "set_vring_addr"
vhost-user: factor out "vhost_user_write_msg"
vhost-user: flatten "enforce_reply" into "vhost_user_write_msg"
vhost-user: hoist "write_msg", "get_features", "get_u64"
vhost-user: allow "vhost_set_vring" to wait for a reply
vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
hw/virtio/vhost-user.c | 202 +++++++++-----------
1 file changed, 94 insertions(+), 108 deletions(-)
base-commit: 50e7a40af372ee5931c99ef7390f5d3d6fbf6ec4
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 1/7] vhost-user: strip superfluous whitespace
2023-08-27 18:29 [PATCH 0/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
@ 2023-08-27 18:29 ` Laszlo Ersek
2023-08-30 8:26 ` Stefano Garzarella
2023-08-30 15:04 ` Philippe Mathieu-Daudé
2023-08-27 18:29 ` [PATCH 2/7] vhost-user: tighten "reply_supported" scope in "set_vring_addr" Laszlo Ersek
` (6 subsequent siblings)
7 siblings, 2 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-27 18:29 UTC (permalink / raw)
To: qemu-devel, lersek
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---
hw/virtio/vhost-user.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8dcf049d422b..b4b677c1ce66 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -398,7 +398,7 @@ static int vhost_user_write(struct vhost_dev *dev, VhostUserMsg *msg,
* operations such as configuring device memory mappings or issuing device
* resets, which affect the whole device instead of individual VQs,
* vhost-user messages should only be sent once.
- *
+ *
* Devices with multiple vhost_devs are given an associated dev->vq_index
* so per_device requests are only sent if vq_index is 0.
*/
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 2/7] vhost-user: tighten "reply_supported" scope in "set_vring_addr"
2023-08-27 18:29 [PATCH 0/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
2023-08-27 18:29 ` [PATCH 1/7] vhost-user: strip superfluous whitespace Laszlo Ersek
@ 2023-08-27 18:29 ` Laszlo Ersek
2023-08-30 8:27 ` Stefano Garzarella
2023-08-30 15:04 ` Philippe Mathieu-Daudé
2023-08-27 18:29 ` [PATCH 3/7] vhost-user: factor out "vhost_user_write_msg" Laszlo Ersek
` (5 subsequent siblings)
7 siblings, 2 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-27 18:29 UTC (permalink / raw)
To: qemu-devel, lersek
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
In the vhost_user_set_vring_addr() function, we calculate
"reply_supported" unconditionally, even though we'll only need it if
"wait_for_reply" is also true.
Restrict the scope of "reply_supported" to the minimum.
This is purely refactoring -- no observable change.
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---
hw/virtio/vhost-user.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index b4b677c1ce66..64eac317bfb2 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1331,17 +1331,18 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev,
.hdr.size = sizeof(msg.payload.addr),
};
- bool reply_supported = virtio_has_feature(dev->protocol_features,
- VHOST_USER_PROTOCOL_F_REPLY_ACK);
-
/*
* wait for a reply if logging is enabled to make sure
* backend is actually logging changes
*/
bool wait_for_reply = addr->flags & (1 << VHOST_VRING_F_LOG);
- if (reply_supported && wait_for_reply) {
- msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+ if (wait_for_reply) {
+ bool reply_supported = virtio_has_feature(dev->protocol_features,
+ VHOST_USER_PROTOCOL_F_REPLY_ACK);
+ if (reply_supported) {
+ msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+ }
}
ret = vhost_user_write(dev, &msg, NULL, 0);
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 3/7] vhost-user: factor out "vhost_user_write_msg"
2023-08-27 18:29 [PATCH 0/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
2023-08-27 18:29 ` [PATCH 1/7] vhost-user: strip superfluous whitespace Laszlo Ersek
2023-08-27 18:29 ` [PATCH 2/7] vhost-user: tighten "reply_supported" scope in "set_vring_addr" Laszlo Ersek
@ 2023-08-27 18:29 ` Laszlo Ersek
2023-08-28 22:46 ` Philippe Mathieu-Daudé
2023-08-30 8:31 ` Stefano Garzarella
2023-08-27 18:29 ` [PATCH 4/7] vhost-user: flatten "enforce_reply" into "vhost_user_write_msg" Laszlo Ersek
` (4 subsequent siblings)
7 siblings, 2 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-27 18:29 UTC (permalink / raw)
To: qemu-devel, lersek
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
The tails of the "vhost_user_set_vring_addr" and "vhost_user_set_u64"
functions are now byte-for-byte identical. Factor the common tail out to a
new function called "vhost_user_write_msg".
This is purely refactoring -- no observable change.
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---
hw/virtio/vhost-user.c | 66 +++++++++-----------
1 file changed, 28 insertions(+), 38 deletions(-)
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 64eac317bfb2..36f99b66a644 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1320,10 +1320,35 @@ static int enforce_reply(struct vhost_dev *dev,
return vhost_user_get_features(dev, &dummy);
}
+/* Note: "msg->hdr.flags" may be modified. */
+static int vhost_user_write_msg(struct vhost_dev *dev, VhostUserMsg *msg,
+ bool wait_for_reply)
+{
+ int ret;
+
+ if (wait_for_reply) {
+ bool reply_supported = virtio_has_feature(dev->protocol_features,
+ VHOST_USER_PROTOCOL_F_REPLY_ACK);
+ if (reply_supported) {
+ msg->hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+ }
+ }
+
+ ret = vhost_user_write(dev, msg, NULL, 0);
+ if (ret < 0) {
+ return ret;
+ }
+
+ if (wait_for_reply) {
+ return enforce_reply(dev, msg);
+ }
+
+ return 0;
+}
+
static int vhost_user_set_vring_addr(struct vhost_dev *dev,
struct vhost_vring_addr *addr)
{
- int ret;
VhostUserMsg msg = {
.hdr.request = VHOST_USER_SET_VRING_ADDR,
.hdr.flags = VHOST_USER_VERSION,
@@ -1337,24 +1362,7 @@ static int vhost_user_set_vring_addr(struct vhost_dev *dev,
*/
bool wait_for_reply = addr->flags & (1 << VHOST_VRING_F_LOG);
- if (wait_for_reply) {
- bool reply_supported = virtio_has_feature(dev->protocol_features,
- VHOST_USER_PROTOCOL_F_REPLY_ACK);
- if (reply_supported) {
- msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
- }
- }
-
- ret = vhost_user_write(dev, &msg, NULL, 0);
- if (ret < 0) {
- return ret;
- }
-
- if (wait_for_reply) {
- return enforce_reply(dev, &msg);
- }
-
- return 0;
+ return vhost_user_write_msg(dev, &msg, wait_for_reply);
}
static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64,
@@ -1366,26 +1374,8 @@ static int vhost_user_set_u64(struct vhost_dev *dev, int request, uint64_t u64,
.payload.u64 = u64,
.hdr.size = sizeof(msg.payload.u64),
};
- int ret;
- if (wait_for_reply) {
- bool reply_supported = virtio_has_feature(dev->protocol_features,
- VHOST_USER_PROTOCOL_F_REPLY_ACK);
- if (reply_supported) {
- msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
- }
- }
-
- ret = vhost_user_write(dev, &msg, NULL, 0);
- if (ret < 0) {
- return ret;
- }
-
- if (wait_for_reply) {
- return enforce_reply(dev, &msg);
- }
-
- return 0;
+ return vhost_user_write_msg(dev, &msg, wait_for_reply);
}
static int vhost_user_set_status(struct vhost_dev *dev, uint8_t status)
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 4/7] vhost-user: flatten "enforce_reply" into "vhost_user_write_msg"
2023-08-27 18:29 [PATCH 0/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
` (2 preceding siblings ...)
2023-08-27 18:29 ` [PATCH 3/7] vhost-user: factor out "vhost_user_write_msg" Laszlo Ersek
@ 2023-08-27 18:29 ` Laszlo Ersek
2023-08-28 22:47 ` Philippe Mathieu-Daudé
2023-08-30 8:31 ` Stefano Garzarella
2023-08-27 18:29 ` [PATCH 5/7] vhost-user: hoist "write_msg", "get_features", "get_u64" Laszlo Ersek
` (3 subsequent siblings)
7 siblings, 2 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-27 18:29 UTC (permalink / raw)
To: qemu-devel, lersek
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
At this point, only "vhost_user_write_msg" calls "enforce_reply"; embed
the latter into the former.
This is purely refactoring -- no observable change.
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---
hw/virtio/vhost-user.c | 32 ++++++++------------
1 file changed, 13 insertions(+), 19 deletions(-)
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 36f99b66a644..8eb7fd094c43 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1302,24 +1302,6 @@ static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features)
return 0;
}
-static int enforce_reply(struct vhost_dev *dev,
- const VhostUserMsg *msg)
-{
- uint64_t dummy;
-
- if (msg->hdr.flags & VHOST_USER_NEED_REPLY_MASK) {
- return process_message_reply(dev, msg);
- }
-
- /*
- * We need to wait for a reply but the backend does not
- * support replies for the command we just sent.
- * Send VHOST_USER_GET_FEATURES which makes all backends
- * send a reply.
- */
- return vhost_user_get_features(dev, &dummy);
-}
-
/* Note: "msg->hdr.flags" may be modified. */
static int vhost_user_write_msg(struct vhost_dev *dev, VhostUserMsg *msg,
bool wait_for_reply)
@@ -1340,7 +1322,19 @@ static int vhost_user_write_msg(struct vhost_dev *dev, VhostUserMsg *msg,
}
if (wait_for_reply) {
- return enforce_reply(dev, msg);
+ uint64_t dummy;
+
+ if (msg->hdr.flags & VHOST_USER_NEED_REPLY_MASK) {
+ return process_message_reply(dev, msg);
+ }
+
+ /*
+ * We need to wait for a reply but the backend does not
+ * support replies for the command we just sent.
+ * Send VHOST_USER_GET_FEATURES which makes all backends
+ * send a reply.
+ */
+ return vhost_user_get_features(dev, &dummy);
}
return 0;
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 5/7] vhost-user: hoist "write_msg", "get_features", "get_u64"
2023-08-27 18:29 [PATCH 0/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
` (3 preceding siblings ...)
2023-08-27 18:29 ` [PATCH 4/7] vhost-user: flatten "enforce_reply" into "vhost_user_write_msg" Laszlo Ersek
@ 2023-08-27 18:29 ` Laszlo Ersek
2023-08-30 8:32 ` Stefano Garzarella
2023-08-30 15:04 ` Philippe Mathieu-Daudé
2023-08-27 18:29 ` [PATCH 6/7] vhost-user: allow "vhost_set_vring" to wait for a reply Laszlo Ersek
` (2 subsequent siblings)
7 siblings, 2 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-27 18:29 UTC (permalink / raw)
To: qemu-devel, lersek
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
In order to avoid a forward-declaration for "vhost_user_write_msg" in a
subsequent patch, hoist "vhost_user_write_msg" ->
"vhost_user_get_features" -> "vhost_user_get_u64" just above
"vhost_set_vring".
This is purely code movement -- no observable change.
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---
hw/virtio/vhost-user.c | 170 ++++++++++----------
1 file changed, 85 insertions(+), 85 deletions(-)
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8eb7fd094c43..cadafebd0767 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1083,6 +1083,91 @@ static int vhost_user_set_vring_endian(struct vhost_dev *dev,
return vhost_user_write(dev, &msg, NULL, 0);
}
+static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64)
+{
+ int ret;
+ VhostUserMsg msg = {
+ .hdr.request = request,
+ .hdr.flags = VHOST_USER_VERSION,
+ };
+
+ if (vhost_user_per_device_request(request) && dev->vq_index != 0) {
+ return 0;
+ }
+
+ ret = vhost_user_write(dev, &msg, NULL, 0);
+ if (ret < 0) {
+ return ret;
+ }
+
+ ret = vhost_user_read(dev, &msg);
+ if (ret < 0) {
+ return ret;
+ }
+
+ if (msg.hdr.request != request) {
+ error_report("Received unexpected msg type. Expected %d received %d",
+ request, msg.hdr.request);
+ return -EPROTO;
+ }
+
+ if (msg.hdr.size != sizeof(msg.payload.u64)) {
+ error_report("Received bad msg size.");
+ return -EPROTO;
+ }
+
+ *u64 = msg.payload.u64;
+
+ return 0;
+}
+
+static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features)
+{
+ if (vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features) < 0) {
+ return -EPROTO;
+ }
+
+ return 0;
+}
+
+/* Note: "msg->hdr.flags" may be modified. */
+static int vhost_user_write_msg(struct vhost_dev *dev, VhostUserMsg *msg,
+ bool wait_for_reply)
+{
+ int ret;
+
+ if (wait_for_reply) {
+ bool reply_supported = virtio_has_feature(dev->protocol_features,
+ VHOST_USER_PROTOCOL_F_REPLY_ACK);
+ if (reply_supported) {
+ msg->hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
+ }
+ }
+
+ ret = vhost_user_write(dev, msg, NULL, 0);
+ if (ret < 0) {
+ return ret;
+ }
+
+ if (wait_for_reply) {
+ uint64_t dummy;
+
+ if (msg->hdr.flags & VHOST_USER_NEED_REPLY_MASK) {
+ return process_message_reply(dev, msg);
+ }
+
+ /*
+ * We need to wait for a reply but the backend does not
+ * support replies for the command we just sent.
+ * Send VHOST_USER_GET_FEATURES which makes all backends
+ * send a reply.
+ */
+ return vhost_user_get_features(dev, &dummy);
+ }
+
+ return 0;
+}
+
static int vhost_set_vring(struct vhost_dev *dev,
unsigned long int request,
struct vhost_vring_state *ring)
@@ -1255,91 +1340,6 @@ static int vhost_user_set_vring_err(struct vhost_dev *dev,
return vhost_set_vring_file(dev, VHOST_USER_SET_VRING_ERR, file);
}
-static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64)
-{
- int ret;
- VhostUserMsg msg = {
- .hdr.request = request,
- .hdr.flags = VHOST_USER_VERSION,
- };
-
- if (vhost_user_per_device_request(request) && dev->vq_index != 0) {
- return 0;
- }
-
- ret = vhost_user_write(dev, &msg, NULL, 0);
- if (ret < 0) {
- return ret;
- }
-
- ret = vhost_user_read(dev, &msg);
- if (ret < 0) {
- return ret;
- }
-
- if (msg.hdr.request != request) {
- error_report("Received unexpected msg type. Expected %d received %d",
- request, msg.hdr.request);
- return -EPROTO;
- }
-
- if (msg.hdr.size != sizeof(msg.payload.u64)) {
- error_report("Received bad msg size.");
- return -EPROTO;
- }
-
- *u64 = msg.payload.u64;
-
- return 0;
-}
-
-static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features)
-{
- if (vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features) < 0) {
- return -EPROTO;
- }
-
- return 0;
-}
-
-/* Note: "msg->hdr.flags" may be modified. */
-static int vhost_user_write_msg(struct vhost_dev *dev, VhostUserMsg *msg,
- bool wait_for_reply)
-{
- int ret;
-
- if (wait_for_reply) {
- bool reply_supported = virtio_has_feature(dev->protocol_features,
- VHOST_USER_PROTOCOL_F_REPLY_ACK);
- if (reply_supported) {
- msg->hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
- }
- }
-
- ret = vhost_user_write(dev, msg, NULL, 0);
- if (ret < 0) {
- return ret;
- }
-
- if (wait_for_reply) {
- uint64_t dummy;
-
- if (msg->hdr.flags & VHOST_USER_NEED_REPLY_MASK) {
- return process_message_reply(dev, msg);
- }
-
- /*
- * We need to wait for a reply but the backend does not
- * support replies for the command we just sent.
- * Send VHOST_USER_GET_FEATURES which makes all backends
- * send a reply.
- */
- return vhost_user_get_features(dev, &dummy);
- }
-
- return 0;
-}
-
static int vhost_user_set_vring_addr(struct vhost_dev *dev,
struct vhost_vring_addr *addr)
{
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 6/7] vhost-user: allow "vhost_set_vring" to wait for a reply
2023-08-27 18:29 [PATCH 0/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
` (4 preceding siblings ...)
2023-08-27 18:29 ` [PATCH 5/7] vhost-user: hoist "write_msg", "get_features", "get_u64" Laszlo Ersek
@ 2023-08-27 18:29 ` Laszlo Ersek
2023-08-28 22:49 ` Philippe Mathieu-Daudé
2023-08-30 8:32 ` Stefano Garzarella
2023-08-27 18:29 ` [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
2023-08-30 8:48 ` [PATCH 0/7] " Stefano Garzarella
7 siblings, 2 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-27 18:29 UTC (permalink / raw)
To: qemu-devel, lersek
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
The "vhost_set_vring" function already centralizes the common parts of
"vhost_user_set_vring_num", "vhost_user_set_vring_base" and
"vhost_user_set_vring_enable". We'll want to allow some of those callers
to wait for a reply.
Therefore, rebase "vhost_set_vring" from just "vhost_user_write" to
"vhost_user_write_msg", exposing the "wait_for_reply" parameter.
This is purely refactoring -- there is no observable change. That's
because:
- all three callers pass in "false" for "wait_for_reply", which disables
all logic in "vhost_user_write_msg" except the call to
"vhost_user_write";
- the fds=NULL and fd_num=0 arguments of the original "vhost_user_write"
call inside "vhost_set_vring" are hard-coded within
"vhost_user_write_msg".
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---
hw/virtio/vhost-user.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index cadafebd0767..beb4b832245e 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1170,7 +1170,8 @@ static int vhost_user_write_msg(struct vhost_dev *dev, VhostUserMsg *msg,
static int vhost_set_vring(struct vhost_dev *dev,
unsigned long int request,
- struct vhost_vring_state *ring)
+ struct vhost_vring_state *ring,
+ bool wait_for_reply)
{
VhostUserMsg msg = {
.hdr.request = request,
@@ -1179,13 +1180,13 @@ static int vhost_set_vring(struct vhost_dev *dev,
.hdr.size = sizeof(msg.payload.state),
};
- return vhost_user_write(dev, &msg, NULL, 0);
+ return vhost_user_write_msg(dev, &msg, wait_for_reply);
}
static int vhost_user_set_vring_num(struct vhost_dev *dev,
struct vhost_vring_state *ring)
{
- return vhost_set_vring(dev, VHOST_USER_SET_VRING_NUM, ring);
+ return vhost_set_vring(dev, VHOST_USER_SET_VRING_NUM, ring, false);
}
static void vhost_user_host_notifier_free(VhostUserHostNotifier *n)
@@ -1216,7 +1217,7 @@ static void vhost_user_host_notifier_remove(VhostUserHostNotifier *n,
static int vhost_user_set_vring_base(struct vhost_dev *dev,
struct vhost_vring_state *ring)
{
- return vhost_set_vring(dev, VHOST_USER_SET_VRING_BASE, ring);
+ return vhost_set_vring(dev, VHOST_USER_SET_VRING_BASE, ring, false);
}
static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
@@ -1234,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
.num = enable,
};
- ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state);
+ ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
if (ret < 0) {
/*
* Restoring the previous state is likely infeasible, as well as
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-27 18:29 [PATCH 0/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
` (5 preceding siblings ...)
2023-08-27 18:29 ` [PATCH 6/7] vhost-user: allow "vhost_set_vring" to wait for a reply Laszlo Ersek
@ 2023-08-27 18:29 ` Laszlo Ersek
2023-08-30 8:39 ` Stefano Garzarella
` (3 more replies)
2023-08-30 8:48 ` [PATCH 0/7] " Stefano Garzarella
7 siblings, 4 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-27 18:29 UTC (permalink / raw)
To: qemu-devel, lersek
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
(1) The virtio-1.0 specification
<http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
> 3 General Initialization And Device Operation
> 3.1 Device Initialization
> 3.1.1 Driver Requirements: Device Initialization
>
> [...]
>
> 7. Perform device-specific setup, including discovery of virtqueues for
> the device, optional per-bus setup, reading and possibly writing the
> device’s virtio configuration space, and population of virtqueues.
>
> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
and
> 4 Virtio Transport Options
> 4.1 Virtio Over PCI Bus
> 4.1.4 Virtio Structure PCI Capabilities
> 4.1.4.3 Common configuration structure layout
> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>
> [...]
>
> The driver MUST configure the other virtqueue fields before enabling the
> virtqueue with queue_enable.
>
> [...]
These together mean that the following sub-sequence of steps is valid for
a virtio-1.0 guest driver:
(1.1) set "queue_enable" for the needed queues as the final part of device
initialization step (7),
(1.2) set DRIVER_OK in step (8),
(1.3) immediately start sending virtio requests to the device.
(2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
special virtio feature is negotiated, then virtio rings start in disabled
state, according to
<https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
enabling vrings.
Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
operation, which travels from the guest through QEMU to the vhost-user
backend, using a unix domain socket.
Whereas sending a virtio request (1.3) is a *data plane* operation, which
evades QEMU -- it travels from guest to the vhost-user backend via
eventfd.
This means that steps (1.1) and (1.3) travel through different channels,
and their relative order can be reversed, as perceived by the vhost-user
backend.
That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
against the Rust-language virtiofsd version 1.7.2. (Which uses version
0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
crate.)
Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
device initialization steps (i.e., control plane operations), and
immediately sends a FUSE_INIT request too (i.e., performs a data plane
operation). In the Rust-language virtiofsd, this creates a race between
two components that run *concurrently*, i.e., in different threads or
processes:
- Control plane, handling vhost-user protocol messages:
The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
[crates/vhost-user-backend/src/handler.rs] handles
VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
flag according to the message processed.
- Data plane, handling virtio / FUSE requests:
The "VringEpollHandler::handle_event" method
[crates/vhost-user-backend/src/event_loop.rs] handles the incoming
virtio / FUSE request, consuming the virtio kick at the same time. If
the vring's "enabled" flag is set, the virtio / FUSE request is
processed genuinely. If the vring's "enabled" flag is clear, then the
virtio / FUSE request is discarded.
Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
However, if the data plane processor in virtiofsd wins the race, then it
sees the FUSE_INIT *before* the control plane processor took notice of
VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
processor. Therefore the latter drops FUSE_INIT on the floor, and goes
back to waiting for further virtio / FUSE requests with epoll_wait.
Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
The deadlock is not deterministic. OVMF hangs infrequently during first
boot. However, OVMF hangs almost certainly during reboots from the UEFI
shell.
The race can be "reliably masked" by inserting a very small delay -- a
single debug message -- at the top of "VringEpollHandler::handle_event",
i.e., just before the data plane processor checks the "enabled" field of
the vring. That delay suffices for the control plane processor to act upon
VHOST_USER_SET_VRING_ENABLE.
We can deterministically prevent the race in QEMU, by blocking OVMF inside
step (1.1) -- i.e., in the write to the "queue_enable" register -- until
VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
cannot advance to the FUSE_INIT submission before virtiofsd's control
plane processor takes notice of the queue being enabled.
Wait for VHOST_USER_SET_VRING_ENABLE completion by:
- setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
has been negotiated, or
- performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---
hw/virtio/vhost-user.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index beb4b832245e..01e0ca90c538 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
.num = enable,
};
- ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
+ ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
if (ret < 0) {
/*
* Restoring the previous state is likely infeasible, as well as
^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [PATCH 3/7] vhost-user: factor out "vhost_user_write_msg"
2023-08-27 18:29 ` [PATCH 3/7] vhost-user: factor out "vhost_user_write_msg" Laszlo Ersek
@ 2023-08-28 22:46 ` Philippe Mathieu-Daudé
2023-08-30 8:31 ` Stefano Garzarella
1 sibling, 0 replies; 58+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-08-28 22:46 UTC (permalink / raw)
To: Laszlo Ersek, qemu-devel
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
On 27/8/23 20:29, Laszlo Ersek wrote:
> The tails of the "vhost_user_set_vring_addr" and "vhost_user_set_u64"
> functions are now byte-for-byte identical. Factor the common tail out to a
> new function called "vhost_user_write_msg".
>
> This is purely refactoring -- no observable change.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> Cc: German Maglione <gmaglione@redhat.com>
> Cc: Liu Jiang <gerry@linux.alibaba.com>
> Cc: Sergio Lopez Pascual <slp@redhat.com>
> Cc: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> ---
> hw/virtio/vhost-user.c | 66 +++++++++-----------
> 1 file changed, 28 insertions(+), 38 deletions(-)
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 4/7] vhost-user: flatten "enforce_reply" into "vhost_user_write_msg"
2023-08-27 18:29 ` [PATCH 4/7] vhost-user: flatten "enforce_reply" into "vhost_user_write_msg" Laszlo Ersek
@ 2023-08-28 22:47 ` Philippe Mathieu-Daudé
2023-08-30 8:31 ` Stefano Garzarella
1 sibling, 0 replies; 58+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-08-28 22:47 UTC (permalink / raw)
To: Laszlo Ersek, qemu-devel
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
On 27/8/23 20:29, Laszlo Ersek wrote:
> At this point, only "vhost_user_write_msg" calls "enforce_reply"; embed
> the latter into the former.
>
> This is purely refactoring -- no observable change.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> Cc: German Maglione <gmaglione@redhat.com>
> Cc: Liu Jiang <gerry@linux.alibaba.com>
> Cc: Sergio Lopez Pascual <slp@redhat.com>
> Cc: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> ---
> hw/virtio/vhost-user.c | 32 ++++++++------------
> 1 file changed, 13 insertions(+), 19 deletions(-)
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 6/7] vhost-user: allow "vhost_set_vring" to wait for a reply
2023-08-27 18:29 ` [PATCH 6/7] vhost-user: allow "vhost_set_vring" to wait for a reply Laszlo Ersek
@ 2023-08-28 22:49 ` Philippe Mathieu-Daudé
2023-08-30 8:32 ` Stefano Garzarella
1 sibling, 0 replies; 58+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-08-28 22:49 UTC (permalink / raw)
To: Laszlo Ersek, qemu-devel
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
On 27/8/23 20:29, Laszlo Ersek wrote:
> The "vhost_set_vring" function already centralizes the common parts of
> "vhost_user_set_vring_num", "vhost_user_set_vring_base" and
> "vhost_user_set_vring_enable". We'll want to allow some of those callers
> to wait for a reply.
>
> Therefore, rebase "vhost_set_vring" from just "vhost_user_write" to
> "vhost_user_write_msg", exposing the "wait_for_reply" parameter.
>
> This is purely refactoring -- there is no observable change. That's
> because:
>
> - all three callers pass in "false" for "wait_for_reply", which disables
> all logic in "vhost_user_write_msg" except the call to
> "vhost_user_write";
>
> - the fds=NULL and fd_num=0 arguments of the original "vhost_user_write"
> call inside "vhost_set_vring" are hard-coded within
> "vhost_user_write_msg".
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> Cc: German Maglione <gmaglione@redhat.com>
> Cc: Liu Jiang <gerry@linux.alibaba.com>
> Cc: Sergio Lopez Pascual <slp@redhat.com>
> Cc: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> ---
> hw/virtio/vhost-user.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/7] vhost-user: strip superfluous whitespace
2023-08-27 18:29 ` [PATCH 1/7] vhost-user: strip superfluous whitespace Laszlo Ersek
@ 2023-08-30 8:26 ` Stefano Garzarella
2023-08-30 15:04 ` Philippe Mathieu-Daudé
1 sibling, 0 replies; 58+ messages in thread
From: Stefano Garzarella @ 2023-08-30 8:26 UTC (permalink / raw)
To: Laszlo Ersek
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual
On Sun, Aug 27, 2023 at 08:29:31PM +0200, Laszlo Ersek wrote:
>Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>Cc: Eugenio Perez Martin <eperezma@redhat.com>
>Cc: German Maglione <gmaglione@redhat.com>
>Cc: Liu Jiang <gerry@linux.alibaba.com>
>Cc: Sergio Lopez Pascual <slp@redhat.com>
>Cc: Stefano Garzarella <sgarzare@redhat.com>
>Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>---
> hw/virtio/vhost-user.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/7] vhost-user: tighten "reply_supported" scope in "set_vring_addr"
2023-08-27 18:29 ` [PATCH 2/7] vhost-user: tighten "reply_supported" scope in "set_vring_addr" Laszlo Ersek
@ 2023-08-30 8:27 ` Stefano Garzarella
2023-08-30 15:04 ` Philippe Mathieu-Daudé
1 sibling, 0 replies; 58+ messages in thread
From: Stefano Garzarella @ 2023-08-30 8:27 UTC (permalink / raw)
To: Laszlo Ersek
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual
On Sun, Aug 27, 2023 at 08:29:32PM +0200, Laszlo Ersek wrote:
>In the vhost_user_set_vring_addr() function, we calculate
>"reply_supported" unconditionally, even though we'll only need it if
>"wait_for_reply" is also true.
>
>Restrict the scope of "reply_supported" to the minimum.
>
>This is purely refactoring -- no observable change.
>
>Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>Cc: Eugenio Perez Martin <eperezma@redhat.com>
>Cc: German Maglione <gmaglione@redhat.com>
>Cc: Liu Jiang <gerry@linux.alibaba.com>
>Cc: Sergio Lopez Pascual <slp@redhat.com>
>Cc: Stefano Garzarella <sgarzare@redhat.com>
>Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>---
> hw/virtio/vhost-user.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 3/7] vhost-user: factor out "vhost_user_write_msg"
2023-08-27 18:29 ` [PATCH 3/7] vhost-user: factor out "vhost_user_write_msg" Laszlo Ersek
2023-08-28 22:46 ` Philippe Mathieu-Daudé
@ 2023-08-30 8:31 ` Stefano Garzarella
2023-08-30 9:14 ` Laszlo Ersek
1 sibling, 1 reply; 58+ messages in thread
From: Stefano Garzarella @ 2023-08-30 8:31 UTC (permalink / raw)
To: Laszlo Ersek
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual
On Sun, Aug 27, 2023 at 08:29:33PM +0200, Laszlo Ersek wrote:
>The tails of the "vhost_user_set_vring_addr" and "vhost_user_set_u64"
>functions are now byte-for-byte identical. Factor the common tail out to a
>new function called "vhost_user_write_msg".
>
>This is purely refactoring -- no observable change.
>
>Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>Cc: Eugenio Perez Martin <eperezma@redhat.com>
>Cc: German Maglione <gmaglione@redhat.com>
>Cc: Liu Jiang <gerry@linux.alibaba.com>
>Cc: Sergio Lopez Pascual <slp@redhat.com>
>Cc: Stefano Garzarella <sgarzare@redhat.com>
>Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>---
> hw/virtio/vhost-user.c | 66 +++++++++-----------
> 1 file changed, 28 insertions(+), 38 deletions(-)
>
>diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>index 64eac317bfb2..36f99b66a644 100644
>--- a/hw/virtio/vhost-user.c
>+++ b/hw/virtio/vhost-user.c
>@@ -1320,10 +1320,35 @@ static int enforce_reply(struct vhost_dev *dev,
> return vhost_user_get_features(dev, &dummy);
> }
>
>+/* Note: "msg->hdr.flags" may be modified. */
>+static int vhost_user_write_msg(struct vhost_dev *dev, VhostUserMsg *msg,
>+ bool wait_for_reply)
The difference between vhost_user_write() and vhost_user_write_msg() is
not immediately obvious from the function name, so I would propose
something different, like vhost_user_write_sync() or
vhost_user_write_wait().
Anyway, I'm not good with names and don't have a strong opinion, so this
version is fine with me as well :-)
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 4/7] vhost-user: flatten "enforce_reply" into "vhost_user_write_msg"
2023-08-27 18:29 ` [PATCH 4/7] vhost-user: flatten "enforce_reply" into "vhost_user_write_msg" Laszlo Ersek
2023-08-28 22:47 ` Philippe Mathieu-Daudé
@ 2023-08-30 8:31 ` Stefano Garzarella
1 sibling, 0 replies; 58+ messages in thread
From: Stefano Garzarella @ 2023-08-30 8:31 UTC (permalink / raw)
To: Laszlo Ersek
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual
On Sun, Aug 27, 2023 at 08:29:34PM +0200, Laszlo Ersek wrote:
>At this point, only "vhost_user_write_msg" calls "enforce_reply"; embed
>the latter into the former.
>
>This is purely refactoring -- no observable change.
>
>Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>Cc: Eugenio Perez Martin <eperezma@redhat.com>
>Cc: German Maglione <gmaglione@redhat.com>
>Cc: Liu Jiang <gerry@linux.alibaba.com>
>Cc: Sergio Lopez Pascual <slp@redhat.com>
>Cc: Stefano Garzarella <sgarzare@redhat.com>
>Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>---
> hw/virtio/vhost-user.c | 32 ++++++++------------
> 1 file changed, 13 insertions(+), 19 deletions(-)
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 5/7] vhost-user: hoist "write_msg", "get_features", "get_u64"
2023-08-27 18:29 ` [PATCH 5/7] vhost-user: hoist "write_msg", "get_features", "get_u64" Laszlo Ersek
@ 2023-08-30 8:32 ` Stefano Garzarella
2023-08-30 15:04 ` Philippe Mathieu-Daudé
1 sibling, 0 replies; 58+ messages in thread
From: Stefano Garzarella @ 2023-08-30 8:32 UTC (permalink / raw)
To: Laszlo Ersek
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual
On Sun, Aug 27, 2023 at 08:29:35PM +0200, Laszlo Ersek wrote:
>In order to avoid a forward-declaration for "vhost_user_write_msg" in a
>subsequent patch, hoist "vhost_user_write_msg" ->
>"vhost_user_get_features" -> "vhost_user_get_u64" just above
>"vhost_set_vring".
>
>This is purely code movement -- no observable change.
>
>Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>Cc: Eugenio Perez Martin <eperezma@redhat.com>
>Cc: German Maglione <gmaglione@redhat.com>
>Cc: Liu Jiang <gerry@linux.alibaba.com>
>Cc: Sergio Lopez Pascual <slp@redhat.com>
>Cc: Stefano Garzarella <sgarzare@redhat.com>
>Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>---
> hw/virtio/vhost-user.c | 170 ++++++++++----------
> 1 file changed, 85 insertions(+), 85 deletions(-)
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 6/7] vhost-user: allow "vhost_set_vring" to wait for a reply
2023-08-27 18:29 ` [PATCH 6/7] vhost-user: allow "vhost_set_vring" to wait for a reply Laszlo Ersek
2023-08-28 22:49 ` Philippe Mathieu-Daudé
@ 2023-08-30 8:32 ` Stefano Garzarella
1 sibling, 0 replies; 58+ messages in thread
From: Stefano Garzarella @ 2023-08-30 8:32 UTC (permalink / raw)
To: Laszlo Ersek
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual
On Sun, Aug 27, 2023 at 08:29:36PM +0200, Laszlo Ersek wrote:
>The "vhost_set_vring" function already centralizes the common parts of
>"vhost_user_set_vring_num", "vhost_user_set_vring_base" and
>"vhost_user_set_vring_enable". We'll want to allow some of those callers
>to wait for a reply.
>
>Therefore, rebase "vhost_set_vring" from just "vhost_user_write" to
>"vhost_user_write_msg", exposing the "wait_for_reply" parameter.
>
>This is purely refactoring -- there is no observable change. That's
>because:
>
>- all three callers pass in "false" for "wait_for_reply", which disables
> all logic in "vhost_user_write_msg" except the call to
> "vhost_user_write";
>
>- the fds=NULL and fd_num=0 arguments of the original "vhost_user_write"
> call inside "vhost_set_vring" are hard-coded within
> "vhost_user_write_msg".
>
>Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>Cc: Eugenio Perez Martin <eperezma@redhat.com>
>Cc: German Maglione <gmaglione@redhat.com>
>Cc: Liu Jiang <gerry@linux.alibaba.com>
>Cc: Sergio Lopez Pascual <slp@redhat.com>
>Cc: Stefano Garzarella <sgarzare@redhat.com>
>Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>---
> hw/virtio/vhost-user.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-27 18:29 ` [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
@ 2023-08-30 8:39 ` Stefano Garzarella
2023-08-30 9:26 ` Laszlo Ersek
2023-08-30 8:41 ` Laszlo Ersek
` (2 subsequent siblings)
3 siblings, 1 reply; 58+ messages in thread
From: Stefano Garzarella @ 2023-08-30 8:39 UTC (permalink / raw)
To: Laszlo Ersek
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual
On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
>(1) The virtio-1.0 specification
><http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
What about referring the latest spec available now (1.2)?
>
>> 3 General Initialization And Device Operation
>> 3.1 Device Initialization
>> 3.1.1 Driver Requirements: Device Initialization
>>
>> [...]
>>
>> 7. Perform device-specific setup, including discovery of virtqueues for
>> the device, optional per-bus setup, reading and possibly writing the
>> device’s virtio configuration space, and population of virtqueues.
>>
>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>
>and
>
>> 4 Virtio Transport Options
>> 4.1 Virtio Over PCI Bus
>> 4.1.4 Virtio Structure PCI Capabilities
>> 4.1.4.3 Common configuration structure layout
>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>
>> [...]
>>
>> The driver MUST configure the other virtqueue fields before enabling the
>> virtqueue with queue_enable.
>>
>> [...]
>
>These together mean that the following sub-sequence of steps is valid for
>a virtio-1.0 guest driver:
>
>(1.1) set "queue_enable" for the needed queues as the final part of device
>initialization step (7),
>
>(1.2) set DRIVER_OK in step (8),
>
>(1.3) immediately start sending virtio requests to the device.
>
>(2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>special virtio feature is negotiated, then virtio rings start in disabled
>state, according to
><https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
>In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>enabling vrings.
>
>Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>operation, which travels from the guest through QEMU to the vhost-user
>backend, using a unix domain socket.
>
>Whereas sending a virtio request (1.3) is a *data plane* operation, which
>evades QEMU -- it travels from guest to the vhost-user backend via
>eventfd.
>
>This means that steps (1.1) and (1.3) travel through different channels,
>and their relative order can be reversed, as perceived by the vhost-user
>backend.
>
>That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>against the Rust-language virtiofsd version 1.7.2. (Which uses version
>0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>crate.)
>
>Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>device initialization steps (i.e., control plane operations), and
>immediately sends a FUSE_INIT request too (i.e., performs a data plane
>operation). In the Rust-language virtiofsd, this creates a race between
>two components that run *concurrently*, i.e., in different threads or
>processes:
>
>- Control plane, handling vhost-user protocol messages:
>
> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> [crates/vhost-user-backend/src/handler.rs] handles
> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> flag according to the message processed.
>
>- Data plane, handling virtio / FUSE requests:
>
> The "VringEpollHandler::handle_event" method
> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> virtio / FUSE request, consuming the virtio kick at the same time. If
> the vring's "enabled" flag is set, the virtio / FUSE request is
> processed genuinely. If the vring's "enabled" flag is clear, then the
> virtio / FUSE request is discarded.
>
>Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>However, if the data plane processor in virtiofsd wins the race, then it
>sees the FUSE_INIT *before* the control plane processor took notice of
>VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>back to waiting for further virtio / FUSE requests with epoll_wait.
>Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>
>The deadlock is not deterministic. OVMF hangs infrequently during first
>boot. However, OVMF hangs almost certainly during reboots from the UEFI
>shell.
>
>The race can be "reliably masked" by inserting a very small delay -- a
>single debug message -- at the top of "VringEpollHandler::handle_event",
>i.e., just before the data plane processor checks the "enabled" field of
>the vring. That delay suffices for the control plane processor to act upon
>VHOST_USER_SET_VRING_ENABLE.
>
>We can deterministically prevent the race in QEMU, by blocking OVMF inside
>step (1.1) -- i.e., in the write to the "queue_enable" register -- until
>VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
>cannot advance to the FUSE_INIT submission before virtiofsd's control
>plane processor takes notice of the queue being enabled.
>
>Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>
>- setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> has been negotiated, or
>
>- performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
Thanks for the excellent analysis (and fix of course!).
>
>Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>Cc: Eugenio Perez Martin <eperezma@redhat.com>
>Cc: German Maglione <gmaglione@redhat.com>
>Cc: Liu Jiang <gerry@linux.alibaba.com>
>Cc: Sergio Lopez Pascual <slp@redhat.com>
>Cc: Stefano Garzarella <sgarzare@redhat.com>
>Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>---
> hw/virtio/vhost-user.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>index beb4b832245e..01e0ca90c538 100644
>--- a/hw/virtio/vhost-user.c
>+++ b/hw/virtio/vhost-user.c
>@@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
> .num = enable,
> };
>
How about adding a small comment here summarizing the commit message in
a few lines?
Should we cc stable for this fix?
In any case, the fix LGTM, so:
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
>- ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
>+ ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
> if (ret < 0) {
> /*
> * Restoring the previous state is likely infeasible, as well as
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-27 18:29 ` [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
2023-08-30 8:39 ` Stefano Garzarella
@ 2023-08-30 8:41 ` Laszlo Ersek
2023-08-30 8:59 ` Laszlo Ersek
2023-08-30 12:10 ` Stefan Hajnoczi
2023-10-03 14:41 ` Michael S. Tsirkin
3 siblings, 1 reply; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-30 8:41 UTC (permalink / raw)
To: qemu-devel
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella,
Stefan Hajnoczi
I'm adding Stefan to the CC list, and an additional piece of explanation
below:
On 8/27/23 20:29, Laszlo Ersek wrote:
> (1) The virtio-1.0 specification
> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>
>> 3 General Initialization And Device Operation
>> 3.1 Device Initialization
>> 3.1.1 Driver Requirements: Device Initialization
>>
>> [...]
>>
>> 7. Perform device-specific setup, including discovery of virtqueues for
>> the device, optional per-bus setup, reading and possibly writing the
>> device’s virtio configuration space, and population of virtqueues.
>>
>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>
> and
>
>> 4 Virtio Transport Options
>> 4.1 Virtio Over PCI Bus
>> 4.1.4 Virtio Structure PCI Capabilities
>> 4.1.4.3 Common configuration structure layout
>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>
>> [...]
>>
>> The driver MUST configure the other virtqueue fields before enabling the
>> virtqueue with queue_enable.
>>
>> [...]
>
> These together mean that the following sub-sequence of steps is valid for
> a virtio-1.0 guest driver:
>
> (1.1) set "queue_enable" for the needed queues as the final part of device
> initialization step (7),
>
> (1.2) set DRIVER_OK in step (8),
>
> (1.3) immediately start sending virtio requests to the device.
>
> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> special virtio feature is negotiated, then virtio rings start in disabled
> state, according to
> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> enabling vrings.
>
> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> operation, which travels from the guest through QEMU to the vhost-user
> backend, using a unix domain socket.
>
> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> evades QEMU -- it travels from guest to the vhost-user backend via
> eventfd.
>
> This means that steps (1.1) and (1.3) travel through different channels,
> and their relative order can be reversed, as perceived by the vhost-user
> backend.
>
> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> crate.)
>
> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> device initialization steps (i.e., control plane operations), and
> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> operation). In the Rust-language virtiofsd, this creates a race between
> two components that run *concurrently*, i.e., in different threads or
> processes:
>
> - Control plane, handling vhost-user protocol messages:
>
> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> [crates/vhost-user-backend/src/handler.rs] handles
> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> flag according to the message processed.
>
> - Data plane, handling virtio / FUSE requests:
>
> The "VringEpollHandler::handle_event" method
> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> virtio / FUSE request, consuming the virtio kick at the same time. If
> the vring's "enabled" flag is set, the virtio / FUSE request is
> processed genuinely. If the vring's "enabled" flag is clear, then the
> virtio / FUSE request is discarded.
>
> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> However, if the data plane processor in virtiofsd wins the race, then it
> sees the FUSE_INIT *before* the control plane processor took notice of
> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> back to waiting for further virtio / FUSE requests with epoll_wait.
> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
I can explain why this issue has not been triggered by / witnessed with
the Linux guest driver for virtiofs ("fs/fuse/virtio_fs.c").
That driver registers *two* driver (callback) structures, a virtio
driver, and a filesystem driver.
(1) The virtio driver half initializes the virtio device, and takes a
note of the particular virtio filesystem, remembering its "tag". See
virtio_fs_probe() -> virtio_device_ready(), and then virtio_fs_probe()
-> virtio_fs_add_instance().
Importantly, at this time, no FUSE_INIT request is sent.
(2) The filesystem driver half has a totally independent entry point.
The relevant parts (after the driver registration) are:
(a) virtio_fs_get_tree() -> virtio_fs_find_instance(), and
(b) if the "tag" was found, (b) virtio_fs_get_tree() ->
virtio_fs_fill_super() -> fuse_send_init().
Importantly, this occurs when guest userspace (i.e., an interactive
user, or a userspace automatism such as systemd) tries to mount a
*concrete* virtio filesystem, identified by its tag (such as in "mount
-t virtiofs TAG /mount/point").
This means that there is an *abritrarily long* delay between (1)
VHOST_USER_SET_VRING_ENABLE (which QEMU sends to virtiofsd while the
guest is inside virtio_fs_probe()) and (2) FUSE_INIT (which the guest
kernel driver sends to virtiofsd while inside virtio_fs_get_tree()).
That huge delay is plenty for masking the race.
But the race is there nonetheless.
Also note that this race does not exist for vhost-net. For vhost-net,
AIUI, such queue operations are handled with ioctl()s, and ioctl()s are
synchronous by nature. Cf.
<https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#vhost-user-protocol-f-reply-ack>:
"The original vhost-user specification only demands replies for certain
commands. This differs from the vhost protocol implementation where
commands are sent over an ioctl() call and block until the back-end has
completed."
Laszlo
>
> The deadlock is not deterministic. OVMF hangs infrequently during first
> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> shell.
>
> The race can be "reliably masked" by inserting a very small delay -- a
> single debug message -- at the top of "VringEpollHandler::handle_event",
> i.e., just before the data plane processor checks the "enabled" field of
> the vring. That delay suffices for the control plane processor to act upon
> VHOST_USER_SET_VRING_ENABLE.
>
> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> cannot advance to the FUSE_INIT submission before virtiofsd's control
> plane processor takes notice of the queue being enabled.
>
> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>
> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> has been negotiated, or
>
> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> Cc: German Maglione <gmaglione@redhat.com>
> Cc: Liu Jiang <gerry@linux.alibaba.com>
> Cc: Sergio Lopez Pascual <slp@redhat.com>
> Cc: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> ---
> hw/virtio/vhost-user.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index beb4b832245e..01e0ca90c538 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
> .num = enable,
> };
>
> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
> if (ret < 0) {
> /*
> * Restoring the previous state is likely infeasible, as well as
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 0/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-27 18:29 [PATCH 0/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
` (6 preceding siblings ...)
2023-08-27 18:29 ` [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
@ 2023-08-30 8:48 ` Stefano Garzarella
2023-08-30 9:32 ` Laszlo Ersek
7 siblings, 1 reply; 58+ messages in thread
From: Stefano Garzarella @ 2023-08-30 8:48 UTC (permalink / raw)
To: Laszlo Ersek
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual
On Sun, Aug 27, 2023 at 08:29:30PM +0200, Laszlo Ersek wrote:
>The last patch in the series states and fixes the problem; prior patches
>only refactor the code.
Thanks for the fix and great cleanup!
I fully reviewed the series and LGTM.
An additional step that we can take (not in this series) crossed my
mind, though. In some places we repeat the following pattern:
vhost_user_write(dev, &msg, NULL, 0);
...
if (reply_supported) {
return process_message_reply(dev, &msg);
}
So what about extending the vhost_user_write_msg() added in this series
to support also this cases and remove some code.
Or maybe integrate vhost_user_write_msg() in vhost_user_write().
I mean something like this (untested):
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 01e0ca90c5..9ee2a78afa 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1130,13 +1130,19 @@ static int vhost_user_get_features(struct vhost_dev *dev, uint
64_t *features)
return 0;
}
+typedef enum {
+ NO_REPLY,
+ REPLY_IF_SUPPORTED,
+ REPLY_FORCED,
+} VhostUserReply;
+
/* Note: "msg->hdr.flags" may be modified. */
static int vhost_user_write_msg(struct vhost_dev *dev, VhostUserMsg *msg,
- bool wait_for_reply)
+ VhostUserReply reply)
{
int ret;
- if (wait_for_reply) {
+ if (reply != NO_REPLY) {
bool reply_supported = virtio_has_feature(dev->protocol_features,
VHOST_USER_PROTOCOL_F_REPLY_ACK);
if (reply_supported) {
@@ -1149,7 +1155,7 @@ static int vhost_user_write_msg(struct vhost_dev *dev, VhostUserMsg *msg,
return ret;
}
- if (wait_for_reply) {
+ if (reply != NO_REPLY) {
uint64_t dummy;
if (msg->hdr.flags & VHOST_USER_NEED_REPLY_MASK) {
@@ -1162,7 +1168,9 @@ static int vhost_user_write_msg(struct vhost_dev
*dev, VhostUserMsg *msg,
* Send VHOST_USER_GET_FEATURES which makes all backends
* send a reply.
*/
- return vhost_user_get_features(dev, &dummy);
+ if (reply == REPLY_FORCED) {
+ return vhost_user_get_features(dev, &dummy);
+ }
}
return 0;
@@ -2207,9 +2228,6 @@ static bool vhost_user_can_merge(struct vhost_dev *dev,
static int vhost_user_net_set_mtu(struct vhost_dev *dev, uint16_t mtu)
{
VhostUserMsg msg;
- bool reply_supported = virtio_has_feature(dev->protocol_features,
- VHOST_USER_PROTOCOL_F_REPLY_ACK);
- int ret;
if (!(dev->protocol_features & (1ULL << VHOST_USER_PROTOCOL_F_NET_MTU))) {
return 0;
@@ -2219,21 +2237,9 @@ static int vhost_user_net_set_mtu(struct vhost_dev *dev, uint16_t mtu)
msg.payload.u64 = mtu;
msg.hdr.size = sizeof(msg.payload.u64);
msg.hdr.flags = VHOST_USER_VERSION;
- if (reply_supported) {
- msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
- }
-
- ret = vhost_user_write(dev, &msg, NULL, 0);
- if (ret < 0) {
- return ret;
- }
/* If reply_ack supported, backend has to ack specified MTU is valid */
- if (reply_supported) {
- return process_message_reply(dev, &msg);
- }
-
- return 0;
+ return vhost_user_write_msg(dev, &msg, REPLY_IF_SUPPORTED);
}
static int vhost_user_send_device_iotlb_msg(struct vhost_dev *dev,
@@ -2313,10 +2319,7 @@ static int vhost_user_get_config(struct vhost_dev *dev, uint8_t *config,
static int vhost_user_set_config(struct vhost_dev *dev, const uint8_t *data,
uint32_t offset, uint32_t size, uint32_t flags)
{
- int ret;
uint8_t *p;
- bool reply_supported = virtio_has_feature(dev->protocol_features,
- VHOST_USER_PROTOCOL_F_REPLY_ACK);
VhostUserMsg msg = {
.hdr.request = VHOST_USER_SET_CONFIG,
@@ -2329,10 +2332,6 @@ static int vhost_user_set_config(struct vhost_dev *dev, const uint8_t *data,
return -ENOTSUP;
}
- if (reply_supported) {
- msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
- }
-
if (size > VHOST_USER_MAX_CONFIG_SIZE) {
return -EINVAL;
}
@@ -2343,16 +2342,7 @@ static int vhost_user_set_config(struct vhost_dev *dev, const uint8_t *data,
p = msg.payload.config.region;
memcpy(p, data, size);
- ret = vhost_user_write(dev, &msg, NULL, 0);
- if (ret < 0) {
- return ret;
- }
-
- if (reply_supported) {
- return process_message_reply(dev, &msg);
- }
-
- return 0;
+ return vhost_user_write_msg(dev, &msg, REPLY_IF_SUPPORTED);
}
Thanks,
Stefano
^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-30 8:41 ` Laszlo Ersek
@ 2023-08-30 8:59 ` Laszlo Ersek
2023-08-30 9:04 ` Laszlo Ersek
0 siblings, 1 reply; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-30 8:59 UTC (permalink / raw)
To: qemu-devel
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella,
Stefan Hajnoczi
On 8/30/23 10:41, Laszlo Ersek wrote:
> I'm adding Stefan to the CC list, and an additional piece of explanation
> below:
>
> On 8/27/23 20:29, Laszlo Ersek wrote:
>> (1) The virtio-1.0 specification
>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>>
>>> 3 General Initialization And Device Operation
>>> 3.1 Device Initialization
>>> 3.1.1 Driver Requirements: Device Initialization
>>>
>>> [...]
>>>
>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>> the device, optional per-bus setup, reading and possibly writing the
>>> device’s virtio configuration space, and population of virtqueues.
>>>
>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>
>> and
>>
>>> 4 Virtio Transport Options
>>> 4.1 Virtio Over PCI Bus
>>> 4.1.4 Virtio Structure PCI Capabilities
>>> 4.1.4.3 Common configuration structure layout
>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>
>>> [...]
>>>
>>> The driver MUST configure the other virtqueue fields before enabling the
>>> virtqueue with queue_enable.
>>>
>>> [...]
>>
>> These together mean that the following sub-sequence of steps is valid for
>> a virtio-1.0 guest driver:
>>
>> (1.1) set "queue_enable" for the needed queues as the final part of device
>> initialization step (7),
>>
>> (1.2) set DRIVER_OK in step (8),
>>
>> (1.3) immediately start sending virtio requests to the device.
>>
>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>> special virtio feature is negotiated, then virtio rings start in disabled
>> state, according to
>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>> enabling vrings.
>>
>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>> operation, which travels from the guest through QEMU to the vhost-user
>> backend, using a unix domain socket.
>>
>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>> evades QEMU -- it travels from guest to the vhost-user backend via
>> eventfd.
>>
>> This means that steps (1.1) and (1.3) travel through different channels,
>> and their relative order can be reversed, as perceived by the vhost-user
>> backend.
>>
>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>> crate.)
>>
>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>> device initialization steps (i.e., control plane operations), and
>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>> operation). In the Rust-language virtiofsd, this creates a race between
>> two components that run *concurrently*, i.e., in different threads or
>> processes:
>>
>> - Control plane, handling vhost-user protocol messages:
>>
>> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>> [crates/vhost-user-backend/src/handler.rs] handles
>> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>> flag according to the message processed.
>>
>> - Data plane, handling virtio / FUSE requests:
>>
>> The "VringEpollHandler::handle_event" method
>> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>> virtio / FUSE request, consuming the virtio kick at the same time. If
>> the vring's "enabled" flag is set, the virtio / FUSE request is
>> processed genuinely. If the vring's "enabled" flag is clear, then the
>> virtio / FUSE request is discarded.
>>
>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>> However, if the data plane processor in virtiofsd wins the race, then it
>> sees the FUSE_INIT *before* the control plane processor took notice of
>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>> back to waiting for further virtio / FUSE requests with epoll_wait.
>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>
> I can explain why this issue has not been triggered by / witnessed with
> the Linux guest driver for virtiofs ("fs/fuse/virtio_fs.c").
>
> That driver registers *two* driver (callback) structures, a virtio
> driver, and a filesystem driver.
>
> (1) The virtio driver half initializes the virtio device, and takes a
> note of the particular virtio filesystem, remembering its "tag". See
> virtio_fs_probe() -> virtio_device_ready(), and then virtio_fs_probe()
> -> virtio_fs_add_instance().
>
> Importantly, at this time, no FUSE_INIT request is sent.
>
> (2) The filesystem driver half has a totally independent entry point.
> The relevant parts (after the driver registration) are:
>
> (a) virtio_fs_get_tree() -> virtio_fs_find_instance(), and
>
> (b) if the "tag" was found, (b) virtio_fs_get_tree() ->
> virtio_fs_fill_super() -> fuse_send_init().
>
> Importantly, this occurs when guest userspace (i.e., an interactive
> user, or a userspace automatism such as systemd) tries to mount a
> *concrete* virtio filesystem, identified by its tag (such as in "mount
> -t virtiofs TAG /mount/point").
>
>
> This means that there is an *abritrarily long* delay between (1)
> VHOST_USER_SET_VRING_ENABLE (which QEMU sends to virtiofsd while the
> guest is inside virtio_fs_probe()) and (2) FUSE_INIT (which the guest
> kernel driver sends to virtiofsd while inside virtio_fs_get_tree()).
>
> That huge delay is plenty for masking the race.
>
> But the race is there nonetheless.
Furthermore, the race was not seen in the C-language virtiofsd
implementation (removed in QEMU commit
e0dc2631ec4ac718ebe22ddea0ab25524eb37b0e) for the following reason:
The C language virtiofsd *did not care* about
VHOST_USER_SET_VRING_ENABLE at all:
- Upon VHOST_USER_GET_VRING_BASE, vu_get_vring_base_exec() in
libvhost-user would call fv_queue_set_started() in virtiofsd, and the
latter would start the data plane thread fv_queue_thread().
- Upon VHOST_USER_SET_VRING_ENABLE, vu_set_vring_enable_exec() in
libvhost-user would set the "enable" field, but not call back into
virtiofsd. And virtiofsd ("tools/virtiofsd/fuse_virtio.c") nowhere
checks the "enable" field.
In summary, the C-language virtiofsd didn't implement queue enablement
in a conformant way. The Rust-language version does, but that exposes a
race in how QEMU sends VHOST_USER_SET_VRING_ENABLE. The race is
triggered by the OVMF guest driver, and not triggered by the Linux guest
driver (since the latter introduces an unbounded delay between vring
enablement and FUSE_INIT submission).
Laszlo
>
>
> Also note that this race does not exist for vhost-net. For vhost-net,
> AIUI, such queue operations are handled with ioctl()s, and ioctl()s are
> synchronous by nature. Cf.
> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#vhost-user-protocol-f-reply-ack>:
>
> "The original vhost-user specification only demands replies for certain
> commands. This differs from the vhost protocol implementation where
> commands are sent over an ioctl() call and block until the back-end has
> completed."
>
> Laszlo
>
>>
>> The deadlock is not deterministic. OVMF hangs infrequently during first
>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>> shell.
>>
>> The race can be "reliably masked" by inserting a very small delay -- a
>> single debug message -- at the top of "VringEpollHandler::handle_event",
>> i.e., just before the data plane processor checks the "enabled" field of
>> the vring. That delay suffices for the control plane processor to act upon
>> VHOST_USER_SET_VRING_ENABLE.
>>
>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
>> cannot advance to the FUSE_INIT submission before virtiofsd's control
>> plane processor takes notice of the queue being enabled.
>>
>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>>
>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
>> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
>> has been negotiated, or
>>
>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
>> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
>> Cc: German Maglione <gmaglione@redhat.com>
>> Cc: Liu Jiang <gerry@linux.alibaba.com>
>> Cc: Sergio Lopez Pascual <slp@redhat.com>
>> Cc: Stefano Garzarella <sgarzare@redhat.com>
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>> ---
>> hw/virtio/vhost-user.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>> index beb4b832245e..01e0ca90c538 100644
>> --- a/hw/virtio/vhost-user.c
>> +++ b/hw/virtio/vhost-user.c
>> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
>> .num = enable,
>> };
>>
>> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
>> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
>> if (ret < 0) {
>> /*
>> * Restoring the previous state is likely infeasible, as well as
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-30 8:59 ` Laszlo Ersek
@ 2023-08-30 9:04 ` Laszlo Ersek
0 siblings, 0 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-30 9:04 UTC (permalink / raw)
To: qemu-devel
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella,
Stefan Hajnoczi
On 8/30/23 10:59, Laszlo Ersek wrote:
> On 8/30/23 10:41, Laszlo Ersek wrote:
>> I'm adding Stefan to the CC list, and an additional piece of explanation
>> below:
>>
>> On 8/27/23 20:29, Laszlo Ersek wrote:
>>> (1) The virtio-1.0 specification
>>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>>>
>>>> 3 General Initialization And Device Operation
>>>> 3.1 Device Initialization
>>>> 3.1.1 Driver Requirements: Device Initialization
>>>>
>>>> [...]
>>>>
>>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>>> the device, optional per-bus setup, reading and possibly writing the
>>>> device’s virtio configuration space, and population of virtqueues.
>>>>
>>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>>
>>> and
>>>
>>>> 4 Virtio Transport Options
>>>> 4.1 Virtio Over PCI Bus
>>>> 4.1.4 Virtio Structure PCI Capabilities
>>>> 4.1.4.3 Common configuration structure layout
>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>>
>>>> [...]
>>>>
>>>> The driver MUST configure the other virtqueue fields before enabling the
>>>> virtqueue with queue_enable.
>>>>
>>>> [...]
>>>
>>> These together mean that the following sub-sequence of steps is valid for
>>> a virtio-1.0 guest driver:
>>>
>>> (1.1) set "queue_enable" for the needed queues as the final part of device
>>> initialization step (7),
>>>
>>> (1.2) set DRIVER_OK in step (8),
>>>
>>> (1.3) immediately start sending virtio requests to the device.
>>>
>>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>>> special virtio feature is negotiated, then virtio rings start in disabled
>>> state, according to
>>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
>>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>>> enabling vrings.
>>>
>>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>>> operation, which travels from the guest through QEMU to the vhost-user
>>> backend, using a unix domain socket.
>>>
>>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>>> evades QEMU -- it travels from guest to the vhost-user backend via
>>> eventfd.
>>>
>>> This means that steps (1.1) and (1.3) travel through different channels,
>>> and their relative order can be reversed, as perceived by the vhost-user
>>> backend.
>>>
>>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>>> crate.)
>>>
>>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>>> device initialization steps (i.e., control plane operations), and
>>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>>> operation). In the Rust-language virtiofsd, this creates a race between
>>> two components that run *concurrently*, i.e., in different threads or
>>> processes:
>>>
>>> - Control plane, handling vhost-user protocol messages:
>>>
>>> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>> [crates/vhost-user-backend/src/handler.rs] handles
>>> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>> flag according to the message processed.
>>>
>>> - Data plane, handling virtio / FUSE requests:
>>>
>>> The "VringEpollHandler::handle_event" method
>>> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>> virtio / FUSE request, consuming the virtio kick at the same time. If
>>> the vring's "enabled" flag is set, the virtio / FUSE request is
>>> processed genuinely. If the vring's "enabled" flag is clear, then the
>>> virtio / FUSE request is discarded.
>>>
>>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>>> However, if the data plane processor in virtiofsd wins the race, then it
>>> sees the FUSE_INIT *before* the control plane processor took notice of
>>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>>> back to waiting for further virtio / FUSE requests with epoll_wait.
>>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>>
>> I can explain why this issue has not been triggered by / witnessed with
>> the Linux guest driver for virtiofs ("fs/fuse/virtio_fs.c").
>>
>> That driver registers *two* driver (callback) structures, a virtio
>> driver, and a filesystem driver.
>>
>> (1) The virtio driver half initializes the virtio device, and takes a
>> note of the particular virtio filesystem, remembering its "tag". See
>> virtio_fs_probe() -> virtio_device_ready(), and then virtio_fs_probe()
>> -> virtio_fs_add_instance().
>>
>> Importantly, at this time, no FUSE_INIT request is sent.
>>
>> (2) The filesystem driver half has a totally independent entry point.
>> The relevant parts (after the driver registration) are:
>>
>> (a) virtio_fs_get_tree() -> virtio_fs_find_instance(), and
>>
>> (b) if the "tag" was found, (b) virtio_fs_get_tree() ->
>> virtio_fs_fill_super() -> fuse_send_init().
>>
>> Importantly, this occurs when guest userspace (i.e., an interactive
>> user, or a userspace automatism such as systemd) tries to mount a
>> *concrete* virtio filesystem, identified by its tag (such as in "mount
>> -t virtiofs TAG /mount/point").
>>
>>
>> This means that there is an *abritrarily long* delay between (1)
>> VHOST_USER_SET_VRING_ENABLE (which QEMU sends to virtiofsd while the
>> guest is inside virtio_fs_probe()) and (2) FUSE_INIT (which the guest
>> kernel driver sends to virtiofsd while inside virtio_fs_get_tree()).
>>
>> That huge delay is plenty for masking the race.
>>
>> But the race is there nonetheless.
>
> Furthermore, the race was not seen in the C-language virtiofsd
> implementation (removed in QEMU commit
> e0dc2631ec4ac718ebe22ddea0ab25524eb37b0e) for the following reason:
>
> The C language virtiofsd *did not care* about
> VHOST_USER_SET_VRING_ENABLE at all:
>
> - Upon VHOST_USER_GET_VRING_BASE, vu_get_vring_base_exec() in
> libvhost-user would call fv_queue_set_started() in virtiofsd, and the
> latter would start the data plane thread fv_queue_thread().
Sorry, not on VHOST_USER_GET_VRING_BASE, but on
VHOST_USER_SET_VRING_KICK / VHOST_USER_VRING_KICK. But that doesn't
change the rest of the argument, namely that VHOST_USER_SET_VRING_ENABLE
had no effect whatsoever on the C-language virtiofsd.
Laszlo
>
> - Upon VHOST_USER_SET_VRING_ENABLE, vu_set_vring_enable_exec() in
> libvhost-user would set the "enable" field, but not call back into
> virtiofsd. And virtiofsd ("tools/virtiofsd/fuse_virtio.c") nowhere
> checks the "enable" field.
>
> In summary, the C-language virtiofsd didn't implement queue enablement
> in a conformant way. The Rust-language version does, but that exposes a
> race in how QEMU sends VHOST_USER_SET_VRING_ENABLE. The race is
> triggered by the OVMF guest driver, and not triggered by the Linux guest
> driver (since the latter introduces an unbounded delay between vring
> enablement and FUSE_INIT submission).
>
> Laszlo
>
>>
>>
>> Also note that this race does not exist for vhost-net. For vhost-net,
>> AIUI, such queue operations are handled with ioctl()s, and ioctl()s are
>> synchronous by nature. Cf.
>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#vhost-user-protocol-f-reply-ack>:
>>
>> "The original vhost-user specification only demands replies for certain
>> commands. This differs from the vhost protocol implementation where
>> commands are sent over an ioctl() call and block until the back-end has
>> completed."
>>
>> Laszlo
>>
>>>
>>> The deadlock is not deterministic. OVMF hangs infrequently during first
>>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>>> shell.
>>>
>>> The race can be "reliably masked" by inserting a very small delay -- a
>>> single debug message -- at the top of "VringEpollHandler::handle_event",
>>> i.e., just before the data plane processor checks the "enabled" field of
>>> the vring. That delay suffices for the control plane processor to act upon
>>> VHOST_USER_SET_VRING_ENABLE.
>>>
>>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
>>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
>>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
>>> cannot advance to the FUSE_INIT submission before virtiofsd's control
>>> plane processor takes notice of the queue being enabled.
>>>
>>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>>>
>>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
>>> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
>>> has been negotiated, or
>>>
>>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
>>> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>>>
>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
>>> Cc: German Maglione <gmaglione@redhat.com>
>>> Cc: Liu Jiang <gerry@linux.alibaba.com>
>>> Cc: Sergio Lopez Pascual <slp@redhat.com>
>>> Cc: Stefano Garzarella <sgarzare@redhat.com>
>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>>> ---
>>> hw/virtio/vhost-user.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>> index beb4b832245e..01e0ca90c538 100644
>>> --- a/hw/virtio/vhost-user.c
>>> +++ b/hw/virtio/vhost-user.c
>>> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
>>> .num = enable,
>>> };
>>>
>>> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
>>> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
>>> if (ret < 0) {
>>> /*
>>> * Restoring the previous state is likely infeasible, as well as
>>
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 3/7] vhost-user: factor out "vhost_user_write_msg"
2023-08-30 8:31 ` Stefano Garzarella
@ 2023-08-30 9:14 ` Laszlo Ersek
2023-08-30 9:54 ` Laszlo Ersek
0 siblings, 1 reply; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-30 9:14 UTC (permalink / raw)
To: Stefano Garzarella
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual
On 8/30/23 10:31, Stefano Garzarella wrote:
> On Sun, Aug 27, 2023 at 08:29:33PM +0200, Laszlo Ersek wrote:
>> The tails of the "vhost_user_set_vring_addr" and "vhost_user_set_u64"
>> functions are now byte-for-byte identical. Factor the common tail out
>> to a
>> new function called "vhost_user_write_msg".
>>
>> This is purely refactoring -- no observable change.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
>> Cc: German Maglione <gmaglione@redhat.com>
>> Cc: Liu Jiang <gerry@linux.alibaba.com>
>> Cc: Sergio Lopez Pascual <slp@redhat.com>
>> Cc: Stefano Garzarella <sgarzare@redhat.com>
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>> ---
>> hw/virtio/vhost-user.c | 66 +++++++++-----------
>> 1 file changed, 28 insertions(+), 38 deletions(-)
>>
>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>> index 64eac317bfb2..36f99b66a644 100644
>> --- a/hw/virtio/vhost-user.c
>> +++ b/hw/virtio/vhost-user.c
>> @@ -1320,10 +1320,35 @@ static int enforce_reply(struct vhost_dev *dev,
>> return vhost_user_get_features(dev, &dummy);
>> }
>>
>> +/* Note: "msg->hdr.flags" may be modified. */
>> +static int vhost_user_write_msg(struct vhost_dev *dev, VhostUserMsg
>> *msg,
>> + bool wait_for_reply)
>
> The difference between vhost_user_write() and vhost_user_write_msg() is
> not immediately obvious from the function name, so I would propose
> something different, like vhost_user_write_sync() or
> vhost_user_write_wait().
I'm mostly OK with either variant; I think I may have thought of _sync
myself, but didn't like it because the wait would be *optional*,
dependent on caller choice. And I didn't like
vhost_user_write_maybe_wait() either; that one seemed awkward / too verbose.
Let's see what others prefer. :)
>
> Anyway, I'm not good with names and don't have a strong opinion, so this
> version is fine with me as well :-)
>
> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
>
Thanks!
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-30 8:39 ` Stefano Garzarella
@ 2023-08-30 9:26 ` Laszlo Ersek
2023-08-30 14:24 ` Stefano Garzarella
0 siblings, 1 reply; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-30 9:26 UTC (permalink / raw)
To: Stefano Garzarella
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual
On 8/30/23 10:39, Stefano Garzarella wrote:
> On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
>> (1) The virtio-1.0 specification
>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>
> What about referring the latest spec available now (1.2)?
I didn't want to do that because the OVMF guest driver was written
against 1.0 (and the spec and the device are backwards compatible).
But, I don't feel strongly about this; I'm OK updating the reference /
quote to 1.2.
>
>>
>>> 3 General Initialization And Device Operation
>>> 3.1 Device Initialization
>>> 3.1.1 Driver Requirements: Device Initialization
>>>
>>> [...]
>>>
>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>> the device, optional per-bus setup, reading and possibly writing the
>>> device’s virtio configuration space, and population of virtqueues.
>>>
>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>
>> and
>>
>>> 4 Virtio Transport Options
>>> 4.1 Virtio Over PCI Bus
>>> 4.1.4 Virtio Structure PCI Capabilities
>>> 4.1.4.3 Common configuration structure layout
>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>
>>> [...]
>>>
>>> The driver MUST configure the other virtqueue fields before enabling the
>>> virtqueue with queue_enable.
>>>
>>> [...]
>>
>> These together mean that the following sub-sequence of steps is valid for
>> a virtio-1.0 guest driver:
>>
>> (1.1) set "queue_enable" for the needed queues as the final part of
>> device
>> initialization step (7),
>>
>> (1.2) set DRIVER_OK in step (8),
>>
>> (1.3) immediately start sending virtio requests to the device.
>>
>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>> special virtio feature is negotiated, then virtio rings start in disabled
>> state, according to
>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed
>> for
>> enabling vrings.
>>
>> Therefore setting "queue_enable" from the guest (1.1) is a *control
>> plane*
>> operation, which travels from the guest through QEMU to the vhost-user
>> backend, using a unix domain socket.
>>
>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>> evades QEMU -- it travels from guest to the vhost-user backend via
>> eventfd.
>>
>> This means that steps (1.1) and (1.3) travel through different channels,
>> and their relative order can be reversed, as perceived by the vhost-user
>> backend.
>>
>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe)
>> runs
>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>> crate.)
>>
>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>> device initialization steps (i.e., control plane operations), and
>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>> operation). In the Rust-language virtiofsd, this creates a race between
>> two components that run *concurrently*, i.e., in different threads or
>> processes:
>>
>> - Control plane, handling vhost-user protocol messages:
>>
>> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>> [crates/vhost-user-backend/src/handler.rs] handles
>> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>> flag according to the message processed.
>>
>> - Data plane, handling virtio / FUSE requests:
>>
>> The "VringEpollHandler::handle_event" method
>> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>> virtio / FUSE request, consuming the virtio kick at the same time. If
>> the vring's "enabled" flag is set, the virtio / FUSE request is
>> processed genuinely. If the vring's "enabled" flag is clear, then the
>> virtio / FUSE request is discarded.
>>
>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>> However, if the data plane processor in virtiofsd wins the race, then it
>> sees the FUSE_INIT *before* the control plane processor took notice of
>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>> back to waiting for further virtio / FUSE requests with epoll_wait.
>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a
>> deadlock.
>>
>> The deadlock is not deterministic. OVMF hangs infrequently during first
>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>> shell.
>>
>> The race can be "reliably masked" by inserting a very small delay -- a
>> single debug message -- at the top of "VringEpollHandler::handle_event",
>> i.e., just before the data plane processor checks the "enabled" field of
>> the vring. That delay suffices for the control plane processor to act
>> upon
>> VHOST_USER_SET_VRING_ENABLE.
>>
>> We can deterministically prevent the race in QEMU, by blocking OVMF
>> inside
>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
>> cannot advance to the FUSE_INIT submission before virtiofsd's control
>> plane processor takes notice of the queue being enabled.
>>
>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>>
>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
>> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
>> has been negotiated, or
>>
>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which
>> requires
>> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>
> Thanks for the excellent analysis (and fix of course!).
>
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
>> Cc: German Maglione <gmaglione@redhat.com>
>> Cc: Liu Jiang <gerry@linux.alibaba.com>
>> Cc: Sergio Lopez Pascual <slp@redhat.com>
>> Cc: Stefano Garzarella <sgarzare@redhat.com>
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>> ---
>> hw/virtio/vhost-user.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>> index beb4b832245e..01e0ca90c538 100644
>> --- a/hw/virtio/vhost-user.c
>> +++ b/hw/virtio/vhost-user.c
>> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct
>> vhost_dev *dev, int enable)
>> .num = enable,
>> };
>>
>
> How about adding a small comment here summarizing the commit message in
> a few lines?
Right, I can do that!
>
> Should we cc stable for this fix?
Hm, that didn't occur to me.
AFAICT, the issue goes back to the introduction of
VHOST_USER_SET_VRING_ENABLE, in commit 7263a0ad7899 ("vhost-user: add a
new message to disable/enable a specific virt queue.", 2015-09-24) --
part of release v2.5.0.
What are the "live" stable branches at this time?
Applying the series on top of v8.1.0 shouldn't be hard, as
"hw/virtio/vhost-user.c" is identical between v8.1.0 and 50e7a40af372 (=
the base commit of this series).
Applying the series on top of v8.0.0 looks more messy, the file had seen
significant changes between 8.0 and 8.1. I'd rather not attempt the
backport (bunch of refactorings etc) to 8.0.
If I just CC stable, what stable branch is going to be targeted?
>
>
> In any case, the fix LGTM, so:
>
> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Thanks!
Laszlo
>> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE,
>> &state, false);
>> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE,
>> &state, true);
>> if (ret < 0) {
>> /*
>> * Restoring the previous state is likely infeasible, as
>> well as
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 0/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-30 8:48 ` [PATCH 0/7] " Stefano Garzarella
@ 2023-08-30 9:32 ` Laszlo Ersek
0 siblings, 0 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-30 9:32 UTC (permalink / raw)
To: Stefano Garzarella
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual
On 8/30/23 10:48, Stefano Garzarella wrote:
> On Sun, Aug 27, 2023 at 08:29:30PM +0200, Laszlo Ersek wrote:
>> The last patch in the series states and fixes the problem; prior patches
>> only refactor the code.
>
> Thanks for the fix and great cleanup!
>
> I fully reviewed the series and LGTM.
>
> An additional step that we can take (not in this series) crossed my
> mind, though. In some places we repeat the following pattern:
>
> vhost_user_write(dev, &msg, NULL, 0);
> ...
>
> if (reply_supported) {
> return process_message_reply(dev, &msg);
> }
>
> So what about extending the vhost_user_write_msg() added in this series
> to support also this cases and remove some code.
> Or maybe integrate vhost_user_write_msg() in vhost_user_write().
Good idea, I'd just like someone else to do it -- and as you say, after
this series :)
This series is relatively packed with "thought" already (in the last
patch), plus a week ago I knew absolutely nothing about vhost /
vhost-user. (And, I read the whole blog series at
<https://www.redhat.com/en/virtio-networking-series> in 1-2 days, while
analyzing this issue, to understand the design of vhost.)
So I'd prefer keeping my first contribution in this area limited -- what
you are suggesting touches on some of the requests that require genuine
responses, and I didn't want to fiddle with those.
(I think your patch should be fine BTW!)
Laszlo
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 3/7] vhost-user: factor out "vhost_user_write_msg"
2023-08-30 9:14 ` Laszlo Ersek
@ 2023-08-30 9:54 ` Laszlo Ersek
0 siblings, 0 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-30 9:54 UTC (permalink / raw)
To: Stefano Garzarella
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual
On 8/30/23 11:14, Laszlo Ersek wrote:
> On 8/30/23 10:31, Stefano Garzarella wrote:
>> On Sun, Aug 27, 2023 at 08:29:33PM +0200, Laszlo Ersek wrote:
>>> The tails of the "vhost_user_set_vring_addr" and "vhost_user_set_u64"
>>> functions are now byte-for-byte identical. Factor the common tail out
>>> to a
>>> new function called "vhost_user_write_msg".
>>>
>>> This is purely refactoring -- no observable change.
>>>
>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
>>> Cc: German Maglione <gmaglione@redhat.com>
>>> Cc: Liu Jiang <gerry@linux.alibaba.com>
>>> Cc: Sergio Lopez Pascual <slp@redhat.com>
>>> Cc: Stefano Garzarella <sgarzare@redhat.com>
>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>>> ---
>>> hw/virtio/vhost-user.c | 66 +++++++++-----------
>>> 1 file changed, 28 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>> index 64eac317bfb2..36f99b66a644 100644
>>> --- a/hw/virtio/vhost-user.c
>>> +++ b/hw/virtio/vhost-user.c
>>> @@ -1320,10 +1320,35 @@ static int enforce_reply(struct vhost_dev *dev,
>>> return vhost_user_get_features(dev, &dummy);
>>> }
>>>
>>> +/* Note: "msg->hdr.flags" may be modified. */
>>> +static int vhost_user_write_msg(struct vhost_dev *dev, VhostUserMsg
>>> *msg,
>>> + bool wait_for_reply)
>>
>> The difference between vhost_user_write() and vhost_user_write_msg() is
>> not immediately obvious from the function name, so I would propose
>> something different, like vhost_user_write_sync() or
>> vhost_user_write_wait().
>
> I'm mostly OK with either variant; I think I may have thought of _sync
> myself, but didn't like it because the wait would be *optional*,
> dependent on caller choice. And I didn't like
> vhost_user_write_maybe_wait() either; that one seemed awkward / too verbose.
>
> Let's see what others prefer. :)
... I went with vhost_user_write_sync.
>
>>
>> Anyway, I'm not good with names and don't have a strong opinion, so this
>> version is fine with me as well :-)
>>
>> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
>>
>
> Thanks!
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-27 18:29 ` [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
2023-08-30 8:39 ` Stefano Garzarella
2023-08-30 8:41 ` Laszlo Ersek
@ 2023-08-30 12:10 ` Stefan Hajnoczi
2023-08-30 13:30 ` Laszlo Ersek
2023-10-03 14:41 ` Michael S. Tsirkin
3 siblings, 1 reply; 58+ messages in thread
From: Stefan Hajnoczi @ 2023-08-30 12:10 UTC (permalink / raw)
To: Laszlo Ersek
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual,
Stefano Garzarella
On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek <lersek@redhat.com> wrote:
>
> (1) The virtio-1.0 specification
> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>
> > 3 General Initialization And Device Operation
> > 3.1 Device Initialization
> > 3.1.1 Driver Requirements: Device Initialization
> >
> > [...]
> >
> > 7. Perform device-specific setup, including discovery of virtqueues for
> > the device, optional per-bus setup, reading and possibly writing the
> > device’s virtio configuration space, and population of virtqueues.
> >
> > 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>
> and
>
> > 4 Virtio Transport Options
> > 4.1 Virtio Over PCI Bus
> > 4.1.4 Virtio Structure PCI Capabilities
> > 4.1.4.3 Common configuration structure layout
> > 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> >
> > [...]
> >
> > The driver MUST configure the other virtqueue fields before enabling the
> > virtqueue with queue_enable.
> >
> > [...]
>
> These together mean that the following sub-sequence of steps is valid for
> a virtio-1.0 guest driver:
>
> (1.1) set "queue_enable" for the needed queues as the final part of device
> initialization step (7),
>
> (1.2) set DRIVER_OK in step (8),
>
> (1.3) immediately start sending virtio requests to the device.
>
> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> special virtio feature is negotiated, then virtio rings start in disabled
> state, according to
> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> enabling vrings.
>
> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> operation, which travels from the guest through QEMU to the vhost-user
> backend, using a unix domain socket.
>
> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> evades QEMU -- it travels from guest to the vhost-user backend via
> eventfd.
>
> This means that steps (1.1) and (1.3) travel through different channels,
> and their relative order can be reversed, as perceived by the vhost-user
> backend.
>
> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> crate.)
>
> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> device initialization steps (i.e., control plane operations), and
> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> operation). In the Rust-language virtiofsd, this creates a race between
> two components that run *concurrently*, i.e., in different threads or
> processes:
>
> - Control plane, handling vhost-user protocol messages:
>
> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> [crates/vhost-user-backend/src/handler.rs] handles
> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> flag according to the message processed.
>
> - Data plane, handling virtio / FUSE requests:
>
> The "VringEpollHandler::handle_event" method
> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> virtio / FUSE request, consuming the virtio kick at the same time. If
> the vring's "enabled" flag is set, the virtio / FUSE request is
> processed genuinely. If the vring's "enabled" flag is clear, then the
> virtio / FUSE request is discarded.
Why is virtiofsd monitoring the virtqueue and discarding requests
while it's disabled? This seems like a bug in the vhost-user backend
to me.
When the virtqueue is disabled, don't monitor the kickfd.
When the virtqueue transitions from disabled to enabled, the control
plane should self-trigger the kickfd so that any available buffers
will be processed.
QEMU uses this scheme to switch between vhost/IOThreads and built-in
virtqueue kick processing.
This approach is more robust than relying buffers being enqueued after
the virtqueue is enabled.
Stefan
>
> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> However, if the data plane processor in virtiofsd wins the race, then it
> sees the FUSE_INIT *before* the control plane processor took notice of
> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> back to waiting for further virtio / FUSE requests with epoll_wait.
> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>
> The deadlock is not deterministic. OVMF hangs infrequently during first
> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> shell.
>
> The race can be "reliably masked" by inserting a very small delay -- a
> single debug message -- at the top of "VringEpollHandler::handle_event",
> i.e., just before the data plane processor checks the "enabled" field of
> the vring. That delay suffices for the control plane processor to act upon
> VHOST_USER_SET_VRING_ENABLE.
>
> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> cannot advance to the FUSE_INIT submission before virtiofsd's control
> plane processor takes notice of the queue being enabled.
>
> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>
> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> has been negotiated, or
>
> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> Cc: German Maglione <gmaglione@redhat.com>
> Cc: Liu Jiang <gerry@linux.alibaba.com>
> Cc: Sergio Lopez Pascual <slp@redhat.com>
> Cc: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> ---
> hw/virtio/vhost-user.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index beb4b832245e..01e0ca90c538 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
> .num = enable,
> };
>
> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
> if (ret < 0) {
> /*
> * Restoring the previous state is likely infeasible, as well as
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-30 12:10 ` Stefan Hajnoczi
@ 2023-08-30 13:30 ` Laszlo Ersek
2023-08-30 15:37 ` Stefan Hajnoczi
0 siblings, 1 reply; 58+ messages in thread
From: Laszlo Ersek @ 2023-08-30 13:30 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual,
Stefano Garzarella
On 8/30/23 14:10, Stefan Hajnoczi wrote:
> On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek <lersek@redhat.com> wrote:
>>
>> (1) The virtio-1.0 specification
>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>>
>>> 3 General Initialization And Device Operation
>>> 3.1 Device Initialization
>>> 3.1.1 Driver Requirements: Device Initialization
>>>
>>> [...]
>>>
>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>> the device, optional per-bus setup, reading and possibly writing the
>>> device’s virtio configuration space, and population of virtqueues.
>>>
>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>
>> and
>>
>>> 4 Virtio Transport Options
>>> 4.1 Virtio Over PCI Bus
>>> 4.1.4 Virtio Structure PCI Capabilities
>>> 4.1.4.3 Common configuration structure layout
>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>
>>> [...]
>>>
>>> The driver MUST configure the other virtqueue fields before enabling the
>>> virtqueue with queue_enable.
>>>
>>> [...]
>>
>> These together mean that the following sub-sequence of steps is valid for
>> a virtio-1.0 guest driver:
>>
>> (1.1) set "queue_enable" for the needed queues as the final part of device
>> initialization step (7),
>>
>> (1.2) set DRIVER_OK in step (8),
>>
>> (1.3) immediately start sending virtio requests to the device.
>>
>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>> special virtio feature is negotiated, then virtio rings start in disabled
>> state, according to
>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>> enabling vrings.
>>
>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>> operation, which travels from the guest through QEMU to the vhost-user
>> backend, using a unix domain socket.
>>
>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>> evades QEMU -- it travels from guest to the vhost-user backend via
>> eventfd.
>>
>> This means that steps (1.1) and (1.3) travel through different channels,
>> and their relative order can be reversed, as perceived by the vhost-user
>> backend.
>>
>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>> crate.)
>>
>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>> device initialization steps (i.e., control plane operations), and
>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>> operation). In the Rust-language virtiofsd, this creates a race between
>> two components that run *concurrently*, i.e., in different threads or
>> processes:
>>
>> - Control plane, handling vhost-user protocol messages:
>>
>> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>> [crates/vhost-user-backend/src/handler.rs] handles
>> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>> flag according to the message processed.
>>
>> - Data plane, handling virtio / FUSE requests:
>>
>> The "VringEpollHandler::handle_event" method
>> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>> virtio / FUSE request, consuming the virtio kick at the same time. If
>> the vring's "enabled" flag is set, the virtio / FUSE request is
>> processed genuinely. If the vring's "enabled" flag is clear, then the
>> virtio / FUSE request is discarded.
>
> Why is virtiofsd monitoring the virtqueue and discarding requests
> while it's disabled?
That's what the vhost-user spec requires:
https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
"""
started but disabled: the back-end must process the ring without causing
any side effects. For example, for a networking device, in the disabled
state the back-end must not supply any new RX packets, but must process
and discard any TX packets.
"""
This state is different from "stopped", where "the back-end must not
process the ring at all".
The spec also says,
"""
If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
initialized in a disabled state and is enabled by
VHOST_USER_SET_VRING_ENABLE with parameter 1.
"""
AFAICT virtiofsd follows this requirement.
> This seems like a bug in the vhost-user backend to me.
I didn't want to exclude that possiblity; that's why I included Eugenio,
German, Liu Jiang, and Sergio in the CC list.
>
> When the virtqueue is disabled, don't monitor the kickfd.
>
> When the virtqueue transitions from disabled to enabled, the control
> plane should self-trigger the kickfd so that any available buffers
> will be processed.
>
> QEMU uses this scheme to switch between vhost/IOThreads and built-in
> virtqueue kick processing.
>
> This approach is more robust than relying buffers being enqueued after
> the virtqueue is enabled.
I'm happy to drop the series if the virtiofsd maintainers agree that the
bug is in virtiofsd, and can propose a design to fix it. (I do think
that such a fix would require an architectural change.)
FWIW, my own interpretation of the vhost-user spec (see above) was that
virtiofsd was right to behave the way it did, and that there was simply
no way to prevent out-of-order delivery other than synchronizing the
guest end-to-end with the vhost-user backend, concerning
VHOST_USER_SET_VRING_ENABLE.
This end-to-end synchronization is present "naturally" in vhost-net,
where ioctl()s are automatically synchronous -- in fact *all* operations
on the control plane are synchronous. (Which is just a different way to
say that the guest is tightly coupled with the control plane.)
Note that there has been at least one race like this before; see commit
699f2e535d93 ("vhost: make SET_VRING_ADDR, SET_FEATURES send replies",
2021-09-04). Basically every pre-existent call to enforce_reply() is a
cover-up for the vhost-user spec turning (somewhat recklessly?) most
operations into async ones.
At some point this became apparent and so the REPLY_ACK flag was
introduced; see commit ca525ce5618b ("vhost-user: Introduce a new
protocol feature REPLY_ACK.", 2016-08-10). (That commit doesn't go into
details, but I'm pretty sure there was a similar race around SET_MEM_TABLE!)
BTW even if we drop this series for QEMU, I don't think it will have
been in vain. The first few patches are cleanups which could be merged
for their own sake. And the last patch is essentially the proof of the
problem statement / analysis. It can be considered an elaborate bug
report for virtiofsd, *if* we decide the bug is in virtiofsd. I did have
that avenue in mind as well, when writing the commit message / patch.
For now I'm going to post v2 -- that's not to say that I'm dismissing
your feedback (see above!), just want to get the latest version on-list.
Thanks!
Laszlo
>
> Stefan
>
>>
>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>> However, if the data plane processor in virtiofsd wins the race, then it
>> sees the FUSE_INIT *before* the control plane processor took notice of
>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>> back to waiting for further virtio / FUSE requests with epoll_wait.
>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>>
>> The deadlock is not deterministic. OVMF hangs infrequently during first
>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>> shell.
>>
>> The race can be "reliably masked" by inserting a very small delay -- a
>> single debug message -- at the top of "VringEpollHandler::handle_event",
>> i.e., just before the data plane processor checks the "enabled" field of
>> the vring. That delay suffices for the control plane processor to act upon
>> VHOST_USER_SET_VRING_ENABLE.
>>
>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
>> cannot advance to the FUSE_INIT submission before virtiofsd's control
>> plane processor takes notice of the queue being enabled.
>>
>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>>
>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
>> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
>> has been negotiated, or
>>
>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
>> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
>> Cc: German Maglione <gmaglione@redhat.com>
>> Cc: Liu Jiang <gerry@linux.alibaba.com>
>> Cc: Sergio Lopez Pascual <slp@redhat.com>
>> Cc: Stefano Garzarella <sgarzare@redhat.com>
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>> ---
>> hw/virtio/vhost-user.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>> index beb4b832245e..01e0ca90c538 100644
>> --- a/hw/virtio/vhost-user.c
>> +++ b/hw/virtio/vhost-user.c
>> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
>> .num = enable,
>> };
>>
>> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
>> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
>> if (ret < 0) {
>> /*
>> * Restoring the previous state is likely infeasible, as well as
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-30 9:26 ` Laszlo Ersek
@ 2023-08-30 14:24 ` Stefano Garzarella
0 siblings, 0 replies; 58+ messages in thread
From: Stefano Garzarella @ 2023-08-30 14:24 UTC (permalink / raw)
To: Laszlo Ersek
Cc: qemu-devel, Michael S. Tsirkin, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual
On Wed, Aug 30, 2023 at 11:26:41AM +0200, Laszlo Ersek wrote:
>On 8/30/23 10:39, Stefano Garzarella wrote:
>> On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
>>> (1) The virtio-1.0 specification
>>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>>
>> What about referring the latest spec available now (1.2)?
>
>I didn't want to do that because the OVMF guest driver was written
>against 1.0 (and the spec and the device are backwards compatible).
>
>But, I don't feel strongly about this; I'm OK updating the reference /
>quote to 1.2.
>
>>
>>>
>>>> 3 General Initialization And Device Operation
>>>> 3.1 Device Initialization
>>>> 3.1.1 Driver Requirements: Device Initialization
>>>>
>>>> [...]
>>>>
>>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>>> the device, optional per-bus setup, reading and possibly writing the
>>>> device’s virtio configuration space, and population of virtqueues.
>>>>
>>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>>
>>> and
>>>
>>>> 4 Virtio Transport Options
>>>> 4.1 Virtio Over PCI Bus
>>>> 4.1.4 Virtio Structure PCI Capabilities
>>>> 4.1.4.3 Common configuration structure layout
>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>>
>>>> [...]
>>>>
>>>> The driver MUST configure the other virtqueue fields before enabling the
>>>> virtqueue with queue_enable.
>>>>
>>>> [...]
>>>
>>> These together mean that the following sub-sequence of steps is valid for
>>> a virtio-1.0 guest driver:
>>>
>>> (1.1) set "queue_enable" for the needed queues as the final part of
>>> device
>>> initialization step (7),
>>>
>>> (1.2) set DRIVER_OK in step (8),
>>>
>>> (1.3) immediately start sending virtio requests to the device.
>>>
>>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>>> special virtio feature is negotiated, then virtio rings start in disabled
>>> state, according to
>>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
>>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed
>>> for
>>> enabling vrings.
>>>
>>> Therefore setting "queue_enable" from the guest (1.1) is a *control
>>> plane*
>>> operation, which travels from the guest through QEMU to the vhost-user
>>> backend, using a unix domain socket.
>>>
>>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>>> evades QEMU -- it travels from guest to the vhost-user backend via
>>> eventfd.
>>>
>>> This means that steps (1.1) and (1.3) travel through different channels,
>>> and their relative order can be reversed, as perceived by the vhost-user
>>> backend.
>>>
>>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe)
>>> runs
>>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>>> crate.)
>>>
>>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>>> device initialization steps (i.e., control plane operations), and
>>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>>> operation). In the Rust-language virtiofsd, this creates a race between
>>> two components that run *concurrently*, i.e., in different threads or
>>> processes:
>>>
>>> - Control plane, handling vhost-user protocol messages:
>>>
>>> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>> [crates/vhost-user-backend/src/handler.rs] handles
>>> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>> flag according to the message processed.
>>>
>>> - Data plane, handling virtio / FUSE requests:
>>>
>>> The "VringEpollHandler::handle_event" method
>>> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>> virtio / FUSE request, consuming the virtio kick at the same time. If
>>> the vring's "enabled" flag is set, the virtio / FUSE request is
>>> processed genuinely. If the vring's "enabled" flag is clear, then the
>>> virtio / FUSE request is discarded.
>>>
>>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>>> However, if the data plane processor in virtiofsd wins the race, then it
>>> sees the FUSE_INIT *before* the control plane processor took notice of
>>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>>> back to waiting for further virtio / FUSE requests with epoll_wait.
>>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a
>>> deadlock.
>>>
>>> The deadlock is not deterministic. OVMF hangs infrequently during first
>>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>>> shell.
>>>
>>> The race can be "reliably masked" by inserting a very small delay -- a
>>> single debug message -- at the top of "VringEpollHandler::handle_event",
>>> i.e., just before the data plane processor checks the "enabled" field of
>>> the vring. That delay suffices for the control plane processor to act
>>> upon
>>> VHOST_USER_SET_VRING_ENABLE.
>>>
>>> We can deterministically prevent the race in QEMU, by blocking OVMF
>>> inside
>>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
>>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
>>> cannot advance to the FUSE_INIT submission before virtiofsd's control
>>> plane processor takes notice of the queue being enabled.
>>>
>>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>>>
>>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
>>> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
>>> has been negotiated, or
>>>
>>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which
>>> requires
>>> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>>
>> Thanks for the excellent analysis (and fix of course!).
>>
>>>
>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
>>> Cc: German Maglione <gmaglione@redhat.com>
>>> Cc: Liu Jiang <gerry@linux.alibaba.com>
>>> Cc: Sergio Lopez Pascual <slp@redhat.com>
>>> Cc: Stefano Garzarella <sgarzare@redhat.com>
>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>>> ---
>>> hw/virtio/vhost-user.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>> index beb4b832245e..01e0ca90c538 100644
>>> --- a/hw/virtio/vhost-user.c
>>> +++ b/hw/virtio/vhost-user.c
>>> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct
>>> vhost_dev *dev, int enable)
>>> .num = enable,
>>> };
>>>
>>
>> How about adding a small comment here summarizing the commit message in
>> a few lines?
>
>Right, I can do that!
>
>>
>> Should we cc stable for this fix?
>
>Hm, that didn't occur to me.
>
>AFAICT, the issue goes back to the introduction of
>VHOST_USER_SET_VRING_ENABLE, in commit 7263a0ad7899 ("vhost-user: add a
>new message to disable/enable a specific virt queue.", 2015-09-24) --
>part of release v2.5.0.
>
>What are the "live" stable branches at this time?
>
>Applying the series on top of v8.1.0 shouldn't be hard, as
>"hw/virtio/vhost-user.c" is identical between v8.1.0 and 50e7a40af372 (=
>the base commit of this series).
>
>Applying the series on top of v8.0.0 looks more messy, the file had seen
>significant changes between 8.0 and 8.1. I'd rather not attempt the
>backport (bunch of refactorings etc) to 8.0.
>
>If I just CC stable, what stable branch is going to be targeted?
From https://www.qemu.org/docs/master/devel/stable-process.html
the target stable branch should be v8.1.x
But since it is not a regression from 8.0, I agree with you not CCing
it.
Thanks,
Stefano
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 5/7] vhost-user: hoist "write_msg", "get_features", "get_u64"
2023-08-27 18:29 ` [PATCH 5/7] vhost-user: hoist "write_msg", "get_features", "get_u64" Laszlo Ersek
2023-08-30 8:32 ` Stefano Garzarella
@ 2023-08-30 15:04 ` Philippe Mathieu-Daudé
1 sibling, 0 replies; 58+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-08-30 15:04 UTC (permalink / raw)
To: Laszlo Ersek, qemu-devel
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
On 27/8/23 20:29, Laszlo Ersek wrote:
> In order to avoid a forward-declaration for "vhost_user_write_msg" in a
> subsequent patch, hoist "vhost_user_write_msg" ->
> "vhost_user_get_features" -> "vhost_user_get_u64" just above
> "vhost_set_vring".
>
> This is purely code movement -- no observable change.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> Cc: German Maglione <gmaglione@redhat.com>
> Cc: Liu Jiang <gerry@linux.alibaba.com>
> Cc: Sergio Lopez Pascual <slp@redhat.com>
> Cc: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> ---
> hw/virtio/vhost-user.c | 170 ++++++++++----------
> 1 file changed, 85 insertions(+), 85 deletions(-)
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/7] vhost-user: tighten "reply_supported" scope in "set_vring_addr"
2023-08-27 18:29 ` [PATCH 2/7] vhost-user: tighten "reply_supported" scope in "set_vring_addr" Laszlo Ersek
2023-08-30 8:27 ` Stefano Garzarella
@ 2023-08-30 15:04 ` Philippe Mathieu-Daudé
1 sibling, 0 replies; 58+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-08-30 15:04 UTC (permalink / raw)
To: Laszlo Ersek, qemu-devel
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
On 27/8/23 20:29, Laszlo Ersek wrote:
> In the vhost_user_set_vring_addr() function, we calculate
> "reply_supported" unconditionally, even though we'll only need it if
> "wait_for_reply" is also true.
>
> Restrict the scope of "reply_supported" to the minimum.
>
> This is purely refactoring -- no observable change.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> Cc: German Maglione <gmaglione@redhat.com>
> Cc: Liu Jiang <gerry@linux.alibaba.com>
> Cc: Sergio Lopez Pascual <slp@redhat.com>
> Cc: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> ---
> hw/virtio/vhost-user.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/7] vhost-user: strip superfluous whitespace
2023-08-27 18:29 ` [PATCH 1/7] vhost-user: strip superfluous whitespace Laszlo Ersek
2023-08-30 8:26 ` Stefano Garzarella
@ 2023-08-30 15:04 ` Philippe Mathieu-Daudé
1 sibling, 0 replies; 58+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-08-30 15:04 UTC (permalink / raw)
To: Laszlo Ersek, qemu-devel
Cc: Michael S. Tsirkin, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
On 27/8/23 20:29, Laszlo Ersek wrote:
> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> Cc: German Maglione <gmaglione@redhat.com>
> Cc: Liu Jiang <gerry@linux.alibaba.com>
> Cc: Sergio Lopez Pascual <slp@redhat.com>
> Cc: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> ---
> hw/virtio/vhost-user.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-30 13:30 ` Laszlo Ersek
@ 2023-08-30 15:37 ` Stefan Hajnoczi
2023-09-05 6:30 ` Laszlo Ersek
2023-10-02 6:49 ` Michael S. Tsirkin
0 siblings, 2 replies; 58+ messages in thread
From: Stefan Hajnoczi @ 2023-08-30 15:37 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: qemu-devel, Laszlo Ersek, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek <lersek@redhat.com> wrote:
>
> On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek <lersek@redhat.com> wrote:
> >>
> >> (1) The virtio-1.0 specification
> >> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
> >>
> >>> 3 General Initialization And Device Operation
> >>> 3.1 Device Initialization
> >>> 3.1.1 Driver Requirements: Device Initialization
> >>>
> >>> [...]
> >>>
> >>> 7. Perform device-specific setup, including discovery of virtqueues for
> >>> the device, optional per-bus setup, reading and possibly writing the
> >>> device’s virtio configuration space, and population of virtqueues.
> >>>
> >>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> >>
> >> and
> >>
> >>> 4 Virtio Transport Options
> >>> 4.1 Virtio Over PCI Bus
> >>> 4.1.4 Virtio Structure PCI Capabilities
> >>> 4.1.4.3 Common configuration structure layout
> >>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> >>>
> >>> [...]
> >>>
> >>> The driver MUST configure the other virtqueue fields before enabling the
> >>> virtqueue with queue_enable.
> >>>
> >>> [...]
> >>
> >> These together mean that the following sub-sequence of steps is valid for
> >> a virtio-1.0 guest driver:
> >>
> >> (1.1) set "queue_enable" for the needed queues as the final part of device
> >> initialization step (7),
> >>
> >> (1.2) set DRIVER_OK in step (8),
> >>
> >> (1.3) immediately start sending virtio requests to the device.
> >>
> >> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> >> special virtio feature is negotiated, then virtio rings start in disabled
> >> state, according to
> >> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> >> enabling vrings.
> >>
> >> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> >> operation, which travels from the guest through QEMU to the vhost-user
> >> backend, using a unix domain socket.
> >>
> >> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> >> evades QEMU -- it travels from guest to the vhost-user backend via
> >> eventfd.
> >>
> >> This means that steps (1.1) and (1.3) travel through different channels,
> >> and their relative order can be reversed, as perceived by the vhost-user
> >> backend.
> >>
> >> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> >> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> >> crate.)
> >>
> >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> >> device initialization steps (i.e., control plane operations), and
> >> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> >> operation). In the Rust-language virtiofsd, this creates a race between
> >> two components that run *concurrently*, i.e., in different threads or
> >> processes:
> >>
> >> - Control plane, handling vhost-user protocol messages:
> >>
> >> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> >> [crates/vhost-user-backend/src/handler.rs] handles
> >> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> >> flag according to the message processed.
> >>
> >> - Data plane, handling virtio / FUSE requests:
> >>
> >> The "VringEpollHandler::handle_event" method
> >> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> >> virtio / FUSE request, consuming the virtio kick at the same time. If
> >> the vring's "enabled" flag is set, the virtio / FUSE request is
> >> processed genuinely. If the vring's "enabled" flag is clear, then the
> >> virtio / FUSE request is discarded.
> >
> > Why is virtiofsd monitoring the virtqueue and discarding requests
> > while it's disabled?
>
> That's what the vhost-user spec requires:
>
> https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
>
> """
> started but disabled: the back-end must process the ring without causing
> any side effects. For example, for a networking device, in the disabled
> state the back-end must not supply any new RX packets, but must process
> and discard any TX packets.
> """
>
> This state is different from "stopped", where "the back-end must not
> process the ring at all".
>
> The spec also says,
>
> """
> If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
> initialized in a disabled state and is enabled by
> VHOST_USER_SET_VRING_ENABLE with parameter 1.
> """
>
> AFAICT virtiofsd follows this requirement.
Hi Michael,
You documented the disabled ring state in QEMU commit commit
c61f09ed855b5009f816242ce281fd01586d4646 ("vhost-user: clarify start
and enable") where virtio-net devices discard tx buffers. The disabled
state seems to be specific to vhost-user and not covered in the VIRTIO
specification.
Do you remember what the purpose of the disabled state was? Why is it
necessary to discard tx buffers instead of postponing ring processing
until the virtqueue is enabled?
My concern is that the semantics are unclear for virtqueue types that
are different from virtio-net rx/tx. Even the virtio-net controlq
would be problematic - should buffers be silently discarded with
VIRTIO_NET_OK or should they fail?
Thanks,
Stefan
>
> > This seems like a bug in the vhost-user backend to me.
>
> I didn't want to exclude that possiblity; that's why I included Eugenio,
> German, Liu Jiang, and Sergio in the CC list.
>
> >
> > When the virtqueue is disabled, don't monitor the kickfd.
> >
> > When the virtqueue transitions from disabled to enabled, the control
> > plane should self-trigger the kickfd so that any available buffers
> > will be processed.
> >
> > QEMU uses this scheme to switch between vhost/IOThreads and built-in
> > virtqueue kick processing.
> >
> > This approach is more robust than relying buffers being enqueued after
> > the virtqueue is enabled.
>
> I'm happy to drop the series if the virtiofsd maintainers agree that the
> bug is in virtiofsd, and can propose a design to fix it. (I do think
> that such a fix would require an architectural change.)
>
> FWIW, my own interpretation of the vhost-user spec (see above) was that
> virtiofsd was right to behave the way it did, and that there was simply
> no way to prevent out-of-order delivery other than synchronizing the
> guest end-to-end with the vhost-user backend, concerning
> VHOST_USER_SET_VRING_ENABLE.
>
> This end-to-end synchronization is present "naturally" in vhost-net,
> where ioctl()s are automatically synchronous -- in fact *all* operations
> on the control plane are synchronous. (Which is just a different way to
> say that the guest is tightly coupled with the control plane.)
>
> Note that there has been at least one race like this before; see commit
> 699f2e535d93 ("vhost: make SET_VRING_ADDR, SET_FEATURES send replies",
> 2021-09-04). Basically every pre-existent call to enforce_reply() is a
> cover-up for the vhost-user spec turning (somewhat recklessly?) most
> operations into async ones.
>
> At some point this became apparent and so the REPLY_ACK flag was
> introduced; see commit ca525ce5618b ("vhost-user: Introduce a new
> protocol feature REPLY_ACK.", 2016-08-10). (That commit doesn't go into
> details, but I'm pretty sure there was a similar race around SET_MEM_TABLE!)
>
> BTW even if we drop this series for QEMU, I don't think it will have
> been in vain. The first few patches are cleanups which could be merged
> for their own sake. And the last patch is essentially the proof of the
> problem statement / analysis. It can be considered an elaborate bug
> report for virtiofsd, *if* we decide the bug is in virtiofsd. I did have
> that avenue in mind as well, when writing the commit message / patch.
>
> For now I'm going to post v2 -- that's not to say that I'm dismissing
> your feedback (see above!), just want to get the latest version on-list.
>
> Thanks!
> Laszlo
>
> >
> > Stefan
> >
> >>
> >> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> >> However, if the data plane processor in virtiofsd wins the race, then it
> >> sees the FUSE_INIT *before* the control plane processor took notice of
> >> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> >> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> >> back to waiting for further virtio / FUSE requests with epoll_wait.
> >> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> >>
> >> The deadlock is not deterministic. OVMF hangs infrequently during first
> >> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> >> shell.
> >>
> >> The race can be "reliably masked" by inserting a very small delay -- a
> >> single debug message -- at the top of "VringEpollHandler::handle_event",
> >> i.e., just before the data plane processor checks the "enabled" field of
> >> the vring. That delay suffices for the control plane processor to act upon
> >> VHOST_USER_SET_VRING_ENABLE.
> >>
> >> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> >> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> >> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> >> cannot advance to the FUSE_INIT submission before virtiofsd's control
> >> plane processor takes notice of the queue being enabled.
> >>
> >> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
> >>
> >> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> >> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> >> has been negotiated, or
> >>
> >> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> >> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
> >>
> >> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> >> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> >> Cc: German Maglione <gmaglione@redhat.com>
> >> Cc: Liu Jiang <gerry@linux.alibaba.com>
> >> Cc: Sergio Lopez Pascual <slp@redhat.com>
> >> Cc: Stefano Garzarella <sgarzare@redhat.com>
> >> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> >> ---
> >> hw/virtio/vhost-user.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> >> index beb4b832245e..01e0ca90c538 100644
> >> --- a/hw/virtio/vhost-user.c
> >> +++ b/hw/virtio/vhost-user.c
> >> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
> >> .num = enable,
> >> };
> >>
> >> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
> >> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
> >> if (ret < 0) {
> >> /*
> >> * Restoring the previous state is likely infeasible, as well as
> >
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-30 15:37 ` Stefan Hajnoczi
@ 2023-09-05 6:30 ` Laszlo Ersek
2023-09-25 15:31 ` Laszlo Ersek
2023-10-02 6:49 ` Michael S. Tsirkin
1 sibling, 1 reply; 58+ messages in thread
From: Laszlo Ersek @ 2023-09-05 6:30 UTC (permalink / raw)
To: Stefan Hajnoczi, Michael S. Tsirkin
Cc: qemu-devel, Eugenio Perez Martin, German Maglione, Liu Jiang,
Sergio Lopez Pascual, Stefano Garzarella
Michael,
On 8/30/23 17:37, Stefan Hajnoczi wrote:
> On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek <lersek@redhat.com> wrote:
>>
>> On 8/30/23 14:10, Stefan Hajnoczi wrote:
>>> On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek <lersek@redhat.com> wrote:
>>>>
>>>> (1) The virtio-1.0 specification
>>>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>>>>
>>>>> 3 General Initialization And Device Operation
>>>>> 3.1 Device Initialization
>>>>> 3.1.1 Driver Requirements: Device Initialization
>>>>>
>>>>> [...]
>>>>>
>>>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>>>> the device, optional per-bus setup, reading and possibly writing the
>>>>> device’s virtio configuration space, and population of virtqueues.
>>>>>
>>>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>>>
>>>> and
>>>>
>>>>> 4 Virtio Transport Options
>>>>> 4.1 Virtio Over PCI Bus
>>>>> 4.1.4 Virtio Structure PCI Capabilities
>>>>> 4.1.4.3 Common configuration structure layout
>>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>>>
>>>>> [...]
>>>>>
>>>>> The driver MUST configure the other virtqueue fields before enabling the
>>>>> virtqueue with queue_enable.
>>>>>
>>>>> [...]
>>>>
>>>> These together mean that the following sub-sequence of steps is valid for
>>>> a virtio-1.0 guest driver:
>>>>
>>>> (1.1) set "queue_enable" for the needed queues as the final part of device
>>>> initialization step (7),
>>>>
>>>> (1.2) set DRIVER_OK in step (8),
>>>>
>>>> (1.3) immediately start sending virtio requests to the device.
>>>>
>>>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>>>> special virtio feature is negotiated, then virtio rings start in disabled
>>>> state, according to
>>>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
>>>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>>>> enabling vrings.
>>>>
>>>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>>>> operation, which travels from the guest through QEMU to the vhost-user
>>>> backend, using a unix domain socket.
>>>>
>>>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>>>> evades QEMU -- it travels from guest to the vhost-user backend via
>>>> eventfd.
>>>>
>>>> This means that steps (1.1) and (1.3) travel through different channels,
>>>> and their relative order can be reversed, as perceived by the vhost-user
>>>> backend.
>>>>
>>>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>>>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>>>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>>>> crate.)
>>>>
>>>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>>>> device initialization steps (i.e., control plane operations), and
>>>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>>>> operation). In the Rust-language virtiofsd, this creates a race between
>>>> two components that run *concurrently*, i.e., in different threads or
>>>> processes:
>>>>
>>>> - Control plane, handling vhost-user protocol messages:
>>>>
>>>> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>>> [crates/vhost-user-backend/src/handler.rs] handles
>>>> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>>> flag according to the message processed.
>>>>
>>>> - Data plane, handling virtio / FUSE requests:
>>>>
>>>> The "VringEpollHandler::handle_event" method
>>>> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>>> virtio / FUSE request, consuming the virtio kick at the same time. If
>>>> the vring's "enabled" flag is set, the virtio / FUSE request is
>>>> processed genuinely. If the vring's "enabled" flag is clear, then the
>>>> virtio / FUSE request is discarded.
>>>
>>> Why is virtiofsd monitoring the virtqueue and discarding requests
>>> while it's disabled?
>>
>> That's what the vhost-user spec requires:
>>
>> https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
>>
>> """
>> started but disabled: the back-end must process the ring without causing
>> any side effects. For example, for a networking device, in the disabled
>> state the back-end must not supply any new RX packets, but must process
>> and discard any TX packets.
>> """
>>
>> This state is different from "stopped", where "the back-end must not
>> process the ring at all".
>>
>> The spec also says,
>>
>> """
>> If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
>> initialized in a disabled state and is enabled by
>> VHOST_USER_SET_VRING_ENABLE with parameter 1.
>> """
>>
>> AFAICT virtiofsd follows this requirement.
>
> Hi Michael,
> You documented the disabled ring state in QEMU commit commit
> c61f09ed855b5009f816242ce281fd01586d4646 ("vhost-user: clarify start
> and enable") where virtio-net devices discard tx buffers. The disabled
> state seems to be specific to vhost-user and not covered in the VIRTIO
> specification.
>
> Do you remember what the purpose of the disabled state was? Why is it
> necessary to discard tx buffers instead of postponing ring processing
> until the virtqueue is enabled?
>
> My concern is that the semantics are unclear for virtqueue types that
> are different from virtio-net rx/tx. Even the virtio-net controlq
> would be problematic - should buffers be silently discarded with
> VIRTIO_NET_OK or should they fail?
Can you comment please?
Thanks
Laszlo
>>> This seems like a bug in the vhost-user backend to me.
>>
>> I didn't want to exclude that possiblity; that's why I included Eugenio,
>> German, Liu Jiang, and Sergio in the CC list.
>>
>>>
>>> When the virtqueue is disabled, don't monitor the kickfd.
>>>
>>> When the virtqueue transitions from disabled to enabled, the control
>>> plane should self-trigger the kickfd so that any available buffers
>>> will be processed.
>>>
>>> QEMU uses this scheme to switch between vhost/IOThreads and built-in
>>> virtqueue kick processing.
>>>
>>> This approach is more robust than relying buffers being enqueued after
>>> the virtqueue is enabled.
>>
>> I'm happy to drop the series if the virtiofsd maintainers agree that the
>> bug is in virtiofsd, and can propose a design to fix it. (I do think
>> that such a fix would require an architectural change.)
>>
>> FWIW, my own interpretation of the vhost-user spec (see above) was that
>> virtiofsd was right to behave the way it did, and that there was simply
>> no way to prevent out-of-order delivery other than synchronizing the
>> guest end-to-end with the vhost-user backend, concerning
>> VHOST_USER_SET_VRING_ENABLE.
>>
>> This end-to-end synchronization is present "naturally" in vhost-net,
>> where ioctl()s are automatically synchronous -- in fact *all* operations
>> on the control plane are synchronous. (Which is just a different way to
>> say that the guest is tightly coupled with the control plane.)
>>
>> Note that there has been at least one race like this before; see commit
>> 699f2e535d93 ("vhost: make SET_VRING_ADDR, SET_FEATURES send replies",
>> 2021-09-04). Basically every pre-existent call to enforce_reply() is a
>> cover-up for the vhost-user spec turning (somewhat recklessly?) most
>> operations into async ones.
>>
>> At some point this became apparent and so the REPLY_ACK flag was
>> introduced; see commit ca525ce5618b ("vhost-user: Introduce a new
>> protocol feature REPLY_ACK.", 2016-08-10). (That commit doesn't go into
>> details, but I'm pretty sure there was a similar race around SET_MEM_TABLE!)
>>
>> BTW even if we drop this series for QEMU, I don't think it will have
>> been in vain. The first few patches are cleanups which could be merged
>> for their own sake. And the last patch is essentially the proof of the
>> problem statement / analysis. It can be considered an elaborate bug
>> report for virtiofsd, *if* we decide the bug is in virtiofsd. I did have
>> that avenue in mind as well, when writing the commit message / patch.
>>
>> For now I'm going to post v2 -- that's not to say that I'm dismissing
>> your feedback (see above!), just want to get the latest version on-list.
>>
>> Thanks!
>> Laszlo
>>
>>>
>>> Stefan
>>>
>>>>
>>>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>>>> However, if the data plane processor in virtiofsd wins the race, then it
>>>> sees the FUSE_INIT *before* the control plane processor took notice of
>>>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>>>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>>>> back to waiting for further virtio / FUSE requests with epoll_wait.
>>>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>>>>
>>>> The deadlock is not deterministic. OVMF hangs infrequently during first
>>>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>>>> shell.
>>>>
>>>> The race can be "reliably masked" by inserting a very small delay -- a
>>>> single debug message -- at the top of "VringEpollHandler::handle_event",
>>>> i.e., just before the data plane processor checks the "enabled" field of
>>>> the vring. That delay suffices for the control plane processor to act upon
>>>> VHOST_USER_SET_VRING_ENABLE.
>>>>
>>>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
>>>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
>>>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
>>>> cannot advance to the FUSE_INIT submission before virtiofsd's control
>>>> plane processor takes notice of the queue being enabled.
>>>>
>>>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>>>>
>>>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
>>>> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
>>>> has been negotiated, or
>>>>
>>>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
>>>> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>>>>
>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>>>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
>>>> Cc: German Maglione <gmaglione@redhat.com>
>>>> Cc: Liu Jiang <gerry@linux.alibaba.com>
>>>> Cc: Sergio Lopez Pascual <slp@redhat.com>
>>>> Cc: Stefano Garzarella <sgarzare@redhat.com>
>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>>>> ---
>>>> hw/virtio/vhost-user.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>>> index beb4b832245e..01e0ca90c538 100644
>>>> --- a/hw/virtio/vhost-user.c
>>>> +++ b/hw/virtio/vhost-user.c
>>>> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
>>>> .num = enable,
>>>> };
>>>>
>>>> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
>>>> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
>>>> if (ret < 0) {
>>>> /*
>>>> * Restoring the previous state is likely infeasible, as well as
>>>
>>
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-09-05 6:30 ` Laszlo Ersek
@ 2023-09-25 15:31 ` Laszlo Ersek
2023-10-01 19:24 ` Michael S. Tsirkin
0 siblings, 1 reply; 58+ messages in thread
From: Laszlo Ersek @ 2023-09-25 15:31 UTC (permalink / raw)
To: Stefan Hajnoczi, Michael S. Tsirkin
Cc: qemu-devel, Eugenio Perez Martin, German Maglione, Liu Jiang,
Sergio Lopez Pascual, Stefano Garzarella
Ping -- Michael, any comments please? This set (now at v2) has been
waiting on your answer since Aug 30th.
Laszlo
On 9/5/23 08:30, Laszlo Ersek wrote:
> Michael,
>
> On 8/30/23 17:37, Stefan Hajnoczi wrote:
>> On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek <lersek@redhat.com> wrote:
>>>
>>> On 8/30/23 14:10, Stefan Hajnoczi wrote:
>>>> On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek <lersek@redhat.com> wrote:
>>>>>
>>>>> (1) The virtio-1.0 specification
>>>>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>>>>>
>>>>>> 3 General Initialization And Device Operation
>>>>>> 3.1 Device Initialization
>>>>>> 3.1.1 Driver Requirements: Device Initialization
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>>>>> the device, optional per-bus setup, reading and possibly writing the
>>>>>> device’s virtio configuration space, and population of virtqueues.
>>>>>>
>>>>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>>>>
>>>>> and
>>>>>
>>>>>> 4 Virtio Transport Options
>>>>>> 4.1 Virtio Over PCI Bus
>>>>>> 4.1.4 Virtio Structure PCI Capabilities
>>>>>> 4.1.4.3 Common configuration structure layout
>>>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>> The driver MUST configure the other virtqueue fields before enabling the
>>>>>> virtqueue with queue_enable.
>>>>>>
>>>>>> [...]
>>>>>
>>>>> These together mean that the following sub-sequence of steps is valid for
>>>>> a virtio-1.0 guest driver:
>>>>>
>>>>> (1.1) set "queue_enable" for the needed queues as the final part of device
>>>>> initialization step (7),
>>>>>
>>>>> (1.2) set DRIVER_OK in step (8),
>>>>>
>>>>> (1.3) immediately start sending virtio requests to the device.
>>>>>
>>>>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>>>>> special virtio feature is negotiated, then virtio rings start in disabled
>>>>> state, according to
>>>>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
>>>>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>>>>> enabling vrings.
>>>>>
>>>>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>>>>> operation, which travels from the guest through QEMU to the vhost-user
>>>>> backend, using a unix domain socket.
>>>>>
>>>>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>>>>> evades QEMU -- it travels from guest to the vhost-user backend via
>>>>> eventfd.
>>>>>
>>>>> This means that steps (1.1) and (1.3) travel through different channels,
>>>>> and their relative order can be reversed, as perceived by the vhost-user
>>>>> backend.
>>>>>
>>>>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>>>>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>>>>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>>>>> crate.)
>>>>>
>>>>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>>>>> device initialization steps (i.e., control plane operations), and
>>>>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>>>>> operation). In the Rust-language virtiofsd, this creates a race between
>>>>> two components that run *concurrently*, i.e., in different threads or
>>>>> processes:
>>>>>
>>>>> - Control plane, handling vhost-user protocol messages:
>>>>>
>>>>> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>>>> [crates/vhost-user-backend/src/handler.rs] handles
>>>>> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>>>> flag according to the message processed.
>>>>>
>>>>> - Data plane, handling virtio / FUSE requests:
>>>>>
>>>>> The "VringEpollHandler::handle_event" method
>>>>> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>>>> virtio / FUSE request, consuming the virtio kick at the same time. If
>>>>> the vring's "enabled" flag is set, the virtio / FUSE request is
>>>>> processed genuinely. If the vring's "enabled" flag is clear, then the
>>>>> virtio / FUSE request is discarded.
>>>>
>>>> Why is virtiofsd monitoring the virtqueue and discarding requests
>>>> while it's disabled?
>>>
>>> That's what the vhost-user spec requires:
>>>
>>> https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
>>>
>>> """
>>> started but disabled: the back-end must process the ring without causing
>>> any side effects. For example, for a networking device, in the disabled
>>> state the back-end must not supply any new RX packets, but must process
>>> and discard any TX packets.
>>> """
>>>
>>> This state is different from "stopped", where "the back-end must not
>>> process the ring at all".
>>>
>>> The spec also says,
>>>
>>> """
>>> If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
>>> initialized in a disabled state and is enabled by
>>> VHOST_USER_SET_VRING_ENABLE with parameter 1.
>>> """
>>>
>>> AFAICT virtiofsd follows this requirement.
>>
>> Hi Michael,
>> You documented the disabled ring state in QEMU commit commit
>> c61f09ed855b5009f816242ce281fd01586d4646 ("vhost-user: clarify start
>> and enable") where virtio-net devices discard tx buffers. The disabled
>> state seems to be specific to vhost-user and not covered in the VIRTIO
>> specification.
>>
>> Do you remember what the purpose of the disabled state was? Why is it
>> necessary to discard tx buffers instead of postponing ring processing
>> until the virtqueue is enabled?
>>
>> My concern is that the semantics are unclear for virtqueue types that
>> are different from virtio-net rx/tx. Even the virtio-net controlq
>> would be problematic - should buffers be silently discarded with
>> VIRTIO_NET_OK or should they fail?
>
> Can you comment please?
>
> Thanks
> Laszlo
>
>
>>>> This seems like a bug in the vhost-user backend to me.
>>>
>>> I didn't want to exclude that possiblity; that's why I included Eugenio,
>>> German, Liu Jiang, and Sergio in the CC list.
>>>
>>>>
>>>> When the virtqueue is disabled, don't monitor the kickfd.
>>>>
>>>> When the virtqueue transitions from disabled to enabled, the control
>>>> plane should self-trigger the kickfd so that any available buffers
>>>> will be processed.
>>>>
>>>> QEMU uses this scheme to switch between vhost/IOThreads and built-in
>>>> virtqueue kick processing.
>>>>
>>>> This approach is more robust than relying buffers being enqueued after
>>>> the virtqueue is enabled.
>>>
>>> I'm happy to drop the series if the virtiofsd maintainers agree that the
>>> bug is in virtiofsd, and can propose a design to fix it. (I do think
>>> that such a fix would require an architectural change.)
>>>
>>> FWIW, my own interpretation of the vhost-user spec (see above) was that
>>> virtiofsd was right to behave the way it did, and that there was simply
>>> no way to prevent out-of-order delivery other than synchronizing the
>>> guest end-to-end with the vhost-user backend, concerning
>>> VHOST_USER_SET_VRING_ENABLE.
>>>
>>> This end-to-end synchronization is present "naturally" in vhost-net,
>>> where ioctl()s are automatically synchronous -- in fact *all* operations
>>> on the control plane are synchronous. (Which is just a different way to
>>> say that the guest is tightly coupled with the control plane.)
>>>
>>> Note that there has been at least one race like this before; see commit
>>> 699f2e535d93 ("vhost: make SET_VRING_ADDR, SET_FEATURES send replies",
>>> 2021-09-04). Basically every pre-existent call to enforce_reply() is a
>>> cover-up for the vhost-user spec turning (somewhat recklessly?) most
>>> operations into async ones.
>>>
>>> At some point this became apparent and so the REPLY_ACK flag was
>>> introduced; see commit ca525ce5618b ("vhost-user: Introduce a new
>>> protocol feature REPLY_ACK.", 2016-08-10). (That commit doesn't go into
>>> details, but I'm pretty sure there was a similar race around SET_MEM_TABLE!)
>>>
>>> BTW even if we drop this series for QEMU, I don't think it will have
>>> been in vain. The first few patches are cleanups which could be merged
>>> for their own sake. And the last patch is essentially the proof of the
>>> problem statement / analysis. It can be considered an elaborate bug
>>> report for virtiofsd, *if* we decide the bug is in virtiofsd. I did have
>>> that avenue in mind as well, when writing the commit message / patch.
>>>
>>> For now I'm going to post v2 -- that's not to say that I'm dismissing
>>> your feedback (see above!), just want to get the latest version on-list.
>>>
>>> Thanks!
>>> Laszlo
>>>
>>>>
>>>> Stefan
>>>>
>>>>>
>>>>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>>>>> However, if the data plane processor in virtiofsd wins the race, then it
>>>>> sees the FUSE_INIT *before* the control plane processor took notice of
>>>>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>>>>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>>>>> back to waiting for further virtio / FUSE requests with epoll_wait.
>>>>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>>>>>
>>>>> The deadlock is not deterministic. OVMF hangs infrequently during first
>>>>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>>>>> shell.
>>>>>
>>>>> The race can be "reliably masked" by inserting a very small delay -- a
>>>>> single debug message -- at the top of "VringEpollHandler::handle_event",
>>>>> i.e., just before the data plane processor checks the "enabled" field of
>>>>> the vring. That delay suffices for the control plane processor to act upon
>>>>> VHOST_USER_SET_VRING_ENABLE.
>>>>>
>>>>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
>>>>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
>>>>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
>>>>> cannot advance to the FUSE_INIT submission before virtiofsd's control
>>>>> plane processor takes notice of the queue being enabled.
>>>>>
>>>>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>>>>>
>>>>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
>>>>> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
>>>>> has been negotiated, or
>>>>>
>>>>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
>>>>> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>>>>>
>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>>>>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
>>>>> Cc: German Maglione <gmaglione@redhat.com>
>>>>> Cc: Liu Jiang <gerry@linux.alibaba.com>
>>>>> Cc: Sergio Lopez Pascual <slp@redhat.com>
>>>>> Cc: Stefano Garzarella <sgarzare@redhat.com>
>>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>>>>> ---
>>>>> hw/virtio/vhost-user.c | 2 +-
>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>>>> index beb4b832245e..01e0ca90c538 100644
>>>>> --- a/hw/virtio/vhost-user.c
>>>>> +++ b/hw/virtio/vhost-user.c
>>>>> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
>>>>> .num = enable,
>>>>> };
>>>>>
>>>>> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
>>>>> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
>>>>> if (ret < 0) {
>>>>> /*
>>>>> * Restoring the previous state is likely infeasible, as well as
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-09-25 15:31 ` Laszlo Ersek
@ 2023-10-01 19:24 ` Michael S. Tsirkin
2023-10-01 19:25 ` Michael S. Tsirkin
0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-10-01 19:24 UTC (permalink / raw)
To: Laszlo Ersek
Cc: Stefan Hajnoczi, qemu-devel, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual,
Stefano Garzarella
yes sorry - I am working on a pull request with this
included.
On Mon, Sep 25, 2023 at 05:31:17PM +0200, Laszlo Ersek wrote:
> Ping -- Michael, any comments please? This set (now at v2) has been
> waiting on your answer since Aug 30th.
>
> Laszlo
>
> On 9/5/23 08:30, Laszlo Ersek wrote:
> > Michael,
> >
> > On 8/30/23 17:37, Stefan Hajnoczi wrote:
> >> On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek <lersek@redhat.com> wrote:
> >>>
> >>> On 8/30/23 14:10, Stefan Hajnoczi wrote:
> >>>> On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek <lersek@redhat.com> wrote:
> >>>>>
> >>>>> (1) The virtio-1.0 specification
> >>>>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
> >>>>>
> >>>>>> 3 General Initialization And Device Operation
> >>>>>> 3.1 Device Initialization
> >>>>>> 3.1.1 Driver Requirements: Device Initialization
> >>>>>>
> >>>>>> [...]
> >>>>>>
> >>>>>> 7. Perform device-specific setup, including discovery of virtqueues for
> >>>>>> the device, optional per-bus setup, reading and possibly writing the
> >>>>>> device’s virtio configuration space, and population of virtqueues.
> >>>>>>
> >>>>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> >>>>>
> >>>>> and
> >>>>>
> >>>>>> 4 Virtio Transport Options
> >>>>>> 4.1 Virtio Over PCI Bus
> >>>>>> 4.1.4 Virtio Structure PCI Capabilities
> >>>>>> 4.1.4.3 Common configuration structure layout
> >>>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> >>>>>>
> >>>>>> [...]
> >>>>>>
> >>>>>> The driver MUST configure the other virtqueue fields before enabling the
> >>>>>> virtqueue with queue_enable.
> >>>>>>
> >>>>>> [...]
> >>>>>
> >>>>> These together mean that the following sub-sequence of steps is valid for
> >>>>> a virtio-1.0 guest driver:
> >>>>>
> >>>>> (1.1) set "queue_enable" for the needed queues as the final part of device
> >>>>> initialization step (7),
> >>>>>
> >>>>> (1.2) set DRIVER_OK in step (8),
> >>>>>
> >>>>> (1.3) immediately start sending virtio requests to the device.
> >>>>>
> >>>>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> >>>>> special virtio feature is negotiated, then virtio rings start in disabled
> >>>>> state, according to
> >>>>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> >>>>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> >>>>> enabling vrings.
> >>>>>
> >>>>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> >>>>> operation, which travels from the guest through QEMU to the vhost-user
> >>>>> backend, using a unix domain socket.
> >>>>>
> >>>>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> >>>>> evades QEMU -- it travels from guest to the vhost-user backend via
> >>>>> eventfd.
> >>>>>
> >>>>> This means that steps (1.1) and (1.3) travel through different channels,
> >>>>> and their relative order can be reversed, as perceived by the vhost-user
> >>>>> backend.
> >>>>>
> >>>>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> >>>>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> >>>>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> >>>>> crate.)
> >>>>>
> >>>>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> >>>>> device initialization steps (i.e., control plane operations), and
> >>>>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> >>>>> operation). In the Rust-language virtiofsd, this creates a race between
> >>>>> two components that run *concurrently*, i.e., in different threads or
> >>>>> processes:
> >>>>>
> >>>>> - Control plane, handling vhost-user protocol messages:
> >>>>>
> >>>>> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> >>>>> [crates/vhost-user-backend/src/handler.rs] handles
> >>>>> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> >>>>> flag according to the message processed.
> >>>>>
> >>>>> - Data plane, handling virtio / FUSE requests:
> >>>>>
> >>>>> The "VringEpollHandler::handle_event" method
> >>>>> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> >>>>> virtio / FUSE request, consuming the virtio kick at the same time. If
> >>>>> the vring's "enabled" flag is set, the virtio / FUSE request is
> >>>>> processed genuinely. If the vring's "enabled" flag is clear, then the
> >>>>> virtio / FUSE request is discarded.
> >>>>
> >>>> Why is virtiofsd monitoring the virtqueue and discarding requests
> >>>> while it's disabled?
> >>>
> >>> That's what the vhost-user spec requires:
> >>>
> >>> https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
> >>>
> >>> """
> >>> started but disabled: the back-end must process the ring without causing
> >>> any side effects. For example, for a networking device, in the disabled
> >>> state the back-end must not supply any new RX packets, but must process
> >>> and discard any TX packets.
> >>> """
> >>>
> >>> This state is different from "stopped", where "the back-end must not
> >>> process the ring at all".
> >>>
> >>> The spec also says,
> >>>
> >>> """
> >>> If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
> >>> initialized in a disabled state and is enabled by
> >>> VHOST_USER_SET_VRING_ENABLE with parameter 1.
> >>> """
> >>>
> >>> AFAICT virtiofsd follows this requirement.
> >>
> >> Hi Michael,
> >> You documented the disabled ring state in QEMU commit commit
> >> c61f09ed855b5009f816242ce281fd01586d4646 ("vhost-user: clarify start
> >> and enable") where virtio-net devices discard tx buffers. The disabled
> >> state seems to be specific to vhost-user and not covered in the VIRTIO
> >> specification.
> >>
> >> Do you remember what the purpose of the disabled state was? Why is it
> >> necessary to discard tx buffers instead of postponing ring processing
> >> until the virtqueue is enabled?
> >>
> >> My concern is that the semantics are unclear for virtqueue types that
> >> are different from virtio-net rx/tx. Even the virtio-net controlq
> >> would be problematic - should buffers be silently discarded with
> >> VIRTIO_NET_OK or should they fail?
> >
> > Can you comment please?
> >
> > Thanks
> > Laszlo
> >
> >
> >>>> This seems like a bug in the vhost-user backend to me.
> >>>
> >>> I didn't want to exclude that possiblity; that's why I included Eugenio,
> >>> German, Liu Jiang, and Sergio in the CC list.
> >>>
> >>>>
> >>>> When the virtqueue is disabled, don't monitor the kickfd.
> >>>>
> >>>> When the virtqueue transitions from disabled to enabled, the control
> >>>> plane should self-trigger the kickfd so that any available buffers
> >>>> will be processed.
> >>>>
> >>>> QEMU uses this scheme to switch between vhost/IOThreads and built-in
> >>>> virtqueue kick processing.
> >>>>
> >>>> This approach is more robust than relying buffers being enqueued after
> >>>> the virtqueue is enabled.
> >>>
> >>> I'm happy to drop the series if the virtiofsd maintainers agree that the
> >>> bug is in virtiofsd, and can propose a design to fix it. (I do think
> >>> that such a fix would require an architectural change.)
> >>>
> >>> FWIW, my own interpretation of the vhost-user spec (see above) was that
> >>> virtiofsd was right to behave the way it did, and that there was simply
> >>> no way to prevent out-of-order delivery other than synchronizing the
> >>> guest end-to-end with the vhost-user backend, concerning
> >>> VHOST_USER_SET_VRING_ENABLE.
> >>>
> >>> This end-to-end synchronization is present "naturally" in vhost-net,
> >>> where ioctl()s are automatically synchronous -- in fact *all* operations
> >>> on the control plane are synchronous. (Which is just a different way to
> >>> say that the guest is tightly coupled with the control plane.)
> >>>
> >>> Note that there has been at least one race like this before; see commit
> >>> 699f2e535d93 ("vhost: make SET_VRING_ADDR, SET_FEATURES send replies",
> >>> 2021-09-04). Basically every pre-existent call to enforce_reply() is a
> >>> cover-up for the vhost-user spec turning (somewhat recklessly?) most
> >>> operations into async ones.
> >>>
> >>> At some point this became apparent and so the REPLY_ACK flag was
> >>> introduced; see commit ca525ce5618b ("vhost-user: Introduce a new
> >>> protocol feature REPLY_ACK.", 2016-08-10). (That commit doesn't go into
> >>> details, but I'm pretty sure there was a similar race around SET_MEM_TABLE!)
> >>>
> >>> BTW even if we drop this series for QEMU, I don't think it will have
> >>> been in vain. The first few patches are cleanups which could be merged
> >>> for their own sake. And the last patch is essentially the proof of the
> >>> problem statement / analysis. It can be considered an elaborate bug
> >>> report for virtiofsd, *if* we decide the bug is in virtiofsd. I did have
> >>> that avenue in mind as well, when writing the commit message / patch.
> >>>
> >>> For now I'm going to post v2 -- that's not to say that I'm dismissing
> >>> your feedback (see above!), just want to get the latest version on-list.
> >>>
> >>> Thanks!
> >>> Laszlo
> >>>
> >>>>
> >>>> Stefan
> >>>>
> >>>>>
> >>>>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> >>>>> However, if the data plane processor in virtiofsd wins the race, then it
> >>>>> sees the FUSE_INIT *before* the control plane processor took notice of
> >>>>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> >>>>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> >>>>> back to waiting for further virtio / FUSE requests with epoll_wait.
> >>>>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> >>>>>
> >>>>> The deadlock is not deterministic. OVMF hangs infrequently during first
> >>>>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> >>>>> shell.
> >>>>>
> >>>>> The race can be "reliably masked" by inserting a very small delay -- a
> >>>>> single debug message -- at the top of "VringEpollHandler::handle_event",
> >>>>> i.e., just before the data plane processor checks the "enabled" field of
> >>>>> the vring. That delay suffices for the control plane processor to act upon
> >>>>> VHOST_USER_SET_VRING_ENABLE.
> >>>>>
> >>>>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> >>>>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> >>>>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> >>>>> cannot advance to the FUSE_INIT submission before virtiofsd's control
> >>>>> plane processor takes notice of the queue being enabled.
> >>>>>
> >>>>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
> >>>>>
> >>>>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> >>>>> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> >>>>> has been negotiated, or
> >>>>>
> >>>>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> >>>>> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
> >>>>>
> >>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> >>>>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> >>>>> Cc: German Maglione <gmaglione@redhat.com>
> >>>>> Cc: Liu Jiang <gerry@linux.alibaba.com>
> >>>>> Cc: Sergio Lopez Pascual <slp@redhat.com>
> >>>>> Cc: Stefano Garzarella <sgarzare@redhat.com>
> >>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> >>>>> ---
> >>>>> hw/virtio/vhost-user.c | 2 +-
> >>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> >>>>> index beb4b832245e..01e0ca90c538 100644
> >>>>> --- a/hw/virtio/vhost-user.c
> >>>>> +++ b/hw/virtio/vhost-user.c
> >>>>> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
> >>>>> .num = enable,
> >>>>> };
> >>>>>
> >>>>> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
> >>>>> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
> >>>>> if (ret < 0) {
> >>>>> /*
> >>>>> * Restoring the previous state is likely infeasible, as well as
> >>>>
> >>>
> >>
> >
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-01 19:24 ` Michael S. Tsirkin
@ 2023-10-01 19:25 ` Michael S. Tsirkin
2023-10-02 1:56 ` Laszlo Ersek
0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-10-01 19:25 UTC (permalink / raw)
To: Laszlo Ersek
Cc: Stefan Hajnoczi, qemu-devel, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual,
Stefano Garzarella
Not this actually - v2 of this.
On Sun, Oct 01, 2023 at 03:24:59PM -0400, Michael S. Tsirkin wrote:
> yes sorry - I am working on a pull request with this
> included.
>
> On Mon, Sep 25, 2023 at 05:31:17PM +0200, Laszlo Ersek wrote:
> > Ping -- Michael, any comments please? This set (now at v2) has been
> > waiting on your answer since Aug 30th.
> >
> > Laszlo
> >
> > On 9/5/23 08:30, Laszlo Ersek wrote:
> > > Michael,
> > >
> > > On 8/30/23 17:37, Stefan Hajnoczi wrote:
> > >> On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek <lersek@redhat.com> wrote:
> > >>>
> > >>> On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > >>>> On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek <lersek@redhat.com> wrote:
> > >>>>>
> > >>>>> (1) The virtio-1.0 specification
> > >>>>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
> > >>>>>
> > >>>>>> 3 General Initialization And Device Operation
> > >>>>>> 3.1 Device Initialization
> > >>>>>> 3.1.1 Driver Requirements: Device Initialization
> > >>>>>>
> > >>>>>> [...]
> > >>>>>>
> > >>>>>> 7. Perform device-specific setup, including discovery of virtqueues for
> > >>>>>> the device, optional per-bus setup, reading and possibly writing the
> > >>>>>> device’s virtio configuration space, and population of virtqueues.
> > >>>>>>
> > >>>>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> > >>>>>
> > >>>>> and
> > >>>>>
> > >>>>>> 4 Virtio Transport Options
> > >>>>>> 4.1 Virtio Over PCI Bus
> > >>>>>> 4.1.4 Virtio Structure PCI Capabilities
> > >>>>>> 4.1.4.3 Common configuration structure layout
> > >>>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > >>>>>>
> > >>>>>> [...]
> > >>>>>>
> > >>>>>> The driver MUST configure the other virtqueue fields before enabling the
> > >>>>>> virtqueue with queue_enable.
> > >>>>>>
> > >>>>>> [...]
> > >>>>>
> > >>>>> These together mean that the following sub-sequence of steps is valid for
> > >>>>> a virtio-1.0 guest driver:
> > >>>>>
> > >>>>> (1.1) set "queue_enable" for the needed queues as the final part of device
> > >>>>> initialization step (7),
> > >>>>>
> > >>>>> (1.2) set DRIVER_OK in step (8),
> > >>>>>
> > >>>>> (1.3) immediately start sending virtio requests to the device.
> > >>>>>
> > >>>>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> > >>>>> special virtio feature is negotiated, then virtio rings start in disabled
> > >>>>> state, according to
> > >>>>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> > >>>>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> > >>>>> enabling vrings.
> > >>>>>
> > >>>>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> > >>>>> operation, which travels from the guest through QEMU to the vhost-user
> > >>>>> backend, using a unix domain socket.
> > >>>>>
> > >>>>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> > >>>>> evades QEMU -- it travels from guest to the vhost-user backend via
> > >>>>> eventfd.
> > >>>>>
> > >>>>> This means that steps (1.1) and (1.3) travel through different channels,
> > >>>>> and their relative order can be reversed, as perceived by the vhost-user
> > >>>>> backend.
> > >>>>>
> > >>>>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> > >>>>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> > >>>>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> > >>>>> crate.)
> > >>>>>
> > >>>>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > >>>>> device initialization steps (i.e., control plane operations), and
> > >>>>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> > >>>>> operation). In the Rust-language virtiofsd, this creates a race between
> > >>>>> two components that run *concurrently*, i.e., in different threads or
> > >>>>> processes:
> > >>>>>
> > >>>>> - Control plane, handling vhost-user protocol messages:
> > >>>>>
> > >>>>> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > >>>>> [crates/vhost-user-backend/src/handler.rs] handles
> > >>>>> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> > >>>>> flag according to the message processed.
> > >>>>>
> > >>>>> - Data plane, handling virtio / FUSE requests:
> > >>>>>
> > >>>>> The "VringEpollHandler::handle_event" method
> > >>>>> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> > >>>>> virtio / FUSE request, consuming the virtio kick at the same time. If
> > >>>>> the vring's "enabled" flag is set, the virtio / FUSE request is
> > >>>>> processed genuinely. If the vring's "enabled" flag is clear, then the
> > >>>>> virtio / FUSE request is discarded.
> > >>>>
> > >>>> Why is virtiofsd monitoring the virtqueue and discarding requests
> > >>>> while it's disabled?
> > >>>
> > >>> That's what the vhost-user spec requires:
> > >>>
> > >>> https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
> > >>>
> > >>> """
> > >>> started but disabled: the back-end must process the ring without causing
> > >>> any side effects. For example, for a networking device, in the disabled
> > >>> state the back-end must not supply any new RX packets, but must process
> > >>> and discard any TX packets.
> > >>> """
> > >>>
> > >>> This state is different from "stopped", where "the back-end must not
> > >>> process the ring at all".
> > >>>
> > >>> The spec also says,
> > >>>
> > >>> """
> > >>> If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
> > >>> initialized in a disabled state and is enabled by
> > >>> VHOST_USER_SET_VRING_ENABLE with parameter 1.
> > >>> """
> > >>>
> > >>> AFAICT virtiofsd follows this requirement.
> > >>
> > >> Hi Michael,
> > >> You documented the disabled ring state in QEMU commit commit
> > >> c61f09ed855b5009f816242ce281fd01586d4646 ("vhost-user: clarify start
> > >> and enable") where virtio-net devices discard tx buffers. The disabled
> > >> state seems to be specific to vhost-user and not covered in the VIRTIO
> > >> specification.
> > >>
> > >> Do you remember what the purpose of the disabled state was? Why is it
> > >> necessary to discard tx buffers instead of postponing ring processing
> > >> until the virtqueue is enabled?
> > >>
> > >> My concern is that the semantics are unclear for virtqueue types that
> > >> are different from virtio-net rx/tx. Even the virtio-net controlq
> > >> would be problematic - should buffers be silently discarded with
> > >> VIRTIO_NET_OK or should they fail?
> > >
> > > Can you comment please?
> > >
> > > Thanks
> > > Laszlo
> > >
> > >
> > >>>> This seems like a bug in the vhost-user backend to me.
> > >>>
> > >>> I didn't want to exclude that possiblity; that's why I included Eugenio,
> > >>> German, Liu Jiang, and Sergio in the CC list.
> > >>>
> > >>>>
> > >>>> When the virtqueue is disabled, don't monitor the kickfd.
> > >>>>
> > >>>> When the virtqueue transitions from disabled to enabled, the control
> > >>>> plane should self-trigger the kickfd so that any available buffers
> > >>>> will be processed.
> > >>>>
> > >>>> QEMU uses this scheme to switch between vhost/IOThreads and built-in
> > >>>> virtqueue kick processing.
> > >>>>
> > >>>> This approach is more robust than relying buffers being enqueued after
> > >>>> the virtqueue is enabled.
> > >>>
> > >>> I'm happy to drop the series if the virtiofsd maintainers agree that the
> > >>> bug is in virtiofsd, and can propose a design to fix it. (I do think
> > >>> that such a fix would require an architectural change.)
> > >>>
> > >>> FWIW, my own interpretation of the vhost-user spec (see above) was that
> > >>> virtiofsd was right to behave the way it did, and that there was simply
> > >>> no way to prevent out-of-order delivery other than synchronizing the
> > >>> guest end-to-end with the vhost-user backend, concerning
> > >>> VHOST_USER_SET_VRING_ENABLE.
> > >>>
> > >>> This end-to-end synchronization is present "naturally" in vhost-net,
> > >>> where ioctl()s are automatically synchronous -- in fact *all* operations
> > >>> on the control plane are synchronous. (Which is just a different way to
> > >>> say that the guest is tightly coupled with the control plane.)
> > >>>
> > >>> Note that there has been at least one race like this before; see commit
> > >>> 699f2e535d93 ("vhost: make SET_VRING_ADDR, SET_FEATURES send replies",
> > >>> 2021-09-04). Basically every pre-existent call to enforce_reply() is a
> > >>> cover-up for the vhost-user spec turning (somewhat recklessly?) most
> > >>> operations into async ones.
> > >>>
> > >>> At some point this became apparent and so the REPLY_ACK flag was
> > >>> introduced; see commit ca525ce5618b ("vhost-user: Introduce a new
> > >>> protocol feature REPLY_ACK.", 2016-08-10). (That commit doesn't go into
> > >>> details, but I'm pretty sure there was a similar race around SET_MEM_TABLE!)
> > >>>
> > >>> BTW even if we drop this series for QEMU, I don't think it will have
> > >>> been in vain. The first few patches are cleanups which could be merged
> > >>> for their own sake. And the last patch is essentially the proof of the
> > >>> problem statement / analysis. It can be considered an elaborate bug
> > >>> report for virtiofsd, *if* we decide the bug is in virtiofsd. I did have
> > >>> that avenue in mind as well, when writing the commit message / patch.
> > >>>
> > >>> For now I'm going to post v2 -- that's not to say that I'm dismissing
> > >>> your feedback (see above!), just want to get the latest version on-list.
> > >>>
> > >>> Thanks!
> > >>> Laszlo
> > >>>
> > >>>>
> > >>>> Stefan
> > >>>>
> > >>>>>
> > >>>>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> > >>>>> However, if the data plane processor in virtiofsd wins the race, then it
> > >>>>> sees the FUSE_INIT *before* the control plane processor took notice of
> > >>>>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> > >>>>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> > >>>>> back to waiting for further virtio / FUSE requests with epoll_wait.
> > >>>>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> > >>>>>
> > >>>>> The deadlock is not deterministic. OVMF hangs infrequently during first
> > >>>>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> > >>>>> shell.
> > >>>>>
> > >>>>> The race can be "reliably masked" by inserting a very small delay -- a
> > >>>>> single debug message -- at the top of "VringEpollHandler::handle_event",
> > >>>>> i.e., just before the data plane processor checks the "enabled" field of
> > >>>>> the vring. That delay suffices for the control plane processor to act upon
> > >>>>> VHOST_USER_SET_VRING_ENABLE.
> > >>>>>
> > >>>>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> > >>>>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> > >>>>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> > >>>>> cannot advance to the FUSE_INIT submission before virtiofsd's control
> > >>>>> plane processor takes notice of the queue being enabled.
> > >>>>>
> > >>>>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
> > >>>>>
> > >>>>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> > >>>>> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> > >>>>> has been negotiated, or
> > >>>>>
> > >>>>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> > >>>>> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
> > >>>>>
> > >>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> > >>>>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> > >>>>> Cc: German Maglione <gmaglione@redhat.com>
> > >>>>> Cc: Liu Jiang <gerry@linux.alibaba.com>
> > >>>>> Cc: Sergio Lopez Pascual <slp@redhat.com>
> > >>>>> Cc: Stefano Garzarella <sgarzare@redhat.com>
> > >>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> > >>>>> ---
> > >>>>> hw/virtio/vhost-user.c | 2 +-
> > >>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
> > >>>>>
> > >>>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > >>>>> index beb4b832245e..01e0ca90c538 100644
> > >>>>> --- a/hw/virtio/vhost-user.c
> > >>>>> +++ b/hw/virtio/vhost-user.c
> > >>>>> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
> > >>>>> .num = enable,
> > >>>>> };
> > >>>>>
> > >>>>> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
> > >>>>> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
> > >>>>> if (ret < 0) {
> > >>>>> /*
> > >>>>> * Restoring the previous state is likely infeasible, as well as
> > >>>>
> > >>>
> > >>
> > >
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-01 19:25 ` Michael S. Tsirkin
@ 2023-10-02 1:56 ` Laszlo Ersek
2023-10-02 6:57 ` Michael S. Tsirkin
0 siblings, 1 reply; 58+ messages in thread
From: Laszlo Ersek @ 2023-10-02 1:56 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Stefan Hajnoczi, qemu-devel, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual,
Stefano Garzarella
On 10/1/23 21:25, Michael S. Tsirkin wrote:
> Not this actually - v2 of this.
Thank you, but:
- Stefan's question should be answered still IMO (although if you pick
up this series, then that could be interpreted as "QEMU bug, not spec bug")
- I was supposed to update the commit message on 7/7 in v3; I didn't
want to do it before Stefan's question was answered
Thanks!
Laszlo
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-30 15:37 ` Stefan Hajnoczi
2023-09-05 6:30 ` Laszlo Ersek
@ 2023-10-02 6:49 ` Michael S. Tsirkin
2023-10-02 21:12 ` Stefan Hajnoczi
1 sibling, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-10-02 6:49 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: qemu-devel, Laszlo Ersek, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella, Jason Wang
On Wed, Aug 30, 2023 at 11:37:50AM -0400, Stefan Hajnoczi wrote:
> On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek <lersek@redhat.com> wrote:
> >
> > On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek <lersek@redhat.com> wrote:
> > >>
> > >> (1) The virtio-1.0 specification
> > >> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
> > >>
> > >>> 3 General Initialization And Device Operation
> > >>> 3.1 Device Initialization
> > >>> 3.1.1 Driver Requirements: Device Initialization
> > >>>
> > >>> [...]
> > >>>
> > >>> 7. Perform device-specific setup, including discovery of virtqueues for
> > >>> the device, optional per-bus setup, reading and possibly writing the
> > >>> device’s virtio configuration space, and population of virtqueues.
> > >>>
> > >>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> > >>
> > >> and
> > >>
> > >>> 4 Virtio Transport Options
> > >>> 4.1 Virtio Over PCI Bus
> > >>> 4.1.4 Virtio Structure PCI Capabilities
> > >>> 4.1.4.3 Common configuration structure layout
> > >>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > >>>
> > >>> [...]
> > >>>
> > >>> The driver MUST configure the other virtqueue fields before enabling the
> > >>> virtqueue with queue_enable.
> > >>>
> > >>> [...]
> > >>
> > >> These together mean that the following sub-sequence of steps is valid for
> > >> a virtio-1.0 guest driver:
> > >>
> > >> (1.1) set "queue_enable" for the needed queues as the final part of device
> > >> initialization step (7),
> > >>
> > >> (1.2) set DRIVER_OK in step (8),
> > >>
> > >> (1.3) immediately start sending virtio requests to the device.
> > >>
> > >> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> > >> special virtio feature is negotiated, then virtio rings start in disabled
> > >> state, according to
> > >> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> > >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> > >> enabling vrings.
> > >>
> > >> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> > >> operation, which travels from the guest through QEMU to the vhost-user
> > >> backend, using a unix domain socket.
> > >>
> > >> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> > >> evades QEMU -- it travels from guest to the vhost-user backend via
> > >> eventfd.
> > >>
> > >> This means that steps (1.1) and (1.3) travel through different channels,
> > >> and their relative order can be reversed, as perceived by the vhost-user
> > >> backend.
> > >>
> > >> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> > >> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> > >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> > >> crate.)
> > >>
> > >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > >> device initialization steps (i.e., control plane operations), and
> > >> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> > >> operation). In the Rust-language virtiofsd, this creates a race between
> > >> two components that run *concurrently*, i.e., in different threads or
> > >> processes:
> > >>
> > >> - Control plane, handling vhost-user protocol messages:
> > >>
> > >> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > >> [crates/vhost-user-backend/src/handler.rs] handles
> > >> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> > >> flag according to the message processed.
> > >>
> > >> - Data plane, handling virtio / FUSE requests:
> > >>
> > >> The "VringEpollHandler::handle_event" method
> > >> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> > >> virtio / FUSE request, consuming the virtio kick at the same time. If
> > >> the vring's "enabled" flag is set, the virtio / FUSE request is
> > >> processed genuinely. If the vring's "enabled" flag is clear, then the
> > >> virtio / FUSE request is discarded.
> > >
> > > Why is virtiofsd monitoring the virtqueue and discarding requests
> > > while it's disabled?
> >
> > That's what the vhost-user spec requires:
> >
> > https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
> >
> > """
> > started but disabled: the back-end must process the ring without causing
> > any side effects. For example, for a networking device, in the disabled
> > state the back-end must not supply any new RX packets, but must process
> > and discard any TX packets.
> > """
> >
> > This state is different from "stopped", where "the back-end must not
> > process the ring at all".
> >
> > The spec also says,
> >
> > """
> > If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
> > initialized in a disabled state and is enabled by
> > VHOST_USER_SET_VRING_ENABLE with parameter 1.
> > """
> >
> > AFAICT virtiofsd follows this requirement.
>
> Hi Michael,
> You documented the disabled ring state in QEMU commit commit
> c61f09ed855b5009f816242ce281fd01586d4646 ("vhost-user: clarify start
> and enable") where virtio-net devices discard tx buffers. The disabled
> state seems to be specific to vhost-user and not covered in the VIRTIO
> specification.
>
> Do you remember what the purpose of the disabled state was? Why is it
> necessary to discard tx buffers instead of postponing ring processing
> until the virtqueue is enabled?
>
> My concern is that the semantics are unclear for virtqueue types that
> are different from virtio-net rx/tx. Even the virtio-net controlq
> would be problematic - should buffers be silently discarded with
> VIRTIO_NET_OK or should they fail?
>
> Thanks,
> Stefan
I think I got it now.
This weird state happens when linux first queues packets
on multiple queues, then changes max queues to 1, queued packets need
to still be freed eventually.
Yes, I am not sure this can apply to devices or queue types
other than virtio net. Maybe.
When we say:
must process the ring without causing any side effects.
then I think it would be better to say
must process the ring if it can be done without causing
guest visible side effects.
processing rx ring would have a side effect of causing
guest to get malformed buffers, so we don't process it.
processing command queue - we can't fail for sure since
that is guest visible. but practically we don't do this
for cvq.
what should happen for virtiofsd? I don't know -
I am guessing discarding would have a side effect
so should not happen.
> >
> > > This seems like a bug in the vhost-user backend to me.
> >
> > I didn't want to exclude that possiblity; that's why I included Eugenio,
> > German, Liu Jiang, and Sergio in the CC list.
> >
> > >
> > > When the virtqueue is disabled, don't monitor the kickfd.
> > >
> > > When the virtqueue transitions from disabled to enabled, the control
> > > plane should self-trigger the kickfd so that any available buffers
> > > will be processed.
> > >
> > > QEMU uses this scheme to switch between vhost/IOThreads and built-in
> > > virtqueue kick processing.
> > >
> > > This approach is more robust than relying buffers being enqueued after
> > > the virtqueue is enabled.
> >
> > I'm happy to drop the series if the virtiofsd maintainers agree that the
> > bug is in virtiofsd, and can propose a design to fix it. (I do think
> > that such a fix would require an architectural change.)
> >
> > FWIW, my own interpretation of the vhost-user spec (see above) was that
> > virtiofsd was right to behave the way it did, and that there was simply
> > no way to prevent out-of-order delivery other than synchronizing the
> > guest end-to-end with the vhost-user backend, concerning
> > VHOST_USER_SET_VRING_ENABLE.
> >
> > This end-to-end synchronization is present "naturally" in vhost-net,
> > where ioctl()s are automatically synchronous -- in fact *all* operations
> > on the control plane are synchronous. (Which is just a different way to
> > say that the guest is tightly coupled with the control plane.)
> >
> > Note that there has been at least one race like this before; see commit
> > 699f2e535d93 ("vhost: make SET_VRING_ADDR, SET_FEATURES send replies",
> > 2021-09-04). Basically every pre-existent call to enforce_reply() is a
> > cover-up for the vhost-user spec turning (somewhat recklessly?) most
> > operations into async ones.
> >
> > At some point this became apparent and so the REPLY_ACK flag was
> > introduced; see commit ca525ce5618b ("vhost-user: Introduce a new
> > protocol feature REPLY_ACK.", 2016-08-10). (That commit doesn't go into
> > details, but I'm pretty sure there was a similar race around SET_MEM_TABLE!)
> >
> > BTW even if we drop this series for QEMU, I don't think it will have
> > been in vain. The first few patches are cleanups which could be merged
> > for their own sake. And the last patch is essentially the proof of the
> > problem statement / analysis. It can be considered an elaborate bug
> > report for virtiofsd, *if* we decide the bug is in virtiofsd. I did have
> > that avenue in mind as well, when writing the commit message / patch.
> >
> > For now I'm going to post v2 -- that's not to say that I'm dismissing
> > your feedback (see above!), just want to get the latest version on-list.
> >
> > Thanks!
> > Laszlo
> >
> > >
> > > Stefan
> > >
> > >>
> > >> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> > >> However, if the data plane processor in virtiofsd wins the race, then it
> > >> sees the FUSE_INIT *before* the control plane processor took notice of
> > >> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> > >> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> > >> back to waiting for further virtio / FUSE requests with epoll_wait.
> > >> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> > >>
> > >> The deadlock is not deterministic. OVMF hangs infrequently during first
> > >> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> > >> shell.
> > >>
> > >> The race can be "reliably masked" by inserting a very small delay -- a
> > >> single debug message -- at the top of "VringEpollHandler::handle_event",
> > >> i.e., just before the data plane processor checks the "enabled" field of
> > >> the vring. That delay suffices for the control plane processor to act upon
> > >> VHOST_USER_SET_VRING_ENABLE.
> > >>
> > >> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> > >> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> > >> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> > >> cannot advance to the FUSE_INIT submission before virtiofsd's control
> > >> plane processor takes notice of the queue being enabled.
> > >>
> > >> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
> > >>
> > >> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> > >> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> > >> has been negotiated, or
> > >>
> > >> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> > >> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
> > >>
> > >> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> > >> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> > >> Cc: German Maglione <gmaglione@redhat.com>
> > >> Cc: Liu Jiang <gerry@linux.alibaba.com>
> > >> Cc: Sergio Lopez Pascual <slp@redhat.com>
> > >> Cc: Stefano Garzarella <sgarzare@redhat.com>
> > >> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> > >> ---
> > >> hw/virtio/vhost-user.c | 2 +-
> > >> 1 file changed, 1 insertion(+), 1 deletion(-)
> > >>
> > >> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > >> index beb4b832245e..01e0ca90c538 100644
> > >> --- a/hw/virtio/vhost-user.c
> > >> +++ b/hw/virtio/vhost-user.c
> > >> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
> > >> .num = enable,
> > >> };
> > >>
> > >> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
> > >> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
> > >> if (ret < 0) {
> > >> /*
> > >> * Restoring the previous state is likely infeasible, as well as
> > >
> >
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-02 1:56 ` Laszlo Ersek
@ 2023-10-02 6:57 ` Michael S. Tsirkin
2023-10-02 14:02 ` Laszlo Ersek
0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-10-02 6:57 UTC (permalink / raw)
To: Laszlo Ersek
Cc: Stefan Hajnoczi, qemu-devel, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual,
Stefano Garzarella
On Mon, Oct 02, 2023 at 03:56:03AM +0200, Laszlo Ersek wrote:
> On 10/1/23 21:25, Michael S. Tsirkin wrote:
> > Not this actually - v2 of this.
>
> Thank you, but:
>
> - Stefan's question should be answered still IMO (although if you pick
> up this series, then that could be interpreted as "QEMU bug, not spec bug")
>
> - I was supposed to update the commit message on 7/7 in v3; I didn't
> want to do it before Stefan's question was answered
>
> Thanks!
> Laszlo
OK I just answered. I am fine with the patch but I think
virtiofsd should be fixed too.
--
MST
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-02 6:57 ` Michael S. Tsirkin
@ 2023-10-02 14:02 ` Laszlo Ersek
0 siblings, 0 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-10-02 14:02 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Stefan Hajnoczi, qemu-devel, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual,
Stefano Garzarella
On 10/2/23 08:57, Michael S. Tsirkin wrote:
> On Mon, Oct 02, 2023 at 03:56:03AM +0200, Laszlo Ersek wrote:
>> On 10/1/23 21:25, Michael S. Tsirkin wrote:
>>> Not this actually - v2 of this.
>>
>> Thank you, but:
>>
>> - Stefan's question should be answered still IMO (although if you pick
>> up this series, then that could be interpreted as "QEMU bug, not spec bug")
>>
>> - I was supposed to update the commit message on 7/7 in v3; I didn't
>> want to do it before Stefan's question was answered
>>
>> Thanks!
>> Laszlo
>
> OK I just answered. I am fine with the patch but I think
> virtiofsd should be fixed too.
Thanks. I'll prepare a v3 with an updated commit message on 7/7 (plus
picking up any new feedback tags).
Cheers
Laszlo
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-02 6:49 ` Michael S. Tsirkin
@ 2023-10-02 21:12 ` Stefan Hajnoczi
2023-10-02 21:13 ` Stefan Hajnoczi
2023-10-02 22:36 ` Michael S. Tsirkin
0 siblings, 2 replies; 58+ messages in thread
From: Stefan Hajnoczi @ 2023-10-02 21:12 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: qemu-devel, Laszlo Ersek, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella, Jason Wang
On Mon, 2 Oct 2023 at 02:49, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Aug 30, 2023 at 11:37:50AM -0400, Stefan Hajnoczi wrote:
> > On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek <lersek@redhat.com> wrote:
> > >
> > > On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > > > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek <lersek@redhat.com> wrote:
> > > >>
> > > >> (1) The virtio-1.0 specification
> > > >> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
> > > >>
> > > >>> 3 General Initialization And Device Operation
> > > >>> 3.1 Device Initialization
> > > >>> 3.1.1 Driver Requirements: Device Initialization
> > > >>>
> > > >>> [...]
> > > >>>
> > > >>> 7. Perform device-specific setup, including discovery of virtqueues for
> > > >>> the device, optional per-bus setup, reading and possibly writing the
> > > >>> device’s virtio configuration space, and population of virtqueues.
> > > >>>
> > > >>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> > > >>
> > > >> and
> > > >>
> > > >>> 4 Virtio Transport Options
> > > >>> 4.1 Virtio Over PCI Bus
> > > >>> 4.1.4 Virtio Structure PCI Capabilities
> > > >>> 4.1.4.3 Common configuration structure layout
> > > >>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > >>>
> > > >>> [...]
> > > >>>
> > > >>> The driver MUST configure the other virtqueue fields before enabling the
> > > >>> virtqueue with queue_enable.
> > > >>>
> > > >>> [...]
> > > >>
> > > >> These together mean that the following sub-sequence of steps is valid for
> > > >> a virtio-1.0 guest driver:
> > > >>
> > > >> (1.1) set "queue_enable" for the needed queues as the final part of device
> > > >> initialization step (7),
> > > >>
> > > >> (1.2) set DRIVER_OK in step (8),
> > > >>
> > > >> (1.3) immediately start sending virtio requests to the device.
> > > >>
> > > >> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> > > >> special virtio feature is negotiated, then virtio rings start in disabled
> > > >> state, according to
> > > >> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> > > >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> > > >> enabling vrings.
> > > >>
> > > >> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> > > >> operation, which travels from the guest through QEMU to the vhost-user
> > > >> backend, using a unix domain socket.
> > > >>
> > > >> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> > > >> evades QEMU -- it travels from guest to the vhost-user backend via
> > > >> eventfd.
> > > >>
> > > >> This means that steps (1.1) and (1.3) travel through different channels,
> > > >> and their relative order can be reversed, as perceived by the vhost-user
> > > >> backend.
> > > >>
> > > >> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> > > >> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> > > >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> > > >> crate.)
> > > >>
> > > >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > > >> device initialization steps (i.e., control plane operations), and
> > > >> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> > > >> operation). In the Rust-language virtiofsd, this creates a race between
> > > >> two components that run *concurrently*, i.e., in different threads or
> > > >> processes:
> > > >>
> > > >> - Control plane, handling vhost-user protocol messages:
> > > >>
> > > >> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > > >> [crates/vhost-user-backend/src/handler.rs] handles
> > > >> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> > > >> flag according to the message processed.
> > > >>
> > > >> - Data plane, handling virtio / FUSE requests:
> > > >>
> > > >> The "VringEpollHandler::handle_event" method
> > > >> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> > > >> virtio / FUSE request, consuming the virtio kick at the same time. If
> > > >> the vring's "enabled" flag is set, the virtio / FUSE request is
> > > >> processed genuinely. If the vring's "enabled" flag is clear, then the
> > > >> virtio / FUSE request is discarded.
> > > >
> > > > Why is virtiofsd monitoring the virtqueue and discarding requests
> > > > while it's disabled?
> > >
> > > That's what the vhost-user spec requires:
> > >
> > > https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
> > >
> > > """
> > > started but disabled: the back-end must process the ring without causing
> > > any side effects. For example, for a networking device, in the disabled
> > > state the back-end must not supply any new RX packets, but must process
> > > and discard any TX packets.
> > > """
> > >
> > > This state is different from "stopped", where "the back-end must not
> > > process the ring at all".
> > >
> > > The spec also says,
> > >
> > > """
> > > If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
> > > initialized in a disabled state and is enabled by
> > > VHOST_USER_SET_VRING_ENABLE with parameter 1.
> > > """
> > >
> > > AFAICT virtiofsd follows this requirement.
> >
> > Hi Michael,
> > You documented the disabled ring state in QEMU commit commit
> > c61f09ed855b5009f816242ce281fd01586d4646 ("vhost-user: clarify start
> > and enable") where virtio-net devices discard tx buffers. The disabled
> > state seems to be specific to vhost-user and not covered in the VIRTIO
> > specification.
> >
> > Do you remember what the purpose of the disabled state was? Why is it
> > necessary to discard tx buffers instead of postponing ring processing
> > until the virtqueue is enabled?
> >
> > My concern is that the semantics are unclear for virtqueue types that
> > are different from virtio-net rx/tx. Even the virtio-net controlq
> > would be problematic - should buffers be silently discarded with
> > VIRTIO_NET_OK or should they fail?
> >
> > Thanks,
> > Stefan
>
> I think I got it now.
> This weird state happens when linux first queues packets
> on multiple queues, then changes max queues to 1, queued packets need
> to still be freed eventually.
Can you explain what is happening in the guest driver, QEMU, and the
vhost-user-net device in more detail? I don't understand the scenario.
> Yes, I am not sure this can apply to devices or queue types
> other than virtio net. Maybe.
>
> When we say:
> must process the ring without causing any side effects.
> then I think it would be better to say
> must process the ring if it can be done without causing
> guest visible side effects.
Completing a tx buffer is guest-visible, so I'm confused by this statement.
> processing rx ring would have a side effect of causing
> guest to get malformed buffers, so we don't process it.
Why are they malformed? Do you mean the rx buffers are stale (the
guest driver has changed the number of queues and doesn't expect to
receive them anymore)?
> processing command queue - we can't fail for sure since
> that is guest visible. but practically we don't do this
> for cvq.
>
> what should happen for virtiofsd? I don't know -
> I am guessing discarding would have a side effect
> so should not happen.
>
>
>
>
> > >
> > > > This seems like a bug in the vhost-user backend to me.
> > >
> > > I didn't want to exclude that possiblity; that's why I included Eugenio,
> > > German, Liu Jiang, and Sergio in the CC list.
> > >
> > > >
> > > > When the virtqueue is disabled, don't monitor the kickfd.
> > > >
> > > > When the virtqueue transitions from disabled to enabled, the control
> > > > plane should self-trigger the kickfd so that any available buffers
> > > > will be processed.
> > > >
> > > > QEMU uses this scheme to switch between vhost/IOThreads and built-in
> > > > virtqueue kick processing.
> > > >
> > > > This approach is more robust than relying buffers being enqueued after
> > > > the virtqueue is enabled.
> > >
> > > I'm happy to drop the series if the virtiofsd maintainers agree that the
> > > bug is in virtiofsd, and can propose a design to fix it. (I do think
> > > that such a fix would require an architectural change.)
> > >
> > > FWIW, my own interpretation of the vhost-user spec (see above) was that
> > > virtiofsd was right to behave the way it did, and that there was simply
> > > no way to prevent out-of-order delivery other than synchronizing the
> > > guest end-to-end with the vhost-user backend, concerning
> > > VHOST_USER_SET_VRING_ENABLE.
> > >
> > > This end-to-end synchronization is present "naturally" in vhost-net,
> > > where ioctl()s are automatically synchronous -- in fact *all* operations
> > > on the control plane are synchronous. (Which is just a different way to
> > > say that the guest is tightly coupled with the control plane.)
> > >
> > > Note that there has been at least one race like this before; see commit
> > > 699f2e535d93 ("vhost: make SET_VRING_ADDR, SET_FEATURES send replies",
> > > 2021-09-04). Basically every pre-existent call to enforce_reply() is a
> > > cover-up for the vhost-user spec turning (somewhat recklessly?) most
> > > operations into async ones.
> > >
> > > At some point this became apparent and so the REPLY_ACK flag was
> > > introduced; see commit ca525ce5618b ("vhost-user: Introduce a new
> > > protocol feature REPLY_ACK.", 2016-08-10). (That commit doesn't go into
> > > details, but I'm pretty sure there was a similar race around SET_MEM_TABLE!)
> > >
> > > BTW even if we drop this series for QEMU, I don't think it will have
> > > been in vain. The first few patches are cleanups which could be merged
> > > for their own sake. And the last patch is essentially the proof of the
> > > problem statement / analysis. It can be considered an elaborate bug
> > > report for virtiofsd, *if* we decide the bug is in virtiofsd. I did have
> > > that avenue in mind as well, when writing the commit message / patch.
> > >
> > > For now I'm going to post v2 -- that's not to say that I'm dismissing
> > > your feedback (see above!), just want to get the latest version on-list.
> > >
> > > Thanks!
> > > Laszlo
> > >
> > > >
> > > > Stefan
> > > >
> > > >>
> > > >> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> > > >> However, if the data plane processor in virtiofsd wins the race, then it
> > > >> sees the FUSE_INIT *before* the control plane processor took notice of
> > > >> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> > > >> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> > > >> back to waiting for further virtio / FUSE requests with epoll_wait.
> > > >> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> > > >>
> > > >> The deadlock is not deterministic. OVMF hangs infrequently during first
> > > >> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> > > >> shell.
> > > >>
> > > >> The race can be "reliably masked" by inserting a very small delay -- a
> > > >> single debug message -- at the top of "VringEpollHandler::handle_event",
> > > >> i.e., just before the data plane processor checks the "enabled" field of
> > > >> the vring. That delay suffices for the control plane processor to act upon
> > > >> VHOST_USER_SET_VRING_ENABLE.
> > > >>
> > > >> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> > > >> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> > > >> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> > > >> cannot advance to the FUSE_INIT submission before virtiofsd's control
> > > >> plane processor takes notice of the queue being enabled.
> > > >>
> > > >> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
> > > >>
> > > >> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> > > >> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> > > >> has been negotiated, or
> > > >>
> > > >> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> > > >> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
> > > >>
> > > >> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> > > >> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> > > >> Cc: German Maglione <gmaglione@redhat.com>
> > > >> Cc: Liu Jiang <gerry@linux.alibaba.com>
> > > >> Cc: Sergio Lopez Pascual <slp@redhat.com>
> > > >> Cc: Stefano Garzarella <sgarzare@redhat.com>
> > > >> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> > > >> ---
> > > >> hw/virtio/vhost-user.c | 2 +-
> > > >> 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >>
> > > >> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > >> index beb4b832245e..01e0ca90c538 100644
> > > >> --- a/hw/virtio/vhost-user.c
> > > >> +++ b/hw/virtio/vhost-user.c
> > > >> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
> > > >> .num = enable,
> > > >> };
> > > >>
> > > >> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
> > > >> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
> > > >> if (ret < 0) {
> > > >> /*
> > > >> * Restoring the previous state is likely infeasible, as well as
> > > >
> > >
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-02 21:12 ` Stefan Hajnoczi
@ 2023-10-02 21:13 ` Stefan Hajnoczi
2023-10-03 12:26 ` Michael S. Tsirkin
2023-10-02 22:36 ` Michael S. Tsirkin
1 sibling, 1 reply; 58+ messages in thread
From: Stefan Hajnoczi @ 2023-10-02 21:13 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: qemu-devel, Laszlo Ersek, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella, Jason Wang
One more question:
Why is the disabled state not needed by regular (non-vhost) virtio-net devices?
On Mon, 2 Oct 2023 at 17:12, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Mon, 2 Oct 2023 at 02:49, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Aug 30, 2023 at 11:37:50AM -0400, Stefan Hajnoczi wrote:
> > > On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek <lersek@redhat.com> wrote:
> > > >
> > > > On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > > > > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek <lersek@redhat.com> wrote:
> > > > >>
> > > > >> (1) The virtio-1.0 specification
> > > > >> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
> > > > >>
> > > > >>> 3 General Initialization And Device Operation
> > > > >>> 3.1 Device Initialization
> > > > >>> 3.1.1 Driver Requirements: Device Initialization
> > > > >>>
> > > > >>> [...]
> > > > >>>
> > > > >>> 7. Perform device-specific setup, including discovery of virtqueues for
> > > > >>> the device, optional per-bus setup, reading and possibly writing the
> > > > >>> device’s virtio configuration space, and population of virtqueues.
> > > > >>>
> > > > >>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> > > > >>
> > > > >> and
> > > > >>
> > > > >>> 4 Virtio Transport Options
> > > > >>> 4.1 Virtio Over PCI Bus
> > > > >>> 4.1.4 Virtio Structure PCI Capabilities
> > > > >>> 4.1.4.3 Common configuration structure layout
> > > > >>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > > >>>
> > > > >>> [...]
> > > > >>>
> > > > >>> The driver MUST configure the other virtqueue fields before enabling the
> > > > >>> virtqueue with queue_enable.
> > > > >>>
> > > > >>> [...]
> > > > >>
> > > > >> These together mean that the following sub-sequence of steps is valid for
> > > > >> a virtio-1.0 guest driver:
> > > > >>
> > > > >> (1.1) set "queue_enable" for the needed queues as the final part of device
> > > > >> initialization step (7),
> > > > >>
> > > > >> (1.2) set DRIVER_OK in step (8),
> > > > >>
> > > > >> (1.3) immediately start sending virtio requests to the device.
> > > > >>
> > > > >> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> > > > >> special virtio feature is negotiated, then virtio rings start in disabled
> > > > >> state, according to
> > > > >> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> > > > >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> > > > >> enabling vrings.
> > > > >>
> > > > >> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> > > > >> operation, which travels from the guest through QEMU to the vhost-user
> > > > >> backend, using a unix domain socket.
> > > > >>
> > > > >> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> > > > >> evades QEMU -- it travels from guest to the vhost-user backend via
> > > > >> eventfd.
> > > > >>
> > > > >> This means that steps (1.1) and (1.3) travel through different channels,
> > > > >> and their relative order can be reversed, as perceived by the vhost-user
> > > > >> backend.
> > > > >>
> > > > >> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> > > > >> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> > > > >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> > > > >> crate.)
> > > > >>
> > > > >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > > > >> device initialization steps (i.e., control plane operations), and
> > > > >> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> > > > >> operation). In the Rust-language virtiofsd, this creates a race between
> > > > >> two components that run *concurrently*, i.e., in different threads or
> > > > >> processes:
> > > > >>
> > > > >> - Control plane, handling vhost-user protocol messages:
> > > > >>
> > > > >> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > > > >> [crates/vhost-user-backend/src/handler.rs] handles
> > > > >> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> > > > >> flag according to the message processed.
> > > > >>
> > > > >> - Data plane, handling virtio / FUSE requests:
> > > > >>
> > > > >> The "VringEpollHandler::handle_event" method
> > > > >> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> > > > >> virtio / FUSE request, consuming the virtio kick at the same time. If
> > > > >> the vring's "enabled" flag is set, the virtio / FUSE request is
> > > > >> processed genuinely. If the vring's "enabled" flag is clear, then the
> > > > >> virtio / FUSE request is discarded.
> > > > >
> > > > > Why is virtiofsd monitoring the virtqueue and discarding requests
> > > > > while it's disabled?
> > > >
> > > > That's what the vhost-user spec requires:
> > > >
> > > > https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
> > > >
> > > > """
> > > > started but disabled: the back-end must process the ring without causing
> > > > any side effects. For example, for a networking device, in the disabled
> > > > state the back-end must not supply any new RX packets, but must process
> > > > and discard any TX packets.
> > > > """
> > > >
> > > > This state is different from "stopped", where "the back-end must not
> > > > process the ring at all".
> > > >
> > > > The spec also says,
> > > >
> > > > """
> > > > If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
> > > > initialized in a disabled state and is enabled by
> > > > VHOST_USER_SET_VRING_ENABLE with parameter 1.
> > > > """
> > > >
> > > > AFAICT virtiofsd follows this requirement.
> > >
> > > Hi Michael,
> > > You documented the disabled ring state in QEMU commit commit
> > > c61f09ed855b5009f816242ce281fd01586d4646 ("vhost-user: clarify start
> > > and enable") where virtio-net devices discard tx buffers. The disabled
> > > state seems to be specific to vhost-user and not covered in the VIRTIO
> > > specification.
> > >
> > > Do you remember what the purpose of the disabled state was? Why is it
> > > necessary to discard tx buffers instead of postponing ring processing
> > > until the virtqueue is enabled?
> > >
> > > My concern is that the semantics are unclear for virtqueue types that
> > > are different from virtio-net rx/tx. Even the virtio-net controlq
> > > would be problematic - should buffers be silently discarded with
> > > VIRTIO_NET_OK or should they fail?
> > >
> > > Thanks,
> > > Stefan
> >
> > I think I got it now.
> > This weird state happens when linux first queues packets
> > on multiple queues, then changes max queues to 1, queued packets need
> > to still be freed eventually.
>
> Can you explain what is happening in the guest driver, QEMU, and the
> vhost-user-net device in more detail? I don't understand the scenario.
>
> > Yes, I am not sure this can apply to devices or queue types
> > other than virtio net. Maybe.
> >
> > When we say:
> > must process the ring without causing any side effects.
> > then I think it would be better to say
> > must process the ring if it can be done without causing
> > guest visible side effects.
>
> Completing a tx buffer is guest-visible, so I'm confused by this statement.
>
> > processing rx ring would have a side effect of causing
> > guest to get malformed buffers, so we don't process it.
>
> Why are they malformed? Do you mean the rx buffers are stale (the
> guest driver has changed the number of queues and doesn't expect to
> receive them anymore)?
>
> > processing command queue - we can't fail for sure since
> > that is guest visible. but practically we don't do this
> > for cvq.
> >
> > what should happen for virtiofsd? I don't know -
> > I am guessing discarding would have a side effect
> > so should not happen.
> >
> >
> >
> >
> > > >
> > > > > This seems like a bug in the vhost-user backend to me.
> > > >
> > > > I didn't want to exclude that possiblity; that's why I included Eugenio,
> > > > German, Liu Jiang, and Sergio in the CC list.
> > > >
> > > > >
> > > > > When the virtqueue is disabled, don't monitor the kickfd.
> > > > >
> > > > > When the virtqueue transitions from disabled to enabled, the control
> > > > > plane should self-trigger the kickfd so that any available buffers
> > > > > will be processed.
> > > > >
> > > > > QEMU uses this scheme to switch between vhost/IOThreads and built-in
> > > > > virtqueue kick processing.
> > > > >
> > > > > This approach is more robust than relying buffers being enqueued after
> > > > > the virtqueue is enabled.
> > > >
> > > > I'm happy to drop the series if the virtiofsd maintainers agree that the
> > > > bug is in virtiofsd, and can propose a design to fix it. (I do think
> > > > that such a fix would require an architectural change.)
> > > >
> > > > FWIW, my own interpretation of the vhost-user spec (see above) was that
> > > > virtiofsd was right to behave the way it did, and that there was simply
> > > > no way to prevent out-of-order delivery other than synchronizing the
> > > > guest end-to-end with the vhost-user backend, concerning
> > > > VHOST_USER_SET_VRING_ENABLE.
> > > >
> > > > This end-to-end synchronization is present "naturally" in vhost-net,
> > > > where ioctl()s are automatically synchronous -- in fact *all* operations
> > > > on the control plane are synchronous. (Which is just a different way to
> > > > say that the guest is tightly coupled with the control plane.)
> > > >
> > > > Note that there has been at least one race like this before; see commit
> > > > 699f2e535d93 ("vhost: make SET_VRING_ADDR, SET_FEATURES send replies",
> > > > 2021-09-04). Basically every pre-existent call to enforce_reply() is a
> > > > cover-up for the vhost-user spec turning (somewhat recklessly?) most
> > > > operations into async ones.
> > > >
> > > > At some point this became apparent and so the REPLY_ACK flag was
> > > > introduced; see commit ca525ce5618b ("vhost-user: Introduce a new
> > > > protocol feature REPLY_ACK.", 2016-08-10). (That commit doesn't go into
> > > > details, but I'm pretty sure there was a similar race around SET_MEM_TABLE!)
> > > >
> > > > BTW even if we drop this series for QEMU, I don't think it will have
> > > > been in vain. The first few patches are cleanups which could be merged
> > > > for their own sake. And the last patch is essentially the proof of the
> > > > problem statement / analysis. It can be considered an elaborate bug
> > > > report for virtiofsd, *if* we decide the bug is in virtiofsd. I did have
> > > > that avenue in mind as well, when writing the commit message / patch.
> > > >
> > > > For now I'm going to post v2 -- that's not to say that I'm dismissing
> > > > your feedback (see above!), just want to get the latest version on-list.
> > > >
> > > > Thanks!
> > > > Laszlo
> > > >
> > > > >
> > > > > Stefan
> > > > >
> > > > >>
> > > > >> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> > > > >> However, if the data plane processor in virtiofsd wins the race, then it
> > > > >> sees the FUSE_INIT *before* the control plane processor took notice of
> > > > >> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> > > > >> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> > > > >> back to waiting for further virtio / FUSE requests with epoll_wait.
> > > > >> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> > > > >>
> > > > >> The deadlock is not deterministic. OVMF hangs infrequently during first
> > > > >> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> > > > >> shell.
> > > > >>
> > > > >> The race can be "reliably masked" by inserting a very small delay -- a
> > > > >> single debug message -- at the top of "VringEpollHandler::handle_event",
> > > > >> i.e., just before the data plane processor checks the "enabled" field of
> > > > >> the vring. That delay suffices for the control plane processor to act upon
> > > > >> VHOST_USER_SET_VRING_ENABLE.
> > > > >>
> > > > >> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> > > > >> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> > > > >> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> > > > >> cannot advance to the FUSE_INIT submission before virtiofsd's control
> > > > >> plane processor takes notice of the queue being enabled.
> > > > >>
> > > > >> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
> > > > >>
> > > > >> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> > > > >> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> > > > >> has been negotiated, or
> > > > >>
> > > > >> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> > > > >> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
> > > > >>
> > > > >> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> > > > >> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> > > > >> Cc: German Maglione <gmaglione@redhat.com>
> > > > >> Cc: Liu Jiang <gerry@linux.alibaba.com>
> > > > >> Cc: Sergio Lopez Pascual <slp@redhat.com>
> > > > >> Cc: Stefano Garzarella <sgarzare@redhat.com>
> > > > >> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> > > > >> ---
> > > > >> hw/virtio/vhost-user.c | 2 +-
> > > > >> 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >>
> > > > >> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > > >> index beb4b832245e..01e0ca90c538 100644
> > > > >> --- a/hw/virtio/vhost-user.c
> > > > >> +++ b/hw/virtio/vhost-user.c
> > > > >> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
> > > > >> .num = enable,
> > > > >> };
> > > > >>
> > > > >> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
> > > > >> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
> > > > >> if (ret < 0) {
> > > > >> /*
> > > > >> * Restoring the previous state is likely infeasible, as well as
> > > > >
> > > >
> >
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-02 21:12 ` Stefan Hajnoczi
2023-10-02 21:13 ` Stefan Hajnoczi
@ 2023-10-02 22:36 ` Michael S. Tsirkin
2023-10-03 0:17 ` Stefan Hajnoczi
1 sibling, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-10-02 22:36 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: qemu-devel, Laszlo Ersek, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella, Jason Wang
On Mon, Oct 02, 2023 at 05:12:27PM -0400, Stefan Hajnoczi wrote:
> On Mon, 2 Oct 2023 at 02:49, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Aug 30, 2023 at 11:37:50AM -0400, Stefan Hajnoczi wrote:
> > > On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek <lersek@redhat.com> wrote:
> > > >
> > > > On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > > > > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek <lersek@redhat.com> wrote:
> > > > >>
> > > > >> (1) The virtio-1.0 specification
> > > > >> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
> > > > >>
> > > > >>> 3 General Initialization And Device Operation
> > > > >>> 3.1 Device Initialization
> > > > >>> 3.1.1 Driver Requirements: Device Initialization
> > > > >>>
> > > > >>> [...]
> > > > >>>
> > > > >>> 7. Perform device-specific setup, including discovery of virtqueues for
> > > > >>> the device, optional per-bus setup, reading and possibly writing the
> > > > >>> device’s virtio configuration space, and population of virtqueues.
> > > > >>>
> > > > >>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> > > > >>
> > > > >> and
> > > > >>
> > > > >>> 4 Virtio Transport Options
> > > > >>> 4.1 Virtio Over PCI Bus
> > > > >>> 4.1.4 Virtio Structure PCI Capabilities
> > > > >>> 4.1.4.3 Common configuration structure layout
> > > > >>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > > >>>
> > > > >>> [...]
> > > > >>>
> > > > >>> The driver MUST configure the other virtqueue fields before enabling the
> > > > >>> virtqueue with queue_enable.
> > > > >>>
> > > > >>> [...]
> > > > >>
> > > > >> These together mean that the following sub-sequence of steps is valid for
> > > > >> a virtio-1.0 guest driver:
> > > > >>
> > > > >> (1.1) set "queue_enable" for the needed queues as the final part of device
> > > > >> initialization step (7),
> > > > >>
> > > > >> (1.2) set DRIVER_OK in step (8),
> > > > >>
> > > > >> (1.3) immediately start sending virtio requests to the device.
> > > > >>
> > > > >> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> > > > >> special virtio feature is negotiated, then virtio rings start in disabled
> > > > >> state, according to
> > > > >> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> > > > >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> > > > >> enabling vrings.
> > > > >>
> > > > >> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> > > > >> operation, which travels from the guest through QEMU to the vhost-user
> > > > >> backend, using a unix domain socket.
> > > > >>
> > > > >> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> > > > >> evades QEMU -- it travels from guest to the vhost-user backend via
> > > > >> eventfd.
> > > > >>
> > > > >> This means that steps (1.1) and (1.3) travel through different channels,
> > > > >> and their relative order can be reversed, as perceived by the vhost-user
> > > > >> backend.
> > > > >>
> > > > >> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> > > > >> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> > > > >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> > > > >> crate.)
> > > > >>
> > > > >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > > > >> device initialization steps (i.e., control plane operations), and
> > > > >> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> > > > >> operation). In the Rust-language virtiofsd, this creates a race between
> > > > >> two components that run *concurrently*, i.e., in different threads or
> > > > >> processes:
> > > > >>
> > > > >> - Control plane, handling vhost-user protocol messages:
> > > > >>
> > > > >> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > > > >> [crates/vhost-user-backend/src/handler.rs] handles
> > > > >> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> > > > >> flag according to the message processed.
> > > > >>
> > > > >> - Data plane, handling virtio / FUSE requests:
> > > > >>
> > > > >> The "VringEpollHandler::handle_event" method
> > > > >> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> > > > >> virtio / FUSE request, consuming the virtio kick at the same time. If
> > > > >> the vring's "enabled" flag is set, the virtio / FUSE request is
> > > > >> processed genuinely. If the vring's "enabled" flag is clear, then the
> > > > >> virtio / FUSE request is discarded.
> > > > >
> > > > > Why is virtiofsd monitoring the virtqueue and discarding requests
> > > > > while it's disabled?
> > > >
> > > > That's what the vhost-user spec requires:
> > > >
> > > > https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
> > > >
> > > > """
> > > > started but disabled: the back-end must process the ring without causing
> > > > any side effects. For example, for a networking device, in the disabled
> > > > state the back-end must not supply any new RX packets, but must process
> > > > and discard any TX packets.
> > > > """
> > > >
> > > > This state is different from "stopped", where "the back-end must not
> > > > process the ring at all".
> > > >
> > > > The spec also says,
> > > >
> > > > """
> > > > If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
> > > > initialized in a disabled state and is enabled by
> > > > VHOST_USER_SET_VRING_ENABLE with parameter 1.
> > > > """
> > > >
> > > > AFAICT virtiofsd follows this requirement.
> > >
> > > Hi Michael,
> > > You documented the disabled ring state in QEMU commit commit
> > > c61f09ed855b5009f816242ce281fd01586d4646 ("vhost-user: clarify start
> > > and enable") where virtio-net devices discard tx buffers. The disabled
> > > state seems to be specific to vhost-user and not covered in the VIRTIO
> > > specification.
> > >
> > > Do you remember what the purpose of the disabled state was? Why is it
> > > necessary to discard tx buffers instead of postponing ring processing
> > > until the virtqueue is enabled?
> > >
> > > My concern is that the semantics are unclear for virtqueue types that
> > > are different from virtio-net rx/tx. Even the virtio-net controlq
> > > would be problematic - should buffers be silently discarded with
> > > VIRTIO_NET_OK or should they fail?
> > >
> > > Thanks,
> > > Stefan
> >
> > I think I got it now.
> > This weird state happens when linux first queues packets
> > on multiple queues, then changes max queues to 1, queued packets need
> > to still be freed eventually.
>
> Can you explain what is happening in the guest driver, QEMU, and the
> vhost-user-net device in more detail? I don't understand the scenario.
guest changes max vq pairs making it smaller
qemu disables ring
> > Yes, I am not sure this can apply to devices or queue types
> > other than virtio net. Maybe.
> >
> > When we say:
> > must process the ring without causing any side effects.
> > then I think it would be better to say
> > must process the ring if it can be done without causing
> > guest visible side effects.
>
> Completing a tx buffer is guest-visible, so I'm confused by this statement.
yes but it's not immediately guest visible whether packet was
transmitted or discarded.
> > processing rx ring would have a side effect of causing
> > guest to get malformed buffers, so we don't process it.
>
> Why are they malformed? Do you mean the rx buffers are stale (the
> guest driver has changed the number of queues and doesn't expect to
> receive them anymore)?
there's no way to consume an rx buffer without supplying
an rx packet to guest.
> > processing command queue - we can't fail for sure since
> > that is guest visible. but practically we don't do this
> > for cvq.
> >
> > what should happen for virtiofsd? I don't know -
> > I am guessing discarding would have a side effect
> > so should not happen.
> >
> >
> >
> >
> > > >
> > > > > This seems like a bug in the vhost-user backend to me.
> > > >
> > > > I didn't want to exclude that possiblity; that's why I included Eugenio,
> > > > German, Liu Jiang, and Sergio in the CC list.
> > > >
> > > > >
> > > > > When the virtqueue is disabled, don't monitor the kickfd.
> > > > >
> > > > > When the virtqueue transitions from disabled to enabled, the control
> > > > > plane should self-trigger the kickfd so that any available buffers
> > > > > will be processed.
> > > > >
> > > > > QEMU uses this scheme to switch between vhost/IOThreads and built-in
> > > > > virtqueue kick processing.
> > > > >
> > > > > This approach is more robust than relying buffers being enqueued after
> > > > > the virtqueue is enabled.
> > > >
> > > > I'm happy to drop the series if the virtiofsd maintainers agree that the
> > > > bug is in virtiofsd, and can propose a design to fix it. (I do think
> > > > that such a fix would require an architectural change.)
> > > >
> > > > FWIW, my own interpretation of the vhost-user spec (see above) was that
> > > > virtiofsd was right to behave the way it did, and that there was simply
> > > > no way to prevent out-of-order delivery other than synchronizing the
> > > > guest end-to-end with the vhost-user backend, concerning
> > > > VHOST_USER_SET_VRING_ENABLE.
> > > >
> > > > This end-to-end synchronization is present "naturally" in vhost-net,
> > > > where ioctl()s are automatically synchronous -- in fact *all* operations
> > > > on the control plane are synchronous. (Which is just a different way to
> > > > say that the guest is tightly coupled with the control plane.)
> > > >
> > > > Note that there has been at least one race like this before; see commit
> > > > 699f2e535d93 ("vhost: make SET_VRING_ADDR, SET_FEATURES send replies",
> > > > 2021-09-04). Basically every pre-existent call to enforce_reply() is a
> > > > cover-up for the vhost-user spec turning (somewhat recklessly?) most
> > > > operations into async ones.
> > > >
> > > > At some point this became apparent and so the REPLY_ACK flag was
> > > > introduced; see commit ca525ce5618b ("vhost-user: Introduce a new
> > > > protocol feature REPLY_ACK.", 2016-08-10). (That commit doesn't go into
> > > > details, but I'm pretty sure there was a similar race around SET_MEM_TABLE!)
> > > >
> > > > BTW even if we drop this series for QEMU, I don't think it will have
> > > > been in vain. The first few patches are cleanups which could be merged
> > > > for their own sake. And the last patch is essentially the proof of the
> > > > problem statement / analysis. It can be considered an elaborate bug
> > > > report for virtiofsd, *if* we decide the bug is in virtiofsd. I did have
> > > > that avenue in mind as well, when writing the commit message / patch.
> > > >
> > > > For now I'm going to post v2 -- that's not to say that I'm dismissing
> > > > your feedback (see above!), just want to get the latest version on-list.
> > > >
> > > > Thanks!
> > > > Laszlo
> > > >
> > > > >
> > > > > Stefan
> > > > >
> > > > >>
> > > > >> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> > > > >> However, if the data plane processor in virtiofsd wins the race, then it
> > > > >> sees the FUSE_INIT *before* the control plane processor took notice of
> > > > >> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> > > > >> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> > > > >> back to waiting for further virtio / FUSE requests with epoll_wait.
> > > > >> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> > > > >>
> > > > >> The deadlock is not deterministic. OVMF hangs infrequently during first
> > > > >> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> > > > >> shell.
> > > > >>
> > > > >> The race can be "reliably masked" by inserting a very small delay -- a
> > > > >> single debug message -- at the top of "VringEpollHandler::handle_event",
> > > > >> i.e., just before the data plane processor checks the "enabled" field of
> > > > >> the vring. That delay suffices for the control plane processor to act upon
> > > > >> VHOST_USER_SET_VRING_ENABLE.
> > > > >>
> > > > >> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> > > > >> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> > > > >> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> > > > >> cannot advance to the FUSE_INIT submission before virtiofsd's control
> > > > >> plane processor takes notice of the queue being enabled.
> > > > >>
> > > > >> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
> > > > >>
> > > > >> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> > > > >> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> > > > >> has been negotiated, or
> > > > >>
> > > > >> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> > > > >> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
> > > > >>
> > > > >> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> > > > >> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> > > > >> Cc: German Maglione <gmaglione@redhat.com>
> > > > >> Cc: Liu Jiang <gerry@linux.alibaba.com>
> > > > >> Cc: Sergio Lopez Pascual <slp@redhat.com>
> > > > >> Cc: Stefano Garzarella <sgarzare@redhat.com>
> > > > >> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> > > > >> ---
> > > > >> hw/virtio/vhost-user.c | 2 +-
> > > > >> 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >>
> > > > >> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > > >> index beb4b832245e..01e0ca90c538 100644
> > > > >> --- a/hw/virtio/vhost-user.c
> > > > >> +++ b/hw/virtio/vhost-user.c
> > > > >> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
> > > > >> .num = enable,
> > > > >> };
> > > > >>
> > > > >> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
> > > > >> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
> > > > >> if (ret < 0) {
> > > > >> /*
> > > > >> * Restoring the previous state is likely infeasible, as well as
> > > > >
> > > >
> >
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-02 22:36 ` Michael S. Tsirkin
@ 2023-10-03 0:17 ` Stefan Hajnoczi
2023-10-03 14:28 ` Michael S. Tsirkin
0 siblings, 1 reply; 58+ messages in thread
From: Stefan Hajnoczi @ 2023-10-03 0:17 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: qemu-devel, Laszlo Ersek, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella, Jason Wang
On Mon, 2 Oct 2023 at 18:36, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Oct 02, 2023 at 05:12:27PM -0400, Stefan Hajnoczi wrote:
> > On Mon, 2 Oct 2023 at 02:49, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Wed, Aug 30, 2023 at 11:37:50AM -0400, Stefan Hajnoczi wrote:
> > > > On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek <lersek@redhat.com> wrote:
> > > > >
> > > > > On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > > > > > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek <lersek@redhat.com> wrote:
> > > > > >>
> > > > > >> (1) The virtio-1.0 specification
> > > > > >> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
> > > > > >>
> > > > > >>> 3 General Initialization And Device Operation
> > > > > >>> 3.1 Device Initialization
> > > > > >>> 3.1.1 Driver Requirements: Device Initialization
> > > > > >>>
> > > > > >>> [...]
> > > > > >>>
> > > > > >>> 7. Perform device-specific setup, including discovery of virtqueues for
> > > > > >>> the device, optional per-bus setup, reading and possibly writing the
> > > > > >>> device’s virtio configuration space, and population of virtqueues.
> > > > > >>>
> > > > > >>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> > > > > >>
> > > > > >> and
> > > > > >>
> > > > > >>> 4 Virtio Transport Options
> > > > > >>> 4.1 Virtio Over PCI Bus
> > > > > >>> 4.1.4 Virtio Structure PCI Capabilities
> > > > > >>> 4.1.4.3 Common configuration structure layout
> > > > > >>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > > > >>>
> > > > > >>> [...]
> > > > > >>>
> > > > > >>> The driver MUST configure the other virtqueue fields before enabling the
> > > > > >>> virtqueue with queue_enable.
> > > > > >>>
> > > > > >>> [...]
> > > > > >>
> > > > > >> These together mean that the following sub-sequence of steps is valid for
> > > > > >> a virtio-1.0 guest driver:
> > > > > >>
> > > > > >> (1.1) set "queue_enable" for the needed queues as the final part of device
> > > > > >> initialization step (7),
> > > > > >>
> > > > > >> (1.2) set DRIVER_OK in step (8),
> > > > > >>
> > > > > >> (1.3) immediately start sending virtio requests to the device.
> > > > > >>
> > > > > >> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> > > > > >> special virtio feature is negotiated, then virtio rings start in disabled
> > > > > >> state, according to
> > > > > >> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> > > > > >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> > > > > >> enabling vrings.
> > > > > >>
> > > > > >> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> > > > > >> operation, which travels from the guest through QEMU to the vhost-user
> > > > > >> backend, using a unix domain socket.
> > > > > >>
> > > > > >> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> > > > > >> evades QEMU -- it travels from guest to the vhost-user backend via
> > > > > >> eventfd.
> > > > > >>
> > > > > >> This means that steps (1.1) and (1.3) travel through different channels,
> > > > > >> and their relative order can be reversed, as perceived by the vhost-user
> > > > > >> backend.
> > > > > >>
> > > > > >> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> > > > > >> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> > > > > >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> > > > > >> crate.)
> > > > > >>
> > > > > >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > > > > >> device initialization steps (i.e., control plane operations), and
> > > > > >> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> > > > > >> operation). In the Rust-language virtiofsd, this creates a race between
> > > > > >> two components that run *concurrently*, i.e., in different threads or
> > > > > >> processes:
> > > > > >>
> > > > > >> - Control plane, handling vhost-user protocol messages:
> > > > > >>
> > > > > >> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > > > > >> [crates/vhost-user-backend/src/handler.rs] handles
> > > > > >> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> > > > > >> flag according to the message processed.
> > > > > >>
> > > > > >> - Data plane, handling virtio / FUSE requests:
> > > > > >>
> > > > > >> The "VringEpollHandler::handle_event" method
> > > > > >> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> > > > > >> virtio / FUSE request, consuming the virtio kick at the same time. If
> > > > > >> the vring's "enabled" flag is set, the virtio / FUSE request is
> > > > > >> processed genuinely. If the vring's "enabled" flag is clear, then the
> > > > > >> virtio / FUSE request is discarded.
> > > > > >
> > > > > > Why is virtiofsd monitoring the virtqueue and discarding requests
> > > > > > while it's disabled?
> > > > >
> > > > > That's what the vhost-user spec requires:
> > > > >
> > > > > https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
> > > > >
> > > > > """
> > > > > started but disabled: the back-end must process the ring without causing
> > > > > any side effects. For example, for a networking device, in the disabled
> > > > > state the back-end must not supply any new RX packets, but must process
> > > > > and discard any TX packets.
> > > > > """
> > > > >
> > > > > This state is different from "stopped", where "the back-end must not
> > > > > process the ring at all".
> > > > >
> > > > > The spec also says,
> > > > >
> > > > > """
> > > > > If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
> > > > > initialized in a disabled state and is enabled by
> > > > > VHOST_USER_SET_VRING_ENABLE with parameter 1.
> > > > > """
> > > > >
> > > > > AFAICT virtiofsd follows this requirement.
> > > >
> > > > Hi Michael,
> > > > You documented the disabled ring state in QEMU commit commit
> > > > c61f09ed855b5009f816242ce281fd01586d4646 ("vhost-user: clarify start
> > > > and enable") where virtio-net devices discard tx buffers. The disabled
> > > > state seems to be specific to vhost-user and not covered in the VIRTIO
> > > > specification.
> > > >
> > > > Do you remember what the purpose of the disabled state was? Why is it
> > > > necessary to discard tx buffers instead of postponing ring processing
> > > > until the virtqueue is enabled?
> > > >
> > > > My concern is that the semantics are unclear for virtqueue types that
> > > > are different from virtio-net rx/tx. Even the virtio-net controlq
> > > > would be problematic - should buffers be silently discarded with
> > > > VIRTIO_NET_OK or should they fail?
> > > >
> > > > Thanks,
> > > > Stefan
> > >
> > > I think I got it now.
> > > This weird state happens when linux first queues packets
> > > on multiple queues, then changes max queues to 1, queued packets need
> > > to still be freed eventually.
> >
> > Can you explain what is happening in the guest driver, QEMU, and the
> > vhost-user-net device in more detail? I don't understand the scenario.
>
> guest changes max vq pairs making it smaller
> qemu disables ring
The purpose of the "ignore rx, discard tx" semantics is still unclear
to me. Can you explain why we do this?
Stefan
> > > Yes, I am not sure this can apply to devices or queue types
> > > other than virtio net. Maybe.
> > >
> > > When we say:
> > > must process the ring without causing any side effects.
> > > then I think it would be better to say
> > > must process the ring if it can be done without causing
> > > guest visible side effects.
> >
> > Completing a tx buffer is guest-visible, so I'm confused by this statement.
>
> yes but it's not immediately guest visible whether packet was
> transmitted or discarded.
>
> > > processing rx ring would have a side effect of causing
> > > guest to get malformed buffers, so we don't process it.
> >
> > Why are they malformed? Do you mean the rx buffers are stale (the
> > guest driver has changed the number of queues and doesn't expect to
> > receive them anymore)?
>
> there's no way to consume an rx buffer without supplying
> an rx packet to guest.
Stefan
> > > processing command queue - we can't fail for sure since
> > > that is guest visible. but practically we don't do this
> > > for cvq.
> > >
> > > what should happen for virtiofsd? I don't know -
> > > I am guessing discarding would have a side effect
> > > so should not happen.
> > >
> > >
> > >
> > >
> > > > >
> > > > > > This seems like a bug in the vhost-user backend to me.
> > > > >
> > > > > I didn't want to exclude that possiblity; that's why I included Eugenio,
> > > > > German, Liu Jiang, and Sergio in the CC list.
> > > > >
> > > > > >
> > > > > > When the virtqueue is disabled, don't monitor the kickfd.
> > > > > >
> > > > > > When the virtqueue transitions from disabled to enabled, the control
> > > > > > plane should self-trigger the kickfd so that any available buffers
> > > > > > will be processed.
> > > > > >
> > > > > > QEMU uses this scheme to switch between vhost/IOThreads and built-in
> > > > > > virtqueue kick processing.
> > > > > >
> > > > > > This approach is more robust than relying buffers being enqueued after
> > > > > > the virtqueue is enabled.
> > > > >
> > > > > I'm happy to drop the series if the virtiofsd maintainers agree that the
> > > > > bug is in virtiofsd, and can propose a design to fix it. (I do think
> > > > > that such a fix would require an architectural change.)
> > > > >
> > > > > FWIW, my own interpretation of the vhost-user spec (see above) was that
> > > > > virtiofsd was right to behave the way it did, and that there was simply
> > > > > no way to prevent out-of-order delivery other than synchronizing the
> > > > > guest end-to-end with the vhost-user backend, concerning
> > > > > VHOST_USER_SET_VRING_ENABLE.
> > > > >
> > > > > This end-to-end synchronization is present "naturally" in vhost-net,
> > > > > where ioctl()s are automatically synchronous -- in fact *all* operations
> > > > > on the control plane are synchronous. (Which is just a different way to
> > > > > say that the guest is tightly coupled with the control plane.)
> > > > >
> > > > > Note that there has been at least one race like this before; see commit
> > > > > 699f2e535d93 ("vhost: make SET_VRING_ADDR, SET_FEATURES send replies",
> > > > > 2021-09-04). Basically every pre-existent call to enforce_reply() is a
> > > > > cover-up for the vhost-user spec turning (somewhat recklessly?) most
> > > > > operations into async ones.
> > > > >
> > > > > At some point this became apparent and so the REPLY_ACK flag was
> > > > > introduced; see commit ca525ce5618b ("vhost-user: Introduce a new
> > > > > protocol feature REPLY_ACK.", 2016-08-10). (That commit doesn't go into
> > > > > details, but I'm pretty sure there was a similar race around SET_MEM_TABLE!)
> > > > >
> > > > > BTW even if we drop this series for QEMU, I don't think it will have
> > > > > been in vain. The first few patches are cleanups which could be merged
> > > > > for their own sake. And the last patch is essentially the proof of the
> > > > > problem statement / analysis. It can be considered an elaborate bug
> > > > > report for virtiofsd, *if* we decide the bug is in virtiofsd. I did have
> > > > > that avenue in mind as well, when writing the commit message / patch.
> > > > >
> > > > > For now I'm going to post v2 -- that's not to say that I'm dismissing
> > > > > your feedback (see above!), just want to get the latest version on-list.
> > > > >
> > > > > Thanks!
> > > > > Laszlo
> > > > >
> > > > > >
> > > > > > Stefan
> > > > > >
> > > > > >>
> > > > > >> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> > > > > >> However, if the data plane processor in virtiofsd wins the race, then it
> > > > > >> sees the FUSE_INIT *before* the control plane processor took notice of
> > > > > >> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> > > > > >> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> > > > > >> back to waiting for further virtio / FUSE requests with epoll_wait.
> > > > > >> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> > > > > >>
> > > > > >> The deadlock is not deterministic. OVMF hangs infrequently during first
> > > > > >> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> > > > > >> shell.
> > > > > >>
> > > > > >> The race can be "reliably masked" by inserting a very small delay -- a
> > > > > >> single debug message -- at the top of "VringEpollHandler::handle_event",
> > > > > >> i.e., just before the data plane processor checks the "enabled" field of
> > > > > >> the vring. That delay suffices for the control plane processor to act upon
> > > > > >> VHOST_USER_SET_VRING_ENABLE.
> > > > > >>
> > > > > >> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> > > > > >> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> > > > > >> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> > > > > >> cannot advance to the FUSE_INIT submission before virtiofsd's control
> > > > > >> plane processor takes notice of the queue being enabled.
> > > > > >>
> > > > > >> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
> > > > > >>
> > > > > >> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> > > > > >> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> > > > > >> has been negotiated, or
> > > > > >>
> > > > > >> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> > > > > >> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
> > > > > >>
> > > > > >> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> > > > > >> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> > > > > >> Cc: German Maglione <gmaglione@redhat.com>
> > > > > >> Cc: Liu Jiang <gerry@linux.alibaba.com>
> > > > > >> Cc: Sergio Lopez Pascual <slp@redhat.com>
> > > > > >> Cc: Stefano Garzarella <sgarzare@redhat.com>
> > > > > >> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> > > > > >> ---
> > > > > >> hw/virtio/vhost-user.c | 2 +-
> > > > > >> 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >>
> > > > > >> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > > > >> index beb4b832245e..01e0ca90c538 100644
> > > > > >> --- a/hw/virtio/vhost-user.c
> > > > > >> +++ b/hw/virtio/vhost-user.c
> > > > > >> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
> > > > > >> .num = enable,
> > > > > >> };
> > > > > >>
> > > > > >> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
> > > > > >> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
> > > > > >> if (ret < 0) {
> > > > > >> /*
> > > > > >> * Restoring the previous state is likely infeasible, as well as
> > > > > >
> > > > >
> > >
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-02 21:13 ` Stefan Hajnoczi
@ 2023-10-03 12:26 ` Michael S. Tsirkin
2023-10-03 13:08 ` Stefan Hajnoczi
0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-10-03 12:26 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: qemu-devel, Laszlo Ersek, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella, Jason Wang
On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
> One more question:
>
> Why is the disabled state not needed by regular (non-vhost) virtio-net devices?
Tap does the same - it purges queued packets:
int tap_disable(NetClientState *nc)
{
TAPState *s = DO_UPCAST(TAPState, nc, nc);
int ret;
if (s->enabled == 0) {
return 0;
} else {
ret = tap_fd_disable(s->fd);
if (ret == 0) {
qemu_purge_queued_packets(nc);
s->enabled = false;
tap_update_fd_handler(s);
}
return ret;
}
}
what about non tap backends? I suspect they just aren't
used widely with multiqueue so no one noticed.
--
MST
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-03 12:26 ` Michael S. Tsirkin
@ 2023-10-03 13:08 ` Stefan Hajnoczi
2023-10-03 13:23 ` Laszlo Ersek
2023-10-03 14:40 ` Michael S. Tsirkin
0 siblings, 2 replies; 58+ messages in thread
From: Stefan Hajnoczi @ 2023-10-03 13:08 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: qemu-devel, Laszlo Ersek, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella, Jason Wang
On Tue, 3 Oct 2023 at 08:27, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
> > One more question:
> >
> > Why is the disabled state not needed by regular (non-vhost) virtio-net devices?
>
> Tap does the same - it purges queued packets:
>
> int tap_disable(NetClientState *nc)
> {
> TAPState *s = DO_UPCAST(TAPState, nc, nc);
> int ret;
>
> if (s->enabled == 0) {
> return 0;
> } else {
> ret = tap_fd_disable(s->fd);
> if (ret == 0) {
> qemu_purge_queued_packets(nc);
> s->enabled = false;
> tap_update_fd_handler(s);
> }
> return ret;
> }
> }
tap_disable() is not equivalent to the vhost-user "started but
disabled" ring state. tap_disable() is a synchronous one-time action,
while "started but disabled" is a continuous state.
The "started but disabled" ring state isn't needed to achieve this.
The back-end can just drop tx buffers upon receiving
VHOST_USER_SET_VRING_ENABLE .num=0.
The history of the spec is curious. VHOST_USER_SET_VRING_ENABLE was
introduced before the the "started but disabled" state was defined,
and it explicitly mentions tap attach/detach:
commit 7263a0ad7899994b719ebed736a1119cc2e08110
Author: Changchun Ouyang <changchun.ouyang@intel.com>
Date: Wed Sep 23 12:20:01 2015 +0800
vhost-user: add a new message to disable/enable a specific virt queue.
Add a new message, VHOST_USER_SET_VRING_ENABLE, to enable or disable
a specific virt queue, which is similar to attach/detach queue for
tap device.
and then later:
commit c61f09ed855b5009f816242ce281fd01586d4646
Author: Michael S. Tsirkin <mst@redhat.com>
Date: Mon Nov 23 12:48:52 2015 +0200
vhost-user: clarify start and enable
>
> what about non tap backends? I suspect they just aren't
> used widely with multiqueue so no one noticed.
I still don't understand why "started but disabled" is needed instead
of just two ring states: enabled and disabled.
It seems like the cleanest path going forward is to keep the "ignore
rx, discard tx" semantics for virtio-net devices but to clarify in the
spec that other device types do not process the ring:
"
* started but disabled: the back-end must not process the ring. For legacy
reasons there is an exception for the networking device, where the
back-end must process and discard any TX packets and not process
other rings.
"
What do you think?
Stefan
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-03 13:08 ` Stefan Hajnoczi
@ 2023-10-03 13:23 ` Laszlo Ersek
2023-10-03 14:25 ` Michael S. Tsirkin
2023-10-03 14:40 ` Michael S. Tsirkin
1 sibling, 1 reply; 58+ messages in thread
From: Laszlo Ersek @ 2023-10-03 13:23 UTC (permalink / raw)
To: Stefan Hajnoczi, Michael S. Tsirkin
Cc: qemu-devel, Eugenio Perez Martin, German Maglione, Liu Jiang,
Sergio Lopez Pascual, Stefano Garzarella, Jason Wang
On 10/3/23 15:08, Stefan Hajnoczi wrote:
> On Tue, 3 Oct 2023 at 08:27, Michael S. Tsirkin <mst@redhat.com> wrote:
>>
>> On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
>>> One more question:
>>>
>>> Why is the disabled state not needed by regular (non-vhost) virtio-net devices?
>>
>> Tap does the same - it purges queued packets:
>>
>> int tap_disable(NetClientState *nc)
>> {
>> TAPState *s = DO_UPCAST(TAPState, nc, nc);
>> int ret;
>>
>> if (s->enabled == 0) {
>> return 0;
>> } else {
>> ret = tap_fd_disable(s->fd);
>> if (ret == 0) {
>> qemu_purge_queued_packets(nc);
>> s->enabled = false;
>> tap_update_fd_handler(s);
>> }
>> return ret;
>> }
>> }
>
> tap_disable() is not equivalent to the vhost-user "started but
> disabled" ring state. tap_disable() is a synchronous one-time action,
> while "started but disabled" is a continuous state.
>
> The "started but disabled" ring state isn't needed to achieve this.
> The back-end can just drop tx buffers upon receiving
> VHOST_USER_SET_VRING_ENABLE .num=0.
>
> The history of the spec is curious. VHOST_USER_SET_VRING_ENABLE was
> introduced before the the "started but disabled" state was defined,
> and it explicitly mentions tap attach/detach:
>
> commit 7263a0ad7899994b719ebed736a1119cc2e08110
> Author: Changchun Ouyang <changchun.ouyang@intel.com>
> Date: Wed Sep 23 12:20:01 2015 +0800
>
> vhost-user: add a new message to disable/enable a specific virt queue.
>
> Add a new message, VHOST_USER_SET_VRING_ENABLE, to enable or disable
> a specific virt queue, which is similar to attach/detach queue for
> tap device.
>
> and then later:
>
> commit c61f09ed855b5009f816242ce281fd01586d4646
> Author: Michael S. Tsirkin <mst@redhat.com>
> Date: Mon Nov 23 12:48:52 2015 +0200
>
> vhost-user: clarify start and enable
>
>>
>> what about non tap backends? I suspect they just aren't
>> used widely with multiqueue so no one noticed.
>
> I still don't understand why "started but disabled" is needed instead
> of just two ring states: enabled and disabled.
>
> It seems like the cleanest path going forward is to keep the "ignore
> rx, discard tx" semantics for virtio-net devices but to clarify in the
> spec that other device types do not process the ring:
>
> "
> * started but disabled: the back-end must not process the ring. For legacy
> reasons there is an exception for the networking device, where the
> back-end must process and discard any TX packets and not process
> other rings.
> "
>
> What do you think?
... from a vhost-user backend perspective, won't this create a need for
all "ring processor" (~ virtio event loop) implementations to support
both methods? IIUC, the "virtio pop" is usually independent of the
particular device to which the requests are ultimately delivered. So the
event loop would have to grow a new parameter regarding "what to do in
the started-but-disabled state", the network device would have to pass
in one value (-> pop & drop), and all other devices would have to pass
in the other value (stop popping).
... I figure in rust-vmm/vhost it would affect the "handle_event"
function in "crates/vhost-user-backend/src/event_loop.rs".
Do I understand right? (Not disagreeing, just pondering the impact on
backends.)
Laszlo
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-03 13:23 ` Laszlo Ersek
@ 2023-10-03 14:25 ` Michael S. Tsirkin
2023-10-03 14:28 ` Laszlo Ersek
0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-10-03 14:25 UTC (permalink / raw)
To: Laszlo Ersek
Cc: Stefan Hajnoczi, qemu-devel, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual,
Stefano Garzarella, Jason Wang
On Tue, Oct 03, 2023 at 03:23:24PM +0200, Laszlo Ersek wrote:
> On 10/3/23 15:08, Stefan Hajnoczi wrote:
> > On Tue, 3 Oct 2023 at 08:27, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>
> >> On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
> >>> One more question:
> >>>
> >>> Why is the disabled state not needed by regular (non-vhost) virtio-net devices?
> >>
> >> Tap does the same - it purges queued packets:
> >>
> >> int tap_disable(NetClientState *nc)
> >> {
> >> TAPState *s = DO_UPCAST(TAPState, nc, nc);
> >> int ret;
> >>
> >> if (s->enabled == 0) {
> >> return 0;
> >> } else {
> >> ret = tap_fd_disable(s->fd);
> >> if (ret == 0) {
> >> qemu_purge_queued_packets(nc);
> >> s->enabled = false;
> >> tap_update_fd_handler(s);
> >> }
> >> return ret;
> >> }
> >> }
> >
> > tap_disable() is not equivalent to the vhost-user "started but
> > disabled" ring state. tap_disable() is a synchronous one-time action,
> > while "started but disabled" is a continuous state.
> >
> > The "started but disabled" ring state isn't needed to achieve this.
> > The back-end can just drop tx buffers upon receiving
> > VHOST_USER_SET_VRING_ENABLE .num=0.
> >
> > The history of the spec is curious. VHOST_USER_SET_VRING_ENABLE was
> > introduced before the the "started but disabled" state was defined,
> > and it explicitly mentions tap attach/detach:
> >
> > commit 7263a0ad7899994b719ebed736a1119cc2e08110
> > Author: Changchun Ouyang <changchun.ouyang@intel.com>
> > Date: Wed Sep 23 12:20:01 2015 +0800
> >
> > vhost-user: add a new message to disable/enable a specific virt queue.
> >
> > Add a new message, VHOST_USER_SET_VRING_ENABLE, to enable or disable
> > a specific virt queue, which is similar to attach/detach queue for
> > tap device.
> >
> > and then later:
> >
> > commit c61f09ed855b5009f816242ce281fd01586d4646
> > Author: Michael S. Tsirkin <mst@redhat.com>
> > Date: Mon Nov 23 12:48:52 2015 +0200
> >
> > vhost-user: clarify start and enable
> >
> >>
> >> what about non tap backends? I suspect they just aren't
> >> used widely with multiqueue so no one noticed.
> >
> > I still don't understand why "started but disabled" is needed instead
> > of just two ring states: enabled and disabled.
> >
> > It seems like the cleanest path going forward is to keep the "ignore
> > rx, discard tx" semantics for virtio-net devices but to clarify in the
> > spec that other device types do not process the ring:
> >
> > "
> > * started but disabled: the back-end must not process the ring. For legacy
> > reasons there is an exception for the networking device, where the
> > back-end must process and discard any TX packets and not process
> > other rings.
> > "
> >
> > What do you think?
>
> ... from a vhost-user backend perspective, won't this create a need for
> all "ring processor" (~ virtio event loop) implementations to support
> both methods? IIUC, the "virtio pop" is usually independent of the
> particular device to which the requests are ultimately delivered. So the
> event loop would have to grow a new parameter regarding "what to do in
> the started-but-disabled state", the network device would have to pass
> in one value (-> pop & drop), and all other devices would have to pass
> in the other value (stop popping).
>
> ... I figure in rust-vmm/vhost it would affect the "handle_event"
> function in "crates/vhost-user-backend/src/event_loop.rs".
>
> Do I understand right? (Not disagreeing, just pondering the impact on
> backends.)
>
> Laszlo
Already the case I guess - RX ring is not processed, TX is. Right?
--
MST
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-03 14:25 ` Michael S. Tsirkin
@ 2023-10-03 14:28 ` Laszlo Ersek
0 siblings, 0 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-10-03 14:28 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Stefan Hajnoczi, qemu-devel, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual,
Stefano Garzarella, Jason Wang
On 10/3/23 16:25, Michael S. Tsirkin wrote:
> On Tue, Oct 03, 2023 at 03:23:24PM +0200, Laszlo Ersek wrote:
>> On 10/3/23 15:08, Stefan Hajnoczi wrote:
>>> On Tue, 3 Oct 2023 at 08:27, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>
>>>> On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
>>>>> One more question:
>>>>>
>>>>> Why is the disabled state not needed by regular (non-vhost) virtio-net devices?
>>>>
>>>> Tap does the same - it purges queued packets:
>>>>
>>>> int tap_disable(NetClientState *nc)
>>>> {
>>>> TAPState *s = DO_UPCAST(TAPState, nc, nc);
>>>> int ret;
>>>>
>>>> if (s->enabled == 0) {
>>>> return 0;
>>>> } else {
>>>> ret = tap_fd_disable(s->fd);
>>>> if (ret == 0) {
>>>> qemu_purge_queued_packets(nc);
>>>> s->enabled = false;
>>>> tap_update_fd_handler(s);
>>>> }
>>>> return ret;
>>>> }
>>>> }
>>>
>>> tap_disable() is not equivalent to the vhost-user "started but
>>> disabled" ring state. tap_disable() is a synchronous one-time action,
>>> while "started but disabled" is a continuous state.
>>>
>>> The "started but disabled" ring state isn't needed to achieve this.
>>> The back-end can just drop tx buffers upon receiving
>>> VHOST_USER_SET_VRING_ENABLE .num=0.
>>>
>>> The history of the spec is curious. VHOST_USER_SET_VRING_ENABLE was
>>> introduced before the the "started but disabled" state was defined,
>>> and it explicitly mentions tap attach/detach:
>>>
>>> commit 7263a0ad7899994b719ebed736a1119cc2e08110
>>> Author: Changchun Ouyang <changchun.ouyang@intel.com>
>>> Date: Wed Sep 23 12:20:01 2015 +0800
>>>
>>> vhost-user: add a new message to disable/enable a specific virt queue.
>>>
>>> Add a new message, VHOST_USER_SET_VRING_ENABLE, to enable or disable
>>> a specific virt queue, which is similar to attach/detach queue for
>>> tap device.
>>>
>>> and then later:
>>>
>>> commit c61f09ed855b5009f816242ce281fd01586d4646
>>> Author: Michael S. Tsirkin <mst@redhat.com>
>>> Date: Mon Nov 23 12:48:52 2015 +0200
>>>
>>> vhost-user: clarify start and enable
>>>
>>>>
>>>> what about non tap backends? I suspect they just aren't
>>>> used widely with multiqueue so no one noticed.
>>>
>>> I still don't understand why "started but disabled" is needed instead
>>> of just two ring states: enabled and disabled.
>>>
>>> It seems like the cleanest path going forward is to keep the "ignore
>>> rx, discard tx" semantics for virtio-net devices but to clarify in the
>>> spec that other device types do not process the ring:
>>>
>>> "
>>> * started but disabled: the back-end must not process the ring. For legacy
>>> reasons there is an exception for the networking device, where the
>>> back-end must process and discard any TX packets and not process
>>> other rings.
>>> "
>>>
>>> What do you think?
>>
>> ... from a vhost-user backend perspective, won't this create a need for
>> all "ring processor" (~ virtio event loop) implementations to support
>> both methods? IIUC, the "virtio pop" is usually independent of the
>> particular device to which the requests are ultimately delivered. So the
>> event loop would have to grow a new parameter regarding "what to do in
>> the started-but-disabled state", the network device would have to pass
>> in one value (-> pop & drop), and all other devices would have to pass
>> in the other value (stop popping).
>>
>> ... I figure in rust-vmm/vhost it would affect the "handle_event"
>> function in "crates/vhost-user-backend/src/event_loop.rs".
>>
>> Do I understand right? (Not disagreeing, just pondering the impact on
>> backends.)
>>
>> Laszlo
>
> Already the case I guess - RX ring is not processed, TX is. Right?
>
Ah I see your point, this distinction must already exist in event loops.
But... as far as I can tell, it's not there in rust-vmm/vhost.
Laszlo
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-03 0:17 ` Stefan Hajnoczi
@ 2023-10-03 14:28 ` Michael S. Tsirkin
0 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-10-03 14:28 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: qemu-devel, Laszlo Ersek, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella, Jason Wang
On Mon, Oct 02, 2023 at 08:17:07PM -0400, Stefan Hajnoczi wrote:
> On Mon, 2 Oct 2023 at 18:36, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Oct 02, 2023 at 05:12:27PM -0400, Stefan Hajnoczi wrote:
> > > On Mon, 2 Oct 2023 at 02:49, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Wed, Aug 30, 2023 at 11:37:50AM -0400, Stefan Hajnoczi wrote:
> > > > > On Wed, 30 Aug 2023 at 09:30, Laszlo Ersek <lersek@redhat.com> wrote:
> > > > > >
> > > > > > On 8/30/23 14:10, Stefan Hajnoczi wrote:
> > > > > > > On Sun, 27 Aug 2023 at 14:31, Laszlo Ersek <lersek@redhat.com> wrote:
> > > > > > >>
> > > > > > >> (1) The virtio-1.0 specification
> > > > > > >> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
> > > > > > >>
> > > > > > >>> 3 General Initialization And Device Operation
> > > > > > >>> 3.1 Device Initialization
> > > > > > >>> 3.1.1 Driver Requirements: Device Initialization
> > > > > > >>>
> > > > > > >>> [...]
> > > > > > >>>
> > > > > > >>> 7. Perform device-specific setup, including discovery of virtqueues for
> > > > > > >>> the device, optional per-bus setup, reading and possibly writing the
> > > > > > >>> device’s virtio configuration space, and population of virtqueues.
> > > > > > >>>
> > > > > > >>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> > > > > > >>
> > > > > > >> and
> > > > > > >>
> > > > > > >>> 4 Virtio Transport Options
> > > > > > >>> 4.1 Virtio Over PCI Bus
> > > > > > >>> 4.1.4 Virtio Structure PCI Capabilities
> > > > > > >>> 4.1.4.3 Common configuration structure layout
> > > > > > >>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > > > > >>>
> > > > > > >>> [...]
> > > > > > >>>
> > > > > > >>> The driver MUST configure the other virtqueue fields before enabling the
> > > > > > >>> virtqueue with queue_enable.
> > > > > > >>>
> > > > > > >>> [...]
> > > > > > >>
> > > > > > >> These together mean that the following sub-sequence of steps is valid for
> > > > > > >> a virtio-1.0 guest driver:
> > > > > > >>
> > > > > > >> (1.1) set "queue_enable" for the needed queues as the final part of device
> > > > > > >> initialization step (7),
> > > > > > >>
> > > > > > >> (1.2) set DRIVER_OK in step (8),
> > > > > > >>
> > > > > > >> (1.3) immediately start sending virtio requests to the device.
> > > > > > >>
> > > > > > >> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> > > > > > >> special virtio feature is negotiated, then virtio rings start in disabled
> > > > > > >> state, according to
> > > > > > >> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> > > > > > >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> > > > > > >> enabling vrings.
> > > > > > >>
> > > > > > >> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> > > > > > >> operation, which travels from the guest through QEMU to the vhost-user
> > > > > > >> backend, using a unix domain socket.
> > > > > > >>
> > > > > > >> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> > > > > > >> evades QEMU -- it travels from guest to the vhost-user backend via
> > > > > > >> eventfd.
> > > > > > >>
> > > > > > >> This means that steps (1.1) and (1.3) travel through different channels,
> > > > > > >> and their relative order can be reversed, as perceived by the vhost-user
> > > > > > >> backend.
> > > > > > >>
> > > > > > >> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> > > > > > >> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> > > > > > >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> > > > > > >> crate.)
> > > > > > >>
> > > > > > >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > > > > > >> device initialization steps (i.e., control plane operations), and
> > > > > > >> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> > > > > > >> operation). In the Rust-language virtiofsd, this creates a race between
> > > > > > >> two components that run *concurrently*, i.e., in different threads or
> > > > > > >> processes:
> > > > > > >>
> > > > > > >> - Control plane, handling vhost-user protocol messages:
> > > > > > >>
> > > > > > >> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > > > > > >> [crates/vhost-user-backend/src/handler.rs] handles
> > > > > > >> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> > > > > > >> flag according to the message processed.
> > > > > > >>
> > > > > > >> - Data plane, handling virtio / FUSE requests:
> > > > > > >>
> > > > > > >> The "VringEpollHandler::handle_event" method
> > > > > > >> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> > > > > > >> virtio / FUSE request, consuming the virtio kick at the same time. If
> > > > > > >> the vring's "enabled" flag is set, the virtio / FUSE request is
> > > > > > >> processed genuinely. If the vring's "enabled" flag is clear, then the
> > > > > > >> virtio / FUSE request is discarded.
> > > > > > >
> > > > > > > Why is virtiofsd monitoring the virtqueue and discarding requests
> > > > > > > while it's disabled?
> > > > > >
> > > > > > That's what the vhost-user spec requires:
> > > > > >
> > > > > > https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states
> > > > > >
> > > > > > """
> > > > > > started but disabled: the back-end must process the ring without causing
> > > > > > any side effects. For example, for a networking device, in the disabled
> > > > > > state the back-end must not supply any new RX packets, but must process
> > > > > > and discard any TX packets.
> > > > > > """
> > > > > >
> > > > > > This state is different from "stopped", where "the back-end must not
> > > > > > process the ring at all".
> > > > > >
> > > > > > The spec also says,
> > > > > >
> > > > > > """
> > > > > > If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is
> > > > > > initialized in a disabled state and is enabled by
> > > > > > VHOST_USER_SET_VRING_ENABLE with parameter 1.
> > > > > > """
> > > > > >
> > > > > > AFAICT virtiofsd follows this requirement.
> > > > >
> > > > > Hi Michael,
> > > > > You documented the disabled ring state in QEMU commit commit
> > > > > c61f09ed855b5009f816242ce281fd01586d4646 ("vhost-user: clarify start
> > > > > and enable") where virtio-net devices discard tx buffers. The disabled
> > > > > state seems to be specific to vhost-user and not covered in the VIRTIO
> > > > > specification.
> > > > >
> > > > > Do you remember what the purpose of the disabled state was? Why is it
> > > > > necessary to discard tx buffers instead of postponing ring processing
> > > > > until the virtqueue is enabled?
> > > > >
> > > > > My concern is that the semantics are unclear for virtqueue types that
> > > > > are different from virtio-net rx/tx. Even the virtio-net controlq
> > > > > would be problematic - should buffers be silently discarded with
> > > > > VIRTIO_NET_OK or should they fail?
> > > > >
> > > > > Thanks,
> > > > > Stefan
> > > >
> > > > I think I got it now.
> > > > This weird state happens when linux first queues packets
> > > > on multiple queues, then changes max queues to 1, queued packets need
> > > > to still be freed eventually.
> > >
> > > Can you explain what is happening in the guest driver, QEMU, and the
> > > vhost-user-net device in more detail? I don't understand the scenario.
> >
> > guest changes max vq pairs making it smaller
> > qemu disables ring
>
> The purpose of the "ignore rx, discard tx" semantics is still unclear
> to me. Can you explain why we do this?
>
> Stefan
We ignore rx since there's nothing we can do with it.
We discard tx because it was reported that some guests would
queue packets then reduce # of queues. they would then
be surprised if buffers queued on tx are never used.
> > > > Yes, I am not sure this can apply to devices or queue types
> > > > other than virtio net. Maybe.
> > > >
> > > > When we say:
> > > > must process the ring without causing any side effects.
> > > > then I think it would be better to say
> > > > must process the ring if it can be done without causing
> > > > guest visible side effects.
> > >
> > > Completing a tx buffer is guest-visible, so I'm confused by this statement.
> >
> > yes but it's not immediately guest visible whether packet was
> > transmitted or discarded.
> >
> > > > processing rx ring would have a side effect of causing
> > > > guest to get malformed buffers, so we don't process it.
> > >
> > > Why are they malformed? Do you mean the rx buffers are stale (the
> > > guest driver has changed the number of queues and doesn't expect to
> > > receive them anymore)?
> >
> > there's no way to consume an rx buffer without supplying
> > an rx packet to guest.
>
> Stefan
>
> > > > processing command queue - we can't fail for sure since
> > > > that is guest visible. but practically we don't do this
> > > > for cvq.
> > > >
> > > > what should happen for virtiofsd? I don't know -
> > > > I am guessing discarding would have a side effect
> > > > so should not happen.
> > > >
> > > >
> > > >
> > > >
> > > > > >
> > > > > > > This seems like a bug in the vhost-user backend to me.
> > > > > >
> > > > > > I didn't want to exclude that possiblity; that's why I included Eugenio,
> > > > > > German, Liu Jiang, and Sergio in the CC list.
> > > > > >
> > > > > > >
> > > > > > > When the virtqueue is disabled, don't monitor the kickfd.
> > > > > > >
> > > > > > > When the virtqueue transitions from disabled to enabled, the control
> > > > > > > plane should self-trigger the kickfd so that any available buffers
> > > > > > > will be processed.
> > > > > > >
> > > > > > > QEMU uses this scheme to switch between vhost/IOThreads and built-in
> > > > > > > virtqueue kick processing.
> > > > > > >
> > > > > > > This approach is more robust than relying buffers being enqueued after
> > > > > > > the virtqueue is enabled.
> > > > > >
> > > > > > I'm happy to drop the series if the virtiofsd maintainers agree that the
> > > > > > bug is in virtiofsd, and can propose a design to fix it. (I do think
> > > > > > that such a fix would require an architectural change.)
> > > > > >
> > > > > > FWIW, my own interpretation of the vhost-user spec (see above) was that
> > > > > > virtiofsd was right to behave the way it did, and that there was simply
> > > > > > no way to prevent out-of-order delivery other than synchronizing the
> > > > > > guest end-to-end with the vhost-user backend, concerning
> > > > > > VHOST_USER_SET_VRING_ENABLE.
> > > > > >
> > > > > > This end-to-end synchronization is present "naturally" in vhost-net,
> > > > > > where ioctl()s are automatically synchronous -- in fact *all* operations
> > > > > > on the control plane are synchronous. (Which is just a different way to
> > > > > > say that the guest is tightly coupled with the control plane.)
> > > > > >
> > > > > > Note that there has been at least one race like this before; see commit
> > > > > > 699f2e535d93 ("vhost: make SET_VRING_ADDR, SET_FEATURES send replies",
> > > > > > 2021-09-04). Basically every pre-existent call to enforce_reply() is a
> > > > > > cover-up for the vhost-user spec turning (somewhat recklessly?) most
> > > > > > operations into async ones.
> > > > > >
> > > > > > At some point this became apparent and so the REPLY_ACK flag was
> > > > > > introduced; see commit ca525ce5618b ("vhost-user: Introduce a new
> > > > > > protocol feature REPLY_ACK.", 2016-08-10). (That commit doesn't go into
> > > > > > details, but I'm pretty sure there was a similar race around SET_MEM_TABLE!)
> > > > > >
> > > > > > BTW even if we drop this series for QEMU, I don't think it will have
> > > > > > been in vain. The first few patches are cleanups which could be merged
> > > > > > for their own sake. And the last patch is essentially the proof of the
> > > > > > problem statement / analysis. It can be considered an elaborate bug
> > > > > > report for virtiofsd, *if* we decide the bug is in virtiofsd. I did have
> > > > > > that avenue in mind as well, when writing the commit message / patch.
> > > > > >
> > > > > > For now I'm going to post v2 -- that's not to say that I'm dismissing
> > > > > > your feedback (see above!), just want to get the latest version on-list.
> > > > > >
> > > > > > Thanks!
> > > > > > Laszlo
> > > > > >
> > > > > > >
> > > > > > > Stefan
> > > > > > >
> > > > > > >>
> > > > > > >> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> > > > > > >> However, if the data plane processor in virtiofsd wins the race, then it
> > > > > > >> sees the FUSE_INIT *before* the control plane processor took notice of
> > > > > > >> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> > > > > > >> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> > > > > > >> back to waiting for further virtio / FUSE requests with epoll_wait.
> > > > > > >> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> > > > > > >>
> > > > > > >> The deadlock is not deterministic. OVMF hangs infrequently during first
> > > > > > >> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> > > > > > >> shell.
> > > > > > >>
> > > > > > >> The race can be "reliably masked" by inserting a very small delay -- a
> > > > > > >> single debug message -- at the top of "VringEpollHandler::handle_event",
> > > > > > >> i.e., just before the data plane processor checks the "enabled" field of
> > > > > > >> the vring. That delay suffices for the control plane processor to act upon
> > > > > > >> VHOST_USER_SET_VRING_ENABLE.
> > > > > > >>
> > > > > > >> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> > > > > > >> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> > > > > > >> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> > > > > > >> cannot advance to the FUSE_INIT submission before virtiofsd's control
> > > > > > >> plane processor takes notice of the queue being enabled.
> > > > > > >>
> > > > > > >> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
> > > > > > >>
> > > > > > >> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> > > > > > >> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> > > > > > >> has been negotiated, or
> > > > > > >>
> > > > > > >> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> > > > > > >> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
> > > > > > >>
> > > > > > >> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> > > > > > >> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> > > > > > >> Cc: German Maglione <gmaglione@redhat.com>
> > > > > > >> Cc: Liu Jiang <gerry@linux.alibaba.com>
> > > > > > >> Cc: Sergio Lopez Pascual <slp@redhat.com>
> > > > > > >> Cc: Stefano Garzarella <sgarzare@redhat.com>
> > > > > > >> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> > > > > > >> ---
> > > > > > >> hw/virtio/vhost-user.c | 2 +-
> > > > > > >> 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > >>
> > > > > > >> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > > > > >> index beb4b832245e..01e0ca90c538 100644
> > > > > > >> --- a/hw/virtio/vhost-user.c
> > > > > > >> +++ b/hw/virtio/vhost-user.c
> > > > > > >> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
> > > > > > >> .num = enable,
> > > > > > >> };
> > > > > > >>
> > > > > > >> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
> > > > > > >> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
> > > > > > >> if (ret < 0) {
> > > > > > >> /*
> > > > > > >> * Restoring the previous state is likely infeasible, as well as
> > > > > > >
> > > > > >
> > > >
> >
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-03 13:08 ` Stefan Hajnoczi
2023-10-03 13:23 ` Laszlo Ersek
@ 2023-10-03 14:40 ` Michael S. Tsirkin
2023-10-03 15:45 ` Stefan Hajnoczi
1 sibling, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-10-03 14:40 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: qemu-devel, Laszlo Ersek, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella, Jason Wang
On Tue, Oct 03, 2023 at 09:08:15AM -0400, Stefan Hajnoczi wrote:
> On Tue, 3 Oct 2023 at 08:27, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
> > > One more question:
> > >
> > > Why is the disabled state not needed by regular (non-vhost) virtio-net devices?
> >
> > Tap does the same - it purges queued packets:
> >
> > int tap_disable(NetClientState *nc)
> > {
> > TAPState *s = DO_UPCAST(TAPState, nc, nc);
> > int ret;
> >
> > if (s->enabled == 0) {
> > return 0;
> > } else {
> > ret = tap_fd_disable(s->fd);
> > if (ret == 0) {
> > qemu_purge_queued_packets(nc);
> > s->enabled = false;
> > tap_update_fd_handler(s);
> > }
> > return ret;
> > }
> > }
>
> tap_disable() is not equivalent to the vhost-user "started but
> disabled" ring state. tap_disable() is a synchronous one-time action,
> while "started but disabled" is a continuous state.
well, yes. but practically guests do not queue too many buffers
after disabling a queue. I don't know if they reliably don't
or it's racy and we didn't notice it yet - I think it
was mostly dpdk that had this and that's usually
used with vhost-user.
> The "started but disabled" ring state isn't needed to achieve this.
> The back-end can just drop tx buffers upon receiving
> VHOST_USER_SET_VRING_ENABLE .num=0.
yes, maybe that would have been a better way to do this.
> The history of the spec is curious. VHOST_USER_SET_VRING_ENABLE was
> introduced before the the "started but disabled" state was defined,
> and it explicitly mentions tap attach/detach:
>
> commit 7263a0ad7899994b719ebed736a1119cc2e08110
> Author: Changchun Ouyang <changchun.ouyang@intel.com>
> Date: Wed Sep 23 12:20:01 2015 +0800
>
> vhost-user: add a new message to disable/enable a specific virt queue.
>
> Add a new message, VHOST_USER_SET_VRING_ENABLE, to enable or disable
> a specific virt queue, which is similar to attach/detach queue for
> tap device.
>
> and then later:
>
> commit c61f09ed855b5009f816242ce281fd01586d4646
> Author: Michael S. Tsirkin <mst@redhat.com>
> Date: Mon Nov 23 12:48:52 2015 +0200
>
> vhost-user: clarify start and enable
>
> >
> > what about non tap backends? I suspect they just aren't
> > used widely with multiqueue so no one noticed.
>
> I still don't understand why "started but disabled" is needed instead
> of just two ring states: enabled and disabled.
With dropping packets when ring is disabled? Maybe that would
have been enough. I also failed to realize it's specific to
net, seemed generic to me :(
> It seems like the cleanest path going forward is to keep the "ignore
> rx, discard tx" semantics for virtio-net devices but to clarify in the
> spec that other device types do not process the ring:
>
> "
> * started but disabled: the back-end must not process the ring. For legacy
> reasons there is an exception for the networking device, where the
> back-end must process and discard any TX packets and not process
> other rings.
> "
>
> What do you think?
>
> Stefan
Okay... I hope we are not missing any devices which need virtio net
semantics. Care checking them all?
--
MST
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-08-27 18:29 ` [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
` (2 preceding siblings ...)
2023-08-30 12:10 ` Stefan Hajnoczi
@ 2023-10-03 14:41 ` Michael S. Tsirkin
2023-10-03 15:55 ` Stefan Hajnoczi
3 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-10-03 14:41 UTC (permalink / raw)
To: Laszlo Ersek
Cc: qemu-devel, Eugenio Perez Martin, German Maglione, Liu Jiang,
Sergio Lopez Pascual, Stefano Garzarella
On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
> (1) The virtio-1.0 specification
> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>
> > 3 General Initialization And Device Operation
> > 3.1 Device Initialization
> > 3.1.1 Driver Requirements: Device Initialization
> >
> > [...]
> >
> > 7. Perform device-specific setup, including discovery of virtqueues for
> > the device, optional per-bus setup, reading and possibly writing the
> > device’s virtio configuration space, and population of virtqueues.
> >
> > 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>
> and
>
> > 4 Virtio Transport Options
> > 4.1 Virtio Over PCI Bus
> > 4.1.4 Virtio Structure PCI Capabilities
> > 4.1.4.3 Common configuration structure layout
> > 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> >
> > [...]
> >
> > The driver MUST configure the other virtqueue fields before enabling the
> > virtqueue with queue_enable.
> >
> > [...]
>
> These together mean that the following sub-sequence of steps is valid for
> a virtio-1.0 guest driver:
>
> (1.1) set "queue_enable" for the needed queues as the final part of device
> initialization step (7),
>
> (1.2) set DRIVER_OK in step (8),
>
> (1.3) immediately start sending virtio requests to the device.
>
> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> special virtio feature is negotiated, then virtio rings start in disabled
> state, according to
> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> enabling vrings.
>
> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> operation, which travels from the guest through QEMU to the vhost-user
> backend, using a unix domain socket.
>
> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> evades QEMU -- it travels from guest to the vhost-user backend via
> eventfd.
>
> This means that steps (1.1) and (1.3) travel through different channels,
> and their relative order can be reversed, as perceived by the vhost-user
> backend.
>
> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> crate.)
>
> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> device initialization steps (i.e., control plane operations), and
> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> operation). In the Rust-language virtiofsd, this creates a race between
> two components that run *concurrently*, i.e., in different threads or
> processes:
>
> - Control plane, handling vhost-user protocol messages:
>
> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> [crates/vhost-user-backend/src/handler.rs] handles
> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> flag according to the message processed.
>
> - Data plane, handling virtio / FUSE requests:
>
> The "VringEpollHandler::handle_event" method
> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> virtio / FUSE request, consuming the virtio kick at the same time. If
> the vring's "enabled" flag is set, the virtio / FUSE request is
> processed genuinely. If the vring's "enabled" flag is clear, then the
> virtio / FUSE request is discarded.
>
> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> However, if the data plane processor in virtiofsd wins the race, then it
> sees the FUSE_INIT *before* the control plane processor took notice of
> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> back to waiting for further virtio / FUSE requests with epoll_wait.
> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>
> The deadlock is not deterministic. OVMF hangs infrequently during first
> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> shell.
>
> The race can be "reliably masked" by inserting a very small delay -- a
> single debug message -- at the top of "VringEpollHandler::handle_event",
> i.e., just before the data plane processor checks the "enabled" field of
> the vring. That delay suffices for the control plane processor to act upon
> VHOST_USER_SET_VRING_ENABLE.
>
> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> cannot advance to the FUSE_INIT submission before virtiofsd's control
> plane processor takes notice of the queue being enabled.
>
> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>
> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> has been negotiated, or
>
> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> Cc: German Maglione <gmaglione@redhat.com>
> Cc: Liu Jiang <gerry@linux.alibaba.com>
> Cc: Sergio Lopez Pascual <slp@redhat.com>
> Cc: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
So you want me to hold on to this patch 7/7 for now?
And maybe merge rest of the patchset?
> ---
> hw/virtio/vhost-user.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index beb4b832245e..01e0ca90c538 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
> .num = enable,
> };
>
> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
> if (ret < 0) {
> /*
> * Restoring the previous state is likely infeasible, as well as
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-03 14:40 ` Michael S. Tsirkin
@ 2023-10-03 15:45 ` Stefan Hajnoczi
0 siblings, 0 replies; 58+ messages in thread
From: Stefan Hajnoczi @ 2023-10-03 15:45 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: qemu-devel, Laszlo Ersek, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella, Jason Wang
On Tue, 3 Oct 2023 at 10:40, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Oct 03, 2023 at 09:08:15AM -0400, Stefan Hajnoczi wrote:
> > On Tue, 3 Oct 2023 at 08:27, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Mon, Oct 02, 2023 at 05:13:26PM -0400, Stefan Hajnoczi wrote:
> > > > One more question:
> > > >
> > > > Why is the disabled state not needed by regular (non-vhost) virtio-net devices?
> > >
> > > Tap does the same - it purges queued packets:
> > >
> > > int tap_disable(NetClientState *nc)
> > > {
> > > TAPState *s = DO_UPCAST(TAPState, nc, nc);
> > > int ret;
> > >
> > > if (s->enabled == 0) {
> > > return 0;
> > > } else {
> > > ret = tap_fd_disable(s->fd);
> > > if (ret == 0) {
> > > qemu_purge_queued_packets(nc);
> > > s->enabled = false;
> > > tap_update_fd_handler(s);
> > > }
> > > return ret;
> > > }
> > > }
> >
> > tap_disable() is not equivalent to the vhost-user "started but
> > disabled" ring state. tap_disable() is a synchronous one-time action,
> > while "started but disabled" is a continuous state.
>
> well, yes. but practically guests do not queue too many buffers
> after disabling a queue. I don't know if they reliably don't
> or it's racy and we didn't notice it yet - I think it
> was mostly dpdk that had this and that's usually
> used with vhost-user.
>
> > The "started but disabled" ring state isn't needed to achieve this.
> > The back-end can just drop tx buffers upon receiving
> > VHOST_USER_SET_VRING_ENABLE .num=0.
>
> yes, maybe that would have been a better way to do this.
>
>
> > The history of the spec is curious. VHOST_USER_SET_VRING_ENABLE was
> > introduced before the the "started but disabled" state was defined,
> > and it explicitly mentions tap attach/detach:
> >
> > commit 7263a0ad7899994b719ebed736a1119cc2e08110
> > Author: Changchun Ouyang <changchun.ouyang@intel.com>
> > Date: Wed Sep 23 12:20:01 2015 +0800
> >
> > vhost-user: add a new message to disable/enable a specific virt queue.
> >
> > Add a new message, VHOST_USER_SET_VRING_ENABLE, to enable or disable
> > a specific virt queue, which is similar to attach/detach queue for
> > tap device.
> >
> > and then later:
> >
> > commit c61f09ed855b5009f816242ce281fd01586d4646
> > Author: Michael S. Tsirkin <mst@redhat.com>
> > Date: Mon Nov 23 12:48:52 2015 +0200
> >
> > vhost-user: clarify start and enable
> >
> > >
> > > what about non tap backends? I suspect they just aren't
> > > used widely with multiqueue so no one noticed.
> >
> > I still don't understand why "started but disabled" is needed instead
> > of just two ring states: enabled and disabled.
>
> With dropping packets when ring is disabled? Maybe that would
> have been enough. I also failed to realize it's specific to
> net, seemed generic to me :(
>
> > It seems like the cleanest path going forward is to keep the "ignore
> > rx, discard tx" semantics for virtio-net devices but to clarify in the
> > spec that other device types do not process the ring:
> >
> > "
> > * started but disabled: the back-end must not process the ring. For legacy
> > reasons there is an exception for the networking device, where the
> > back-end must process and discard any TX packets and not process
> > other rings.
> > "
> >
> > What do you think?
> >
> > Stefan
>
> Okay... I hope we are not missing any devices which need virtio net
> semantics. Care checking them all?
Sure, I will check them. I'm very curious myself whether virtio-vsock
is affected (it has rx and tx queues).
I will report back.
Stefan
>
> --
> MST
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-03 14:41 ` Michael S. Tsirkin
@ 2023-10-03 15:55 ` Stefan Hajnoczi
2023-10-04 10:15 ` Laszlo Ersek
2023-10-04 10:17 ` Laszlo Ersek
0 siblings, 2 replies; 58+ messages in thread
From: Stefan Hajnoczi @ 2023-10-03 15:55 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Laszlo Ersek, qemu-devel, Eugenio Perez Martin, German Maglione,
Liu Jiang, Sergio Lopez Pascual, Stefano Garzarella
On Tue, 3 Oct 2023 at 10:41, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
> > (1) The virtio-1.0 specification
> > <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
> >
> > > 3 General Initialization And Device Operation
> > > 3.1 Device Initialization
> > > 3.1.1 Driver Requirements: Device Initialization
> > >
> > > [...]
> > >
> > > 7. Perform device-specific setup, including discovery of virtqueues for
> > > the device, optional per-bus setup, reading and possibly writing the
> > > device’s virtio configuration space, and population of virtqueues.
> > >
> > > 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> >
> > and
> >
> > > 4 Virtio Transport Options
> > > 4.1 Virtio Over PCI Bus
> > > 4.1.4 Virtio Structure PCI Capabilities
> > > 4.1.4.3 Common configuration structure layout
> > > 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > >
> > > [...]
> > >
> > > The driver MUST configure the other virtqueue fields before enabling the
> > > virtqueue with queue_enable.
> > >
> > > [...]
> >
> > These together mean that the following sub-sequence of steps is valid for
> > a virtio-1.0 guest driver:
> >
> > (1.1) set "queue_enable" for the needed queues as the final part of device
> > initialization step (7),
> >
> > (1.2) set DRIVER_OK in step (8),
> >
> > (1.3) immediately start sending virtio requests to the device.
> >
> > (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> > special virtio feature is negotiated, then virtio rings start in disabled
> > state, according to
> > <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> > In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> > enabling vrings.
> >
> > Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> > operation, which travels from the guest through QEMU to the vhost-user
> > backend, using a unix domain socket.
> >
> > Whereas sending a virtio request (1.3) is a *data plane* operation, which
> > evades QEMU -- it travels from guest to the vhost-user backend via
> > eventfd.
> >
> > This means that steps (1.1) and (1.3) travel through different channels,
> > and their relative order can be reversed, as perceived by the vhost-user
> > backend.
> >
> > That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> > against the Rust-language virtiofsd version 1.7.2. (Which uses version
> > 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> > crate.)
> >
> > Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> > device initialization steps (i.e., control plane operations), and
> > immediately sends a FUSE_INIT request too (i.e., performs a data plane
> > operation). In the Rust-language virtiofsd, this creates a race between
> > two components that run *concurrently*, i.e., in different threads or
> > processes:
> >
> > - Control plane, handling vhost-user protocol messages:
> >
> > The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> > [crates/vhost-user-backend/src/handler.rs] handles
> > VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> > flag according to the message processed.
> >
> > - Data plane, handling virtio / FUSE requests:
> >
> > The "VringEpollHandler::handle_event" method
> > [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> > virtio / FUSE request, consuming the virtio kick at the same time. If
> > the vring's "enabled" flag is set, the virtio / FUSE request is
> > processed genuinely. If the vring's "enabled" flag is clear, then the
> > virtio / FUSE request is discarded.
> >
> > Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> > However, if the data plane processor in virtiofsd wins the race, then it
> > sees the FUSE_INIT *before* the control plane processor took notice of
> > VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> > processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> > back to waiting for further virtio / FUSE requests with epoll_wait.
> > Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> >
> > The deadlock is not deterministic. OVMF hangs infrequently during first
> > boot. However, OVMF hangs almost certainly during reboots from the UEFI
> > shell.
> >
> > The race can be "reliably masked" by inserting a very small delay -- a
> > single debug message -- at the top of "VringEpollHandler::handle_event",
> > i.e., just before the data plane processor checks the "enabled" field of
> > the vring. That delay suffices for the control plane processor to act upon
> > VHOST_USER_SET_VRING_ENABLE.
> >
> > We can deterministically prevent the race in QEMU, by blocking OVMF inside
> > step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> > VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> > cannot advance to the FUSE_INIT submission before virtiofsd's control
> > plane processor takes notice of the queue being enabled.
> >
> > Wait for VHOST_USER_SET_VRING_ENABLE completion by:
> >
> > - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> > for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> > has been negotiated, or
> >
> > - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> > a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
> >
> > Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> > Cc: Eugenio Perez Martin <eperezma@redhat.com>
> > Cc: German Maglione <gmaglione@redhat.com>
> > Cc: Liu Jiang <gerry@linux.alibaba.com>
> > Cc: Sergio Lopez Pascual <slp@redhat.com>
> > Cc: Stefano Garzarella <sgarzare@redhat.com>
> > Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>
>
> So you want me to hold on to this patch 7/7 for now?
> And maybe merge rest of the patchset?
Up to Laszlo, but I wanted to mention that I support merging this
patch series. A ring has not been enabled/disabled until the back-end
replies, so I think this patch series makes sense.
Stefan
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-03 15:55 ` Stefan Hajnoczi
@ 2023-10-04 10:15 ` Laszlo Ersek
2023-10-04 16:30 ` Michael S. Tsirkin
2023-10-04 10:17 ` Laszlo Ersek
1 sibling, 1 reply; 58+ messages in thread
From: Laszlo Ersek @ 2023-10-04 10:15 UTC (permalink / raw)
To: Stefan Hajnoczi, Michael S. Tsirkin
Cc: qemu-devel, Eugenio Perez Martin, German Maglione, Liu Jiang,
Sergio Lopez Pascual, Stefano Garzarella
On 10/3/23 17:55, Stefan Hajnoczi wrote:
> On Tue, 3 Oct 2023 at 10:41, Michael S. Tsirkin <mst@redhat.com> wrote:
>>
>> On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
>>> (1) The virtio-1.0 specification
>>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>>>
>>>> 3 General Initialization And Device Operation
>>>> 3.1 Device Initialization
>>>> 3.1.1 Driver Requirements: Device Initialization
>>>>
>>>> [...]
>>>>
>>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>>> the device, optional per-bus setup, reading and possibly writing the
>>>> device’s virtio configuration space, and population of virtqueues.
>>>>
>>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>>
>>> and
>>>
>>>> 4 Virtio Transport Options
>>>> 4.1 Virtio Over PCI Bus
>>>> 4.1.4 Virtio Structure PCI Capabilities
>>>> 4.1.4.3 Common configuration structure layout
>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>>
>>>> [...]
>>>>
>>>> The driver MUST configure the other virtqueue fields before enabling the
>>>> virtqueue with queue_enable.
>>>>
>>>> [...]
>>>
>>> These together mean that the following sub-sequence of steps is valid for
>>> a virtio-1.0 guest driver:
>>>
>>> (1.1) set "queue_enable" for the needed queues as the final part of device
>>> initialization step (7),
>>>
>>> (1.2) set DRIVER_OK in step (8),
>>>
>>> (1.3) immediately start sending virtio requests to the device.
>>>
>>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>>> special virtio feature is negotiated, then virtio rings start in disabled
>>> state, according to
>>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
>>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>>> enabling vrings.
>>>
>>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>>> operation, which travels from the guest through QEMU to the vhost-user
>>> backend, using a unix domain socket.
>>>
>>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>>> evades QEMU -- it travels from guest to the vhost-user backend via
>>> eventfd.
>>>
>>> This means that steps (1.1) and (1.3) travel through different channels,
>>> and their relative order can be reversed, as perceived by the vhost-user
>>> backend.
>>>
>>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>>> crate.)
>>>
>>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>>> device initialization steps (i.e., control plane operations), and
>>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>>> operation). In the Rust-language virtiofsd, this creates a race between
>>> two components that run *concurrently*, i.e., in different threads or
>>> processes:
>>>
>>> - Control plane, handling vhost-user protocol messages:
>>>
>>> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>> [crates/vhost-user-backend/src/handler.rs] handles
>>> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>> flag according to the message processed.
>>>
>>> - Data plane, handling virtio / FUSE requests:
>>>
>>> The "VringEpollHandler::handle_event" method
>>> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>> virtio / FUSE request, consuming the virtio kick at the same time. If
>>> the vring's "enabled" flag is set, the virtio / FUSE request is
>>> processed genuinely. If the vring's "enabled" flag is clear, then the
>>> virtio / FUSE request is discarded.
>>>
>>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>>> However, if the data plane processor in virtiofsd wins the race, then it
>>> sees the FUSE_INIT *before* the control plane processor took notice of
>>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>>> back to waiting for further virtio / FUSE requests with epoll_wait.
>>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>>>
>>> The deadlock is not deterministic. OVMF hangs infrequently during first
>>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>>> shell.
>>>
>>> The race can be "reliably masked" by inserting a very small delay -- a
>>> single debug message -- at the top of "VringEpollHandler::handle_event",
>>> i.e., just before the data plane processor checks the "enabled" field of
>>> the vring. That delay suffices for the control plane processor to act upon
>>> VHOST_USER_SET_VRING_ENABLE.
>>>
>>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
>>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
>>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
>>> cannot advance to the FUSE_INIT submission before virtiofsd's control
>>> plane processor takes notice of the queue being enabled.
>>>
>>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>>>
>>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
>>> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
>>> has been negotiated, or
>>>
>>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
>>> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>>>
>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
>>> Cc: German Maglione <gmaglione@redhat.com>
>>> Cc: Liu Jiang <gerry@linux.alibaba.com>
>>> Cc: Sergio Lopez Pascual <slp@redhat.com>
>>> Cc: Stefano Garzarella <sgarzare@redhat.com>
>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>>
>>
>> So you want me to hold on to this patch 7/7 for now?
>> And maybe merge rest of the patchset?
>
> Up to Laszlo, but I wanted to mention that I support merging this
> patch series. A ring has not been enabled/disabled until the back-end
> replies, so I think this patch series makes sense.
Sorry, I didn't get to see this part of the discussion yesterday, and
now I see that Michael has gone ahead with a PR that contains v2 of this
set. The night before yesterday I posted v3
<https://patchwork.ozlabs.org/project/qemu-devel/cover/20231002203221.17241-1-lersek@redhat.com/>,
with commit message updates / improvements only (based on feedback), so
please merge that one.
Thanks!
Laszlo
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-03 15:55 ` Stefan Hajnoczi
2023-10-04 10:15 ` Laszlo Ersek
@ 2023-10-04 10:17 ` Laszlo Ersek
1 sibling, 0 replies; 58+ messages in thread
From: Laszlo Ersek @ 2023-10-04 10:17 UTC (permalink / raw)
To: Stefan Hajnoczi, Michael S. Tsirkin
Cc: qemu-devel, Eugenio Perez Martin, German Maglione, Liu Jiang,
Sergio Lopez Pascual, Stefano Garzarella
On 10/3/23 17:55, Stefan Hajnoczi wrote:
> On Tue, 3 Oct 2023 at 10:41, Michael S. Tsirkin <mst@redhat.com> wrote:
>>
>> On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
>>> (1) The virtio-1.0 specification
>>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>>>
>>>> 3 General Initialization And Device Operation
>>>> 3.1 Device Initialization
>>>> 3.1.1 Driver Requirements: Device Initialization
>>>>
>>>> [...]
>>>>
>>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>>> the device, optional per-bus setup, reading and possibly writing the
>>>> device’s virtio configuration space, and population of virtqueues.
>>>>
>>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>>
>>> and
>>>
>>>> 4 Virtio Transport Options
>>>> 4.1 Virtio Over PCI Bus
>>>> 4.1.4 Virtio Structure PCI Capabilities
>>>> 4.1.4.3 Common configuration structure layout
>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>>
>>>> [...]
>>>>
>>>> The driver MUST configure the other virtqueue fields before enabling the
>>>> virtqueue with queue_enable.
>>>>
>>>> [...]
>>>
>>> These together mean that the following sub-sequence of steps is valid for
>>> a virtio-1.0 guest driver:
>>>
>>> (1.1) set "queue_enable" for the needed queues as the final part of device
>>> initialization step (7),
>>>
>>> (1.2) set DRIVER_OK in step (8),
>>>
>>> (1.3) immediately start sending virtio requests to the device.
>>>
>>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>>> special virtio feature is negotiated, then virtio rings start in disabled
>>> state, according to
>>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
>>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>>> enabling vrings.
>>>
>>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>>> operation, which travels from the guest through QEMU to the vhost-user
>>> backend, using a unix domain socket.
>>>
>>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>>> evades QEMU -- it travels from guest to the vhost-user backend via
>>> eventfd.
>>>
>>> This means that steps (1.1) and (1.3) travel through different channels,
>>> and their relative order can be reversed, as perceived by the vhost-user
>>> backend.
>>>
>>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>>> crate.)
>>>
>>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>>> device initialization steps (i.e., control plane operations), and
>>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>>> operation). In the Rust-language virtiofsd, this creates a race between
>>> two components that run *concurrently*, i.e., in different threads or
>>> processes:
>>>
>>> - Control plane, handling vhost-user protocol messages:
>>>
>>> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>> [crates/vhost-user-backend/src/handler.rs] handles
>>> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>> flag according to the message processed.
>>>
>>> - Data plane, handling virtio / FUSE requests:
>>>
>>> The "VringEpollHandler::handle_event" method
>>> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>> virtio / FUSE request, consuming the virtio kick at the same time. If
>>> the vring's "enabled" flag is set, the virtio / FUSE request is
>>> processed genuinely. If the vring's "enabled" flag is clear, then the
>>> virtio / FUSE request is discarded.
>>>
>>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>>> However, if the data plane processor in virtiofsd wins the race, then it
>>> sees the FUSE_INIT *before* the control plane processor took notice of
>>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>>> back to waiting for further virtio / FUSE requests with epoll_wait.
>>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>>>
>>> The deadlock is not deterministic. OVMF hangs infrequently during first
>>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>>> shell.
>>>
>>> The race can be "reliably masked" by inserting a very small delay -- a
>>> single debug message -- at the top of "VringEpollHandler::handle_event",
>>> i.e., just before the data plane processor checks the "enabled" field of
>>> the vring. That delay suffices for the control plane processor to act upon
>>> VHOST_USER_SET_VRING_ENABLE.
>>>
>>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
>>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
>>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
>>> cannot advance to the FUSE_INIT submission before virtiofsd's control
>>> plane processor takes notice of the queue being enabled.
>>>
>>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>>>
>>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
>>> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
>>> has been negotiated, or
>>>
>>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
>>> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>>>
>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
>>> Cc: German Maglione <gmaglione@redhat.com>
>>> Cc: Liu Jiang <gerry@linux.alibaba.com>
>>> Cc: Sergio Lopez Pascual <slp@redhat.com>
>>> Cc: Stefano Garzarella <sgarzare@redhat.com>
>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>>
>>
>> So you want me to hold on to this patch 7/7 for now?
>> And maybe merge rest of the patchset?
>
> Up to Laszlo, but I wanted to mention that I support merging this
> patch series.
Oh and I wanted to say thanks for that. :)
Laszlo
> A ring has not been enabled/disabled until the back-end
> replies, so I think this patch series makes sense.
>
> Stefan
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
2023-10-04 10:15 ` Laszlo Ersek
@ 2023-10-04 16:30 ` Michael S. Tsirkin
0 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2023-10-04 16:30 UTC (permalink / raw)
To: Laszlo Ersek
Cc: Stefan Hajnoczi, qemu-devel, Eugenio Perez Martin,
German Maglione, Liu Jiang, Sergio Lopez Pascual,
Stefano Garzarella
On Wed, Oct 04, 2023 at 12:15:48PM +0200, Laszlo Ersek wrote:
> On 10/3/23 17:55, Stefan Hajnoczi wrote:
> > On Tue, 3 Oct 2023 at 10:41, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>
> >> On Sun, Aug 27, 2023 at 08:29:37PM +0200, Laszlo Ersek wrote:
> >>> (1) The virtio-1.0 specification
> >>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
> >>>
> >>>> 3 General Initialization And Device Operation
> >>>> 3.1 Device Initialization
> >>>> 3.1.1 Driver Requirements: Device Initialization
> >>>>
> >>>> [...]
> >>>>
> >>>> 7. Perform device-specific setup, including discovery of virtqueues for
> >>>> the device, optional per-bus setup, reading and possibly writing the
> >>>> device’s virtio configuration space, and population of virtqueues.
> >>>>
> >>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
> >>>
> >>> and
> >>>
> >>>> 4 Virtio Transport Options
> >>>> 4.1 Virtio Over PCI Bus
> >>>> 4.1.4 Virtio Structure PCI Capabilities
> >>>> 4.1.4.3 Common configuration structure layout
> >>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> >>>>
> >>>> [...]
> >>>>
> >>>> The driver MUST configure the other virtqueue fields before enabling the
> >>>> virtqueue with queue_enable.
> >>>>
> >>>> [...]
> >>>
> >>> These together mean that the following sub-sequence of steps is valid for
> >>> a virtio-1.0 guest driver:
> >>>
> >>> (1.1) set "queue_enable" for the needed queues as the final part of device
> >>> initialization step (7),
> >>>
> >>> (1.2) set DRIVER_OK in step (8),
> >>>
> >>> (1.3) immediately start sending virtio requests to the device.
> >>>
> >>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
> >>> special virtio feature is negotiated, then virtio rings start in disabled
> >>> state, according to
> >>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
> >>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
> >>> enabling vrings.
> >>>
> >>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
> >>> operation, which travels from the guest through QEMU to the vhost-user
> >>> backend, using a unix domain socket.
> >>>
> >>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
> >>> evades QEMU -- it travels from guest to the vhost-user backend via
> >>> eventfd.
> >>>
> >>> This means that steps (1.1) and (1.3) travel through different channels,
> >>> and their relative order can be reversed, as perceived by the vhost-user
> >>> backend.
> >>>
> >>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
> >>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
> >>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
> >>> crate.)
> >>>
> >>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
> >>> device initialization steps (i.e., control plane operations), and
> >>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
> >>> operation). In the Rust-language virtiofsd, this creates a race between
> >>> two components that run *concurrently*, i.e., in different threads or
> >>> processes:
> >>>
> >>> - Control plane, handling vhost-user protocol messages:
> >>>
> >>> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
> >>> [crates/vhost-user-backend/src/handler.rs] handles
> >>> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
> >>> flag according to the message processed.
> >>>
> >>> - Data plane, handling virtio / FUSE requests:
> >>>
> >>> The "VringEpollHandler::handle_event" method
> >>> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
> >>> virtio / FUSE request, consuming the virtio kick at the same time. If
> >>> the vring's "enabled" flag is set, the virtio / FUSE request is
> >>> processed genuinely. If the vring's "enabled" flag is clear, then the
> >>> virtio / FUSE request is discarded.
> >>>
> >>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
> >>> However, if the data plane processor in virtiofsd wins the race, then it
> >>> sees the FUSE_INIT *before* the control plane processor took notice of
> >>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
> >>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
> >>> back to waiting for further virtio / FUSE requests with epoll_wait.
> >>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> >>>
> >>> The deadlock is not deterministic. OVMF hangs infrequently during first
> >>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
> >>> shell.
> >>>
> >>> The race can be "reliably masked" by inserting a very small delay -- a
> >>> single debug message -- at the top of "VringEpollHandler::handle_event",
> >>> i.e., just before the data plane processor checks the "enabled" field of
> >>> the vring. That delay suffices for the control plane processor to act upon
> >>> VHOST_USER_SET_VRING_ENABLE.
> >>>
> >>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
> >>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
> >>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
> >>> cannot advance to the FUSE_INIT submission before virtiofsd's control
> >>> plane processor takes notice of the queue being enabled.
> >>>
> >>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
> >>>
> >>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
> >>> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
> >>> has been negotiated, or
> >>>
> >>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
> >>> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
> >>>
> >>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
> >>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
> >>> Cc: German Maglione <gmaglione@redhat.com>
> >>> Cc: Liu Jiang <gerry@linux.alibaba.com>
> >>> Cc: Sergio Lopez Pascual <slp@redhat.com>
> >>> Cc: Stefano Garzarella <sgarzare@redhat.com>
> >>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> >>
> >>
> >> So you want me to hold on to this patch 7/7 for now?
> >> And maybe merge rest of the patchset?
> >
> > Up to Laszlo, but I wanted to mention that I support merging this
> > patch series. A ring has not been enabled/disabled until the back-end
> > replies, so I think this patch series makes sense.
>
> Sorry, I didn't get to see this part of the discussion yesterday, and
> now I see that Michael has gone ahead with a PR that contains v2 of this
> set. The night before yesterday I posted v3
> <https://patchwork.ozlabs.org/project/qemu-devel/cover/20231002203221.17241-1-lersek@redhat.com/>,
> with commit message updates / improvements only (based on feedback), so
> please merge that one.
>
> Thanks!
> Laszlo
OK. I'll need to do another PR soonish since a bunch of patchsets
which I wanted in this PR had issues and I had to drop them.
v3 will be there.
--
MST
^ permalink raw reply [flat|nested] 58+ messages in thread
end of thread, other threads:[~2023-10-04 16:31 UTC | newest]
Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-27 18:29 [PATCH 0/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
2023-08-27 18:29 ` [PATCH 1/7] vhost-user: strip superfluous whitespace Laszlo Ersek
2023-08-30 8:26 ` Stefano Garzarella
2023-08-30 15:04 ` Philippe Mathieu-Daudé
2023-08-27 18:29 ` [PATCH 2/7] vhost-user: tighten "reply_supported" scope in "set_vring_addr" Laszlo Ersek
2023-08-30 8:27 ` Stefano Garzarella
2023-08-30 15:04 ` Philippe Mathieu-Daudé
2023-08-27 18:29 ` [PATCH 3/7] vhost-user: factor out "vhost_user_write_msg" Laszlo Ersek
2023-08-28 22:46 ` Philippe Mathieu-Daudé
2023-08-30 8:31 ` Stefano Garzarella
2023-08-30 9:14 ` Laszlo Ersek
2023-08-30 9:54 ` Laszlo Ersek
2023-08-27 18:29 ` [PATCH 4/7] vhost-user: flatten "enforce_reply" into "vhost_user_write_msg" Laszlo Ersek
2023-08-28 22:47 ` Philippe Mathieu-Daudé
2023-08-30 8:31 ` Stefano Garzarella
2023-08-27 18:29 ` [PATCH 5/7] vhost-user: hoist "write_msg", "get_features", "get_u64" Laszlo Ersek
2023-08-30 8:32 ` Stefano Garzarella
2023-08-30 15:04 ` Philippe Mathieu-Daudé
2023-08-27 18:29 ` [PATCH 6/7] vhost-user: allow "vhost_set_vring" to wait for a reply Laszlo Ersek
2023-08-28 22:49 ` Philippe Mathieu-Daudé
2023-08-30 8:32 ` Stefano Garzarella
2023-08-27 18:29 ` [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
2023-08-30 8:39 ` Stefano Garzarella
2023-08-30 9:26 ` Laszlo Ersek
2023-08-30 14:24 ` Stefano Garzarella
2023-08-30 8:41 ` Laszlo Ersek
2023-08-30 8:59 ` Laszlo Ersek
2023-08-30 9:04 ` Laszlo Ersek
2023-08-30 12:10 ` Stefan Hajnoczi
2023-08-30 13:30 ` Laszlo Ersek
2023-08-30 15:37 ` Stefan Hajnoczi
2023-09-05 6:30 ` Laszlo Ersek
2023-09-25 15:31 ` Laszlo Ersek
2023-10-01 19:24 ` Michael S. Tsirkin
2023-10-01 19:25 ` Michael S. Tsirkin
2023-10-02 1:56 ` Laszlo Ersek
2023-10-02 6:57 ` Michael S. Tsirkin
2023-10-02 14:02 ` Laszlo Ersek
2023-10-02 6:49 ` Michael S. Tsirkin
2023-10-02 21:12 ` Stefan Hajnoczi
2023-10-02 21:13 ` Stefan Hajnoczi
2023-10-03 12:26 ` Michael S. Tsirkin
2023-10-03 13:08 ` Stefan Hajnoczi
2023-10-03 13:23 ` Laszlo Ersek
2023-10-03 14:25 ` Michael S. Tsirkin
2023-10-03 14:28 ` Laszlo Ersek
2023-10-03 14:40 ` Michael S. Tsirkin
2023-10-03 15:45 ` Stefan Hajnoczi
2023-10-02 22:36 ` Michael S. Tsirkin
2023-10-03 0:17 ` Stefan Hajnoczi
2023-10-03 14:28 ` Michael S. Tsirkin
2023-10-03 14:41 ` Michael S. Tsirkin
2023-10-03 15:55 ` Stefan Hajnoczi
2023-10-04 10:15 ` Laszlo Ersek
2023-10-04 16:30 ` Michael S. Tsirkin
2023-10-04 10:17 ` Laszlo Ersek
2023-08-30 8:48 ` [PATCH 0/7] " Stefano Garzarella
2023-08-30 9:32 ` Laszlo Ersek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).