* [PATCH 0/2] Send all the SVQ control commands in parallel @ 2023-04-19 11:49 Hawkins Jiawei 2023-04-19 11:49 ` [PATCH 1/2] vdpa: rename vhost_vdpa_net_cvq_add() Hawkins Jiawei ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: Hawkins Jiawei @ 2023-04-19 11:49 UTC (permalink / raw) To: eperezma, jasowang; +Cc: yin31149, 18801353760, qemu-devel This patchset allows QEMU to poll and check the device used buffer after sending all SVQ control commands, instead of polling and checking immediately after sending each SVQ control command, so that QEMU can send all the SVQ control commands in parallel, which have better performance improvement. I use vdpa_sim_net to simulate vdpa device, refactor vhost_vdpa_net_load() to call vhost_vdpa_net_load_mac() 30 times, refactor `net_vhost_vdpa_cvq_info.load` to call vhost_vdpa_net_load() 1000 times, to build a test environment for sending multiple SVQ control commands. Time in monotonic to finish `net_vhost_vdpa_cvq_info.load`: QEMU monotonic time -------------------------------------------------- not patched 89202 -------------------------------------------------- patched 80455 This patchset resolves the GitLab issue at https://gitlab.com/qemu-project/qemu/-/issues/1578. Hawkins Jiawei (2): vdpa: rename vhost_vdpa_net_cvq_add() vdpa: send CVQ state load commands in parallel net/vhost-vdpa.c | 150 +++++++++++++++++++++++++++++++++++------------ 1 file changed, 112 insertions(+), 38 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/2] vdpa: rename vhost_vdpa_net_cvq_add() 2023-04-19 11:49 [PATCH 0/2] Send all the SVQ control commands in parallel Hawkins Jiawei @ 2023-04-19 11:49 ` Hawkins Jiawei 2023-04-19 11:49 ` [PATCH 2/2] vdpa: send CVQ state load commands in parallel Hawkins Jiawei 2023-04-19 17:16 ` [PATCH 0/2] Send all the SVQ control " Eugenio Perez Martin 2 siblings, 0 replies; 8+ messages in thread From: Hawkins Jiawei @ 2023-04-19 11:49 UTC (permalink / raw) To: eperezma, jasowang; +Cc: yin31149, 18801353760, qemu-devel We want to introduce a new version of vhost_vdpa_net_cvq_add() that does not poll immediately after forwarding custom buffers to the device, so that QEMU can send all the SVQ control commands in parallel instead of serialized. Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> --- net/vhost-vdpa.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 99904a0da7..10804c7200 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -590,8 +590,14 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc) vhost_vdpa_net_client_stop(nc); } -static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len, - size_t in_len) +/** + * vhost_vdpa_net_cvq_add_and_wait() adds SVQ control commands to SVQ, + * kicks the device and polls the device used buffers. + * + * Return the length written by the device. + */ +static ssize_t vhost_vdpa_net_cvq_add_and_wait(VhostVDPAState *s, + size_t out_len, size_t in_len) { /* Buffers for the device */ const struct iovec out = { @@ -636,7 +642,7 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class, memcpy(s->cvq_cmd_out_buffer, &ctrl, sizeof(ctrl)); memcpy(s->cvq_cmd_out_buffer + sizeof(ctrl), data, data_size); - return vhost_vdpa_net_cvq_add(s, sizeof(ctrl) + data_size, + return vhost_vdpa_net_cvq_add_and_wait(s, sizeof(ctrl) + data_size, sizeof(virtio_net_ctrl_ack)); } @@ -753,7 +759,8 @@ static int vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq, dev_written = sizeof(status); *s->status = VIRTIO_NET_OK; } else { - dev_written = vhost_vdpa_net_cvq_add(s, out.iov_len, sizeof(status)); + dev_written = vhost_vdpa_net_cvq_add_and_wait(s, out.iov_len, + sizeof(status)); if (unlikely(dev_written < 0)) { goto out; } -- 2.25.1 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/2] vdpa: send CVQ state load commands in parallel 2023-04-19 11:49 [PATCH 0/2] Send all the SVQ control commands in parallel Hawkins Jiawei 2023-04-19 11:49 ` [PATCH 1/2] vdpa: rename vhost_vdpa_net_cvq_add() Hawkins Jiawei @ 2023-04-19 11:49 ` Hawkins Jiawei 2023-04-19 17:42 ` Eugenio Perez Martin 2023-04-19 17:16 ` [PATCH 0/2] Send all the SVQ control " Eugenio Perez Martin 2 siblings, 1 reply; 8+ messages in thread From: Hawkins Jiawei @ 2023-04-19 11:49 UTC (permalink / raw) To: eperezma, jasowang; +Cc: yin31149, 18801353760, qemu-devel This patch introduces the vhost_vdpa_net_cvq_add() and refactors the vhost_vdpa_net_load*(), so that QEMU can send CVQ state load commands in parallel. To be more specific, this patch introduces vhost_vdpa_net_cvq_add() to add SVQ control commands to SVQ and kick the device, but does not poll the device used buffers. QEMU will not poll and check the device used buffers in vhost_vdpa_net_load() until all CVQ state load commands have been sent to the device. What's more, in order to avoid buffer overwriting caused by using `svq->cvq_cmd_out_buffer` and `svq->status` as the buffer for all CVQ state load commands when sending CVQ state load commands in parallel, this patch introduces `out_cursor` and `in_cursor` in vhost_vdpa_net_load(), pointing to the available buffer for in descriptor and out descriptor, so that different CVQ state load commands can use their unique buffer. Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1578 Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> --- net/vhost-vdpa.c | 137 +++++++++++++++++++++++++++++++++++------------ 1 file changed, 102 insertions(+), 35 deletions(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 10804c7200..d1f7113992 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -590,6 +590,44 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc) vhost_vdpa_net_client_stop(nc); } +/** + * vhost_vdpa_net_cvq_add() adds SVQ control commands to SVQ, + * kicks the device but does not poll the device used buffers. + * + * Return the length of the device used buffers. + */ +static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, + void **out_cursor, size_t out_len, + void **in_cursor, size_t in_len) +{ + /* Buffers for the device */ + const struct iovec out = { + .iov_base = *out_cursor, + .iov_len = out_len, + }; + const struct iovec in = { + .iov_base = *in_cursor, + .iov_len = in_len, + }; + VhostShadowVirtqueue *svq = g_ptr_array_index(s->vhost_vdpa.shadow_vqs, 0); + int r; + + r = vhost_svq_add(svq, &out, 1, &in, 1, NULL); + if (unlikely(r != 0)) { + if (unlikely(r == -ENOSPC)) { + qemu_log_mask(LOG_GUEST_ERROR, "%s: No space on device queue\n", + __func__); + } + return r; + } + + /* Update the cursor */ + *out_cursor += out_len; + *in_cursor += in_len; + + return in_len; +} + /** * vhost_vdpa_net_cvq_add_and_wait() adds SVQ control commands to SVQ, * kicks the device and polls the device used buffers. @@ -628,69 +666,64 @@ static ssize_t vhost_vdpa_net_cvq_add_and_wait(VhostVDPAState *s, return vhost_svq_poll(svq); } -static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class, - uint8_t cmd, const void *data, - size_t data_size) +static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, + void **out_cursor, uint8_t class, uint8_t cmd, + const void *data, size_t data_size, + void **in_cursor) { const struct virtio_net_ctrl_hdr ctrl = { .class = class, .cmd = cmd, }; - assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl)); + assert(sizeof(ctrl) < vhost_vdpa_net_cvq_cmd_page_len() - + (*out_cursor - s->cvq_cmd_out_buffer)); + assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl) - + (*out_cursor - s->cvq_cmd_out_buffer)); - memcpy(s->cvq_cmd_out_buffer, &ctrl, sizeof(ctrl)); - memcpy(s->cvq_cmd_out_buffer + sizeof(ctrl), data, data_size); + memcpy(*out_cursor, &ctrl, sizeof(ctrl)); + memcpy(*out_cursor + sizeof(ctrl), data, data_size); - return vhost_vdpa_net_cvq_add_and_wait(s, sizeof(ctrl) + data_size, - sizeof(virtio_net_ctrl_ack)); + return vhost_vdpa_net_cvq_add(s, out_cursor, sizeof(ctrl) + data_size, + in_cursor, sizeof(virtio_net_ctrl_ack)); } -static int vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n) +static ssize_t vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n, + void **out_cursor, void **in_cursor) { uint64_t features = n->parent_obj.guest_features; if (features & BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR)) { - ssize_t dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_MAC, - VIRTIO_NET_CTRL_MAC_ADDR_SET, - n->mac, sizeof(n->mac)); - if (unlikely(dev_written < 0)) { - return dev_written; - } - - return *s->status != VIRTIO_NET_OK; + return vhost_vdpa_net_load_cmd(s, out_cursor, VIRTIO_NET_CTRL_MAC, + VIRTIO_NET_CTRL_MAC_ADDR_SET, + n->mac, sizeof(n->mac), in_cursor); } return 0; } -static int vhost_vdpa_net_load_mq(VhostVDPAState *s, - const VirtIONet *n) +static ssize_t vhost_vdpa_net_load_mq(VhostVDPAState *s, const VirtIONet *n, + void **out_cursor, void **in_cursor) { struct virtio_net_ctrl_mq mq; uint64_t features = n->parent_obj.guest_features; - ssize_t dev_written; if (!(features & BIT_ULL(VIRTIO_NET_F_MQ))) { return 0; } - mq.virtqueue_pairs = cpu_to_le16(n->curr_queue_pairs); - dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_MQ, - VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, &mq, - sizeof(mq)); - if (unlikely(dev_written < 0)) { - return dev_written; - } - - return *s->status != VIRTIO_NET_OK; + return vhost_vdpa_net_load_cmd(s, out_cursor, VIRTIO_NET_CTRL_MQ, + VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, + &mq, sizeof(mq), in_cursor); } static int vhost_vdpa_net_load(NetClientState *nc) { VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); + VhostShadowVirtqueue *svq; + void *out_cursor, *in_cursor; struct vhost_vdpa *v = &s->vhost_vdpa; const VirtIONet *n; - int r; + ssize_t need_poll_len = 0, r, dev_written; assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); @@ -699,16 +732,50 @@ static int vhost_vdpa_net_load(NetClientState *nc) } n = VIRTIO_NET(v->dev->vdev); - r = vhost_vdpa_net_load_mac(s, n); + + need_poll_len = 0; + out_cursor = s->cvq_cmd_out_buffer; + in_cursor = s->status; + + r = vhost_vdpa_net_load_mac(s, n, &out_cursor, &in_cursor); if (unlikely(r < 0)) { - return r; + goto load_err; + } + need_poll_len += r; + + r = vhost_vdpa_net_load_mq(s, n, &out_cursor, &in_cursor); + if (unlikely(r < 0)) { + goto load_err; + } + need_poll_len += r; + +load_err: + /* Poll for all SVQ control commands used buffer from device */ + svq = g_ptr_array_index(s->vhost_vdpa.shadow_vqs, 0); + while (need_poll_len > 0) { + /* + * We can poll here since we've had BQL from the time we sent the + * descriptor. Also, we need to take the answer before SVQ pulls + * by itself, when BQL is released + */ + dev_written = vhost_svq_poll(svq); + need_poll_len -= dev_written; } - r = vhost_vdpa_net_load_mq(s, n); - if (unlikely(r)) { + + /* If code comes from `load_err` label, then we should return directly */ + if (unlikely(r < 0)) { return r; } - return 0; + /* Check the used buffer from device */ + for (virtio_net_ctrl_ack * status = s->status; (void *)status < in_cursor; + ++status) { + if (*status != VIRTIO_NET_OK) { + ++r; + } + } + + return r; } static NetClientInfo net_vhost_vdpa_cvq_info = { -- 2.25.1 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] vdpa: send CVQ state load commands in parallel 2023-04-19 11:49 ` [PATCH 2/2] vdpa: send CVQ state load commands in parallel Hawkins Jiawei @ 2023-04-19 17:42 ` Eugenio Perez Martin 2023-04-20 11:38 ` Hawkins Jiawei 0 siblings, 1 reply; 8+ messages in thread From: Eugenio Perez Martin @ 2023-04-19 17:42 UTC (permalink / raw) To: Hawkins Jiawei; +Cc: jasowang, 18801353760, qemu-devel On Wed, Apr 19, 2023 at 1:50 PM Hawkins Jiawei <yin31149@gmail.com> wrote: > > This patch introduces the vhost_vdpa_net_cvq_add() and > refactors the vhost_vdpa_net_load*(), so that QEMU can > send CVQ state load commands in parallel. > > To be more specific, this patch introduces vhost_vdpa_net_cvq_add() > to add SVQ control commands to SVQ and kick the device, > but does not poll the device used buffers. QEMU will not > poll and check the device used buffers in vhost_vdpa_net_load() > until all CVQ state load commands have been sent to the device. > > What's more, in order to avoid buffer overwriting caused by > using `svq->cvq_cmd_out_buffer` and `svq->status` as the > buffer for all CVQ state load commands when sending > CVQ state load commands in parallel, this patch introduces > `out_cursor` and `in_cursor` in vhost_vdpa_net_load(), > pointing to the available buffer for in descriptor and > out descriptor, so that different CVQ state load commands can > use their unique buffer. > > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1578 > Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> > --- > net/vhost-vdpa.c | 137 +++++++++++++++++++++++++++++++++++------------ > 1 file changed, 102 insertions(+), 35 deletions(-) > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c > index 10804c7200..d1f7113992 100644 > --- a/net/vhost-vdpa.c > +++ b/net/vhost-vdpa.c > @@ -590,6 +590,44 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc) > vhost_vdpa_net_client_stop(nc); > } > > +/** > + * vhost_vdpa_net_cvq_add() adds SVQ control commands to SVQ, > + * kicks the device but does not poll the device used buffers. > + * > + * Return the length of the device used buffers. > + */ > +static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, > + void **out_cursor, size_t out_len, > + void **in_cursor, size_t in_len) > +{ > + /* Buffers for the device */ > + const struct iovec out = { > + .iov_base = *out_cursor, > + .iov_len = out_len, > + }; > + const struct iovec in = { > + .iov_base = *in_cursor, > + .iov_len = in_len, > + }; > + VhostShadowVirtqueue *svq = g_ptr_array_index(s->vhost_vdpa.shadow_vqs, 0); > + int r; > + > + r = vhost_svq_add(svq, &out, 1, &in, 1, NULL); > + if (unlikely(r != 0)) { > + if (unlikely(r == -ENOSPC)) { > + qemu_log_mask(LOG_GUEST_ERROR, "%s: No space on device queue\n", > + __func__); > + } > + return r; > + } > + > + /* Update the cursor */ > + *out_cursor += out_len; > + *in_cursor += in_len; > + > + return in_len; > +} > + > /** > * vhost_vdpa_net_cvq_add_and_wait() adds SVQ control commands to SVQ, > * kicks the device and polls the device used buffers. > @@ -628,69 +666,64 @@ static ssize_t vhost_vdpa_net_cvq_add_and_wait(VhostVDPAState *s, > return vhost_svq_poll(svq); > } > > -static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class, > - uint8_t cmd, const void *data, > - size_t data_size) > +static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, > + void **out_cursor, uint8_t class, uint8_t cmd, > + const void *data, size_t data_size, > + void **in_cursor) > { > const struct virtio_net_ctrl_hdr ctrl = { > .class = class, > .cmd = cmd, > }; > > - assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl)); > + assert(sizeof(ctrl) < vhost_vdpa_net_cvq_cmd_page_len() - > + (*out_cursor - s->cvq_cmd_out_buffer)); > + assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl) - > + (*out_cursor - s->cvq_cmd_out_buffer)); > > - memcpy(s->cvq_cmd_out_buffer, &ctrl, sizeof(ctrl)); > - memcpy(s->cvq_cmd_out_buffer + sizeof(ctrl), data, data_size); > + memcpy(*out_cursor, &ctrl, sizeof(ctrl)); > + memcpy(*out_cursor + sizeof(ctrl), data, data_size); > > - return vhost_vdpa_net_cvq_add_and_wait(s, sizeof(ctrl) + data_size, > - sizeof(virtio_net_ctrl_ack)); > + return vhost_vdpa_net_cvq_add(s, out_cursor, sizeof(ctrl) + data_size, > + in_cursor, sizeof(virtio_net_ctrl_ack)); > } > > -static int vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n) > +static ssize_t vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n, > + void **out_cursor, void **in_cursor) > { > uint64_t features = n->parent_obj.guest_features; > if (features & BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR)) { > - ssize_t dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_MAC, > - VIRTIO_NET_CTRL_MAC_ADDR_SET, > - n->mac, sizeof(n->mac)); > - if (unlikely(dev_written < 0)) { > - return dev_written; > - } > - > - return *s->status != VIRTIO_NET_OK; > + return vhost_vdpa_net_load_cmd(s, out_cursor, VIRTIO_NET_CTRL_MAC, > + VIRTIO_NET_CTRL_MAC_ADDR_SET, > + n->mac, sizeof(n->mac), in_cursor); > } > > return 0; > } > > -static int vhost_vdpa_net_load_mq(VhostVDPAState *s, > - const VirtIONet *n) > +static ssize_t vhost_vdpa_net_load_mq(VhostVDPAState *s, const VirtIONet *n, > + void **out_cursor, void **in_cursor) > { > struct virtio_net_ctrl_mq mq; > uint64_t features = n->parent_obj.guest_features; > - ssize_t dev_written; > > if (!(features & BIT_ULL(VIRTIO_NET_F_MQ))) { > return 0; > } > > - mq.virtqueue_pairs = cpu_to_le16(n->curr_queue_pairs); So where is mq filled now? > - dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_MQ, > - VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, &mq, > - sizeof(mq)); > - if (unlikely(dev_written < 0)) { > - return dev_written; > - } > - > - return *s->status != VIRTIO_NET_OK; > + return vhost_vdpa_net_load_cmd(s, out_cursor, VIRTIO_NET_CTRL_MQ, > + VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, > + &mq, sizeof(mq), in_cursor); > } > > static int vhost_vdpa_net_load(NetClientState *nc) > { > VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); > + VhostShadowVirtqueue *svq; > + void *out_cursor, *in_cursor; > struct vhost_vdpa *v = &s->vhost_vdpa; > const VirtIONet *n; > - int r; > + ssize_t need_poll_len = 0, r, dev_written; > > assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); > > @@ -699,16 +732,50 @@ static int vhost_vdpa_net_load(NetClientState *nc) > } > > n = VIRTIO_NET(v->dev->vdev); > - r = vhost_vdpa_net_load_mac(s, n); > + Extra newline. > + need_poll_len = 0; Maybe we can call it "cmds_in_flight" or something similar? It is not a length. > + out_cursor = s->cvq_cmd_out_buffer; > + in_cursor = s->status; > + > + r = vhost_vdpa_net_load_mac(s, n, &out_cursor, &in_cursor); > if (unlikely(r < 0)) { > - return r; > + goto load_err; > + } > + need_poll_len += r; > + > + r = vhost_vdpa_net_load_mq(s, n, &out_cursor, &in_cursor); > + if (unlikely(r < 0)) { > + goto load_err; > + } > + need_poll_len += r; > + > +load_err: > + /* Poll for all SVQ control commands used buffer from device */ > + svq = g_ptr_array_index(s->vhost_vdpa.shadow_vqs, 0); > + while (need_poll_len > 0) { > + /* > + * We can poll here since we've had BQL from the time we sent the > + * descriptor. Also, we need to take the answer before SVQ pulls > + * by itself, when BQL is released > + */ > + dev_written = vhost_svq_poll(svq); If vhost_svq_poll returns 0 we must return an error from this code. Otherwise this will be an infinite loop. > + need_poll_len -= dev_written; On the other hand, vhost_svq_poll returns 1 because that is the length written in the "in" buffer, but it is not obvious here. It is more clear if we just do need_poll_len-- here. > } > - r = vhost_vdpa_net_load_mq(s, n); > - if (unlikely(r)) { > + > + /* If code comes from `load_err` label, then we should return directly */ > + if (unlikely(r < 0)) { If this function returns a failure we can avoid the sending of the queued commands, as the caller must cancel the start of the device anyway. We can return directly from the failure in vhost_vdpa_net_load_* and avoid the label. > return r; > } > > - return 0; > + /* Check the used buffer from device */ > + for (virtio_net_ctrl_ack * status = s->status; (void *)status < in_cursor; in_cursor can be a virtio_net_ctrl_ack so we don't need the casting here. > + ++status) { > + if (*status != VIRTIO_NET_OK) { > + ++r; > + } > + } > + > + return r; Although the caller is fine with >=0, I think we should keep the 0 == success. The number of commands delivered does not make a lot of sense for the callers, just if the call succeeded or not. Thanks! > } > > static NetClientInfo net_vhost_vdpa_cvq_info = { > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] vdpa: send CVQ state load commands in parallel 2023-04-19 17:42 ` Eugenio Perez Martin @ 2023-04-20 11:38 ` Hawkins Jiawei 2023-04-20 15:22 ` Hawkins Jiawei 0 siblings, 1 reply; 8+ messages in thread From: Hawkins Jiawei @ 2023-04-20 11:38 UTC (permalink / raw) To: Eugenio Perez Martin; +Cc: jasowang, 18801353760, qemu-devel On Thu, 20 Apr 2023 at 01:43, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Wed, Apr 19, 2023 at 1:50 PM Hawkins Jiawei <yin31149@gmail.com> wrote: > > > > This patch introduces the vhost_vdpa_net_cvq_add() and > > refactors the vhost_vdpa_net_load*(), so that QEMU can > > send CVQ state load commands in parallel. > > > > To be more specific, this patch introduces vhost_vdpa_net_cvq_add() > > to add SVQ control commands to SVQ and kick the device, > > but does not poll the device used buffers. QEMU will not > > poll and check the device used buffers in vhost_vdpa_net_load() > > until all CVQ state load commands have been sent to the device. > > > > What's more, in order to avoid buffer overwriting caused by > > using `svq->cvq_cmd_out_buffer` and `svq->status` as the > > buffer for all CVQ state load commands when sending > > CVQ state load commands in parallel, this patch introduces > > `out_cursor` and `in_cursor` in vhost_vdpa_net_load(), > > pointing to the available buffer for in descriptor and > > out descriptor, so that different CVQ state load commands can > > use their unique buffer. > > > > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1578 > > Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> > > --- > > net/vhost-vdpa.c | 137 +++++++++++++++++++++++++++++++++++------------ > > 1 file changed, 102 insertions(+), 35 deletions(-) > > > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c > > index 10804c7200..d1f7113992 100644 > > --- a/net/vhost-vdpa.c > > +++ b/net/vhost-vdpa.c > > @@ -590,6 +590,44 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc) > > vhost_vdpa_net_client_stop(nc); > > } > > > > +/** > > + * vhost_vdpa_net_cvq_add() adds SVQ control commands to SVQ, > > + * kicks the device but does not poll the device used buffers. > > + * > > + * Return the length of the device used buffers. > > + */ > > +static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, > > + void **out_cursor, size_t out_len, > > + void **in_cursor, size_t in_len) > > +{ > > + /* Buffers for the device */ > > + const struct iovec out = { > > + .iov_base = *out_cursor, > > + .iov_len = out_len, > > + }; > > + const struct iovec in = { > > + .iov_base = *in_cursor, > > + .iov_len = in_len, > > + }; > > + VhostShadowVirtqueue *svq = g_ptr_array_index(s->vhost_vdpa.shadow_vqs, 0); > > + int r; > > + > > + r = vhost_svq_add(svq, &out, 1, &in, 1, NULL); > > + if (unlikely(r != 0)) { > > + if (unlikely(r == -ENOSPC)) { > > + qemu_log_mask(LOG_GUEST_ERROR, "%s: No space on device queue\n", > > + __func__); > > + } > > + return r; > > + } > > + > > + /* Update the cursor */ > > + *out_cursor += out_len; > > + *in_cursor += in_len; > > + > > + return in_len; > > +} > > + > > /** > > * vhost_vdpa_net_cvq_add_and_wait() adds SVQ control commands to SVQ, > > * kicks the device and polls the device used buffers. > > @@ -628,69 +666,64 @@ static ssize_t vhost_vdpa_net_cvq_add_and_wait(VhostVDPAState *s, > > return vhost_svq_poll(svq); > > } > > > > -static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, uint8_t class, > > - uint8_t cmd, const void *data, > > - size_t data_size) > > +static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, > > + void **out_cursor, uint8_t class, uint8_t cmd, > > + const void *data, size_t data_size, > > + void **in_cursor) > > { > > const struct virtio_net_ctrl_hdr ctrl = { > > .class = class, > > .cmd = cmd, > > }; > > > > - assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl)); > > + assert(sizeof(ctrl) < vhost_vdpa_net_cvq_cmd_page_len() - > > + (*out_cursor - s->cvq_cmd_out_buffer)); > > + assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl) - > > + (*out_cursor - s->cvq_cmd_out_buffer)); > > > > - memcpy(s->cvq_cmd_out_buffer, &ctrl, sizeof(ctrl)); > > - memcpy(s->cvq_cmd_out_buffer + sizeof(ctrl), data, data_size); > > + memcpy(*out_cursor, &ctrl, sizeof(ctrl)); > > + memcpy(*out_cursor + sizeof(ctrl), data, data_size); > > > > - return vhost_vdpa_net_cvq_add_and_wait(s, sizeof(ctrl) + data_size, > > - sizeof(virtio_net_ctrl_ack)); > > + return vhost_vdpa_net_cvq_add(s, out_cursor, sizeof(ctrl) + data_size, > > + in_cursor, sizeof(virtio_net_ctrl_ack)); > > } > > > > -static int vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n) > > +static ssize_t vhost_vdpa_net_load_mac(VhostVDPAState *s, const VirtIONet *n, > > + void **out_cursor, void **in_cursor) > > { > > uint64_t features = n->parent_obj.guest_features; > > if (features & BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR)) { > > - ssize_t dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_MAC, > > - VIRTIO_NET_CTRL_MAC_ADDR_SET, > > - n->mac, sizeof(n->mac)); > > - if (unlikely(dev_written < 0)) { > > - return dev_written; > > - } > > - > > - return *s->status != VIRTIO_NET_OK; > > + return vhost_vdpa_net_load_cmd(s, out_cursor, VIRTIO_NET_CTRL_MAC, > > + VIRTIO_NET_CTRL_MAC_ADDR_SET, > > + n->mac, sizeof(n->mac), in_cursor); > > } > > > > return 0; > > } > > > > -static int vhost_vdpa_net_load_mq(VhostVDPAState *s, > > - const VirtIONet *n) > > +static ssize_t vhost_vdpa_net_load_mq(VhostVDPAState *s, const VirtIONet *n, > > + void **out_cursor, void **in_cursor) > > { > > struct virtio_net_ctrl_mq mq; > > uint64_t features = n->parent_obj.guest_features; > > - ssize_t dev_written; > > > > if (!(features & BIT_ULL(VIRTIO_NET_F_MQ))) { > > return 0; > > } > > > > - mq.virtqueue_pairs = cpu_to_le16(n->curr_queue_pairs); > > So where is mq filled now? This should be a typo, this line should not be deleted. I will correct it in the second patch. > > > - dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_MQ, > > - VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, &mq, > > - sizeof(mq)); > > - if (unlikely(dev_written < 0)) { > > - return dev_written; > > - } > > - > > - return *s->status != VIRTIO_NET_OK; > > + return vhost_vdpa_net_load_cmd(s, out_cursor, VIRTIO_NET_CTRL_MQ, > > + VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, > > + &mq, sizeof(mq), in_cursor); > > } > > > > static int vhost_vdpa_net_load(NetClientState *nc) > > { > > VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); > > + VhostShadowVirtqueue *svq; > > + void *out_cursor, *in_cursor; > > struct vhost_vdpa *v = &s->vhost_vdpa; > > const VirtIONet *n; > > - int r; > > + ssize_t need_poll_len = 0, r, dev_written; > > > > assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); > > > > @@ -699,16 +732,50 @@ static int vhost_vdpa_net_load(NetClientState *nc) > > } > > > > n = VIRTIO_NET(v->dev->vdev); > > - r = vhost_vdpa_net_load_mac(s, n); > > + > > Extra newline. Thanks for the comment, I will remove it in the second patch. > > > + need_poll_len = 0; > > Maybe we can call it "cmds_in_flight" or something similar? It is not a length. Do you mean we should maintain the number of successful commands here? My initial thought was to maintain the total length of device used buffer, so that we can know when to exit while polling for the device used buffer. So far, each SVQ command uses one byte long used buffer for device, so the effect of maintaining the total length of device used buffer and the number of successful commands is consistent. Yet I prefer to maintain the total length of device used buffer, because this can avoid the assumption that each SVQ command uses one byte long used buffer for device and can use the vhost_svq_poll() return value better during polling, which makes the code look more reasonable. What do you think? > > > + out_cursor = s->cvq_cmd_out_buffer; > > + in_cursor = s->status; > > + > > + r = vhost_vdpa_net_load_mac(s, n, &out_cursor, &in_cursor); > > if (unlikely(r < 0)) { > > - return r; > > + goto load_err; > > + } > > + need_poll_len += r; > > + > > + r = vhost_vdpa_net_load_mq(s, n, &out_cursor, &in_cursor); > > + if (unlikely(r < 0)) { > > + goto load_err; > > + } > > + need_poll_len += r; > > + > > +load_err: > > + /* Poll for all SVQ control commands used buffer from device */ > > + svq = g_ptr_array_index(s->vhost_vdpa.shadow_vqs, 0); > > + while (need_poll_len > 0) { > > + /* > > + * We can poll here since we've had BQL from the time we sent the > > + * descriptor. Also, we need to take the answer before SVQ pulls > > + * by itself, when BQL is released > > + */ > > + dev_written = vhost_svq_poll(svq); > > If vhost_svq_poll returns 0 we must return an error from this code. > Otherwise this will be an infinite loop. Yes, you are right, we should return an error here. What error code do you think we should return? It seems that vhost_svq_poll() returns 0 when QEMU waits for too long or no available used buffer from device, so is it okay to return -EAGAIN(resource temporarily unavailable)? > > > + need_poll_len -= dev_written; > > On the other hand, vhost_svq_poll returns 1 because that is the length > written in the "in" buffer, but it is not obvious here. It is more > clear if we just do need_poll_len-- here. It should be the same as above. If we maintain the total length of device used buffer for `need_poll_len`, it look more reasonable here. Maybe `used_buffer_len` is clearer? > > > } > > - r = vhost_vdpa_net_load_mq(s, n); > > - if (unlikely(r)) { > > + > > + /* If code comes from `load_err` label, then we should return directly */ > > + if (unlikely(r < 0)) { > > If this function returns a failure we can avoid the sending of the > queued commands, as the caller must cancel the start of the device > anyway. We can return directly from the failure in > vhost_vdpa_net_load_* and avoid the label. Thanks for the explanation, I will refactor the patch as you suggested. > > > return r; > > } > > > > - return 0; > > + /* Check the used buffer from device */ > > + for (virtio_net_ctrl_ack * status = s->status; (void *)status < in_cursor; > > in_cursor can be a virtio_net_ctrl_ack so we don't need the casting here. Thanks for the suggestion, I will refactor the patch as you suggested. > > > + ++status) { > > + if (*status != VIRTIO_NET_OK) { > > + ++r; > > + } > > + } > > + > > + return r; > > Although the caller is fine with >=0, I think we should keep the 0 == > success. The number of commands delivered does not make a lot of sense > for the callers, just if the call succeeded or not. Thanks for the explanation, I will refactor the patch as you suggested. > > Thanks! > > > } > > > > static NetClientInfo net_vhost_vdpa_cvq_info = { > > -- > > 2.25.1 > > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] vdpa: send CVQ state load commands in parallel 2023-04-20 11:38 ` Hawkins Jiawei @ 2023-04-20 15:22 ` Hawkins Jiawei 0 siblings, 0 replies; 8+ messages in thread From: Hawkins Jiawei @ 2023-04-20 15:22 UTC (permalink / raw) To: Eugenio Perez Martin; +Cc: jasowang, 18801353760, qemu-devel On Thu, 20 Apr 2023 at 19:38, Hawkins Jiawei <yin31149@gmail.com> wrote: > > On Thu, 20 Apr 2023 at 01:43, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > On Wed, Apr 19, 2023 at 1:50 PM Hawkins Jiawei <yin31149@gmail.com> wrote: > > > > > > + ++status) { > > > + if (*status != VIRTIO_NET_OK) { > > > + ++r; > > > + } > > > + } > > > + > > > + return r; > > > > Although the caller is fine with >=0, I think we should keep the 0 == > > success. The number of commands delivered does not make a lot of sense > > for the callers, just if the call succeeded or not. > > Thanks for the explanation, I will refactor the patch as you suggested. I still have some questions about the check on device used buffers. My initial thought was to return the number of commands whose `in buffer` value is not VIRTIO_NET_OK. If we are not supposed to return value > 0, what should we return if some commands' `in buffer` value is not VIRTIO_NET_OK. Should we return an error code, such as EINVAL(Invalid argument), indicating that QEMU can not successfully send all SVQ commands in the current state. Or should we just do not check the device used buffers, and return 0 when QEMU finishes polling? Thanks! > > > > > Thanks! > > > > > } > > > > > > static NetClientInfo net_vhost_vdpa_cvq_info = { > > > -- > > > 2.25.1 > > > > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] Send all the SVQ control commands in parallel 2023-04-19 11:49 [PATCH 0/2] Send all the SVQ control commands in parallel Hawkins Jiawei 2023-04-19 11:49 ` [PATCH 1/2] vdpa: rename vhost_vdpa_net_cvq_add() Hawkins Jiawei 2023-04-19 11:49 ` [PATCH 2/2] vdpa: send CVQ state load commands in parallel Hawkins Jiawei @ 2023-04-19 17:16 ` Eugenio Perez Martin 2023-04-20 8:32 ` Hawkins Jiawei 2 siblings, 1 reply; 8+ messages in thread From: Eugenio Perez Martin @ 2023-04-19 17:16 UTC (permalink / raw) To: Hawkins Jiawei; +Cc: jasowang, 18801353760, qemu-devel On Wed, Apr 19, 2023 at 1:50 PM Hawkins Jiawei <yin31149@gmail.com> wrote: > > This patchset allows QEMU to poll and check the device used buffer > after sending all SVQ control commands, instead of polling and checking > immediately after sending each SVQ control command, so that QEMU can > send all the SVQ control commands in parallel, which have better > performance improvement. > > I use vdpa_sim_net to simulate vdpa device, refactor > vhost_vdpa_net_load() to call vhost_vdpa_net_load_mac() 30 times, > refactor `net_vhost_vdpa_cvq_info.load` to call vhost_vdpa_net_load() > 1000 times, Maybe a little bit too high for real scenarios but it gives us a hint for sure :). Maybe it is more realistic to send ~10 or ~100 commands? > to build a test environment for sending > multiple SVQ control commands. Time in monotonic to > finish `net_vhost_vdpa_cvq_info.load`: > > QEMU monotonic time > -------------------------------------------------- > not patched 89202 > -------------------------------------------------- > patched 80455 > Is time expressed in seconds or milliseconds? I'm going to assume ms. So let's say all the time was spent in the context switch between qemu and kernel, this is a save of (89202 - 80455)/30000 = 0.3 ms per command? Thanks! > This patchset resolves the GitLab issue at > https://gitlab.com/qemu-project/qemu/-/issues/1578. > > Hawkins Jiawei (2): > vdpa: rename vhost_vdpa_net_cvq_add() > vdpa: send CVQ state load commands in parallel > > net/vhost-vdpa.c | 150 +++++++++++++++++++++++++++++++++++------------ > 1 file changed, 112 insertions(+), 38 deletions(-) > > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] Send all the SVQ control commands in parallel 2023-04-19 17:16 ` [PATCH 0/2] Send all the SVQ control " Eugenio Perez Martin @ 2023-04-20 8:32 ` Hawkins Jiawei 0 siblings, 0 replies; 8+ messages in thread From: Hawkins Jiawei @ 2023-04-20 8:32 UTC (permalink / raw) To: Eugenio Perez Martin; +Cc: jasowang, 18801353760, qemu-devel On Thu, 20 Apr 2023 at 01:17, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Wed, Apr 19, 2023 at 1:50 PM Hawkins Jiawei <yin31149@gmail.com> wrote: > > > > This patchset allows QEMU to poll and check the device used buffer > > after sending all SVQ control commands, instead of polling and checking > > immediately after sending each SVQ control command, so that QEMU can > > send all the SVQ control commands in parallel, which have better > > performance improvement. > > > > I use vdpa_sim_net to simulate vdpa device, refactor > > vhost_vdpa_net_load() to call vhost_vdpa_net_load_mac() 30 times, > > refactor `net_vhost_vdpa_cvq_info.load` to call vhost_vdpa_net_load() > > 1000 times, > > Maybe a little bit too high for real scenarios but it gives us a hint > for sure :). Maybe it is more realistic to send ~10 or ~100 commands? Yes, it is absolutely too high for real scenarios to call vhost_vdpa_net_load() 1000 times. But considering that the time to execute vhost_vdpa_net_load_mac() 30 times is very short, the result time may be highly unstable and fluctuate greatly, so I call vhost_vdpa_net_load() 1000 times, hoping to get a more stable result. > > > to build a test environment for sending > > multiple SVQ control commands. Time in monotonic to > > finish `net_vhost_vdpa_cvq_info.load`: > > > > QEMU monotonic time > > -------------------------------------------------- > > not patched 89202 > > -------------------------------------------------- > > patched 80455 > > > > Is time expressed in seconds or milliseconds? I'm going to assume ms. I got this by calling g_get_monotonic_time(), it should be microseconds according to [1]. [1]. https://docs.gtk.org/glib/func.get_monotonic_time.html > > So let's say all the time was spent in the context switch between qemu > and kernel, this is a save of (89202 - 80455)/30000 = 0.3 ms per > command? Yes, I think it is a save of 0.3 microseconds per command. Thanks! > > Thanks! > > > This patchset resolves the GitLab issue at > > https://gitlab.com/qemu-project/qemu/-/issues/1578. > > > > Hawkins Jiawei (2): > > vdpa: rename vhost_vdpa_net_cvq_add() > > vdpa: send CVQ state load commands in parallel > > > > net/vhost-vdpa.c | 150 +++++++++++++++++++++++++++++++++++------------ > > 1 file changed, 112 insertions(+), 38 deletions(-) > > > > -- > > 2.25.1 > > > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-04-20 15:24 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-04-19 11:49 [PATCH 0/2] Send all the SVQ control commands in parallel Hawkins Jiawei 2023-04-19 11:49 ` [PATCH 1/2] vdpa: rename vhost_vdpa_net_cvq_add() Hawkins Jiawei 2023-04-19 11:49 ` [PATCH 2/2] vdpa: send CVQ state load commands in parallel Hawkins Jiawei 2023-04-19 17:42 ` Eugenio Perez Martin 2023-04-20 11:38 ` Hawkins Jiawei 2023-04-20 15:22 ` Hawkins Jiawei 2023-04-19 17:16 ` [PATCH 0/2] Send all the SVQ control " Eugenio Perez Martin 2023-04-20 8:32 ` Hawkins Jiawei
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).