From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: michael@hinespot.com
Cc: qemu-devel@nongnu.org
Subject: [Qemu-devel] An RDMA race?
Date: Fri, 11 Dec 2015 17:48:50 +0000 [thread overview]
Message-ID: <20151211174850.GK2987@work-vm> (raw)
Hi Michael,
I think I've got an RDMA race condition, but I'm being a little
cautious at the moment and wondered if you agree with the following
diagnosis.
It's showing up in a world of mine that's sending more control messages
from the destination->source and I'm seeing the following.
We normally expect:
src dest
----------->control ready->
Sees SEND_CONTROL signal to ack that it has been sent
<-----control message--
Sees RECV_CONTROL message from dest
but what I'm seeing is:
src dest
----------->control ready->
<-----control message--
Sees RECV_CONTROL message from dest
Sees SEND_CONTROL signal to ack that it has been sent
which seems a bit odd - that means that the source is sending
a message, and getting the reply before it gets the acknowledgment
from it's own stack that it sent the message the reply is to ?!
It's rare (~1 in 100 migrations ish) and only in my world
where I'm sending more control messages.
But that confuses qemu_rdma_post_send_control when it sends
the READY, because it calls block_for_wrid(SEND_CONTROL),
that sees the RECV_CONTROL (which it loses) and then the SEND_CONTROL.
Does this sound a sane explanation? My current fix (below)
I only use it for RDMA_CONTROL_READY , I'm torn between worrying
that the race is potentially more general, but worry what will
happen if I stop making post_send_control wait at the end for
all the other control messages.
It seems to work :-)
What do you reckon?
Dave
From 332b867fb0f63be47d89822b7a10b2a519b431fc Mon Sep 17 00:00:00 2001
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Date: Fri, 11 Dec 2015 14:53:11 +0000
Subject: [PATCH] Don't wait for send-control on a ready
---
migration/rdma.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
trace-events | 1 +
2 files changed, 42 insertions(+), 5 deletions(-)
diff --git a/migration/rdma.c b/migration/rdma.c
index b0feddc..6a0e59f 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -314,7 +314,13 @@ typedef struct RDMAContext {
* the READY message before send() does, in which case we need to
* know if it completed.
*/
- int control_ready_expected;
+ bool control_ready_expected;
+
+ /*
+ * Set when we've sent a control message but not yet waited for the
+ * result.
+ */
+ bool control_sent_expected;
/* number of outstanding writes */
int nb_sent;
@@ -1454,6 +1460,13 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
rdma->control_ready_expected = 0;
}
+ if (rdma->control_sent_expected &&
+ (wr_id >= RDMA_WRID_SEND_CONTROL)) {
+ trace_qemu_rdma_poll_send(wrid_desc[RDMA_WRID_SEND_CONTROL],
+ wr_id - RDMA_WRID_SEND_CONTROL, wr_id, rdma->nb_sent);
+ rdma->control_sent_expected = 0;
+ }
+
if (wr_id == RDMA_WRID_RDMA_WRITE) {
uint64_t chunk =
(wc.wr_id & RDMA_WRID_CHUNK_MASK) >> RDMA_WRID_CHUNK_SHIFT;
@@ -1609,6 +1622,7 @@ static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
RDMAControlHeader *head)
{
int ret = 0;
+ bool wait;
RDMAWorkRequestData *wr = &rdma->wr_data[RDMA_WRID_CONTROL];
struct ibv_send_wr *bad_wr;
struct ibv_sge sge = {
@@ -1626,6 +1640,26 @@ static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
trace_qemu_rdma_post_send_control(control_desc[head->type]);
+ if (rdma->control_sent_expected) {
+ ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
+ if (ret < 0) {
+ error_report("%s: send polling control error (entry) %s",
+ __func__, strerror(ret));
+ return ret;
+ }
+ }
+
+ switch (head->type) {
+ case RDMA_CONTROL_READY:
+ wait=false;
+ rdma->control_sent_expected = true;
+ break;
+
+ default:
+ wait=true;
+ break;
+ }
+
/*
* We don't actually need to do a memcpy() in here if we used
* the "sge" properly, but since we're only sending control messages
@@ -1646,13 +1680,15 @@ static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
ret = ibv_post_send(rdma->qp, &send_wr, &bad_wr);
if (ret > 0) {
- error_report("Failed to use post IB SEND for control");
+ error_report("Failed to use post IB SEND for control: %s", strerror(ret));
return -ret;
}
- ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
- if (ret < 0) {
- error_report("rdma migration: send polling control error");
+ if (wait) {
+ ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
+ if (ret < 0) {
+ error_report("rdma migration: send polling control error");
+ }
}
return ret;
diff --git a/trace-events b/trace-events
index 9043b56..0eeb6b2 100644
--- a/trace-events
+++ b/trace-events
@@ -1447,6 +1447,7 @@ qemu_rdma_exchange_send_received(const char *desc) "Response %s received."
qemu_rdma_fill(size_t control_len, size_t size) "RDMA %zd of %zd bytes already in buffer"
qemu_rdma_init_ram_blocks(int blocks) "Allocated %d local ram block structures"
qemu_rdma_poll_recv(const char *compstr, int64_t comp, int64_t id, int sent) "completion %s #%" PRId64 " received (%" PRId64 ") left %d"
+qemu_rdma_poll_send(const char *compstr, int64_t comp, int64_t id, int sent) "completion %s #%" PRId64 " received (%" PRId64 ") left %d"
qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block, uint64_t chunk, void *local, void *remote) "completions %s (%" PRId64 ") left %d, block %" PRIu64 ", chunk: %" PRIu64 " %p %p"
qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d"
qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
--
2.5.0
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next reply other threads:[~2015-12-11 17:49 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-12-11 17:48 Dr. David Alan Gilbert [this message]
2015-12-12 3:40 ` [Qemu-devel] An RDMA race? Michael R. Hines
2015-12-14 10:53 ` Dr. David Alan Gilbert
2015-12-20 7:08 ` Michael R. Hines
2015-12-20 7:09 ` Michael R. Hines
2016-01-04 18:15 ` Dr. David Alan Gilbert
2016-01-09 11:03 ` Michael R. Hines
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151211174850.GK2987@work-vm \
--to=dgilbert@redhat.com \
--cc=michael@hinespot.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.