From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: michael@hinespot.com
Cc: qemu-devel@nongnu.org
Subject: [Qemu-devel] An RDMA race?
Date: Fri, 11 Dec 2015 17:48:50 +0000 [thread overview]
Message-ID: <20151211174850.GK2987@work-vm> (raw)
Hi Michael,
I think I've got an RDMA race condition, but I'm being a little
cautious at the moment and wondered if you agree with the following
diagnosis.
It's showing up in a world of mine that's sending more control messages
from the destination->source and I'm seeing the following.
We normally expect:
src dest
----------->control ready->
Sees SEND_CONTROL signal to ack that it has been sent
<-----control message--
Sees RECV_CONTROL message from dest
but what I'm seeing is:
src dest
----------->control ready->
<-----control message--
Sees RECV_CONTROL message from dest
Sees SEND_CONTROL signal to ack that it has been sent
which seems a bit odd - that means that the source is sending
a message, and getting the reply before it gets the acknowledgment
from it's own stack that it sent the message the reply is to ?!
It's rare (~1 in 100 migrations ish) and only in my world
where I'm sending more control messages.
But that confuses qemu_rdma_post_send_control when it sends
the READY, because it calls block_for_wrid(SEND_CONTROL),
that sees the RECV_CONTROL (which it loses) and then the SEND_CONTROL.
Does this sound a sane explanation? My current fix (below)
I only use it for RDMA_CONTROL_READY , I'm torn between worrying
that the race is potentially more general, but worry what will
happen if I stop making post_send_control wait at the end for
all the other control messages.
It seems to work :-)
What do you reckon?
Dave
From 332b867fb0f63be47d89822b7a10b2a519b431fc Mon Sep 17 00:00:00 2001
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Date: Fri, 11 Dec 2015 14:53:11 +0000
Subject: [PATCH] Don't wait for send-control on a ready
---
migration/rdma.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
trace-events | 1 +
2 files changed, 42 insertions(+), 5 deletions(-)
diff --git a/migration/rdma.c b/migration/rdma.c
index b0feddc..6a0e59f 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -314,7 +314,13 @@ typedef struct RDMAContext {
* the READY message before send() does, in which case we need to
* know if it completed.
*/
- int control_ready_expected;
+ bool control_ready_expected;
+
+ /*
+ * Set when we've sent a control message but not yet waited for the
+ * result.
+ */
+ bool control_sent_expected;
/* number of outstanding writes */
int nb_sent;
@@ -1454,6 +1460,13 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
rdma->control_ready_expected = 0;
}
+ if (rdma->control_sent_expected &&
+ (wr_id >= RDMA_WRID_SEND_CONTROL)) {
+ trace_qemu_rdma_poll_send(wrid_desc[RDMA_WRID_SEND_CONTROL],
+ wr_id - RDMA_WRID_SEND_CONTROL, wr_id, rdma->nb_sent);
+ rdma->control_sent_expected = 0;
+ }
+
if (wr_id == RDMA_WRID_RDMA_WRITE) {
uint64_t chunk =
(wc.wr_id & RDMA_WRID_CHUNK_MASK) >> RDMA_WRID_CHUNK_SHIFT;
@@ -1609,6 +1622,7 @@ static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
RDMAControlHeader *head)
{
int ret = 0;
+ bool wait;
RDMAWorkRequestData *wr = &rdma->wr_data[RDMA_WRID_CONTROL];
struct ibv_send_wr *bad_wr;
struct ibv_sge sge = {
@@ -1626,6 +1640,26 @@ static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
trace_qemu_rdma_post_send_control(control_desc[head->type]);
+ if (rdma->control_sent_expected) {
+ ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
+ if (ret < 0) {
+ error_report("%s: send polling control error (entry) %s",
+ __func__, strerror(ret));
+ return ret;
+ }
+ }
+
+ switch (head->type) {
+ case RDMA_CONTROL_READY:
+ wait=false;
+ rdma->control_sent_expected = true;
+ break;
+
+ default:
+ wait=true;
+ break;
+ }
+
/*
* We don't actually need to do a memcpy() in here if we used
* the "sge" properly, but since we're only sending control messages
@@ -1646,13 +1680,15 @@ static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
ret = ibv_post_send(rdma->qp, &send_wr, &bad_wr);
if (ret > 0) {
- error_report("Failed to use post IB SEND for control");
+ error_report("Failed to use post IB SEND for control: %s", strerror(ret));
return -ret;
}
- ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
- if (ret < 0) {
- error_report("rdma migration: send polling control error");
+ if (wait) {
+ ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
+ if (ret < 0) {
+ error_report("rdma migration: send polling control error");
+ }
}
return ret;
diff --git a/trace-events b/trace-events
index 9043b56..0eeb6b2 100644
--- a/trace-events
+++ b/trace-events
@@ -1447,6 +1447,7 @@ qemu_rdma_exchange_send_received(const char *desc) "Response %s received."
qemu_rdma_fill(size_t control_len, size_t size) "RDMA %zd of %zd bytes already in buffer"
qemu_rdma_init_ram_blocks(int blocks) "Allocated %d local ram block structures"
qemu_rdma_poll_recv(const char *compstr, int64_t comp, int64_t id, int sent) "completion %s #%" PRId64 " received (%" PRId64 ") left %d"
+qemu_rdma_poll_send(const char *compstr, int64_t comp, int64_t id, int sent) "completion %s #%" PRId64 " received (%" PRId64 ") left %d"
qemu_rdma_poll_write(const char *compstr, int64_t comp, int left, uint64_t block, uint64_t chunk, void *local, void *remote) "completions %s (%" PRId64 ") left %d, block %" PRIu64 ", chunk: %" PRIu64 " %p %p"
qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other completion %s (%" PRId64 ") received left %d"
qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
--
2.5.0
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next reply other threads:[~2015-12-11 17:49 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-12-11 17:48 Dr. David Alan Gilbert [this message]
2015-12-12 3:40 ` [Qemu-devel] An RDMA race? Michael R. Hines
2015-12-14 10:53 ` Dr. David Alan Gilbert
2015-12-20 7:08 ` Michael R. Hines
2015-12-20 7:09 ` Michael R. Hines
2016-01-04 18:15 ` Dr. David Alan Gilbert
2016-01-09 11:03 ` Michael R. Hines
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151211174850.GK2987@work-vm \
--to=dgilbert@redhat.com \
--cc=michael@hinespot.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).