From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:59562) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aHrId-00073g-5P for qemu-devel@nongnu.org; Sat, 09 Jan 2016 06:03:12 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aHrIZ-0003Ez-Pm for qemu-devel@nongnu.org; Sat, 09 Jan 2016 06:03:11 -0500 Received: from mail-oi0-x232.google.com ([2607:f8b0:4003:c06::232]:33762) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aHrIZ-0003Ev-JM for qemu-devel@nongnu.org; Sat, 09 Jan 2016 06:03:07 -0500 Received: by mail-oi0-x232.google.com with SMTP id y66so349737494oig.0 for ; Sat, 09 Jan 2016 03:03:06 -0800 (PST) References: <20151211174850.GK2987@work-vm> <20151214105308.GE2493@work-vm> <20160104181555.GB12368@work-vm> From: "Michael R. Hines" Message-ID: <5690E8E9.8060701@digitalocean.com> Date: Sat, 9 Jan 2016 05:03:05 -0600 MIME-Version: 1.0 In-Reply-To: <20160104181555.GB12368@work-vm> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] An RDMA race? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Dr. David Alan Gilbert" Cc: Michael Hines , qemu-devel@nongnu.org I don't mind ACKing this change if we could agree on some kind of regression test for this. (I have an RDMA card at home that I could run tests on if need be). The way that virt-test goes about this is not sufficient. The way I do testing for RDMA is that I not only confirm that the migration succeded or failed, but I actually compare serial console output for funny keywords, like "panic" and so forth to make a poor-man's attempt guess at whether or not there was any memory corruption. Do you have a testing harness for yourself? (I'd also like to know what the COLO guys are doing). Maybe we can coalesce around something? - Michael On 01/04/2016 12:15 PM, Dr. David Alan Gilbert wrote: > * Michael R. Hines (mhines@digitalocean.com) wrote: >> Adding such a control message would defeat the benefits of RDMA, as there >> shouldn't be any signalling in the actual DMA path, or RDMA latency would >> be too high. If you're sending control messages for individual writes, then >> you need to change up your design. It's OK to design ACKs for groups of >> writes, depending on the requirements. > I started off with sending individual messages, and then once I had it working > I made it group them to send one message every 2048 pages. > The performance isn't very good though, and I've not yet analysed why. > >> So, the out-of-order issue you're seeing is only with your new message, not >> the original messages? > Yes I believe they're only on the new messages; however: > 1) I'm sending a lot more control messages, so if there's a race I'm > a lot more likely to trigger it. (I'm not sure I'm triggering it in the > case where I group those 2048 together) - so does this mean it would > occasionally trigger on the unmodified code? > > 2) My reading of the existing code is that I think it could happen; > a) the source is ready to send something and is waiting for a CONTROL_READY, > b) the destination sends the CONTROL_READY > (blocking in qemu_rdma_post_send_control call to > qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL) > c) The source sends it's data > d) That arrives at the destination > e) finally the WRID_SEND_CONTROL arrives back > > It's having d/e the wrong way round which is the race I think I'm seeing > and then we lose (d)'s data. > >> Can you describe/document it in more detail so I can help advise? > There are 2 cases where the destination needs to know which pages it's received: > i) In COLO or checkpointing where it's receiving a partial new checkpoint; > since it's only receiving a partial checkpoint it needs to know what it's > received. This allows the destination to avoid copying the whole of it's > received checkpoint and only copy the bits that changed. > > ii) On postcopy once a page is received by the destination the page has to > be atomically placed; I've not thought too hard about that yet. > > Dave > >> - Michael >> >> On Mon, Dec 14, 2015 at 6:53 PM, Dr. David Alan Gilbert >> wrote: >>> * Michael R. Hines (mhines@digitalocean.com) wrote: >>>> David, >>>> >>>> Thanks for including my email directly. It helps a lot. >>>> >>>> Below, I'm going to assume that only "dest" is calling >>>> qemu_rdma_exchange_recv() >>>> and only src is calling qemu_rdma_exchange_send(), since you didn't >>> specify >>>> who >>>> is sending and who is receiving. >>>> >>>> If that assumption is wrong, please respond again. >>> That's correct. >>> >>>> Comments inline..... >>>> >>>> On Sat, Dec 12, 2015 at 1:48 AM, Dr. David Alan Gilbert < >>> dgilbert@redhat.com >>>>> wrote: >>>>> Hi Michael, >>>>> I think I've got an RDMA race condition, but I'm being a little >>>>> cautious at the moment and wondered if you agree with the following >>>>> diagnosis. >>>>> >>>>> It's showing up in a world of mine that's sending more control messages >>>>> from the destination->source and I'm seeing the following. >>>>> >>>>> We normally expect: >>>>> >>>>> src dest >>>>> ----------->control ready-> >>>>> >>>> If src is sending, this is not correct. Dest should send the ready >>> message >>>> if it is receiving, not src, which breaks the above assumption. So, I'll >>>> reverse the assumption previously and continue with your observation and >>>> assume that src is receiving instead of dest, which should instead look >>>> like: >>> Gah! Yes, I got the label the wrong way around; it's dest sending control >>> ready. >>> >>>> src (receiving) dest (sending) >>>> ----------->control ready-> >>>> >>>> >>>> >>>>> Sees SEND_CONTROL signal to ack that it has been sent >>>>> >>>> I'll assume here that you meant that dest sees the ready message and is >>>> then later sends something. >>>> >>>> >>>>> <-----control message-- >>>>> Sees RECV_CONTROL message from dest >>>>> >>>>> >>>> Similar assumption for the receiver (src). >>>> >>>> >>>>> but what I'm seeing is: >>>>> src dest >>>>> ----------->control ready-> >>>>> <-----control message-- >>>>> Sees RECV_CONTROL message from dest >>>>> >>>> hmmmmm.... >>>> >>>> >>>>> Sees SEND_CONTROL signal to ack that it has been sent >>>>> >>>>> >>>> There's not enough information here....... do you have a multi-threaded >>>> send or receive or something? >>> No, I've been trying to wire RDMA into the COLO fault-tolerant setup; >>> so the change which got me to trigger this bug was that I'd >>> added a new control message 'notify write' which explicitly >>> told the destination it had a page written to; at the RDMA level >>> that was the only change. >>> >>>> Do the work request IDs match up? >>> Yes I think so; I also added a sequence number to the 'ready' messages >>> to check I wasn't losing one. >>> I had a chat to one of our RDMA guys (Doug Ledford) and he said >>> it's perfectly legal for RDMA to take longer to return the signal >>> from the send than for the round trip of the destination responding; >>> the 'signal' doesn't happen until an ack has been received from the >>> destination card anyway, so the ack can get delayed or retried. >>> So I think we do need to fix this; the question then is how do we fix >>> it for all control messages without breaking anything else. Are there >>> any cases that rely on having received the signal from the send before >>> continuing, or could i just do what I'm doing for all control messages? >>> >>> Dave >>> >>>> - Michael >>> -- >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >>> >> >> >> -- >> /* >> * Michael R. Hines >> * https://michael.hinespot.com >> */ > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- /* * Michael R. Hines * Platform Engineer, DigitalOcean. */