* is it possible to avoid syncing after an rdma write?
@ 2010-02-16 23:29 Andy Grover
[not found] ` <4B7B2A6C.80101-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Andy Grover @ 2010-02-16 23:29 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Right now, RDS follows each RDMA write op with a Send op, which 1)
causes an interrupt and 2) includes the info we need to call
ib_dma_sync_sg_for_cpu() for the target of the rdma write.
We want to omit the Send. If we don't do the sync on the machine that is
the target of the RDMA write, the result is... what exactly? I assume
the write to memory is snooped by CPUs, so their cachelines will be
properly invalidated. However, Linux DMA-API docs seem pretty clear in
insisting on the sync.
Is the issue IOMMUs? Or for compatibility with bounce buffering?
Thanks in advance -- Regards -- Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread[parent not found: <4B7B2A6C.80101-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>]
* Re: is it possible to avoid syncing after an rdma write? [not found] ` <4B7B2A6C.80101-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> @ 2010-02-17 0:58 ` Jason Gunthorpe [not found] ` <20100217005827.GF16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2010-02-17 10:40 ` Or Gerlitz 1 sibling, 1 reply; 9+ messages in thread From: Jason Gunthorpe @ 2010-02-17 0:58 UTC (permalink / raw) To: Andy Grover; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA On Tue, Feb 16, 2010 at 03:29:48PM -0800, Andy Grover wrote: > Right now, RDS follows each RDMA write op with a Send op, which 1) > causes an interrupt and 2) includes the info we need to call > ib_dma_sync_sg_for_cpu() for the target of the rdma write. > > We want to omit the Send. If we don't do the sync on the machine that is > the target of the RDMA write, the result is... what exactly? I assume > the write to memory is snooped by CPUs, so their cachelines will be > properly invalidated. However, Linux DMA-API docs seem pretty clear in > insisting on the sync. I'm curious about this too, but I will point out that at least the user RDMA interface has no match for the kernel DMA calls, so in practice RDMA does not work on systems that require them. That means bounce buffering is not used and IO/CPU caches are coherent. Though, I guess, the kernel could use weaker memory ordering types in kernel mode that do require the DMA api calls. > Is the issue IOMMUs? Or for compatibility with bounce buffering? As long as the memory is registered the IOMMU should remain configured. What do you intend to replace the SEND with? spin on last byte? There are other issues to consider like ordering within the PCI-E fabric.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20100217005827.GF16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* RE: is it possible to avoid syncing after an rdma write? [not found] ` <20100217005827.GF16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2010-02-17 1:05 ` Paul Grun 2010-02-17 1:12 ` Jason Gunthorpe 2010-02-17 19:54 ` Andy Grover 1 sibling, 1 reply; 9+ messages in thread From: Paul Grun @ 2010-02-17 1:05 UTC (permalink / raw) To: 'Jason Gunthorpe', 'Andy Grover' Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA Why not use an RDMA write w/ immed? That forces the consumption of a receive WQE and can be used to create a completion event. Since the immediate data is carried in the last packet of a multi-packet RDMA write, you are guaranteed that all data has been placed in the receive buffer, in order. I'm a hardware guy, so this may be completely off-the-wall w.r.t. this particular discussion. -Paul -----Original Message----- From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Jason Gunthorpe Sent: Tuesday, February 16, 2010 4:58 PM To: Andy Grover Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Subject: Re: is it possible to avoid syncing after an rdma write? On Tue, Feb 16, 2010 at 03:29:48PM -0800, Andy Grover wrote: > Right now, RDS follows each RDMA write op with a Send op, which 1) > causes an interrupt and 2) includes the info we need to call > ib_dma_sync_sg_for_cpu() for the target of the rdma write. > > We want to omit the Send. If we don't do the sync on the machine that is > the target of the RDMA write, the result is... what exactly? I assume > the write to memory is snooped by CPUs, so their cachelines will be > properly invalidated. However, Linux DMA-API docs seem pretty clear in > insisting on the sync. I'm curious about this too, but I will point out that at least the user RDMA interface has no match for the kernel DMA calls, so in practice RDMA does not work on systems that require them. That means bounce buffering is not used and IO/CPU caches are coherent. Though, I guess, the kernel could use weaker memory ordering types in kernel mode that do require the DMA api calls. > Is the issue IOMMUs? Or for compatibility with bounce buffering? As long as the memory is registered the IOMMU should remain configured. What do you intend to replace the SEND with? spin on last byte? There are other issues to consider like ordering within the PCI-E fabric.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: is it possible to avoid syncing after an rdma write? 2010-02-17 1:05 ` Paul Grun @ 2010-02-17 1:12 ` Jason Gunthorpe [not found] ` <20100217011224.GH16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Jason Gunthorpe @ 2010-02-17 1:12 UTC (permalink / raw) To: Paul Grun; +Cc: 'Andy Grover', linux-rdma-u79uwXL29TY76Z2rM5mHXA On Tue, Feb 16, 2010 at 05:05:21PM -0800, Paul Grun wrote: > Why not use an RDMA write w/ immed? That forces the consumption of a > receive WQE and can be used to create a completion event. Since the > immediate data is carried in the last packet of a multi-packet RDMA write, > you are guaranteed that all data has been placed in the receive buffer, in > order. Yes, RDMA WRITE w/ immediate data is perfectly fine. I've even implemented some protocols that use it to good effect. Not sure what the performance trade off is like though. The immediate data pretty much behaves exactly like a SEND WC on the receive side, but there may be some performance and latency advantages, particularly on the send side. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20100217011224.GH16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* RE: is it possible to avoid syncing after an rdma write? [not found] ` <20100217011224.GH16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2010-02-17 6:40 ` Paul Grun 2010-02-17 18:59 ` Jason Gunthorpe 0 siblings, 1 reply; 9+ messages in thread From: Paul Grun @ 2010-02-17 6:40 UTC (permalink / raw) To: 'Jason Gunthorpe' Cc: 'Andy Grover', linux-rdma-u79uwXL29TY76Z2rM5mHXA Two advantages come to mind vs an RDMA Write followed by a SEND: Using a SEND will consume a second WQE on the send side, and the synchronizing SEND will cause an entire new transaction, which will consume a(n infinitesimally) small amount of additional wire bandwidth, as well as incurring a(infinitesimally) small likelihood of a dropped or lost packets. Nits? Yes, probably infinitesimally small ones. (hardware guys tend to worry about the small ones.) To answer Andy's original question, the behavior on the receive side is not guaranteed until control of the receive buffer has been formally returned to the receiver. I expect that most HCAs are pretty well behaved here, as are most CPU/memory/root complexes...but you never know. Can anybody guarantee that the inbound packet gets written to the memory in order? If something odd did happen, it seems like one of those places that would require an incredible stroke of luck to debug. OTOH, I know that many applications simply poll the receive buffer looking for a flag everyday and get away with it. -----Original Message----- From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org] Sent: Tuesday, February 16, 2010 5:12 PM To: Paul Grun Cc: 'Andy Grover'; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Subject: Re: is it possible to avoid syncing after an rdma write? On Tue, Feb 16, 2010 at 05:05:21PM -0800, Paul Grun wrote: > Why not use an RDMA write w/ immed? That forces the consumption of a > receive WQE and can be used to create a completion event. Since the > immediate data is carried in the last packet of a multi-packet RDMA write, > you are guaranteed that all data has been placed in the receive buffer, in > order. Yes, RDMA WRITE w/ immediate data is perfectly fine. I've even implemented some protocols that use it to good effect. Not sure what the performance trade off is like though. The immediate data pretty much behaves exactly like a SEND WC on the receive side, but there may be some performance and latency advantages, particularly on the send side. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: is it possible to avoid syncing after an rdma write? 2010-02-17 6:40 ` Paul Grun @ 2010-02-17 18:59 ` Jason Gunthorpe 0 siblings, 0 replies; 9+ messages in thread From: Jason Gunthorpe @ 2010-02-17 18:59 UTC (permalink / raw) To: Paul Grun; +Cc: 'Andy Grover', linux-rdma-u79uwXL29TY76Z2rM5mHXA On Tue, Feb 16, 2010 at 10:40:45PM -0800, Paul Grun wrote: > Two advantages come to mind vs an RDMA Write followed by a SEND: > Using a SEND will consume a second WQE on the send side, and the > synchronizing SEND will cause an entire new transaction, which will > consume a(n infinitesimally) small amount of additional wire > bandwidth, as well as incurring a(infinitesimally) small likelihood > of a dropped or lost packets. The practical reason I've used it is because it doesn't consume an unack'd message slot. My apps can exceed the max unacked messages in flight ... > To answer Andy's original question, the behavior on the receive side > is not guaranteed until control of the receive buffer has been > formally returned to the receiver. I expect that most HCAs are > pretty well behaved here, as are most CPU/memory/root > complexes...but you never know. Can anybody guarantee that the > inbound packet gets written to the memory in order? The HCA could guarentee this. It needs to issue the PCI-E transaction that contains the last byte with relaxed ordering cleared, after all other transactions have been issued. In this case it would have identical synchronizing properties to the WCE write. PCI-E requires that the host observe the writes in a transaction in-order and that a host observe multiple transactions according to PCI-E rules. Presumably common HCAs do all this to support MPI. Andy, I think if want to do poll on completion in the kernel then maybe introducing a new ib_dma_poll call for that purpose is a good idea? First crack could just fail if there is bounce buffering, non-coherent caching, etc. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: is it possible to avoid syncing after an rdma write? [not found] ` <20100217005827.GF16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2010-02-17 1:05 ` Paul Grun @ 2010-02-17 19:54 ` Andy Grover [not found] ` <4B7C4984.9050004-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 9+ messages in thread From: Andy Grover @ 2010-02-17 19:54 UTC (permalink / raw) Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA Jason Gunthorpe wrote: > On Tue, Feb 16, 2010 at 03:29:48PM -0800, Andy Grover wrote: >> Right now, RDS follows each RDMA write op with a Send op, which 1) >> causes an interrupt and 2) includes the info we need to call >> ib_dma_sync_sg_for_cpu() for the target of the rdma write. >> >> We want to omit the Send. If we don't do the sync on the machine that is >> the target of the RDMA write, the result is... what exactly? I assume >> the write to memory is snooped by CPUs, so their cachelines will be >> properly invalidated. However, Linux DMA-API docs seem pretty clear in >> insisting on the sync. <snip> > What do you intend to replace the SEND with? spin on last byte? There > are other issues to consider like ordering within the PCI-E fabric.. Well, hopefully nothing. What I'm looking for is to write to a target region multiple times, as efficiently as possible, but be able to occasionally read it on the target machine and get consistent results. I definitely don't want to take an event, and avoiding the CQE would be nice. What I'm hearing is that I don't have to worry about what the Linux DMA-API docs say about noncoherent mappings, but I need to be mindful of IB spec 9.5 section o9-20: --- o9-20: An application shall not depend of the contents of an RDMA WRITE buffer at the responder until one of the following has occurred: * Arrival and Completion of the last RDMA WRITE request packet when used with Immediate data. * Arrival and completion of a subsequent SEND message. * Update of a memory element by a subsequent ATOMIC operation. --- So if I do an RDMA write and follow it up with an atomic op, it sounds like I can achieve the behavior I want, and without an event or CQE. Although for my particular use case with ongoing writes, the CPU couldn't fetch more than one value (64bit?) without potentially reading data from a later write, I would think. Regards -- Andy -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <4B7C4984.9050004-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>]
* Re: is it possible to avoid syncing after an rdma write? [not found] ` <4B7C4984.9050004-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> @ 2010-02-17 22:25 ` Jason Gunthorpe 0 siblings, 0 replies; 9+ messages in thread From: Jason Gunthorpe @ 2010-02-17 22:25 UTC (permalink / raw) To: Andy Grover; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA On Wed, Feb 17, 2010 at 11:54:44AM -0800, Andy Grover wrote: > > What do you intend to replace the SEND with? spin on last byte? There > > are other issues to consider like ordering within the PCI-E fabric.. > > Well, hopefully nothing. What I'm looking for is to write to a target > region multiple times, as efficiently as possible, but be able to > occasionally read it on the target machine and get consistent results. I > definitely don't want to take an event, and avoiding the CQE would be nice. Ahhh, interesting, I've thought about doing something like that as well. Sounds to me like you want to often RDMA WRITE some state information and have the CPU read that state from time to time, ie some kind of pointer values or whatever. I didn't come to a satisfactory method and gave up on the idea.. IMHO, the critical problem to solve is that you cannot re-write over the same region again and again. Guaranteeing CPU and RDMA consistency is hard. For instance if the CPU reads two 64 bit values from your WRITE region there is no way to guarentee anything about them, other than all of the bytes were written at some point by the far side. For instance, a 32 bit CPU might read a 64 bit value with two memory transactions and there is no chance of guaranteed coherence. Basically, it depends on your requirements for the data. If you have an array of 32 bit values that have no inter-relationships then I think it can work OK. Anything else becomes alot harder. > What I'm hearing is that I don't have to worry about what the Linux > DMA-API docs say about noncoherent mappings, but I need to be mindful of > IB spec 9.5 section o9-20: You cannot ignore it completely, but to support userspace there is a way to ensure you get the right kind of mapping for this to work. > So if I do an RDMA write and follow it up with an atomic op, it sounds > like I can achieve the behavior I want, and without an event or CQE. > Although for my particular use case with ongoing writes, the CPU > couldn't fetch more than one value (64bit?) without potentially reading > data from a later write, I would think. You don't need the atomic at all, it doesn't do anything if you intend to start another RDMA WRITE to the same memory soon. The problem you face is not knowing when the last write finished but knowing when the next write is going to start. sizeof(atomic_t) is probably all you get, which will be 32 bits on 32 bit Linux. For instance, a strategy that can work OK would to have an array of your states and the far side RDMA WRITEs into consecutive positions and uses an unsignaled immediate data to indicate the tail. The recv side runs through the CQEs and determines the latest write region. If you run out of slots or out of CQEs then the sender waits for more.. Or replace the immediate data with a last-byte-written poll (like MPI). Either way, the key is that you are never writing twice without synchronizing both sides. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: is it possible to avoid syncing after an rdma write? [not found] ` <4B7B2A6C.80101-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> 2010-02-17 0:58 ` Jason Gunthorpe @ 2010-02-17 10:40 ` Or Gerlitz 1 sibling, 0 replies; 9+ messages in thread From: Or Gerlitz @ 2010-02-17 10:40 UTC (permalink / raw) To: Andy Grover; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA Andy Grover wrote: > RDS follows each RDMA write op with a Send op [...] we want to omit the Send Andy, This way or another the side which isn't initiating the rdma write has to be notified that the local buffer && rkey (stag) they advertised can now invalidated from the HCA/RNIC IOMMU, its mapping from the node IOMMU, returned to the pool it was allocated from, reclaimed by higher layers etc. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-02-17 22:25 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-16 23:29 is it possible to avoid syncing after an rdma write? Andy Grover
[not found] ` <4B7B2A6C.80101-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2010-02-17 0:58 ` Jason Gunthorpe
[not found] ` <20100217005827.GF16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-02-17 1:05 ` Paul Grun
2010-02-17 1:12 ` Jason Gunthorpe
[not found] ` <20100217011224.GH16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-02-17 6:40 ` Paul Grun
2010-02-17 18:59 ` Jason Gunthorpe
2010-02-17 19:54 ` Andy Grover
[not found] ` <4B7C4984.9050004-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2010-02-17 22:25 ` Jason Gunthorpe
2010-02-17 10:40 ` Or Gerlitz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox