is it possible to avoid syncing after an rdma write?

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* is it possible to avoid syncing after an rdma write?
@ 2010-02-16 23:29 Andy Grover
       [not found] ` <4B7B2A6C.80101-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Andy Grover @ 2010-02-16 23:29 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Right now, RDS follows each RDMA write op with a Send op, which 1)
causes an interrupt and 2) includes the info we need to call
ib_dma_sync_sg_for_cpu() for the target of the rdma write.

We want to omit the Send. If we don't do the sync on the machine that is
the target of the RDMA write, the result is... what exactly? I assume
the write to memory is snooped by CPUs, so their cachelines will be
properly invalidated. However, Linux DMA-API docs seem pretty clear in
insisting on the sync.

Is the issue IOMMUs? Or for compatibility with bounce buffering?

Thanks in advance -- Regards -- Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: is it possible to avoid syncing after an rdma write?
       [not found] ` <4B7B2A6C.80101-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2010-02-17  0:58   ` Jason Gunthorpe
       [not found]     ` <20100217005827.GF16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2010-02-17 10:40   ` Or Gerlitz
  1 sibling, 1 reply; 9+ messages in thread
From: Jason Gunthorpe @ 2010-02-17  0:58 UTC (permalink / raw)
  To: Andy Grover; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Feb 16, 2010 at 03:29:48PM -0800, Andy Grover wrote:
> Right now, RDS follows each RDMA write op with a Send op, which 1)
> causes an interrupt and 2) includes the info we need to call
> ib_dma_sync_sg_for_cpu() for the target of the rdma write.
> 
> We want to omit the Send. If we don't do the sync on the machine that is
> the target of the RDMA write, the result is... what exactly? I assume
> the write to memory is snooped by CPUs, so their cachelines will be
> properly invalidated. However, Linux DMA-API docs seem pretty clear in
> insisting on the sync.

I'm curious about this too, but I will point out that at least the
user RDMA interface has no match for the kernel DMA calls, so in
practice RDMA does not work on systems that require them. That means
bounce buffering is not used and IO/CPU caches are coherent.

Though, I guess, the kernel could use weaker memory ordering types in
kernel mode that do require the DMA api calls.

> Is the issue IOMMUs? Or for compatibility with bounce buffering?

As long as the memory is registered the IOMMU should remain
configured.

What do you intend to replace the SEND with? spin on last byte? There
are other issues to consider like ordering within the PCI-E fabric..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: is it possible to avoid syncing after an rdma write?
       [not found]     ` <20100217005827.GF16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-02-17  1:05       ` Paul Grun
  2010-02-17  1:12         ` Jason Gunthorpe
  2010-02-17 19:54       ` Andy Grover
  1 sibling, 1 reply; 9+ messages in thread
From: Paul Grun @ 2010-02-17  1:05 UTC (permalink / raw)
  To: 'Jason Gunthorpe', 'Andy Grover'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Why not use an RDMA write w/ immed?  That forces the consumption of a
receive WQE and can be used to create a completion event.  Since the
immediate data is carried in the last packet of a multi-packet RDMA write,
you are guaranteed that all data has been placed in the receive buffer, in
order.

I'm a hardware guy, so this may be completely off-the-wall w.r.t. this
particular discussion.
-Paul

-----Original Message-----
From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
[mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Jason Gunthorpe
Sent: Tuesday, February 16, 2010 4:58 PM
To: Andy Grover
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: is it possible to avoid syncing after an rdma write?

On Tue, Feb 16, 2010 at 03:29:48PM -0800, Andy Grover wrote:
> Right now, RDS follows each RDMA write op with a Send op, which 1)
> causes an interrupt and 2) includes the info we need to call
> ib_dma_sync_sg_for_cpu() for the target of the rdma write.
> 
> We want to omit the Send. If we don't do the sync on the machine that is
> the target of the RDMA write, the result is... what exactly? I assume
> the write to memory is snooped by CPUs, so their cachelines will be
> properly invalidated. However, Linux DMA-API docs seem pretty clear in
> insisting on the sync.

I'm curious about this too, but I will point out that at least the
user RDMA interface has no match for the kernel DMA calls, so in
practice RDMA does not work on systems that require them. That means
bounce buffering is not used and IO/CPU caches are coherent.

Though, I guess, the kernel could use weaker memory ordering types in
kernel mode that do require the DMA api calls.

> Is the issue IOMMUs? Or for compatibility with bounce buffering?

As long as the memory is registered the IOMMU should remain
configured.

What do you intend to replace the SEND with? spin on last byte? There
are other issues to consider like ordering within the PCI-E fabric..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: is it possible to avoid syncing after an rdma write?
  2010-02-17  1:05       ` Paul Grun
@ 2010-02-17  1:12         ` Jason Gunthorpe
       [not found]           ` <20100217011224.GH16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Jason Gunthorpe @ 2010-02-17  1:12 UTC (permalink / raw)
  To: Paul Grun; +Cc: 'Andy Grover', linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Feb 16, 2010 at 05:05:21PM -0800, Paul Grun wrote:
> Why not use an RDMA write w/ immed?  That forces the consumption of a
> receive WQE and can be used to create a completion event.  Since the
> immediate data is carried in the last packet of a multi-packet RDMA write,
> you are guaranteed that all data has been placed in the receive buffer, in
> order.

Yes, RDMA WRITE w/ immediate data is perfectly fine. I've even
implemented some protocols that use it to good effect.

Not sure what the performance trade off is like though. The immediate
data pretty much behaves exactly like a SEND WC on the receive side,
but there may be some performance and latency advantages, particularly
on the send side.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: is it possible to avoid syncing after an rdma write?
       [not found]           ` <20100217011224.GH16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-02-17  6:40             ` Paul Grun
  2010-02-17 18:59               ` Jason Gunthorpe
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Grun @ 2010-02-17  6:40 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: 'Andy Grover', linux-rdma-u79uwXL29TY76Z2rM5mHXA

Two advantages come to mind vs an RDMA Write followed by a SEND:
Using a SEND will consume a second WQE on the send side, and the
synchronizing SEND will cause an entire new transaction, which will consume
a(n infinitesimally) small amount of additional wire bandwidth, as well as
incurring a(infinitesimally) small likelihood of a dropped or lost packets.

Nits?  Yes, probably infinitesimally small ones.   (hardware guys tend to
worry about the small ones.)

To answer Andy's original question, the behavior on the receive side is not
guaranteed until control of the receive buffer has been formally returned to
the receiver.  I expect that most HCAs are pretty well behaved here, as are
most CPU/memory/root complexes...but you never know.  Can anybody guarantee
that the inbound packet gets written to the memory in order?  

If something odd did happen, it seems like one of those places that would
require an incredible stroke of luck to debug.

OTOH, I know that many applications simply poll the receive buffer looking
for a flag everyday and get away with it.

-----Original Message-----
From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org] 
Sent: Tuesday, February 16, 2010 5:12 PM
To: Paul Grun
Cc: 'Andy Grover'; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: is it possible to avoid syncing after an rdma write?

On Tue, Feb 16, 2010 at 05:05:21PM -0800, Paul Grun wrote:
> Why not use an RDMA write w/ immed?  That forces the consumption of a
> receive WQE and can be used to create a completion event.  Since the
> immediate data is carried in the last packet of a multi-packet RDMA write,
> you are guaranteed that all data has been placed in the receive buffer, in
> order.

Yes, RDMA WRITE w/ immediate data is perfectly fine. I've even
implemented some protocols that use it to good effect.

Not sure what the performance trade off is like though. The immediate
data pretty much behaves exactly like a SEND WC on the receive side,
but there may be some performance and latency advantages, particularly
on the send side.

Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: is it possible to avoid syncing after an rdma write?
       [not found] ` <4B7B2A6C.80101-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2010-02-17  0:58   ` Jason Gunthorpe
@ 2010-02-17 10:40   ` Or Gerlitz
  1 sibling, 0 replies; 9+ messages in thread
From: Or Gerlitz @ 2010-02-17 10:40 UTC (permalink / raw)
  To: Andy Grover; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Andy Grover wrote:
> RDS follows each RDMA write op with a Send op [...] we want to omit the Send 
Andy,

This way or another the side which isn't initiating the rdma write has 
to be notified that the local buffer && rkey (stag) they advertised can 
now invalidated from the  HCA/RNIC IOMMU, its mapping from the node 
IOMMU, returned to the pool it was allocated from, reclaimed by higher 
layers etc.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: is it possible to avoid syncing after an rdma write?
  2010-02-17  6:40             ` Paul Grun
@ 2010-02-17 18:59               ` Jason Gunthorpe
  0 siblings, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2010-02-17 18:59 UTC (permalink / raw)
  To: Paul Grun; +Cc: 'Andy Grover', linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Feb 16, 2010 at 10:40:45PM -0800, Paul Grun wrote:

> Two advantages come to mind vs an RDMA Write followed by a SEND:
> Using a SEND will consume a second WQE on the send side, and the
> synchronizing SEND will cause an entire new transaction, which will
> consume a(n infinitesimally) small amount of additional wire
> bandwidth, as well as incurring a(infinitesimally) small likelihood
> of a dropped or lost packets.

The practical reason I've used it is because it doesn't consume an
unack'd message slot. My apps can exceed the max unacked messages in
flight ...

> To answer Andy's original question, the behavior on the receive side
> is not guaranteed until control of the receive buffer has been
> formally returned to the receiver.  I expect that most HCAs are
> pretty well behaved here, as are most CPU/memory/root
> complexes...but you never know.  Can anybody guarantee that the
> inbound packet gets written to the memory in order?

The HCA could guarentee this. It needs to issue the PCI-E transaction
that contains the last byte with relaxed ordering cleared, after all
other transactions have been issued. In this case it would have
identical synchronizing properties to the WCE write. PCI-E requires
that the host observe the writes in a transaction in-order and that a
host observe multiple transactions according to PCI-E rules.

Presumably common HCAs do all this to support MPI.

Andy, I think if want to do poll on completion in the kernel then
maybe introducing a new ib_dma_poll call for that purpose is a good
idea? First crack could just fail if there is bounce buffering,
non-coherent caching, etc.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: is it possible to avoid syncing after an rdma write?
       [not found]     ` <20100217005827.GF16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2010-02-17  1:05       ` Paul Grun
@ 2010-02-17 19:54       ` Andy Grover
       [not found]         ` <4B7C4984.9050004-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 9+ messages in thread
From: Andy Grover @ 2010-02-17 19:54 UTC (permalink / raw)
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Jason Gunthorpe wrote:
> On Tue, Feb 16, 2010 at 03:29:48PM -0800, Andy Grover wrote:
>> Right now, RDS follows each RDMA write op with a Send op, which 1)
>> causes an interrupt and 2) includes the info we need to call
>> ib_dma_sync_sg_for_cpu() for the target of the rdma write.
>>
>> We want to omit the Send. If we don't do the sync on the machine that is
>> the target of the RDMA write, the result is... what exactly? I assume
>> the write to memory is snooped by CPUs, so their cachelines will be
>> properly invalidated. However, Linux DMA-API docs seem pretty clear in
>> insisting on the sync.

<snip>

> What do you intend to replace the SEND with? spin on last byte? There
> are other issues to consider like ordering within the PCI-E fabric..

Well, hopefully nothing. What I'm looking for is to write to a target
region multiple times, as efficiently as possible, but be able to
occasionally read it on the target machine and get consistent results. I
definitely don't want to take an event, and avoiding the CQE would be nice.

What I'm hearing is that I don't have to worry about what the Linux
DMA-API docs say about noncoherent mappings, but I need to be mindful of
IB spec 9.5 section o9-20:

---
o9-20: An application shall not depend of the contents of an RDMA WRITE
buffer at the responder until one of the following has occurred:
* Arrival and Completion of the last RDMA WRITE request packet when
  used with Immediate data.
* Arrival and completion of a subsequent SEND message.
* Update of a memory element by a subsequent ATOMIC operation.
---

So if I do an RDMA write and follow it up with an atomic op, it sounds
like I can achieve the behavior I want, and without an event or CQE.
Although for my particular use case with ongoing writes, the CPU
couldn't fetch more than one value (64bit?) without potentially reading
data from a later write, I would think.

Regards -- Andy

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: is it possible to avoid syncing after an rdma write?
       [not found]         ` <4B7C4984.9050004-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2010-02-17 22:25           ` Jason Gunthorpe
  0 siblings, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2010-02-17 22:25 UTC (permalink / raw)
  To: Andy Grover; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Feb 17, 2010 at 11:54:44AM -0800, Andy Grover wrote:

> > What do you intend to replace the SEND with? spin on last byte? There
> > are other issues to consider like ordering within the PCI-E fabric..
> 
> Well, hopefully nothing. What I'm looking for is to write to a target
> region multiple times, as efficiently as possible, but be able to
> occasionally read it on the target machine and get consistent results. I
> definitely don't want to take an event, and avoiding the CQE would be nice.

Ahhh, interesting, I've thought about doing something like that as
well. Sounds to me like you want to often RDMA WRITE some state
information and have the CPU read that state from time to time, ie
some kind of pointer values or whatever.

I didn't come to a satisfactory method and gave up on the idea..

IMHO, the critical problem to solve is that you cannot re-write over
the same region again and again. Guaranteeing CPU and RDMA consistency
is hard. For instance if the CPU reads two 64 bit values from your
WRITE region there is no way to guarentee anything about them, other
than all of the bytes were written at some point by the far side.

For instance, a 32 bit CPU might read a 64 bit value with two memory
transactions and there is no chance of guaranteed coherence.

Basically, it depends on your requirements for the data. If you have
an array of 32 bit values that have no inter-relationships then I
think it can work OK. Anything else becomes alot harder.

> What I'm hearing is that I don't have to worry about what the Linux
> DMA-API docs say about noncoherent mappings, but I need to be mindful of
> IB spec 9.5 section o9-20:

You cannot ignore it completely, but to support userspace there is a
way to ensure you get the right kind of mapping for this to work.

> So if I do an RDMA write and follow it up with an atomic op, it sounds
> like I can achieve the behavior I want, and without an event or CQE.
> Although for my particular use case with ongoing writes, the CPU
> couldn't fetch more than one value (64bit?) without potentially reading
> data from a later write, I would think.

You don't need the atomic at all, it doesn't do anything if you intend
to start another RDMA WRITE to the same memory soon. The problem you
face is not knowing when the last write finished but knowing when the
next write is going to start.

sizeof(atomic_t) is probably all you get, which will be 32 bits on 32
bit Linux.

For instance, a strategy that can work OK would to have an array of
your states and the far side RDMA WRITEs into consecutive positions
and uses an unsignaled immediate data to indicate the tail. The recv
side runs through the CQEs and determines the latest write region. If
you run out of slots or out of CQEs then the sender waits for more..

Or replace the immediate data with a last-byte-written poll (like MPI).

Either way, the key is that you are never writing twice without
synchronizing both sides.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-02-17 22:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-16 23:29 is it possible to avoid syncing after an rdma write? Andy Grover
     [not found] ` <4B7B2A6C.80101-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2010-02-17  0:58   ` Jason Gunthorpe
     [not found]     ` <20100217005827.GF16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-02-17  1:05       ` Paul Grun
2010-02-17  1:12         ` Jason Gunthorpe
     [not found]           ` <20100217011224.GH16490-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-02-17  6:40             ` Paul Grun
2010-02-17 18:59               ` Jason Gunthorpe
2010-02-17 19:54       ` Andy Grover
     [not found]         ` <4B7C4984.9050004-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2010-02-17 22:25           ` Jason Gunthorpe
2010-02-17 10:40   ` Or Gerlitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox