RDMA and memory ordering

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* RDMA and memory ordering
@ 2013-11-10 10:46 Anuj Kalia
       [not found] ` <CADPSxAhAGYZude8CM65-UDvfiPscStgcNsAfs=2XBbntg-wL0w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Anuj Kalia @ 2013-11-10 10:46 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi.

I am running a server which essentially does the following operations in a loop:

A[i].value = counter;  //It's actually something else
asm volatile ("" : : : "memory");
asm volatile("mfence" ::: "memory");
A[i].counter = counter;
printf("%d %d\n", A[i].value, A[i].counter);
counter ++;

Basically, I want a fresh value of A[i].counter to indicate a fresh A[i].value.

I have a remote client which reads the struct A[i] from the server
(via RDMA) in a loop. Sometimes in the value that the client reads,
A[i].counter is larger than A[i].value. i.e., I see the newer value of
A[i].counter but A[i].value corresponds to a previous iteration of the
server's loop.

How can this happen in the presence of memory barriers? With barriers,
A[i].counter should be updated later and therefore should always be
smaller than A[i].value.

Thanks for your help!

Anuj Kalia,
Carnegie Mellon University
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <CADPSxAhAGYZude8CM65-UDvfiPscStgcNsAfs=2XBbntg-wL0w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: RDMA and memory ordering
       [not found] ` <CADPSxAhAGYZude8CM65-UDvfiPscStgcNsAfs=2XBbntg-wL0w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-12 10:16   ` Gabriele Svelto
       [not found]     ` <5281FFF9.5070705-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Gabriele Svelto @ 2013-11-12 10:16 UTC (permalink / raw)
  To: Anuj Kalia, linux-rdma-u79uwXL29TY76Z2rM5mHXA

  Hi Anuj,

On 10/11/2013 11:46, Anuj Kalia wrote:
> How can this happen in the presence of memory barriers? With barriers,
> A[i].counter should be updated later and therefore should always be
> smaller than A[i].value.

memory barriers such as mfence synchronize memory operations from the 
point of view of CPUs only. Practically this means that the stores you 
wrote might go out to memory in a different order than what the 
processor sees and external devices such as a PCIe HCAs might thus see a 
different ordering even in the presence of memory barriers.

To ensure that an external devices sees your store in the order you 
meant you will need some form of external barrier though I do not know 
if it is possible at all in userspace and besides it will be a fragile 
solution.

Instead I would suggest you to use verbs atomic operations such as 
IBV_WR_ATOMIC_CMP_AND_SWP and IBV_WR_ATOMIC_FETCH_AND_ADD to implement 
what you have in mind.

  Gabriele
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <5281FFF9.5070705-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]

* Re: RDMA and memory ordering
       [not found]     ` <5281FFF9.5070705-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2013-11-12 10:31       ` Anuj Kalia
  2013-11-12 18:31         ` Jason Gunthorpe
  2013-11-13 18:23         ` Gabriele Svelto
  0 siblings, 2 replies; 16+ messages in thread
From: Anuj Kalia @ 2013-11-12 10:31 UTC (permalink / raw)
  To: Gabriele Svelto; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Hi Gabriel,

Thanks for your reply.

That makes sense. This way, we have no consistency between the CPU's
view and the HCA's view - it all depends when the cache gets flushed
to RAM.

However, if the HCA performs reads from L3 cache, then everything
should be consistent, right? While ordering the writes, I think we can
assume that they are ordered till the cache hierarchy (with no
guarantees for when they appear in RAM). Ido Shamai (@Mellanox) told
me that RDMA writes go to L3 cache. This, plus on-chip memory
controllers make me think that reads should come from L3 cache too.

I believe the atomic operations would be a lot more expensive than
reads/writes. I'm targetting maximum performance so I don't want to
look that way yet.

--Anuj


On Tue, Nov 12, 2013 at 6:16 AM, Gabriele Svelto
<gabriele.svelto-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>  Hi Anuj,
>
>
> On 10/11/2013 11:46, Anuj Kalia wrote:
>>
>> How can this happen in the presence of memory barriers? With barriers,
>> A[i].counter should be updated later and therefore should always be
>> smaller than A[i].value.
>
>
> memory barriers such as mfence synchronize memory operations from the point
> of view of CPUs only. Practically this means that the stores you wrote might
> go out to memory in a different order than what the processor sees and
> external devices such as a PCIe HCAs might thus see a different ordering
> even in the presence of memory barriers.
>
> To ensure that an external devices sees your store in the order you meant
> you will need some form of external barrier though I do not know if it is
> possible at all in userspace and besides it will be a fragile solution.
>
> Instead I would suggest you to use verbs atomic operations such as
> IBV_WR_ATOMIC_CMP_AND_SWP and IBV_WR_ATOMIC_FETCH_AND_ADD to implement what
> you have in mind.
>
>  Gabriele
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RDMA and memory ordering
  2013-11-12 10:31       ` Anuj Kalia
@ 2013-11-12 18:31         ` Jason Gunthorpe
       [not found]           ` <20131112183142.GB6639-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2013-11-13 18:23         ` Gabriele Svelto
  1 sibling, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2013-11-12 18:31 UTC (permalink / raw)
  To: Anuj Kalia
  Cc: Gabriele Svelto,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Tue, Nov 12, 2013 at 06:31:04AM -0400, Anuj Kalia wrote:

> That makes sense. This way, we have no consistency between the CPU's
> view and the HCA's view - it all depends when the cache gets flushed
> to RAM.

What you are talking about is firmly in undefined territory. You might
be able to get something to work today, but tomorrows CPUs and HCAs
might mess it up.

You will never reliably get the guarentee you desired with the scheme
you have. Even with two CPUs it is not going to happen.

> I have a remote client which reads the struct A[i] from the server
> (via RDMA) in a loop. Sometimes in the value that the client reads,
> A[i].counter is larger than A[i].value. i.e., I see the newer value of
> A[i].counter but A[i].value corresponds to a previous iteration of the
> server's loop.

This is a fundamental mis-understanding of what FENCE does, it just
makes the writes happen in-order, it doesn't alter the reader side

CPU1                         CPU2
                             read avalue
value = counter
FENCE
a.counter = counter
                             read a.counter

value < counter


CPU1                         CPU2
a.value = counter
                             read a.value
FENCE
a.counter = counter
                             read a.coutner

value < counter


CPU1                         CPU2
a.value = counter
FENCE
                             read a.value
< SCHEDULE >
a.counter = counter
                             read a.coutner

value < counter

etc.

This stuff is hard, if you want a crazy scheme to be reliable you need
to have really detailed understanding of what is actually being
guarenteed.

> However, if the HCA performs reads from L3 cache, then everything
> should be consistent, right? While ordering the writes, I think we
> can

No. The cache makes no difference. Fundamentally you aren't atomically
writing cache lines. You are writing single values.

99% of the time it might look like atomic cache line writes, but there
is a 1% where that assumption will break.

Probably the best you can do is a collision detect scheme:

uint64_t counter
void data[];

writer
 counter++
 FENCE
 data = [.....];
 FENCE
 counter++

reader:
  read counter
   if counter % 2 == 1: retry
  read data
  read counter
   if counter != last_counter: retry

But even something as simple as that probably has scary races - I only
thought about it for a few moments. :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20131112183142.GB6639-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]

* Re: RDMA and memory ordering
       [not found]           ` <20131112183142.GB6639-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2013-11-12 20:59             ` Anuj Kalia
       [not found]               ` <CADPSxAgF1CAiYoYbxbCON4NCD-tH8cAsJFRtECkTGJJQC4MXCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Anuj Kalia @ 2013-11-12 20:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Gabriele Svelto,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

[Included missed conversation with Jason at end].

Jason,

Thanks again. So we conclude there is nothing like an atomic cacheline
read. Then my current design is a dud. But there should be 8 byte
atomicity, right? I think I can leverage that to get what I want.

This part is interesting (from Jason's reply):
"If you burst read from the HCA value and counter then the result is
undefined, you don't know if counter was read before value, or the
other way around."

Is there a way of knowing the order in which they are read - for
example, I heard in a talk that there is a left-to-right ordering when
a HCA reads a contiguous buffer. This could be totally architecture
specific, for example, I just want the answer for Mellanox ConnectX-3
cards. I think I can check this experimentally, but a definitive
answer would be great.

--Anuj

[Conversation with Jason follows]

Jason,

Thanks a lot for your reply.

I think I understand that the RDMA reader will not see the ordering in
the updates to A[i].value and A[i].counter if they are in different L3
cache lines. But what are the guarantees when they are in the same
cache line?

For example, 32 bit processors have atomic 32 bit loads and stores
i.e. memory operations to the same 32 bit (aligned) word are
linearizable.

On 12 Nov 2013 13:31, "Jason Gunthorpe" <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>
> On Tue, Nov 12, 2013 at 06:31:04AM -0400, Anuj Kalia wrote:
>
> > That makes sense. This way, we have no consistency between the CPU's
> > view and the HCA's view - it all depends when the cache gets flushed
> > to RAM.
>
> What you are talking about is firmly in undefined territory. You might
> be able to get something to work today, but tomorrows CPUs and HCAs
> might mess it up.
>
> You will never reliably get the guarentee you desired with the scheme
> you have. Even with two CPUs it is not going to happen.
>
> > I have a remote client which reads the struct A[i] from the server
> > (via RDMA) in a loop. Sometimes in the value that the client reads,
> > A[i].counter is larger than A[i].value. i.e., I see the newer value of
> > A[i].counter but A[i].value corresponds to a previous iteration of the
> > server's loop.
>
>
>
> This is a fundamental mis-understanding of what FENCE does, it just
> makes the writes happen in-order, it doesn't alter the reader side
>
> CPU1                         CPU2
>                              read avalue
> value = counter
> FENCE
> a.counter = counter
>                              read a.counter
>
> value < counter
>
That's right - thanks for the detailed explanation! However, I'm
assuming that the HCA performs atomic cacheline reads (I don't have a
lot of basis for this assumption and it would be great if someone
could tell me more about it). If that is true, 'read a.value' and
'read a.counter' are not 2 separate operations. Instead, there is one
'read cacheline(a)' - that should provide a snapshot of a's state at
CPU1.
>
>
> CPU1                         CPU2
> a.value = counter
>                              read a.value
> FENCE
> a.counter = counter
>                              read a.coutner
>
> value < counter
>
>
> CPU1                         CPU2
> a.value = counter
> FENCE
>                              read a.value
> < SCHEDULE >
> a.counter = counter
>                              read a.coutner
>
> value < counter
>
> etc.
>
> This stuff is hard, if you want a crazy scheme to be reliable you need
> to have really detailed understanding of what is actually being
> guarenteed.
>
> > However, if the HCA performs reads from L3 cache, then everything
> > should be consistent, right? While ordering the writes, I think we
> > can
>
> No. The cache makes no difference. Fundamentally you aren't atomically
> writing cache lines. You are writing single values.


I was not assuming atomic writes to the entire cacheline - I was only
assuming that the ordering imposed by mfence is preserved in cache -
the write to 'a.value' appears in the cache hierarchy before the write
to 'a.counter'.

>
> 99% of the time it might look like atomic cache line writes, but there
> is a 1% where that assumption will break.
>
> Probably the best you can do is a collision detect scheme:
>
> uint64_t counter
> void data[];
>
> writer
>  counter++
>  FENCE
>  data = [.....];
>  FENCE
>  counter++
>
> reader:
>   read counter
>    if counter % 2 == 1: retry
>   read data
>   read counter
>    if counter != last_counter: retry
>
> But even something as simple as that probably has scary races - I only
> thought about it for a few moments. :)
>
> Jason


So I guess my primary question is this now: does the HCA perform
atomic cacheline reads (wrt other CPU operations to the same
cacheline)?

On Tue, Nov 12, 2013 at 03:18:35PM -0400, Anuj Kalia wrote:
> Jason,
>
> Thanks a lot for your reply.
>
> I think I understand that the RDMA reader will not see the ordering

This isn't just RDMA, CPU to CPU coherency is the same.

To be honest, your test doesn't really show anything, the reads and
writes can be interleaved in any way, and value >, == < counter are
all valid outcomes.

What the fence gives you is this:
   Read counter, then value. FENCE ensures that value >= counter.

If you burst read from the HCA value and counter then the result is
undefined, you don't know if counter was read before value, or the
other way around.

> in the updates to A[i].value and A[i].counter if they are in
> different L3 cache lines.  But what are the guarantees when they are
> in the same cache line?

Cache lines make no difference. They are not really modeled as part of
the coherency API the processor presents.

Two nearby writes in the instruction stream might be merged into an
atomic cache line update, or they might not. You have no control over
this

> That's right - thanks for the detailed explanation! However, I'm
> assuming that the HCA performs atomic cacheline reads (I don't have
> a lot of basis for this assumption and it would be great if someone
> could tell me more about it). If that is true, 'read a.value' and
> 'read a.counter' are not 2 separate operations. Instead, there is
> one 'read cacheline(a)' - that should provide a snapshot of a's
> state at CPU1.

That is an implementation detail, there is no architectural guarantee.

I don't think any current implementations provides atomic cacheline
reads.

> I was not assuming atomic writes to the entire cacheline - I was
> only assuming that the ordering imposed by mfence is preserved in
> cache - the write to 'a.value' appears in the cache hierarchy before
> the write to 'a.counter'.

mfence preserves the ordering, but there is no such thing as an atomic
cache line read or write. So the only way to see the ordering created
by mfence is with two non-burst reads, strongly ordered in time.

(Note, transactional memory extensions create something that looks an
 awful lot like an atomic cache line write. However that stuff is
 still really new so not alot of info on how it co-exists with DMA/etc)


On Tue, Nov 12, 2013 at 2:31 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Tue, Nov 12, 2013 at 06:31:04AM -0400, Anuj Kalia wrote:
>
>> That makes sense. This way, we have no consistency between the CPU's
>> view and the HCA's view - it all depends when the cache gets flushed
>> to RAM.
>
> What you are talking about is firmly in undefined territory. You might
> be able to get something to work today, but tomorrows CPUs and HCAs
> might mess it up.
>
> You will never reliably get the guarentee you desired with the scheme
> you have. Even with two CPUs it is not going to happen.
>
>> I have a remote client which reads the struct A[i] from the server
>> (via RDMA) in a loop. Sometimes in the value that the client reads,
>> A[i].counter is larger than A[i].value. i.e., I see the newer value of
>> A[i].counter but A[i].value corresponds to a previous iteration of the
>> server's loop.
>
> This is a fundamental mis-understanding of what FENCE does, it just
> makes the writes happen in-order, it doesn't alter the reader side
>
> CPU1                         CPU2
>                              read avalue
> value = counter
> FENCE
> a.counter = counter
>                              read a.counter
>
> value < counter
>
>
> CPU1                         CPU2
> a.value = counter
>                              read a.value
> FENCE
> a.counter = counter
>                              read a.coutner
>
> value < counter
>
>
> CPU1                         CPU2
> a.value = counter
> FENCE
>                              read a.value
> < SCHEDULE >
> a.counter = counter
>                              read a.coutner
>
> value < counter
>
> etc.
>
> This stuff is hard, if you want a crazy scheme to be reliable you need
> to have really detailed understanding of what is actually being
> guarenteed.
>
>> However, if the HCA performs reads from L3 cache, then everything
>> should be consistent, right? While ordering the writes, I think we
>> can
>
> No. The cache makes no difference. Fundamentally you aren't atomically
> writing cache lines. You are writing single values.
>
> 99% of the time it might look like atomic cache line writes, but there
> is a 1% where that assumption will break.
>
> Probably the best you can do is a collision detect scheme:
>
> uint64_t counter
> void data[];
>
> writer
>  counter++
>  FENCE
>  data = [.....];
>  FENCE
>  counter++
>
> reader:
>   read counter
>    if counter % 2 == 1: retry
>   read data
>   read counter
>    if counter != last_counter: retry
>
> But even something as simple as that probably has scary races - I only
> thought about it for a few moments. :)
>
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <CADPSxAgF1CAiYoYbxbCON4NCD-tH8cAsJFRtECkTGJJQC4MXCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: RDMA and memory ordering
       [not found]               ` <CADPSxAgF1CAiYoYbxbCON4NCD-tH8cAsJFRtECkTGJJQC4MXCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-12 21:11                 ` Jason Gunthorpe
       [not found]                   ` <20131112211123.GA29132-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2013-11-12 21:11 UTC (permalink / raw)
  To: Anuj Kalia
  Cc: Gabriele Svelto,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Tue, Nov 12, 2013 at 04:59:19PM -0400, Anuj Kalia wrote:

> Thanks again. So we conclude there is nothing like an atomic cacheline
> read. Then my current design is a dud. But there should be 8 byte
> atomicity, right? I think I can leverage that to get what I want.

64 bit CPUs do have 64 bit atomic stores, so you can rely on DMAs
seeing only values you've written and not some combination of old/new
bits.

> This part is interesting (from Jason's reply):
> "If you burst read from the HCA value and counter then the result is
> undefined, you don't know if counter was read before value, or the
> other way around."

> Is there a way of knowing the order in which they are read - for
> example, I heard in a talk that there is a left-to-right ordering
> when

So, this I don't know. I don't think anyone has ever had a need to
look into that, it is certainly not defined. What you are asking is
how does memory write ordering interact with a burst read.

> a HCA reads a contiguous buffer. This could be totally architecture
> specific, for example, I just want the answer for Mellanox ConnectX-3
> cards. I think I can check this experimentally, but a definitive
> answer would be great.

The talk you heard about left-to-write ordering was probably in the
context of DMA burst writes and MPI polling.

In this case the DMA would write DDDDDP, and the MPI would poll on
P. Once P is written it assumes that D is visible.

This is undefined in general, but ensured in some cases on Intel and
Mellanox. I'm not sure if D and P have to be in the same cache line,
but you probably need a fence after reading P..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20131112211123.GA29132-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]

* Re: RDMA and memory ordering
       [not found]                   ` <20131112211123.GA29132-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2013-11-13  6:55                     ` Anuj Kalia
       [not found]                       ` <CADPSxAhzmaut9s9L1fv5urhzX8xKU9GbL6z1TkOX3FuM4NUsww-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Anuj Kalia @ 2013-11-13  6:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Gabriele Svelto,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Tue, Nov 12, 2013 at 5:11 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Tue, Nov 12, 2013 at 04:59:19PM -0400, Anuj Kalia wrote:
>
>> Thanks again. So we conclude there is nothing like an atomic cacheline
>> read. Then my current design is a dud. But there should be 8 byte
>> atomicity, right? I think I can leverage that to get what I want.
>
> 64 bit CPUs do have 64 bit atomic stores, so you can rely on DMAs
> seeing only values you've written and not some combination of old/new
> bits.
That's a relief :).
>> This part is interesting (from Jason's reply):
>> "If you burst read from the HCA value and counter then the result is
>> undefined, you don't know if counter was read before value, or the
>> other way around."
>
>> Is there a way of knowing the order in which they are read - for
>> example, I heard in a talk that there is a left-to-right ordering
>> when
>
> So, this I don't know. I don't think anyone has ever had a need to
> look into that, it is certainly not defined. What you are asking is
> how does memory write ordering interact with a burst read.

OK. I'll do some experiments to figure out the order in which
cacheline words are read by the HCA. I'll post my findings if they're
interesting.

>> a HCA reads a contiguous buffer. This could be totally architecture
>> specific, for example, I just want the answer for Mellanox ConnectX-3
>> cards. I think I can check this experimentally, but a definitive
>> answer would be great.
>
> The talk you heard about left-to-write ordering was probably in the
> context of DMA burst writes and MPI polling.
>
> In this case the DMA would write DDDDDP, and the MPI would poll on
> P. Once P is written it assumes that D is visible.

The talk wasn't about MPI but you're right. It was about RDMA writes
and CPU polls. Thanks for making that clear.

I don't know what you meant by burst writes: do you mean several RDMA
writes or one large write? I'm concered with the order in which data
is written out in one large RDMA write (I'm concerned with RDMA reads
too). For example, if I read/write 64 bytes addressed from "buf" to
"buf+64", does [buf, buf+7] get read/written first or does [buf+56,
buf+63]?

I guess now is the time I run lots of micro experiments. Thanks a lot
for the help everyone.

> This is undefined in general, but ensured in some cases on Intel and
> Mellanox. I'm not sure if D and P have to be in the same cache line,
> but you probably need a fence after reading P..
>
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <CADPSxAhzmaut9s9L1fv5urhzX8xKU9GbL6z1TkOX3FuM4NUsww-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: RDMA and memory ordering
       [not found]                       ` <CADPSxAhzmaut9s9L1fv5urhzX8xKU9GbL6z1TkOX3FuM4NUsww-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-13 18:09                         ` Jason Gunthorpe
       [not found]                           ` <20131113180915.GA6597-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2013-11-13 18:09 UTC (permalink / raw)
  To: Anuj Kalia
  Cc: Gabriele Svelto,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Wed, Nov 13, 2013 at 02:55:53AM -0400, Anuj Kalia wrote:

> I don't know what you meant by burst writes: do you mean several RDMA
> writes or one large write? I'm concered with the order in which data

A RDMA write will be split up by the HCA into a burst of PCI MemoryWr
operations.

> I guess now is the time I run lots of micro experiments. Thanks a lot
> for the help everyone.

Carefull, experiments can't prove that order is guranteed to be
present, they can only show if it certainly isn't.

Intel hardware is very good at hiding ordering issues 99% of the time,
but in many cases there can be a stress'd condition that will show a
different result.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20131113180915.GA6597-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]

* Re: RDMA and memory ordering
       [not found]                           ` <20131113180915.GA6597-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2013-11-14  5:12                             ` Anuj Kalia
       [not found]                               ` <CADPSxAiepGuzWYXjyDxnSzER5MqL57fZ9mh83SLwV461PwZO3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Anuj Kalia @ 2013-11-14  5:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Gabriele Svelto,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Jason,

Thanks again :).

I found another similar thread:
http://www.spinics.net/lists/linux-rdma/msg02709.html. The conclusion
there was that although Infiniband specs don't specify any ordering of
writes, many people assume left-to-right ordering anyway. There is no
mention of reads though.

So I did the micro experiments and I found that although writes follow
the left-right ordering, reads do not. More details follow:

1. Write ordering experiment:
1.a. In the nth iteration, a client writes a buffer containing C ~
1024 integers (each equal to 'n') to the server. The client sleeps for
2000 us between iterations.
1.b. The server busily polls for a change to the Cth integer. When the
Cth integer changes from i to i+1, it checks if the entire buffer is
equal to i+1. The check always passes (I've tried over 15 million
checks). The test fails if the polled integer is not the rightmost
integer.

2. Read ordering experiment:
2.a. In the nth iteration, the server writes 'n' to C ~ 1024 integers
in a local buffer. The server does the write in reverse order
(starting from index C-1). It then sleeps for 2000 us.
2.b. The client continuously reads the buffer. When the Cth integer in
the read sink changes from i to i+1, it checks if all the integers in
the buffer are i+1. This check fails (although rarely).

This shows that reads are NOT ordered left to right. The read pattern
that I'd expect is HHHH...HHHH (where H corresponds to i+1). However,
I can see patterns like HH..LLLLL...HH (L corresponds to i). This is
wrong because we don't expect i's to be lingering around after the
first integer has become i+1 (under the false assumption that reads
happen left-to-right).

Curiously, whenever there are stale i's, they are always such that the
contiguous chunk of i's would fit inside a cacheline. I'm seeing 16
i's and 48 i's usually.
2.c. The check always succeeds if C is 16 (the buffer fits inside a
cacheline). I've done 15 million checks, will do much more tonight.

So, another question: why are the reads unordered while the writes are
ordered? I think by now we can assume write ordering (my experiments +
MVAPICH uses it). Can the PCI reorder the reads issued by the HCA?

On Wed, Nov 13, 2013 at 2:09 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Wed, Nov 13, 2013 at 02:55:53AM -0400, Anuj Kalia wrote:
>
>> I don't know what you meant by burst writes: do you mean several RDMA
>> writes or one large write? I'm concered with the order in which data
>
> A RDMA write will be split up by the HCA into a burst of PCI MemoryWr
> operations.
>
>> I guess now is the time I run lots of micro experiments. Thanks a lot
>> for the help everyone.
>
> Carefull, experiments can't prove that order is guranteed to be
> present, they can only show if it certainly isn't.
Aah, unfortunately that's true. However, I ran experiments anyway. If
people have been assuming an ordering on writes, I guess I can check
if reads are ordered too.
> Intel hardware is very good at hiding ordering issues 99% of the time,
> but in many cases there can be a stress'd condition that will show a
> different result.
Hmm.. I'm willing to run billions of iterations of the test. That
should give some confidence.
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <CADPSxAiepGuzWYXjyDxnSzER5MqL57fZ9mh83SLwV461PwZO3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: RDMA and memory ordering
       [not found]                               ` <CADPSxAiepGuzWYXjyDxnSzER5MqL57fZ9mh83SLwV461PwZO3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-14 19:05                                 ` Jason Gunthorpe
       [not found]                                   ` <20131114190514.GB21549-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2013-11-14 19:05 UTC (permalink / raw)
  To: Anuj Kalia
  Cc: Gabriele Svelto,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Thu, Nov 14, 2013 at 01:12:55AM -0400, Anuj Kalia wrote:

> So, another question: why are the reads unordered while the writes are
> ordered? I think by now we can assume write ordering (my experiments +
> MVAPICH uses it). Can the PCI reorder the reads issued by the HCA?

Without fencing there is no gurantee in what order things are made
visible, and the CPU will flush its write buffers however it likes.

The PCI subsystem can also re-order reads however it likes, that is
part of the PCI spec. In a 2 socket system don't be surprised if cache
lines on different sockets complete out of order.

Think of this as a classic multi-threaded race condition, and not
related to PCI. If you do the same test using 2 threads you probably
get the same results.

> > Intel hardware is very good at hiding ordering issues 99% of the time,
> > but in many cases there can be a stress'd condition that will show a
> > different result.

> Hmm.. I'm willing to run billions of iterations of the test. That
> should give some confidence.

Not really, repeating the same test billions of times is not
comprehensive.  You need to stress the system in all sorts of
different ways to see different behavior.

For instance, in a 2 socket system there are likely all sorts of crazy
sensitivities that depend on which socket the memory lives, which
socket holds the newest cacheline, which socket has an old line, which
socket is connected directly to the HCA, etc.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20131114190514.GB21549-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]

* Re: RDMA and memory ordering
       [not found]                                   ` <20131114190514.GB21549-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2013-11-14 19:33                                     ` Anuj Kalia
       [not found]                                       ` <CADPSxAg0k5SuxCX=3CMNV8-xME55p3iL4BMqnq0ji---kN6ZEg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Anuj Kalia @ 2013-11-14 19:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Gabriele Svelto,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Jason,

I just got an email saying that Mellanox does infact use an ordering
for reads and writes. So I think we can blame the CPU or the PCI for
the unordered reads.

On Thu, Nov 14, 2013 at 3:05 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Thu, Nov 14, 2013 at 01:12:55AM -0400, Anuj Kalia wrote:
>
>> So, another question: why are the reads unordered while the writes are
>> ordered? I think by now we can assume write ordering (my experiments +
>> MVAPICH uses it). Can the PCI reorder the reads issued by the HCA?
>
> Without fencing there is no gurantee in what order things are made
> visible, and the CPU will flush its write buffers however it likes.
I'm using fencing in the read experiment. The code at the server looks
like this:

while(1) {
        for(i = 0; i < EXTENT_CAPACITY; i++) {
            ptr[EXTENT_CAPACITY - i - 1] = iter;
            asm volatile ("" : : : "memory");
            asm volatile("mfence" ::: "memory");
        }
        iter ++;
        usleep(2000 + (rand() % 200));
    }

> The PCI subsystem can also re-order reads however it likes, that is
> part of the PCI spec. In a 2 socket system don't be surprised if cache
> lines on different sockets complete out of order.
> Think of this as a classic multi-threaded race condition, and not
> related to PCI. If you do the same test using 2 threads you probably
> get the same results.
>
The PCI explanation sounds good.
However, with a fence after every update, I don't think multiple
sockets will be a problem.
>> > Intel hardware is very good at hiding ordering issues 99% of the time,
>> > but in many cases there can be a stress'd condition that will show a
>> > different result.
>
>> Hmm.. I'm willing to run billions of iterations of the test. That
>> should give some confidence.
>
> Not really, repeating the same test billions of times is not
> comprehensive.  You need to stress the system in all sorts of
> different ways to see different behavior.
Hmm.. It's not really the same test. My server sleeps for a randomly
chosen large duration between updates. If the test passes for many
iterations, we can assume that we've tested a lot of interleavings.
But yes, that doesn't give 100% confidence.
> For instance, in a 2 socket system there are likely all sorts of crazy
> sensitivities that depend on which socket the memory lives, which
> socket holds the newest cacheline, which socket has an old line, which
> socket is connected directly to the HCA, etc.
Again, does that matter with fences? With a fence after every update,
there is a real time ordering for when the updates appear in the cache
hierarchy regardless of the socket.
>
> Jason

Regards,
Anuj
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <CADPSxAg0k5SuxCX=3CMNV8-xME55p3iL4BMqnq0ji---kN6ZEg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: RDMA and memory ordering
       [not found]                                       ` <CADPSxAg0k5SuxCX=3CMNV8-xME55p3iL4BMqnq0ji---kN6ZEg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-14 19:47                                         ` Anuj Kalia
  0 siblings, 0 replies; 16+ messages in thread
From: Anuj Kalia @ 2013-11-14 19:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Gabriele Svelto,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

I should do the experiment with 2 processes however..

On Thu, Nov 14, 2013 at 3:33 PM, Anuj Kalia <anujkaliaiitd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Jason,
>
> I just got an email saying that Mellanox does infact use an ordering
> for reads and writes. So I think we can blame the CPU or the PCI for
> the unordered reads.
>
> On Thu, Nov 14, 2013 at 3:05 PM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>> On Thu, Nov 14, 2013 at 01:12:55AM -0400, Anuj Kalia wrote:
>>
>>> So, another question: why are the reads unordered while the writes are
>>> ordered? I think by now we can assume write ordering (my experiments +
>>> MVAPICH uses it). Can the PCI reorder the reads issued by the HCA?
>>
>> Without fencing there is no gurantee in what order things are made
>> visible, and the CPU will flush its write buffers however it likes.
> I'm using fencing in the read experiment. The code at the server looks
> like this:
>
> while(1) {
>         for(i = 0; i < EXTENT_CAPACITY; i++) {
>             ptr[EXTENT_CAPACITY - i - 1] = iter;
>             asm volatile ("" : : : "memory");
>             asm volatile("mfence" ::: "memory");
>         }
>         iter ++;
>         usleep(2000 + (rand() % 200));
>     }
>
>> The PCI subsystem can also re-order reads however it likes, that is
>> part of the PCI spec. In a 2 socket system don't be surprised if cache
>> lines on different sockets complete out of order.
>> Think of this as a classic multi-threaded race condition, and not
>> related to PCI. If you do the same test using 2 threads you probably
>> get the same results.
>>
> The PCI explanation sounds good.
> However, with a fence after every update, I don't think multiple
> sockets will be a problem.
>>> > Intel hardware is very good at hiding ordering issues 99% of the time,
>>> > but in many cases there can be a stress'd condition that will show a
>>> > different result.
>>
>>> Hmm.. I'm willing to run billions of iterations of the test. That
>>> should give some confidence.
>>
>> Not really, repeating the same test billions of times is not
>> comprehensive.  You need to stress the system in all sorts of
>> different ways to see different behavior.
> Hmm.. It's not really the same test. My server sleeps for a randomly
> chosen large duration between updates. If the test passes for many
> iterations, we can assume that we've tested a lot of interleavings.
> But yes, that doesn't give 100% confidence.
>> For instance, in a 2 socket system there are likely all sorts of crazy
>> sensitivities that depend on which socket the memory lives, which
>> socket holds the newest cacheline, which socket has an old line, which
>> socket is connected directly to the HCA, etc.
> Again, does that matter with fences? With a fence after every update,
> there is a real time ordering for when the updates appear in the cache
> hierarchy regardless of the socket.
>>
>> Jason
>
> Regards,
> Anuj
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: RDMA and memory ordering
  2013-11-12 10:31       ` Anuj Kalia
  2013-11-12 18:31         ` Jason Gunthorpe
@ 2013-11-13 18:23         ` Gabriele Svelto
       [not found]           ` <5283C3B2.6010106-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 16+ messages in thread
From: Gabriele Svelto @ 2013-11-13 18:23 UTC (permalink / raw)
  To: Anuj Kalia; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On 12/11/2013 11:31, Anuj Kalia wrote:
> I believe the atomic operations would be a lot more expensive than
> reads/writes. I'm targetting maximum performance so I don't want to
> look that way yet.

This sounds like premature optimization to me which as you know is the 
root of all evil :)

Try using the atomic primitives, they have been designed specifically 
for this kind of scenario, and then measure their performance in the 
real world before spending time on optimizing something that might just 
be fast enough for your purposes (and far more robust). If you're 
already polling your CQs those operations will be *very* fast.

  Gabriele
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <5283C3B2.6010106-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]

* Re: RDMA and memory ordering
       [not found]           ` <5283C3B2.6010106-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2013-11-14  2:11             ` Anuj Kalia
  0 siblings, 0 replies; 16+ messages in thread
From: Anuj Kalia @ 2013-11-14  2:11 UTC (permalink / raw)
  To: Gabriele Svelto; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Wed, Nov 13, 2013 at 2:23 PM, Gabriele Svelto
<gabriele.svelto-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On 12/11/2013 11:31, Anuj Kalia wrote:
>>
>> I believe the atomic operations would be a lot more expensive than
>> reads/writes. I'm targetting maximum performance so I don't want to
>> look that way yet.
>
>
> This sounds like premature optimization to me which as you know is the root
> of all evil :)
> Try using the atomic primitives, they have been designed specifically for
> this kind of scenario, and then measure their performance in the real world
> before spending time on optimizing something that might just be fast enough
> for your purposes (and far more robust). If you're already polling your CQs
> those operations will be *very* fast.
>

I'm working on a project where I'm trying to extract the maximum IOPS
from a server for an application. If atomic operations are even 2-X
slower than RDMA writes (which I'd expect because they involve a read
and a write), I can't use them. However, it would be interesting to
find their performance. I'll try that.

Thanks!
>  Gabriele
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: RDMA and memory ordering
@ 2013-11-11 23:13 Hefty, Sean
       [not found] ` <1828884A29C6694DAF28B7E6B8A8237388CF721E-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Hefty, Sean @ 2013-11-11 23:13 UTC (permalink / raw)
  To: Anuj Kalia, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

> I am running a server which essentially does the following operations in a
> loop:
> 
> A[i].value = counter;  //It's actually something else
> asm volatile ("" : : : "memory");
> asm volatile("mfence" ::: "memory");
> A[i].counter = counter;
> printf("%d %d\n", A[i].value, A[i].counter);
> counter ++;
> 
> Basically, I want a fresh value of A[i].counter to indicate a fresh
> A[i].value.
> 
> I have a remote client which reads the struct A[i] from the server
> (via RDMA) in a loop. Sometimes in the value that the client reads,
> A[i].counter is larger than A[i].value. i.e., I see the newer value of
> A[i].counter but A[i].value corresponds to a previous iteration of the
> server's loop.
> 
> How can this happen in the presence of memory barriers? With barriers,
> A[i].counter should be updated later and therefore should always be
> smaller than A[i].value.
> 
> Thanks for your help!

It seems possible for a remote read to start retrieving memory before an update, such that A[i].value is read and placed on the wire, the server modifies the memory, and then A[i].counter is read and placed on the wire.  It may depend on how large the data is that's being read and the RDMA read implementation.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <1828884A29C6694DAF28B7E6B8A8237388CF721E-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>]

* Re: RDMA and memory ordering
       [not found] ` <1828884A29C6694DAF28B7E6B8A8237388CF721E-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-11-12  7:28   ` Anuj Kalia
  0 siblings, 0 replies; 16+ messages in thread
From: Anuj Kalia @ 2013-11-12  7:28 UTC (permalink / raw)
  To: Hefty, Sean; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Sean,

Thanks for your reply. Sorry for the duplicate email!

Your argument is correct if the structs are large. In my case, the
array A[] contains 32 byte structs that don't span multiple L3
cachelines (ensured via memalign). I've heard that RDMA reads happen
at L3 cacheline granularity - in that case, A[i] will be read once and
placed on the wire. Then we could see partial writes to A[i].value or
A[i].counter, but we'll never see a completed update to A[i].counter
before the corresponding update to A[i].value.

I had parallel questions that could help me understand the issue better:

1. I am I right in assuming that RDMA reads happen from the remote
host's L3 cache? My processors are from the AMD Opteron 6200 series.
The argument I heard in favor of this is that 'modern' processors have
on-chip memory controllers, so DMA reads always come from the L3
cache.

2. Are reads from the L3 cache always consistent with L1 and L2 cache?
i.e. can some update be cached inside L1 cache so that an L3 read sees
an old value? I think that this doesn't happen or I would be seeing
lots of stale reads.

3. When we do an RDMA write, is there an order in which the bytes get
written? For example, I heard during a talk that there is a
left-to-right ordering i.e. the lower addressed bytes get written
before higher addressed bytes. Is this correct?

In general, can I read more about the hardware aspects of RDMA somewhere?

--Anuj

On Mon, Nov 11, 2013 at 7:13 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> I am running a server which essentially does the following operations in a
>> loop:
>>
>> A[i].value = counter;  //It's actually something else
>> asm volatile ("" : : : "memory");
>> asm volatile("mfence" ::: "memory");
>> A[i].counter = counter;
>> printf("%d %d\n", A[i].value, A[i].counter);
>> counter ++;
>>
>> Basically, I want a fresh value of A[i].counter to indicate a fresh
>> A[i].value.
>>
>> I have a remote client which reads the struct A[i] from the server
>> (via RDMA) in a loop. Sometimes in the value that the client reads,
>> A[i].counter is larger than A[i].value. i.e., I see the newer value of
>> A[i].counter but A[i].value corresponds to a previous iteration of the
>> server's loop.
>>
>> How can this happen in the presence of memory barriers? With barriers,
>> A[i].counter should be updated later and therefore should always be
>> smaller than A[i].value.
>>
>> Thanks for your help!
>
> It seems possible for a remote read to start retrieving memory before an update, such that A[i].value is read and placed on the wire, the server modifies the memory, and then A[i].counter is read and placed on the wire.  It may depend on how large the data is that's being read and the RDMA read implementation.
>
> - Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-11-14 19:47 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-10 10:46 RDMA and memory ordering Anuj Kalia
     [not found] ` <CADPSxAhAGYZude8CM65-UDvfiPscStgcNsAfs=2XBbntg-wL0w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-12 10:16   ` Gabriele Svelto
     [not found]     ` <5281FFF9.5070705-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2013-11-12 10:31       ` Anuj Kalia
2013-11-12 18:31         ` Jason Gunthorpe
     [not found]           ` <20131112183142.GB6639-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-11-12 20:59             ` Anuj Kalia
     [not found]               ` <CADPSxAgF1CAiYoYbxbCON4NCD-tH8cAsJFRtECkTGJJQC4MXCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-12 21:11                 ` Jason Gunthorpe
     [not found]                   ` <20131112211123.GA29132-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-11-13  6:55                     ` Anuj Kalia
     [not found]                       ` <CADPSxAhzmaut9s9L1fv5urhzX8xKU9GbL6z1TkOX3FuM4NUsww-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-13 18:09                         ` Jason Gunthorpe
     [not found]                           ` <20131113180915.GA6597-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-11-14  5:12                             ` Anuj Kalia
     [not found]                               ` <CADPSxAiepGuzWYXjyDxnSzER5MqL57fZ9mh83SLwV461PwZO3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-14 19:05                                 ` Jason Gunthorpe
     [not found]                                   ` <20131114190514.GB21549-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-11-14 19:33                                     ` Anuj Kalia
     [not found]                                       ` <CADPSxAg0k5SuxCX=3CMNV8-xME55p3iL4BMqnq0ji---kN6ZEg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-14 19:47                                         ` Anuj Kalia
2013-11-13 18:23         ` Gabriele Svelto
     [not found]           ` <5283C3B2.6010106-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2013-11-14  2:11             ` Anuj Kalia
  -- strict thread matches above, loose matches on Subject: below --
2013-11-11 23:13 Hefty, Sean
     [not found] ` <1828884A29C6694DAF28B7E6B8A8237388CF721E-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-11-12  7:28   ` Anuj Kalia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox