All of lore.kernel.org
 help / color / mirror / Atom feed
* PCIe 2.0 motherboard + ConnectX-3 cards
@ 2013-11-21 22:26 Anuj Kalia
       [not found] ` <CADPSxAhPhtMegUYmjVS6qZGsWVAbYKeOGRKz1CG9oK3S-pgD7w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Anuj Kalia @ 2013-11-21 22:26 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

I have machines with Mellanox ConnectX-3 cards connected to a
motherboard with PCIe 2.0. I had some questions regarding performance
of this system:

1. When multiple clients issue small (32 byte) RDMA writes to the
server, the combined throughput is about 22 million operations per
second. With ConnectX-3 I should be able to get 35 million (quote from
Mellanox).

Is 22 million DMAs per second a PCIe 2.0 bottleneck?

2. When one client machine issues RDMA writes to multiple server
machines, it can issue at most 10-11 million writes per second. Is
this a PCIe bottleneck again?

I believe it's a PCIe issue because an RDMA operation should involve 2
(or more) PCIe operations at the active side:
a. Write the work request to the HCA (or maybe the HCA reads the request).
b. The HCA reads the payload from the processor.

Does this reasoning sound correct?

3. Is there a way to reduce the number of PCIe operations at the
active side? I don't think that posting a linked list of WQEs will
help because the HCA should read them one by one.

--Anuj
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PCIe 2.0 motherboard + ConnectX-3 cards
       [not found] ` <CADPSxAhPhtMegUYmjVS6qZGsWVAbYKeOGRKz1CG9oK3S-pgD7w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-22  7:59   ` Anuj Kalia
       [not found]     ` <CADPSxAi06+_zjJ50v7CgCDzv+abDowN0pxBEgAZKwKBGPCe1ig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Anuj Kalia @ 2013-11-22  7:59 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

I had a related question regarding PCIe usage.

How exactly does the userspace driver interact with the HCA? I'm
reading the code for libmlx4 but I can't find any code for interaction
with PCIe. There are some references to 'ringing a doorbell via PCI
MMIO' - can someone please tell me how that works?

In general, it would be great if someone could explain the CPU-HCA
communication steps involved in doing an RDMA operation. If there is
an online resource from where I can read about this, I'd appreciate a
pointer.

Thanks for your time!

--Anuj



On Thu, Nov 21, 2013 at 5:26 PM, Anuj Kalia <anujkaliaiitd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> I have machines with Mellanox ConnectX-3 cards connected to a
> motherboard with PCIe 2.0. I had some questions regarding performance
> of this system:
>
> 1. When multiple clients issue small (32 byte) RDMA writes to the
> server, the combined throughput is about 22 million operations per
> second. With ConnectX-3 I should be able to get 35 million (quote from
> Mellanox).
>
> Is 22 million DMAs per second a PCIe 2.0 bottleneck?
>
> 2. When one client machine issues RDMA writes to multiple server
> machines, it can issue at most 10-11 million writes per second. Is
> this a PCIe bottleneck again?
>
> I believe it's a PCIe issue because an RDMA operation should involve 2
> (or more) PCIe operations at the active side:
> a. Write the work request to the HCA (or maybe the HCA reads the request).
> b. The HCA reads the payload from the processor.
>
> Does this reasoning sound correct?
>
> 3. Is there a way to reduce the number of PCIe operations at the
> active side? I don't think that posting a linked list of WQEs will
> help because the HCA should read them one by one.
>
> --Anuj
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PCIe 2.0 motherboard + ConnectX-3 cards
       [not found]     ` <CADPSxAi06+_zjJ50v7CgCDzv+abDowN0pxBEgAZKwKBGPCe1ig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-23  4:13       ` Anuj Kalia
       [not found]         ` <CADPSxAjCqk8GqQ9cQiceZk2fbrPuQdRWDoBSfkgLesVpVOc-fg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Anuj Kalia @ 2013-11-23  4:13 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Update: I found ways to improve active side performance from 10
million RDMA writes per second to 20 million (which I believe is the
PCIe bottleneck):

1. Use inline payload - I think this reduces PCIe traffic.
2. Use non-signalled RDMA writes + don't poll for completion for every
write - I don't know if ibv_poll_cq() uses the PCIe much.

I'd appreciate any other ideas to reduce PCIe traffic or any
affirmation that my explanation for the bottleneck and improvements
are correct.

--Anuj


On Fri, Nov 22, 2013 at 2:59 AM, Anuj Kalia <anujkaliaiitd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> I had a related question regarding PCIe usage.
>
> How exactly does the userspace driver interact with the HCA? I'm
> reading the code for libmlx4 but I can't find any code for interaction
> with PCIe. There are some references to 'ringing a doorbell via PCI
> MMIO' - can someone please tell me how that works?
>
> In general, it would be great if someone could explain the CPU-HCA
> communication steps involved in doing an RDMA operation. If there is
> an online resource from where I can read about this, I'd appreciate a
> pointer.
>
> Thanks for your time!
>
> --Anuj
>
>
>
> On Thu, Nov 21, 2013 at 5:26 PM, Anuj Kalia <anujkaliaiitd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> I have machines with Mellanox ConnectX-3 cards connected to a
>> motherboard with PCIe 2.0. I had some questions regarding performance
>> of this system:
>>
>> 1. When multiple clients issue small (32 byte) RDMA writes to the
>> server, the combined throughput is about 22 million operations per
>> second. With ConnectX-3 I should be able to get 35 million (quote from
>> Mellanox).
>>
>> Is 22 million DMAs per second a PCIe 2.0 bottleneck?
>>
>> 2. When one client machine issues RDMA writes to multiple server
>> machines, it can issue at most 10-11 million writes per second. Is
>> this a PCIe bottleneck again?
>>
>> I believe it's a PCIe issue because an RDMA operation should involve 2
>> (or more) PCIe operations at the active side:
>> a. Write the work request to the HCA (or maybe the HCA reads the request).
>> b. The HCA reads the payload from the processor.
>>
>> Does this reasoning sound correct?
>>
>> 3. Is there a way to reduce the number of PCIe operations at the
>> active side? I don't think that posting a linked list of WQEs will
>> help because the HCA should read them one by one.
>>
>> --Anuj
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PCIe 2.0 motherboard + ConnectX-3 cards
       [not found]         ` <CADPSxAjCqk8GqQ9cQiceZk2fbrPuQdRWDoBSfkgLesVpVOc-fg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-24 17:57           ` Steve Wise
       [not found]             ` <52923E1B.6000003-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Steve Wise @ 2013-11-24 17:57 UTC (permalink / raw)
  To: Anuj Kalia, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On 11/22/2013 10:13 PM, Anuj Kalia wrote:
> Update: I found ways to improve active side performance from 10
> million RDMA writes per second to 20 million (which I believe is the
> PCIe bottleneck):
>
> 1. Use inline payload - I think this reduces PCIe traffic.

Yes, without inline, each IO requires 2 PCIe transactions:  1 to fetch 
(or push) the work request, and one to fetch the payload/data.  If you 
use inline, the data is included in the work request.  So you cut the 
required transactions in half.

> 2. Use non-signalled RDMA writes + don't poll for completion for every
> write - I don't know if ibv_poll_cq() uses the PCIe much.

Each signaled work request generates a completion entry (CQE) which is 
pushed from the adapter into the CQ in host memory.  So reducing the 
number of these required also reduces the PCIe transactions.

Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PCIe 2.0 motherboard + ConnectX-3 cards
       [not found]             ` <52923E1B.6000003-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
@ 2013-11-26  2:08               ` Anuj Kalia
  0 siblings, 0 replies; 5+ messages in thread
From: Anuj Kalia @ 2013-11-26  2:08 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Steve,

Thanks for the confirmation.

--Anuj

On Sun, Nov 24, 2013 at 12:57 PM, Steve Wise
<swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
> On 11/22/2013 10:13 PM, Anuj Kalia wrote:
>>
>> Update: I found ways to improve active side performance from 10
>> million RDMA writes per second to 20 million (which I believe is the
>> PCIe bottleneck):
>>
>> 1. Use inline payload - I think this reduces PCIe traffic.
>
>
> Yes, without inline, each IO requires 2 PCIe transactions:  1 to fetch (or
> push) the work request, and one to fetch the payload/data.  If you use
> inline, the data is included in the work request.  So you cut the required
> transactions in half.
>
>
>> 2. Use non-signalled RDMA writes + don't poll for completion for every
>> write - I don't know if ibv_poll_cq() uses the PCIe much.
>
>
> Each signaled work request generates a completion entry (CQE) which is
> pushed from the adapter into the CQ in host memory.  So reducing the number
> of these required also reduces the PCIe transactions.
>
> Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-11-26  2:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-21 22:26 PCIe 2.0 motherboard + ConnectX-3 cards Anuj Kalia
     [not found] ` <CADPSxAhPhtMegUYmjVS6qZGsWVAbYKeOGRKz1CG9oK3S-pgD7w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-22  7:59   ` Anuj Kalia
     [not found]     ` <CADPSxAi06+_zjJ50v7CgCDzv+abDowN0pxBEgAZKwKBGPCe1ig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-23  4:13       ` Anuj Kalia
     [not found]         ` <CADPSxAjCqk8GqQ9cQiceZk2fbrPuQdRWDoBSfkgLesVpVOc-fg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-24 17:57           ` Steve Wise
     [not found]             ` <52923E1B.6000003-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2013-11-26  2:08               ` Anuj Kalia

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.