From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
To: Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>,
Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
Cc: Yishai Hadas
<yishaih-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>,
Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
majd-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org
Subject: Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
Date: Mon, 21 Mar 2016 17:24:23 +0200 [thread overview]
Message-ID: <56F01227.9050900@mellanox.com> (raw)
In-Reply-To: <20160302195138.GA8427-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
On 02/03/2016 21:51, Jason Gunthorpe wrote:
> On Wed, Mar 02, 2016 at 01:08:30PM -0600, Christoph Lameter wrote:
>> On Wed, 2 Mar 2016, Jason Gunthorpe wrote:
>>
>>> So all this ugly API to minimize cache line usage has no measured
>>> performance gain?
>>
>> We have seen an increased cacheline footprint adding ~100-200ns to receive
>> latencies during loops. This does not show up in synthetic loads that do
>> not do much processing since their cache footprint is minimal.
>
> I know you've seen effects..
>
> But apparently the patch authors haven't.
>
>>>> Does the opaque pointer guarantees an aligned access? Who allocates the
>>>> space for the vendor's CQE? Any concrete example?
>>>> One of the problems here are that CQEs could be formatted as -
>>>> "if QP type is y, then copy the field z from o". Doing that this way may
>>>> result doing the same "if" multiple times. The current approach could still
>>>> avoid memcpy and write straight to the user's buffer.
>>>
>>> No, none of that...
>>>
>>> struct ibv_cq
>>> {
>>> int (*read_next_cq)(ibv_cq *cq,struct common_wr *res);
>>> int (*read_address)(ibv_cq *cq,struct wr_address *res);
>>> uint64_t (*read_timestamp(ibv_cq *cq);
>>> uint32_t (*read_immediate_data(ibv_cq *cq);
>>> void (*read_something_else)(ibv_cq *cq,struct something_else *res);
>>> };
>>
>> Argh. You are requiring multiple indirect function calls
>> top retrieve the same imformation and therefore significantly increase
>> latency.
>
> It depends. These are unconditional branches and they elide alot of
> other code and conditional branches by their design.
>
Our performance team and I measured the new ibv_poll_cq_ex vs the old
poll_cq implementation. We measured the number of cycles around the
poll_cq call. We saw ~15 cycles increase for every CQE, which is about
2% over the regular poll_cq (which is not extensible). We've used
ib_write_lat in order to do these measurements.
Pinpointing the source of this increase, we found that the function
pointer of our internal poll_one routine is the source of this. Our
poll_cq_ex implementation uses a per-CQ poll_one_ex function, which is
*almost* tailored for the user required fields. Using a static poll_one
will cause a lot of conditional branches and will decrease performance.
Meaning, using a function pointer vs using a static poll_one function
(that the compiler is free to inline) causes this effect, but using a
monolithic poll_one function will incur substantial computation overhead.
Using a per field getter introduces such a call (unconditional jump) for
every field. If we were using static linking (+ whole program linkage),
I agree this route is better. However, based on the results we saw
earlier, we are worried this might incur tremendous overhead.
In addition, the user has to call read_next_cq for every completion
entry (which is another unconditional jump).
> The mlx4 driver does something like this on every CQE to parse the
> immediate data:
>
> + if (is_send) {
> + switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) {
> + case MLX4_RECV_OPCODE_RDMA_WRITE_IMM:
> + if (wc_flags & IBV_WC_EX_WITH_IMM) {
> + *wc_buffer.b32++ = cqe->immed_rss_invalid;
>
> With this additional overhead for all the parse paths that have no
> immediate:
>
> + if (wc_flags & IBV_WC_EX_WITH_IMM)
> + wc_buffer.b32++; // No imm to set
>
> Whereas my suggestion would looke more like:
>
> mlx4_read_immediate_data(cq) {return cq->cur_cqe->immed_rss_invalid;};
>
> [and even that can be micro-optimized further]
>
> And the setting of WITH_IMM during the common parse is much less branchy:
>
> opcode = cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK;
> wc->flags |= WITH_IMM_BIT << ((unsigned int)(opcode == MLX4_RECV_OPCODE_RDMA_WRITE_IMM ||
> opcode == MLX4_RECV_OPCODE_SEND_IMM ...))
>
> I bet that is a lot less work going on, even including the indirect
> jump overhead.
>
So we would have read the owner_sr_opcode several times (for almost
every related wc field) and parse it accordingly or we'll have to store
the pre-processed owner_sr_opcode in the CQ itself and incur the
overhead of "copying" it. In addition, we'll need another API (end_cq)
in order to indicate return-cq-element-to-hardware-ownership (and
release vendor's per poll_cq locks if exist). So, as far as we
understand, you propose something like:
struct ibv_cq
{
/* more argument is new */
int (*read_next_cq)(ibv_cq *cq,struct common_wr *res, bool more);
/* function is new */
int (*end_cq)(ibv_cq *cq);
int (*read_address)(ibv_cq *cq,struct wr_address *res);
uint64_t (*read_timestamp(ibv_cq *cq);
uint32_t (*read_immediate_data(ibv_cq *cq);
void (*read_something_else)(ibv_cq *cq,struct something_else *res);
};
Again, every call here introduces an unconditional jump and makes the CQ
structure larger. So, the CQ struct itself might not fit in the cache line.
In addition, ibv_cq isn't extensible as it's today. Addign a new
ibv_cq_ex will change all APIs that use it.
>> This is going to cause lots of problems for procesing at high
>> speed where we have to use the processor caches as carefully as possible
>> to squeeze out all we can get.
>
> The driver should be built so the branch targets are all in close
> cache lines, with function prologues deliberately elided so the branch
> targets are small. There may even be further techniques to micro
> optimize the jump, if one wanted to spend the effort.
>
> Further, since we are only running the code the caller actually needs
> we are not wasting icache lines trundling through the driver CQE parse
> that has to handle all cases, even ones the caller doesn't care about.
>
The current optimization we implement in our user-space drivers is to
introduce multiple poll_one_ex functions, where every function assigns
only some of the WC fields. Each function is tailored for a different
use case and common scenario. By doing this, we don't assign fields that
the user doesn't care about and we can avoid all conditional branches.
> It is possible this uses fewer icache lines than what we have today,
> it depends how big the calls end up..
>
The current proposal offers several poll_one_ex function - a function
per common case. That's true we have more functions, but the function we
actually used is *almost* tailored to the user's requirements.
All checks and conditions are done once. We trade-off the function
pointer + re-checking/re-formatting some fields over and over again by
copying the fields to ibv_wc_ex.
>>> 4) A basic analysis says this trades cache line dirtying of the wc
>>> array for unconditional branches.
>>> It eliminates at least 1 conditional branch per CQE iteration by
>>> using only 1 loop.
>>
>> This done none of that stuff at all if the device directly follows the
>> programmed format. There will be no need to do any driver formatting at
>> all.
>
> Maybe, but no such device exists? I'm not sure I want to see a
> miserable API on the faint hope of hardware support..
>
> Even if hardware appears, this basic API pattern can still support
> that optimally by using the #5 technique - and still avoids the memcpy
> from the DMA buffer to the user memory.
>
>>> Compared to the poll_ex proposal this eliminates a huge
>>> number of conditional branches as 'wc_flags' and related no longer
>>> exist at all.
>>
>> wc_flags may be something bothersome. You do not want to check inside the
>> loop.
>
> Did you look at the mlx4 driver patch? It checks wc_flags constantly
> when memcpying the HW CQE to the user memory - in the per CQE loop:
>
> + if (wc_flags & IBV_WC_EX_WITH_IMM) {
> + *wc_buffer.b32++ = cqe->immed_rss_invalid;
>
> Every single user CQE write (or even not write) is guarded by a
> conditional branch on the flags, and use of post-increment like that
> creates critical path data dependencies and kills ILP. All this extra
> code eats icache lines.
>
>
Look at the new code for libmlx5. This code is optimized *at compile
time*. We create a poll_one_ex function for a common case. This has zero
overhead.
Regarding user-space check for wc_flags, I agree on eliminating this
check. If you created a CQ with a set of attributes, they are guaranteed
to exist on known values for the opcode.
> Sure, the scheme saves dritying cache lines, but at the cost of a huge
> number of these conditional branches and more code. I'm deeply
> skeptical that is an overall win compared to dirtying more cache
> lines - mispredicted branches are expensive.
>
> What I've suggested avoids all the cache line dirtying and avoids all
> the above conditional branching and data dependencies at the cost of
> a few indirect unconditional jumps. (and if #5 is possible they aren't
> even jumps)
>
When eliminating "if (wc_flags & IBV_WC_EX_WITH_IMM)" in the user's
space application, conditional branching is eliminated in this proposal
as well.
The question becomes - what costs more - copying the data or using an
unconditional branch.
> I say there is no way to tell which scheme performs better without
> careful benchmarking - it is not obvious any of these trade offs are
> winners.
>
Agree. It may depend on the user application as well.
>> All cqe's should come with the fields requested and the
>> layout of the data must be fixed when in the receive loop. No additional
>> branches in the loop.
>
> Sure, for the caller, I'm looking at this from the whole system
> perspective. The driver is doing a wack more crap to produce this
> formatting and it certainly isn't free.
>
Dynamic WC format (or getter functions) isn't going to be free, at least
not without linking the vendor's user-space driver to the application
itself or making all vendors behave the same.
> Jason
>
Yishai and Matan.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2016-03-21 15:24 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-24 9:41 [PATCH V1 libibverbs 0/8] Completion timestamping Yishai Hadas
[not found] ` <1456306924-31298-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-24 9:41 ` [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb Yishai Hadas
[not found] ` <1456306924-31298-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-24 19:02 ` Jason Gunthorpe
[not found] ` <20160224190230.GA10588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-25 8:01 ` Yishai Hadas
[not found] ` <56CEB4C7.60607-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2016-02-25 17:05 ` Jason Gunthorpe
[not found] ` <20160225170541.GA22513-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-28 16:03 ` Matan Barak (External)
[not found] ` <56D31A58.20205-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-29 19:17 ` Jason Gunthorpe
[not found] ` <20160229191734.GA15042-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-01 8:52 ` Matan Barak (External)
[not found] ` <56D55851.60206-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-03-01 16:10 ` Christoph Lameter
2016-03-01 17:24 ` Jason Gunthorpe
[not found] ` <20160301172448.GA24031-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-02 7:34 ` Matan Barak (External)
[not found] ` <56D6979F.6000400-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-03-02 18:28 ` Jason Gunthorpe
[not found] ` <20160302182836.GA7084-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-02 19:08 ` Christoph Lameter
[not found] ` <alpine.DEB.2.20.1603021300491.15609-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2016-03-02 19:51 ` Jason Gunthorpe
[not found] ` <20160302195138.GA8427-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-21 15:24 ` Matan Barak [this message]
[not found] ` <56F01227.9050900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-03-21 17:09 ` Jason Gunthorpe
2016-02-24 9:41 ` [PATCH V1 libibverbs 2/8] Add timestamp_mask and hca_core_clock to ibv_query_device_ex Yishai Hadas
2016-02-24 9:41 ` [PATCH V1 libibverbs 3/8] Add support for extended ibv_create_cq Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 4/8] Add completion timestamp support for ibv_poll_cq_ex Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 5/8] Add helper functions to work with the extended WC Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 6/8] Add ibv_query_rt_values_ex Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 7/8] Man pages for time stamping support Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 8/8] Add timestamp support in rc_pingpong Yishai Hadas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56F01227.9050900@mellanox.com \
--to=matanb-vpraknaxozvwk0htik3j/w@public.gmane.org \
--cc=cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org \
--cc=dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=majd-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=yishaih-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org \
--cc=yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.