From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
To: Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>,
Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
Cc: Yishai Hadas
<yishaih-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>,
Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
majd-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org
Subject: Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
Date: Mon, 21 Mar 2016 17:24:23 +0200 [thread overview]
Message-ID: <56F01227.9050900@mellanox.com> (raw)
In-Reply-To: <20160302195138.GA8427-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
On 02/03/2016 21:51, Jason Gunthorpe wrote:
> On Wed, Mar 02, 2016 at 01:08:30PM -0600, Christoph Lameter wrote:
>> On Wed, 2 Mar 2016, Jason Gunthorpe wrote:
>>
>>> So all this ugly API to minimize cache line usage has no measured
>>> performance gain?
>>
>> We have seen an increased cacheline footprint adding ~100-200ns to receive
>> latencies during loops. This does not show up in synthetic loads that do
>> not do much processing since their cache footprint is minimal.
>
> I know you've seen effects..
>
> But apparently the patch authors haven't.
>
>>>> Does the opaque pointer guarantees an aligned access? Who allocates the
>>>> space for the vendor's CQE? Any concrete example?
>>>> One of the problems here are that CQEs could be formatted as -
>>>> "if QP type is y, then copy the field z from o". Doing that this way may
>>>> result doing the same "if" multiple times. The current approach could still
>>>> avoid memcpy and write straight to the user's buffer.
>>>
>>> No, none of that...
>>>
>>> struct ibv_cq
>>> {
>>> int (*read_next_cq)(ibv_cq *cq,struct common_wr *res);
>>> int (*read_address)(ibv_cq *cq,struct wr_address *res);
>>> uint64_t (*read_timestamp(ibv_cq *cq);
>>> uint32_t (*read_immediate_data(ibv_cq *cq);
>>> void (*read_something_else)(ibv_cq *cq,struct something_else *res);
>>> };
>>
>> Argh. You are requiring multiple indirect function calls
>> top retrieve the same imformation and therefore significantly increase
>> latency.
>
> It depends. These are unconditional branches and they elide alot of
> other code and conditional branches by their design.
>
Our performance team and I measured the new ibv_poll_cq_ex vs the old
poll_cq implementation. We measured the number of cycles around the
poll_cq call. We saw ~15 cycles increase for every CQE, which is about
2% over the regular poll_cq (which is not extensible). We've used
ib_write_lat in order to do these measurements.
Pinpointing the source of this increase, we found that the function
pointer of our internal poll_one routine is the source of this. Our
poll_cq_ex implementation uses a per-CQ poll_one_ex function, which is
*almost* tailored for the user required fields. Using a static poll_one
will cause a lot of conditional branches and will decrease performance.
Meaning, using a function pointer vs using a static poll_one function
(that the compiler is free to inline) causes this effect, but using a
monolithic poll_one function will incur substantial computation overhead.
Using a per field getter introduces such a call (unconditional jump) for
every field. If we were using static linking (+ whole program linkage),
I agree this route is better. However, based on the results we saw
earlier, we are worried this might incur tremendous overhead.
In addition, the user has to call read_next_cq for every completion
entry (which is another unconditional jump).
> The mlx4 driver does something like this on every CQE to parse the
> immediate data:
>
> + if (is_send) {
> + switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) {
> + case MLX4_RECV_OPCODE_RDMA_WRITE_IMM:
> + if (wc_flags & IBV_WC_EX_WITH_IMM) {
> + *wc_buffer.b32++ = cqe->immed_rss_invalid;
>
> With this additional overhead for all the parse paths that have no
> immediate:
>
> + if (wc_flags & IBV_WC_EX_WITH_IMM)
> + wc_buffer.b32++; // No imm to set
>
> Whereas my suggestion would looke more like:
>
> mlx4_read_immediate_data(cq) {return cq->cur_cqe->immed_rss_invalid;};
>
> [and even that can be micro-optimized further]
>
> And the setting of WITH_IMM during the common parse is much less branchy:
>
> opcode = cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK;
> wc->flags |= WITH_IMM_BIT << ((unsigned int)(opcode == MLX4_RECV_OPCODE_RDMA_WRITE_IMM ||
> opcode == MLX4_RECV_OPCODE_SEND_IMM ...))
>
> I bet that is a lot less work going on, even including the indirect
> jump overhead.
>
So we would have read the owner_sr_opcode several times (for almost
every related wc field) and parse it accordingly or we'll have to store
the pre-processed owner_sr_opcode in the CQ itself and incur the
overhead of "copying" it. In addition, we'll need another API (end_cq)
in order to indicate return-cq-element-to-hardware-ownership (and
release vendor's per poll_cq locks if exist). So, as far as we
understand, you propose something like:
struct ibv_cq
{
/* more argument is new */
int (*read_next_cq)(ibv_cq *cq,struct common_wr *res, bool more);
/* function is new */
int (*end_cq)(ibv_cq *cq);
int (*read_address)(ibv_cq *cq,struct wr_address *res);
uint64_t (*read_timestamp(ibv_cq *cq);
uint32_t (*read_immediate_data(ibv_cq *cq);
void (*read_something_else)(ibv_cq *cq,struct something_else *res);
};
Again, every call here introduces an unconditional jump and makes the CQ
structure larger. So, the CQ struct itself might not fit in the cache line.
In addition, ibv_cq isn't extensible as it's today. Addign a new
ibv_cq_ex will change all APIs that use it.
>> This is going to cause lots of problems for procesing at high
>> speed where we have to use the processor caches as carefully as possible
>> to squeeze out all we can get.
>
> The driver should be built so the branch targets are all in close
> cache lines, with function prologues deliberately elided so the branch
> targets are small. There may even be further techniques to micro
> optimize the jump, if one wanted to spend the effort.
>
> Further, since we are only running the code the caller actually needs
> we are not wasting icache lines trundling through the driver CQE parse
> that has to handle all cases, even ones the caller doesn't care about.
>
The current optimization we implement in our user-space drivers is to
introduce multiple poll_one_ex functions, where every function assigns
only some of the WC fields. Each function is tailored for a different
use case and common scenario. By doing this, we don't assign fields that
the user doesn't care about and we can avoid all conditional branches.
> It is possible this uses fewer icache lines than what we have today,
> it depends how big the calls end up..
>
The current proposal offers several poll_one_ex function - a function
per common case. That's true we have more functions, but the function we
actually used is *almost* tailored to the user's requirements.
All checks and conditions are done once. We trade-off the function
pointer + re-checking/re-formatting some fields over and over again by
copying the fields to ibv_wc_ex.
>>> 4) A basic analysis says this trades cache line dirtying of the wc
>>> array for unconditional branches.
>>> It eliminates at least 1 conditional branch per CQE iteration by
>>> using only 1 loop.
>>
>> This done none of that stuff at all if the device directly follows the
>> programmed format. There will be no need to do any driver formatting at
>> all.
>
> Maybe, but no such device exists? I'm not sure I want to see a
> miserable API on the faint hope of hardware support..
>
> Even if hardware appears, this basic API pattern can still support
> that optimally by using the #5 technique - and still avoids the memcpy
> from the DMA buffer to the user memory.
>
>>> Compared to the poll_ex proposal this eliminates a huge
>>> number of conditional branches as 'wc_flags' and related no longer
>>> exist at all.
>>
>> wc_flags may be something bothersome. You do not want to check inside the
>> loop.
>
> Did you look at the mlx4 driver patch? It checks wc_flags constantly
> when memcpying the HW CQE to the user memory - in the per CQE loop:
>
> + if (wc_flags & IBV_WC_EX_WITH_IMM) {
> + *wc_buffer.b32++ = cqe->immed_rss_invalid;
>
> Every single user CQE write (or even not write) is guarded by a
> conditional branch on the flags, and use of post-increment like that
> creates critical path data dependencies and kills ILP. All this extra
> code eats icache lines.
>
>
Look at the new code for libmlx5. This code is optimized *at compile
time*. We create a poll_one_ex function for a common case. This has zero
overhead.
Regarding user-space check for wc_flags, I agree on eliminating this
check. If you created a CQ with a set of attributes, they are guaranteed
to exist on known values for the opcode.
> Sure, the scheme saves dritying cache lines, but at the cost of a huge
> number of these conditional branches and more code. I'm deeply
> skeptical that is an overall win compared to dirtying more cache
> lines - mispredicted branches are expensive.
>
> What I've suggested avoids all the cache line dirtying and avoids all
> the above conditional branching and data dependencies at the cost of
> a few indirect unconditional jumps. (and if #5 is possible they aren't
> even jumps)
>
When eliminating "if (wc_flags & IBV_WC_EX_WITH_IMM)" in the user's
space application, conditional branching is eliminated in this proposal
as well.
The question becomes - what costs more - copying the data or using an
unconditional branch.
> I say there is no way to tell which scheme performs better without
> careful benchmarking - it is not obvious any of these trade offs are
> winners.
>
Agree. It may depend on the user application as well.
>> All cqe's should come with the fields requested and the
>> layout of the data must be fixed when in the receive loop. No additional
>> branches in the loop.
>
> Sure, for the caller, I'm looking at this from the whole system
> perspective. The driver is doing a wack more crap to produce this
> formatting and it certainly isn't free.
>
Dynamic WC format (or getter functions) isn't going to be free, at least
not without linking the vendor's user-space driver to the application
itself or making all vendors behave the same.
> Jason
>
Yishai and Matan.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2016-03-21 15:24 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-24 9:41 [PATCH V1 libibverbs 0/8] Completion timestamping Yishai Hadas
[not found] ` <1456306924-31298-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-24 9:41 ` [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb Yishai Hadas
[not found] ` <1456306924-31298-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-24 19:02 ` Jason Gunthorpe
[not found] ` <20160224190230.GA10588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-25 8:01 ` Yishai Hadas
[not found] ` <56CEB4C7.60607-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2016-02-25 17:05 ` Jason Gunthorpe
[not found] ` <20160225170541.GA22513-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-28 16:03 ` Matan Barak (External)
[not found] ` <56D31A58.20205-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-29 19:17 ` Jason Gunthorpe
[not found] ` <20160229191734.GA15042-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-01 8:52 ` Matan Barak (External)
[not found] ` <56D55851.60206-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-03-01 16:10 ` Christoph Lameter
2016-03-01 17:24 ` Jason Gunthorpe
[not found] ` <20160301172448.GA24031-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-02 7:34 ` Matan Barak (External)
[not found] ` <56D6979F.6000400-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-03-02 18:28 ` Jason Gunthorpe
[not found] ` <20160302182836.GA7084-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-02 19:08 ` Christoph Lameter
[not found] ` <alpine.DEB.2.20.1603021300491.15609-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2016-03-02 19:51 ` Jason Gunthorpe
[not found] ` <20160302195138.GA8427-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-21 15:24 ` Matan Barak [this message]
[not found] ` <56F01227.9050900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-03-21 17:09 ` Jason Gunthorpe
2016-02-24 9:41 ` [PATCH V1 libibverbs 2/8] Add timestamp_mask and hca_core_clock to ibv_query_device_ex Yishai Hadas
2016-02-24 9:41 ` [PATCH V1 libibverbs 3/8] Add support for extended ibv_create_cq Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 4/8] Add completion timestamp support for ibv_poll_cq_ex Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 5/8] Add helper functions to work with the extended WC Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 6/8] Add ibv_query_rt_values_ex Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 7/8] Man pages for time stamping support Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 8/8] Add timestamp support in rc_pingpong Yishai Hadas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56F01227.9050900@mellanox.com \
--to=matanb-vpraknaxozvwk0htik3j/w@public.gmane.org \
--cc=cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org \
--cc=dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=majd-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=yishaih-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org \
--cc=yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox