From: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
To: Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>
Cc: "Matan Barak (External)"
<matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
Yishai Hadas
<yishaih-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>,
Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
majd-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org
Subject: Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb
Date: Wed, 2 Mar 2016 12:51:38 -0700 [thread overview]
Message-ID: <20160302195138.GA8427@obsidianresearch.com> (raw)
In-Reply-To: <alpine.DEB.2.20.1603021300491.15609-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
On Wed, Mar 02, 2016 at 01:08:30PM -0600, Christoph Lameter wrote:
> On Wed, 2 Mar 2016, Jason Gunthorpe wrote:
>
> > So all this ugly API to minimize cache line usage has no measured
> > performance gain?
>
> We have seen an increased cacheline footprint adding ~100-200ns to receive
> latencies during loops. This does not show up in synthetic loads that do
> not do much processing since their cache footprint is minimal.
I know you've seen effects..
But apparently the patch authors haven't.
> > > Does the opaque pointer guarantees an aligned access? Who allocates the
> > > space for the vendor's CQE? Any concrete example?
> > > One of the problems here are that CQEs could be formatted as -
> > > "if QP type is y, then copy the field z from o". Doing that this way may
> > > result doing the same "if" multiple times. The current approach could still
> > > avoid memcpy and write straight to the user's buffer.
> >
> > No, none of that...
> >
> > struct ibv_cq
> > {
> > int (*read_next_cq)(ibv_cq *cq,struct common_wr *res);
> > int (*read_address)(ibv_cq *cq,struct wr_address *res);
> > uint64_t (*read_timestamp(ibv_cq *cq);
> > uint32_t (*read_immediate_data(ibv_cq *cq);
> > void (*read_something_else)(ibv_cq *cq,struct something_else *res);
> > };
>
> Argh. You are requiring multiple indirect function calls
> top retrieve the same imformation and therefore significantly increase
> latency.
It depends. These are unconditional branches and they elide alot of
other code and conditional branches by their design.
The mlx4 driver does something like this on every CQE to parse the
immediate data:
+ if (is_send) {
+ switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) {
+ case MLX4_RECV_OPCODE_RDMA_WRITE_IMM:
+ if (wc_flags & IBV_WC_EX_WITH_IMM) {
+ *wc_buffer.b32++ = cqe->immed_rss_invalid;
With this additional overhead for all the parse paths that have no
immediate:
+ if (wc_flags & IBV_WC_EX_WITH_IMM)
+ wc_buffer.b32++; // No imm to set
Whereas my suggestion would looke more like:
mlx4_read_immediate_data(cq) {return cq->cur_cqe->immed_rss_invalid;};
[and even that can be micro-optimized further]
And the setting of WITH_IMM during the common parse is much less branchy:
opcode = cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK;
wc->flags |= WITH_IMM_BIT << ((unsigned int)(opcode == MLX4_RECV_OPCODE_RDMA_WRITE_IMM ||
opcode == MLX4_RECV_OPCODE_SEND_IMM ...))
I bet that is a lot less work going on, even including the indirect
jump overhead.
> This is going to cause lots of problems for procesing at high
> speed where we have to use the processor caches as carefully as possible
> to squeeze out all we can get.
The driver should be built so the branch targets are all in close
cache lines, with function prologues deliberately elided so the branch
targets are small. There may even be further techniques to micro
optimize the jump, if one wanted to spend the effort.
Further, since we are only running the code the caller actually needs
we are not wasting icache lines trundling through the driver CQE parse
that has to handle all cases, even ones the caller doesn't care about.
It is possible this uses fewer icache lines than what we have today,
it depends how big the calls end up..
> > 4) A basic analysis says this trades cache line dirtying of the wc
> > array for unconditional branches.
> > It eliminates at least 1 conditional branch per CQE iteration by
> > using only 1 loop.
>
> This done none of that stuff at all if the device directly follows the
> programmed format. There will be no need to do any driver formatting at
> all.
Maybe, but no such device exists? I'm not sure I want to see a
miserable API on the faint hope of hardware support..
Even if hardware appears, this basic API pattern can still support
that optimally by using the #5 technique - and still avoids the memcpy
from the DMA buffer to the user memory.
> > Compared to the poll_ex proposal this eliminates a huge
> > number of conditional branches as 'wc_flags' and related no longer
> > exist at all.
>
> wc_flags may be something bothersome. You do not want to check inside the
> loop.
Did you look at the mlx4 driver patch? It checks wc_flags constantly
when memcpying the HW CQE to the user memory - in the per CQE loop:
+ if (wc_flags & IBV_WC_EX_WITH_IMM) {
+ *wc_buffer.b32++ = cqe->immed_rss_invalid;
Every single user CQE write (or even not write) is guarded by a
conditional branch on the flags, and use of post-increment like that
creates critical path data dependencies and kills ILP. All this extra
code eats icache lines.
Sure, the scheme saves dritying cache lines, but at the cost of a huge
number of these conditional branches and more code. I'm deeply
skeptical that is an overall win compared to dirtying more cache
lines - mispredicted branches are expensive.
What I've suggested avoids all the cache line dirtying and avoids all
the above conditional branching and data dependencies at the cost of
a few indirect unconditional jumps. (and if #5 is possible they aren't
even jumps)
I say there is no way to tell which scheme performs better without
careful benchmarking - it is not obvious any of these trade offs are
winners.
> All cqe's should come with the fields requested and the
> layout of the data must be fixed when in the receive loop. No additional
> branches in the loop.
Sure, for the caller, I'm looking at this from the whole system
perspective. The driver is doing a wack more crap to produce this
formatting and it certainly isn't free.
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2016-03-02 19:51 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-24 9:41 [PATCH V1 libibverbs 0/8] Completion timestamping Yishai Hadas
[not found] ` <1456306924-31298-1-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-24 9:41 ` [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb Yishai Hadas
[not found] ` <1456306924-31298-2-git-send-email-yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-24 19:02 ` Jason Gunthorpe
[not found] ` <20160224190230.GA10588-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-25 8:01 ` Yishai Hadas
[not found] ` <56CEB4C7.60607-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2016-02-25 17:05 ` Jason Gunthorpe
[not found] ` <20160225170541.GA22513-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-28 16:03 ` Matan Barak (External)
[not found] ` <56D31A58.20205-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-29 19:17 ` Jason Gunthorpe
[not found] ` <20160229191734.GA15042-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-01 8:52 ` Matan Barak (External)
[not found] ` <56D55851.60206-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-03-01 16:10 ` Christoph Lameter
2016-03-01 17:24 ` Jason Gunthorpe
[not found] ` <20160301172448.GA24031-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-02 7:34 ` Matan Barak (External)
[not found] ` <56D6979F.6000400-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-03-02 18:28 ` Jason Gunthorpe
[not found] ` <20160302182836.GA7084-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-02 19:08 ` Christoph Lameter
[not found] ` <alpine.DEB.2.20.1603021300491.15609-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2016-03-02 19:51 ` Jason Gunthorpe [this message]
[not found] ` <20160302195138.GA8427-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-03-21 15:24 ` Matan Barak
[not found] ` <56F01227.9050900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-03-21 17:09 ` Jason Gunthorpe
2016-02-24 9:41 ` [PATCH V1 libibverbs 2/8] Add timestamp_mask and hca_core_clock to ibv_query_device_ex Yishai Hadas
2016-02-24 9:41 ` [PATCH V1 libibverbs 3/8] Add support for extended ibv_create_cq Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 4/8] Add completion timestamp support for ibv_poll_cq_ex Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 5/8] Add helper functions to work with the extended WC Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 6/8] Add ibv_query_rt_values_ex Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 7/8] Man pages for time stamping support Yishai Hadas
2016-02-24 9:42 ` [PATCH V1 libibverbs 8/8] Add timestamp support in rc_pingpong Yishai Hadas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160302195138.GA8427@obsidianresearch.com \
--to=jgunthorpe-epgobjl8dl3ta4ec/59zmfatqe2ktcn/@public.gmane.org \
--cc=cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org \
--cc=dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=majd-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=yishaih-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org \
--cc=yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox