From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matan Barak Subject: Re: [PATCH V1 libibverbs 1/8] Add ibv_poll_cq_ex verb Date: Mon, 21 Mar 2016 17:24:23 +0200 Message-ID: <56F01227.9050900@mellanox.com> References: <20160224190230.GA10588@obsidianresearch.com> <56CEB4C7.60607@dev.mellanox.co.il> <20160225170541.GA22513@obsidianresearch.com> <56D31A58.20205@mellanox.com> <20160229191734.GA15042@obsidianresearch.com> <56D55851.60206@mellanox.com> <20160301172448.GA24031@obsidianresearch.com> <56D6979F.6000400@mellanox.com> <20160302182836.GA7084@obsidianresearch.com> <20160302195138.GA8427@obsidianresearch.com> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20160302195138.GA8427-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Jason Gunthorpe , Christoph Lameter Cc: Yishai Hadas , Yishai Hadas , dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, majd-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, talal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org List-Id: linux-rdma@vger.kernel.org On 02/03/2016 21:51, Jason Gunthorpe wrote: > On Wed, Mar 02, 2016 at 01:08:30PM -0600, Christoph Lameter wrote: >> On Wed, 2 Mar 2016, Jason Gunthorpe wrote: >> >>> So all this ugly API to minimize cache line usage has no measured >>> performance gain? >> >> We have seen an increased cacheline footprint adding ~100-200ns to receive >> latencies during loops. This does not show up in synthetic loads that do >> not do much processing since their cache footprint is minimal. > > I know you've seen effects.. > > But apparently the patch authors haven't. > >>>> Does the opaque pointer guarantees an aligned access? Who allocates the >>>> space for the vendor's CQE? Any concrete example? >>>> One of the problems here are that CQEs could be formatted as - >>>> "if QP type is y, then copy the field z from o". Doing that this way may >>>> result doing the same "if" multiple times. The current approach could still >>>> avoid memcpy and write straight to the user's buffer. >>> >>> No, none of that... >>> >>> struct ibv_cq >>> { >>> int (*read_next_cq)(ibv_cq *cq,struct common_wr *res); >>> int (*read_address)(ibv_cq *cq,struct wr_address *res); >>> uint64_t (*read_timestamp(ibv_cq *cq); >>> uint32_t (*read_immediate_data(ibv_cq *cq); >>> void (*read_something_else)(ibv_cq *cq,struct something_else *res); >>> }; >> >> Argh. You are requiring multiple indirect function calls >> top retrieve the same imformation and therefore significantly increase >> latency. > > It depends. These are unconditional branches and they elide alot of > other code and conditional branches by their design. > Our performance team and I measured the new ibv_poll_cq_ex vs the old poll_cq implementation. We measured the number of cycles around the poll_cq call. We saw ~15 cycles increase for every CQE, which is about 2% over the regular poll_cq (which is not extensible). We've used ib_write_lat in order to do these measurements. Pinpointing the source of this increase, we found that the function pointer of our internal poll_one routine is the source of this. Our poll_cq_ex implementation uses a per-CQ poll_one_ex function, which is *almost* tailored for the user required fields. Using a static poll_one will cause a lot of conditional branches and will decrease performance. Meaning, using a function pointer vs using a static poll_one function (that the compiler is free to inline) causes this effect, but using a monolithic poll_one function will incur substantial computation overhead. Using a per field getter introduces such a call (unconditional jump) for every field. If we were using static linking (+ whole program linkage), I agree this route is better. However, based on the results we saw earlier, we are worried this might incur tremendous overhead. In addition, the user has to call read_next_cq for every completion entry (which is another unconditional jump). > The mlx4 driver does something like this on every CQE to parse the > immediate data: > > + if (is_send) { > + switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) { > + case MLX4_RECV_OPCODE_RDMA_WRITE_IMM: > + if (wc_flags & IBV_WC_EX_WITH_IMM) { > + *wc_buffer.b32++ = cqe->immed_rss_invalid; > > With this additional overhead for all the parse paths that have no > immediate: > > + if (wc_flags & IBV_WC_EX_WITH_IMM) > + wc_buffer.b32++; // No imm to set > > Whereas my suggestion would looke more like: > > mlx4_read_immediate_data(cq) {return cq->cur_cqe->immed_rss_invalid;}; > > [and even that can be micro-optimized further] > > And the setting of WITH_IMM during the common parse is much less branchy: > > opcode = cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK; > wc->flags |= WITH_IMM_BIT << ((unsigned int)(opcode == MLX4_RECV_OPCODE_RDMA_WRITE_IMM || > opcode == MLX4_RECV_OPCODE_SEND_IMM ...)) > > I bet that is a lot less work going on, even including the indirect > jump overhead. > So we would have read the owner_sr_opcode several times (for almost every related wc field) and parse it accordingly or we'll have to store the pre-processed owner_sr_opcode in the CQ itself and incur the overhead of "copying" it. In addition, we'll need another API (end_cq) in order to indicate return-cq-element-to-hardware-ownership (and release vendor's per poll_cq locks if exist). So, as far as we understand, you propose something like: struct ibv_cq { /* more argument is new */ int (*read_next_cq)(ibv_cq *cq,struct common_wr *res, bool more); /* function is new */ int (*end_cq)(ibv_cq *cq); int (*read_address)(ibv_cq *cq,struct wr_address *res); uint64_t (*read_timestamp(ibv_cq *cq); uint32_t (*read_immediate_data(ibv_cq *cq); void (*read_something_else)(ibv_cq *cq,struct something_else *res); }; Again, every call here introduces an unconditional jump and makes the CQ structure larger. So, the CQ struct itself might not fit in the cache line. In addition, ibv_cq isn't extensible as it's today. Addign a new ibv_cq_ex will change all APIs that use it. >> This is going to cause lots of problems for procesing at high >> speed where we have to use the processor caches as carefully as possible >> to squeeze out all we can get. > > The driver should be built so the branch targets are all in close > cache lines, with function prologues deliberately elided so the branch > targets are small. There may even be further techniques to micro > optimize the jump, if one wanted to spend the effort. > > Further, since we are only running the code the caller actually needs > we are not wasting icache lines trundling through the driver CQE parse > that has to handle all cases, even ones the caller doesn't care about. > The current optimization we implement in our user-space drivers is to introduce multiple poll_one_ex functions, where every function assigns only some of the WC fields. Each function is tailored for a different use case and common scenario. By doing this, we don't assign fields that the user doesn't care about and we can avoid all conditional branches. > It is possible this uses fewer icache lines than what we have today, > it depends how big the calls end up.. > The current proposal offers several poll_one_ex function - a function per common case. That's true we have more functions, but the function we actually used is *almost* tailored to the user's requirements. All checks and conditions are done once. We trade-off the function pointer + re-checking/re-formatting some fields over and over again by copying the fields to ibv_wc_ex. >>> 4) A basic analysis says this trades cache line dirtying of the wc >>> array for unconditional branches. >>> It eliminates at least 1 conditional branch per CQE iteration by >>> using only 1 loop. >> >> This done none of that stuff at all if the device directly follows the >> programmed format. There will be no need to do any driver formatting at >> all. > > Maybe, but no such device exists? I'm not sure I want to see a > miserable API on the faint hope of hardware support.. > > Even if hardware appears, this basic API pattern can still support > that optimally by using the #5 technique - and still avoids the memcpy > from the DMA buffer to the user memory. > >>> Compared to the poll_ex proposal this eliminates a huge >>> number of conditional branches as 'wc_flags' and related no longer >>> exist at all. >> >> wc_flags may be something bothersome. You do not want to check inside the >> loop. > > Did you look at the mlx4 driver patch? It checks wc_flags constantly > when memcpying the HW CQE to the user memory - in the per CQE loop: > > + if (wc_flags & IBV_WC_EX_WITH_IMM) { > + *wc_buffer.b32++ = cqe->immed_rss_invalid; > > Every single user CQE write (or even not write) is guarded by a > conditional branch on the flags, and use of post-increment like that > creates critical path data dependencies and kills ILP. All this extra > code eats icache lines. > > Look at the new code for libmlx5. This code is optimized *at compile time*. We create a poll_one_ex function for a common case. This has zero overhead. Regarding user-space check for wc_flags, I agree on eliminating this check. If you created a CQ with a set of attributes, they are guaranteed to exist on known values for the opcode. > Sure, the scheme saves dritying cache lines, but at the cost of a huge > number of these conditional branches and more code. I'm deeply > skeptical that is an overall win compared to dirtying more cache > lines - mispredicted branches are expensive. > > What I've suggested avoids all the cache line dirtying and avoids all > the above conditional branching and data dependencies at the cost of > a few indirect unconditional jumps. (and if #5 is possible they aren't > even jumps) > When eliminating "if (wc_flags & IBV_WC_EX_WITH_IMM)" in the user's space application, conditional branching is eliminated in this proposal as well. The question becomes - what costs more - copying the data or using an unconditional branch. > I say there is no way to tell which scheme performs better without > careful benchmarking - it is not obvious any of these trade offs are > winners. > Agree. It may depend on the user application as well. >> All cqe's should come with the fields requested and the >> layout of the data must be fixed when in the receive loop. No additional >> branches in the loop. > > Sure, for the caller, I'm looking at this from the whole system > perspective. The driver is doing a wack more crap to produce this > formatting and it certainly isn't free. > Dynamic WC format (or getter functions) isn't going to be free, at least not without linking the vendor's user-space driver to the application itself or making all vendors behave the same. > Jason > Yishai and Matan. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html