On 5/29/2016 10:51 AM, Yishai Hadas wrote: > From: Matan Barak > > Currently, ibv_poll_cq is one stop shop for polling all the > completion's attribute. The current implementation has a few > implications: > 1. The vendor's completion format is transformed to an agnostic IB > format (copying data which might has been used directly). > 2. Adding new completion attributes means wasting more time in copies. > 3. Extensible functions require wasting cycles on logic which > determines whether a function is known and implemented in the > provider. > > In order to solve these problems, we introduce a new poll_cq mechanism > for extensible CQs. Polling is done in batches, where every batch > could contain more than one completion. A batch is started with > calling to start_poll function. This already fetches a completion > that the user can now query using the various query functions. > Advancing to the next completion is done using next_poll. > After querying all completions (or when the user wants to stop > fetching completions in the current batch), end_poll is called. > > As 'wr_id' and 'status' are mostly used, they can be accessed directly > and there is no function call for. So I've been busy with kernel stuff, but I don't really like this patch. I'm not really enamored with the entire poll_ex API. First, I don't like the start_poll/poll/end_poll setup. I would rather we do something like using refcounting on the WCs. Maybe something like returning an array of pointers to WCs where each WC already has a ref count taken, and when the user is done, they call ibv_release_wc() (or something like that) on one or more of these WCs, at which point they are freed and the hardware can do with them as they wish. Then, I've been thinking about how to get this data to the user, and again, I don't really like where we are headed. I was thinking about changing it so that we create an enum like so: enum wc_elements { WC_WR_ID, WC_WR_STATUS, etc. }; Then we define each cq struct to have a new array called offsets: struct cq { unsigned char offsets[64]; /* One cache line, should be 64byte aligned */ .... } __align__(64); When a cq is created by the driver, it is responsible for filling out the offsets array (can be static if your driver only has one CQE format, but if it changes with different options, it allows for that). Essentially, it is a mapping from the enum of WC items to the position in the hardware's CQE. Then you define something like this in the verbs header file: static inline u8 __ibv_get_wc_u8(struct cq *cq, struct wc *wc, unsigned char wc_element) { return *(wc + cq->offsets[wc_element]); } and one for the other data sizes too, then something like this: define ibv_get_wc_element(cq, wc, wc_element, data) \ { \ switch(wc_element) { \ /* All char size elements */ \ case WC_WR_STATUS: \ data = __ibv_get_wc_u8(cq, wc, wc_element); \ You get the idea. The compiler will optimize the above down to static code at each location. And you could simplify things a bit by having another static array that keeps the element sizes in it, and then run the switch on that. The long and short of this method is that accessing CQE struct elements becomes essentially a load of a variable offset from the cq->offsets[] item, which should stay hot in cache, followed by a lea of the cqe address and the offset. I would prefer having to have an individual routine to access each element. I would prefer to avoid all the calls if possible. And I want something that can be optimized away by the compiler as much as possible. The above accomplishes that and has the room to support up to 64 different WC struct elements, with a maximum offset of 256 bytes (much larger than any WC today). What do people think of this idea? If no one shoots holes in it, I will try to code something up this upcoming week (although I can't promise it, I currently have two sick children at home and that prevents me from getting much coding done).