From mboxrd@z Thu Jan 1 00:00:00 1970 From: Or Gerlitz Subject: Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver Date: Wed, 07 Jul 2010 09:45:28 +0300 Message-ID: <4C342288.4070803@voltaire.com> References: <20100705135438.26042.55865.stgit@gkslx010.igk.intel.com> <4C32EE45.9030906@Voltaire.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: "Walukiewicz, Miroslaw" Cc: "rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org" , "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , "alekseys-smomgflXvOZWk0Htik3J/w@public.gmane.org" List-Id: linux-rdma@vger.kernel.org Walukiewicz, Miroslaw wrote: > From my measuremnts it looks like the problem is related to memory allocation in the user-space and kernel path, that is a very, very expesive operation. Look for the tx path (rx is very similar). Ibv_post_send(): > post_send_wrapper_1_0 > for (w = wr; w; w = w->next) { > real_wr = alloca(sizeof *real_wr); <- 1. dyn alloc > real_wr->wr_id = w->wr_id; > next the call to HW specific part > and prepare message to send > cmd = alloca(cmd_size); <- 2. dyn allocation Hi Mirek, I don't think there are applications around which would use raw qp AND are linked against libibverbs-1.0, such that they would exercise the 1_0 wrapper, so we can ignore the 1st allocation, the one at the wrapper code. As for the 2nd allocation, since a WQE --posting-- is synchronous, using the maximal values specified during the creation of the QP, I believe that this allocation can be done once per QP and used later. > dive to kernel: > ib_uverbs_post_send() > user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); <- 3. dyn alloc > next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + > user_wr->num_sge * sizeof (struct ib_sge), > GFP_KERNEL); <- 4. dyn alloc > And now there is finel call to driver. ~same here for #4 you can compute/allocate once the maximal possible size for "next" per qp and use it later. As for #3, this need further thinking. But before diving to all this design changes, what was the penalty introduced by these allocations? is it in packets-per-second, latency? > Diving to kernel is treated as a something like passing signal to kernel that there is prepared information to post_send/post_recv. The information about buffers are passed through shared page (available to userspace through mmap) to avoid copying of data. Write() ops is used to passing signal about post_send. Read() ops is used to pass information about post_recv(). We avoid additional copying of the data that way. thanks for the heads-up, I took a look and this user/kernel shared memory page is used to hold the work-request, nothing to do with data. As for the work request, you still have to copy it in user space from the user work request to the library mmaped buffer. So the only difference would be the copy_from_user done by uverbs, for few tens of bytes, can you tell if/what is the extra penalty introduced by this copy? > struct nes_ud_send_wr { > u32 wr_cnt; > u32 qpn; > u32 flags; > u32 resv[1]; > struct ib_sge sg_list[64]; > }; > > struct nes_ud_recv_wr { > u32 wr_cnt; > u32 qpn; > u32 resv[2]; > struct ib_sge sg_list[64]; > }; Looking on struct nes_ud_send/recv_wr, I wasn't sure to follow, the same instance can be used to post list of work requests, where is work request is limited to use one SGE, am I correct? I don't think there a need to support posting 64 --send-- requests, for recv it might makes sense, but it could be done in a "batch/background" flow, thoughts? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html