From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754703AbYJFSKt (ORCPT ); Mon, 6 Oct 2008 14:10:49 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752411AbYJFSKk (ORCPT ); Mon, 6 Oct 2008 14:10:40 -0400 Received: from smtp.opengridcomputing.com ([209.198.142.2]:42322 "EHLO smtp.opengridcomputing.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751724AbYJFSKj (ORCPT ); Mon, 6 Oct 2008 14:10:39 -0400 Message-ID: <48EA549E.1050001@opengridcomputing.com> Date: Mon, 06 Oct 2008 13:10:38 -0500 From: Tom Tucker User-Agent: Thunderbird 2.0.0.17 (Macintosh/20080914) MIME-Version: 1.0 To: Roland Dreier CC: ericvh@gmail.com, linux-kernel@vger.kernel.org, v9fs-devel@vger.kernel.org, rminnich@dancer.ca.sandia.gov, lionkov@lanl.gov Subject: Re: [PATCH 01/03] 9prdma: RDMA Transport Support for 9P References: <1222906119-14310-1-git-send-email-tom@opengridcomputing.com> <1222906119-14310-2-git-send-email-tom@opengridcomputing.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Roland Dreier wrote: > very cool... neat that it only takes 1000 lines of code to do this too. > > a few quick comments from a cursory read: > > > This file implements the RDMA transport provider for 9P. It allows > > mounts to be performed over iWARP and IB capable network interfaces > > and uses the OpenFabrics API to perform I/O. > > I don't like this "openfabrics API" terminology -- the RDMA API is just > one part of the kernel API that you're using, and I'm not sure it's > worth calling out specially. ie just delete that third line from the > changelog. > Ok. > > + atomic_t next_tag; > > this seems to be a write-only field... delete it? This is left over garbage. Nice catch. > > > + * @wc_op: Mellanox's broken HW doesn't provide the original WR op > > + * when the CQE completes in error. This forces apps to keep track of > > + * the op themselves. Yes, it's a Pet Peeve of mine ;-) > > there's nothing broken about this behavior -- the IB spec very > explicitly calls out that the opcode field is undefined for completions > with error status. > Fair enough... > > +find_req(struct p9_trans_rdma *rdma, u16 tag) > > > + return found ? req : 0; > > using 0 instead of NULL here... probably worth running all this through > sparse to see what else if flags. > will do. > > + if (atomic_inc_return(&rdma->sq_count) >= rdma->sq_depth) > > + wait_event_interruptible > > + (rdma->send_wait, > > + (atomic_read(&rdma->sq_count) < rdma->sq_depth)); > > this worries me a bit... for one thing you use wait_event_interruptible() > without checking the return value, so it's entirely possible that this > gets woken up before the condition actually becomes true. And to handle > that correctly, you'd need extra code to decrement the sq_count and wake > up other waiters if it was interrupted, etc. > Yes, I think this needs to be wait_event. > I know counting semaphores are out of favor in the kernel but this seems > to be a good place to use real semaphores, rather than concocting your > own in a much more complicated way. > > > + wait_for_completion_interruptible(&rdma->cm_done); > > similarly there are lots of places you do this interruptible wait but > then don't check if you were interrupted. > Well. The only place I do that is when the connection is being established. And actually, I don't(didn't?) really care why it stopped waiting, I only care about the state of the connection and if it's not what I need it to be I error out. I suppose there is a theoretical race here where the user cancels the wait with ctrl-C, but the connection keeps going because the transport always gets back fast enough. This would eventually fail in rdma_rpc. I'll add the code to check so that it's obvious. > > + if (0 == (devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)) { > > + rdma->dma_mr = ib_get_dma_mr(rdma->pd, IB_ACCESS_LOCAL_WRITE); > > + if (IS_ERR(rdma->dma_mr)) > > + goto error; > > + rdma->lkey = rdma->dma_mr->lkey; > > + } else > > + rdma->lkey = rdma->cm_id->device->local_dma_lkey; > > seems to me this could be written more naturally and avoid the "if > (0==(test))" construction (which I always have to reread to parse) if > you reverse the test: > > if (devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY) { > rdma->lkey = rdma->cm_id->device->local_dma_lkey; > } else { > rdma->dma_mr = ib_get_dma_mr(rdma->pd, IB_ACCESS_LOCAL_WRITE); > if (IS_ERR(rdma->dma_mr)) > goto error; > rdma->lkey = rdma->dma_mr->lkey; > } > agree. thanks. > The only place I see a send request posted has: > > > + wr.opcode = IB_WR_SEND; > > so I guess your protocol is only send/receives (ie no RDMA in the > pedantic sense, just two-sided operations). I'm wondering what the > motivation for all this is: is using send/receive on IB/iWARP a big > performance win over non-offloaded transport? Even compared to TCP on a > NIC with LSO and (software) LRO? > For IB it avoids IPoIB. Which I think is a big win. For iWARP, it's not so clear. There is the potential for RDMA ops in the future. > Is your protocol documented anywhere? It seems npfs has support for the > same transport. > When using SEND/RECV, there isn't a protocol other than 9P itself. Once we add RDMA ops, one will be required to exchange RKEY, etc... But maybe I don't understand the question? > Finally, you create your QP with: > > > + qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR; > > but you unconditionally do: > > > + wr.send_flags = IB_SEND_SIGNALED; > > not a big deal but you could save one line of code and a correspondingly > tiny amount of .text by just making the send queue signal all requests. > ok. Thanks for the review, Tom > - R.