From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sagi Grimberg Subject: kernel memory registration (was: RDMA/core: Transport-independent access flags) Date: Fri, 10 Jul 2015 11:55:29 +0300 Message-ID: <559F8881.7070308@dev.mellanox.co.il> References: <559B9891.8060907@dev.mellanox.co.il> <000b01d0b8bd$f2bfcc10$d83f6430$@opengridcomputing.com> <20150707161751.GA623@obsidianresearch.com> <559BFE03.4020709@dev.mellanox.co.il> <20150707213628.GA5661@obsidianresearch.com> <559CD174.4040901@dev.mellanox.co.il> <20150708190842.GB11740@obsidianresearch.com> <559D983D.6000804@talpey.com> <20150708233604.GA20765@obsidianresearch.com> <559E54AB.2010905@dev.mellanox.co.il> <20150709170142.GA21921@obsidianresearch.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20150709170142.GA21921-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Jason Gunthorpe Cc: Tom Talpey , Steve Wise , 'Christoph Hellwig' , dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, roid-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, Oren Duer List-Id: linux-rdma@vger.kernel.org On 7/9/2015 8:01 PM, Jason Gunthorpe wrote: > On Thu, Jul 09, 2015 at 02:02:03PM +0300, Sagi Grimberg wrote: > >> We have protocol that involves remote memory keys transfer in their >> standards so I don't see how we can remove it altogether from ULPs. > > This is why I've been talking about local and remote MRs > differently. IMHO, memory registration is memory registration. The fact that we are distinguishing between local and remote might be a sign that this might be a wrong direction to take. Sorry. Besides, if a ULP wants to register memory for local access why should we temper with that or deny it? What if a ULP has a pre-allocated pool of large buffers that it knows it is going to use for its entire lifetime? silent driver driven FRWRs would perform a lot worse than pre-registering these buffers. Or what if the ULP wants to register the memory region with data integrity (signature) parameters? I must say, whenever I find myself trying to assume/guess what the ULP/APP might do from the driver PoV and try to see if I'm covered, I shake my head saying: "This is a hack, go drink some water and rethink the whole thing". If there is one thing worse than a complicated API, it is a restrictive one. I'd much rather ULPs just having a simple API for registering memory. > A Local MR is one where the Key is never put on the wire, > it exists soley to facilitate DMA between the CPU and the local HCA, > and it would never be needed if we had infinitely long S/G lists. > >> My main problem with this approach is that once you do non-trivial >> things such as memory registration completely under the hood, it is >> a slippery slope for device drivers. > > Yes, there is going to be some stuff going on, but the simplification > for the ULP side is incredible, it is certainly something that should > be explored and not dismissed without some really good reasons. > >> If say a driver decides to register memory without the caller knowing, >> it would need to post an extra work request on the send queue. > > Yes, the first issue is how to do flow control the sendq. > > But this seems easily solved, every ULP is already tracking the > number of available entries in the senq, and it will not start new ops > until there is space, so instead of doing the computation internally > on how much space is needed to do X, we factor it out: > > if (rdma_sqes_post_read(...) < avail_sqe) > avail_sqe -= rdma_post_read(...) > else > // Try again after completions advance > > Every new-style post op is paired with a 'how many entires do I need' > call. > > This is not a new concept, a ULP working with FRMR already has to know > it cannot start a FRMR using OP unless there is 2 SQE's > available. (and it has to make all this conditional on if it using > FRMR or something else). All this is doing is shifting the computation > of '2' out of the ULP and into the driver. > >> So once it sees the completion, it needs to silently consume it and >> have some non trivial logic to invalidate it (another work request!) >> either from poll_cq context or another thread. > > Completions are driven by the ULP. Every new-style post also has a > completion entry point. The ULP calls it when it knows the op has > done, either because the WRID it provided has signaled completed, or > because a later op has completed (signalling was supressed). > > Since that may also be an implicitly posting API (if necessary, is > it?), it follows the same rules as above. This isn't changing > anything. ULPs would already have to drive invalidate posts from > completion with flow control, we are just moving the actul post > construction and computation of the needed SQEs out of the ULP. > >> This would also require the drivers to take a huristic approach on how >> much memory registration resources are needed for all possible >> consumers (ipoib, sdp, srp, iser, nfs, more...) which might have >> different requirements. > > That doesn't seem like a big issue. The ULP can give a hint on the PD > or QP what sort of usage it expects. 'Up to 16 RDMA READS' 'Up to 1MB > transfer per RDMA' and the core can use a pre-allocated pool scheme. > > I was thinking about a pre-allocation for local here, as Christoph > suggests, I think that is a refinement we could certainly add on, once > there is some clear idea what allocations are acutally necessary to > spin up a temp MR. The basic issue I'd see is that the preallocation > would be done without knowledge of the desired SG list, but maybe some > kind of canonical 'max' SG could be used as a stand in... > > Put together, it would look like this: > if (rdma_sqes_post_read(...) < avail_sqe) > avail_sqe -= rdma_post_read(qp,...,read_wrid) > [.. fetch wcs ...] > if (wc.wrid == read_wrid) > if (rdma_sqes_post_complete_read(...,read_wrid) < avail_sqe) > rdma_post_complete_read(qp,...,read_wrid); > else > // queue read_wrid for later rdma_post_complete_read > > I'm not really seeing anything here that screams out this is > impossible, or performance is impacted, or it is too messy on either > the ULP or driver side. I think it is possible (at the moment). But I don't know if we should have the drivers abusing the send/completion queues like that. I can't say I'm fully on board with the idea of silent send-queue posting and silent completion consuming. > > Laid out like this, I think it even means we can nuke the IB DMA API > for these cases. rdma_post_read and rdma_post_complete_read are the > two points that need dma api calls (cache flushes), and they can just > do them internally. > > This also tells me that the above call sites must already exist in > every ULP, so we, again, are not really substantially changing > core control flow for the ULP. > > Are there more details that wreck the world? > > Just to break it down: > - rdma_sqes_post_read figures out how many SQEs are needed to post > the specified RDMA READ. > On IB, if the SG list can be used then this is always 1. > If the RDMA READ is split into N RDMA READS then it is N. > For something like iWarp this would be (?) > * FRMR SQE > * RDMA READ SQE > * FRMR Invalidate (signaled) > > Presumably we can squeeze FMR and so forth into this scheme as > well? They don't seem to use SQE's so it is looks simpler.. > > Perhaps if an internal MR pool is exhausted this returns 0xFFFF > and the caller will do a completion cycle, which may provide > free MR's back to the pool. Ultimately once the SQ and CQ are > totally drained the pool should be back to 100%? > - rdma_post_read generates the necessary number of posts. > The SQ must have the right number of entires available > (see above) > - rdma_post_complete_read is doing any clean up posts to make a MR > ready to go again. Perhaps this isn't even posting? > > Semantically, I'd want to see rdma_post_complete_read returning to > mean that the local read buffer is ready to go, and the ULP can > start using it instantly. All invalidation is complete and all > CPU caches are sync'd. > > This is where we'd start the recycling process for any temp MR, > whatever that means.. > > I expect all these calls would be function pointers, and each driver > would provide a function pointer that is optimal for it's use. Eg mlx4 > would provide a pointer that used the S/G list, then falls back to > FRMR if the S/G list is exhausted. The core code would provide a > toolbox of common functions the drivers can use here. Maybe it's just me, but I can't help but wander if this is facilitating an atmosphere where drivers will keep finding new ways to abuse even the most simple operations. I need more time to comprehend. > > I didn't explore how errors work, but, I think, errors are just a > labeling exercise: > if (wc is error && wc.wrid == read_wrid > rdma_error_complete_read(...,read_wrid,wc) > > Error recovery blows up the QP, so we just need to book keep and get > the MRs accounted for. The driver could do a synchronous clean up of > whatever mess is left during the next create_qp, or on the PD destroy. > >> I know that these are implementation details, but the point is that >> vendor drivers can easily become a complete mess. I think we should >> try to find a balanced approach where both consumers and providers are >> not completely messed up. > > Sure, but today vendor drivers and the core is trivial while ULPs are > an absolute mess. > > Goal #1 should be to move all the mess into the API and support all > the existing methods. We should try as hard as possible to do that, > and if along the way, it is just isn't possible, then fine. But that > should be the first thing we try to reach for. > > Just tidying FRMR so it unifies with indirect, is a fine consolation > prize, but I believe we can do better. > > To your point in another message, I'd say, as long as the new API > supports FRMR at full speed with no performance penalty we are > good. If the other variants out there take a performance hit, then I > think that is OK. As you say, they are on the way out, we just need to > make sure that ULPs continue to work with FMR with the new API so > legacy cards don't completely break. My intention is to improve FRWR API and gradually remove the other APIs from the kernel (i.e. FMR/FMR_POOL/MW). As I said, I don't think that striving to an API that implicitly chooses how to register memory is a good idea. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html