From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============7642012843478362618==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] SPDK support for NVMe CMBs and PMRs with WDS/RDS Date: Wed, 31 Jan 2018 21:43:09 +0000 Message-ID: <1517434988.25907.154.camel@intel.com> In-Reply-To: DD04B694-1A59-47BD-98F4-CBC3D8B07692@raithlin.com List-ID: To: spdk@lists.01.org --===============7642012843478362618== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Wed, 2018-01-31 at 20:08 +0000, Stephen Bates wrote: > I am not opposed to this but I would like others to chime in on this appr= oach. > It does involve two function calls rather than one which complicates life= for > the user. Does anyone else have opinion on this? I did like the way "impr= oved" > spdk_mem_register to accomodate this new class of memory region rather th= an > add a separate function. > = > I am going to give it a few more days for others to review and comment on= this > and other aspects of the patch series and then I will resubmit. > = I've been mulling this over for a few days and now I believe I see a strate= gy that's going to simplify this a lot. First, a little bit of background. The function spdk_mem_register registers a virtual address with *all* SPDK memory translation tables. The system is set up to track an arbitrarily lar= ge number - in fact user code can create their own translation tables if they = want. In practice, there are two important ones: 1) The va to pa translation table (or va to iova if IOMMU is enabled). This= one is created automatically by the env layer. 2) The va to RDMA lkey translation table. This one is created if you're usi= ng NVMe-oF in some form. When spdk_mem_register is called, it loops through all existing translation tables and calls a function pointer on each to notify them that a new virtu= al address is registered. That function pointer is responsible for determining= the appropriate mapping of that virtual address and setting the appropriate translation. The first patch in your series alters spdk_mem_register so that it has an optional second argument (paddr). But that doesn't make sense - only one of= the tables knows what a physical address is. What we really want to do is chang= e the code so that the notify function in the code managing the va to pa table can correctly identify the pa for the given va when the va is a BAR. That funct= ion is spdk_vtophys_notify() in lib/env_dpdk/vtophys.c. Right now spdk_vtophys_notify() first attempts to look up a virtual address= in DPDK's list of known allocated hugepages (it has translations pre-stored in= that list). If that fails (and there is no IOMMU), the code falls back to scanni= ng /proc/self/pagemap in sysfs. Is the correct translation from va to pa for t= he BAR present in /proc/self/pagemap? I would have guessed that it was, but I'= ll assume you already tried this and it wasn't (but I'll test it myself once I= get a device with CMB plugged in to my machine). If the address translation isn't available in /proc/self/pagemap, we can ad= d a step in between looking up the va in DPDK's hugepages and scanning /proc/self/pagemap that iterates through all attached PCI devices and check= s if the va is in the range of one of their BARs. There is a facility for iterat= ing the devices already - FOREACH_DEVICE_ON_PCIBUS, which iterates over structu= res that contain an array named "mem_resource" where the index is the BAR numbe= r and each element is this: struct rte_mem_resource { uint64_t phys_addr; /**< Physical address, 0 if not resource. */ uint64_t len; /**< Length of the resource. */ void *addr; /**< Virtual address, NULL when not mapped. */ }; I think that's the right way to approach this. I know that when you call spdk_mem_register, you have the physical address sitting there, so on the surface this seems much less efficient. But it keeps our nice delineation between the roles and responsibilities of the software stack such that spdk_mem_register is entirely unaffected - just the code that is in charge = of managing specifically the va to pa map is modified (vtophys.c). The CMB is = also going to be automatically registered with RDMA NICs for NVMe-oF as well. I haven't thought as deeply about the rest of the patches, but they look fi= ne to me right now. I think this is going to be pretty neat! Thanks, Ben --===============7642012843478362618==--