All of lore.kernel.org
 help / color / mirror / Atom feed
From: Walker, Benjamin <benjamin.walker at intel.com>
To: spdk@lists.01.org
Subject: Re: [SPDK] SPDK support for NVMe CMBs and PMRs with WDS/RDS
Date: Wed, 31 Jan 2018 21:43:09 +0000	[thread overview]
Message-ID: <1517434988.25907.154.camel@intel.com> (raw)
In-Reply-To: DD04B694-1A59-47BD-98F4-CBC3D8B07692@raithlin.com

[-- Attachment #1: Type: text/plain, Size: 3943 bytes --]

On Wed, 2018-01-31 at 20:08 +0000, Stephen  Bates wrote:
> I am not opposed to this but I would like others to chime in on this approach.
> It does involve two function calls rather than one which complicates life for
> the user. Does anyone else have opinion on this? I did like the way "improved"
> spdk_mem_register to accomodate this new class of memory region rather than
> add a separate function.
> 
> I am going to give it a few more days for others to review and comment on this
> and other aspects of the patch series and then I will resubmit.
> 

I've been mulling this over for a few days and now I believe I see a strategy
that's going to simplify this a lot. First, a little bit of background.

The function spdk_mem_register registers a virtual address with *all* SPDK
memory translation tables. The system is set up to track an arbitrarily large
number - in fact user code can create their own translation tables if they want.
In practice, there are two important ones:

1) The va to pa translation table (or va to iova if IOMMU is enabled). This one
is created automatically by the env layer.
2) The va to RDMA lkey translation table. This one is created if you're using
NVMe-oF in some form.

When spdk_mem_register is called, it loops through all existing translation
tables and calls a function pointer on each to notify them that a new virtual
address is registered. That function pointer is responsible for determining the
appropriate mapping of that virtual address and setting the appropriate
translation.

The first patch in your series alters spdk_mem_register so that it has an
optional second argument (paddr). But that doesn't make sense - only one of the
tables knows what a physical address is. What we really want to do is change the
code so that the notify function in the code managing the va to pa table can
correctly identify the pa for the given va when the va is a BAR. That function
is spdk_vtophys_notify() in lib/env_dpdk/vtophys.c.

Right now spdk_vtophys_notify() first attempts to look up a virtual address in
DPDK's list of known allocated hugepages (it has translations pre-stored in that
list). If that fails (and there is no IOMMU), the code falls back to scanning
/proc/self/pagemap in sysfs. Is the correct translation from va to pa for the
BAR present in /proc/self/pagemap? I would have guessed that it was, but I'll
assume you already tried this and it wasn't (but I'll test it myself once I get
a device with CMB plugged in to my machine).

If the address translation isn't available in /proc/self/pagemap, we can add a
step in between looking up the va in DPDK's hugepages and scanning
/proc/self/pagemap that iterates through all attached PCI devices and checks if
the va is in the range of one of their BARs. There is a facility for iterating
the devices already - FOREACH_DEVICE_ON_PCIBUS, which iterates over structures
that contain an array named "mem_resource" where the index is the BAR number and
each element is this:

struct rte_mem_resource {
	uint64_t phys_addr; /**< Physical address, 0 if not resource. */
	uint64_t len;       /**< Length of the resource. */
	void *addr;         /**< Virtual address, NULL when not mapped. */
};

I think that's the right way to approach this. I know that when you call
spdk_mem_register, you have the physical address sitting there, so on the
surface this seems much less efficient. But it keeps our nice delineation
between the roles and responsibilities of the software stack such that
spdk_mem_register is entirely unaffected - just the code that is in charge of
managing specifically the va to pa map is modified (vtophys.c). The CMB is also
going to be automatically registered with RDMA NICs for NVMe-oF as well.

I haven't thought as deeply about the rest of the patches, but they look fine to
me right now. I think this is going to be pretty neat!

Thanks,
Ben

             reply	other threads:[~2018-01-31 21:43 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-31 21:43 Walker, Benjamin [this message]
  -- strict thread matches above, loose matches on Subject: below --
2018-02-09  0:29 [SPDK] SPDK support for NVMe CMBs and PMRs with WDS/RDS Stephen Bates
2018-02-09  0:07 Stephen Bates
2018-02-08 23:46 Harris, James R
2018-02-08 23:36 Stephen Bates
2018-02-08  5:17 Stephen Bates
2018-02-08  5:07 Harris, James R
2018-02-08  3:05 Liu, Changpeng
2018-02-08  2:35 Stephen Bates
2018-02-07 23:18 Stephen Bates
2018-02-07 18:53 Stephen Bates
2018-02-07  0:20 Harris, James R
2018-02-06  4:05 Harris, James R
2018-02-06  3:06 Stephen Bates
2018-02-01 22:04 Stephen Bates
2018-01-31 20:14 Harris, James R
2018-01-31 20:08 Stephen Bates
2018-01-30  8:45 Stojaczyk, DariuszX
2018-01-30  3:35 Stephen Bates

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1517434988.25907.154.camel@intel.com \
    --to=spdk@lists.01.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.