All of lore.kernel.org
 help / color / mirror / Atom feed
From: Walker, Benjamin <benjamin.walker at intel.com>
To: spdk@lists.01.org
Subject: Re: [SPDK] Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ?
Date: Wed, 09 Aug 2017 19:04:48 +0000	[thread overview]
Message-ID: <1502305486.2934.6.camel@intel.com> (raw)
In-Reply-To: 703D359C-FC0A-4692-A049-7F387B27CCC5@oracle.com

[-- Attachment #1: Type: text/plain, Size: 8998 bytes --]

On Wed, 2017-08-09 at 10:06 -0500, Lance Hartmann ORACLE wrote:
> 
> Ah, eureka!  Thank you!  506 was so unique, I just knew I was overlooking
> something.  *whew*  I feel so much better now! ;-)

To expand a bit more on our choices in this area, SPDK defines two key
structures - requests and trackers. Trackers correspond 1:1 with entries in the
NVMe submission queues. Requests represent requests from the user to perform
I/O. There may be more requests outstanding than available trackers, in which
case we'll do software queueing until a tracker becomes available. We also split
requests for a huge number of reasons (large I/O, issues with sgl to prp
translation, etc.) to make the API more convenient to use, so requests often
have many children.

In NVMe, when a command is submitted the user gets to provide a command ID
(cid), which is a 16 bit number that will be provided in the completion entry.
This is how software is supposed to match completions up with submitted
commands. We elected to allocate the trackers for each NVMe submission queue as
an array and set the cid to the index of the tracker in this array. The tracker
holds our context for the command, including a pointer back to the request.

When a command requires more than two PRP entries, it must provide a separate
PRP list (an array of 64 bit integers describing physical memory pages). Each
segment of that PRP list must fit within a physical page of memory, but segments
can be chained together to make very long arrays. We could have allocated these
PRP list segments separately from the trackers, in their own pool. Then, if we
needed a PRP list for a command, we could have grabbed one from the pool and
made the tracker point at it. This design has a few advantages in that if we
needed to chain together PRP list segments for a very, very large I/O we could
do that by grabbing multiple PRP list segments from the pool and assigning them
to one tracker, and then free them back to the pool when the associated command
completed.

However, we elected to do something a bit simpler that we hope is also faster.
One observation is that each PRP entry describes 4KiB of memory and real SSDs
limit the maximum transfer size to somewhere between 64KiB and 1MiB. 1MiB is
only 256 PRP entries, which is 2KiB of PRP list. There are good reasons not to
send I/O larger than that (QoS considerations), even if the device supports it,
so it's fairly clear that in practice no more than one PRP list segment is ever
needed. In fact, only half of a physical page is probably required. Therefore,
we decided to make our trackers exactly one physical 4KiB page, where the
context for the command is stored at the front and the remainder is the single
PRP list segment. This places a maximum on the I/O size SPDK can support
(although it's nearly 2 MiB and won't ever be the limiting factor in practice),
but it allows us to avoid pointer chasing through the tracker to the PRP list
and keep the first portion of the PRP list on the same cache line as the rest of
the tracker. A lot of our choices in the code come down to jamming as many
things into the fewest number of cache lines as possible, and this is just an
example of that.

The math on NVME_MAX_XFER_SIZE may not be obvious either. The command has one
PRP entry baked in, plus 506 in the associated array. That's a total of 507
entries. Each entry describes a 4KiB page, so why is NVME_MAX_XFER_SIZE set to
506 * PAGE_SIZE instead of 507 * PAGE_SIZE? The answer is that the data buffer
provided may not be page aligned. In the PRP format, the first element of the
list is allowed to start unaligned, as long as it ends on a page boundary.
Similarly, the last element must start on a page boundary, but may end
unaligned. To account for this takes one extra PRP entry in the worst case. So
if the buffer is perfectly 4KiB aligned, we could indeed support 507 * PAGE_SIZE
I/O, but since we don't require 4KiB alignment we have to subtract one page
worth.

Thanks everyone else for jumping in and providing the right answer more quickly
than I was able to.

-Ben


> --
> Lance Hartmann
> lance.hartmann(a)oracle.com
> 
> > On Aug 9, 2017, at 12:18 AM, Liu, Changpeng <changpeng.liu(a)intel.com> wrote:
> > 
> > Yes, you are right. 
> > SPDK embedded PRP list into the struct nvme_tracker, and the data structure
> > is 4KiB aligned,
> > And also several other fields, so only 506 entries left for PRP lists.
> > 
> > > -----Original Message-----
> > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance Hartmann
> > > ORACLE
> > > Sent: Wednesday, August 9, 2017 1:01 PM
> > > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > > Subject: Re: [SPDK] Determination of NVMe max_io_xfer_size
> > > (NVME_MAX_XFER_SIZE) ?
> > > 
> > > 
> > > Ok, but 506 * PAGE_SIZE?  Surely 506 wasn’t arbitrarily selected?  I
> > > understand
> > > that the controller’s Identify Controller structure may indicate far fewer
> > > pages
> > > supported, but if, as the comment suggests, PRP2 is pointing to a list,
> > > then why
> > > reduce the number “just a few”?  I feel like I’m missing something.
> > > 
> > > Let’s say PRP1 is aligned to a memory page boundary and the length of the
> > > data
> > > transfer is more than two (2) memory pages.  PRP1 points to the first
> > > memory
> > > page of data, and PRP2 points to a memory page containing PRP entries;
> > > i.e. a
> > > PRP list.  If the memory page size is 4096 (4KB), then up to 4096 / (size
> > > of PRP
> > > pointer in bytes) = 4096 / 8 = 512 of PRP entries could be created in that
> > > page.
> > > Thus, if I follow it correctly, with PRP1 pointing to 4KB of data in the
> > > first memory
> > > page, and with PRP2 pointing to a 4KB page of PRP entries, we should be
> > > able to
> > > transfer 1 + 512 = 513 memory pages, and so in this case 513 * 4096 =
> > > 2,101,248
> > > bytes of data.  And, that’s only if the implementation of the SPDK NVMe
> > > driver
> > > elects not to support the mechanism of using the last entry of the page of
> > > PRP
> > > entries to point to another page of PRP entries.
> > > 
> > > --
> > > Lance Hartmann
> > > lance.hartmann(a)oracle.com
> > > 
> > > 
> > > > On Aug 8, 2017, at 11:24 PM, Liu, Changpeng <changpeng.liu(a)intel.com>
> > > > wrote:
> > > > 
> > > > Hi Lance,
> > > > 
> > > > NVME_MAX_XFER_SIZE is the maximum data length supported by SPDK driver,
> > > of course the NVMe controller has a field(MDTS)
> > > > to show the limit from hardware,  so choose the smaller one as the
> > > > command
> > > limit to split commands bigger than this number.
> > > > Most of  Intel NVMe SSDs has a hardware value 128KiB, so the driver
> > > > limit with
> > > (506*4) KiB is big enough to support it.
> > > > > -----Original Message-----
> > > > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance
> > > > > Hartmann
> > > > > ORACLE
> > > > > Sent: Wednesday, August 9, 2017 11:52 AM
> > > > > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > > > > Subject: [SPDK] Determination of NVMe max_io_xfer_size
> > > > > (NVME_MAX_XFER_SIZE) ?
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > I’m trying to reconcile the #define NVME_MAX_XFER_SIZE and leading
> > > comment:
> > > > > /*
> > > > > * For commands requiring more than 2 PRP entries, one PRP will be
> > > > > *  embedded in the command (prp1), and the rest of the PRP entries
> > > > > *  will be in a list pointed to by the command (prp2).  This means
> > > > > *  that real max number of PRP entries we support is 506+1, which
> > > > > *  results in a max xfer size of 506*PAGE_SIZE.
> > > > > */
> > > > > 
> > > > > in lib/nvme/nvme_pcie.c with my interpretation from reading the NVMe
> > > > > spec.
> > > > > I’d greatly appreciate if someone could “show me the math” or
> > > > > otherwise
> > > help
> > > > > me to understand this.  How was NVME_MAX_PRP_LIST_ENTRIES (506)
> > > derived?
> > > > > I don’t know if I’m lost in the semantics of the naming, the comment,
> > > > > or
> > > perhaps
> > > > > there’s a nuance in the “…we support…” part.  I would’ve guessed,
> > > > > otherwise,
> > > > > that the max # of PRP entries would be a function of the PAGE_SIZE.
> > > > > 
> > > > > I did see that the driver in nvme_ctrlr_identify() compares this
> > > > > derived
> > > maximum
> > > > > transfer size with that which the controller can actually support as
> > > > > reported in
> > > > > the Identify Controller structure, choosing the minimum of the two
> > > > > values,
> > > but
> > > > > that’s understood and separate from the above.
> > > > > 
> > > > > regards,
> > > > > 
> > > > > 
> > > > > --
> > > > > Lance Hartmann
> > > > > lance.hartmann(a)oracle.com
> > > > > 
> > > > > 

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3274 bytes --]

             reply	other threads:[~2017-08-09 19:04 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-09 19:04 Walker, Benjamin [this message]
  -- strict thread matches above, loose matches on Subject: below --
2017-08-09 16:28 [SPDK] Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ? Luse, Paul E
2017-08-09 15:06 Lance Hartmann ORACLE
2017-08-09  5:58 Crane Chu
2017-08-09  5:18 Liu, Changpeng
2017-08-09  5:01 Lance Hartmann ORACLE
2017-08-09  4:24 Liu, Changpeng
2017-08-09  3:52 Lance Hartmann ORACLE

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1502305486.2934.6.camel@intel.com \
    --to=spdk@lists.01.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.