* [SPDK] Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ?
@ 2017-08-09 3:52 Lance Hartmann ORACLE
0 siblings, 0 replies; 8+ messages in thread
From: Lance Hartmann ORACLE @ 2017-08-09 3:52 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 1264 bytes --]
Hello,
I’m trying to reconcile the #define NVME_MAX_XFER_SIZE and leading comment:
/*
* For commands requiring more than 2 PRP entries, one PRP will be
* embedded in the command (prp1), and the rest of the PRP entries
* will be in a list pointed to by the command (prp2). This means
* that real max number of PRP entries we support is 506+1, which
* results in a max xfer size of 506*PAGE_SIZE.
*/
in lib/nvme/nvme_pcie.c with my interpretation from reading the NVMe spec. I’d greatly appreciate if someone could “show me the math” or otherwise help me to understand this. How was NVME_MAX_PRP_LIST_ENTRIES (506) derived? I don’t know if I’m lost in the semantics of the naming, the comment, or perhaps there’s a nuance in the “…we support…” part. I would’ve guessed, otherwise, that the max # of PRP entries would be a function of the PAGE_SIZE.
I did see that the driver in nvme_ctrlr_identify() compares this derived maximum transfer size with that which the controller can actually support as reported in the Identify Controller structure, choosing the minimum of the two values, but that’s understood and separate from the above.
regards,
--
Lance Hartmann
lance.hartmann(a)oracle.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [SPDK] Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ?
@ 2017-08-09 4:24 Liu, Changpeng
0 siblings, 0 replies; 8+ messages in thread
From: Liu, Changpeng @ 2017-08-09 4:24 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 2175 bytes --]
Hi Lance,
NVME_MAX_XFER_SIZE is the maximum data length supported by SPDK driver, of course the NVMe controller has a field(MDTS)
to show the limit from hardware, so choose the smaller one as the command limit to split commands bigger than this number.
Most of Intel NVMe SSDs has a hardware value 128KiB, so the driver limit with (506*4) KiB is big enough to support it.
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance Hartmann
> ORACLE
> Sent: Wednesday, August 9, 2017 11:52 AM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: [SPDK] Determination of NVMe max_io_xfer_size
> (NVME_MAX_XFER_SIZE) ?
>
> Hello,
>
> I’m trying to reconcile the #define NVME_MAX_XFER_SIZE and leading comment:
>
> /*
> * For commands requiring more than 2 PRP entries, one PRP will be
> * embedded in the command (prp1), and the rest of the PRP entries
> * will be in a list pointed to by the command (prp2). This means
> * that real max number of PRP entries we support is 506+1, which
> * results in a max xfer size of 506*PAGE_SIZE.
> */
>
> in lib/nvme/nvme_pcie.c with my interpretation from reading the NVMe spec.
> I’d greatly appreciate if someone could “show me the math” or otherwise help
> me to understand this. How was NVME_MAX_PRP_LIST_ENTRIES (506) derived?
> I don’t know if I’m lost in the semantics of the naming, the comment, or perhaps
> there’s a nuance in the “…we support…” part. I would’ve guessed, otherwise,
> that the max # of PRP entries would be a function of the PAGE_SIZE.
>
> I did see that the driver in nvme_ctrlr_identify() compares this derived maximum
> transfer size with that which the controller can actually support as reported in
> the Identify Controller structure, choosing the minimum of the two values, but
> that’s understood and separate from the above.
>
> regards,
>
>
> --
> Lance Hartmann
> lance.hartmann(a)oracle.com
>
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [SPDK] Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ?
@ 2017-08-09 5:01 Lance Hartmann ORACLE
0 siblings, 0 replies; 8+ messages in thread
From: Lance Hartmann ORACLE @ 2017-08-09 5:01 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 3691 bytes --]
Ok, but 506 * PAGE_SIZE? Surely 506 wasn’t arbitrarily selected? I understand that the controller’s Identify Controller structure may indicate far fewer pages supported, but if, as the comment suggests, PRP2 is pointing to a list, then why reduce the number “just a few”? I feel like I’m missing something.
Let’s say PRP1 is aligned to a memory page boundary and the length of the data transfer is more than two (2) memory pages. PRP1 points to the first memory page of data, and PRP2 points to a memory page containing PRP entries; i.e. a PRP list. If the memory page size is 4096 (4KB), then up to 4096 / (size of PRP pointer in bytes) = 4096 / 8 = 512 of PRP entries could be created in that page. Thus, if I follow it correctly, with PRP1 pointing to 4KB of data in the first memory page, and with PRP2 pointing to a 4KB page of PRP entries, we should be able to transfer 1 + 512 = 513 memory pages, and so in this case 513 * 4096 = 2,101,248 bytes of data. And, that’s only if the implementation of the SPDK NVMe driver elects not to support the mechanism of using the last entry of the page of PRP entries to point to another page of PRP entries.
--
Lance Hartmann
lance.hartmann(a)oracle.com
> On Aug 8, 2017, at 11:24 PM, Liu, Changpeng <changpeng.liu(a)intel.com> wrote:
>
> Hi Lance,
>
> NVME_MAX_XFER_SIZE is the maximum data length supported by SPDK driver, of course the NVMe controller has a field(MDTS)
> to show the limit from hardware, so choose the smaller one as the command limit to split commands bigger than this number.
>
> Most of Intel NVMe SSDs has a hardware value 128KiB, so the driver limit with (506*4) KiB is big enough to support it.
>
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance Hartmann
>> ORACLE
>> Sent: Wednesday, August 9, 2017 11:52 AM
>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
>> Subject: [SPDK] Determination of NVMe max_io_xfer_size
>> (NVME_MAX_XFER_SIZE) ?
>>
>> Hello,
>>
>> I’m trying to reconcile the #define NVME_MAX_XFER_SIZE and leading comment:
>>
>> /*
>> * For commands requiring more than 2 PRP entries, one PRP will be
>> * embedded in the command (prp1), and the rest of the PRP entries
>> * will be in a list pointed to by the command (prp2). This means
>> * that real max number of PRP entries we support is 506+1, which
>> * results in a max xfer size of 506*PAGE_SIZE.
>> */
>>
>> in lib/nvme/nvme_pcie.c with my interpretation from reading the NVMe spec.
>> I’d greatly appreciate if someone could “show me the math” or otherwise help
>> me to understand this. How was NVME_MAX_PRP_LIST_ENTRIES (506) derived?
>> I don’t know if I’m lost in the semantics of the naming, the comment, or perhaps
>> there’s a nuance in the “…we support…” part. I would’ve guessed, otherwise,
>> that the max # of PRP entries would be a function of the PAGE_SIZE.
>>
>> I did see that the driver in nvme_ctrlr_identify() compares this derived maximum
>> transfer size with that which the controller can actually support as reported in
>> the Identify Controller structure, choosing the minimum of the two values, but
>> that’s understood and separate from the above.
>>
>> regards,
>>
>>
>> --
>> Lance Hartmann
>> lance.hartmann(a)oracle.com
>>
>>
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [SPDK] Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ?
@ 2017-08-09 5:18 Liu, Changpeng
0 siblings, 0 replies; 8+ messages in thread
From: Liu, Changpeng @ 2017-08-09 5:18 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 4528 bytes --]
Yes, you are right.
SPDK embedded PRP list into the struct nvme_tracker, and the data structure is 4KiB aligned,
And also several other fields, so only 506 entries left for PRP lists.
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance Hartmann
> ORACLE
> Sent: Wednesday, August 9, 2017 1:01 PM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] Determination of NVMe max_io_xfer_size
> (NVME_MAX_XFER_SIZE) ?
>
>
> Ok, but 506 * PAGE_SIZE? Surely 506 wasn’t arbitrarily selected? I understand
> that the controller’s Identify Controller structure may indicate far fewer pages
> supported, but if, as the comment suggests, PRP2 is pointing to a list, then why
> reduce the number “just a few”? I feel like I’m missing something.
>
> Let’s say PRP1 is aligned to a memory page boundary and the length of the data
> transfer is more than two (2) memory pages. PRP1 points to the first memory
> page of data, and PRP2 points to a memory page containing PRP entries; i.e. a
> PRP list. If the memory page size is 4096 (4KB), then up to 4096 / (size of PRP
> pointer in bytes) = 4096 / 8 = 512 of PRP entries could be created in that page.
> Thus, if I follow it correctly, with PRP1 pointing to 4KB of data in the first memory
> page, and with PRP2 pointing to a 4KB page of PRP entries, we should be able to
> transfer 1 + 512 = 513 memory pages, and so in this case 513 * 4096 = 2,101,248
> bytes of data. And, that’s only if the implementation of the SPDK NVMe driver
> elects not to support the mechanism of using the last entry of the page of PRP
> entries to point to another page of PRP entries.
>
> --
> Lance Hartmann
> lance.hartmann(a)oracle.com
>
>
> > On Aug 8, 2017, at 11:24 PM, Liu, Changpeng <changpeng.liu(a)intel.com> wrote:
> >
> > Hi Lance,
> >
> > NVME_MAX_XFER_SIZE is the maximum data length supported by SPDK driver,
> of course the NVMe controller has a field(MDTS)
> > to show the limit from hardware, so choose the smaller one as the command
> limit to split commands bigger than this number.
> >
> > Most of Intel NVMe SSDs has a hardware value 128KiB, so the driver limit with
> (506*4) KiB is big enough to support it.
> >
> >> -----Original Message-----
> >> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance Hartmann
> >> ORACLE
> >> Sent: Wednesday, August 9, 2017 11:52 AM
> >> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> >> Subject: [SPDK] Determination of NVMe max_io_xfer_size
> >> (NVME_MAX_XFER_SIZE) ?
> >>
> >> Hello,
> >>
> >> I’m trying to reconcile the #define NVME_MAX_XFER_SIZE and leading
> comment:
> >>
> >> /*
> >> * For commands requiring more than 2 PRP entries, one PRP will be
> >> * embedded in the command (prp1), and the rest of the PRP entries
> >> * will be in a list pointed to by the command (prp2). This means
> >> * that real max number of PRP entries we support is 506+1, which
> >> * results in a max xfer size of 506*PAGE_SIZE.
> >> */
> >>
> >> in lib/nvme/nvme_pcie.c with my interpretation from reading the NVMe spec.
> >> I’d greatly appreciate if someone could “show me the math” or otherwise
> help
> >> me to understand this. How was NVME_MAX_PRP_LIST_ENTRIES (506)
> derived?
> >> I don’t know if I’m lost in the semantics of the naming, the comment, or
> perhaps
> >> there’s a nuance in the “…we support…” part. I would’ve guessed, otherwise,
> >> that the max # of PRP entries would be a function of the PAGE_SIZE.
> >>
> >> I did see that the driver in nvme_ctrlr_identify() compares this derived
> maximum
> >> transfer size with that which the controller can actually support as reported in
> >> the Identify Controller structure, choosing the minimum of the two values,
> but
> >> that’s understood and separate from the above.
> >>
> >> regards,
> >>
> >>
> >> --
> >> Lance Hartmann
> >> lance.hartmann(a)oracle.com
> >>
> >>
> >> _______________________________________________
> >> SPDK mailing list
> >> SPDK(a)lists.01.org
> >> https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [SPDK] Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ?
@ 2017-08-09 5:58 Crane Chu
0 siblings, 0 replies; 8+ messages in thread
From: Crane Chu @ 2017-08-09 5:58 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 5467 bytes --]
Here is the comment in nvme_pcie_qpair_construct():
/*
* Reserve space for all of the trackers in a single allocation.
* struct nvme_tracker must be padded so that its size is already a power
of 2.
* This ensures the PRP list embedded in the nvme_tracker object will not
span a
* 4KB boundary, while allowing access to trackers in tr[] via normal
array indexing.
*/
It's a beautiful design in SPDK. :)
On Wed, Aug 9, 2017 at 1:18 PM, Liu, Changpeng <changpeng.liu(a)intel.com>
wrote:
> Yes, you are right.
> SPDK embedded PRP list into the struct nvme_tracker, and the data
> structure is 4KiB aligned,
> And also several other fields, so only 506 entries left for PRP lists.
>
> > -----Original Message-----
> > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance
> Hartmann
> > ORACLE
> > Sent: Wednesday, August 9, 2017 1:01 PM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: Re: [SPDK] Determination of NVMe max_io_xfer_size
> > (NVME_MAX_XFER_SIZE) ?
> >
> >
> > Ok, but 506 * PAGE_SIZE? Surely 506 wasn’t arbitrarily selected? I
> understand
> > that the controller’s Identify Controller structure may indicate far
> fewer pages
> > supported, but if, as the comment suggests, PRP2 is pointing to a list,
> then why
> > reduce the number “just a few”? I feel like I’m missing something.
> >
> > Let’s say PRP1 is aligned to a memory page boundary and the length of
> the data
> > transfer is more than two (2) memory pages. PRP1 points to the first
> memory
> > page of data, and PRP2 points to a memory page containing PRP entries;
> i.e. a
> > PRP list. If the memory page size is 4096 (4KB), then up to 4096 /
> (size of PRP
> > pointer in bytes) = 4096 / 8 = 512 of PRP entries could be created in
> that page.
> > Thus, if I follow it correctly, with PRP1 pointing to 4KB of data in the
> first memory
> > page, and with PRP2 pointing to a 4KB page of PRP entries, we should be
> able to
> > transfer 1 + 512 = 513 memory pages, and so in this case 513 * 4096 =
> 2,101,248
> > bytes of data. And, that’s only if the implementation of the SPDK NVMe
> driver
> > elects not to support the mechanism of using the last entry of the page
> of PRP
> > entries to point to another page of PRP entries.
> >
> > --
> > Lance Hartmann
> > lance.hartmann(a)oracle.com
> >
> >
> > > On Aug 8, 2017, at 11:24 PM, Liu, Changpeng <changpeng.liu(a)intel.com>
> wrote:
> > >
> > > Hi Lance,
> > >
> > > NVME_MAX_XFER_SIZE is the maximum data length supported by SPDK driver,
> > of course the NVMe controller has a field(MDTS)
> > > to show the limit from hardware, so choose the smaller one as the
> command
> > limit to split commands bigger than this number.
> > >
> > > Most of Intel NVMe SSDs has a hardware value 128KiB, so the driver
> limit with
> > (506*4) KiB is big enough to support it.
> > >
> > >> -----Original Message-----
> > >> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance
> Hartmann
> > >> ORACLE
> > >> Sent: Wednesday, August 9, 2017 11:52 AM
> > >> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > >> Subject: [SPDK] Determination of NVMe max_io_xfer_size
> > >> (NVME_MAX_XFER_SIZE) ?
> > >>
> > >> Hello,
> > >>
> > >> I’m trying to reconcile the #define NVME_MAX_XFER_SIZE and leading
> > comment:
> > >>
> > >> /*
> > >> * For commands requiring more than 2 PRP entries, one PRP will be
> > >> * embedded in the command (prp1), and the rest of the PRP entries
> > >> * will be in a list pointed to by the command (prp2). This means
> > >> * that real max number of PRP entries we support is 506+1, which
> > >> * results in a max xfer size of 506*PAGE_SIZE.
> > >> */
> > >>
> > >> in lib/nvme/nvme_pcie.c with my interpretation from reading the NVMe
> spec.
> > >> I’d greatly appreciate if someone could “show me the math” or
> otherwise
> > help
> > >> me to understand this. How was NVME_MAX_PRP_LIST_ENTRIES (506)
> > derived?
> > >> I don’t know if I’m lost in the semantics of the naming, the comment,
> or
> > perhaps
> > >> there’s a nuance in the “…we support…” part. I would’ve guessed,
> otherwise,
> > >> that the max # of PRP entries would be a function of the PAGE_SIZE.
> > >>
> > >> I did see that the driver in nvme_ctrlr_identify() compares this
> derived
> > maximum
> > >> transfer size with that which the controller can actually support as
> reported in
> > >> the Identify Controller structure, choosing the minimum of the two
> values,
> > but
> > >> that’s understood and separate from the above.
> > >>
> > >> regards,
> > >>
> > >>
> > >> --
> > >> Lance Hartmann
> > >> lance.hartmann(a)oracle.com
> > >>
> > >>
> > >> _______________________________________________
> > >> SPDK mailing list
> > >> SPDK(a)lists.01.org
> > >> https://lists.01.org/mailman/listinfo/spdk
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://lists.01.org/mailman/listinfo/spdk
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
>
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 8356 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [SPDK] Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ?
@ 2017-08-09 15:06 Lance Hartmann ORACLE
0 siblings, 0 replies; 8+ messages in thread
From: Lance Hartmann ORACLE @ 2017-08-09 15:06 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 5005 bytes --]
Ah, eureka! Thank you! 506 was so unique, I just knew I was overlooking something. *whew* I feel so much better now! ;-)
--
Lance Hartmann
lance.hartmann(a)oracle.com
> On Aug 9, 2017, at 12:18 AM, Liu, Changpeng <changpeng.liu(a)intel.com> wrote:
>
> Yes, you are right.
> SPDK embedded PRP list into the struct nvme_tracker, and the data structure is 4KiB aligned,
> And also several other fields, so only 506 entries left for PRP lists.
>
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance Hartmann
>> ORACLE
>> Sent: Wednesday, August 9, 2017 1:01 PM
>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
>> Subject: Re: [SPDK] Determination of NVMe max_io_xfer_size
>> (NVME_MAX_XFER_SIZE) ?
>>
>>
>> Ok, but 506 * PAGE_SIZE? Surely 506 wasn’t arbitrarily selected? I understand
>> that the controller’s Identify Controller structure may indicate far fewer pages
>> supported, but if, as the comment suggests, PRP2 is pointing to a list, then why
>> reduce the number “just a few”? I feel like I’m missing something.
>>
>> Let’s say PRP1 is aligned to a memory page boundary and the length of the data
>> transfer is more than two (2) memory pages. PRP1 points to the first memory
>> page of data, and PRP2 points to a memory page containing PRP entries; i.e. a
>> PRP list. If the memory page size is 4096 (4KB), then up to 4096 / (size of PRP
>> pointer in bytes) = 4096 / 8 = 512 of PRP entries could be created in that page.
>> Thus, if I follow it correctly, with PRP1 pointing to 4KB of data in the first memory
>> page, and with PRP2 pointing to a 4KB page of PRP entries, we should be able to
>> transfer 1 + 512 = 513 memory pages, and so in this case 513 * 4096 = 2,101,248
>> bytes of data. And, that’s only if the implementation of the SPDK NVMe driver
>> elects not to support the mechanism of using the last entry of the page of PRP
>> entries to point to another page of PRP entries.
>>
>> --
>> Lance Hartmann
>> lance.hartmann(a)oracle.com
>>
>>
>>> On Aug 8, 2017, at 11:24 PM, Liu, Changpeng <changpeng.liu(a)intel.com> wrote:
>>>
>>> Hi Lance,
>>>
>>> NVME_MAX_XFER_SIZE is the maximum data length supported by SPDK driver,
>> of course the NVMe controller has a field(MDTS)
>>> to show the limit from hardware, so choose the smaller one as the command
>> limit to split commands bigger than this number.
>>>
>>> Most of Intel NVMe SSDs has a hardware value 128KiB, so the driver limit with
>> (506*4) KiB is big enough to support it.
>>>
>>>> -----Original Message-----
>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance Hartmann
>>>> ORACLE
>>>> Sent: Wednesday, August 9, 2017 11:52 AM
>>>> To: Storage Performance Development Kit <spdk(a)lists.01.org>
>>>> Subject: [SPDK] Determination of NVMe max_io_xfer_size
>>>> (NVME_MAX_XFER_SIZE) ?
>>>>
>>>> Hello,
>>>>
>>>> I’m trying to reconcile the #define NVME_MAX_XFER_SIZE and leading
>> comment:
>>>>
>>>> /*
>>>> * For commands requiring more than 2 PRP entries, one PRP will be
>>>> * embedded in the command (prp1), and the rest of the PRP entries
>>>> * will be in a list pointed to by the command (prp2). This means
>>>> * that real max number of PRP entries we support is 506+1, which
>>>> * results in a max xfer size of 506*PAGE_SIZE.
>>>> */
>>>>
>>>> in lib/nvme/nvme_pcie.c with my interpretation from reading the NVMe spec.
>>>> I’d greatly appreciate if someone could “show me the math” or otherwise
>> help
>>>> me to understand this. How was NVME_MAX_PRP_LIST_ENTRIES (506)
>> derived?
>>>> I don’t know if I’m lost in the semantics of the naming, the comment, or
>> perhaps
>>>> there’s a nuance in the “…we support…” part. I would’ve guessed, otherwise,
>>>> that the max # of PRP entries would be a function of the PAGE_SIZE.
>>>>
>>>> I did see that the driver in nvme_ctrlr_identify() compares this derived
>> maximum
>>>> transfer size with that which the controller can actually support as reported in
>>>> the Identify Controller structure, choosing the minimum of the two values,
>> but
>>>> that’s understood and separate from the above.
>>>>
>>>> regards,
>>>>
>>>>
>>>> --
>>>> Lance Hartmann
>>>> lance.hartmann(a)oracle.com
>>>>
>>>>
>>>> _______________________________________________
>>>> SPDK mailing list
>>>> SPDK(a)lists.01.org
>>>> https://lists.01.org/mailman/listinfo/spdk
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>> https://lists.01.org/mailman/listinfo/spdk
>>
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 7691 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [SPDK] Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ?
@ 2017-08-09 16:28 Luse, Paul E
0 siblings, 0 replies; 8+ messages in thread
From: Luse, Paul E @ 2017-08-09 16:28 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 5804 bytes --]
Hi Lance,
Have you tried using IRC for engaging the SPDK community? We’re on freenode at #spdk and it’s a great way to get real-time conversations happening. Nothing wrong with the dist list either of course.
Also, feel free to submit a patch with suggested wording changes in the comment, it you’re not setup to submit patches you can read how on spdk.io on the developer tab and submitting one with a comment change would be a great and easy way to learn the process and prepare yourself for when you have something else you want to submit down the road…
Thx
Paul
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance Hartmann ORACLE
Sent: Wednesday, August 9, 2017 8:07 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ?
Ah, eureka! Thank you! 506 was so unique, I just knew I was overlooking something. *whew* I feel so much better now! ;-)
--
Lance Hartmann
lance.hartmann(a)oracle.com<mailto:lance.hartmann(a)oracle.com>
On Aug 9, 2017, at 12:18 AM, Liu, Changpeng <changpeng.liu(a)intel.com<mailto:changpeng.liu(a)intel.com>> wrote:
Yes, you are right.
SPDK embedded PRP list into the struct nvme_tracker, and the data structure is 4KiB aligned,
And also several other fields, so only 506 entries left for PRP lists.
-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance Hartmann
ORACLE
Sent: Wednesday, August 9, 2017 1:01 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Determination of NVMe max_io_xfer_size
(NVME_MAX_XFER_SIZE) ?
Ok, but 506 * PAGE_SIZE? Surely 506 wasn’t arbitrarily selected? I understand
that the controller’s Identify Controller structure may indicate far fewer pages
supported, but if, as the comment suggests, PRP2 is pointing to a list, then why
reduce the number “just a few”? I feel like I’m missing something.
Let’s say PRP1 is aligned to a memory page boundary and the length of the data
transfer is more than two (2) memory pages. PRP1 points to the first memory
page of data, and PRP2 points to a memory page containing PRP entries; i.e. a
PRP list. If the memory page size is 4096 (4KB), then up to 4096 / (size of PRP
pointer in bytes) = 4096 / 8 = 512 of PRP entries could be created in that page.
Thus, if I follow it correctly, with PRP1 pointing to 4KB of data in the first memory
page, and with PRP2 pointing to a 4KB page of PRP entries, we should be able to
transfer 1 + 512 = 513 memory pages, and so in this case 513 * 4096 = 2,101,248
bytes of data. And, that’s only if the implementation of the SPDK NVMe driver
elects not to support the mechanism of using the last entry of the page of PRP
entries to point to another page of PRP entries.
--
Lance Hartmann
lance.hartmann(a)oracle.com<mailto:lance.hartmann(a)oracle.com>
On Aug 8, 2017, at 11:24 PM, Liu, Changpeng <changpeng.liu(a)intel.com<mailto:changpeng.liu(a)intel.com>> wrote:
Hi Lance,
NVME_MAX_XFER_SIZE is the maximum data length supported by SPDK driver,
of course the NVMe controller has a field(MDTS)
to show the limit from hardware, so choose the smaller one as the command
limit to split commands bigger than this number.
Most of Intel NVMe SSDs has a hardware value 128KiB, so the driver limit with
(506*4) KiB is big enough to support it.
-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance Hartmann
ORACLE
Sent: Wednesday, August 9, 2017 11:52 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: [SPDK] Determination of NVMe max_io_xfer_size
(NVME_MAX_XFER_SIZE) ?
Hello,
I’m trying to reconcile the #define NVME_MAX_XFER_SIZE and leading
comment:
/*
* For commands requiring more than 2 PRP entries, one PRP will be
* embedded in the command (prp1), and the rest of the PRP entries
* will be in a list pointed to by the command (prp2). This means
* that real max number of PRP entries we support is 506+1, which
* results in a max xfer size of 506*PAGE_SIZE.
*/
in lib/nvme/nvme_pcie.c with my interpretation from reading the NVMe spec.
I’d greatly appreciate if someone could “show me the math” or otherwise
help
me to understand this. How was NVME_MAX_PRP_LIST_ENTRIES (506)
derived?
I don’t know if I’m lost in the semantics of the naming, the comment, or
perhaps
there’s a nuance in the “…we support…” part. I would’ve guessed, otherwise,
that the max # of PRP entries would be a function of the PAGE_SIZE.
I did see that the driver in nvme_ctrlr_identify() compares this derived
maximum
transfer size with that which the controller can actually support as reported in
the Identify Controller structure, choosing the minimum of the two values,
but
that’s understood and separate from the above.
regards,
--
Lance Hartmann
lance.hartmann(a)oracle.com<mailto:lance.hartmann(a)oracle.com>
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org<mailto:SPDK(a)lists.01.org>
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org<mailto:SPDK(a)lists.01.org>
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org<mailto:SPDK(a)lists.01.org>
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org<mailto:SPDK(a)lists.01.org>
https://lists.01.org/mailman/listinfo/spdk
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 15406 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [SPDK] Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ?
@ 2017-08-09 19:04 Walker, Benjamin
0 siblings, 0 replies; 8+ messages in thread
From: Walker, Benjamin @ 2017-08-09 19:04 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 8998 bytes --]
On Wed, 2017-08-09 at 10:06 -0500, Lance Hartmann ORACLE wrote:
>
> Ah, eureka! Thank you! 506 was so unique, I just knew I was overlooking
> something. *whew* I feel so much better now! ;-)
To expand a bit more on our choices in this area, SPDK defines two key
structures - requests and trackers. Trackers correspond 1:1 with entries in the
NVMe submission queues. Requests represent requests from the user to perform
I/O. There may be more requests outstanding than available trackers, in which
case we'll do software queueing until a tracker becomes available. We also split
requests for a huge number of reasons (large I/O, issues with sgl to prp
translation, etc.) to make the API more convenient to use, so requests often
have many children.
In NVMe, when a command is submitted the user gets to provide a command ID
(cid), which is a 16 bit number that will be provided in the completion entry.
This is how software is supposed to match completions up with submitted
commands. We elected to allocate the trackers for each NVMe submission queue as
an array and set the cid to the index of the tracker in this array. The tracker
holds our context for the command, including a pointer back to the request.
When a command requires more than two PRP entries, it must provide a separate
PRP list (an array of 64 bit integers describing physical memory pages). Each
segment of that PRP list must fit within a physical page of memory, but segments
can be chained together to make very long arrays. We could have allocated these
PRP list segments separately from the trackers, in their own pool. Then, if we
needed a PRP list for a command, we could have grabbed one from the pool and
made the tracker point at it. This design has a few advantages in that if we
needed to chain together PRP list segments for a very, very large I/O we could
do that by grabbing multiple PRP list segments from the pool and assigning them
to one tracker, and then free them back to the pool when the associated command
completed.
However, we elected to do something a bit simpler that we hope is also faster.
One observation is that each PRP entry describes 4KiB of memory and real SSDs
limit the maximum transfer size to somewhere between 64KiB and 1MiB. 1MiB is
only 256 PRP entries, which is 2KiB of PRP list. There are good reasons not to
send I/O larger than that (QoS considerations), even if the device supports it,
so it's fairly clear that in practice no more than one PRP list segment is ever
needed. In fact, only half of a physical page is probably required. Therefore,
we decided to make our trackers exactly one physical 4KiB page, where the
context for the command is stored at the front and the remainder is the single
PRP list segment. This places a maximum on the I/O size SPDK can support
(although it's nearly 2 MiB and won't ever be the limiting factor in practice),
but it allows us to avoid pointer chasing through the tracker to the PRP list
and keep the first portion of the PRP list on the same cache line as the rest of
the tracker. A lot of our choices in the code come down to jamming as many
things into the fewest number of cache lines as possible, and this is just an
example of that.
The math on NVME_MAX_XFER_SIZE may not be obvious either. The command has one
PRP entry baked in, plus 506 in the associated array. That's a total of 507
entries. Each entry describes a 4KiB page, so why is NVME_MAX_XFER_SIZE set to
506 * PAGE_SIZE instead of 507 * PAGE_SIZE? The answer is that the data buffer
provided may not be page aligned. In the PRP format, the first element of the
list is allowed to start unaligned, as long as it ends on a page boundary.
Similarly, the last element must start on a page boundary, but may end
unaligned. To account for this takes one extra PRP entry in the worst case. So
if the buffer is perfectly 4KiB aligned, we could indeed support 507 * PAGE_SIZE
I/O, but since we don't require 4KiB alignment we have to subtract one page
worth.
Thanks everyone else for jumping in and providing the right answer more quickly
than I was able to.
-Ben
> --
> Lance Hartmann
> lance.hartmann(a)oracle.com
>
> > On Aug 9, 2017, at 12:18 AM, Liu, Changpeng <changpeng.liu(a)intel.com> wrote:
> >
> > Yes, you are right.
> > SPDK embedded PRP list into the struct nvme_tracker, and the data structure
> > is 4KiB aligned,
> > And also several other fields, so only 506 entries left for PRP lists.
> >
> > > -----Original Message-----
> > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance Hartmann
> > > ORACLE
> > > Sent: Wednesday, August 9, 2017 1:01 PM
> > > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > > Subject: Re: [SPDK] Determination of NVMe max_io_xfer_size
> > > (NVME_MAX_XFER_SIZE) ?
> > >
> > >
> > > Ok, but 506 * PAGE_SIZE? Surely 506 wasn’t arbitrarily selected? I
> > > understand
> > > that the controller’s Identify Controller structure may indicate far fewer
> > > pages
> > > supported, but if, as the comment suggests, PRP2 is pointing to a list,
> > > then why
> > > reduce the number “just a few”? I feel like I’m missing something.
> > >
> > > Let’s say PRP1 is aligned to a memory page boundary and the length of the
> > > data
> > > transfer is more than two (2) memory pages. PRP1 points to the first
> > > memory
> > > page of data, and PRP2 points to a memory page containing PRP entries;
> > > i.e. a
> > > PRP list. If the memory page size is 4096 (4KB), then up to 4096 / (size
> > > of PRP
> > > pointer in bytes) = 4096 / 8 = 512 of PRP entries could be created in that
> > > page.
> > > Thus, if I follow it correctly, with PRP1 pointing to 4KB of data in the
> > > first memory
> > > page, and with PRP2 pointing to a 4KB page of PRP entries, we should be
> > > able to
> > > transfer 1 + 512 = 513 memory pages, and so in this case 513 * 4096 =
> > > 2,101,248
> > > bytes of data. And, that’s only if the implementation of the SPDK NVMe
> > > driver
> > > elects not to support the mechanism of using the last entry of the page of
> > > PRP
> > > entries to point to another page of PRP entries.
> > >
> > > --
> > > Lance Hartmann
> > > lance.hartmann(a)oracle.com
> > >
> > >
> > > > On Aug 8, 2017, at 11:24 PM, Liu, Changpeng <changpeng.liu(a)intel.com>
> > > > wrote:
> > > >
> > > > Hi Lance,
> > > >
> > > > NVME_MAX_XFER_SIZE is the maximum data length supported by SPDK driver,
> > > of course the NVMe controller has a field(MDTS)
> > > > to show the limit from hardware, so choose the smaller one as the
> > > > command
> > > limit to split commands bigger than this number.
> > > > Most of Intel NVMe SSDs has a hardware value 128KiB, so the driver
> > > > limit with
> > > (506*4) KiB is big enough to support it.
> > > > > -----Original Message-----
> > > > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance
> > > > > Hartmann
> > > > > ORACLE
> > > > > Sent: Wednesday, August 9, 2017 11:52 AM
> > > > > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > > > > Subject: [SPDK] Determination of NVMe max_io_xfer_size
> > > > > (NVME_MAX_XFER_SIZE) ?
> > > > >
> > > > > Hello,
> > > > >
> > > > > I’m trying to reconcile the #define NVME_MAX_XFER_SIZE and leading
> > > comment:
> > > > > /*
> > > > > * For commands requiring more than 2 PRP entries, one PRP will be
> > > > > * embedded in the command (prp1), and the rest of the PRP entries
> > > > > * will be in a list pointed to by the command (prp2). This means
> > > > > * that real max number of PRP entries we support is 506+1, which
> > > > > * results in a max xfer size of 506*PAGE_SIZE.
> > > > > */
> > > > >
> > > > > in lib/nvme/nvme_pcie.c with my interpretation from reading the NVMe
> > > > > spec.
> > > > > I’d greatly appreciate if someone could “show me the math” or
> > > > > otherwise
> > > help
> > > > > me to understand this. How was NVME_MAX_PRP_LIST_ENTRIES (506)
> > > derived?
> > > > > I don’t know if I’m lost in the semantics of the naming, the comment,
> > > > > or
> > > perhaps
> > > > > there’s a nuance in the “…we support…” part. I would’ve guessed,
> > > > > otherwise,
> > > > > that the max # of PRP entries would be a function of the PAGE_SIZE.
> > > > >
> > > > > I did see that the driver in nvme_ctrlr_identify() compares this
> > > > > derived
> > > maximum
> > > > > transfer size with that which the controller can actually support as
> > > > > reported in
> > > > > the Identify Controller structure, choosing the minimum of the two
> > > > > values,
> > > but
> > > > > that’s understood and separate from the above.
> > > > >
> > > > > regards,
> > > > >
> > > > >
> > > > > --
> > > > > Lance Hartmann
> > > > > lance.hartmann(a)oracle.com
> > > > >
> > > > >
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3274 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-08-09 19:04 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-09 19:04 [SPDK] Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ? Walker, Benjamin
-- strict thread matches above, loose matches on Subject: below --
2017-08-09 16:28 Luse, Paul E
2017-08-09 15:06 Lance Hartmann ORACLE
2017-08-09 5:58 Crane Chu
2017-08-09 5:18 Liu, Changpeng
2017-08-09 5:01 Lance Hartmann ORACLE
2017-08-09 4:24 Liu, Changpeng
2017-08-09 3:52 Lance Hartmann ORACLE
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.