From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============6304108024127979368==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] Determination of NVMe max_io_xfer_size (NVME_MAX_XFER_SIZE) ? Date: Wed, 09 Aug 2017 19:04:48 +0000 Message-ID: <1502305486.2934.6.camel@intel.com> In-Reply-To: 703D359C-FC0A-4692-A049-7F387B27CCC5@oracle.com List-ID: To: spdk@lists.01.org --===============6304108024127979368== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Wed, 2017-08-09 at 10:06 -0500, Lance Hartmann ORACLE wrote: > = > Ah, eureka! Thank you! 506 was so unique, I just knew I was overlooking > something. *whew* I feel so much better now! ;-) To expand a bit more on our choices in this area, SPDK defines two key structures - requests and trackers. Trackers correspond 1:1 with entries in= the NVMe submission queues. Requests represent requests from the user to perform I/O. There may be more requests outstanding than available trackers, in whi= ch case we'll do software queueing until a tracker becomes available. We also = split requests for a huge number of reasons (large I/O, issues with sgl to prp translation, etc.) to make the API more convenient to use, so requests often have many children. In NVMe, when a command is submitted the user gets to provide a command ID (cid), which is a 16 bit number that will be provided in the completion ent= ry. This is how software is supposed to match completions up with submitted commands. We elected to allocate the trackers for each NVMe submission queu= e as an array and set the cid to the index of the tracker in this array. The tra= cker holds our context for the command, including a pointer back to the request. When a command requires more than two PRP entries, it must provide a separa= te PRP list (an array of 64 bit integers describing physical memory pages). Ea= ch segment of that PRP list must fit within a physical page of memory, but seg= ments can be chained together to make very long arrays. We could have allocated t= hese PRP list segments separately from the trackers, in their own pool. Then, if= we needed a PRP list for a command, we could have grabbed one from the pool and made the tracker point at it. This design has a few advantages in that if we needed to chain together PRP list segments for a very, very large I/O we co= uld do that by grabbing multiple PRP list segments from the pool and assigning = them to one tracker, and then free them back to the pool when the associated com= mand completed. However, we elected to do something a bit simpler that we hope is also fast= er. One observation is that each PRP entry describes 4KiB of memory and real SS= Ds limit the maximum transfer size to somewhere between 64KiB and 1MiB. 1MiB is only 256 PRP entries, which is 2KiB of PRP list. There are good reasons not= to send I/O larger than that (QoS considerations), even if the device supports= it, so it's fairly clear that in practice no more than one PRP list segment is = ever needed. In fact, only half of a physical page is probably required. Therefo= re, we decided to make our trackers exactly one physical 4KiB page, where the context for the command is stored at the front and the remainder is the sin= gle PRP list segment. This places a maximum on the I/O size SPDK can support (although it's nearly 2 MiB and won't ever be the limiting factor in practi= ce), but it allows us to avoid pointer chasing through the tracker to the PRP li= st and keep the first portion of the PRP list on the same cache line as the re= st of the tracker. A lot of our choices in the code come down to jamming as many things into the fewest number of cache lines as possible, and this is just = an example of that. The math on NVME_MAX_XFER_SIZE may not be obvious either. The command has o= ne PRP entry baked in, plus 506 in the associated array. That's a total of 507 entries. Each entry describes a 4KiB page, so why is NVME_MAX_XFER_SIZE set= to 506 * PAGE_SIZE instead of 507 * PAGE_SIZE? The answer is that the data buf= fer provided may not be page aligned. In the PRP format, the first element of t= he list is allowed to start unaligned, as long as it ends on a page boundary. Similarly, the last element must start on a page boundary, but may end unaligned. To account for this takes one extra PRP entry in the worst case.= So if the buffer is perfectly 4KiB aligned, we could indeed support 507 * PAGE= _SIZE I/O, but since we don't require 4KiB alignment we have to subtract one page worth. Thanks everyone else for jumping in and providing the right answer more qui= ckly than I was able to. -Ben > -- > Lance Hartmann > lance.hartmann(a)oracle.com > = > > On Aug 9, 2017, at 12:18 AM, Liu, Changpeng = wrote: > > = > > Yes, you are right. = > > SPDK embedded PRP list into the struct nvme_tracker, and the data struc= ture > > is 4KiB aligned, > > And also several other fields, so only 506 entries left for PRP lists. > > = > > > -----Original Message----- > > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance Ha= rtmann > > > ORACLE > > > Sent: Wednesday, August 9, 2017 1:01 PM > > > To: Storage Performance Development Kit > > > Subject: Re: [SPDK] Determination of NVMe max_io_xfer_size > > > (NVME_MAX_XFER_SIZE) ? > > > = > > > = > > > Ok, but 506 * PAGE_SIZE? Surely 506 wasn=E2=80=99t arbitrarily selec= ted? I > > > understand > > > that the controller=E2=80=99s Identify Controller structure may indic= ate far fewer > > > pages > > > supported, but if, as the comment suggests, PRP2 is pointing to a lis= t, > > > then why > > > reduce the number =E2=80=9Cjust a few=E2=80=9D? I feel like I=E2=80= =99m missing something. > > > = > > > Let=E2=80=99s say PRP1 is aligned to a memory page boundary and the l= ength of the > > > data > > > transfer is more than two (2) memory pages. PRP1 points to the first > > > memory > > > page of data, and PRP2 points to a memory page containing PRP entries; > > > i.e. a > > > PRP list. If the memory page size is 4096 (4KB), then up to 4096 / (= size > > > of PRP > > > pointer in bytes) =3D 4096 / 8 =3D 512 of PRP entries could be create= d in that > > > page. > > > Thus, if I follow it correctly, with PRP1 pointing to 4KB of data in = the > > > first memory > > > page, and with PRP2 pointing to a 4KB page of PRP entries, we should = be > > > able to > > > transfer 1 + 512 =3D 513 memory pages, and so in this case 513 * 4096= =3D > > > 2,101,248 > > > bytes of data. And, that=E2=80=99s only if the implementation of the= SPDK NVMe > > > driver > > > elects not to support the mechanism of using the last entry of the pa= ge of > > > PRP > > > entries to point to another page of PRP entries. > > > = > > > -- > > > Lance Hartmann > > > lance.hartmann(a)oracle.com > > > = > > > = > > > > On Aug 8, 2017, at 11:24 PM, Liu, Changpeng > > > > wrote: > > > > = > > > > Hi Lance, > > > > = > > > > NVME_MAX_XFER_SIZE is the maximum data length supported by SPDK dri= ver, > > > of course the NVMe controller has a field(MDTS) > > > > to show the limit from hardware, so choose the smaller one as the > > > > command > > > limit to split commands bigger than this number. > > > > Most of Intel NVMe SSDs has a hardware value 128KiB, so the driver > > > > limit with > > > (506*4) KiB is big enough to support it. > > > > > -----Original Message----- > > > > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Lance > > > > > Hartmann > > > > > ORACLE > > > > > Sent: Wednesday, August 9, 2017 11:52 AM > > > > > To: Storage Performance Development Kit > > > > > Subject: [SPDK] Determination of NVMe max_io_xfer_size > > > > > (NVME_MAX_XFER_SIZE) ? > > > > > = > > > > > Hello, > > > > > = > > > > > I=E2=80=99m trying to reconcile the #define NVME_MAX_XFER_SIZE an= d leading > > > comment: > > > > > /* > > > > > * For commands requiring more than 2 PRP entries, one PRP will be > > > > > * embedded in the command (prp1), and the rest of the PRP entries > > > > > * will be in a list pointed to by the command (prp2). This means > > > > > * that real max number of PRP entries we support is 506+1, which > > > > > * results in a max xfer size of 506*PAGE_SIZE. > > > > > */ > > > > > = > > > > > in lib/nvme/nvme_pcie.c with my interpretation from reading the N= VMe > > > > > spec. > > > > > I=E2=80=99d greatly appreciate if someone could =E2=80=9Cshow me = the math=E2=80=9D or > > > > > otherwise > > > help > > > > > me to understand this. How was NVME_MAX_PRP_LIST_ENTRIES (506) > > > derived? > > > > > I don=E2=80=99t know if I=E2=80=99m lost in the semantics of the = naming, the comment, > > > > > or > > > perhaps > > > > > there=E2=80=99s a nuance in the =E2=80=9C=E2=80=A6we support=E2= =80=A6=E2=80=9D part. I would=E2=80=99ve guessed, > > > > > otherwise, > > > > > that the max # of PRP entries would be a function of the PAGE_SIZ= E. > > > > > = > > > > > I did see that the driver in nvme_ctrlr_identify() compares this > > > > > derived > > > maximum > > > > > transfer size with that which the controller can actually support= as > > > > > reported in > > > > > the Identify Controller structure, choosing the minimum of the two > > > > > values, > > > but > > > > > that=E2=80=99s understood and separate from the above. > > > > > = > > > > > regards, > > > > > = > > > > > = > > > > > -- > > > > > Lance Hartmann > > > > > lance.hartmann(a)oracle.com > > > > > = > > > > >=20 --===============6304108024127979368== Content-Type: application/x-pkcs7-signature MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIKdTCCBOsw ggPToAMCAQICEFLpAsoR6ESdlGU4L6MaMLswDQYJKoZIhvcNAQEFBQAwbzELMAkGA1UEBhMCU0Ux FDASBgNVBAoTC0FkZFRydXN0IEFCMSYwJAYDVQQLEx1BZGRUcnVzdCBFeHRlcm5hbCBUVFAgTmV0 d29yazEiMCAGA1UEAxMZQWRkVHJ1c3QgRXh0ZXJuYWwgQ0EgUm9vdDAeFw0xMzAzMTkwMDAwMDBa Fw0yMDA1MzAxMDQ4MzhaMHkxCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJDQTEUMBIGA1UEBxMLU2Fu dGEgQ2xhcmExGjAYBgNVBAoTEUludGVsIENvcnBvcmF0aW9uMSswKQYDVQQDEyJJbnRlbCBFeHRl cm5hbCBCYXNpYyBJc3N1aW5nIENBIDRBMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA 4LDMgJ3YSVX6A9sE+jjH3b+F3Xa86z3LLKu/6WvjIdvUbxnoz2qnvl9UKQI3sE1zURQxrfgvtP0b Pgt1uDwAfLc6H5eqnyi+7FrPsTGCR4gwDmq1WkTQgNDNXUgb71e9/6sfq+WfCDpi8ScaglyLCRp7 ph/V60cbitBvnZFelKCDBh332S6KG3bAdnNGB/vk86bwDlY6omDs6/RsfNwzQVwo/M3oPrux6y6z yIoRulfkVENbM0/9RrzQOlyK4W5Vk4EEsfW2jlCV4W83QKqRccAKIUxw2q/HoHVPbbETrrLmE6RR Z/+eWlkGWl+mtx42HOgOmX0BRdTRo9vH7yeBowIDAQABo4IBdzCCAXMwHwYDVR0jBBgwFoAUrb2Y ejS0Jvf6xCZU7wO94CTLVBowHQYDVR0OBBYEFB5pKrTcKP5HGE4hCz+8rBEv8Jj1MA4GA1UdDwEB /wQEAwIBhjASBgNVHRMBAf8ECDAGAQH/AgEAMDYGA1UdJQQvMC0GCCsGAQUFBwMEBgorBgEEAYI3 CgMEBgorBgEEAYI3CgMMBgkrBgEEAYI3FQUwFwYDVR0gBBAwDjAMBgoqhkiG+E0BBQFpMEkGA1Ud HwRCMEAwPqA8oDqGOGh0dHA6Ly9jcmwudHJ1c3QtcHJvdmlkZXIuY29tL0FkZFRydXN0RXh0ZXJu YWxDQVJvb3QuY3JsMDoGCCsGAQUFBwEBBC4wLDAqBggrBgEFBQcwAYYeaHR0cDovL29jc3AudHJ1 c3QtcHJvdmlkZXIuY29tMDUGA1UdHgQuMCygKjALgQlpbnRlbC5jb20wG6AZBgorBgEEAYI3FAID oAsMCWludGVsLmNvbTANBgkqhkiG9w0BAQUFAAOCAQEAKcLNo/2So1Jnoi8G7W5Q6FSPq1fmyKW3 sSDf1amvyHkjEgd25n7MKRHGEmRxxoziPKpcmbfXYU+J0g560nCo5gPF78Wd7ZmzcmCcm1UFFfIx fw6QA19bRpTC8bMMaSSEl8y39Pgwa+HENmoPZsM63DdZ6ziDnPqcSbcfYs8qd/m5d22rpXq5IGVU tX6LX7R/hSSw/3sfATnBLgiJtilVyY7OGGmYKCAS2I04itvSS1WtecXTt9OZDyNbl7LtObBrgMLh ZkpJW+pOR9f3h5VG2S5uKkA7Th9NC9EoScdwQCAIw+UWKbSQ0Isj2UFL7fHKvmqWKVTL98sRzvI3 seNC4DCCBYIwggRqoAMCAQICEzMAAIu5Kz5Fe8d0qN0AAAAAi7kwDQYJKoZIhvcNAQEFBQAweTEL MAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRQwEgYDVQQHEwtTYW50YSBDbGFyYTEaMBgGA1UEChMR SW50ZWwgQ29ycG9yYXRpb24xKzApBgNVBAMTIkludGVsIEV4dGVybmFsIEJhc2ljIElzc3Vpbmcg Q0EgNEEwHhcNMTcwMTA5MjEyMzU4WhcNMTgwMTA0MjEyMzU4WjBFMRkwFwYDVQQDExBXYWxrZXIs IEJlbmphbWluMSgwJgYJKoZIhvcNAQkBFhliZW5qYW1pbi53YWxrZXJAaW50ZWwuY29tMIIBIjAN BgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAxFugJYk4Vd/Yvdmr8BdnGDdCkN1bc1KNCAQBhzC/ BWXw5nxpXWMYFBkTxahM78PtuwdtPDFqoHsMNEaX0miWeYjB6zKbKl7y0LEsSxlu9wjllEdWTYOP 9/m3UC0oITDn7L01adbsD5Sin6W1FMmjcBVrD51oy2orpwfvan3TNVRRQxt8dQz38hivXnona5tt toi+V8ved7o251HApvEwW7QtDfdML+RmBKBSf0MzGjZHPzoBfRrsBUZ0yRHJxlkYNeY99EAUUHwT npsySQSf0cxLmvA6/a4qPOUSitHit+cJQ58/EOt6PLrPGAbdu5sz9O+Iv+FUJakwUtg0sAY4RQID AQABo4ICNTCCAjEwHQYDVR0OBBYEFAU2hsr+3sx/M5e5WafmYD18VvX1MB8GA1UdIwQYMBaAFB5p KrTcKP5HGE4hCz+8rBEv8Jj1MGUGA1UdHwReMFwwWqBYoFaGVGh0dHA6Ly93d3cuaW50ZWwuY29t L3JlcG9zaXRvcnkvQ1JML0ludGVsJTIwRXh0ZXJuYWwlMjBCYXNpYyUyMElzc3VpbmclMjBDQSUy MDRBLmNybDCBnwYIKwYBBQUHAQEEgZIwgY8waQYIKwYBBQUHMAKGXWh0dHA6Ly93d3cuaW50ZWwu Y29tL3JlcG9zaXRvcnkvY2VydGlmaWNhdGVzL0ludGVsJTIwRXh0ZXJuYWwlMjBCYXNpYyUyMElz c3VpbmclMjBDQSUyMDRBLmNydDAiBggrBgEFBQcwAYYWaHR0cDovL29jc3AuaW50ZWwuY29tLzAL BgNVHQ8EBAMCB4AwPAYJKwYBBAGCNxUHBC8wLQYlKwYBBAGCNxUIhsOMdYSZ5VGD/YEohY6fU4KR wAlngd69OZXwQwIBZAIBCTAfBgNVHSUEGDAWBggrBgEFBQcDBAYKKwYBBAGCNwoDDDApBgkrBgEE AYI3FQoEHDAaMAoGCCsGAQUFBwMEMAwGCisGAQQBgjcKAwwwTwYDVR0RBEgwRqApBgorBgEEAYI3 FAIDoBsMGWJlbmphbWluLndhbGtlckBpbnRlbC5jb22BGWJlbmphbWluLndhbGtlckBpbnRlbC5j b20wDQYJKoZIhvcNAQEFBQADggEBAMQUzXgrfwDLl92M7wNqp24Xe1poeurJ8YVAy5a2UukwC/uX uXE8Duoz2jMJL90QETn17H7EQQu1J7kc059H6GyDU42MkzPA3mqZQimrTgOaalPXxWXoVl/UUoLB PJZXGF3Ef1p8b1UVdSnZZ8wTD/QTUw7UhgljKZ1td/raLV1h96x6lKCVkZ0UKU8be5M3FHQ/GZJ9 CgUjvN0m2mYOUHDkNzsUTJb4bsV7vZDa3zixm4Gxu2F/uq328AEJ6JJmXA+jjFOzQ0FI8sa7XOSR 1UPvZSrwyA00M/zFZaDTln+sFPFNseYYGYFU7P711D8Wj1Hv1V/C2G4rSRBJG5f1WF8xggIXMIIC EwIBATCBkDB5MQswCQYDVQQGEwJVUzELMAkGA1UECBMCQ0ExFDASBgNVBAcTC1NhbnRhIENsYXJh MRowGAYDVQQKExFJbnRlbCBDb3Jwb3JhdGlvbjErMCkGA1UEAxMiSW50ZWwgRXh0ZXJuYWwgQmFz aWMgSXNzdWluZyBDQSA0QQITMwAAi7krPkV7x3So3QAAAACLuTAJBgUrDgMCGgUAoF0wGAYJKoZI hvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTcwODA5MTkwNDQ2WjAjBgkqhkiG 9w0BCQQxFgQUalh8EbGTMf7wTbzbs3gXUjZAVg8wDQYJKoZIhvcNAQEBBQAEggEAT4s6pBcfwfpF 0MMTbT3LmgKd1PYFts0tXiaujsI9q9reqY+mU5U2bVcoqc+I7cIF8FMxUmaFlGCWDnQXuHWzd8KP t7AUDrUdiqIz7jLaCI5uV8ieT9GIucVkQxik+UwF3I8x6J9WSiuWv4MzuUWeTKP5fhaVvjmrB5uk JtvoB4JM1JweGw6x7MEARImUd425n5YHNYCEZymB45XDRCNGinomydKJydANyc+nXB41dPxuNKPC eiUWPWXdWWXrk3VA3g7Yzg0bhnnL32a6tswlQE9KX24/EdSSUhED4WJl5vWk+1LfU62Kiet3BRpe qkPONOmHXRtfo6wozB/l8haoPgAAAAAAAA== --===============6304108024127979368==--