From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============9201772281321219670==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] bdev/NVMe pass-though command Date: Wed, 22 Mar 2017 17:42:39 +0000 Message-ID: <1490204558.20464.11.camel@intel.com> In-Reply-To: 622F4407872BA447A16110F65453358C05E87152548C@FMSAMAIL.fmsa.local List-ID: To: spdk@lists.01.org --===============9201772281321219670== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Wed, 2017-03-22 at 05:51 -0700, Paul Von-Stamwitz wrote: > =C2=A0 > We had an off-line discussion on implementing a NMVe pass-through > command at the bdev level, and I thought to include the community in > the discussion. Our primary use case is for the retrieval of > SMART/Health information via the Get Log page, but it could be used > for other purposes. > =C2=A0 > How do you envision this? > Should the upper layer send down a raw NVME command which gets passed > down to blockdev_nvme and is handled similarly to nvmef/direct.c? > Since multiple bdev contexts can share the same admin queue pair, > should we limit which context is allowed to use the pass-through? > Technically we could have an I/O pass-through, but I think we should > limit it to admin commands. > Should we put checks on what is allowed (i.e. read-only commands) or > let anything go through? We've spent some time thinking about this so let me lay out how I think this should work. The bdev layer today supports 5 commands - read/write/unmap/flush/reset. We're (collectively) proposing to add a 6th - NVMe passthru. The command would consist of a 64 byte NVMe command, an optional pointer to some data, and the length of the pointer. This part is simple. The tricky part, as you note, is that there are really two categories of NVMe commands - I/O and Admin - and they need to be submitted on different types of NVMe queue pairs. The API the user sees at the bdev level exposes an spdk_io_channel object on which the user may submit commands, but that channel is not typed. In reality, it is a thin wrapper around an NVMe I/O queue pair, so it cannot be used for Admin commands. This has worked well until now, because the user never needed to submit any operation that resulted in an Admin command from the bdev layer. The easiest way to implement NVMe passthru would be to only allow I/O commands, but that isn't particularly interesting. All of the commands that we envision people would want to send are Admin commands. The SPDK NVMe driver already protects the one global Admin queue pair per controller using a lock, so it's safe to submit Admin commands from multiple threads. On the submission side in the bdev layer, then, we can look at the NVMe command being passed in and decide if it is Admin or I/O and route to the associated NVMe I/O queue pair or the global Admin queue pair. That part will work out fine. The challenge is on the completion side. The spdk_io_channel object is tied to a thread, so that means each NVMe I/O queue pair is also tied to a thread. When the user submits a command on a channel, they provide a callback that will be called when the command completes. The bdev layer guarantees that the callback will be called on the thread that the command was submitted on (i.e. the one associated with the channel). Today, since all the commands go through I/O queue pairs, we set up a poller per channel (on the thread it is associated with) that polls the underlying NVMe I/O queue pair. If we were to instead route some commands to the global Admin queue pair, we'll run into the case where that Admin queue pair was polled by a different thread, causing the completion callback to execute on a different thread. This would then require users of the bdev layer to coordinate with locks, which is unacceptable. I think the solution is to add a completion queue to each spdk_io_channel in the blockdev_nvme code. We can have a single thread polling the Admin queue pair as we do today, but when each command completes it drops a message onto the appropriate spdk_io_channel's completion queue. The next time that spdk_io_channel is polled for completions, it can execute the user callbacks (which will now be on the correct thread). There is another set of problems that I haven't touched on yet either. The bdev layer doesn't expose the concept of a namespace or LUN - each bdev is just one sequential collection of blocks. For devices that support multiple namespaces/LUNs, we expose a different bdev for each one. If the user is limited to just doing I/O commands, this works out fine. However, a number of Admin commands can change the size or number of namespaces, or change the state of the NVMe controller more globally, and so sending an Admin command to a bdev may impact other bdevs. I think there are a few ways we could work this out. One way is to only allow informational Admin commands through (log pages and such). This mostly fixes the problem, except getting a log page actually does update global state on the controller regarding asynchronous event requests. However, if we don't allow the user to generate asynchronous event requests through the bdev layer (and handle them entirely internally), then I think we can still work this out. The other option is to only allow NVMe passthrough on devices with one namespace/LUN and just block it otherwise. This is also reasonably simple and probably meets your needs. > =C2=A0 > I would appreciate your thoughts, since we would like to get started > on this. > =C2=A0 > Thanks, > Paul --===============9201772281321219670== Content-Type: application/x-pkcs7-signature MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIKdTCCBOsw ggPToAMCAQICEFLpAsoR6ESdlGU4L6MaMLswDQYJKoZIhvcNAQEFBQAwbzELMAkGA1UEBhMCU0Ux FDASBgNVBAoTC0FkZFRydXN0IEFCMSYwJAYDVQQLEx1BZGRUcnVzdCBFeHRlcm5hbCBUVFAgTmV0 d29yazEiMCAGA1UEAxMZQWRkVHJ1c3QgRXh0ZXJuYWwgQ0EgUm9vdDAeFw0xMzAzMTkwMDAwMDBa Fw0yMDA1MzAxMDQ4MzhaMHkxCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJDQTEUMBIGA1UEBxMLU2Fu dGEgQ2xhcmExGjAYBgNVBAoTEUludGVsIENvcnBvcmF0aW9uMSswKQYDVQQDEyJJbnRlbCBFeHRl cm5hbCBCYXNpYyBJc3N1aW5nIENBIDRBMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA 4LDMgJ3YSVX6A9sE+jjH3b+F3Xa86z3LLKu/6WvjIdvUbxnoz2qnvl9UKQI3sE1zURQxrfgvtP0b Pgt1uDwAfLc6H5eqnyi+7FrPsTGCR4gwDmq1WkTQgNDNXUgb71e9/6sfq+WfCDpi8ScaglyLCRp7 ph/V60cbitBvnZFelKCDBh332S6KG3bAdnNGB/vk86bwDlY6omDs6/RsfNwzQVwo/M3oPrux6y6z yIoRulfkVENbM0/9RrzQOlyK4W5Vk4EEsfW2jlCV4W83QKqRccAKIUxw2q/HoHVPbbETrrLmE6RR Z/+eWlkGWl+mtx42HOgOmX0BRdTRo9vH7yeBowIDAQABo4IBdzCCAXMwHwYDVR0jBBgwFoAUrb2Y ejS0Jvf6xCZU7wO94CTLVBowHQYDVR0OBBYEFB5pKrTcKP5HGE4hCz+8rBEv8Jj1MA4GA1UdDwEB /wQEAwIBhjASBgNVHRMBAf8ECDAGAQH/AgEAMDYGA1UdJQQvMC0GCCsGAQUFBwMEBgorBgEEAYI3 CgMEBgorBgEEAYI3CgMMBgkrBgEEAYI3FQUwFwYDVR0gBBAwDjAMBgoqhkiG+E0BBQFpMEkGA1Ud HwRCMEAwPqA8oDqGOGh0dHA6Ly9jcmwudHJ1c3QtcHJvdmlkZXIuY29tL0FkZFRydXN0RXh0ZXJu YWxDQVJvb3QuY3JsMDoGCCsGAQUFBwEBBC4wLDAqBggrBgEFBQcwAYYeaHR0cDovL29jc3AudHJ1 c3QtcHJvdmlkZXIuY29tMDUGA1UdHgQuMCygKjALgQlpbnRlbC5jb20wG6AZBgorBgEEAYI3FAID oAsMCWludGVsLmNvbTANBgkqhkiG9w0BAQUFAAOCAQEAKcLNo/2So1Jnoi8G7W5Q6FSPq1fmyKW3 sSDf1amvyHkjEgd25n7MKRHGEmRxxoziPKpcmbfXYU+J0g560nCo5gPF78Wd7ZmzcmCcm1UFFfIx fw6QA19bRpTC8bMMaSSEl8y39Pgwa+HENmoPZsM63DdZ6ziDnPqcSbcfYs8qd/m5d22rpXq5IGVU tX6LX7R/hSSw/3sfATnBLgiJtilVyY7OGGmYKCAS2I04itvSS1WtecXTt9OZDyNbl7LtObBrgMLh ZkpJW+pOR9f3h5VG2S5uKkA7Th9NC9EoScdwQCAIw+UWKbSQ0Isj2UFL7fHKvmqWKVTL98sRzvI3 seNC4DCCBYIwggRqoAMCAQICEzMAAIu5Kz5Fe8d0qN0AAAAAi7kwDQYJKoZIhvcNAQEFBQAweTEL MAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRQwEgYDVQQHEwtTYW50YSBDbGFyYTEaMBgGA1UEChMR SW50ZWwgQ29ycG9yYXRpb24xKzApBgNVBAMTIkludGVsIEV4dGVybmFsIEJhc2ljIElzc3Vpbmcg Q0EgNEEwHhcNMTcwMTA5MjEyMzU4WhcNMTgwMTA0MjEyMzU4WjBFMRkwFwYDVQQDExBXYWxrZXIs IEJlbmphbWluMSgwJgYJKoZIhvcNAQkBFhliZW5qYW1pbi53YWxrZXJAaW50ZWwuY29tMIIBIjAN BgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAxFugJYk4Vd/Yvdmr8BdnGDdCkN1bc1KNCAQBhzC/ BWXw5nxpXWMYFBkTxahM78PtuwdtPDFqoHsMNEaX0miWeYjB6zKbKl7y0LEsSxlu9wjllEdWTYOP 9/m3UC0oITDn7L01adbsD5Sin6W1FMmjcBVrD51oy2orpwfvan3TNVRRQxt8dQz38hivXnona5tt toi+V8ved7o251HApvEwW7QtDfdML+RmBKBSf0MzGjZHPzoBfRrsBUZ0yRHJxlkYNeY99EAUUHwT npsySQSf0cxLmvA6/a4qPOUSitHit+cJQ58/EOt6PLrPGAbdu5sz9O+Iv+FUJakwUtg0sAY4RQID AQABo4ICNTCCAjEwHQYDVR0OBBYEFAU2hsr+3sx/M5e5WafmYD18VvX1MB8GA1UdIwQYMBaAFB5p KrTcKP5HGE4hCz+8rBEv8Jj1MGUGA1UdHwReMFwwWqBYoFaGVGh0dHA6Ly93d3cuaW50ZWwuY29t L3JlcG9zaXRvcnkvQ1JML0ludGVsJTIwRXh0ZXJuYWwlMjBCYXNpYyUyMElzc3VpbmclMjBDQSUy MDRBLmNybDCBnwYIKwYBBQUHAQEEgZIwgY8waQYIKwYBBQUHMAKGXWh0dHA6Ly93d3cuaW50ZWwu Y29tL3JlcG9zaXRvcnkvY2VydGlmaWNhdGVzL0ludGVsJTIwRXh0ZXJuYWwlMjBCYXNpYyUyMElz c3VpbmclMjBDQSUyMDRBLmNydDAiBggrBgEFBQcwAYYWaHR0cDovL29jc3AuaW50ZWwuY29tLzAL BgNVHQ8EBAMCB4AwPAYJKwYBBAGCNxUHBC8wLQYlKwYBBAGCNxUIhsOMdYSZ5VGD/YEohY6fU4KR wAlngd69OZXwQwIBZAIBCTAfBgNVHSUEGDAWBggrBgEFBQcDBAYKKwYBBAGCNwoDDDApBgkrBgEE AYI3FQoEHDAaMAoGCCsGAQUFBwMEMAwGCisGAQQBgjcKAwwwTwYDVR0RBEgwRqApBgorBgEEAYI3 FAIDoBsMGWJlbmphbWluLndhbGtlckBpbnRlbC5jb22BGWJlbmphbWluLndhbGtlckBpbnRlbC5j b20wDQYJKoZIhvcNAQEFBQADggEBAMQUzXgrfwDLl92M7wNqp24Xe1poeurJ8YVAy5a2UukwC/uX uXE8Duoz2jMJL90QETn17H7EQQu1J7kc059H6GyDU42MkzPA3mqZQimrTgOaalPXxWXoVl/UUoLB PJZXGF3Ef1p8b1UVdSnZZ8wTD/QTUw7UhgljKZ1td/raLV1h96x6lKCVkZ0UKU8be5M3FHQ/GZJ9 CgUjvN0m2mYOUHDkNzsUTJb4bsV7vZDa3zixm4Gxu2F/uq328AEJ6JJmXA+jjFOzQ0FI8sa7XOSR 1UPvZSrwyA00M/zFZaDTln+sFPFNseYYGYFU7P711D8Wj1Hv1V/C2G4rSRBJG5f1WF8xggIXMIIC EwIBATCBkDB5MQswCQYDVQQGEwJVUzELMAkGA1UECBMCQ0ExFDASBgNVBAcTC1NhbnRhIENsYXJh MRowGAYDVQQKExFJbnRlbCBDb3Jwb3JhdGlvbjErMCkGA1UEAxMiSW50ZWwgRXh0ZXJuYWwgQmFz aWMgSXNzdWluZyBDQSA0QQITMwAAi7krPkV7x3So3QAAAACLuTAJBgUrDgMCGgUAoF0wGAYJKoZI hvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTcwMzIyMTc0MjM4WjAjBgkqhkiG 9w0BCQQxFgQUOjM3ToXthb9xOfOQ1O5mrk8XvVIwDQYJKoZIhvcNAQEBBQAEggEAAL5h+oaEghwM l99cRcJwKRiRaBjzHgzhwsJya5CTy7JFPJe0vHdj+SHl0bKBrwvxVd5aiWYuVCzmMh0qckHvZkMP lKI7U3RwAWHHXijOw+D8AKJ1cQvjpC3aLCf+QJuj07Di3uW9BqFEYkcMkO4uu9Zil/KHQzxTYORI hq/yTxtpmCS7dk+VOdJt5JjiyUV8oe3pOWNE6REM8JHE8LRRTU4lLXEGU2JBAUejeDAQIpyKAl1N r5YPvWlde39lSQb3FvBX551uU4wfAldKjhMvoMiFpwdVv08zZR2Hl+O+/ieAnLITZhG3wL2Nu78K 5nn6pQrtBIzjBp6QwsUaaZknuwAAAAAAAA== --===============9201772281321219670==--