From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============4047314565460350755==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] NVMe-oF Target Library Date: Wed, 19 Jul 2017 17:10:28 +0000 Message-ID: <1500484227.3169.1.camel@intel.com> In-Reply-To: 37B08312E007AE46A00101F43DA919DD79AE02F2@FMSMSX105.amr.corp.intel.com List-ID: To: spdk@lists.01.org --===============4047314565460350755== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable I'm reviving this old thread because I'm back to working in this area again= . The challenge for changes this large is to figure out how to do them in small pieces. Some of the major changes to the threading model will need to be do= ne in one go, but I think there are a few incremental steps we can take to improv= e the code base and prepare it for the big transition. I think the first of those changes is to remove the "mode" parameter from subsystems. Today, NVMe-oF subsystems can be in either direct (I/O routed to nvme library) or virtual (I/O routed to bdev library) mode. Recently, a new= bdev command was added by the wider community (thanks for the patch!) that adds = an NVMe passthrough command to the bdev layer. That allows us to send an NVMe command through the regular bdev stack. The commands are generally only interpreted by the NVMe bdev module - the other backing devices don't report support for the NVMe passthrough command - but that's good enough. Given th= at new capability, we can do anything in virtual mode that we previously did in direct mode. The only reason we didn't remove direct mode immediately after the addition= of NVMe passthrough was because we wanted to do a full performance evaluation = to verify the bdev layer doesn't have a measurable amount of overhead. I'm gla= d to report those results have come in and the overhead of routing I/O through t= he bdev library instead of nvme isn't measurable for any hardware set up we we= re able to build. I wrote up a patch here: https://review.gerrithub.io/#/c/369496/ The next big step is probably to make some changes to the transport API to accommodate the new ideas in my previous email. Discussion and requests are always welcome! Thanks, Ben On Fri, 2017-07-14 at 17:16 +0000, Walker, Benjamin wrote: > = > -----Original Message----- > From: Walker, Benjamin=C2=A0 > Sent: Wednesday, April 26, 2017 2:06 PM > To: spdk(a)lists.01.org > Subject: NVMe-oF Target Library > = > Hi all, > = > I was hoping to start a bit of a design discussion about the future of the > NVMe-oF target library (lib/nvmf). The NVMe-oF target was originally crea= ted > as > part of a skunkworks project and was very much an application. It wasn't > divided into a library and an app as it is today. Right before we release= d it, > I decided to attempt to break it up into a library and an application, bu= t I > never really finished that task. I'd like to resume that work now, but le= t the > entire community weigh in on what the library looks like. > = > First, libraries in SPDK (most things that live in lib/) shouldn't enforc= e a > threading model. They should, as much as possible, be entirely passive C > libraries with as few dependencies as we can manage. Applications in SPDK > (things that live in app/), on the other hand, necessarily must choose a > particular threading model. We universally use our application/event fram= ework > (lib/event) for apps, which spawns one thread per core, etc. We'll contin= ue > this model for NVMe-oF where app/nvmf_tgt will be a full application with= a > threading model dictated by the application/event framework, while lib/nv= mf > will be a passive C library that will depend only on other passive C > libraries. > I don't think this distinction is at all reality today, but let's work to= make > it so. > = > The other major issue with the NVMe-oF target implementation is that it h= as > quite a few baked in assumptions about what the backing storage device lo= oks > like. In particular, it was written assuming that it was talking directly= to > an > NVMe device (Direct mode), and the ability to route I/O to the bdev layer > (Virtual mode) was added much later and isn't entirely fleshed out yet. O= ne of > these assumptions is that real NVMe devices don't benefit from multiple q= ueues > - you can get the full performance from an NVMe device using just one que= ue > pair. That isn't necessarily true for bdevs, which may be arbitrarily > complex virtualized devices. Given that assumption, the NVMe-oF target > today only creates a single queue pair to the backing storage device and = only > uses a single thread to route I/O to it. We're definitely going to need to > break that assumption. > = > The first discussion that I want to have is around what the high level > concepts > should be. We clearly need to expose things like "subsystem", "queue > pair/connection", "namespace", and "port". We should probably have an obj= ect > that represents the entire target too, maybe "nvmf_tgt". However, in orde= r to > separate the threading model from the library I think we'll need at least= two > more concepts. > = > First, some thread has to be in charge of polling for new connections. We > typically refer to this as the "acceptor" thread today. Maybe the best wa= y to > handle this is to add an "accept" function that takes the nvmf_tgt object= as > an > argument. This function can only be called one a single thread at a time = and > is > repeatedly called to discover new connections. I think the user will end = up > passing a callback in to this function that will be called for each new > connection discovered. > = > Second, once a new connection is discovered, we need to hand it off to so= me > collection that a dedicated thread can poll. This collection of connectio= ns > would be tied specifically to that dedicated thread, but it wouldn't > necessarily be tied to a subsystem or a particular storage device. I don't > really know what to call this thing - right now I'm kind of thing > "io_handler". > = > So the general flow for an application would be to construct a target, add > subsystems, namespaces, and ports as needed, and then poll the target for > incoming connections. For each new connection, the application would assi= gn it > to an io_handler (using whatever algorithm it wanted) and then poll the > io_handlers to actually handle I/O on the connections. Does this seem lik= e a > reasonable design at a very high level? Feedback is very much welcome and > encouraged. > = > If I don't hear back with a bunch of "you're wrong!" or "that's stupid!" = type > replies over the next few days, the next step will be to write up a new h= eader > file for the library that we can discuss in more detail. > = > Thanks, > Ben --===============4047314565460350755== Content-Type: application/x-pkcs7-signature MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIKdTCCBOsw ggPToAMCAQICEFLpAsoR6ESdlGU4L6MaMLswDQYJKoZIhvcNAQEFBQAwbzELMAkGA1UEBhMCU0Ux FDASBgNVBAoTC0FkZFRydXN0IEFCMSYwJAYDVQQLEx1BZGRUcnVzdCBFeHRlcm5hbCBUVFAgTmV0 d29yazEiMCAGA1UEAxMZQWRkVHJ1c3QgRXh0ZXJuYWwgQ0EgUm9vdDAeFw0xMzAzMTkwMDAwMDBa Fw0yMDA1MzAxMDQ4MzhaMHkxCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJDQTEUMBIGA1UEBxMLU2Fu dGEgQ2xhcmExGjAYBgNVBAoTEUludGVsIENvcnBvcmF0aW9uMSswKQYDVQQDEyJJbnRlbCBFeHRl cm5hbCBCYXNpYyBJc3N1aW5nIENBIDRBMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA 4LDMgJ3YSVX6A9sE+jjH3b+F3Xa86z3LLKu/6WvjIdvUbxnoz2qnvl9UKQI3sE1zURQxrfgvtP0b Pgt1uDwAfLc6H5eqnyi+7FrPsTGCR4gwDmq1WkTQgNDNXUgb71e9/6sfq+WfCDpi8ScaglyLCRp7 ph/V60cbitBvnZFelKCDBh332S6KG3bAdnNGB/vk86bwDlY6omDs6/RsfNwzQVwo/M3oPrux6y6z yIoRulfkVENbM0/9RrzQOlyK4W5Vk4EEsfW2jlCV4W83QKqRccAKIUxw2q/HoHVPbbETrrLmE6RR Z/+eWlkGWl+mtx42HOgOmX0BRdTRo9vH7yeBowIDAQABo4IBdzCCAXMwHwYDVR0jBBgwFoAUrb2Y ejS0Jvf6xCZU7wO94CTLVBowHQYDVR0OBBYEFB5pKrTcKP5HGE4hCz+8rBEv8Jj1MA4GA1UdDwEB /wQEAwIBhjASBgNVHRMBAf8ECDAGAQH/AgEAMDYGA1UdJQQvMC0GCCsGAQUFBwMEBgorBgEEAYI3 CgMEBgorBgEEAYI3CgMMBgkrBgEEAYI3FQUwFwYDVR0gBBAwDjAMBgoqhkiG+E0BBQFpMEkGA1Ud HwRCMEAwPqA8oDqGOGh0dHA6Ly9jcmwudHJ1c3QtcHJvdmlkZXIuY29tL0FkZFRydXN0RXh0ZXJu YWxDQVJvb3QuY3JsMDoGCCsGAQUFBwEBBC4wLDAqBggrBgEFBQcwAYYeaHR0cDovL29jc3AudHJ1 c3QtcHJvdmlkZXIuY29tMDUGA1UdHgQuMCygKjALgQlpbnRlbC5jb20wG6AZBgorBgEEAYI3FAID oAsMCWludGVsLmNvbTANBgkqhkiG9w0BAQUFAAOCAQEAKcLNo/2So1Jnoi8G7W5Q6FSPq1fmyKW3 sSDf1amvyHkjEgd25n7MKRHGEmRxxoziPKpcmbfXYU+J0g560nCo5gPF78Wd7ZmzcmCcm1UFFfIx fw6QA19bRpTC8bMMaSSEl8y39Pgwa+HENmoPZsM63DdZ6ziDnPqcSbcfYs8qd/m5d22rpXq5IGVU tX6LX7R/hSSw/3sfATnBLgiJtilVyY7OGGmYKCAS2I04itvSS1WtecXTt9OZDyNbl7LtObBrgMLh ZkpJW+pOR9f3h5VG2S5uKkA7Th9NC9EoScdwQCAIw+UWKbSQ0Isj2UFL7fHKvmqWKVTL98sRzvI3 seNC4DCCBYIwggRqoAMCAQICEzMAAIu5Kz5Fe8d0qN0AAAAAi7kwDQYJKoZIhvcNAQEFBQAweTEL MAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRQwEgYDVQQHEwtTYW50YSBDbGFyYTEaMBgGA1UEChMR SW50ZWwgQ29ycG9yYXRpb24xKzApBgNVBAMTIkludGVsIEV4dGVybmFsIEJhc2ljIElzc3Vpbmcg Q0EgNEEwHhcNMTcwMTA5MjEyMzU4WhcNMTgwMTA0MjEyMzU4WjBFMRkwFwYDVQQDExBXYWxrZXIs IEJlbmphbWluMSgwJgYJKoZIhvcNAQkBFhliZW5qYW1pbi53YWxrZXJAaW50ZWwuY29tMIIBIjAN BgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAxFugJYk4Vd/Yvdmr8BdnGDdCkN1bc1KNCAQBhzC/ BWXw5nxpXWMYFBkTxahM78PtuwdtPDFqoHsMNEaX0miWeYjB6zKbKl7y0LEsSxlu9wjllEdWTYOP 9/m3UC0oITDn7L01adbsD5Sin6W1FMmjcBVrD51oy2orpwfvan3TNVRRQxt8dQz38hivXnona5tt toi+V8ved7o251HApvEwW7QtDfdML+RmBKBSf0MzGjZHPzoBfRrsBUZ0yRHJxlkYNeY99EAUUHwT npsySQSf0cxLmvA6/a4qPOUSitHit+cJQ58/EOt6PLrPGAbdu5sz9O+Iv+FUJakwUtg0sAY4RQID AQABo4ICNTCCAjEwHQYDVR0OBBYEFAU2hsr+3sx/M5e5WafmYD18VvX1MB8GA1UdIwQYMBaAFB5p KrTcKP5HGE4hCz+8rBEv8Jj1MGUGA1UdHwReMFwwWqBYoFaGVGh0dHA6Ly93d3cuaW50ZWwuY29t L3JlcG9zaXRvcnkvQ1JML0ludGVsJTIwRXh0ZXJuYWwlMjBCYXNpYyUyMElzc3VpbmclMjBDQSUy MDRBLmNybDCBnwYIKwYBBQUHAQEEgZIwgY8waQYIKwYBBQUHMAKGXWh0dHA6Ly93d3cuaW50ZWwu Y29tL3JlcG9zaXRvcnkvY2VydGlmaWNhdGVzL0ludGVsJTIwRXh0ZXJuYWwlMjBCYXNpYyUyMElz c3VpbmclMjBDQSUyMDRBLmNydDAiBggrBgEFBQcwAYYWaHR0cDovL29jc3AuaW50ZWwuY29tLzAL BgNVHQ8EBAMCB4AwPAYJKwYBBAGCNxUHBC8wLQYlKwYBBAGCNxUIhsOMdYSZ5VGD/YEohY6fU4KR wAlngd69OZXwQwIBZAIBCTAfBgNVHSUEGDAWBggrBgEFBQcDBAYKKwYBBAGCNwoDDDApBgkrBgEE AYI3FQoEHDAaMAoGCCsGAQUFBwMEMAwGCisGAQQBgjcKAwwwTwYDVR0RBEgwRqApBgorBgEEAYI3 FAIDoBsMGWJlbmphbWluLndhbGtlckBpbnRlbC5jb22BGWJlbmphbWluLndhbGtlckBpbnRlbC5j b20wDQYJKoZIhvcNAQEFBQADggEBAMQUzXgrfwDLl92M7wNqp24Xe1poeurJ8YVAy5a2UukwC/uX uXE8Duoz2jMJL90QETn17H7EQQu1J7kc059H6GyDU42MkzPA3mqZQimrTgOaalPXxWXoVl/UUoLB PJZXGF3Ef1p8b1UVdSnZZ8wTD/QTUw7UhgljKZ1td/raLV1h96x6lKCVkZ0UKU8be5M3FHQ/GZJ9 CgUjvN0m2mYOUHDkNzsUTJb4bsV7vZDa3zixm4Gxu2F/uq328AEJ6JJmXA+jjFOzQ0FI8sa7XOSR 1UPvZSrwyA00M/zFZaDTln+sFPFNseYYGYFU7P711D8Wj1Hv1V/C2G4rSRBJG5f1WF8xggIXMIIC EwIBATCBkDB5MQswCQYDVQQGEwJVUzELMAkGA1UECBMCQ0ExFDASBgNVBAcTC1NhbnRhIENsYXJh MRowGAYDVQQKExFJbnRlbCBDb3Jwb3JhdGlvbjErMCkGA1UEAxMiSW50ZWwgRXh0ZXJuYWwgQmFz aWMgSXNzdWluZyBDQSA0QQITMwAAi7krPkV7x3So3QAAAACLuTAJBgUrDgMCGgUAoF0wGAYJKoZI hvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTcwNzE5MTcxMDI3WjAjBgkqhkiG 9w0BCQQxFgQUtHJewdTOxEWcgYbvqaMlczbkngEwDQYJKoZIhvcNAQEBBQAEggEAiaUj2AT8A2DH kp6Xuu69cCPHKumT4R//2HXPhD+x92HsH4rKz9TIXJRDgFsYie1fpHqAtEuT84RhhNvrvXCQhSmL hFLfYiSrObBGQh2O68RJ25syYlCGlqOa3CouIszhCbyjK1fKdOgXLhRQhwrHZGnnkuvZ1IZjyaHo yL8Qm+VkeLiMLFmskYaSF334IjXbkkVpMdi1Vr4l2UYoVJ+cwVIHnOy3+Wh978CFGYwGvI4KqFdl qvgiOM4tt1B1OdmhdcvJW63s++G+flB83zFornF6rcApyThPW9MxfS76sH7w+jOOHC/znp+pPOkC yLzCkT7Sm+t6vVs0t5BDpwhiNwAAAAAAAA== --===============4047314565460350755==--