From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============0067328666987468033==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] SPDK Blob Store Fundamentals Date: Wed, 29 Mar 2017 20:40:57 +0000 Message-ID: <1490820056.62307.20.camel@intel.com> In-Reply-To: BN6PR19MB1153A0E2D192EC333E4BA380D5350@BN6PR19MB1153.namprd19.prod.outlook.com List-ID: To: spdk@lists.01.org --===============0067328666987468033== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Wed, 2017-03-29 at 19:06 +0000, George Kondiles wrote: > Hello, > = > I am attempting to use the SPDK blob store to implement a basic NVMe- > based flat file store. I understand=C2=A0that this is a new addition to > the SPDK that is under active development=C2=A0and=C2=A0that > documentation/examples of usage are sparse. But this is a great new > addition to the SPDK that I've been tracking and so I'm eager to > begin using it. I'm glad you're using it! Note that this is not even part of an official release yet. Further, the API we're going to release as part of SPDK 17.03 is not even the API that I envision the blobstore will have when all is said and done. I just want to correctly set expectations - I'm going to change the API quite a bit and not everything in the API currently makes sense. I also reserve the right to change the on-disk format still, for at least a few more months. Feedback of any kind is very much welcome. > = > With that being said, I've been scouring through its usage in the > bdev component, as well as the test cases in an attempt to glean how > I might integrate it into my code base (specifically, I am=C2=A0already > successfully=C2=A0using the SPDK to interact with NVMe devices) but have a > few high-level questions that I hope are easy to answer. > = > 1) In the most basic usage, it seems=C2=A0IO channels should=C2=A0be 1-to= -1 > with threads. It looks like I must start a thread, > call=C2=A0spdk_allocate_thread(), then=C2=A0spdk_get_io_channel() to get = the > spdk_io_channel instance created and associated with that thread. You'll always need to call spdk_allocate_thread when each new thread starts up that is using the blobstore (unless you are using our event framework from lib/event, which does that for you). If you want the blobstore to talk to the bdev layer, then you'll want to call spdk_get_io_channel and pass it the bdev as the io_device parameter. There is a full example of how to do this in lib/blob/bdev/blob_bdev.c. I highly recommend for this first version that you just follow that example. If you want the blobstore to talk directly to the NVMe driver, however, I haven't written an example to show you how just yet. I think the easiest way to implement spdk_bs_dev::create_channel directly on the NVMe driver is to make it call spdk_nvme_ctrlr_alloc_io_qpair and then return (and cast) the queue pair to an spdk_io_channel object. That's cheating a bit but I think it will work out. I'll try to write up an example that demonstrates the best way to do this in the next week or two. There are some other challenges here, such as who polls each queue pair for completions, that using the bdev layer just solves for you. > = > Since spdk_bs_dev.create_channel is synchronous, it looks like I must > block the create_channel() call while the above is happening in the > new IO thread. Is this a reasonable approach, or am I misinterpreting > how IO channels are intended to work? The spdk_bs_dev::create_channel function will only be called on the thread that will be using that channel. That thread should have already been set up with the spdk_allocate_thread when it started, but you can just call spdk_get_io_channel from within the create_channel callback. See lib/blob/bdev/blob_bdev.c for an example that you can probably just use outright. > = > 2) I've already got a set of IO threads for executing asynchronous > NVMe operations (e.g.=C2=A0spdk_nvme_ns_cmd_read(...))=C2=A0against one o= r more > devices. These IO=C2=A0threads each own a set of NVMe queue pairs, and > have queuing mechanisms allowing for the submission of work to be > performed against a specific device.=C2=A0Given this, I am interpreting an > IO channel to essentially be an additional "outer"=C2=A0queue of pending > blob-IO operations that are processed by an additional, dedicated > thread. A=C2=A0call to spdk_bs_dev.read() or .write() would find the > correct IO channel thread, enqueue an "outer" blob=C2=A0op, and the > channel IO thread would then enqueue one or more=C2=A0lower-level=C2=A0NV= Me IO > operations on the "inner" queue.=C2=A0Does this interpretation match the > intended usage? Am I missing something? I think you're on the right track here. Our spdk_io_channel structure is just a software construct for tracking per-thread contexts up and down the I/O stack. The bottom of that stack is an NVMe queue pair typically. This idea is a powerful one, but one that we haven't done a great job explaining just yet. It is a dramatic departure from concepts present in POSIX too, so it will be unfamiliar to most people. > = > 3) spdk_bs_dev.unmap() appears to correspond to dealloc/TRIM. Is this > correct? Yes. SATA calls it TRIM. NVMe calls it dealloc. SCSI calls it UNMAP. Maybe I should call it dealloc because that's actually the most descriptive term and we're very NVMe-centric. Of the three terms, I'm sure that's the one least used though. > = > 4) I've read through the docs at=C2=A0http://www.spdk.io/doc/blob.html=C2= =A0and > understand at a high level=C2=A0how things are being stored on disk, but > there are references to the caching of metadata. My current workload > will likely generate on the order of 100K to 1M blobs of sizes > ranging from 512KB to 32MB,=C2=A0each with a couple of small=C2=A0attribu= tes. > Is there any way to estimate the total size (in memory) of the cache? > Also, are any=C2=A0metadata modifications O(n) in the number of blobs? Blob metadata is cached, but only while a blob is open. If you close the blob all of the memory is released. I don't have exact counts (and it is very much subject to change), but you can expect maybe ~128B per open blob. There are a few operations (i.e. opening a blob) that are currently O(N) where N is the number of OPEN blobs. This is just because I haven't had a chance to implement a better algorithm yet. There aren't any operations that are O(N) where N is the total number of blobs. In general, blobs are entirely independent of one another because they each have their own blocks for metadata and data and the location of that metadata can be determined entirely from the blobid with no shared data structure. That's the real key to this design - with the exception of a bit mask that requires central coordination for a brief, synchronous period only when doing a few rare metadata operations (create, sync, delete), every other operation on the blobstore can happen entirely in parallel with no locks. > = > Thanks in advance for any=C2=A0help or insight anyone can provide. Any > assistance is greatly appreciated. > = > - George Kondiles > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk --===============0067328666987468033== Content-Type: application/x-pkcs7-signature MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIKdTCCBOsw ggPToAMCAQICEFLpAsoR6ESdlGU4L6MaMLswDQYJKoZIhvcNAQEFBQAwbzELMAkGA1UEBhMCU0Ux FDASBgNVBAoTC0FkZFRydXN0IEFCMSYwJAYDVQQLEx1BZGRUcnVzdCBFeHRlcm5hbCBUVFAgTmV0 d29yazEiMCAGA1UEAxMZQWRkVHJ1c3QgRXh0ZXJuYWwgQ0EgUm9vdDAeFw0xMzAzMTkwMDAwMDBa Fw0yMDA1MzAxMDQ4MzhaMHkxCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJDQTEUMBIGA1UEBxMLU2Fu dGEgQ2xhcmExGjAYBgNVBAoTEUludGVsIENvcnBvcmF0aW9uMSswKQYDVQQDEyJJbnRlbCBFeHRl cm5hbCBCYXNpYyBJc3N1aW5nIENBIDRBMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA 4LDMgJ3YSVX6A9sE+jjH3b+F3Xa86z3LLKu/6WvjIdvUbxnoz2qnvl9UKQI3sE1zURQxrfgvtP0b Pgt1uDwAfLc6H5eqnyi+7FrPsTGCR4gwDmq1WkTQgNDNXUgb71e9/6sfq+WfCDpi8ScaglyLCRp7 ph/V60cbitBvnZFelKCDBh332S6KG3bAdnNGB/vk86bwDlY6omDs6/RsfNwzQVwo/M3oPrux6y6z yIoRulfkVENbM0/9RrzQOlyK4W5Vk4EEsfW2jlCV4W83QKqRccAKIUxw2q/HoHVPbbETrrLmE6RR Z/+eWlkGWl+mtx42HOgOmX0BRdTRo9vH7yeBowIDAQABo4IBdzCCAXMwHwYDVR0jBBgwFoAUrb2Y ejS0Jvf6xCZU7wO94CTLVBowHQYDVR0OBBYEFB5pKrTcKP5HGE4hCz+8rBEv8Jj1MA4GA1UdDwEB /wQEAwIBhjASBgNVHRMBAf8ECDAGAQH/AgEAMDYGA1UdJQQvMC0GCCsGAQUFBwMEBgorBgEEAYI3 CgMEBgorBgEEAYI3CgMMBgkrBgEEAYI3FQUwFwYDVR0gBBAwDjAMBgoqhkiG+E0BBQFpMEkGA1Ud HwRCMEAwPqA8oDqGOGh0dHA6Ly9jcmwudHJ1c3QtcHJvdmlkZXIuY29tL0FkZFRydXN0RXh0ZXJu YWxDQVJvb3QuY3JsMDoGCCsGAQUFBwEBBC4wLDAqBggrBgEFBQcwAYYeaHR0cDovL29jc3AudHJ1 c3QtcHJvdmlkZXIuY29tMDUGA1UdHgQuMCygKjALgQlpbnRlbC5jb20wG6AZBgorBgEEAYI3FAID oAsMCWludGVsLmNvbTANBgkqhkiG9w0BAQUFAAOCAQEAKcLNo/2So1Jnoi8G7W5Q6FSPq1fmyKW3 sSDf1amvyHkjEgd25n7MKRHGEmRxxoziPKpcmbfXYU+J0g560nCo5gPF78Wd7ZmzcmCcm1UFFfIx fw6QA19bRpTC8bMMaSSEl8y39Pgwa+HENmoPZsM63DdZ6ziDnPqcSbcfYs8qd/m5d22rpXq5IGVU tX6LX7R/hSSw/3sfATnBLgiJtilVyY7OGGmYKCAS2I04itvSS1WtecXTt9OZDyNbl7LtObBrgMLh ZkpJW+pOR9f3h5VG2S5uKkA7Th9NC9EoScdwQCAIw+UWKbSQ0Isj2UFL7fHKvmqWKVTL98sRzvI3 seNC4DCCBYIwggRqoAMCAQICEzMAAIu5Kz5Fe8d0qN0AAAAAi7kwDQYJKoZIhvcNAQEFBQAweTEL MAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRQwEgYDVQQHEwtTYW50YSBDbGFyYTEaMBgGA1UEChMR SW50ZWwgQ29ycG9yYXRpb24xKzApBgNVBAMTIkludGVsIEV4dGVybmFsIEJhc2ljIElzc3Vpbmcg Q0EgNEEwHhcNMTcwMTA5MjEyMzU4WhcNMTgwMTA0MjEyMzU4WjBFMRkwFwYDVQQDExBXYWxrZXIs IEJlbmphbWluMSgwJgYJKoZIhvcNAQkBFhliZW5qYW1pbi53YWxrZXJAaW50ZWwuY29tMIIBIjAN BgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAxFugJYk4Vd/Yvdmr8BdnGDdCkN1bc1KNCAQBhzC/ BWXw5nxpXWMYFBkTxahM78PtuwdtPDFqoHsMNEaX0miWeYjB6zKbKl7y0LEsSxlu9wjllEdWTYOP 9/m3UC0oITDn7L01adbsD5Sin6W1FMmjcBVrD51oy2orpwfvan3TNVRRQxt8dQz38hivXnona5tt toi+V8ved7o251HApvEwW7QtDfdML+RmBKBSf0MzGjZHPzoBfRrsBUZ0yRHJxlkYNeY99EAUUHwT npsySQSf0cxLmvA6/a4qPOUSitHit+cJQ58/EOt6PLrPGAbdu5sz9O+Iv+FUJakwUtg0sAY4RQID AQABo4ICNTCCAjEwHQYDVR0OBBYEFAU2hsr+3sx/M5e5WafmYD18VvX1MB8GA1UdIwQYMBaAFB5p KrTcKP5HGE4hCz+8rBEv8Jj1MGUGA1UdHwReMFwwWqBYoFaGVGh0dHA6Ly93d3cuaW50ZWwuY29t L3JlcG9zaXRvcnkvQ1JML0ludGVsJTIwRXh0ZXJuYWwlMjBCYXNpYyUyMElzc3VpbmclMjBDQSUy MDRBLmNybDCBnwYIKwYBBQUHAQEEgZIwgY8waQYIKwYBBQUHMAKGXWh0dHA6Ly93d3cuaW50ZWwu Y29tL3JlcG9zaXRvcnkvY2VydGlmaWNhdGVzL0ludGVsJTIwRXh0ZXJuYWwlMjBCYXNpYyUyMElz c3VpbmclMjBDQSUyMDRBLmNydDAiBggrBgEFBQcwAYYWaHR0cDovL29jc3AuaW50ZWwuY29tLzAL BgNVHQ8EBAMCB4AwPAYJKwYBBAGCNxUHBC8wLQYlKwYBBAGCNxUIhsOMdYSZ5VGD/YEohY6fU4KR wAlngd69OZXwQwIBZAIBCTAfBgNVHSUEGDAWBggrBgEFBQcDBAYKKwYBBAGCNwoDDDApBgkrBgEE AYI3FQoEHDAaMAoGCCsGAQUFBwMEMAwGCisGAQQBgjcKAwwwTwYDVR0RBEgwRqApBgorBgEEAYI3 FAIDoBsMGWJlbmphbWluLndhbGtlckBpbnRlbC5jb22BGWJlbmphbWluLndhbGtlckBpbnRlbC5j b20wDQYJKoZIhvcNAQEFBQADggEBAMQUzXgrfwDLl92M7wNqp24Xe1poeurJ8YVAy5a2UukwC/uX uXE8Duoz2jMJL90QETn17H7EQQu1J7kc059H6GyDU42MkzPA3mqZQimrTgOaalPXxWXoVl/UUoLB PJZXGF3Ef1p8b1UVdSnZZ8wTD/QTUw7UhgljKZ1td/raLV1h96x6lKCVkZ0UKU8be5M3FHQ/GZJ9 CgUjvN0m2mYOUHDkNzsUTJb4bsV7vZDa3zixm4Gxu2F/uq328AEJ6JJmXA+jjFOzQ0FI8sa7XOSR 1UPvZSrwyA00M/zFZaDTln+sFPFNseYYGYFU7P711D8Wj1Hv1V/C2G4rSRBJG5f1WF8xggIXMIIC EwIBATCBkDB5MQswCQYDVQQGEwJVUzELMAkGA1UECBMCQ0ExFDASBgNVBAcTC1NhbnRhIENsYXJh MRowGAYDVQQKExFJbnRlbCBDb3Jwb3JhdGlvbjErMCkGA1UEAxMiSW50ZWwgRXh0ZXJuYWwgQmFz aWMgSXNzdWluZyBDQSA0QQITMwAAi7krPkV7x3So3QAAAACLuTAJBgUrDgMCGgUAoF0wGAYJKoZI hvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTcwMzI5MjA0MDU2WjAjBgkqhkiG 9w0BCQQxFgQU4LAdIDwiL2yv1pZBp5PEiTXXAmUwDQYJKoZIhvcNAQEBBQAEggEAjIYwiy8mrca1 CZzC/Ea5p0on8Beoz8FZMAZqnu+0XybqFbOsGXK0fpMVsed+Hp4fSqLliBuPkWfWb8BaGP4IGGNw 7+l2XVuepc/dY3lF5/fHtGQW1l2w5nRcHSQcR7seaTdB5BxubRCLwbB/FrsbmSzpxiehqUNAgosH Ia1dUBQsDcMEENQ/2BDz+d0Wkvhr7R0w46AFj4c/7J/YlzGM3pXgql5V8B5c0irde495CpByfmli VszrsVICG5mTtGrzxW0z7TKXaXBkaPEpDMJ/u/hs0oEdWWGISYgXxbMcBuRZhApJjtTMl99nhGkw XsrEOV7khc36v0CPcAu1ClCM6gAAAAAAAA== --===============0067328666987468033==--