From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============9011884236213846059==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] Handling of physical disk removals Date: Wed, 30 May 2018 21:27:11 +0000 Message-ID: <1527715630.27143.55.camel@intel.com> In-Reply-To: CAKye4QZzOsZnuHo3t040n532e49YyjUWgQ3ieuCK-3qiK6maug@mail.gmail.com List-ID: To: spdk@lists.01.org --===============9011884236213846059== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Wed, 2018-05-30 at 14:46 +0300, Baruch Even wrote: > As for threading, the only thing I need to make things integrate better i= nto > my green-threads is to use a function call to do a context switch if need= ed > (spdk_yield() of sorts), that is empty/no-op for most users and can be a > compile time option (weak symbol? define?) to use some user-provided func= tion. > This will immediately integrate into any green thread system by switching= to > another user-thread and returning for a poll when other actions are taken. > This way the posix-thread will not block and there is no special async lo= gic > that needs to happen. Understood and we'd like to accommodate frameworks like you are using. See = the ongoing thread titled SPDK Dynamic Threading Model. We're just getting star= ted on that effort. In parallel, any time you see code that performs a blocking= or excessively long operation, please do let us know (or help us remove it). > = > As for the low level api, my application already has its own queue for ea= ch > device, and we split our requests as we need to and combine them as possi= ble > to improve performance. It took us by surprise to find that SPDK internal= ly > breaks out the requests to smaller chunks, queues them internally and > completions may not really complete the full request. There was even was > scenario that we had a deadlock because we overflowed the spdk queues. On= ce we > found these out we made sure to expose to our application all the device > constraints and break the IOs with the device limitations in mind and now= all > the extra work and checks that SPDK does are a waste of cpu cycles as far= as > we are concerned. SPDK will indeed split I/O automatically based on device characteristics. Splitting is non-trivial because I/O may need to be split for a large numbe= r of reasons (max I/O size, device-internal striping, PRP/SGL restrictions, etc.= ). Getting this right is very difficult and I'd highly recommend that people a= void re-implementing it (see the top half of lib/nvme/nvme_ns_cmd.c). SPDK will = also queue I/O sent beyond the available queue depth. These were both done to ma= ke code using the NVMe driver not have to deal with device-specific concerns (= i.e. the splitting required on one device is different than the splitting on ano= ther, and the queue depth available on one device is different than the queue dep= th available on another). The splitting and queueing should all be entirely transparent to the user though. There shouldn't ever be a case where a user completion callback is called prior to all split fragments of the I/O completing - you should get = your completion callback called just one time for each time you called to submit= a new I/O at the public NVMe API level. > = > There is also the issue that we track our IOs and SPDK because it does the > extra work has IO tracking of its own and it means there is an extra memo= ry > allocation on the data path for no good reason from our point of view. The memory that SPDK uses for this tracking is allocated when the queue pai= r is allocated, so while certainly an extra cache line is accessed, it isn't a m= emory allocation. > = > When I'm building my command in the queue I need to keep the list of buff= ers > in some data structure, currently I keep it in a vector and then this gets > translated to PRP/SGL when the command is submitted. I could save that st= ep if > I could just maintain the PRP/SGL myself and just pass the ready-made lis= t to > the command. This will save me some memory and time. Building valid PRP/SGL data structures is challenging, but for those up to = the task I'd be willing to consider an API that let the user provide them (we'l= l see if the rest of the community agrees). I'm always in favor of improvements f= or performance, but that seems like a lot of extra work to save what is probab= ly just a handful of CPU instructions. > = > All of this just means that spdk is wasting some cpu cycles on things we = don't > want it to do and after all the main reason to use spdk and dpdk is to ek= e out > all the possible performance from the system (cpu/nvme devices) by a more > integrated design. I can understand that other users of spdk do want the > higher levels and do need the features provided but for us we have better > places to do these actions. > = > Baruch > = > On Thu, May 24, 2018 at 7:44 PM Walker, Benjamin > wrote: > > Hi Baruch, > > = > > Regarding blocking threads =E2=80=93 only the NVMe initialization path = does that > > today. We considered doing a fully asynchronous initialization path for= NVMe > > devices, but we received a lot of feedback that it makes it very diffic= ult > > to use. I=E2=80=99d personally be open to adding a second initializatio= n path that > > was entirely asynchronous and required the user to explicitly poll some > > initialization context to advance the state machine until the devices c= ome > > online. This clearly makes much more sense in the context of hot plug s= ince > > the device initialization may occur on a thread that is also processing= I/O. > > Contributions are always very welcome! > > = > > I do expect to see some fairly major movement this year regarding > > abstractions around threading models and I=E2=80=99d love to define som= e abstraction > > that allows SPDK to nicely integrate into applications using green thre= ads > > or coroutines without tying SPDK to any specific implementation. This is > > definitely at the front of mind currently (as is dynamic memory managem= ent =E2=80=93 > > the other major hurdle for integrating with existing applications). I t= hink > > over the coming months there are going to be several lively discussions= on > > the mailing list and in IRC as we try to sort all of this out. > > = > > Regarding a lower level API =E2=80=93 we have some lower level interfac= es, although > > not as low-level as you=E2=80=99re talking about. See spdk_nvme_ctrlr_i= o_cmd_raw(), > > for example, which lets you build an arbitrary NVMe command and send it= . We > > haven=E2=80=99t exposed any knobs to control internal queueing or to al= low a user to > > build their own SGL/PRP. Can you outline what the use cases for those a= re? > > Why would you want to deviate from what SPDK does by default today? We= =E2=80=99re > > always amenable to providing more control, if that control has value. > > = > > Thanks, > > Ben > > = > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Baruch Even > > Sent: Thursday, May 24, 2018 1:17 AM > > = > > To: Storage Performance Development Kit > > Subject: Re: [SPDK] Handling of physical disk removals > > = > > Hi, > > = > > I logged the issue as https://github.com/spdk/spdk/issues/310 > > = > > I do call detach but I do not use the empty probe call, instead I have > > another part of my application monitor the same uevent information and > > trigger a larger process to remove the drive. I also have this triggere= d by > > the timeout callback. I had to remove the timeout callback or I'd get > > bombarded with the callback until it actually happened, looks like the > > timeout callback assumes that the reset call will happen from inside it= and > > complete before it returns but that doesn't work very well with our > > application. > > = > > I have an issue with the threading model you are using as we are using = user- > > space-threads (aka green threads or coroutines) and spdk will block the > > entire application for no good reason. It would have been nice if ther= e was > > a yield callback I can provide you so you won't block the main thread. = For > > now I need to resort to send the spdk admin calls to a thread to be done > > there since spdk may block. > > = > > If I am on a ranting roll I also would have preferred a lower level > > interface than currently provided that doesn't implement internal queui= ng > > and internal allocations so much and doesn't hide all the nvme details = (prp, > > sgl), though by that stage I'm probably better of with unvme instead of > > spdk. If they had support uio_pci_generic in addition to vfio-pci it wo= uld > > probably be a better direction for me. > > = > > Baruch > > = > > On Wed, May 23, 2018 at 8:18 PM Harris, James R > > wrote: > > > Hi Baruch, > > > = > > > Thanks for raising this issue =E2=80=93 there are absolutely changes = that SPDK > > > needs to make here. > > > = > > > Can you describe your code path a bit more? Or let me try to guess a= nd > > > you can tell me where I=E2=80=99m wrong. > > > = > > > 1) You=E2=80=99re using spdk_nvme_probe() with a NULL trid and = a remove_cb > > > handler to detect physically removed devices. > > > 2) In your remove_cb handler, you call spdk_nvme_detach(). > > > = > > > The SPDK bdev nvme module doesn=E2=80=99t call spdk_nvme_detach() in = its remove_cb > > > which is why the SPDK automated tests don=E2=80=99t run into this iss= ue. But I > > > don=E2=80=99t this is correct =E2=80=93 we should be calling spdk_nvm= e_detach() at some > > > point to clean up any allocated resources. It needs to make sure any > > > associate IO channels are freed up first (to avoid racing between the > > > remove callback and different threads submitting IO to that removed > > > controller). > > > = > > > Could you file this as a bug in github? Please add any additional de= tails > > > on how you=E2=80=99re hitting this issue if it=E2=80=99s different th= an what I=E2=80=99ve guessed > > > above. > > > = > > > https://github.com/spdk/spdk/issues > > > = > > > Thanks, > > > = > > > Jim > > > = > > > = > > > From: SPDK on behalf of Baruch Even > > ka.io> > > > Reply-To: Storage Performance Development Kit > > > Date: Wednesday, May 23, 2018 at 1:44 AM > > > To: Storage Performance Development Kit > > > Subject: [SPDK] Handling of physical disk removals > > > = > > > Hi, > > > = > > > I'm using spdk for local nvme through the nvme interface, I find that > > > physical disk removals are not handled properly for my use case and w= onder > > > if others see it that way as well and if there is an intention to fix > > > this. > > > = > > > Our system uses long running processes that control one or more disks= at a > > > time, if a disk fails it may drop completely from the pcie bus and it= will > > > also look like that if the disk is physically removed (say a technici= an > > > mistakes the disk that he should replace). > > > = > > > The problem that I see is that spdk doesnt consider a device complete= ly > > > disappearing from the bus and will try to release the io qpair by sen= ding > > > the delete io sq and delete io cq commands, both of these will never = get > > > an answer (the device is not on the pcie device anymore) and there is= no > > > timeout logic in that code path. This means two things, the process w= ill > > > halt forever and there is an effective memory leak which currently me= ans > > > that we need to restart the process. Now, our system is resilient eno= ugh > > > that restarting the process is not a big deal but it is a very messy = way > > > to go about handlign a physical drive removal. > > > = > > > Have others seen this behavior? Does it bother others? > > > = > > > For my own use I put a timeout in there of a few seconds and that sol= ves > > > it for me. > > > = > > > Baruch Even > > > = > > > -- > > > = > > > Baruch Even, Software Developer = > > > E baruch(a)weka.io = > > > www.weka.io > > > _______________________________________________ > > > SPDK mailing list > > > SPDK(a)lists.01.org > > > https://lists.01.org/mailman/listinfo/spdk > > = > > -- > > = > > Baruch Even, Software Developer = > > E baruch(a)weka.io = > > www.weka.io > > _______________________________________________ > > SPDK mailing list > > SPDK(a)lists.01.org > > https://lists.01.org/mailman/listinfo/spdk > = > -- = > = > Baruch Even, Software Developer = > E baruch(a)weka.io = > www.weka.io > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk --===============9011884236213846059==--