From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============7041778373981651054==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] Trying to recover from one SPDK process crashing in a multi-process environment Date: Tue, 10 Apr 2018 17:50:12 +0000 Message-ID: <1523382610.2684.51.camel@intel.com> In-Reply-To: F009CE4E1CB4E047B169243B6A3189273155FA32@SHSMSX101.ccr.corp.intel.com List-ID: To: spdk@lists.01.org --===============7041778373981651054== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Tue, 2018-04-10 at 13:43 +0000, Cao, Gang wrote: > Hi all, > = > This topic is regarding the usage of SPDK NVMe driver in the multi-process > mode. There are some cases that any process (primary or secondary) could = exit > in the unexpectedly way to leave the allocated memory not released, the h= eld > lock also not released or even some severe problem like in the middle of = the > memory allocation. Some of these issues may have a way to solve and other= s may > be difficult to solve. > Would like to initiate some discussion on this topic to get more input, > suggestions and comments. I'm very interested to hear from SPDK users who plan to deploy SPDK using i= ts multi-process capabilities. Specifically, I want to know what their strateg= y is for handling a process crash. As far as I'm aware, DPDK (which provides the= low level multi-process handling), recommends that all processes be restarted f= resh after any process crashes. To recap, SPDK's NVMe driver allows for multiple separate processes to star= t up, map some shared memory regions, and then each process can allocate an NVMe = I/O queue pair of its own. They can submit commands to the NVMe device without = any further coordination between the two processes at that point. See http://www.spdk.io/doc/nvme.html#nvme_multi_process for the full docs. If one of these processes crash, there are several potential issues that mu= st be dealt with: 1) The DPDK-allocated memory assigned to the crashed process will never be released. This is more than just memory allocated by spdk_dma_malloc and su= ch - it's also memory that needs to be put back into memory pools. 2) The process may have been holding a cross-process lock and/or modifying = some of the data structures in shared memory at the time of the crash. None of t= he code in SPDK or DPDK, in the critical areas, is designed to guarantee that = the in-memory data structures are always in a valid or consistent state such th= at a process could crash in the middle of a modification and another process cou= ld continue using them. Moving to atomic data structures everywhere, assuming = it is even possible, would be both a huge amount of effort and probably a huge performance hit. So the above are the facts. Now my opinion, which can easily be swayed base= d on the feedback we receive here: It's not clear that it is even possible to rewrite all of the critical DPDK= and SPDK data structures to be atomic, the effort would be enormous, and the end result would probably have significantly degraded performance. So the only reasonable way to guarantee correct recovery from a process crash is to res= tart all processes involved. That, in my opinion, makes the use of NVMe multi-pr= ocess = features for more than simple management tools with short lifetimes a britt= le architectural choice. In my opinion, a more robust design would be to use something like SPDK's v= host target as a dispatcher, where one process owns the storage devices and expo= ses shared memory queues to the other processes on which they can submit reques= ts. The vhost model does use process-shared memory, but in a limited way that is protected against both issues I outlined above. To be fair, there are a few drawbacks to using vhost as a dispatcher. First, there are currently restrictions on the memory layout that can be described to vhost for the sh= ared memory regions that need to be lifted (some work in flight). Second, the vh= ost target is polling the incoming queues and the storage devices, so it may co= nsume additional CPU compared to the multi-process model. There are probably ways= that we can work to mitigate this over time, and even right now the cost can be amortized across a large number of client processes due to the vhost target being so efficient. Thanks, Ben --===============7041778373981651054==--