From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: multipart/mixed; boundary="===============7041778373981651054=="
MIME-Version: 1.0
From: Walker, Benjamin <benjamin.walker at intel.com>
Subject: Re: [SPDK] Trying to recover from one SPDK process crashing in a
 multi-process environment
Date: Tue, 10 Apr 2018 17:50:12 +0000
Message-ID: <1523382610.2684.51.camel@intel.com>
In-Reply-To: F009CE4E1CB4E047B169243B6A3189273155FA32@SHSMSX101.ccr.corp.intel.com
List-ID: <spdk@lists.01.org>
To: spdk@lists.01.org

--===============7041778373981651054==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

On Tue, 2018-04-10 at 13:43 +0000, Cao, Gang wrote:
> Hi all,
>  =

> This topic is regarding the usage of SPDK NVMe driver in the multi-process
> mode. There are some cases that any process (primary or secondary) could =
exit
> in the unexpectedly way to leave the allocated memory not released, the h=
eld
> lock also not released or even some severe problem like in the middle of =
the
> memory allocation. Some of these issues may have a way to solve and other=
s may
> be difficult to solve.

> Would like to initiate some discussion on this topic to get more input,
> suggestions and comments.

I'm very interested to hear from SPDK users who plan to deploy SPDK using i=
ts
multi-process capabilities. Specifically, I want to know what their strateg=
y is
for handling a process crash. As far as I'm aware, DPDK (which provides the=
 low
level multi-process handling), recommends that all processes be restarted f=
resh
after any process crashes.

To recap, SPDK's NVMe driver allows for multiple separate processes to star=
t up,
map some shared memory regions, and then each process can allocate an NVMe =
I/O
queue pair of its own. They can submit commands to the NVMe device without =
any
further coordination between the two processes at that point. See
http://www.spdk.io/doc/nvme.html#nvme_multi_process for the full docs.

If one of these processes crash, there are several potential issues that mu=
st be
dealt with:

1) The DPDK-allocated memory assigned to the crashed process will never be
released. This is more than just memory allocated by spdk_dma_malloc and su=
ch -
it's also memory that needs to be put back into memory pools.
2) The process may have been holding a cross-process lock and/or modifying =
some
of the data structures in shared memory at the time of the crash. None of t=
he
code in SPDK or DPDK, in the critical areas, is designed to guarantee that =
the
in-memory data structures are always in a valid or consistent state such th=
at a
process could crash in the middle of a modification and another process cou=
ld
continue using them. Moving to atomic data structures everywhere, assuming =
it is
even possible, would be both a huge amount of effort and probably a huge
performance hit.

So the above are the facts. Now my opinion, which can easily be swayed base=
d on
the feedback we receive here:

It's not clear that it is even possible to rewrite all of the critical DPDK=
 and
SPDK data structures to be atomic, the effort would be enormous, and the end
result would probably have significantly degraded performance. So the only
reasonable way to guarantee correct recovery from a process crash is to res=
tart
all processes involved. That, in my opinion, makes the use of NVMe multi-pr=
ocess =

features for more than simple management tools with short lifetimes a britt=
le
architectural choice.

In my opinion, a more robust design would be to use something like SPDK's v=
host
target as a dispatcher, where one process owns the storage devices and expo=
ses
shared memory queues to the other processes on which they can submit reques=
ts.
The vhost model does use process-shared memory, but in a limited way that is
protected against both issues I outlined above. To be fair, there are a few
drawbacks to using vhost as a dispatcher. First, there are currently
restrictions on the memory layout that can be described to vhost for the sh=
ared
memory regions that need to be lifted (some work in flight). Second, the vh=
ost
target is polling the incoming queues and the storage devices, so it may co=
nsume
additional CPU compared to the multi-process model. There are probably ways=
 that
we can work to mitigate this over time, and even right now the cost can be
amortized across a large number of client processes due to the vhost target
being so efficient.

Thanks,
Ben
--===============7041778373981651054==--