From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: multipart/mixed; boundary="===============0432739158840272302=="
MIME-Version: 1.0
From: Vladislav Bolkhovitin <vst at vlnb.net>
Subject: Re: [SPDK] SCST Usermode iSCSI Storage Server now handles Intel SPDK
 backing storage
Date: Wed, 06 Sep 2017 16:32:04 -0700
Message-ID: <59B08574.5020902@vlnb.net>
In-Reply-To: CALiN7ryf3J9ej0EWFBxKT_te4wgVPJDWw1JSQ+AimOHsnr0bwQ@mail.gmail.com
List-ID: <spdk@lists.01.org>
To: spdk@lists.01.org

--===============0432739158840272302==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable


David Butterfield wrote on 09/06/2017 03:48 PM:
> On Tue, Sep 5, 2017 at 9:25 PM, Vladislav Bolkhovitin <vst(a)vlnb.net> wr=
ote:
>> The only note would be that, as I have already mentioned before, tcmu do=
es data copy
>> between user mode module and kernel, so usage with SCST zero-copy scst_u=
ser instead
>> would be more performance efficient.
> =

> Yes, it's better not to have to copy the data; but I'm not sure that's
> the limiting factor for TCMU performance.
> =

> A ring buffer mediates communication in the TCMU datapath between
> tcm_user (in the kernel) and libtcmu (in usermode).  One fairly
> fundamental characteristic of the TCMU model is that the granularity
> of transaction through the ring buffer is the CDB.  There is overhead
> cost to access and maintain the ring four times per SCSI command
> (Request+Response) * (Sender+Receiver).
> =

> Concerning me more than that is the problem of timely scheduling of
> the threads on each side of the ring.  One might expect at least one
> wakeup per SCSI command, because whichever side of the ring is faster
> to process a command must inevitably sleep waiting for the slower
> side.
> =

> In practice it averages fewer than one wakeup per command (with
> sufficient queue-depth) because multiple commands can accumulate in
> the ring during the scheduling delay for the first command, and the
> entire backlog can be processed in one wakeup.  But you only get such
> batching in return for enduring thread scheduling latency on the
> datapath (with its own issues).
> =

> It is too complicated to determine from analysis alone how all the
> factors combine into overall performance behavior under various
> loading conditions -- the only way to really know is to observe and
> measure it.  How many IOPS can get through that ring, and what happens
> if the load is not quite 100%, or the load is light at queue-depth 2
> or even 1?  Or when the required protocol work is heavier on the
> kernel side versus heavier on the usermode side?
> =

> TCMU has had some time to gain usermode clients.  Finding even *one*
> such client -- that has been well-measured under a variety of
> conditions and demonstrated to work reliably with high performance --
> would prove that it is possible to do through the TCMU API,
> substantially reducing the concern.  There may be an example out
> there, but I looked around a couple of months ago and did not find
> anything except "we haven't done performance tuning yet".  But the
> concern is toward factors that are inherent in the TCMU model, not
> amenable to simple "performance tuning at the end".  Given
> CDB-granularity, I expect the TCMU IOPS bottleneck is going to be
> around that ring.
> =

> In contrast to the CDB-ring model, Usermode SCST uses socket(2) and
> related system calls for communication with the iSCSI initiator --
> these socket calls are where the datapath crosses between the kernel
> and usermode.  Here the granularity of transaction between the two can
> theoretically be as large as the socket buffer size -- much larger
> than one SCSI command.
> =

> Especially when using SPDK for backing storage, another step is to
> re-implement the network I/O using DPDK calls, eliminating the socket
> I/O calls altogether (I expect that to be straightforward in
> iscsi-scst/kernel/nthread.c).  Then the entire datapath would be in
> usermode (down to the I/O instructions, I think).
> =

> (Caveat: this analysis is based only on considering the TCMU model,
> not any actual performance experimentation with TCMU)

I see, interesting analyze. Just one correction, netlink sockets are used f=
or
kernel-user mode communication in iSCSI-SCST, and used only to establish co=
nnection,
then everything is done entirely inside the kernel (in user space in your p=
ort).
Scst_user uses IOCTL-based interface, with 2 calls per CDB that could be ba=
tched too.
Everything inside single thread context, no extra inter-threads switches. I=
n your user
space port it could be translated to just a regular function call leading t=
o very
interesting marriage between SPDK frontend and existing user mode SCST back=
ends :)

Vlad


--===============0432739158840272302==--