From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============0432739158840272302==" MIME-Version: 1.0 From: Vladislav Bolkhovitin Subject: Re: [SPDK] SCST Usermode iSCSI Storage Server now handles Intel SPDK backing storage Date: Wed, 06 Sep 2017 16:32:04 -0700 Message-ID: <59B08574.5020902@vlnb.net> In-Reply-To: CALiN7ryf3J9ej0EWFBxKT_te4wgVPJDWw1JSQ+AimOHsnr0bwQ@mail.gmail.com List-ID: To: spdk@lists.01.org --===============0432739158840272302== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable David Butterfield wrote on 09/06/2017 03:48 PM: > On Tue, Sep 5, 2017 at 9:25 PM, Vladislav Bolkhovitin wr= ote: >> The only note would be that, as I have already mentioned before, tcmu do= es data copy >> between user mode module and kernel, so usage with SCST zero-copy scst_u= ser instead >> would be more performance efficient. > = > Yes, it's better not to have to copy the data; but I'm not sure that's > the limiting factor for TCMU performance. > = > A ring buffer mediates communication in the TCMU datapath between > tcm_user (in the kernel) and libtcmu (in usermode). One fairly > fundamental characteristic of the TCMU model is that the granularity > of transaction through the ring buffer is the CDB. There is overhead > cost to access and maintain the ring four times per SCSI command > (Request+Response) * (Sender+Receiver). > = > Concerning me more than that is the problem of timely scheduling of > the threads on each side of the ring. One might expect at least one > wakeup per SCSI command, because whichever side of the ring is faster > to process a command must inevitably sleep waiting for the slower > side. > = > In practice it averages fewer than one wakeup per command (with > sufficient queue-depth) because multiple commands can accumulate in > the ring during the scheduling delay for the first command, and the > entire backlog can be processed in one wakeup. But you only get such > batching in return for enduring thread scheduling latency on the > datapath (with its own issues). > = > It is too complicated to determine from analysis alone how all the > factors combine into overall performance behavior under various > loading conditions -- the only way to really know is to observe and > measure it. How many IOPS can get through that ring, and what happens > if the load is not quite 100%, or the load is light at queue-depth 2 > or even 1? Or when the required protocol work is heavier on the > kernel side versus heavier on the usermode side? > = > TCMU has had some time to gain usermode clients. Finding even *one* > such client -- that has been well-measured under a variety of > conditions and demonstrated to work reliably with high performance -- > would prove that it is possible to do through the TCMU API, > substantially reducing the concern. There may be an example out > there, but I looked around a couple of months ago and did not find > anything except "we haven't done performance tuning yet". But the > concern is toward factors that are inherent in the TCMU model, not > amenable to simple "performance tuning at the end". Given > CDB-granularity, I expect the TCMU IOPS bottleneck is going to be > around that ring. > = > In contrast to the CDB-ring model, Usermode SCST uses socket(2) and > related system calls for communication with the iSCSI initiator -- > these socket calls are where the datapath crosses between the kernel > and usermode. Here the granularity of transaction between the two can > theoretically be as large as the socket buffer size -- much larger > than one SCSI command. > = > Especially when using SPDK for backing storage, another step is to > re-implement the network I/O using DPDK calls, eliminating the socket > I/O calls altogether (I expect that to be straightforward in > iscsi-scst/kernel/nthread.c). Then the entire datapath would be in > usermode (down to the I/O instructions, I think). > = > (Caveat: this analysis is based only on considering the TCMU model, > not any actual performance experimentation with TCMU) I see, interesting analyze. Just one correction, netlink sockets are used f= or kernel-user mode communication in iSCSI-SCST, and used only to establish co= nnection, then everything is done entirely inside the kernel (in user space in your p= ort). Scst_user uses IOCTL-based interface, with 2 calls per CDB that could be ba= tched too. Everything inside single thread context, no extra inter-threads switches. I= n your user space port it could be translated to just a regular function call leading t= o very interesting marriage between SPDK frontend and existing user mode SCST back= ends :) Vlad --===============0432739158840272302==--