From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: multipart/mixed; boundary="===============3186708450900795814=="
MIME-Version: 1.0
From: Walker, Benjamin <benjamin.walker at intel.com>
Subject: Re: [SPDK] SPDK aio examples
Date: Wed, 22 Jun 2016 17:35:49 +0000
Message-ID: <1466616949.26925.170.camel@intel.com>
In-Reply-To: 3A3AEA95-16B6-44BA-B5A7-691CBED02A9B@playstation.sony.com
List-ID: <spdk@lists.01.org>
To: spdk@lists.01.org

--===============3186708450900795814==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

On Wed, 2016-06-22 at 16:50 +0000, Bhadauria, Varun wrote:
> Hi Ben
> =

> Thank you for the reply.
> =

> For application I/Os from being getting submitted from different threads =
to same queue pair,=C2=A0=C2=A0one
> can allocate a queue pair per logical core (given the H/W supports creati=
on of those many number
> of queue pairs). However getting the current cpu from application code in=
volves a system call
> overhead (Some OS may not even support this).=C2=A0
> =

> The other approach can be to have some worker threads each with is own qu=
eue pair that feed of
> the=C2=A0=C2=A0application maintained pending I/O queues. However this ap=
proach introduces various locking
> overheads (to establish this producer consumer model) which may introduce=
 contentions and prevent
> getting the maximum=C2=A0=C2=A0performance.
> =

> How do you think this problem can be avoided?

I think you are assuming that the layer doing the I/O submission is not des=
igned with knowledge of
the application logic above it. That's true for something like the Linux ke=
rnel's block-mq layer -
it doesn't know what threading model the application(s) running on it use s=
o it just allocates 1
queue pair per core (sharing if necessary) and then has to ask which core a=
 thread is on to choose
the right queue pair. One of the major advantages of SPDK, however, is that=
 the I/O submission layer
is part of the application and can therefore take advantage of additional k=
nowledge. Most
applications using SPDK will be designed to have 1 thread per CPU core wher=
e the thread is running
in a tight event loop, polling a queue. I/O coming in off of the network wi=
ll be immediately routed
to a particular CPU core and it will be processed there until I/O is submit=
ted to the disk. In that
model, you never have to look up what core you are on - you just have to as=
sociate network
connections with particular threads one time when the connection is establi=
shed. We provide a basic
framework for applications to use this model inside of SPDK (header is at i=
nclude/spdk/event.h). The
framework isn't required to use our drivers, but all of our example applica=
tions and our NVMf target
use it.

> =

> Also I don=E2=80=99t see any api for issuing a trim command. Is that bein=
g implemented as well?

Every specification uses a different word for trim for some reason. TRIM is=
 the term used by the ATA
command set, SCSI calls it UNMAP, and NVMe calls it deallocate. See=C2=A0ht=
tp://www.spdk.io/spdk/doc/nvme
_8h.html#ae275923b7e982b115483e425c2972ec5.=C2=A0

> =

> Also=C2=A0
> Regards,
> Varun Bhadauria
> =

> =

> =

> =

> =

> =

> =

> On 6/17/16, 2:57 PM, "SPDK on behalf of Walker, Benjamin" <spdk-bounces(a=
)lists.01.org on behalf of =

> benjamin.walker(a)intel.com> wrote:
> =

> > =

> > On Fri, 2016-06-17 at 20:52 +0000, Bhadauria, Varun wrote:
> > > =

> > > Thanks Ben
> > > =

> > > Can you also possibly shed some light on the expected behavior when m=
ore than one I/Os are
> > > erroneously submitted on the same qpair? Do the spdk_nvme_ns_cmd_read=
/write*() return a
> > > specific
> > > error value in this case?
> > > =

> > You can submit many I/O per queue pair at the same time as long as you =
do it from a single
> > thread,
> > and you can submit I/O to different queue pairs on different threads si=
multaneously with no
> > locks.
> > Are you asking what happens when I/O is submitted simultaneously from d=
ifferent threads to the
> > same
> > queue pair? In that case, you run the risk of corrupting the memory sta=
te of the queue. The
> > queue is
> > implemented as an array in memory with a head and a tail pointer. Submi=
tting an I/O to the queue
> > places a command into the next slot, increments the head pointer, and r=
ings a doorbell register
> > to
> > tell the device new commands are present. If you do this from two threa=
ds simultaneously, they'd
> > both be copying into the same spot and ringing the doorbell, meaning th=
e device may receive part
> > of
> > one command and part of another. The code is in lib/nvme/nvme_qpair.c:n=
vme_qpair_submit_tracker
> > if
> > you want to look.
> > =

> > There is no expected error value for this case - the behavior is simply=
 undefined. In order to
> > catch
> > a user doing this, we'd have to look at some shared state (which means =
a lock) and the whole
> > purpose
> > of queue pairs is to avoid locking.
> > =

> > > =

> > > Also doesn the spdk_nvme_qpair_process_completions() for a qpair need=
s to be invoked from the
> > > same
> > > thread that is responsible for issuing i/o on the qpair?
> > Yes - you need to call that function from the same thread that you subm=
itted the I/O on. It's
> > fairly
> > obvious that you can only call spdk_nvme_qpair_process_completions on a=
 particular queue pair
> > from 1
> > thread at a time, but it isn't as obvious why you can't reap your compl=
etions on a different
> > thread
> > than your submissions, so let me try and explain that.=C2=A0
> > =

> > We define two objects, a request and a tracker, that are placed on list=
s. A request represents a
> > single user call to submit an I/O. A tracker is an entry on the hardwar=
e queue. We allow more
> > requests outstanding than available trackers. Submissions and completio=
ns manipulate the lists
> > of
> > free requests and trackers using a simple linked list, which is not thr=
ead safe. Further, each
> > time
> > a completion happens and frees up a tracker, we check if there are any =
pending requests and
> > submit
> > them. If we find any on the completion side but we're on a different th=
read and the submission
> > path,
> > this would be equivalent to doing submissions from two threads simultan=
eously.
> > =

> > I'm not sure this technical challenge couldn't be overcome, but I am fa=
irly confident that you
> > don't
> > actually want to do this in your software anyway. Not only is it more c=
omplicated, but you end
> > up
> > thrashing your CPU cache. The request objects are sitting nicely in you=
r L1 or L2 CPU cache from
> > submission, so when you complete on the same core it is ideal.
> > =

> > > =

> > > =

> > > When any outstanding completions that are processed as a result of ca=
lling
> > > spdk_nvme_qpair_process_completions(), does a request=E2=80=99s call =
back called on the same core ?
> > Yes - whatever thread you call spdk_nvme_qpair_process_completions on, =
for each completion it
> > finds
> > it will call that callback immediately inside of the current thread. So=
 all of the callbacks for
> > completions found will have been called by the time spdk_nvme_qpair_pro=
cess_completions returns.
> > The
> > code is in lib/nvme/nvme_qpair.c:spdk_nvme_qpair_process_completions() =
- you can see it just
> > loop
> > over the completion entries and call nvme_qpair_complete_tracker for ea=
ch one. Inside of
> > nvme_qpair_complete_tracker, it calls the callback function.
> > =

> > > =

> > > =

> > > Is it always necessary to call spdk_nvme_qpair_process_completions() =
to process completions?
> > Yes - there are no interrupts or backgrounds threads so the driver will=
 only execute in response
> > to
> > calls from the user.=C2=A0
> > =

> > > =

> > > =

> > > Regards,
> > > Varun Bhadauria
> > > =

> > > =

> > > =

> > > =

> > > =

> > > =

> > > =

> > > =

> > > On 6/17/16, 10:24 AM, "SPDK on behalf of Walker, Benjamin" <spdk-boun=
ces(a)lists.01.org on
> > > behalf of
> > > benjamin.walker(a)intel.com> wrote:
> > > =

> > > > =

> > > > =

> > > > On Wed, 2016-06-15 at 23:56 +0000, Bhadauria, Varun wrote:
> > > > > =

> > > > > =

> > > > > Hello Ben
> > > > > =

> > > > > Thank you for the clarification. I was under the false impression=
 that Linux AIO can be
> > > > > made
> > > > > to
> > > > > use SPDK under the hood which is clearly not the case since they =
will have to go through
> > > > > the
> > > > > filesystem.=C2=A0
> > > > I'm sure someone could wrap the AIO interface around the SPDK drive=
r for the specific case
> > > > where
> > > > the
> > > > user is opening a block device directly with O_DIRECT. It's nearly =
a 1:1 translation for
> > > > that
> > > > case.
> > > > Unfortunately, most people use Linux AIO on files instead of block =
devices.
> > > > =

> > > > > =

> > > > > =

> > > > > BTW are there any known early filesystem implementation besides c=
eph=E2=80=99s rocksdb based
> > > > > bluestore
> > > > > FS
> > > > > which use SPDK.
> > > > The only publicly announced one that I'm aware of is Bluestore insi=
de of Ceph. As long as
> > > > SPDK
> > > > continues to be valuable, I fully expect many filesystems with diff=
erent designs to appear
> > > > over
> > > > time. If you have a particular use case where you'd like some sort =
of filesystem-like layer
> > > > on
> > > > top
> > > > of SPDK, I'd love to hear about it. At a minimum, it's useful to co=
llect requirements from a
> > > > number
> > > > of sources.
> > > > =

> > > > > =

> > > > > =

> > > > > =

> > > > > Regards,
> > > > > Varun Bhadauria
> > > > > =C2=A0
> > > > > =

> > > > > On 6/15/16, 4:37 PM, "SPDK on behalf of Walker, Benjamin" <spdk-b=
ounces(a)lists.01.org on
> > > > > behalf
> > > > > of=C2=A0
> > > > > benjamin.walker(a)intel.com> wrote:
> > > > > =

> > > > > > =

> > > > > > =

> > > > > > =

> > > > > > Can you explain a bit more about why you want to use AIO? Are y=
ou referring to Linux AIO
> > > > > > or
> > > > > > POSIX AIO? If you want to do a performance comparison of Linux =
AIO and the SPDK NVMe
> > > > > > driver
> > > > > > then
> > > > > > the perf tool is your best bet.
> > > > > > =

> > > > > > You can run the perf tool against a block device using Linux AI=
O by binding your NVMe
> > > > > > device
> > > > > > to
> > > > > > the kernel ("./scripts/setup.sh reset" will hand them all back =
to the kernel) and then
> > > > > > doing
> > > > > > something like:
> > > > > > =

> > > > > > ./perf -q 1 -s 4096 -w read -t 10 /dev/nvme0n1 /dev/nvme1n1
> > > > > > =

> > > > > > -----Original Message-----
> > > > > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Bh=
adauria, Varun
> > > > > > Sent: Wednesday, June 15, 2016 4:30 PM
> > > > > > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > > > > > Subject: [SPDK] SPDK air examples
> > > > > > =

> > > > > > Hello=C2=A0
> > > > > > =

> > > > > > Are there any SPDK examples which use AIO?=C2=A0=C2=A0Perf.c ha=
s very little documentation in the
> > > > > > usage
> > > > > > for AIO.
> > > > > > =

> > > > > > Regards,
> > > > > > Varun Bhadauria
> > > > > > =

> > > > > > =

> > > > > > _______________________________________________
> > > > > > SPDK mailing list
> > > > > > SPDK(a)lists.01.org
> > > > > > https://lists.01.org/mailman/listinfo/spdk
> > > > > > _______________________________________________
> > > > > > SPDK mailing list
> > > > > > SPDK(a)lists.01.org
> > > > > > https://lists.01.org/mailman/listinfo/spdk
> > > > > _______________________________________________
> > > > > SPDK mailing list
> > > > > SPDK(a)lists.01.org
> > > > > https://lists.01.org/mailman/listinfo/spdk
> > > > _______________________________________________
> > > > SPDK mailing list
> > > > SPDK(a)lists.01.org
> > > > https://lists.01.org/mailman/listinfo/spdk
> > > _______________________________________________
> > > SPDK mailing list
> > > SPDK(a)lists.01.org
> > > https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
--===============3186708450900795814==--